Sign in to save

Bookmark this page so you can find it later.

Sign in to save

Bookmark this page so you can find it later.

This cheat sheet summarizes common statistics tasks in R, Python, and Excel for grades 10-12. It helps students connect statistical ideas to the software commands used to calculate them. Use it as a quick reference when entering data, finding summaries, making graphs, or checking results. It is especially useful for avoiding syntax errors and choosing the right tool for each statistical question. The core ideas include descriptive statistics, data visualization, correlation, linear regression, and hypothesis testing. Important formulas include the sample mean xˉ=xin\bar{x} = \frac{\sum x_i}{n}, sample standard deviation s=(xixˉ)2n1s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}}, and correlation r=(xixˉ)(yiyˉ)(n1)sxsyr = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{(n - 1)s_xs_y}. R, Python, and Excel can all compute these values, but each uses different syntax. Students should understand both the command and the statistic it produces.

Key Facts

  • The sample mean is calculated by xˉ=xin\bar{x} = \frac{\sum x_i}{n}, using mean(x) in R, df['x'].mean() in Python, or =AVERAGE(range) in Excel.
  • The sample standard deviation is s=(xixˉ)2n1s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}}, using sd(x) in R, df['x'].std() in Python, or =STDEV.S(range) in Excel.
  • The five-number summary includes the minimum, first quartile, median, third quartile, and maximum, often written as min,Q1,Q2,Q3,max\min, Q_1, Q_2, Q_3, \max.
  • Correlation measures linear association with r=(xixˉ)(yiyˉ)(n1)sxsyr = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{(n - 1)s_xs_y}, using cor(x,y), df['x'].corr(df['y']), or =CORREL(range1,range2).
  • A simple linear regression model has the form y^=b0+b1x\hat{y} = b_0 + b_1x, where b0b_0 is the intercept and b1b_1 is the slope.
  • A z-score standardizes a value using z=xxˉsz = \frac{x - \bar{x}}{s}, which tells how many standard deviations the value is from the mean.
  • For a one-sample t statistic, use t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}, where μ0\mu_0 is the hypothesized population mean.
  • Always check whether software functions use sample formulas with n1n - 1 or population formulas with nn, especially for variance and standard deviation.

Vocabulary

Data frame
A table-like data structure with rows as observations and columns as variables, commonly used in R and Python.
Function
A named command that takes input values and returns an output, such as mean(x) or =AVERAGE(range).
Sample standard deviation
A measure of spread for sample data, calculated by s=(xixˉ)2n1s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}}.
Correlation
A number from 1-1 to 11 that describes the strength and direction of a linear relationship between two quantitative variables.
Linear regression
A method for modeling a straight-line relationship between variables using y^=b0+b1x\hat{y} = b_0 + b_1x.
P-value
The probability of getting a result at least as extreme as the observed result if the null hypothesis is true.

Common Mistakes to Avoid

  • Using population standard deviation instead of sample standard deviation is wrong when the data represent a sample, because σ=(xiμ)2n\sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{n}} and s=(xixˉ)2n1s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} answer different questions.
  • Forgetting quotation marks around column names in Python is wrong because df[x] looks for a variable named xx, while df['x'] selects the column named x.
  • Mixing up correlation and causation is wrong because a high value of rr shows linear association, not proof that one variable causes the other.
  • Selecting the wrong Excel range is wrong because one extra blank cell, label, or missing value can change outputs such as xˉ\bar{x}, ss, and rr.
  • Interpreting the regression intercept without context can be wrong because b0b_0 only has a meaningful real-world interpretation when x=0x = 0 is reasonable for the situation.

Practice Questions

  1. 1 A data set has values 4,6,8,10,124, 6, 8, 10, 12. Find the sample mean xˉ\bar{x} and identify the matching software command in R, Python, and Excel.
  2. 2 For the data 3,5,5,7,103, 5, 5, 7, 10, calculate the sample standard deviation using s=(xixˉ)2n1s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}}.
  3. 3 A regression output gives b0=12b_0 = 12 and b1=3.5b_1 = 3.5. Write the prediction equation y^=b0+b1x\hat{y} = b_0 + b_1x and predict y^\hat{y} when x=8x = 8.
  4. 4 A spreadsheet shows a correlation of r=0.92r = 0.92 between study time and test score. Explain what this means and why it does not prove causation.