Correlation and Regression
r, r2, Slope & Least-Squares Line
Correlation and regression are tools for describing relationships between two quantitative variables. Correlation tells how strongly and in what direction the variables move together, while regression gives an equation that predicts one variable from the other. These ideas are used in science, economics, psychology, and engineering to analyze data and make informed decisions. Understanding both helps students move from simply plotting points to interpreting patterns mathematically.
A scatter plot is usually the starting point because it shows whether a linear pattern is reasonable. The correlation coefficient r measures the strength and direction of a linear association, with values from -1 to 1. A regression line, often written as y = a + bx, estimates the average change in y for each one unit increase in x. Good analysis also checks for outliers, nonlinearity, and the important fact that correlation alone does not prove causation.
Key Facts
- Correlation coefficient range: -1 <= r <= 1
- If r > 0, the association is positive; if r < 0, the association is negative
- A common linear regression model is y = a + bx
- Slope formula for a regression line: b = change in y / change in x
- The intercept a is the predicted value of y when x = 0
- Coefficient of determination: R^2 tells the proportion of variation in y explained by the regression model
Vocabulary
- Correlation
- Correlation is a numerical measure of the strength and direction of a linear relationship between two variables.
- Regression line
- A regression line is the line that best fits the data and is used to predict values of one variable from another.
- Slope
- Slope describes how much the predicted y-value changes for each one unit increase in x.
- Intercept
- The intercept is the predicted value of y where the regression line crosses the y-axis, at x = 0.
- Outlier
- An outlier is a data point that lies far from the overall pattern of the other points.
Common Mistakes to Avoid
- Assuming correlation proves causation, which is wrong because two variables can be related without one causing the other. A third variable or coincidence may explain the pattern.
- Using the regression line when the scatter plot is clearly curved, which is wrong because linear regression only models roughly straight-line relationships well. Always inspect the graph before fitting a line.
- Ignoring outliers, which is wrong because one unusual point can strongly change both the correlation and the regression equation. Check whether the point is an error or a meaningful extreme value.
- Interpreting a small negative r as no relationship, which is wrong because a negative value still shows direction. The sign gives direction and the magnitude gives strength.
Practice Questions
- 1 A regression equation is y = 12 + 3x. What is the predicted value of y when x = 5, and what does the slope mean in context?
- 2 A data set has correlation coefficient r = -0.82. State the direction of the relationship and describe whether the linear association is weak, moderate, or strong.
- 3 Two variables have a correlation of 0.91. Explain why this does not automatically mean that changes in one variable cause changes in the other.