Statistics: Regression Diagnostics: Residuals and Conditions
Using residuals to check whether a linear regression model is appropriate
Statistics: Regression Diagnostics: Residuals and Conditions
Using residuals to check whether a linear regression model is appropriate
Statistics - Grade 9-12
- 1
A regression model predicts a student's test score using hours studied: predicted score = 52 + 7.5(hours studied). A student studied for 4 hours and earned an actual score of 86. Calculate the residual and explain what it means.
Residual = actual value - predicted value.
The predicted score is 52 + 7.5(4) = 82. The residual is actual minus predicted, so 86 - 82 = 4. The student scored 4 points higher than the model predicted. - 2
A regression model predicts the cost of a used car from its age. One car has an actual price of $9,800 and a predicted price of $10,600. Find the residual and interpret it in context.
The residual is 9,800 - 10,600 = -800. The car cost $800 less than the regression model predicted. - 3
A residual plot shows residuals scattered randomly around 0 with no clear pattern and roughly the same vertical spread across all x-values. What does this suggest about the linear regression model?
A good residual plot should look like random scatter around the horizontal line at 0.
This suggests that a linear regression model is reasonable because the residual plot does not show curvature or changing spread. - 4
A residual plot forms a clear U-shape, with positive residuals for small and large x-values and negative residuals for middle x-values. What condition appears to be violated, and what should the analyst consider?
Patterns in a residual plot usually mean the model is missing some structure.
The linearity condition appears to be violated. The analyst should consider using a curved model or transforming one of the variables instead of using a straight-line model. - 5
In a regression analysis, residuals become more spread out as x increases. Name the condition that is not met and explain why it matters.
The equal variance condition is not met because the residuals do not have a similar spread across all x-values. This matters because predictions and uncertainty estimates may be less reliable. - 6
List the four common conditions checked before using a linear regression model for inference.
Many classes summarize these as linear, independent, normal, and equal variance.
The four common conditions are linearity, independence of observations, approximately normal residuals, and equal variance of residuals. - 7
A class fits a regression line to predict backpack weight from student height. The residual plot shows no pattern, but the data were collected by measuring 30 students from the same sports team after practice. Which condition might be a concern, and why?
The independence condition might be a concern because students from the same team may have similar routines, gear, or body types, so the observations may not be independent of each other. - 8
A normal probability plot of the residuals is nearly straight, with only small random deviations. What does this suggest about the normal residuals condition?
For a normal probability plot, points close to a straight line are a good sign.
This suggests that the normal residuals condition is reasonably met because the residuals appear to follow an approximately normal distribution. - 9
A scatterplot contains one point with an x-value much larger than all the other x-values. The point lies close to the regression line. Explain why this point may still be important in regression diagnostics.
Leverage depends mostly on how unusual the x-value is.
The point may have high leverage because its x-value is far from the rest of the data. Even if it is close to the line, it can strongly affect the slope and position of the regression line. - 10
A data point has a large residual but an x-value near the center of the data. Is it more accurate to call it an outlier, a high-leverage point, or both? Explain.
It is more accurate to call it an outlier because it has a large residual. It is not a high-leverage point if its x-value is near the center of the data. - 11
A point has a very unusual x-value and removing it changes the regression slope from 1.2 to 2.8. What type of point is this, and why?
Influential points noticeably change the regression model when removed.
This point is influential because removing it causes a large change in the regression slope. It likely also has high leverage because its x-value is very unusual. - 12
A regression model predicts monthly electricity cost from average outdoor temperature. The residuals for 12 months are measured in dollars: -5, 7, -3, 4, 0, -6, 8, -2, 5, -4, 3, -7. Are the residuals centered around 0? Explain using the sum or mean of the residuals.
Add all the residuals, then divide by the number of residuals if needed.
The residuals are centered around 0 because their sum is 0 and their mean is 0. This is typical when a least-squares regression line with an intercept is used. - 13
A regression model has residual standard deviation s = 3.2 minutes when predicting race time. Interpret this value in context.
The model's predictions are typically off by about 3.2 minutes. This value describes the typical size of the residuals when predicting race time. - 14
A student says, "The correlation is strong, so we do not need to check the residual plot." Explain why this statement is incorrect.
Correlation measures the strength of a linear relationship, but diagnostics check whether the model assumptions are reasonable.
The statement is incorrect because a strong correlation does not guarantee that a linear model is appropriate. A residual plot can reveal curvature, changing spread, or unusual points that the correlation alone may not show. - 15
A regression model was created using data from houses between 800 and 3,000 square feet. A real estate agent uses the model to predict the price of a 5,200-square-foot house. Identify the diagnostic or modeling concern and explain the risk.
The concern is extrapolation because 5,200 square feet is outside the range of the data used to create the model. The prediction may be unreliable because the linear pattern may not continue for houses that large.