Outliers and Their Impact
How Extreme Values Affect Statistics
Related Tools
Related Labs
Related Worksheets
Related Cheat Sheets
Outliers are data values that lie far away from the rest of a data set. They matter because a single unusual value can change summaries like the mean, spread, and even the apparent pattern in a graph. In science, business, and social research, outliers can signal measurement error, rare events, or important discoveries. Learning to identify and interpret them helps students avoid misleading conclusions.
An outlier can strongly affect some statistics while leaving others nearly unchanged. For example, the mean is sensitive to extreme values, but the median is usually more resistant. Outliers can also stretch the scale of a graph, hide the main cluster, and change the slope of a best fit line. Good statistical practice is not just removing unusual points, but investigating why they appear and choosing methods that match the situation.
Key Facts
- An outlier is a value much larger or smaller than most of the data.
- Mean = (sum of all data values) / n, and the mean is strongly affected by outliers.
- Median is the middle value of ordered data, and it is usually resistant to outliers.
- Range = , so one extreme value can greatly increase the range.
- IQR = , and a common rule marks outliers below or above .
- In scatterplots, an outlier can change correlation and the equation of a line of best fit.
Vocabulary
- Outlier
- A data value that is unusually far from the rest of the values in a data set.
- Mean
- The mean is the average found by adding all data values and dividing by the number of values.
- Median
- The median is the middle value when the data are arranged in order.
- Interquartile Range
- The interquartile range is the difference between the third quartile and the first quartile and describes the spread of the middle half of the data.
- Resistant statistic
- A resistant statistic is a measure that does not change much when an outlier is added or removed.
Common Mistakes to Avoid
- Assuming every extreme value is an error, which is wrong because some outliers represent real and important events that should be studied rather than deleted.
- Using only the mean to describe data with outliers, which is wrong because the mean can be pulled strongly by extreme values and may not represent the typical value well.
- Ignoring the graph and relying only on calculations, which is wrong because a plot often reveals outliers, clusters, and skew that summary numbers can hide.
- Removing outliers without explanation, which is wrong because data cleaning should be justified by context such as measurement mistakes, recording errors, or a clearly different population.
Practice Questions
- 1 Find the mean and median of the data set 4, 5, 5, 6, 6, 7, 30. Then state which measure better represents the center of the data.
- 2 For the ordered data , find , , , and determine whether is an outlier using the rule.
- 3 A scatterplot shows a clear upward trend for most points, but one point lies far to the right and far below the pattern. Explain how that point could affect the correlation and the line of best fit.