Lesson 3 (Outliers)
outliers
So far, we have seen some pretty symmetrical distributions, where all the data points were centered around the mean of 0. What if we had a distribution where most of the data points were still around a mean of 0, but we had one additional data point far away from this mean? We can visualize this case using the following lines of code:
This produces the following histogram, located in the tab titled "Histogram[0,10]+Outlier.png":
Notice that there is one data point that looks far away from the rest of the data (at 200 on the x-axis). We call these data points that do not fit the same pattern of the expected distribution "outliers."
Medians
In our outlier example, the mean of the expected distribution is 0 but if we calculate the mean of the data including the outlier, we get a non-zero answer. One statistic that is not as influenced by the outlier is the median. The median is calculated by ranking all of the data from the smallest value to the largest value and then selecting the middle value, referred to as the median. If we calculate the median for our data with the outlier, it will be much closer to the intuitive value of 0.
Now it's your turn!
Use the median() command to calculate the median of the human height data.