Lesson 3 (Histograms)

Histograms

We can use Python to start to get a sense of what our data actually looks like.

One tool for visualizing data quickly is a graph called a histogram. Histograms display the frequency of different values in your data so that you can see which values are most and least common. Histograms are essentially bar plots where the x-axis represents the values in your data and the y-axis represents counts (i.e. how many of each value we observed in our data).

Histogram basics

Histograms in Python are super simple! It's just a few lines of code. Let's make one to look at the distribution of height in our data set. In addition to the Pandas package we used earlier, we’re going to use another package (Matplotlib) to help us make visualizations like this one.

In order to view the histogram plot created by this code, click the third tab that should appear upon running the code titled HeightHistogram1.png (in order to see this tab, you might have to click the icon that looks like a page with the corner folded in the left side of the terminal). It should look like this:

HeightHistogram.png

We call the shape of the histogram a "distribution." The x-axis represents different values of height in the data, ranging from 60 inches to 76 inches. The y-axis tells us how many people in the data set were measured in each range of height (e.g. 62-64 inches, 64-66 inches, etc.). For example, we can see that there are about 25 people measured between 63 and 65 inches.

Labeling histograms

Let's make this graph more descriptive. It might help to have a title, x-axis label, and y-axis label so we know what we're looking at so that if you showed it to someone else, they would understand it better.

It should be found in a tab titled HeightHistogram2.png and look like this:

HeightHistogram2.png

Changing the scales of histograms

Here we have about ~2 inch resolution, so we can see how many people fall into each 2-inch range. We call these ranges “bins.” What if instead we wanted to see how many people fall into each 1-inch bin (e.g. how many people are between 66 inches and 67 inches)?

We can do that by changing the number of bins in the histogram. In the previous histogram, there were 10 bins. Now, let's make 16 bins (since our range is between 60 inches and 76 inches, and 76 - 60 = 16). We will also specify our range of between 60 and 76 to ensure that the bins each get drawn containing exactly one inch each.

The resulting graph should be found in a tab titled HeightHistogram3.png and look like this:

HeightHistogram3.png

Notice that when we use this breakdown, the symmetry of the distribution is even more obvious than it was before. This is a good lesson in the importance of data visualization and that your interpretation can be subject to slight changes in your visualization method.

Now it's your turn!

  1. Make the same plot as above (a histogram with labeled axes and a title, where each bin represents 5 pounds) to show the distribution of human weight.