Lesson 3 (Correlation)

Correlations

So far, we've been exploring height and weight separately, only looking at one variable at a time. But what if we wanted to learn about the relationship between those two variables? Intuitively, we would expect that taller people weigh more. Can we show this in our data?

Programming and statistics offer a variety of tools to examine these kinds of relationships. We've already talked about histograms, but those won't be very informative for showing relationships between two different variables; they're mainly used for information about the distribution of a single variable at a time. To look at two variables, we commonly use a type of graph called a scatter plot.

On a scatter plot, the value of one of the variables is plotted along the x-axis, and the value of the other is plotted on the y-axis. If we consider our height and weight data, each point would represent one person, with their height plotted on one axis and their weight on the other. Like histograms and box plots, scatter plots are easily created using Python!

This code should generate a plot that looks like this:

HeightWeightScatterPlot.png

You might notice that we can't really tell how many points there are in the middle of the graph -- there are so many points that there's tons of overlap. We can use a trick to help us improve our visualization: a parameter of graphs called alpha. In the graph setting, alpha controls the transparency of each point. Alpha can have any value from 0 to 1 (the default is 1, which is fully opaque). Let's try an example where we set alpha to 0.1, so we can see the point distribution a little more clearly (no pun intended).

This code generates a plot that looks like this, in which the density of the points is easier to interpret:

HeightWeightScatterPlot_withTransparency.png

In general, we see that points with lower heights also have lower weights, and similarly, points with higher heights have higher weights. This causes a diagonal shape in the scatter plot, which helps us see that there is a correlation in these data. Specifically, these data are called positively correlated, whereas if higher heights had lower weights, and vice versa, we would say the data are negatively correlated (in which case the diagonal would be flipped). Other variables that are positively correlated are temperature and ice cream consumed. Alternatively, there is a negative correlation between temperature and hot chocolate consumption. Life is certainly sweet year-round! Keep in mind, not all variables are correlated; there is no correlation between temperature and potato chip consumption (we can eat potato chips year-round too!). 

We want to leave you with a final thought about correlation:

Correlation does not equal causation
— Every data scientist, ever