Lesson 3 (Package and data import)
Packages and data import
When we were working with Mendel's peas, we were only looking at a few numbers at a time. However, in science, we often get lots and lots of numbers that we want to analyze, and computation enables us to do so. Computation also allows us to create beautiful visualizations of patterns in data that would be impossible to generate by hand. In this part of the lesson, we're going to show you some simple examples of how we can use our newfound Python skills to visualize a large amount of data very quickly.
One example of an area that generates lots of data is human phenotypes. For example, scientists have worked to aggregate human height and weight data to study things like growth and obesity. We have downloaded a publicly available dataset from UCLA that we can work with for this lesson.
Instead of writing our own code to deal with large data sets, we will use what we refer to as a Python "package." Packages are collections of code someone else has already written to perform complicated tasks that you can then use so you don't have to reinvent the wheel! Think of it like using a template in Microsoft PowerPoint to build a slide rather than building one from scratch. You could put a text box for the title and another for the subtitle, but there's already a template that looks just like that, so it's quicker to just use that one.
It's not super important for you to understand the details of how package importing works, but we'll show you what it looks like. Here, we're importing a package called Pandas. Pandas contains many functions that aid in data import and analysis. We could, of course, spend a lot of time re-writing complicated functions that give the same end-result, but why re-invent the wheel when one simple line of code can import all of these functions for us to use!
Now we're going to load the data, using the Pandas package we just imported. Again, the syntax is not important for you to understand right now. We're going to read in a file of the heights and weights of 25,000 people and put it in an object called human_data. You can look at the raw data file, which is stored in the second tab of the coding console, titled HumanHeightWeightData.csv (you can also download the data to take a look at it yourself by clicking on the file name). As you can probably surmise by looking at the data, the filetype "csv" stands for "comma separated values."
So what exactly did we just create? Let's take a look. Using the head() command, you can see the top of the file you just loaded.
Note: Because we're loading multiple packages and a large data set, please allow some time for the code to run (up to 1 minute).
Using the shape property, you can see how big the file is. The first number (25000) is the number of rows (in this data, each row is one individual person), and the second number (2) is the number of columns, which makes sense because we know that there are two columns - one for height and one for weight.
Because the data take so long to load, we'll be working with a smaller file for the rest of the lesson (only 1000 rows). However, don't think that computers can't work with big data sets; it's only because the file needs to be loaded into each window that it seems too slow. Normally, you only have to load the data once, and then you can perform all of your calculations relatively quickly!
Now, try going to the website the data came from and see if you can figure out what the average height and weight of these individuals is... we bet you can't! Let's use Python to calculate the mean in fractions of a second using the mean property!