From the course: Data Wrangling in R
Wide vs. long datasets
From the course: Data Wrangling in R
Wide vs. long datasets
- [Instructor] There are many different ways that you can present the same dataset to the world. Let's take a look at one of the most important and fundamental distinctions, whether a dataset is wide or long. Now, the difference between wide and long datasets boils down to whether we prefer to have more columns in our dataset or more rows. A dataset that emphasizes putting additional data about a single subject in columns is called a wide dataset because as we add more columns the dataset becomes wider. Similarly, a dataset that emphasizes including additional data about a subject in rows is called a long dataset because as we add more rows the dataset becomes longer. Now, it's important to point out that there's nothing inherently good or bad about wide or long data. In the world of data wrangling, we sometimes need to make a long dataset wider, and we sometimes need to make a wide dataset longer. However, it is true that, as a general rule, data scientists who embrace the concept of tidy data generally prefer longer datasets over wider ones because they're easier to manipulate in R and other statistical analysis packages. The key is to make sure that you're continuing to follow the rules of tidy data. Now remember, we want to structure our datasets so that each variable is in its own column, each observation is in its own row and each type of observational unit is in its own table. And the definition of an observation can be a little bit squishy at times, and that's where we often find the wiggle room to make a dataset wide or long. You've already seen the concept of wide versus long datasets at play in this course, I just didn't use the terms wide and long to describe them. Earlier, I showed you this example of a set of patient treatment data from Hadley Wickham's paper. This dataset contains all the treatments given to a single patient, in a row, with different columns for each treatment type. We then converted the dataset so that each treatment of a patient had its own row. That was an example of taking a dataset that was wide and making it long. Now this is a very small dataset, so it's not as easy to see width and length at play, so let's look back at another example where it's more clear. Do you remember this pew data on religion from earlier in the course? When we first came across this dataset it had this structure, where each religion had its own row, and each income range had a column in that row. This is an example of a wide dataset. The portion of the dataset that we see on the screen, here, has 10 rows, each containing six data points, that's 60 data values total. We then converted that dataset, so that each unique pairing of religion and income range had its own row. Now that entire dataset doesn't fit on the screen so I'm showing you a small portion, but converting that entire dataset, would now put each data value in a single row, resulting in 60 rows. We've made our dataset much longer and narrower than its original form. It might seem that we made the dataset more difficult to work with, and that's true when you're looking at it with the human eye, but the fact is that longer and narrower datasets are much easier to work with in R. Now that you understand the concepts of wide and long datasets, let's talk about the tools that you can use in R to convert between them.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.