Chapter 11 dplyr and the tidy data concept
Data wrangling refers to the process of manipulating raw data into the format that we want it in, for example for data visualisation or statistical analyses. There are a wide range of ways we may want to manipulate our data, for example by creating new variables, subsetting the data, or calculating summaries. Data wrangling is often a time consuming process. It is also not the most interesting part of any analysis - we are interested in answering biological questions, not in formatting data. However, it is a necessary step to go through to be able to conduct the analyses that we’re really interested in. Learning how to manipulate data efficiently can save us a lot of time and trouble and is therefore a really important skill to master.
11.2 The value of dplyr
The dplyr package has been carefully designed to make life easier to manipulate data frames and other kinds of similar objects. A key reason for its ease-of-use is that dplyr is very consistent in the way its functions work. For example, the first argument of the main dplyr functions is always an object containing our data. This consistency makes it very easy to get to grips with each of the main dplyr functions—it’s often possible to understand how one works by seeing one or two examples of its use.
A second reason for favouring dplyr is that it is orientated around a few core functions, each of which is designed to do one thing well. The key dplyr functions are often referred to as “verbs”, reflecting the fact that they “do something” to data. For example: (1)
select is used to obtain a subset of variables; (2)
mutate is used to construct new variables; (3)
filter is used to obtain a subset of rows; (4)
arrange is used to reorder rows; and (5)
summarise is used to calculate information about groups. We’ll cover each of these verbs in detail in later chapters, as well as a few additional functions such as
Apart from being easy to use, dplyr is also fast compared to base R functions. This won’t matter much for the small data sets we use in this book, but dplyr is a good option for large data sets. The dplyr package also allows us to work with data stored in different ways, for example, by interacting directly with a number of database systems. We won’t work with anything other than data frames (and the closely-related “tibble”) but it is worth knowing about this facility. Learning to use dplyr with data frames makes it easy to work with these other kinds of data objects.
A dplyr cheat sheet
The developers of RStudio have produced a very usable cheat sheat that summarises the main data wrangling tools provided by dplyr. Our advice is to download this, print out a copy and refer to this often as you start working with dplyr.
11.3 Tidy data
dplyr will work with any data frame, but it’s at its most powerful when our data are organised as tidy data. The word “tidy” has a very specific meaning in this context. Tidy data has a specific structure that makes it easy to manipulate, model and visualise. A tidy data set is one where each variable is in only one column and each row contains only one observation. This might seem like the “obvious” way to organise data, but many people fail to adopt this convention.
We aren’t going to explore the tidy data concept in great detail, but the basic principles are not difficult to understand. We’ll use an example to illustrate what the “one variable = one column” and “one observation = one row” idea means. Let’s return to the made-up experiment investigating the response of communities to fertilizer addition. This time, imagine we had only measured biomass, but that we had measured it twice over the course of the experiment.
We’ll examine some artificial data for the experiment and look at two ways to organise it to help us understand the tidy data idea. The first way uses a separate column for each biomass measurement:
## Treatment BiomassT1 BiomassT2 ## 1 Control 284 324 ## 2 Control 328 400 ## 3 Control 291 355 ## 4 Fertilser 956 1197 ## 5 Fertilser 954 1012 ## 6 Fertilser 685 859
This often seems like the natural way to store such data, especially for experienced Excel users. However, this format is not tidy. Why? The biomass variable has been split across two columns (“BiomassT1” and “BiomassT2”), which means each row corresponds to two observations.
We won’t go into the “whys” here, but take our word for it: adopting this format makes it difficult to efficiently work with data. This is not really an R-specific problem. This non-tidy format is sub-optimal in many different data analysis environments.
A tidy version of the example data set would still have three columns, but now these would be: “Treatment”, denoting the experimental treatment applied; “Time”, denoting the sampling occasion; and “Biomass”, denoting the biomass measured:
## Treatment Time Biomass ## 1 Control T1 284 ## 2 Control T1 328 ## 3 Control T1 291 ## 4 Fertilser T1 956 ## 5 Fertilser T1 954 ## 6 Fertilser T1 685 ## 7 Control T2 324 ## 8 Control T2 400 ## 9 Control T2 355 ## 10 Fertilser T2 1197 ## 11 Fertilser T2 1012 ## 12 Fertilser T2 859
These data are tidy: each variable is in only one column, and each observation has its own unique row. These data are well-suited to use with dplyr.
Always try to start with tidy data
The best way to make sure your data set is tidy is to store in that format when it’s first collected and recorded. There are packages that can help convert non tidy data into the tidy data format (e.g. the tidyr package), but life is much simpler if we just make sure our data are tidy from the very beginning.
11.4 A quick look at dplyr
We’ll finish up this chapter by taking a quick look at a few features of the dplyr package, before really drilling down into how it works. The package is not part of the base R installation, so we have to install it first via
install.packages("dplyr"). Remember, we only have to install dplyr once, so there’s no need to leave the
install.packages line in script that uses the package. We do have to add
library to the top of any scripts using the package to load and attach it:
We need some data to work with. We’ll use two data sets to illustrate the key ideas in the next few chapters: the
iris data set in the datasets package and the
storms data set in the nasaweather package.
The datasets package ships with R and is loaded and attached at start up, so there’s no need to do anything to make
iris available. The nasaweather package doesn’t ship with R so it needs to be installed via
install.packages("nasaweather"). Finally, we have to add
library to the top of our script to load and attach the package:
The nasaweather package is a bare bones data package. It doesn’t contain any new R functions, just data. We’ll be using the
storms data set from nasaweather: this contains information about tropical storms in North America (from 1995-2000). We’re just using it as a convenient example to illustrate the workings of the dplyr, and later, the ggplot2 packages.
11.4.1 Tibble (
The primary purpose of the dplyr package is to make it easier to manipulate data interactively. In order to facilitate this kind of work dplyr implements a special kind of data object known as a
tbl (pronounced “tibble”). We can think of a tibble as a special data frame with a few extra whistles and bells.
We can convert an ordinary data frame to a a tibble using the
tbl_df function. It’s a good idea (though not necessary) to convert ordinary data frames to tibbles. Why? When a data frame is printed to the Console R will try to print every column and row until it reaches a (very large) maximum permitted amount of output. The result is a mess of text that’s virtually impossible to make sense of. In contrast, when a tibble is printed to the Console, it does so in a compact way. To see this, we can convert the
iris data set to a tibble using
tbl_df and then print the resulting object to the Console:
# make a "tibble" copy of iris iris_tbl <- tbl_df(iris) # print it iris_tbl
## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # … with 140 more rows
Notice that only the first 10 rows are printed. This is much nicer than trying to wade through every row of a data frame.
Sometimes we just need a quick, compact summary of a data frame or tibble. This is the job of the
glimpse function from dplyr. The glimpse function is very similar to
## Observations: 150 ## Variables: 5 ## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5… ## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3… ## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1… ## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0… ## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, set…
The function takes one argument: the name of a data frame or tibble. It then tells us how many rows it has, how many variables there are, what these variables are called, and what kind of data are associated with each variable. This function is useful when we’re working with a data set containing many variables.