Chapter 16 Building piplines
This chapter will introduce something called the pipe operator: %>%
. We don’t often use the various dplyr verbs in isolation. Instead, starting with our raw data, they are combined in a sequence to prepare the data for further analysis (e.g. making a plot, calculating summaries, fitting a statistical model, and so on). The function of the pipe operator is to make the data wrangling part of such a workflow as transparent as possible.
16.1 Why do we need ‘pipes’?
We’ve seen that carrying out calculations on a per-group basis can be achieved by grouping a tibble, assigning this a name, and then applying the summarise
function to the new tibble. For example, if we need the mean wind speed for every storm recorded in storms_tbl
, we could use:
# 1. make a grouped copy of the storms data
storms_grouped <- group_by(storms_tbl, name)
# 2. calculate the mean wind speed for each storm
summarise(storms_grouped, mean.wind = mean(wind))
## # A tibble: 79 × 2
## name mean.wind
## <chr> <dbl>
## 1 Alberto 63.04598
## 2 Alex 35.38462
## 3 Allison 44.39394
## 4 Ana 32.10526
## 5 Arlene 39.03846
## 6 Arthur 35.22727
## 7 Barry 39.76190
## 8 Bertha 60.00000
## 9 Beryl 36.11111
## 10 Bill 50.55556
## # ... with 69 more rows
There’s nothing wrong with this way of doing things. However, this approach to building up an analysis is quite verbose—especially if an analysis involves more than a couple of steps—because we have to keep storing intermediate steps. It also tends to clutter the global environment with lots of data objects we don’t need.
One way to make things more concise is to use a nested function call (we examined these in the Using functions chapter), like this:
summarise(group_by(storms_tbl, type), mean.wind = mean(wind))
## # A tibble: 4 × 2
## type mean.wind
## <chr> <dbl>
## 1 Extratropical 40.06068
## 2 Hurricane 84.65960
## 3 Tropical Depression 27.35867
## 4 Tropical Storm 47.32181
Here we placed the group_by
function call inside the list of arguments to summarise
. Remember, you have to read nested function calls from the inside out to understand what they are doing. This is exactly equivalent to the previous example. We get the same result without having to store intermediate data. However, there are a couple of good reasons why this approach is not advised:
Experienced R users probably don’t mind this approach because they’re used to nested functions calls. Nonetheless, no reasonable person would argue that nesting functions inside one another is intuitive. Reading outward from the inside of a large number of nested functions is hard work.
Even for experienced R users, using function nesting is a fairly error prone approach. For example, it’s very easy to accidentally put an argument or two on the wrong side of a closing
)
. If we’re lucky this will produce an error and we’ll catch the problem. If we’re not, we may just end up with nonsense in the output.
There’s a third option for combing several functions that has the dual benefit of keeping our code concise and readable, while avoiding the need to clutter the global environment with intermediate objects. This third approach involves something called the “pipe” operator: %>%
(no spaces allowed). This isn’t part of base R though. Instead, it’s part of a package called magrittr. but there’s no need to install this if we’re using dplyr because dplyr imports it for us.
The %>%
operator has become very popular in recent years. The main reason for this is because it allows us to specify a chain of function calls in a (reasonably) human readable format. Here’s how we write the previous example using the pipe operator %>%
:
storms_tbl %>% group_by(., type) %>% summarise(., mean.wind = mean(wind))
## # A tibble: 4 × 2
## type mean.wind
## <chr> <dbl>
## 1 Extratropical 40.06068
## 2 Hurricane 84.65960
## 3 Tropical Depression 27.35867
## 4 Tropical Storm 47.32181
How do we make sense of this? Every time we see the %>%
operator it means the following: take whatever is produced by the left hand expression and use it as an argument in the function on the right hand side. The .
serves as a placeholder for the location of the corresponding argument. This means we can understand what a sequence of calculations is doing by reading from left to right, just as we would read the words in a book. This example says, take the storms_tbl
data, group it by type
, then take the resulting grouped tibble and apply the summarise function to it to calculate the mean of wind
. It is exactly the same calculation we did above.
When using the pipe operator we can often leave out the .
placeholder. Remember, this signifies the argument of the function on the right of %>%
that is associated with the result from on left of %>%
. If we choose to leave out the .
, the pipe operator assumes we meant to slot it into the first argument. This means we can simplify our example even more:
storms_tbl %>% group_by(type) %>% summarise(mean.wind = mean(wind))
## # A tibble: 4 × 2
## type mean.wind
## <chr> <dbl>
## 1 Extratropical 40.06068
## 2 Hurricane 84.65960
## 3 Tropical Depression 27.35867
## 4 Tropical Storm 47.32181
This is why the first argument of a dplyr verb is always the data object. This convention ensures that we can use %>%
without explicitly specifying the argument to match against.
Remember, R does not care about white space, which means we can break a chained set of function calls over several lines if it becomes too long:
storms_tbl %>%
group_by(type) %>%
summarise(mean.wind = mean(wind))
## # A tibble: 4 × 2
## type mean.wind
## <chr> <dbl>
## 1 Extratropical 40.06068
## 2 Hurricane 84.65960
## 3 Tropical Depression 27.35867
## 4 Tropical Storm 47.32181
In fact, many dplyr users always place each part of a pipeline onto a new line to help with overall readability.
Finally, when we need to assign the result of a chained functions we have to break the left to right rule a bit, placing the assignment at the beginning:
new_data <-
storms_tbl %>%
group_by(type) %>%
summarise(mean.wind = mean(wind))
(Actually, there is a rightward assignment operator, ->
, but let’s not worry about that)
Why is %>%
called the ‘pipe’ operator?
The %>%
operator takes the output from one function and “pipes it” to another as the input. It’s called ‘the pipe’ for the simple reason that it allows us to create an analysis ‘pipeline’ from a series of function calls. Incidentally, if you Google the phrase “magritt pipe” you’ll see that magrittr is a very clever name for an R package.
One final piece of advice: learn how to use the %>%
method of chaining together functions. Why? Because it’s the simplest and cleanest method for doing this, many of the examples in the dplyr help files and on the web use it, and the majority of people carrying out real world data wrangling with dplyr rely on piping.