Chapter 15 Grouping and summarising data

This chapter will explore the summarise and group_by verbs. These two verbs are considered together because they are often used together, and their usage is quite distinct from the other dplyr verbs we’ve encountered:

The group_by function adds information into a data object (e.g. a data frame or tibble), which makes subsequent calculations happen on a group-specific basis.
The summarise function is a data reduction function calculates single-number summaries of one or more variables, respecting the group structure if present.

15.0.1 Getting ready

We can start a new script by loading and attaching the dplyr package:

library("dplyr")

We’re going to use both the storms and iris data sets in the nasaweather and datasets packages, respectively. The datasets package ships is automatically loaded and attached at start up, so we need to make the nasaweather package available:

library("nasaweather")

Finally, let’s convert both data sets to a tibble so they print to the Console cleanly:

storms_tbl <- tbl_df(storms)
iris_tbl   <- tbl_df(iris)

15.1 Summarising variables with `summarise`

We use summarise to calculate summaries of variables in an object containing our data. We do this kind of calculation all the time when analysing data. In terms of pseudo-code, usage of summarise looks like this:

summarise(data_set, <expression1>, <expression2>, ...)

The first argument, data_set, must be the name of the data frame or tibble containing our data. We then include a series of one or more additional arguments, each of these is a valid R expression involving at least one variable in data_set. These are given by the pseudo-code placeholder <expression1>, <expression2>, ..., where <expression1> and <expression2> represent the first two expressions, and the ... is acting as placeholder for the remaining expressions. These expressions can be any calculation involving R functions. The only constraint is that they must generate a single value when evaluated.

That last sentence was important. It’s easy to use summarise if can we remember one thing: summarise is designed to work with functions that take a vector as their input and return a single value (i.e. a vector of length one). Any calculation that does this can be used with summarise.

The summarise verb is best understood by example. The R function called mean takes a vector of numbers (several numbers) and calculates their arithmetic mean (one number). We can use mean with summarise to calculate the mean of the Petal.Length and Petal.Width variables in iris_tbl like this:

summarise(iris_tbl, mean(Petal.Length), mean(Petal.Width))

## # A tibble: 1 x 2
##   `mean(Petal.Length)` `mean(Petal.Width)`
##                  <dbl>               <dbl>
## 1                 3.76                1.20

Notice what kind of object summarise returns: it’s a tibble with only one row and two columns. There are two columns because we calculated two means, and there is one row containing these means. Simple. There are a few other things to note about how summarise works:

As with all dplyr functions, the expression that performs the required summary calculation is not surrounded by quotes because it is an expression that it “does some calculations”.
The order of the expression in the resulting tibble is the same as the order in which they were used as arguments.
Even though the dimensions of the output object have changed, summarise returns the same kind of data object as its input. It returns a data frame if our data was originally in a data frame, or a tibble if it was in a tibble.

Notice that summarise used the expressions to name the variables. Variable names like mean(Petal.Length) and mean(Petal.Width) are not very helpful. They’re quite long for one. More problematically, they contain special reserved characters like (, which makes referring to columns in the resulting tibble more difficult than it needs to be:

# make a summary tibble an assign it a name
iris_means <- summarise(iris_tbl, mean(Petal.Length), mean(Petal.Width))
# extract the mean petal length
iris_means$`mean(Petal.Length)`

## [1] 3.758

We have to place ‘back ticks’ (as above) or ordinary quotes around the name to extract the new column when it includes special characters.

It’s better to avoid using the default names. The summarise function can name the new variables at the same time as they are created. Predictably, we do this by naming the arguments using =, placing the name we require on the left hand side. For example:

summarise(iris_tbl, Mean_PL = mean(Petal.Length), Mean_PW = mean(Petal.Width))

## # A tibble: 1 x 2
##   Mean_PL Mean_PW
##     <dbl>   <dbl>
## 1    3.76    1.20

There are very many base R functions that can be used with summarise. A few useful ones for calculating summaries of numeric variables are:

min and max calculate the minimum and maximum values of a vector.
mean and median calculate the mean and median of a numeric vector.
sd and var calculate the standard deviation and variance of a numeric vector.

We can combine more than one function in a summarise expression as long as it returns a single number. This means we can do arbitrarily complicated calculations in a single step. For example, if we need to know the ratio of the mean and median values of petal length and petal width in iris_tbl, we use:

summarise(iris_tbl,
          Mn_Md_PL = mean(Petal.Length) / median(Petal.Length),
          Mn_Md_PW = mean(Petal.Width)  / median(Petal.Width))

## # A tibble: 1 x 2
##   Mn_Md_PL Mn_Md_PW
##      <dbl>    <dbl>
## 1    0.864    0.923

Notice that we placed each argument on a separate line in this example. This is just a style issue—we don’t have to do this, but since R doesn’t care about white space, we can use new lines and spaces to keep everything a bit more more human-readable. It pays to organise summarise calculations like this as they become longer. It allows us to see the logic of the calculations more easily, and helps us spot potential errors when they occur.

15.1.1 Helper functions

There are a small number dplyr helper functions that can be used with summarise. These generally provide summaries that aren’t available directly using base R functions. For example, the n_distinct function is used to calculate the number of distinct values in a variable:

summarise(iris_tbl, Num.PL.Vals = n_distinct(Petal.Length))

## # A tibble: 1 x 1
##   Num.PL.Vals
##         <int>
## 1          43

This tells us that there are 43 unique values of Petal.Length. We won’t explore any others here. The handy cheat sheat is worth looking over to see what additional options are available.

15.2 Grouped operations using `group_by`

Performing a calculation with one or more variables over the whole data set is useful, but very often we also need to carry out an operation on different subsets of our data. For example, it’s probably more useful to know how the mean sepal and petal traits vary among the different species in the iris_tbl data set, rather than knowing the overall mean of these traits. We could calculate separate means by using filter to create different subsets of iris_tbl, and then using summary on each of these to calculate the relevant means. This would get the job done, but it’s not very efficient and very soon becomes tiresome when we have to work with many groups.

The group_by function provides a more elegant solution to this kind of problem. It doesn’t do all that much on its own though. All the group_by function does is add a bit of grouping information to a tibble or data frame. In effect, it defines subsets of data on the basis of one or more grouping variables. The magic happens when the grouped object is used with a dplyr verb like summarise or mutate. Once a data frame or tibble has been tagged with grouping information, operations that involve these (and other) verbs are carried out on separate subsets of the data, where the subsets correspond to the different values of the grouping variable(s).

Basic usage of group_by looks like this:

group_by(data_set, vname1, vname2, ...)

The first argument, data_set (“data object”), must be the name of the object containing our data. We then have to include one or more additional arguments, where each of these is the name of a variable in data_set. I have expressed this as vname1, vname2, ..., where vname1 and vname2 are names of the first two variables, and the ... is acting as placeholder for the remaining variables.

As usual, it’s much easier to understand how group_by works once we’ve seen it in action. We’ll illustrate group_by by using it alongside summarise with the storms_tbl data set. We’re aiming to calculate the mean wind speed for every type of storm. The first step is to use group_by to add grouping information to storms_tbl:

group_by(storms_tbl, type)

## # A tibble: 2,747 x 11
## # Groups:   type [4]
##    name    year month   day  hour   lat  long pressure  wind type   seasday
##    <chr>  <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>    <int>
##  1 Allis…  1995     6     3     0  17.4 -84.3     1005    30 Tropi…       3
##  2 Allis…  1995     6     3     6  18.3 -84.9     1004    30 Tropi…       3
##  3 Allis…  1995     6     3    12  19.3 -85.7     1003    35 Tropi…       3
##  4 Allis…  1995     6     3    18  20.6 -85.8     1001    40 Tropi…       3
##  5 Allis…  1995     6     4     0  22   -86        997    50 Tropi…       4
##  6 Allis…  1995     6     4     6  23.3 -86.3      995    60 Tropi…       4
##  7 Allis…  1995     6     4    12  24.7 -86.2      987    65 Hurri…       4
##  8 Allis…  1995     6     4    18  26.2 -86.2      988    65 Hurri…       4
##  9 Allis…  1995     6     5     0  27.6 -86.1      988    65 Hurri…       5
## 10 Allis…  1995     6     5     6  28.5 -85.6      990    60 Tropi…       5
## # … with 2,737 more rows

Compare this to the output produced when we print the original storms_tbl data set:

storms_tbl

## # A tibble: 2,747 x 11
##    name    year month   day  hour   lat  long pressure  wind type   seasday
##    <chr>  <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>    <int>
##  1 Allis…  1995     6     3     0  17.4 -84.3     1005    30 Tropi…       3
##  2 Allis…  1995     6     3     6  18.3 -84.9     1004    30 Tropi…       3
##  3 Allis…  1995     6     3    12  19.3 -85.7     1003    35 Tropi…       3
##  4 Allis…  1995     6     3    18  20.6 -85.8     1001    40 Tropi…       3
##  5 Allis…  1995     6     4     0  22   -86        997    50 Tropi…       4
##  6 Allis…  1995     6     4     6  23.3 -86.3      995    60 Tropi…       4
##  7 Allis…  1995     6     4    12  24.7 -86.2      987    65 Hurri…       4
##  8 Allis…  1995     6     4    18  26.2 -86.2      988    65 Hurri…       4
##  9 Allis…  1995     6     5     0  27.6 -86.1      988    65 Hurri…       5
## 10 Allis…  1995     6     5     6  28.5 -85.6      990    60 Tropi…       5
## # … with 2,737 more rows

There is almost no change in the printed information—group_by really doesn’t do much on its own. The main change is that the tibble resulting from the group_by operation has a little bit of additional information printed at the top: Groups: type [4]. The Groups: type part of this tells us that the tibble is grouped by the type variable and nothing else. The [4] part tells us that there are 4 different groups.

The only thing group_by did was add this grouping information to a copy of storms_tbl. The original storms_tbl object was not altered in any way. If we actually want to do anything useful useful with the result we need to assign it a name so that we can work with it:

storms_grouped <- group_by(storms_tbl, type)

Now we have a grouped tibble called storms_grouped, where the groups are defined by the values of type. Any operations on this tibble will now be performed on a “by group” basis. To see this in action, we use summarise to calculate the mean wind speed:

summarise(storms_grouped, mean.wind = mean(wind))

## # A tibble: 4 x 2
##   type                mean.wind
##   <chr>                   <dbl>
## 1 Extratropical            40.1
## 2 Hurricane                84.7
## 3 Tropical Depression      27.4
## 4 Tropical Storm           47.3

When we used summarise on an ungrouped tibble the result was a tibble with one row: the overall global mean. Now the resulting tibble has four rows, one for each value of type: The type variable in the new tibble tells us what these values are; the mean.wind variable shows the mean wind speed for each value.

15.2.1 More than one grouping variable

What if we need to calculate summaries using more than one grouping variable? The workflow is unchanged. Let’s assume we want to know the mean wind speed and atmospheric pressure associated with each storm type in each year. We first make a grouped copy of the data set with the appropriate grouping variables:

# group the storms_tbl data by storm year + assign the result a name
storms_grouped <- group_by(storms_tbl, type, year)
# 
storms_grouped

## # A tibble: 2,747 x 11
## # Groups:   type, year [24]
##    name    year month   day  hour   lat  long pressure  wind type   seasday
##    <chr>  <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>    <int>
##  1 Allis…  1995     6     3     0  17.4 -84.3     1005    30 Tropi…       3
##  2 Allis…  1995     6     3     6  18.3 -84.9     1004    30 Tropi…       3
##  3 Allis…  1995     6     3    12  19.3 -85.7     1003    35 Tropi…       3
##  4 Allis…  1995     6     3    18  20.6 -85.8     1001    40 Tropi…       3
##  5 Allis…  1995     6     4     0  22   -86        997    50 Tropi…       4
##  6 Allis…  1995     6     4     6  23.3 -86.3      995    60 Tropi…       4
##  7 Allis…  1995     6     4    12  24.7 -86.2      987    65 Hurri…       4
##  8 Allis…  1995     6     4    18  26.2 -86.2      988    65 Hurri…       4
##  9 Allis…  1995     6     5     0  27.6 -86.1      988    65 Hurri…       5
## 10 Allis…  1995     6     5     6  28.5 -85.6      990    60 Tropi…       5
## # … with 2,737 more rows

We grouped the storms_tbl data by type and year and assigned the grouped tibble the name storms_grouped. When we print this to the Console we see Groups: type, year [24] near the top, which tells us that the tibble is grouped by two variables with 24 unique combinations of values. We then calculate the mean wind speed and pressure of each storm type in each year:

summarise(storms_grouped, 
          mean_wind     = mean(wind), 
          mean_pressure = mean(pressure))

## # A tibble: 24 x 4
## # Groups:   type [?]
##    type           year mean_wind mean_pressure
##    <chr>         <int>     <dbl>         <dbl>
##  1 Extratropical  1995      38.7          995.
##  2 Extratropical  1996      40.4          991.
##  3 Extratropical  1997      38.9         1000.
##  4 Extratropical  1998      42.9          991.
##  5 Extratropical  1999      38.9          992.
##  6 Extratropical  2000      39.7          997.
##  7 Hurricane      1995      82.0          970.
##  8 Hurricane      1996      85.5          969.
##  9 Hurricane      1997      80.4          976.
## 10 Hurricane      1998      87.0          972.
## # … with 14 more rows

This calculates mean wind speed and atmospheric pressure for different combination of type and year. The first line shows us that the mean wind speed and pressure associated with extra-tropical storms in 1995 was 38.7 mph and 995 millibars, the second line shows us that the mean wind speed and pressure associated with extra-tropical storms in 1995 was 40.4 mph and 991 millibars, and so on. There are 24 rows in total because there were 24 unique combinations of type and year in the original storms_tbl.

15.2.2 Using `group_by` with other verbs

The summarise function is the only dplyr verb we’ll use with grouped tibbles in this book. However, all the main verbs alter their behaviour to respect group identity when used with tibbles with grouping information. When mutate or transmute are used with a grouped object they still add new variables, but now the calculations occur “by group”. Here’s an example using transmute:

# group the storms data by storm name + assign the result a name
storms_grouped <- group_by(storms_tbl, name)
# create a data set 'mean centred' wind speed variable
transmute(storms_grouped, wind_centred = wind - mean(wind))

## # A tibble: 2,747 x 2
## # Groups:   name [79]
##    name    wind_centred
##    <chr>          <dbl>
##  1 Allison       -14.4 
##  2 Allison       -14.4 
##  3 Allison        -9.39
##  4 Allison        -4.39
##  5 Allison         5.61
##  6 Allison        15.6 
##  7 Allison        20.6 
##  8 Allison        20.6 
##  9 Allison        20.6 
## 10 Allison        15.6 
## # … with 2,737 more rows

In this example we calculated the “group mean-centered” version of the wind variable. The new wind_centred variable contains the difference between the wind speed and the mean of whichever storm type is associated with the observation.

15.3 Removing grouping information

On occasion it’s necessary to remove grouping information from a data object. This is most often done when working with “pipes” (the topic of the next chapter) when we need to revert back to operating on the whole data set. The ungroup function removes grouping information:

ungroup(storms_grouped)

## # A tibble: 2,747 x 11
##    name    year month   day  hour   lat  long pressure  wind type   seasday
##    <chr>  <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>    <int>
##  1 Allis…  1995     6     3     0  17.4 -84.3     1005    30 Tropi…       3
##  2 Allis…  1995     6     3     6  18.3 -84.9     1004    30 Tropi…       3
##  3 Allis…  1995     6     3    12  19.3 -85.7     1003    35 Tropi…       3
##  4 Allis…  1995     6     3    18  20.6 -85.8     1001    40 Tropi…       3
##  5 Allis…  1995     6     4     0  22   -86        997    50 Tropi…       4
##  6 Allis…  1995     6     4     6  23.3 -86.3      995    60 Tropi…       4
##  7 Allis…  1995     6     4    12  24.7 -86.2      987    65 Hurri…       4
##  8 Allis…  1995     6     4    18  26.2 -86.2      988    65 Hurri…       4
##  9 Allis…  1995     6     5     0  27.6 -86.1      988    65 Hurri…       5
## 10 Allis…  1995     6     5     6  28.5 -85.6      990    60 Tropi…       5
## # … with 2,737 more rows

Looking at the top right of the printed summary, we can see that the Group: part is now gone—the ungroup function effectively created a copy of storms_grouped that is identical to the original storms_tbl tibble.