Chapter 15 Grouping and summarising data
This chapter will explore the summarise
and group_by
verbs. These two verbs are considered together because they are often used together, and their usage is quite distinct from the other dplyr verbs we’ve encountered:
The
group_by
function adds information into a data object (e.g. a data frame or tibble), which makes subsequent calculations happen on a group-specific basis.The
summarise
function is a data reduction function calculates single-number summaries of one or more variables, respecting the group structure if present.
15.0.1 Getting ready
We can start a new script by loading and attaching the dplyr package:
library("dplyr")
We’re going to use both the storms
and iris
data sets in the nasaweather and datasets packages, respectively. The datasets package ships is automatically loaded and attached at start up, so we need to make the nasaweather package available:
library("nasaweather")
Finally, let’s convert both data sets to a tibble so they print to the Console cleanly:
storms_tbl <- tbl_df(storms)
iris_tbl <- tbl_df(iris)
15.1 Summarising variables with summarise
We use summarise
to calculate summaries of variables in an object containing our data. We do this kind of calculation all the time when analysing data. In terms of pseudo-code, usage of summarise
looks like this:
summarise(data_set, <expression1>, <expression2>, ...)
The first argument, data_set
, must be the name of the data frame or tibble containing our data. We then include a series of one or more additional arguments, each of these is a valid R expression involving at least one variable in data_set
. These are given by the pseudo-code placeholder <expression1>, <expression2>, ...
, where <expression1>
and <expression2>
represent the first two expressions, and the ...
is acting as placeholder for the remaining expressions. These expressions can be any calculation involving R functions. The only constraint is that they must generate a single value when evaluated.
That last sentence was important. It’s easy to use summarise
if can we remember one thing: summarise
is designed to work with functions that take a vector as their input and return a single value (i.e. a vector of length one). Any calculation that does this can be used with summarise
.
The summarise
verb is best understood by example. The R function called mean
takes a vector of numbers (several numbers) and calculates their arithmetic mean (one number). We can use mean
with summarise
to calculate the mean of the Petal.Length
and Petal.Width
variables in iris_tbl
like this:
summarise(iris_tbl, mean(Petal.Length), mean(Petal.Width))
## # A tibble: 1 x 2
## `mean(Petal.Length)` `mean(Petal.Width)`
## <dbl> <dbl>
## 1 3.76 1.20
Notice what kind of object summarise
returns: it’s a tibble with only one row and two columns. There are two columns because we calculated two means, and there is one row containing these means. Simple. There are a few other things to note about how summarise
works:
As with all dplyr functions, the expression that performs the required summary calculation is not surrounded by quotes because it is an expression that it “does some calculations”.
The order of the expression in the resulting tibble is the same as the order in which they were used as arguments.
Even though the dimensions of the output object have changed,
summarise
returns the same kind of data object as its input. It returns a data frame if our data was originally in a data frame, or a tibble if it was in a tibble.
Notice that summarise
used the expressions to name the variables. Variable names like mean(Petal.Length)
and mean(Petal.Width)
are not very helpful. They’re quite long for one. More problematically, they contain special reserved characters like (
, which makes referring to columns in the resulting tibble more difficult than it needs to be:
# make a summary tibble an assign it a name
iris_means <- summarise(iris_tbl, mean(Petal.Length), mean(Petal.Width))
# extract the mean petal length
iris_means$`mean(Petal.Length)`
## [1] 3.758
We have to place ‘back ticks’ (as above) or ordinary quotes around the name to extract the new column when it includes special characters.
It’s better to avoid using the default names. The summarise
function can name the new variables at the same time as they are created. Predictably, we do this by naming the arguments using =
, placing the name we require on the left hand side. For example:
summarise(iris_tbl, Mean_PL = mean(Petal.Length), Mean_PW = mean(Petal.Width))
## # A tibble: 1 x 2
## Mean_PL Mean_PW
## <dbl> <dbl>
## 1 3.76 1.20
There are very many base R functions that can be used with summarise
. A few useful ones for calculating summaries of numeric variables are:
min
andmax
calculate the minimum and maximum values of a vector.mean
andmedian
calculate the mean and median of a numeric vector.sd
andvar
calculate the standard deviation and variance of a numeric vector.
We can combine more than one function in a summarise
expression as long as it returns a single number. This means we can do arbitrarily complicated calculations in a single step. For example, if we need to know the ratio of the mean and median values of petal length and petal width in iris_tbl
, we use:
summarise(iris_tbl,
Mn_Md_PL = mean(Petal.Length) / median(Petal.Length),
Mn_Md_PW = mean(Petal.Width) / median(Petal.Width))
## # A tibble: 1 x 2
## Mn_Md_PL Mn_Md_PW
## <dbl> <dbl>
## 1 0.864 0.923
Notice that we placed each argument on a separate line in this example. This is just a style issue—we don’t have to do this, but since R doesn’t care about white space, we can use new lines and spaces to keep everything a bit more more human-readable. It pays to organise summarise
calculations like this as they become longer. It allows us to see the logic of the calculations more easily, and helps us spot potential errors when they occur.
15.1.1 Helper functions
There are a small number dplyr helper functions that can be used with summarise
. These generally provide summaries that aren’t available directly using base R functions. For example, the n_distinct
function is used to calculate the number of distinct values in a variable:
summarise(iris_tbl, Num.PL.Vals = n_distinct(Petal.Length))
## # A tibble: 1 x 1
## Num.PL.Vals
## <int>
## 1 43
This tells us that there are 43 unique values of Petal.Length
. We won’t explore any others here. The handy cheat sheat is worth looking over to see what additional options are available.
15.2 Grouped operations using group_by
Performing a calculation with one or more variables over the whole data set is useful, but very often we also need to carry out an operation on different subsets of our data. For example, it’s probably more useful to know how the mean sepal and petal traits vary among the different species in the iris_tbl
data set, rather than knowing the overall mean of these traits. We could calculate separate means by using filter
to create different subsets of iris_tbl
, and then using summary
on each of these to calculate the relevant means. This would get the job done, but it’s not very efficient and very soon becomes tiresome when we have to work with many groups.
The group_by
function provides a more elegant solution to this kind of problem. It doesn’t do all that much on its own though. All the group_by
function does is add a bit of grouping information to a tibble or data frame. In effect, it defines subsets of data on the basis of one or more grouping variables. The magic happens when the grouped object is used with a dplyr verb like summarise
or mutate
. Once a data frame or tibble has been tagged with grouping information, operations that involve these (and other) verbs are carried out on separate subsets of the data, where the subsets correspond to the different values of the grouping variable(s).
Basic usage of group_by
looks like this:
group_by(data_set, vname1, vname2, ...)
The first argument, data_set
(“data object”), must be the name of the object containing our data. We then have to include one or more additional arguments, where each of these is the name of a variable in data_set
. I have expressed this as vname1, vname2, ...
, where vname1
and vname2
are names of the first two variables, and the ...
is acting as placeholder for the remaining variables.
As usual, it’s much easier to understand how group_by
works once we’ve seen it in action. We’ll illustrate group_by
by using it alongside summarise
with the storms_tbl
data set. We’re aiming to calculate the mean wind speed for every type of storm. The first step is to use group_by
to add grouping information to storms_tbl
:
group_by(storms_tbl, type)
## # A tibble: 2,747 x 11
## # Groups: type [4]
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Allis… 1995 6 3 0 17.4 -84.3 1005 30 Tropi… 3
## 2 Allis… 1995 6 3 6 18.3 -84.9 1004 30 Tropi… 3
## 3 Allis… 1995 6 3 12 19.3 -85.7 1003 35 Tropi… 3
## 4 Allis… 1995 6 3 18 20.6 -85.8 1001 40 Tropi… 3
## 5 Allis… 1995 6 4 0 22 -86 997 50 Tropi… 4
## 6 Allis… 1995 6 4 6 23.3 -86.3 995 60 Tropi… 4
## 7 Allis… 1995 6 4 12 24.7 -86.2 987 65 Hurri… 4
## 8 Allis… 1995 6 4 18 26.2 -86.2 988 65 Hurri… 4
## 9 Allis… 1995 6 5 0 27.6 -86.1 988 65 Hurri… 5
## 10 Allis… 1995 6 5 6 28.5 -85.6 990 60 Tropi… 5
## # … with 2,737 more rows
Compare this to the output produced when we print the original storms_tbl
data set:
storms_tbl
## # A tibble: 2,747 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Allis… 1995 6 3 0 17.4 -84.3 1005 30 Tropi… 3
## 2 Allis… 1995 6 3 6 18.3 -84.9 1004 30 Tropi… 3
## 3 Allis… 1995 6 3 12 19.3 -85.7 1003 35 Tropi… 3
## 4 Allis… 1995 6 3 18 20.6 -85.8 1001 40 Tropi… 3
## 5 Allis… 1995 6 4 0 22 -86 997 50 Tropi… 4
## 6 Allis… 1995 6 4 6 23.3 -86.3 995 60 Tropi… 4
## 7 Allis… 1995 6 4 12 24.7 -86.2 987 65 Hurri… 4
## 8 Allis… 1995 6 4 18 26.2 -86.2 988 65 Hurri… 4
## 9 Allis… 1995 6 5 0 27.6 -86.1 988 65 Hurri… 5
## 10 Allis… 1995 6 5 6 28.5 -85.6 990 60 Tropi… 5
## # … with 2,737 more rows
There is almost no change in the printed information—group_by
really doesn’t do much on its own. The main change is that the tibble resulting from the group_by
operation has a little bit of additional information printed at the top: Groups: type [4]
. The Groups: type
part of this tells us that the tibble is grouped by the type
variable and nothing else. The [4]
part tells us that there are 4 different groups.
The only thing group_by
did was add this grouping information to a copy of storms_tbl
. The original storms_tbl
object was not altered in any way. If we actually want to do anything useful useful with the result we need to assign it a name so that we can work with it:
storms_grouped <- group_by(storms_tbl, type)
Now we have a grouped tibble called storms_grouped
, where the groups are defined by the values of type
. Any operations on this tibble will now be performed on a “by group” basis. To see this in action, we use summarise
to calculate the mean wind speed:
summarise(storms_grouped, mean.wind = mean(wind))
## # A tibble: 4 x 2
## type mean.wind
## <chr> <dbl>
## 1 Extratropical 40.1
## 2 Hurricane 84.7
## 3 Tropical Depression 27.4
## 4 Tropical Storm 47.3
When we used summarise
on an ungrouped tibble the result was a tibble with one row: the overall global mean. Now the resulting tibble has four rows, one for each value of type
: The type
variable in the new tibble tells us what these values are; the mean.wind
variable shows the mean wind speed for each value.
15.2.1 More than one grouping variable
What if we need to calculate summaries using more than one grouping variable? The workflow is unchanged. Let’s assume we want to know the mean wind speed and atmospheric pressure associated with each storm type in each year. We first make a grouped copy of the data set with the appropriate grouping variables:
# group the storms_tbl data by storm year + assign the result a name
storms_grouped <- group_by(storms_tbl, type, year)
#
storms_grouped
## # A tibble: 2,747 x 11
## # Groups: type, year [24]
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Allis… 1995 6 3 0 17.4 -84.3 1005 30 Tropi… 3
## 2 Allis… 1995 6 3 6 18.3 -84.9 1004 30 Tropi… 3
## 3 Allis… 1995 6 3 12 19.3 -85.7 1003 35 Tropi… 3
## 4 Allis… 1995 6 3 18 20.6 -85.8 1001 40 Tropi… 3
## 5 Allis… 1995 6 4 0 22 -86 997 50 Tropi… 4
## 6 Allis… 1995 6 4 6 23.3 -86.3 995 60 Tropi… 4
## 7 Allis… 1995 6 4 12 24.7 -86.2 987 65 Hurri… 4
## 8 Allis… 1995 6 4 18 26.2 -86.2 988 65 Hurri… 4
## 9 Allis… 1995 6 5 0 27.6 -86.1 988 65 Hurri… 5
## 10 Allis… 1995 6 5 6 28.5 -85.6 990 60 Tropi… 5
## # … with 2,737 more rows
We grouped the storms_tbl
data by type
and year
and assigned the grouped tibble the name storms_grouped
. When we print this to the Console we see Groups: type, year [24]
near the top, which tells us that the tibble is grouped by two variables with 24 unique combinations of values. We then calculate the mean wind speed and pressure of each storm type in each year:
summarise(storms_grouped,
mean_wind = mean(wind),
mean_pressure = mean(pressure))
## # A tibble: 24 x 4
## # Groups: type [?]
## type year mean_wind mean_pressure
## <chr> <int> <dbl> <dbl>
## 1 Extratropical 1995 38.7 995.
## 2 Extratropical 1996 40.4 991.
## 3 Extratropical 1997 38.9 1000.
## 4 Extratropical 1998 42.9 991.
## 5 Extratropical 1999 38.9 992.
## 6 Extratropical 2000 39.7 997.
## 7 Hurricane 1995 82.0 970.
## 8 Hurricane 1996 85.5 969.
## 9 Hurricane 1997 80.4 976.
## 10 Hurricane 1998 87.0 972.
## # … with 14 more rows
This calculates mean wind speed and atmospheric pressure for different combination of type
and year
. The first line shows us that the mean wind speed and pressure associated with extra-tropical storms in 1995 was 38.7 mph and 995 millibars, the second line shows us that the mean wind speed and pressure associated with extra-tropical storms in 1995 was 40.4 mph and 991 millibars, and so on. There are 24 rows in total because there were 24 unique combinations of type
and year
in the original storms_tbl
.
15.2.2 Using group_by
with other verbs
The summarise
function is the only dplyr verb we’ll use with grouped tibbles in this book. However, all the main verbs alter their behaviour to respect group identity when used with tibbles with grouping information. When mutate
or transmute
are used with a grouped object they still add new variables, but now the calculations occur “by group”. Here’s an example using transmute
:
# group the storms data by storm name + assign the result a name
storms_grouped <- group_by(storms_tbl, name)
# create a data set 'mean centred' wind speed variable
transmute(storms_grouped, wind_centred = wind - mean(wind))
## # A tibble: 2,747 x 2
## # Groups: name [79]
## name wind_centred
## <chr> <dbl>
## 1 Allison -14.4
## 2 Allison -14.4
## 3 Allison -9.39
## 4 Allison -4.39
## 5 Allison 5.61
## 6 Allison 15.6
## 7 Allison 20.6
## 8 Allison 20.6
## 9 Allison 20.6
## 10 Allison 15.6
## # … with 2,737 more rows
In this example we calculated the “group mean-centered” version of the wind variable. The new wind_centred
variable contains the difference between the wind speed and the mean of whichever storm type is associated with the observation.
15.3 Removing grouping information
On occasion it’s necessary to remove grouping information from a data object. This is most often done when working with “pipes” (the topic of the next chapter) when we need to revert back to operating on the whole data set. The ungroup
function removes grouping information:
ungroup(storms_grouped)
## # A tibble: 2,747 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Allis… 1995 6 3 0 17.4 -84.3 1005 30 Tropi… 3
## 2 Allis… 1995 6 3 6 18.3 -84.9 1004 30 Tropi… 3
## 3 Allis… 1995 6 3 12 19.3 -85.7 1003 35 Tropi… 3
## 4 Allis… 1995 6 3 18 20.6 -85.8 1001 40 Tropi… 3
## 5 Allis… 1995 6 4 0 22 -86 997 50 Tropi… 4
## 6 Allis… 1995 6 4 6 23.3 -86.3 995 60 Tropi… 4
## 7 Allis… 1995 6 4 12 24.7 -86.2 987 65 Hurri… 4
## 8 Allis… 1995 6 4 18 26.2 -86.2 988 65 Hurri… 4
## 9 Allis… 1995 6 5 0 27.6 -86.1 988 65 Hurri… 5
## 10 Allis… 1995 6 5 6 28.5 -85.6 990 60 Tropi… 5
## # … with 2,737 more rows
Looking at the top right of the printed summary, we can see that the Group:
part is now gone—the ungroup
function effectively created a copy of storms_grouped
that is identical to the original storms_tbl
tibble.