Chapter 12 Working with variables

12.1 Introduction

This chapter will explore the the dplyr select and mutate verbs, as well as the related rename and transmute verbs. These four verbs are considered together because they all operate on the variables (i.e. the columns) of a data frame or tibble:

The select function selects a subset of variables to retain and (optionally) renames them in the process.
The mutate function creates new variables from preexisting ones and retains the original variables.
The rename function renames one or more variables while keeping the remaining variable names unchanged.
The transmute function creates new variables from preexisting ones and drops the original variables.

12.1.1 Getting ready

A script that uses dplyr will typically start by loading and attaching the package:

library("dplyr")

Obviously we need to have first installed dplyr package for this to work.

We’ll use the iris data set in the datasets package to illustrate the ideas in this chapter. The datasets package ships with R and is loaded and attached at start up, so there’s no need to do anything to make iris available. The iris data set is an ordinary data frame. Before we start working, it’s handy to convert this to a tibble so that it prints to the Console in a compact way:

iris_tbl   <- tbl_df(iris)

We gave the new tibble version a new name. We didn’t have to do this, but it will remind us that we’re working with tibbles.

12.2 Subset variables with `select`

We use select to select variables from a data frame or tibble. This is typically used when we have a data set with many variables but only need to work with a subset of these. Basic usage of select looks like this:

select(data_set, vname1, vname2, ...)

take note: this is not an example we can run. This is a “pseudo code” example, designed to show, in abstract terms, how we use select:

The first argument, data_set (“data object”), must be the name of the object containing our data.
We then include a series of one or more additional arguments, where each one is the name of a variable in data_set. We’ve expressed this as vname1, vname2, ..., where vname1 and vname2 are names of the first two variables, and the ... is acting as placeholder for the remaining variables (there could be any number of these).

It’s easiest to understand how a function like select works by seeing it in action. We select the Species, Petal.Length and Petal.Width variables from iris_tbl like this:

select(iris_tbl, Species, Petal.Length, Petal.Width)

## # A tibble: 150 x 3
##    Species Petal.Length Petal.Width
##    <fct>          <dbl>       <dbl>
##  1 setosa           1.4         0.2
##  2 setosa           1.4         0.2
##  3 setosa           1.3         0.2
##  4 setosa           1.5         0.2
##  5 setosa           1.4         0.2
##  6 setosa           1.7         0.4
##  7 setosa           1.4         0.3
##  8 setosa           1.5         0.2
##  9 setosa           1.4         0.2
## 10 setosa           1.5         0.1
## # … with 140 more rows

Hopefully nothing about this example is surprising or confusing. There are a few things to notice about how select works though:

The select function is one of those non-standard functions we briefly mentioned in the Using functions chapter. This means the variable names should not be surrounded by quotes unless they have spaces in them (which is best avoided).
The select function is just like other R functions: it does not have “side effects”. What this means is that it does not change the original iris_tbl. We printed the result produced by select to the Console, so we can’t access the new data set. If we need to access the result we have to assign it a name using <-.
The order of variables (i.e. the column order) in the resulting object is the same as the order in which they were supplied as arguments. This means we can reorder variables at the same time as selecting them if we need to.
The select function will return the same kind of data object it is working on. It returns a data frame if our data was originally in a data frame and a tibble if it was a tibble. In this example, R prints a tibble because we had converted iris_tbl from a data frame to a tibble.

It’s sometimes more convenient to use select to subset variables by specifying those we do not need, rather than specifying of the ones to keep. We use the - operator indicate that variables should be dropped. For example, to drop the Petal.Width and Petal.Length columns, we use:

select(iris_tbl, -Petal.Width, -Petal.Length)

## # A tibble: 150 x 3
##    Sepal.Length Sepal.Width Species
##           <dbl>       <dbl> <fct>  
##  1          5.1         3.5 setosa 
##  2          4.9         3   setosa 
##  3          4.7         3.2 setosa 
##  4          4.6         3.1 setosa 
##  5          5           3.6 setosa 
##  6          5.4         3.9 setosa 
##  7          4.6         3.4 setosa 
##  8          5           3.4 setosa 
##  9          4.4         2.9 setosa 
## 10          4.9         3.1 setosa 
## # … with 140 more rows

This returns a tibble with just the remaining variables:Sepal.Length, Sepal.Width and Species.

The select function can also be used to grab (or drop) a set of variables that occur in a sequence next to one another. We specify a series of adjacent variables using the : operator. This must be used with two variable names, one on the left hand side and one on the right. When we use : like this, select will subset both those variables along with any others that fall in between them. For example, if we need the two Petal variables and Species, we use:

select(iris_tbl, Petal.Length:Species)

## # A tibble: 150 x 3
##    Petal.Length Petal.Width Species
##           <dbl>       <dbl> <fct>  
##  1          1.4         0.2 setosa 
##  2          1.4         0.2 setosa 
##  3          1.3         0.2 setosa 
##  4          1.5         0.2 setosa 
##  5          1.4         0.2 setosa 
##  6          1.7         0.4 setosa 
##  7          1.4         0.3 setosa 
##  8          1.5         0.2 setosa 
##  9          1.4         0.2 setosa 
## 10          1.5         0.1 setosa 
## # … with 140 more rows

The : operator can be combined with - if we need to drop a series of variables according to their position in a data frame or tibble:

select(iris_tbl, -(Petal.Length:Species))

## # A tibble: 150 x 2
##    Sepal.Length Sepal.Width
##           <dbl>       <dbl>
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows

The extra ( ) around are Petal.Length:Species important here — select will throw an error if we don’t include them.

12.2.1 Renaming variables with `select` and `rename`

In addition to selecting a subset of variables, the select function can also rename variables at the same time. To do this, we have to name the arguments using =, placing the new name on the left hand side. For example, to select theSpecies, Petal.Length and Petal.Width variables from iris_tbl, but also rename Petal.Length and Petal.Width to Petal_Length and Petal_Width, we use:

select(iris_tbl, Species, Petal_Length = Petal.Length, Petal_Width = Petal.Width)

## # A tibble: 150 x 3
##    Species Petal_Length Petal_Width
##    <fct>          <dbl>       <dbl>
##  1 setosa           1.4         0.2
##  2 setosa           1.4         0.2
##  3 setosa           1.3         0.2
##  4 setosa           1.5         0.2
##  5 setosa           1.4         0.2
##  6 setosa           1.7         0.4
##  7 setosa           1.4         0.3
##  8 setosa           1.5         0.2
##  9 setosa           1.4         0.2
## 10 setosa           1.5         0.1
## # … with 140 more rows

Renaming variables is a common task when working with data frames and tibbles. What should we do if the only thing we would like to achieve is to rename a variables, rather than rename and select variables? The dplyr provides an additional function called rename for this purpose. This function renames certain variables while retaining all others. It works in a similar way to select. For example, to rename Petal.Length and Petal.Width to Petal_Length and Petal_Width, we use:

rename(iris_tbl, Petal_Length = Petal.Length, Petal_Width = Petal.Width)

## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal_Length Petal_Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Notice that the rename function also preserves the order of the variables found in the original data.

12.3 Creating variables with `mutate`

We use mutate to add new variables to a data frame or tibble. This is useful if we need to construct one or more derived variables to support an analysis. Basic usage of mutate looks like this:

mutate(data_set, <expression1>, <expression2>, ...)

Again, this is not an example we can run; it’s pseudo code that highlights in abstract terms how to use mutate. As always with dplyr, the first argument, data_set, should be the name of the object containing our data. We then include a series of one or more additional arguments, where each of these is a valid R expression involving one or more variables in data_set. We’ve have expressed these as <expression1>, <expression2>, where <expression1> and <expression2> represent the first two expressions, and the ... is acting as placeholder for the remaining expressions. Remember, this is not valid R code. It is just intended to demonstrate the general usage of mutate.

To see mutate in action, let’s construct a new version of iris_tbl that contains a variable summarising the approximate area of sepals:

mutate(iris_tbl, Sepal.Width * Sepal.Length)

## # A tibble: 150 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows, and 1 more variable: `Sepal.Width *
## #   Sepal.Length` <dbl>

This creates a copy of iris_tbl with a new column called Sepal.Width * Sepal.Length (mentioned at the bottom of the printed output). Most of the rules that apply to select also apply to mutate:

The expression that performs the required calculation is not surrounded by quotes. This makes sense, because an expression is meant to be evaluated so that it “does something”. It is not a value.
Once again, we just printed the result produced by mutate to the Console, rather than assigning the result a name using <-. The mutate function does not have side effects, meaning it does not change the original iris_tbl in any way.
The select function returns the same kind of data object as the one it is working on: a data frame if our data was originally in a data frame, a tibble if it was a tibble.

Creating a variable called something like Sepal.Width * Sepal.Length is not exactly ideal because it’s a difficult name to work with. The mutate function can name variables at the same time as they are created. We have to name the arguments using =, placing the name on the left hand side, to do this. Here’s how to use this construct to name the new area variable Sepal.Area:

mutate(iris_tbl, Sepal.Area = Sepal.Width * Sepal.Length)

## # A tibble: 150 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
##  1          5.1         3.5          1.4         0.2 setosa        17.8
##  2          4.9         3            1.4         0.2 setosa        14.7
##  3          4.7         3.2          1.3         0.2 setosa        15.0
##  4          4.6         3.1          1.5         0.2 setosa        14.3
##  5          5           3.6          1.4         0.2 setosa        18  
##  6          5.4         3.9          1.7         0.4 setosa        21.1
##  7          4.6         3.4          1.4         0.3 setosa        15.6
##  8          5           3.4          1.5         0.2 setosa        17  
##  9          4.4         2.9          1.4         0.2 setosa        12.8
## 10          4.9         3.1          1.5         0.1 setosa        15.2
## # … with 140 more rows

We can create more than one variable by supplying mutate multiple (named) arguments:

mutate(iris_tbl, 
       Sepal.Area = Sepal.Width * Sepal.Length,
       Petal.Area = Petal.Width * Petal.Length,
       Area.Ratio = Petal.Area / Petal.Area)

## # A tibble: 150 x 8
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
##  1          5.1         3.5          1.4         0.2 setosa        17.8
##  2          4.9         3            1.4         0.2 setosa        14.7
##  3          4.7         3.2          1.3         0.2 setosa        15.0
##  4          4.6         3.1          1.5         0.2 setosa        14.3
##  5          5           3.6          1.4         0.2 setosa        18  
##  6          5.4         3.9          1.7         0.4 setosa        21.1
##  7          4.6         3.4          1.4         0.3 setosa        15.6
##  8          5           3.4          1.5         0.2 setosa        17  
##  9          4.4         2.9          1.4         0.2 setosa        12.8
## 10          4.9         3.1          1.5         0.1 setosa        15.2
## # … with 140 more rows, and 2 more variables: Petal.Area <dbl>,
## #   Area.Ratio <dbl>

Notice that here we placed each argument on a new line, remembering the comma to separate arguments. There is nothing to stop us doing this because R ignores white space. This is useful though, because it allows us, the user, to makes long function calls easier to read by breaking them up on different lines.

This last example reveals a nice feature of mutate: we can use newly created variables in further calculations. Here we constructed approximate sepal and petal area variables, and then used these to construct a third variable containing the ratio of these two quantities, Area.Ratio.

12.3.1 Transforming and dropping variables

Occasionally we may want to construct one or more new variables, and then drop all other variables in the original dataset. The transmute function is designed to do this. It works exactly like mutate, but it has a slightly different behaviour:

transmute(iris_tbl,
          Sepal.Area = Sepal.Width * Sepal.Length,
          Petal.Area = Petal.Width * Petal.Length,
          Area.Ratio = Petal.Area / Petal.Area)

## # A tibble: 150 x 3
##    Sepal.Area Petal.Area Area.Ratio
##         <dbl>      <dbl>      <dbl>
##  1       17.8      0.280          1
##  2       14.7      0.280          1
##  3       15.0      0.26           1
##  4       14.3      0.3            1
##  5       18        0.280          1
##  6       21.1      0.68           1
##  7       15.6      0.42           1
##  8       17        0.3            1
##  9       12.8      0.280          1
## 10       15.2      0.15           1
## # … with 140 more rows

Here we repeated the previous example, but now only the new variables were retained in the resulting tibble. If we also want to retain one or more variables without altering them we just have to pass them as unnamed arguments. For example, if we need to retain species identity in the output, we use:

transmute(iris_tbl,
          Species,
          Sepal.Area = Sepal.Width * Sepal.Length,
          Petal.Area = Petal.Width * Petal.Length,
          Area.Ratio = Petal.Area / Petal.Area)

## # A tibble: 150 x 4
##    Species Sepal.Area Petal.Area Area.Ratio
##    <fct>        <dbl>      <dbl>      <dbl>
##  1 setosa        17.8      0.280          1
##  2 setosa        14.7      0.280          1
##  3 setosa        15.0      0.26           1
##  4 setosa        14.3      0.3            1
##  5 setosa        18        0.280          1
##  6 setosa        21.1      0.68           1
##  7 setosa        15.6      0.42           1
##  8 setosa        17        0.3            1
##  9 setosa        12.8      0.280          1
## 10 setosa        15.2      0.15           1
## # … with 140 more rows