Chapter 12 Working with variables
12.1 Introduction
This chapter will explore the the dplyr select
and mutate
verbs, as well as the related rename
and transmute
verbs. These four verbs are considered together because they all operate on the variables (i.e. the columns) of a data frame or tibble:
The
select
function selects a subset of variables to retain and (optionally) renames them in the process.The
mutate
function creates new variables from preexisting ones and retains the original variables.The
rename
function renames one or more variables while keeping the remaining variable names unchanged.The
transmute
function creates new variables from preexisting ones and drops the original variables.
12.1.1 Getting ready
A script that uses dplyr will typically start by loading and attaching the package:
library("dplyr")
Obviously we need to have first installed dplyr
package for this to work.
We’ll use the iris
data set in the datasets package to illustrate the ideas in this chapter. The datasets package ships with R and is loaded and attached at start up, so there’s no need to do anything to make iris
available. The iris
data set is an ordinary data frame. Before we start working, it’s handy to convert this to a tibble so that it prints to the Console in a compact way:
iris_tbl <- tbl_df(iris)
We gave the new tibble version a new name. We didn’t have to do this, but it will remind us that we’re working with tibbles.
12.2 Subset variables with select
We use select
to select variables from a data frame or tibble. This is typically used when we have a data set with many variables but only need to work with a subset of these. Basic usage of select
looks like this:
select(data_set, vname1, vname2, ...)
take note: this is not an example we can run. This is a “pseudo code” example, designed to show, in abstract terms, how we use select
:
The first argument,
data_set
(“data object”), must be the name of the object containing our data.We then include a series of one or more additional arguments, where each one is the name of a variable in
data_set
. We’ve expressed this asvname1, vname2, ...
, wherevname1
andvname2
are names of the first two variables, and the...
is acting as placeholder for the remaining variables (there could be any number of these).
It’s easiest to understand how a function like select
works by seeing it in action. We select the Species
, Petal.Length
and Petal.Width
variables from iris_tbl
like this:
select(iris_tbl, Species, Petal.Length, Petal.Width)
## # A tibble: 150 x 3
## Species Petal.Length Petal.Width
## <fct> <dbl> <dbl>
## 1 setosa 1.4 0.2
## 2 setosa 1.4 0.2
## 3 setosa 1.3 0.2
## 4 setosa 1.5 0.2
## 5 setosa 1.4 0.2
## 6 setosa 1.7 0.4
## 7 setosa 1.4 0.3
## 8 setosa 1.5 0.2
## 9 setosa 1.4 0.2
## 10 setosa 1.5 0.1
## # … with 140 more rows
Hopefully nothing about this example is surprising or confusing. There are a few things to notice about how select
works though:
The
select
function is one of those non-standard functions we briefly mentioned in the Using functions chapter. This means the variable names should not be surrounded by quotes unless they have spaces in them (which is best avoided).The
select
function is just like other R functions: it does not have “side effects”. What this means is that it does not change the originaliris_tbl
. We printed the result produced byselect
to the Console, so we can’t access the new data set. If we need to access the result we have to assign it a name using<-
.The order of variables (i.e. the column order) in the resulting object is the same as the order in which they were supplied as arguments. This means we can reorder variables at the same time as selecting them if we need to.
The
select
function will return the same kind of data object it is working on. It returns a data frame if our data was originally in a data frame and a tibble if it was a tibble. In this example, R prints a tibble because we had convertediris_tbl
from a data frame to a tibble.
It’s sometimes more convenient to use select
to subset variables by specifying those we do not need, rather than specifying of the ones to keep. We use the -
operator indicate that variables should be dropped. For example, to drop the Petal.Width
and Petal.Length
columns, we use:
select(iris_tbl, -Petal.Width, -Petal.Length)
## # A tibble: 150 x 3
## Sepal.Length Sepal.Width Species
## <dbl> <dbl> <fct>
## 1 5.1 3.5 setosa
## 2 4.9 3 setosa
## 3 4.7 3.2 setosa
## 4 4.6 3.1 setosa
## 5 5 3.6 setosa
## 6 5.4 3.9 setosa
## 7 4.6 3.4 setosa
## 8 5 3.4 setosa
## 9 4.4 2.9 setosa
## 10 4.9 3.1 setosa
## # … with 140 more rows
This returns a tibble with just the remaining variables:Sepal.Length
, Sepal.Width
and Species
.
The select
function can also be used to grab (or drop) a set of variables that occur in a sequence next to one another. We specify a series of adjacent variables using the :
operator. This must be used with two variable names, one on the left hand side and one on the right. When we use :
like this, select
will subset both those variables along with any others that fall in between them. For example, if we need the two Petal
variables and Species
, we use:
select(iris_tbl, Petal.Length:Species)
## # A tibble: 150 x 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fct>
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## 4 1.5 0.2 setosa
## 5 1.4 0.2 setosa
## 6 1.7 0.4 setosa
## 7 1.4 0.3 setosa
## 8 1.5 0.2 setosa
## 9 1.4 0.2 setosa
## 10 1.5 0.1 setosa
## # … with 140 more rows
The :
operator can be combined with -
if we need to drop a series of variables according to their position in a data frame or tibble:
select(iris_tbl, -(Petal.Length:Species))
## # A tibble: 150 x 2
## Sepal.Length Sepal.Width
## <dbl> <dbl>
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # … with 140 more rows
The extra ( )
around are Petal.Length:Species
important here — select
will throw an error if we don’t include them.
12.2.1 Renaming variables with select
and rename
In addition to selecting a subset of variables, the select
function can also rename variables at the same time. To do this, we have to name the arguments using =
, placing the new name on the left hand side. For example, to select theSpecies
, Petal.Length
and Petal.Width
variables from iris_tbl
, but also rename Petal.Length
and Petal.Width
to Petal_Length
and Petal_Width
, we use:
select(iris_tbl, Species, Petal_Length = Petal.Length, Petal_Width = Petal.Width)
## # A tibble: 150 x 3
## Species Petal_Length Petal_Width
## <fct> <dbl> <dbl>
## 1 setosa 1.4 0.2
## 2 setosa 1.4 0.2
## 3 setosa 1.3 0.2
## 4 setosa 1.5 0.2
## 5 setosa 1.4 0.2
## 6 setosa 1.7 0.4
## 7 setosa 1.4 0.3
## 8 setosa 1.5 0.2
## 9 setosa 1.4 0.2
## 10 setosa 1.5 0.1
## # … with 140 more rows
Renaming variables is a common task when working with data frames and tibbles. What should we do if the only thing we would like to achieve is to rename a variables, rather than rename and select variables? The dplyr
provides an additional function called rename
for this purpose. This function renames certain variables while retaining all others. It works in a similar way to select
. For example, to rename Petal.Length
and Petal.Width
to Petal_Length
and Petal_Width
, we use:
rename(iris_tbl, Petal_Length = Petal.Length, Petal_Width = Petal.Width)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal_Length Petal_Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
Notice that the rename function also preserves the order of the variables found in the original data.
12.3 Creating variables with mutate
We use mutate
to add new variables to a data frame or tibble. This is useful if we need to construct one or more derived variables to support an analysis. Basic usage of mutate
looks like this:
mutate(data_set, <expression1>, <expression2>, ...)
Again, this is not an example we can run; it’s pseudo code that highlights in abstract terms how to use mutate
. As always with dplyr
, the first argument, data_set
, should be the name of the object containing our data. We then include a series of one or more additional arguments, where each of these is a valid R expression involving one or more variables in data_set
. We’ve have expressed these as <expression1>, <expression2>
, where <expression1>
and <expression2>
represent the first two expressions, and the ...
is acting as placeholder for the remaining expressions. Remember, this is not valid R code. It is just intended to demonstrate the general usage of mutate
.
To see mutate
in action, let’s construct a new version of iris_tbl
that contains a variable summarising the approximate area of sepals:
mutate(iris_tbl, Sepal.Width * Sepal.Length)
## # A tibble: 150 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows, and 1 more variable: `Sepal.Width *
## # Sepal.Length` <dbl>
This creates a copy of iris_tbl
with a new column called Sepal.Width * Sepal.Length
(mentioned at the bottom of the printed output). Most of the rules that apply to select
also apply to mutate
:
The expression that performs the required calculation is not surrounded by quotes. This makes sense, because an expression is meant to be evaluated so that it “does something”. It is not a value.
Once again, we just printed the result produced by
mutate
to the Console, rather than assigning the result a name using<-
. Themutate
function does not have side effects, meaning it does not change the originaliris_tbl
in any way.The
select
function returns the same kind of data object as the one it is working on: a data frame if our data was originally in a data frame, a tibble if it was a tibble.
Creating a variable called something like Sepal.Width * Sepal.Length
is not exactly ideal because it’s a difficult name to work with. The mutate
function can name variables at the same time as they are created. We have to name the arguments using =
, placing the name on the left hand side, to do this. Here’s how to use this construct to name the new area variable Sepal.Area
:
mutate(iris_tbl, Sepal.Area = Sepal.Width * Sepal.Length)
## # A tibble: 150 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 5.1 3.5 1.4 0.2 setosa 17.8
## 2 4.9 3 1.4 0.2 setosa 14.7
## 3 4.7 3.2 1.3 0.2 setosa 15.0
## 4 4.6 3.1 1.5 0.2 setosa 14.3
## 5 5 3.6 1.4 0.2 setosa 18
## 6 5.4 3.9 1.7 0.4 setosa 21.1
## 7 4.6 3.4 1.4 0.3 setosa 15.6
## 8 5 3.4 1.5 0.2 setosa 17
## 9 4.4 2.9 1.4 0.2 setosa 12.8
## 10 4.9 3.1 1.5 0.1 setosa 15.2
## # … with 140 more rows
We can create more than one variable by supplying mutate
multiple (named) arguments:
mutate(iris_tbl,
Sepal.Area = Sepal.Width * Sepal.Length,
Petal.Area = Petal.Width * Petal.Length,
Area.Ratio = Petal.Area / Petal.Area)
## # A tibble: 150 x 8
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 5.1 3.5 1.4 0.2 setosa 17.8
## 2 4.9 3 1.4 0.2 setosa 14.7
## 3 4.7 3.2 1.3 0.2 setosa 15.0
## 4 4.6 3.1 1.5 0.2 setosa 14.3
## 5 5 3.6 1.4 0.2 setosa 18
## 6 5.4 3.9 1.7 0.4 setosa 21.1
## 7 4.6 3.4 1.4 0.3 setosa 15.6
## 8 5 3.4 1.5 0.2 setosa 17
## 9 4.4 2.9 1.4 0.2 setosa 12.8
## 10 4.9 3.1 1.5 0.1 setosa 15.2
## # … with 140 more rows, and 2 more variables: Petal.Area <dbl>,
## # Area.Ratio <dbl>
Notice that here we placed each argument on a new line, remembering the comma to separate arguments. There is nothing to stop us doing this because R ignores white space. This is useful though, because it allows us, the user, to makes long function calls easier to read by breaking them up on different lines.
This last example reveals a nice feature of mutate
: we can use newly created variables in further calculations. Here we constructed approximate sepal and petal area variables, and then used these to construct a third variable containing the ratio of these two quantities, Area.Ratio
.
12.3.1 Transforming and dropping variables
Occasionally we may want to construct one or more new variables, and then drop all other variables in the original dataset. The transmute
function is designed to do this. It works exactly like mutate
, but it has a slightly different behaviour:
transmute(iris_tbl,
Sepal.Area = Sepal.Width * Sepal.Length,
Petal.Area = Petal.Width * Petal.Length,
Area.Ratio = Petal.Area / Petal.Area)
## # A tibble: 150 x 3
## Sepal.Area Petal.Area Area.Ratio
## <dbl> <dbl> <dbl>
## 1 17.8 0.280 1
## 2 14.7 0.280 1
## 3 15.0 0.26 1
## 4 14.3 0.3 1
## 5 18 0.280 1
## 6 21.1 0.68 1
## 7 15.6 0.42 1
## 8 17 0.3 1
## 9 12.8 0.280 1
## 10 15.2 0.15 1
## # … with 140 more rows
Here we repeated the previous example, but now only the new variables were retained in the resulting tibble. If we also want to retain one or more variables without altering them we just have to pass them as unnamed arguments. For example, if we need to retain species identity in the output, we use:
transmute(iris_tbl,
Species,
Sepal.Area = Sepal.Width * Sepal.Length,
Petal.Area = Petal.Width * Petal.Length,
Area.Ratio = Petal.Area / Petal.Area)
## # A tibble: 150 x 4
## Species Sepal.Area Petal.Area Area.Ratio
## <fct> <dbl> <dbl> <dbl>
## 1 setosa 17.8 0.280 1
## 2 setosa 14.7 0.280 1
## 3 setosa 15.0 0.26 1
## 4 setosa 14.3 0.3 1
## 5 setosa 18 0.280 1
## 6 setosa 21.1 0.68 1
## 7 setosa 15.6 0.42 1
## 8 setosa 17 0.3 1
## 9 setosa 12.8 0.280 1
## 10 setosa 15.2 0.15 1
## # … with 140 more rows