Chapter 14 Helper functions
14.1 Introduction
There are a number of helper functions supplied by dplyr. Many of these are shown in the handy dplyr cheat sheat. This is a short chapter. We aren’t going to try to cover every single helper function here. Instead, we’ll highlight some of the more useful ones, and point out where the others tend to be used. We also assume that the storms_tbl
and iris_tbl
tibbles have already been constructed (look over the previous two chapters to see how this is done).
14.2 Working with select
There are relatively few helper functions that can be used with select
. The job of these functions is to make it easier to match variable names according to various criteria. We’ll look at the three simplest of these, but look at the examples in the help file for select
and the cheat sheat to see what else is available.
We can select variables according to the sequence of characters used at the start of their name with the starts_with
function. For example, to select all the variables in iris_tbl
that begin with the word “Petal”, we use:
select(iris_tbl, starts_with("petal"))
## # A tibble: 150 x 2
## Petal.Length Petal.Width
## <dbl> <dbl>
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## 4 1.5 0.2
## 5 1.4 0.2
## 6 1.7 0.4
## 7 1.4 0.3
## 8 1.5 0.2
## 9 1.4 0.2
## 10 1.5 0.1
## # … with 140 more rows
This returns a table containing just Petal.Length
and Petal.Width
. As one might expect, there is also a helper function to select variables according to characters used at the end of their name. This is the ends_with
function (no surprises here). To select all the variables in iris_tbl
that end with the word “Length”, we use:
select(iris_tbl, ends_with("length"))
## # A tibble: 150 x 2
## Sepal.Length Petal.Length
## <dbl> <dbl>
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
## 4 4.6 1.5
## 5 5 1.4
## 6 5.4 1.7
## 7 4.6 1.4
## 8 5 1.5
## 9 4.4 1.4
## 10 4.9 1.5
## # … with 140 more rows
Notice that we have to quote the character string that we want to match against. This is not optional. However, the starts_with
and ends_with
functions are not case sensitive by default. For example, I passed starts_with
the argument "petal"
instead of "Petal"
, yet it still selected variables beginning with the character string "Petal"
. If we want to select variables on a case-sensitive basis, we need to set an argument ignore.case
to FALSE
in starts_with
and ends_with
.
The last select
helper function we will look at is called contains
. This allows us to select variables based on a partial match anywhere in their name. Look at what happens if we pass contains
the argument "."
:
select(iris_tbl, contains("."))
## # A tibble: 150 x 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 3.4 1.4 0.3
## 8 5 3.4 1.5 0.2
## 9 4.4 2.9 1.4 0.2
## 10 4.9 3.1 1.5 0.1
## # … with 140 more rows
This selects all the variables with a dot in their name.
There is nothing to stop us combining the different variable selection methods. For example, we can use this approach to select all the variables whose names start with the word “Petal” or end with the word “Length”:
select(iris_tbl, ends_with("length"), starts_with("petal"))
## # A tibble: 150 x 3
## Sepal.Length Petal.Length Petal.Width
## <dbl> <dbl> <dbl>
## 1 5.1 1.4 0.2
## 2 4.9 1.4 0.2
## 3 4.7 1.3 0.2
## 4 4.6 1.5 0.2
## 5 5 1.4 0.2
## 6 5.4 1.7 0.4
## 7 4.6 1.4 0.3
## 8 5 1.5 0.2
## 9 4.4 1.4 0.2
## 10 4.9 1.5 0.1
## # … with 140 more rows
When we apply more than one selection criteria like this the select
function returns all the variables that match either criteria, rather than just the set that meets all the criteria.
14.3 Working with mutate
and transmute
There are quite a few helper functions that can be used with mutate
. These make it easier to carry out certain transformations that aren’t easy to do with base R functions. We won’t explore these here as they tend to be needed only in quite specific circumstances. However, in situations where we need to construct an unusual variable—for example, one that ranks the values of another variable—it’s always worth looking at the that handy cheat sheat to see what options might be available.
14.4 Working with filter
There’s one dplyr
helper function that works with filter
that’s definitely worth knowing about: the between
function. This is used to identify the values of a variable that lie inside a defined range:
filter(storms_tbl, between(pressure, 960, 970))
## # A tibble: 213 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Felix 1995 8 11 18 21.3 -56.5 965 90 Hurric… 72
## 2 Felix 1995 8 14 12 29.9 -63.4 962 80 Hurric… 75
## 3 Felix 1995 8 14 18 30.7 -64.1 962 75 Hurric… 75
## 4 Felix 1995 8 15 0 31.3 -65.1 962 75 Hurric… 76
## 5 Felix 1995 8 15 6 31.9 -66.2 964 75 Hurric… 76
## 6 Felix 1995 8 15 12 32.5 -67.4 968 70 Hurric… 76
## 7 Felix 1995 8 15 18 33.1 -68.8 965 70 Hurric… 76
## 8 Felix 1995 8 16 0 33.5 -70.1 963 70 Hurric… 77
## 9 Felix 1995 8 16 6 34 -71.3 966 70 Hurric… 77
## 10 Felix 1995 8 16 12 34.6 -72.4 968 70 Hurric… 77
## # … with 203 more rows
This example filters the storms
dataset such that only values of pressure
between 960 and 970 are retained. We could do the same thing using some combination of >
or <
, but the between
function makes things a bit easier to read.