Chapter 14 Helper functions
14.1 Introduction
There are a number of helper functions supplied by dplyr. Many of these are shown in the handy dplyr cheat sheat. This is a short chapter. We aren’t going to try to cover every single helper function here. Instead, we’ll highlight some of the more useful ones, and point out where the others tend to be used. We also assume that the storms_tbl and iris_tbl tibbles have already been constructed (look over the previous two chapters to see how this is done).
14.2 Working with select
There are relatively few helper functions that can be used with select. The job of these functions is to make it easier to match variable names according to various criteria. We’ll look at the three simplest of these, but look at the examples in the help file for select and the cheat sheat to see what else is available.
We can select variables according to the sequence of characters used at the start of their name with the starts_with function. For example, to select all the variables in iris_tbl that begin with the word “Petal”, we use:
select(iris_tbl, starts_with("petal"))## # A tibble: 150 x 2
## Petal.Length Petal.Width
## <dbl> <dbl>
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## 4 1.5 0.2
## 5 1.4 0.2
## 6 1.7 0.4
## 7 1.4 0.3
## 8 1.5 0.2
## 9 1.4 0.2
## 10 1.5 0.1
## # … with 140 more rows
This returns a table containing just Petal.Length and Petal.Width. As one might expect, there is also a helper function to select variables according to characters used at the end of their name. This is the ends_with function (no surprises here). To select all the variables in iris_tbl that end with the word “Length”, we use:
select(iris_tbl, ends_with("length"))## # A tibble: 150 x 2
## Sepal.Length Petal.Length
## <dbl> <dbl>
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
## 4 4.6 1.5
## 5 5 1.4
## 6 5.4 1.7
## 7 4.6 1.4
## 8 5 1.5
## 9 4.4 1.4
## 10 4.9 1.5
## # … with 140 more rows
Notice that we have to quote the character string that we want to match against. This is not optional. However, the starts_with and ends_with functions are not case sensitive by default. For example, I passed starts_with the argument "petal" instead of "Petal", yet it still selected variables beginning with the character string "Petal". If we want to select variables on a case-sensitive basis, we need to set an argument ignore.case to FALSE in starts_with and ends_with.
The last select helper function we will look at is called contains. This allows us to select variables based on a partial match anywhere in their name. Look at what happens if we pass contains the argument ".":
select(iris_tbl, contains("."))## # A tibble: 150 x 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 3.4 1.4 0.3
## 8 5 3.4 1.5 0.2
## 9 4.4 2.9 1.4 0.2
## 10 4.9 3.1 1.5 0.1
## # … with 140 more rows
This selects all the variables with a dot in their name.
There is nothing to stop us combining the different variable selection methods. For example, we can use this approach to select all the variables whose names start with the word “Petal” or end with the word “Length”:
select(iris_tbl, ends_with("length"), starts_with("petal"))## # A tibble: 150 x 3
## Sepal.Length Petal.Length Petal.Width
## <dbl> <dbl> <dbl>
## 1 5.1 1.4 0.2
## 2 4.9 1.4 0.2
## 3 4.7 1.3 0.2
## 4 4.6 1.5 0.2
## 5 5 1.4 0.2
## 6 5.4 1.7 0.4
## 7 4.6 1.4 0.3
## 8 5 1.5 0.2
## 9 4.4 1.4 0.2
## 10 4.9 1.5 0.1
## # … with 140 more rows
When we apply more than one selection criteria like this the select function returns all the variables that match either criteria, rather than just the set that meets all the criteria.
14.3 Working with mutate and transmute
There are quite a few helper functions that can be used with mutate. These make it easier to carry out certain transformations that aren’t easy to do with base R functions. We won’t explore these here as they tend to be needed only in quite specific circumstances. However, in situations where we need to construct an unusual variable—for example, one that ranks the values of another variable—it’s always worth looking at the that handy cheat sheat to see what options might be available.
14.4 Working with filter
There’s one dplyr helper function that works with filter that’s definitely worth knowing about: the between function. This is used to identify the values of a variable that lie inside a defined range:
filter(storms_tbl, between(pressure, 960, 970))## # A tibble: 213 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Felix 1995 8 11 18 21.3 -56.5 965 90 Hurric… 72
## 2 Felix 1995 8 14 12 29.9 -63.4 962 80 Hurric… 75
## 3 Felix 1995 8 14 18 30.7 -64.1 962 75 Hurric… 75
## 4 Felix 1995 8 15 0 31.3 -65.1 962 75 Hurric… 76
## 5 Felix 1995 8 15 6 31.9 -66.2 964 75 Hurric… 76
## 6 Felix 1995 8 15 12 32.5 -67.4 968 70 Hurric… 76
## 7 Felix 1995 8 15 18 33.1 -68.8 965 70 Hurric… 76
## 8 Felix 1995 8 16 0 33.5 -70.1 963 70 Hurric… 77
## 9 Felix 1995 8 16 6 34 -71.3 966 70 Hurric… 77
## 10 Felix 1995 8 16 12 34.6 -72.4 968 70 Hurric… 77
## # … with 203 more rows
This example filters the storms dataset such that only values of pressure between 960 and 970 are retained. We could do the same thing using some combination of > or <, but the between function makes things a bit easier to read.