Chapter 14 Helper functions

14.1 Introduction

There are a number of helper functions supplied by dplyr. Many of these are shown in the handy dplyr cheat sheat. This is a short chapter. We aren’t going to try to cover every single helper function here. Instead, we’ll highlight some of the more useful ones, and point out where the others tend to be used. We also assume that the storms_tbl and iris_tbl tibbles have already been constructed (look over the previous two chapters to see how this is done).

14.2 Working with select

There are relatively few helper functions that can be used with select. The job of these functions is to make it easier to match variable names according to various criteria. We’ll look at the three simplest of these, but look at the examples in the help file for select and the cheat sheat to see what else is available.

We can select variables according to the sequence of characters used at the start of their name with the starts_with function. For example, to select all the variables in iris_tbl that begin with the word “Petal”, we use:

select(iris_tbl, starts_with("petal"))
## # A tibble: 150 x 2
##    Petal.Length Petal.Width
##           <dbl>       <dbl>
##  1          1.4         0.2
##  2          1.4         0.2
##  3          1.3         0.2
##  4          1.5         0.2
##  5          1.4         0.2
##  6          1.7         0.4
##  7          1.4         0.3
##  8          1.5         0.2
##  9          1.4         0.2
## 10          1.5         0.1
## # … with 140 more rows

This returns a table containing just Petal.Length and Petal.Width. As one might expect, there is also a helper function to select variables according to characters used at the end of their name. This is the ends_with function (no surprises here). To select all the variables in iris_tbl that end with the word “Length”, we use:

select(iris_tbl, ends_with("length"))
## # A tibble: 150 x 2
##    Sepal.Length Petal.Length
##           <dbl>        <dbl>
##  1          5.1          1.4
##  2          4.9          1.4
##  3          4.7          1.3
##  4          4.6          1.5
##  5          5            1.4
##  6          5.4          1.7
##  7          4.6          1.4
##  8          5            1.5
##  9          4.4          1.4
## 10          4.9          1.5
## # … with 140 more rows

Notice that we have to quote the character string that we want to match against. This is not optional. However, the starts_with and ends_with functions are not case sensitive by default. For example, I passed starts_with the argument "petal" instead of "Petal", yet it still selected variables beginning with the character string "Petal". If we want to select variables on a case-sensitive basis, we need to set an argument ignore.case to FALSE in starts_with and ends_with.

The last select helper function we will look at is called contains. This allows us to select variables based on a partial match anywhere in their name. Look at what happens if we pass contains the argument ".":

select(iris_tbl, contains("."))
## # A tibble: 150 x 4
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
##           <dbl>       <dbl>        <dbl>       <dbl>
##  1          5.1         3.5          1.4         0.2
##  2          4.9         3            1.4         0.2
##  3          4.7         3.2          1.3         0.2
##  4          4.6         3.1          1.5         0.2
##  5          5           3.6          1.4         0.2
##  6          5.4         3.9          1.7         0.4
##  7          4.6         3.4          1.4         0.3
##  8          5           3.4          1.5         0.2
##  9          4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1
## # … with 140 more rows

This selects all the variables with a dot in their name.

There is nothing to stop us combining the different variable selection methods. For example, we can use this approach to select all the variables whose names start with the word “Petal” or end with the word “Length”:

select(iris_tbl, ends_with("length"), starts_with("petal"))
## # A tibble: 150 x 3
##    Sepal.Length Petal.Length Petal.Width
##           <dbl>        <dbl>       <dbl>
##  1          5.1          1.4         0.2
##  2          4.9          1.4         0.2
##  3          4.7          1.3         0.2
##  4          4.6          1.5         0.2
##  5          5            1.4         0.2
##  6          5.4          1.7         0.4
##  7          4.6          1.4         0.3
##  8          5            1.5         0.2
##  9          4.4          1.4         0.2
## 10          4.9          1.5         0.1
## # … with 140 more rows

When we apply more than one selection criteria like this the select function returns all the variables that match either criteria, rather than just the set that meets all the criteria.

14.3 Working with mutate and transmute

There are quite a few helper functions that can be used with mutate. These make it easier to carry out certain transformations that aren’t easy to do with base R functions. We won’t explore these here as they tend to be needed only in quite specific circumstances. However, in situations where we need to construct an unusual variable—for example, one that ranks the values of another variable—it’s always worth looking at the that handy cheat sheat to see what options might be available.

14.4 Working with filter

There’s one dplyr helper function that works with filter that’s definitely worth knowing about: the between function. This is used to identify the values of a variable that lie inside a defined range:

filter(storms_tbl, between(pressure, 960, 970))
## # A tibble: 213 x 11
##    name   year month   day  hour   lat  long pressure  wind type    seasday
##    <chr> <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>     <int>
##  1 Felix  1995     8    11    18  21.3 -56.5      965    90 Hurric…      72
##  2 Felix  1995     8    14    12  29.9 -63.4      962    80 Hurric…      75
##  3 Felix  1995     8    14    18  30.7 -64.1      962    75 Hurric…      75
##  4 Felix  1995     8    15     0  31.3 -65.1      962    75 Hurric…      76
##  5 Felix  1995     8    15     6  31.9 -66.2      964    75 Hurric…      76
##  6 Felix  1995     8    15    12  32.5 -67.4      968    70 Hurric…      76
##  7 Felix  1995     8    15    18  33.1 -68.8      965    70 Hurric…      76
##  8 Felix  1995     8    16     0  33.5 -70.1      963    70 Hurric…      77
##  9 Felix  1995     8    16     6  34   -71.3      966    70 Hurric…      77
## 10 Felix  1995     8    16    12  34.6 -72.4      968    70 Hurric…      77
## # … with 203 more rows

This example filters the storms dataset such that only values of pressure between 960 and 970 are retained. We could do the same thing using some combination of > or <, but the between function makes things a bit easier to read.