Chapter 13 Working with observations

13.1 Introduction

This chapter will explore the filter and arrange verbs. These are discussed together because they are used to manipulate observations (i.e. rows) of a data frame or tibble:

  • The filter function extracts a subset of observations based on supplied conditions involving the variables in our data.

  • The arrange function reorders the rows according to the values in one or more variables.

13.1.1 Getting ready

We should start a new script by loading and attaching the dplyr package:

library("dplyr")

We’re going to use the storms data set in the nasaweather package this time. This means we need to load and attach the nasaweather package to make storms available:

library("nasaweather")

The storms data set is an ordinary data frame, so let’s convert it to a tibble so that it prints nicely:

storms_tbl <- tbl_df(storms)

13.2 Subset observations with filter

We use filter to subset observations in a data frame or tibble containing our data. This is often done when we want to limit an analysis to a subset of observations. Basic usage of filter looks something like this:

filter(data_set, <expression1>, <expression1>, ...)

Remember, this is pseudo code (it’s not an example we can run). The first argument, data_set, must the name of the object containing our data. We then include one or more additional arguments, where each of these is a valid R expression involving one or more variables in data_set. Each expression must return a logical vector. We’ve expressed these as <expression1>, <expression2>, ..., where <expression1> and <expression2> represent the first two expressions, and the ... is acting as placeholder for the remaining expressions.

To see how filter works in action, we’ll use it to subset observations in the storms_tbl dataset, based on two relational criteria:

filter(storms_tbl, pressure <= 960, wind >= 100)
## # A tibble: 199 x 11
##    name   year month   day  hour   lat  long pressure  wind type    seasday
##    <chr> <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>     <int>
##  1 Felix  1995     8    12     0  22.1 -57.8      955   100 Hurric…      73
##  2 Felix  1995     8    12     6  22.9 -59        943   110 Hurric…      73
##  3 Felix  1995     8    12    12  23.6 -60.2      932   115 Hurric…      73
##  4 Felix  1995     8    12    18  24.3 -61        929   120 Hurric…      73
##  5 Felix  1995     8    13     0  25.1 -61.6      930   115 Hurric…      74
##  6 Felix  1995     8    13     6  25.9 -61.9      937   105 Hurric…      74
##  7 Felix  1995     8    13    12  26.6 -62.3      942   100 Hurric…      74
##  8 Luis   1995     9     1     6  15.8 -42.6      958   105 Hurric…      93
##  9 Luis   1995     9     1    12  16.2 -43.6      950   115 Hurric…      93
## 10 Luis   1995     9     1    18  16.5 -44.7      948   115 Hurric…      93
## # … with 189 more rows

In this example we’ve created a subset of storms_tbl that only includes observations where the pressure variable is less than or equal to 960 and the wind variable is greater than or equal to 100. Both conditions must be met for an observation to be included in the resulting tibble. The conditions are not combined as an either/or operation.

This is probably starting to become repetitious, but there are a few features of filter that we should note:

  • Each expression that performs a comparison is not surrounded by quotes. This makes sense, because the expression is meant to be evaluated to return a logical vector – it is not “a value”.

  • As usual, the result produced by mutate in our example was printed to the Console. The mutate function did not change the original storms_tbl in any way (no side effects!).

  • The filter function will return the same kind of data object it is working on: it returns a data frame if our data was originally in a data frame, and a tibble if it was a tibble.

We can achieve the same result as the above example in a different way. This involves the & operator:

filter(storms_tbl, pressure <= 960 & wind >= 100)
## # A tibble: 199 x 11
##    name   year month   day  hour   lat  long pressure  wind type    seasday
##    <chr> <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>     <int>
##  1 Felix  1995     8    12     0  22.1 -57.8      955   100 Hurric…      73
##  2 Felix  1995     8    12     6  22.9 -59        943   110 Hurric…      73
##  3 Felix  1995     8    12    12  23.6 -60.2      932   115 Hurric…      73
##  4 Felix  1995     8    12    18  24.3 -61        929   120 Hurric…      73
##  5 Felix  1995     8    13     0  25.1 -61.6      930   115 Hurric…      74
##  6 Felix  1995     8    13     6  25.9 -61.9      937   105 Hurric…      74
##  7 Felix  1995     8    13    12  26.6 -62.3      942   100 Hurric…      74
##  8 Luis   1995     9     1     6  15.8 -42.6      958   105 Hurric…      93
##  9 Luis   1995     9     1    12  16.2 -43.6      950   115 Hurric…      93
## 10 Luis   1995     9     1    18  16.5 -44.7      948   115 Hurric…      93
## # … with 189 more rows

Once again, we created a subset of storms_tbl that only includes observation where the pressure variable is less than or equal to 960 and the wind variable is greater than or equal to 100. However, rather than supplying pressure <= 960 and wind >= 100 as two arguments, we used a single R expression, combining them with the &. We’re pointing this out because we sometimes need to subset on an either/or basis, and in those cases we have to use this second approach. For example:

filter(storms_tbl, pressure <= 960 | wind >= 100)
## # A tibble: 266 x 11
##    name   year month   day  hour   lat  long pressure  wind type    seasday
##    <chr> <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>     <int>
##  1 Felix  1995     8    12     0  22.1 -57.8      955   100 Hurric…      73
##  2 Felix  1995     8    12     6  22.9 -59        943   110 Hurric…      73
##  3 Felix  1995     8    12    12  23.6 -60.2      932   115 Hurric…      73
##  4 Felix  1995     8    12    18  24.3 -61        929   120 Hurric…      73
##  5 Felix  1995     8    13     0  25.1 -61.6      930   115 Hurric…      74
##  6 Felix  1995     8    13     6  25.9 -61.9      937   105 Hurric…      74
##  7 Felix  1995     8    13    12  26.6 -62.3      942   100 Hurric…      74
##  8 Felix  1995     8    13    18  27.4 -62.3      947    95 Hurric…      74
##  9 Felix  1995     8    14     0  28.2 -62.5      948    90 Hurric…      75
## 10 Felix  1995     8    14     6  29   -62.9      954    80 Hurric…      75
## # … with 256 more rows

This creates a subset of storms_tbl that only includes observation where the pressure variable is less than or equal to 960 or the wind variable is greater than or equal to 100.

We’re also not restricted to using some combination of relational operators such as ==, >= or != when working with filter. The conditions specified in the filter function can be any expression that returns a logical vector. The only constraint is that the length of this logical vector has to equal the length of its input vectors.

Here’s an example. The group membership %in% operator (part of base R, not dplyr) is used to determine whether the values in one vector occurs among the values in a second vector. It’s used like this: vec1 %in% vec2. This returns a vector where the values are TRUE if an element of vec1 is in vec2, and FALSE otherwise. We can use the %in% operator with filter to select to subset rows by the values of one or more variables:

sub_storms_tbl <- filter(storms_tbl, name %in% c("Roxanne", "Marilyn", "Dolly"))
# print the output
sub_storms_tbl $ name
##   [1] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##   [8] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [15] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [22] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [29] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [36] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [43] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [50] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [57] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [64] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [71] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
##  [78] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
##  [85] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
##  [92] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
##  [99] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [106] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [113] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [120] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [127] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Dolly"   "Dolly"  
## [134] "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"  
## [141] "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"  
## [148] "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"   "Dolly"  
## [155] "Dolly"

13.3 Reording observations with arrange

We use arrange to reorder the rows of an object containing our data. This is sometimes used when we want to inspect a dataset to look for associations among the different variables. This is hard to do if they are not ordered. Basic usage of arrange looks like this:

arrange(data_set, vname1, vname2, ...)

Yes, this is pseudo-code. As always, the first argument, data_set, is the name of the object containing our data. We then include a series of one or more additional arguments, where each of these should be the name of a variable in data_set: vname1 and vname2 are names of the first two ordering variables, and the ... is acting as placeholder for the remaining variables.

To see arrange in action, let’s construct a new version of storms_tbl where the rows have been reordered first by wind, and then by pressure:

arrange(storms_tbl, wind, pressure)
## # A tibble: 2,747 x 11
##    name    year month   day  hour   lat  long pressure  wind type   seasday
##    <chr>  <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>    <int>
##  1 Fran    1996     9     9    12  45.7 -72.3     1006    15 Extra…     101
##  2 Fran    1996     9     9    18  46   -71.1     1008    15 Extra…     101
##  3 Fran    1996     9    10     0  46.7 -70       1010    15 Extra…     102
##  4 Franc…  1998     9    13     6  31.7 -96.9     1002    20 Tropi…     105
##  5 Dean    1995     7    31    18  30.5 -96.5     1003    20 Tropi…      61
##  6 Erin    1995     8     4    12  33.2 -89.7     1003    20 Tropi…      65
##  7 Erin    1995     8     4    18  34.1 -90.2     1003    20 Tropi…      65
##  8 Erin    1995     8     5     0  34.8 -90.2     1003    20 Tropi…      66
##  9 Erin    1995     8     5     6  35.4 -90.1     1003    20 Tropi…      66
## 10 Erin    1995     8     5    12  36.3 -89.8     1003    20 Tropi…      66
## # … with 2,737 more rows

This creates a new version of storms_tbl where the rows are sorted according to the values of wind and pressure in ascending order – i.e. from smallest to largest. Since wind appears before pressure among the arguments, the values of pressure are only used to break ties within any particular value of wind.

For the sake of avoiding any doubt about how arrange works, let’s quickly review its behaviour:

  • The variable names used as arguments of arrange are not surrounded by quotes.

  • The arrange function did not change the original iris_tbl in any way.

  • The arrange function will return the same kind of data object it is working on.

There isn’t much else we need to to learn about arrange. By default, it sorts variables in ascending order. If we need it to sort a variable in descending order, we wrap the variable name in the desc function:

arrange(storms_tbl, wind, desc(pressure))
## # A tibble: 2,747 x 11
##    name    year month   day  hour   lat  long pressure  wind type   seasday
##    <chr>  <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>    <int>
##  1 Fran    1996     9    10     0  46.7 -70       1010    15 Extra…     102
##  2 Fran    1996     9     9    18  46   -71.1     1008    15 Extra…     101
##  3 Fran    1996     9     9    12  45.7 -72.3     1006    15 Extra…     101
##  4 Barry   1995     7     5     6  32   -72       1019    20 Extra…      35
##  5 Barry   1995     7     5    12  32   -72       1019    20 Extra…      35
##  6 Barry   1995     7     5    18  31.9 -72       1018    20 Extra…      35
##  7 Maril…  1995     9    30    12  34.6 -49.3     1016    20 Extra…     122
##  8 Maril…  1995     9    30    18  34.7 -50       1016    20 Extra…     122
##  9 Maril…  1995    10     1     0  34.8 -50.5     1016    20 Extra…     123
## 10 Maril…  1995    10     1     6  35   -51       1016    20 Extra…     123
## # … with 2,737 more rows

This creates a new version of storms_tbl where the rows are sorted according to the values of wind and pressure, in ascending and descending order, respectively.