Chapter 13 Working with observations
13.1 Introduction
This chapter will explore the filter
and arrange
verbs. These are discussed together because they are used to manipulate observations (i.e. rows) of a data frame or tibble:
The
filter
function extracts a subset of observations based on supplied conditions involving the variables in our data.The
arrange
function reorders the rows according to the values in one or more variables.
13.1.1 Getting ready
We should start a new script by loading and attaching the dplyr package:
library("dplyr")
We’re going to use the storms
data set in the nasaweather package this time. This means we need to load and attach the nasaweather package to make storms
available:
library("nasaweather")
The storms
data set is an ordinary data frame, so let’s convert it to a tibble so that it prints nicely:
storms_tbl <- tbl_df(storms)
13.2 Subset observations with filter
We use filter
to subset observations in a data frame or tibble containing our data. This is often done when we want to limit an analysis to a subset of observations. Basic usage of filter
looks something like this:
filter(data_set, <expression1>, <expression1>, ...)
Remember, this is pseudo code (it’s not an example we can run). The first argument, data_set
, must the name of the object containing our data. We then include one or more additional arguments, where each of these is a valid R expression involving one or more variables in data_set
. Each expression must return a logical vector. We’ve expressed these as <expression1>, <expression2>, ...
, where <expression1>
and <expression2>
represent the first two expressions, and the ...
is acting as placeholder for the remaining expressions.
To see how filter
works in action, we’ll use it to subset observations in the storms_tbl
dataset, based on two relational criteria:
filter(storms_tbl, pressure <= 960, wind >= 100)
## # A tibble: 199 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Felix 1995 8 12 0 22.1 -57.8 955 100 Hurric… 73
## 2 Felix 1995 8 12 6 22.9 -59 943 110 Hurric… 73
## 3 Felix 1995 8 12 12 23.6 -60.2 932 115 Hurric… 73
## 4 Felix 1995 8 12 18 24.3 -61 929 120 Hurric… 73
## 5 Felix 1995 8 13 0 25.1 -61.6 930 115 Hurric… 74
## 6 Felix 1995 8 13 6 25.9 -61.9 937 105 Hurric… 74
## 7 Felix 1995 8 13 12 26.6 -62.3 942 100 Hurric… 74
## 8 Luis 1995 9 1 6 15.8 -42.6 958 105 Hurric… 93
## 9 Luis 1995 9 1 12 16.2 -43.6 950 115 Hurric… 93
## 10 Luis 1995 9 1 18 16.5 -44.7 948 115 Hurric… 93
## # … with 189 more rows
In this example we’ve created a subset of storms_tbl
that only includes observations where the pressure
variable is less than or equal to 960 and the wind
variable is greater than or equal to 100. Both conditions must be met for an observation to be included in the resulting tibble. The conditions are not combined as an either/or operation.
This is probably starting to become repetitious, but there are a few features of filter
that we should note:
Each expression that performs a comparison is not surrounded by quotes. This makes sense, because the expression is meant to be evaluated to return a logical vector – it is not “a value”.
As usual, the result produced by
mutate
in our example was printed to the Console. Themutate
function did not change the originalstorms_tbl
in any way (no side effects!).The
filter
function will return the same kind of data object it is working on: it returns a data frame if our data was originally in a data frame, and a tibble if it was a tibble.
We can achieve the same result as the above example in a different way. This involves the &
operator:
filter(storms_tbl, pressure <= 960 & wind >= 100)
## # A tibble: 199 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Felix 1995 8 12 0 22.1 -57.8 955 100 Hurric… 73
## 2 Felix 1995 8 12 6 22.9 -59 943 110 Hurric… 73
## 3 Felix 1995 8 12 12 23.6 -60.2 932 115 Hurric… 73
## 4 Felix 1995 8 12 18 24.3 -61 929 120 Hurric… 73
## 5 Felix 1995 8 13 0 25.1 -61.6 930 115 Hurric… 74
## 6 Felix 1995 8 13 6 25.9 -61.9 937 105 Hurric… 74
## 7 Felix 1995 8 13 12 26.6 -62.3 942 100 Hurric… 74
## 8 Luis 1995 9 1 6 15.8 -42.6 958 105 Hurric… 93
## 9 Luis 1995 9 1 12 16.2 -43.6 950 115 Hurric… 93
## 10 Luis 1995 9 1 18 16.5 -44.7 948 115 Hurric… 93
## # … with 189 more rows
Once again, we created a subset of storms_tbl
that only includes observation where the pressure
variable is less than or equal to 960 and the wind
variable is greater than or equal to 100. However, rather than supplying pressure <= 960
and wind >= 100
as two arguments, we used a single R expression, combining them with the &
. We’re pointing this out because we sometimes need to subset on an either/or basis, and in those cases we have to use this second approach. For example:
filter(storms_tbl, pressure <= 960 | wind >= 100)
## # A tibble: 266 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Felix 1995 8 12 0 22.1 -57.8 955 100 Hurric… 73
## 2 Felix 1995 8 12 6 22.9 -59 943 110 Hurric… 73
## 3 Felix 1995 8 12 12 23.6 -60.2 932 115 Hurric… 73
## 4 Felix 1995 8 12 18 24.3 -61 929 120 Hurric… 73
## 5 Felix 1995 8 13 0 25.1 -61.6 930 115 Hurric… 74
## 6 Felix 1995 8 13 6 25.9 -61.9 937 105 Hurric… 74
## 7 Felix 1995 8 13 12 26.6 -62.3 942 100 Hurric… 74
## 8 Felix 1995 8 13 18 27.4 -62.3 947 95 Hurric… 74
## 9 Felix 1995 8 14 0 28.2 -62.5 948 90 Hurric… 75
## 10 Felix 1995 8 14 6 29 -62.9 954 80 Hurric… 75
## # … with 256 more rows
This creates a subset of storms_tbl
that only includes observation where the pressure
variable is less than or equal to 960 or the wind
variable is greater than or equal to 100.
We’re also not restricted to using some combination of relational operators such as ==
, >=
or !=
when working with filter
. The conditions specified in the filter
function can be any expression that returns a logical vector. The only constraint is that the length of this logical vector has to equal the length of its input vectors.
Here’s an example. The group membership %in%
operator (part of base R, not dplyr
) is used to determine whether the values in one vector occurs among the values in a second vector. It’s used like this: vec1 %in% vec2
. This returns a vector where the values are TRUE
if an element of vec1
is in vec2
, and FALSE
otherwise. We can use the %in%
operator with filter
to select to subset rows by the values of one or more variables:
sub_storms_tbl <- filter(storms_tbl, name %in% c("Roxanne", "Marilyn", "Dolly"))
# print the output
sub_storms_tbl $ name
## [1] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [8] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [15] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [22] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [29] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [36] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [43] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [50] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [57] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [64] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [71] "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn" "Marilyn"
## [78] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [85] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [92] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [99] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [106] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [113] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [120] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne"
## [127] "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Roxanne" "Dolly" "Dolly"
## [134] "Dolly" "Dolly" "Dolly" "Dolly" "Dolly" "Dolly" "Dolly"
## [141] "Dolly" "Dolly" "Dolly" "Dolly" "Dolly" "Dolly" "Dolly"
## [148] "Dolly" "Dolly" "Dolly" "Dolly" "Dolly" "Dolly" "Dolly"
## [155] "Dolly"
13.3 Reording observations with arrange
We use arrange
to reorder the rows of an object containing our data. This is sometimes used when we want to inspect a dataset to look for associations among the different variables. This is hard to do if they are not ordered. Basic usage of arrange
looks like this:
arrange(data_set, vname1, vname2, ...)
Yes, this is pseudo-code. As always, the first argument, data_set
, is the name of the object containing our data. We then include a series of one or more additional arguments, where each of these should be the name of a variable in data_set
: vname1
and vname2
are names of the first two ordering variables, and the ...
is acting as placeholder for the remaining variables.
To see arrange
in action, let’s construct a new version of storms_tbl
where the rows have been reordered first by wind
, and then by pressure
:
arrange(storms_tbl, wind, pressure)
## # A tibble: 2,747 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Fran 1996 9 9 12 45.7 -72.3 1006 15 Extra… 101
## 2 Fran 1996 9 9 18 46 -71.1 1008 15 Extra… 101
## 3 Fran 1996 9 10 0 46.7 -70 1010 15 Extra… 102
## 4 Franc… 1998 9 13 6 31.7 -96.9 1002 20 Tropi… 105
## 5 Dean 1995 7 31 18 30.5 -96.5 1003 20 Tropi… 61
## 6 Erin 1995 8 4 12 33.2 -89.7 1003 20 Tropi… 65
## 7 Erin 1995 8 4 18 34.1 -90.2 1003 20 Tropi… 65
## 8 Erin 1995 8 5 0 34.8 -90.2 1003 20 Tropi… 66
## 9 Erin 1995 8 5 6 35.4 -90.1 1003 20 Tropi… 66
## 10 Erin 1995 8 5 12 36.3 -89.8 1003 20 Tropi… 66
## # … with 2,737 more rows
This creates a new version of storms_tbl
where the rows are sorted according to the values of wind
and pressure
in ascending order – i.e. from smallest to largest. Since wind
appears before pressure
among the arguments, the values of pressure
are only used to break ties within any particular value of wind
.
For the sake of avoiding any doubt about how arrange
works, let’s quickly review its behaviour:
The variable names used as arguments of
arrange
are not surrounded by quotes.The
arrange
function did not change the originaliris_tbl
in any way.The
arrange
function will return the same kind of data object it is working on.
There isn’t much else we need to to learn about arrange
. By default, it sorts variables in ascending order. If we need it to sort a variable in descending order, we wrap the variable name in the desc
function:
arrange(storms_tbl, wind, desc(pressure))
## # A tibble: 2,747 x 11
## name year month day hour lat long pressure wind type seasday
## <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
## 1 Fran 1996 9 10 0 46.7 -70 1010 15 Extra… 102
## 2 Fran 1996 9 9 18 46 -71.1 1008 15 Extra… 101
## 3 Fran 1996 9 9 12 45.7 -72.3 1006 15 Extra… 101
## 4 Barry 1995 7 5 6 32 -72 1019 20 Extra… 35
## 5 Barry 1995 7 5 12 32 -72 1019 20 Extra… 35
## 6 Barry 1995 7 5 18 31.9 -72 1018 20 Extra… 35
## 7 Maril… 1995 9 30 12 34.6 -49.3 1016 20 Extra… 122
## 8 Maril… 1995 9 30 18 34.7 -50 1016 20 Extra… 122
## 9 Maril… 1995 10 1 0 34.8 -50.5 1016 20 Extra… 123
## 10 Maril… 1995 10 1 6 35 -51 1016 20 Extra… 123
## # … with 2,737 more rows
This creates a new version of storms_tbl
where the rows are sorted according to the values of wind
and pressure
, in ascending and descending order, respectively.