Chapter 6 Extracting subsets of vectors

6.1 Introduction

At the beginning of the last chapter we said that an atomic vector is a 1-dimensional object that contains an ordered collection of data values. Why is this view of a vector useful? It means that we can extract a subset of elements from a vector once we know their position in that vector. There are two main ways to subset atomic vectors, both of which we’ll cover in this chapter. Whatever the method we use, subsetting involves a pair of opening and closing square brackets ([ and ]). These are always used together.

6.2 Subsetting by position

We can use the [ construct with a vector to subset its elements directly using their position. Take a look at this example:

numvec <- c(7.2, 3.6, 2.9)
numvec[2]
## [1] 3.6

The numvec variable is a length 3 numeric vector. In order to subset it and retain only the second element we used the [ ] construct with the number 2 inside, placing [2] next to the name of the vector. Notice that we do not place a space anywhere in this construct. We could, but this is not conventional.

Remember, “the number 2” is in fact a numeric vector of length 1. This suggests we might be able to use longer vectors to extract more than one element:

# make a numeric sequence
my_vec <- seq(0, 0.5, by = 0.1)
my_vec
## [1] 0.0 0.1 0.2 0.3 0.4 0.5
# make an indexing vector
i <- c(1, 3)
# extract a subset of values
my_vec[i]
## [1] 0.0 0.2

We first constructed a numeric vector of length 2 called i, which has elements 1 and 3. We then used this to extract his first and third value by placing i inside the [ ]. We didn’t have to carry out the subsetting operation in two steps. This achieves the same result:

my_vec[c(1, 3)]
## [1] 0.0 0.2

Notice that when we subset a vector of a particular type, we get a vector of the same type back, e.g. subsetting a numeric vector produces another numeric vector.

We can also subset a vector by removing certain elements. We use the - operator to do this. Here’s an example that produces the same result as the last example, but in a different way:

my_vec[-c(2, 4, 5, 6)]
## [1] 0.0 0.2

The my_vec[-c(2, 4, 5, 6)] expression indicates that we want all the elements of c_vec except those that at position 2, 4, 5, and 6.

We have learned how to use the [ construct with a numeric vector of integer(s) to subset the elements of vector by their position. This works exactly the same way with character vectors:

c_vec <- c("Dylan", "Zachary", "Childs")
c_vec[1]
## [1] "Dylan"

The c_vec variable is a length 3 character vector, with elements corresponding to his first, middle and last name. We used the [ ] construct with the number 1 inside, to extract the first element (i.e. the first name). Longer vectors can be used to extract more than one element, and we can use negative numbers to remove elements:

# extract the first and third value
c_vec[c(1, 3)]
## [1] "Dylan"  "Childs"
# drop the second value (equivalent)
c_vec[-2]
## [1] "Dylan"  "Childs"

6.3 Subsetting with logical operators

Subsetting vectors by position suffers from one major drawback—we have to know where the elements we want sit in the vector. A second way to subset a vector makes use of logical vectors alongside [ ]. This is usually done using two vectors of the same length: the focal vector we wish to subset, and a logical vector that specifies which elements to keep. Here is a very simple example:

i <- c(TRUE, FALSE, TRUE)
c_vec[i]  
## [1] "Dylan"  "Childs"

This should be fairly self-explanatory. The logical vector method of subsetting works element-by-element, and an element in c_vec is retained wherever the corresponding element in i contains a TRUE; otherwise it is discarded.

Why is this better than using position indexing? After all, using a vector of positions to subset a vector is much more direct. The reason is that we can use relational operators to help us select elements according to particular criteria. This is best illustrated with an example. We’ll start by defining two vectors of the same length:

name <- c("cat", "dog", "wren", "pig", "owl")
name
## [1] "cat"  "dog"  "wren" "pig"  "owl"
type <- c("mammal", "mammal", "bird", "mammal", "bird")
type
## [1] "mammal" "mammal" "bird"   "mammal" "bird"

The first, name, is a character vector containing the common names of a few animals. The second, type, is another character vector whose elements denote the type of animal, in this case, a bird or a mammal. The vectors are arranged such that the information is associated via the position of elements in each vector (cats and dogs are mammals, a wren is a bird, and so on).

Let’s assume that we want to create a subset of name that only contains the names of mammals. We can do this by creating a logical vector from type, where the values are TRUE when an element is equal to "mammal" and FALSE otherwise. We know how to do this using the == operator:

i <- type == "mammal"
i
## [1]  TRUE  TRUE FALSE  TRUE FALSE

We stored the result in a variable called i. Now all we need to do is use with i inside the [ ] construct to subset name:

name[i]
## [1] "cat" "dog" "pig"

We did this the long way to understand the logic of subsetting vectors with logical operators. This is quite verbose though, and we usually combine the two steps into a single R expression:

name[type == "mammal"]
## [1] "cat" "dog" "pig"

We can use any of the relational operators to subset vectors like this. If we define a numeric variable that contains the mean mass (in grams) of each animal, we can use this to subset names according to the associated mean mass. For example, if we want a subset that contains only those animals where the mean mass is greater than 1kg we use:

mass <- c(2900, 9000, 10, 18000, 2000)
name[mass > 1000]
## [1] "cat" "dog" "pig" "owl"

Just remember, this way of using information in one vector to create subsets of a second vector only works if the information in each is associated via the position of their respective elements. Keeping a bunch of different vectors organised like this is difficult and error prone. In the next block we’ll learn how to use something called a data frame and the dplyr package to make working with a collection of related vectors much easier.