Chapter 9 Data frames

9.1 Introduction

We learned in the A quick introduction to R chapter that the word “variable” is used as short hand for any kind of named object. For example, we can make a variable called num_vec that refers to a simple numeric vector using:

num_vec <- 1:10

When a computer scientist talks about variables they’re usually referring to these sorts of name-value associations. However, the word “variable” has a second, more abstract meaning in the world of data analysis and statistics: it refers to anything we can control or measure. For example, if our data comes from an experiment, the data will typically involve variables whose values describe the experimental conditions (e.g. “control plots” vs. “fertiliser plots”) and the quantities we chose to measure (e.g. species biomass and diversity).

These kinds of abstract variables are often called “statistical variables”. Statistical variables can be further broken down into a range of different types, such as numeric and categorical variables. We’ll discuss these later on in the Exploratory data analysis chapter. The reason we’re pointing out the dual meaning of the word “variable” here is because we need to work with both interpretations. The dual meaning can be confusing, but both meanings are in widespread use so we just have to get used to them. We’ll try to minimise confusion by using the phrase “statistical variable” when we are referring to data, rather than R objects.

We’re introducing these ideas now because we’re going to consider a new type of data object in R: the data frame. Real world data analysis involves collections of data (“data sets”) that involve several related statistical variables. We’ve seen that an atomic vector can only be used to store one type of data such as a collection of numbers. This means a vector can be used to store a single statistical variable, How should we keep a large collection of variables organised? We could work with them separately but this is very error prone. Ideally, we need a way to keep related variables together. This is the problem that data frames are designed to manage.

9.2 Data frames

Data frames are one of those R features that mark it out as a particularly good environment for data analysis. We can think of a data frame as table-like objects with rows and columns. They collect together different statistical variables, storing each of them as a different column. Related observations are all found in the same row. This will make more sense in a moment. Let’s consider the columns first.

Each column is a vector of some kind. These are usually simple vectors containing numbers or character strings, though it is also possible to include more complicated vectors inside data frames. We’ll only work with data frames made up of relatively simple vectors in this book. The key constraint that a data frame applies is that each vector must have the same length. This is what gives a data frame it table-like structure.

The simplest way to get a feel for data frames is to make one. Data frames are usually constructed by reading some external data into R, but for the purposes of learning about them it is better to build one from its component parts. We’ll make some artificial data describing a hypothetical experiment to do this. Imagine that we’ve conducted a small experiment to examine biomass and community diversity in six field plots. Three plots were subjected to fertiliser enrichment. The other three plots act as experimental controls. We could store the data describing this experiment in three vectors:

  • trt (short for “treatment”) shows which experimental manipulation was used.

  • bms (short for “biomass”) shows the total biomass measured at the end of the experiment.

  • div (short for “diversity”) shows the number of species present at the end of the experiment.

Here’s some R code to generate these three vectors (it doesn’t matter what the actual values are, they’re made up):

trt <- rep(c("Control","Fertilser"), each = 3) 
bms <- c(284, 328, 291, 956, 954, 685)
div <- c(8, 12, 11, 8, 4, 5)
trt
## [1] "Control"   "Control"   "Control"   "Fertilser" "Fertilser" "Fertilser"
bms
## [1] 284 328 291 956 954 685
div
## [1]  8 12 11  8  4  5

Notice that the information about different observations are linked by their positions in these vectors. For example, the third control plot had a biomass of ‘291’ and a species diversity ‘11’.

We can use the data.frame function to construct a data frame from one or more vectors. To build a data frame from the three vectors we created and print these to the Console, we use:

experim.data <- data.frame(trt, bms, div)
experim.data
##         trt bms div
## 1   Control 284   8
## 2   Control 328  12
## 3   Control 291  11
## 4 Fertilser 956   8
## 5 Fertilser 954   4
## 6 Fertilser 685   5

Notice what happens when we print the data frame: it is displayed as though it has rows and columns. That’s what we meant when we said a data frame is a table-like structure. The data.frame function takes a variable number of arguments. We used the trt, bms and div vectors as arguments, resulting in a data frame with three columns. Each of these vectors has 6 elements, so the resulting data frame has 6 rows. The names of the vectors were used to name its columns. The rows do not have names, but they are numbered to reflect their position.

The words trt, bms and div are not very informative. If we prefer to work with more informative column names—this is always a good idea—then we have to name the data.frame arguments:

experim.data <- data.frame(Treatment = trt, Biomass = bms, Diversity = div)
experim.data
##   Treatment Biomass Diversity
## 1   Control     284         8
## 2   Control     328        12
## 3   Control     291        11
## 4 Fertilser     956         8
## 5 Fertilser     954         4
## 6 Fertilser     685         5

The new data frame contains the same data as the previous one but now the column names correspond to the names we chose. These names are better because they describe each variable using a human-readable word.

Don’t bother with row names

We can also name the rows of a data frame using the row.names argument of the data.frame function. We won’t bother to show an example of this though. Why? We can’t easily work with the information in row names so there’s not much point adding it. If we need to include row-specific data in a data frame it’s best to include an additional variable, i.e. an extra column.

9.3 Exploring data frames

The first things we should do when presented with a new data set is explore its structure to understand what we’re dealing with. This is easy when the data is stored in a data frame. If the data set is reasonably small we can just print it to the Console. This is not very practical for even moderate-sized data sets though. The head and tail functions extract the first and last few rows of a data set, so these can be used to print part of a data set. The n argument controls the number of rows printed:

head(experim.data, n = 3)
##   Treatment Biomass Diversity
## 1   Control     284         8
## 2   Control     328        12
## 3   Control     291        11
tail(experim.data, n = 3)
##   Treatment Biomass Diversity
## 4 Fertilser     956         8
## 5 Fertilser     954         4
## 6 Fertilser     685         5

The View function can be used to visualise the whole data set in a spreadsheet like view:

View(experim.data)

This shows the rows and columns of the data frame argument in a table- or spreadsheet-like format. When we run this in RStudio a new tab opens up with the experim.data data inside it.

View only displays the data

The View function is designed to allow us to display the data in a data frame as a table of rows and columns. We can’t change the data in any way with the View function. We can reorder the way the data are presented, but keep in mind that this won’t alter the underlying data.

There are quite a few different R functions that will extract information about a data frame. The nrow and ncol functions return the number of rows and columns, respectively:

nrow(experim.data)
## [1] 6
ncol(experim.data)
## [1] 3

The names function is used to extract the column names from a data frame:

colnames(experim.data)
## [1] "Treatment" "Biomass"   "Diversity"

The experim.data data frame has three columns, so names returns a character vector of length three, where each element corresponds to a column name. There is also a rownames function if we need that too. The nrow, ncol, names and rownames functions each return a vector, so we can assign the result if we need to use it later. For example, if we want to extract and store the column names for any reason we could use varnames <- names(experim.data).

9.4 Extracting data from data frames

Data frames would not be much use if we could not extract and modify the data in them. In this section we will briefly review how to carry out these kinds of operations using basic R functions.

9.4.1 Extracting and adding a single variable

A data frame is just a collection of variables stored in columns, where each column is a vector of some kind. There are several ways to extract these variables from a data frame. If we just want to extract a single variable we have two options.

The first way of extracting a variable from a data frame uses a double square brackets construct, [[. For example, we extract the Biomass variable from our example data frame with the double square brackets like this:

experim.data[["Biomass"]]
## [1] 284 328 291 956 954 685

This prints whatever is in the Biomass column to the Console. What kind of object is this? It’s a numeric vector:

is.numeric(experim.data[["Biomass"]])
## [1] TRUE

A data frame really is nothing more than a collection of vectors. Notice that all we did was print the resulting vector to the Console. If we want to actually do something with this numeric vector we need to assign the result:

bmass <- experim.data$Biomass
bmass^2
## [1]  80656 107584  84681 913936 910116 469225

Here, we extracted the Biomass variable, assigned it to bmass, and then squared this. The value of Biomass variable inside the experim.data data frame is unchanged.

Notice that we used "Biomass" instead of Biomass inside the double square brackets, i.e. we quoted the name of the variable. This is because we want R to treat the word “Biomass” as a literal value. This little detail is important! If we don’t quote the name then R will assume that Biomass is the name of an object and go in search of it in the global environment. Since we haven’t created something called Biomass, leaving out the quotes generates an error:

experim.data[[Biomass]]
## Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, : object 'Biomass' not found

The error message is telling us that R can’t find a variable called Biomass in the global environment. On the other hand, this example does work:

vname <- "Biomass"
experim.data[[vname]]
## [1] 284 328 291 956 954 685

This works because we first defined vname to be a character vector of length one, whose value is the name of a variable in experim.data. When R encounters vname inside the [[ construct it goes and finds the value associated with it and uses this value to determine the variable to extract.

The second method for extracting a variable from a data frame uses the $ operator. For example, to extract the Biomass column from the experim.data data frame, we use:

experim.data$Biomass
## [1] 284 328 291 956 954 685

We use the $ operator by placing the name of the data frame we want to work with on the left hand side and and the name of the column (i.e. the variable) we want to extract on the right hand side. Notice that this time we didn’t have to put quotes around the variable name when using the $ operator. We can do this if we want to—i.e. experim.data$"Biomass" also works—but $ doesn’t require it.

Why is there more than one way to extract variables from a data frame? There’s no simple way to answer this question without getting into the details of how R represents data frames. The simple answer is that $ and [[ are not actually equivalent, even though they appear to do much the same thing. We’ve looked at the two extraction methods because they are both widely used. However, the $ method is a bit easier to read and people tend to prefer it for interactive data analysis tasks (the [[ construct tends to be used when we need a bit more flexibility).

9.4.2 Adding a variable to a data frame

How do we add a new variable to an existing data frame? It turns out that the $ operator is also be used for this job by combining it with the assignment operator. Using it this way is fairly intuitive. For example, if we want to add a new (made up) variable called Elevation to experim.data, we do it like this:

experim.data$Elevation <- c(364, 294, 321, 358, 298, 312)

This assigns some fake elevation data to a new variable in experim.data using the $ operator. The new variable is called Elevation because that was the name we used on the right hand side of $. This changes experim.data, such that it now contains four columns (variables):

head(experim.data, n = 3)
##   Treatment Biomass Diversity Elevation
## 1   Control     284         8       364
## 2   Control     328        12       294
## 3   Control     291        11       321

The [[ operator can also be used with <- to add variables to a data frame. We won’t bother to show an example, as it works in exactly the same way as $ and we won’t be using the [[ method in this book.

9.4.3 Subsetting data frames

What do we do if, instead of just extracting a single variable from a data frame, we need to select a subset of rows and/or columns? We use the single square brackets construct, [, to do this. There are two different ways we can use single square brackets, both of which involve the use of indexing vector(s) inside the [ construct.

The first use of [ allows us to subset one or more columns while keeping all the rows. This works exactly as the [ does for vectors. Just think of columns as the elements of the data frame. For example, if we want to subset experim.data such that we are only left with the first and second columns (Treatment and Biomass), we can use a numeric indexing vector:

experim.data[c(1:2)]
##   Treatment Biomass
## 1   Control     284
## 2   Control     328
## 3   Control     291
## 4 Fertilser     956
## 5 Fertilser     954
## 6 Fertilser     685

However, this is not a very good way to subset columns because we have to know the position of each variable. If for some reason we change the order of the columns, we have to update our R code accordingly. A better approach uses a character vector of column names inside the [:

experim.data[c("Treatment", "Biomass")]
##   Treatment Biomass
## 1   Control     284
## 2   Control     328
## 3   Control     291
## 4 Fertilser     956
## 5 Fertilser     954
## 6 Fertilser     685

The second use of [ ] is designed to allow us to subset rows and columns at the same time. We have to specify both the rows and the columns we require, using a comma (“,”) to separate a row and column index vector. This is easiest to understand with an example:

# row index
rindex <- 1:3
# column index 
cindex <- c("Treatment", "Biomass")
# subset the data frame
experim.data[rindex, cindex]
##   Treatment Biomass
## 1   Control     284
## 2   Control     328
## 3   Control     291

This example extracts a subset of experim.data corresponding to rows 1 through 3, and columns “Treatment” and “Biomass”. The rindex is a numeric vector of row positions, and cindex is a character vector of column names. This shows that rows and columns can be selected by referencing their position or their names. The rows are not named in experim.data, so we specified the positions.

Storing the index vectors first is quite a long-winded way of subsetting a data frame. However, there is nothing to stop us doing everything in one step:

experim.data[1:3, c("Treatment", "Biomass")]
##   Treatment Biomass
## 1   Control     284
## 2   Control     328
## 3   Control     291

If we need to subset just rows or columns we just leave out the appropriate index vector:

experim.data[1:3, ]
##   Treatment Biomass Diversity Elevation
## 1   Control     284         8       364
## 2   Control     328        12       294
## 3   Control     291        11       321

The absence of an index vector before/after the comma indicates that we want to keep every row/column. Here we kept all the columns but only the first three rows.

Be careful with [

Subsetting with the [rindex, cindex] construct produces another data frame. This should be apparent from the way the last example was printed. This is usually how this construct works. We say usually, because subsetting just one column produces a vector. This is very unfortunate, as it produces unpredictable behaviour if we’re not paying attention.

The [ construct works with three types of index vectors. We’ve just seen that the index vector can be a numeric or character type. The third approach uses a logical index vector. For example, we can subset the experim.data data frame, keeping just the rows where the Treatment variable is equal to “Control”, using:

# make a logical index vector
rindex <- experim.data $ Treatment == "Control"
rindex
## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
# 
experim.data[rindex, ]
##   Treatment Biomass Diversity Elevation
## 1   Control     284         8       364
## 2   Control     328        12       294
## 3   Control     291        11       321

Notice that we construct the logical rindex vector by extracting the Treatment variable with the $ operator and using the == operator to test for equality with “Control”. Don’t worry too much if that seems confusing. We combined many different ideas in that example. We’re going to learn a much more transparent way to achieve the same result in later chapters.

9.5 Final words

We’ve seen how to extract/add variables and subset data frames using the $, [[ and [ constructs. The last example also showed that we can use a combination of relational operators (e.g. ==, != or >=) and the square brackets construct to subset a data frame according to one or more criteria. There are also a number of base R functions that allow us to manipulate data frames in a slightly more intuitive way. For example, there is a function called transform that adds new variables and changes existing ones, and a function called subset to select variables and subset rows in a data frame according to the values of its variables.

We’ve shown these approaches because they’re still used by many people. However, we will rely on the dplyr package to handle operations like subsetting and transforming data frame variables in this book. The dplyr package provides a much cleaner, less error prone framework for manipulating data frames, and can be used to work with similar kinds of objects that store data in consistent way. Before we can do that though, we need to learn a little bit about how to organise and import data into R.