Chapter 4 Numeric vectors
4.1 Introduction
The term “data structure” is computer science jargon for a particular way of organising data on our computers, so that it can be accessed easily and efficiently. Computer languages use many different kinds of data structures, but fortunately, we only need to learn about a couple of relatively simple ones to use R for data analysis. In fact, in this book only two kinds of data structure really matter: “vectors” and “data frames”. We’ll learn how to work with data frames in the next section of the book.
The next three chapters will consider vectors. This chapter has two goals. First, we want to learn the basics of how to work with vectors. For example, we’ll see how “vectorised operations” may be used to express a repeated calculation. Second, we’ll learn how to construct and use numeric vectors to perform various calculations. Keep in mind that although we’re going to focus on numeric vectors, many of the principles we learn here can be applied to the other kinds of vectors considered later.
4.2 Atomic vectors
A vector is a 1-dimensional object that contains a set of data values, which are accessible by their position: position 1, position 2, position 3, and so one. When people talk about vectors in R they’re often referring to atomic vectors6. An atomic vector is the simplest kind of data structure in R. There are a few different kinds of atomic vector, but the defining feature of each one is that it can only contain data of one type. An atomic vector might contain all integers (e.g. 2, 4, 6, …) or all characters (e.g. “A”, “B”, “C”), but it can’t mix and match integers and characters (e.g. “A”, 2, “C”, 5).
The word “atomic” in the name refers to the fact that an atomic vector can’t be broken down into anything simpler—they are the simplest kind of data structure R knows about. Even when working with a single number we’re actually dealing with an atomic vector. It just happens to be of length one. Here’s the very first expression we evaluated in the first chapter:
1 + 1
## [1] 2
Look at the output. What is that [1]
at the beginning? We ignored it before because we weren’t in a position to understand its significance. The [1]
is a clue that the output resulting from 1 + 1
is a atomic vector. We can verify this with the is.vector
and is.atomic
functions:
x <- 1 + 1
# what value is associated with x?
x
## [1] 2
# is it a vector?
is.vector(x)
## [1] TRUE
# is it atomic?
is.atomic(x)
## [1] TRUE
This little exercise demonstrates an important point about R. Atomic vectors really are the simplest kind of data structure in R. Unlike many other languages there is simply no way to represent just a number. Instead, a single number must be stored as a vector of length one7.
4.3 Numeric vectors
A lot of work in R involves numeric vectors. After all, data analysis is all about numbers. Here’s a simple way to construct a numeric vector:
numeric(length = 50)
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
What happened? We made a numeric vector with 50 elements, each of which is the number 0. The word “element” is used to reference any object (a number in this case) that resides at a particular position in a vector.
When we create a vector but don’t assign it to a name using <-
R just prints it to the Console for us. Notice what happens when the vector is printed to the screen. Since the length-50 vector can’t fit on one line, it was printed over two. At the beginning of each line there is a [X]
: the X
is a number that describes the position of the element shown at the beginning of a particular line.
Different kinds of numbers
Roughly speaking, R stores numbers in two different ways depending of whether they are whole numbers (“integers”) or numbers containing decimal points (“doubles” – don’t ask). We’re not going to worry about this distinction. Most of the time the distinction is fairly invisible to users so it is easier to just think in terms of numeric vectors. We can mix and match integers and doubles in R without having to worry too much about R is storing the numbers.
If we need to check that we really have made a numeric vector we can use the is.numeric
function to do this:
# let's create a variable that is a numeric vector
numvec <- numeric(length = 50)
# check it really is a numeric vector
is.numeric(numvec)
## [1] TRUE
This returns TRUE
as expected. A value of FALSE
would imply that numvec
is some other kind of object. This may not look like the most useful function in the world, but sometimes we need functions like is.numeric
to understand what R is doing or root out mistakes in our scripts.
Keep in mind that when we print a numeric vector to Console R only prints the elements to 7 significant figures by default. We can see this by printing the built in constant pi
to the Console:
pi
## [1] 3.141593
The actual value stored in pi
is actually much more precise than this. We can see this by printing pi
again using the print
function:
print(pi, digits = 16)
## [1] 3.141592653589793
4.4 Constructing numeric vectors
We just saw how to use the numeric
function to make a numeric vector of zeros. The size of the vector is controlled by the length
argument—we used length = 50
above to make a vector with 50 elements. This is arguably not a particularly useful skill, as we usually need to work vectors of particular values (not just 0). A very useful function for creating custom vectors is the c
function. Take a look at this example:
c(1.1, 2.3, 4.0, 5.7)
## [1] 1.1 2.3 4.0 5.7
The “c” in the function name stands for “combine”. The c
function takes a variable number of arguments, each of which must be a vector of some kind, and combines these into a single, new vector. We supplied the c
function with four arguments, each of which was a vector of length 1 (remember: a single number is treated as a length-one vector). The c
function combines these to generate a vector of length 4. Simple. Now look at this example:
vec1 <- c(1.1, 2.3)
vec2 <- c(4.0, 5.7, 3.6)
c(vec1, vec2)
## [1] 1.1 2.3 4.0 5.7 3.6
This shows that we can use the c
function to concatenate (“stick together”) two or more vectors, even if they are not of length 1. We combined a length-2 vector with a length-3 vector to produce a new length-5 vector.
Notice that we did not have to name the arguments in those two examples—there were no =
involved. The c
function is an example of one of those flexible functions that breaks the simple rules of thumb for using arguments that we set out earlier: it can take a variable number of arguments, and these arguments do not have predefined names. This behaviour is necessary for c
to be of any use: in order to be useful it needs to be flexible enough to take any combination of arguments.
Finding out about a vector R
Sometimes it is useful to be able to find out a little extra information about a vector you are working with, especially if it is very large. Three functions that can extract some useful information about a vector for us are length
, head
and tail
. Using a variety of different vectors, experiment with these to find out what they do. Make sure you use vectors that aren’t too short (e.g. with a length of at least 10). Hint: head
and tail
can be used with a second argument, n
.
4.5 Named vectors
What happens if we name the arguments to c
when constructing a vector? Take a look at this:
namedv <- c(a = 1, b = 2, c = 3)
namedv
## a b c
## 1 2 3
What happened here is that the argument names were used to define the names of elements in the vector we made. The resulting vector is still a 1-dimensional data structure. When it is printed to the Console the value of each element is printed, along with the associated name above it. We can extract the names from a named vector using the names
function:
names(namedv)
## [1] "a" "b" "c"
Being able to name the elements of a vector is very useful because it enables us to more easily identify relevant information and extract the bits we need—we’ll see how this works in the next chapter.
4.6 Generating sequences of numbers
The main limitation of the c
function is that we have to manually construct vectors from their elements. This isn’t very practical if we need to construct very long vectors of numeric values. There are various functions that can help with this kind of thing though. These are useful when the elements of the target vector need to follow a sequence or repeating pattern. This may not appear all that useful at first, but repeating sequences are used a lot in R.
4.6.1 Generating sequences of numbers
The seq
function allows us to generate sequences of numbers. It needs at least two arguments, but there are several different ways to control the sequence produced by seq
. The method used is determined by our choice of arguments: from
, to
, by
, length.out
and along.with
. We don’t need to use all of these though—setting 2-3 of these arguments will often work:
- Using the
by
argument generates a sequence of numbers that increase or decrease by the requested step size:
seq(from = 0, to = 12, by = 2)
## [1] 0 2 4 6 8 10 12
This is fairly self-explanatory. The seq
function constructed a numeric vector with elements that started at 0 and ended 12, with successive elements increasing in steps of 2. Be careful when using seq
like this. If the by
argument does not lead to a sequence that ends exactly on the value of to
then that value won’t appear in the vector. For example:
seq(from = 0, to = 11, by = 2)
## [1] 0 2 4 6 8 10
We can generate a descending sequence by using a negative by
value, like this:
seq(from = 12, to = 0, by = -2)
## [1] 12 10 8 6 4 2 0
- Using the
length.out
argument generates a sequence of numbers where the resulting vector has the length specified bylength.out
:
seq(from = 0, to = 12, length.out = 6)
## [1] 0.0 2.4 4.8 7.2 9.6 12.0
Using the length.out
argument will always produce a sequence that starts and ends exactly on the values specified by from
and to
(if we need a descending sequence we just have to make sure from
is larger than to
).
- Using the
along.with
argument allows us to use another vector to determine the length of the numeric sequence we want to produce:
# make any any vector we like
my_vec <- c(1, 3, 7, 2, 4, 2)
# use it to make a sequence of the same length
seq(from = 0, to = 12, along.with = my_vec)
## [1] 0.0 2.4 4.8 7.2 9.6 12.0
It doesn’t matter what the elements of myvec
are. The behaviour of seq
is controlled by the length of myvec
when we use along.with
.
It’s conventional to leave out the from
and to
argument names when using the seq
function. For example, we could rewrite the first example above as:
seq(0, 12, by = 2)
## [1] 0 2 4 6 8 10 12
When we leave out the name of the third argument its value is position matched to the by
argument:
seq(0, 12, 2)
## [1] 0 2 4 6 8 10 12
However, our advice is to explicitly name the by
argument instead of relying on position matching, because this reminds us what we’re doing and will stop us forgetting about the different control arguments.
If we do not specify values of either by
, length.out
and along.with
when using seq
the default behaviour is to assume we meant by = 1
:
seq(from = 1, to = 12)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
Generating sequences of integers that go up or down in steps of 1 is something we do a lot in R. Because of this, there is a special operator to generate these simple sequences—the colon, :
. For example, to produce the last sequence again we use:
1:12
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
When we use the :
operator the convention is to not leave spaces either side of it.
4.6.2 Generating repeated sequences of numbers
The rep
function is designed to replicate the values inside a vector, i.e. it’s short for “replicate”. The first argument (called x
) is the vector we want to replicate. There are two other arguments that control how this is done. The times
argument specifies the number of times to replicate the whole vector:
# make a simple sequence of integers
seqvec <- 1:4
seqvec
## [1] 1 2 3 4
# now replicate *the whole vector* 3 times
rep(seqvec, times = 3)
## [1] 1 2 3 4 1 2 3 4 1 2 3 4
All we did here was take a sequence of integers from 1 to 4 and replicate this end-to-end three times, resulting in a length-12 vector. Alternatively, we can use the each
argument to replicate each element of a vector while retaining the original order:
# make a simple sequence of integers
seqvec <- 1:4
seqvec
## [1] 1 2 3 4
# now replicate *each element* vector 3 times
rep(seqvec, each = 3)
## [1] 1 1 1 2 2 2 3 3 3 4 4 4
This example produced a similar vector to the previous one. It contains the same elements and has the same length, but now the order is different. All the 1’s appear first, then the 2’s, and so on.
Predictably, we can also use both the times
and each
arguments together if we want to:
seqvec <- 1:4
rep(seqvec, times = 3, each = 2)
## [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4
The way to think about how this works is to imagine that we apply rep
twice, first with each = 2
, then with times = 3
(or vice versa). We can achieve the same thing using nested function calls, though it is much uglier:
seqvec <- 1:4
rep(rep(seqvec, each = 2), times = 3)
## [1] 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4
4.7 Vectorised operations
All the simple arithmetic operators (e.g. +
and -
) and many mathematical functions are vectorised in R . What this means is that when we use a vectorised function it operates on vectors on an element-by-element basis. Take a look at this example to see what we mean by this:
# make a simple sequence
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
# make another simple sequence *of the same length*
y <- seq(0.1, 1, length = 10)
y
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
# add them
x + y
## [1] 1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 11.0
We constructed two length-10 numeric vectors, called x
and y
, where x
is a sequence from 1 to 10 and y
is a sequence from 0.1 to 1. When R evaluates the expression x + y
it does this by adding the first element of x
to the first element of y
, the second element of x
to the second element of y
, and so on, working through all 10 elements of x
and y
.
Vectorisation is also implemented in the standard mathematical functions. For example, our friend the round
function will round each element of a numeric vector:
# make a simple sequence of non-integer values
my_nums <- seq(0, 1, length = 13)
my_nums
## [1] 0.00000000 0.08333333 0.16666667 0.25000000 0.33333333 0.41666667
## [7] 0.50000000 0.58333333 0.66666667 0.75000000 0.83333333 0.91666667
## [13] 1.00000000
# now round the values
round(my_nums, digits = 2)
## [1] 0.00 0.08 0.17 0.25 0.33 0.42 0.50 0.58 0.67 0.75 0.83 0.92 1.00
The same behaviour is seen with other mathematical functions like sin
, cos
, exp
, and log
. Each of these will apply the relevant function to each element of a numeric vector.
Not all functions are vectorised. For example, the sum
function takes a vector of numbers and adds them up:
sum(my_nums)
## [1] 6.5
Although sum
obviously works on a numeric vector it is not “vectorised” in the sense that it works element-by-element to return an output vector of the same length as its main argument.
Many functions apply the vectorisation principle to more than one argument. Take a look at this example to see what we mean by this:
# make a repeated set of non-integer values
my_nums <- rep(2 / 7, times = 6)
my_nums
## [1] 0.2857143 0.2857143 0.2857143 0.2857143 0.2857143 0.2857143
# round each one to a different number of decimal places
round(my_nums, digits = 1:6)
## [1] 0.300000 0.290000 0.286000 0.285700 0.285710 0.285714
We constructed a length 6 vector containing the number 2/7 (~ 0.285714) and then used the round
function to round each element to a different number of decimal places. The first element was rounded to 1 decimal place, the second to two decimal places, and so on. We get this behaviour because instead of using a single number for the digits
argument, we used a vector that is an integer sequence from 1 to 6.
Vectorisation is not the norm
R’s vectorised behaviour may seem like the “obvious” thing to do, but most computer languages do not work like this. In other languages we typically have to write a much more complicated expression to do something so simple. This is one of the reasons R is such a data analysis language: vectorisation allows us to express repetitious calculations in a simple, intuitive way. This behaviour can save us a lot of time. However, not every function treats its arguments in a vectorised way, so we always need to check (most easily, by experimenting) whether this behaviour is available before relying on it.