Chapter 1 A quick introduction to R
1.1 Using R as a big calculator
1.1.1 Basic arithmetic
The end of the Get up and running with R and RStudio chapter demonstrated that R can handle familiar arithmetic operations: addition, subtraction, multiplication, division. If we want to add or subtract a pair of numbers just place the +
or -
symbol in between two numbers, hit Enter, and R will read the expression, evaluate it, and print the result to the Console. This works exactly as we expect it to:
3 + 2
## [1] 5
5 - 1
## [1] 4
Multiplication and division are no different, though we don’t use x
or ÷
for these operations. Instead, we use *
and /
to multiply and divide:
7 * 2
## [1] 14
3 / 2
## [1] 1.5
We can also exponentiate a numbers: raise one number to the power of another. We use the ^
operator to do this:
4^2
## [1] 16
This raises 4 to the power of 2 (i.e. we squared it). In general, we can raise a number x
to the power of y
using x^y
. Neither x
or y
need to be a whole numbers either.
Arithmetic operations can also be combined into one expression. Assume we want to subtract 6 from 23. The expression to perform this calculation is:
2^3 - 6
## [1] 2
\(2^3=8\) and \(8-6=2\). Simple enough, but what if we had wanted to carry out a slightly longer calculation that required the last answer to then be divided by 2? This is the wrong the way to do it:
2^3 - 6 / 2
## [1] 5
The answer we were looking for is \(1\). So what happened? R evaluated \(6/2\) first and then subtracted this answer from \(2^3\).
If that’s obvious, great. If not, it’s time to learn a bit about the order of precendence used by R. R uses a standard set of rules to decide the order in which arithmetic calculations feed into one another so that it can unambiguously evaluate any expression. It uses the same order as every other computer language, which thankfully is the same one we all learned in mathematics class at school. The order of precedence used is:
exponents and roots (“taking powers”)
multiplication and division
additional and subtraction
BODMAS and friends
If you find it difficult to remember order of precedence used by R, there are a load of mnemonics that can to help. Pick one you like and remember that instead.
In order to get the answer we were looking for we need to take control of the order of evaluation. We do this by enclosing grouping the necessary bits of the calculation inside parentheses (“round brackets”). That is, we place (
and )
either side of them. The order in which expressions inside different pairs of parentheses are evaluated follows the rules we all had to learn at school. The R expression we should have used is therefore:
(2^3 - 6) / 2
## [1] 1
We can use more than one pair of parentheses to control the order of evaluation in more complex calculations. For example, if we want to find the cube root of 2 (i.e. 21/3) rather than 23 in that last calculation we would instead write:
(2^(1/3) - 6) / 2
## [1] -2.370039
The parentheses around the 1/3
in the exponent are needed to ensure this is evaluated prior to being used as the exponent.
1.1.2 Problematic calculations
Now is a good time to highlight how R handles certain kinds of awkward numerical calculations. One of these involves division of a number by 0. Some programming languages will respond to an attempt to do this with an error. R is a bit more forgiving:
1/0
## [1] Inf
Mathematically, division of a finite number by 0
equals A Very Large Number: infinity. R has a special built in data value that allows it to handle this kind of thing. This is Inf
, which of course stands for “infinity”. The other special kind of value we sometimes run into can be generated by numerical calculations that don’t have a well-defined result. For example, it arises when we try to divide 0 or infinity by themselves:
0/0
## [1] NaN
The NaN
in this result stands for Not a Number. R produces NaN
because \(0/0\) is not defined mathematically: it produces something that is Not a Number. The reason we are pointing out Inf
and NaN
is not because we expect to use them. It’s important to know what they represent because they often arise as a result of a mistake somewhere in a program. It’s hard to track down such mistakes if we don’t know how Inf
and NaN
arise.
That is enough about using R as a calculator for now. What we’ve seen—even though we haven’t said it yet—is that R functions as a REPL: a read-eval-print loop (there’s no need to remember this term). R takes user input, evaluates it, prints the results, and then waits for the next input. This is handy, because it means we can use it interactively, working through an analysis line-by-line. However, to use R to solve for complex problems we need to learn how to store and reuse results. We’ll look at this in the next section.
Working efficiently at the Console
Working at the Console soon gets tedious if we have to retype similar things over and over again. There is no need to do this though. Place the cursor at the prompt and hit the up arrow. What happens? This brings back the last expression sent to R’s interpreter. Hit the up arrow again to see the last-but-one expression, and so on. We go back down the list using the down arrow. Once we’re at the line we need, we use the left and right arrows to move around the expression and the delete key to remove the parts we want to change. Once an expression has been edited like this we hit Enter to send it to R again. Try it!
1.2 Storing and reusing results
So far we’ve not tried to do anything remotely complicated or interesting, though we now know how to construct longer calculations using parentheses to control the order of evaluation. This approach is fine if the calculation is very simple. It quickly becomes unwieldy for dealing with anything more. The best way to see what we mean is by working through a simple example—solving a quadratic equation. Quadratic equations looks like this: \(a + bx + cx^2 = 0\). If we know the values of \(a\), \(b\) and \(c\) then we can solve this equation to find the values of \(x\) that ensure the left hand side equals the right hand side. Here’s the well-known formula for these solutions: \[ x = \frac{-b\pm\sqrt{b^2-4ac}}{2a} \] Let’s use R to calculate these solutions for us. Say that we want to find the solutions to the quadratic equation when \(a=1\), \(b=6\) and \(c=5\). We just have to turn the above equation into a pair of R expressions:
(-6 + (6^2 -4 * 1 * 5)^(1/2)) / (2 * 1)
## [1] -1
(-6 - (6^2 -4 * 1 * 5)^(1/2)) / (2 * 1)
## [1] -5
The output tells us that the two values of \(x\) that satisfy this particular quadratic equation are -1 and -5. What should we do if we now need to solve a different quadratic equation? Working at the Console, we could bring up the expressions we typed (using the up arrow) and then go through each of these, changing the numbers to match the new values of \(a\), \(b\) and \(c\). Editing individual expressions like this is fairly tedious, and more importantly, it’s fairly error prone because we have to make sure we substitute the new numbers at exactly the right positions.
A partial solution to this problem is to store the values of \(a\), \(b\) and \(c\). We’ll see precisely why this is useful in a moment. First, we need to learn how to store results in R. The key to this is to use the assigment operator, written as a left arrow <-
. Sticking with our original example, we need to store the numbers 1, 6 and 5. We do this using three expressions, one after the another:
a <- 1
b <- 6
c <- 5
Notice that we don’t put a space between <
and -
—R won’t like it if we try to add one. R didn’t print anything to screen, so what actually happened? We asked R to first evaluate the expression on the right hand side of each <-
(just a number in this case) and then assign the result of that evaluation instead of printing it. Each result has a name associated with it, which appears on the left hand side of the <-
.
RStudio shortcut
We use the assignment operator <-
all the time when working with R, and because it’s inefficient to have to type the <
and -
characters over and over again, RStudio has a built in shortcut for typing the assignment operator: Alt + -
. Try it. Move the curser to the Console, hold down the Alt key (‘Option’ on a Mac), and press the -
sign key. RStudio will auto-magically add insert <-
.
The net result of all this is that we have stored the numbers 1, 6 and 5 somewhere in R, associating them with the letters a
, b
and c
, respectively. What does this mean? Here’s what happens if we type the letter a
into the Console and hit Enter:
a
## [1] 1
It looks the same as if we had typed the number 1
directly into the Console. The result of typing b
or c
is hopefully obvious. What we just did was to store the output that results from evaluating three separate R expressions, associating each a name so that we can access them again3.
Whenever we use the assignment operator <-
we are telling R to keep whatever kind of value results from the calculation on the right hand side of <-
, giving it the name on the left hand side so that we can access it later. Why is this useful? Let’s imagine we want to do more than one thing with our three numbers. If we want to know their sum or their product we can now use:
a + b + c
## [1] 12
a * b * c
## [1] 30
So once we’ve stored a result and associated it with a name we can reuse it wherever it’s needed. Returning to our motivating example, we can now calculate the solutions to the quadratic equation by typing these two expressions into the Console:
(-b + (b^2 -4 * a * c)^(1/2)) / (2 * a)
## [1] -1
(-b - (b^2 -4 * a * c)^(1/2)) / (2 * a)
## [1] -5
Imagine we’d like to find the solutions to a different quadratic equation where \(a=1\), \(b=5\) and \(c=5\). We just changed the value of \(b\) here to keep things simple. To find our new solutions we have to do two things. First we change the value of the number associated with b
…
b <- 5
…then we bring up those lines that calculate the solutions to the quadratic equation and run them, one after the other:
(-b + (b^2 -4 * a * c)^(1/2)) / (2 * a)
## [1] -1.381966
(-b - (b^2 -4 * a * c)^(1/2)) / (2 * a)
## [1] -3.618034
We didn’t have to retype those two expressions. We could just use the up arrow to bring each one back to the prompt and hit Enter. This is much simpler than editing the expressions. More importantly, we are beginning to see the benefits of using something like R: we can break down complex calculations into a series of steps, storing and reusing intermediate results as required.
1.3 How does assignment work?
It’s important to understand, at least roughly, how assignment works. The first thing to note is that when we use the assignment operator <-
to associate names and values, we informally refer to this as creating (or modifying) a variable. This is much less tedious than using words like “bind”, “associate”, value“, and”name" all the time. Why is it called a variable? What happens when we run these lines:
myvar <- 1
myvar <- 7
The first time we used <-
with myvar
on the left hand side we created a variable myvar
associated with the value 1. The second line myvar <- 7
modified the value of myvar
to be 7. This is why we refer to myvar
as a variable: we can change the its value as we please. What happened to the old value associated with myvar
? In short, it is gone, kaput, lost… forever. The moment we assign a new value to myvar
the old one is destroyed and can no longer be accessed. Remember this.
Keep in mind that the expression on the right hand side of <-
can be any kind of calculation, not just just a number. For example, if I want to store the number 1, associating it with answer
, I could do this:
answer <- (1 + 2^3) / (2 + 7)
That is a strange way to assign the number 1, but it illustrates the point. More generally, as along as the expression on the right hand side generates an output it can be used with the assignment operator. For example, we can create new variables from old variables:
newvar <- 2 * answer
What happened here? Start at the right hand side of <-
. The expression on this side contained the variable answer
so R went to see if answer
actually exists in the global environment. It does, so it then substituted the value associated with answer
into the requested calculation, and then assigned the resulting value of 2 to newvar
. We created a new variable newvar
using information associated with answer
.
Now look at what happens if we just copy a variable using the assignment operator:
myvar <- 7
mycopy <- myvar
At this point we have two variables, myvar
and mycopy
, each associated with the number 7. There is something very important going on here: each of these is associated with a different copy of this number. If we change the value associated with one of these variables it does not change the value of the other, as this shows:
myvar <- 10
myvar
## [1] 10
mycopy
## [1] 7
R always behaves like this unless we work hard to alter this behaviour (we never do this in this book). So remember, every time we assign one variable to another, we actually make a completely new, independent copy of its associated value. For our purposes this is a good thing because it makes it much easier to understand what a long sequence of R expressions will do. That probably doesn’t seem like an obvious or important point, but trust us, it is.
1.4 Global environment
Whenever we associate a name with a value we create a copy of both these things somewhere in the computer’s memory. In R the “somewhere” is called an environment. We aren’t going to get into a discussion of R’s many different kinds of environments—that’s an advanced topic well beyond the scope of this book. The one environment we do need to be aware of though is the Global Environment.
Whenever we perform an assignment in the Console the name-value pair we create (i.e. the variable) is placed into the Global Environment. The current set of variables are all listed in the Environment tab in RStudio. Take a look. Assuming that at least one variable has been made, there will be two columns in the Environment tab. The first shows us the names of all the variables, while the second summarises their values.
The Global Environment is temporary
By default, R will save the Global Environment whenever we close it down and then restore it in the next R session. It does this by writing a copy of the Global Environment to disk. In theory this means we can close down R, reopen it, and pick things up from where we left off. Don’t do this—it only increases the risk of making a serious mistake. Assume that when R and RStudio are shut down, everything in Global Environment will be lost.
1.5 Naming rules and conventions
We don’t have to use a single letter to name things in R. The words tom
, dick
and harry
could be used in place of a
, b
and c
. It might be confusing to use them, but tom
, dick
and harry
are all legal names as far as to R is concerned:
A legal name in R is any sequence of letters, numbers,
.
, or_
, but the sequence of characters we use must begin with a letter. Both upper and lower case letters are allowed. For example,num_1
,num.1
,num1
,NUM1
,myNum1
are all legal names, but1num
and_num1
are not because they begin with1
and_
.R is case sensitive—it treats upper and lower case letters as different characters. This means that
num
andNum
are treated as distinct names. Forgetting about case sensitivity is a good way to create errors when using R. Try to remember that.
Don’t begin a name with .
We are allowed to begin a name with a .
, but this usually is A Bad Idea. Why? Because variable names that begin with .
are hidden from view in the Global Environment—the value it refers to exists but it’s invisible. This behaviour exists to allow R to create invisible variables that control how it behaves. This is useful, but it isn’t really meant to be used by the average user.
Technically, this is called binding the name to a value. You really don’t need to remember this though.↩