APS 135: Introduction to Exploratory Data Analysis with R
Dylan Z. Childs
Information and overview
This is the online course book for the Introduction to Exploratory Data Analysis with R component of (APS 135) module. You can view this book in any modern desktop browser, as well as on your phone or tablet device. The site is self-contained—it contains all the material you are expected to learn this year. Bethan Hindle is running the course this year. Please email her if you spot any problems.
You will be introduced to the R ecosystem. R is now very widely used by biologists and environmental scientists to access data, carry out interactive data analysis, build mathematical models and produce high quality figures. We will teach you a little basic R programming so that you are in a position to address these needs in future if you need to. You don’t have to become an expert programmer to have a successful career in science, but knowing a little bit about programming has become (almost) a prerequisite for doing biological research in the 21st century.
You will learn how to begin using R to carry out data manipulation and visualisation. Designing good experiments, collecting data, and analysis are hard, and these activities often takes a great deal time and money. If you want to effectively communicate your hard-won, latest, greatest results, it is difficult to beat a good figure or diagram (conversely, if you want to be ignored, put everything into a boring table). R is really good at producing figures, so even if you end up just using it as a platform for visualising data, your time hasn’t been wasted.
This book provides a foundation for learning statistics later on. If you want to be a biologist, particularly one involved in research, there is really no way to avoid using statistics. You might be able to dodge it by becoming a theoretician, but if that really is your primary interest you should probably being studying for a mathematics degree. For the rest of us who collect data, or at least analyse other people’s data, knowing about statistics is essential: it allows us to distinguish between real patterns (the “signal”) and chance variation (the “noise”).
The topics we will cover in this book are divided into three sections:
The Getting Started with R block introduces the R language and the RStudio environment for working with R. We aim to run through much of what you need to know to start using R to improve your productivity. This includes some basic terminology, how to use R packages, and how to access help. As noted earlier, we are not trying to turn you into an expert programmer. That takes too long (not everyone enjoys programming, though many of you may be surprised to discover that you do in fact like it). By the end of this block you will know enough about R to begin learning the more practical material that follows.
The Data Wrangling with R block aims to show you how to manipulate your data with R. The truth is that if you regularly work with data, a large amount of time will inevitably be spent getting data into the format you need. The informal name for this is “data wrangling”. This is a topic that is not often taught well to undergraduates, which is a shame, because mastering the art of data wrangling saves you a lot of time in the long run. This block will briefly cover two R packages to help you do this: dplyr and tidyr. We’ll learn how to get data into and out of R, makes subsets of important variables, create new variables, summarise your data, and so on.
The Exploratory Data Analysis block is all about using R to help you understand and describe your data. The first step in any analysis after you have managed to wrangle the data into shape almost always involves some kind of visualisation and/or numerical summary (or at least that should be the next step if you are serious about getting your analysis right). In this block you will learn how to do this using one of the best plotting systems in R: ggplot2. We will review the different kinds of variables you might have to analyse, discuss the different ways you can describe them, both visually and with numbers, and learn how to explore relationships between variables.
How to use the book
This book covers all the material you need to get to grips with this year, some of which we will not have time to cover in the practicals. No one is expecting you to memorise everything in the book. It is designed to serve as a resource for you to refer to over the next 2-3 years (and beyond) as needed. However, you should aim to familiarise yourself with the content so that you know where to look for information or examples when needed. Try to understand the important concepts and then worry about the specific details.
What should you be doing as you read about each topic? There is a lot of R code embedded in the book, most of which you can just copy and paste into RStudio and then run. You are strongly encouraged to do this when you first work through a topic. The best way to learn something like R is to use it actively, not just read about it. Experimenting with different code snippets by changing them is also a very good way to learn what they do. You can’t really break R (well you can, but it is quite hard), and working out why something does or does not work will help you learn to use it.
Text, instructions, and explanations
Normal text, instructions, explanations etc. are written in the same type as this document, we will tend to use bold for emphasis and italics to highlight specific technical terms when they are first introduced (italics will also crop up with Latin names from time to time, but this is unlikely to produce too much confusion!)
At various points in the text you will come across text in different coloured boxes. These are designed to highlight stand-alone exercises or little pieces of supplementary information that might otherwise break the flow. There are three different kinds of boxes:
This is an action box. We use these when we want to say something important. For example, we might be summarising a key learning outcome or giving you instructions to do something.
This is a warning box. These contain a warning or a common “gotcha”. There are a number of common pitfalls that trip up new users of R. These boxes aim to highlight these and show you how to avoid them. It’s a good idea to pay attention to these.
This is an information box. These aim to offer a not-too-technical discussion of how or why something works the way it does. You do not have to understand everything in these boxes to use R, but the information will help you understand how it works.
R code and output in this book
We will try to illustrate as many ideas as we can using snippets of real R code. Stand alone snippets will be formatted like this:
tmp <- 1 print(tmp)
##  1
At this point it does not matter what the above actually means. You just need to understand how the formatting of R code in this book works. The lines that start with
## show us what R prints to the screen after it evaluates an instruction and does whatever was asked of it, that is, they show the output. The lines that do not start with
## show us the instructions, that is, they show us the input. So remember, the absence of
## shows us what we are asking R to do, otherwise we are looking at something R prints in response to these instructions.
This typeface is used to distinguish R code within a sentence of text: e.g. “We use the
mutate function to change or add new variables.”
A sequence of selections from an RStudio menu is indicated as follows: e.g. File ▶ New File ▶ R Script
File names referred to in general text are given in upper case in the normal typeface: e.g. MYFILE.CSV.
You will learn various ways of finding help about R in this book. If you find yourself stuck at any point these should your first port of call. If you are still struggling, try the following, in this order:
Google is your friend. One of the nice consequences of R’s growing popularity and the rise of blogging – take a look at R Bloggers for a flavour of R-specific blogs – is that the web is now packed full of useful tutorials and tips, many of which are aimed at beginners. One of the objectives of this book is turn you into a self sufficient useR. Learning how to solve your own R-related problems is an essential pre-requisite for this to happen. Solving your own problems will also help you learn how to use R more effectively.
If an hour of Googling does not solve a problem, post a question on the APS 135 Facebook page. If you find something difficult, the chances are that someone else finds it difficult too. You are strongly encouraged to try to address one anothers’ problems via this page. Thinking through and explaining the answer to a question someone else has posed is a really good way of learning. Dylan will check the Facebook page from time to time, and will offer a solution if no-one else has suggested one. We would much prefer you to help each other though.
We encourage you to try options 1 and 2 first. Nonetheless, on occasion Google may turn out not to be your friend and a post to the Facebook page might not elicit a satisfactory response. In these instances you are welcome to email Bethan with your query. You are unlikely to receive an answer at the weekend though.
We may occasionally decide to update the book in light of the results of comments and questions we receive from you. This is another reason why it is important for you to ask or post questions—it allows us to see where people are struggling. It is also a motivation for choosing to use a website rather than a static document—we can very easily adapt or extend the content to address problems as they arise. If we do update the book, we will let you know what has changed.