APS 135: Introduction to Exploratory Data Analysis with R
Information and overview
This is the online course book for the Introduction to Exploratory Data Analysis with R component of APS 135, a module taught by the Department and Animal and Plant Sciences at the University of Sheffield. You can view this book in any modern desktop browser, as well as on your phone or tablet device. Dylan Childs is running the course this year. Please email him if you spot any problems with the course book.
You will be introduced to the R ecosystem. R is widely used by biologists and environmental scientists to manipulate and clean data, produce high quality figures, and carry out statistical analyses. We will teach you some basic R programming so that you are in a position to address these needs in future if you need to. You don’t have to become an expert programmer to have a successful career in science, but knowing a little bit of programming has almost become a prerequisite for doing biological research in the 21st century.
You will learn how to use R to carry out data manipulation and visualisation. Designing good experiments, collecting data, and analysis are hard, and these activities often take a great deal time and money. If you want to effectively communicate your hard-won results, it is difficult to beat a good figure or diagram; conversely, if you want to be ignored, put everything into a boring table. R is really good at producing figures, so even if you end up just using it as a platform for visualising data, your time hasn’t been wasted.
This book provides a foundation for learning statistics later on. If you want to be a biologist, particularly one involved in research, there is really no way to avoid using statistics. You might be able to dodge it by becoming a theoretician, but if that really is your primary interest you should probably being studying for a mathematics degree. For the rest of us who collect and analyse data knowing about statistics is essential: it allows us to distinguish between real patterns (the “signal”) and chance variation (the “noise”).
The topics we will cover in this book are divided into three sections (‘blocks’):
The Getting Started with R block introduces the R language and the RStudio environment for working with R. Our aim is to run through much of what you need to know to start using R productively. This includes some basic terminology, how to use R packages, and how to access help. We are not trying to turn you into an expert programmer—though you may be surprised to discover that you do enjoy it. However, by the end of this block you will know enough about R to begin learning the practical material that follows.
The Data Wrangling with R block aims to show you how to manipulate your data with R. If you regularly work with data a large amount of time will inevitably be spent getting data into the format you need. The informal name for this is “data wrangling”. This is a topic that is often not taught to undergraduates, which is a shame because mastering the art of data wrangling can save a lot of time in the long run. We’ll learn how to get data into and out of R, makes subsets of important variables, create new variables, summarise your data, and so on.
The Exploratory Data Analysis block is all about using R to help you understand and describe your data. The first step in any analysis after you have managed to wrangle the data into shape almost always involves some kind of visualisation or numerical summary. In this block you will learn how to do this using one of the best plotting systems in R: ggplot2. We will review the different kinds of variables you might have to analyse, discuss the different ways you can describe them, both visually and with numbers, and then learn how to explore relationships between variables.
How to use the book
This book covers all the material you need to get to grips with this year, some of which we will not have time to cover in the practicals. No one is expecting you to memorise everything in the book. It is designed to serve as a resource for you to refer to over the next 2-3 years (and beyond) as needed. However, you should aim to familiarise yourself with the content so that you know where to look for information or examples when needed. Try to understand the important concepts and then worry about the details.
What should you be doing as you read about each topic? There is a lot of R code embedded in the book, most of which you can just copy and paste into R. It’s a good idea to do this as you work through a topic. The best way to learn something is to use it actively, not just read about it. Experimenting with different code snippets by changing them is also a very good way to learn what they do. You can’t really break R and working out why something does or does not work will help you learn how to use it.
Text, instructions, and explanations
Normal text, instructions, explanations etc. are written in the same type as this document, we will tend to use bold for emphasis and italics to highlight specific technical terms when they are first introduced (italics will also crop up with Latin names from time to time, but this is unlikely to produce too much confusion!)
At various points in the text you will come across text in different coloured boxes. These are designed to highlight stand-alone exercises or little pieces of supplementary information that might otherwise break the flow. There are three different kinds of boxes:
This is an action box. We use these when we want to say something important. For example, we might be summarising a key learning outcome or giving you instructions to do something. Do not ignore these boxes.
This is a warning box. These contain a warning or a common “gotcha”. There are a number of common pitfalls that trip up new users of R. These boxes aim to highlight these and show you how to avoid them. Pay attention to these.
This is an information box. These aim to offer a not-too-technical discussion of why something works the way it does. You do not have to understand everything in these boxes to use R but the information will help you understand how it works.
R code and output in this book
We will try to illustrate as many ideas as we can using snippets of real R code. Stand alone snippets will be formatted like this:
tmp <- 1
##  1
At this point it does not matter what the above actually means. You just need to understand how the formatting of R code in this book works. The lines that start with
## show us what R prints to the screen after it evaluates an instruction and does whatever was asked of it, that is, they show the output. The lines that do not start with
## show us the instructions, that is, they show us the input. So remember, the absence of
## shows us what we are asking R to do, otherwise we are looking at something R prints in response to these instructions.
This typeface is used to distinguish R code within a sentence of text: e.g. “We use the
mutate function to change or add new variables.”
A sequence of selections from an RStudio menu is indicated as follows: e.g. File ▶ New File ▶ R Script
File names referred to in general text are given in upper case in the normal typeface: e.g. MYFILE.CSV.
You will learn various ways of finding help about R in this book. If you find yourself stuck at any point these should your first port of call. If you are still struggling, try the following, in this order:
Google is your friend. One of the nice consequences of R’s growing popularity is that the web is now packed full of useful tutorials and tips, many of which are aimed at beginners. One of the objectives of this book is turn you into a self sufficient useR. Learning how to solve your own R-related problems is an essential pre-requisite for this to happen.
If an hour of Googling does not solve a problem, make a note of your question and ask a TA for help in a practical session. The TA’s have been chosen because they have a lot of experience with R. They’re a friendly bunch too and they like talking about R! If they can’t answer your question then the instructor (Dylan) will be able to. He gets bored when nobody asks him questions in practicals…
We encourage you to try options 1 and 2 first. Nonetheless, on occasion Google may turn out not to be your friend and a post to the Facebook page might not elicit a satisfactory response. In these instances you are welcome to email Dylan with your query. You are unlikely to receive an answer at the weekend though.