Chapter 17 Exploratory data analysis
Exploratory data analysis (EDA) was promoted by the statistician John Tukey in his 1977 book, “Exploratory Data Analysis”. The broad aim of EDA is to help us formulate and refine hypotheses that will lead to informative analyses or further data collection. The core objectives of EDA are:
to suggest hypotheses about the causes of observed phenomena,
to guide the selection of appropriate statistical tools and techniques,
to assess the assumptions on which statistical analysis will be based,
to provide a foundation for further data collection.
EDA involves a mix of both numerical and visual methods of analysis. Statistical methods are sometimes used to supplement EDA, but its main purpose is to facilitate understanding before diving into formal statistical modelling.
Even if we think we already know what kind of analysis we need to pursue, it’s always a good idea to explore a data set before diving into the analysis. At the very least, this will help us to determine whether or not our plans are sensible. Very often it uncovers new patterns and insights. In this chapter we’re going to examine some basic concepts that underpin EDA: 1) classifying different types of data, and 2) distinguishing between populations and samples. This will set us up to learn how to explore our data in later chapters.
17.2 Statistical variables and data
In the Data frames chapter we pointed out that the word “variable” is typically used to mean one of two things. In the context of programming, a variable is a name-value association that we create when we run some code. Statisticians use the word in a quite different way. To them, a variable is any characteristic or quantity that can be measured, classified or experimentally controlled. Much of statistics is about quantifying and explaining the variation in such quantities as best we can.
Biomass, species richness, population density, infection status, body size, and individual fitness are all examples of statistical variables we encounter in the organismal sciences. We can think of these as statistical variables because their values vary between different observations. For example, ‘annual fitness’—measured as the number of offspring produced—is a variable that differs both among the organisms in a population and over the life cycle of a given individual.
There are different ways to describe statistical variables according to the manner in which they can be analysed, measured, or presented. While not the most exciting topic, it’s important to be clear about what kind of variables we’re dealing with because this determines how we should visualise the data, and later, how we might analyse it statistically. Statisticians have thought long and hard about how we should classify variables. The distinctions can be very subtle, but we’ll only consider two fairly simple classification schemes.
17.2.1 Numeric vs. categorical variables
Numeric variables have values that describe a measurable quantity as a number, like ‘how many’ or ‘how much’. Numeric variables are also called quantitative variables; the data collected containing numeric variables are called quantitative data. Numeric variables may be further described as either continuous or discrete:
Continuous numeric variable: Observations can take any value between a certain set of real numbers, i.e. numbers represented with decimals. This set is typically either “every possible number” (e.g. the change in population density can be positive or negative, and very large or very small) or “all the positive numbers” (e.g. biomass may be very large or very small, but it is strictly positive). Examples of continuous variables include body mass, age, time, and temperature. Though in theory continuous variables may admit any number in the set of possible numbers, in practice the values given to an observation may be bounded and can only include values as small as the measurement protocol allows. Elephants are very large, but they never get as big as a passenger jet, and trying to measure the mass of an elephant at a precision of a few grams is probably not practical.
Discrete numeric variable: Observations can take a value based on a count from a set of whole values; e.g. 1, 2, 3, 4, 5, and so on. A discrete variable cannot take the value of a fraction between one value and the next closest value. Examples of discrete variables include the number of individuals in a population, number of offspring produced (‘reproductive fitness’), and number of infected individuals in an experiment. All of these are measured as whole units. Discrete variables are very common in the biological and environmental sciences. We love to count things…
Categorical variables have values that describe a characteristic of a data unit, like ‘what type’ or ‘which category’. Categorical variables fall into mutually exclusive (in one category or in another) and exhaustive (include all possible options) categories. Therefore, categorical variables are qualitative variables and tend to be represented by a non-numeric value. The data collected for a categorical variable are qualitative data. Categorical variables may be further described as ordinal or nominal:
Ordinal variable: Observations can take a value that can be logically ordered or ranked. The categories associated with ordinal variables can be ranked higher or lower than another, but do not necessarily establish a numeric difference between each category. Examples of ordinal categorical variables include academic grades (i.e. A, B, C), size class of a plant (i.e. small, medium, large) and behaviour.
Nominal variable: Observations can take a value that is not able to be organised in a logical sequence. Examples of nominal categorical variables include sex, business type, eye colour, religion and brand.
Don’t use numbers to classify categorical variables
Be careful when classifying variables. It is dangerous to assume that just because a numerical scheme has been used to describe it, a variable it must not be categorical – there is nothing to stop someone using numbers to describe a categorical variable (e.g. Male = 1, Female = 2, Hermaphrodite = 3). We can use numbers to describe categories, but it isn’t very sensible. It’s much clearer to use a non-numeric recording scheme (e.g. Male = “M”, Female = “F”, Hermaphrodite = “H”) to record categorical variables.
17.2.2 Ratio vs. interval scales
A second way of classifying numeric variables (not categorical variables) relates to the scale they are measured on. The measurement scale is important because it determines how things like differences, ratios, and variability are interpreted.
Interval scale: This allows for the degree of difference between data items, but not the ratio between them. This kind of scale does not have unique and non-arbitrary zero value. However, we can compare the ratio of differences on an interval scale though. A good example of an interval scale is date, which we measure relative to an arbitrary epoch (e.g. AD). We cannot really say that 2000 AD is twice as long as 1000 AD. We can talk about the amount of time that has passed between two dates though, i.e. it does make sense to say that twice as much time has passed since the epoch in 2000 AD versus 1000 AD.
Ratio scale: This scale does possess a meaningful zero value. It takes its name from the fact that a measurement on this scale represents a ratio between a measure of the magnitude of a quantity and a unit of the same kind. What this means in simple terms is that it is meaningful to say that something is “twice as …” as something else when working with a variable measured on a ratio scales. Ratio scales most often appear when we work with physical quantities. For example, we can say that one tree is twice as big as another, or that one elephant has twice the mass of another, because length and mass are measured on ratio scales.
Keep in mind that the distinction between ratio and interval scales is a property of the scale of measurement, not the thing being measured. For example, when we measure temperature in º C we’re working on an interval scale defined relative to the freezing and boiling temperatures of water under standard conditions. It doesn’t make any sense to say that 30º C is twice as hot as 15º C. However, if we measured the same two temperatures on the Kelvin scale, it is meaningful to say that 303.2K is 1.05 times hotter than 288.2K. This is because the Kelvin scale is relative to a true zero: absolute zero.
17.3 Populations, samples and distributions
When we collect data of any kind, we are working with a sample of objects (e.g. trees, insects, fields) from a wider population. We usually want to know something about the wider population, but since it’s impossible to study every member of the population, we study the properties of one or more samples instead.
The problem with samples is that they are ‘noisy’. If we were repeat the same data collection protocol more than once we should expect to end up with a different sample each time, even if the wider population never changes. This results purely from chance variation in the sampling of different units. Picking apart the relationship between samples and populations is the basis of much of statistics. This topic is best dealt with in a dedicated statistics book, so we won’t develop these ideas in much detail here.
The reason we mention the distinction between a population and a sample is because EDA is primarily concerned with properties of samples—it aims to characterise the sample in hand without trying to say too much about the wider population from which it is derived.
When we talk about “exploring a variable” what we are really doing is exploring is the sample distribution of that variable. What is this? The sample distribution is a statement about the frequency with which different values occur in a particular sample. Imagine we took a sample of undergraduates and measured their height. The majority of students would be round about 1.7m tall, even though there would obviously be some variation among students. Men would tend to be slightly taller than women, and very small or very tall people would be rare. We know from experience that no one in this sample would be over 3 meters tall. These are all statements about a (hypothetical) sample distribution of undergraduate heights.
Our goal when exploring the sample distribution of a variable is to answer questions such as, What are the most common values of the variable; and How much do observations differ from one another? Rather than simply describing these properties in verbal terms, as we did above, we want to describe in a more informative way. There are two ways to go about this:
Calculate descriptive statistics. Descriptive statistics are used to quantify the basic features of a sample distribution. They provide simple summaries about the sample that can be used to make comparisons and draw preliminary conclusions. For example, we often use ‘the mean’ to summarise the ‘most likely’ values of a variable in a sample.
Construct graphical summaries. Descriptive statistics are not much use on their own, for the simple reason that a few numbers can’t capture every aspect of a sample distribution. Graphical summaries are an ideal complement to descriptive statistics because they allow us to present lots of information about a sample distribution in one place and in a manner that is easy for people to understand.
So far we’ve been thinking about samples of one statistical variable. However, a sample may involve more than one variable. Moreover, data analysis is usually concerned with the relationships among two or more variables. These relationships might involve the same (e.g. numeric vs. numeric) or different types of variable (e.g. numeric vs. categorical). In either case, EDA is used to understand how the values of one variable depend on those of the other. Just as with single variable analyses, we use both descriptive statistics and graphical summaries to explore such relationships.