Chapter 19 Principles of experimental design

Hiawatha to convince them,
Organised a shooting contest.
Laid out in the proper manner
Of designs experimental
Recommended in the textbooks
Mainly used for tasting tea
(but sometimes used in other cases)
Still they couldn’t understand it,
so they couldn’t raise objections.
(Which is what so often happens
with analysis of variance.)

Maurice Kendal (after Longfellow)
from Hiawatha Designs an Experiment

19.1 Introduction

The data we use to test hypotheses may be generated by recording information from natural systems (‘observational studies’) or by carrying out some sort of experiment in which the system under study is manipulated in some way (‘experimental studies’). There is often considerable scope for deliberately arranging the system to generate data in the best way to test a particular effect when conducting experiments. For this reason we tend to use the term ‘design’ primarily in the context of experiments. However, collection of data in both situations requires thought and planning, and many of the considerations of what is termed experimental design apply equally to observational and experimental studies13.

The underlying principle of experimental design is: to extract data from a system in such a way that differences or variation in the data can be unambiguously attributed to the particular process we are investigating.

In order to do this we need to know how to maximise the statistical power of an experiment or data collection protocol. Statistical power is the likelihood that a study will detect an effect when there really is an effect present. In statistics, the word ‘effect’ is an umbrella term for anything measurable we care about, like differences between groups or associations between variables. Broadly speaking, statistical power is influenced by: (1) the size of the effect and (2) the size of the sample used to detect it. Bigger effects are easier to detect than smaller effects, while large samples present greater test sensitivity than small samples. A second consideration is that the less variable the material we are using the smaller the effects we will be able to detect.

Given these facts, there are obviously two things to do when designing an experiment:

  1. Use the maximum feasible sample sizes.

  2. Take steps to minimise the variability in the data14.

Exactly what combination of these is appropriate will depend on the subject area. In a physiological experiment using complex apparatus and monitoring equipment the scope for replication may be very limited. Obviously here maximum effort should be put into experimentally controlling extraneous sources of variation. With the subject material this may mean using animals of the same age, reared under the same conditions, of the same stock; it may involve using clones of plant material. It will involve running the experiment under controlled conditions of light, temperature, and require that the measurement methods are as precise as possible. On the other hand an ecologist studying an organism in the field may have relatively little scope for experimental control of either the material studied or the environmental conditions, and may be forced to make relatively crude measurements. In this case the best approach is to control what can be controlled and then try and maximise the sample size.

19.2 Jargon busting

Before we delve any further into experimental design concepts we need to introduce a little bit of statistical jargon. We’ll define the terms and then run through an example to better understand them:

  • An experimental unit is the physical entity which can be assigned to a treatment (see next definition). Examples of possible experimental units are individual clones or organisms.

  • A treatment is any kind of manipulation applied to experimental units. A group of experimental units that all receive the same treatment is called a treatment group.

  • Most experiments include one or more complementary groups, called control groups. The experimental units in a control group receive either no treatment or some kind of standard treatment.

  • An experimental factor is a collection of related treatments and controls, and the different treatments/controls are called the levels of that factor.

Here’s an example. Suppose we wanted to compare the weight gain of cattle on 4 different dietary supplements to determine which is the most effective. We conduct an experiment in which groups of eight cows are given a particular supplement for one month. A fifth group serves as the control group—they do not receive any supplement. At the end of the experiment we measure how much weight each cow has gained over the month. In this example individual cows are the experimental units, dietary supplements are the treatments, and the ‘no supplement’ group is the control group. Together, the four ‘supplement type’ and the ‘no diet’ control constitute the five levels of the ‘dietary supplement’ factor.

Finally, a word of warning—it is common to lump control groups and treatment groups together and just call them ‘treatments.’ This is fine, but be aware of the distinction between the two.

19.3 Replication

We cannot do statistics without understanding the idea of replication—the process of assigning several experimental units to the same treatment or combination of treatments. Why does replication matter? Replication affects the power of a statistical test—by increasingly the replication in a study we increase the sample size available to detect specific effects. Replication is fundamental to many of the statistical methods we use, and is particularly important in biology because the material we work with is often inherently variable and hard to make precise measurements on. It seems like a simple idea: increased replication = more statistical power. We have to be very careful about how we replicate though…

19.3.1 Independence and pseudoreplication

An assumption of most statistical tests is that the data are independent. Independence means that the value of a measurement from one object is not affected by the values of other objects. Common sources of non-independence in biology include:

  • genetics - e.g. if a set of mice are taken from the litter of a single female, they are more likely to be similar to each other than mice taken from the litters of several different females.

  • geography - e.g. samples from sites close together will experience similar microclimate, have similar soil type etc.

  • sampling within biological ‘units’ - e.g. leaves on a tree will be more similar to each other than to leaves from other trees.

  • experimental arrangements in the lab - e.g. plants grown together in a pot, or fish kept in one aquarium will all be affected by the conditions in that pot/aquarium.

Non-independence occurs at many levels in biological data, and in statistical testing the common consequence of non-independence is pseudoreplication. Pseudoreplication is an artificial increase in the sample size caused by using non-independent data. It may be easiest to see what this means by example.

Imagine we are interested in whether plants of a particular species produce flowers with different numbers of petals when grown in two different soil types. We have three plants in each soil type and each plant produces 4 flowers. As it turns out, the 4 flowers within each individual plant have identical numbers of petals. If we count the petals in a single flower from each plant, and then test the difference using a t-test we get the following result:

Soil type Num. Petals Mean
(Plant 1) (Plant 2) (Plant 3)
Soil type A 3 4 5 4
(Plant 1) (Plant 2) (Plant 3)
Soil type B 4 5 6 5
p = 0.29

The difference is not significant. Now instead of sampling a single flower from each plant we count the petals of all four flowers on each plant and (incorrectly) use all the values in the analysis (giving an apparent sample size of 12 in each treatment):

Soil type Num. Petals Mean
(Plant 1) (Plant 2) (Plant 3)
Soil type A 3, 2, 3, 4 4, 4, 3, 5 3, 6, 7, 4 4
(Plant 1) (Plant 2) (Plant 3)
Soil type B 4, 5, 4, 3 5, 7, 3, 5 6, 5, 7, 6 5
p = 0.009

Even with that proviso that the data might be a bit suspect with regard to normality, the same difference in the means now appears to be highly significant! The problem here is that the flowers within each plant are not independent - there is variation among plants in petal numbers, but within the plant (perhaps for genetic reasons) the number of petals produced are similar. Because of this non-independence the apparent significance in the final result is spurious. There are only three independent entities in each soil type treatment—the plants—so the first of the two tests here is correct, the second is pseudoreplicated.

To illustrate the effect in a still more obvious way, consider if we were interested in the heights of plants in the two soil types, but we actually only had one plant in Soil A and one in Soil B. If we measure the plants and find they differ somewhat in height, we cannot tell whether this is due to the soil, or just because no two plants are identical. With one plant in each soil we cannot carry out a statistical test to compare the heights. Now, if it was suggested that we measure the height of each plant 20 times and then used those numbers to do a statistical test to compare the plant heights in the two soils we would realise that this was an entirely pointless exercise.

There is no more information about the effect of soil type in the two sets of 20 measurements than there was in the single measurement (except we now know how variable our measuring technique is). And why stop at 20? Why not just keep remeasuring until we have enough numbers to get a significant difference?! Clearly this is nonsense.

Put this way, the pitfall of pseudoreplication seems obvious. However, it can creep into biological studies in quite subtle ways and occurs in a significant number of published studies. One very common problem occurs in ecological studies where different habitats, or experimental plots, are being compared. Say we are looking at zooplankton abundance in two lakes, one with fish and one without. We would normally take a number of samples from each lake and could obviously compare the zooplankton numbers between these two sets of samples. It would be tempting to attribute any differences we observe to the effect of fish. However this would not be correct.

We have measured the difference in zooplankton between the two lakes (and this is quite a valid thing to do) but the lakes may differ in any number of ways, not just the presence of fish, so it is not correct to interpret our result in relation to the effect of fish. To do this, we would really need data on zooplankton abundance in several lakes with fish, and several without. In other words, for testing the effect of fish, our replicates should be whole lakes with and without the relevant factor (fish), not samples from within a single lake.

But surely it is still better to take lots of samples from each site than just one; it must give a more accurate picture? This is true. Taking several measurements or samples from each object guards against the possibility of the results being influenced by a single, possibly unusual, sample; so the accuracy of the information about the object is increased. It would be much more reliable to have twenty zooplankton samples from a lake than just one. This is important, but it is not the same as having measurements from more objects (lakes)—true replication—which increases the power of the statistical test to detect differences among objects with respect to the particular factor (e.g. fish / no fish) we are interested in.

So in cases such as those above, the best strategy would be to measure petal number on all the flowers on each plant, but then calculate a mean for each plant and use those means in the statistical test. The same idea applies in the lake situation—several plankton samples could be taken from each of a number of lakes, then combined to give one estimate of plankton density for each lake. Though of course, we couldn’t do much in the way of statistical analysis on the two means.

So, in summary, when carrying out an investigation the key question to ask is: What is the biological unit of replication relevant to the effect we trying to test? As this implies, the appropriate unit of replication may vary depending on what we are investigating. If we want to test for a difference in the plankton density between two lakes, then taking 10 samples from each lake and comparing them would be the correct approach. But if, as above, we wanted to assess the effect of fish on plankton density, it would be inappropriate—the correct unit of replication in this case is the whole lake and we would therefore want to sample several lakes with and without fish.

19.4 Controls

We are told repeatedly, probably starting at primary school, that every experiment must have a control—a reference treatment against which the other treatments can be compared. The idea does, however, sometimes generate confusion since it is not always clear what is being controlled for, and some experiments do not require a control while others require more than one.

In some cases the appropriate control is obvious. In a toxicity test we are interested in the mortality due to the toxicant, and clearly we want the control to tell us what the background mortality rate (without toxicant) would be under those experimental conditions. However, if we are measuring the movement rates of slugs on surfaces of differing moisture content there is no control required — indeed none possible. Slugs encounter many different moisture conditions in their daily lives and there isn’t a ‘control’ moisture level. So the first message is that there may not be a control for all experiments.

More tricky is the situation where the objects we are investigating are affected not just by the treatment we are administering, but also by other effects of applying that treatment. This too can sometimes be addressed by the use of control treatments, but these are now not simply the ‘natural’ situation, they may have to be quite specifically designed to mimic certain aspects of the experiment, and not others. These sorts of controls are discussed in more detail below.

19.5 Confounded and noisy experiments

Unwanted variation comes in two forms.

  1. The first is confounding variation. This occurs when there are one or more other sources of variation that work in parallel to the factor we are investigating and make it hard, or impossible, to unambiguously attribute any effects we see to a single cause. Confounding variation is particularly problematic in observational studies because, by definition, we don’t manipulate the factors we’re interested in.

  2. The second is noise. This describes variation that is unrelated to the factor we are investigating but adds variability to the results so that it is harder to see, and detect statistically, any effect of that factor. As noted above, much of experimental design is about improving our ability to account for noise in a statistical analysis.

We will consider these together, as some of the techniques for dealing with them are be applicable to both.

19.5.1 Confounding

The potential for confounding effects may sometimes be easy to recognise. If we measure growth rates in plants growing at sites of differing altitude, there are several factors which all change systematically with altitude (temperature, ultraviolet radiation, precipitation, wind speed etc.) and it may be hard to use such data to examine effects of any one of these factors alone. The important thing to remember is that observing a relationship between two variables (e.g. a negative relationship between plant growth and increased precipitation up a mountain) does not necessarily indicate a causal link (plant growth may be determined by one or more of the other factors that vary with altitude).

Confounding effects can also be much more subtle. We may find that eagle owls take more large Norway rats at a particular time of year—but the factor we are interested in (rat size) is related to sex (males are larger) and the males spend more time moving around (hence out of cover and exposed to predation) at that time of year. So what seems to be a size effect, may actually be produced by sex-specific behaviour and not due to eagle owls selecting larger prey at all.

Confounding doesn’t just occur in observational studies. Confounding occurs when administration of a treatment itself generates other unwanted effects where the treatment is applied. An example might be in the administration of nutrients to plants. Changing the supply of nitrogen may be done by supplying different levels of a nitrate (NO3) salt (e.g. Mg(NO3)2 or Ca(NO3)2), but how can we be sure that the effects we see are a consequence of nitrogen addition, rather than effects of the magnesium or calcium cations?

19.5.2 Noise

Noise in the data can be generated by the same processes that generate confounding. The difference is that noise is generated even when the confounding factors don’t align with the treatments. So, going back to measuring growth rates in plants, if we were looking at growth rates of different subspecies of plant on a mountain then we might find that we can get five samples from each different subspecies, but the samples are scattered across very different altitudes on the mountain. This will add variation to the estimates of growth rate due to effects of altitude—this variation is unwanted noise. On the other hand, we might find that the subspecies each grow predominantly at different altitudes and in this situation the variation due to altitude is confounded with the variation due to subspecies—we cannot tell whether the subspecies are inherently different, or the differences are just down to altitude.

19.6 Dealing with confounding effects and noise

Confounding effects occur often in biological work and noise of some sort is always present. Techniques for dealing with such effects include:

  • randomisation

  • blocking

  • experimental control

  • additional treatments.

We’ll consider each of these in turn…

19.6.1 Randomisation

Randomisation is fundamental to experimental design. Although there may be specific confounding factors we can identify and explicitly counter using experimental techniques, we can never anticipate all such factors. Randomisation provides an ‘insurance’ against the unpredictable confounding effects encountered in experiments. The basic principle is that each experimental unit should be selected, or allocated to a particular treatment, ‘at random.’ This may involve selecting which patients to give a drug and which a placebo at random or it may involve setting out experimental plots at random locations in a field. The important thing is that of all the possible patients or plots, the ones that get a particular treatment are randomly selected.

Randomisation guards against a variety of possible biases and confounding effects, including the inadvertent biases that might be introduced simply in the process of setting up an experiment. For example, if in a toxicological experiment with freshwater invertebrates the chemical treatment is set up first and then the control, it may be that the animals caught most easily from the stock tank (the largest? the weakest?) will all end up in the chemical treatment and the remainder in the control, with consequent bias in the death rates observed in the subsequent experiment.

Randomisation is a critical method for guarding against confounding effects. It is the best insurance we have against unwittingly getting some other factor working in parallel to a treatment. It does not, of course, do anything to reduce noise in the data, in fact if randomisation removes confounding effectively, it can appear to increase that variation—but it is a necessary cost to pay for being able to interpret treatment effects correctly.

What does ‘at random’ mean in practise?

The random bit of the word randomisation has a specific meaning: objects chosen ‘at random’ are chosen independently with equal probabilities. How do we achieve this in practice? First we need a set of random numbers. For example, if we need to assign 10 experimental units to treatments we might start with a set of random integers: 4, 3, 5, 8, 7, 1, 10, 9, 6, 2. Attaining a set of random numbers is easy enough. Tables of random numbers are published in most statistics books expressly for use in setting up experiments, or R can also be used to find a set of random numbers (e.g. sample(1:10)).

Exactly how these numbers are used in setting up the experiment will depend on what is practical. In the toxicological experiment the best thing to do would be to place animals in each of the test containers to be used for the experiment, number each container and then use the first half of the set of random numbers to randomly select half the containers to be the test and use the remainder as the controls. In a field experiment, a grid could be mapped out and pairs of random numbers used to select co-ordinates at random for each plot—in this case we would generate random co-ordinate values instead of using integers.

19.6.2 Blocking

Another way of tackling potential confounding effects, and the general heterogeneity of biological material leading to noise, is organise experimental material into ‘blocks.’ This technique, called blocking, is arguably the most important experimental design concept after replication. It works as follows:

  1. Group the objects being studied into blocks such that variation among objects within blocks is small; variation between blocks may be larger.

  2. Each treatment should occur at least once within each block15.

For example, in an experiment in which mice are reared on three different diets (I, II, III), we might expect the responses of mice from within a particular litter to be fairly similar to each other, but they might be rather different to the responses of mice from different litters. If we have five litters of mice (A … E) it would be sensible to select three mice from each litter (at random) to be allocated to each treatment.

I \(A_{1}\) \(B_{1}\) \(C_{1}\) \(D_{1}\) \(E_{1}\)
II \(A_{2}\) \(B_{2}\) \(C_{2}\) \(D_{2}\) \(E_{2}\)
II \(A_{3}\) \(B_{3}\) \(C_{3}\) \(D_{3}\) \(E_{3}\)

(Where \(A_{1}\) is the first randomly chosen animal from litter \(A\), \(A_{2}\) the second, etc..).

This type of blocking should, if there are differences between litters, increase the power of the experiment to detect effects of the treatment and guards against the possibility that we might by chance end up with one diet having mice from, say, only two litters.

In the case of only two treatments (e.g. if we just had diets I and II), this type of blocking is simply the pairing of treatments we have encountered in the paired-sample t-test. Blocked designs with more than two blocks are typically analysed using Analysis of Variance (ANOVA). We will learn how to apply ANOVA to a blocked experimental design this in later chapters.

Note that randomisation is important here also. Mice were selected at random from each litter to be allocated to each treatment and litters are essentially ‘random’ in the sense that they are not deliberately chosen to be different in any particular way, we just anticipate that they are likely to be different in some ways.

Blocking crops up in all sorts of experimental (and non-experimental) study designs. Some examples are given below.

  • If plants in an experiment on soil water levels are being grown in pots on greenhouse benches, there may be differences in light or temperature at differing distances from the glass. Treatments could be blocked along the gradient—at each position on the bench we have one pot from each treatment. This way, every treatment is represented at each position along the gradient.

  • If a field experiment involving several treatments is set up in an environment known to have some spatial variation (e.g., different parts of a field, sections of a river, etc.) setting up one replicate of each treatment in blocks at different locations ensures that no one treatment ends up confounded by some environmental difference, and helps remove noise due to environmental effects in the final analysis.

  • An immunity response is being tested using insects kept in a parallel set of laboratory cultures. There are insufficient insects from a single culture to run the whole experiment, so we could set up one replicate of each treatment using insects from each culture. The cultures would be the blocks—we are not particularly interested in the differences between cultures, but we want to be able to control and remove any variation due to differences between cultures, so as to stand the best chance of detecting a treatment effect.

  • In a comparison of three new diagnostic techniques for measuring the frequency of abnormalities in tissue samples the techniques could be dependent on the person who carries them out (experience, standard of working etc.). The same workers could carry out all three techniques (in random order) and the results compared using individual workers as blocks to increase the power of the analysis to detect differences.

  • If the process of collecting and analysing samples from an experiment is very time consuming (relative to the rate at which things might change) then we could block the experiment in time. Set up one replicate of each treatment on each of a sequence of days, and then collect the samples after a particular time, again over the same sequence of days. Each replicate has then been run for the same length of time (we would randomize the order in which treatments were sampled each day), and we could then include ‘days’ as a block within the analysis to control for any unknown differences resulting from the different setup, or sample days.

It’s worth saying again: blocking is one of the most important experimental design concepts. Most experimental settings lend themselves to some kind of blocking scheme. If there is a way to block an experiment, we should do it. Why? Because a blocked experiment is more powerful, in the statistical sense, than the equivalent non-blocked version. That is, a study is more likely to detect an effect if it uses a blocked design. We will see how to analyse a blocked experiment in a later chapter.

19.6.3 Experimental control

It is obvious that some unwanted variation in data will arise if there is poor measurement, or careless implementation of the treatments (imprecise administration of doses, sloppy timing of trial periods, etc.). In every study we do we should look at the ‘protocol’ issues and see if they can be made tighter. This means considering the precision of the measurements we are making, etc. in relation to the sizes of effects we are interested in, and the resources available to carry out the work. There would be no point in timing measurement intervals over which seedling growth was determined to the millisecond, or determining the soil pH to 5 decimal places, but it would be good to measure seedling height using a standard approach (natural growth form, or stretched out to maximum length? Starting from where?) and to the nearest millimetre, rather than centimetre.

A second form of experimental control is where we can use experimental manipulation of some sort to control for factors that might vary among replicates or treatments. At its simplest obviously this involves controlling the other conditions (for example temperature) so that all treatments experience identical conditions (though note that it may not always be necessary for the conditions to be constant—it may be sufficient that whatever variation occurs is the same for all treatments).

More complex problems arise where the unwanted variation is directly produced as a by product of the treatment we are administering (confounding again). So for example, if we were interested in the effect of decomposition of leaf litter on the microbial communities in soils we might have an experimental treatment that involves varying the amount of leaf litter placed on the soil surface in the test plots. The problem is that this will vary not just the amount of decomposing material entering the soil, but also the physical presence of the leaf litter layer will affect the microclimate at the soil surface (so for example the dryness in the surface of the soil). So we might create some sort of artificial litter which can be mixed in with the real litter , but which does not decompose, so that each plot has a constant volume of ‘litter’ on the surface, but different amounts of decomposing material entering the soil.

Other situations in which this type of experimental ‘adjustment’ can be used include experiments in which different nutrient solutions have to be adjusted so that they have the same pH or where different temperature treatments have to have humidity adjusted to ensure that it remains constant. In general this type of approach can be very useful but it depends on the necessary adjustment being known, and sometimes requires continuous monitoring to keep the adjustments correct.

19.6.4 Additional treatments: ‘designing in’ unwanted variation

Often we are faced by situations in which the unwanted variation — in particular confounding effects — cannot be removed by manipulating the treatments themselves, but has to be tackled by creating additional treatments whose function is to measure the extent of the unwanted variation, and then allow us to remove it statistically, from the data after the experiment is done. In other words, instead of just designing the experiment with the factor we are interested in, we ‘design in’ the sources of unwanted variation.

19.6.4.1 Transplants and cross-factoring

Imagine we had an investigation that involved looking at effects of air pollution on the ability of trees to defend themselves chemically against attack by leaf-mining insects. The obvious thing to do would be to look at trees along a gradient of air pollution and monitor leaf damage by the insects. We might find that the trees in polluted areas are more attacked by the insects. However the problem here is that the trees growing in areas of high air pollution might be attacked more because they are stressed and less able to invest resources in defending themselves (as we hypothesised), or because the insects are more abundant there because their own natural enemies (birds and parasitoids) are less abundant in areas of high air pollution and so cannot control the abundance of the leaf-miners. One way of escaping this confounding effect would be to take tree saplings from polluted and unpolluted areas and do reciprocal transplants — moving trees from polluted areas into clean areas, and vice versa. This then enables us to separate out to a large extent the effect of tree quality from the effect of insect abundance as we can compare trees that have grown with and without air pollution, in both polluted and unpolluted areas.

It is also possible that by careful choice of location, or other elements of design, we can include the unwanted variation as an additional factor in the design without necessarily physically manipulating the subjects, but by sampling material systematically with regard to both the thing we are interested in and the additional unwanted factor(s), so that we can cross-factor the two. For example, if we were interested in how habitat use determines gut parasite load in dogfish, then we might sample dogfish from different habitats, but also record the sex, and age, or size, of the fish. It would then be possible to separate out the effects of sex, or age, from those of where the fish were living. If we didn’t do this, then both factors would probably contribute unwanted variation, either noise, or possibly confounding effects (for example male and female dogfish have somewhat different habitat preferences).

19.6.4.2 Procedural controls

Confounding effects are not only a problem along natural gradients, they can often be introduced by the experimental procedures. For example, a marine biologist investigating the effect of crab predation on the density of bivalve molluscs in an estuarine ecosystem might have cages on the mud flats from which crabs are removed, and in which any change in bivalve settlement and survival can be monitored. The obvious control for this would be equivalent plots on the adjacent mudflats with normal crab numbers. Obviously if the experiment just compares the bivalve density in cages with reduced crab numbers and in the adjacent mud flat any effects observed could be attributable to crab density, environmental changes brought about by the cages, or disturbance due to the repeated netting to remove crabs. To address this problem there are several additional controls that might be useful here. In addition to the proper treatment, bivalve density could be monitored in:

  • a ‘no cage / no disturbance control’—open mud flat adjacent to the experiment (so no cage effects, no added disturbance).

  • a ‘cage control’—crabs at normal density but with a cage (usually done as cage with openings to allow crabs to enter and leave).

  • a ‘disturbance control’—crabs at normal densities, but subject to the same disturbance as the reduced density treatments (cages netted to remove crabs, but all crabs returned to the cages)

The latter two could be combined if it wasn’t important to separate disturbance and cage effects, but even so in some circumstances it is quite possible for an experiment to have as many controls as there are actual treatments.

The additional treatments in this sort of situation are effectively additional controls—in fact they be termed procedural controls—but they are not simply the natural ‘background’ conditions. A classic example of this type of control is the use of placebo treatments in medical trials. For example if we are investigating the effect of a drug then there may be a confounding effect due to psychological, behavioural or even physiological changes in patients resulting simply from the process of being treated, rather than any active compound in the drug. It is common, therefore, to give the drug to one group of patients and a ‘placebo’ (equivalent treatment process, but with no active component in the substance administered) to another group. The placebo is a secondary manipulation designed to equalise the effect of simply ‘being treated.’ There are many other examples of similar experimental controls: a treatment involving surgical implantation of some sort of device, may require a control group who have the surgery, but without the implantation itself, or even with implantation of an inactive device, to allow us to factor out the confounding effect of surgical trauma, or the body’s reaction to the implant itself.

19.7 Ethics and practicality

Although experimental design is often fairly straight forward in principle, the ideal design to test an hypothesis may turn out to be impractical, unaffordable or unethical. All experiments are constrained by practicality, most by finance and a rather smaller, but important set, by ethical considerations. Ethical factors obviously constrain experiments in subjects such as psychology and animal physiology and even in ecology where experiments in studies of rare species, species introductions, or environmental damage may be technically possible, but ethically unacceptable. However, nowhere is the problem more pronounced than in medicine.

Drug testing presents the classic difficulty. Effective testing of the efficacy of a drug depends on the comparison of patients receiving the drug with closely equivalent patients not doing so, or receiving some alternative treatment. Since it is highly likely that one of the treatments will be better than another, then by definition, at least one group of people are having an available and better treatment withheld from them (e.g., Aspinal and Goodman 1995). Thus, as soon as the experimental evidence gives some indication of which treatment is best, it is very hard to justify withholding it from all patients, even if the experimenter feels that further work is necessary.

Good experimental design and appropriate analysis cannot remove ethical, practical or financial problems, but they can help to ensure that where time and money are invested in investigating a problem, the maximum useful information is returned.

19.8 Further reading

Barnard, C., Gilbert, F. and McGregor, P. (2007) Asking questions in biology. Longman.

Ruxton, G. D. and Colegrave, N. (2010) Experimental design for the life sciences. Oxford Univ. Press.


  1. It is worth noting that in reports experiment and observation should always be distinguished. If we have carried out observations on a natural system of any sort, but where there has been no experimental manipulation of any aspect of the system, that is not an experiment. It would be inappropriate to write in a report: “This experiment consisted of measuring mean stomatal density from thirty trees growing at a range of altitudes.” Instead, we might write: “We conducted an observational study measuring mean stomatal density from thirty trees growing at a range of altitudes.”↩︎

  2. This variability could be due to all kinds of things: the organisms/material being used; of the experimental conditions; and of the methods of measurement.↩︎

  3. Actually, there are special types of experimental design that use blocking, but where each treatment does not appear in every block. These are much more advanced than anything we will cover in this book.↩︎