Assigned Reading - Experiments Lecture Script PDF

Lecture 6: Experiments
1. Introduction
In this lecture, I focus on scientific experiments and the associated experimental practices. First,
I discuss some characteristic aspects of experiments, which distinguishes them from other
empirical methods. I also address some interesting borderline cases like “natural experiments”
and “simulation experiments”.
Next, I discuss some justification for running experiments. Specifically, I focus on the capacity
of experiments to produce scientific knowledge, and what makes experiments better (or worse)
suited for certain tasks than other empirical methods. This section is structured according to the
three main functions of experiments: (i) to test theories, (ii) to discover new phenomena, and (iii)
to support policy formulation.
Third, I present some considerations how to design experiments. As in the “Models” lecture, this
cannot yield a recipe how to run experiments – too much depends on the specific context and the
creativity of the experimenter. But philosophical considerations may point to certain underlying
reasons that constrain and shape rationally designed – and hence “internally valid” experiments.
At best, these internal validity considerations will help you choose between “recipes” that your
respective disciplines have on offer.
Finally, I address the issue how to interpret experiments. Many experiments are highly controlled
and run under laboratory conditions, they are therefore quite different from the situations in the
real world in which we are interested in. Thus it is not trivial to apply results obtained from
experiments to those real-world situations. Instead, one has to give well-thought out justifications
for why an experimental result may apply to a certain real-world situation. While often
overlooked, these external validity considerations are crucial for the success of any experimental
science.
2. What is an Experiment?
When you think of an experiment, you may well think of scientists in white coats working away
in a laboratory. But “to experiment” means something different than that. In fact, it covers many
non-scientific, every-day practices.
Imagine the fuse blows in your apartment. You notice that at the time, three appliances were operating: the
toaster, the washing machine, and the kitchen light. So you set out to find the culprit by switching off all
appliances, and replacing the fuse. Then you test each appliance in isolation, switching it on with the others off.
You might find the culprit already at that point: maybe the toaster short circuits and needs to be replaced. What
do you do if this does not work? You may have heard that two or more appliances may overload the grid, so
you try the appliances in pairs. If that still doesn’t work, you may want to test whether the fuse blowing
depends on the particular order in which you switch the appliances on. If that doesn’t work…you better call the
electrician!
This everyday household example bears all the hallmarks of an experiment: you seek to observe
a result (the fuse blowing again), but you intend to do so under controlled conditions (i.e.
knowing which appliances are on and which are off). Furthermore, you manipulate the situation
in order to make these controlled experiments – you methodically switch on and off the various
appliances. So you performed an experiment, yet with no white coats or a laboratory in sight!
1
Experimenting is indeed a human practice that goes far beyond the narrow realms of science.
The Mende people of Sierra Leone, for example, have a special word, “hungoo”, for experiment.
A “hungoo” can consist in planting two seeds in adjacent rows, and then measuring the output in
order to determine which seed was best. This was probably an original habit, not one brought to
the Mende by visiting Europeans. Similar experiments also occur in other parts of the world
(Richards 1989). Scientists just sophisticated this practice and employed it for their purposes.
While the household and the Mende experiments are made to solve practical problems, scientific
experiments are also used to test theories and discover new phenomena.
Note that the modern tradition of scientific experimentation is relatively recent. Francis Bacon
(1561 – 1626) has been credited with the development of an inductive scientific method that
focused on controlled observation under manipulation. Galileo Galilei (1564 – 1642) described
many ground-breaking scientific experiments (although there is debate whether he actually
performed many of these experiments, or whether they were mere thought experiments). Robert
Boyle (1627 –1691) is often credited as the initiator of the laboratory style in the sciences
(Hacking 1992).
What characterizes an experiment? The first aspect of experimenting is controlled observation.
A controlled observation is one in which the major features of the object of study are registered
in a way planned to provide us with adequate information about it. Controlled observation
involves knowledge about the relevant features of an object, allowing us to register all these
features in our observation. Controlled observation often is difficult to achieve because of the
required knowledge.
Many sciences, in particular astronomy, biology and the social sciences often engage in
observational studies. Such studies often involve observation of a set of features over a period of
time or over a population at a defined time. Typically, such observations are made under varying
conditions. To exert scientific control, all the relevant features must be recorded. Furthermore, if
two observations differ in more than one feature, the effects of these features must be separated
through statistical techniques. This is sometimes possible, but only under relative strict
conditions of independence and completeness. The ability of controlled observations in this
context thus depends on theoretical assumptions, which are often difficult to verify themselves.
In order to avoid such strong theoretical assumptions, empirically orientated scientists often
search for real-world situations in which two groups of objects are distinguished only by one
feature, while all the other features are identical. In such a situation, the effects of different
features need not be separated statistically, so that observed changes in outcome can be attributed
to the difference in that one feature.
2
Famous example of such an observational study was the 1854 Broad Street cholera outbreak in London,
England. By the end of the outbreak 616 people had died. The physician John Snow identified the source of the
outbreak as the nearest public water pump, which he identified using a map of deaths and illness.
Figure 1
In this example, Snow discovered a strong association between the use of the water and deaths and illnesses due
to cholera. Snow found that the water company (the Southwark and Vauxhall Company) that supplied water to
districts with high attack rates obtained the water from the Thames downstream from where raw sewage was
discharged into the river. By contrast, districts that were supplied water by the Lambeth Company, which
obtained water upstream from the points of sewage discharge, had low attack rates. Given the near-haphazard
patchwork development of the water supply in mid-Nineteenth Century London, Snow viewed the
developments as "an experiment...on the grandest scale."
Of course, the exposure to the polluted water was not under Snow’s control, and hence it is misleading to speak
of an experiment. Instead, Snow was lucky to be able to identify two groups – those infected with Cholera and
those not – who were (statistically) identical in all features (distribution of age, sex, education, culture,
occupation, class, etc.), except for their use of water. Snow thus found a “naturally controlled” situation, which
allowed him to attribute the cause of the Cholera outbreak to the water source (Freedman 2005, 6-9).
A situation like Snow’s Cholera study is sometimes called a “natural experiment”, because in it
features are “naturally” arranged in such a fashion as if they had been controlled by an
experimenter. Yet experimentation proper not only involves controlled observation, but also
manipulation. An experiment is a controlled observation in which the observer manipulates the
variables that are believed to influence the outcome. Thus, the experimenter does not just
passively observe the various features of an object, recording their correlations. Rather, she
actively manipulates some of these features in order to observe how other features are affected.
Such manipulations may be performed in the laboratory, but need not.
3
A typical experiment in the wild is the following experiment involving wild vervet monkeys in the Loskop
Dam Nature Reserve, South Africa. The purpose was to measure the value of food providers for a monkey
group as a function of demand for food and supply of food providers. Experimenters allowed a varying number
of “providers” in a group of monkeys to open a box with fruit. The retrieved food was then shared with
everyone in the group. The experimenters observed the grooming time that other group members offered the
providers, and related the amount of grooming time to the numbers of providers (Fruteau et al 2009).
Figure 2
Because this observed animal behaviour is presumably highly sensitive to the environment, the experimenters
chose to run the experiment in the wild, rather than in a laboratory. Although observation in a zoo would have
presumably increased their degree of control, it would also have introduced so many potentially disturbing
factor that an observation in the wild seemed preferable.
Often it is more practical to not construct every part of the context in which an experiment is
performed in the laboratory. Instead, one takes as much of an existing context, manipulates only
a few features, and makes sure that these manipulations do not affect the context otherwise. Note
that in the above monkey experiment, a manipulation takes place, but that it is assumed that this
intervention (including the presence of the observers!) does not affect other possibly relevant
factors like habitat, hierarchy or security perception.
In biology, sociology and archaeology, to name just a few, such non-laboratory experiments are
often conducted. Yet in other sciences – like physics, chemistry, psychology and economics –
laboratory experiments predominate. The aim of the laboratory is to construct and hence fully
control the manipulation of as many of the relevant features of an object as possible. Some of
these sciences are so dependent on the laboratory that their theories make claims only about
phenomena observed in the laboratory. Such disciplines are called “laboratory sciences”
(Hacking 1992).
Physics is a prime example of a laboratory science: almost all its observations are made in highly
developed machines, controlled in every last feature, creating conditions (like nuclear fusion,
fission or collision) that do never occur ‘in the wild’ at least not on earth. Thus, while the
laboratory offers an unprecedented degree of control and manipulability, it also produces objects
and situations that are often far removed from the real world.
Computer simulations are sometimes called “computer experiments” or “in silico experiments”.
Here, a model is implemented on a computer leading to a dynamic demonstration. Further
interventions can then be programmed to explore how they affect the scenario. In some sense,
this is clearly more about modelling than experimenting, because the control and intervention is
performed on a representation, not the “real stuff” used in field or laboratory experiments.
However, some authors have pointed out that what is really controlled and manipulated is the
implementation on the machine, not the representation itself. To run an experiment on a
computer is thus akin to experimenting with a laboratory animal: both are used to experiment on,
in order to learn about a substantially different target.
4
3. Why Experiment?
Experimenting is a widespread practice in almost all sciences. In fact, the public image of a
scientist is often that of a laboratory scientist. But why is it that science is so closely linked to
experiments? In particular, what are the epistemic contributions of experiments to scientific
knowledge? And what makes experiments better than other empirical methods? In the following,
I will discuss these questions at the hand of the three main scientific uses of experimentation.
A famous economic experimenter once said that the primary goals of experiments are: “Speaking
to theorists, Searching for facts, and Whispering into the ears of princes” (Roth 1986). Thus,
experiments are used to (i) test theoretical hypotheses, (ii) discover and investigate novel
phenomena that cannot be explained by existing theories, and (iii) illuminate or support policy
making. In each of these three functions, experiments can contribute to the increase of scientific
knowledge, and indeed can do so better than other empirical methods. Let me explain these
claims in more detail.
Maybe the most obvious function of experiments is in theory testing. As you heard in the second
lecture, the standard hypothetico-deductive (HD) model of hypothesis testing suggests that
scientists propose theoretical hypotheses, deduce observable implications from these hypotheses,
and then check out by observation and experimentation whether these implications are true or
false.
In 1920, Frederick Banting and his student Charles Best hypothesized that pancreatic extract from dogs would
have anti-diabetic qualities. The work of Naunyn, Minkowski, Opie, Schafer, and others had indicated that
diabetes was caused by lack of a protein hormone secreted by the Islets of Langerhans in the pancreas. To this
hormone Schafer had given the name insulin. It was supposed that insulin controls the metabolism of sugar, so
that lack of it results in the accumulation of sugar in the blood and the excretion of excess of sugar in urine.
Attempts to supply the missing insulin by feeding patients with fresh pancreas or extracts of it had failed. It was
consecutively shown that the proteolytic enzyme of the pancreas destroyed insulin. The problem, therefore, was
how to extract insulin from the pancreas before it had been thus destroyed.
Banting and Best’s extraction method involved tying a string around the dog’s pancrease duct. When examined
several weeks later, the pancreatic digestive cells had died and been absorbed by the immune system. The
process left behind thousands of islets. They isolated the extracts from the islets and produced isletin. What
they called isletin became known as insulin.
Banting and Best then tested this extract on dogs. They surgically removed a dog’s pancreas, and managed to
keep it alive throughout the whole summer by administering the extract. The extract regulated the dog’s blood
sugar levels.
In 1922 the insulin was tested on Leonard Thompson, a 14-year-old diabetes patient who lay dying at the
Toronto General Hospital. He was given an insulin injection. At first he suffered a severe allergic reaction and
further injections were cancelled. The scientists worked hard on improving the extract and then a second dose
of injections were administered on Thompson. The results were spectacular. The scientists went to the other
wards with diabetic children, most of them comatose and dying from diabetic keto-acidosis. They went from
bed-to-bed and injected them with the new purified extract - insulin. This is known as one of medicines most
dramatic moments. Before injecting the last comatose children, the first started to awaken from their comas.
The Banting and Best story is typical for the problems and potential of experimental HD testing.
In the earlier attempts, the hypothesis was rejected each time. The experiments seemed to show
that insulin does not cure diabetes. But instead of giving up on the hypothesis altogether (and
hence on the underlying theory of human metabolism), they successfully found errors in some of
the auxiliary assumptions – first by showing that the pancreas enzyme had destroyed the insulin,
and later that the insulin was so contaminated that it created an allergic reaction.
5
Because they were able to pin the negative test results to the auxiliary assumptions, and
developed appropriate techniques to rectify these errors, Banting and Best were consecutively
able to find positive evidence for their theoretical hypothesis. But they never would have gotten
there if without experimentally determining the errors that yielded the negative results, nor
without the evidential support that the manipulation of the dog’s organism yielded.
In most cases, theory testing is not as straightforward. Recall the Duhem-Quine problem,
discussed in lecture 2. While the standard HD account claims that empirical tests check the
implications of theories alone, it almost always turns out that the implication is derived from the
hypothesis using various auxiliary assumptions. Auxiliary assumptions include the accuracy of
measurement instruments, the absence of disturbing factors, and various initial conditions. Hence
the HD-scheme does not look like (a) but like (b) in figure 4:
H->e but H&A->e
not e not e
not H not H&A
(a) (b)
Figure 4
From observing “not e” one therefore can only conclude to “not H&A” but not to “not H”. And
from observing e one therefore can only conclude to “probably H&A” but not “probably H”.
The Duhem-Quine problem of course applies to all empirical tests. Yet in contrast to other
empirical methods like mere controlled observation, experiments allow for a greater control of
many of the relevant auxiliary conditions. Because scientists construct the instruments with
which they interfere in experimental situations, they can be more confident about their reliability.
Because they create laboratory background conditions, and themselves perform their
experimental manipulations, they can be more confident about their accuracy. Thus, while of
course still prone to error, experiments can provide a higher confidence in the reliability and
accuracy of auxiliary conditions.
With that, the Duhem-Quine problem can be mitigated: observing “not e”, one can conclude “not
H&A” – and with confidence in accuracy and reliability of A, also “not H”. Observing “e” one
can conclude “probably H&A” – and with confidence in accuracy and reliability of A, also
“probably H”. The higher the degree of experimental control, the higher the degree of confidence
in auxiliary conditions, and hence the less pressing the Duhem-Quine problem. Laboratory
experiments thus deal better with the Duhem-Quine problem than non-laboratory experiments,
which in turn deal with it better than natural experiments.
Note that dealing with the Duhem-Quine problem in this way requires a whole series of
experiments, not merely a single experiment aimed at proving or disproving some grand theory.
Instead, the hypotheses experimenters test with their experiments concern low-level claims about
the accuracy of a measuring tool, or the influence of some background factor. Such sources of
error are sought to be eliminated through experiments, which in turn increase the confidence of
drawing inferences from data to phenomena or from phenomena to theory. Testing a theoretical
hypothesis thus involves a whole research program – a long series of related experiments –
rather than a single crucial experiment. Note how long such research programs may last:
6
Langhans discovered the pancreatic islets in 1869, and it took more than 50 years until the
hypothesis was confirmed that these were connected to human metabolism!
Experiments also have a special role in testing causal theories. Causal theories propose
relationships between causes and effects. A commonsensical idea about causation is that causal
relationships are relationships that are potentially exploitable for purposes of manipulation and
control: very roughly, if C is genuinely a cause of E, then if I can manipulate C in the right way,
this should be a way of manipulating or changing E (cf. Woodward 2003). According to this
view, testing a causal claim involves changing C while holding constant everything else, and
observing whether a change occurs in E. This is often very difficult to do in non-experimental
observations, as one may not be able to find the required changes in C, or these changes only
occur with other background conditions changing as well. In experiments, however, we are often
able to manipulate the relevant causes, and keep background conditions constant. Hence
experiments are a particularly suitable instrument for testing causal hypotheses.
Experimentation is not limited to a contributing role in theory testing. In many scientific areas,
theory is not properly developed – and yet, some of these very areas are very fertile grounds for
experimenters. Thus, besides theory testing, another crucial role of experiments is that of
discovering new phenomena. Recall that phenomena are distinguished from mere observational
data. Phenomena are not mere reports of data points and distributions of these, but rather identify
relatively stable features that characterize such events. Scientific theories are about phenomena,
not data. At the same time, observational data is used as evidence for phenomena.
Scientists often make observations that suggest phenomena that have not been accounted for by
theory. Take the following example from social science.
In the iterated public goods game experiment, subjects are given a number of tokens, which they can exchange
for money at the end of the experiment. They then secretly choose how many of their private tokens to put into
a public pot. At the end of each round, the public pot is multiplied by a factor larger than 1 and then evenly split
between the subjects. Each subject keeps the tokens they do not contribute; plus an even split of the tokens in
the pot. The same group of subjects then plays this game over a series of rounds.
The group as a whole does best when everyone contributes all of their tokens into the public pool. If everyone
puts every token they start with into the pot then the group will extract the maximum total reward from the
economists running the test. However, game theory predicts that no player will contribute to the pot, because
any player does better contributing zero than any other amount regardless of whatever anyone else does.
However, zero contributions are rarely seen in experiments; people do tend to add something into the pot. The
actual initial level of contribution varies widely. Yet whatever the initial level of contribution is, it is generally
observed that it declines with time. However, the amount contributed to the pool rarely drops to zero when
rounds of the game are iterated (Burlando & Guala 2005). After many replications of this experiment, the
majority of scientists have concluded that this limited decay is indeed a stable phenomenon.
Figure 3
If limited decay is indeed a new phenomenon, it calls for a new theory, as no current theory
accounts for this phenomenon. In that case, experimenters are ahead of theory: they have isolated
7
certain features in a laboratory situation that are stable enough to call for theoretical
development. Thus experiments, by producing such “freestanding phenomena” (Guala 2005, 46),
“exhibits” (Sugden 2006) or “bottled phenomena” (Kahnemann) can drive theory development,
instead of merely following theory development in testing theories.
Such “freestanding phenomena” obviously contribute to scientific knowledge. Because the
ability to manipulate background conditions and to design an intervention make identifying new
phenomena easier – but in particular because intervention and control is needed to ascertain that
this phenomenon is stable – experiments are even better suited than other empirical methods to
discover new phenomena.
I started out with a practical example of non-scientific experimentation: how to determine the
cause of the short circuit. The goal of this experiment was neither to test theory nor to discover
new phenomena – but rather to solve a practical problem. Scientists themselves often engage in
such practically-minded experiments. Medical researchers, for example, routinely evaluate the
effects of certain drugs on human health, without testing hypotheses about the underlying
mechanisms. Similarly, policy makers often seek trials for now social policies, without testing
underlying mechanisms or aiming to establish new phenomena.
Aspirin, the most widely used drug in the world, was initially prescribed as an analgesic to relieve minor aches
and pains, as an antipyretic to reduce fever, and as an anti-inflammatory medication. It is believed that this
effect is due to Aspirin's ability to inactivate the cyclooxygenase (COX) enzyme, which in turn suppresses the
production of prostaglandins and thromboxanes. This theory gave rise to the idea that Aspirin may be useful for
the primary prevention of heart attack, as heart attack is caused by blood clotting, and thromboxanes are the
main agents of blood clotting. However, the assessment of Aspirin as a primary preventive drug against heart
attack did not proceed through testing theoretical hypotheses about biochemical effects of Aspirin on blood
clotting. Rather, what clinical trials tested how much aspirin decreased the overall incidence of heart attacks
and ischaemic strokes amongst persons without cardiovascular problems, and how much the risks of
hemorrhagic strokes and gastrointestinal bleeding completely offset these benefits of aspirin. Further trials are
still in progress, feeding into a current debate whether Aspirin should be routinely prescribed to persons without
cardiovascular problems (Baigent et al. 2009).
In studies that are directed to such practical goals, the need for an experimental method is
particularly obvious. Because a lot of background factors and underlying mechanisms are only
incompletely known, such studies seek to isolate the effect of an intervention on a particular
feature or phenomenon. This requires first that such an intervention is performed, which often is
only possible in an experimental setting. It secondly requires that all potential background
conditions are controlled for, even those that are not known. Finally, it requires that the ensuing
effects can be monitored without disturbance, often over a long period of time. It is for this
reason that in medical research, evidence from certain kinds of experiments is ranked as “higher-
quality evidence” than that from observational studies or expert opinion.
3. How Should We Run Experiments?

It is the ability to interfere and manipulate, and through that the higher degree of control, that
make experiments so special for the scientific project. Yet with these higher abilities also comes
an increased possibility of error. In particular, experimenters are prone (i) to misunderstand the
nature of the experimental manipulation, (ii) to describe wrongly the background conditions of
the experimental setting, and (iii) to falsely read the experimental observations. Either of these
mistakes leads to experimental artefacts – an interpretation of the experiment that is a mere
8
illusion, unconnected to real causes and real phenomena. An example of an experimental artefact
is the famous Hawthorn effect:
During the 1920s and early 1930s, the Hawthorne Works was the site of a series of landmark human behavior
studies that examined how fatigue, monotony and supervision on an assembly line dramatically affected
productivity. The first phase of the studies was designed to find the level of illumination that made the work of
female coil winders, relay assemblers and small parts inspectors more efficient.
Workers were divided into test and control groups. Lighting for the test group was increased from 24 to 46 to
70 foot-candles. Production of the test group increased as expected, which seemed to indicate that better
lighting conditions improved productivity. But then it was found that production of the control group increased
approximately the same amount. The experimenters were confused. Ultimately, they argued that a factor they
had not controlled for had affected the change: the experimental subjects knew that they were observed, and
this knowledge made them work harder and increase productivity.
Here is an example of a medical experimental artefact:

Transplantation of human embryonic dopamine neurons into the brains of patients with Parkinson’s disease has
proved beneficial in open clinical trials. Initially, it was believed that the transplanted neurons themselves
contributed to the improvement. In a number of recent experiments, however, a control group receives “sham
surgery”: holes were drilled in the skull but the dura was not penetrated. The mean (±SD) scores on the global
rating scale for improvement or deterioration at one year were 0.0±2.1 in the transplantation group and 0.4±
1.7 in the sham-surgery group. There was no significant improvement in older patients (60+) in the
transplantation group, as compared with the sham-surgery group (Freed et al. 2001).
Such experimental artefacts may make one reject true hypotheses, claim false causal relations,
present illusionary phenomena, or take inefficient actions. It is therefore of great importance to
design and perform experiments in such a fashion that experimental artefacts are avoided.
Experiments free of artefacts are called internally valid. In this section, I will discuss ways to
design internally valid experiments.
The means to avoid experimental artefacts lie in the various ways of experimental control.
Control consists both in accurately identifying the features that are relevant for an experimental
result, and also being able influencing these features in such a way that alternative explanations
of the experimental result can be ruled out. The original Hawthorne experiment, for example,
failed to control for the influence of observation onto workers’ behaviour, while the
transplantation experiments failed to control for surgery placebo effects. Both experimental
designs thus failed to identify a feature relevant for the experimental outcome. But the two
experimental studies reacted differently to this failure: the transplantation study found a way to
control this feature, by performing sham surgery on a control group, and showing that this indeed
was responsible for the placebo effect. The Hawthorne experimenters concluded that it must be
observation that influenced workers’ productivity, but were never able to actually control for
observation – perhaps, because one cannot experimentally observe such a large group of people
without being recognized.
How do experimenters identify the relevant features? One of the things that you learn when you
are trained as an experimenter is the list of likely flaws that are worth checking in a given
experimental context. Such a list is part of what Thomas Kuhn (1962) called the paradigm: a set
of rules to do good normal science. Partly, this information comes from established theories:
from them we often know that certain feature might have an influence on the experimental result,
and that we better control for them. Partly, this information also comes from related experiments.
A lot of experiments, after all, concern low-level claims about the accuracy of a measuring tool,
9
or the influence of some background factor. Hunting possible errors is a crucial function of
experiments themselves, so that most experiments are not one-shot activities, but belong to a
long series of related trials. Finally, this list is partly determined by purely social factors.
For example, experimental economists worry a lot about the influence of deception on experimental subjects,
and consequently insist on offering only real monetary incentives – while experimental psychologists are often
quite happy to offer fictional rewards, which are not translated into real material gain at the end of the
experiments.
Identifying relevant factors thus is an open-ended procedure: at a given moment in time,

experimenters may not be able to control for all possible flaws, because some of them may not
even be known or conceivable given the present state of scientific knowledge and convention.
Many scientific pioneers have missed important factors that their successors have discovered to be in need of
control. One example of this is a famous plant physiological experiment that was performed in the 17th century
by the Belgian Jan Baptista van Helmont (1580 – 1644). In order to find out where growing plants take their
new material from he planted a weighed willow plant in a pot with dried and weighed earth, and then had it
watered with rain water. After five years the earth and the willow plant were dried and weighed again. The
plant had increased considerably in weight, whereas the earth had only lost a small amount of weight. Since
only water had been added, van Helmont concluded that the plant mass had been formed solely by water. Only
later was it realized that air could contribute matter to the plant.
But let’s start with those features that are identifiable with current scientific knowledge and
convention. How do experimenters actually exert control?
In the ideal case, we can control a factor by eliminating its effect on the experimental result
altogether. For example, we eliminate the effects of air pressure and air movement by creating a
vacuum. We eliminate the effect of gravitation by performing an experiment in outer space. Or
we exclude the effects of external electromagnetic fields by using a Faraday cage.
Blinding is also an elimination method. Single-blinding leaves experimental subjects in the dark
what their experimental treatment is. For example, Parkinson Patients are not told whether they
got a neuron implant or were only subjected to sham surgery. This eliminates expectation effects.
In double-blinding neither the individuals nor the researchers know who belongs to the control
group and the experimental group. This eliminates effects that the experimenter’s expectation
could have either directly on the experimental result, or indirectly through influencing the
experimental subjects. Note that while single-blinding only makes sense in experiments with
cognisant beings, double-blinding may be relevant also in experiments with inanimate matter.
Take the following example:
10
The chemist Michel Eugène Chevreul (1786 – 1889) made observations in the 1830´s that initially made him
believe that a pendulum reacted to some previously unknown natural force. As a pendulum he used an iron ring
hanging on a string. When he held it over mercury, it gave a clear indication. When he put a pane of glass
between the iron ring and the mercury, the indication disappeared. It seemed as if the pendulum was influenced
by a force that was shielded by the pane of glass.
But Chevreul was not fully convinced. Could this possibly be what we would today have called a psychological
phenomenon? Perhaps his own expectations made the pendulum swing? In order to settle the issue he had to
perform an experiment in which he could separate a possible unknown physical force from a possible
expectation effect. He did this by eliminating the expectation effect. He repeated the experiment with blind-
folded eyes. An assistant now and then introduced the glass pane between the pendulum and the mercury, but
without letting Chevreul know when he did this. If it was only the effects of expectations that ruled the
pendulum, the effect would then disappear, otherwise it would remain.
It turned out that when Chevreul, who was himself holding the pendulum, did not know that the iron was
shielded from the mercury, then shielding had no effect on the movement of the pendulum. Chevreul drew the
correct conclusion from this, namely that his own expectations, conveyed through small subconscious muscular
movements, ruled the pendulum. He wrote about his experiment in a letter to Ampère in 1833. Since then, a
large number of experiments, both with the pendulum and with the dowsing rod, have shown that dowsing is
fully explained by the expectation effect.
There are also cases in which confounding factors cannot be eliminated. Often, effect separation
can then be used instead. This means that an experimental set-up is constructed such that the
confusable effects and those of the studied phenomenon can be separately registered.
An important modern example of effect separation is the use of magnetic fields to separate
particles with different charges. If an experiment aims at measuring neutron radiation, it must be
distinguished from radiation with charged particles. This can be achieved by letting the radiation
pass through a magnetic field, where the charged particles will deviate.
Yet in many experiments, neither elimination nor effect separation can be achieved. To observe
the effect of changing work conditions on factory workers, as in the Hawthorne case, without
them knowing that they are observed, is a hard task to complete. Nor can the effect of
observation be separated from the effect of the change in work conditions. In that case, suspicion
that another, non-controlled, feature seemed to influence the experimental result came from the
observation that however the controllable conditions were varied, the productivity increase
persisted.
The Hawthorne case points to a widespread problem in designing experiments: we often do not
know that a factor is important (in the 1960s there was no theory of reactivity, nor were their
scientific conventions to address such issues). When it is not known what factors are important,
it is often useful to perform preliminary experiments with extreme values of the factors suspected
to have an influence. The factors that have an influence at such extreme values are selected then
for being controlled in continued experiments.
In order to run such exploratory experiments, we need to vary one feature while holding other
factors constant. When for example exploring the effect of a drug in animal experiments, this is
achieved by using animals that are genetically as similar as possible, and treating them as equally
as possible with respect to food, care, temperature, etc. The only difference should be that the
experimental groups receive the drug, whereas the control group does not.
But it is not always possible to hold all other features constant. In a clinical experiment, where
the drug is tested on patients, we can of course not keep genetics or food constant. Instead,
11
patients are distributed randomly between groups that receive different treatments. If the number
of patients is sufficiently large, the background factors will then be reasonably evenly distributed
between the groups. The effects of this are essentially the same as those of constancy, namely
that these factors will not have any major influence on the outcome. With respect to the patient
groups as a whole, randomisation can be seen as means to achieve constancy.
In particular for clinical trials, randomized, double-blind experiments are considered the
evidence of the highest quality (see figure 4).
Figure 4 (U.S. Preventive Services Task Force 1989)

Yet randomization is not an all-cure and has its own specific problems. One issue is that the
number of experimental subjects must be sufficiently large for randomization to work. With rare
phenomena, a large experimental sample may be difficult to obtain – and small numbers will
invalidate any conclusion from randomized studies!
Another issue is that randomization eliminates selection biases (and hence is essentially the same
as holding factors constant) by recreating the population’s distribution of features in the
experimental groups. Results from such an experiment then may not be valid for another
population, if the relevant features are distributed differently in that population! For example,
you may want to hold constant the influence of a genetic factor to the phenomenon of interest by
randomization. But the relevant genetic feature may be more prevalent in the population from
which you draw the experimental subjects, than in the population to which you intend to apply
the experimental results. This may lead to invalid inferences.
But let’s leave these technical fine points and return to the bigger picture. Experimenters have a
number of techniques available for exerting experimental control. These techniques will not
work in all cases; furthermore, the experimenter never knows the whole list of relevant factors
that need to be controlled. Thus the problem of identifying and controlling background
conditions is essentially open-ended, and experimentation a continuous process.
In this continuous process, experimenters pursue two avenues to maintain internal validity: one
the one hand, they attempt to reproduce or repeat already performed experiments, in order to
ensure that the controls were indeed sufficient for the result to occur and did not hinge on some
freak accident.
All scientific journals of importance have referees that assess the articles before they are published. The journal
Organic Synthesis goes one step further. The journal publishes methods of chemical synthesis. Before a method
can be published it has to be reproduced by another researcher (a member of the board of editors). The methods
are also
The described
journal in more
Organic in detail than in most other chemical journals.
Synthesis
The reason for this unusual practice is that is difficult to describe a complicated organic synthesis so precisely
that the description can serve as a guide for repeating the synthesis. To obtain repeatability an extra control has
therefore been introduced, that is not common in other areas.
12
On the other hand, experimenters seek to replicate results. Replication and repetition need to be
clearly distinguished (Guala 2005, 14): while a repetition indeed reproduces an experiment in all
of its known details, a replication seeks to reproduce the result of an experiment by means of
slightly or even radically different design. For example, the decay phenomenon in the public
goods game from above has been replicated with heterogeneous groups, higher payoffs, different
multiplication factors, and many more.
The prime motivation to replicate experiments is to see how stable the observed phenomenon is,
by varying some of the background conditions. Indeed, experimenters often spend a lot of time
trying “to make the [observed experimental] effect go away” (Galison 1997). This is not some
perverse wish to destroy one’s own or other’s work, but rather an integral part of doing
experimental science in a context where not all relevant background conditions are known, and
knowledge is continuously being revised and expanded. Experimenters should as hard as they
can to find an experimental set-up that starts from varied but similar conditions to the original
experiment, but does not yield the original result. If they fail at that – within reasonable
variations of conditions – they may conclude that their experimental efforts have established a
phenomenon that may give rise to new theory, or rejects an existing theory, or helps taking
efficient action. If they succeed, they have to go back to the lab, seeking increased control
through the techniques at their disposal.
It is not clear that there are rational stopping rules for such a process (Franklin 1994) – in the
future, we may always find new relevant background conditions, or fail at a relevant replication,
or learn about a biased measurement procedure. Experimentation, like all empirical science, is
fallible, after all. But that does not prevent us from trying to make experiments and inferences
from them as good as we can, given our current background knowledge. Against that knowledge,
we can decide whether an inference from a particular experimental design is internally valid or
not. That there may be unknown flaws is not a good reason not to accept an experimental result.
4. How to Interpret Experiments?

Some experiments do not need much of an interpretation. The little household experiment with
which I began this lecture was run in the context in which efficient action needed to be taken. As
long as the experiment was internally valid, inferences from it to the real world were trivial: just
exchange that appliance that blew the fuse in the experiment, stupid! It was trivial, because the
stuff that was experimented on was the stuff that we wanted to affect through our action.
Some real experiments are close to this trivial case. The Broad Street Cholera “natural
experiment” immediately led Snow to the real source of contamination. Banting and Best’s trial
of insulin in the children’s ward led them to conclude directly that their new substance regulated
sugar levels in diabetics.
But many other experiments do not offer such trivial inferences. Take the decay phenomenon in
the repeated public goods game – what can we infer from that experimental result about the
world? Surely not that public morale is deteriorating everywhere?
My point here is that we often perform experiments in ways that make inferring from them to the
phenomena of our interest non-trivial. Medical researchers test drugs on rats, even though they
develop these drugs for humans. Economists run market experiments in classrooms, even though
prices are not set at university. Biologists study behaviour of animals in captivity. Physicists
13
build billion-dollar machines to let miniscule particles collide. These experiments, even if
internally valid, have no obvious relevant real-world target to which their results can be trivially
applied. Instead, inferences from them to relevant real-world targets require justification.
Philosophers call justified inferences of this sort externally valid.
In biomedical research, for example, the inference may take the following form:
1. Humans have symptoms Y
2. Laboratory animals have symptoms Y
3. In laboratory animals, the symptoms are caused by factor X (e.g. bacteria, toxin, or
deficiency)
The human symptoms are therefore caused by Y
It should be obvious to you that such an inference is not generally valid. It requires extra
information, not stated in (1) to (3), in order to justify the conclusion. Consequently, inferences
from many experiments are not obviously externally valid. But what kind of information would
justify these inferences?
One could answer with a counterquestion: Why then engage in such experiments at all? Why not
just aim at those experiments whose interpretation is trivial? First, because it is not always
practically feasible. Running an experiment in a real market, for example, would require
resources above the means of most central banks. Scientists just don’t have that kind of money,
or that amount of time, to run such experiments.
Second, there is a serious ethical concern here. Experimenting with people’s livelihoods, or with
their health, is not something that scientists are morally permitted to do.
Third, scientists often seek a compromise between internal and external validity. From the
examples I just gave, you may have noted that external validity is easier to achieve for natural
experiments than for field experiments, and easier for field experiments than for laboratory
experiments. But the ability to control is highest in laboratory experiments, less high in field
experiments, and even lower in natural experiments. Hence, most of the time our confidence in
the internal validity of laboratory experiments is higher than that of field experiments, which in
turn is higher than that of natural experiments. There is a trade-off between internal and external
validity, and it may be wise to sacrifice some external validity for a sufficient degree of
confidence in the internal validity of an experiment.
So, there are reasons to choose experimental designs that raise the issue of external validity. But
of course not at any price. The point of experimenting, after all, is to increase our knowledge
about the real world. In the extreme case, however, such experiments would not allow any
inferences to the world outside the laboratory at all. Such a conclusion should be unacceptable
for any empirical scientist. Instead, ways must be found to establish the external validity of
experiments.
At least three strategies of establishing the external validity of an experiment for a given real-
world target can be distinguished. The first is to design the experiment in such a way that it
indeed mirrors the real target. Where possible, this is a good strategy: if we manipulate a cause
in the experiment that is the same as in the target, then we can also expect to see the same result.
But I just mentioned a few reasons to believe why this is not possible or desirable. In scientific
practice the conditions between experiment and real-world targets are rarely the same!
14
Another strategy is to always direct the focus of experiments towards eliminating false theories,
and only apply the best (remaining) theories to the world. External validity problems are solved
by basically abstaining from drawing conclusions – about phenomena, or efficient courses of
action – directly from experiments. Instead, by going through theories when applying
experiments, the best theories bundle all the available experimental evidence, thus decreasing the
probability of making an externally invalid inferences from a single experiment.
While this is a strategy that is obviously very widespread in the sciences, it is not a cure-all, as
many scientific questions – both concerning understanding as well as practical actions – are not
covered by theory. Theories are almost never complete, covering all aspects of the real world.
With theories incomplete, we will always have to answer questions that are not covered by
established theories, leaving only direct inferences from empirical investigations as an option.
A third strategy is to show that there are relevant analogies between experimental and real-world
phenomena (Guala 2005, 193). Seeking analogies is a lot weaker than seeking to make
experiments and real-world targets the same. We know that the target is different from the
experimental situation in certain respects. By seeking an analogy, we ask whether these
differences can confound the external validity inference or not.
The first thing to note is that the situation is somewhat similar to the relation between models and
their targets: we know that the model and target are not identical, but we wonder whether the
model is good enough to make the desired inferences. Indeed, there is a considerable overlap
between some types of models (e.g. computer simulations, model organisms) and experiment sin
need of external validation.
The second thing to note is that the claim of analogy between experiment and target is itself an
empirical hypothesis: we have to observe both the experiment and the target to see whether the
analogy is true. In particular, we have to check whether differences between experiment and
target system can confound external validity inference. This requires knowing identifying the
differences, and checking in the laboratory by incorporating them one by one in the experimental
design.
The third thing to note here is that testing the analogy hypothesis requires a lot of empirical
knowledge about the target. So we need to run empirical studies on the target system, in order to
check the analogy. We cannot just rely on the experiment alone to tell us how the target works.
Instead, the experiment only is an intermediate step, not the whole story! Instead we must show
that both experimental and field evidence have been generated by systems that are similar in all
relevant respects. “Strong external validity inferences begin and end in the field” (Guala 2005,
199).
5. Summary
An experiment is a controlled observation in which the observer manipulates the variables that
are believed to influence the outcome. Experiments are thus distinct from observational studies,
including “natural experiments”. Experiments are often performed in the laboratory, but also “in
the field”. Commonly, the laboratory offers a higher degree of experimental control.
Experiments have a special epistemic role amongst empirical studies. They offer better means to
deal with the Duhem-Quine problem in theory testing, and better means to test causal theories.
They are particularly apt for the discovery of new phenomena, and for addressing practical
15
problems. Experiments have this special role because they allow a higher degree of control than
other empirical studies, and because they allow intervention and manipulation.
The quality of inference from experiments has to aspects. Internally valid inferences are
conclusions that avoid experimental artefacts. Externally valid inferences justify conclusions
from experiments to their targets. Internal validity depends on the sufficient degree of
experimental control. External validity depends on the sufficiently strong analogy between
experiment and target. Internal and external validity of experiments often trade off against each
other.
Experimental control consists in identifying factors that influence the phenomenon of interest,
and controlling these factors. Typical techniques include elimination, effect separation,
adjustment and randomization. Because control is never complete, it is important to repeat and to
replicate experiments.
External validity is an important problem, because we often cannot or do not want to experiment
on the real-world target itself. Making externally valid claims involves either making the
experiment mirror the target, applying experimental evidence only through well-confirmed
theory, or establishing strong analogies between experiment and target.
6. References
Baigent C, Blackwell L, Collins R, et al. (2009) Aspirin in the primary and secondary prevention of vascular
disease: collaborative meta-analysis of individual participant data from randomised trials. Lancet 373 (9678):
1849–60.
Burlando RM, F. Guala (2005) Heterogeneous Agents in Public Good Experiments, Experimental Economics, 8: 35-
54.
Franklin, A. (1994) How to Avoid the Experimenters' Regress. Studies in History and Philosophy of Science Part A
25 (3):463-491
Freed C, Greene P, Breeze R, et al. (2001) Transplantation of embryonic dopamine neurons for severe Parkinson’s
disease. N Engl J Med. 344:710-719.
Freedman, David A. (2005) Statistical Models: Theory and Practice, Cambridge University Press.
Fruteau C, Voelkl B, van Damme E, Noe R (2009) Supply and demand determine the market value of food
providers in wild vervet monkeys. Proc Natl Acad Sci USA 106: 12007–12012.
Galison, P. (1997) Image and Logic. University of Chicago Press.
Guala, F. (2005) The Methodology of Experimental Economics. Cambridge University Press
Hacking, I. (1992) The Self-Vindication of the Laboratory Sciences, in A. Pickering. (ed.) Science as Practice and
Culture. Chicago: University of Chicago Press
Kuhn, T (1962) The Structure of Scientific Revolutions. Chicago, IL: University of Chicago Press.
Paul Richards, Farmers also experiment: A neglected intellectual resource in African science, Discovery and
Innovation 1:19-25, 1989.
Roth, A. E. (1986) Laboratory Experimentation in Economics, Economics and Philosophy 2: 245–73.
U.S. Preventive Services Task Force (August 1989). Guide to clinical preventive services: report of the U.S.
Preventive Services Task Force. DIANE Publishing.
Woodward, J. (2003) Making Things Happen. Oxford University Press, Oxford.
16

Assigned Reading - Experiments Lecture Script PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Assigned Reading - Experiments Lecture Script PDF

Загружено:

Авторское право:

Доступные форматы

Lecture 6: Experiments

3. How Should We Run Experiments?

Here is an example of a medical experimental artefact:

Identifying relevant factors thus is an open-ended procedure: at a given moment in time,

Figure 4 (U.S. Preventive Services Task Force 1989)

4. How to Interpret Experiments?

Вам также может понравиться