Research Methods in Political Science

The Scientific Study of Politics
OVERVIEW
Most political science students are interested in the substance of politics
and not in its methodology. We begin with a discussion of the goals of this
book and why a scientific approach to the study of politics is more interesting
and desirable than a "just-the-facts" approach. In this chapter we provide an
overview of what it means to study politics scientifically. We begin with an
introduction to how we move from causal theories to scientific knowledge,
and a key part of this process is thinking about the world in terms of models
in which t he concepts of interest become variables that are causally linked
together by theories. We then introduce the goals and standards of political
science research that will be our rules of the road to keep in mind throughout
this book. The chapter concludes with a brief overview of the structure of
this book.
Doubt is the beginning, not the end, of wisdom.
- Chinese proverb
1111 POLITICAL SCIENCE?
"Which party do you support?" "When are you going to run for office?"
These are questions that students often hear after announcing that they
are taking courses in political science. Although many political scientists
are avid partisans, and some political scientists have even run for elected
offices or have advised elected officials, for the most part this is nor the
focus of modern political science. Instead, political science is about the
scientific study of political phenomena. Perhaps like you, a great many of
roday's political scientists were attracted to this discipline as undergraduates
because of intense interests in a particular issue or candidate. Although we
are often drawn into political science based on political passions, the most
1
Kellstedt and Whitten (2013) "The Fundamentals of
Political Science Research", Chapter 1.
2 The Scientific Study of Politics
respected political science research today is conducted in a fashion that
makes it impossible tO teJI the personal political views of the writer.
Many people taking their first political science research course are sur-
prised to find out how much science and, in particular, how much math are
involved. We would like to encourage the students who find themselves in
this position to hang in there with us- even if your answer to this encourage-
ment is "but I'm only taking this class because they require it to graduate,
and I'll never use any of this stuff again." Even if you never run a regression
model after you graduate, having made your way through these materials
should help you in a number of important ways. We have written this book
with the following three goals in mind:
To help you consume academic political science research in your other
courses. One of the signs that a field of research is becoming scientific
is the development of a common technical language. We aim to make
the common technical language of political science accessible to you.
To help you become a better consumer of information. In political
science and many other areas of scientific and popular communication,
claims about causal relationships are frequently made. We want you
to be better able to evaluate such claims critically.
To start you on the road to becoming a producer of scientific research
on politics. This is obviously the most ambitious of our goals. In our
teaching we often have found that once skeptical students get comfort-
able with the basic tools of political science, their skepticism turns into
curiosity and enthusiasm.
To see the value of this approach, consider an alternative way of learn-
ing about politics, one in which political science courses would focus on
"just the facts" of politics. Under this alternative way, for example, a course
offered in 1995 on the politics of the European Union (EU) would have
taught students that there were 15 member nations who participated in
governing the EU through a particular set of institutional arrangements
that had a particular set of rules. An obvious problem with this alternative
way is that courses in which lists of facts are the only material would prob-
ably be pretty boring. An even bigger problem, though, is that the political
world is constantly changing. ln 2011 the EU was made up of 27 member
nations and had some new governing institutions and rules that were dif-
ferent from what they were in 1995. Students who took a facts-only course
on the EU back in 1995 would find themselves lost in trying to understand
the EU of 2011. By contrast, a theoretical approach to politics helps us to
better understand why changes have come about and their likely impact on
EU politics.
1.2 Approaching Politics Scientifically 3
ln this chapter we provide an overview of what it means to study pol-
itics scientifically. We begin this discussion with an introduction to how
we move from causal theories to scientific knowledge. A key part of this
process is thinking about the world in terms of models in which the con-
cepts of interest become variables' that are causally linked together by
theories. We then introduce the goals and standards of political science
research that will be our rules of the road to keep in mind throughout this
book. We conclude this chapter with a brief overview of the structure of
this book .
.:11 APPROACHING POLITICS SCIENTIFICALLY: THE SEARCH FOR
CAUSAL EXPLANATIONS
I've said, I don't know whether it's addictive. I'm not a doctor. I'm not a
scientist.
- Bob Dole, in a conversation with Katie Couric about robacco during
the 1996 U.S. presidential campaign
The question of "how do we know what we know" is, at its heart, a
philosophical question. Scientists are lumped into different disciplines that
develop standards for evaluating evidence. A core part of being a scientist
and taking a scientific approach to studying the phenomena that interest
you is always being willing to consider new evidence and, on the basis of
that new evidence, change what you thought you knew to be true. This
willingness to always consider new evidence is counterbalanced by a stern
approach to the evaluation of new evidence that permeates the scientific
approach. This is certainly true of the way that political scientists approach
politics.
So what do political scientists do and what makes them scientists? A
basic answer to this question is that, like other scientists, political scientists
develop and test theories. A theory is a tentative conjecture about the causes
of some phenomenon of interest. The development of causal theories about
the political world requires thinking in new ways about familiar phenom-
ena. As such, theory building is part art and part science. We discuss this
in greater detail in Chapter 2, "The Art of Theory Building."
1
When we introduce an important new term in this book, that term appears in boldface
rype. At the end of each chapter, we will provide short definitions of each bolded term that
was introduced in that chapter. We discuss at great length later in this and other
chapters. For now, a good working definition is that a variable is a definable quantity that
can take on two or more values. An example of a variable is voter turnout; researchers
usually measure it as the percentage of voting-eligible persons in a geographically defined
area who cast a vote in a particular election.
Causal theory
.0.
Hypothesis
Empirical test
Evaluation of hypothesis
Evaluation of causal theory
Scientific knowledge
Figure 1.1. The road to
scientific knowledge.
Once a theory has been develope<.!, like all
scientists, we turn to the business of testing our
theory. The first step in testing a particular theory
is to restate is as one or more testable hypotheses.
A hypothesis is a theory-based statement about a
relationship that we expect to observe. For every
hypothesis there is a corresponding null hypothe-
sis. A null hypothesis is also a theory-based state-
ment but it is about what we would observe if
there were no relationship between an independent
variable and the dependent variable. Hypothesis
testing is a process in which scientists evaluate sys-
tematically collected evidence to make a judgment
of whether the evidence favors their hypothesis or
favors the corresponding null hypothesis. The pro-
cess of setting up hypothesis tests involves both
logical reasoning and creative design. In Chapter
3, "Evaluating Causal Relationships," we focus on
the logical reason side of this process. In Chapter 4, "Research Design," we
focus on the design part of this process. If a hypothesis survives rigorous
testing, scientists start to gain confidence in that hypothesis rather than in
the null hypothesis, and thus they also gain confidence in the theory from
which they generated their hypothesis.
Figure 1.1 presents a stylized schematic view of the path from theories
to hypotheses to scientific knowledge.
2
At the top of the figure, we begin
with a causal theory to explain our phenomenon of interest. We then derive
one or more hypotheses about what our theory leads us to expect when we
measure our concepts of interest (which we call variables- as was previously
discussed) in the real world. In the third step, we conduct empirical tests of
our hypotheses.
3
From what we find, we evaluate our hypotheses relative
to corresponding null hypotheses. Next, from the results of our hypothesis
tests, we evaluate our causal theory. In light of our evaluation of our theory,
we then think about how, if at all, we should revise what we consider to be
scientific knowledge concerning our phenomenon of interest.
A core part of the scientific process is skepticism. On hearing of a
new theory, other scientists will challenge this theory and devise further
tests. Although this process can occasionally become quite combative, it is
a necessary component in the development of scientific knowledge. Indeed,
2 In practice, rhe de\clopmenr of scientific knowledge ~ frequently much messier rhan this
step-by-step diagram. We show more of the complexity of this approach in later chapters.
3 By "empirical" we simply mean "based on observations of the real world . .,
Jl
1.2 Approachtng Politics Scientifically 5
a core component of scientific knowledge is that, as confident as we are in a
particular theory, we remain open to the possibility that there is still a test
out there that will provide evidence that makes us lose confidence in that
theory.
It is important to underscore here the nature of the testing that scien-
tists carry out. One way of explaining this is to say that scientists are not
like lawyers in the way that they approach evidence. Lawyers work for a
particular client, advocate a particular point of view (like "guilt" or "inno-
cence"), and then accumulate evidence with a goal of proving their case
to a judge or jury. This goal of proving a desired result determines their
approach to evidence. When faced with evidence that conflicts with their
case, Ia wyers attempt to ignore or discredit such evidence. When faced with
evidence that supports their case, lawyers try to emphasize the applicability
and quality of the supportive evidence. In many ways, the scientific and legal
approaches to evidence couldn't be further apart. Scientific confidence in a
theory is achieved only after hypotheses derived from that theory have run a
gantlet of tough tests. At the beginning of a trial, lawyers develop a strategy
ro prove their case. In contrast, at the beginning of a research project, sci-
entists will think long and hard about the most rigorous tests that they can
conduct. A scientist's theory is never proven because scientists are always
willing ro consider new evidence.
The process of hypothesis testing reflects how hard scientists are on
their own theories. As scientists evaluate systematically collected evidence to
make a judgment of whether the evidence favors their hypothesis or favors
the corresponding null hypothesis, they always favor the null hypothesis.
Statistical techniques allow scientists to make probability-based statements
about the empirical evidence that they have collected. You might think that,
if the evidence was 50-50 between their hypothesis and the corresponding
null hypothesis, the scientists would tend to give the nod to the hypothesis
(from their theory) over the null hypothesis. In practice, though, this is
not the case. Even when the hypothesis has an 80-20 edge over the null
hypothesis, most scientists will still favor the null hypothesis. Why? Because
scientists are very worried about the possibility of falsely rejecting the null
hypothesis and therefore making claims that others ultimately will show to
be wrong.
Once a theory has become established as a part of scientific knowl-
edge in a field of study, researchers can build upon the foundation that this
theory provides. Thomas Kuhn wrote about these processes in his famous
book The Structure of Scientific Revolutions. According to Kuhn, scien-
tific fields go tl:uough cycles of accumulating knowledge based on a set of
shared assumptions and commonly accepted theories about the way that
the world works. Together, these shared assumptions and accepted theories
form what we call a paradigm. Once researchers in a scientific field have
widely accepted a paradigm, they can pursue increasingly technical ques-
t ions that make sense only because of the work that has come beforehand.
This state of research under an accepted paradigm is referred to as nor-
mal science. When a major problem is found with the accepted theorie!>
and assumptions of a scientific field, that field will go through a revolu-
tionary period during which new theories and assumptions replace the old
paradigm ro establish a new paradigm. One of the more famous of these
scientific revolutions occurred during the 16th century when the field of
astronomy was forced to abandon its assumption that the Earth was the
center of the known universe. This was an assumption that had informed
theories about planetary movement for thousands of years. In the book
On Revolutions of the Heavenly Bodies, Nicolai Copernicus presented his
theory that the Sun was the center of the known universe. Although this
radical theory met many challenges, an increasing body of evidence con-
vinced astronomers that Coperinicus had it right. ln the aftermath of t his
paradigm shift, researchers developed new assumptions and theories that
established a new paradigm, and the affected fields of study entered into
new periods of normal scientific research.
lr may seem hard to imagine that the field of political science has gone
through anything that can compare with the experiences of astronomers in
the 16th century. Indeed, Kuhn and other scholars who study the evolu-
tion of scientific fields of research have a lively and ongoing debate about
where the social sciences, like political science, are in terms of their devel-
opment. The more skeptical participants in this debate argue that political
science is not sufficiently mature to have a paradigm, much less a paradigm
shift. If we pur aside this somewhat esoteric debate about paradigms and
paradigm shifts, we can see an important example of the evolution of sci-
entific knowledge about politics from the study of public opinion in the
Uni ted States.
In the 1940s the study of public opinion through mass surveys was in
its infancy. Prior to that time, political scienrists and sociologists assumed
that U.S. voters were heavily influenced by presidential campaigns - and,
in particular, by campaign advertising - as they made up their minds about
the candidates. To better understand how these processes worked, a team
of researchers from Columbia University set up an in-depth study of public
opinion in Erie County, Ohio, during the 1944 presidential election. Their
study involved interviewing the same individuals at multiple time periods
across the course of the campaign. Much to the researchers' surprise, they
found that voters were remarkably consistent from interview to interview
in terms of their vote intentions. Instead of being influenced by particular
events of the campaign, most of the voters surveyed had made up their minds
1.3 Variables and Causal Explanations 7
about how they would cast their ballots long before the campaigning had
even begun. The resulting book by Paul Lazarsfeld, Bernard Berelson, and
Hazel Gaudet, titled The People's Choice, changed the way that scholars
thought about public opinion and political behavior in the United States.
If political campaigns were not central to vote choice, scholars were forced
to ask themselves what was critical to determining how people voted.
At first other scholars were skeptical of the findings of the 1944 Erie
County study, but as the revised theories of politics of Lazarsfeld et al. were
evaluated in other studies, the field of public opinion underwent a change
that looks very much like what Thomas Kuhn calls a "paradigm shift." In
the aftermath of this finding, new theories were developed to attempt to
explain the origins of voters' long-lasting attachments to political parties in
the United States. An example of an influential study that was carried out
under this shifted paradigm is Richard Niemi and Kent Jenning's seminal
book from 1974, The Political Character of Adolescence: The Influence
of Families and Schools. As the title indicates, Niemi and Jennings studied
the attachments of schoolchildren to political parties. Under the pre-Erie
County paradigm of public opinion, this study would not have made much
sense. But once researchers had found that voter's partisan attachments
were quite stable over time, studying them at the early ages at which they
form became a reasonable scientific enterprise. You can see evidence of
this paradigm at work in current studies of party identification and debates
about its stability .
.. THINKING ABOUT THE WORLD IN TERMS OF VARIABLES AND
CAUSAL EXPLANATIONS
So how do political scientists develop theories about politics? A key element
of this is that they order their thoughts about the political world in terms of
concepts that scientists call variables and causal relationships between vari-
ables. This type of mental exercise is just a more rigorous way of expressing
ideas about politics that we hear on a daily basis. You should think of each
variable in terms of its label and its values. The variable label is a descrip-
tion of what the variable is, and the variable values are the denominations
in which the variable occurs. So, if we're talking about the variable that
reflects an individual's age, we could simply label this variable "Age" and
some of the denominations in which this variable occurs would be years,
days, or even hours.
It is easier ro understand the process of turning concepts into variables
by using an example of an entire theory. For instance, if we' re thinking
about U.S. presidential elections, a commonly expressed idea is that the
y
incumbent president will fare better when the economy is relatively healthy.
If we restate this in terms of a political science theory, the state of the
economy becomes the independent variable, and the outcome of presidential
elections becomes the dependent variable. One way of keeping the lingo of
theories straight is to remember that the value of the "dependent" variable
"depends" on the value of the "independent" variable. Recall that a theory
is a tentative conjecture about the causes of some phenomenon of interest.
In other words, a theory is a conjecture that the independent variable is
causally related to the dependent variable; accordi ng to our theory, change
in the value of the independent variable causes change in the value of the
dependent variable.
This is a good opportunity to pause and try to come up with your own
causal statement in terms of an independent and dependent vari able; try
fil.ling in the following blanks with some political variables:
__________ ___ causes
Sometimes it's easier to phrase causal propositions more specifically in terms
of the values of the variables that you have in mind. For instance,
higher - --------
causes lower
or
higher ----- ----
causes higher
Once you learn to think about the world in terms of variables you will be
able to produce an almost endless slew of causal theories. In Chapter 4 we
will discuss at length how we design research to evaluate the causal claims
in theories, but one way to initially evaluate a particular theory is to think
about the causal explanation behind it. The causal explanation behind a
theory is the answer to the question, "why do you think that this indepen-
dent variable is causally related to this dependent variable?" If the answer
is reasonable, then the theory has possibilities. ln addition, if the answer is
original and thought provoking, then you may really be on to something.
Let's return now to our working example in which the state of the econ-
omy is the independent variable and the outcome of presidential elections
is our dependent variable. The causal explanation for this theory is that
we believe that the state of the economy is causally related to the outcome
of presidential elections because voters hold the president responsible for
management of the national economy. As a result, when the economy has
been performing well, more voters will vote for the incumbent. When the
1.3 Variables and Causal Explanations
Independent variable ---------- - -----+
(concept)
(Operatior(alization)
.
.

Independent variable
(measured)
Causal theory
Hypothesis
Figure 1.2. From theory to hypothesis.
Dependent variable
(concept)

Dependent variable
(measured}
9
economy is performing poorly, fewer voters will support the incumbent
candi date. If we put this in terms of the preceding fill-in-the-blank exercise,
we could write
economic performance causes presidential election outcomes,
or, more specifically, we could write
higher economic performance causes higher incumbent vote.
For now we'll refer to this theory, which has been widely advanced and
tested by political scientists, as "the theory of economic voting."
To test the theory of economic voting in U.S. presidential elections, we
need to derive from it one or more testable hypotheses. Figure 1.2 provides
a schematic diagram of the relationship between a theory and one of its
hypotheses. At the top of this diagram are the components of the causal
theory. As we move from the top parr of this diagram (Causal theory) to
the bottom part (Hypothesis), we are moving from a general statement
about how we think the world works to a more specific statement about a
relationship that we expect to find when we go out in the real world and
measure (or operationalize) our variables.
4
4
Throughout hook we will use the terms umeasure" and ''operationalize" interchange-
ably. It is fairly common practice in the current political science literature to usc the term
"operationalize."
At the theory level at the top of Figure 1.2, our variables do not need to
be explicitly defined. With the economic voting example, the independent
variable, labeled "Economic Performance," can be thought of as a concept
that ranges from values of very strong ro very poor. The dependent vari-
able, labeled "Incumbent Vote," can be thought of as a concept that ranges
from values of very high to very low. Our causal theory is that a stronger
economic performance causes the incumbent vote to be higher.
Because there are many ways in which we can measure each of our
two variables, there arc many different hypotheses that we can test to find
out how well our theory holds up to real-world data. We can measure
economic performance in a variety of ways. These measures include infla-
tion, unemployment, real economic growth, and many others. "Incumbent
Vote" may seem pretty straightforward to measure, but here there are also
a number of choices that we need to make. For instance, what do we do in
the cases in which the incumbent president is not running again? Or what
about elections in which a third-party candidate runs? Measurement (or
operationalization) of concepts is an important part of the scientific pro-
cess. We will discuss this in greater detail in Chapter 5, which is devoted
entirely to evaluating different variable measurements and variation in vari-
ables. For now, imagine that we are operationalizing economic performance
with a variable that we will label "One Year Real Economic Growth Per
Capita." This measure, which is available from official U.S. government
sources measures the one-year rate of inflation-adjusted (thus the term
"real") economic growth per capita at the rime of the election. The adjust-
ments for inflation and population (per capita) reflect an important part
of measurement - we want our measure of our variables to be comparable
across cases. The values for this variahle range from negative values for
years in which the economy shrank to positive values for years in which
the economy expanded. We operationalize our dependent variable with a
variable that we label "Incumbent Party Percentage of Major Party Yore. "
This variable takes on val ues based on the percentage of the popular vote,
as reported in official election results, for the party that controlled the pres-
idency at the time of the election and thus has a possible range from 0 to
100. In order to make our measure of this dependent variable comparable
across cases, votes for third party candidates have been removed from this
measure.
5
5
If you're questioning rhe wisdom of removing voce!> for rhinl parry candidates, you are
chinking in rhe righr way - any rime you read about a measuremenr you !>hould think a hour
different ways in which ir mighr have been carried out. And, in particular, you should focus
on the like!} consequences of different measurement choices on the resulrs of hypothesis
tests. Evaluating measuremenr srrategics ~ a major topic in Chapter 5.
Q)
0
0
0
C:-
0)

a_
0
co
0
'co
:2
0
.....
0 0
Q) <D
Ol
!'!! 0
c ttl
Q)

0
Q)

a_
C:-
0

C')
0;
0
c
Q)
C\1
.0
E
0
::>
...
(.)
E
0
-20 - 10 0 10 20
One-Year Real Economic Growth Per Capita
Figure 1.3. What would you expect to see based on the theory of economic voting?
Figure 1.3 shows the axes of the graph that we could produce if we
collected the measures of these two variables. We could place each U.S.
presidential election on the graph in Figure 1.3 by identifying the point
that corresponds to the value of both "One-Year Real Economic Growth"
(the horizontal, or x, axis) and "Incumbent-Party Vote Percentage" (the
vertical, or y, axis). For instance, if these values were (respectively) 0 and
50, the position for that election year would be exactly in the center of
the graph. Based on our theory, what would you expect to see if we col-
lected these measures for all elections? Remember that our theory is that
a stronger economic performance causes the incumbent vote to be higher.
And we can restate this theory in reverse such that a weaker economic
performance causes the incumbent vote to be lower. So, what would this
lead us to expect to see if we plotted real-world data onto Figure 1.3? To
get this answer right, let's make sure that we know our way around this
graph. If we move from left to right on the horizontal axis, which is labeled
"One-Year Real Economic Growth," what is going on in real-world terms?
We can see that, at the far left end of the horizontal axis, the value is -20.
This would mean that the U.S. economy had shrunk by 20% over the past
year, which would represent a very poor performance (to say the least). As
we move to the right on this axis, each point represents a better economic
performance up to the point where we see a value of +20, indicating that
the real economy has grown by 20% over the past year. The vertical axis
depicts values of "Incumbent-Party Vote Percentage." Moving upward on
P ~
this axis represents an increasing share of the popular vote for the incum-
bent party, whereas moving downward represents a decreasing share of the
popular vote.
Now think about these two axes together in terms of what we would
expect to see based on the theory of economic voting. ln thinking through
these matters, we should always start with our independent variable. This is
because our theory states that the value of the independent variable exerts a
causal influence on the value of the dependent variable. So, if we start with
a very low value of economic performance -let's say -15 on the horizontal
axis - what does our theory lead us to expect in terms of values for the
incumbent vote, the dependent variable? We would also expect the value of
the dependent variable to be very low. This case would then be expected to
be in the lower-left-hand corner of Figure 1.3. Now imagine a case in which
economic performance was quite strong at+ 15. Under these circumstances,
our theory would lead us to expect that the incumbent-vote percentage
would also be quite high. Such a case would be in the upper-right-hand cor-
ner of our graph. Figure 1.4 shows two such hypothetical points plotted on
the same graph as Figure 1. 3. If we draw a line between these two points, this
line would slope upward from the lower left to the upper right. We describe
such a line as having a positive slope. We can therefore hypothesize that
the relationship between the variable labeled "One-Year Real Economic
Growth" and the variable labeled "Incumbent-Party Vote Percentage" will
be a positive relationship. A positive relationship is one for which higher
cu
0
0
g
0
~
0>
<0
0
Cl. (X)
0
'(if 0
~
,.._
0 0
cu
<0
0>
<0
0
c l{)
cu
!:?
0
cu
'<t
Cl.
~
(ij
Cl.
0
(')
.,!.
0
c C\1
cu
.0
E
0
::>
()
.
0
-20
*
-10 0 10
One-Year Real Economic Growth Per Capita
*
20
Figure 1.4. What would you expect co see based on the rheory of economic voting? T\.\'0
hypothetical cases.
values of the independent variable tend to coincide with higher values of
the dependent variable.
Let's consider a different operationalization of our independent vari-
able. Instead of economic growth, let's use "Unemployment Percentage" as
our operationalization of economic performance. We haven't changed our
theory, but we need to rethink our hypothesis with this new measurement or
operationalization. The best way to do so is to draw a picture like Figure 1.3
but with the changed independent variable on the horizontal axis. This is
what we have in Figure 1.5. As we move from left to right on the horizontal
axis in Figure 1.5, the percentage of the members of the workforce who are
unemployed goes up. What does this mean in terms of economic perfor-
mance? Rising unemployment is generally considered a poorer economic
performance whereas decreasing unemployment is considered a better eco-
nomic performance. Based on our theory, what should we expect to see
in terms of incumbent vote percentage when unemployment is high? What
about when unemployment is low?
Figure 1.6 shows nvo such hypothetical points plotted on our graph
of unemployment and incumbent vote from Figure 1.5. The point in the
upper-left-hand corner represents our expected vote percentage when unem-
ployment equals zero. Under these circumstances, our theory of economic
voting leads us to expect that the incumbent party will do very well. The
point in the lower-right-hand corner represents our expected vote percent-
age when unemployment is very high. Under these circumstances our theory
0
Q)
0
0
...
> 0
>-
C1>
t::
<II
0
a.
a:>
0
m- 0
2
r--
0 0
Q) (0
CT>
<II
0
c UJ
Q)

0
Q)
<T
a.

0
<II
(')
a,
0
c
Q)
C\J
D
E
0
:::;)
0
E
0
0 10 20 30 40 50 60 70 80 90 100
Unemployment Percentage
Figure 1.5. What would you expect to see based on rhe rheory of economic voting?
Ql
8
0
.....
> 0

0)
II)
Q..
....
0
0
Cl)
*
'(ij 0

,...
0 0
Ql <0
Ol
II)
0
c I/)
Q)
e
0
Q)
Q..
'<t

<I)
g
Q..
E
0

C\1
E
0
*
i3
E
0
. 0
10 20 30 40 50 60 70 80 90 100
Figure 1.6. What would you expect to see based on the theory of economic voting? Two
hypothetical cases.
of economic voting leads us to expect that the incumbent parry will do very
poorly. If we draw a line berween these rwo poims, this line would slope
downward from the upper-left to the lower-right. We describe such a line as
having a negative slope. We can therefore hypothesize that the relationship
between the variable labeled "Unemployment Percentage" and the variable
labeled "Incumbent-Parry Vote Percentage" will be a negative relationship.
A negative relationship is one for which higher values of the independent
variable tend to coincide with lower values of the dependent variable.
In this example we have seen that the same theory can lead to a hypoth-
esis of a positive or a negative relationship. The theory tO be tested, together
with the operationalization of the independent and the dependent variables,
determines the direction of the hypothesized relationship. The best way to
translate our theories into hypotheses is to draw a picture like Figure 1.3
or 1.5. Tbe first step is to label the vertical axis with the variable label for
the independent variable (as operationalized) and then label the low (left)
and high (right} ends of the axis with appropriate value labels. The second
step in this process is to label the vertical axis with the variable label for
the dependent variable and then label the low and high ends of that axis
with appropriate value labels. Once we have such a figure with the axes and
low and high values for each properly labeled, we can determine what our
expected value of our dependent variable should be if we observe both a
low and a high value of the independent variable. And, once we have placed
1.4 Models of Polit ics 15
the two resulting points on our figure, we can tell whether our hypothesized
relationship is positive or negative.
Once we have figured out our hypothesized relationship, we can collect
data from real-world cases and see how well these data reflect our expec-
tations of a positive or negative relationship. This is a very important step
that we can carry out fairly easily in the case of the theory of economic vot-
ing. Once we collect all of the data on economic performance and election
outcomes, we will, however, still be a long way from confirming the theory
that economic performance causes presidential election outcomes. Even if
a graph like Figure 1.3 produces compelling visual evidence, we will need
to see more rigorous evidence than that. Chapters 7- 11 focus on the use
of statistics co evaluate hypotheses. The basic logic of statistical hypothesis
testing is that we assess the probability that the relationship we find could
be due to random chance. The stronger the evidence that such a relationship
could not be due to random chance, the more confident we would be in our
hypothesis. The stronger the evidence that such a relationship could be due
to random chance, the more confident we would be in the corresponding
null hypothesis. This in turn reflects on our theory.
We also, at this point, need to be cautious about claiming that we
have "confirmed" our theory, because social scientific phenomena (such as
elections) are usually complex and cannot be explained completely with
a single independent variable. Take a minute or two to think about what
other variables, aside from economic performance, you believe might be
causally related to U.S. presidential election outcomes. If you can come up
with at least one, you are on your way to thinking like a polit ical scientist.
Because there are usually other variables that matter, we can continue to
think about our theories two variables at a time, bur we need to qualify our
expectations to account for other variables. We will spend Chapters 3 and
4 expanding on these important issues .
.. MODELS OF POLITICS
When we think about the phenomena that we want to better understand as
dependent variables and develop theories about the independent variables
that causally influence them, we are constructing theoretical models. Polit-
ical scientist James Rogers provides an excellent analogy between models
and maps to explain how these abstractions from reality are useful to us as
we try to understand the political world:
The very unrealism of a model, if properly constructed, is what makes it
useful. The models developed below are intended to serve much the same
function as a street map of a city. If one compares a map of a city to the real
topography of that city, it is certain that what is represented in the map
is a highly unrealistic portrayal of what the city acruallr looks like. The
map utterly distorts what is really there and leaves out numerous details
about what a particular area looks like. But it is precic;ely because the map
distorts reality- because it abstracts away from a host of derails about what
is really there - that it is a useful roo!. A map rhat attempted to portray
the full details of a particular area would be roo cluttered to be useful
in finding a particular location or would be too large ro be conveniently
stored. (2006, p. 276, emphasis in original)
The essential point is that models are simplifications. Whether or not they
are useful to us depends on what we are trying ro accomplish with the
particular model. One of the remarkable aspectS of models is that they
are often more useful to us when they are inaccurate rhan '':hen they are
accurate. T he process of thinking about the failure of a model to explain
one or more cases can generate a new causal theory. Glaring inaccuracies
often point us in the direction of fruitful theoretical progress .
.. RULES OF THE ROAD TO SCIENTIFIC KNOWLEDGE
ABOUT POLITICS
In the chapters that follow, we will focus on particular tools of political sci-
ence research. As we do this, try ro keep in mind our larger purpose - trying
to advance the state of scientific knowledge about politics. As scientists, we
have a number of basic rules that should never be far from our thinking:
Make your theories causal.
Don't let data alone drive your theories.
Consider only empirical evidence.
Avoid normative statements.
Pursue both generality and parsimony.
1.5.1 Make Your Theories Causal
All of Chapter 3 deals with the issue of causality and, specifically, how we
identify causal relationshi ps. When political scientists construct theories, it
is critical that they always think in terms of the causal processes that drive
the phenomena in which they are interested. For us to develop a better
understanding of the political world, we need to think in terms of causes and
not mere covariation. The term covariation is used ro describe a situation in
which two variables vary together (or covary). If we imagine two variables,
A and B, then we would say that A and B covary if it is the case that,
when we observe higher values of variable A, we generally also observe
higher values of variable B. We would also say that A and B covary if it
1.5 Rules of the Road 17
is the case that, when we observe higher values of variable A, we generally
also observe lower values of variable B.
6
It is easy to assume that when we
observe covariation we are also observing causality, but it is important nor
to fall into this trap.
1.5. 2 Don't Let Data Alone Drive Your Theories
This rule of the road is closely linked to the first. A longer way of stating
it is "try to develop theories before examining the data on which you will
perform your tests." The importance of this rule is best illustrated by a silly
example. Suppose that we are looking at data on the murder rate (number
of murders per 1000 people) in the city of Houston, Texas, by months of
the year. This is our dependent variable, and we want to explain why it
is higher in some months and lower in others. If we were to take as many
different independent variables as possible and simply see whether they
had a relationship with our dependent variable, one variable that we might
find to strongly covary with the murder rate is the amount of money spent
per capita on ice cream. If we perform some verbal gymnastics, we might
develop a "theory" about how heightened blood sugar levels in people who
eat too much ice cream lead to murderous patterns of behavior. Of course, if
we think about it further, we might realize that both ice cream sales and the
number of murders committed go up when temperatures rise. Do we have
a plausible explanation for why temperatures and murder rates might be
causally related? It is pretty well known that people's tempers tend to fray
when the temperature is higher. People also spend a lot more time outside
during hotter weather, and these two factors might combine to produce a
causally plausible relationship between temperatures and murder rates.
What this rather silly example illustrates is that we don't want our
theories to be crafted based entirely on observations from real-world data.
We are likely ro be somewhat familiar with empirical patterns relating ro
the dependent variables for which we are developing causal theories. This
is normal; we wouldn't be able to develop theories about phenomena about
which we know nothing. But we need to be careful about how much we let
what we see guide our development of our theories. One of the best ways to
do this is to think about the underlying causal process as we develop our the-
ories and to let this have much more influence on our thinking than patterns
that we might have observed. Chapter 2 is all about strategies for develop-
ing theories. One of these strategies is to identify interesting variation in our
6 A closely related term is correlation. For now we use these two terms interchangeably.
In Chapter 7, you will see that there are precise statistical measures of covariance and
correlation that arc closely related to each other but produce different numbers for the
same data.
dependent variable. Although this strategy for theory development relies on
data, it should not be done without thinking about the underlying causal
processes.
1.5.3 Consider Only Empirical Evidence
As we previously outlined, we need tO always remain open to the possibility
that new evidence will come along that will decrease our confidence in
even a well-established theory. A closely related rule of the road is that, as
scientists, we want to base what we know on what we see from empirical
evidence, which, as we have said, is simply "evidence based on observing
the real world." Strong logical arguments are a good start in favor of a
theory, but before we can be convinced, we need to see results from rigorous
hypothesis tests.
7
1.5.4 Avoid Normative Statements
Normative st atements are statements about how the world ought to be.
Whereas politicians make and break their political careers with norma-
tive statements, political scientists need to avoid them at all costs. Most
political scientists care about political issues and have opinions about how
the world ought ro be. On its own, this is not a problem. But when nor-
mative preferences about how the world "should" be structured creep into
their scientific work, the results can become highly problematic. The best
way to avoid such problems is to conduct research and report your findings
in such a fashion that it is impossible for the reader to tell what are your
normative preferences about the world.
This does not mean that good political science research cannot be used
to change the world. To the contrary, advances in our scientific knowledge
about phenomena enable policy makers to bring about changes in an effec-
t ive manner. For instance, if we want to rid the world of wars (normative),
we need to understand the systematic dynamics of the international system
that produce wars in the first place (empirical and causal). If we want to rid
America of homelessness (normative), we need to understand the pathways
7
It is worth noting that some political sciemists use data drawn from experimental settings to
test their hypotheses. There is some debate about whether such data are, strictly speaking,
empirical or not. We discuss political science experiments and their limitations in Chapter
4. In recent years some political scientists have also made clever of simulated data to
gain leverage on their phenomena of interest, and the empirical nature of such data can
certainly be debated. In the context of this textbook we are nor in weighing in
on these debates about exactly what is and not empirical data. lmtead, we suggest that
one should always cons1dcr the overall qual icy of data on which hypothesis tests have been
performed when evaluating causal claims.
1.6 A Quick Look Ahead 19
into and out of being homeless (empirical and causal).lf we want to help our
favored candidate win elections (normative), we need to understand what
characteristics make people vote the way they do (empirical and causal).
1.5.5 Pursue Both Gene rality and Parsimony
Our final rule of the road is that we should always pursue generality and
parsimony. These two goals can come into conflict. By "generality," we
mean that we want our theories to be applied to as general a class of phe-
nomena as possible. for instance, a theory that explains the causes of a
phenomenon in only one country is less useful than a theory that explains
the same phenomenon across multiple countries. Additionally, the more
simple or parsimonious a theory is, the more appealing it becomes.
8
In the real world, however, we often face trade-offs between generality
and parsimony. This is the case because, to make a theory apply more
generally, we need to add caveats. The more caveats that we add to a theory,
the less parsimonious it becomes .
.. A QUICK LOOK AHEAD
You now know the rules of the road. As we go through the next 11 chapters,
you will acquire an increasingly complicated set of tools for developing and
testing scientific theories about politics, so it is crucial that, at every step
along the way, you keep these rules in the back of your mind. The rest of this
book can be divided into three different sections. The first section, which
includes this chapter through Chapter 4, is focused on the development of
theories and research designs to study causal relationships about politics. In
Chapter 2, "The Art of Theory Building," we discuss a range of strategies
for developing theories about political phenomena. In Chapter 3, "Evalu-
ating Causal Relationships," we provide a detailed explanation of the logic
for evaluating causal claims about relationships between an independent
variable, which we call "X," and a dependent variable, which we call "Y."
In Chapter 4, "Research Design," we discuss the research strategies that
political scientists use to investigate causal relationships.
In the second section of this book, we expand on the basic tools that
political scientists need to test their theories. Chapter 5, "Getting to Know
Your Data: Evaluating Measurement and Variations," is a detailed discus-
sion of how we measure {or operationalize) our variables, along with an
8
The term "parsimonious" is often used in a relative sense. So, if we are comparing two
theories, the theory that is simpler would he the more parsimonious. Indeed, this rule of
the road might be phrased 'pursue both generality and simplicity." We use the words
"parsimony" and "parsimonious- becau:.e they are widely used to describe theories.
introduction to a set of tools that can be used to summarize the charac-
teristics of variables one at a time. Chapter 6, ""Probability and Statistical
Inference," introduces both the basics of probability theory as well as the
logic of statistical hypothesis testing. In Chapter -, "Bivariate Hypothe-
sis Testing," we begin to apply the lessons from Chapter 6 to a series of
empirical rests of the relationship between pairs of \'ariables.
The third and final section of this book introduces the critical con-
cepts of the regression model. Chapter 8, "Bivariate Regression Models,"
introduces the two-variable regression model as an extension of the con-
cepts from Chapter 7. In Chapter 9, "Multiple Regression: The Basics," we
introduce the multiple regression model, with which researchers are able
to look at the effects of independent variable X on dependent variable Y
while controlling for the effects of other independent variables. Chapter 10,
"Multiple Regression Model Specification," and Chapter 11, "Limited
Dependent Variables and Time-Series Data," pro.,.;de in-depth discussions
of and advice for commonly encountered research scenarios involving mul-
tiple regression models. Lastly, in Chapter 12, "Puning It All Together to
Produce Effective Research," we discuss how to apply the lessons learned
in this book to begin to produce original research of your own.
CONCEPTS INTRODUCED IN THIS CHAPTER
9
causal - implying causality. A central focus of this book is on theories
about "causal" relat ionships.
correlation- a statistical measure of covariarion which summarizes the
direction (positive or negative) and strength of the linear relationship
between two variables.
covary (or covariation) - when two variables vary together, they
are said to "covary." The term "covariation" is used to describe
circumstances in which two variables covary.
data- a collection of variable values for at least nvo observations.
dependent variable - a variable for which at least some of the variation
is theorized to be caused by one or more independent variables.
empirical - based on real-world observation.
hypothesis - a theory-based statement about what we would expect
to observe if our theory is correct. A hypothesis is a more explicit
statement of a theory in terms of the expected relationship between a
9 At the end of each chapter, we will provide short definitions of each bolded term that was
introduced in that chapter. These short definitions are inrended to help you get an initial
grasp of the term when it is introduced. A full understanding of these concepts, of course,
can only be gained through a thorough reading of the chapter.
EXJ
1. Pid
scie1
sub1
E
Exercises 21
measure of the independent variable and a measure of the dependent
variable.
hypothesis testing - the act of evaluating empirical evidence in order
to determine rhe level of support for rhe hypothesis versus the null
hypothesis.
independent variable - a variable that is theorized to cause variation
in the dependent variable.
measure - a process by which abstract concepts are turned into real-
world observations.
negative relationship- higher values of the independent variable tend
to coincide with lower values of the dependent variable.
normal science - scientific research that is carried our under the shared
set of assumptions and accepted theories of a paradigm.
normative statements - statements about how the world ought to be.
null hypothesis - a theory-based statement about what we would
observe if there were no relationship between an independent variable
and the dependent variable.
operationalize - another word for measurement. When a variable
moves from the concept-level in a theory to the real-world measure
for a hypothesis test, it has been operationalized.
paradigm - a shared set of assumptions and accepted theories in a
particular scientific field.
paradigm shift - when new findings challenge the conventional wisdom
of a paradigm to the point where the set of shared assumptions. and
accepted theories in a scientific field is redefined.
parsimonious - synonym for simple or succinct.
positive relationship- higher values of the independent variable tend
to coincide with higher values of the dependent variable.
theoretical model - the combination of independent variables, the
dependent variable, and the causal relationships that are theorized to
exist between them.
theory - a tentative conjecture about the causes of some phenomenon
of interest.
variable - a definable quantity that can rake on rwo or more values.
variable label- the label used to describe a particular variable.
variable values- the values that a particular variable can rake on.
EXERCISES
1. Pick another subject in which ) ou have taken a course and heard mention of
scientific theories. How is political science similar to and different from that
subject?
Statistics and Political SCIENCE
Agnar F. Helgason
The Ohio State University
PS4781 Statistics and Political SCIENCE 1 / 21
Chapter Outline
1
Political Science?
2
Thinking about the World in Terms of Variables and Causal
Explanations
3
Rules of the Road to Scientic Knowledge about Politics
4
Why Statistics?
Approaching Politics Scientically: The Search for Causal
Explanations
Political science is about the scientic study of political phenomena
The scientic method involves
Developing falsiable theories
Testing those theories with data

Skepticism about scientic knowledge is inherent in the scientic
method
We never prove a theory, we only fail to disprove it
All knowledge is tentative

The Scientic Method
Causal theory
Hypothesis
Empirical test
Evaluation of hypothesis
Evaluation of causal theory
Scientific knowledge
Thinking about the World in Terms of Variables and
Causal Explanations
From concepts to variables
variable label and variable values
independent variable dependent variable
Try to come up with your own causal statement in terms of an
independent and dependent variable; try lling in the following blanks
with some political variables:
causes
From theory to hypothesis
(concept)
Dependent variable
(concept)
(Operationalization) (Operationalization)
Causal theory
(measured)
Dependent variable
(measured)
Hypothesis
Theory of Economic Voting
What would you expect to see based on the theory of
economic voting?
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
0
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

P
e
r
c
e
n
t
a
g
e

o
f

M
a
j
o
r

P
a
r
t
y

V
o
t
e

-20 -10 0 10 20

One Year Real Economic Growth Per Capita
economic voting? Two hypothetical cases
*
*
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
0
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

P
e
r
c
e
n
t
a
g
e

o
f

M
a
j
o
r

P
a
r
t
y

V
o
t
e

-20 -10 0 10 20

One Year Real Economic Growth Per Capita
economic voting?
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
0
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

P
e
r
c
e
n
t
a
g
e

o
f

M
a
j
o
r

P
a
r
t
y

V
o
t
e

0 10 20 30 40 50 60 70 80 90 100

economic voting? Two hypothetical cases
*
*
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
0
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

P
e
r
c
e
n
t
a
g
e

o
f

M
a
j
o
r

P
a
r
t
y

V
o
t
e

0 10 20 30 40 50 60 70 80 90 100

Rules of the Road to Scientic Knowledge about Politics
Make Your Theories Causal
Dont Let Data Alone Drive Your Theories
Consider Only Empirical Evidence
Avoid Normative Statements
Pursue Both Generality and Parsimony
Whats Statistics got to do with it?
Statistics- Set of methods for collecting/analyzing data (the art and
science of learning from data). Statistics is the language of science.
Data - Information collected to gain knowledge about a eld or to
answer a question of interest.
Data Sources include:
Surveys (Mail, Telephone, Internet)
Experiments
Observational
Statistics provides methods for:
Design - Planning/Implementing a study
Sample survey, experiment? Observational?
How to choose people (subjects) for the study, and how many?
Description - Graphical and numerical methods for summarizing the
data
Inference - Methods for making predictions about a population (total
set of subjects of interest), based on a sample (subset of the sample
on which study collects data)
Parameters and Statistics
Parameter - Numerical summary of the population
Population mean (or median or some other measure)
Population proportion (or percentage)

Statistic - Numerical summary of the sample
Well learn how to use sample statistics to make inferences about
population parameters.
Parameters and Statistics: Examples
Parameter
% of all adult
Americans who
approve of Barack
Obamas
performance as
President
Mean reaction
time to red light in
experiment when
using (not using)
cell phone while
driving
Statistic
% of 1000 adult
Americans in a
poll who approve
of Obamas
performance as
President
Mean reaction
time to red light
for 100 students in
experiment when
using (not using)
cell phone while
driving
Why study Statistics?
One answer: You need it to understand research ndings in political
science, psychology, medicine, business, ...
Another answer: In a competitive job market, understanding how to
deal with quantitative information provides an important advantage.
(The sexy job of the next 10 years will be statistician - Hal Varian,
chief economist at Google)
Broader answer: In your everyday life, it will help you make sense of
what to heed and what to ignore in statistical information provided in
news reports, medical research, surveys, political campaigns,
advertisements, ...
Week 13 Interaction Eects
PS4781 Techniques of Political Analysis
Section 11.5 in the Agresti and Finlay textbook contains an example (ex. 11.6), which
shows how two independent variables can interact to aect a dependent variable. Section
11.5 should have been assigned for this week, but was omitted from the syllabus by mistake.
Rather than (unfairly and unexpectedly) require you to read the section, Ive prepared this
short handout, which goes over the example. It should help with questions 4,5, and 6 on
Homework 4.
The data in the example come from a study in Florida on the relationship between mental
health impairment and several explanatory variables. The dependent variable (y) is an
index which incorporates various aspects of psychiatric symptoms, including anxiety and
depression, with higher scores indicating greater impairment. The two main independent
variables are x
1
= life events score, and x
2
= socioeconomic status (SES). The life events
score is a composite measure of both the number and severity of major life events the subject
experienced within the past three years (e.g. death in the family, jail sentence), with higher
scores representing a greater number and/or greater severity of these life events. The SES
score is a composite index, based on occupation, income, and education. The higher the
score, the higher the status.
One model which we might estimate, could be
E(y) = +
1
x
1
+
2
x
2
where is the constant (a.k.a. intercept) and the s are the coecients (a.k.a. slopes) on
the two independent variables. The s show how y changes as the xs change.
In this model, the eects of each of the independent variables on the dependent variable
is always the same, regardless of the values of the other variable in the model. Say that
1
= 0.5 then increasing the value of the x
1
variable by 1 unit, would lead y to increase
by 0.5 units, regardless of the particular value of the x
2
variable.
However, we might have reason to believe that this assumption might not hold. In particular,
we might suspect that the eects of severe life events on mental impairment might depend
on a persons socioeconomic status. A plausible hypothesis is that extreme life events should
be less likely to lead to mental impairment the higher a persons status.
How could be go about testing this? The simplest way is to add an interaction eect to
the model, which allows for the possibility that the eects of, say, x
1
on y depends on the
1
specic value of the x
2
variable. The model could then be written
E(y) = +
1
x
1
+
2
x
2
+
3
x
1
x
2
Notice that now we have an extra term in the model, which is the cross-product of the other
two terms in the model.
The textbook shows the results of estimating this model, which leads to the following pre-
diction equation
y = 26.0 + 0.156x
1
0.060x
2
0.00087x
1
x
2
What do we make of this? Well, rst of all, we can no longer just look at the coecient
in front of, say, x
1
to evalute the relationship between x
1
and y. Now we need to keep the
coecient on the interaction term in mind as well. Secondly, we can see a couple of things
directly from the prediction equation: More extreme life events (x
1
) are associated with
higher levels of mental impairment (because the coecient on x
1
is positive), and a higher
socioeconomic status (x
2
) is associated with lower levels of mental impairment (because the
coecient on x
2
is negative). The coecient on the interaction term is negative, suggesting
that increasing either x
1
or x
2
will make the coecient on the other variable more negative
(or less positive).
Lets take an example. The socioeconomic scale runs from 0 to 100. How do the eects of
severe life events on mental impairment dier when the value on the socioeconomic scale is
0, 50, and 100? To gure that out, we now have to input the dierent suggested value of x
2
into the equation, simplify the equation, and see how the coecient on x
1
changes.
When x
2
= 0,
y = 26.0 + 0.156x
1
0.060(0) 0.00087x
1
(0) = 26.0 + 0.156x
1
When x
2
= 50,
y = 26.0 + 0.156x
1
0.060(50) 0.00087x
1
(50) = 23.0 + 0.113x
1
When x
2
= 100,
y = 26.0 + 0.156x
1
0.060(100) 0.00087x
1
(100) = 20.0 + 0.069x
1
What can we conclude when comparing the prediction equations at dierent levels of x
2
?
The higher the value of socioeconomic status (x
2
), the smaller the slope between predicted
mental impairment (y) and life events (x
1
), and so the weaker is the eect of life events. This
suggests that subjects who possess greater resources, in the form of a higher socioeconomic
status, are better able to withstand the mental stress of potentially traumatic life events.
Thus, the eects of life events on mental impairment depend on (a.k.a. are conditioned by)
socioeconomic status.
2
Bivariate Regression I
PS4781 Bivariate Regression I 1 / 19
Chapter Outline
1
Motivation
2
Fitting the Best Line
3
Example
Motivation
Last week we covered several bivariate hypotheses tests
This week we turn our attention to the workhorse model of
quantitative political science, regression
We start very gently, considering regression in the two variable case,
when both variables are quantitative (interval variables)
In the following weeks (actually for all the remaining weeks of the
course), we add to this simple framework and go over how we can
extend regression modeling to account for many of the issues we have
covered in the course.
Most importantly, we show how regression allows us to control for Zs
while we are exploring the linear relationship between Xs and Ys
Scatter plot of change in GDP and incumbent-party vote
share
3
5
4
0
4
5
5
0
5
5
6
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

V
o
t
e

P
e
r
c
e
n
t
a
g
e

-15 -10 -5 0 5 10

Percentage Change in Real GDP Per Capita
Which line ts best? Estimating the regression line
In estimating a regression line, our task is to draw a straight line that
describes the relationship between our independent variable X and
our dependent variable Y.
We clearly want to draw a line that comes as close as possible to the
cases in our scatter plot of data.
But how do we decide which line is best?
An aside on notation
Many of you may remember the formula for a line from a geometry
class
y = b + mx
where b is the y-intercept and m is the slope often explained as
the rise over run component of the line
For a one unit increase in x (run), m is the corresponding amount
of rise in y
In OLS, part of the equation for the best line is based on this formula,
but we use dierent symbols
Y
i
= +

X
i
+ u
i
is the intercept,

is the slope, Y
i
and X
i
are individual data points,
and u
i
is the error or residual term (which is what we are trying to
minimize)
Remember, because our theories always involve probabilistic
causation, we can never perfectly predict Y with X hence the
residual term
We often refer to +

X
i
as the systematic component of Y
i
and u
i
as the stochastic or random component
Three possible lines
A: Y=50.21+1.15Xi
B: Y=51.51+.62Xi
C: Y=52.01+.25Xi
3
0
4
0
5
0
6
0
7
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

V
o
t
e

P
e
r
c
e
n
t
a
g
e
-15 -10 -5 0 5 10 15
Which line is best?
So, how do we decide which line best ts the data that we see in
our scatter plot of X
i
and Y
i
values?
We nd the line that minimizes the vertical distances between each
Y
i
and the line
Because there are points both above and below the line, we use the
squared distance, so that the distances dont cancel each other out
This is called ordinary least squares regression or just OLS
Finding the Line with the Least Squares
So how do we nd the line?
So how do we nd (=estimate) ,

and u
i
based on the Xs and Ys
we have?
For the two variable case the formula for the estimates are:
n
i =1
(X
i

X)(Y
i

Y)
n
i =1
(X
i

X)
2
=

Y

X
u
i
= Y
i

Y
i
So how do we nd the line?
If we examine the formula for

, we can see that the numerator is the
same as the numerator for calculating the covariance between X and
Y.
Thus the logic of how each case contributes to this formula is the
same.
The denominator in the formula for

is the sum of squared
deviations of the X
i
values from the mean value of X (

X).
Thus, for a given covariance between X and Y, the more (less)
spread out X is, the less (more) steep the estimated slope of the
regression line.
OLS regression line through scatter plot with
mean-delimited quadrants
3
5
4
0
4
5
5
0
5
5
6
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

V
o
t
e

P
e
r
c
e
n
t
a
g
e
-15 -10 -5 0 5 10
Interpretation of

If

> 0 the line slopes upward (posititive relationship)
If

= 0 the line is horizontal (no relationship)
If

< 0 the line slopes downward (negative relationship)
Do rich countries pollute more than poor countries?
X=Gross Domestic Production (GDP, in thousands of dollars per
capita)
Y=Cardon Dioxide Emissions (per capita, in metric tons)
Do rich countries pollute more than poor countries?
Least squares line given by the formula is Y
i
= 0.42 + 0.31X
i
So when X=0, emissions would be 0.42 (which is actually irrelevant,
because no GDP values fall near 0)
For each increase of 1 thousand dollars in per capita GDP, predicted
pollution increases by 0.31 metric tons per capita
But, this linear equation is just an approximation. The correlation
between X and Y for these nations was 0.64, not 1.0
At X=39.7 (value for US), predicted pollution is
Y
i
= 0.42 + 0.31(39.7) = 12.7
Actual US emissions are 19.8, so the residual for the US is
u
i
= Y
i

Y
i
= 19.8 12.7 = 7.1
So the US pollutes quite a bit more than would be predicted based on
its GDP
Causation in the Social Sciences
PS4781 Causation in the Social Sciences 1 / 11
Chapter Outline
1
Causality and everyday language
Bivariate theories and a multivariate world
Recall that the goal of political science (and all science) is to create
and then evaluate causal theories.
Most causal theoriesalmost all of themare bivariate, which means
that they ask whether a single independent variable is a cause of a
single dependent variable.
In a generic way, well usually say that were wondering whether X
causes Y.
But, crucially, even though our theory might be bivariate, the real
world is not. The real world is a place that is multivariate. That is,
every interesting phenomenon (or dependent variable) has several
causes.
So, in the real world, perhaps X does cause Yor perhaps notbut
the key is to remember that perhaps some other variables, which well
call Z, might cause Y, too (or instead).
Does ice cream consumption cause increases in crime?
Probably not!
Evaluating bivariate theories in a multivariate world
So if our theories are bivariate, but reality is multivariate, we need to
gure out how to control for the other possible causes of Y when
evaluating whether X causes Y.
If we dont control for Z, the other possible causes of Y, then our
conclusions about whether X causes Y might very well be mistaken.
Precise strategies, which well call research designs, for doing this will
be delt with next week. What well talk about this week are the
logical foundations about how to evaluate causal connections.
A note on the word causality
The words cause or causality appear everywhere. But what do
we mean by it in Political Science?
In fact, the branch of Philosophy of Science that is dedicated to
eshing out what we mean by causality is quite extensive.
In most understandings, causality is understood as deterministic.
But social reality is more complex than physics, so in our world,
causation is normally understood as probabilistic.
Deterministic vs. Probabilistic Causation
Deterministic: If X, then always Y
Probabilistic: If X, then Y is more likely
Example: Theory of Economic Voting
Relates economic conditions with electoral success of incumbent
Deterministic: If the economy is doing poorly, the incumbent will
always lose elections
Probabilistic: If the economy is doing poorly, the incumbent is more
likely to lose the elections
Obama wins reelection despite shaky economy
Sidenote: More on the Rube Goldberg Machine
http://www.nytimes.com/2012/01/08/nyregion/
brooklyns-joseph-herscher-and-his-rube-goldberg-machines.
html?_r=1&
Pearsons Correlation Coecient
a.k.a. Pearsons r
PS4781 Pearsons Correlation Coecient a.k.a. Pearsons r 1 / 14
Chapter Outline
1
When to Use the Correlation Coecient?
2
Example from Kellstedt and Whitten
Conditions
We use Pearsons Correlation Coecient when
The dependent (response) variable is continuous
The independent (explanatory) variable is also continuous
Gives an indication of the strength of the linear association between
two variables
Properties of r
r is always between -1 and 1
If -1, the two variables are perfectly negatively correlated
If 1, the two variables are perfectly (positively) correlated
The larger the absolute value, the stronger the association
Examples of Dierent Correlations
Correlation Coecient
When we have an independent variable and a dependent variable that
are both continuous, we can visually detect covariation pretty easily in
graphs.
Scatter plots are useful for getting an initial look at the relationship
between two continuous variables:
Any time that you examine a scatter plot, you should gure out what
are the axes and then what each point in the scatter plot represents.
In these plots, the dependent variable (in this case incumbent vote)
should be displayed on the vertical axis while the independent variable
(in this case economic growth) should be displayed on the horizontal
axis.
Each point in the scatter plot should represent the values for the two
variables for an individual case.
share
3
5
4
0
4
5
5
0
5
5
6
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

V
o
t
e

P
e
r
c
e
n
t
a
g
e

-15 -10 -5 0 5 10

Covariance
Covariance is a statistical way of summarizing the general pattern of
association (or the lack thereof) between two continuous variables.
The formula for covariance between two variables X and Y is
cov
XY
=
n
i =1
(X
i

X)(Y
i

Y)
n
.
To better understand the intuition behind the covariance formula, it is
helpful to think of individual cases in terms of their values relative to
the mean of X (

X) and the mean of Y (
Y):
If X
i

X > 0)and Y
i

Y > 0, that cases contribution to the
numerator in the covariance equation will be positive.
If X
i

X < 0 and Y
i

Y < 0, that cases contribution to the
numerator in the covariance equation will also be positive, because
multiplying two negative numbers yields a positive product.
If a case has a combination of one value greater than the mean and
one value less than the mean, its contribution to the numerator in the
covariance equation will be negative because multiplying a positive
number by a negative number yields a negative product.
share with mean-delimited quadrants
(- -)= + (+ -)= -
(- +)= - (+ +)= +
3
5
4
0
4
5
5
0
5
5
6
0
I
n
c
u
m
b
e
n
t

P
a
r
t
y

V
o
t
e

P
e
r
c
e
n
t
a
g
e

-15 -10 -5 0 5 10

Covariance table for economic growth and incumbent-party
presidential vote, 18802004
Vote Growth
Vote 35.4804
Growth 18.6846 29.8997
In a covariance table, the cells across the main diagonal (from
upper-left to lower-right) are cells for which the column and the row
reference the same variable.
In this case the cell entry is the variance for the referenced variable.
Each of the cells o of the main diagonal displays the covariance for a
pair of variables.
From covariance to correlation
The covariance calculation tells us that we have a positive or negative
relationship, but it does not tell us how condent we can be that this
relationship is dierent from what we would see if our independent
and dependent variables were not related in our underlying population
of interest.
To make this assessment, we use Pearsons r, the formula for which
is
r =
cov
XY
var
X
var
Y
.
t-test for r
We can calculate a t-statistic for a correlation coecient as
t
r
=
r
n 2
1 r
2
,
with n 2 degrees of freedom, where n is the number of cases.
In this case, our degrees of freedom equal 34 2 = 32.
Armed with this knowledge, we can calculate the critical value and
P-value in R.
Dierence of Means Test
PS4781 Dierence of Means Test 1 / 17
Chapter Outline
1
When to Use the Test?
2
Test choices you might come across
3
Conditions
We use a dierence of means hypothesis test when
The dependent (response) variable is continuous or a proportion
The independent (explanatory) variable is categorical, with two groups
Group 1 Group 2 Estimate
Population mean
1

2
y
2
y
1
Population proportion
1

2

2

1
Used to test whether the outcome diers between the two groups
Assumptions
Always bear in mind the assumptions associated with the statistical tests
we discuss. In this case we assume:
We have a random sample from the population
The population distribution is either normal or the sample size is big
enough so the CLT applies
If neither of the two conditions hold, there are alternative tests
available. Discussed in Agresti and Finlay. Wont cover in this class,
but be aware of them.
Independent vs. Dependent Samples
Are the two groups being compared inherently matched?
Independent Samples: Dierent samples, no natural matching. E.g.
comparing men vs. women, Americans vs. non-Americans, young vs.
old etc.
Dependent Samples: Natural matching between each subject in one
sample and a subject in other sample. E.g. experiment where we
compare scores of subjects before and after receiving treatment.
Equal vs. Unequal Variance
Are we willing to assume that the two groups being compared have the
same variance and standard deviation? (mostly matters with continuous
variables)
If we are willing to assume that, then we can pool the observations
when calculating the standard error.
Reasonable to make the assumption when the sample standard
deviations are similar.
If we dont assume equal variance, then the formula for the degrees of
freedom is very complex must use statistical software to calculate.
Do majority governments last longer than minority governments in
parliamentary democracies?
H
0
: Duration
Maj
= Duration
Min
H
a
: Duration
Maj
> Duration
Min
DV: Government duration (days, continuous)
IV: Type of government (majority/minority, 2 categories)
Independent samples
Lets assume equal variances for simplicity

Box-whisker plot of Government Duration for majority and
minority governments
0
5
0
0
1
,
0
0
0
1
,
5
0
0
2
,
0
0
0
N
u
m
b
e
r

o
f

D
a
y
s

i
n

G
o
v
e
r
n
m
e
n
t

majority minority
Dierence of means test
The plot suggests that majority governments last longer than minority
governments.
To determine whether the dierence from the gure is statistically
signicant, we turn to a dierence of means test.
In this test we compare what we have seen in the gure with what we
would expect if there were no relationship between Government Type
and Government Duration.
If there were no relationship between these two variables, then the
world would be such that the duration of governments of both types
were drawn from the same underlying distribution. If this were the
case, the mean or average value of Government Duration would be
the same for minority and majority governments.
A test of this null hypothesis is a version of the t-test
The formula for this particular t-test is
t =
y
1
y
2
se( y
1
y
2
)
where y
1
is the mean of the dependent variable for the rst value of
the independent variable and y
2
is the mean of the dependent
variable for the second value of the independent variable
We can see from this formula that the greater the dierence between
the mean value of the dependent variable across the two values of the
independent variable, the further the value of t will be from zero.
The further apart the two means are and the less dispersed the
distributions (as measured by the standard deviations s
1
and s
2
), the
greater condence we have that y
1
and y
2
are dierent from each
other.
Government type and government duration
Government Number of Mean Standard
type observations duration deviation
Majority 124 930.5 466.1
Minority 53 674.4 421.4
Combined 177 853.8 467.1
From the values displayed in this table we can calculate the t-test
statistic for our hypothesis test. The standard error of the dierence
between two means ( y
1
and y
2
), se( y
1
y
2
), is calculated from the
following formula:
se( y
1
y
2
) =
_
_
(n
1
1)s
2
1
+ (n
2
1)s
2
2
n
1
+ n
2
2
_
_
1
n
1
+
1
n
2
_
.
where n
1
and n
2
are the sample sizes, and s
2
1
and s
2
2
are the sample
variances.
If we label the number of days in government for majority
governments y
1
and the number of days in government for minority
governments y
2
, then we can calculate the standard error as
se( y
1
y
2
) =
_
(124 1)(466.1)
2
+ (53 1)(421.4)
2
124 + 77 2
_
_
_
1
124
+
1
53
_
se( y
1
y
2
) = 74.39.
Now that we have the standard error, we can calculate the t-statistic:
t =
y
1
y
2
se( y
1
y
2
)
=
930.5 674.4
74.39
=
256.1
74.39
= 3.44.
Now that we have calculated this t-statistic, we need only nd the degrees of
freedom to get to our p-value.
Dierence of mean testdegrees of freedom
We calculate the degrees of freedom for a dierence of means
t-statistic based on the sum of total sample size minus two. Thus our
degrees of freedom is n
1
+ n
2
2 = 124 + 53 2 = 175
Dierence of means testp-value
In R, we type pt(-abs(3.44),df=175)1
pt because we have the t-statistic and want the p-value
-abs(t-statistic) for technical reasons. Just trust me!
and 1 because we want the one-sided p-value ( 2 if this were a
2-sided test)
We get a p-value of 0.0004 and can reject the null hypothesis that
majority governments and minority governments last as long, and
accept the alternative hypothesis that majority governments last
longer than minority governments.
Statistical Inference
Agnar F. Helgason
PS4781 Statistical Inference 1 / 24
Chapter Outline
1
Estimation
2
Condence Interval for Sample Proportion
3
Condence Interval for Sample Mean
4
Using Condence Intervals
Estimation
Goal: How can we use sample data to estimate values of population
parameters?
Point estimate: A single statistic value that is the best guess for
the parameter value
Interval estimate: An interval of numbers around the point estimate,
that has a xed condence level of containing the true parameter
value. Gives an indication of the accuracy of the points estimate.
Point Estimators
Most common to use sample values
For quantitative variables
Sample mean Population mean
Sample std. dev. Population std. dev.
For qualitative variables
Sample proportion Population proportion
Properties of Good Estimators
Unbiased: Sampling distribution of the estimator centers around the
parameter value being estimated
Example of a biased estimator: sample range. Can never be larger
than the population range (why?), so it systematically underestimates
the population range
Ecient: Smallest possible standard error, compared to other
estimators
Example: If population is symmetric and bell-shaped, sample mean is
more ecient than sample median in estimating both the population
mean and median.
Interval Estimators
A condence interval is an interval of numbers believed to contain
the parameter value.
The probability the method produces an interval that contains the
parameter is called the condence level. Most studies use a
condence level of 0.95 or 0.99.
Condence levels have the form
point estimate margin of error
point estimate Z standard error
with the margin of error based on the spread of the the sampling
distribution of the point estimator and the desired condence level
(Z-score).
Condence Interval for a Proportion (in a particular
category)
point estimate margin of error
= Z

= Z /
n
Recall that the population standard deviation of a proportion is
=
(1 )
Then the standard error of the sample proportion is
n
=
(1 )/n
But is the unknown parameter be are trying to estimate!
Solution: Use to approximate When we use this approximation,
we refer to the standard error as se rather than

Example: What percentage of 18-22 year-old Americans
report being very happy?)
35 of n = 164 say they are very happy
So = 35/164 = 0.213
And se =
(1 )/n =
0.213(0.787)/164 = 0.032
Lets say we want a 95% condence level (z-score: 1.96)
Then the interval estimate would be
z(se) = 0.213 1.96(0.032) = 0.213 0.063 (0.15, 0.28)
Were 95% condent the population proportion who are very happy
is between 0.15 and 0.28.
What happens to the interval when we want a higher
condence level or have a larger sample?)
R to the rescue!
What assumptions are neccesary?
The method requires large n so the sampling distribution is approx.
normal (otherwise the CLT doesnt apply). In practice, at least 15
observations in the category of interest and at least 15 observations
not in it needed. If you have less than that, you can check the section
in Agresti and Finlay on small sample solutions.
Correct Interpretation of Condence Intervals
There is NOT a 95% probability that the parameter is contained
within the condence interval. The true parameter is a constant
(unfortunately unknown to us), and either is or is not within the
interval (there is no randomness).
If we repeatedly took random samples of some xed size n and each
time calculated a 95% CI, in the long run about 95% of the CIs
would contain the population proportion .
The probability refers to the interval containing , not on being in
the interval
Correct Interpretation of Condence Intervals
Condence Interval for Sample Mean
In large random samples, the sample mean has approximately a
normal sampling distribution with mean and standard error
y
=

n
Thus, condence interval would simply be y z(
y
)
However, the needed to calculate
y
is unknown and we cant
calculate it based on the estimate for the sample mean (like with
proportions)
Rather, we must use the sample standard deviation as an estimate for
the population standard deviation, se =
s
n
This works ok for large n because s then a good estimate of (and
CLT applies). But for small n, replacing by its estimate s
introduces extra error, and CI is not quite wide enough unless we
replace z-score by a slightly larger t-score.
The t distribution
Quite similar to the normal distribution. However, the tails are a little
bit thicker, reecting the notion that a higher proportion of samples
falls far away from the population mean in small samples
Precise shape depends on the degrees of freedom (df). For
inference about mean, df = n 1.
As the sample size increases, we have more degrees of freedom
Above a certain sample size threshold, the t and normal distributions
are nearly identical.
The t-distribution allows us to make inferences for any random
sample size. However, it requires the underlying population
distribution to be normal.
T table
Example: Anorexia Study
Weight measured before and after period of treatment
y = weight at end - weight at beginning
For n=17 girls receiving family therapy:
y = 11.4, 11.0, 5.5, 9.4, 13.6, -2.9, -0.1, 7.4, 21.5, -5.3, -3.8, 13.4,
13.1, 9.0, 3.9, 5.7, 10.7
What is the mean population weight change associated with family
therapy?
Comments about the t-distribution
The method is robust to violations of the assumption of a normal
population distribution
(But, be careful if sample data distribution is very highly skewed, or if
severe outliers. Look at the data!)
Using Condence Intervals
What is the variable of interest?
If quantitative: inference about mean
If small n: use t-distribution
If large n: use z-distribution

If qualitative: inference about proportion
Always use z-distribution

Research Design and the Logic of Experimentation
PS4781 Research Design and the Logic of Experimentation 1 / 22
Chapter Outline
1
Comparison as the key to establishing causal relationships
2
Experimental research designs
What is being compared to what?
Making good comparisons is one of the keys of doing social science.
Example from last week: The simple bivariate comparison of
participants in a Head Start program to those who are not in the
program, despite its initial appeals, can be very misleading. (Why?)
If the comparisons we make are faulty, then our conclusions about
whether or not a causal relationship is present are also likely to be
faulty. (But were never sure one way or the other.)
What to do?
Comparing apples to.... nothing
90% of cold suerers report signicant improvement after 3 days
when using Vicks VapoRub
Implicit claim: Vicks VapoRub caused improvements
But most peoples colds improve after a few days...
No comparison group, therefore we cannot make inferences
Research design
So lets say that we have a theory that says that some X causes some
Y.
We dont know whether, in reality, X causes Y. We may be armed
with a theory that suggests that X does, indeed, cause Y, but
theories can be (and often are) wrong or incomplete.
So how do scientists generally, and political scientists in particular, go
about testing whether X causes Y? There are several strategies, or
research designs that researchers can employ toward that end.
The goal of all types of research designs are to help us evaluate how
well a theory fares as it makes its way over the four causal
hurdlesthat is, to answer as conclusively as is possible the question
about whether X causes Y.
Components of a research design
A research design is a plan to answer your research question, and
includes:
A causal theory and implied hypotheses.
A unit of analysis on which the hypothesis operates.
A set of variables, including a dependent variable and an
independent variable
A plan to collect these data
A plan to analyze these data
Two approaches
We will talk about two broad approaches to designing research. The
rst is called an experimental design, and it is the benchmark of
scientic research. The second is meant to emulate the rst, and is
called an observational study.
Well cover the former in this video, and the latter in the next video
An example from day-to-day political life
Suppose that you were a candidate for political oce locked in what
seems to be a tight race. Youre deciding whether or not to make
some television ad buys for a spot that sharply contrast your record
with your opponents.
The campaign manager has had a public-relations rm craft the spot,
and has shown it to you in your strategy meetings. You like it, but
you look to your sta and ask the bottom-line question: Will the ad
work with the voters?
Can you see the causal question?
Exposure to a candidates negative ad (X) may, or may not, aect a
voters likelihood of voting for that candidate (Y).
And it is important to add here that the causal claim has a particular
directional component to it; that is, exposure to the advertisement
will increase the chances that a voter will choose that candidate.
How might researchers in the social sciences evaluate such a causal
claim?
Think about the comparison
How can we most eectively make a comparison to answer our causal
question?
It is very important, and not at all surprising, to realize that voters
may voter for or against you for a variety of reasons (Zs) that have
nothing to do with exposure to the advarying socioeconomic
statuses, varying ideologies, and party identications can all cause
voters to favor one candidate over another.
So how can we establish whether or not, among these other inuences
(Z), the advertisement (X) also causes voters to be more likely to
vote for you (Y)?
The possibly confounding eects of political interest in the
advertisementvote intention relationship
Exposure to
campaign advertisement
Voter's likelihood
of supporting a particular
candidate
Interest in politics
An experiment
The word experiment has many uses in the English language, but in
this class well use it in a rather precise way.
An experiment is a research design in which the researcher both
controls and randomly assigns values of the independent variable to
the subjects.
These two componentscontrol and random assignmentform a
necessary and sucient denition of an experiment.
Control
What does it mean to say that a researcher controls the value of
the independent variable that the subjects receive?
It means, most importantly, that the values of the independent
variable that the subjects receive are not determined either by the
subjects themselves, or by nature.
In our campaign-advertising example, this requirement means that we
cannot compare people who, by their own choice, already view the ad
to those who do not (in this case the choice of whether or not to view
the ad is a Z variable that may exert an inuence on Y separate from
X).
It means that we, the researchers, have to decide which of our
experimental subjects will view the ad and which ones will not.
Random assignment
We, the researchers, not only must control the values of the
independent variable, but we must also assign those values to subjects
randomly.
In the context of our campaign-ad example, this means that we must
toss coins, draw numbers out of a hat, use a random-number
generator, or some other such mechanism to ensure that our subjects
are divided into a treatment group (who will view the ad) and a
control group (who will not view it, but will instead presumably
watch something innocuous, akin to a placebo).
The benets of control and random assignment
Why is randomly assigning subjects to treatment groups important?
What scientic benets arise from the random assignment of subjects
to treatment groups?
Recall that we have emphasized that all science is about comparisons
and also that interesting dependent variable is caused by many
factors, not just one.
Random assignment to treatment groups ensures that the comparison
we make between the treatment group and the control group is as
pure as possible, and that some other cause (Z) of the dependent
variable will not pollute that comparison. What we have ensured,
through random assignment, is that the subjects will not be
systematically dierent from one another.
And this is radically dierent from any non-experimental design (as
well see).
Experiments and internal validity
How do experiments help us cross the four hurdles? Take them one at
a time.
1
Is there a credible causal mechanism that connects X to Y?
2
Can we rule out the possibility that Y could cause X?
3
Is there covariation between X and Y?
4
Have we controlled for all confounding variables Z that might make the
association between X and Y spurious?
Because experiments deal with the fourth hurdle so eectively, they
are said to have high degrees of internal validitythat is, the
inferences we make about whether X causes Y or not are likely to be
correct.
Drawbacks to experiments
1
Can we really assign X to subjects? What if we are interested in
something that cannot (logically or feasibly) be manipulated? Gender,
race, religiousity, wealth? Democracy, war, trade, economic
development?
2
Are there ethical considerations?
3
What about external validity, the extent to which our conclusions
generalize to dierent populations, time, and settings?
Samples of convenience and replication
External validity of the stimulus

Examples of Threats to External Validity
Subjects are not a random sample from the population most social
science experiments use college undergrads
Is problematic if we expect the eects of the treatment to dier
based on age, education or anything that systematically diers
between college undergrads and the general population
Note dierence between random assignment and random sampling
The latter is not needed for internal validity only external validity
Examples of Threats to External Validity
Location, Location, Location Articiality of the situation
Subjects are aware that they are in an experiment and being watched
(Hawthorne Eect)
Stimulus may be too strong (negative ads in real life vs. in
experiment)
Possible biases of subjects and experimenters reason for
double-blind studies
A note on the homework
Table : Deaths in the rst ve years of the HIP screening trial, by cause.
Cause of Death
Breast cancer All other
Sample Rate Rate
Size Number (per 1000) Number (per 1000)
Treatment Group
Examined 20,200 23 1.1 428 21
Refused 10,800 16 1.5 409 38
Total 31,000 39 1.3 837 27
Control Group 31,000 63 2.0 879 28
Evaluating Causal Relationships
PS4781 Evaluating Causal Relationships 1 / 15
Chapter Outline
1
Four hurdles along the route to establishing causal relationships
2
Why is studying causality so important? Three examples from political
science
The focus on causality
Recall that the goal of political science (and all science) is to evaluate
causal theories.
Bear in mind that establishing causal relationships between variables
is not at all akin to hunting for DNA evidence like some episode from
a television crime drama. Social reality does not lend itself to such
simple, cut-and-dried answers.
Is there a best practice for trying to establish whether X causes Y?
The four causal hurdles
1
Is there a credible causal mechanism that connects X to Y?
2
3
4
Have we controlled for all confounding variables Z that might make
the association between X and Y spurious?
Hurdle 1
What do we mean Is there a credible causal mechanism that
connects X to Y?
Can you answer the how and why questions?
Hurdle 2
Its possible that X Y, Y X, or X Y, or neither.
Hurdle 3
This is the easiest one. No, correlation is not causation, but its
normally a key component of causation.
Hurdle 4
Have we controlled for all confounding variables Z that might make
the association between X and Y spurious?
This is the toughest hurdle to cross in most social sciences.
But what if we dont cross that fourth hurdle?
A substantial portion of disagreements between scholars boil down to
this fourth causal hurdle. When one scholar is evaluating anothers
work, perhaps the most frequent objection is that the researcher
failed to control for some potentially important cause of the
dependent variable.
So long as a credible case can be made that some uncontrolled-for Z
might be related to both X and Y, we cannot conclude with full
condence that X indeed causes Y Since the main goal of science is
to establish whether causal connections between variables exist, then
failing to control for other causes of Y is a potentially serious
problem.
Applying the four hurdles
Are we talking merely about political science research here?
Absolutely not.
Where else would these thinking skills apply?
Life satisfaction and democratic stability
What is the relationship between life satisfaction in the mass public
and the stability of democratic institutions?
Inglehart (1988) argues that life satisfaction (X) causes democratic
system stability (Y). If people in a democratic nation are more
satised with their lives, they will be less likely to want to overthrow
their government.
Stay focused on the causal hurdles. How can we evaluate this claim?
Race and political participation in the U.S.
What is the relationship between an individuals race and the amount
of political participation that individual engages in?
Many scholars have noticed that Anglos participate more in politics
than do African Americans. But is that relationship causal?
What confounding (Z) variables would we need to control for that
might shed light on this relationship?
Evaluating whether Head Start is eective
Does attending Head Start aect Kindergarten readiness?
What confounding (Z) variables would we need to control for that
might shed light on this relationship?
The Logic of Hypothesis Testing
PS4781 The Logic of Hypothesis Testing 1 / 14
Chapter Outline
1
Signicance Tests
2
Five Parts of a Signicance Test
Assumptions
Specifying Hypotheses
Test Statistic
P-value
Conclusion
Our Goal
Use statistical methods to test hypotheses such as
For treating anorexia, cognitive behavioral and family therapies have
same mean weight change as placebo (no eect)
Mental health tends to be better at higher levels of socioeconomic
status (SES) (i.e., there is an eect)
Spending money on other people has a more positive impact on
happiness than spending money on oneself.
Our Goal
Appropriate to use hypothesis tests to answer whether two variables
are related
Can also be used to see if a treatment has an eect in a well designed
experimental study (i.e. when we can be sure that there are no Zs
lurking in the background)
Wouldnt use such tests on observational data to establish a causal
relationship
However, same logic applies to more complex methods
Correspondence with Condence Intervals
Hypothesis tests are in many ways the ip side of condence intervals
Remember, condence intervals give us an interval of numbers
believed to contain the true parameter value.
However, signicance tests involve choosing a specic parameter
value and asking: If this were the true parameter value, how likely
would it be to observe the sample data we have?
Assumptions
When we test hypotheses we make assumptions about:
Type of data (categorical vs. quantitative)
Sampling method (random or not)
Population distribution of parameter (e.g. normal)
Sample size (large enough?)
The choice of hypothesis tests depend on what kind of assumptions we are
willing to make.
Specifying Hypotheses
Hypothesis tests always take on a particular form.
We specify a null hypothesis (H
0
) and an alternative hypothesis (H
a
).
H
0
is a statement that a parameter takes a specic value (usually: no
eect or some critical threshold)
H
a
states that the parameter value falls in some alternative range of
values (an eect)
Test Statistic
Compares data to what H
0
predicts, often by nding the number of
standard errors between sample point estimate and H
0
value of
parameter.
We are already familiar with test statistics remember Z-scores and
t-scores from week 5?
P-value
A probability measure of evidence about H
0
.
The probability of observing a test statistic as extreme or more
extreme (in the direction predicted by H
a
) as the one observed , if H
0
were true.
The smaller the P-value, the stronger the evidence against H
0
.
Example of P-value Interpretation
Say we compare the annual income of men and women, with the null
hypothesis that they are the same and the alternative hypothesis that
men have higher income than women.
Formally, H
0
: = 0 and H
a
: > 0
Say we get a p-value of 0.01
Interpretation: There is only 1% probability of observing the results
we get, or more extreme results in the direction of H
a
, if H
0
were true
Ergo, if income did not dier by sex in the population, the observed
results from our sample would be very unlikely
Conclusion
If no decision needed, report and interpret P-value.
If decision needed, select a cuto point (such as 0.05 or 0.01) and
reject H
0
if P-value that value
If, say, P-value 0.05, we say that the results are signicant at 0.05
level.
If the P-value is not suciently small, we fail to reject H
0
(i.e. we
NEVER accept the null hypothesis)
The cuto point corresponds to the condence level for condence
intervals
Analogous to American Judicial System
H
0
: Defendant is innocent
H
a
: Defendant is guilty
If the defendant is innocent how unlikely would it be to observe the
evidence against him?
If there is reasonable doubt, the jury fails to reject H
0
.
It does not mean that the defendant IS innocent (H
0
hasnt been
accepted). Just that, given the evidence, we cannot reject the
possibility that hes innocent.
Likewise, when we reject the null hypothesis of (say) no eect, it
doesnt mean that there IS no eect.
Introduction to the R Statistical Computing
Environment
Agnar F. Helgason
PS4781 Introduction to the R Statistical Computing Environment 1 / 9
Chapter Outline
1
What is R?
2
Installing R and RStudio
What is R?
A statistical programming language and computing environment
Dierent from a statistical package, such as SPSS
Packages oer point-and-click menus. Standard analysis easy
anything unconventional hard to accomplish.
R requires you to type all commands to program. More dicult to
get started. Standard analysis and innovation easy after initial time
investment.
Why choose R?
The de facto standard programming language for statisticians
Incredibly powerful can do anything from simple arithmetic to
complex analyses of millions of datapoints
Can create beautiful graphs with relatively little eort
Easy to use (for a programming language)
Its free!
How does R work?
R includes basic operations and a simple user interface
However, R is extendable you can download free packages that
add functionality to R
You can also add a more pleasing graphical interface by installing
RStudio. Denitely recommended!
Installing R
Visit http://cran.r-project.org/
Download the appropriate version of R (Windows/Linux/Mac)
Once downloaded, run the installer
You might want to consider changing the installation directory to
c:/R/R-3.0.1
Otherwise, use default options
Installing RStudio
Visit http://www.rstudio.com/
Download the desktop version
Once downloaded, run the installer using the default options
RStudio links automatically to the previously installed R

Logistic Regression
PS4781 Logistic Regression 1 / 12
The basics of the BNL and BNP models
To understand the BNL (a.k.a. logit or logistic regression) and BNP
(a.k.a. probit), lets rst rewrite the LPM from the last video in terms
of a probability statement:
P
i
= P(Y
i
= 1) = +
1
Party ID
i
+
2
War Evaluation
i
+
3
Economic Evaluation
i
+u
i
.
This is just a way of expressing the probability part of the LPM in a
formula in which P(Y
i
= 1) translates to the probability that Y
i
is
equal to one, which in the case of our running example is the
probability that the individual cast a vote for Bush. We then further
collapse this to
P
i
= P(Y
i
= 1) = +
1
X
1i
+
2
X
2i
+
3
X
3i
+u
i
,
More basics of BNL and BNP models
This reduces further to:
P
i
= P(Y
i
= 1) = X
i
+u
i
,
where we dene X
i
as the systematic component of Y such that
X
i
= +
1
X
1i
+
2
X
2i
+
3
X
3i
.
The term u
i
continues to represent the stochastic or random
component of Y. So if we think about our predicted probability for a
given case, we can write this as
Y
i
=

P
i
=

P(Y
i
= 1) = X
i

= +

1
X
1i
+

2
X
2i
+

3
X
3i
.
From the LPM to the BNL...
A BNL model with the same variables would be written as:
P
i
= P(Y
i
= 1) = ( +
1
X
1i
+
2
X
2i
+
3
X
3i
+u
i
) = (X
i
+u
i
).
The predicted probabilities from this model would be written as
P
i
=

P(Y
i
= 1) = ( +

1
X
1i
+

2
X
2i
+

3
X
3i
= (X
i

).
... to the BNP
A BNP with the same variables would be written as:
P
i
= P(Y
i
= 1) = ( +
1
X
1i
+
2
X
2i
+
3
X
3i
+u
i
) = (X
i
+u
i
).
The predicted probabilities from this model would be written as:
P
i
=

P(Y
i
= 1) = ( +

1
X
1i
+

2
X
2i
+

3
X
3i
= (X
i

).
Link functions
The dierence between the BNL model and the LPM is the , and
the dierence between the BNP model and the LPM is the .
and are known as link functions. A link function links the linear
component of a logit or probit model, X
i

, to the quantity in which
we are interested, the predicted probability that the dummy
dependent variable equals one

P(Y
i
= 1) or

P
i
.
A major result of using these link functions is that the relationship
between our independent and dependent variables is no longer
assumed to be linear. In the case of a logit model, the link function,
abbreviated as , uses the cumulative logistic distribution function
(and thus the name logit) to link the linear component to the
probability that Y
i
= 1.
In the case of the probit function, the link function abbreviated as
uses the cumulative normal distribution function to link the linear
component to the predicted probability that Y
i
= 1.
Understanding the BNL and BNP: An example
The best way to understand how the LPM, BNL, and BNP work
similarly to and dierently from each other is to look at them all with
the same model and data.
The eects of partisanship and performance evaluations on
votes for Bush in 2004: Three dierent types of models
LPM BNL BNP
Party Identication 0.09*** 0.82*** 0.45***
(0.01) (0.09) (0.04)
Evaluation: War on Terror 0.08*** 0.60*** 0.32***
(0.01) (0.09) (0.05)
Evaluation: Health of the Economy 0.08*** 0.59*** 0.32***
(0.01) (0.10) (0.06)
Intercept 0.60*** 1.11*** 0.58***
(0.01) (0.20) (0.10)
Notes:
The dependent variable is equal to one if the respondent voted for Bush and
equal to zero if they voted for Kerry. Standard errors in parentheses.
Two-sided signicance tests: *** p < .01; ** p < .05; * p < .10
Interpretation
Across the three models the parameter estimate for each independent
variable has the same sign and signicance level. But it is also
apparent that the magnitude of these parameter estimates is dierent
across the three models. This is mainly due to the dierence of link
functions.
To better illustrate the dierences between the three models
presented on the previous slide, lets plot the predicted probabilities
from the models. These predicted probabilities are for an individual
who strongly approved of the Bush administrations handling of the
war on terror but who strongly disapproved of the Bush
administrations handling of the economy.
Three dierent models of Bush vote
Interpretation
The horizontal axis in this gure is this individuals party
identication ranging from strong Democratic Party identiers on the
left end to strong Republican Party identiers on the right end. The
vertical axis is the predicted probability of voting for Bush.
We can see from this gure that the three models make very similar
predictions. The main dierences come as we move away from a
predicted probability of 0.5.
The LPM line has, by denition, a constant slope across the entire
range of X. The BNL and BNP lines of predicted probabilities change
their slope such that they slope more and more gently as we move
farther from predicted probabilities of 0.5.
The dierences between the BNL and BNP lines are trivial. This
means that the eect of a movement in Party Identication on the
predicted probability is constant for the LPM. But for the BNL and
BNP, the eect of a movement in Party Identication depends on the
value of the other variables in the model.
The logit link function
The logit link function may seem complex
log

P(y = 1)
1

P(y = 1)
= +x
But there are some neat features we can nd with the formula
The right hand side of the formula are the parameter estimates we
would get from a regression. However, we cant directly translate
them into predicted probabilities, because there is more than just
P(y = 1) on the left hand side

With some fairly straightforward algebra, we can isolate that term
and nd that
P(y = 1) =
e
+x
1 +e
+x
Linear Probability Model
PS4781 Linear Probability Model 1 / 12
Dierent Types of Dependent Variables
So far, we have discussed multiple regression in the context of a
continuous dependent variable.
But what if we have a dummy dependent variable a variable that
takes only two values?
These are actually very common in political science: Did the
respondent vote? Was the incumbent reelected? Did the two
countries engage in conict?
So your dependent variable is a dummy?
Very often, this means that we need to move to a statistical model
other than OLS if we want to get reasonable estimates for our
hypothesis testing.
One exception to this is the linear probability model (LPM). The
LPM is an OLS model in which the dependent variable is a dummy
variable. It is called a probability model because we can interpret
the

Y values as predicted probabilities.
But, as we will see, it is not without problems. Because of these
problems, most political scientists do not use the LPM.
An example of the Linear Probability Model
As an example of a dummy dependent variable, we use the choice
that most U.S. voters in the 2004 presidential election made between
voting for the incumbent George W. Bush and his Democratic
challenger John Kerry.
Our dependent variable, which we will call Bush, is equal to one for
respondents who reported voting for Bush and equal to zero for
respondents who reported voting for Kerry.
For our model we theorize that the decision to vote for Bush or Kerry
is a function of an individuals partisan identication (ranging from
3 for strong Democrats, to 0 for independents, to +3 for strong
Republican identiers) and their evaluation of the job that Bush did
in handling the war on terror and the health of economy (both of
these evaluations range from +2 for approve strongly to 2 for
disapprove strongly).
The model
The formula for this model is:
Bush
i
= +
1
Party ID
i
+
2
War Evaluation
i
+
3
Economic Evaluation
i
+u
i
.
The eects of partisanship and performance evaluations on
votes for Bush in 2004
Independent variable Parameter estimate
Party Identication 0.09***
(0.01)
Evaluation: War on Terror 0.08***
(0.01)
Evaluation: Health of the Economy 0.08***
(0.01)
Intercept 0.60***
(0.01)
N 780
R
2
.73
Notes:
The dependent variable is equal to one if the respondent voted for Bush and
equal to zero if they voted for Kerry. Standard errors in parentheses.
Two-sided t-tests: *** indicates p < .01; ** indicates p < .05; * indicates p < .10
Interpretation
We can see from the table that all of the parameter estimates are
statistically signicant in the expected (positive) direction.
Not surprisingly, we see that people who identied with the
Republican Party and who had more approving evaluations of the
presidents handling of the war and the economy were more likely to
vote for him.
This model performs pretty well overall, with an R
2
statistic equal to
.73.
Predicted values
To examine how the interpretation of this model is dierent from that of a
regular OLS model, lets calculate some individual

Y values. We know
from the table that the formula for

Y is
Y
i
= 0.6 + 0.09 Party ID
i
+ 0.08 War Evaluation
i
+ 0.08 Economic Evaluation
i
.
For a respondent who reported being a pure independent (Party ID = 0)
with a somewhat approving evaluation of Bushs handling of the war on
terror (War Evaluation = 1) and a somewhat disapproving evaluation of
Bushs handling of the health of the economy (Economic Evaluation
= 1), we would calculate

Y
i
as follows:
Y
i
= 0.6 + (0.09 0) + (0.08 1) + (0.08 1) = 0.6.
Predicted probabilities
One way to interpret this predicted value is to think of it as a
predicted probability that the dummy dependent variable is equal to
one, or, in other words, the predicted probability of this respondent
voting for Bush.
Using the example for which we just calculated

Y
i
, we would predict
that such an individual would have a 0.6 probability (or 60% chance)
of voting for Bush in 2004. As you can imagine, if we change the
values of our three independent variables around, the predicted
probability of the individual voting for Bush changes correspondingly.
This means that the LPM is a special case of OLS for which we can
think of the predicted values of the dependent variable as predicted
probabilities.
We represent predicted probabilities for a particular case as
P
i
or
P(Y
i
= 1) and we can summarize this special property of the LPM
as

P
i
=

P(Y
i
= 1) =

Y
i
.
Problems with the LPM
One of the problems with the LPM comes when we arrive at extreme
values of the predicted probabilities.
Consider, for instance, a respondent who reported being a strong
Republican (Party ID = 3) with a strongly approving evaluation of
Bushs handling of the war on terror (War Evaluation = 2) and a
strongly approving evaluation of Bushs handling of the health of the
economy (Economic Evaluation = 2). For this individual, we would
calculate

P
i
as follows:
P
i
=

Y
i
= 0.6 + (0.09 3) + (0.08 2) + (0.08 2) = 1.19.
This means that we would predict that such an individual would have
a 119% chance of voting for Bush in 2004. Such a predicted
probability is, of course, nonsensical because probabilities cannot be
smaller than zero or greater than one.
Problems with and alternatives to the LPM
The LPM has two potentially more serious problems:
heteroscedasticity and functional form.
We do not cover the former suce it to say that a model that is
heteroscedastic is not a good model.
The problem of functional form is related to the assumption of
parametric linearity. In the context of the LPM, this assumption
amounts to saying that the impact of a one-unit increase in an
independent variable X is equal to

regardless of the value of X or
any other independent variable. This assumption may be problematic
for LPMs because the eect of a change in an independent variable
may be greater for cases that would otherwise be at 0.5 than for
those cases for which the predicted probability would otherwise be
close to zero or one.
Problems with and alternatives to the LPM
For these reasons, the typical political science solution to having a
dummy dependent variable is to avoid using the LPM. Most
applications that you will come across in political science research will
use a binomial logit (BNL) or binomial probit (BNP) model instead
of the LPM.
Descriptive Statistics II:
Measures of Centrality and Dispersion
Agnar F. Helgason
PS4781 Descriptive Statistics II: Measures of Centrality and Dispersion 1 / 20
Measures of central tendency and variation
There are two types of descriptive statistics that are most relevant in
the social sciences:
1
Measures of central tendency tell us about typical values for a
particular variable.
2
Measures of variation or (dispersion) tell us the distribution (or
spread, or range) of values that it takes across the cases for which we
measure it.
Importantly, for dierent variable types, dierent descriptive statistics
are appropriate (and inappropriate).
Measures of central tendency and variation
Central tendency: Mode, median, mean
Dispersion: Standard deviation, Interquartile range
Chapter Outline
1
Measures of Centrality
2
Measures of Dispersion
Nominal variables
The only measure of central tendency that is appropriate for a
nominal variable is the mode, which we dene as the most frequently
occurring value.
The median or mean dont make any sense for nominal variables
The simplest thing to do is create a table, or a graph.
Example
Table : Frequency table for religious identication in the 2004 ANES
Category Number of Cases Percent
Protestant 672 56.14
Catholic 292 24.39
Jewish 35 2.92
Other 17 1.42
None 181 15.12
Ordinal variables
With ordinal variables we can also use the median value of the
variable. The median value is the value of the case that sits at the
exact center of our cases when we rank them from the smallest to the
largest observed values.
When we have an even number of cases, we average the value of the
two center-most ranked cases to obtain the median value. This is also
known as the value of the variable at the 50 percent rank.
Interval variables
Finally, we can use the mean value or average value to describe interval
variables. For a variable Y, the mean value is depicted and calculated as:
Y =
n
i =1
Y
i
n
where

Y, known as Y-bar, indicates the mean of Y, which is equal to
the sum of all values of Y across individual cases of Y, Y
i
, divided by the
total number of cases n.
Example of median and mean
Annual per capita carbon dioxide emissions (metric tons) for n = 8 largest
nations in population size
Bangladesh 0.3
Brazil 1.8
China 2.3
Indonesia 1.4
Pakistan 0.7
India 1.2
Russia 9.9
U.S. 20.1
Ordered sample: 0.3, 0.7, 1.2, 1.4, 1.8, 2.3, 9.9, 20.1
Median = (1.4 + 1.8)/2 = 1.6
Mean = (0.3 + 0.7 + 1.2 + . + 20.1)/8 = 4.7
Properties of mean and median
For symmetric distributions, mean = median
For skewed distributions, mean is drawn in direction of longer tail,
relative to median
Mean sensitive to outliers (median often preferred for highly skewed
distributions)
When distribution symmetric or mildly skewed or discrete with few
values, mean preferred because uses numerical values of observations
Example
New York Yankees baseball team, 2006
mean salary = $7.0 million
median salary = $2.9 million
How possible? Direction of skew?

Variation: The standard deviation
The most common measure of variance is the standard deviation:
sd(Y) = sd
Y
= s
Y
=
var (Y) =
n
i =1
(Y
i

Y)
2
n 1
Roughly speaking, this is the average dierence between values of Y (Y
i
)
and the mean of Y (
Y). At rst glance, this may not be apparent. But,

the important thing to understand about this formula is that the purpose
of squaring each dierence from the mean and then taking the square root
of the resulting sum of squared deviations is to keep the negative and
positive deviations from canceling each other out.
An example
Consider the following exam scores on a 7 point scale: y=
2,3,7,5,6,7,5,6,4
Properties of the standard deviation
s 0, and only equals 0 if all observations are equal
s increases with the amount of variation around the mean
Division by n - 1 (not n) is due to technical reasons (later)
s depends on the units of the data
Like mean, aected by outliers
Empirical rule
If distribution is approx. bell-shaped:
about 68% of data within 1 standard dev. of mean
about 95% of data within 2 standard dev. of mean
nearly all data within 3 standard dev. of mean
Example
Alternative measure of dispersion: IQR
pth percentile: p percent of observations below it, (100 - p)% above it.
p = 50: median
p = 25: lower quartile (LQ)
p = 75: upper quartile (UQ)
Interquartile range IQR = UQ - LQ
Back to Music...
Mode = 1
Median = 1
Mean = 1.83
Min = 1; Max = 20
LQ = 1; UQ = 2
S = 2.08
Signicance Tests
for Means and Proportions
for One Population
PS4781 Signicance Tests for Means and Proportions for One Population 1 / 21
Drawing inferences from a sample to a population
This video: Arent interested in dierences between groups of the
population (ergo, no independent variable)
Rather, we are asking questions about specic parameter values
Is there more than 50% support for Obama?
Is average income in the U.S. higher than $70,000?
Very similar procedures to the ones from week 5
Chapter Outline
1
Signicance Test for Proportion
2
Signicance Test for Mean
3
Errors in decisions about about signicance
Assumptions: Randomization, categorical variable, large-ish sample
Null hypothesis: H
0
: =
0
where
0
is particular value for
population proportion (e.g. 0.50 threshold for majority)
Alternative hypothesis:
2 sided: H
a
: =
0
1 sided: H
a
: <
0
1 sided: H
a
: >
0
Usually, we use the 2 sided alternative, to err on the side of caution
(its easier to nd signicant relationships with 1 sided tests)
Test statistic: The number of standard errors that the sample mean
falls from the H
0
value
z =

0
=

0
0
(1
0
)/n
NOTE: We use
0
in standard error measure, not (as when we are
forming condence intervals)
In words: (estimate of parameter - H
0
value)/(standard error)
= no. of standard errors the estimate falls away from the H
0
value
Recall intuition behind sampling distribution from Week 4
P-value: Under presumption that H
0
true, probability the z test
statistic equals observed value or even more extreme (i.e., larger in
absolute value), providing stronger evidence against H
0
This is a two-tail probability, for the two-sided H
a
(i.e. our sample
can fall in either tail of the sampling distribution)
Conclusion: Report and interpret P-value. If needed, make decision
about H
0
Example 1
Assumptions: Randomization, quantitative variable, normal
population distribution (robustness?)
Null hypothesis: H
0
: =
0
where
0
is particular value for
population mean (typically no eect or no change from some
standard)
Alternative hypothesis:
2 sided: H
a
: =
0
1 sided: H
a
: <
0
1 sided: H
a
: >
0
Usually, we use the 2 sided alternative, to err on the side of caution
(its easier to nd signicant relationships with 1 sided tests)
Test statistic: The number of standard errors that the sample mean
falls from the H
0
value
t =
y
0
se
where se = s/
n
y is the sample mean, s is the sample standard deviation, and n is the
sample size
When H
0
is true, the sampling distribution of the t test statistic is the
t distribution with df = n 1
P-value: Under presumption that H
0
true, probability the t test
statistic equals observed value or even more extreme (i.e., larger in
absolute value), providing stronger evidence against H
0
This is a two-tail probability, for the two-sided H
a
Conclusion: Report and interpret P-value. If needed, make decision
about H
0
Example 2
Decisions in Tests
Recall, when we need to make a decision about the results of a test
we specify a signicance level
Usually, = 0.01 or 0.05 (which lead to 0.99 and 0.95 condence
levels, respectively)
Decisions for = 0.05
P-value H
0
conclusion H
a
conclusion
0.05 Reject Accept
>0.05 Do not reject Do not accept
Errors in Decisions
Note that our conclusions will have error we only observe a sample
from the population and any inference we make based on the sample
to the population involves some uncertainty
How much uncertainty we are willing to live with is determined by the
signicance level we set.
If we set = 0.05 we are essentially willing to accept that the
inferences we draw will be wrong in 1 out of every 20 independent
tests we perform
If we set = 0.01 we will draw the wrong inference in 1 out of every
100 tests
Types of Errors
Everytime there is a mismatch between the true population parameter
and the inferences we draw about it, we make a decision error we
draw the wrong conclusion about the parameter
Two types of errors
Type 1 Error: Reject H

0
when it is true
Type 2 Error: Do not reject H

0
when it is false
Reality Test Result
Reject H
0
Do not reject H
0
H
0
True Type 1 Error Correct
H
0
False Correct Type 2 Error
Type 1 error
The signicance level we select is the probability of type 1 error
For example, if we set = 0.05 then there is a 5% probability of a
Type 1 error
Why not just set = 0.000001 and be (nearly) sure not to make a
Type 1 error?
An extremely low signicance level leads to extremely large condence
intervals AND increases the probability of Type 2 error
Think of the courtroom analogy: If we reduce the probability of a
Type 1 error (sending an innocent person to prison, we increase the
probability of a Type 2 error (not convicting a guilty person)
Type 2 error
, the probability of a Type 2 error depends most crucially on the
signicance level we set, sample size, and how far the true value of
the parameter is from the null hypothesis we set.
Say the true value in the population is = 99 and the sample mean
is 99, but we test the null hypothesis that = 100
If we dont have a large sample, we will probably fail to reject the null
hypothesis
We cant discern between the true value and the null hypothesis
1- = Power of test
When we have large n, we often say that we have a well powered
test, i.e. the probability of type 2 error is small
Limitations of Signicance Tests
Statistical signicance does not mean practical signicance
Signicance tests dont tell us about the size of the eect (as
condence intervals do)
Some tests may be statistically signicant just by chance (i.e. type
1 errors occur)
Concepts and Measurement in Political Science
PS4781 Concepts and Measurement in Political Science 1 / 19
Chapter Outline
1
Putting Theories to the Test
2
Social science measurement: The varying challenges of quantifying
humanity
3
Problems in measuring concepts of interest
4
Are there consequences to poor measurement?
Remember what a theory is
As we have said, a theory is a statement (or a question) about the
possible causal relationship between two (or more) concepts.
We have been using both the abstract Does X Cause Y? language
as well as the more specic Does cigarette smoking cause heart
disease? language.
How do we evaluate our theories?
That is, how do we come to a conclusion about whether our theory is
likely to be correct?
We need to make empirical observations. In other words, we need to
compare our abstract theoretical ideas with reality. (Remember,
empirical just means based on observations. They might be
quantitative or in-depth qualitative.)
Theres a potential problem here
We need to be as condent as possible that our concepts in our
theory correspond as closely as possible to our empirical observations.
This is called the problem of measurement.
Whats the big deal?
If we want to do a good job evaluating whether X causes Y, then we
need to do a precise job measuring both X and Y.
If we are sloppy in measuring X and Y, then how will we be condent
whether or not our assessment of the theory is right?
Recall Figure 1.2
(concept)
Dependent variable
(concept)
(Operationalization) (Operationalization)
Causal theory
(measured)
Dependent variable
(measured)
Hypothesis
Do you see the disconnect?
The relationship that we care about most is one we cannot directly
observe. We therefore have to rely on potentially imperfect measures
of the concepts we care about.
That means that measuring our concepts with care is one of the most
important (and overlooked) parts of social science.
Measurement problems in the social sciences
Economics: Dollars, people
Political Science: ???
Psychology: Depression, anxiety, prejudice
Whats the best measure of the concept of poverty?
Compare a pre-transfer measure of poverty to a post-transfer measure
of poverty?
Which is superior?
The three issues of measurement
1
Conceptual clarity
2
Reliability
3
Validity
Conceptual clarity
What is the exact nature of the concept were trying to measure?
Example: How should a survey question measure income?
1
What is your income?
2
What is the total amount of income earned in the most recently
completed tax year by you and any other adults in your household,
including all sources of income?
Reliability
An operational measure of a concept is said to be reliable to the
extent that it is repeatable or consistent; that is, applying the same
measurement rules to the same case or observation will produce
identical results.
The bathroom scale
Validity
A valid measure accurately represents the concept that it is supposed
to measure, while an invalid measure measures something other than
what was originally intended.
Example: A lying bathroom scale
The consequences of poor measurement
How can we know if our theory is supported if we have done a poor
job measuring the key concepts that we observe?
If our empirical analysis is based on measures that do not capture the
essence of the abstract concepts in our theory, then we are unlikely to
have any condence in the ndings themselves.

Correlates of War Project

National Material Capabilities Data Documentation

Version 4.0

Last update completed: June 2010*

*Note from update carried out at PSU that concluded in May 2005: Many individuals have
contributed to the collection of national capabilities data and this documentation/coding manual
over many years. Particular contributors to the v3.0 coding manual include Re!at Bayer, Diane
Dutka, Faten Ghosn, and Christopher Housenick. Important contributors to the 1990 version of
the manual included Paul Williamson, C. Bradley, Dan Jones, and M. Coyne.

Note from the update carried out at UNT that concluded in June 2010: J. Michael Greig and
Andrew J. Enterline directed the update with the assistance of graduate students, Christina Case
and Joseph Magagnoli. Greig and Enterline acknowledge the support of the UNT Department of
Political Science for its general research support, and in particular its generous provision of $500
to purchase an electronic copy of the United Nations Energy Statistics Database in April 2010.

Table of Contents

!"#$%#$&
Note on Version 4.0 (2010) Update.................................................................................... 1
Introduction......................................................................................................................... 3
General Considerations....................................................................................................... 4
Basic Dimensions................................................................................................................ 6
Overview of Version 3.0..................................................................................................... 7
Sub-Component Data...................................................................................................... 7
Discontinuities and Source/Quality Codes ..................................................................... 7
Individual Data Set Updates ........................................................................................... 8
Military Personnel............................................................................................................. 11
Whats New in Version 3.0........................................................................................... 11
Data Acquisition and Generation.................................................................................. 11
Problems and Potential Errors ...................................................................................... 12
Component Data Set Layout ......................................................................................... 13
2010 Update.................................................................................................................. 14
Military Expenditures ....................................................................................................... 20
Currency Conversion .................................................................................................... 21
Problems and Possible Errors ....................................................................................... 22
The Future of Military Expenditures ............................................................................ 23
2010 Update.................................................................................................................. 24
Total Population................................................................................................................ 26
Quality Codes................................................................................................................ 29
Anomaly Codes............................................................................................................. 29
2010 Update.................................................................................................................. 31
World Bank. (2009). World Development Indicators Online (WDI) database.
http://data.worldbank.org/indicatorUrban Population ...................................................... 32
Urban Population .............................................................................................................. 33
2010 Update.................................................................................................................. 36
Iron and Steel Consumption.............................................................................................. 38
Data Sources ................................................................................................................. 39

Quality Codes................................................................................................................ 41
2010 Update.................................................................................................................. 43
Bibliography ................................................................................................................. 43
Primary Energy Consumption........................................................................................... 45
Assumption of Zero Values. ..................................................................................... 45
Conversion into One Thousand Metric Coal-Ton Equivalents................................. 47
Interpolation. ............................................................................................................. 48
Bringing Technology to Bear.................................................................................... 48
Data Merging Methods ............................................................................................. 49
Petroleum. ................................................................................................................. 52
Electricity. ................................................................................................................. 54
Natural Gas. .............................................................................................................. 55
Negative Values. ....................................................................................................... 55
Multiple Data Values. ............................................................................................... 57
Missing Data. ............................................................................................................ 57
Quality Codes................................................................................................................ 59
Anomaly Codes............................................................................................................. 59
2010 Update.................................................................................................................. 61
Conversion to Metric-ton Coal Units........................................................................ 61
Primary Commodities ............................................................................................... 62
Special Country Series Extractions........................................................................... 63
Bibliography ................................................................................................................. 63

1
Note on Version 4.0 (2010) Update
From the perspective of the UNT update team, the purpose of the 2010 update of the
National Material Capabilities (NMC) is twofold. First, collect data for the six NMC components
(i.e., total population, urban population, military personnel, military expenditures, primary energy
consumption, and iron and steel production) for the 2002-2007 period, and merge these new data
with the existing 1816-2001 data from v3.02 completed in May 2005.

Data Set Layout
The variables in the NMC_v4.0.csv file, in order, are:
Position Variable Description
1 stateabb 3 letter country Abbreviation
2 ccode COW Country code
3 year Year of observation
4 irst Iron and steel production (thousands of tons)
5 milex Military Expenditures (For 1816-1913: thousands of
current year British Pounds. For 1914+: thousands
of current year US Dollars.)
6 milper Military Personnel (thousands)
7 pec Primary Energy Consumption (thousands of coal-ton
equivalents. Formerly, energy.)
8 tpop Total Population (thousands)
9 upop Urban population (population living in cities with
population greater than 100,000; in thousands)
10 cinc Composite Index of National Capability (CINC) score
11 version Version number of the data set

Missing values are indicated by the value -9. Users must ensure that their statistical analysis
software takes this coding into account.
Second, transition the data generation from routines executed in MS-Access to STATA
and release the data in both STATA and comma separated variable format. The rationale for
transition from MS-Access to STATA is grounded in (a) the widespread use of STATA for data
management purposes, and (b) the ease by which STATA command files (known as do-files)
can be read with any text editor, thereby enabling subsequent update teams to decipher the
decision-making executed during the current update, even if said subsequent update teams
choose an alternative data management software. After conversion to STATA, the final version
was saved as .csv and is available on the website (see the subcomponent data set layout below.)

2
Subcomponent Data Set Layout
The variables in the NMC_Supplement_v4.0.csv file, in order, are:

1 statename Country name
6 irstsource Information source for Iron and steel production
7 irstnote Note on Iron and steel production
8 irstqualitycode Iron and steel production Quality Code, takes on letter values.
Please see documentation for v3.0 further below for greater
detail.
9 irstanomalycode Iron and steel production Anomaly Code, takes on letter
values. Please see documentation for v3.0 further below for
greater detail.
10 pec Primary Energy Consumption (thousands of coal-ton
equivalents. Formerly, energy.)
11 pecsource Information source for Primary Energy Consumption
12 pecnote Note on Primary Energy Consumption
13 pecqualitycode Primary Energy Consumption Quality Code, takes on letter
greater detail.
14 pecanomalycode Primary Energy Consumption Anomaly Code, takes on letter
greater detail.
16 milpersource Information source for Military Personnel
17 milpernote Note on Military Personnel
19 milexsource Information source for Military Expenditures
20 milexnote Note on Military Expenditures

3
22 upopsource Information source for Urban population
23 upopqualitycode Urban population Quality Code, takes on letter
greater detail.
24 upopanomalycode Urban population Anomaly Code, takes on letter
greater detail.
25 upopgrowth Urban population growth
26 upopgrowthsource Information source for Urban population growth
28 tpopsource Information source for Total Population
29 tpopnote Note on Total Population
30 tpopqualitycode Total Population Quality Code, takes on letter
greater detail.
31 tpopanomalycode Total Population Anomaly Code, takes on letter
greater detail.

A final note regarding our contributions to this codebook, the bulk of which was written by
the update team at the Pennsylvania State University and filed in May 2005. We did not
undertake a revision of the codebook written by the PSU coders; rather, at the conclusion of each
NMC component section we insert an additional sub-section labeled 2010 update within which
we detail information and issues relevant to the 2010 update. That said, our additions to the
codebook sometimes clarify issues that were vague in the original text, a byproduct of Stuart
Bremers passing in 2002. Thus, we recommend that users of the NMC data consider the pre-
existing and updated codebook materials prior to using the data.

Introduction
"Power" - here defined as the ability of a nation to exercise and resist influence - is a
function of many factors, among them the nation's material capabilities. Power and material
capabilities are not identical; but given their association it is essential that we try to define the
latter in operational terms so as to understand the former.

4
This manual examines some of the more crucial issues in the selection and construction
of indicators of these capabilities, discusses the implications of the various options, and indicates
the decisions made by the Correlates of War project for the Composite Index of National
Capacity. It presents with detail the terminology and definitions of each indicator, data collection
techniques, problems and irregularities, and data manipulation procedures. Additionally, it
functions as a guideline for reading the data set and provides a bibliography. Not all of the
decisions undertaken were optimal, and often the trade-offs are difficult. Nor did the enterprise
start from scratch. Historians, social and physical scientists, military analysts, and operations
researchers have examined the ideas of power base, national strength, and material capabilities.
As the bibliography makes clear, about two dozen authors have tried to develop - and generate
data for - indicators of national attributes. We profited greatly from these prior efforts, be they
speculative, empirical, or both. This literature has been of great assistance, especially in
illuminating the difficulties and highlighting those myriad strategies we have avoided.
General Considerations
There are certain general considerations we must note before turning to the specific
dimensions in any detail. First and foremost is that of comparability across a long time period
(1816 to the present) of a staggering variety of territorial states, peoples, cultures, and institutions
at radically different stages of their economic, social, and political development at any given
moment. An indicator that might validly compare a group of European states in 1960 may very
well be useless in comparing one of them to a North African state in the same year, let alone
1930 or 1870. We selected our indicators from among those that were both meaningful at any
given time and that had roughly the same meaning across this broad time range. This
requirement limited our choices, even in the statistically better endowed post-World War I years.
Various caveats must be made concerning the validity of the indicators the project
selected. The first of these is comparison, which relies on the sometimes questionable
assumption that equal values of the same indicator make equal contributions to capability. To
differentially weight the contributions of individual nations entails questions that the project was
not ready to address. Certain indices where this caution especially applies are noted later.
A second caveat concerns the choice of coding rules given several equally
plausible alternatives. Here, the purpose is that the value assigned to the underlying concept not
be highly sensitive to this choice. In some cases, we estimated this sensitivity by recollecting data
for a sample subset, applying alternative choices, and determining their distribution of data values
around those previously gathered.
A third caveat is information sources. We consulted several sources. We were
particularly interested in series having long runs of data from multiple sources overlapping the
same time period because this allowed better discrimination of reliable figures. Given different

5
volumes of the same series, we used the most recent data reported, although alert to the
possibility that revisions reflected manipulation by the reporting nation or changes in the methods
of reporting, rather than improvements in accuracy.
A fourth caveat is the role of estimation. It is not surprising that we could not find all the
requisite information. We did not expend considerable time and effort to produce a series
complete save for some small remaining bit of ignorance. Rather, we filled in the gaps through
interpolation, where it was reasonable to assume that the difference in values of the endpoints of
the gap were accurate and that the change rate between them was uniform. We discuss this
further under particular sections. In the case of missing data or lack of comparability among
sources, we often resorted to bivariate regression of the known values on time, using the latter to
estimate all the data in the series. A contrast between the two methods is that estimates obtained
by interpolation are assumed correct even if they depart from the long-run trend. Estimates
obtained by regression assume that the true change rate is constant over a longer sequence of
several known data points, of which the endpoints and all other reported values may be in error.
The approach we used depended on the context of all that was known about each individual
case.
A fifth caveat data availability and the inevitability of error. Most of the indicators used in
the Correlates of War project are generated by the application of operational criteria and coding
rules to the often ambiguous "traces" of history. In some cases we can be quite confident about
the reliability of this approach because we ourselves developed the data. In other cases, we rely
on apparently precise numerical traces recorded by others at earlier times with coding and
scaling criteria ranging from unknown to inconsistent. For instance, given that our definitions of
national territories sometimes differ from source definitions, and given the imprecision of the
latter, the figures we obtained may have reflected these incorrect boundaries. Likewise, errors
could have been introduced through efforts to correct for boundary changes.
Error could also arise from inappropriate uses of estimation. The assumption (in the case
of interpolation) of accurate endpoints or (in the case of regression) that transient residuals in
documented values do not represent historically real fluctuations may be wrong. In either case,
the assumption of constant change rates may have been mistaken. While we sought to leave no
stone unturned, the reporting of national statistics is a recent practice. As one moves further back
toward 1816, statistical availability and quality deteriorates. Given the paucity of documentation,
figures and estimates of inferior reliability often were the only kind available. In those cases, and
despite the possibility of error, we had no choice but to identify, select, and combine numerical
estimates of evidence, hoping that we have recognized and taken account of differing criteria.
Given the multiplicity of interpretations as well as the difficulty of validation, we expect
alternative national capability indicators to be put forth with some regularity well into the future.
This leads us to a brief consideration of the dimensions and indicators of capability we adopted

6
and why. We intended to tap the scholarly consensus on the major components of general
capabilities, and not the development of the most comprehensive predictor of success in
diplomacy, crisis, or war. The extent to which these capabilities do account for such success is an
empirical question and there is mounting evidence that the two differ in important ways.
Basic Dimensions
The project selected demographic, industrial, and military indicators as the most effective
measures of a nation's material capabilities. These three indicators reflect the breadth and depth
of the resources that a nation could bring to bear in instances of militarized disputes.
Why have we treated only demographic, industrial, and military indicators of national
capabilities? Why have not geography of location, or terrain, or natural resources (all of which
clearly affect material capabilities) been addressed? Location, for example, could be important in
several senses: island and peninsular states are often more able to trade with a larger number of
others, are somewhat more defensible against invasion, emphasize sea power over land power
(thus appearing less able to threaten another with invasion), and have fewer close neighbors with
whom to quarrel. Landlocked states are typically more restricted in their choice of trading
partners, are more vulnerable to invasion, occupation, or annexation, have more immediate
neighbors, and "require" greater land forces that often appear threatening. All these facets could
detract from or enhance a state's capabilities. However, they are too dyad-specific to permit valid
cross-national comparison because they pertain to the relationship between nations rather than to
the characteristics of a given nation. As to natural resources such as arable land, climate, and
resource availability, these factors are already reflected to a considerable extent in the indicators
we employed.
There is, of course, the question of effective political institutions, citizen competence,
regime legitimacy, and the professional competence of the national security elites. While these
are far from negligible, they contribute to national power and the efficiency with which the basic
material capabilities are utilized, but they are not a component of such capabilities.
A final and major point is that while most researchers grant that the demographic,
industrial, and military dimensions are three of the most central components of material strength,
nevertheless they may quarrel either with (1) the specific subcomponents or (2) the decision to
stay with them over nearly two centuries. These issues are dealt with later in their specific
contexts. The value of uniform series throughout the period is a question that must be subject to
further inquiry, and by empirical means based on datasets such as this one.
Next we address the procedures and problems of the individual indicators. Where there
are important departures from core procedures, we note them in this document and in the data
set itself. For each of the three indicators, we begin with an introductory section and follow it, for

7
each of the two subdimensions on which the indicators rest, with discussions of data acquisition
and generation, and data problems and potential errors.
Overview of Version 3.0
Version 3.0 of the National Material Capabilities data set is the result of several years of
effort undertaken at the Pennsylvania State University by the COW2 Project. Two major updates
have taken place. First, additional detail about the source for and quality of data points was
added to some component sets. We hope to continue this practice in the future. Second, each
component series was extended and each series was examined and in some cases was revised.
A brief overview of these changes is outlined below, starting with the universal updates and
moving then to individual component updates. Once those two discussions are complete, this
manual then goes into greater detail about each of the six indicators of national capabilities.
Sub-Component Data
Along with overall data, the COW 2 project is releasing additional information about each
separate sub-component. Each sub-component has its own separate data set (saved in
Microsoft Access format) which contains new detail about the particular variable. Information in
these sub-data sets includes in particular source data identification, Quality Codes, and Anomaly
Codes, along with the values for the variables in each state-year. The final values for each state
year are then placed in the final overall 6-component data set typically used by analysts.
Discontinuities and Source/Quality Codes
It is important to document the source of and confidence we have in our data points.
Therefore, coding schemes for source and quality codes have been developed during the
collection of v3.0, and included as was possible and practical during the update. For instance,
the sub-component data sets include the source of the value in the data series. In many cases,
we were unable to track data value to a particular source. In such cases, we have left original
values, which did come from specific sources, but which we simply do not know.
In any data set, there are data points that must be interpolated, extrapolated, or
estimated. Previously, COW data sets have not listed which data points are interpolated and
which come from solid data sources.1 In this version of the national capabilities data set, we
made these estimations transparent to users when possible by creating a quality code variable as
a separate column in four of the national capability indicators. These 4 indicators are iron and
steel production, primary energy consumption, total population, and urban population. It is
important to note that each component has its own quality coding scheme. Because of very
different coding rules and potential fluctuations, each component needed its own coding
approach. For instance, total population changes very slowly, and a census every ten years is

8
the norm. Basic growth can easily be calculated for each country, and anything that can radically
alter a states population will most often be well documented. Examining a concept like primary
energy consumption, however, it is quite possible for there to be quite rapid fluctuations in energy
usage. Oil embargoes, new technologies, and wars can make energy consumption values
fluctuate greatly. Therefore, this commodity has a higher standard for its data point quality, and
that higher standard is reflected in its quality codes.
Ideally, these data quality codes would be a temporary element of this data set. The
long-term goal of this project should be to eventually find specific data for each data point that
falls short of the standard for receiving an A (the universal designation for a well-documented
data point). As this research advances, once all data points in a series receive an A, the quality
codes for that series would then be irrelevant and could be dropped from the data set.
A second new element added to these data sets are the identification of anomalies. One
of the most routine questions that arise over any data set is the major fluctuations in data values.
Oftentimes, these fluctuations reflect true changes in the data. In other cases, however, they can
be created by the coders themselves. Changing data sources, differing conversion factors, or
introducing new components can create an apparent disconnect in a data series.
In a proactive approach to these discontinuities that appear in many data sets, each
component now has an anomaly code column included in the data set. When a potential
discontinuity was found in a data series, it was noted and supplemental research was done
attempting to identify the cause of the anomaly. In some cases, a specific cause was easy to
identify and document, such as changes in population after wars or losses of territory. In such
cases, the fluctuation is real, and understandable. In other cases, anomalies were created
because of changes in the data structure itself, such as when switching indicators from iron to
steel production. In other cases a new source introduces a jump in a series. In these cases, the
apparent increase or decrease in an indicator is artificial, and the jump must be accounted for in
time-series analysis of the component series. Unfortunately, there were cases where no
discernable reason could be found for the anomaly between previous and subsequent data
points. These points were documented and it should be the future goal of this project to fully
document all the reasons for anomalies in these data sets.
Individual Data Set Updates
Each of the six indicators of national capabilities underwent revisions and updates over
the course of this project. While there is more detail in the sections that follow, it is important to
note at least briefly what some of the major modifications and improvements are.
The Military Personnel Data Set was both updated and modified. It was modified from
previous versions by replacing previous data with data from the U.S Arms Control and
Disarmament Agency (ACDA) for all data points from 1961 until 1993. The data were also

9
extended from 1993 forward using ACDA data where possible, supplemented with data from the
Military Balance.
The Military Expenditure Data Set was updated from 1993 to 2001
2
.
The Iron and Steel Data Set was first updated to 2001. Then researchers went back
through the data set and re-confirmed the entire series, re-documenting the sources for all data
points in the series.
The Primary Energy Consumption Data Set was completely re-constructed for version
3.0 of the data set. All energy values were re-calculated from raw data sources, and compiled
into a total energy consumption data value for each state in a given year. The data were also
extended to 2001.
The Total Population Data Set was first updated from 1993 until 2001 using United
Nations data. Then researchers went back through the data set, re-documenting the data points;
some data series were replaced, and some interpolations were re-calculated.
The Urban Population Data Set was updated from 1990 until 2001.

Notes on the format of data file NMC_3.0.csv

The file NMC_3.0.csv contains version 3.0 of the Correlates of War National Material Capabilities
Data Set (1816-2001). The file is in comma-separated-variable (comma-delimited) form, a flat
text format which may also be read automatically into computer software packages such as
Microsoft Excel, or read using specific commands into other programs (e.g. using the insheet
using nmc_3.0.csv command in Stata). The first line of the data set contains the variable names.
The data set contains the following 11 variables, in order:

7 energy Energy consumption (thousands of coal-ton equivalents)

10

Missing values are indicated by the value -9. Users must ensure that their statistical analysis
software takes this coding into account.

11
Military Personnel
The Military Personnel Data Set contains data on the size of state armies (defined below)
from 1816 until 2001.
Whats New in Version 3.0
This version of the data set has undergone three important modifications. First,
whenever possible, researchers have re-documented the source for the data points. Second, the
data were extended from 1991 until 2001. Third, the data between 1961 and 1999 now comes
from the U.S Arms Control and Disarmament Agency (ACDA). Previous versions used a
combination of both ACDA data and data from the International Institute for Strategic Studies
(IISS). Version 3.0 uses ACDA data for all data points where it was available and only
supplements with IISS data in cases where ACDA data were not available.
Data Acquisition and Generation
Military personnel are defined as troops under the command of the national government,
intended for use against foreign adversaries, and held ready for combat as of January 1 of the
referent year. It is important to note that any date besides January 1
st
would have been
appropriate for the majority of cases because the data values change slowly. On occasion,
however, there are instances where there are rapid changes in troop strength, such as
mobilizations for conflicts and wars. Short-term variations in strength are not reflected in the
project's data unless the changes remained in effect until the following January 1. With this
definition in place, there are five important aspects of quantifying military personnel that need
elaboration.
First, the project counted only those troops under the command of the national
government. These troop strengths include active, regular military units of the land, naval, and air
components. Troops in the reserves such as those found in the United States were not included
in the states annual total. Colonial troops (such as Indian troops under British command during
Indias colonial period) were usually not included in this total if they were a separately
administered force.
Second, the military personnel data exclude the military forces of foreign military forces,
semi-autonomous states and protectorates, and insurgent troops. Such units were not part of a
regular national armed force under a military chain of command. Their inclusion would distort the
number of personnel that could be summoned when deemed necessary.
Third, these figures reflect the project's best judgment on which forces were intended for
combat with foreign parties. Irregular forces such as civil defense units, frontier guards,
gendarmerie, carabineri, and other quasi-military units were nominally responsible for defending

12
outlying districts or for internal security and could be mobilized in time of war. We usually
excluded them, however, because they were not integral to the regular armed forces (e.g.
Cossack troops of nineteenth century Russia). When these forces were the only military a nation
had they were still excluded (e.g. Costa Rica and Switzerland).
A fourth aspect concerns armed forces in several semi-feudal nations, including the
warlord armies in pre-modern Japan and China, and Jannissary troops in the Ottoman Empire.
Not all nations were quick to adopt Western military organization. We counted only those forces
that were acting at the behest of the central government. For example, we included only the
Imperial troops and those armies of feudal lords operating on the behalf of the throne in the case
of pre-modern Japan.
A final aspect concerns national police forces organized for both foreign and domestic
purposes and found in several developing nations in the twentieth century. Such units come
directly under the military chain of command and are fully a part of the armed forces at the
immediate disposal of a national government. Examples include the old National Guard of
Nicaragua and the national police forces of many African states. When such forces provided dual
functions of foreign combat and internal security, we included them in its military personnel
figures; otherwise, they were excluded.
Usually it was only after 1960 that we found ready-made data (including army, navy, and
air force totals) meeting our coding criteria and aggregated into the desired totals. Elsewhere, we
assembled the data from bits and pieces. Given a figure that did not fully meet our
inclusion/exclusion criteria, we used it only after locating supplementary information that could be
used to adjust it. Confronted with conflicting figures, we adopted those that best matched the
contemporary data, and only if they seemed historically plausible. In practice, frequently it was
impossible to find documentation reflecting the January 1 criterion. In most such cases, however,
the figures were changing sufficiently slowly to afford an acceptable approximation. In cases of
rapid military change, such as the onset of war, we took note of the fact in arriving at a plausible
estimate. Because of the relatively great sensitivity of personnel levels to transitory circumstances
such as war involvement, we used estimates to fill missing entries only when they did not occur in
such circumstances.
Problems and Potential Errors
The precise numbers of active forces remains uncertain in a conceptual basis. It is easy
to see that during the course of their foreign policy, states often have an incentive to exaggerate
their troop strengths when deterring a potential opponent or understate their troop strength when
attempting to avoid notice by other powerful states or a potential target of hostilities. These
potential motivations to misrepresent troop strengths can create difficulties with this projects data
collection efforts. However, because we use sources that themselves often use multiple sources

13
and channels of estimation, we believe that these differences in opinions are ironed out of the
data and the numbers presented here are reflective of the military personnel of these states.
Inadequate source documentation is another potential source of difficulty in assembling
this data. There is some possibility that personnel which were never counted in a general source
total have been missed. We were not aware of such flaws in our research, however, and do not
consider this a major potential for error.
Similarly, our criteria for including or excluding various "irregular" types of forces may
have led us to exclude forces which did indeed contribute to national totals. Equally plausible is
that we classed as active some military units that should have been excluded by its criteria, such
as those performing internal security functions. Source limitation frequently precluded the
requisite distinctions.

Quality/Anomaly Codes
There are no quality or anomaly codes for this component.
Component Data Set Layout
The layout of the military personnel Access data set is found in Table MILPER 1 below.
The data set contains seven columns. The first and second columns correspond to the COW
state abbreviation and COW state number. The third column is the year of observation. The
fourth column contains the value for that year (in thousands), unless the value is missing. Missing
values are indicated by -9. The fifth column provides the source of the data point or See note. If
the column contains See note, the note column should be consulted to see how that data point
was calculated. The next (sixth) column, Note, explains how that data point was obtained
(estimation, or whether the value was verified as coming from a particular source). All data points
that have been verified are so indicated. The seventh column is entitle Source Code, but has
not been used and is blank.

Table MILPER 1: Data Set Layout
Military Personnel
StateAbb CCode Year MilPer Source Note Source code Version
USA 2 1816 17 "Historical
Statistics of the
U.S., Colonial
Times to 1957"
(U.S. Department
of Commerce and
Bureau of the
Census) p737
verified
10/22/2001.
DLD
3.01

14
2010 Update
Military Personnel data (MILPER) are coded for the 2002-2007 period using the
annualInternational Institute of Strategic Studies reports, The Military Balance. Unlike both
previous and subsequent reports, the 2004 edition of The Military Balance does not report armed
force size. As a result, military sizes for all countries where interpolated for the year 2004.
Consistent with previous updates of military personnel, decimal values for MILPER>.50 where
rounded up; decimal values of MILPER<=.50 where rounded down. Issues arose with specific
country series, and they are, and were addressed, as follows:
1. Bhutan. As was the case with the Maldives, no data was available for the size of
the military of Bhutan. Information from the U.S. State Department, however,
made clear that Bhutan did indeed have a regular military force. As a result,
MILPER data for Bhutan are coded missing for 2002-2007;
2. Comoros. No data was available for the size of the military of Comoros and its
military expenditures. Information from the U.S. State Department, however,
made clear that Comoros did indeed have a regular military force. As a result,
MILPER data for Comoros are coded missing;
3. East Timor. No data was available for the size of the military of East Timor for
2002-2005. Information from the U.S. State Department, however, made clear
that East Timor did indeed have a regular military force. As a result, MILPER
data for East Timor for 2002-2005 are coded as missing;
4. Maldives. Information from the U.S. State Department indicates that the Maldives
does have a military, but no concrete information could be located regarding the
size of this military force. As a result, data for the Maldives was coded as
missing rather than a value of zero;
5. Sao Tome and Principe. As was the case with the Maldives, no data was
available for the size of the military of Sao Tome and Principe or its level of
military spending. Information from the U.S. State Department, however, made
clear that Sao Tome and Principe did indeed have a regular military force. As a
result, MILPER was coded missing;
6. Somalia. MILPER data is missing from 2002-2004, 2007, and coded as zero for
2005-2006;
7. Swaziland. No data was available for the size of the military of Swaziland or its
made clear that Swaziland did indeed have a regular military force. As a result,
MILPER is coded missing; and
8. Switzerland. IISS reports a significant drop in military personnel, moving from
28,000 troops in 2003 to 4,000 troops in 2005, 2006, and 2007. This drop in

15
drops is reversed in 2008 when Switzerlands troop level returns to 23,000. The
2005, 2006, and 2007 troop values are each reported in different IISS annual
reports so the likelihood of an incorrect data entry seems small. Instead, this
change in troop levels seems to be tied to the Army XXI reforms that were
adopted by the Swiss in 2003 that called for a drastic reduction in the force
strength of the Swiss military. This 4,000 full-time troop strength value was
confirmed in the following U.S. State Department background note on
Switzerland (August 2009, http://www.state.gov/r/pa/ei/bgn/3431.htm.)

Bibliography
Historical Statistics of the U.S., Colonial Times to 1957. U.S. Department of Commerce (1960).
Page 730.

Statistical Abstract of the U.S. U.S. Department of Commerce (1970). Page 257.

The Statesman's Year-Book. New York. .

Almanach de Gotha.

Annual Abstract of Statistics. Great Britain.

Clode, Charles M. The Military Forces of the Crown. London: John Murray, 1869. Vol. I, pp.
399-400.

Whiting, Kenneth R. The Development of the Soviet Armed Forces. 1917-1966. Maxwell Air
Force Base, Alabama: Air University, 1966.

Erickson, John. The Soviet High Command. A Military Political History, 1918-1941. London:
Macmillan & Co, 1962.

The Institute for Strategic Studies. The Military Balance, 1965-1966. 18 Adam Street, London,
WC 2. December, 1965.

The Europa Year Book. London: Europa Publishing Co., Ltd.

National Attributes Data.

Upton, John. The Armies of Asia and Europe. New York: D. Appleton & Company, 1878.

Schnitzler, J. H. Essai Dune Statistique Generale de L'Empire De Russie. Paris, 1829.

The Japan Year Book. 1906-1952. Tokyo: The Japan Year Book Office.

League of Nations. Armaments Year Book. Geneva, 1935 = vol. XI.

Wolowski, M. Louis. Les Finances La Russie. Paris: Librairie de Guillaumin, 1864.

Great Britain, Central Statistical Office. Annual Abstract of Statistics. London: Her Majesty's
Stationery Office.

16
Great Britain, House of Commons, Sessional Papers, 1816-1965.

Cobden, Richard. The Three Panics: An-Historical Episode. London: Cassell and Company, Ltd.,
1884.

Ravay, Espagnac du. Vingt Ans Politique Navale, 1919-1939. Grenoble: Arthaud, 1942.

Mulhall, Michael G. The Dictionary of Statistics. London: George Routledge & Sons, Ltd., 1899.

The Encyclopedia Britanica, 11th Edition. Cambridge, England, 1910. Published by University
Press, New York..

Monteilhet, Joseph. Les Institutions Militaires de la France, 1814-1932, De la Paix Armee a la
Paix Desarmee. Paris: Librairie Felix Alcan, 1932.

U.S., Office of Naval Intelligence. Information Concerning Some of the Principal Navies of the
World. Washington.: Government Printing Office, December, 1911.

Annuaire de L`Economie Politique. Paris: Rue Richelieu, No. 14.

Annuario Statistico Italiano.

Lobell, V. Jahresberichte uber die Veranderungen and Fortschritte im Militarwesen. (L.C. Title
Rustung and Albrustung) 1874-1913. (Vol . 1= 1874) E. S. Mittler and Sohn, Berlin.

League of Nations. Armaments Yearbook. (Yearly 1924-1939.)

The Military Balance: 19_/ _. London, Institute for Strategic Studies.(1959/60 = vol. 1)

U.N. Office of the Secretary-General. Economic and Social Consequences of Disarmament:
Replies of the various Governments anti Communications from International Organizations. (U.N.,
New York; 1962)

World-wide Military Expenditures and Related Data: 1965. U.S- Arms Control and Disarmament
Agency, Washington; 1967. Research Report 67-6.

Statistical Abstract of Latin America 19-. (1954 = Vol. 1) Latin American Center, Univ. of
California, Los Angeles; 1965.

M.D.R. Foot. Men in Uniform. London, Institute for Strategic Studies; 1961.

New York Times. Military Survey. May 12, 1947: p. 14.

New York Times, March 27, 1949. p. E-5.

New York Times, May 16, 1948, p. 4E.

Hanson, Harry (ed.). The World Almanac and Book of Facts: 19-. New York, Scripps-Howard.

Information Please Almanac. Doubleday, New York.

Whitakers Almanac 1801. (1961 = 93rd annual). London.

Republic of Ghana, Statistical Yearbook for 1863. Accra.

Central Statistical Office. Statistical Abstract for the U.K.

17

Bureau of Statistics, Office of the P.M. Japan Statistical Yearbook. (1906 = vol. 1)

Census and Statistics Department of N.Z. New Zealand Official Yearbook.

Commonwealth Bureau of Census and Statistics, Official Yearbook of the Commonwealth of
Australia.

Dominion Bureau of Statistics, Dept. of Trade and Commerce. The Canada Yearbook.

Vaterlandischer Pilger. (1809-1848; published irregularly in the
early pre-1817) years. Annual thereafter.) Publisher not given. 804.J95

Genealogisch-historisch-statistischer Almanac fur das Jahr-. Verlag von
Landes-Industrie-Comptoir, Weimar. (1823= Vol.l) CS27.G32

Australia. The Parliament of the Commonwealth of Australia. Naval Expenditures of the Principal
Naval Powers. CMD Paper. Nov. 27. 1914.

Readers Digest Almanac. (1966= vol. 1)

Mitteilungen aus dem Gebeit des Seewesens. Seidel und Sohn, Jura.

H. Roberts Coward. Military Technology in Developing Countries.

Beer, Francis A., ed. Alliances: Latent War Communities in the Contemporary World. New York;
Holt, Rinehart and Winston, Inc., 1970.

Annuaire Statistique de la France.

Hinsley. Catalogue of German Naval Archives.

Das koenigliche Prussische Kriegsministerium, 1809-1909 (Berlin, 1909).

Guichi, Ono. War and Armament Expenditure of Japan. (Concord, Carnegie Endowment for
International peace, ]922)

Rathgen, Karl. Japans.Volkwirtschaft and Staatshaushalt. Leipzig, Verlag von Duncker and
Humboldt, 1891)

Australia, Parliament. Naval Expenditures of the Principle Naval Powers. (Cmd Paper, Nov. 27,
1914.)

Kolb, Georg Friedrich. Handbuch der Vergleichenden Statistik. (Leipzig, A. Forstnersche
Buchhandlang, published irregularly beginning in 1862)

Schnabel, General Statistik des Europaischen Kaiserstaats. NA1107.536 1841

Almanach de Paris. (Paris, annual, 1865=vol. 1)

Genealogisches and Statistisches Almanach fur Zeitunpysleser. (Weimar, only one volume in
L.C., no information as to dates of series publication)

Block, Maurice. Statistique de la France. (2 vol.) Paris 1860.

18
France. Ministre du Commerce. Documents Statistiques sur la France. (Paris, Imprimerie Royale.
1835)

Great Britain, House of Commons. East India Reports. Synopsis of the Evidence Taken Before
the Select Committee in Relation to the Army of India. 1832.

U. S. Arms Control and Disarmament Agency. World-Wide Military Expenditure and Related Data
1971

Stein, Friedrich von. Geschichte des russischen Heer.(leipzig, 1895).

U.S. War Dept., Military Commission to Europe. The Armies of Europe. (Philadelphia, 1861.)

Ruestow, Wilhelm von, Die Russische Armee. Vienna, 1876.

Fadieev. Russias Kriegsmacht Und Kriegpolitik. Leipzig, 1870.

Larroque, Patrice, De La Guerre Et Des Armees Permanents. Third ed., Paris, 1870.

Jerome Chen, 1972 (no title given)

Fijiwara Akiva, 1961. Gunjishi (The History of Military Affairs in Japan)

China Yearbook

China H.B.

J. G. Godaire "The Claim of the Russian Military Establishment" in
Dimensions of Soviet Economic Power (GPO, Washington, 1962)

A.J.A. Gerlach. Fastes Militares des Indes-Orientales Neerlandisches (?) Paris, 1859)

Bodart, Gaston. Losses of Life in Modern Wars (Oxford, Clarendon; 1916)

Obalance, Edgar. The Sinai Campaign of 1956 (New York, Praeger, 1960)

Clowes, William Laird. The Royal Navy; A History from Earliest Times to the Present. (London,
1901) 7 vol.

Schnabel, JamesF. Schnabel. The United States Army in the Korean War. Policy and
Direction: The First Year. (Office of the Chief of Military History, United States Army,
Washington, 972)

Naikoku Tokeikyoku (Cabinet Statistical Bureau), Meiji Hyakunen Shiryo
(Statistical Data for the Hundred Years since Meiji) (Tokyo, 1967)

Appleman, Roy E. The United States Arm in the Korean War: South to the Nakton , North to
the Yalu(?) Office of the chief of Military History, Department of the Army, Washington,
1961)

Naikoku Tokei Kyoku: Nihon Teikoku Tokeineukau (check spelling)

0'balance, Edgar. Korea 1950-1953 (Faber and Faber, London, 1959)

Walter G. Hermes. The United States Army in the Korean War; Truce Tent and Fighting Front.
(Office of the Chief of Military History, United States Army, Washington, 1966).

19

Howard, Michael. The Franco-Prussian War (ICY. Collier, 1969; first published 1961)

Hammer, Kenneth M. "Huks in the Philippines." in Franklin M. Osaka. Modern Guerilla Warfare
Free Press of Glencue, 1962, pp. 177-183.

Wood, David G. "Twentieth Century Conflicts" Adelphi Papers #1968 (From English (?) Foreign
Policy Institute.)

Ralph L. Powell. The Rise of Chinese Military Power 1895-1912. Princeton U. Press, 1955

Statistisk Arsbok fur Sverige (Statistical Abstract for Sweden)

John Gittings. The Role of the Chinese Army. London, Oxford U. Press, 1967

Dernberger, Robert "Evaluation of Existing Estimates For China's Military Costs and Preliminary
Illustration of the 'Best' Available Method for Making New Estimates." (unpublished
paper)

U.S. Arms Control and Disarmament Agency. World Expenditures and Arms Transfers,
1966-1975. Washington, 1976.

Rothenberg, Gunther. The Army of Franz Joseph., Purdue Univ. Press, W.Lafayette, 1976.

Alvin D. Coox, "Effects of Attrition on National Wav Effort: The Japanese Army Experience in
China, 1937-38, Military Affairs v. XXXII, no. 2. (1968), pp.57-62.

20
Military Expenditures
The second indicator of military capabilities is military expenditures. Military expenditure
is defined as the total military budget for a given state for a given year.

The data were updated through 2001.
Since our primary interest was to index all financial resources available to the military in
time of war, we coded all resources devoted to military forces that could be deployed, irrespective
of their active or reserve status.
Appropriations for all the types of units mentioned earlier were included when the units
were under the authority of officials of the national government, even if the units did not contribute
to the personnel variable. Such units typically were excluded from published budgets, in any
case. It is important to note that in our assessments the sources of military expenditure data
often provided gross (rather than net) expenditure figures.
We sought to identify and exclude all appropriations of a non-military character because
some nations have civil ministries under military control (national police forces is the most
prevalent example). The use of such unadjusted budgets would substantially over-estimate the
military capability of those nations. If there was a clear bureaucratic division between the
execution of civil and military functions, this task was easily accomplished. For instance, if there
were separate accounting and authorization procedures for merchant- and military-marine,
expenditures of the former were excluded. On the other hand, merchant marine expenditures
charged to the same administrative units which carried out military marine functions were
included in the project's tabulations. Likewise, the budget figures were adjusted upward where we
determined that outlays in other parts of the budget served to enhance military capacity.
Having made the above distinction concerning money spent on military forces, we
delimited part of the latter directly related to a country's war fighting capacity; that is, we had to
distinguish which figures going for military purposes were destined to enhance capability. We
deemed that expenditures on pensions, superannuation pay, relief, and subsidies to widows and
orphans do not contribute to military power and excluded them where possible. For most
statistically developed countries, these items were found to be readily identified in a separate
section of the military budget, or charged outright to the finance ministry.
We decided to identify gross rather than net expenditures, so as to sidestep problems of
accounting for the yearly variations in stockpile buildup, depreciation, and liquidation. As with the

21
accounting of energy stocks, little was found that would have allowed us to determine net
expenditures.
We closely attended to allocations, usually found in supplemental budgets, special
accounts, and war credits and loans, over and above regular appropriations. Examples include
the special funds and credits voted during the mobilizations prior to and during the two world
wars, and the loans contracted by Prussia prior to the Franco-Prussian War.
With regard to these special appropriations, some ambiguity exists as to which year the
expenditures should be assigned. Since our objective was that each unit of currency spent on
military capabilities should be counted only in the year that it directly enhanced military capability,
it counted surpluses and credits transferred from past years (when known) among the
expenditures of the referent year.
For example, expenditures from special accounts (such as the construction of
fortifications or the purchase of armaments) were included in the expenditure totals for that year.
If the special account was composed of transfers from the general budget, expenditures on that
account were included in the year in which they were spent or projected to be spent. If the special
account was composed of credits budgeted to a war ministry in previous years, but unspent in
those previous years, we included only actual expenditures from that account in the project's
totals for the appropriate years. Outlays for the amortization of debts incurred were excluded,
since the project had already counted them in the year in which the military items were acquired.
Thus, if a naval ship was acquired in 1923 but not paid for until 1926, we counted the
corresponding expenditure in 1923. Surplus military appropriations from previous years were
counted as military expenditures only for those years when the funds were actually spent.
The customary difficulties in Soviet statistics were resolved by period. For the years prior
to the Second World War, the fragmentation of the evidence precludes an appraisal of real
expenditures. Rather than engage in speculation, the project reported the official figures
published in the League of Nations Armaments Year-Book from 1924 to 1940. From 1955 to 1963
we utilized SIPRI estimates and from 1963 on have used ACDA figures.
Currency Conversion
In most cases, expenditures were originally collected in national currency. The data were
then converted into a standard unit - British pounds sterling prior to 1914, U.S. dollars thereafter -
using the COW currency conversion dataset (which uses current exchange rates). We entered
beginning of the year market rates wherever available, except for periods of marked inflation in
the twentieth century, where we entered black market rates, if available. This was the case for
most Western nations throughout the data period, and for most nations since 1945. Otherwise,
we used government rates, except for Eastern European states in the period after 1945, for which
we used dollar amounts. In all remaining cases - most in the first half of the nineteenth century,

22
for which documentation is particularly scarce - we used project estimates. Principal sources
were the Times (London) for the years prior to 1914, League of Nations Statistical Yearbook for
1919-1939, and International Monetary Fund from 1948 onward. Supplementary sources included
de Gotha and Statesman's Yearbook, as well as economic and historical monographs.
To moderate short-term fluctuations, we sometimes revised the resulting series by a
smoothing process that used a seven-year moving average. A prime example of its application is
the smoothing of rate changes during the wholesale suspension of the gold standard in the
1930s. In the event of introduction of a new currency, we omitted this process. Occasional
interpolations were performed to fill small intervals in a series, but only when currency conditions
seemed stable. Data during extremely inflationary times (e.g. Germany during the early Weimar
Republic) should be viewed with special care.
Problems and Possible Errors
It was often difficult to identify and exclude civil expenditures from reported budgets of
less developed nations. For many countries, including some major powers, published military
budgets are a catch-all category for a variety of developmental and administrative expenses -
public works, colonial administration, development of the merchant marine, construction, and
improvement of harbor and navigational facilities, transportation of civilian personnel, and the
delivery of mail - of dubious military relevance. Except when we were able to obtain finance
ministry reports, it is impossible to make detailed breakdowns. Even when such reports were
available, it proved difficult to delineate "purely" military outlays. For example, consider the case
in which the military builds a road that facilitates troops movements, but which is used primarily
by civilians. A related problem concerns those instances in which the reported military budget
does not reflect all of the resources devoted to that sector. This usually happens when a nation
tries to hide such expenditures from scrutiny; for instance, most Western scholars and military
experts agree that officially reported post-1945 Soviet-bloc totals are unrealistically low, although
they disagree on the appropriate adjustments.
We also encountered difficulty concerning lack of sufficient information about local
currencies. Nineteenth century sources frequently shift from one name to another, for the same
currency. Thus, Almanac de Gotha uses the "thaler", the "thaler en espece," and the "riksdaler"
as currency unit names. After consulting several sources dealing with currencies, we determined
all three to be the same unit. Occasionally, the sources report a budget, particularly of states
newly independent in the nineteenth century, in different units from one referent year to the next.
Thus, Statesman's Yearbook and Almanac de Gotha report Guatemalan expenditures first in
silver pesos and later in paper pesos. Although we encountered situations in which currencies of
the same name but of different values were in circulation, usually the values were sufficiently
different to distinguish by comparison the units in question. Not surprisingly, these difficulties

23
were less prevalent in later years. Thus, SIPRI informed us that their series are always
represented in the most recent currency unit, to which prior data are adjusted. Again, usually the
scale of the reported figures is indicative of the referent unit.
A final problem concerning currency conversion is conceptual in nature. When comparing
economic magnitudes across time or space, there is a choice to be made concerning what price
weights apply to what quantities of each good or service under consideration. Our particular
choice of standard units (sterling and dollars) implies a decision to assign these weights to each
nation's military program according to British or U.S. opportunity costs for the referent year. This
choice is implicitly made when the project converts local currency units to sterling or dollars, for it
is then computing what Britain or the U.S. would have given up in order to make the same
purchases. Given the relatively free international monetary and trade movements that obtained
during much of nineteenth century, in which pounds sterling, dollars, francs; deutschmarks, lira,
etc., were readily convertible into each other, there was arguably a single world economy. These
opportunity costs would then have been approximately the same for any standard unit, since
each nation was drawing on this single economy. In the most autarkical situations that
occasionally arose in the twentieth century, the opportunity costs were no longer roughly
equivalent; the relative monetary costs often depended on the currency in which they were
expressed. This was the situation during the world wars, when normal monetary and commercial
exchange was disrupted.
The most extreme cases, however, are the economies of the Soviet Union, China since
1949, and the centrally directed economies of Eastern European states since 1945, for which
there has been relatively little freedom of movement. Here, one might find Soviet military
expenditures exceeding U.S. expenditures, when they are valued in U.S. dollars, but the reverse,
when they are both valued in rubles. Moreover, because Soviet prices were set by fiat rather
than by market bidding, the prices of military goods and services, compared among themselves
or to civilian items, are not necessarily reflective of their relative value in the sense that we
normally ascribe, even as measured in the local currency. These difficulties compound the
problem we noted earlier, that the military accounts in question have often been distorted or
partially hidden to outside eyes. Like others before it, we found no way around these inherent
ambiguities. In the cases noted, we simply stuck with them.
The Future of Military Expenditures
Two tasks exist relevant to the future of the military expenditure data. First, the military
expenditure data set requires that there be a consistent, accurate, and well documented currency
conversion data set. Raw military expenditure data often come in a variety of different monetary
units, such as rubles, dollars, pounds, franks, or marks. Because of these differing units, it is

24
quite important to have a universally accepted and accurate key for converting all those raw data
values into one common metric.
Unfortunately, the original COW Currency Conversion data set appears to have been lost
to time, and so although we have converted expenditure variables, we do not have the
conversion series.
If a new version of the currency conversion data set could be completed, a second major
endeavor of the military expenditure data set could begin: the re-documentation of the data
points before approximately 1960.

Quality/Anomaly Codes
There are no quality or anomaly codes for this component.
The layout of the data set is found in Table MILEX 1 below. The data set contains six
columns. The first and second columns correspond to the COW state abbreviation and COW
state number, respectively. The third column is the year of observation. The fourth column
contains the value for that year (from 1816 to 1913, in thousands of current year British pounds
and from 1914 onwards, in thousands of current year U.S. dollars), unless the value is missing.
Missing values are indicated by -9. The fifth and sixth columns contain any information that was
available from the original COW project. Since we did not attempt to verify this data, these
columns are often left blank, in cases where we could not find any information about sources from
the original project.

Table MILEX 1: Data Set Layout
Military Expenditure
StateAbb CCode Year MilEx Source Note Version
USA 2 1816 3823 3.01

2010 Update
Military Expenditure data (MILEX) were coded for the 2002-2007 period with the
International Institute of Strategic Studies annual reports, The Military Balance. Issues arose with
specific country series, and they are, and were addressed, as follows:
1. Bhutan. Military expenditure data is missing from 2004-2007. Information from
the U.S. State Department, however, made clear that Bhutan did indeed have a
regular military force. As a result, MILEX data is coded missing for 2004-2007;

25
2. Comoros. As was the case with the Maldives, no data was available for the size
of the military of Comoros or its military expenditures. Information from the U.S.
State Department, however, made clear that Comoros did indeed have a regular
military force. As a result, MILEX data for Comoros is coded missing;
3. East Timor. Military expenditure data was missing for the full 2002-2007 period.
Information from the U.S. State Department, however, made clear that East
Timor did indeed have a regular military force. As a result, MILEX data for the
2002-2007 period is coded missing;
4. Guyana. No military expenditure data for St. Kitts & Nevis was available from
2002-2007, and these observations are coded missing;
5. Haiti. No military expenditure data for Haiti was available from 2004-2007. These
observations are coded missing;
6. Iceland. Military expenditure data was only sporadically available for Iceland.
While data is available for 2004-2006, other years in the update is coded
missing;
7. Iraq. Military expenditure data is absent for the 2002-2007 period and is coded
missing.
8. Liberia. Military expenditure drops from 45000 in 2003 (as reported in IISS 2004)
to 1 in 2004 (as reported in IISS 2006). Subsequent IISS reports report 2004-
2006 as missing. As a result, we coded 2004-2006 coded missing;
9. Sao Tome and Principe. No data was available for the size of the military of Sao
Tome and Principe and its level of military spending. Information from the U.S.
State Department, however, made clear that Sao Tome and Principe did indeed
have a regular military force. As a result, MILEX data for Sao Tome and Principe
were coded missing;
10. Somalia. Military expenditure data was missing from 2003-2007 and coded
missing;
11. St. Kitts & Nevis. No military expenditure data for St. Kitts & Nevis was available
from 2002-2007. These observations were coded missing; and
12. Swaziland). No data was available for the size of the military of Swaziland and its
made clear that Swaziland did indeed have a regular military force. As a result,
MILEX data for Swaziland is coded missing.

Bibliography
See Source Notes in sub-component data set.

26
Total Population
The total population of a state has been theorized to be one of the major factors in
determining the relative strength. A state with a large population can have a larger army,
maintain its home industries during times of war, and absorb losses in wartime easier than a state
with a smaller population.

The series was updated through 2001. The original data were verified and in some
cases replaced.
While the most reliable total population figures usually appear in national government
tallies, modern census-taking was rare before 1850 in Europe and countries of European
settlement, and rare before the First World War elsewhere. In all periods, the accuracy and
reliability of national census data seem to vary with the level of economic development. As a
result, data from the developing world require particular scrutiny.
A census may be of the de facto population, comprising all residents within the national
boundaries, or of the de jure population, comprising only those who are legal residents. We used
the former, where possible, to which totals of military personnel abroad were added. Since the
differences between de jure and de facto (between "total" and "total home") population are
typically small, we did not analyze this data for sensitivity to these coding distinctions.
The United Nations Statistical Office has an estimated yearly total population series,
corrected for over- and under-enumeration to the extent possible, for most nations since 1919.
We relied on those series where possible.
For prior years and nations where we found one or more plausible time series, we took
data from the sources presenting the greatest continuity with the U.N. data. We uncovered most
of the general censuses taken since 1816 and used alternative sources for the numerous
remaining gaps. For example, Japan maintained a system of population registration through a
rough running tally. Other countries took sample surveys from which they constructed estimates
of the total population. We judged these sources the most reliable.
For the occasional nation maintaining reasonably complete registers of vital events (e.g.
the United Kingdom), we estimated missing data utilizing Formula TPOP One:

Formula TPOP 1: Missing Total Population Data Estimations

p(t) = p(t
o
) + b(t) - d(t) + i(t) - e(t),

27
where:
p(t) is the known or estimated population at time t,
p (t
o
) is the population recorded at time to, and
b(t), d(t), i(t), and e(t) are the respective numbers of births,
deaths, immigrants, and emigrants recorded since t
0
.

Net migration is usually small enough to be safely disregarded. For nations maintaining registers
of births and deaths but not of migration, we estimated i(t) - e(t) to be zero.
In lieu of complete demographic records, we resorted to estimation either (1) by
interpolation, or (2) by least-squares linear regression with time as the independent variable. The
choice between them was based on the availability and quality of information, and on whether the
period in question was marked by major wars or territorial boundary changes. First, however, we
must note four types of situations in which such change did not take place.
The first concerns the many cases for which censuses had been taken regularly. In these
cases, the only missing records were for the intervening years. In such instances we interpolated
between census records using Formula TPOP Two Below:

Formula TPOP Two: Interpolation Between Known Data Points

where:
p(t
1
) and p(t
2
) are the known population figures at time t
1
and t
2
.

This method entailed the assumption of a constant growth rate over the period delineated by t
1
, t
2
,
and t
x
from which the formula is derived.
In a second type of situation, taking account of the country's demographic history, the manner
and quality of its census-taking, the later population trends, and the opinion of demographers, we
were able to discern a plausibly reliable population series even though regular censuses were not
available. Again, we resorted to interpolation as here it seemed appropriate.
A second type of concern arose where data for the final years in a series were missing,
either because of loss of national identity (e.g. Estonia, Latvia, and Lithuania in 1940, or Austria-
Hungary during World War One). In these cases, we resorted to extrapolation using the above
interpolation formula.
A third concern was a problem of having no uniform data series at our disposal and what
sources were available providing only a patchwork of spotty and conflicting coverage. In this type
of situation, we estimated population by regression performed on the logarithms of the known
data. A prime consideration in our willingness to use this technique was that data for
well-documented nations indicates that growth rates usually change quite slowly. Distortions were
thereby introduced but not to as great a degree as would arise from applying interpolation and

28
extrapolation to these highly erratic data. Where necessary, to bring the estimates into agreement
with the uniform (typically post-1919 United Nations) series of an adjoining epoch, we raised or
lowered the regression line - while maintaining the same slope - such that the line passed through
the adjacent values of the series.
Finally, in situations of marked discontinuity in population trends associated with wars
and exchanges of territory, we applied the above methods as appropriate, but only to the
separate intervals on either side of the discontinuity, and only where it could document its
demographic magnitude. Interpolation, for example, could be used only if total population before
and after the break, or one of them plus the magnitude of the change, was known. We treated all
cases in which the nation gained or lost at least 1% of its total home population in this manner.
For territorial exchanges, we were able to document many of the gains and losses. We were,
however, usually unsuccessful in documenting war losses. Its method was to adjust for the affect
of territorial changes and then to extrapolate forward from pre-war and backward from post-war
data. Unless otherwise noted, population losses due to war were prorated over its duration. The
most pronounced instance was over estimate of Chinese population during and immediately after
the Taiping rebellion.
There are two difficulties with territorial boundary changes and with our estimation
assumptions. Concerning the former, the United Nations series occasionally fails to adjust the
base to reflect them; estimates for prior years may measure the population living within the
present national boundaries even though territorial changes occurred in the interim. We
attempted to determine where this was the case and make adjustments to reflect the actual
boundaries at the time.
The second difficulty is that we assumed a constant growth rate in regression. We
regard any observed deviations from constant growth as due to under or over-enumeration. This
procedure would cause us to miss the effects of, for example, a famine. The population growth
rate for a particular state may have been much higher than our estimate; when a famine stuck,
however, for a few years between those censuses the population dropped severely, but then
resumed a higher growth rate than our estimation procedure captured. The result is that our
estimation procedures would be the actual trend of the population, and not an accurate reflection
of the year-to-year population fluctuations. We assume that these circumstances are rather rare
and generally not of a magnitude large enough to cause the data to be distorted in any significant
manner in the aggregate, particularly when combined in the aggregate CINC score.

29
Quality Codes
In version 3.0 of this data set, a measure has been included to capture the source of
each data point, reflecting the confidence we have in each point. The quality codes for total
Population are listed below in Table TPOP 1 below.
Table TPOP 1: Total Population Quality Codes
Code Interpretive Meaning
A Value from identified source
B Linear Interpolation from identified sources
C Linear Interpolation from at least one unidentified source
D Regression from identified sources
E Regression from at least one unidentified source
F Extrapolation from identified sources
G Extrapolation from at least one unidentified source
M Missing value.

In documenting and revising the entire population data set for version 3.0, we first
identified data points in the 2.1 data for which we have or do not have an identified source.
States with the most accurate data were given a rating of A. Data points generated from linear
interpolation were given a rating of B if they were produced using two known data points and a
C if they came from one or more unidentified sources (including a value from the version 2.1
capabilities data set if the source was unknown). Data generated utilizing regression techniques
received quality codes of D and E, again based on the number of known data points that were
utilized in generating them. Extrapolated data points received quality codes of F and G. Any
missing data points received a quality code of M. It is important to note two things about this
quality code scheme. First, it is NOT meant to be an ordinal scale for all data points; while all A
data points are of the highest quality and standards, a point with a C quality code should not be
taken as being of superior quality than a G valued-point. The second important point to note is
that the vast majority (over 85%) of the data points have a quality code of A. The eventual goal
of this project should be to gather data on the points where data is less available (codes B
through M) and convert them into A values.
Anomaly Codes
Version 3.0 of this data set also identifies points where the time series of total population
makes radical changes. Identifying these inconsistencies will make future versions of this data
set more robust, as it will be easier to identify where there are difficulties and concerns with
particular data points.
Each indicator of the CINC possesses differing standards for what constitutes an
anomaly. For total population, the standard change is a two percent increase or decrease in one

30
years time. For some scale, this would be the equivalent of the United States losing the total
population of the state of Washington in just one year. If the population changed by more than
+/- 2%, we investigated the data point further to try to determine the source of the rapid growth.
The lists of anomaly codes for total population are listed in Table TPOP 2 below.

Over 95% of data points for total population are As. A second point worth mentioning
about these codes is that C, D, E, and F are constrained by time. F values are almost
always before 1919 when the League of Nations began collecting data, while the other three
potential anomalies are always found during the times when UN data is utilized (1919 to the
present).
The layout of the Access sub-component data set is found in Table TOT POP 3 below.
The data set contains nine columns. The first and second columns correspond to the COW state
abbreviation and COW state number, respectively. The third column is the year of observation.
The fourth column contains the value for that year (in thousands), unless the value is missing.
Missing values are indicated by -9/ The fifth column provides the source of the data point or See
note. If the column contains See note, the note column should be consulted to see how that
data point was calculated. The sixth and seventh columns, respectively, list the data anomaly
and quality codes for that value. The eighth column, Note, explains how that data point was
obtained (i.e. linear interpolation, extrapolation, etc.). This column is usually empty for data points
with a quality code of A. The ninth and final column lists the version number of this data set.

Table TOT POP 3: Data Set Layout
Total Population 3-0
StateAbb CCode Year TPop Source AnomCode QualCode Note Version
USA 2 1816 8659 Historical Statistics of the
United States: Colonial
Times to 1970 Part 1, Page
8
A A

3.01

Table TPOP 2: Total Population Anomaly Codes
Code Substantive Meaning
A No Anomaly (< 2% change)
B Explained Inconsistency (e.g. change in territory, loss in wartime)
C Change of Sources (between 2 non-UN sources or 1 non-UN to UN source)
D Change of UN Sources
E UN Internal Inconsistency within same UN source
F Internal inconsistency within non-UN source
G Unexplained Anomaly

31

2010 Update
The 2010 update of the total population data followed the methodology described above in the
3.0 update. The primary source for total population data for the 2002-2007 period was the online
version of the United Nations Demographic Yearbook (UNDYB). Because the United Nations
updates population values at each iteration of the yearbook, for each year in which we coded a
population value we used the most recent version of the UNDYB that reported a value for that
year. Because many of the values coded for 2001 during the previous update of the data were
extrapolated from existing data, we updated those values where data from an identified source
was available.

For cases in which no data was available from the UNDYB for the period 2002-2007, we
substituted data from the online edition of the World Banks World Development Indicators (WDI)
and the CIA World Factbook (CIA). We compared country-years for which the UNDYB, WDI, and
CIA each had data. We noted in this comparison that, quite often, these values did not match
exactly. WDI values, for example, tended to be larger than those reported by the UNDYB. As a
result, to avoid introducing excessive source-induced variance in the data, for countries for which
we were able to located UNDYB population data we interpolated and extrapolated missing values
from these known data points rather than use WDI or CIA data to fill in these missing values.

In checking the data for anomalies, we followed the procedures outlined above. We closely
examined all observations with a year to year population variance greater than 2%. The vast
majority of these population anomalies were confirmed by the UNDYB data. Many of these
anomalies came from countries with very small populations, making them particularly susceptible
to large population swings. We also compared cases with significant outliers to the Correlates of
War Territorial Change data set (v 4.01) in order to determine potential explanations for
significant anomalies in the population data.

Bibliography
Bunle, Henri. 1954. Le Mouvement Naturel de la Population dans le Monde de 1906 a 1936.
Paris: LInstit National dEtudes Demographiques.
Flora, Peter. 1983. State, Economy, and Society in Western Europe 1815-1975 A Data
Handbook in two Volumes. Frankfurt: Campus Verlag; London: Macmillan Press;
Chicago: St. James Press.
France, Bureau de la Statistique Generale de la France. 1907. Statistique internationale du
movement de la population dapres les registres detat civil. Paris: Imprimere Nationale.

32
Maddison, Angus. 1995. Monitoring the World Economy, 1820-1992. Paris: OECD.
Mitchell, Brian R. 1988. British Historical Statistics. Cambridge: Cambridge University Press.
Mitchell, Brian R. 1998. International Historical Statistics: Europe, 1750-1993. London: Macmillan
Reference; New York: Stockton Press.
Mitchell, Brian R. 1998. International Historical Statistics: Africa, Asia & Oceania, 1750-1993.
London: Macmillan Reference; New York: Stockton Press.
Mitchell, Brian R. 1998. International Historical Statistics: the Americas, 1750-1993. London:
Macmillan Reference; New York: Stockton Press.
Statistics Division of the Department of Economic and Social Affairs of the United Nations
Secretariat, United Nations Demographic Yearbook Historical Supplement 1948-1997.
49
th
issue. CD-Rom.
Tir, Jaroslav, Philip Schafer, Paul F. Diehl, and Gary Goertz. 1998. Territorial Changes, 1816-
1996. Conflict Management and Peace Science 16: 89-97.
United Nations Statistics Division. 2009. United Nations Demographic Yearbook.
http://unstats.un.org/unsd/demographic/products/dyb/dyb2.htm
Wilkie, James. 2000. Statistical Abstract of Latin America. V36. Los Angeles: University of
California at Los Angeles, Latin American Center Publications.
The World Factbook 2009. Washington, DC: Central Intelligence Agency, 2009.
https://www.cia.gov/library/publications/the-world-factbook/index.html

World Bank. (2009). World Development Indicators Online (WDI) database.
http://data.worldbank.org/indicator

33
Urban Population
Besides sheer numbers of people, it is important to capture other elements of a states
population. Factors such as education, societal organization, and social services are not
captured by the measure of total population. In order to capture the net effect of these more
abstract and amorphous ideas, this project includes a measure of urban population. Urbanization
is associated with higher education standards and life expectancies, with industrialization and
industrial capacity, and with the concentrated availability of citizens who may be mobilized during
times of conflict.
The series was updated through 2001. Some series were recomputed when new data
suggested that reinterpretation or extrapolation was necessary.
"Urban population" is a difficult concept to specify and operationalize for a professional
demographer, let alone an international relations researcher. What criterion best captures the
meaning of the term? A common approach is to include all cities that exceed a size threshold.
Many such thresholds, ranging from 5,000 to 100,000 inhabitants, have been advanced. By
virtue of its simplicity, we adopted the threshold criterion using the upper value of 100,000.
This choice has the advantage of facilitating data completeness, which is problematic at
lower values. It has the corresponding liability that, in the early 1800s, many areas that one might
consider "urban" did not contain 100,000 people. Moreover, the approach appears less well
suited for the contemporary period, when build-up areas frequently are comprised, in large part,
of many smaller cities and unincorporated places.
While the best data came from national censuses, several of them do not tabulate urban
population. Some developed nations take sample surveys to construct reasonable estimates of
urban population while multinational sources and demographic experts also publish data based
on their own estimation procedures. We used such estimates whenever they did not contradict
formal census figures.
The data reflect varying national definitions of what constitutes an incorporated city or
urban area; we used these figures where alternatives were unavailable. Occasionally, a source
changed its city definition, thus creating a discontinuity in the time series. In instances before
1945 where more than one alternative was offered as to the boundaries of a city, we adopted the
one more closely reflecting the built-up area. Otherwise, we entered the data as it was reported.
Occasionally, the data reflect a mix of and de jure
3
information. In some states, it was
the case that there would be de facto data for one urban area while there would only be de jure

34
data for another urban area of within a state. For instance, looking to Russian urban data, it is
rather easy to find recorded urban population data for the Moscow urban area; finding recorded
data on St. Petersburg or Vladivostok is much more challenging. Usually we found only one or
the other; secondary sources offered scant clarification in order to present a series with as much
documented data as possible. Faced with this ambiguity, we averaged across de facto and de
jure totals. For the occasional country that mixes data from different years in the same report, the
project used interpolation and extrapolation to estimate the referent year.
Often, the value of the same urban population datum is revised from one demographic
yearbook to the next. Presuming that revised data are more accurate, we used them. When, as
often was the case, this introduced a discontinuity between the first year appearing in the revised
series and the previous year appearing in the old, we performed log-linear regression on all the
old data in our pooled series and adjusted the regression line to match the revised data points.
When we encountered numbers from other sources significantly different from the United
Nations series, we used the U.N. figures unless they were irregular. In the latter cases, we used
the log-linear regression method on available data points, the United Nations and otherwise. For
cases of recently declining urbanization (e.g. Belgium and the Netherlands in the 1970s), we filled
the data gaps in the same way using a constant negative growth rate.
We conceive of urbanization as a continuous process, for which the growth rate should
vary smoothly. On the other hand, the inclusion of additional cities, as they exceed the population
threshold, introduces discontinuities in the census totals. Moreover, some cities appear in one
enumeration, but are absent from the next. Cities also occasionally make first-time appearances
bearing totals well over the threshold population value. Secondary sources remedied the situation
to a limited extent. Since interpolated and extrapolated values can be dominated by such
irregularities, we frequently used log-linear regression as a means of smoothing the data obtained
by the above methods to obtain a final estimate.
In the contemporary period, there is some debate over whether urban population or
urban agglomeration is a better measure of a countrys level of development. Urban
agglomeration includes both the population of people living within the city proper and its suburbs.
Since there has been a population shift away from large cities and toward suburban areas in most
industrialized countries, we thought that shifting to urban agglomeration data in the years after
1945 would provide a better indicator of each countrys level of development.
Unfortunately, logistical problems prevented us from making the shift from urban
population to urban agglomeration. The United Nations statistical yearbook reports figures for
both urban population and urban agglomeration, but the numbers for urban agglomeration are
much less complete. Many countries do not report urban agglomeration at all, and those that do

35
generally report it less frequently than they report urban population. Furthermore, there is not
one year in which a critical mass of countries began reporting urban agglomeration. While some
developed countries began releasing data on urban agglomeration shortly after the end of World
War II, other developed countries did not release any data on urban agglomeration until the
1980s.
A small number of countries released only urban agglomeration data instead of urban
population data. In those cases, we included the agglomeration figures in the data set with a note
indicating that they were agglomeration rather than population figures.
We also investigated a number of other sources for data on urban population, most
notably the U.N. World Urbanization Prospects. While these sources provided data at regular
intervals, they did not provide a clear definition of urban population, and so we did not use these
sources.

Quality Codes
Urban population employs a system of alphabetical codes to identify the relative strength or
confidence a particular data point, as listed in Table UPOP 1.
Table UPOP 1: Urban Population Quality Codes
Code Substantive Interpretation
A Value from UN Demographic
B Assumed 0 (Ex.: Vanuatu)
C Linear Interpolation from identified sources
D Linear Interpolation from at least one unidentified source
E Extrapolation from identified sources
F Extrapolation from at least one unidentified source

It is important to note that there is a unique quality code for urban populationthe
assumption of zero, coded as a B. While it is a rare data value, there are states where there are
no cities that reach the standards of 100,000 set above. It is also important to note that these
quality code numbers are not meant to be an ordinal scale, except for the value of A, which
should be taken to be the most reliable and best quality data points in the data set.

Anomaly Codes
There are no anomaly codes for this component.
The layout of the Access sub-component data set is found in Table UPOP 2 below. The
data set contains eleven columns. The first and second columns correspond to the COW state

36
The fourth column contains the value for that year (in thousands), unless the value is missing.
Missing values are indicated by -9. The fifth column provides the source of the data point, when
this information is available. The next two columns deal with cases where figures were estimated
using the growth rates. The column Growth Rate gives the number of the growth rate for that
particular year if needed/used, while the Growth Rate Source column indicates the source for
that rate. The Note column contains any other pertinent information. The ninth and tenth
columns, respectively, list the data quality and anomaly codes for that value. The eleventh and
final column lists the version number of this data set.

Table UPOP 2: Data Set Layout
Urban Population
StateAbb CCode Year UPop
Population
Source
Growth
Rate
Growth
Rate
Source
Note Quality Anomaly Version
USA 2 1816 101

For
1810,
HS US
1975
gives
0.

3.01

2010 Update
In updating the urban population data from 2002-2007 we followed the methodology described
above. Data for urban population was taken from the United Nations Demographic Yearbook.
Consistent with previous versions of the data, we used urban population growth rate data to
estimate the urban population values for country years without data. Data on urban population
growth was taken from the United Nations World Urbanization Prospects database.

Because actual measures of urban population are rarely produced on an annual basis for any
country, much of the data during the 3.0 update as well as this current update were produced
through regression, interpolation, and extrapolation from known data points. One consequence
of this characteristic of the data is that because the current update of the data added new
observations of identified values, the known points from which prior extrapolation of data based
are changed in the current update. As a result, this update required both the addition of new data
for 2002-2007 and changes to some data before 2002. The following example illustrates why
these changes beyond the primary update period of 2002-2007 was necessary:

37

Example of Changes Made to Prior Urban Population Data

In the example in the above table, the last urban population value for Somalia from an identified
source observed during the update of the data to version 3.02 is 377 in 1980. All of the
subsequent values for Somalia from 1981-2001 were extrapolated from that prior data point in
version 3.02. In the 2010 update, we use a newly identified urban population value for Somalia in
2001 from the UN Demographic Yearbook. This value, 1212, is substantially larger than the
extrapolated value of 692 that was produced during the version 3.02 update. As a result, it was
necessary to not only change Somalias urban population value for 2001 from the value reported
in the prior version of the data, but it was necessary to recalculate by interpolation the values
from 1981-2000. By way of comparison, the urban population values for Somalia produced in the
current update of the data are listed in the column upop 2010 above. The urban population
values produced during the previous update of the data are listed in the column upop old.

Bibliography
See Source Notes in sub-component dataset.

38
Iron and Steel Consumption
Iron and Steel Production is one of the two components of the industrial dimension, and one of
the six indicators of national power. It reflects all domestically produced pig iron before 1899 and
steel after 1900.
In addition to cleaning and updating the data set through 2001, this version contains two main
new features: data quality codes and anomaly codes.
4

Iron and steel production trends since 1816 involve transitions concerning the categories
of iron produced and the types of fuels used in making iron and steel. In general, cast iron
means all iron, including pig iron that has at least 0.3% carbon. Specifically, cast iron includes
all iron that has been molded into functional shapes. Wrought iron (puddle iron or bar iron) is
made from pig iron (except in a small percentage prior to 1850, when it was made directly from
ore) in a puddling furnace. It is very pure (containing less than 0.04% carbon) and relatively
malleable. Steel has an intermediate carbon content between 0.04 and 2.25%.
Until around 1870, cast iron and wrought iron were the principal products. The proportion
of the former as a final product steadily decreased until castings, as a proportion of total blast
furnace production, amounted to less than 0.1% and wrought iron became the primary metal of
construction.
By 1880, the Bessemer invention and improvements in coking made wrought iron
production obsolete. The use of coke as an inexpensive, non-volatile, and structurally solid fuel
allowed the construction of larger blast furnaces. The use of coke combined with the rapid steel
production in the Bessemer invention, made steel the primary commercial metal.
While wrought iron was of primary importance as a finished good prior to 1870, we did
not use it as an indicator because: 1) pig iron data is more readily available; 2) in our judgment,
use of the former would underestimate industrial activity in some states, notably the United
States; and 3) such use would downplay the importance of cast iron production, especially prior
to 1850. Steel production totals were too low in many states to reflect accurately industrial activity
in the nineteenth century. Instead, for the years 1816-1899, we estimated iron production from pig
iron output. When direct castings output was reported separately from pig iron, we added these
totals to the reported pig iron. This reflects our judgment that direct castings are nothing more
than cast pig iron. Our selection of crude pig iron plus separately reported direct castings is
plausible because this output was part of every activity in the iron and steel sectors of the
economy.

39
Where iron production appeared in disaggregated form, we summed the appropriate raw
figures to form the total pig iron output. This was done most often for Prussian and Austrian data
when we had to transform the old Prussian and Austrian centners into tons.
By 1900, the preferred product of this economic sector was clearly steel, hence our use
of steel output as an indicator. This date is somewhat arbitrary since any year from 1890 to
around 1910 could have been chosen for the same reason. It is, however, a reasonable midpoint
for our analysis. By 1910, virtually every nation that produced iron in the nineteenth century had
matched in the output of steel its previous rank as measured in pig iron. We are confident that the
two indicators are roughly equivalent measures of industrial activity at the point of transition.
Data Sources
The approach for refining and updating the data were similar to detective work. In many
cases we had the data and a source list but did not know which sources corresponded to which
values. As a result, we had to rely on memos from the original COW project at the University of
Michigan to put the puzzle together. Hence, most of the sources used were the same as those
used by the COW. However, some minor changes were made when extending the data set to
2001. In some cases, the original COW data set had estimates for which we were unable to
identify a source. Since there is no reason to doubt those numbers we retained those estimates.
A list of states where we used COW1 estimates is found in Table IRST 1.

Table IRST 1: States Utilizing Original COW 1 Data Points
State Years State Years
Mexico 1874-1899 Saxony 1837-1867
Netherlands 1945 Wurttemberg 1834-1870
Switzerland 1850-1899 Austria-Hungary 1816-1820
Bavaria
1816-1833
1851-1852
Greece 1979
1859-1871 Sweden 1816-1820
Germany 1945 China 1860-1899

In many cases the values from the original data set matched the values in B.R. Mitchells
volumes of International Historical Statistics. A second important source which allowed us to
update the data set until 2001 was the Steel Statistical Yearbook published by the International
Iron and Steel Institution. This not only allowed us to update the data but also to discover that
there were many states which according to COW1 had a value of zero, but which actually had
production. Changes were made to replace those zero values with the values provided by the
Steel Statistical Yearbook. All the states in which we found no evidence of production were given
a value of zero and a note was placed in the note column indicating that there was no known
production capability for these states.

40
Table IRST 2: Interpolations and Extrapolations
Log Linear Interpolations Log-Linear Extrapolations
United States 1816-1927 Cuba 1956-62
France 1816-1819
1821-1823

Mexico 1919
Spain 1831-1845 Belgium 1830
Poland 1919 Switzerland 1905-14
Albania 1981-1987 Germany 1816-20
Rumania 1919, 1941,
1942, 1944

Italy 1816-45
Soviet Union 1941-1944 Yugoslavia 1919
Denmark 1939-1940 Bulgaria 1908-1936
Morocco 1976-1979 China 1935
Egypt 1957
Israel 1954-1958
Pakistan 1983-1989
Dem. Republic of Vietnam 1984-1989

For some states where a complete series did not exist the value was estimated using log
linear interpolation, or in some cases log linear extrapolation. States where log linear interpolation
or log linear extrapolation was performed are listed in Table IRST 2 above.

Table IRST 3: Special Estimation Cases (See Data File for Specific Details)
Netherlands 1816-1830, 1831-1841 Greece 1951-1952, 1954-1956
Spain 1846-60 Sweden 1821-1835
Germany 1821-60 Angola 1976-1979, 1981-1982, 1986
Austria-Hungary 1821-1840, 1841-1899, 1900-1909 Morocco 1963-64
Austria 1919 Iran 1974, 1977-1979, 1980
Italy 1846-60 China 1936-37

There are also a few data points where specialized estimation procedures were utilized.
For example, to calculate the values for Sweden 1821-35, COW1 used the number provided by
Woytinsky's five year interval and divided it by 5 to get the average over the 5 years. The States
and the years where specialized estimation techniques were utilized appear in Table IRST 3
above.

Some might question the projects retention of steel to the present. Steel production is
currently declining for some highly developed states, and many scholars argue that it is no longer

41
a valid indicator of industrial activity. This decline, though, reflects the trend in virtually every
industrialized sector of states. The decline of steel production in the United States, for example,
closely parallels the decline in automobile production. We think it fair to say that the downward
trend primarily characterizes the manufacture of such durable goods and represents the passage
from one stage of development (heavy industrialization and consumer durables) to another
(computers, information processing, and other high technology). Therefore, we are not troubled
by our use of steel production as an indicator since it mirrors the more general trend. Our choice
of pig iron and steel as indicators of industrial strength is plausible since these materials are both
the primary product of the blast furnace and hence the closest thing we can find to raw
industrialization. The project has considered shifting to (or adding) materials such as aluminum,
or semiconductors, or PCs, but each indicator brings with it its own problems, and such
discussions have not been finalized.
Quality Codes
The quality coding scheme is listed in Table IRST 4 below. A data point received the
quality code A if the value came from an identified data source such as Mitchell or the Steel
Statistical Yearbook. A quality code of a B was given for those data points where a state had no
known production capability. If a data point was interpolated by COW2 it received a quality code
of C. The quality code D refers to data from the earlier COW data set. This quality code includes
values in the earlier data set from which we could not confirm the source of the value, as well as
interpolation, extrapolation or other estimation techniques performed by COW1. Finally, a data
point receives the quality code M if the value is missing.

Table IRST 4: Iron and Steel Quality Codes
Quality
Code
Interpretation
A Value from identified source
B No known productionassumed to be zero
C COW2 interpolation
D Data from earlier COW data set, but with missing or unidentified source
M Missing value

Anomaly Codes
In the data set there are places where there is a large increase in iron or steel production
from one year to the next. We identified these large increases and created a coding scheme to
alert users of the discontinuities in the time series. An anomaly was defined by the project as an
increase or decrease in a value from the previous year by at least 100%
5
. We have identified 263

42
data points where the difference from the previous year was at least 100%. These data points
encompass 2% of all data points (13002 total data points).
When there was a difference from year t to t+1 of less than 100% the value at year t+1
was coded as A. A data point was coded as B when the increase occurred as a result of initial
industrialization, that is, moving from having no production capability to production. There are 59
data points with this type of anomaly code. A value is coded as C if the difference from the
previous year was a result of changing sources. There are 29 of this type of anomaly, most of
which occur when we moved from using UN data to Mitchell Data. A value is coded as D if the
increase occurred within the same source. There are 175 data points which had internal source
inconsistencies. For example, if Mitchell reports a value at year t and there is at least a 100%
increase at year t+1, the value at t+1 would be coded as D. Finally, a value is coded as E if we
could not find an explanation for the increase. No values received this code. The second year of
the anomaly is given the code. For example, if there is an anomaly from year t to year t+1, year
t+1 is given the anomaly code.

The layout of the Access sub-component data set is found in Table IRST 7 below. The
data set contains nine columns. The first and second columns correspond to the COW state
The fourth column contains the value for that year (in thousands of tons), unless the value is
missing. Missing values are indicated by -9. The fifth column provides the source of the data point
or See note. If the column contains See note, the note column should be consulted to see how
that data point was calculated. The next (sixth) column, Note, explains how that data point was
obtained (i.e. linear interpolation or COW1 memo). This column is usually empty for data points
with a quality code of A. The seventh and eighth columns, respectively, list the data quality and
anomaly codes for that value. The ninth and final column lists the version number of this data set.

Table IRST 5: Iron and Steel Anomaly Codes
Anomaly Code Interpretation
A No anomaly
B No known production capability to production (Ex.: Brazil 1924-1925)
C Changes of sources (Ex.: China 1911-1912)
D Internal source inconsistency (Ex.: Algeria 1963-64)
E Unexplained anomaly

43
Table IRST 6: Data Set Layout
Iron & Steel Production
StateAbb CCode Year IrSt Source Note QCode
Anomaly
Code
Version
USA 2 1816 80 Mulhall, Michael.
"The Dictionary of
Statistics,"
George
Routledge and
Sons, Limited,
1892, p. 332.
COW1 memo states
that they interpolated
1816-1819 using
Mulhall's 1810
(55,000 metric tons)
and 1820 (110,000
metric tons) figures.
D A 3.01

2010 Update
The World Steel Associations Steel Statistical Yearbook 2008 (Table 1, pp. 3-5) was employed to
update IRST for the period 2002-2007.

Bibliography
Europa Yearbook. 1986. London: Europa Publications Limited.
Hartmann, Carl. 1861. Der Heutige Standpunkt Des Deutschen
Eisenhuttengewerbes, Leipzig: Verlag von Veit und Comp.

Hood, Christopher. 1911. Iron and Steel: Their Production and Manufacture.
London: Sir Isaac Pitman & Sons, Ltd.

Hsia, Ronald. 1971. Steel in China: Its Output Behavior, Productivity & Growth
Pattern. Wiesbaden: O. Harrassowitz.

Imperial Institute. 1938. The Mineral Industry of the British Empire and Foreign
Countries: Statistical Summary. His Majesty's Stationery Office.

International Iron and Steel Institute. 2000. Steel Statistical Yearbook (CD-Rom).
Belgium: International Iron and Steel Institute.

International Iron and Steel Institute. Steel Statistical Yearbook 2002. Belgium:
International Iron and Steel Institute.

League of Nations. Various years. Statistical Yearbook of the League of Nations.
Geneva: Series of League of Nations Publications.

Mitchell, B.R. 1998. International Historical Statistics: Africa, Asia & Oceania
1750-1993. Third Edition. New York, NY: Stockton Press.

Mitchell, B.R. 1998. International Historical Statistics: Europe 1750-1993. Fourth
Edition. New York, NY: Stockton Press.

Mitchell, B.R. 1998. International Historical Statistics: The Americas 1750-1993.
Fourth Edition. New York, NY: Stockton Press.

44

Mulhall, Michael. 1892. The Dictionary of Statistics. London: Routledge and
Sons, Limited.

Oechelhauser, Wilhelm. 1852. Vergleichende Statistik der eisen-industrie aller
lander, und erorterung ihrer okonomischen lage im Zollverein. Berlin:
Derlag bon Veit und Comp.

Pakistan Statistics Division. 1990 & 1994. Pakistan Statistical Yearbook. Karachi:
The Manager of Publications.

Ruess, Conrad, Emile Koutny, and Leon Tychon. 1960. Le Progrs Economique
En Siderurgie: Belgique, Luxembourg, Pays-Bas, 1830-1955. Louvain.

Seiichi, Tohata (ed.). 1966. The Modernization of Japan. Tokyo: Institute of Asian
Economic Affairs.

Singer, J. David, with P. Williamson, C. Bradley, D. Jones, and M. Coyne. 1990.
National Material Capabilities Dataset: Users Manual. University of
Michigan: Correlates of War Project.

Steel Statistical Yearbook. 2008. http://www.worldsteel.org/. World Steel Association,
Brussels, Belgium.

Strumlin, S.G. 1955. Istoria Chernoi Metalurgii v SSSR (The History of Ferrous
Metallurgy in the USSR). Moscow:

Svennilson, Ingvar. 1954. Growth and Stagnation in the European Economy.
Geneva: United Nations Economic Commission for Europe.

Temin, Peter. 1964. Iron and Steel in Nineteenth-Century America: An Economic
Inquiry. Cambridge, MA: The M.I.T. Press.

The Middle East and North Africa Yearbook. Various years. London: Europa
Publications Limited.

United Nations. Various years. Statistical Yearbook. New York: United Nations.

Woytinsky, Wladimir S. 1925-28. Die Welt in Zahlen. 7 Bd. Berlin: Rudolf Mosse.

Renmei, Nihon Tekko. 1970. "Tokei kara mita Nihon tekkogyo hyakunenkan no
ayumi,"

U.S. Bureau of Mines. 1954.

Collection of Modern Chinas Economic Statistics (translated), 1955

Statistics of Chinas Steel and Iron Industry (translated), 1985.

45
Primary Energy Consumption
This section deals with the similarities and differences between this new Primary Energy
Consumption (abbreviated PEC) data set and previous versions of this data set
6
. Of the six
indicators of national capabilities, this data series underwent the most extensive reconstruction
and re-evaluation of previous coding rules. Therefore, this documentation supercedes all
previous versions.
The energy values contained in the Version 3.0 data set have been recomputed from raw
figures. There are seven areas where changes or additions have been made in the basic coding
rules as compared to previous versions: 1) assumptions of zero values; 2) conversions of energy
commodities into one-thousand metric coal ton equivalents; 3) interpolation; 4) bringing
technology to bear; 5) data quality codes; 6) data merging methods; and 7) identifying
anomalies. Quality and anomaly codes will be discussed in a separate section.
Assumption of Zero Values. One major difference between previous data sets and
version three presented here is a change in coding of developing states. Previous versions of
this data set have almost no values of zero. If a state had no PEC, it was always assumed to be
missing; for instance, Colombia (COL, 100)
7
has missing data values from its founding in 1831
until 1925. While the data may be missing, it is very possible that there was no industry (and
therefore no commercial energy consumption) in this state at that time. Most Central and South
American states were almost exclusively agrarian societies well into the Twentieth Century. It is
quite possible that they did not experience industrialization until very late in the data presented
here. Looking at version 2.1 of this data set as a whole, the extent of this assumption becomes
readily apparent. There are only eight data points out of a possible 11,323 that have a value
equal to zero. On the other hand, there are 2,815 missing data points.
Assuming that these data points are all missing does not account for pre-industrial
periods that most states would seem to possess. It is possible that many states that did not have
data available simply did not have industrial energy consumption of any kind. Therefore, it was
deemed necessary to change the coding rule and code a 0 in order to reflect pre-industrial
societies. A list of states where this applies appears in Table One.
The coding rule used to determine if a state was pre-industrial is as follows: If the first
data entry for a given state is 10 or less, then it is assumed that all values before this point are
zero. This threshold was chosen because of the data values contained in the Mitchell (1998)
volumes. For many states, this is the lowest possible value that a state could have and still be
provided data. The states that fell into this category are listed in the Table One, and make up half

46
of the states in the international system (twenty-six out of fifty-three states) that went through a
pre-industrial period in this data set.

Table ENER 1: States with Pre-Industrial Periods
State
Years With
Zero Values
State
Years With
Zero Values
Afghanistan 1920-1949 Laos 1954-1959
Albania 1914-1925 Liberia 1920-1942
Bolivia 1848-1928 Mauritania 1960-1964
Burundi 1962-1965 Nepal 1920-1953
Dominican Republic 1894-1945 Nicaragua 1900-1948
El Salvador 1875-1950 Panama 1920-1945
Ethiopia 1898-1929 Paraguay 1846-1945
Guatemala 1868-1945 Peru 1839-1898
Haiti 1859-1950 Spain 1816-1830
Honduras 1899-1950 Sri Lanka 1948-1950
Japan 1860-1868 Thailand 1887-1934
Jordan 1946-1956 Venezuela 1841-1884

Korea 1887-1905 Yemen Arab Republic 1926-1948

If a given state had a first data value of more than 10, however, then this assumption is
violated and therefore does not apply. It was necessary to apply some other coding rule. If a
states first available data value was more than 10, an industrializing period was computed for
that state. We assumed that the state without data developed at a similar rate (in terms of energy
consumption per capita) to another state with full data. Using the PEC data for the similar state,
in conjunction with the population data for the two states and the first measured data point of the
state in question, it was possible to compute a reasonable approximation of the PEC for the state
with the missing data. A list of all the states where this technique was utilized, as well as the
similar states and the extrapolated periods, appear in Table Two.

Table ENER 2: States with Computed Pre-Industrial Periods
State Similar State
Extrapolated
Years
First Year With
Mitchell Data
Argentina Spain 1841-1886 1887
Brazil Spain 1836-1900 1901
Chile Spain 1839-1894 1895
Colombia Mexico 1891-1921 1922
Costa Rica Mexico 1924-1949 1950
Cuba Mexico 1902-1927 1928
Denmark Germany 1816-1842 1843
Ecuador Mexico 1900-1924 1925
Greece Austria-Hungary 1828-1866 1867
Iran Turkey 1898-1910 1911
Italy Spain 1833-1860 1861

47
Mexico Spain 1838-1890 1891
Mongolia China 1921-1956 1957
Portugal Spain 1836-1871 1872
Romania Austria-Hungary 1878-1881 1882
Russia Austria-Hungary 1816-1859 1860
Saudi Arabia Iraq 1933-1936 1937
Sweden Germany 1816-1839 1840
Switzerland Germany 1816-1857 1858
Turkey Austria-Hungary 1816-1897 1898
Uruguay Mexico 1910-1945 1946
Yugoslavia Austria-Hungary 1878-1909 1910

There are four exceptions exist to the rules listed above. Morocco (MOR, 600), Tunisia
(TUN, 616), and Egypt (EGY, 651) were states early in the time span covered by this data set
(1847-1911, 1825-1881, and 1855-1882, respectively) with no available data during their initial
existence in the international system. These states were eventually all subsumed by other states
for extended periods of time. When the colonial system in Africa broke down, however, these
states re-entered the international system with PEC values that were greater than zero. Because
these were occupied states, however, it appears safe to assume that their industrialization
periods were during occupations, and not during these early independent times. Therefore, we
assume that the PEC for these three states is zero during their independence in the 1800s.
The final exception, the Netherlands (NTH, 210), was somewhat more complicated.
From 1830 (the first available data point) to 1846 (the secession and independence of Belgium),
the Netherlands was assumed to have the same yearly change in PEC as Belgium. Using this
yearly change, the data series for the Netherlands was extrapolated backwards. For 1829, the
values for Belgium and the Netherlands were added together for 1830, and then an annual
growth rate of five percent was assumed. From 1816 to 1828, an annual growth rate of five
percent was assumed, and the PEC values were extrapolated over this span. Using the above
method produced logically consistent data values for this series.
Conversion into One Thousand Metric Coal-Ton Equivalents. One element that
particularly complicated this research was the validation of the conversion formulas used to turn
quantities of energy-producing substances into the coin of the realm. There appear to be two
major methods for converting various energy commodities into thousands of metric coal-ton
equivalentsDarmstadter and the UN. In previous versions of this data set, this project relied
primarily on Darmstadter for the conversion formulas. The reasoning behind this was that
Darmstadter was the primary source for a majority of data points. As this project continues to
grow and evolve by adding more data points computed by UN techniques, however, this
reasoning becomes less valid.
In order to correct for this, version three of the data set adheres to UN standards. For this
reason, there have been some small changes to the conversion factors (which will be discussed

48
in greater detail in the respective commodity sections) that may alter the final computed energy
consumption from version 2.1 to the version presented here.
Interpolation. In the original version of the data set, most interpolation was done using
the total energy consumption of a given state. This stemmed from the notion that Darmstadter
(the original raw data source for earlier versions of this data set) would often report total energy
consumption for a state, already converted into one thousand metric coal-ton equivalents.
However, this source would only list data points intermittently, leaving out certain spans of data
values. For instance, data for the United States (USA, 002) was available for 1925, 1929, 1933,
1937, 1938, 1950, 1953, 1955, 1957, 1960, 1961, 1962, 1963, 1964, and 1965 (Darmstadter et
al, p. 225). All other data points (particularly the war years between 1941 and 1945) were not
available from this source. In order to obtain data points for the missing years, previous
researchers would have to interpolate the total energy consumption.
The Mitchell data, however, made it possible to avoid doing dramatic interpolations. In
data points assembled using Mitchell data, any necessary interpolations were calculated using
individual commodity data (i.e., coal, petroleum, etc.).
Whenever an interpolation was performed in the Mitchell data period, it was computed
using Log-Linear Interpolation (abbreviated LLI). These interpolations assume a logarithmic
growth rate, and are computed using the following formula:

Equation ENER 2: Logarithmic Growth Rate Computation Formula
,
where X
n+t
and X
n
are the known starting and finishing points of the range of values
to be interpolated, and t is the number of data points to be interpolated.
This rate is then multiplied through for all points as shown in Equation ENER 3 below:

Equation ENER 3: Interpolation of Data Points Using Logarithmic Growth Rate
X
n+1
= Rate * X
n
; X
n+2
= Rate * X
n+1
; ; X
n+t
= Rate * X
n+t-1

Bringing Technology to Bear. The original Correlates of War energy consumption data
set relied on individual paper computation sheets for assimilating much of the data. Each data
point originally consisted of a computation page that listed raw data values, sources, conversion
calculations, and total computed energy consumption. Overall, there should have been almost
12,000 of these computation sheets. Unfortunately, over time these sheets were lost and only a
few dozen remain. Therefore, it was necessary to re-compute every data point from scratch,
including documentation and computation.

49
To recreate this data set using these previous technologies would be impossible.
However, due to advances in computer technology and computing strength (particularly in
spreadsheet and scanning technology) completely re-creating this data set was possible. These
new technologies were fully utilized for this project. Raw data sources were scanned in from their
source books into computer-readable tables. From these raw data sources, a Microsoft Excel
worksheet was constructed for each state from 1816 or its inception until 1970. These workbooks
contained five pagesone for each of the four energy commodities, and one for the total energy
consumption of the state in question. The data cells in each of these workbooks are fully linked
together, in order to make updating data simpler. Each workbook page contains data points,
source listings, conversion factors, any necessary interpolations, and documentation and
discussion of individual problems. After 1970, the data came from the UN and was already
converted into one thousand metric coal ton equivalents.
Data Merging Methods. One potential problem area in version 2.1 was where previous
researchers merged the UN data together with data from other various sources. In version 2.1,
UN data were used for every state only after 1970. Literally, every state in the international
system changed conversion formulas and data sources at exactly the same point. The authors of
the Users Manual wrote: The slight difference in conversion methods introduced discontinuities
from one year to the next in coal-ton energy values (Singer et al, p. 30). Having every state in
the international system change data source and conversion methods at once created a
potentially large discontinuity in the data, making examinations over time much more difficult.
This version of the energy consumption data attempts to correct for this potential bias.
The most recent UN data covered every state in the international system with very little missing
data starting in either 1968 or 1970. Some states (particularly major ones such as the United
States, Soviet Union, Western Europe, and Japan) had UN energy consumption data starting in
1950. With this in mind, energy consumption data were computed from the Mitchell volumes for
all states from either 1816 or their inception until 1970. These two data sources were then
merged. If there were both UN and Mitchell data values, the UN data values were utilized for the
data point. This merging method will hopefully smooth out some of this discontinuity contained in
previous versions.
Primary Energy Consumption measures one element of the industrial capacity of states
in the international system. Simply put, the greater the energy consumption, the larger the
potential manufacturing base of an economy, the larger the potential economy of the state in
question, and the more wealth and potential influence that state could or should have. PEC is a
derived indicator, computed using Equation One below:

50
Equation ENER 1: Primary Energy Consumption Formula
Consumption = Production + Imports Exports ! in Domestic Stocks

This formula is quite similar to the one utilized in the original coding manual, except for
one changethe inclusion of domestic stocks into the equation (Singer et al, p. 21). This reflects
that states will maintain supplies of energy-producing commodities in the event that there are
disruptions of import or export flows.
Primary Energy Consumption comes from (and is computed using data about) four broad
categories of sourcescoal, petroleum, electricity, and natural gas. Each of these elements is
broken into a variety of different elements. It is important to note that these forms of energy are
all types of commercial energy. Many other forms (such as animal waste, peat, and wood-
burning) exist, however these other energy sources are of such small amounts that they do not
qualify as industrial energy sources. The raw data for each commodity is converted into a
common unit (in this case, one thousand metric ton coal equivalents) and then summed to
produce the energy consumption for a given state in a particular year.
The data series runs from 1816 (when the Correlates of War project begins to track the
international system) until 1998 (the last year the United Nations publishes comparable, cross-
national data on energy consumption). Data on these commodities comes primarily from two
sources. For the pre-1970 portion of the data, much of the data necessary to compute PEC
comes from the Mitchell International Historical Statistics series. After 1970, the data come from
the Energy Statistics Yearbook published by the United Nations. This is a change from previous
data sets. Older versions of the data set obtained much of the PEC data during the pre-1970
period through state-specific sources, and not a single, common source. This made tracing the
source of many of the original data points impossible. In this version, however, there are far more
points that come from only a few sources instead of an amalgamation.
United Nations Data. This data source was utilized for all states whenever possible.
Overall, the UN began collecting PEC data for some states (particularly the United States,
Western Europe, Soviet Union, China, Japan, and Australia) starting in 1950. Comprehensive
data on all the states in the international system only began between 1968 and 19708.
The United Nations data arrives already converted into one thousand metric coal-ton
equivalents. However, Mitchell data were disaggregated into four major commodities (coal,
petroleum, electricity, and natural gas); UN data is aggregated into four major categories:
production, imports, exports, and changes in domestic stocks (in accordance with Equation One
above). This required a different combination scheme. Simply put, Equation One was applied to
the UN data to calculate PEC. However, there were a number of blank cells contained within the
data that had to be addressed in order to make the calculations. The assumption that was used

51
for the UN data only was that if the data were missing, the value was zero. These entries are
contained in Table ENER 3 below.

Using this technique, there were no negative data values produced. As is also apparent,
there are no data cells where all the information is missing; there are some values that can be
calculated for each state in the international system.
Mitchell Data. Primary Energy Consumption computed using Mitchell data is comprised
of four energy-producing commodities: 1) Coal; 2) Petroleum; 3) Electricity; and 4) Natural Gas.
This section will discuss each of these commodities, looking at a brief history, conversion
formulas, and potential problems found within each commodity.
Coal. Of all the industrial indicators, coal is the only indicator that covers the entire time
span from 1816 to the present. Coal is the primary energy consumption element for all states
prior to World War One. It is also the metric standard by which all this energy consumption data
is measured.
In this data collection effort, three types of coal were identified: Anthracite, Bituminous,
and Brown. Anthracite and Bituminous are very similar; they are the hard, black coal found in
most mines throughout the world9. These two types of coal are the standard by which all other
energy consumption elements are measured.
Brown coal, on the other hand, is softer, quicker burning, and less efficient as an
industrial fuel. There are a variety of different types of brown coals (a type called lignite is
mentioned most often), and their quality is often dependent on where a state is located in the
Table ENER 3: UN Data Codes
UN Code Present Data Missing Data (Assumed to be 0) N
1 None All Data 0
2 Stock Change Production, Imports, Exports 0
3 Exports Production, Imports, Stock Change 0
4 Exports, Stock Changes Production, Imports 0
5 Imports Production, Exports, Stock Change 0
6 Imports Production, Exports 0
7 Imports, Exports Production, Stock Change 0
8 Imports, Exports, Stock Change Production 0
9 Production Imports, Exports, Stock Change 543
10 Production, Stock Change Imports, Exports 58
11 Production, Exports Imports, Stock Change 141
12 Production, Exports, Stock Change Imports 191
13 Production, Imports Exports, Stock Change 587
14 Production, Imports, Stock Change Exports 506
15 Production, Imports, Exports Stock Change 1030
16 All Data None 2771
Note: UN data codes 1-8 are included here in order to account for
the future possibility that certain state's data values may be missing.

52
world. In order to account for these differences, this data set utilized a state-by-state brown coal
conversion table. These conversion values appear as Table ENER 4.
Some similar conversion values appeared in previous versions of the coders manual
(Singer et al, Table Three, p. 28). There are some differences between that table and the one
presented by Darmstadter. We choose to utilize the table as presented by Darmstadter. One
potential problem arose in these brown coal conversions. There were three cases where there
was no brown coal conversion presented for a given state, even though the Mitchell data
documented that the state in question produced brown coal. These states are Hungary (HUN,
310), Iran (IRN, 630), and Mongolia (MON, 712). For these three states, this computation utilized
the conversion factor for a state on the list that is geographically proximate to the state in
question. These proximate states were Austria (AUS, 305), Turkey (TUR, 640), and North Korea
(PRK, 731), respectively.

Table ENER 4: Brown Coal Conversion Values for Given States
10

State Conversion State Conversion
Thailand 0.7 Netherlands 0.33
Canada 0.65 Tunisia 0.33
Czechoslovakia 0.6 Turkey 0.33
France 0.6 United States 0.33
Hungary 0.6 Germany, West 0.31
Romania 0.6 Bulgaria 0.3
Albania 0.5 Germany, East 0.3
Austria 0.5 India 0.3
Greece 0.5 Indo-China 0.3
Japan 0.5 Italy 0.3
New Zealand 0.5 Korea, North 0.3
Portugal 0.5 Korea, South 0.3
Spain 0.5 Poland 0.3
Yugoslavia 0.5 Denmark 0.29

Chile 0.33 Australia 0.25

Petroleum. Petroleum is the second most prevalent source of industrial energy
consumption. Relatively speaking, petroleum products were a minor source of commercial
energy until the advent of the automobile after the turn of the century. Since then, however,
petroleum has become a highly important industrial energy source.
In generating usable data from the raw petroleum figures presented, it is often necessary
to perform two types of conversions. First, it is necessary to convert the raw data into metric
tons. Second, it is then necessary to convert data from the metric ton of petroleum into metric
tons of coal equivalency. I will look at each of these in turn.

53
There were two alternate measures used throughout the Mitchell Data. They were One
Million US Gallons and One Thousand Barrels. The conversions from their respective units
into metric ton equivalencies for each of these measures are listed in Equations ENER 4 and
ENER 5 below
11
:

Equation ENER 4: Millions of US Gallons to Thousands of Metric Tons of Oil
12

____M US Gallons * 3.2468 = ____K Metric Tons of Oil

Equation ENER 5: Thousands of Barrels to Thousands of Metric Tons of Oil
13

____K Barrels * 0.1366 = ____K Metric Tons of Oil

In converting petroleum into coal-ton equivalents, Mitchell distinguished between two major forms
of petroleum: 1) Crude, and 2) Refined. The conversions for each of these two types of oil are
listed below:

Equation ENER 6: Crude Petroleum to Coal Ton Equivalents
14

____K Metric Tons Crude Petroleum * 1.429 = ____K Coal Ton Equivalent

Equation ENER 7: Refined Petroleum to Coal Ton Equivalents
15

____K Metric Tons Refined Petroleum * 1.474 = ____K Coal-Ton Equivalent

One of the difficulties of converting petroleum products into coal equivalents is that petroleum
products come in a variety of different weights and types, each of which has its own conversion
value. Unfortunately, Mitchell does not distinguish between the many different weights and types
of petroleum productsthis source only utilizes the crude and refined categories listed above. In
order to overcome this problem, it was necessary to make some assumptions about the
conversion formulas utilized here. For crude oil, the conversion factor utilized here is the
conversion for crude oil of average viscosity
16
. For refined products, the conversion stems from
two considerations. First, this value is the conversion for kerosene, the major refined petroleum
product prior to World War One. Second, it is also the approximate mean value for all refined
petroleum products that have been produced following World War One. For instance, gasoline
and liquefied petroleum gases both have higher conversion factors to coal-ton equivalents than
kerosene (1.500 and 1.554, respectively, as compared to 1.474 for kerosene; taken from Energy
Statistics Yearbook, p. xlv). However, gas-diesel oils and residual fuel oil have lower conversion
formulas than kerosene (1.450 and 1.416, respectively; taken from Energy Statistics Yearbook, p.
xlv). Because some types of refined petroleum products have greater conversion factors and

54
others have smaller conversion factors, this conversion value appears to be a good
approximation for all types of refined petroleum products.
It is vital to note, however, that in the UN data this assumption is unnecessary. The United
Nations collects data on each individual commodity, in its original form, and makes its
calculations based on the values of these individual characteristics instead of on an assumption
about homogeneity. Therefore, these assumed values above are only concerned with the
Mitchell Data calculated before 1970.
Electricity. After 1900, electrical production enters the world energy picture. It is
important to note, however, that the electrical production values listed here DO NOT include
electricity produced from fossil fuels; these types of energy production are included in the coal,
oil, and natural gas components of the data. Electrical production here includes three types of
electrical production: 1) Hydroelectric, 2) Nuclear, and 3) Geothermal.
Conversion from electrical energy production into coal-ton equivalents utilizes the
following formula:

Equation ENER 8: Electrical Energy Conversion
____ Giggawatts * 0.123 = ____ K Coal-Ton Equivalents

Mitchells aggregation of raw data again makes assuming necessary, which is central to
this conversion factor. Mitchell does not distinguish between hydroelectric, nuclear, or
geothermal energy. He aggregates these three vastly different categories into one category.
Therefore, it was again necessary to assume about the type of energy that was produced prior to
1970.
The assumption utilized here (and in the conversion above) is that all electricity before
1970 is hydroelectric power. Prior to 1970, this assumption is quite tenablebefore World War
Two, nuclear and geothermic electricity did not exist. After World War Two, the nuclear reactor
was only becoming commercially available and viable in the early 1960s
17
, and only by 1970 were
there enough nuclear plants to make any measurable contribution to energy production. Only
after the oil shocks of the 1970s (when this data set utilizes UN data that separates conversion
rates for each type of electricity) did research and utilization of these alternate forms of electrical
generation step into high gear.
The potential biasing impact of this assumption is diminished again because of the
prevalence of UN data. The UN again distinguishes between all three of the aforementioned
electricity types, converting each according to its own conversion factor. Therefore, in states
where nuclear power is prevalent, such as the United States, Western Europe, Soviet Union,
China, and Japan, their data come from the UN beginning in 1950, making this assumption not
apply to these states in question.

55
One difference between the version presented here and version 2.1 is a change in the
conversion rates utilized. Version 2.1 utilized a conversion rate that evolved from 1.0 in 1919 to
0.3 in 1971 (Singer et al, 1990, p. 28; originally published in Darmstadter 1971, p. 830
18
). The
original researchers believed that there was an evolution of electric-producing technology, making
more efficient electrical production possible over time, necessitating a moving scale. The UN,
however, rejects this conversion and utilizes fixed conversion factors. Because the UN data is
utilized for computing far more data points than anything in the Darmstadter era, version three
utilizes UN data conversion techniques, which for electrical consumption utilizes a constant
conversion rate.
Natural Gas. Natural gas production was the last of the four energy commodities to
appear on the industrial scene. Present for as long as there were petroleum production facilities,
natural gas was often burned off at the site, instead of being used for more commercial purposes.
Only in the last fifty years has the condensation, refrigeration, and storage technology been
available to harness this source of energy for commercial purposes.
The conversion formulas for computing natural gas production into coal ton equivalents
appear as Formulas ENER 9 and ENER 10 below.

Formula ENER 9: Cubic Meters of Natural Gas to Coal Ton Equivalents
____M Cubic Meters Natural Gas * 1.33 = ____K Coal-Ton Equivalents

Formula ENER 10: Petajoules of Natural Gas to Coal Ton Equivalents
____Petajoules * 34.121 = ____K Coal-Ton Equivalents

For most of the data points, the million cubic meters is the standard unit for natural gas
production. The Petajoule became the basic unit of natural gas production in 1966, necessitating
a different conversion formula.
In calculating these revised energy consumption data values, certain problems arose.
This section will describe and list four of the more important problems: 1) Negative Values, 2)
Multiple Data Values, 3) Missing Data, and 4) Un-Documented Data Points.
Negative Values. One of the problems with generating historical energy consumption
data were that in a small number of cases (twelve to be precise) the computations of energy
consumption produced a negative value for energy consumption. These data points are listed in
Table ENER 6.

56
Two issues surrounding the data available on one particular commoditypetroleumcontributes
to this problem of negative PEC. Looking at Equation ENER 1 again the formula for calculating
PEC below can also help illuminate these potential biases.

PEC = Production + Imports Exports Change in Domestic Stocks

The Change in Domestic Stocks portion of the equation above is the first avenue that can create
this problem of apparent negative PEC. Simply put, there are not historical records of domestic
stockpiles of energy commodities, such as coal and petroleum. Oil-producing countries often
have domestic stockpiles of petroleum. Much like the United States Strategic Oil Reserve,
most states keep some sort of stockpiles of petroleum in case of shortages, embargoes or other
possible disruptions that can stop or reduce the flow of oil into or out of a state. In petroleum-
producing states, however, these stockpiles can be massive. In lower-production (or higher
demand) years, these states would often export from these domestic stocks, while keeping
production low. The UN was the first source to begin gathering compete domestic stock data in
the 1970s, and without being able to account for these stocks, a state could easily appear to
export more oil than it produced, creating the problem of apparent negative energy consumption.
Without some sort of entry into this variable, an omitted variable bias is created in the above
equation, making it appear that a state had negative PEC.
The production portion of the equation above is the second issue that drives this problem
of apparent negative PEC. As a policy, OPEC monitors and attempts to manage petroleum
production by looking at production values (www.opec.org) and setting quotas based on these
production values, while not appearing to examine import or export amounts. Therefore, a state
that would want to break its quota would simply falsify its production amounts by reporting less oil
production than they truly produced while maintaining their accurate import and export figures.
This would again result in a negative value for energy consumption.
Negative energy consumption data values were corrected by altering the production of
crude oil. These corrections appear in the Correction column of Table Five above. For the
most part, the domestic crude oil productions were inflated between one and ten percent. The
exact amount was determined by looking at the data points surrounding the negative value, using
a common-sense approach. The only exception to this was Iran, where some data values were
shifted from one year to another to account for what appeared to be oil produced in one year and
exported in another.

Table ENER 5: Negative PEC Data Points and Their Corrections
State Year
Original
PEC
Adjusted
PEC Adjustments Made
Mexico 1922 -263 1603 Petrol. Production Increased by 5%

57
Venezuela 1930 -568 869 Petrol. Production Increased by 5%
Gabon 1963 -78 49 Petrol. Production Increased by 10%
Iran 1911 -1 11 No Petrol. Production Value--LLI from
1910 (0) to 1912 (80)
Iran 1919 -278 151 Moved 300 1K MT Oil Production from
1915 to 1919 in order to smooth curve
Iran 1920 -1011 319 No Mathematical Correction Possible
Assumed Petrol. Production Missing
LLI Production from 1919 to 1921
Iran 1933 -545 375
Moved 644 1K MT Petrol. Production
from
1931 to 1933 in order to smooth curve
Iraq 1948 -34 210 Petrol. Production Increased by 5%

Multiple Data Values. Throughout the Mitchell (1998) data, there are a number of points
where this source lists two data points for a state in a given year
19
. Often, these come from some
sort of change in reporting, which can take place in a variety of different ways. There could have
been some change in accounting procedure that generates two data points; for instance, many
states changed their accounting procedures from using calendar years to fiscal years or vice
versa. Two data points could be generated if a new region became included into a state. There
could be changes of measurement units, for instance moving from Millions of Cubic Meters to
Petajoules of Natural Gas Production. Generally, the procedure for handling this potential
problem was to average the two values and assign a data quality value of B. This tended to
create a smoother time series picture of the change in a given commodity over time.
Missing Data. There was a problem of missing data upon the completion of the major
data recreation utilizing Mitchell (1998) as a source. Nineteen states, often due to brief or early
existences in the international system, had data values in version 2.1 of the data set but did not
have data values available through Mitchell (1998). Therefore, it was necessary to perform some
sort of estimation of these phantom data points.
The technique utilized to estimate fourteen of these problematic data series is called
population-based energy consumption estimation. This technique involves three steps. First, a
state that is geographically proximate and industrially similar to the state with missing data is
identified
20
. Second, energy consumption per capita (that is, energy consumption divided by total
population) was computed for the neighboring state with documented data. Third, the yearly
energy consumption per capita values from this similar state was multiplied by the population

58
data for the state with missing energy consumption data. This produces an estimate of what the
energy consumption would be for that state.
Table ENER 6 contains a list of states whose industrial energy consumptions were
computed in this manner. It also lists the proxy states whose energy consumptions were utilized
in the calculations presented above, as well as the years that these calculations were performed
and the number of data points generated in this manner.

Table ENER 6: States with Population-Based PEC Estimations

States With Missing
Data Points
Similar States Utilized
for Estimations Estimation Span
Number of
Data Points
Luxemburg Belgium All Data Points before 1970 48
Estonia Poland All Data Points before 1970 23
Latvia Poland All Data Points before 1970 23
Lithuania Poland All Data Points before 1970 23
Saxony Germany 1850 to 1867 18
Hanover Germany 1838 to 1866 29
Bavaria Germany 1816 to 1871 56
Hesse Electoral Germany 1816 to 1866 51
Cyprus Greece All Data Points before 1970 11
Malta Italy All Data Points before 1970 7
Equatorial Guinea Cameroon All Data Points before 1970 3
Gambia Senegal All Data Points before 1970 6
Zanzibar Tanzania All Data Points before 1970 2
Maldive Islands Sri Lanka All Data Points before 1970 6

The final three undocumentable statesHanover (HAN, 240), Bavaria (BAV, 245), and
Hesse Electoral (HSE, 273)all possessed a very unusual data pattern. These three states only
had one data point each in the original COW PEC data set1853. Every other data point for
each of these three states both before and after 1853 was missing. Somehow, some researcher
found that one value for these three states. Unfortunately, that one data source for that one point
cannot now be identified. First attempts to use population-based energy consumption estimates
produced data figures that were far too high to be realistic, especially once these states
amalgamated into Germany proper. Therefore, this technique was dismissed. It was apparent
that some other technique was necessary for estimating these unusual data points.
The following equation was utilized in order to make an educated estimation about the
data values for these three states:

Equation ENER 11: PEC Estimates for Hanover, Bavaria, and Hesse Electoral

PEC X-1 * *

59
Population X-1 German PEC x
PEC X =
German PEC x-1 German Population x
Population X
German Population x-1

This formula rests on three thoughts. First, it follows the industrial growth rate of Germany for the
same time span. Second, it anchors these three data points to a value that some researcher was
able to find and document (1853, even though it is undocumented as of this writing). Third, it also
centers on the population growth rate of these three states (which is fully documented in version
three of the data set. This technique has produced data values for these three states than seem
fairly reasonable. Future research should focus on finding more exact measures for these states.
Quality Codes
One of the realizations in creating this version of the PEC data set was that there were
numerous methods used to calculate data points in both previous and current versions of the
primary energy consumption data set. Some data points are compiled using very precise data
points, gathered for a state in a given year. Sometimes it was necessary to extrapolate or
interpolate particular commodities. In other instances, it was necessary to make estimations
about the energy consumption of a state with little available data. The quality codes for this data
series reflect these situations.

Table ENER 7: Primary Energy Consumption Quality Codes
Quality Code Substantive Interpretation
A All Components Present; or, only electricity interpolated from
1900-1945
B All Components known, but averaged. Often happens when a
state changes reporting units (for example, moving from
calendar years to fiscal years or vice-versa).
C Some (but not all) component data points interpolated
D All component data points interpolated (Example: China during
the Boxer Rebellion
E Log Linear Extrapolation based on growth rates (Example:
Mexico before 1981)
M Missing Data Values (Example: Lesotho, Papal States)

Anomaly Codes
In many data sets, there are often discontinuities that exist in the data. A states
international trade will suddenly increase by 400% in one year.
In version 3.0, this project identified data points that appear to create discontinuities in
the data values. These discontinuities can wreak havoc during analysis; if a researcher is using

60
time series analyses, running an analysis across these anomalies will create estimation problems
that can lead to Type I or Type II error during analysis.
For energy consumption, we considered an anomaly was defined as an increase or
decrease in total primary energy consumption that was as least 100% from its previous value
21
.

The layout of the Access sub-component data set is found in Table ENER 9 below. The
data set contains eight columns. The first and second columns correspond to the COW state
number and COW state abbreviation, respectively. The third column is the year of observation.
The fourth column contains the value for that year (in thousands of coal-ton equivalents), unless
the value is missing. Missing values are indicated by -9. The fifth column provides the source of
the data point or See note. If the column contains See note, the note column should be
consulted to see how that data point was calculated. The next (sixth) column, Note, explains
how that data point was obtained (i.e. linear interpolation or extrapolation). This column is usually
empty for data points with a quality code of A. The seventh and eighth columns, respectively, list
the data anomaly and quality codes for that value.

Table ENER 9: Data Set Layout
PE Consumption
CCode State Year Energy Source Note Anomaly Code QCode Version
2 USA 1816 254 B.R. Mitchel,
International
Historical
Statistics: the
Americas,
1750-1993.
Derived from
production,
export, and
import values of
coal, petroleum,
natural gas, and
electricity.
A A 3.01

Table Energy 8: Total Population Anomaly Codes
Code Substantive Meaning
A No Anomaly (< 2% change)
B Explained Inconsistency (e.g. change in territory, loss in wartime)
C Change of Sources (between 2 non-UN sources or 1 non-UN to UN source)
D Change of UN Sources
E UN Internal Inconsistency within same UN source
F Internal inconsistency within non-UN source
G Unexplained Anomaly

61
2010 Update
As noted above, for data points post-WWII, the COW PEC operationalization transitioned
from reliance on Mitchels International Historical Statistics (IHS) to the electronic United Nations
Energy Statistics Database (UNESD). Although the UNESD data collection commences in the
1950, complete data for all countries is generally unavailable until approximately 1970. However,
the v3.02 update frequently relied on ISH for PEC values for the post-1969 period. Furthermore,
the 2005 update had access to the UNESD for the 1950-97 period, relying on hard bound editions
of the United Nations Energy Statistics Yearbook and growth-based interpolations to identify
PEC values for the 1998-2001 period.
The goal of the 2010 update with regard to coding of PEC and integrating it with the
existing data was to replace the observations that were coded by the 2005 update with UNESD
data, but preserve PEC values computed with IHS. Stated differently, the 2010 update did not
implement a wholesale replacement of PEC values for the 1970-2001 period, but only did so in
the absence of a value computed from IHS. Our reasoning for doing so is that the Bremer-led
PSU 2005 updated team weighed the decision to use IHS vs. UNESD, and we do not revisit this
decision in the 2010 update.
As noted by the 2005 update team, reliance on electronic form of the UNESD removes
the often complex burden of (a) hand coding data points, and (b) generating suitable metric ton
coal units, the base COW PEC unit. While the UNESD enables rapid conversion of the
international standard energy units, terajoules, into metric ton coal units, several coding decisions
and sub-routines are necessary to generate the final PEC scores, and these decisions, as well as
their respective rationales, remain obscured or unrecorded in the MS-Access routines employed
by Stuart Bremer, or are insufficiently specified in the 2005 update codebook material, above.
While our coding decisions are embedded in the STATA do-files within which we managed the
data, it is worthwhile to highlight some of these coding decisions herein.
Conversion to Metric-ton Coal Units

The energy commodities contained in the UNESD are reported in what might be termed
native units, such as metric tons, terajoules, and so forth. In order to compute the COW PEC it
is necessary to execute two steps with the raw UNESD energy commodity data prior to
computation of the standard COW PEC formula (above, this is referred to as Equation ENER 1):
1. Conversion to Common Unit. The modern energy unit is the terajoule. The
UNESD provides a conversion table so that the native units corresponding to
each energy commodity can be converted into terajoules (UNESD file Energy
DB Codes.xls, worksheet Commodities); and

62
2. Conversion to Metric-ton Coal Equivalent. Translation of terajoules into metric-
ton coal equivalents requires the following transformation that is an international
standard: (Tj*34.120842)/1000.
We follow this two-stage conversion procedure in our update.

Primary Commodities

COW PEC is derived from the primary energy commodities reported in the UNESD. The
primary commodities identified in Stuart Bremers MS-Access databases depart slightly from the
full list of primary commodities reported in the UNESD. We use Bremers list of primary
commodities. The set of primary energy commodities are as follows:

Table ENER 10: Primary Commodities from Bremers MS-Access Database
Code Commodity Name
AL Alcohol
AW Other biomass and wastes
BS Bagasse
CL Coal
CR Crude Petroleum
EG Geothermal
EH Hydro
EL Total Electricity
EN Nuclear Electricity
EO Tide, Wave Electricity
ES Solar Electricity
EW Wind Electricity
FW Fuelwood
GL Natural Gas Liquids
LB Lignite/Brown Coal
MP Natural Gas Liquids (NGL) n.e.s.
MW Municipal Wastes
NC Other Non-commercial Energy Sources
NG Natural Gas (including LNG)
OS Oil Shale
PT Peat (for fuel use)
PU Pulp and Paper Waste
ST Steam and Hot Water
TH Thorium
UR Uranium
VW Vegetal Waste
WF Falling Water

63
Special Country Series Extractions

Four countries---Taiwan, Liechtenstein, Monaco, and San Marino---are not reported in
the UNESD (Taiwan is not a country in the UN system) or are reported as paired with other states
(Liechtenstein, Monaco, and San Marino reported jointly with Switzerland, France, and Italy,
respectively.) These cases were coded as follows:
1. Taiwan. PEC values were created by relying on the CIA World Factbook (2010,
online edition, p. 662) to determine the 2008 energy production value, which was
estimated to be 225.3 kWH. Next, we located and employed the transformation
used by the 2005 update team to transform the CIA figure for 2008 into metric-
ton coal equivalents ((225.3*1000)*(0.123)). In turn, we use STATAs ipolate
routine to generate a linear interpolation of Taiwans PEC between the PEC
value corresponding to the value for the year 2000 and coded during the 2005
update and the CIA-identified value for the year 2008. Last, the interpolated
values were used for the final PEC values for the period 2001-2007;
2. Liechtenstein. The United Nations reports energy data for Liechtenstein in
conjunction with Switzerland. As such, we extracted Liechtensteins PEC value
by computing the joint total population for Liechtenstein and Switzerland,
identifying the per capita contribution of Liechtenstein to the joint PEC, and then
extracting the appropriate PEC for each country;
3. Monaco. The United Nations reports energy data for Monaco in conjunction with
France. As such, we extracted Monacos PEC value by computing the joint total
population for Monaco and France, identifying the per capita contribution of
Monaco to the joint PEC, and then extracting the appropriate PEC for each
country; and
4. San Marino. The United Nations reports energy data for San Marino in
conjunction with Italy. As such, we extracted San Marinos PEC value by
computing the joint total population for San Marino and Italy, identifying the per
capita contribution of San Marino to the joint PEC, and then extracting the
appropriate PEC for each country.
Bibliography
One of the improvements of version three of the industrial energy consumption data set
and previous research efforts is that there are far fewer sources that were utilized for gathering
data. The sources are listed below in annotated bibliographic form.

Australias HDR Resources. 2000. The University of New South Wales School of
Petroleum Engineering Website.

64
URL: http://www.petrol.unsw.edu.au/research/resource.html

This web site was invaluable in determining what a petajoule both is theoretically and
mathematically. The UN conversions and documentation deal in terajoules, not petajoules.
Without this source, the conversion could not have been performed.

CIA World Factbook. 2002. Published by the Central Intelligence Agency.
URL: http://www.cia.gov/cia/publications/factbook/index.html

This web-based source was important in extending some of the data until 1997, because it
provided a measure of total energy consumption when other sources did not.

Darmstadter, Joel, Perry D. Teiterbaum and Jaroslav G. Polach. 1971. Energy in the World
Economy: A Statistical Review of Trends in Output, Trade, and Consumption Since 1925.
Washington, DC: Johns Hopkins Press.

This source was utilized primarily for the brown coal conversion values. However, for two states
(Albania and Iceland) this source was also utilized for primary energy consumption data.

Energy Statistics Yearbook (United Nations. Statistical Office). 1997. New York: United
Nations Press.

This volume was the source for all the conversion formulas utilized throughout this research. It
was also the data source for some states from 1950 until 1970 and for all states starting around
1970. The UN only publishes data on energy consumption with a four-year lag, therefore their
data collection (and the scope of this project) ends in 1997.

Mitchell, B.R. 1998. International Historical Statistics: The Americas 1750-1993. Fourth
Edition. New York, New York: Stockton Press.

Mitchell, B.R. 1998. International Historical Statistics: Africa, Asia, & Oceania 1750-1993.
Third Edition. New York, New York: Stockton Press.

Mitchell, B.R. 1998. International Historical Statistics: Europe 1750-1993. Fourth Edition.
New York, New York: Stockton Press.

These three volumes contain international historical statistics on most states in the international
system from 1816 until approximately 1993. They were the major source of raw energy
commodity data for all states in the international system.

Singer, J. David, with Contributions from P. Williamson, C. Bradley, D. Jones, and M.
Coyne. May 1, 1990. National Material Capabilities Dataset: Users Manual. Correlates
of War Project: The University of Michigan.

1
Some data sets, such as alliances or contiguity, do not have this sort of consideration. Others, such as
interstate trade or foreign direct investment, are the type that is being addressed in this discussion.

2
It is important to note that 1993 was part of version 2.1 of the data set. It was necessary to update this
data value as well. Because the last revision was in 1992-1993, many of these data points were either
estimates or missing. Therefore, we were able to go back and enter non-estimated data values for many of
these previously troublesome data points.
3
De facto information or data is information that is taken with surveys, censuses, and other forms of direct
counting. De jure data is best described as data that comes from historians impressions or estimates of the
population of an urban center at the time of their writing.

65

4
The coding rules and procedures are largely taken from the 1990 coding manual. This manual also
includes a discussion of the theoretical relationship between iron and steel production and national power.

5
In order to determine whether 100% was the appropriate threshold, a stratified random sample was
conducted with ten states using thresholds of 25 and 50%. There was not a large difference in the number
of anomalies identified at these thresholds.

6
Unless otherwise stated, anything not discussed in this coding manual should be assumed to remain the
same as in the original coding manual (Singer et al, 1990).

7
The letters and numbers appearing in parentheses are the COW abbreviation and country numbers for the
states in question throughout the rest of this article.

8
There appears to be no reasoning or pattern as to when the UN began collecting this data.

9
The only state where there was a distinction made between anthracite and bituminous coal was the United
States (USA, 002). Assuming that these two types were the same yielded an interesting result43 out of
44 data points for this state from 1816 to 1859 were identical to those contained in version 2.1 of the PEC
data set.

10
Reproduced from Darmstadter, p. 828.

11
Notation used from here on is K=1,000 and M=1,000,000. Cross-cancellations supporting these
conversion factors appear in Appendix Two at the end of this document.

12
Energy Statistics Yearbook, p. xlix.

13
Ibid.

14
Energy Statistics Yearbook, p. xlv.

15
Ibid.

16
Much like brown coal, crude petroleum is not the same everywhere on the planet. However, these
differences have not been documented or mathematically differentiated as well as in the brown coal case.

17
http://geocities.com/RainForest/Andes/6180/history.html#top

18
It would appear that the original project researchers interpolated conversion values from 1965 to 1971.

19
Specific numbers are not available, however as a best guess, probably every state in the international
system between 1816 and 1970 has at least one individual commodity data point with two values for the
same year.

20
Needless to say, the state selected must have energy consumption data available.
21
We also performed these tests at thresholds of 50% and 20% for a sample of 10 states spread through all
the regions of the world, and found that there was very little change.

Observational Studies
PS4781 Observational Studies 1 / 18
Chapter Outline
1
The Logic of Observation
2
Cross-sectional vs. Time-series Studies
If not an experiment, then what?
If we cannot evaluate causal theories in a controlled setting like an
experiment, we have to take the world as it already is, and use what
are called observational studies.
Denition: An observational study is a research design in which the
researcher does not have control over values of the independent
variable, which occur naturally. However, it is necessary that there be
some degree of variability on the independent variable between cases,
as well as variation in the dependent variable.
Observational studies and the four causal hurdles
How do the four causal hurdles change? Not at all.
Hurdles 1 and 3 are rather similar between experiments and
observational studies. But...
Can be argued that the causal mechanism connecting X and Y needs
to be better developed for observational studies theory plays a
larger role in how we evaluate observational studies
Hurdles 2 and 4 are a bit dierent, though. How so?
Hurdles 4 in Observational Studies
Hurdle 2: Reverse Causation
Remember, we might argue that economic development (X) causes
democracy (Y), but there is also a case to be made for democracy
(X) causing economic development (Y)
Dicult to disentangle
Hurdle 4 in Observational Studies
Hurdle 4: Omitted Variables (Zs)
Ice cream consumption (X) doesnt cause increased crime (Y) high
temperature (Z) causes both
Unlike in experimental research, forces us to consider (and account
for) alternative explanations
In the language of experimentation: We dont control he assignment
mechanism of X (and usually dont know the mechanism)
Two types of data, two types of observational studies
What is the unit of observation? (individuals vs. aggregates)
Two types of data sets
Two types of observational studies, focusing on two dierent types of
variation
Variation through space at one point in time
Government debt as
Nation a percentage of GNP Unemployment rate
Finland 6.6 2.6
Denmark 5.7 1.6
USA 27.5 5.6
Spain 13.9 3.2
Sweden 15.9 2.7
Belgium 45.0 2.4
Japan 11.2 1.4
New Zealand 44.6 0.5
Ireland 63.8 5.9
Italy 42.5 4.7
Portugal 6.6 2.1
Norway 28.1 1.7
Variation through time in one spatial unit
Month Presidential Approval Ination
2002.01 83.7 1.14
2002.02 82.0 1.14
2002.03 79.8 1.48
2002.04 76.2 1.64
2002.05 76.3 1.18
2002.06 73.4 1.07
2002.07 71.6 1.46
2002.08 66.5 1.80
2002.09 67.2 1.51
2002.10 65.3 2.03
2002.11 65.5 2.20
2002.12 62.8 2.38
Cross-sectional observational studies
A cross-sectional observational study examines a cross-section of
social reality, focusing on variation between individual
spatial unitsagain, like citizens, elected ocials, voting districts, or
countriesand explaining the variation in the dependent variable
across them.
Example: What, if anything, is the connection between the
preferences of the voters from a district (X) and a representatives
voting behavior (Y)?
We could compare the aggregated preferences of voters from a variety
of districts (X) to the voting records of the representatives (Y).
This particular X is not at all subject to experimental manipulation.
Time-series obvservational studies
In the time-series observational study, political scientists typically
examine the variation within one spatial unit over time.
For example, how, if at all, do changes in media coverage about the
economy (X) aect public concern about the economy (Y)? That is,
when the media spend more time talking about the potential problem
of ination, does the public show more concern about ination; and
when the media spend less time on the subject of ination, does
public concern about ination wane?
We need to focus hard on that fourth causal hurdle. Are there any
other variables (Z) that are related to the varying volume of news
coverage about ination (X) and public concern about ination (Y)?
Cross-Sectional vs. Time-series
Debate in comparative politics about the eects of oil resources (X)
on democracy (Y).
Short version of theory: Oil wealth hinders the development of
democracy, i.e. natural resources can be a curse of sorts.
One mechanism: Rulers have access to oil revenue, dont need to tax
citizens, citizens dont demand representation (no taxation, no
representation)
Cross-Sectional Research Design
Ross (2001) tests the theory, leveraging (mostly) cross-sectional
dierences. Uses data on 113 countries.
Main hypothesis being tested: Countries that have more oil wealth
will be less democratic.
Finds support for the hypothesis. Conclusion: Oil does hurt
democracy.
Time-Series Research Design
But wait!
Haber and Menaldo (2011) argue that Ross and other proponents of
the theory are missing important variables that dier across countries
(Zs).
They take a time-series approach, analyzing data from 1800 to 2006
for 168 countries.
Main hypothesis being tested: As countries become more wealthy
from oil resources, they become less democratic.
Do not nd support for the hypothesis. Conclusion: Oil doesnt hurt
democracy.
How many variables do I need to control for?
All of the possibly relevant ones.
How is this done? By the use of statistical controls.
Notice how the basis of comparison diers from experiments. In
experiments, we achive comparability between subjects by design.
This means that observational studies generally require more complex
statistical methods than experimental studies.
External and Internal Validity
Because there is the potential for reverse causality and omitted
variables in observational studies, they are considered less internally
valid than experiments.
That is, we cant be as sure that X really caused Y as in an
experiment.
However, because observational studies rely on naturally occuring
phenomena, they do not suer from the articiality problem that
haunts many exeriments. Thus, they are considered more externally
valid.
That is, the results we nd are more likely to generalize to
phenomena outside of our study than the results from experiments.
Multiple Regression: Categorical Predictors
PS4781 Multiple Regression: Categorical Predictors 1 / 20
Chapter Outline
1
Being Smart with Dummy Independent Variables in OLS
2
The Dummy Trap
3
Categorical Variables with More than Two Values
4
Multiple Dummy Variables
Being Smart with Dummy Independent Variables in OLS
In this section, though, we consider a series of scenarios involving
independent variables that are not continuous:
Using Dummy Variables to Test Hypotheses about a Categorical

Independent Variable with Only Two Values
Using Dummy Variables to Test Hypotheses about a Categorical

Independent Variable with More Than Two Values
Using Dummy Variables to Test Hypotheses about Multiple

Independent Variables
Using Dummy Variables to Test Hypotheses about a
Categorical Independent Variable with Only Two Values
We begin with a relatively simple case in which we have a categorical
independent variable that takes on one of two possible values for all
cases.
Categorical variables like this are commonly referred to as dummy
variables.
The most common form of dummy variable is one that takes on
values of one or zero.
These variables are also sometimes referred to as indicator variables
when a value of one indicates the presence of a particular
characteristic and a value of zero indicates the absence of that
characteristic.
Hillary Clinton Thermometer Scores Example
Data from 1996 NES.
Dependent variable: Hillary Clinton Thermometer Rating
Independent variables: Income and Gender
Each respondents gender was coded as equaling either 1 for male
or 2 for female.
Although we could leave this gender variable as it is and run our
analyses, we chose to use this variable to create two new dummy
variables, male equaling 1 for yes and 0 for no, and female
equaling 1 for yes and 0 for no.
Our rst inclination is to estimate an OLS model in which the
specication is the following:
Hillary Thermometer
i
= +
1
Income
i
+
2
Male
i
+
3
Female
i
+u
i
.
Regression output when we include both gender dummy
variables in our model
The dummy trap
We can see that the software has reported the results from the
following model instead of what we asked for:
Hillary Thermometer
i
= +
1
Income
i
+
3
Female
i
+u
i
.
This is the case because we have failed to meet the additional minimal
mathematical criteria that we introduced when we moved from
two-variable OLS to multiple OLS no perfect multicollinearity.
The reason that we have failed to meet this is that, for two of the
independent variables in our model, Male
i
and Female
i
, it is the case
that
Male
i
+Female
i
= 1 i .
In other words, our variables Male and Female are perfectly
correlated.
This situation is known as the dummy trap.
Avoiding the dummy trap
To avoid the dummy-variable trap, we have to omit one of our
dummy variables.
But we want to be able to compare the eects of being male with the
eects of being female to test our hypothesis.
How can we do this if we have to omit of one our two variables that
measures gender? Before we answer this question, lets look at the
results in a Table from the two dierent models in which we omit one
of these two variables. We can learn a lot by looking at what is and
what is not the same across these two models.
Two models of the eects of gender and income on Hillary
Clinton Thermometer scores
Independent variable Model 1 Model 2
Male 8.08
(1.50)
Female 8.08

(1.50)
Income 0.84
0.84
(0.12) (0.12)
Intercept 61.18
69.26
(2.22) (1.92)
n 1542 1542
R
2
.06 .06
Notes: The dependent variable in both models is the respondents
thermometer score for Hillary Clinton. Standard errors in
parentheses. Two-sided t-tests:

indicates p < .01;
indicates p < .05;

indicates p < .10.
Regression lines from the model with a dummy variable for
gender
Categorical Independent Variable with More Than Two
Values
Value Category Frequency Percent
0 Protestant 683 39.85
1 Catholic 346 20.19
2 Jewish 22 1.28
3 Other 153 8.93
4 None 510 29.75
When we have a categorical variable with more than two categories
and we want to include it in an OLS model, things get more
complicated.
The best strategy for modeling the eects of such an independent
variable is to include a dummy variable for all values of that
independent variable except one.
Categorical Independent Variable with More Than Two
Values
The value of the independent variable for which we do not include a
dummy variable is known as the reference category.
This is the case because the parameter estimates for all of the dummy
variables representing the other values of the independent variable are
estimated in reference to that value of the independent variable.
So lets say that we choose to estimate the following model:
Hillary Thermometer
i
= +
1
Income
i
+
2
Protestant
i
+
3
Catholic
i
+
4
Jewish
i
+
5
Other
i
+u
i
.
For this model we would be using None as our reference category
for religious identication.
This would mean that

2
would be the estimated eect of being
Protestant relative to being nonreligious.
The same model of religion and income on Hillary Clinton
Thermometer scores with dierent reference categories
Independent
variable Model 1 Model 2 Model 3 Model 4 Model 5
Income 0.97
0.97
0.97
0.97
0.97
(0.12) (0.12) (0.12) (0.12) (0.12)

Protestant 4.24
6.66
24.82
6.30
(1.77) (2.68) (6.70) (2.02)

Catholic 2.07 0.35 18.51
6.30
(2.12) (2.93) (6.80) (2.02)

Jewish 20.58
18.16
18.51
24.82
(6.73) (7.02) (6.80) (6.70)

Other 2.42 18.16
0.35 6.66
(2.75) (7.02) (2.93) (2.68)

None 2.42 20.58
2.07 4.24
(2.75) (6.73) (2.12) (1.77)

Intercept 68.40
70.83
88.98
70.47
64.17
(2.19) (2.88) (6.83) (2.53) (2.10)

n 1542 1542 1542 1542 1542
R
2
.06 .06 .06 .06 .06
Notes: The dependent variable in both models is the respondents thermometer
score for Hillary Clinton. Standard errors in parentheses.
Two-sided t-tests:

indicates p < .01;

indicates p < .05;

indicates p < .10.
Using Dummy Variables to Test Hypotheses about
Multiple Independent Variables
It is often the case that we will want to use multiple dummy
independent variables in the same model.
Remember from last week, when we moved from a bivariate
regression model to a multiple regression model, we had to interpret
each parameter estimate as the estimated eect of a one-point
increase in that particular independent variable on the dependent
variable, while controlling for the eects of all other independent
variables in the model.
When we interpret the estimated eect of each dummy independent
variable, we are interpreting the parameter estimate as the estimated
eect of that variable having a value of one versus zero on the
dependent variable, while controlling for the eects of all other
independent variables in the model, including the other dummy
variables.
Model of Bargaining Duration
Independent variable Parameter estimate
Ideological Range of the Government 2.57*
(1.95)
Number of Parties in the Government -15.44***
(2.30)
Post-Election 5.87**
(2.99)
Continuation Rule -6.34**
(3.34)
Intercept 19.63***
(3.82)
n 203
R
2
.62
Notes: The dependent variable is the number of days before
each government was formed. Standard errors in parentheses.
One-sided t-tests:

indicates p < .01;

indicates p < .05;
indicates p < .10.

Two Overlapping Dummy Variables in Models by Martin
and Vanberg
Continuation Rule
No (0) Yes (1)
Post- No (0) 61 25
Election? Yes (1) 76 41
Note: Numbers in cells represent the number of cases.
Multiple Regression: Interaction and Multicollinearity
PS4781 Multiple Regression: Interaction and Multicollinearity 1 / 17
Chapter Outline
1
Testing Interactive Hypotheses with Dummy Variables
2
Multicollinearity
All of the OLS models that we have examined so far have been what
we could call additive models.
To calculate the

Y value for a particular case from an additive model,
we simply multiply each independent variable value for that case by
the appropriate parameter estimate and add these values together.
Interactive models contain at least one independent variable that we
create by multiplying together two or more independent variables.
When we specify interactive models, we are testing theories about
how the eects of one independent variable on our dependent variable
may be contingent on the value of another independent variable.
We begin with an additive model with the following specication:
Hillary Thermometer
i
= +
1
Womens Movement Thermometer
i
+
2
Female
i
+ u
i
.
In this model we are testing the theory that a respondents feelings
toward Hillary Clinton are a function of their feelings toward the
womens movement and their own gender.
This specication seems pretty reasonable, but we also want to test
an additional theory that the eect of feelings toward the womens
movement have a stronger eect on feelings toward Hillary Clinton
among women than they do among men.
In essence, we want to test the hypothesis that the slope of the line
representing the relationship between Womens Movement
Thermometer and Hillary Clinton Thermometer is steeper for women
than it is for men.
To test this hypothesis, we need to create a new variable that is the
product of the two independent variables in our model and include
this new variable in our model:
Hillary Thermometer
i
= +
1
i
+
2
Female
i
+
3
(Womens Movement Thermometer
i
Female
i
)+u
i
.
By specifying our model as such, we have created two dierent
models for women and men. So we can rewrite our model as
for Men (Female = 0) : Hillary Thermometer
i
=
+
1
i
+ u
i
;
for Women (Female = 1) : Hillary Thermometer
i
=
+
1
i
+ (
2
+
3
)(Womens Movement Thermometer
i
) + u
i
.
And we can rewrite the formula for women as
for Women (Female = 1) : Hillary Thermometer
i
= ( +
2
)
+ (
1
+
3
)(Womens Movement Thermometer
i
) + u
i
.
The eects of gender and feelings toward the womens
movement on Hillary Clinton Thermometer scores
Independent variable Additive model Interactive model
Womens Movement Thermometer 0.68
0.75
(0.03) (0.05)
Female 7.13
15.21
(1.37) (4.19)
Womens Movement Thermometer Female 0.13
(0.06)
Intercept 5.98
1.56
(2.13) (3.04)
n 1466 1466
R
2
.27 .27
Notes: The dependent variable in both models is the respondents
thermometer score for Hillary Clinton. Standard errors in parentheses.
Two-sided t-tests:

indicates p < .01;

indicates p < .05;

indicates p < .10.
Regression lines from the interactive model
Multicollinearity
We know from last week that a minimal mathematical property for
estimating a multiple OLS model is that there is no perfect
multicollinearity.
Perfect multicollinearity, you will recall, occurs when one independent
variable is an exact linear function of one or more other independent
variables in a model.
In practice, perfect multicollinearity is usually the result of a small
number of cases relative to the number of parameters we are
estimating, limited independent variable values, or model
misspecication.
A much more common and vexing issue is high multicollinearity.
As a result, when people refer to multicollinearity, they almost always
mean high multicollinearity. From here on, when we refer to
multicollinearity, we will mean high, but less-than-perfect,
multicollinearity.
Multicollinearity is induced by a small number of degrees of freedom
and/or high correlation between independent variables.
Venn diagram with multicollinearity
Detecting Multicollinearity
It is very important to know when you have multicollinearity.
If we have a high R
2
statistic, but none (or very few) of our
parameter estimates is statistically signicant, we should be suspicious
of multicollinerity.
We should also be suspicious of multicollinearity if we see that, when
we add and remove independent variables from our model, the
parameter estimates for other independent variables (and especially
their standard errors) change substantially.
A more formal way to diagnose multicollinearity is to calculate the
variance ination factor (VIF) for each of our independent variables.
This calculation is based on an auxiliary regression model in which one
independent variable, which we will call X
j
, is the dependent variable
and all of the other independent variables are independent variables.
The R
2
statistic from this auxiliary model, R
2
j
, is then used to
calculate the VIF for variable j as follows:
VIF
j
=
1
(1 R
2
j
)
.
Multicollinearity: A Real-World Example
We estimate a model of the thermometer scores for U.S. voters for
George W. Bush in 2004. Our model specication is the following:
Bush Thermometer
i
= +
1
Income
i
+
2
Ideology
i
+
3
Education
i
+
4
Party ID
i
+ u
i
.
Although we have distinct theories about the causal impact of each
independent variable on peoples feelings toward Bush, the table on
the next slide indicates that some of these independent variables are
substantially correlated with each other.
Pairwise correlations between independent variables
Bush Therm. Income Ideology Education Party ID
Bush Therm. 1.00
Income 0.09
1.00
Ideology 0.56
0.13
1.00
Education 0.07
0.44
0.06
1.00
Party ID 0.69
0.15
0.60
0.06
1.00
Notes: Cell entries are correlation coecients. Two-sided t-tests:

indicates p < .01;
indicates p < .05;

indicates p < .10.
Model results from random draws of increasing size from
the 2004 NES
Independent variable Model 1 Model 2 Model 3
Income 0.77 0.72 0.11
(0.90) (0.51) (0.15)
{1.63} {1.16} {1.24}
Ideology 7.02 4.57
4.26
(5.53) (2.22) (0.67)

{3.50} {1.78} {1.58}
Education 6.29 2.50 1.88
(3.32) (1.83) (0.55)

{1.42} {1.23} {1.22}
Party ID 6.83 8.44
10.00
(3.98) (1.58) (0.46)

{3.05} {1.70} {1.56}
Intercept 21.92 12.03 13.73
(23.45) (13.03) (3.56)

n 20 74 821
R
2
.71 .56 .57
Notes: The dependent variable is the the respondents thermometer
score for George W. Bush. Standard errors in parentheses;
VIF statistics in braces.
Two-sided t-tests:

indicates p < .01;

indicates p < .05;
indicates p < .10.

Multicollinearity: What Should I Do?
The reason why multicollinearity is vexing is that there is no
magical statistical cure for it.
What is the best thing to do when you have multicollinearity?
Easy (in theory): Collect more data. But data are expensive to
collect. If we had more data, we would use them and we wouldnt
have hit this problem in the rst place.
So, if you do not have an easy way increase your sample size, then
multicollinearity ends up being something that you just have to live
with.
It is important to know that you have multicollinearity and to present
your multicollinearity by reporting the results of VIF statistics or what
happens to your model when you add and drop the guilty variables.
Multiple Regression: Interpretation
PS4781 Multiple Regression: Interpretation 1 / 13
Chapter Outline
1
Interpreting multiple regression
2
Which eect is biggest?
3
Statistical and Substantive Signicance
An example
Table: Three regression models of U.S. presidential elections
A B C
Growth 0.65* 0.57*
(0.16) (0.16)
Good News 0.96* 0.72*
(0.34) (0.30)
Constant 51.86* 47.20* 48.12*
(0.88) (2.07) (1.75)
R
2
0.36 0.20 0.46
N 32 32 32
Note: Standard errors are in parentheses.
* = p < 0.05
Which eect is biggest?
We might be tempted to look at the coecients in column C for
Growth (0.57) and for Good News (0.72) and conclude that the eect
for Good News is roughly one-third larger than the eect for Growth.
Dont jump to that conclusion: The two independent variables are
measured in dierent metrics, which makes that comparison
misleading.
The short-run growth rates variable ranges from negative numbers
(for times when the economy shrunk) all the way through stronger
periods where growth exceeded ve percent per year.
The number of quarters of consecutive strong growth ranges from
zero in the data set through ten.
That makes comparing the coecients misleading.
Standardized coecients
Because the coecients in the table each exist in the native metric of each
variable, they are referred to as unstandardized coecients. While they
are normally not comparable, there is a rather simple method to remove
the metric of each variable to make them comparable to one another. As
you might imagine, such coecients, because they are on a standardized
metric, are referred to as standardized coecients. They are computed,
quite simply, by taking the unstandardized coecients and taking out the
metricsin the forms of the standard deviationsof both the independent
and dependent variables. The formula for this is:
Std
=

s
X
s
Y
More
Std
=

s
X
s
Y
where

Std
is the standardized regression coecient, is the
unstandardized coecient (as in the table), and s
X
and s
Y
are the
standard deviations of X and Y, respectively. The interpretation of the
standardized coecients changes, not surprisingly. Whereas the
unstandardized coecients represent the expected change in Y given a
one-unit increase in X, the standardized coecients represent the expected
standard deviation change in Y given a one standard deviation increase in
X. Now, since all parameter estimates are in the same unitsthat is, the
standard deviationsthey become comparable.
For the previous table
Implementing this formula for the unstandardized coecients in column C
of the above table produces the following results. First, for Growth:
Std
= 0.57
5.54
6.07
= 0.52
Next, for Good News:
Std
= 0.72
2.85
6.07
= 0.34
Comments
Partial eects in multiple regression refer to controlling other
variables in model, so dier from eects in bivariate models, which
ignore all other variables
Partial eect of X (controlling for Z) is same as bivariate eect of X
when correlation = 0 between X and Z (as is true in most designed
experiments).
Partial eect of a predictor in this multiple regression model is
identical at all xed values of other predictors in model
Example:
Lets say we have the prediction equation y = 2 +1X 3Z
Then at Z=0, the partial eects of X is 1
And at Z=100, the partial eects of X is also 1
Comments
This parallelism means that this model assumes no interaction
between predictors in their eects on y. (i.e., eect of X does not
depend on value of Z)
This may or may not be a good approximation. We discuss
interaction eects next week.
How do we know if an eect is big?
Are the eects found in column C of the above table big?
Its tempting to answer, Well of course theyre big. Both coecients
are statistically signicant. Therefore, theyre big. That logic,
although perhaps appealing, is faulty.
Recall the discussion on the eects of sample size on the magnitude
of the standard error of the mean. And we noted the same eects of
sample size on the magnitude of the standard error of our regression
coecients.
Big sample sizes make for small standard errors, and hence larger
t-statistics, and hence a greater likelihood of statistical signicance.
But dont mistake this for a big substantive impact.
Multiple Regression: The Basics
PS4781 Multiple Regression: The Basics 1 / 20
Chapter Outline
1
Modeling multivariate reality
2
The population regression function
3
What happens when we fail to control for Z?
Crossing the fourth causal hurdle
In any observational study, how do we control for the eects of
other variables?
Multiple regression, this weeks topic, is by far the most common
method in the social sciences.
The population regression function
We can generalize the population regression model from last week:
bivariate population regression model: Y
i
= + X
i
+u
i
,
to include more than one cause of Y, which we have been calling Z:
multiple population regression model: Y
i
= +
1
X
i
+
2
Z
i
+u
i
.
Notational dierences!
Pay attention to how the notation has changed. In the two-variable
formula for a line, there were no numeric subscripts below the
coecientbecause, well, there was only one of them. But now we have
two independent variables, X and Z, that help to explain the variation in
Y, and therefore we have two dierent coecients , and so we subscript
them
1
and
2
in order to be clear that the values of these two eects are
dierent from one another.
Curb your enthusiasm
Its been a good number of weeks between the rst moment when we
discussed the importance of controlling for Z and the point when
we showed you how to do it. The fourth causal hurdle has never been
too far from front-and-center, and now you know the process of
crossing it.
Dont get too optimistic, though. The three-variable setup we just
mentioned can easily be generalized to more than three variables. But
the formula for
1
only controls for the eects of the Z variable that
is included in the regression equation. It does not control for other
variables that are not measured and not included in the model. And
what happens when we fail to include a relevant cause of Y in our
regression model? Statistically speaking, bad things.
The importance of controlling for Z
Controlling for the eects of other possible causes of our dependent
variable Y is critical to making the correct causal inferences.
But how does omitting Z from a regression model aect inferences
about whether X causes Y? Z isnt X, and Z isnt Y, so why should
omitting Z matter?
If we dont estimate the correct model
Consider the following three-variable regression model involving our
now-familiar trio of X, Y and Z.
Y
i
= +
1
X
i
+
2
Z
i
+u
i
And assume, for the moment, that this is the correct model of reality.
That is, the only systematic causes of Y are X and Z; and, to some
degree, Y is also inuenced by some random error component, u.
Now lets assume that, instead of estimating this correct model, we fail to
estimate the eects of Z. That is, we estimate:
Y
i
= +
1
X
i
+u
i
Omitted variables bias
Remember that the value of
1
in the correct, three-variable equation and
the value of
1
will not be identical under most circumstances. (Well see
the exceptions below.) And that, right there, should be enough to raise
red ags of warning. For, if we know that the three-variable model is the
correct modeland what that means, of course, is that the estimated
value of
1
that we obtain from the data will be equal to the true
population valueand if we know that
1
will not be equal to
1
, then
there is a problem with the estimated value of
1
. That problem is a
statistical problem called bias, which means that the expected value of the
parameter estimate that we obtain from a sample will not be equal to the
true population parameter. The specic type of bias that results from the
failure to include a variable that belongs in our regression model is called
omitted-variables bias.
The specics of omitted variables bias
If, instead of estimating the true three-variable model, we estimate the
incorrect two-variable model, the formula for the slope
1
will be:
1
=
n
i =1
(X
i

X)(Y
i

Y)
n
i =1
(X
i

X)
2
Notice that this is simply the bivariate formula for the eect of X on Y.
(Of course, the model we just estimated is a bivariate model, in spite of
the fact that we know that Z, as well as X, aects Y.) But since we know
that Z should be in the model, and we know from Chapter 9 that
regression lines travel through the mean values of each variable, we can
gure out that the following is true:
(Y
i

Y) =
1
(X
i

X) +
2
(Z
i

Z) + (u
i
u)
More
(Y
i

Y) =
1
(X
i

X) +
2
(Z
i

Z) + (u
i
u)
We can do this because we know that the plane will travel through each
mean.
Now notice that the left-hand side of the above equationthe
(Y
i

Y)is identical to one portion of the numerator of the slope for

1
.
Therefore, we can substitute the right-hand side of the above
equationyes, that entire messinto the numerator of the formula for

1
.
The bias
The resulting math isnt anything that is beyond your skills in algebra, but
it is cumbersome, so we wont derive it here. After a few lines of
multiplying and reducing, though, the formula for

1
will reduce to:
E(
1
) =
1
+
2
n
i =1
(X
i

X)(Z
i

Z)
n
i =1
(X
i

X)
2
More bias
E(
1
) =
1
+
2
n
i =1
(X
i

X)(Z
i

Z)
n
i =1
(X
i

X)
2
This might seem like a mouthfula fact thats rather hard to denybut
there is a very important message in it. What the equation says is that the
estimated eect of X on Y,

1
, where we do not include the eects of Z
on Y (but should have), will be equal to the true
1
that is, the eect
with Z taken into accountplus a bundle of other stu. That other stu,
strictly speaking, is bias. And since this bias came about as a result of
omitting a variable (Z) that should have been in the model, this type of
bias is known as omitted-variable bias.
Whats the bias equal to?
Obviously, wed like the expected value of our

1
(estimated without Z) to
equal the true
1
(as if we had estimated the equation with Z). And if the
things on the right-hand side of the + sign in the equation above equal
zero, it will. When will that happen? In two circumstances, neither of
which is particularly likely. First,

1
=
1
if
2
= 0. Second,

1
=
1
if
the large quotient at the end of the equation is equal to zero. What is that
quotient? It should look familiar; in fact, it is the bivariate slope parameter
of a regression of Z on X.
When will the bias equal zero?
In the rst of these two special circumstances, the bias term will equal
zero if and only if the eect of Z on Ythat is, the parameter
2
is
zero. Okay, so its safe to omit an independent variable from a regression
equation if it has no eect on the dependent variable. (If that seems
obvious to you, good.) The second circumstance is a bit more interesting:
Its safe to omit an independent variable Z from an equation if it is
entirely unrelated to the other independent variable X. Of course, if we
omit Z in such circumstances, well still be deprived of understanding how
Z aects Y; but at least, so long as Z and X are absolutely unrelated,
omitting Z will not adversely aect our estimate of the eect of X on Y.
The size of bias
If we estimate a regression model that omits an independent variable (Z)
that belongs in the model, then the eects of that Z will somehow work
their way into the parameter estimates for the independent variable that
we do estimate (X), and pollute our estimate of the eect of X on Y.
The equation above also suggests when the magnitude of the bias is likely
to be large, and when it is likely to be small. If either or both of the
components of the bias term (
2
and
n
i =1
(X
i
X)(Z
i
Z)
n
i =1
(X
i
X)
2
) are close to zero,
then the bias is likely to be small (because the bias term is the product of
both components); but if both are likely to be large, then the bias is likely
to be quite large.
The direction of bias
Moreover, the equation also suggests the likely direction of the bias. All
we have said thus far is that the coecient

1
will be biasedthat is, it
will not equal its true value. But will it be too large or too small? If we
have good guesses about the values of
2
and the correlation between X
and Z, then we can suspect the direction of the bias. For example,
suppose that
1
,
2
, and the correlation between X and Z are all positive.
That means that our estimated coecient

1
will be larger than it is
supposed to be, since a positive number plus the product of two positive
numbers will be a still-larger positive number. And so on.
Probability Distributions
Agnar F. Helgason
PS4781 Probability Distributions 1 / 22
Looking back, looking ahead
We now know how to use descriptive statisticsthat is, measures of
central tendency and measures of dispersionto describe what a
distribution of data looks like.
For example, we can describe a classs scores on an exam or a paper
with things like the mode, median and mean, and its standard
deviation.
Populations versus samples
But we also know that many of our statistics are derived from
samples of data. Weve said that we tend not to care about our
samples in and of themselves, but only insofar as they tell us
something about the population as a whole.
This is statistical inference
Statistical inference is the process of making probabilistic
statements about a population characteristic based on our knowledge
of the sample characteristic.
In other words, there are things we know about with certaintylike
the mean of some variable in our sample. But we care about the likely
values of that variable in the entire population. Since we almost never
have data for an entire population, we need to use what we know to
infer the likely range of values in the population.
Probability theory plays a fundamental role in the process of
statistical inference
Statistical Inference
Chapter Outline
1
Some Basics of Probability Theory
2
The Normal Distribution
Meaning of Probability
What does it mean to say that something has some X probability of
occurring?
Examples:
0.5 probability of getting heads when you ip a coin
0.7 probability of rain tomorrow
0.9 probability that Obama wins the election, if he wins Ohio
Denition
The probability of an outcome is the proportion of times that outcome
would occur in a long sequence of observations
Examples:
If you ip a coin a 1000 times, you can expect about 500 heads
The last 1000 times weather conditions have been similar to those we
expect tomorrow, 700 rainy days have been observed
In last 10 elections, similar candidates who have won Ohio have gone
on to win the presidency 9 times
Basic Rules
0 P(y) 1
All outcomes have some probability ranging from 0 to 1
P(y) = 1
The sum of all possible outcomes must be exactly 1
Basic Rules
Let A, B denotes possible outcomes
P(not A) = 1 - P(A)
P(A and B) = P(A)*P(B given A)
If A and B are independent events:
P(A and B) = P(A)*P(B)
P(A or B) = P(A) + P(B) - P(A and B)
If A and B are mutually exclusive events:
P(A or B) = P(A) + P(B)
Properties
A probability distribution species probabilities for all values of a
variable. P(y) denotes the probability of value y.
Like frequency distributions, probability distributions have descriptive
measures, such as mean and standard deviation
Mean (expected value) = = E(Y) =
(y P(y))
For a single die:
E(Y) = 1 (1/6) + 2 (1/6) + ... + 6 (1/6) = 3.5
The Normal Distribution
Symmetric, bell-shaped
Characterized by mean () and standard deviation (), representing
center and spread
Probability within any particular number of standard deviations of
is same for all normal distributions
An individual observation from an approximately normal distribution
has probability
0.68 of falling within 1 standard deviation of mean
0.95 of falling within 2 standard deviations
0.997 of falling within 3 standard deviations

When = 0 and = 1, we call the distribution a standard normal
distribution
It looks like this
Z-Score
For any normal distribution, the number of standard deviations from
the mean is called the z-score
z =
x
Table A in Agresti and Finlay gives probability of being more than z

deviations above the mean.
Part of table A in AF
Example
If the mean IQ in the population is 100 with a standard deviation of
16, how high does your IQ need to be to be in the 99th percentile?
Example
The Minnesota Multiphasic Personality Inventory (MMPI), based on
responses to 500 true/false questions, provides scores for several
scales (e.g., depression, anxiety, substance abuse), with = 50 and
= 10. The distribution of scores is normal and anything above 65 is
considered abnormally high. What percentage is this?
Example
For 5459 pregnant women using Aarhus University Hospital in
Denmark in a two-year period who reported information on length of
gestation until birth, the mean was 281.9 days, with standard
deviation 11.4 days. A baby is classied as premature if the gestation
time is 258 days or less. If gestation times are normally distributed,
what proportion of babies would be born prematurely?
Example
On the midterm exam in introductory statistics, an instructor always
gives a grade of B to students who score between 80 and 90. One
year, the scores have approximaterly a normal distribution with mean
83 and standard deviation 5. About what proportion of the students
get a B?
Why is the normal distribution so important?
In the next video, well learn that if dierent studies take random
samples and calculate a statistic (e.g. sample mean) to estimate a
parameter (e.g. population mean), the collection of statistic values
from those studies usually has approximately a normal distribution.
R Exercises
PS4781 Techniques of Political Analysis
1. Create a sequence that runs from 1 to 1000, in increments of 4 units. What is the
mean of the sequence?
2. How many numbers are in the sequence? (hint: ?length)
3. What is the value of the 150th element of the sequence?
4. Now create a new variable based on the sequence that omits elements 50 through 70.
What is the mean of this new variable?
5. With this new variable, only keep items with a value higher than 200 and lower than
300 (note that this condition refers to the values of each element, not the indices of
the elements themselves in the vector). How many elements are left?
For the next exercises, we will be using a subset of the National Military Capabilities
dataset. It contains annual values for total population, urban population, iron and steel
production, energy consumption, military personnel, and military expenditure for all
countries in the world. It is a commonly used dataset by scholars of international
politics. If you want to read more about the dataset, you can have a look at this site.
For our purposes, Ive subsetted the dataset so that it only contains values for the
year 2007, for all countries in the world. The dataset, along with the codebook, can
be found on Carmen. Note that the dataset is a Stata-dataset, so the le type is .dta
(hint: ?read.dta after loading the foreign library).
6. What are the dimensions of the dataset?
7. What would you type to show the rst column of the dataframe?
8. What is the value of element [140,8]?
9. What is the mean value of the milper variable?
10. Ahh... we may have made a mistake in the previous question. Have a look at page 4
in the codebook. Missing values are not coded as NA, but as -9. We need to x this,
so we can calculate the correct mean. Run the following command:
nmc$milper[which(nmc$milper==-9)] <- NA
The command tells R to assign the value NA to the milper variable for all rows in
which milper equals -9. What is the mean for this new variable?
1
11. What is the median of the variable?
12. What do the median and mean suggest about the distribution of the data? (hint: you
can also plot a histogram to see the distribution)
13. But wait! Countries dier in size so using the raw number of military personnel might
be a bit misleading. Create a new variable in the dataframe that contains the share of
military personnel of the total population (hint: the tpop variable might help). What
is the mean of this new variable?
14. How many countries have over 2% of their population in the military? (hint: use the
which function to nd the row numbers of observations in the dataframe that meet
the condition. Then wrap the which within the length function).
15. What is the value of the highest share of military personnel in the dataset?
16. Which country has the highest share of military personnel in the dataset?
2
Bivariate Regression II
PS4781 Bivariate Regression II 1 / 18
Chapter Outline
1
Inference based on OLS
2
Measuring our Uncertainty about the OLS line
Inference based on OLS
In the previous video, we talked about nding the best linear
approximation for a given set of points
What we didnt consider is that we rarely have population data, i.e.
the data points we observe are just a sample from a larger population
As such, we are interested in making inferences from the sample to
the population (just as with the hypothesis tests we covered last
week)
The formula we used to describe the relationship was
Y
i
= +

X
i
+ u
i
The reason , and u
i
have hats () is because we need to
distinguish them our estimates from the true population
parameters , and u
i
being estimated
This is just like in previous weeks, where we have, for example, used y
for a sample mean that is an estimate of the population mean
The reason Y
i
and X
i
dont have hats is because they arent
estimates, just data points that have been measured and sampled
from the population
Measuring our Uncertainty about the OLS line
With an OLS regression model, we have several dierent ways in
which to measure our uncertainty.
We discuss these measures in terms of the overall t between X and
Y rst and then discuss the uncertainty about individual parameters.
Our uncertainty about individual parameters is used in the testing of
our hypotheses (i.e. nding out if the results are statistically
signicant).
How well does the regression model t the data?
Measures of the overall t between a regression model and the
dependent variable are called goodness-of-t measures.
One of the most popular measures is the R-squared statistic (R
2
)
The R
2
statistic ranges between zero and one, indicating the
proportion of the variation in the dependent variable that is
accounted for by the model.
We need two statistics to calculate R
2
The total variation in Y, also known as the Total Sum of Squares
(TSS)
TSS =
n
i =1
(Y
i

Y)
2
The residual variation in Y, also known as Residual Sum of Squares
(RSS)
RSS =
n
i =1
u
2
i
How well does the regression model t the data?
Once we have TSS and RSS two quantities, we can calculate the R
2
statistic as
R
2
= 1
RSS
TSS
What is a good value for R-squared?
Many people want to designate a threshold for R
2
beyond which we
can say we have a good model of your dependent variable
But its really hard to come up with a universal value, because so
much of R
2
depends on the overall variation of Y
If we are trying to explain some Y that has a high variance, we are
always going to have a more dicult time getting a high R
2
, than if
we have a Y that has low variance
Uncertainty about Individual Components of the Sample
Regression Model:
2
A crucial part of the uncertainty in OLS regression models is the
degree of uncertainty about individual estimates of population
parameter values from the sample regression model.
One estimate that factors into the calculations of our uncertainty
about each of the population parameters is the estimated variance of
the population stochastic component, u
i
.
This unseen variance,
2
, is estimated from the residuals ( u
i
) after
the parameters for the sample regression model have been estimated
by the following formula:

2
=
n
i =1
u
2
i
n 2
Uncertainty about Individual Components of the Sample
Regression Model:
2

2
=
n
i =1
u
2
i
n 2
Looking at this formula, we can see two components that play a role
in determining the magnitude of this estimate.
The rst component comes from the individual residual values ( u

i
).
Remember that these values (calculated as u
i
= Y
i

Y
i
) are the
vertical distance between each observed Y
i
value and the regression
line.
The larger these values are, the further the individual cases are from
the regression line.
The second component of this formula comes from n, the sample size.
The larger the sample size, the smaller the variance of the estimate.
The variance and standard errors for the slope parameter
estimate (
)
Once we have estimated
2
, the variance and standard errors for the
slope parameter estimate (
) are then estimated from the following

formulae:
var(
) =

2
n
i =1
(X
i

X)
2
,
se(
) =
var(
) =

n
i =1
(X
i

X)
2
.
In the numerator, we nd values. So the larger these values are, the
larger will be the variance and standard error of the slope parameter
estimate.
The variance and standard errors for the slope parameter
estimate (
)
var(
) =

2
n
i =1
(X
i

X)
2
,
se(
) =
var(
) =

n
i =1
(X
i

X)
2
.
In the denominator in this equation, we see the term
n
i =1
(X
i

X)
2
,
which is a measure of the variation of the X
i
values around their
mean (

X).
The greater this variation, the smaller will be the variance and
standard error of the slope parameter estimate.
This is an important property; in real-world terms it means that the
more variation we have in X, the more precisely we will be able to
estimate the relationship between X and Y.
The variance and standard errors for the intercept
parameter estimate ( )
The variance and standard errors for the intercept parameter estimate
( ) are then estimated from the following formulae:
var( ) =

2
n
i =1
X
2
i
n
n
i =1
(X
i

X)
2
,
se( ) =
var( ) =
n
i =1
X
2
i
n
n
i =1
(X
i

X)
2
.
Condence Intervals and Hypothesis Testing about
Parameter Estimates
In previous weeks weve discussed how we use the normal distribution
(supported by the central limit theorem) to estimate condence
intervals and do hypothesis tests for the unseen population mean
from sample data.
We go through the same logical steps to estimate the unseen
parameters from the population regression model by using the results
from the sample regression model.
Traditional OLS hypothesis testing
The traditional approach to hypothesis testing with OLS regression is
that we specify a null hypothesis and an alternative hypothesis and
then compare the two.
Although we can test hypotheses about either the slope or the
intercept parameter, we are usually more concerned with tests about
the slope parameter.
In particular, we are usually concerned with testing the hypothesis
that the population slope parameter is equal to zero.
The logic of this hypothesis test corresponds closely with the logic of
the bivariate hypothesis tests from last week.
We observe a sample slope parameter, which is an estimate of the

population slope.
Then, from the value of this parameter estimate, the condence

interval around it, and the size of our sample, we evaluate how likely it
is that we observe this sample slope if the true but unobserved
population slope is equal to zero.
If the answer is very likely, then we conclude that the population

slope is equal to zero.
Notes on OLS
Always remember that linear regression is a statistical model a
simplication of the data we observe. We dont truly expect exactly a
linear relationship, but it is nonetheless often a good and simple
approximation in practice.
Extrapolation beyond observed range of x-values is dangerous. For y
= high school GPA and x = weekly hours watching TV,
Y = 3.44 0.03X. If observe x between 0 and 30, say, does not make
sense to plug in x = 100 and get predicted GPA = 0.44
Observations are very inuential if they take extreme values (small or
large) of x and fall far from the linear trend the rest of the data
follow. Such outliers can unduly aect least squares results.
Sampling Distributions and the Central Limit Theorem
Agnar F. Helgason
PS4781 Sampling Distributions and the Central Limit Theorem 1 / 10
Sampling
Imagine a (hypothetical) world in which we took an innite number of
samples, and took the mean of each sample, and then plotted those
means. How would those plotted means be distributed?
An example
Imagine that we rolled a six-sided dice. It can come out as a 1, 2, 3,
4, 5 or 6 with equal probability, right?
Lets say you rolled that dice 600 times. What would that distribution
look like?
A uniform (not normal) distribution
60
80
100
120
N
u
m
b
e
r

o
f

r
o
l
l
s
0
20
40
1 2 3 4 5 6
Value
Thats not normal
Thats not normal, right?
Lets say we rolled that dice 600 times. What do you think the mean
would be (about)?
Would it be exactly 3.5? Every time? No, of course not.
But what would happen if we did that roll-it-600-times thing, say, a
billion times, then plotted the means? (Not the rolls, the means. Be
careful!)
It would be normal
Think about this carefully. In our frequency distribution, we could get
a score of 1 to 6 with equal likelihood. But in our sample means, we
would never get means of 1 or 6. All of our means would be
somewhere around 3.5. Moreover, they would be distributed around
that mean (3.5) normally.
This is the Central Limit Theorem
The Central Limit Theorem says that, no matter what the underlying
shape of the frequency distribution (whether its uniform, normal, or
whatever), the hypothetical distribution of sample meanswhich is called
a sampling distributionwill be normal, with mean equal to the true
population mean, and standard deviation equal to
Y
=
s
Y
n
The above is called the standard error of the mean.
Pretty amazing stu!
Each sample you take from a population comes from a sampling
distribution
You dont know where from the sampling distribution could be an
unusual sample
However, armed with our knowledge of the normal distribution we can
test how likely the results from our single sample were the results of
an unusual sample (= chance) or whether the results indicate
something substantive
Example
An exit poll of 2293 voters in the 2006 Ohio Senatorial election indicated
that 44% voted for the Republican candidate, Mike DeWine, and 56%
voted for the Democratic candidate, Sherrod Brown.
If actually 50% of the population voted for DeWine, would it have
been surprising to obtain the results in this exit poll?
Note: For binary data (y=1 or y=0) with P(Y=1)=, can show that
=
(1 )
When = 0.5, then = 0.5 and standard error is se =

n
=
0.5
n
Looking ahead
Next week we will start using what weve learned in the last two
videos to make principled inferences about population parameters
Well also talk more generally about survey design. What kind of
samples do we need to make inferences? And given a good sample,
can we get an unbiased measure of all possible population
parameters?
Descriptive Statistics I:
Scales of Measurement and Distribution of Data
Agnar F. Helgason
PS4781 Descriptive Statistics I: Scales of Measurement and Distribution of Data 1 / 17
Outline of the session
1
Scales of Measurement
2
Distribution of data
Measuring variables
Variable a characteristic that can vary in value among subjects in a
sample or a population.
(a) Age (b) Eye Color (c) Partisanship
Figure: Examples of Variables
What is a variables measurement metric?
One major way to distinguish among variables is measurement
metric. A variables measurement metric is the type of values that
the variable takes on.
There are three types of variables, categorized according to the metric
in which the values of the variable occur: nominal, ordinal, and
interval.
Variables can also be classied according to whether they take
discrete or continuous values.
Nominal (categorical) variables
Nominal variables are variables for which cases have values that are
either dierent or the same as the values for other cases, but about
which we cannot make any universally-holding ranking distinctions.
Example: Religious Identication. Some values for this variable are
Catholic, Muslim, non-religious, and so on. While these values
are clearly dierent from each other, we cannot make
universally-holding ranking distinctions across them.
More casually, with nominal variables like this one, it is not possible
to rank order the categories from least to greatest: the value
Muslim is neither greater than nor less than non-religious
Ordinal variables
Ordinal variables are also variables for which cases have values that
are either dierent or the same as the values for other cases.
The distinction between ordinal and nominal variables is that we can
make universally-holding ranking distinctions across the variable
values for ordinal variables.
Example: Retrospective family nancial situation
We are interested in how people are getting along nancially these
days. Would you say that you (and your family living here) are better
o or worse o than you were a year ago? Researchers then asked
respondents who answered better or worse: Much
[better/worse] or somewhat [better/worse]? The resulting variable
was then coded as follows:
1
much better
2
somewhat better
3
same
4
somewhat worse
5
much worse
Interval (continuous) variables
An important characteristic that ordinal variables do not have is
equal unit dierences.
The metric in which we measure a variable has equal unit dierences
if a one-unit increase in the value of that variable indicates the same
amount of change across all values of that variable. Interval
variables are variables that do have equal unit dierences.
Example: Age (in years), Income (in dollars)
In analyses, we often are forced to treat ordinal variables as if they
were continuous.
Back to the examples
(a) Age (b) Eye Color (c) Partisanship
Figure: Examples of Variables
Frequency and distribution
The most fundamental descriptive statistic is a summary of frequencies.
Frequency distribution: Lists possible values of variable and number
of times each occurs
Histogram: Bar graph of frequencies or percentages
Example
Example: Political ideology measured as ordinal variable with 1 = very
liberal, ..., 4 = moderate, ..., 7 = very conservative
Histogram
Shapes of histograms
Back to Music...
Survey Design
Agnar F. Helgason
PS4781 Survey Design 1 / 19
Chapter Outline
1
Sample Size Selection
2
Sampling Bias
3
Response and Nonresponse Bias
4
Looking Back Looking Ahead
Choosing the Sample Size
So far, we have treated the sample as given. However, we might want
to perform a study ourselves and a critical question is how large a
sample is necessary.
Need to determine acceptable margin of error, condence level and
make an educated guess about the likely parameter value.
Proportion: n = (1 )(
z
ME
)
2
Mean: n =
2
(
z
ME
)
2
Example: Future Anorexia Study
We want n to estimate population mean weight change to within 2
pounds, with probability 0.95.
Based on past study, we guess that = 7
n =
2
(
z
ME
)
2
= 7
2
(
1.96
2
)
2
= 47
Given the assumption, we need a sample size of 47.
Simple Random Sample
A crucial assumption behind all inferences weve done up until now is
that the sample we have is a simple random sample.
A simple random sample is based on a sampling method where each
possible sample of size n has the same chance of being selected.
Remember the sampling distribution we are equally likely to draw
any of the samples in the distribution.
How do we randomize?
There are a number of ways in which we can obtain a simple random
sample from a population e.g. use random number tables or
statistical software
Sampling frame (listing of all subjects in population) must exist to
implement simple random sampling
There are a number of dierent methods one can use to select a
random sample (see AF#2.4 if you are interested)
What happens when we dont have a random sample?
In many cases, we dont have a random sample from the population
what is referred to as nonprobability sampling, because we
cannot specify the probabilities for sample selection from the
population
Examples: Volunteer samples, such as polls on the internet or
call-in surveys on radio shows.
Because such samples are not random, any inferences from the
sample to the population are highly suspect.
Example
Lou Dobbs (CNN) asks (August 2009) Uberliberal Bill Maher says
the American people are too stupid to decide whether Obamas
unwritten health-care legislation is right for them, and that the
president should just ram it through Congress. Do you believe that
the president knows best on health care?
Yes, I agree we need his reforms 5%
No thanks, Ill decide for myself 95%
Textbook sampling bias based on a nonprobability sample
Are convenience samples then useless?
No, not at all! They cannot, however, be the basis of statistical inference
to a population. Some uses include:
A pilot study
Pedogical purposes
When you want to learn something about the particular participants
in the survey (eectively treating the sample observations as the
population)
Sometimes the best we can do is a conveniance sample.
Other Biases
In addition to sampling bias, we need to be aware of two dierent biases
that can lead us to draw wrong conclusions about the population in
question.
Response Bias
Response bias occurs when the question or survey aects the true
response of a subject. Examples include poorly worded or confusing
questions, and questions pertaining to socially sensitive questions
Example from 2006 NY Times poll
Do you favor a gasoline tax... = 12% yes
...to reduce U.S. dependence on foreign oil = 55% yes
Nonresponse Bias
Nonresponse bias occurs when some sampled subjects either do not
participate in the study or fail to answer particular questions. If such
missing values are systematic, our inferences will be biased.
For example, imagine a simple random internet survey meant to
estimate the population proportion that uses the internet.
Why will this approach give us a biased answer?
What Have We Accomplished?
Descriptive Statistics summarize data visually and numerically
Probability and Sampling understand the basis for drawing
inferences about populations based on a random sample
Point and Interval Estimation able to make statements about
population means and proportions with a given condence level
Not least: Able to use R for basic statistics (no small feat!).
How Does That Fit Into the Bigger Picture?
Weve essentially mastered the single variable
Now we are ready to move on and start thinking about two (or more)
variables at the same time thinking in terms of how one variable
aects another
What Lies Ahead?
In the next three weeks we will be focusing on theory, hypotheses and
research design. Basically, how we formulate questions that can be
tested with quantitative data and, given a question, what research
strategy will give us the best estimate of the underlying population
relationship
After that, well pick up where we leave this week and start testing
hypotheses with some simple techniques: For example, comparing the
means of dierent groups or the correlation between two variables,
and testing whether the relationship is statistically signicant.
In the last section of the course, well move away from the bivariate
world and learn how to test hypotheses with multiple variables at a
time using a technique called regression.
Political Science 4781 Online
Techniques of Political Analysis
Spring 2014
Syllabus Version: 1.0
Instructor: Mr. Agnar F. Helgason
E-mail: helgason.2@osu.edu
Skype: agnar.f.helgason
Oce Hours: On Skype. 10-11AM Tue and Fri
or by appointment.
c xkcd
Course Description
In this class, students will learn how social scientists leverage quantitative data to address interesting
research questions. We will talk about how to describe data, fruitful strategies to formulate theories,
how to measure important concepts, and how to design research projects. The most important
aspect of the course, however, will be in applying statistical methods to analyze data. This will
involve learning the foundations of statistical inference and applying those tools to a wide variety of
statistical problems, using a popular (and free!) statistical software program, R. While the majority
of the examples used in the class will be related to political science, we will also use examples from
other elds of study, such as sociology and the health sciences.
At the successful completion of the course, students should be able to critically evaluate social
scientic research they come across, both in other courses and in their daily lives; and seek out
and analyze data to address research questions that interest them, for example for a senior thesis.
Moreover, familiarity with statistical methods and statistical software is increasingly valued by
employers. If you do well, you can add completion of the course to your CV and, in particular,
that you have experience with using R.
Political Science 4781: Techniques of Political Analysis
A Note on the Mode of Instruction
The course will be delivered entirely online, with all lectures, assignments, exams, and feedback pro-
vided through Carmen. This gives students considerable exibility in managing how they progress
through the course, but with great exibility comes great responsibility. While I will do my best
to facilitate learning over the entire span of the course, it is eventually the students responsibility
to keep up with lectures and assignments and to actively seek out help from the instructor if help
is needed. Even though this is an online class, it is every bit as challenging as its face-to-face
equivalent and requires the same time committment from students. While this format suits the
learning needs of many students, it is by no means the ideal format for all students.
Prerequisites
While much of this course is devoted to learning methods of statistical analysis, it is not necessary for
students to have an extensive mathematical background in order to gain a thorough understanding
of the material. However, the material does require extensive hands-on practice using statistical
software. Since this is an online course, it is expected that students are able (and willing) to install
and use the statistical software program R on their own. That said, I am here to help you learn
and will, of course, provide extensive information on how to perform data analysis in R.
From prior experience, three types of students have struggled with the class. First, students
who expect this to be an easy online class with minimal work on their behalf. That is denitely not
the case the work load is the same as in the face-to-face section of the class. Second, students
who are unwilling to put in the time and eort to learn how to perform statistical analysis using
the R software. Since using R is an intergral part of the course, that is a recipe for failure. Third,
students who are unwilling or unable to receive assistance over email and/or Skype. I do my best
to answer emails promptly, be available for Skype sessions on short notice and, within reason, be as
helpful as possible to students who have questions or concerns about the course material. However,
for that help to be of any use, students must proactively reach out to me and inform me of their
questions and/or concerns and be willing to discuss matters via email or on Skype.
Course Objectives
This course satises the GE data analysis requirement, described by the University as follows:
Goals: Students develop skills in drawing conclusions and critically evaluating results based
on data.
Expected Learning Outcomes: Students understand basic concepts of statistics and prob-
ability, comprehend methods needed to analyze and critically evaluate statistical arguments,
and recognize the importance of statistical ideas.
Texts and Material
Required Texts
Rather than using a single textbook, we will be using two main textbooks that complement each
other during the semester. One of textbooks is geared towards applying quantitative methodology
to political science, while the other one is more grounded in the nuts and bolts of statistical methods.
2 of 8
Although they emphasize dierent aspects of the material, the books also overlap to some extent
in their coverage. This is a feature, not a bug: Learning statistics from two dierent sources can
help shed light on dicult concepts and methods.
The following textbooks are required:
Paul M Kellstedt and Guy D Whitten. The Fundamentals of Political Science Research.
Cambridge University Press, New York, NY, 2nd edition, 2013. ISBN 978-1107621664
This book (henceforth KW) covers research design and empirical analysis in an integrated
fashion, with an emphasis on political science applications. I strongly recommend that you
get the 2nd edition of the book, which was published in 2013. Backwards compatibility with
the 1st edition is not guaranteed.
Alan Agresti and Barbara Finlay. Statistical Methods for the Social Sciences. Pearson Prentice
Hall, Upper Saddle River, NJ, 4th edition, 2009. ISBN 978-0130272959
This book (henceforth AF) provides a more in depth treatment of statistical methods than
KW. Although I will lecture from the 4th edition, using the 3rd edition (ISBN-13: 978-
0135265260) is perfectly acceptable (a used version of the 3rd edition currently goes for about
$10 on Amazon).
1
While we will not read the following book from cover to cover, it is a handy reference on R.
Peter Dalgaard. Introductory Statistics with R. Springer, New York, NY, 2nd edition, 2008.
ISBN 978-0387790534
This book (henceforth PD) serves as our main text for learning the intricacies of R, the
statistical package we will be using. It is available as a free electronic resource through the
OSU library (i.e. there is no reason to purchase the book). You can choose to read the book
online or download PDFs of the chapters. You can access the books library page here.
The rst two books have been put on 2 hour reserve at the Thompson library.
Computer Software
Quantitative social science research requires the use of computers. While there are many dierent
statistical software packages available, we will be using a package called R, which is a free, open
source programming language and environment for statistical computing. It can be downloaded at
http://www.r-project.org/. To assist you in using R, I also recommend that you install RStudio,
which is a free, open source interface to R that makes working with R signicantly easier, particu-
larly on Windows. You can download it for various platforms at http://www.rstudio.org/. In week
3, I will provide instructions on how to install both programs.
Data
Data necessary to complete assignments will either be available on Carmen or freely available for
download elsewhere on the web.
1
Note that there are slight dierences in section numberings between the editions. The class schedule is based on
the 4th edition - Ill try to point out any relevant dierences between the editions each week.
3 of 8
Assignments and Grading
Assessment for this course will be based on weekly quizzes, four homework assignments, and a com-
prehensive nal exam. The nal course grade will be determined based on the following breakdown
and grading scale:
Weekly quizzes: Best 12 out of 14 @ 2.5% each = 30%
Homework assignments: 4 @ 10% each = 40%
Final exam: 1 @ 30% = 30%
A 93-100%
A- 90-92%
B+ 87-89%
B 84-86%
B- 80-83%
C+ 77-79%
C 74-76%
C- 70-73%
D+ 67-69%
D 64-66%
D- 60-63%
E <60%
The quizzes are meant to test students working knowledge of the materials covered in lectures
and assigned readings. The quizzes will consist of about ten questions, which you will have to
answer in a maximum of 25 minutes. Most weeks, the questions will be randomly chosen from
a larger pool of questions, which means that you and your classmates will not necessarily be asked
exactly the same questions. The quizzes are open book, i.e. you are free to check the book, videos,
R etc. for answers, but are not allowed to consult with your classmates during the quiz. Quizzes
are due by 11.59 PM (EST) on Sunday each week. After the deadline has passed, students
will not be able to take that weeks quiz and will receive a score of 0. (Protip: Plan for
contingencies and do not wait until the last minute to take the quiz no extensions will be granted
due to technical diculties). Students can miss (or do poorly) on 2 out of the 14 quizzes without
it aecting their grade since only the scores on the 12 best quizzes count toward their grade.
Homework assignments are meant to assist students in developing the problem-solving, analytic,
and computer skills necessary to perform modern social scientic research. Assignments will give
students the opportunity to engage deeply with the course material and provide them with hands-
on experience in working with real-world data. The schedule below contains information on when
assignments are due to the instructor they should be handed in by 11.59 PM (EST) on Sunday
in the week they are due. Students handing in homeworks 1, 2, or 3 (4), up to ve (two)
days late, will incur a 20 point penalty. Homework will not be accepted after the ve
(two) days have passed.
2
The nal exam will be comprehensive, covering all materials presented or assigned during the
course. The exam will take place from 8 AM on April 23 to 11.59 PM on April 24. Students can
choose when during the two day period they take the exam, although once students have begun
taking the exam, they only have 80 minutes to complete it. As with the quizzes, do not
wait until the last minute to complete the nal exam. Extensions based on technical problems that
occur at the last minute will not be granted.
Extra Credit Opportunities
No individual extra credit opportunities will be made available over the semester. However, in an
eort to encourage students to ll out the student evaluations of instruction (SEIs)
there will be a group-wise extra credit granted if more than 75% of the students
completing the course complete their SEI. Since the SEIs are anonymous, everyone in the
2
The ve day limit applies to homeworks 1, 2, and 3, while the two day limit applies to homework 4 (which is due
quite close to the nal exam).
4 of 8
class will receive the extra credit if the threshold is passed. The extra credit will add 3 extra points
to your nal grade.
Policies and Procedures
Academic Honesty
I expect all of the work you do in the course to be your own. Academic misconduct of any sort
will not be tolerated and will be reported to the university committee on academic misconduct and
handled according to university policy. The quizzes, homework assignments, and exam are
to be taken during the allotted time period without the aid of other students and/or
people.
Communication
The primary method of communication will be through Carmen. Please be sure that you have
access to Carmen and that you check it regularly.
I will have online oce hours twice a week, Tuesdays and Fridays 10-11AM EST, and by
appointment. If youd like to make an appointment outside the regular oce hours, please email
me and we can nd a mutually agreeable time to talk. Oce hours will be conducted through Skype
- please add agnar.f.helgason to your contacts prior to your rst oce hours appointment.
The class site on Carmen includes a discussion board where students can ask questions about
the course. If you have questions of a general nature or one that you believe might benet other
students, please consider posting it on the discussion board, rather than emailing me directly.
Since this is the rst time the course includes a discussion board, consider it an experimental
feature. Please adhere to the (fairly reasonable) code of conduct when you post on the
discussion board. Repeated violations will result in the discussion board being closed.
Disability Services
Students with disabilities that have been certied by the Oce for Disability Services (http://
www.ods.ohio-state.edu/) will be appropriately accommodated and should inform the instructor
as soon as possible of their needs. The Oce for Disability Services is located in 150 Pomerene
Hall, 1760 Neil Avenue; telephone 292-3307, TDD 292-0901.
Exam Absences
If a student is unable to complete the nal exam in the allotted time period, they will be allowed
to schedule a make-up exam only if the absence is due to a documented medical, family, or similar
serious emergency, observance of religious holy days (which requires written notication to the
instructor at least 14 days prior to the exam), or properly documented University-sponsored planned
activities. Absences in all other cases will result in a score of zero on the exam. If you become
aware that you will not be able to attend the exam before the time of the exam, please contact
the instructor and seek permission to be absent as soon as possible. If a make-up exam is granted,
failure to complete it within the allotted time period will result in a score of zero on the exam.
5 of 8
Grade Disputes
Grade disputes will be considered only if they adhere to this policy. Grade disputes must be made
in writing (typed!). Your written dispute must contain a documented logic for why you believe
your answer for each disputed item was incorrectly marked you must cite specic passages in
the texts and/or lectures and explain why you thought they applied to the item in question. The
instructor will then review your dispute and issue a decision within one week.
Weekly Schedule
Every week, I will post a short study guide for the week on Carmen. The guide will include
recommentations for what parts of the readings you should focus on, key terms and concepts you
should pay attention to, and anything else that might be important to that weeks subject. Please
begin each week by reading the schedule.
Class Schedule
The class schedule below is tentative, and is subject to change depending on how fast we get through
the material. If the schedule does change, an updated syllabus will be posted to Carmen. Readings
from each book are denoted with a # sign - for example, KW#5.7-5.12 means that you should read
sections 7 through 12 of Chapter 5 in Kellstedt and Whitten (2013) and AF#1 means you should
read all of Chapter 1 in Agresti and Finlay (2009). Full citations for all readings are given on the
last page of the syllabus.
W Dates Topics and Readings Assignments
1 1/6 - 1/12 Introduction and Syllabus Review; Political SCIENCE? Q1 Due
Read: This syllabus, KW#1, AF#1
Part I
2 1/13 - 1/19 Descriptive Statistics Q2 Due
Read: KW#5.7-5.12, AF#2.1&3.1-3.4
3 1/20 - 1/26 Statistical Computing: Installing and Using R Q3 Due
Read: PD#1,2,4
Do: Code Schools (2013) Online Introduction to
R, Exercises provided by the instructor
4 1/27 - 2/2 Probability and Sampling Q4 Due
Read: KW#6.1-6.3, AF#4
Recommended: PD#3
5 2/3 - 2/9 Statistical Inference and Condence Intervals Q5 & HW1 Due
Read: KW#6.4-6.5, AF#2.2-2.3 & 5
Continued on next page...
6 of 8
W Dates Topics and Readings Assignments
Part II
6 2/10 - 2/16 Theories and Hypotheses; Concepts and Measurement Q6 Due
Read: KW#2&5.0-5.6
7 2/17 - 2/23 Evaluting Causal Relationships Q7 Due
Read: KW#3 and AF#10-10.3
8 2/24 - 3/2 Research Design Q8& HW2 Due
Read: KW#4
Part III
9 3/3 - 3/9 Hypothesis Testing Q9 Due
Read: KW#7.1-7.3 and AF#6 (excl. 6.6-6.7)
Recommended: PD#5.1
10 3/10 - 3/16 SPRING BREAK
10 3/17 - 3/23 Hypothesis Testing II Q10 Due
Read: KW#7.4-7.5 and AF#3.5&8.1-8.2&9.4
Recommended: PD#5.3,5.6,6.4,8.4
11 3/24 - 3/30 Bivarate Regression Q11 & HW3 Due
Read: KW#8 and AF#9 (excl. 9.4)
Recommended: PD#6.0-6.3
Part IV
12 3/31 - 4/6 Multiple Regression I Q12 Due
Read: KW#9 and AF#11.1-11.3
13 4/7 - 4/13 Multiple Regression II Q13 & HW4 Due
Read: KW#10 and AF#13.2-13.3
14 4/14 - 4/20 Logistic Regression & Review Q14 Due
Read: KW#11.1-11.2 and AF#15.1-15.2
16 4/23 - 4/24 FINAL EXAM
7 of 8
Readings
Alan Agresti and Barbara Finlay. Statistical Methods for the Social Sciences. Pearson Prentice
Hall, Upper Saddle River, NJ, 4th edition, 2009. ISBN 978-0130272959. [3, 6, 7]
Code School. Try r, 2013. URL http://tryr.codeschool.com. [6]
Peter Dalgaard. Introductory Statistics with R. Springer, New York, NY, 2nd edition, 2008. ISBN
978-0387790534. [3, 6, 7]
Paul M Kellstedt and Guy D Whitten. The Fundamentals of Political Science Research. Cambridge
University Press, New York, NY, 2nd edition, 2013. ISBN 978-1107621664. [3, 6, 7]
8 of 8
Tabular Analysis The
2
test of independence
2
test of independence 1 / 17
Chapter Outline
1
When to use the test?
2
2
Chapter Outline
1
2
2
1
2
2
Conditions
We use a
2
test when
The dependent (response) variable is categorical
The independent (explanatory) variable is also categorical
Used to test whether two categorical variables are statistically
independent
2
Assumptions
Always bear in mind the assumptions associated with the statistical tests
we discuss. In this case we assume:
We have a random sample from the population
The sample size is large
If sample size condition doesnt hold, there are alternative tests
available. Discussed in Agresti and Finlay. Wont cover in this class,
but be aware of them
2
1
2
2
Tabular analysisreading a table
Any time that you see a table, it is very important to take some time
to make sure that you understand what is being conveyed in the table.
We can break this into the following three-step process:
1
Figure out what the variables are that dene the rows and columns of
the table.
2
Figure out what the individual cell values represent. Sometimes they
will be the number of cases that take on the particular row and column
values; other times the cell values will be proportions (ranging from 0
to 1.0) or percentages (ranging from 0 to 100). If this is the case, it is
critical that you gure out whether the researcher calculated the
percentages or proportions for the entire table or for each column or
row.
3
Figure out what, if any, general patterns you see in the table.
2
1
the table.
2
row.
3
2
1
the table.
2
row.
3
2
1
the table.
2
row.
3
2
1
the table.
2
row.
3
2
1
the table.
2
row.
3
2
1
the table.
2
row.
3
2
Gender and Voting Behavior
Does gender aect voting behavior in U.S. presidential elections?
H
0
: Gender and Voting behavior are statistically independent
H
a
: Gender and Voting behavior are statistically dependent
2
Gender and vote in the 2008 U.S. presidential election:
Hypothetical scenario
Candidate Male Female Row total
McCain ? ? 45.0
Obama ? ? 55.0
Column total 100.0 100.0 100.0
Note: Cell entries are column percentages.
What we would expect to nd if there was no relationship between
these two variables?
In other words, what values should replace the question marks in this
table if there were no relationship between our independent variable
(X) and dependent variable (Y)?
2
McCain ? ? 45.0
Obama ? ? 55.0
Column total 100.0 100.0 100.0
2
McCain ? ? 45.0
Obama ? ? 55.0
Column total 100.0 100.0 100.0
2
Expectations for hypothetical scenario if there were no
relationship
McCain 45.0 45.0 45.0
Obama 55.0 55.0 55.0
Column total 100.0 100.0 100.0
2
Gender and vote in the 2008 U.S. presidential election
McCain ? ? 1,434
Obama ? ? 1,755
Column total 1,379 1,810 3,189
Note: Cell entries are number of respondents.
2
Calculating the expected cell values if gender and
presidential vote are unrelated
Candidate Male Female
McCain (45% of 1,379) (45% of 1,810)
= 0.45 1, 379 = 620.55 = 0.45 1, 810 = 814.5
Obama (55% of 1,379) (55% of 1,810)
= 0.55 1, 379 = 758.45 = 0.55 1, 810 = 995.5
Note: Cell entries are expectation calculations if these two
variables are unrelated.
2
McCain 682 752 1,434
Obama 697 1,058 1,755
Column total 1,379 1,810 3,189
Note: Cell entries are number of respondents.
2
McCain O = 682; E = 620.55 O = 752; E = 814.5
Obama O = 697; E = 758.45 O = 1, 058; E = 995.5
Note: Cell entries are the number observed (O); the number
expected if there were no relationship (E).
Among males, the proportion observed voting for Obama is lower
than what we would expect if there were no relationship between the
two variables. Also, among males, the proportion voting for McCain is
higher than what we would expect if there were no relationship.
For females this pattern is reversed.
The pattern of these dierences is in line with the theory that women
support Democratic Party candidates more than men do.
To assess whether or not these dierences are statistically signicant,
we turn to the chi-squared (
2
) test for tabular association.
2
McCain O = 682; E = 620.55 O = 752; E = 814.5
Obama O = 697; E = 758.45 O = 1, 058; E = 995.5
2
2
McCain O = 682; E = 620.55 O = 752; E = 814.5
Obama O = 697; E = 758.45 O = 1, 058; E = 995.5
2
2
McCain O = 682; E = 620.55 O = 752; E = 814.5
Obama O = 697; E = 758.45 O = 1, 058; E = 995.5
2
2
McCain O = 682; E = 620.55 O = 752; E = 814.5
Obama O = 697; E = 758.45 O = 1, 058; E = 995.5
2
2
The
2
test for tabular association
The formula for the
2
statistic is
2
=
(O E)
2
E
.
The summation sign in this formula signies that we sum over each
cell in the table; so a 2 2 table would have four cells to add up.
If we think about an individual cells contribution to this formula, we
can see the underlying logic of the
2
test.
If the value observed, O, is exactly equal to the expected value if there

were no relationship between the two variables, E, then we would get a
contribution of zero from that cell to the overall formula (because
O E would be zero).
If all observed values were exactly equal to the values that we expect if
there were no relationship between the two variables, then
2
= 0.
The more the O values dier from the E values, the greater the value
will be for
2
.
2
The
2
The formula for the
2
statistic is
2
=
(O E)
2
E
.
2
test.

O E would be zero).
2
= 0.
will be for
2
.
2
The
2
The formula for the
2
statistic is
2
=
(O E)
2
E
.
2
test.

O E would be zero).
2
= 0.
will be for
2
.
2
The
2
The formula for the
2
statistic is
2
=
(O E)
2
E
.
2
test.

O E would be zero).
2
= 0.
will be for
2
.
2
The
2
The formula for the
2
statistic is
2
=
(O E)
2
E
.
2
test.

O E would be zero).
2
= 0.
will be for
2
.
2
The
2
The formula for the
2
statistic is
2
=
(O E)
2
E
.
2
test.

O E would be zero).
2
= 0.
will be for
2
.
2
2
calculation
Here are the calculations for
2
for our gender and voting example:
2
=
(O E)
2
E
=
(682 620.55)
2
620.55
+
(752 814.5)
2
814.5
+
(697 758.45)
2
758.45
+
(1, 058 995.5)
2
995.5
=
3, 776.1
620.55
+
3, 906.25
814.5
+
3, 776.1
758.45
+
3906.25
995.5
= 6.09 + 4.8 + 4.98 + 3.92 = 19.79
We need to compare that 19.79 with some predetermined standard, called a critical value,
of
2
.
If our calculated value is greater than the critical value, then we conclude that there is a
relationship between the two variables; and if the calculated value is less than the critical
value, we cannot make such a conclusion.
To make this evaluation, we need to calculate the degrees of freedom (dfs) for our
test. df = (r 1)(c 1), where r is the number of rows in the table, and c is the number
of columns in the table. In our table there are two rows and two columns, so
(2 1)(2 1) = 1.
2
2
calculation
2
2
=
(O E)
2
E
=
(682 620.55)
2
620.55
+
(752 814.5)
2
814.5
+
(697 758.45)
2
758.45
+
(1, 058 995.5)
2
995.5
=
3, 776.1
620.55
+
3, 906.25
814.5
+
3, 776.1
758.45
+
3906.25
995.5
= 6.09 + 4.8 + 4.98 + 3.92 = 19.79
of
2
.
(2 1)(2 1) = 1.
2
2
calculation
2
2
=
(O E)
2
E
=
(682 620.55)
2
620.55
+
(752 814.5)
2
814.5
+
(697 758.45)
2
758.45
+
(1, 058 995.5)
2
995.5
=
3, 776.1
620.55
+
3, 906.25
814.5
+
3, 776.1
758.45
+
3906.25
995.5
= 6.09 + 4.8 + 4.98 + 3.92 = 19.79
of
2
.
(2 1)(2 1) = 1.
2
2
calculation
2
2
=
(O E)
2
E
=
(682 620.55)
2
620.55
+
(752 814.5)
2
814.5
+
(697 758.45)
2
758.45
+
(1, 058 995.5)
2
995.5
=
3, 776.1
620.55
+
3, 906.25
814.5
+
3, 776.1
758.45
+
3906.25
995.5
= 6.09 + 4.8 + 4.98 + 3.92 = 19.79
of
2
.
(2 1)(2 1) = 1.
2
From
2
to statistical signicance
You can nd a table with critical values of
2
in Appendix A of
Kellstedt and Whitten, or use R.
In R, we tabulate our data and apply the test to the table with the
chisq.test command. See R-code for more.
two variables meets a conventionally accepted standard of statistical
signicance (i.e., p < .05).
Although this result is supportive of our hypothesis, we have not yet
established a causal relationship between gender and presidential
voting.
With a bivariate analysis, we cannot know whether some other
variable Z is relevant because, by denition, there are only two
variables in such an analysis. So, until we see evidence that Z
variables have been controlled for, our scorecard for this causal claim
is [y y y n].
2
From
2
2
in Appendix A of
voting.
is [y y y n].
2
From
2
2
in Appendix A of
voting.
is [y y y n].
2
Theories and Hypotheses in Political Science
PS4781 Theories and Hypotheses in Political Science 1 / 16
Chapter Outline
1
Political Science?
2
Good Theories Come from Good Theory-Building Strategies
Promising Theories Oer Answers to Interesting Research Questions
Identifying Interesting Variation
Learning to Use Your Knowledge
Examine Previous Research
3
How Do I Know If I Have a Good Theory?
Approaching Politics Scientically: The Search for Causal
Explanations
Political science is about the scientic study of political phenomena
The scientic method involves
Developing falsiable theories
Testing those theories with data

Skepticism about scientic knowledge is inherent in the scientic
method
We never prove a theory, we only fail to disprove it
At best, our theory is supported by the data
All knowledge is tentative

Good Theories Come from Good Theory-Building
Strategies
A good theory is one that changes the way that we think about some
aspect of the political world.
This is a tall order, and a logical question to ask at this point is How
do I come up with such a theory?
Unfortunately, there is neither an easy answer nor a single answer.
Instead, what we can oer is a set of strategies.
Promising Theories Oer Answers to Interesting Research
Questions
One of the main ways in which theories can be evaluated is in terms
of the questions that they answer.
If the question being answered by a theory is interesting and
important, then that theory has potential.
Because theories are designed to explain variation in the dependent
variable, identifying some variation that is of interest to you is a good
jumping-o point.
When we think about measuring our dependent variable, the rst
things that we need to identify are the time and spatial dimensions
over which we would like to measure this variable.
The time dimension identies the point or points in time at which we

would like to measure our variable.
The spatial dimension identies the physical units that we want to

measure.
Throughout the course, we think about measuring our dependent
variable such that one of these two dimensions will be static (or
constant).
This means that our measures of our dependent variable will be of
one of two types:
The rst is a time-series measure, in which the spatial dimension is the

same for all cases and the dependent variable is measured at multiple
points in time.
The second is a cross-sectional measure, in which the time dimension is

the same for all cases and the dependent variable is measured for
multiple spatial units.
Presidential approval, 19952005
4
0
5
0
6
0
7
0
8
0
9
0
P
r
e
s
i
d
e
n
t
i
a
l

A
p
p
r
o
v
a
l

1
9
9
5
m
1
1
9
9
5
m
7
1
9
9
6
m
1
1
9
9
6
m
7
1
9
9
7
m
1
1
9
9
7
m
7
1
9
9
8
m
1
1
9
9
8
m
7
1
9
9
9
m
1
1
9
9
9
m
7
2
0
0
0
m
1
2
0
0
0
m
7
2
0
0
1
m
1
2
0
0
1
m
7
2
0
0
2
m
1
2
0
0
2
m
7
2
0
0
3
m
1
2
0
0
3
m
7
2
0
0
4
m
1
2
0
0
4
m
7
2
0
0
5
m
1
Year/Month
Military spending in 2005
0
1
2
3
4
5
M
i
l
i
t
a
r
y

E
x
p
e
n
d
i
t
u
r
e
s

a
s

a

P
e
r
c
e
n
t
a
g
e

o
f

G
D
P
B
a
n
g
l
a
d
e
s
h
C
h
a
d
S
i
e
r
r
a

L
e
o
n
e
B
e
l
g
i
u
m
L
i
t
h
u
a
n
i
a
U
r
u
g
u
a
y
A
l
b
a
n
i
a
C
o
n
g
o
C
y
p
r
u
s
G
e
r
m
a
n
y
P
e
r
u
E
s
t
o
n
i
a
S
l
o
v
e
n
i
a
A
u
s
t
r
a
l
i
a
T
a
i
w
a
n
U
g
a
n
d
a
K
o
r
e
a
,

S
o
u
t
h
B
o
t
s
w
a
n
a
C
h
i
l
e
U
S
A
M
o
r
o
c
c
o
S
y
r
i
a
Learning to Use Your Knowledge
It is helpful to know some specics about politics, but it is also
important to be able to distance yourself from the specics of one
case and to think more broadly about the underlying causal process.
Moving from a specic event to more general theories
Know local, think global: can you drop the proper nouns?
Presidential approval, 19952005
4
0
5
0
6
0
7
0
8
0
9
0
P
r
e
s
i
d
e
n
t
i
a
l

A
p
p
r
o
v
a
l

1
9
9
5
m
1
1
9
9
5
m
7
1
9
9
6
m
1
1
9
9
6
m
7
1
9
9
7
m
1
1
9
9
7
m
7
1
9
9
8
m
1
1
9
9
8
m
7
1
9
9
9
m
1
1
9
9
9
m
7
2
0
0
0
m
1
2
0
0
0
m
7
2
0
0
1
m
1
2
0
0
1
m
7
2
0
0
2
m
1
2
0
0
2
m
7
2
0
0
3
m
1
2
0
0
3
m
7
2
0
0
4
m
1
2
0
0
4
m
7
2
0
0
5
m
1
Year/Month
9/11 caused an increase in Bushs approval ratings
Terrorist attacks caused an increase in Bushs approval ratings
Terrorist attacks cause increases in presidential approval ratings
Conict causes increases in presidential approval ratings
Muellers Rally round the ag eect
Examine Previous Research
As you examine previous research in your substantive courses or for a
research project, keep the following list of questions in mind:
What (if any) other causes of the dependent variable did the previous
researchers miss?
Can their theory be applied elsewhere?
If we believe their ndings, are there further implications?
How might this theory work at dierent levels of aggregation

(micromacro)?
How Do I Know If I Have a Good Theory?
Does your theory oer an answer to an interesting research question?
Is your theory causal?
Can you test your theory on data that you have not yet observed?
How general is your theory?
How parsimonious is your theory?
How new is your theory?
Bonus: How nonobvious is your theory?
Types of Causal Relationships:
Concepts and Termonology
PS4781 Types of Causal Relationships: Concepts and Termonology 1 / 8
Chapter Outline
1
A Typology of Causal Relationships
Bivariate Relationships
X Y
Example: Smoking Cancer
Spurious Relationships
X and Y are associated, but...
Z X
Z Y
Z is called an omitted variable
Example: Height and math achievement are associated...
but both increase with age.
Chain Relationships
X
1
X
2
Y
X
1
indirectly causes Y
X
2
is called an intervening variable or a mediator variable
The association between X
1
and Y disappears after controlling for X
2
Example: Parents education Osprings education Osprings
income
Dierent from spurious relationships
Conditional/Interactive Relationships
When Z, X Y
When not Z, X does not Y
The relationship between X and Y changes as the value of Z changes
Z is called a moderator or moderating variable
Example:
For daugters, mothers employment Higher GPA
For sons, mothers employment does not aect GPA
Types of Causal Relationship
From AF, p. 315 in 4th, p. 373 in 3rd

Research Methods in Political Science

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Research Methods in Political Science

Загружено:

Авторское право:

Доступные форматы

The Scientific Study of Politics

Developing falsiable theories

Testing those theories with data

We never prove a theory, we only fail to disprove it

All knowledge is tentative

Surveys (Mail, Telephone, Internet)

Sample survey, experiment? Observational?

Population mean (or median or some other measure)

Population proportion (or percentage)

Lets assume equal variances for simplicity

If small n: use t-distribution

If large n: use z-distribution

Always use z-distribution

Samples of convenience and replication

External validity of the stimulus

PS4781 Introduction to the R Statistical Computing Environment 9 / 9

P(y = 1) on the left hand side

mean salary = $7.0 million

median salary = $2.9 million

How possible? Direction of skew?

Y). At rst glance, this may not be apparent. But,

Type 1 Error: Reject H

Type 2 Error: Do not reject H

Using Dummy Variables to Test Hypotheses about a Categorical

Using Dummy Variables to Test Hypotheses about a Categorical

Using Dummy Variables to Test Hypotheses about Multiple

indicates p < .05;

(0.12) (0.12) (0.12) (0.12) (0.12)

(1.77) (2.68) (6.70) (2.02)

(2.12) (2.93) (6.80) (2.02)

(6.73) (7.02) (6.80) (6.70)

(2.75) (7.02) (2.93) (2.68)

(2.75) (6.73) (2.12) (1.77)

(2.19) (2.88) (6.83) (2.53) (2.10)

indicates p < .10.

indicates p < .05;

(5.53) (2.22) (0.67)

(3.32) (1.83) (0.55)

(3.98) (1.58) (0.46)

(23.45) (13.03) (3.56)

indicates p < .10.

0.68 of falling within 1 standard deviation of mean

0.95 of falling within 2 standard deviations

0.997 of falling within 3 standard deviations

Table A in Agresti and Finlay gives probability of being more than z

The rst component comes from the individual residual values ( u

) are then estimated from the following

We observe a sample slope parameter, which is an estimate of the

Then, from the value of this parameter estimate, the condence

If the answer is very likely, then we conclude that the population

If the value observed, O, is exactly equal to the expected value if there

If the value observed, O, is exactly equal to the expected value if there

If the value observed, O, is exactly equal to the expected value if there

If the value observed, O, is exactly equal to the expected value if there

If the value observed, O, is exactly equal to the expected value if there

If the value observed, O, is exactly equal to the expected value if there

Developing falsiable theories

Testing those theories with data

We never prove a theory, we only fail to disprove it

At best, our theory is supported by the data

All knowledge is tentative

The time dimension identies the point or points in time at which we

The spatial dimension identies the physical units that we want to

The rst is a time-series measure, in which the spatial dimension is the