A Concept-driven Approach to Measurement:

The Lexical Scale

John Gerring
Department of Political Science
Boston University

Svend-Erik Skaaning
Department of Political Science
Aarhus University

Paper prepared for presentation at the APSA annual meeting, Chicago, August 29-September 1,



This paper introduces a concept-driven method of scale construction the lexical scale that is viable
in situations where a concept can be meaningfully operationalized according to a series of necessary-
and-sufficient conditions arrayed in an ordinal scale. We offer several examples of how lexical scales
might be applied to key social science concepts, with a focus on electoral democracy, for which
we provide a new dataset extending back to 1800. We proceed to contrast the lexical scale with the
Mokken scale and conclude with an evaluation of the overall strengths and weaknesses of the
proposed approach.


The theoretical burden of social science is carried by highly abstract concepts such as democracy,
corruption, property rights, and rule of law. We require such concepts in order to articulate high-
order theories. Yet, they are difficult to operationalize, even when agreement can be reached on a
general definition for a term.
A key obstacle is aggregation. Faced with a number of indicators that seem relevant to a
concept the researcher must decide how to combine them into a single index. Factor analysis,
structural equation models, and IRT models are three generic approaches to this problem (DeVellis
2011). Other approaches, such as those employed by Polity IV (Marshall, Jaggers 2007) and
Freedom House (2007) to measure various facets of regimes, are more idiosyncratic but nonetheless
Where attributes co-vary in a predictable fashion the problem of aggregation is easily solved.
But where they do not, as is generally the case with multivalent concepts, researchers are at pains to
solve the aggregation problem in a non-arbitrary fashion.
This paper introduces a novel approach to scale construction that builds on the properties of
concepts. Any attempt at measurement must be oriented around a concept, the thing that is
purportedly being measured. Indeed, problems of validity often stem from inattention to the core
concept and a consequent lack of fit between the term, its definition, and the indicators employed to
measure it (Adcock, Collier 2001). This much is generally recognized, even if not always achieved.
In this study, we look to concepts to do more than identify relevant indicators. Specifically,
we enlist the structure of concepts to solve the aggregation problem. This is accomplished by
regarding conceptual attributes as necessary-and-sufficient conditions arrayed in an ordinal scale.
Following Rawls (1971), we refer to this as a lexical scale.
The first section reviews the concept-guided approach to measurement with a focus on
binary scales. The following section lays out the construction of a lexical scale. In the third section,
we propose a Lexical index for the key concept of electoral democracy, providing an original dataset
including all countries and semisovereign entities from 1800 to 2008, which is then tentatively
explored in the context of research on state repression. The fourth section situates the lexical scale
in the measurement literature, with special reference to Mokken scales. The paper concludes with
some general observations about the strengths and weaknesses of the lexical scale relative to other
approaches to scale construction.
A few notes on terminology will be helpful before we begin. Concept formation refers to the
construction of a concept, including the choice of terms, defining characteristics, and referents.
Defining properties of a concept refer to attributes that provide its formal definition. Associated
properties are attributes that are thought to be associated with the defining properties, but not
definitional. Measurement refers to concept operationalization, i.e., the instructions or instruments
required to identify membership, or degrees of membership, in the extension of a concept. This
involves the construction of an indicator or index (a group of indicators combined in some fashion). A
scale is a generic type of indicator or index.

I. Concept-Driven Approaches

Concepts are indispensable at a fundamental level; one cannot apprehend reality without verbal
tools. However, in choosing how to operationalize a term social scientists often rely on the empirical
properties of things out there to construct an index. Concept formation is thus often subordinated
to measurement, as Sartori (1970: 1038) lamented some time ago.

This state of affairs is partly a product of the underdevelopment of concept-led approaches
to measurement. With the exception of a small body of work on concept formation and typologies
(Bailey 1994; Collier et al. 2012; Collier, Gerring 2009; Elman 2005; Gerring 2012: ch 5; Goertz
2006; Sartori 1984), and a few studies focused explicitly on the connection between
conceptualization and measurement (e.g., Adcock, Collier 2001; Munck 2009), no literature
addresses the question of how one might construct a scale based on the properties of a concept.
Despite efforts to overcome the qualitative/quantitative gap (Brady, Collier 2010), the twin tasks of
conceptualization and measurement seem far removed from one another.
Insofar as a tradition of concept-driven measurement exists it is located in the binary scale,
where a concept is operationalized as a series of 0s and 1s. Commonly, membership criteria consist
of one or more necessary conditions, jointly understood as necessary-and-sufficient. Occasionally,
sufficient conditions are invoked (Goertz 2006). In either case, binary scales are generally
constructed with a view to represent ordinary meanings and/or important theoretical properties of a
concept. This is central to the classical tradition of concept formation (Collier, Gerring 2009;
Sartori 1984) and to set-theoretic approaches of social science (Goertz 2006; Goertz, Mahoney 2012;
Schneider, Wagemann 2012). It is also implicit in experimental and quasi-experimental studies,
where treatments are usually understood in a binary fashion and are derived from a priori research
hypotheses (Shadish, Cook, Campbell 2002).
In the study of democracy, one binary scale, known as Democracy-Dictatorship (DD), has
come to play an especially influential role. According to Przeworski and colleagues (Alvarez et al.
1996; Cheibub et al. 2010: 69), a regime is a democracy if leaders are selected through contested
elections. To operationalize this conception, they identify four criteria:
1. The chief executive must be chosen by popular election or by a body that was itself
popularly elected.
2. The legislature must be popularly elected.
3. There must be more than one party competing in the elections.
4. An alternation in power under electoral rules identical to the ones that brought the
incumbent to office must have taken place.
Like many binary scales, the DD index adopts a minimal definition of the theoretical concept and
operationalizes that concept with one or more necessary conditions (in this case, four), all of which
must be satisfied in order to receive a score of 1 (=democracy).
The main complaint of binary scales when imposed on complex concepts is that they reduce
all aspects of that concept to two categories, converting a plethora of information into a series of 0s
and 1s (Elkins 2000). Of course, this procedure is perfectly reasonable (a) if the associated
properties of a concept are highly correlated with the binary division or (b) if the associated
properties are randomly distributed across the two groups. However, with non-experimental data
the likelihood of either (a) or (b) is slight. Associated properties of democracy are not likely to be
correlated perfectly with the binary scale; nor are they likely to be distributed in a random fashion.
Likely as not, they will be somewhere in between. This means that the resulting scale is difficult to
interpret. It is neither a proxy for democracy at-large nor a uniform treatment, and is apt to be
associated with confounders (other properties of democracy that dont align with the binary index).

Consider the DD index, as defined above, and consider an attribute that is probably associated with its defining
features such as executive constraints (EC). Let us suppose that EC can be conceptualized in a binary fashion, and let us
suppose that it correlates partially with the DD index: countries that are scored 1 on the DD index are more likely to
score 1 on a binary index of EC, but the association is not perfect. It follows that any causal analysis that places DD on
the right side of the model runs into a problem of omitted variable bias if EC is unmeasured and collinearity (and
possible endogeneity) if it is included. (We are presuming that EC may be affected by DD, but is not entirely the product

Binary scales play a vital role in social science and also constitute an important link to the
classical tradition of concept formation. However, they cannot carry all the freight that is sometimes
assigned to them.

II. A Lexical Scale

There may be a way to preserve the virtues of conceptually driven scales with the need for more
differentiated scales. This, in brief, is the strategy of the lexical scale, which incorporates necessary-
and-sufficient conditions as distinct levels of an ordinal scale.
Before discussing the properties of the lexical scale and comparing it with other scales it will
be helpful to sketch out the procedure envisioned for the formation of such a scale. This procedure
begins by identifying relevant attributes of a concept. In order to make sure that all possible
attributes are considered, and none arbitrarily excluded, it is important to survey definitions and
usage patterns of a concept in ordinary language and in whatever specialized language region may be
relevant to the research. The initial culling should be as comprehensive as possible, excluding only
idiosyncratic features.

Next, one must arrange these attributes so that each attribute serves as a necessary-and-
sufficient condition within an ordered scale. That is, each successive level is comprised of an
additional condition, which defines the scale in a cumulative fashion. Condition A is necessary and
sufficient for L1; conditions A&B are necessary and sufficient for L2; and so forth, as illustrated in
Table 1. If there are five levels to an index, five necessary conditions must be satisfied in order to
justify a score of 5. This means that each level in a lexical scale is defined by a set of conditions that
are both necessary and sufficient, fulfilling a goal of the classical concept. Note that the structure of
a lexical scale presupposes that there is a true zero, representing phenomena that do not meet the
first condition (~A).
[Table 1 about here]
In achieving these desiderata four criteria must be satisfied: (1) binary values, (2)
unidimensionality, (3) qualitative differences, and (4) centrality or dependence.
First, each level in the scale must be measurable in a binary fashion without recourse to
arbitrary distinctions. It is either satisfied or it is not. To be sure, the construction of a binary
condition may be the product of a set of necessary and/or sufficient conditions. Collectively,
however, these conditions must be regarded as necessary and sufficient.
Second, levels in a lexical scale must be understood as elements of a single latent
(unobserved) concept. Empirical multidimensionality may persist, as discussed below. However,
conceptual multidimensionality must be eliminated, either by dropping the offending attribute
and/or by re-defining the concept in a clearer and perhaps more restrictive fashion.
Third, each level must demarcate a distinct step or threshold in a concept, not simply a
matter of degrees. Levels in a lexical concept identify qualitative differences. A 3 on a lexical scale
is not simply a midway station between 2 and 4. Indeed, each level may be viewed as a subtype
of the larger concept. Note that these sub-types are defined by cumulative combinations of the
attributes possessed by the full concept A, A&B, A&B&C, and so forth fulfilling the criterion of
a classical concept.

of DD.) Thus, even when binary coding is transparent and consistent with a common reading of the concept, as with
DD, the resulting index may engender confusion.
Examples of this sort of semantic surveying can be found in Collier, Gerring (2009) and Sartori (1984).

The most challenging aspect of lexical scale construction is the ordering of attributes, which
follows a conceptual (rather than empirical) logic. One attribute may be considered prior to another
if it is more central to the concept of theoretical interest (from some theoretical vantage point). This
follows a constitutive approach to measurement, where attributes are the defining elements of a
concept. Alternatively, one attribute may be considered prior if it is a logical, functional, or causal
pre-requisite of another. The dependence of B on A is what mandates that A assume a lower level on a
scale. Whether responding to considerations of centrality or dependence, the levels of a lexical scale
bear an asymmetric relationship to each other; some are more fundamental than others.
This is the most distinctive feature of a lexical scale, and clearly differentiates it from
Mokken scales (discussed below). While traditional measurement models consider attributes in an
independent and additive fashion, the lexical scale presumes that attributes can be understood only
according to their inter-relationship with one another. A concept is thus defined by specific
combinations of attributes.

Where lexical ordering is unclear a priori (according to considerations of centrality and
dependence) one is well-advised to consider the shape of the empirical universe. Specifically, if A is
always (or almost always) present where B is present, there may be grounds for considering A as
more central or more fundamental than B. However, any conclusions reached on the basis of an
exploration of empirical properties must be justified as a matter of centrality or dependence. Thus,
we regard the relative prevalence of attributes as a clue to asymmetric relationships among the
properties of a concept, not as a desideratum. In constructing a lexical scale, deductive
considerations trump data distributions.

Deliberation and Disagreement
A product of deliberation, the lexical scale follows the procedure by which John Rawls (1971) orders
the three core principles that establish his theory of justice (1) the Liberty principle, (2) the Fair
Equality of Opportunity principle, and (3) the Difference principle. These are arranged in order of
lexical (short for lexicographical) priority. That is, one should not consider 2 or 3 until 1 has been fully
satisfied, nor 3 until 1 and 2 are fully satisfied. Thus, each principle serves as a necessary condition
of the next, creating a lexical scale with four levels. The force of this argument hinges, in large part,
on a conceptual argument that the core meaning of justice is reflected in this particular ordering of
attributes, with category (1) understood as the most basic or essential (Moldau 1992).
Of course, Rawls was interested in defining the terms by which institutions within a society
could be established and justified. He was not interested in measuring the presence/absence or
degrees of justice in a society, and it is not clear whether he would have applied the same rules to
such a measurement instrument. Even so, his approach is remarkably similar to that which we
envision for empirical concepts in the social sciences. The lexical scale depends upon reaching a
reflective equilibrium with respect to the defining attributes of a concept and their lexical ordering.

Insofar as this solution is persuasive, the scale will be useful. Insofar as it strains the meaning of a
concept or theory it will seem arbitrary and forced, and is on that account unlikely to perform
any useful function in social science. A lexical scale must resonate with everyday usage of a word as
well as with considered judgments about what a concept should mean in a given theoretical context.
Just as there are disagreements over the meaning of justice, so there will be disagreements
over how to scale empirical concepts (concepts whose purpose is to capture something specific and
measurable out there). These may be placed into three categories.

This echoes the approach of qualitative comparative analysis (QCA) to empirical relationships more generally
(Schneider, Wagemann 2012).
On the concept of reflective equilibrium see Daniels (1996).

First, we must consider disagreements over the definition of a term. Evidently, many high-
order concepts in the social-science lexicon are contested (Collier et al. 2006), and this sort of
contestation necessarily affects a concept-driven approach to measurement (though it also affects
any other approach to measurement). In particular, it may affect the attributes included and excluded
as conditions in a scale.
Yet, because the lexical scale is developed by reference to a concepts core meaning we
anticipate that disagreements over attribute inclusion are likely to affect positions on the periphery
of the scale. As such, they will impact only those cases whose value is determined by features at the
high end of the scale. For example, if two 7-point scales for the same concept differ in the chosen
attributes these differences are most likely to be located at levels 6 and 7 and least likely to be located
at levels 1 and 2. As such, only cases with the highest scores (i.e., those whose score is affected by
the 6
or 7
conditions) will be affected. It follows that measurement error arising from errors in the
hierarchy are more likely at the high end of the scale than at the low end.
The second sort of disagreement concerns the number of levels assigned to a lexical scale.
Evidently, scholars working on the same concept may produce scales of differing lengths. Indeed,
the decision about when to aggregate and when to disaggregate attributes (to form conditions) is
somewhat arbitrary, hinging on contextual matters such as the sort of data that is at-hand and the
use envisioned for the scale. For example, a scale developed for use on the right side of a causal
model (X) is likely to be more concise than a scale developed for use on the left side of a causal
model (Y), as discrimination is vital in measuring outcomes whereas in constructing independent
variables one usually strives to limit the number of treatments.
In principle, there is no limit to the number of levels in a lexical scale. In practice, we
anticipate that the number of levels will rarely surpass ten. This is because some concepts do not
have a great many (truly distinct) attributes and because, even where a great many attributes are
present, they are unlikely to follow the logic of lexicality. In any case, disagreement over scale length
is not critical, as differently-sized scales will co-vary so long as the identifiable elements are ordered
in the same fashion. Compare two hypothetical scales: I (A-B-C) and II (A
), where
the latter disaggregates each element of the former into two components. These alternate scales for
the same concept are different insofar as one has more levels than the other; but they are not in
conflict with each other.
A third, more damaging, sort of disagreement concerns the lexical priority of different
elements. If researchers cannot agree on how to prioritize the attributes of a concept there is little
hope of arriving at a useful lexical scale or, to put the matter differently, each scale will be useful
only in a very narrow context and may seem idiosyncratic. This sort of basic-level disagreement is
encountered whenever the attributes of a concept bear no apparent relationship functional or
logical to each other or where multiple attributes are equally important to the core concept.
While there is no simple solution to this situation, one strategy is to redefine the boundaries
of the concept in a narrower fashion so as to exclude elements that cannot easily be integrated. This
may be understood as a shift from a background concept to a systematized concept (Adcock, Collier
2001), and causes no damage so long as the re-definition is plausible (it does not strain the meaning
of the core concept), or can be communicated by a compound noun that makes clear how the
narrower concept relates to the parent concept. Accordingly, we eschew democracy in favor of a
diminished subtype, electoral democracy.

Examples in Brief
As with most things, it is easier to grasp the workings of a method when specific examples are
brought into view. In this section we provide a cursory exploration of possible lexical indices for the

following well-known concepts: (1) civil liberty, (2) party strength, and (3) rule of law.
discussion begins with a brief definition, clarifying how we understand the concept. This is followed
by a proposed index, following the principles of the lexical scale (as laid out above).
Civil liberty is a human right, as well as a key component of democracy. Here, we understand
civil liberty as a property of a government, including parties, civil society groups, and paramilitary
groups that are closely associated with that government. We intend to measure the extent to which
governments respect civil liberties, not the extent to which civil liberties exist in a society.
important caveat is that we are concerned with the actions of government relative to the citizens of a
polity. Its actions towards non-citizens (foreign nationals) lie outside the boundaries of our concept.
Finally, it should be clarified that we are concerned primarily with the relevance of civil liberty to
citizens. That is, we order the components of the index according to their presumed importance to
individuals living in that society.
This is how we arrive at judgments of relative centrality across
attributes. With these qualifications, the following Lexical index is proposed:
0. Political and extra-judicial killings.
1. No political and extra-judicial killings. The government does not organize or condone
arbitrary killings or the killing of dissidents or of citizens based on their ascriptive
characteristics (e.g., ethnic minorities).
2. No torture. The government does not organize or condone the torture of dissidents or of
citizens based on their ascriptive characteristics (e.g., ethnic minorities).
3. Due process. The government does not arbitrarily arrest, imprison, or harass its citizens.
4. Free movement. The government does not restrict movement and residence within the
5. Free discussion. The government does not restrict discussion in private arenas (among
family, friends).
6. Free public speech. The government does not restrict speech in public arenas including the
7. Free association. The government does not restrict association, including political parties,
labor unions, religious organizations, and other civil society organizations.
Political parties may be defined minimally as organizations that nominate officials for public
office, a key function in most theories of representative democracy (Schumpeter 1950). Within this
context, the relative strength of these organizations is often considered to be an important
component of democracy and perhaps also of good governance (Ranney 1962; Schattschneider
1942). Party strength within an authoritarian context may also be important, e.g., for regime stability.
However, measuring party strength in this context would require a different sort of scale since
parties function quite differently in authoritarian contexts (Brownlee 2007). Thus, our focus is
restricted to parties as they operate within minimally democratic settings. Party strength is
understood here as the mean strength of all political parties that gain entrance into the legislature,

For further examples of concept operationalizations that seem to follow the logic of a lexical scale one might consider
constitutionalism (Nino 1998: 3-4), human security (Tadjbakhsh & Chenoy 2007: Ch.2), peasants (Kurtz 2000: 96), and liberal
democracy (Howard, Roessler 2006; Mller, Skaaning 2013).
Of course, one might argue that it is the responsibility of governments to protect civil liberties, regardless of who is
infringing upon them. Nonetheless, it seems important to distinguish between the actions of governmental and non-
governmental actors.
One might also consider the issue from the perspective of Rawls original position: which attributes would one
consider most important in establishing a society de novo?

and is not to be confused with party system strength (the durability of a set of parties within a polity).
With these clarifications, we propose the following Lexical index:

0. Not allowed. Parties are not allowed to organize.
1. Allowed. Parties are allowed to organize. If the system is minimally democratic, the state
may restrict entry to small parties judged to be hostile to democratic principles.
2. Independence. Parties are independent of the state (e.g., the bureaucracy, the military) and
independent of each other (though naturally members of a coalition will be to some
extent constrained by coalition agreements).
3. Defections rare. Party officials rarely leave their party voluntarily (to join another party or
to continue their political career as an independent). Expulsions and retirements are not
counted as defections.
4. Legislative cohesion. Members of a party usually vote together in the legislature.
5. Centralization. Parties do not have strong factions or regional strongholds with distinct
organizational structures; important decisions over policy and candidate selection are
made at the center, or can be overturned by central party leadership.
6. Programmatic. Parties publically embrace policies and ideologies that are relatively
The rule of law is a virtually universal political ideal which has in recent decades been
identified as crucial for economic and human development (Tamanaha 2004: 1-4). Among the
varying definitions of this concept, most of the attributes may be understood along a continuum
from thin to thick conceptions of the concept (Bedner 2010; Mller, Skaaning 2012; Tamanaha
2004; Trebilcock, Daniels 2008: 12-13). In this spirit, we suggest the following Lexical index:
0. No rule by law.
1. Rule by law. Law is used as instrument for government action.
2. Formal legality. Laws are general, clear, prospective, certain, and equally applied.
3. Institutional checks. An institutionalized system of government characterized by checks and
balances, including an independent judiciary and penalties for misconduct.
4. Civil liberties. Liberal (negative) rights in the form of physical integrity rights and First
Amendment-type rights are safeguarded.
5. Democratic consent determines laws. The citizens, through their elected representatives, are the
ultimate source of laws.

It should be clear that our purpose in offering this set of examples is to demonstrate the
potential applicability of the lexical scale to a diverse array of concepts in the social sciences, not to
make claims about particular indices. Evidently, a great deal more work would need to be done
before promulgating the foregoing indices for general consumption, and especially in
operationalizing each condition.

III. Electoral Democracy

The index does not include a consideration of party nationalization. In our view, parties may be strong while also being
rooted in particular regions, as in the United Kingdom.

Having explored several concepts in a cursory fashion, we now turn to a more extended example.
Our focus will be on electoral democracy, the idea that democracy is achieved through competition
among leadership groups which vie for the electorates approval during periodic elections before a
broad electorate. This view of our subject is sometimes referred to as polyarchy, contestation, or as
the competitive, elite, minimal, realist, or Schumpeterian conception of democracy (Dahl 1956,
1971; Przeworski et al. 2000; Schumpeter 1950).
In identifying attributes for possible inclusion we are mindful of the vast literature on this
topic, with special attention to linguistic studies of the concept (e.g., Held 2006; Lively 1975; Naess
1956) and foundational works on the electoral conception of democracy (see above). From these
attributes we tried to identify those that were most central to the concept, combining attributes
where they seemed to express the same general idea and discarding attributes if they seemed
peripheral to the electoral conception of democracy. Conveniently, most of these attributes can be
understood in a binary fashion (as being present/absent). Equally important, it seemed plausible to
regard them as cohering to a unidimensional latent variable, i.e., more/less electoral democracy.
To order these attributes in a single index we attempted to gauge their relative centrality or
interdependence. In this fashion, the existence of elections was judged fundamental since no other
attributes commonly associated with electoral democracy make any sense outside of an electoral
context. One could not say, for example, that Country A is more of an electoral democracy than
Country B if neither polity holds elections, regardless of what other characteristics those polities
might possess. Likewise, some of the attributes depended upon other attributes in a logical manner.
For example, one cannot hold multi-party elections unless an electoral regime is in place. Finally,
some of the attributes seem to depend for their meaning on other attributes in a functional manner.
For example, the extent of suffrage allowed in an election is of little import unless elections count
for something, i.e., unless they allow for multi-party competition and the most important
policymaking offices are elective. Whether or not suffrage is restricted or universal in North Korea is
not a question that is generally regarded as consequential. (North Korea would hardly be less
democratic if it decided to restrict access to the ballot, while changing no other aspect of its
totalitarian system.)
After considerable deliberation, we arrived at a Lexical index with six conditions and seven
levels, as follows:
0. No elections. Elections are not held for any national-level policymaking offices. This
includes situations in which elections are postponed indefinitely or the constitutional
timing of elections is violated in a more than marginal fashion.
1. Elections. There are regular national elections.
2. Multi-party elections. Opposition parties are allowed to participate in legislative
elections and to take office.
3. Executive accountability. The chief executive is accountable either directly to the
electorate or indirectly to an elected parliament.
4. Competitive elections. The chief executive offices and the seats in the effective
legislative body are filled by elections characterized by uncertainty, meaning that the
elections are, in principle, sufficiently free and fair to enable the opposition to gain
5. Male or female suffrage. Virtually all adult male or female citizens are allowed to vote in
national elections.

6. Universal suffrage. Virtually all adult citizens are allowed to vote in national elections.
Several features of this index deserve clarification. First, our reading of the concept of
electoral democracy is fairly narrow. We do not presume that the electorate is encompassing.
Indeed, it may be highly restrictive though it must be separable from, and much larger than, the
group of officials it is charged with selecting. Thus, South Africa under Apartheid may be considered
an electoral democracy for white South Africans, just as Ancient Athens functioned as a direct
democracy for male citizens. Likewise, in measuring suffrage we take a juridical approach. Suffrage is
achieved when constitutionally prescribed, even though local or informal practices may impede the
achievement of this right (as in the American South prior to the Civil Rights movement). This is
consistent with the usage of the concept by Schumpeter, Dahl (for polyarchy), Przeworski, and
also with many extant indices such as the Polity2 index.
Second, indirect elections do not qualify as elections unless the electors endorse specific
candidates or parties, as in US presidential elections. In the absence of explicit endorsements,
indirect elections usually serve to restrict competition, allowing the in-group to monopolize political
power. Third, electoral democracy does not presume complete sovereignty. A polity may be
constrained in its actions by other states, by imperial control (as over a colony), by international
treaties, or by world markets. In sum, to say that a polity is an electoral democracy, according to our
proposed index, is to say that it functions as such (a) for those who are allowed to vote and (b) for
policies over which it enjoys decision-making power.
To be sure, expansions of this rather minimalist approach to electoral democracy can be
envisioned. One might, for example, include a seventh condition measuring high electoral integrity
(i.e., only minor irregularities with respect to intimidation, disruption, violence, registration, and high
respect for political liberties such as freedom of expression, assembly, and association). Lexical
indices for a given concept can usually be crafted so to as to encompass a greater or lesser number
of levels, as noted. For present purposes, we restrict ourselves to what might be considered the most
basic or fundamental aspects of electoral democracy. (Extant sources do not provide sufficient
guidance on the aforementioned factors to allow for historical coding, as set forth in the next
The reader should be aware that electoral democracy identifies one aspect, or dimension, of
the parent concept. Other dimensions of democracy sometimes articulated as liberal, deliberative,
egalitarian, or participatory are ignored. Presumably, lexical scales might also be developed for
these concepts.
In any case, the utility of this definition and operationalization of democracy (like all others)
rests ultimately on how faithfully it explains the world around us. Specifically, the electoral
interpretation of democracy rests on a theory about which aspect of democracy has the greatest
impact on governance, wellbeing, and on other aspects of democracy (liberal, deliberative, et al.).
Schumpeter, Przeworski, and many other writers were convinced that the electoral component of
democracy is causally exogenous, and it is on this basis that we separate it out from other
dimensions. Likewise, our electoral index is premised on an idea of which features of electoral
democracy are likely to be most fundamental. It is on this basis that we included some attributes and
excluded others, and arrived at a lexical ordering of those that were included. In the following
sections, we offer a brief exploration of how the Lexical index might be applied to the world of

A New Dataset
To apply the Lexical index of electoral democracy in the broadest manner possible we code all
sovereign or semisovereign countries from 1800 to 2008, generating a new dataset with 228

countries and 17,176 country-years.
Crossnational coding sources include the Political Institutions
and Political Events (PIPE) dataset (Przeworski et al. 2013) and the Dataset of Political Regimes
(Boix et al. 2013). In addition, we consult numerous books, articles, and reports in order to code
missing observations, to corroborate or revise codings suggested by the principal sources, and to
create a new indicator of competitive elections. A detailed description of sources and coding
procedures is contained in Appendix A.
In order to get a feel for the application of this index we provide country scores for the
median year in our sample, 1904, as shown in Table 2. At that time, there were fifty-three
independent countries in the world. These were distributed fairly evenly across the 7 categories of
the Lexical index, with the exception of the most democratic category (6), which has only one
occupant. (Only Australia granted universal suffrage to both men and women, while satisfying the
other criteria stipulated in the index.)
[Table 2 about here]
A frequency distribution of scores across the entire 1800-2008 period is provided in Table 3.
It will be seen that the most populated categories are L0, L1, L3, and L6, while others (notably L5)
have fewer occupants. Although a fairly high proportion of cases stack up at the two ends of the
index the distribution of cases is not bimodal, a problem that affects ordinal/interval indices such as
PR and Polity2 (as noted by Cheibub et al. 2010: 77; see also Treier, Jackman 2008).
The distributions of cases changes over time, as one might expect. This feature may be
portrayed in a stacked graph, as shown in Figure 1. Here, the width of each category represents the
number of countries falling into that category for each year from 1800 to 2008. Note that our
sample grows over time from 27 in 1800 to 194 in 2008 due to the appearance of newly
sovereign states (e.g., in Africa) and the break-up of sovereign states (e.g., the Soviet Union).
[Figure 1 about here]
Figure 1 also illustrates a key feature of a lexical scale, namely the possibility of decomposing
a concept into constituent parts parts that are intrinsically meaningful because they represent
qualitative (i.e., step or threshold) differences. Each lexical scale contains an implicit typology. This
allows one to interpret category membership and changes in membership over time in a
meaningful way. For example, one can evaluate the composition of political regimes at each point in
time over the past two centuries. In 1800, polities were predominantly of type 0 (no elections). Later
in the nineteenth century, we see the rise of types 1-5 and the concomitant decline of type 0. This is
the most diverse period, when no single type is dominant, as illustrated by our snapshot of the world
in 1904 (see Table 2). Over the course of the 20
century, we can see the extraordinary rise of type 1
(elections without multiparty competition), followed by a steep decline beginning in the 1980s and
coincident with the Third Wave of democratization (Huntington 1991). Type 3 (multi-party
executive and legislative elections without real competition), which had been a modestly sized
category for a century, begins to grow in the late 20
century to the point where it constitutes the
second-most dominant regime-type. It is worth noting that this regime-type is quite similar to
polities described as competitive authoritarian (Levitsky, Way 2002) or limited multi-party (Hadenius,
Teorell 2007). The other striking pattern over the past century is the rise of type 6, the highest level
of our Lexical index, corresponding to polities that satisfy all six criteria. This category now
comprises over half of all polities in the world. Figure 1 also offers insight into those periods in
which electoral democracy advanced throughout the world (e.g., at the end of World War I, World
War II, and the Cold War), as well as those periods in which it declined (e.g., the 1930s).
By way of contrast, interval indices such as the Unified Democracy Scores (Pemstein et al.
2010) allow one to track the overall increase and decrease in democracy. Ordinal indices such as

In comparison, the coverage of the Polity2 variable for the same period is 189 countries and 15,823 country-years.

Political Rights (or, perhaps, the Polity2 scale) are similar in this respect, given that the levels in the
index do not indicate categories that are qualitatively different.
As such, extant indices are capable
of indicating overall trends more or less democracy but they cannot indicate anything about the
specific content (quality) of regimes, or about which regime-types expanded or contracted at
different points in time. The latter information is both substantively important as well as useful for
tracing causal mechanisms.
As is to be expected, our Lexical index generally co-varies with other measures of the same
general concept, as shown in Table A1. For example, it correlates with Polity2 at 0.79 and with the
Political Rights index at 0.85 (Spearmans rho). Dropping the highest scoring cases (Lexical=6), i.e.,
those codings likely to be least controversial, these correlations drop to 0.59 and 0.43, respectively.
These are not especially high correlations, suggesting that empirical relationships with political
regime type as the independent or dependent variable are likely to reveal different patterns when
explored with our Lexical index, relative to extant (interval, ordinal, and binary) indices.
In sum, the Lexical index allows us to represent more information than is possible in a
binary scale (e.g., DD) but without the ambiguities of a composite scale whose composition is
derived either from complex models (e.g., UDS) or a less formulaic but often opaque weightings
across dimensions (e.g., Freedom House and Polity). The level affixed to a country in a particular
year is immediately interpretable. The cost is that elements of democracy not associated with a
narrow definition of electoral democracy are left aside.

Empirical uses
One purpose of the Lexical index of electoral democracy is univariate and descriptive: to
differentiate regime-types in the world (e.g., Table 2) and to portray changes over time (e.g., Figure
1). Another use is to show relationships between regime-type and other factors. Such inter-
relationships may be descriptive, causal, or putatively causal (where causal inference is suspected but
not entirely proven).
Consider the vaunted democratic peace hypothesis (Brown et al. 1996). While a new scale of
democracy will assuredly not solve this obdurate research question by itself, it does allow a more
nuanced test of the thesis (at least as pertains to the electoral components of democracy).
Specifically, we can explore whether there is a specific level in the Lexical index beyond which
conflict between nations ceases to occur, and whether one or both members of the dyad must
surpass that threshold. This is arguably more informative than a binary or ordinal/interval analysis
of the problem.

The 7-point Political Rights index, constructed by Freedom House, is explained as follows: 1. Enjoy a wide range of
political rights, including free and fair elections. Candidates who are elected actually rule, political parties are competitive,
the opposition plays an important role and enjoys real power, and minority groups have reasonable self-government or
can participate in the government through informal consensus. 2. Have slightly weaker political rights than those with a
rating of 1 because of such factors as some political corruption, limits on the functioning of political parties and
opposition groups, and foreign or military influence on politics. 3-5. Include those that moderately protect almost all
political rights [and] those that more strongly protect some political rights while less strongly protecting others. The
same factors that undermine freedom in countries with a rating of 2 may also weaken political rights in those with a
rating of 3, 4, or 5, but to an increasingly greater extent at each successive rating. 6. Very restricted political rights. They
are ruled by one-party or military dictatorships, religious hierarchies, or autocrats. They may allow a few political rights,
such as some representation or autonomy for minority groups, and a few are traditional monarchies that tolerate political
discussion and accept public petitions. 7. Few or no political rights because of severe government oppression,
sometimes in combination with civil war. They may also lack an authoritative and functioning central government and
suffer from extreme violence or warlord rule that dominates political power. Downloaded 1/16/2013 from

As a second example, one might consider the equally contested relationship between
development and democracy (Przeworski et al. 2000). With democracy on the left side of the model,
one may investigate whether the empirical relationship of socio-economic development to electoral
democracy is different at various points in the Lexical index. Do increases in per capita GDP have a
greater impact on electoral democracy at certain thresholds? With democracy on the right side of the
model, one might investigate whether different thresholds of electoral democracy have varying
relationships to economic growth (Gerring, Skaaning 2013). For example, does the initial transition
to multiparty elections have a different impact on growth performance than the transition to
competitive elections?
Another research agenda that lends itself to reassessment is the role of democracy in
structuring state repression of personal integrity rights (Davenport, Armstrong 2004). Here, we shall
explore the empirical relationship in order to demonstrate in greater detail how the Lexical index
may be brought to bear on a (putatively) causal relationship.
Democracies are expected to be less repressive than autocracies for a variety of reasons. A
democratic framework is thought to promote tolerance; low respect for human rights may be
punished by the electorate at the ballot box; and political participation and contestation provide an
outlet for protests and secure legitimacy in the broad population, alleviating the extra-constitutional
challenges that often spur violent government repression. Extant theory thus presents a strong
prima facie case for political regime type as an influence on state repression.
However, it is not clear what the precise empirical relationship might be. Extant work on the
subject suggests three possible patterns. As summarized by Davenport and Armstrong (2004: 538-
39) [hereafter D&A]: (1) with every step toward democracy, the likelihood of state-related civil
peace is enhanced; (2) human rights conditions are not only improved when full democracy exists
but also when full autocracy is present; or (3) there maybe some threshold of domestic
democratic peace, below which there is no effect of democracy on repression, but above which a
negative influence can be found.
To investigate this question we adopt the empirical format employed by D&A, with some
minor modifications to update the analysis to include all years until 2008.
State repression is
measured by the widely used Political Terror Scale (PTS), based on the State Department human
rights country reports (Wood, Gibney 2010). Like D&A we enlist the OLS (TSCS) regression to
assess the model, and we employ a similar battery of covariates: interstate armed conflict (UCDP,
categories 1-3 collapsed), internal armed conflict (UCDP, categories 1-3 collapsed), military
dictatorship (Cheibub et al. 2010), population (ln) (PWT), GDP/cap (ln) (PWT), and a one-period
lag of the outcome.
Following D&A, democracy is measured by the 10-point Polity Democracy index (scaled
from 0 to 10) drawn from the Polity IV dataset. Our second measure of democracy is the Lexical
index, with one coding change. Note that since data on state repression is available only from 1976
there is little variation in suffrage laws during the observed period.
Distinctions across levels L4-L6
of the Lexical index are therefore rendered moot, prompting us to collapse L4-6 into a single
category (L4). The resulting index has five levels L0-L4, with roughly equal membership rather
than seven (L0-L6), and is otherwise identical to the index described above.
But how does democracy impact state repression? In this discussion we shall leave aside the
knotty problem of causal inference. This is not to gainsay the importance or the knottiness of the

Apart from Lexical, all data used is taken from the QoG standard dataset 2013. Downloaded 08/27/2013 from
Moreover, the analyses are limited to overlapping country-years in order to increase the comparability of the results.

problem. Rather, it is because issues of causal inference are extraneous to this comparison, whose
purpose is to illustrate the value-added of a lexical approach to measurement.
In Table 4, we begin by testing the possibility of a linear relationship between democracy and
state repression. Polity (Model 1) and Lexical (Model 2) both indicate a negative relationship: more
democracy is correlated with less repression, vindicating the general theory presented above. Next,
we test the possibility of a curvilinear relationship by introducing a multiplicative term. The
coefficients for Polity (Model 3) and Lexical (Model 4) are similar, though only Polity offers support
for the notion that democracys impact on repression is nonlinear.
Finally, again following the lead of D&A, we attempt to get inside the causal box by
exploring each category of these indices separately through the use of dummy variables representing
each level (with the first level omitted as a reference category). Results, shown in Models 5 and 6, are
again broadly similar across the two indices, though there are some important differences. The
coefficient for L1 in the Polity index is significantly more repressive than the reference category, L2-
6 do not show results are statistically distinguishable from the null, and L7-10 show negative, and
statistically significant coefficients. By contrast, L1, L2, and L4 in the Lexical index are statistically
significant from the reference category, but not L3.
Leaving aside for a moment the question of which index offers a truer representation of the
relationship between democracy and repression, let us consider what might be learned from Models
5 and 6. D&A (2004: 548) conclude that there are important differences between the political
systems associated with the highest levels of the Polity measure a reasonable conjecture. But
they cannot follow this statement up with any speculation about what is distinctive about the higher
levels of the Polity index or what might be driving the apparently curvilinear relationship between
democracy and repression. This is because the levels of the Polity index are not individually
interpretable. In this respect, ordinal indices of democracy such as Polity, PR, and CL function very
much like interval indices. They inform us about quantities (more or less of some latent trait) but not
about qualities (categorical differences across levels).
By contrast, the Lexical index provides ample fodder for theorizing because each level
defines a discrete category and each category is plausibly approached as a regime-type. Let us begin
by reviewing the information contained in Model 6. Two levels in the Lexical index reveal higher
levels of state repression, L0 (no elections) and L3 (multi-party elections but no executive
accountability). This accounts for the curvilinear relationship discovered in Models 3-4. While it is
unsurprising to discover that a nonelectoral state has high levels of repression (for all the reasons set
forth in our initial theory), it is somewhat surprising to find that there is no (statistically significant)
difference in levels of repression across L0 and L3. In other words, repression increases as a polity
moves (hypothetically) from multiparty elections (L2) to multiparty elections plus executive
accountability (L3).
[Table 4 about here]
An explanation may be found in the hybrid nature of this regime, which installs many of the
constitutional forms of democracy without the crucial missing step in which elections are allowed to
become competitive. That is, L3 polities look like they are democratic, and undoubtedly are
portrayed by their leaders as democratic. But leaders have mechanisms to prevent the possibility of
opposition parties gaining power (Schedler 2002). Opposition groups are free to organize and to
participate in the political system but are not allowed to win. This setting may engender repression
for several complementary reasons. First, the government is compelled to recognize the opposition;
it cannot prevent them from organizing as it might in a one-party or no-party regime. Second,
because the opposition is free to organize it is likely to pose a significant challenge to the
government. And because it is not allowed to compete fairly in elections it is likely to pursue
extraconstitutional measures, which in turn are likely to provoke government repression. Finally,

since the government cannot use constitutional means to maintain its power (for otherwise the
opposition might win at the ballot box) it must resort to state repression. In short, both government
and opposition have means and motive to engage in a cycle of protest and retaliation, a setting that
is likely to feed high levels of state repression, as measured by the PTS.
This short explanatory sketch is not intended to convince. In order to be fully convincing a
causal explanation would need to be accompanied by a much longer theoretical discussion intended
to make sense of case-based evidence and extant theorizing on this well-trodden subject (not to
mention a battery of robustness tests). Our purpose is heuristic. We hope to have shown that a
lexical approach to measurement provides a useful tool for gaining insight into (putatively) causal
relationships and specifically into the causal mechanisms that may (or may not) be at work. Note
that this information gained from categories on a lexical scale are useful for shedding light on why X
might be a cause of Y and why it might not be a cause of Y.
We are not proposing that a Lexical index has any claim to ontological priority over other
sorts of indices, each of which represent certain aspects of reality and each of which has its uses.
Sometimes, relationships are continuous (and hence best measured with an interval scale) and
sometimes they have only one threshold (and hence best measured with a binary scale). By the same
token, sometimes causal relationships are ordinal in character or require an ordinal scale to test
various threshold possibilities. In these settings, which surely apply to many theories about
democratic development (as cause or effect) a lexical scale where ordinal levels represent
qualitatively different categories may be appropriate.

IV. Situating the Lexical Scale

Within the literature on concept formation, the lexical scale may be viewed as an attempt to
reconcile minimal (thin) and maximal (thick, ideal-type) strategies (Coppedge 1999; Gerring 2012: ch
5). Note that the first condition (or first several conditions) establishes a minimal definition while
the last condition in the scale completes what might be viewed as an ideal-type concept. Granted,
ideal-type definitions are often more expansive than those envisioned by the lexical scale, primarily
because the requirements of an ideal-type are less restrictive (anything that coheres with the concept
is admissible). Even so, the lexical scale serves as a bridge between minimal and maximal concepts.
The lexical scale is also closely linked to a long intellectual tradition focused on typologies
(Bailey 1994; Collier et al. 2012; Elman 2005; Gerring 2012: ch 5; Lazarsfeld 1937; Lazarsfeld,
Barton 1951). Note that each level of a lexical scale corresponds to a distinctive category or type,
some of which may have recognizable names and identities.
Within the literature on measurement the lexical scale is similar in structure to Guttman
scales (Coppedge, Reinicke 1990; Guttman 1950) and Mokken scales, aka ordinal or nonparametric
item response theory (Cingranelli, Richards 1999; Mokken 1971; Sitjtsma, Molenaar 2002; Sitjtsma,
Debets, Molenaar 1990; van Schuur 2003, 2011). However, there are important differences.
First, a Mokken scale typically attempts to integrate a large number of items, many of which
are not qualitatively different from each other. In doing so, the resulting index has a better chance of
identifying fine-grained differences among similar entities, at the cost of introducing levels in the
scale that have no intrinsic or theoretical meaning.
Second, in ordering these items the Mokken scale approach is inductive rather than
deductive. Specifically, the most common attributes constitute the bottom rungs of the scale while
the least common attributes constitute the top rungs.

Third, the inductive method of scaling is probabilistic, while the lexical method of scaling is
deterministic. By this, we mean that a Mokken scale presumes that some cases will not fit neatly into
the monotonicity presuppositions of the scale. (The degree of fit is captured in a statistic called
Loevingers H, which is based on the frequency of Guttman errors compared to expected Guttman
errors.) By contrast, the score for a lexical scale is deterministic because it arises from a series of
necessary-and-sufficient conditions. Naturally, there is error associated with the coding of cases, but
this error is at the level of brute data; sometimes, one does not have enough information at ones
disposal to code a particular case, or the coding criteria for a scale is too ambiguous to resolve
coding decisions.
Both Mokken and lexical scales are cumulative. However, all cells in the matrix are defined
in a Mokken scale while only the positive elements of the scale are defined in a lexical scale. This
may be seen by directly comparing lexical and Mokken scales with the same number of levels. (As
we pointed out, this would be rare, as inductively derived indices generally seek to integrate many
more items than the corresponding lexical scale.) Returning to Table 1, the reader will notice that all
cells containing a term are defined while empty cells are undefined. By defined, we mean that they
define the lexical scale in a deterministic fashion (successive necessary-and-sufficient conditions). If
a case receives a score of 3 it must code positively on A, B, and C, and negatively on D. Condition E
is undefined; whether it is satisfied or not will not affect the coding for that case.
Table 5 presents a hypothetical Mokken scale, mirroring the features of Table 1 in all but
one respect. Here, all cells are relevant, which means that the predictive capacity of the scale is much
greater. Specifically, knowledge of a cases score will allow one to predict values for all conditions in
the scale, including both lower- and higher-level conditions. Of course, because the scale is
probabilistic these predictions include error. This error should be interpreted as error in the fit
between the model and reality, not measurement error in the usual sense (although the latter may be
present as well).
[Table 5 about here]
Naturally, lexical and Mokken scales lead to different indices wherever the deductive
properties of a concept do not line up neatly with its inductive properties. Consider what happens
when we construct a hypothetical Mokken index of electoral democracy using the same seven
elements we identified as components of our Lexical index. The ordering of elements in the Lexical
index is presented in the first row of Table 6. The ordering of elements in a Mokken index using
the same sample of country-years and based on the relative frequency of each attribute is located
in the second row. The ordering of elements in two Mokken indices constructed from restricted
samples is presented in the final rows. Note that not only are the Mokken scales quite different from
the lexical scale; each of the three Mokken scales is also quite different from the others due to the
fact that they draw on varying samples.
[Table 6 about here]
To say that lexical and Mokken scaling procedures sometimes arrive at different indices for
the same general concept is not to say that one is necessarily superior to the other. Both surely have
their uses. However, there are reasons to imagine that the solutions provided by Mokken/IRT
models are less appropriate for some concepts than for others.
In addressing this issue we shall compare concepts such as civil liberty, party strength, rule of law,
and electoral democracy (which we explored in Section III) with concepts that have been central to the
development of Mokken/IRT models in the fields of education and psychology such as intelligence
and aptitude. Note that while the first class of concepts describes features of institutions, the second
class of concepts describes features of individuals. (This is not the only aspect in which they differ,
but it may be a critical one.) Four important differences across these two classes of concepts may be

First, with individual-level concepts it is often possible to identify outcome indicators that
measure the latent concept of interest. A subjects performance on a test offers a good indication of
their aptitude if not of their overall intelligence at least of their knowledge in a subject area. With
institutional concepts the outcome-based approach is often difficult to apply. Outcome measures of
democracy might include the closeness of the vote between parties in an election or the frequency of
turnover as the result an election (Gerring, Teorell, Zarecki 2013; Vanhanen 2000). While
informative, such indicators obviously do not capture the entirety of the concept and can be
misleading. (A large margin of victory, and infrequent turnover, may be a sign of citizen satisfaction
rather than of authoritarian tendencies.) Thus, for institutional concepts the items on a scale are
likely to consist of substantive attributes that define the concept of interest. In measuring electoral
democracy, for example, an item might consist of the question Are there competitive elections? or
Is suffrage universal? This is roughly equivalent to an aptitude test in geography that includes the
question Are you good at identifying place-names?
Second, with individual-level concepts that can be measured with outcome indicators like
answers to an aptitude test it is fair to assume that all questions (that relate in some way to the
subject) are relevant and none are more important than others, except insofar as they might serve a
more useful function in discriminating between subjects. A questions utility derives from its
function within the overall schedule of items on a test, and is easily judged by the extent to which it
improves the overall capacity of that test to discriminate among subjects. By contrast, when
institutional qualities are measured by the attributes that define them, it follows that some items will
be more important than others (in defining the latent concept of interest). Consider the following
two indicators of electoral democracy: Are there competitive elections? and Are 18-21-year-olds
allowed to vote? Evidently, the first is more central to the concept (as normally understood) than
the latter. Inductive methods of scale construction offer no way of sorting this out. (Granted, our
lexical ordering of attributes results in an ordinal scale whose levels are probably not equidistant
from each other. Nonetheless, the construction of the scale presumes that each level is qualitatively
different and theoretically significant, guarding against the introduction of trivial conditions.)
Third, with individual-level concepts like intelligence it is often reasonable to construct a
scale inductively by reference to the pervasiveness of different responses. This is because responses
to items on a questionnaire often possess a cumulative quality. All subjects who answer a hard
question correctly will also answer easier questions correctly. However, the empirical distribution of
institutional concepts does not always follow this cumulative logic. For example, polities that are
undemocratic may thwart the will of the people in different ways, meaning that they will not receive
the same score on some questionnaire items even when they are equally authoritarian. IRT models
do not provide a means for identifying functional substitutes (several individually sufficient
Finally, within individual-level concepts like intelligence the meaning of items on a scale are
independent of each other. A subjects answer to question #4 does not affect the meaning of her
answer to question #7. However, in a survey purporting to measure democracy the meaning of Is
there universal suffrage? changes depending upon the answer to the question Are there
competitive elections? Specifically, it is not clear that universal suffrage has much significance to
electoral democracy unless and until there are minimally competitive elections. This is not an
empirical relationship; indeed, many countries hold elections without competition. It is, rather, about
the meaning of the attributes relative to the underlying concept. This issue of non-independence
does not arise in most individual-level concepts; accordingly, IRT models do not usually take into
account relationships of logical necessity. (If they do, they are constructed in a highly deductive
manner. Recall that our argument is not with technical methods of scale construction per se but
rather with their use.)

Of course, some institutional concepts have properties that are conducive to Mokken/IRT
scaling. And others can be scaled in different ways depending upon the goal of the researcher or
ones interpretation of a concept. Each scaling tradition has its uses. The point of our extended
comparison is simply to indicate that, in some settings, the Mokken/IRT approach may result in
problems of concept validity and, in these settings, more concept-driven approach to measurement
may be warranted.

V. Discussion

We shall now attempt to summarize our wide-ranging discussion pertaining to the strengths and
weaknesses of the lexical scale.
In principle, the recalcitrant aggregation problem is solved by treating defining attributes as
necessary-and-sufficient conditions arrayed in an ordinal fashion.
If the scale is true to its
objectives, each level in the scale defines a stronger, more complete instantiation of the underlying
concept. This is no mean feat, as composite indices are often plagued by problems of aggregation
(Goertz 2006; Munck 2009). Indeed, wherever multiple indicators are combined in a single index
there are usually multiple principles of aggregation that may be invoked, and rarely is there a
definitive justification for choosing one over another.
Because conceptualization is integrated into measurement there should, in principle, be less
slippage between concept and indicator than is typically encountered with other methods of scale
However, if the analyst drops attributes of a concept from an index (because they
cannot be meaningfully arrayed in an ordinal scale), forces continuous phenomena into an arbitrary
binary coding, or prioritizes conditions without some underlying rationale, the resulting index will
depart from ordinary meanings implied by the concept. The lexical scale is by no means immune to
problems of concept/construct validity.
Likewise, the strictures of the lexical scale are not universally applicable. They require that
relevant attributes of a concept be coded in a binary fashion without undue distortion and that the
chosen attributes be arrayed along a single dimension according to their centrality to the concept or
relations of dependence. These are not easy requirements to satisfy.
We have noted that the deductive properties of a lexical scale require many judgments on the
part of the analyst. Accordingly, different analysts may arrive at different scales for the same
concept. This, by itself, does not differentiate the lexical scale from scales constructed in a more
inductive fashion. After all, there are many moving parts to any scale, particularly when one is
attempting to operationalize a highly abstract concept. One must choose an indicator or set

This presumes, of course, that each condition can be accurately measured in a binary fashion without too much loss of
Authors choices of indicators to include in an index are often somewhat arbitrary (Goertz 2006; Haig, Borsboom
2008; Munck 2009). For example, the Worldwide Governance Indicator for rule of law (Kaufmann, Kraay, Mastruzzi
2007) primarily measures crime and property rights, downplaying or entirely excluding other attributes of the concept
(Skaaning 2010). Likewise, the Freedom House Political Rights and Civil Liberties indices (a component of various
latent-variable models of democracy) include indicators pertaining to corruption, civilian control of the police, the
absence of widespread violent crime, willingness to grant political asylum, the right to buy and sell land, and the
distribution of state enterprise profits (Freedom House 2007). Some observers might regard these features as elements
of political rights and civil liberties; others might not. Since most abstract concepts can be defined in a variety of ways
and do not possess sharp boundaries, it is no surprise to discover that one analysts bundle of indicators may be quite
different from anothers, even when they purport to operationalize the same term.

indicators to represent a concept and, if more than one indicator is chosen, one must decide upon
an aggregation technique(s) that combines those elements into a single scale. Accordingly, it is not
the case that lexical scales are more subjective than other scales.
Arguably, the assumptions employed in the construction of a lexical scale are more
transparent than the assumptions used to construct many other composite indices, especially when a
number of aggregation principles are embedded in a complex statistical model. On the other hand,
because of the set-theoretic nature of the lexical scale it seems likely that alternative lexical scales (for
the same concept) will be less highly correlated than varying inductive scales (for the same concept).
A small change in an ordinal scale generally has greater consequences than a small change in an
interval scale.
Lexical scale construction is a highly deductive enterprise insofar as the resulting index is
constructed to suit a priori requirements drawn from the concept rather than from the empirical
distribution of the data. Yet, many concepts do not provide clear guidance with respect to the
relative priority of their defining conditions, a limiting condition on the applicability of lexical
Where applicable, however, the deductive properties of the scaling procedure offer certain
advantages. Note that insofar as the distribution of data is allowed to govern the construction of an
index the resulting index is sample-dependent. If key properties of a sample change (e.g., when
drawn from different populations or when drawn non-randomly from a single population), so does
the resulting scale. Sometimes, sample bias can be corrected with an IRT model; however, this
presumes considerable knowledge about the larger population of interest. In many circumstances
(e.g., where the population extends into the future), it is not possible to determine what the shape of
a larger population looks like. In these situations, indices are biased or, alternatively, they lack
generalizability (they are sample-dependent). Note that the Mokken indices of electoral democracy
that we constructed in Table 5 vary across each chosen sample period.
Relatedly, a basic (and nearly universal) operating assumption of standard index construction
is that one can combine information from observed variables by paying attention to their
commonalities and discarding their differences (as error). This is a reasonable set of assumptions in
many circumstances, especially when the commonalities are great and the remaining differences do
not seem to represent anything of substantive significance (i.e., they do not compose an identifiable
dimension). However, it involves a considerable simplification of reality, especially when
intercorrelations are modest. In such circumstances, the lexical scale offers a viable alternative.
With respect to discrimination, the lexical scale may be counted as modestly successful. It
provides much more information than the classical concept, understood as a binary scale. It is on
par with many ordinal indices, which generally incorporate a handful of levels. It is also on par with
indices that purport to be interval scales but, in reality, are probably better understood as ordinal
such as the Polity and Freedom House indices of democracy (Armstrong 2011; Cheibub et al. 2010;
Pemstein et al. 2010; Treier and Jackman 2008). To be sure, a lexical scale will discriminate less
successfully than a scale whose construction is geared to detect small differences (e.g., scales
constructed with IRT models).
While sensitivity to small differences is valuable, it is not the only factor of importance in
constructing a scale. Note that some concepts in the social science universe are probably lumpy
rather than continuous. This appears to be the case with electoral democracy. One is at pains to
describe the difference between a regime with popular elections and one without (the first condition
of our proposed Lexical index) as a matter of degrees. The same point might be made with reference
to some of the other examples discussed in section III.
Likewise, where a concept is being formulated as a right-side variable in a causal model it
may be helpful to recognize distinct treatments, understood as a cumulative series of compound

treatments A, A&B, A&B&C, et al. These can be tested with (a) pairwise comparisons and
matching algorithms, (b) dummy variables in a regression model, (c) generalized additive models
(GAM [Beck, Jackman 1998]), or (d) Bayesian shrinkage models (Alvarez, Bailey, Katz 2011). If used
to achieve covariate balance in a matching analysis a categorical variable is generally more tractable
than a continuous variable. In these respects, lexical scales are well-suited for causal inference.
By contrast, inductively derived indices often function awkwardly on the right side of a
causal model. A useful treatment is uniform, imposing the same condition on all those within the
treatment group. However, indices usually include heterogeneous elements a little bit of this and
little bit of that, in portions that are difficult to account for. Typically, there are many ways to obtain
a score of 3 along a continuous scale.
Consequently, it is difficult to say what the treatment
consists of, what causal mechanisms might be at work, and whether the resulting relationship should
be interpreted as causal.
Composite scales generally indicate differences of degree, but not of kind. A 4 on the
Polity2 scale indicates that a regime is more democratic than a country receiving a 2. But it offers
no additional information about the qualities of these regimes. In this respect, the information
contained in a standard composite index is quantitative (more/less) rather than qualitative
(differences of type). Accordingly, a point on a composite index rarely has an obvious interpretation
or meaning except in terms of standard deviations from the mean, and thresholds used to convert a
continuous scale into a nominal or ordinal scale are apt to be highly arbitrary. This makes it difficult
to evaluate concept validity, even if aggregation rules are perfectly transparent. And it makes it
difficult to apply concepts to real-world situations, detracting from social sciences relevance to
politics and policy.
By contrast, a lexical scale is relatively transparent. Researchers and reviewers know exactly
what a shift from 2 to 3 or 3 to 4 means because each level in the scale is achieved by only
one additional criterion. This eases the burden of ex ante coding and ex post interpretation.
Likewise, insofar as levels correspond to distinctive types, membership in each category of an
ordinal scale is meaningful. Units coded as 3 share various characteristics, which may signal
important theoretical properties (e.g., as inputs or outputs of a causal model). Qualitative differences
are sometimes more informative than quantitative differences.

With respect to the Freedom House indices, Cheibub et al. (2010: 75) note: for each of the ten categories in the
political rights checklist and the 15 categories of the civil liberties checklist, coders assign ratings from zero to four and
the points are added so that a country can obtain a maximum score of 40 in political rights and 60 in civil rights. With
five alternatives for each of ten and 15 categories, there are 510 = 9,765,625 possible ways to obtain a sum of scores
between zero and 40 in political rights, and 515 = 30,517,578,125 possible ways to obtain a sum of scores between zero
and 60 in civil liberties. All of these possible combinations are then distilled into the two seven-point scales of political
rights and civil liberties.

Table 1:
Generic Lexical scale

0. ~A
1. A ~B
2. A B ~C
3. A B C ~D
4. A B C D ~E
5. A B C D E

0-5 = ordinal scale. A-E = conditions that are satisfied. ~A-E = conditions that are not satisfied. Empty
cells = undefined. Relationships are deterministic.


Table 2:
Lexical Index of Democracy: The World in 1904

0 1 2 3 4 5 6
Male or
female suffrage
Ottoman Emp
Dom. Rep.
El Salvador
Costa Rica
United States



Table 3:
Frequency Distribution of the Lexical Index of Electoral Democracy, 1800-2008

N %
0 4,501 26.20
1 2,920 17.00
2 1,507 8.77
3 2,594 15.10
4 863 5.02
5 503 2.93
6 4,291 24.98


Table 4:
Electoral Democracy as a Predictor of State Repression: A Comparison of Indices

Linear Curvilinear Disaggregated

1 2 3 4 5 6
Polity -.027***


Polity, L1 .115*

Polity, L2 -.086

Polity, L3 -.046

Polity, L4 .028

Polity, L5 -.053

Polity, L6 -.010

Polity, L7 -.093*

Polity, L8 -.148***

Polity, L9 -.182***

Polity, L10 -.394***

Lexical -.041***


Lexical, L1 -.107**
Lexical, L2 -.167**
Lexical, L3 -.038
Lexical, L4 -.226***
.752 .749 .755 .749 .756 .751

Sample period: 1976-2008. Countries: 158. Observations: 3582. Estimator: OLS (TSCS). Standard errors in
parentheses. *<.1, **<.01, ***<.001 (two-tailed test). L=levels on an ordinal scale (not lags). Additional
covariates (presented in the text) are not shown.


Table 5:
Generic Mokken Scale

0. ~A ~B ~C ~D ~E
1. A ~B ~C ~D ~E
2. A B ~C ~D ~E
3. A B C ~D ~E
4. A B C D ~E
5. A B C D E

0-5 = ordinal scale. A-E = conditions that are satisfied. ~A-E = conditions that are not satisfied.
Relationships are probabilistic.


Table 6:
Lexical and Mokken Democracy Indices Compared

Indices Ordering of attributes
Lexical 0,1,2,3,4,5,6
Mokken (1800-2008) 0,1,3,5,2,6,4
Mokken (1800-1945) 0,1,2,3,5,4,6
Mokken (1946-2008) 0,5,6,1,3,2,4

0=No elections 1=National elections 2=Multi-party elections 3=Executive elections 4=Minimally
competitive elections 5=Male or female adult suffrage 6=Universal adult suffrage. Full definitions
are contained in the text. Mokken analyses reveal Loevingers H coefficients of 0.66-0.77 for the
scales and 0.56-0.96 for the individual items, indicating rather strong scalability. (Note: Mokken
coding for 0 presumes no national elections and no universal suffrage for males or females.)


Figure 1:
Distribution of Countries
Across the Lexical index of Democracy, 1800-2008


Appendix A:
The Lexical Index of Electoral Democracy

When introducing a new index of democracy to the world it is traditional to begin by critiquing
extant indices. Because this critique has been issued so many times in recent years it seems gratuitous
to revisit these issues here. Instead, we invite readers to peruse the voluminous literature on the
subject (e.g., Beetham 1994; Berg-Schlosser 2004a, 2004b; Bollen 1993; Bollen and Paxton 2000;
Bowman, Lehoucq, and Mahoney 2005; Coppedge et al. 2013; Coppedge, Alvarez, Maldonado 2008;
Foweraker and Krznaric 2000; Gleditsch and Ward 1997; Hadenius and Teorell 2005; McHenry
2000; Munck 2009; Munck and Verkuilen 2002; Treier and Jackman 2008; Vermillion 2006).
Accordingly, we focus our discussion in the text and in this appendix on that which
distinguishes the Lexical index from other indices. Occasionally, these comparisons will verge on
critique. But the general point should be kept in mind: most indices of a concept (e.g., democracy)
are useful for some purposes. It is exceedingly rare to discover an indicator that is inferior to another
on all accounts. Thus, the principal goal in launching a new index should be on making its distinctive
qualities clear, without derogating the alternatives.
We begin by sketching out in some detail the coding process entailed in the construction of
the Lexical index, which encompasses all independent countries from 1800 to 2008. Next, we
compare the Lexical index to other prominent democracy indices in a series of tables (some of
which have already been referenced in the text).

To code the Lexical index of electoral democracy we began by identifying independent countries
that would serve as the units of analysis. Here, we rely on Gleditsch (2013) and Correlates of War
(2011), supplemented for the years 1800-1815 by various country-specific sources.
To operationalize the different levels of the Lexical index we make use of four indicators
from the PIPE dataset (Przeworski et al. 2013): LEGSELEC, EXSELEC, OPPOSITION, and
FRANCHISE, as described below. We also construct a new indicator, COMPETITION, also
described below. Note that each of these indicators refer to the status of a country on the last day of
the calendar year (31 December), and are not intended to reflect the mean value of an indicator
across the previous 365 days. These five indicators are employed, as follows, in order to construct
the Lexical index:
0. No elections. Elections are not held for any national-level policymaking
offices. This includes situations in which elections are postponed indefinitely
or the constitutional timing of elections is violated in a more than marginal
fashion. LEGSELEC = 0 (the lower house of the legislature is not elected) &
EXSELEC = 0 (the chief executive is not elected whether directly or
indirectly, i.e., by people who have been elected). Sources: PIPE, country-
specific sources.
1. National elections. There are regular national elections. LEGSELEC =1 (the
lower house of the legislature is at least partly elected) or EXSELEC =1 (the
chief executive is either directly or indirectly elected, i.e., by people who have
been elected). Sources: PIPE, country-specific sources.

2. Multi-party elections. Opposition parties are allowed to participate in legislative
elections and to take office. OPPOSITION =1 (there is a legislature that is at
least in part elected by voters facing more than one choice). Sources: PIPE,
country-specific sources.
3. Executive accountability. The chief executive is accountable either directly to
the electorate or indirectly to an elected parliament. EXSELEC = 1 (the chief
executive is either directly or indirectly elected, i.e., by people who have been
elected). Sources: PIPE, country-specific sources.
4. Competitive elections. The chief executive offices and the seats in the effective
legislative body are directly or indirectly filled by elections characterized
by uncertainty, meaning that the elections are, in principle, sufficiently free to
enable the opposition to gain power. COMPETITION =1 (there is a
positive probability that the opposition can win government power). Sources:
Cheibub et al. (2010), Boix et al. (2013), country-specific sources.
5. Male or female adult suffrage. Virtually all adult male or female citizens are
allowed to vote in national elections. (In no extant cases was universal female
suffrage introduced before universal male suffrage, so in practice this level is
reserved for countries with male (only) suffrage.) MALE SUFFRAGE = 1
(virtually universal male suffrage) or FEMALE SUFFRAGE = 1 (virtually
universal female suffrage). Sources: PIPE (FRANCHISE), country-specific
6. Universal suffrage. Virtually all adult citizens are allowed to vote in national
elections. MALE SUFFRAGE = 1 (virtually universal male suffrage) &
FEMALE SUFFRAGE = 1 (virtually universal female suffrage). Sources:
PIPE (FRANCHISE), country-specific sources.

It is important to emphasize that although we employ PIPE as an initial source for coding
L0-3 and L5-6, we deviate from PIPE codings based on our reading of country-specific sources
in several ways. First, with respect to executive elections, in the PIPE dataset Prime ministers are
always coded as elected if the legislature is open. However, for our purposes we need an indicator
that also takes into account whether the government is responsible to an elected parliament if the
executive is not directly elected, which, for example, has not been the case in a number of
monarchies in Europe before World War I and today in the Middle East. To illustrate, PIPE codes
Denmark as having executive elections from 1849 to 1900 although the parliamentary principle was
not established until 1901. Before then, the government was accountable to the king. Among the
current cases with elected multi-party legislatures not fulfilling this condition, we find Jordan and
Morocco. In order to achieve a higher level of concept-measure consistency, we have thus recoded
all country-years (based on country-specific accounts) for this variable where our sources suggested
doing so.
Second, we have filled out all the missing values in the PIPE dataset, meaning that we have a
complete dataset for all conditions for all independent countries of the world in the period 1800-
2008. In general, except for the minor adjustment regarding executive elections mentioned above,
we have followed the more specific coding rules laid out in the PIPE codebook. This work did not
only imply that we used country-specific sources to fill the gaps but also that many years and a
number of countries (such as the German principalities of the 19th century) were added. If we, in
our coding of the missing values, came across information which suggested a recoding of any PIPE

scores, we did so. All values refer to the status of countries as of the end of a particular year.
Whereas the numbers of observations for the employed PIPE indicators range between 14,465 and
15,302, the additional codings mean that our dataset has 17,179 observations for all indicators.
Third, in order to measure if elections are competitive, we have constructed a new variable
to capture if there is a positive probability (see Przeworski et al. 2000: 16-17) that the opposition can
win government power. Like Cheibub et al. (2010; see also Przeworski et al. 2000), we have
considered instances of electoral incumbent turnover as a rather robust indicator of contested
elections. However, like Boix et al. (2013), we have not considered electoral executive turnover to be
either a necessary or a sufficient criterion for genuinely contested elections. Instead, we have also
taken into account more general impressions of how free elections were according to country-
specific sources. Doing so, we have followed Schumpeter (1950; see also Przeworki et al. 2000; Boix
et al. 2013) by establishing a modest threshold, e.g., not insisting on an entirely level playing field or
a high level of respect for civil liberties. Thus, elections are generally considered competitive if
voters experience little or no systematic coercion in exercising their electoral choice, and electoral
fraud is not determining who wins the elections.
By and large, the coding decisions required for the Lexical index of democracy are factual in
nature, resting on institutional features that require historical knowledge but not subjective
judgments on the part of the coder. Uncertainties are introduced when source material for a country
or era is weak. But we can probably assume that this sort of bias is random rather than systematic (as
it might be if coder judgments involved questions of meaning and interpretation). In this respect, the
Lexical index echoes a feature of the DD and BMR indices. Indeed, it is quite similar to these
indices insofar as it relies on binary codings, which are combined to form the lexical scale.



Table A1:
Democracy Indices Compared


Type Range Countries Years Obs
Lexical (authors) Lexical 0-6 228 1800-2008 $
DD (Cheibub et al) Binary 0-1 200 1946-2008 $ .84 (S) .36 (S)
BMR (Boix et al.) Binary 0-1 213 1800-2007 15,972 $ (S) $ (S)
Polity2 (Marshall, Jaggers) Ordinal
10-10 189 1800-2012 $ .79 (S) .59 (S)
PR (Freedom House) Ordinal 1-7 200 1972-2012 $ .85 (S) .43 (S)
CL (Freedom House) Ordinal 1-7 200 1972-2012 $ .79 (S) .31 (S)
Democracy Index (Vanhanen) Interval 0-100 187 1810-2000 $ $ (P) $ (P)
Contestation (Coppedge et al.) Interval -1.84-1.96 200 1950-2000 $ .91 (P) $ (P)
Inclusiveness (Coppedge et al.) Interval -3.04-1.91 200 1950-2000 $ .59 (P) .54 (P)
UDS(Pemstein et al.) Interval -2.10-2.12 200 1946-2008 $ .88 (P) .63 (P)

The final columns show the Spearmans (S) or Pearsons (P) correlation coefficient between the Lexical index and other
indices of democracy in a full sample (all available country-years) and a restricted sample (Lexical<6).

Table A2:

DD (Cheibub et al)
| Lexical
| 0 1 2 3 4 5 6 | Total
0 | 3,162 2,548 1,119 1,882 418 62 130 | 9,321
1 | 11 13 11 80 351 432 2,387 | 3,285
Total | 3,173 2,561 1,130 1,962 769 494 2,517 | 12,606


| Lexical
| 0 1 2 3 4 6 | Total
1 | 0 0 14 1 0 1,307 | 1,322
2 | 3 0 13 27 0 805 | 848
3 | 5 15 13 73 0 318 | 424
4 | 24 55 45 211 4 142 | 481
5 | 150 214 69 192 16 24 | 665
6 | 330 505 35 228 1 1 | 1,100
7 | 385 478 21 61 0 0 | 945
Total | 897 1,267 210 793 21 2,597 | 5,785

Polity2 (Polity IV)

| Lexical
| 0 1 2 3 4 5 6 | Total
-10 | 1,067 52 39 0 0 0 0 | 1,158
-9 | 256 618 114 128 0 0 4 | 1,120
-8 | 122 265 76 76 0 1 0 | 540
-7 | 629 798 224 82 0 0 0 | 1,733
-6 | 657 197 154 191 0 0 0 | 1,199
-5 | 140 158 39 191 20 0 10 | 558

-4 | 42 67 210 169 24 2 1 | 515
-3 | 241 265 83 410 45 42 14 | 1,100
-2 | 16 22 34 139 39 1 3 | 254
-1 | 49 132 51 225 6 4 21 | 488
0 | 107 51 10 126 8 9 19 | 330
1 | 37 31 76 101 8 7 3 | 263
2 | 17 35 43 177 66 21 20 | 379
3 | 4 26 10 77 76 14 30 | 237
4 | 87 10 24 119 113 8 51 | 412
5 | 2 26 11 84 50 19 119 | 311
6 | 5 3 25 15 53 22 264 | 387
7 | 5 13 0 36 66 36 281 | 437
8 | 6 12 3 20 50 32 457 | 580
9 | 0 1 1 4 50 56 363 | 475
10 | 4 2 0 10 93 225 1,801 | 2,135
Total | 3,493 2,784 1,227 2,380 767 499 3,461 | 14,611