Вы находитесь на странице: 1из 142

Advanced Quantitative Research Methodology, Lecture Notes: Text Analysis II: Unsupervised Learning via Cluster Analysis1

Gary King http://GKing.Harvard.Edu

December 23, 2011

Gary King http://GKing.Harvard.Edu ()

Copyright 2010 Gary King, All Rights Reserved.

Advanced Quantitative Research Methodology, Lecture Notes: December Text Analysis 23, 2011 II: Unsupervise 1 / 23

Reading

Justin Grimmer and Gary King. 2010. Quantitative Discovery of Qualitative Information: A General Purpose Document Clustering Methodology http://gking.harvard.edu/files/abs/discov-abs.shtml.

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

2 / 23

The Problem: Discovery from Unstructured Text


Examples: scholarly literature, news stories, medical information, blog posts, comments, product reviews, emails, social media updates, audio-to-text summaries, speeches, press releases, legal decisions, etc.

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

3 / 23

The Problem: Discovery from Unstructured Text


Examples: scholarly literature, news stories, medical information, blog posts, comments, product reviews, emails, social media updates, audio-to-text summaries, speeches, press releases, legal decisions, etc. 10 minutes of worldwide email = 1 LOC equivalent

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

3 / 23

The Problem: Discovery from Unstructured Text


Examples: scholarly literature, news stories, medical information, blog posts, comments, product reviews, emails, social media updates, audio-to-text summaries, speeches, press releases, legal decisions, etc. 10 minutes of worldwide email = 1 LOC equivalent An essential part of discovery is classication: one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classication, there could be no advanced conceptualization, reasoning, language, data analysis or, for that matter, social science research. (Bailey, 1994).

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

3 / 23

The Problem: Discovery from Unstructured Text


Examples: scholarly literature, news stories, medical information, blog posts, comments, product reviews, emails, social media updates, audio-to-text summaries, speeches, press releases, legal decisions, etc. 10 minutes of worldwide email = 1 LOC equivalent An essential part of discovery is classication: one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classication, there could be no advanced conceptualization, reasoning, language, data analysis or, for that matter, social science research. (Bailey, 1994). We focus on cluster analysis: discovery through (1) classication and (2) simultaneously inventing a classication scheme

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

3 / 23

The Problem: Discovery from Unstructured Text


Examples: scholarly literature, news stories, medical information, blog posts, comments, product reviews, emails, social media updates, audio-to-text summaries, speeches, press releases, legal decisions, etc. 10 minutes of worldwide email = 1 LOC equivalent An essential part of discovery is classication: one of the most central and generic of all our conceptual exercises. . . . the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis. . . . Without classication, there could be no advanced conceptualization, reasoning, language, data analysis or, for that matter, social science research. (Bailey, 1994). We focus on cluster analysis: discovery through (1) classication and (2) simultaneously inventing a classication scheme (We analyze text; our methods apply more generally)
Gary King (Harvard, IQSS) Quantitative Discovery from Text 3 / 23

Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

4 / 23

Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

4 / 23

Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

4 / 23

Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

4 / 23

Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

4 / 23

Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) 1028 Number of elementary particles in the universe

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

4 / 23

Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) 1028 Number of elementary particles in the universe Now imagine choosing the optimal classication scheme by hand!

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

4 / 23

Why Johnny Cant Classify (Optimally)

Bell(n) = number of ways of partitioning n objects Bell(2) = 2 (AB, A B) Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C) Bell(5) = 52 Bell(100) 1028 Number of elementary particles in the universe Now imagine choosing the optimal classication scheme by hand! That we think of all this as astonishing . . . is astonishing

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

4 / 23

Why HAL Cant Classify Either

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:


No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:


No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications

Existing methods:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:


No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications

Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graphbased, fuzzy k -modes, anity propogation, self-organizing maps,. . .

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:


No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications

Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graphbased, fuzzy k -modes, anity propogation, self-organizing maps,. . . Well-dened statistical, data analytic, or machine learning foundations

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:


No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications

Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graphbased, fuzzy k -modes, anity propogation, self-organizing maps,. . . Well-dened statistical, data analytic, or machine learning foundations How to add substantive knowledge:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:


No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications

Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graphbased, fuzzy k -modes, anity propogation, self-organizing maps,. . . Well-dened statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, who knows?!

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:


No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications

Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graphbased, fuzzy k -modes, anity propogation, self-organizing maps,. . . Well-dened statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, who knows?! The literature: little guidance on when methods apply

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

Why HAL Cant Classify Either

The Goal an optimal application-independent cluster analysis method is mathematically impossible:


No free lunch theorem: every possible clustering method performs equally well on average over all possible substantive applications

Existing methods:
Many choices: model-based, subspace, spectral, grid-based, graphbased, fuzzy k -modes, anity propogation, self-organizing maps,. . . Well-dened statistical, data analytic, or machine learning foundations How to add substantive knowledge: With few exceptions, who knows?! The literature: little guidance on when methods apply Deep problem in cluster analysis literature: no way to know which method will work ex ante

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

5 / 23

If Ex Ante doesnt work, try Ex Post

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

6 / 23

If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

6 / 23

If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem) The usual approach fails: hard to do it by understanding the model

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

6 / 23

If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem) The usual approach fails: hard to do it by understanding the model We do it ex post (by qualitative choice). For example:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

6 / 23

If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem) The usual approach fails: hard to do it by understanding the model We do it ex post (by qualitative choice). For example:
Create long list of clusterings; choose the best

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

6 / 23

If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem) The usual approach fails: hard to do it by understanding the model We do it ex post (by qualitative choice). For example:
Create long list of clusterings; choose the best Too hard for mere humans!

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

6 / 23

If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem) The usual approach fails: hard to do it by understanding the model We do it ex post (by qualitative choice). For example:
Create long list of clusterings; choose the best Too hard for mere humans! An organized list will make the search possible

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

6 / 23

If Ex Ante doesnt work, try Ex Post

Methods and substance must be connected (no free lunch theorem) The usual approach fails: hard to do it by understanding the model We do it ex post (by qualitative choice). For example:
Create long list of clusterings; choose the best Too hard for mere humans! An organized list will make the search possible E.g.,: consider two clusterings that dier only because one document (of many) moves from category 5 to 6

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

6 / 23

Our Idea: Meaning Through Geography

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

7 / 23

Our Idea: Meaning Through Geography

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

7 / 23

Our Idea: Meaning Through Geography

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

7 / 23

Our Idea: Meaning Through Geography

We develop a (conceptual) geography of clusterings

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

7 / 23

A New Strategy
Make it easy to choose best clustering from millions of choices

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

8 / 23

A New Strategy
Make it easy to choose best clustering from millions of choices

Code text as numbers (in one or more of several ways)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

8 / 23

A New Strategy
Make it easy to choose best clustering from millions of choices

1 2

Code text as numbers (in one or more of several ways) Apply all clustering methods we can nd to the data each representing dierent (unstated) substantive assumptions (<15 mins)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

8 / 23

A New Strategy
Make it easy to choose best clustering from millions of choices

1 2

Code text as numbers (in one or more of several ways) Apply all clustering methods we can nd to the data each representing dierent (unstated) substantive assumptions (<15 mins) (Too much for a person to understand, but organization will help)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

8 / 23

A New Strategy
Make it easy to choose best clustering from millions of choices

1 2

Code text as numbers (in one or more of several ways) Apply all clustering methods we can nd to the data each representing dierent (unstated) substantive assumptions (<15 mins) (Too much for a person to understand, but organization will help) Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection

3 4

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

8 / 23

A New Strategy
Make it easy to choose best clustering from millions of choices

1 2

Code text as numbers (in one or more of several ways) Apply all clustering methods we can nd to the data each representing dierent (unstated) substantive assumptions (<15 mins) (Too much for a person to understand, but organization will help) Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection Local cluster ensemble creates a new clustering at any point, based on weighted average of nearby clusterings

3 4

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

8 / 23

A New Strategy
Make it easy to choose best clustering from millions of choices

1 2

Code text as numbers (in one or more of several ways) Apply all clustering methods we can nd to the data each representing dierent (unstated) substantive assumptions (<15 mins) (Too much for a person to understand, but organization will help) Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection Local cluster ensemble creates a new clustering at any point, based on weighted average of nearby clusterings A new animated visualization to explore the space of clusterings (smoothly morphing from one into others)

3 4

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

8 / 23

A New Strategy
Make it easy to choose best clustering from millions of choices

1 2

Code text as numbers (in one or more of several ways) Apply all clustering methods we can nd to the data each representing dierent (unstated) substantive assumptions (<15 mins) (Too much for a person to understand, but organization will help) Develop an application-independent distance metric between clusterings, a metric space of clusterings, and a 2-D projection Local cluster ensemble creates a new clustering at any point, based on weighted average of nearby clusterings A new animated visualization to explore the space of clusterings (smoothly morphing from one into others) Millions of clusterings, easily comprehended (takes about 10-15 minutes to choose a clustering with insight)
Quantitative Discovery from Text 8 / 23

3 4

Gary King (Harvard, IQSS)

Many Thousands of Clusterings, Sorted & Organized


You choose one (or more), based on insight, discovery, useful information,. . .
Ford Nixon Carter Eisenhower
kmeans correlation

Obama

Cluster Solution 1

Space of Cluster Solutions mixvmf


affprop info.costs kmedoids stand.euc

Cluster Solution 2
Carter Johnson Ford Eisenhower Truman Roosevelt

rock hclust correlation single hclust pearson single

affprop maximum

Truman Johnson Roosevelt

``Other Presidents ''


Clinton

hclust binary median hclust canberra median hclust canberra mcquitty kmeans kendall biclust_spectral affprop affprop manhattan cosine hclust canberra single hclust binary single

hclust maximum single hclust correlation median hclust hclust pearson pearson centroid median correlation centroid hclust binary hclust centroid hclust canberra centroid average average hclust hclust correlation pearson mcquitty mcquitty hclust kendall single hclust euclidean centroid mspec_max

spec_max hclust maximum ward

Nixon

``Roosevelt To Carter''

mspec_canb hclust canberra average divisive stand.euc mspec_cos kmeans pearson

Kennedy Bush kmeans binary

hclust manhattan centroid manhattan single hclust spearman centroid hclust hclust maximum maximum centroid median hclust kmedoids kendall manhattan centroid hclust euclidean median hclust hclust correlation pearson complete complete hclust kendall average hclust spearman median hclust hclust manhattan kendall median median hclust euclidean average single hclust maximum mcquitty hclust maximum complete affprop euclidean average hclust manhattan hclust mcquitty euclidean average divisive euclidean q hclust spearman single

kmeans maximum

Kennedy

hclust binary average

som

kmedoids euclidean hclust spearman average spec_mink mspec_euc mspec_mink

divisive manhattan mspec_man hclust euclidean euclidean complete mcquitty hclust kendall complete hclust hclust correlation ward complete hclust canberra clust_convex hclust euclidean dismea ward hclust hclust spearman kendall mcquitty mcquitty hclust binary ward hclust hclust binary binary complete mcquitty hclust canberra ward spec_canb hclust spearman complete hclust manhattan complete

Obama

Bush

hclust kendall ward mixvmfVA

``Reagan Republicans''

spec_cos spec_euc hclust manhattan ward kmeans manhattan kmeans euclidean spec_man hclust pearson ward hclust spearman ward kmeans spearman

`` Reagan To Obama ''

Reagan HWBush

kmeans canberra

HWBush Clinton
mult_dirproc

Reagan

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

9 / 23

Application-Independent Distance Metric: Axioms

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

10 / 23

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

10 / 23

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1

Distance between clusterings: a function of the pairwise document agreements (pairwise agreements triples, quadruples, etc.)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

10 / 23

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1

Distance between clusterings: a function of the pairwise document agreements (pairwise agreements triples, quadruples, etc.) Invariance: Distance is invariant to the number of documents (for any xed number of clusters)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

10 / 23

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1

Distance between clusterings: a function of the pairwise document agreements (pairwise agreements triples, quadruples, etc.) Invariance: Distance is invariant to the number of documents (for any xed number of clusters) Scale: the maximum distance is set to log(num clusters)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

10 / 23

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1

Distance between clusterings: a function of the pairwise document agreements (pairwise agreements triples, quadruples, etc.) Invariance: Distance is invariant to the number of documents (for any xed number of clusters) Scale: the maximum distance is set to log(num clusters)

Only one measure satises all three (the variation of information)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

10 / 23

Application-Independent Distance Metric: Axioms

Metric based on 3 assumptions


1

Distance between clusterings: a function of the pairwise document agreements (pairwise agreements triples, quadruples, etc.) Invariance: Distance is invariant to the number of documents (for any xed number of clusters) Scale: the maximum distance is set to log(num clusters)

Only one measure satises all three (the variation of information) Meila (2007): derives same metric using dierent axioms (lattice theory)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

10 / 23

The Future of Political Science


100 Perspectives
Edited by Gary King, Harvard University, Kay Lehman Schlozman, Boston College and Norman H. Nie, Stanford University
The list of authors in The Future of Political Science is a 'whos who' of political science. As I was reading it, I came to think of it as a platter of tasty hors doeuvres. It hooked me thoroughly. Peter Kingstone, University of Connecticut In this one-of-a-kind collection, an eclectic set of contributors offer short but forceful forecasts about the future of the discipline. The resulting assortment is captivating, consistently thought-provoking, often intriguing, and sure to spur discussion and debate. Wendy K. Tam Cho, University of Illinois at Urbana-Champaign King, Schlozman, and Nie have created a visionary and stimulating volume. The organization of the essays strikes me as nothing less than brilliant. . . It is truly a joy to read. Lawrence C. Dodd, Manning J. Dauer Eminent Scholar in Political Science, University of Florida

Available March 2009: 304pp Pb: 978-0-415-99701-0: $24.95

www.routledge.com/politics

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

11 / 23

Evaluators Rate Machine Choices Better Than Their Own

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related Table reports: mean(scale)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related Table reports: mean(scale)

Pairs from Random Selection

Overall Mean 1.38

Evaluator 1 1.16

Evaluator 2 1.60

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related Table reports: mean(scale)

Pairs from Random Selection

Overall Mean 1.38

Evaluator 1 1.16

Evaluator 2 1.60

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related Table reports: mean(scale)

Pairs from Random Selection Hand-Coded Clusters

Overall Mean 1.38 1.58

Evaluator 1 1.16 1.48

Evaluator 2 1.60 1.68

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related Table reports: mean(scale)

Pairs from Random Selection Hand-Coded Clusters Hand-Coding

Overall Mean 1.38 1.58 2.06

Evaluator 1 1.16 1.48 1.88

Evaluator 2 1.60 1.68 2.24

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related Table reports: mean(scale)

Pairs from Random Selection Hand-Coded Clusters Hand-Coding Machine

Overall Mean 1.38 1.58 2.06 2.24

Evaluator 1 1.16 1.48 1.88 2.08

Evaluator 2 1.60 1.68 2.24 2.40

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluators Rate Machine Choices Better Than Their Own

Scale: (1) unrelated, (2) loosely related, or (3) closely related Table reports: mean(scale)

Pairs from Random Selection Hand-Coded Clusters Hand-Coding Machine

Overall Mean 1.38 1.58 2.06 2.24

Evaluator 1 1.16 1.48 1.88 2.08

Evaluator 2 1.60 1.68 2.24 2.40

p.s. The hand-coders did the evaluation!

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

12 / 23

Evaluating Performance

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluating Performance

Goals:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms human conceptualization

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research

We now present three evaluations

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research

We now present three evaluations


Cluster Quality RA coders

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research

We now present three evaluations


Cluster Quality RA coders Informative discoveries Experienced scholars analyzing texts

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluating Performance

Goals:
Validate Claim: computer-assisted conceptualization outperforms human conceptualization Demonstrate: new experimental designs for cluster evaluation Inject human judgement: relying on insights from survey research

We now present three evaluations


Cluster Quality RA coders Informative discoveries Experienced scholars analyzing texts Discovery Youre the judge

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

13 / 23

Evaluation 1: Cluster Quality

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time = Cluster quality evaluation: human judgement of document pairs

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time = Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time = Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality


automated visualization to choose one clustering

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time = Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality


automated visualization to choose one clustering many pairs of documents

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time = Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality


automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time = Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality


automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time = Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality


automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

What Are Humans Good For?


They cant: keep many documents & clusters in their head They can: compare two documents at a time = Cluster quality evaluation: human judgement of document pairs

Experimental Design to Assess Cluster Quality


automated visualization to choose one clustering many pairs of documents for coders: (1) unrelated, (2) loosely related, (3) closely related Quality = mean(within cluster) - mean(between clusters) Bias results against ourselves by not letting evaluators choose clustering

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

14 / 23

Evaluation 1: Cluster Quality

0.3

0.2

0.1

0.1

0.2

0.3

(Our Method) (Human Coders)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

15 / 23

Evaluation 1: Cluster Quality

Lautenberg Press Releases


q

0.3

0.2

0.1

0.1

0.2

0.3

(Our Method) (Human Coders)

Lautenberg: 200 Senate Press Releases (appropriations, economy, education, tax, veterans, . . . )
Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23

Evaluation 1: Cluster Quality

Lautenberg Press Releases


q

Policy Agendas Project


q

0.3

0.2

0.1

0.1

0.2

0.3

(Our Method) (Human Coders)

Policy Agendas: 213 quasi-sentences from Bushs State of the Union (agriculture, banking & commerce, civil rights/liberties, defense, . . . )
Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23

Evaluation 1: Cluster Quality

Lautenberg Press Releases


q

Policy Agendas Project


q

Reuter's Gold Standard


q

0.3

0.2

0.1

0.1

0.2

0.3

(Our Method) (Human Coders)

Reuters: nancial news (trade, earnings, copper, gold, coee, . . . ); gold standard for supervised learning studies
Gary King (Harvard, IQSS) Quantitative Discovery from Text 15 / 23

Evaluation 2: More Informative Discoveries

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:
2 clusterings selected with our method (biased against us)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:
2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:
2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplar document, automated content summary)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:
2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Asked for
6 2

=15 pairwise comparisons

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:
2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Asked for
6 2

=15 pairwise comparisons

User chooses only care about the one clustering that wins

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:
2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Asked for
6 2

=15 pairwise comparisons

User chooses only care about the one clustering that wins Both cases a Condorcet winner:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:
2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Asked for
6 2

=15 pairwise comparisons

User chooses only care about the one clustering that wins Both cases a Condorcet winner:
Immigration: Our Method 1 vMF 1 vMF 2 Our Method 2 K-Means 1 K-Means 2

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

16 / 23

Evaluation 2: More Informative Discoveries


Found 2 scholars analyzing lots of textual data for their work Created 6 clusterings:
2 clusterings selected with our method (biased against us) 2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplar document, automated content summary) Asked for
6 2

=15 pairwise comparisons

User chooses only care about the one clustering that wins Both cases a Condorcet winner:
Immigration: Our Method 1 vMF 1 vMF 2 Our Method 2 K-Means 1 K-Means 2

Genetic testing: Our Method 1 {Our Method 2, K-Means 1, K-means 2} Dir Proc. 1 Dir Proc. 2
Gary King (Harvard, IQSS) Quantitative Discovery from Text 16 / 23

Evaluation 3: What Do Members of Congress Do?

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

17 / 23

Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

17 / 23

Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

17 / 23

Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising - Credit Claiming

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

17 / 23

Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising - Credit Claiming - Position Taking

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

17 / 23

Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising - Credit Claiming - Position Taking

- Data: 200 press releases from Frank Lautenbergs oce (D-NJ)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

17 / 23

Evaluation 3: What Do Members of Congress Do?

- David Mayhews (1974) famous typology


- Advertising - Credit Claiming - Position Taking

- Data: 200 press releases from Frank Lautenbergs oce (D-NJ) - Apply our method

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

17 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop cosine
kmeans manhattan

affprop maximum

hclust canberra average

hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Red point: a clustering by Anity Propagation-Cosine (Dueck and Frey 2007)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc

mixvmf
hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single

kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

affprop maximum

som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop cosine
kmeans manhattan

hclust canberra average

hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Red point: a clustering by Anity Propagation-Cosine (Dueck and Frey 2007) Close to: Mixture of von Mises-Fisher distributions (Banerjee et. al. 2005)

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Space between methods:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median

affprop maximum

kmeans spearman

hclust canberra average

kmeans manhattan

kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Space between methods:

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median

affprop maximum

kmeans spearman

hclust canberra average

kmeans manhattan

kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Space between methods: local cluster ensemble

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Found a region with particularly insightful clusterings

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral affprop maximum hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

Mixture:

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral affprop maximum hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

Mixture:
0.39 Hclust-Canberra-McQuitty

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral affprop maximum hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

Mixture:
0.39 Hclust-Canberra-McQuitty 0.30 Spectral clustering Random Walk (Metrics 1-6)

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral affprop maximum hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

Mixture:
0.39 Hclust-Canberra-McQuitty 0.30 Spectral clustering Random Walk (Metrics 1-6) 0.13 Hclust-Correlation-Ward

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral affprop maximum hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

Mixture:
0.39 Hclust-Canberra-McQuitty 0.30 Spectral clustering Random Walk (Metrics 1-6) 0.13 Hclust-Correlation-Ward 0.09 Hclust-Pearson-Ward

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral affprop maximum hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

Mixture:
0.39 Hclust-Canberra-McQuitty 0.30 Spectral clustering Random Walk (Metrics 1-6) 0.13 Hclust-Correlation-Ward 0.09 Hclust-Pearson-Ward 0.05 Kmediods-Cosine

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral affprop maximum hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

Mixture:
0.39 Hclust-Canberra-McQuitty 0.30 Spectral clustering Random Walk (Metrics 1-6) 0.13 Hclust-Correlation-Ward 0.09 Hclust-Pearson-Ward 0.05 Kmediods-Cosine 0.04 Spectral clustering Symmetric (Metrics 1-6)

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Clusters in this Clustering

Mayhew
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Clusters in this Clustering


q q q q q q q q q qq qq qq q q q q qq q q q

Credit Claiming Pork

Credit Claiming, Pork: Sens. Frank R. Lautenberg (D-NJ) and Robert Menendez (D-NJ) announced that the U.S. Department of Commerce has awarded a $100,000 grant to the South Jersey Economic Development District

Mayhew
Gary King (Harvard, IQSS) Quantitative Discovery from Text 18 / 23

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Clusters in this Clustering


q q q q q q q q q qq qq qq q q q q qq q q q

Credit Claiming, Legislation: As the Senate begins its recess, Senator Frank Lautenberg today pointed to a string of victories in Congress on his legislative agenda during this work period

Credit Claiming Pork q


q q q q q q q q qq q q qq q q q qq q q q

Mayhew

Credit Claiming Legislation


Quantitative Discovery from Text 18 / 23

Gary King (Harvard, IQSS)

Example Discovery
mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Clusters in this Clustering


q q q q q q q q q qq qq qq q q q q qq q q q q q q q qq q q q q

Advertising: Senate Adopts Lautenberg/Menendez Resolution Honoring Spelling Bee Champion from New Jersey

Credit Claiming Pork q


q q q q q q q q qq q q qq q q q qq q q q

Advertising
q

Mayhew

Credit Claiming Legislation


Quantitative Discovery from Text 18 / 23

Gary King (Harvard, IQSS)

Example Discovery: Partisan Taunting


mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Clusters in this Clustering


q q q q q q q q q qq qq qq q q q q qq q q q q q q q qq q q q q

Partisan Taunting: Republicans Selling Out Nation on Chemical Plant Security

Credit Claiming Pork q


q q q q q q q q q q q qq q q q qq q q q

Advertising Partisan Taunting


q q q q q q qq q q q q q qq q q q qq q q q qq q q

Mayhew

q Credit Claiming Legislation

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery: Partisan Taunting


mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Clusters in this Clustering


q q q q q q q q q qq qq qq q q q q qq q q q q q q q qq q q q q

Credit Claiming Pork q


q q q q q q q q q q q qq q q q qq q q q

Advertising Partisan Taunting


q q q q q q qq q q q q q qq q q q qq q q q qq q q

Partisan Taunting: Senator Lautenbergs amendment would change the name of ...the Republican bill...to More Tax Breaks for the Rich and More Debt for Our Grandchildren Decit Expansion Reconciliation Act of 2006

Mayhew

q Credit Claiming Legislation

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery: Partisan Taunting


mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Denition: Explicit, public, and negative attacks on another political party or its members

Clusters in this Clustering


q q q q q q q q q qq qq qq q q q q qq q q q q q q q qq q q q q

Credit Claiming Pork q


q q q q q q q q q q q qq q q q qq q q q

Advertising Partisan Taunting


q q q q q q qq q q q q q qq q q q qq q q q qq q q

Mayhew

q Credit Claiming Legislation

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

Example Discovery: Partisan Taunting


mult_dirproc kmeans correlation hclust canberra ward sot_cor divisive stand.euc mixvmf hclust binary complete hclust correlationmixvmfVA mcquitty affprop cosine hclust pearson mcquitty hclust pearson average hclust correlation complete hclust pearson complete hclust correlation average hclust binary average hclust binary mcquitty spec_man spec_cos spec_mink spec_euc spec_max mspec_mink spec_canb mspec_man mspec_max mspec_cos mspec_canb mspec_euc kmeans pearson

hclust pearson single hclust pearson median hclust correlation single mec hclust correlation median hclust binary single som hclustpearson correlation centroid rock hclust centroid hclust binary median hclust canberra single hclust spearman complete biclust_spectral hclust canberra kmeans kendall median kmeans spearman kmeans manhattan kmeans canberra hclust binary centroid hclust kendall single hclust spearman centroid hclust kendall centroid average average hclust spearman median kendall median hclust spearman single hclust kendall mcquitty hclust spearman mcquitty hclust kendall complete hclust canberra centroid kmedoids manhattan hclust manhattan centroid hclust manhattan median hclust manhattan average affprop manhattan hclust euclidean single hclust manhattan single hclust euclidean median divisive manhattan hclust maximum single hclust euclidean centroid hclust euclidean average hclust manhattan mcquitty clust_convex hclust euclidean mcquitty kmedoids euclidean hclustmaximum maximum centroidaffprop euclidean hclust median divisive euclidean hclust maximum average hclust maximum complete hclust euclidean complete hclust maximum hclust manhattan complete mcquitty

affprop maximum

hclust canberra average

q
hclust correlation ward kmedoids hclust pearson wardstand.euc hclust canberra mcquitty

dist_ebinary dist_binary dist_fbinary dist_minkowski dist_canb dist_max dist_cos dismea hclust manhattan ward hclust canberra complete hclust binary ward

affprop info.costs

euclidean ward kmeanshclust euclidean sot_euc

spearman ward hclusthclust kendall ward hclust maximum ward kmeans maximum kmeans binary

Denition: Explicit, public, and negative attacks on another political party or its members

Clusters in this Clustering


q q q q q q q q q qq qq qq q q q q qq q q q q q q q qq q q q q

Taunting ruins deliberation


qq q q q q

Credit Claiming Pork q


q q q q q q q q q q q qq q q q qq q q q

Advertising Partisan Taunting


q q q q q q q qq q q q qq q q q qq q q

Mayhew

q Credit Claiming Legislation

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

18 / 23

In Sample Illustration of Partisan Taunting Taunting ruins deliberation


- Senator Lautenberg Blasts Republicans as Chicken Hawks [Government Oversight]

Sen. Lautenberg on Senate Floor 4/29/04

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

19 / 23

In Sample Illustration of Partisan Taunting Taunting ruins deliberation


- Senator Lautenberg Blasts Republicans as Chicken Hawks [Government Oversight] - The scopes trial took place in 1925. Sadly, President Bushs veto today shows that we havent progressed much since then [Healthcare] Sen. Lautenberg on Senate Floor 4/29/04

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

19 / 23

In Sample Illustration of Partisan Taunting Taunting ruins deliberation


- Senator Lautenberg Blasts Republicans as Chicken Hawks [Government Oversight] - The scopes trial took place in 1925. Sadly, President Bushs veto today shows that we havent progressed much since then [Healthcare] Sen. Lautenberg on Senate Floor 4/29/04 - Every day the House Republicans dragged this out was a day that made our communities less safe.[Homeland Security]

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

19 / 23

Out of Sample Conrmation of Partisan Taunting


- Discovered using 200 press releases; 1 senator.

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

20 / 23

Out of Sample Conrmation of Partisan Taunting


- Discovered using 200 press releases; 1 senator. - Conrmed using 64,033 press releases; 301 senator-years.

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

20 / 23

Out of Sample Conrmation of Partisan Taunting


- Discovered using 200 press releases; 1 senator. - Conrmed using 64,033 press releases; 301 senator-years. - Apply supervised learning method: measure proportion of press releases a senator taunts other party

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

20 / 23

Out of Sample Conrmation of Partisan Taunting


- Discovered using 200 press releases; 1 senator. - Conrmed using 64,033 press releases; 301 senator-years. - Apply supervised learning method: measure proportion of press releases a senator taunts other party

Frequency

10

20

30

0.1

0.2

0.3 Prop. of Press Releases Taunting

0.4

0.5

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

21 / 23

Out of Sample Conrmation of Partisan Taunting


- Discovered using 200 press releases; 1 senator. - Conrmed using 64,033 press releases; 301 senator-years. - Apply supervised learning method: measure proportion of press releases a senator taunts other party
On Avg., Senators Taunt in 27 % of Press Releases
30 Frequency 10 20

0.1

0.2

0.3 Prop. of Press Releases Taunting

0.4

0.5

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

21 / 23

Advancing the Objective of Discovery


1) Conceptualization Qualitative Methods (reading!)

2) Measurement

Quantitative Methods

3) Validation Quantitative methods for conceptualization: aiding discovery

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

22 / 23

Advancing the Objective of Discovery


1) Conceptualization Qualitative Methods (reading!)

2) Measurement

Quantitative Methods

3) Validation Quantitative methods for conceptualization: aiding discovery - Few formal methods designed explicitly for conceptualization

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

22 / 23

Advancing the Objective of Discovery


1) Conceptualization Qualitative Methods (reading!)

2) Measurement

Quantitative Methods

3) Validation Quantitative methods for conceptualization: aiding discovery - Few formal methods designed explicitly for conceptualization - Belittled: Tom Swift and His Electric Factor Analysis Machine (Armstrong 1967)
Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23

Advancing the Objective of Discovery


1) Conceptualization Qualitative Methods (reading!)

2) Measurement

Quantitative Methods

3) Validation Quantitative methods for conceptualization: aiding discovery - Few formal methods designed explicitly for conceptualization - Belittled: Tom Swift and His Electric Factor Analysis Machine (Armstrong 1967) - Evaluation methods measure progress in discovery
Gary King (Harvard, IQSS) Quantitative Discovery from Text 22 / 23

For more information:

http://GKing.Harvard.edu

Gary King (Harvard, IQSS)

Quantitative Discovery from Text

23 / 23

Вам также может понравиться