Вы находитесь на странице: 1из 60

Frontiers of

Computational Journalism
Columbia Journalism School
Week 5: Quantification and Statistics
October 13, 2016

This class

Quantification
Stats 101 Freestyle
Conditional probability
Analyzing discrimination

Quantification

Definition of data?

My Definition of data
a collection of related pieces of
recorded information

structured data

unstructured data

Where does data come from?

Data

Quantification
!
#
#
#
#
#
#
#
"

x1 $
&
x2 &
&
x3 &
&
&
xN &
%

Different types of quantitative


Numeric
o
o
o
o

continuous
countable
bounded?
units of measurement?

Categorical
o
o
o
o

finite, e.g. {on, off}


infinite e.g. {red, yellow, blue, ... chartreuse}
ordered?
equivalence classes or other structure?

Different types of scales


Temperature
Continuous scale, fixed zero point, physical
units, comparative, uniform

Likert Scale
Discrete scale, no fixed origin , abstract units,
comparative, non-uniform

Likert scales are non-uniform

No averages on a non-uniform scale


Its not linear, so is 2X1 twice as good?
(X1+c) (X2+c) X1 X2
Lots of things dont make much sense, such as
sum(X1 ... XN) / N = ?
Average is not well defined! (Nor std dev, etc.)
But rank order statistics are robust.
And all of this might not be a problem in practice.

Issues withquantitative
Where did the data come from?
o physical measurement
o computer logging
o human recording

What are the sources of error?


o
o
o
o
o

measurement error
missing data
ambiguity in human classification
process errors
intentional bias / deception

Other things that are tricky to


quantify, but quantified anyway

Intelligence
Academic performance
Gender
Race, ethnicity, nationality
Number of sexual harassment incidents
Income
Political Ideology
...

Interview the Data

Who created this data?


What is this data supposed to count?
How was this data actually collected?
Does it really count what its suppose to?
For what purpose was this data collected?
How do we know it is complete?
If the data was collected from people, who was
asked and how?

Interview the Data


Who is going to look bad or lose money because of
this data?
Is the data consistent with other sources?
Is the data consistent from day to day, or when
collected by different people?
Who has already analyzed it?
Are there multiple versions?
Does this data have known problems?

GDP = C + I + G + (X - M)

It looks like Lucknow and Kanpur have few traffic accidents,


but deaths data suggests that accidents are not being counted.

Evaluating Data Quality


Internal validity: check the data against itself
row counts (e.g. all 50 states?)
histograms
do the numbers add up?
External validity: compare the data to something else.
alternate data sources
expert knowledge
previous versions
common sense!

Stats 101 Freestyle

If we could repeat the survey many times

we could just look at the distribution of average


values

Same data, different meaning

More than one true story

More than one true story

Conditional Probability

Taxi Accidents
Imagine you live in a city where one in every ten rides
ends in an accident, and last year there were
- 75 accidents involving yellow cabs
- 25 accidents involving blue cabs
Which taxi company is more dangerous?

Accident
No accident

Pr(Accident) = 0.10

P(Yellow) = 0.6

Blue

Yellow

Accident
No Accident

Blue

Yellow

Accident
No Accident

P(Accident|Blue) = 0.6

Blue

Yellow

Definition
Pr(B|A) = Pr(AB)/Pr(A)

Bayes Theorem
Tells us how to go from Pr(A|B) to Pr(B|A)
Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)

Mammograms and Cancer


Suppose I tell you:
14 of 1000 women under 50 have breast
cancer
If a woman has cancer, a mammogram
is positive 75% of the time
If a woman does not have cancer, a
mammogram is positive 10% of the time
If a woman has a positive mammogram,
how likely is she to have cancer?

From The Signal and the Noise, Nate Silver

cancer
no cancer

positive

negative

cancer
no cancer
Pr(positive|cancer) = 0.75
= N(positive & cancer) / N(cancer)
N(cancer) = 4
N(positive & cancer) = 3

positive

negative

cancer
no cancer
Pr(positive|no cancer) = 0.1
= N(positive & no cancer) / N(positive)
N(no cancer) = 1000
N(positive & no cancer) = 100

positive

negative

cancer
no cancer
Pr(cancer) 0.0014
= N(cancer) / N

positive

negative

cancer
no cancer

Pr(cancer|positive)
= 9.6%

positive

negative

Conditional probabilities
Pr(positive|cancer) = 75%
Pr(positive|no cancer) = 10%
What is Pr(cancer|positive)?

Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
Pr(positive|cancer) = 0.75
Pr(cancer) = 0.014
Pr(positive) =

Pr(positive|no cancer)Pr(no cancer) +


Pr(positive|cancer)Pr(cancer)
= 0.10*0.986 + 0.75*0.014
= 0.1091

Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
= (0.75 * 0.014) / (0.1091)
= 0.0962
= 9.6% chance she has cancer
if mammogram is positive

Bayes learns from evidence


Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)
or

P(H|E) = Pr(E|H)/Pr(E) * Pr(H)


Likelihood
How likely is H
given evidence E?

Model of H
Probability of
seeing E
if H is true

Prior
Model of E
How likely was
How commonly
H to begin with?
do we see E at all?

Analyzing Discrimination

Вам также может понравиться