Computational Journalism 2016 Week 5: Quantification and Statistics

Frontiers of
Computational Journalism
Columbia Journalism School
Week 5: Quantification and Statistics
October 13, 2016
This class
Quantification
Stats 101 Freestyle
Conditional probability
Analyzing discrimination
Quantification
Definition of data?
My Definition of data
a collection of related pieces of
recorded information
structured data
unstructured data
Where does data come from?
Data
Quantification
!
#
#
#
#
#
#
#
"
x1 $
&
x2 &
&
x3 &
&
&
xN &
%
Different types of quantitative

Numeric
o
o
o
o
continuous
countable
bounded?
units of measurement?
Categorical
o
o
o
o
finite, e.g. {on, off}

infinite e.g. {red, yellow, blue, ... chartreuse}
ordered?
equivalence classes or other structure?
Different types of scales

Temperature
Continuous scale, fixed zero point, physical
units, comparative, uniform
Likert Scale
Discrete scale, no fixed origin , abstract units,
comparative, non-uniform
Likert scales are non-uniform
No averages on a non-uniform scale

Its not linear, so is 2X1 twice as good?
(X1+c) (X2+c) X1 X2
Lots of things dont make much sense, such as
sum(X1 ... XN) / N = ?
Average is not well defined! (Nor std dev, etc.)
But rank order statistics are robust.
And all of this might not be a problem in practice.
Issues withquantitative
Where did the data come from?
o physical measurement
o computer logging
o human recording
What are the sources of error?

o
o
o
o
o
measurement error
missing data
ambiguity in human classification
process errors
intentional bias / deception
Other things that are tricky to

quantify, but quantified anyway
Intelligence
Academic performance
Gender
Race, ethnicity, nationality
Number of sexual harassment incidents
Income
Political Ideology
...
Interview the Data
Who created this data?

What is this data supposed to count?
How was this data actually collected?
Does it really count what its suppose to?
For what purpose was this data collected?
How do we know it is complete?
If the data was collected from people, who was
asked and how?
Interview the Data

Who is going to look bad or lose money because of
this data?
Is the data consistent with other sources?
Is the data consistent from day to day, or when
collected by different people?
Who has already analyzed it?
Are there multiple versions?
Does this data have known problems?
GDP = C + I + G + (X - M)
It looks like Lucknow and Kanpur have few traffic accidents,

but deaths data suggests that accidents are not being counted.
Evaluating Data Quality

Internal validity: check the data against itself
row counts (e.g. all 50 states?)
histograms
do the numbers add up?
External validity: compare the data to something else.
alternate data sources
expert knowledge
previous versions
common sense!
Stats 101 Freestyle
If we could repeat the survey many times
we could just look at the distribution of average

values
Same data, different meaning
More than one true story
More than one true story
Conditional Probability
Taxi Accidents
Imagine you live in a city where one in every ten rides
ends in an accident, and last year there were
- 75 accidents involving yellow cabs
- 25 accidents involving blue cabs
Which taxi company is more dangerous?
Accident
No accident
Pr(Accident) = 0.10
P(Yellow) = 0.6
Blue
Yellow
Accident
No Accident
Blue
Yellow
Accident
No Accident
P(Accident|Blue) = 0.6
Blue
Yellow
Definition
Pr(B|A) = Pr(AB)/Pr(A)
Bayes Theorem
Tells us how to go from Pr(A|B) to Pr(B|A)
Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)
Mammograms and Cancer

Suppose I tell you:
14 of 1000 women under 50 have breast
cancer
If a woman has cancer, a mammogram
is positive 75% of the time
If a woman does not have cancer, a
mammogram is positive 10% of the time
If a woman has a positive mammogram,
how likely is she to have cancer?
From The Signal and the Noise, Nate Silver
cancer
no cancer
positive
negative
cancer
no cancer
Pr(positive|cancer) = 0.75
= N(positive & cancer) / N(cancer)
N(cancer) = 4
N(positive & cancer) = 3
positive
negative
cancer
no cancer
Pr(positive|no cancer) = 0.1
= N(positive & no cancer) / N(positive)
N(no cancer) = 1000
N(positive & no cancer) = 100
positive
negative
cancer
no cancer
Pr(cancer) 0.0014
= N(cancer) / N
positive
negative
cancer
no cancer
Pr(cancer|positive)
= 9.6%
positive
negative
Conditional probabilities
Pr(positive|cancer) = 75%
Pr(positive|no cancer) = 10%
What is Pr(cancer|positive)?
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
Pr(positive|cancer) = 0.75
Pr(cancer) = 0.014
Pr(positive) =
Pr(positive|no cancer)Pr(no cancer) +

Pr(positive|cancer)Pr(cancer)
= 0.10*0.986 + 0.75*0.014
= 0.1091
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
= (0.75 * 0.014) / (0.1091)
= 0.0962
= 9.6% chance she has cancer
if mammogram is positive
Bayes learns from evidence

Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)
or
P(H|E) = Pr(E|H)/Pr(E) * Pr(H)

Likelihood
How likely is H
given evidence E?
Model of H
Probability of
seeing E
if H is true
Prior
Model of E
How likely was
How commonly
H to begin with?
do we see E at all?
Analyzing Discrimination

Computational Journalism 2016 Week 5: Quantification and Statistics

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computational Journalism 2016 Week 5: Quantification and Statistics

Загружено:

Авторское право:

Доступные форматы

Frontiers of

Where does data come from?

Different types of quantitative

finite, e.g. {on, off}

Different types of scales

Likert scales are non-uniform

No averages on a non-uniform scale

What are the sources of error?

Other things that are tricky to

Interview the Data

Who created this data?

Interview the Data

It looks like Lucknow and Kanpur have few traffic accidents,

Evaluating Data Quality

Stats 101 Freestyle

If we could repeat the survey many times

we could just look at the distribution of average

Same data, different meaning

More than one true story

More than one true story

Mammograms and Cancer

From The Signal and the Noise, Nate Silver

Pr(positive|no cancer)Pr(no cancer) +

Bayes learns from evidence

P(H|E) = Pr(E|H)/Pr(E) * Pr(H)

Вам также может понравиться