Вы находитесь на странице: 1из 71

Frontiers of

Computational Journalism
Columbia Journalism School
Week 5: Quantification and Probability

October 6, 2017
This class

Quantification and Data Quality


What is statistics for?
Models made of Counting
Interpretation Generally
Quantification and Data Quality
Definition of data?
My definition of data:

A collection of similar pieces of recorded information


Structured data
Unstructured data
Quantification

! x1 $
# &
# x2 &
# &
# x3 &
# &
# &
# xN &
" %
Different types of counting
Numeric
o Continuous or discrete
o Units of measurement?
o Non-linear scales?

Categorical
o finite, e.g. {true, false}
o infinite e.g. {red, yellow, blue, ... chartreuse}
o ordered?
Choices about what to count
GDP = C + I + G + (X - M)
1940 U.S. census enumerator
instructions
2010 U.S. census race and
ethnicity questions
Some things that are tricky to quantify,
but usefully quantified anyway
Intelligence
Academic performance
Race, ethnicity, nationality, gender
Number of incidents of some type
Income
Political Ideology
Intentional or unintentional problems
It looks like Lucknow and Kanpur have few traffic accidents, but
deaths data suggests that accidents are not being counted.
Evaluating Data Quality
Internal validity: check the data against itself
row counts (e.g. all 50 states?)
related data
histograms
do the numbers add up?

External validity: compare the data to something else.


alternate data sources
expert knowledge
previous versions
common sense!
Interview the Data

Who created this data?


What is this data supposed to count?
How was this data actually collected?
Does it really count what its suppose to?
For what purpose was this data collected?
How do we know it is complete?
If the data was collected from people, who was
asked and how?
Interview the Data

Who is going to look bad or lose money because of


this data?
Is the data consistent with other sources?
Is the data consistent from day to day, or when
collected by different people?
Who has already analyzed it?
Are there multiple versions?
Does this data have known problems?
What is statistics for?
Description
Explanation
Prediction
Models made of Counting
The Simplest Model:
Counting One Thing
P(Yellow) = 0.6

Blue Yellow
The Second Simplest model:
Counting Two Things
Accident

No accident

Pr(Accident) = 0.15
Accident

No Accident

Blue Yellow
Deadly Force in Black and White, ProPublica 10/10/2014
Relative risk (risk ratio)
AP Clinton Foundation Story
At least 85 of 154 people from private interests who met or had phone
conversations scheduled with Clinton while she led the State
Department donated to her family charity or pledged commitments to
its international programs, according to a review of State Department
calendars.

odds
AP Clinton Foundation Story

odds

Not enough information to compute the odds ratio...


which you can tell immediately because four values are required.
Accident

No Accident

P(Accident|Blue) = 0.1

Blue Yellow
Conditional Probability

Pr(B|A) = Pr(AB)/Pr(A)
Relative risk as conditional probability

N = a+b+c+d
N(disease) = a+c
N(no disease) = b+d

Pr(disease) = a+c / a+b+c+d


Pr(disease|smoker) = a / (a+b)
Pr(disease|non-smoker) = c / (c+d)

RR = Pr(disease|smoker)/Pr(disease|non-smoker) = (a/a+b) / (c/c+d)


Predicting Cats
Predicting Recidivism

How We Analyzed the COMPAS Recidivism Algorithm,


ProPublica, 2016
Confusion Matrix
ROC curve (for adjustable thresholds)
Base Rates - Taxi Accidents
Imagine you live in a city where 15% of all rides end in
an accident, and last year there were

- 75 accidents involving yellow cabs


- 25 accidents involving blue cabs

Which taxi company is more dangerous?


Base rate
We know

P(accident) = 0.15
P(accident|blue) = 0.25
P(accident|yellow) = 0.75

We do not know the base rate:

P(yellow)

or equivalently

N(yellow)
Evidence and Conditional Probability

Hypothesis H = Alice has a cold


Evidence E = we just saw her cough
Alice is coughing. Does she have a cold?

Most people with colds cough

P(coughing|cold) = 0.9
P(A|B) P(B|A)
Most people with colds cough

P(coughing|cold) = 0.9

but we want

P(cold | coughing)
Bayes Theorem

Tells us how to go from Pr(A|B) to Pr(B|A)

Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)


Alice is coughing. Does she have a cold?
Prior P(H) = 0.05 (5% of our friends have a cold)
Likelihood P(E|H) = 0.9 (most people with colds cough)
Base rate P(E) = 0.1 (10% of everyone coughs today)

P(H|E) = P(E|H)P(H)/P(E)
= 0.9 * 0.05 / 0.1
= 0.45

If you believe your initial probability estimates, you should now


believe there's a 45% chance she has a cold.
Evidence
Information that justifies a belief.

Presented with evidence E for X, we should believe X "more."

In terms of probability, P(X|E) > P(X)


Bayes learns from evidence
Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)
or

P(H|E) = Pr(E|H)/Pr(E) * Pr(H)

Posterior Likelihood Prior


How likely is H Base Rate How likely was
Probability of
given evidence E? How commonly H to begin with?
seeing E
do we see E at all?
if H is true
Bayes Theorem - Diagnostic tests
Suppose I tell you:

14 of 1000 women under 50 have breast cancer


If a woman has cancer, a mammogram is
positive 75% of the time
If a woman does not have cancer, a
mammogram is positive 10% of the time

If a woman has a positive mammogram, how likely is


she to have cancer?
The Signal and the Noise, Nate Silver
cancer

no cancer

positive negative
cancer

no cancer

Pr(positive|cancer) = 0.75

= N(positive & cancer) / N(cancer)

N(cancer) = 4
N(positive & cancer) = 3

positive negative
cancer

no cancer

Pr(positive|no cancer) = 0.1

= N(positive & no cancer) / N(positive)

N(no cancer) = 1000


N(positive & no cancer) = 100

positive negative
cancer

no cancer

Pr(cancer) = 0.0014
= N(cancer) / N

positive negative
Conditional probabilities

Pr(positive|cancer) = 75%
Pr(positive|no cancer) = 10%

What is Pr(cancer|positive)?
cancer

no cancer

Pr(cancer|positive)
= 9.6%

positive negative
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)

Pr(positive|cancer) = 0.75
Pr(cancer) = 0.014

Pr(positive) = Pr(positive|no cancer)Pr(no cancer) +


Pr(positive|cancer)Pr(cancer)
= 0.10*0.986 + 0.75*0.014
= 0.1091
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)

= (0.75 * 0.014) / (0.1091)

= 0.0962

= 9.6% chance she has cancer


if mammogram is positive
Interpretation generally
Same data, different meaning
More than one true story
More than one true story

Вам также может понравиться