Computational Journalism 2017 Week 5: Quantification and Statistics

Frontiers of
Computational Journalism
Columbia Journalism School
Week 5: Quantification and Probability
October 6, 2017
This class
Quantification and Data Quality

What is statistics for?
Models made of Counting
Interpretation Generally
Quantification and Data Quality
Definition of data?
My definition of data:
A collection of similar pieces of recorded information

Structured data
Unstructured data
Quantification
! x1 $
# &
# x2 &
# &
# x3 &
# &
# &
# xN &
" %
Different types of counting
Numeric
o Continuous or discrete
o Units of measurement?
o Non-linear scales?
Categorical
o finite, e.g. {true, false}
o infinite e.g. {red, yellow, blue, ... chartreuse}
o ordered?
Choices about what to count
GDP = C + I + G + (X - M)
1940 U.S. census enumerator
instructions
2010 U.S. census race and
ethnicity questions
Some things that are tricky to quantify,
but usefully quantified anyway
Intelligence
Academic performance
Race, ethnicity, nationality, gender
Number of incidents of some type
Income
Political Ideology
Intentional or unintentional problems
It looks like Lucknow and Kanpur have few traffic accidents, but
deaths data suggests that accidents are not being counted.
Evaluating Data Quality
Internal validity: check the data against itself
row counts (e.g. all 50 states?)
related data
histograms
do the numbers add up?
External validity: compare the data to something else.

alternate data sources
expert knowledge
previous versions
common sense!
Interview the Data
Who created this data?

What is this data supposed to count?
How was this data actually collected?
Does it really count what its suppose to?
For what purpose was this data collected?
How do we know it is complete?
If the data was collected from people, who was
asked and how?
Interview the Data
Who is going to look bad or lose money because of

this data?
Is the data consistent with other sources?
Is the data consistent from day to day, or when
collected by different people?
Who has already analyzed it?
Are there multiple versions?
Does this data have known problems?
What is statistics for?
Description
Explanation
Prediction
Models made of Counting
The Simplest Model:
Counting One Thing
P(Yellow) = 0.6
Blue Yellow
The Second Simplest model:
Counting Two Things
Accident
No accident
Pr(Accident) = 0.15
Accident
No Accident
Blue Yellow
Deadly Force in Black and White, ProPublica 10/10/2014
Relative risk (risk ratio)
AP Clinton Foundation Story
At least 85 of 154 people from private interests who met or had phone
conversations scheduled with Clinton while she led the State
Department donated to her family charity or pledged commitments to
its international programs, according to a review of State Department
calendars.
odds
AP Clinton Foundation Story
odds
Not enough information to compute the odds ratio...

which you can tell immediately because four values are required.
Accident
No Accident
P(Accident|Blue) = 0.1
Blue Yellow
Conditional Probability
Pr(B|A) = Pr(AB)/Pr(A)
Relative risk as conditional probability
N = a+b+c+d
N(disease) = a+c
N(no disease) = b+d
Pr(disease) = a+c / a+b+c+d

Pr(disease|smoker) = a / (a+b)
Pr(disease|non-smoker) = c / (c+d)
RR = Pr(disease|smoker)/Pr(disease|non-smoker) = (a/a+b) / (c/c+d)

Predicting Cats
Predicting Recidivism
How We Analyzed the COMPAS Recidivism Algorithm,

ProPublica, 2016
Confusion Matrix
ROC curve (for adjustable thresholds)
Base Rates - Taxi Accidents
Imagine you live in a city where 15% of all rides end in
an accident, and last year there were
- 75 accidents involving yellow cabs

- 25 accidents involving blue cabs
Which taxi company is more dangerous?

Base rate
We know
P(accident) = 0.15
P(accident|blue) = 0.25
P(accident|yellow) = 0.75
We do not know the base rate:
P(yellow)
or equivalently
N(yellow)
Evidence and Conditional Probability
Hypothesis H = Alice has a cold

Evidence E = we just saw her cough
Alice is coughing. Does she have a cold?
Most people with colds cough
P(coughing|cold) = 0.9
P(A|B) P(B|A)
Most people with colds cough
P(coughing|cold) = 0.9
but we want
P(cold | coughing)
Bayes Theorem
Tells us how to go from Pr(A|B) to Pr(B|A)
Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)

Alice is coughing. Does she have a cold?
Prior P(H) = 0.05 (5% of our friends have a cold)
Likelihood P(E|H) = 0.9 (most people with colds cough)
Base rate P(E) = 0.1 (10% of everyone coughs today)
P(H|E) = P(E|H)P(H)/P(E)
= 0.9 * 0.05 / 0.1
= 0.45
If you believe your initial probability estimates, you should now

believe there's a 45% chance she has a cold.
Evidence
Information that justifies a belief.
Presented with evidence E for X, we should believe X "more."
In terms of probability, P(X|E) > P(X)

Bayes learns from evidence
Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)
or
P(H|E) = Pr(E|H)/Pr(E) * Pr(H)
Posterior Likelihood Prior

How likely is H Base Rate How likely was
Probability of
given evidence E? How commonly H to begin with?
seeing E
do we see E at all?
if H is true
Bayes Theorem - Diagnostic tests
Suppose I tell you:
14 of 1000 women under 50 have breast cancer

If a woman has cancer, a mammogram is
positive 75% of the time
If a woman does not have cancer, a
mammogram is positive 10% of the time
If a woman has a positive mammogram, how likely is

she to have cancer?
The Signal and the Noise, Nate Silver
cancer
no cancer
positive negative
cancer
no cancer
Pr(positive|cancer) = 0.75
= N(positive & cancer) / N(cancer)
N(cancer) = 4
N(positive & cancer) = 3
positive negative
cancer
no cancer
Pr(positive|no cancer) = 0.1
= N(positive & no cancer) / N(positive)
N(no cancer) = 1000

N(positive & no cancer) = 100
positive negative
cancer
no cancer
Pr(cancer) = 0.0014
= N(cancer) / N
positive negative
Conditional probabilities
Pr(positive|cancer) = 75%
Pr(positive|no cancer) = 10%
What is Pr(cancer|positive)?
cancer
no cancer
Pr(cancer|positive)
= 9.6%
positive negative
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
Pr(positive|cancer) = 0.75
Pr(cancer) = 0.014
Pr(positive) = Pr(positive|no cancer)Pr(no cancer) +

Pr(positive|cancer)Pr(cancer)
= 0.10*0.986 + 0.75*0.014
= 0.1091
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
= (0.75 * 0.014) / (0.1091)
= 0.0962
= 9.6% chance she has cancer

if mammogram is positive
Interpretation generally
Same data, different meaning
More than one true story
More than one true story

Computational Journalism 2017 Week 5: Quantification and Statistics

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computational Journalism 2017 Week 5: Quantification and Statistics

Загружено:

Авторское право:

Доступные форматы

Frontiers of

Quantification and Data Quality

A collection of similar pieces of recorded information

External validity: compare the data to something else.

Who created this data?

Who is going to look bad or lose money because of

Not enough information to compute the odds ratio...

Pr(disease) = a+c / a+b+c+d

RR = Pr(disease|smoker)/Pr(disease|non-smoker) = (a/a+b) / (c/c+d)

How We Analyzed the COMPAS Recidivism Algorithm,

- 75 accidents involving yellow cabs

Which taxi company is more dangerous?

We do not know the base rate:

Hypothesis H = Alice has a cold

Most people with colds cough

Tells us how to go from Pr(A|B) to Pr(B|A)

Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)

If you believe your initial probability estimates, you should now

Presented with evidence E for X, we should believe X "more."

In terms of probability, P(X|E) > P(X)

P(H|E) = Pr(E|H)/Pr(E) * Pr(H)

Posterior Likelihood Prior

14 of 1000 women under 50 have breast cancer

If a woman has a positive mammogram, how likely is

= N(positive & cancer) / N(cancer)

Pr(positive|no cancer) = 0.1

= N(positive & no cancer) / N(positive)

N(no cancer) = 1000

Pr(positive) = Pr(positive|no cancer)Pr(no cancer) +

= (0.75 * 0.014) / (0.1091)

= 9.6% chance she has cancer

Вам также может понравиться