Академический Документы
Профессиональный Документы
Культура Документы
Computational Journalism
Columbia Journalism School
Week 5: Quantification and Probability
October 6, 2017
This class
! x1 $
# &
# x2 &
# &
# x3 &
# &
# &
# xN &
" %
Different types of counting
Numeric
o Continuous or discrete
o Units of measurement?
o Non-linear scales?
Categorical
o finite, e.g. {true, false}
o infinite e.g. {red, yellow, blue, ... chartreuse}
o ordered?
Choices about what to count
GDP = C + I + G + (X - M)
1940 U.S. census enumerator
instructions
2010 U.S. census race and
ethnicity questions
Some things that are tricky to quantify,
but usefully quantified anyway
Intelligence
Academic performance
Race, ethnicity, nationality, gender
Number of incidents of some type
Income
Political Ideology
Intentional or unintentional problems
It looks like Lucknow and Kanpur have few traffic accidents, but
deaths data suggests that accidents are not being counted.
Evaluating Data Quality
Internal validity: check the data against itself
row counts (e.g. all 50 states?)
related data
histograms
do the numbers add up?
Blue Yellow
The Second Simplest model:
Counting Two Things
Accident
No accident
Pr(Accident) = 0.15
Accident
No Accident
Blue Yellow
Deadly Force in Black and White, ProPublica 10/10/2014
Relative risk (risk ratio)
AP Clinton Foundation Story
At least 85 of 154 people from private interests who met or had phone
conversations scheduled with Clinton while she led the State
Department donated to her family charity or pledged commitments to
its international programs, according to a review of State Department
calendars.
odds
AP Clinton Foundation Story
odds
No Accident
P(Accident|Blue) = 0.1
Blue Yellow
Conditional Probability
Pr(B|A) = Pr(AB)/Pr(A)
Relative risk as conditional probability
N = a+b+c+d
N(disease) = a+c
N(no disease) = b+d
P(accident) = 0.15
P(accident|blue) = 0.25
P(accident|yellow) = 0.75
P(yellow)
or equivalently
N(yellow)
Evidence and Conditional Probability
P(coughing|cold) = 0.9
P(A|B) P(B|A)
Most people with colds cough
P(coughing|cold) = 0.9
but we want
P(cold | coughing)
Bayes Theorem
P(H|E) = P(E|H)P(H)/P(E)
= 0.9 * 0.05 / 0.1
= 0.45
no cancer
positive negative
cancer
no cancer
Pr(positive|cancer) = 0.75
N(cancer) = 4
N(positive & cancer) = 3
positive negative
cancer
no cancer
positive negative
cancer
no cancer
Pr(cancer) = 0.0014
= N(cancer) / N
positive negative
Conditional probabilities
Pr(positive|cancer) = 75%
Pr(positive|no cancer) = 10%
What is Pr(cancer|positive)?
cancer
no cancer
Pr(cancer|positive)
= 9.6%
positive negative
Bayesian Mammograms
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
Pr(positive|cancer) = 0.75
Pr(cancer) = 0.014
= 0.0962