Вы находитесь на странице: 1из 58

Frontiers of

Computational Journalism
Columbia Journalism School
Week 1: Introduction
September 11, 2015

Lecture 1: Basics
Computer Science and Journalism
Course Structure
Interpreting High Dimensional Data

Computational Journalism:
Denitions
Broadly defined, it can involve changing how stories
are discovered, presented, aggregated, monetized,
and archived. Computation can advance journalism
by drawing on innovations in topic detection, video
analysis, personalization, aggregation, visualization,
and sensemaking.
- Cohen, Hamilton, Turner, Computational Journalism, 2011

Computational Journalism:
Denitions
Stories will emerge from stacks of financial disclosure
forms, court records, legislative hearings, officials' calendars
or meeting notes, and regulators' email messages that no
one today has time or money to mine. With a suite of
reporting tools, a journalist will be able to scan, transcribe,
analyze, and visualize the patterns in these documents.
- Cohen, Hamilton, Turner, Computational Journalism, 2011

Cohen et al. model


Data

Reporting

User

Computer
Science

CS for presentation /
interaction
CS

Data

CS

Reporting

User

Filter stories for user


CS

Data

Reporting

CS

Data

CS

Reporting

CS

Filtering

Reporting

CS

Data

CS

CS

User

Examples of lters

Facebook news feed


What an editor puts on the front page
Google News
Reddits comment system
Twitter
Techmeme
New York Times recommendation system

http://snap.stanford.edu/nifty

Kony 2012 early network, by Gilad Lotan

CS in Journalism
CS

Data

Reporting

CS

Data

Reporting

CS

CS

CS

Reporting

CS

Data

CS

Eects

Filtering

CS

User

Journalism with algorithms


vs.
Journalism about algorithms

Websites Vary Prices, Deals Based on Users' Information


Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012

Message Machine
Jeff Larson, Al Shaw, ProPublica, 2012

Where does data come from?

Computer Science in
Journalism
Reporting
Presentation
Filtering
Tracking
Algorithmic accountability

Quantication

Data

Journalism as a cycle
CS

Eects

Data

CS

Reporting

User

CS
CS

Filtering

Computational Journalism:
Denitions
the application of computer science to the problems
of public information, knowledge, and belief, by
practitioners who see their mission as outside of both
commerce and government.
- Jonathan Stray, A Computational Journalism Reading List,
2011

Course Structure

Information retrieval: TF-IDF, search engines


Text analysis: clustering and topic modeling
Information filtering systems
Social network analysis
Knowledge representation
Drawing conclusions from data
Writing about data
Information Security
Tracking flow and effects

Information Retrieval

Visualization
Clustering

Natural Language
Processing

Text Analysis
Filter Design
Social Network Analysis

Articial
Intelligence

Sociology

Knowledge Representation

Graph Theory

Drawing Conclusions
Cognitive Science

Statistics

Epistemology

Administration
Assignment after each class
Four assignments require programming, but
your writing counts for more than your code!

Course blog
http://compjournalism.com

Final project
for 6-pt students only

Grading
Dual degree students
Pass/Fail.
Final project: paper, story, or software.

Non-journalism students
80% assignements
20% class participation

Definition of data?

My Definition of data

a collection of related pieces of
recorded information

structured data

unstructured data

Quantication
!
#
#
#
#
#
#
#
"

x1 $
&
x2 &
&
x3 &
&
&
xN &
%

Other things that are tricky to


quantify, but quantied anyway

Intelligence
Academic performance
Gender
Race, ethnicity, nationality
Number of sexual harassment incidents
Income
Political Ideology
...

Dierent types of quantitative


Numeric
o
o
o
o

continuous
countable
bounded?
units of measurement?

Categorical
o
o
o
o

finite, e.g. {on, off}


infinite e.g. {red, yellow, blue, ... chartreuse}
ordered?
equivalence classes or other structure?

Dierent types of scales


Temperature
Continuous scale, fixed zero point,
physical units, comparative, uniform
Likert Scale
Discrete scale, no xed origin , abstract units,
comparative, non-uniform

Likert scales are non-uniform

No averages on a non-uniform scale


Its not linear, so is 2X1 twice as good?
(X1+c) (X2+c) X1 X2
Lots of things dont make much sense, such as
sum(X1 ... XN) / N = ?
Average is not well defined! (Nor std dev, etc.)
But rank order statistics are robust.
And all of this might not be a problem in practice.

Other issues withquantitative


Where did the data come from?
o physical measurement
o computer logging
o human recording

What are the sources of error?


o
o
o
o
o

measurement error
missing data
ambiguity in human classification
process errors
intentional bias / deception

Vector representation of objects


Fundamental representation for many data mining, clustering,
machine learning, visualization, NLP, etc. algorithms.

!
#
#
#
#
#
#
#
"

x1 $
&
x2 &
&
x3 &
&
&
xN &
%

Each xi is a numerical or categorical feature


N = number of features or dimension

Examples of features

number of claws
latitude
color {red, yellow, blue}
number of break-ins
1 for bought X, 0 for did not buy X
time, duration, etc.
number of times word Y appears in document
votes cast

Feature selection
Technical meaning in machine learning etc.:
which variables matter?
Were journalists, so were interested in an earlier
process:
how to describe the world in numbers?

Choosing Features
!
#
#
#
#
#
#
#
"

Journalism
How do we
represent the
world
numerically?

x1 $
&
x2 &
&
x3 &
&
&
xN &
%

! x
f (1)
#
# x f (2 )
#

#
# x f (k )
"

$
&
&
&
&
&
%

where k N
Machine learning
Which variables
carry the most
information?

Examples of vector representations


Obvious
o movies watched / items purchased
o Legislative voting history for a politician
o crime locations

Less obvious, but standard


o document vector space model
o psychological survey results

Tricky research problem: disparate field types


o Corporate filing document
o Wikileaks SIGACT

What can we do with vectors?


Predict one variable based on others
o this is called regression
o or maybe "classification"
o supervised machine learning

Group similar items together


o This is clustering
o or maybe "classification" with unknown categories
o unsupervised machine learning

Вам также может понравиться