Вы находитесь на странице: 1из 56

Fron%ers

of Computa%onal Journalism
Columbia Journalism School Week 1: Basics September 4, 2013

Lecture 1: Basics
Computer Science and Journalism Represen%ng Data Interpre%ng High Dimensional Data

Computa%onal Journalism: Deni%ons


Broadly dened, it can involve changing how stories are discovered, presented, aggregated, mone%zed, and archived. Computa%on can advance journalism by drawing on innova%ons in topic detec%on, video analysis, personaliza%on, aggrega%on, visualiza%on, and sensemaking. - Cohen, Hamilton, Turner, Computa(onal Journalism

Computa%onal Journalism: Deni%ons


Stories will emerge from stacks of nancial disclosure forms, court records, legisla%ve hearings, ocials' calendars or mee%ng notes, and regulators' email messages that no one today has %me or money to mine. With a suite of repor%ng tools, a journalist will be able to scan, transcribe, analyze, and visualize the paUerns in these documents. - Cohen, Hamilton, Turner, Computa(onal Journalism

Cohen et al. model

Data

Repor%ng User

Computer Science

CS for presenta%on / interac%on

CS CS

Data

Repor%ng User

Filter many stories for user


CS CS

Data

Repor%ng

CS

CS

CS

Data

Repor%ng

Filtering
User

CS

CS

Data

Repor%ng

Examples of lters
What an editor puts on the front page Google News Reddits comment system TwiUer Facebook news feed Techmeme

Memetracker by Leskovic, Backstrom, Kleinberg

Kony 2012 early network, by Gilad Lotan / Socialow

Track eects
CS CS

Data

Repor%ng

CS

CS

CS

CS

Data

Repor%ng

Filtering User

Eects

CS

CS

Data

Repor%ng

Computer Science in Journalism


Repor%ng Presenta%on Filtering Tracking

Computa%onal Journalism: Deni%ons


the applica%on of computer science to the problems of public informa%on, knowledge, and belief, by prac%%oners who see their mission as outside of both commerce and government. - Jonathan Stray, A Computa(onal Journalism Reading List

Course Structure
Informa%on retrieval: TF-IDF, search engines Text analysis: clustering and topic modeling Informa%on ltering systems Social network analysis Knowledge representa%on Drawing conclusions from data Informa%on Security Tracking ow and eects

Informa%on Retrieval

Data Science

Natural Language Processing

Clustering Text Analysis Filter Design Social Network Analysis Knowledge Representa%on Drawing Conclusions Sociology Graph Theory

Ar%cial Intelligence

Sta%s%cs

Cogni%ve Science

Administra%on
Assignment acer each class
Four assignments require programming, but your wri%ng counts for more than your code!

Course blog
hUp://jmsc.hku.hk/courses/jmsc6041spring2013/

Final project
to be completed Feb-April

Lecture 1: Basics
Computer Science and Journalism Represen%ng Data Interpre%ng High Dimensional Data

Deni%on of data a collec%on of similar pieces of informa%on

structured data

unstructured data

Vector representa%on of objects


Fundamental representa%on for (almost) all data mining, clustering, machine learning, visualiza%on, NLP, etc. algorithms.
! # # # # # # # " x1 $ & x2 & & x3 & & & xN & %

! # # # # # # # "

x1 $ & x2 & & x3 & & & xN & %

Each xi is a numerical or categorical feature N = number of features or dimension

Examples of features
number of claws la%tude color {red, yellow, blue} number of break-ins 1 for bought X, 0 for did not buy X %me, dura%on, etc. number of %mes word Y appears in document votes cast

Feature selec%on
Technical meaning in machine learning etc.: which variables ma.er? Were journalists, so were interested in an earlier process: how to describe the world in numbers?

Choosing Features
! # # # # # # # "
Journalism How do we represent the world numerically?

x1 $ & x2 & & x3 & & & xN & %

! x # f (1) # x f (2) # # # x f (k ) "

$ & & & & & %

where k N

Machine learning Which variables carry the most informa%on?

Dierent types of quan%ta%ve


Numeric
con%nuous countable bounded? units of measurement?

Categorical
nite, e.g. {on, o} innite e.g. {red, yellow, blue, ... chartreuse} ordered? equivalence classes or other structure?

Dierent types of scales


Temperature Con%nuous scale, xed zero point, physical units, compara%ve, uniform

Likert Scale Discrete scale, no xed origin , abstract units, compara%ve, non-uniform

Likert scales are non-uniform

No averages on a non-uniform scale


Its not linear, so is 2X1 twice as good? (X1+c) (X2+c) X1 X2 Lots of things dont make much sense, such as sum(X1 ... XN) / N = ? Average is not well dened! (Nor std dev, etc.) But rank order sta%s%cs are robust. And all of this might not be a problem in prac%ce.

Other issues withquan%ta%ve


Where did the data come from?
physical measurement computer logging human recording

What are the sources of error?


measurement error missing data ambiguity in human classica%on process errors inten%onal bias / decep%on

! # # # # # # # "

x1 $ & x2 & & x3 & & & xN & %

Even with all these caveats, the vector representa%on is incredibly exible and powerful.

Examples of vector representa%ons


Obvious
movies watched / items purchased Legisla%ve vo%ng history for a poli%cian crime loca%ons

Less obvious, but standard


document vector space model psychological survey results

Tricky research problem: disparate eld types


Corporate ling document Wikileaks SIGACT

What can we do with vectors?


Predict one variable based on others
this is called regression supervised machine learning

Group similar items together


This is classica%on or clustering We may or may not know pre-exis%ng classes

Lecture 1: Basics
Computer Science and Journalism Represen%ng Data Interpre%ng High Dimensional Data

Interpre%ng High Dimensional Data

UK House of Lords vo%ng record, 2000-2012. N = 1043 votes by M = 1630 lords 2 = aye, 4 = nay, -9 = didn't vote

Vote vectors
let v(i,j) = vote of MP i on issue j. Then we can look at all votes for a par%cular MP # mpi = ! v ( i , 0) v ( i ,1) v ( i , N ) " $ Now we have 1043 vectors, each of dimension 1630. What could we learn from this? What is their structure?

Visualizing High Dimensional Data

We can visualize 3 dimensions at a %me. What do we do with 1043?

Looking at all MPs for votes 100, 200, 300

Dimensionality reduc%on
Problem: vector space is high-dimensional. Up to thousands of dimensions. The screen is two- dimensional. We have to go from x RN to much lower dimensional points y RK<<N Probably K=2 or K=3.

This is called "projec%on"

Projec%on from 3 to 2 dimensions

Think of this as rota%ng to align the "screen" with coordinate axes, then simply throwing out values of higher dimensions.

Projec%on from 3 to 2 dimensions

Direc%on of projec%on maUers!

Which direc%on should we look from?


Intui%on: nd a direc%on that "spreads out" points.

House of Lords PCA analysis

Principal Components Analysis nds the direc%ons of maximum variance. Here, we're ployng the two dims of greatest variance.

Interpreta%on requires context

Conserva%ve and Liberal Democrats really do vote together, mostly. Cross-benchers and bishops in the middle, Labor opposite.

Вам также может понравиться