Академический Документы
Профессиональный Документы
Культура Документы
of
Computa%onal
Journalism
Columbia
Journalism
School
Week
1:
Basics
September
4,
2013
Lecture
1:
Basics
Computer
Science
and
Journalism
Represen%ng
Data
Interpre%ng
High
Dimensional
Data
Data
Repor%ng User
Computer Science
CS CS
Data
Repor%ng User
Data
Repor%ng
CS
CS
CS
Data
Repor%ng
Filtering
User
CS
CS
Data
Repor%ng
Examples
of
lters
What
an
editor
puts
on
the
front
page
Google
News
Reddits
comment
system
TwiUer
Facebook
news
feed
Techmeme
Track
eects
CS
CS
Data
Repor%ng
CS
CS
CS
CS
Data
Repor%ng
Filtering User
Eects
CS
CS
Data
Repor%ng
Course
Structure
Informa%on
retrieval:
TF-IDF,
search
engines
Text
analysis:
clustering
and
topic
modeling
Informa%on
ltering
systems
Social
network
analysis
Knowledge
representa%on
Drawing
conclusions
from
data
Informa%on
Security
Tracking
ow
and
eects
Informa%on Retrieval
Data Science
Clustering Text Analysis Filter Design Social Network Analysis Knowledge Representa%on Drawing Conclusions Sociology Graph Theory
Ar%cial Intelligence
Sta%s%cs
Cogni%ve Science
Administra%on
Assignment
acer
each
class
Four
assignments
require
programming,
but
your
wri%ng
counts
for
more
than
your
code!
Course
blog
hUp://jmsc.hku.hk/courses/jmsc6041spring2013/
Final
project
to
be
completed
Feb-April
Lecture
1:
Basics
Computer
Science
and
Journalism
Represen%ng
Data
Interpre%ng
High
Dimensional
Data
structured data
unstructured data
! # # # # # # # "
Examples
of
features
number
of
claws
la%tude
color
{red,
yellow,
blue}
number
of
break-ins
1
for
bought
X,
0
for
did
not
buy
X
%me,
dura%on,
etc.
number
of
%mes
word
Y
appears
in
document
votes
cast
Feature
selec%on
Technical
meaning
in
machine
learning
etc.:
which
variables
ma.er?
Were
journalists,
so
were
interested
in
an
earlier
process:
how
to
describe
the
world
in
numbers?
Choosing
Features
! # # # # # # # "
Journalism
How
do
we
represent
the
world
numerically?
where k N
Categorical
nite,
e.g.
{on,
o}
innite
e.g.
{red,
yellow,
blue,
...
chartreuse}
ordered?
equivalence
classes
or
other
structure?
Likert Scale Discrete scale, no xed origin , abstract units, compara%ve, non-uniform
! # # # # # # # "
Even with all these caveats, the vector representa%on is incredibly exible and powerful.
Lecture
1:
Basics
Computer
Science
and
Journalism
Represen%ng
Data
Interpre%ng
High
Dimensional
Data
UK House of Lords vo%ng record, 2000-2012. N = 1043 votes by M = 1630 lords 2 = aye, 4 = nay, -9 = didn't vote
Vote
vectors
let
v(i,j)
=
vote
of
MP
i
on
issue
j.
Then
we
can
look
at
all
votes
for
a
par%cular
MP
# mpi = ! v ( i , 0) v ( i ,1) v ( i , N ) " $
Now
we
have
1043
vectors,
each
of
dimension
1630.
What
could
we
learn
from
this?
What
is
their
structure?
Dimensionality
reduc%on
Problem:
vector
space
is
high-dimensional.
Up
to
thousands
of
dimensions.
The
screen
is
two- dimensional.
We
have
to
go
from
x
RN
to
much
lower
dimensional
points
y
RK<<N
Probably
K=2
or
K=3.
Think of this as rota%ng to align the "screen" with coordinate axes, then simply throwing out values of higher dimensions.
Principal Components Analysis nds the direc%ons of maximum variance. Here, we're ployng the two dims of greatest variance.
Conserva%ve and Liberal Democrats really do vote together, mostly. Cross-benchers and bishops in the middle, Labor opposite.