Вы находитесь на странице: 1из 56

Frontiers

of
Computational Journalism
Columbia Journalism School
Week 9: Social Network Analysis
November 20, 2015

Network
A set of people

and a set of connections between pairs of them

Types of connections
Social network analysis: only one type of connection between
individuals (e.g. "friend")
Link analysis: multiple types of connections
friend
brother
employer
went to university with
sold a car to
owns 51% of

Link analysis is much more relevant to journalism, because it


allows representation of much more detail and context.

People Act in Groups


Family and friendships: I am most closely connected to a small set of
people, who are usually closely connected to each other.
Business: I am much more likely to do business with people I already
know.
Influence: I listen to people I know more than I listen to strangers.
Norms: what is right depends on what the people around me think.
People tend to marry, do business with, spend time with, etc. people
from similar backgrounds... and people who have social ties tend to
be similar.

Homophily
Homophily is the principle that contact between similar people
occurs at a higher rate than among dissimilar people. The
pervasive fact of homophily means that cultural, behavioral,
genetic, or material information that flows through networks
will tend to be localized. Homophily imples that distance in
terms of social characteristics translates into network distance,
the number of relationships through which a piece of
information must travel to connect two individuals.
- McPherson, Smith-Lovin, Cook
Birds of a feather: homophily in social networks

Structure Relates to Behavior

In a 1951 experiment, researchers had five people work together, only


allowed to communicate according to one of the patterns above. They
were each given a card with several symbols on it. The task was to
determine which symbol was in common between all of the cards. It
was repeated many times.
How did the groups organize themselves? Which patterns were fastest?
From H. Leavitt, Some effects of certain communication patterns on group performance,
Journal of Abnormal Psychology 46(1)

Correlation of different types of info


Suppose you have a record of phone numbers called, a database of
political campaign donations, and a list of government appointees. Put
them together, and you have this story:
WASHINGTONTime and again, Texas Gov. Rick Perry picked up his office phone in
the months before he would announce his bid for the presidency. He dialed
wealthy friends who were his big fundraisers and state officials who owed him for
their jobs.
Perry also met with a Texas executive who would later co-found an independent
political committee that has promised to raise millions to support Perry but is
prohibited from coordinating its activities with the governor.
- Jack Gillum, Perry called top donors from work phones, AP, 6 Dec 2011

Social Network Analysis in Journalism

Identify people or communities


Track money and criminal networks
Understand spread of information and behavior
Illustrate complex stories

Useful in all areas where CS intersects journalism! (Reporting,


communication, filtering, effect tracking)

Two major analysis methods


after you have the network data, which may be a very manual
process.
Look at a visualization
Apply algorithm
In both cases, the results are not interpretable without context!

Force-Directed Layout

Each edge is a "spring" with a fixed preferred length.


Plus global repulsive force that pushes all nodes apart.

From The Effect of Graph Layout on Inference from Social Network


Data, Blythe et al.

From The Effect of Graph Layout on Inference from Social Network


Data, Blythe et al.

We asked respondents three questions about the same


five focal nodes in each sociogram:
1) how many subgroups were in the sociogram
2) how prominent was each player in the sociogram
3) how important a bridging role did each player occupy
in the sociogram

From The Effect of Graph Layout on Inference from Social Network


Data, Blythe et al.

Centrality
Often identified with "influence" or "power." Often important in
journalism.
We can visualize the graph and use our eyes, or we can compute centrality
values algorithmically.

Degree centrality: number of edges

Models: cases where the number of connections is important.


Example: which celebrity can reach the most people at once?

Closeness centrality: average distance to all other nodes

Models: cases where time taken to reach a node is important.


Example: who finds out about gossip first?

Betweenness centrality:
number of shortest paths that pass through node

Models: cases where control over transmission is important.


Example: who has the most power to make introductions?

Eigenvector centrality:
how likely you are to end up at a node on a random walk
(same idea as PageRank)

Models: cases where importance of neighbors is important.


Example: the private adviser to the president

Journalism centrality:
how important is this person to this story?

Who is "important"?
What type of person do you want to identify in the network?
Often assumed we're after "influential." But sociology says
"power" is a complicated thing and difficult to define and
measure.
Network analysis has mostly ignored this problem. I know of no
successful use of centrality metrics in journalism maybe you'll
be the first.

Finding Communities
No one definition of "community." Could mean a town, or a club, or an
industry network.
But for our purposes, a community is "a group of people with pre-existing
patterns of association."
In social network analysis, that translates into clusters in the graph.

Friends/followers

Co-consumption Network of political book sales, Orgnet.com

Communications network Exploring Enron, Jeffery Heer

Web link structure Map of Iranian Blogosphere, Berkman Center

Individual time/location trails CitySense, Sense Networks

Warning: no network is ever "complete."


Otherwise there would be 7 billion people in it

Mathematical definitions of "cluster"


You've already seen several! If you can compute distance between any
two items, you can cluster.
But in social networks, not everyone is connected to everyone else...

Modularity

Are there more intra-group edges than we would


expect randomly?

Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
m = k
There are total edges in the graph.
If they go between random vertices then
number of edges between i,j is ki k j / 2m
1
2

Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
Modularity

Q = ( Aij ki k j / 2m )gij
ij

If Q>0 then there are "excess" edges inside the


groups (and fewer edges between them.)

Modularity algorithm
Look for a division of nodes into two groups that
maximizes Q
Can find this through eigenvector technique
Possible that no division has Q>0, in which case the
graph is a single community
If a division with Q>0 found, split
Recursively split sub-graphs

The Hairball problem

Real social networks are big, with complex, overlapping communities in the central
component. Modularity and other community detection algorithms give poor results.

K-core Decomposition
Find the nodes at the "center" of a network.
for k=1 to maximum node degree
repeat
remove all nodes with degree < k
until all remaining nodes have degree >= k
set "core number" of remaining nodes to k

K-core Decomposition

Carmi et al., A model of Internet topology using k-shell decomposition

Protest Dynamics on Twitter

Gonzlez-Bailon et al, The Dynamics of Protest Recruitment through an Online


Network

k-core number vs. maximum cascade size. Color = sent at least one tweet
which reached this fraction of users (orange = reached all users)

Key insight: triangles not edges

Simmel's theory of sociology (early 20th C.) says


relationship between two people cannot be
understood without context.

Idea: count shared triangles


1. Given each node A, given each of A's friends B, count the
number of triangles involving A and each B (= number of shared
friends of A and B).
2. Rank A's friends (each B) by number of shared friends
(number of C's for A,B) to create "top friends" list for A.
2. Keep the edge between nodes A,D only if there is some
threshold percentage overlap in their top friends list.

Simmelian Backbones

SNA in journalism

ICIJ Offshore Tax Haven leak


ICIJ human tissue investigation
Organized Crime and Corruption Reporting Project
WSJ Galleon's Web insider trading story
SCMP's Who Runs Hong Kong
Muckety.com

The other challenge was the data itself. How to separate the extraordinary from the routine
and find the public interest inside a maze of more than 37,000 offshore company holders? A
first step was to build as many lists as possible of public figures: Politburo members, military
commanders, mayors of large cities, billionaires listed in Forbes and Huruns rankings of the
mega-wealthy and so-called princelings (relatives of the current leadership or former
Communist Party elders).
Through painstaking database work, a reporter in Spain cross-referenced the lists of notable
Chinese against the names of offshore clients listed within ICIJs Offshore Leaks data. The
added difficulty was that in most cases, names in the offshore files were registered in
Romanized form, not Chinese characters. This made making exact matches extremely hard,
because Romanized spellings from Chinese characters tend to vary widely: Wang might be
spelled Wong, Zhang could be Cheung, and Ye might be spelled Yeh. Addresses and ID
numbers helped confirmed many identities but many others names were dropped because
the reporting team could not be 100 percent sure that the person was a correct match.
A picture slowly began to emerge: Chinas elites were aggressively using offshore havens to
hold assets, list companies in the worlds stock exchanges, buy and sell real estate and
conduct their business away from Beijings red tape and capital controls.

How We Did Offshore Leaks China, ICIJ

Analyzing the Data behind Skin and Bone, ICIJ

Who Runs HK? The Fight over Stanley Ho's Fortune


South China Morning Post, 2010

SNA that could be used in Journalism

The Network of Global Corporate Control paper


Network of campaign finance contributions (SuperPACs)
International financial system / HFT
"Revolving door" / regulatory capture
Political elite in any country
Find audience for story, akin to targeted marketing
...

Vitali, Glattfelder, Battiston, The Network of Global Corporate Control

SNA in journalism
Visualization widely used
Link analysis successful in investigative reporting
Most of the work required to do these types of stories is
traditional research, not algorithmically-guided.
I am not aware of successful application of centrality metrics
or community detection algorithms.
This may change as the graphs journalism examines get
bigger...
Would it be possible to use community detection to find the
"right" audience for a story?

Вам также может понравиться