Social Network Analysis. Computational Journalism Week 10

Frontiers
of
Computational Journalism
Columbia Journalism School
Week 9: Social Network Analysis
November 20, 2015
Network
A set of people
and a set of connections between pairs of them
Types of connections
Social network analysis: only one type of connection between
individuals (e.g. "friend")
Link analysis: multiple types of connections
friend
brother
employer
went to university with
sold a car to
owns 51% of
Link analysis is much more relevant to journalism, because it

allows representation of much more detail and context.
People Act in Groups

Family and friendships: I am most closely connected to a small set of
people, who are usually closely connected to each other.
Business: I am much more likely to do business with people I already
know.
Influence: I listen to people I know more than I listen to strangers.
Norms: what is right depends on what the people around me think.
People tend to marry, do business with, spend time with, etc. people
from similar backgrounds... and people who have social ties tend to
be similar.
Homophily
Homophily is the principle that contact between similar people
occurs at a higher rate than among dissimilar people. The
pervasive fact of homophily means that cultural, behavioral,
genetic, or material information that flows through networks
will tend to be localized. Homophily imples that distance in
terms of social characteristics translates into network distance,
the number of relationships through which a piece of
information must travel to connect two individuals.
- McPherson, Smith-Lovin, Cook
Birds of a feather: homophily in social networks
Structure Relates to Behavior
In a 1951 experiment, researchers had five people work together, only

allowed to communicate according to one of the patterns above. They
were each given a card with several symbols on it. The task was to
determine which symbol was in common between all of the cards. It
was repeated many times.
How did the groups organize themselves? Which patterns were fastest?
From H. Leavitt, Some effects of certain communication patterns on group performance,
Journal of Abnormal Psychology 46(1)
Correlation of different types of info

Suppose you have a record of phone numbers called, a database of
political campaign donations, and a list of government appointees. Put
them together, and you have this story:
WASHINGTONTime and again, Texas Gov. Rick Perry picked up his office phone in
the months before he would announce his bid for the presidency. He dialed
wealthy friends who were his big fundraisers and state officials who owed him for
their jobs.
Perry also met with a Texas executive who would later co-found an independent
political committee that has promised to raise millions to support Perry but is
prohibited from coordinating its activities with the governor.
- Jack Gillum, Perry called top donors from work phones, AP, 6 Dec 2011
Social Network Analysis in Journalism
Identify people or communities

Track money and criminal networks
Understand spread of information and behavior
Illustrate complex stories
Useful in all areas where CS intersects journalism! (Reporting,

communication, filtering, effect tracking)
Two major analysis methods

after you have the network data, which may be a very manual
process.
Look at a visualization
Apply algorithm
In both cases, the results are not interpretable without context!
Force-Directed Layout
Each edge is a "spring" with a fixed preferred length.

Plus global repulsive force that pushes all nodes apart.
From The Effect of Graph Layout on Inference from Social Network

Data, Blythe et al.

Data, Blythe et al.
We asked respondents three questions about the same

five focal nodes in each sociogram:
1) how many subgroups were in the sociogram
2) how prominent was each player in the sociogram
3) how important a bridging role did each player occupy
in the sociogram

Data, Blythe et al.
Centrality
Often identified with "influence" or "power." Often important in
journalism.
We can visualize the graph and use our eyes, or we can compute centrality
values algorithmically.
Degree centrality: number of edges
Models: cases where the number of connections is important.

Example: which celebrity can reach the most people at once?
Closeness centrality: average distance to all other nodes
Models: cases where time taken to reach a node is important.

Example: who finds out about gossip first?
Betweenness centrality:
number of shortest paths that pass through node
Models: cases where control over transmission is important.

Example: who has the most power to make introductions?
Eigenvector centrality:
how likely you are to end up at a node on a random walk
(same idea as PageRank)
Models: cases where importance of neighbors is important.

Example: the private adviser to the president
Journalism centrality:
how important is this person to this story?
Who is "important"?
What type of person do you want to identify in the network?
Often assumed we're after "influential." But sociology says
"power" is a complicated thing and difficult to define and
measure.
Network analysis has mostly ignored this problem. I know of no
successful use of centrality metrics in journalism maybe you'll
be the first.
Finding Communities
No one definition of "community." Could mean a town, or a club, or an
industry network.
But for our purposes, a community is "a group of people with pre-existing
patterns of association."
In social network analysis, that translates into clusters in the graph.
Friends/followers
Co-consumption Network of political book sales, Orgnet.com
Communications network Exploring Enron, Jeffery Heer
Web link structure Map of Iranian Blogosphere, Berkman Center
Individual time/location trails CitySense, Sense Networks
Warning: no network is ever "complete."

Otherwise there would be 7 billion people in it
Mathematical definitions of "cluster"

You've already seen several! If you can compute distance between any
two items, you can cluster.
But in social networks, not everyone is connected to everyone else...
Modularity
Are there more intra-group edges than we would

expect randomly?
Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
m = k
There are total edges in the graph.
If they go between random vertices then
number of edges between i,j is ki k j / 2m
1
2
Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
Modularity
Q = ( Aij ki k j / 2m )gij
ij
If Q>0 then there are "excess" edges inside the

groups (and fewer edges between them.)
Modularity algorithm
Look for a division of nodes into two groups that
maximizes Q
Can find this through eigenvector technique
Possible that no division has Q>0, in which case the
graph is a single community
If a division with Q>0 found, split
Recursively split sub-graphs
The Hairball problem
Real social networks are big, with complex, overlapping communities in the central
component. Modularity and other community detection algorithms give poor results.
K-core Decomposition
Find the nodes at the "center" of a network.
for k=1 to maximum node degree
repeat
remove all nodes with degree < k
until all remaining nodes have degree >= k
set "core number" of remaining nodes to k
K-core Decomposition
Carmi et al., A model of Internet topology using k-shell decomposition
Protest Dynamics on Twitter
Gonzlez-Bailon et al, The Dynamics of Protest Recruitment through an Online

Network
k-core number vs. maximum cascade size. Color = sent at least one tweet
which reached this fraction of users (orange = reached all users)
Key insight: triangles not edges
Simmel's theory of sociology (early 20th C.) says

relationship between two people cannot be
understood without context.
Idea: count shared triangles

1. Given each node A, given each of A's friends B, count the
number of triangles involving A and each B (= number of shared
friends of A and B).
2. Rank A's friends (each B) by number of shared friends
(number of C's for A,B) to create "top friends" list for A.
2. Keep the edge between nodes A,D only if there is some
threshold percentage overlap in their top friends list.
Simmelian Backbones
SNA in journalism
ICIJ Offshore Tax Haven leak

ICIJ human tissue investigation
Organized Crime and Corruption Reporting Project
WSJ Galleon's Web insider trading story
SCMP's Who Runs Hong Kong
Muckety.com
The other challenge was the data itself. How to separate the extraordinary from the routine
and find the public interest inside a maze of more than 37,000 offshore company holders? A
first step was to build as many lists as possible of public figures: Politburo members, military
commanders, mayors of large cities, billionaires listed in Forbes and Huruns rankings of the
mega-wealthy and so-called princelings (relatives of the current leadership or former
Communist Party elders).
Through painstaking database work, a reporter in Spain cross-referenced the lists of notable
Chinese against the names of offshore clients listed within ICIJs Offshore Leaks data. The
added difficulty was that in most cases, names in the offshore files were registered in
Romanized form, not Chinese characters. This made making exact matches extremely hard,
because Romanized spellings from Chinese characters tend to vary widely: Wang might be
spelled Wong, Zhang could be Cheung, and Ye might be spelled Yeh. Addresses and ID
numbers helped confirmed many identities but many others names were dropped because
the reporting team could not be 100 percent sure that the person was a correct match.
A picture slowly began to emerge: Chinas elites were aggressively using offshore havens to
hold assets, list companies in the worlds stock exchanges, buy and sell real estate and
conduct their business away from Beijings red tape and capital controls.
How We Did Offshore Leaks China, ICIJ
Analyzing the Data behind Skin and Bone, ICIJ
Who Runs HK? The Fight over Stanley Ho's Fortune

South China Morning Post, 2010
SNA that could be used in Journalism
The Network of Global Corporate Control paper

Network of campaign finance contributions (SuperPACs)
International financial system / HFT
"Revolving door" / regulatory capture
Political elite in any country
Find audience for story, akin to targeted marketing
...
Vitali, Glattfelder, Battiston, The Network of Global Corporate Control
SNA in journalism
Visualization widely used
Link analysis successful in investigative reporting
Most of the work required to do these types of stories is
traditional research, not algorithmically-guided.
I am not aware of successful application of centrality metrics
or community detection algorithms.
This may change as the graphs journalism examines get
bigger...
Would it be possible to use community detection to find the
"right" audience for a story?

Social Network Analysis. Computational Journalism Week 10

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Social Network Analysis. Computational Journalism Week 10

Загружено:

Авторское право:

Доступные форматы

Frontiers

and a set of connections between pairs of them

Link analysis is much more relevant to journalism, because it

People Act in Groups

Structure Relates to Behavior

In a 1951 experiment, researchers had five people work together, only

Correlation of different types of info

Social Network Analysis in Journalism

Identify people or communities

Useful in all areas where CS intersects journalism! (Reporting,

Two major analysis methods

Each edge is a "spring" with a fixed preferred length.

From The Effect of Graph Layout on Inference from Social Network

From The Effect of Graph Layout on Inference from Social Network

We asked respondents three questions about the same

From The Effect of Graph Layout on Inference from Social Network

Degree centrality: number of edges

Models: cases where the number of connections is important.

Closeness centrality: average distance to all other nodes

Models: cases where time taken to reach a node is important.

Models: cases where control over transmission is important.

Models: cases where importance of neighbors is important.

Co-consumption Network of political book sales, Orgnet.com

Communications network Exploring Enron, Jeffery Heer

Web link structure Map of Iranian Blogosphere, Berkman Center

Individual time/location trails CitySense, Sense Networks

Warning: no network is ever "complete."

Mathematical definitions of "cluster"

Are there more intra-group edges than we would

If Q>0 then there are "excess" edges inside the

The Hairball problem

Carmi et al., A model of Internet topology using k-shell decomposition

Protest Dynamics on Twitter

Gonzlez-Bailon et al, The Dynamics of Protest Recruitment through an Online

Key insight: triangles not edges

Simmel's theory of sociology (early 20th C.) says

Idea: count shared triangles

ICIJ Offshore Tax Haven leak

How We Did Offshore Leaks China, ICIJ

Analyzing the Data behind Skin and Bone, ICIJ

Who Runs HK? The Fight over Stanley Ho's Fortune

SNA that could be used in Journalism

The Network of Global Corporate Control paper

Vitali, Glattfelder, Battiston, The Network of Global Corporate Control

Вам также может понравиться