Академический Документы
Профессиональный Документы
Культура Документы
of
Computational Journalism
Columbia Journalism School
Week 9: Social Network Analysis
November 20, 2015
Network
A set of people
Types of connections
Social network analysis: only one type of connection between
individuals (e.g. "friend")
Link analysis: multiple types of connections
friend
brother
employer
went to university with
sold a car to
owns 51% of
Homophily
Homophily is the principle that contact between similar people
occurs at a higher rate than among dissimilar people. The
pervasive fact of homophily means that cultural, behavioral,
genetic, or material information that flows through networks
will tend to be localized. Homophily imples that distance in
terms of social characteristics translates into network distance,
the number of relationships through which a piece of
information must travel to connect two individuals.
- McPherson, Smith-Lovin, Cook
Birds of a feather: homophily in social networks
Force-Directed Layout
Centrality
Often identified with "influence" or "power." Often important in
journalism.
We can visualize the graph and use our eyes, or we can compute centrality
values algorithmically.
Betweenness centrality:
number of shortest paths that pass through node
Eigenvector centrality:
how likely you are to end up at a node on a random walk
(same idea as PageRank)
Journalism centrality:
how important is this person to this story?
Who is "important"?
What type of person do you want to identify in the network?
Often assumed we're after "influential." But sociology says
"power" is a complicated thing and difficult to define and
measure.
Network analysis has mostly ignored this problem. I know of no
successful use of centrality metrics in journalism maybe you'll
be the first.
Finding Communities
No one definition of "community." Could mean a town, or a club, or an
industry network.
But for our purposes, a community is "a group of people with pre-existing
patterns of association."
In social network analysis, that translates into clusters in the graph.
Friends/followers
Modularity
Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
m = k
There are total edges in the graph.
If they go between random vertices then
number of edges between i,j is ki k j / 2m
1
2
Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
Modularity
Q = ( Aij ki k j / 2m )gij
ij
Modularity algorithm
Look for a division of nodes into two groups that
maximizes Q
Can find this through eigenvector technique
Possible that no division has Q>0, in which case the
graph is a single community
If a division with Q>0 found, split
Recursively split sub-graphs
Real social networks are big, with complex, overlapping communities in the central
component. Modularity and other community detection algorithms give poor results.
K-core Decomposition
Find the nodes at the "center" of a network.
for k=1 to maximum node degree
repeat
remove all nodes with degree < k
until all remaining nodes have degree >= k
set "core number" of remaining nodes to k
K-core Decomposition
k-core number vs. maximum cascade size. Color = sent at least one tweet
which reached this fraction of users (orange = reached all users)
Simmelian Backbones
SNA in journalism
The other challenge was the data itself. How to separate the extraordinary from the routine
and find the public interest inside a maze of more than 37,000 offshore company holders? A
first step was to build as many lists as possible of public figures: Politburo members, military
commanders, mayors of large cities, billionaires listed in Forbes and Huruns rankings of the
mega-wealthy and so-called princelings (relatives of the current leadership or former
Communist Party elders).
Through painstaking database work, a reporter in Spain cross-referenced the lists of notable
Chinese against the names of offshore clients listed within ICIJs Offshore Leaks data. The
added difficulty was that in most cases, names in the offshore files were registered in
Romanized form, not Chinese characters. This made making exact matches extremely hard,
because Romanized spellings from Chinese characters tend to vary widely: Wang might be
spelled Wong, Zhang could be Cheung, and Ye might be spelled Yeh. Addresses and ID
numbers helped confirmed many identities but many others names were dropped because
the reporting team could not be 100 percent sure that the person was a correct match.
A picture slowly began to emerge: Chinas elites were aggressively using offshore havens to
hold assets, list companies in the worlds stock exchanges, buy and sell real estate and
conduct their business away from Beijings red tape and capital controls.
SNA in journalism
Visualization widely used
Link analysis successful in investigative reporting
Most of the work required to do these types of stories is
traditional research, not algorithmically-guided.
I am not aware of successful application of centrality metrics
or community detection algorithms.
This may change as the graphs journalism examines get
bigger...
Would it be possible to use community detection to find the
"right" audience for a story?