Вы находитесь на странице: 1из 93

Web Data Analysis

Department of Communication PhD Student Workshop


Web Mining for Communication Research
April 22-25, 2014
http://weblab.com.cityu.edu.hk/blog/project/workshops

Cheng-Jun Wang

Outline
I. Key features of web data
II. Major approaches to web data
analysis
i.
ii.
iii.
iv.

Network analysis
Temporal analysis
Spatial analysis
Sentiment analysis

III. Reflections on web data analysis

FEATURES OF WEB DATA

Traditional vs. Web Data


Analysis of traditional
(cross-sectional, fat)
data
ID
V1
V2
V3
...
V..
1

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

1,00
0

...

...

...

...

...

Analysis of web
(longitudinal, tall)
data
Time series analysis
ID
Network analysis1
Spatial analysis
2
Text mining
...
etc.

Time

V1

...

...

...

...

...

...

Multiple regression

1,000

...

...

Log-linear model

...

...

...

10,000

...

...

...

...

...

...

...

...

Multilevel analysis
Structural equation modeling
etc.

APPROACHES TO WEB
DATA ANALYSIS
5

What Can We Do with Web


Data?

Features

Temporal features
Spatial features
Structural/behavioral features (e.g., RT,
@)
Content features (term/topic/sentiment)

Approaches

Time series analysis


Spatial analysis
Network analysis
Text mining
6

Frequently Used Tools


Operation
Pull-down
menus

Programmingbased

Open Source Commercial


OpenOffice
Google Docs
Spreadsheet

SPSS
Excel

Stata
SAS

NETWORK ANALYSIS
(ELEMENTARY-LEVEL)
8

What Is a Network?
A network consists of
Nodes (actors, agents,
etc.)
Edges (relations, ties, etc.)

The same set of nodes


and edges can also be
called:

a graph
a matrix
a web
a map
etc.

A pair of adjacent nodes are


neighbors. (Are A and C neighbors?)
9

Key Concepts

Network
Node
Edge
Ego-network
Component
Triadic closure

Individual-level
analysis:
Centrality metrics

Group-level
analysis
Transitivity

Global-level
analysis
Density
Modularity
10

Examples of Nodes and


Edges
Nodes:
Persons (e.g., Facebook
users)
Organizations
(McDonald restaurants)
Nations (EU members)
Machines (web servers)
Locations (airports)
Ideas (words in articles)
etc.

Edges:
Kinship links (family
ties)
Friendship ties (factual
or perceived)
Business transactions
Travel routes (highways,
subways, air flights)
Similarities (word cooccurrences in articles)
etc.

11

Examples of Innovative
Network Analysis
Food Flavor Network

Music Notes Network

http://www.nature.com/srep/2011/111215/srep0
0196/full/srep00196.html

http://www.eie.polyu.edu.hk/~xfliu/publications/
LiuXF.2010.physa.Music.pdf
12

More on Edges
Directed (one-way) vs. undirected (two-way)
Observed (directly measured, e.g., hyperlinks)
vs. hidden (inferred , e.g., co-occurrences)
Formal (institutionally arranged, top-down) vs.
informal (self-organized, bottom-up)
Static (unchanged over time) vs. dynamic
(evolving)
Positive (e.g., friending) vs. negative (e.g., defriending)
The key challenge to innovative network analysis is
to identify hidden, informal, and evolving edges
13

Classification of Online Social


Networks
Manifestation
of Ties

Direction of Ties
Undirected

Directed

Directly
Observed

Friendship networks
(e.g., Facebook,
Google+)

Microblog networks
(e.g., Twitter, Sina
Weibo)

Indirectly
Inferred

Semantic networks
(e.g., recommendation
systems, social tagging
systems)

Newsgroups, blogs,
WWW hyperlink
networks

Source: Ackland and Zhu (forthcoming). Social network analysis, Sage.14

Components
A component is a
subset of a network:
i.

ii.

every node in the


subset has a path to
every other
the subset is not
part of some larger
set

Most online social


networks have one
(or a few) giant
components
15

Components in a High School


Network

Source: Bearman, Moody & Stovel (2004).


http://www.jstor.org/discover/10.1086/386272?uid=3738176&uid=2&uid=4&sid=
21103
878405327

16

Components in World Wide


Web

Daisy Model (Donato et al., 2005)

Bowtie Model (Broder et al., 2000)


SCC: strongly connected component
IN: unilaterally connected to SCC
OUT: unilaterally connected by SCC

Teapot Model (Zhu et al., 2008)


17

Ego-Centric Network
Ego-network: a subset of a network
including a particularly designated node
(ego) and its neighbors (alters)
For example, followers of a VIP account on
Twitter or Sina Weibo form an ego-network
All snowballing samples of online social
networks are ego-networks.
An important property of ego-networks is
the depth (see next slide).
Family tree is a special case of egonetworks (see the second next slide).
18

The Depth of Ego Networks


1.0 Ego Network

1.5 Ego Network

2.0 Ego Network

2.5 Ego Network

19

Family Tree: Special EgoNetworks


Is it an
undirected or
directed graph?
Are there
multiple paths
from a parent
node to a child
node?
What are the
similarities or
differences
between family
trees and other
types of ego20
networks?

Triadic Closure (Transitivity)


B

A
t0

t1

Why are friends (B and C) of a common friend (A) more likely to become
friends themselves: 1. chances to meet each other; 2. similarity between them.

21

Triads of Undirected
Networks

Closed Triad

Connected Pair

Open Triad

Unconnected
22

Triads of Directed Networks


The 1st number: N of bidirectional edges;
The 2nd number: N of
uni-directional edges;
The 3rd number: N of
nonexistent edges;
The letter code:
directed variations of
the same triad, with U
for up, D for down,
C for circle, and T for
transitive (i.e.,
having 2 paths that
lead to the same
endpoint).
23

Measure of Triadic Closure


Nodes in a graph
usually have multiple
triads each.
Therefore, there is a
need to measure
quantitatively the
overall degree of triadic
closure for each node.
Clustering coefficient
(CC) is the most
frequently used
measure for the
purpose.

B
A

C
D

24

Triadic Closure: Driven by


Social Selection or Social
Influence?
(b) Focal closure

(a) Triadic closure

Person

Person

Person

Person

Focus
(e.g.,
recommended
(c) Membership closure
books on
Amazon)

Person

Person

Focus
(e.g.,
recommended
groups on
Person
Facebook)
25

Goals of Social Network


Analysis
Perer & Shneiderman (2008):
1.Overall network metrics (e.g., number of nodes,
number of edges, density, diameter), global
2.Node rankings (degree, betweenness, closeness
centrality), individual
3.Edge rankings (weight, betweenness centrality),
local
4.Node rankings in pairs (degree vs. betweenness,
plotted on a scatter gram), local
5.Edge rankings in pairs, local
6.Cohesive subgroups (finding communities), local
7.Multiplexity (analyzing comparisons between
different edge types, such as friends vs. enemies),
cross-levels
26

Levels of Network Analysis


Individual-level: nodes, focusing on
who are the most
popular/important/influential nodes in
the network?
Local-level: groups (or clusters,
communities, components, etc.),
focusing on how are the nodes
clustered/grouped together?
Global-level: network, focusing on how
densely/closely is the network
connected as a whole?
27

Individual-level Analysis
Find popular/important/influential nodes
usually based on centrality metrics
Degree centrality: How many nodes are you
connected to?
Closeness centrality: How close are you to
other nodes?
Betweenness centrality: How many paths
are through you?
Eigenvalue: How many important nodes are
round you?
28

Interpretation of Centrality
Scores
High centrality scores:
Individuals with high
centrality scores are
often more likely to be:

Low centrality scores:


Individuals with low
scores are in
peripheral positions:

leaders
key conduits of
information
early adopters of
anything that spreads
in a network

who may be
protected from
negative contagion
and influence
who may be
associated with less
work overload in an
organization
29

Example: Krackhardts Kite


Graph
F
A

E
G

H
D

A network of 10
nodes and 18 edges:
Who has the highest
degree centrality?
Who has the highest
betweenness
centrality?
Who has the highest
closeness centrality?
30

Degree Centrality
Number of
neighbors a node is
directly connected
Indicates how well
the node is
connected within
the graph

Degree of G = 6

31

Betweenness Centrality
The number of
shortest paths
between pairs of
other nodes through a
node (as compared
with total number of
shortest paths in the
graph)
Indicates how critical
the node is to the flow
of information or
resource in the graph

Betweenness of H = 14

32

Closeness Centrality
Number of steps
along the shortest
path from the focal
node to all other
node
Indicates how
quickly information
travels between the
node and anyone
else in the graph

Closeness of D and E =
14, respectively
14 == 1*5 + 2*3 + 3*1
33

Eigenvalue Centrality
The extent to which a node
is a big fish connected with
other big fish in a big pond.
Calculated by assessing
how well connected a node
is to the parts of the
network with the greatest
connectivity.
Nodes with high
eigenvector scores have
many connections who
have many connections,
etc., similar to the logic of
Google PageRank.

Highly connected individuals


within highly interconnected
clusters, or big fish in big
ponds, have high eigenvector
centrality.
34

Group-level Analysis
Central Question: How are nodes
clustered (grouped) together?
based on clustering analysis, a method
to merge an n number of nodes into a g
number of groups such that:
the nodes within the same group are
maximally similar or homogeneous
the nodes between the groups are
maximally different or heterogeneous
35

Process of Clustering
Analysis
1

At step 1, there are 10 clusters,


each with a node that is uniquely
different from all others.
At step 2, nodes 1 and 2 are
considered to be similar enough
to form a cluster; same goes
between nodes 9 and 10. There
are now 8 clusters.
At step 3, node 3 joins the
cluster of 1 and 2, and node 8
joins the cluster of 9 and 10. The
process keeps on until every
node is included in the one giant
cluster at step 6.
An optimal solution is to keep a
small number of clusters with
maximal similarity within and
maximal difference between.

2
3
4
5
6
7
8
9
10
1

36

Island Method for Group


Detection
By raising the
threshold of edge
strength (e.g., mean,
median, or k
standard deviation
above the mean), an
increasing number of
groups
(communities) will
emerge successively
from a giant
connected
component.
37

Group-level Metrics in
NodeXL

Vertex counts
Edge counts
Geodesic distances
Group density
Number of edges between each pair
of groups

38

Global-level Analysis
Key question: How
densely or closely
connected is the
network as a whole?
Fig a (top): connected
based on 67%
agreement
Fig b (bottom):
connected based on
75% agreement
39

Global-level Metrics in NodeXL


(1)
Graph Type
Vertices
Unique Edges
Edges With Duplicates
Total Edges

Directed or undirected.
The number of vertices in the graph.
The number of edges that do not have duplicates.
The number of edges that have duplicates.
The number of edges in the graph. This is the sum of Unique
Edges and Edges With Duplicates.
Self-Loops
The number of edges that connect a vertex to itself.
Reciprocated Vertex Pair In a directed graph, this is the N of vertex pairs that have edges
Ratio
in both directions divided by the N of vertex pairs that are
connected by any edge. Duplicate edges and self-loops are
ignored. In an undirected graph, this is undefined.
Reciprocated Edge Ratio In a directed graph, this is the number of edges that are
reciprocated divided by the total number of edges. Duplicate
edges and self-loops are ignored. In an undirected graph, this is
undefined and is not calculated.
Connected Components The number of connected components in the graph. A
connected component is a set of vertices that are connected to
each other but not to the rest of the graph.
40

Global-level Metrics in NodeXL


(2)

Single-Vertex Connected
Components
Maximum Vertices in a
Connected Component
Maximum Edges in a
Connected Component
Maximum Geodesic
Distance (Diameter)

The N of connected components that have only one vertex.

Average Geodesic
Distance

The average geodesic distance among all vertex pairs, where


geodesic distance is the distance between two vertices along the
shortest path between them.

Graph Density

A ratio that compares the N of edges with the maximum N of


edges the graph would have if all the vertices were connected to
each other. Duplicate edges and self-loops are ignored.

Modularity

When the graph has groups, this is a measure of the "quality" of


the grouping. Graphs with high modularity have dense
connections among the vertices within the group but sparse
connections among vertices in different groups. When the graph
does not have groups, this is undefined.

The N of vertices in the connected component that has the most


vertices.
The N of edges in the connected component that has the most
edges.
The maximum geodesic distance among all vertex pairs, where
geodesic distance is the shortest path between two vertices.

41

HANDS-ON TUTORIALS

42

use R! aRe you


suRe?
NodeXL

and

Super R logo
Source: www.redbubble.com/

43

R for Web Data Analysis


Networ
k
Analysi
s

R packages

igraph, Statnet,
Rsiena

Spatial
Analysis

http://cran.rproject.org/web/views/Spat
ial.html

Sp, Spatial,
OpenStreetMap,
RgoogleMaps

Temporal
Analysis

http://cran.rtseries, forecast, urca,


project.org/web/views/Time wavelets, SpatioTemporal
Series.html
http://cran.rproject.org/web/views/Spat
ioTemporal.html

Text Mining

http://cran.rproject.org/web/views/Natu
ralLanguageProcessing.ht
ml

tm, Rweka, openNLP,


wordcloud, topicmodels,
RTextTools, sentiment,
ReadMe

Machine

http://cran.r-

Nnet, rpart, trees, party,

44

came, I saw,

and I

walked away?
Picture: Gareth Jenkins/Solent
http://www.telegraph.co.uk/news/picturegalleries/picturesoftheday/8561204/Pictures-of-the-day-7-June-2011.html?image=6

Figure from the movie Daddy Day Care (2003)


http://img0.joyreactor.com/pics/post/gif-eddie-murphy-reaction-gifs-party-394848.gif

Plunge

into the water!


45

HANDS-ON!

46

Demo 1. Software
Installation
Download and install R, Rstudio, and
NodeXL
http://cran.r-project.org/

https://www.rstudio.com/ide/

http://nodexl.codeplex.com/

Learn the basics of R


http://tryr.codeschool.com/

More information
https://www.rstudio.com/training/online.html
47

NETWORK ANALYSIS
(ADVANCED-LEVEL)
48

Network Topology

49

Regular or random?
Regular network
Nodes are connected in
a regular neighborhood
with a fixed number k
of edges per each node
They do not exhibit the
small world
characteristics
They may exhibit
clustering

Random network
Random networks
have randomly
connected edges
each node has an
average edges
They exhibit the small
world characteristics
They do not exhibit
clustering
50

Small-World Networks
Between order and chaos

Network generation

Watts and Strogatz


(1999) propose a
model for networks
between order and
chaos

The model is built by simply


Re-wiring at random a small
percentage of the regular edges
Which dramatically shortens the
average path length without
destroying clustering

Such that

The network exhibits the


small world feature as
random networks
And exhibits clustering, as
regular lattices
Watts and Strogatz (1999)
51

Scale-free network
Power law
Long-tail distribution
P(k) ~ k-a, 0<a<2
log(P) ~-a*log(k)

Zipf distribution
Pareto distribution

Properties
Scale-invariance
P(c*k) ~ (c*k) a
Thus, P(c*k) ~ c a k-a

P(c*k) k-a

No average
Universality

Barabsi, Albert, and Jeong, Scale-free characteristics of random networks: The topology of the world wide web, Physical A.,
281, 2000, pp.69-77.
52

Demo2. Generate the


Network

R scripthttp://chengjun.github.io/web_data_analysis/demo2_simulate_networks/
install.packages("igraph")
library(igraph)
size = 50
g = graph.tree(size, children = 2); plot(g)
g = graph.star(size); plot(g)
g = graph.full(size); plot(g)
g = graph.ring(size); plot(g)
g = connect.neighborhood(graph.ring(size), 2); plot(g)
g = erdos.renyi.game(size, 0.1)
# small-world network
g = rewire.edges(erdos.renyi.game(size, 0.1), prob = 0.8 ); plot(g)
# scale-free network
g = barabasi.game(size) ; plot(g)

53

The Political Blogosphere VS.


Congressmens Retweet
Network
Peng, Zhu, Liu, Wu, Liu (2014)
L. A. Adamic and N. Glance, 'The
Political Blogosphere and the 2004
U.S. Election: Divided They Blog',
LinkKDD 2005

Friendship, Interaction
networks and Vote agreement
of congressmen in the United
States. 7th APNC, Montreal,
Canada

54

How to Represent a
Network?
A
e1
B

e3

e
2
C

e
4
e6 e5
E

A,
A,
A,
C,
C,
C,

B
D
C
D
E
F

55

Demo 3. Describe the


Network
Compute graph metrics using NodeXL

Step 1 paste the


edgelist here

56

Demo 3. Describe the


Network
NodeXL: Set node attributes

Step 2 paste the


node attribute here,
and name it as
party
Remember to click here to shift to
the Vertices window

57

Network
NodeXL: Calculating graph
metrics

Step 3 Click graph


metrics here

Remember to set the graph as


directed here

58

Network
NodeXL: Set vertex color and
vertex size

Step 4 Set vertex color


and vertex size by
click here

Remember to set the party as a


categorical variable

59

Demo 3. Describe the


Network

R script

http://chengjun.github.io/web_data_analysis/demo3_describe_the_network/

Graph Statistics
Centrality Measures
Algorithms of graphs
Shortest path
Connected component algorithms

60

The exponential random


graph model (p*)
An ERGM (p*) model is a statistical model for the
ties in a network
Independent (pairs of) ties (p1, Holland and
Leinhardt, 1981; Fienberg and Wasserman, 1979,
1981)
Markov graphs (Frank and Strauss, 1986)
Extensions (Pattison & Wasserman, 1999; Robins,
Pattison & Wasserman, 1999; Wasserman &
Pattison, 1996)
New specifications (Snijders et al., 2006; Hunter
& Handcock, 2006)

61

Why do we use stochastic


network models?
To capture complex social phenomena that
caused by regularities and randomness.
To infer whether certain network signatures will
appear more often than by chance
To distinguish between different social processes
(e.g. homophily vs. structural balance)
To better understand the way local social
processes interact and combine to shape global
network patterns
Deterministic approaches are not always good
enough

62

Procedures of ERGM

Assume we have an observed network of size n. What are the


mechanisms driving the formation of our network (e.g. reciprocity,
transitivity)?
Given those mechanisms, are some network configurations (e.g.
mutual dyads, transitive triplets) more common than you would
expect by chance?
Include a parameter for each configuration in the model. Parameter
values will help us identify a probability distribution for all graphs of
size n. (e.g. if we have a high value for the reciprocity parameter,
graphs that have a lot of mutual dyads will be more probable than
ones that do not)
Estimate the parameters: find the parameter values that best match
the observed network. We do that using MCMC-MLE: Markov Chain
Monte Carlo Maximum Likelihood Estimation techniques.
Once we have our probability distribution, we can draw random
graphs from it and compare any of their characteristics to those of our
observed
network.
http://www.kateto.net/wordpress/wp-content/uploads/2012/12/COMM%20645%20-%20ERGM.pdf
63

Network Configurations:
Undirected Networks
4-star
Edge

K-star
2-star

:
:

Triangle
3-star

64

Network Configurations:
Directed Networks
Arc

Reciprocity

isolate

2-mixed star

2-in star

2-out star

K-in star

Transitive triad

:
:

K-out star

:
:

Cyclic triad

65

Exponential Random Graph Models


ERGM

Y: all the possible ties


y: the observed ties
X: node attributes
g(y,X): network configurations(a vector) .
: a vector of model parameters
k(): normalizing constant

66

One example

denotes the vector of change statistics

Johan Koskinen (2012) An introduction to ERGM. 8th UKSNA Conference, Bristol


67

Tie-Network configuration
matrix
edges

2-star

K-star

Triangle

Y1,2

Y1,3

Y2, 3

Yn, n-1

68

Online Collective Identity

Ackland (2011) Online collective identity. SN

69

Demo 4. ERGM with R


R script
Load data
Build up network objects
Set the node attributes
Plot the network
Fitting a basic ERG model
http://chengjun.github.io/web_data_analysis/demo4_ergm_analysis/

70

TEMPORAL ANALYSIS

71

Time Series Analysis


Time series data can
be analyzed within
either time domain
or frequency domain.

Time domain:
ARIMA/VAR analysis
Survival analysis
Multilevel analysis

Frequency domain:

Where time domain analysis is


routinely conducted, frequency domain
analysis rarely adopted.

Fourier
transformation
Spectrum analysis
(comparing ak and bk
of different time
series).
72

Time Series Analysis

Forecasting and Univariate Modeling


Frequency analysis
Decomposition and Filtering
Seasonality
Stationarity, Unit Roots, and Cointegration
Nonlinear Time Series Analysis
Dynamic Regression Models
Multivariate Time Series Models

73

Survival Analysis of
Blogging Behavior

Source: Zhu et al., ICA 2010

74

SPATIAL ANALYSIS

75

Spatial Analysis
Spatial Data:
Location names
IP addresses
Map visits
GPS usage
etc.
Well-developed for
offline data but underdeveloped/utilized for
web data beyond visual
inspections.

Spatial Analysis:
Spatial clusters/patterns
(by visual inspections)
Spatial autocorrelation
Spatial Regression
Spatial Dependence
(correlation between
nearby locations)
Spatial interaction
(correlation between
geo-coded variables)

76

Geospatial Distribution of the


Communication on Twitter

Conover MD, Davis C, Ferrara E, McKelvey K, Menczer F, et al. (2013) The


Geospatial Characteristics of a Social Movement Communication Network.
PLoS ONE 8(3): e55957. doi:10.1371/journal.pone.0055957

77

Spatial Distribution of Tweets in


Milan
kernel smoother of point density

Are the tweets randomly distributed?


78

Temporal Distribution of Tweets


in Milan

79

SENTIMENT ANALYSIS

80

Sentiment Analysis
Decompose sentiment
Emotion

Joy
surprise
Anger
Sadness
Fear
disgust

Polarity
Positivity
Negativity
Neutral

Lexicon method
Carlo Strapparava and
Alessandro Valituttis
emotions lexicon
Janyce Wiebes
subjectivity lexicon
Liu Bings polarity
lexicon

Supervised machine
learning
Combine lexicon and
machine learning
81

Sentiment in the Tweet Stream

Miller (2011) Social scientists wade into the tweet


stream. Science

82

Twitter Mood Predicts the Stock


Market?
Decompose
sentiment
Emotion
Calm
Alert
Sure
Vital
Kind
Happy
Bollen (2011)
Twitter mood predicts
the stock market. JOCS

83

Calm Sentiment Predicts the


Stock Market

84

Demo 5. Sentiment Analysis


with Supervised Machine
Learning
R script
http://chengjun.github.io/web_data_analysis/demo5_sentiment_analysis/

Figure source: http://courtneylambert.co/official-twitter-stats-from-chirp

85

REFLECTION ON WEB
DATA ANALYSIS
86

Google Correlate & Google Flu


Prediction

http://www.google.com/trends/correlate/comic?p=2
87

Nature reported that Google flu trends (GFT) was predicting more than
double the proportion of doctor visits for influenza-like illness (ILI) than the
Centers for Disease Control and Prevention (CDC), which bases its
estimates on surveillance reports from laboratories across the United States
(1, 2).
Lazer et al. (2014) The parable of Google Flu Traps in big data analysis. Science

88

Tweet Sentiment and U. S.


Election 2012

Figures source: election.twitter.com


89

Facebook Insight and Twitter


Mention
Result

Facebook Insight

http://www.cnn.com/election/2012/facebook-insights/

http://www.zerogeography.net/2012/11/obama-wins-election-ontwitter.html
http://www.huffingtonpost.com/simon-jackman/pollster-predictions_b_2081013.html

90

Predict Political Orientation


with Machine Learning

Colleoni et al (2014) Echo chamber or public sphere Predicting political


orientation and measuring political homophily in Twitter. JOC

91

To Move on

R Style Guide
R bloggers
stackoverflow
github

http://adv-r.had.co.nz/Style.html

92

Yet, It Is Not Finished

Creating an R package (intro) http://gastonsanchez.com/teaching/

93

Вам также может понравиться