Web Data Analysis

Web Data Analysis
Department of Communication PhD Student Workshop

Web Mining for Communication Research
April 22-25, 2014
http://weblab.com.cityu.edu.hk/blog/project/workshops
Cheng-Jun Wang
Outline
I. Key features of web data
II. Major approaches to web data
analysis
i.
ii.
iii.
iv.
Network analysis
Temporal analysis
Spatial analysis
Sentiment analysis
III. Reflections on web data analysis
FEATURES OF WEB DATA
Traditional vs. Web Data

Analysis of traditional
(cross-sectional, fat)
data
ID
V1
V2
V3
...
V..
1
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
1,00
0
...
...
...
...
...
Analysis of web
(longitudinal, tall)
data
Time series analysis
ID
Network analysis1
Spatial analysis
2
Text mining
...
etc.
Time
V1
...
...
...
...
...
...
Multiple regression
1,000
...
...
Log-linear model
...
...
...
10,000
...
...
...
...
...
...
...
...
Multilevel analysis
Structural equation modeling
etc.
APPROACHES TO WEB
DATA ANALYSIS
5
What Can We Do with Web

Data?
Features
Temporal features
Spatial features
Structural/behavioral features (e.g., RT,
@)
Content features (term/topic/sentiment)
Approaches
Time series analysis

Spatial analysis
Network analysis
Text mining
6
Frequently Used Tools

Operation
Pull-down
menus
Programmingbased
Open Source Commercial

OpenOffice
Google Docs
Spreadsheet
SPSS
Excel
Stata
SAS
NETWORK ANALYSIS
(ELEMENTARY-LEVEL)
8
What Is a Network?
A network consists of
Nodes (actors, agents,
etc.)
Edges (relations, ties, etc.)
The same set of nodes

and edges can also be
called:
a graph
a matrix
a web
a map
etc.
A pair of adjacent nodes are

neighbors. (Are A and C neighbors?)
9
Key Concepts
Network
Node
Edge
Ego-network
Component
Triadic closure
Individual-level
analysis:
Centrality metrics
Group-level
analysis
Transitivity
Global-level
analysis
Density
Modularity
10
Examples of Nodes and

Edges
Nodes:
Persons (e.g., Facebook
users)
Organizations
(McDonald restaurants)
Nations (EU members)
Machines (web servers)
Locations (airports)
Ideas (words in articles)
etc.
Edges:
Kinship links (family
ties)
Friendship ties (factual
or perceived)
Business transactions
Travel routes (highways,
subways, air flights)
Similarities (word cooccurrences in articles)
etc.
11
Examples of Innovative
Network Analysis
Food Flavor Network
Music Notes Network
http://www.nature.com/srep/2011/111215/srep0
0196/full/srep00196.html
http://www.eie.polyu.edu.hk/~xfliu/publications/
LiuXF.2010.physa.Music.pdf
12
More on Edges
Directed (one-way) vs. undirected (two-way)
Observed (directly measured, e.g., hyperlinks)
vs. hidden (inferred , e.g., co-occurrences)
Formal (institutionally arranged, top-down) vs.
informal (self-organized, bottom-up)
Static (unchanged over time) vs. dynamic
(evolving)
Positive (e.g., friending) vs. negative (e.g., defriending)
The key challenge to innovative network analysis is
to identify hidden, informal, and evolving edges
13
Classification of Online Social

Networks
Manifestation
of Ties
Direction of Ties
Undirected
Directed
Directly
Observed
Friendship networks
(e.g., Facebook,
Google+)
Microblog networks
(e.g., Twitter, Sina
Weibo)
Indirectly
Inferred
Semantic networks
(e.g., recommendation
systems, social tagging
systems)
Newsgroups, blogs,
WWW hyperlink
networks
Source: Ackland and Zhu (forthcoming). Social network analysis, Sage.14
Components
A component is a
subset of a network:
i.
ii.
every node in the

subset has a path to
every other
the subset is not
part of some larger
set
Most online social

networks have one
(or a few) giant
components
15
Components in a High School

Network
Source: Bearman, Moody & Stovel (2004).

http://www.jstor.org/discover/10.1086/386272?uid=3738176&uid=2&uid=4&sid=
21103
878405327
16
Components in World Wide

Web
Daisy Model (Donato et al., 2005)
Bowtie Model (Broder et al., 2000)

SCC: strongly connected component
IN: unilaterally connected to SCC
OUT: unilaterally connected by SCC
Teapot Model (Zhu et al., 2008)

17
Ego-Centric Network
Ego-network: a subset of a network
including a particularly designated node
(ego) and its neighbors (alters)
For example, followers of a VIP account on
Twitter or Sina Weibo form an ego-network
All snowballing samples of online social
networks are ego-networks.
An important property of ego-networks is
the depth (see next slide).
Family tree is a special case of egonetworks (see the second next slide).
18
The Depth of Ego Networks

1.0 Ego Network
1.5 Ego Network
2.0 Ego Network
2.5 Ego Network
19
Family Tree: Special EgoNetworks

Is it an
undirected or
directed graph?
Are there
multiple paths
from a parent
node to a child
node?
What are the
similarities or
differences
between family
trees and other
types of ego20
networks?
Triadic Closure (Transitivity)

B
A
t0
t1
Why are friends (B and C) of a common friend (A) more likely to become
friends themselves: 1. chances to meet each other; 2. similarity between them.
21
Triads of Undirected
Networks
Closed Triad
Connected Pair
Open Triad
Unconnected
22
Triads of Directed Networks

The 1st number: N of bidirectional edges;
The 2nd number: N of
uni-directional edges;
The 3rd number: N of
nonexistent edges;
The letter code:
directed variations of
the same triad, with U
for up, D for down,
C for circle, and T for
transitive (i.e.,
having 2 paths that
lead to the same
endpoint).
23
Measure of Triadic Closure

Nodes in a graph
usually have multiple
triads each.
Therefore, there is a
need to measure
quantitatively the
overall degree of triadic
closure for each node.
Clustering coefficient
(CC) is the most
frequently used
measure for the
purpose.
B
A
C
D
24
Triadic Closure: Driven by

Social Selection or Social
Influence?
(b) Focal closure
(a) Triadic closure
Person
Person
Person
Person
Focus
(e.g.,
recommended
(c) Membership closure
books on
Amazon)
Person
Person
Focus
(e.g.,
recommended
groups on
Person
Facebook)
25
Goals of Social Network

Analysis
Perer & Shneiderman (2008):
1.Overall network metrics (e.g., number of nodes,
number of edges, density, diameter), global
2.Node rankings (degree, betweenness, closeness
centrality), individual
3.Edge rankings (weight, betweenness centrality),
local
4.Node rankings in pairs (degree vs. betweenness,
plotted on a scatter gram), local
5.Edge rankings in pairs, local
6.Cohesive subgroups (finding communities), local
7.Multiplexity (analyzing comparisons between
different edge types, such as friends vs. enemies),
cross-levels
26
Levels of Network Analysis

Individual-level: nodes, focusing on
who are the most
popular/important/influential nodes in
the network?
Local-level: groups (or clusters,
communities, components, etc.),
focusing on how are the nodes
clustered/grouped together?
Global-level: network, focusing on how
densely/closely is the network
connected as a whole?
27
Individual-level Analysis
Find popular/important/influential nodes
usually based on centrality metrics
Degree centrality: How many nodes are you
connected to?
Closeness centrality: How close are you to
other nodes?
Betweenness centrality: How many paths
are through you?
Eigenvalue: How many important nodes are
round you?
28
Interpretation of Centrality
Scores
High centrality scores:
Individuals with high
centrality scores are
often more likely to be:
Low centrality scores:

Individuals with low
scores are in
peripheral positions:
leaders
key conduits of
information
early adopters of
anything that spreads
in a network
who may be
protected from
negative contagion
and influence
who may be
associated with less
work overload in an
organization
29
Example: Krackhardts Kite

Graph
F
A
E
G
H
D
A network of 10
nodes and 18 edges:
Who has the highest
degree centrality?
Who has the highest
betweenness
centrality?
Who has the highest
closeness centrality?
30
Degree Centrality
Number of
neighbors a node is
directly connected
Indicates how well
the node is
connected within
the graph
Degree of G = 6
31
Betweenness Centrality
The number of
shortest paths
between pairs of
other nodes through a
node (as compared
with total number of
shortest paths in the
graph)
Indicates how critical
the node is to the flow
of information or
resource in the graph
Betweenness of H = 14
32
Closeness Centrality
Number of steps
along the shortest
path from the focal
node to all other
node
Indicates how
quickly information
travels between the
node and anyone
else in the graph
Closeness of D and E =
14, respectively
14 == 1*5 + 2*3 + 3*1
33
Eigenvalue Centrality
The extent to which a node
is a big fish connected with
other big fish in a big pond.
Calculated by assessing
how well connected a node
is to the parts of the
network with the greatest
connectivity.
Nodes with high
eigenvector scores have
many connections who
have many connections,
etc., similar to the logic of
Google PageRank.
Highly connected individuals

within highly interconnected
clusters, or big fish in big
ponds, have high eigenvector
centrality.
34
Group-level Analysis
Central Question: How are nodes
clustered (grouped) together?
based on clustering analysis, a method
to merge an n number of nodes into a g
number of groups such that:
the nodes within the same group are
maximally similar or homogeneous
the nodes between the groups are
maximally different or heterogeneous
35
Process of Clustering
Analysis
1
At step 1, there are 10 clusters,

each with a node that is uniquely
different from all others.
At step 2, nodes 1 and 2 are
considered to be similar enough
to form a cluster; same goes
between nodes 9 and 10. There
are now 8 clusters.
At step 3, node 3 joins the
cluster of 1 and 2, and node 8
joins the cluster of 9 and 10. The
process keeps on until every
node is included in the one giant
cluster at step 6.
An optimal solution is to keep a
small number of clusters with
maximal similarity within and
maximal difference between.
2
3
4
5
6
7
8
9
10
1
36
Island Method for Group

Detection
By raising the
threshold of edge
strength (e.g., mean,
median, or k
standard deviation
above the mean), an
increasing number of
groups
(communities) will
emerge successively
from a giant
connected
component.
37
Group-level Metrics in
NodeXL
Vertex counts
Edge counts
Geodesic distances
Group density
Number of edges between each pair
of groups
38
Global-level Analysis
Key question: How
densely or closely
connected is the
network as a whole?
Fig a (top): connected
based on 67%
agreement
Fig b (bottom):
connected based on
75% agreement
39
Global-level Metrics in NodeXL

(1)
Graph Type
Vertices
Unique Edges
Edges With Duplicates
Total Edges
Directed or undirected.
The number of vertices in the graph.
The number of edges that do not have duplicates.
The number of edges that have duplicates.
The number of edges in the graph. This is the sum of Unique
Edges and Edges With Duplicates.
Self-Loops
The number of edges that connect a vertex to itself.
Reciprocated Vertex Pair In a directed graph, this is the N of vertex pairs that have edges
Ratio
in both directions divided by the N of vertex pairs that are
connected by any edge. Duplicate edges and self-loops are
ignored. In an undirected graph, this is undefined.
Reciprocated Edge Ratio In a directed graph, this is the number of edges that are
reciprocated divided by the total number of edges. Duplicate
edges and self-loops are ignored. In an undirected graph, this is
undefined and is not calculated.
Connected Components The number of connected components in the graph. A
connected component is a set of vertices that are connected to
each other but not to the rest of the graph.
40
Global-level Metrics in NodeXL

(2)
Single-Vertex Connected
Components
Maximum Vertices in a
Connected Component
Maximum Edges in a
Connected Component
Maximum Geodesic
Distance (Diameter)
The N of connected components that have only one vertex.
Average Geodesic
Distance
The average geodesic distance among all vertex pairs, where

geodesic distance is the distance between two vertices along the
shortest path between them.
Graph Density
A ratio that compares the N of edges with the maximum N of

edges the graph would have if all the vertices were connected to
each other. Duplicate edges and self-loops are ignored.
Modularity
When the graph has groups, this is a measure of the "quality" of

the grouping. Graphs with high modularity have dense
connections among the vertices within the group but sparse
connections among vertices in different groups. When the graph
does not have groups, this is undefined.
The N of vertices in the connected component that has the most

vertices.
The N of edges in the connected component that has the most
edges.
The maximum geodesic distance among all vertex pairs, where
geodesic distance is the shortest path between two vertices.
41
HANDS-ON TUTORIALS
42
use R! aRe you

suRe?
NodeXL
and
Super R logo
Source: www.redbubble.com/
43
R for Web Data Analysis

Networ
k
Analysi
s
R packages
igraph, Statnet,
Rsiena
Spatial
Analysis
http://cran.rproject.org/web/views/Spat
ial.html
Sp, Spatial,
OpenStreetMap,
RgoogleMaps
Temporal
Analysis
http://cran.rtseries, forecast, urca,

project.org/web/views/Time wavelets, SpatioTemporal
Series.html
http://cran.rproject.org/web/views/Spat
ioTemporal.html
Text Mining
http://cran.rproject.org/web/views/Natu
ralLanguageProcessing.ht
ml
tm, Rweka, openNLP,

wordcloud, topicmodels,
RTextTools, sentiment,
ReadMe
Machine
http://cran.r-
Nnet, rpart, trees, party,
44
came, I saw,
and I
walked away?
Picture: Gareth Jenkins/Solent
http://www.telegraph.co.uk/news/picturegalleries/picturesoftheday/8561204/Pictures-of-the-day-7-June-2011.html?image=6
Figure from the movie Daddy Day Care (2003)

http://img0.joyreactor.com/pics/post/gif-eddie-murphy-reaction-gifs-party-394848.gif
Plunge
into the water!

45
HANDS-ON!
46
Demo 1. Software
Installation
Download and install R, Rstudio, and
NodeXL
http://cran.r-project.org/
https://www.rstudio.com/ide/
http://nodexl.codeplex.com/
Learn the basics of R

http://tryr.codeschool.com/
More information
https://www.rstudio.com/training/online.html
47
NETWORK ANALYSIS
(ADVANCED-LEVEL)
48
Network Topology
49
Regular or random?
Regular network
Nodes are connected in
a regular neighborhood
with a fixed number k
of edges per each node
They do not exhibit the
small world
characteristics
They may exhibit
clustering
Random network
Random networks
have randomly
connected edges
each node has an
average edges
They exhibit the small
world characteristics
They do not exhibit
clustering
50
Small-World Networks
Between order and chaos
Network generation
Watts and Strogatz

(1999) propose a
model for networks
between order and
chaos
The model is built by simply

Re-wiring at random a small
percentage of the regular edges
Which dramatically shortens the
average path length without
destroying clustering
Such that
The network exhibits the

small world feature as
random networks
And exhibits clustering, as
regular lattices
Watts and Strogatz (1999)
51
Scale-free network
Power law
Long-tail distribution
P(k) ~ k-a, 0<a<2
log(P) ~-a*log(k)
Zipf distribution
Pareto distribution
Properties
Scale-invariance
P(c*k) ~ (c*k) a
Thus, P(c*k) ~ c a k-a
P(c*k) k-a
No average
Universality
Barabsi, Albert, and Jeong, Scale-free characteristics of random networks: The topology of the world wide web, Physical A.,
281, 2000, pp.69-77.
52
Demo2. Generate the

Network
R scripthttp://chengjun.github.io/web_data_analysis/demo2_simulate_networks/
install.packages("igraph")
library(igraph)
size = 50
g = graph.tree(size, children = 2); plot(g)
g = graph.star(size); plot(g)
g = graph.full(size); plot(g)
g = graph.ring(size); plot(g)
g = connect.neighborhood(graph.ring(size), 2); plot(g)
g = erdos.renyi.game(size, 0.1)
# small-world network
g = rewire.edges(erdos.renyi.game(size, 0.1), prob = 0.8 ); plot(g)
# scale-free network
g = barabasi.game(size) ; plot(g)
53
The Political Blogosphere VS.

Congressmens Retweet
Network
Peng, Zhu, Liu, Wu, Liu (2014)
L. A. Adamic and N. Glance, 'The
Political Blogosphere and the 2004
U.S. Election: Divided They Blog',
LinkKDD 2005
Friendship, Interaction
networks and Vote agreement
of congressmen in the United
States. 7th APNC, Montreal,
Canada
54
How to Represent a
Network?
A
e1
B
e3
e
2
C
e
4
e6 e5
E
A,
A,
A,
C,
C,
C,
B
D
C
D
E
F
55
Demo 3. Describe the

Network
Compute graph metrics using NodeXL
Step 1 paste the

edgelist here
56

Network
NodeXL: Set node attributes
Step 2 paste the

node attribute here,
and name it as
party
Remember to click here to shift to
the Vertices window
57
Network
NodeXL: Calculating graph
metrics
Step 3 Click graph

metrics here
Remember to set the graph as

directed here
58
Network
NodeXL: Set vertex color and
vertex size
Step 4 Set vertex color

and vertex size by
click here
Remember to set the party as a

categorical variable
59

Network
R script
http://chengjun.github.io/web_data_analysis/demo3_describe_the_network/
Graph Statistics
Centrality Measures
Algorithms of graphs
Shortest path
Connected component algorithms
60
The exponential random

graph model (p*)
An ERGM (p*) model is a statistical model for the
ties in a network
Independent (pairs of) ties (p1, Holland and
Leinhardt, 1981; Fienberg and Wasserman, 1979,
1981)
Markov graphs (Frank and Strauss, 1986)
Extensions (Pattison & Wasserman, 1999; Robins,
Pattison & Wasserman, 1999; Wasserman &
Pattison, 1996)
New specifications (Snijders et al., 2006; Hunter
& Handcock, 2006)
61
Why do we use stochastic

network models?
To capture complex social phenomena that
caused by regularities and randomness.
To infer whether certain network signatures will
appear more often than by chance
To distinguish between different social processes
(e.g. homophily vs. structural balance)
To better understand the way local social
processes interact and combine to shape global
network patterns
Deterministic approaches are not always good
enough
62
Procedures of ERGM
Assume we have an observed network of size n. What are the

mechanisms driving the formation of our network (e.g. reciprocity,
transitivity)?
Given those mechanisms, are some network configurations (e.g.
mutual dyads, transitive triplets) more common than you would
expect by chance?
Include a parameter for each configuration in the model. Parameter
values will help us identify a probability distribution for all graphs of
size n. (e.g. if we have a high value for the reciprocity parameter,
graphs that have a lot of mutual dyads will be more probable than
ones that do not)
Estimate the parameters: find the parameter values that best match
the observed network. We do that using MCMC-MLE: Markov Chain
Monte Carlo Maximum Likelihood Estimation techniques.
Once we have our probability distribution, we can draw random
graphs from it and compare any of their characteristics to those of our
observed
network.
http://www.kateto.net/wordpress/wp-content/uploads/2012/12/COMM%20645%20-%20ERGM.pdf
63
Network Configurations:
Undirected Networks
4-star
Edge
K-star
2-star
:
:
Triangle
3-star
64
Network Configurations:
Directed Networks
Arc
Reciprocity
isolate
2-mixed star
2-in star
2-out star
K-in star
Transitive triad
:
:
K-out star
:
:
Cyclic triad
65
Exponential Random Graph Models

ERGM
Y: all the possible ties

y: the observed ties
X: node attributes
g(y,X): network configurations(a vector) .
: a vector of model parameters
k(): normalizing constant
66
One example
denotes the vector of change statistics
Johan Koskinen (2012) An introduction to ERGM. 8th UKSNA Conference, Bristol

67
Tie-Network configuration
matrix
edges
2-star
K-star
Triangle
Y1,2
Y1,3
Y2, 3
Yn, n-1
68
Online Collective Identity
Ackland (2011) Online collective identity. SN
69
Demo 4. ERGM with R

R script
Load data
Build up network objects
Set the node attributes
Plot the network
Fitting a basic ERG model
http://chengjun.github.io/web_data_analysis/demo4_ergm_analysis/
70
TEMPORAL ANALYSIS
71
Time Series Analysis

Time series data can
be analyzed within
either time domain
or frequency domain.
Time domain:
ARIMA/VAR analysis
Survival analysis
Multilevel analysis
Frequency domain:
Where time domain analysis is

routinely conducted, frequency domain
analysis rarely adopted.
Fourier
transformation
Spectrum analysis
(comparing ak and bk
of different time
series).
72
Time Series Analysis
Forecasting and Univariate Modeling

Frequency analysis
Decomposition and Filtering
Seasonality
Stationarity, Unit Roots, and Cointegration
Nonlinear Time Series Analysis
Dynamic Regression Models
Multivariate Time Series Models
73
Survival Analysis of
Blogging Behavior
Source: Zhu et al., ICA 2010
74
SPATIAL ANALYSIS
75
Spatial Analysis
Spatial Data:
Location names
IP addresses
Map visits
GPS usage
etc.
Well-developed for
offline data but underdeveloped/utilized for
web data beyond visual
inspections.
Spatial Analysis:
Spatial clusters/patterns
(by visual inspections)
Spatial autocorrelation
Spatial Regression
Spatial Dependence
(correlation between
nearby locations)
Spatial interaction
(correlation between
geo-coded variables)
76
Geospatial Distribution of the

Communication on Twitter
Conover MD, Davis C, Ferrara E, McKelvey K, Menczer F, et al. (2013) The

Geospatial Characteristics of a Social Movement Communication Network.
PLoS ONE 8(3): e55957. doi:10.1371/journal.pone.0055957
77
Spatial Distribution of Tweets in

Milan
kernel smoother of point density
Are the tweets randomly distributed?

78
Temporal Distribution of Tweets

in Milan
79
SENTIMENT ANALYSIS
80
Sentiment Analysis
Decompose sentiment
Emotion
Joy
surprise
Anger
Sadness
Fear
disgust
Polarity
Positivity
Negativity
Neutral
Lexicon method
Carlo Strapparava and
Alessandro Valituttis
emotions lexicon
Janyce Wiebes
subjectivity lexicon
Liu Bings polarity
lexicon
Supervised machine
learning
Combine lexicon and
machine learning
81
Sentiment in the Tweet Stream
Miller (2011) Social scientists wade into the tweet

stream. Science
82
Twitter Mood Predicts the Stock

Market?
Decompose
sentiment
Emotion
Calm
Alert
Sure
Vital
Kind
Happy
Bollen (2011)
Twitter mood predicts
the stock market. JOCS
83
Calm Sentiment Predicts the

Stock Market
84
Demo 5. Sentiment Analysis

with Supervised Machine
Learning
R script
http://chengjun.github.io/web_data_analysis/demo5_sentiment_analysis/
Figure source: http://courtneylambert.co/official-twitter-stats-from-chirp
85
REFLECTION ON WEB
DATA ANALYSIS
86
Google Correlate & Google Flu

Prediction
http://www.google.com/trends/correlate/comic?p=2
87
Nature reported that Google flu trends (GFT) was predicting more than
double the proportion of doctor visits for influenza-like illness (ILI) than the
Centers for Disease Control and Prevention (CDC), which bases its
estimates on surveillance reports from laboratories across the United States
(1, 2).
Lazer et al. (2014) The parable of Google Flu Traps in big data analysis. Science
88
Tweet Sentiment and U. S.

Election 2012
Figures source: election.twitter.com

89
Facebook Insight and Twitter

Mention
Result
Facebook Insight
http://www.cnn.com/election/2012/facebook-insights/
http://www.zerogeography.net/2012/11/obama-wins-election-ontwitter.html
http://www.huffingtonpost.com/simon-jackman/pollster-predictions_b_2081013.html
90
Predict Political Orientation

with Machine Learning
Colleoni et al (2014) Echo chamber or public sphere Predicting political

orientation and measuring political homophily in Twitter. JOC
91
To Move on
R Style Guide
R bloggers
stackoverflow
github
http://adv-r.had.co.nz/Style.html
92
Yet, It Is Not Finished
Creating an R package (intro) http://gastonsanchez.com/teaching/
93

Web Data Analysis

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Web Data Analysis

Загружено:

Авторское право:

Доступные форматы

Web Data Analysis

Department of Communication PhD Student Workshop

III. Reflections on web data analysis

FEATURES OF WEB DATA

Traditional vs. Web Data

What Can We Do with Web

Time series analysis

Frequently Used Tools

Open Source Commercial

The same set of nodes

A pair of adjacent nodes are

Examples of Nodes and

Music Notes Network

Classification of Online Social

Source: Ackland and Zhu (forthcoming). Social network analysis, Sage.14

every node in the

Most online social

Components in a High School

Source: Bearman, Moody & Stovel (2004).

Components in World Wide

Daisy Model (Donato et al., 2005)

Bowtie Model (Broder et al., 2000)

Teapot Model (Zhu et al., 2008)

The Depth of Ego Networks

1.5 Ego Network

2.0 Ego Network

2.5 Ego Network

Family Tree: Special EgoNetworks

Triadic Closure (Transitivity)

Triads of Directed Networks

Measure of Triadic Closure

Triadic Closure: Driven by

(a) Triadic closure

Goals of Social Network

Levels of Network Analysis

Low centrality scores:

Example: Krackhardts Kite

Highly connected individuals

At step 1, there are 10 clusters,

Island Method for Group

Global-level Metrics in NodeXL

Global-level Metrics in NodeXL

The N of connected components that have only one vertex.

The average geodesic distance among all vertex pairs, where

A ratio that compares the N of edges with the maximum N of

When the graph has groups, this is a measure of the "quality" of

The N of vertices in the connected component that has the most

use R! aRe you

R for Web Data Analysis

http://cran.rtseries, forecast, urca,

tm, Rweka, openNLP,

Nnet, rpart, trees, party,

Figure from the movie Daddy Day Care (2003)

into the water!

Learn the basics of R

Watts and Strogatz

The model is built by simply

The network exhibits the

Demo2. Generate the

The Political Blogosphere VS.

Demo 3. Describe the

Step 1 paste the

Demo 3. Describe the

Step 2 paste the