Completed UNIT-IV 18.9.17

UNIT-IV
CLASSIFICATION AND CLUSTERING
Text Classification and Nave Bayes

Vector Space Classification
Support vector machines and Machine
learning on documents.
Flat Clustering
Hierarchical Clustering
Matrix decompositions and latent semantic
indexing
Fusion and Meta learning
Text Classification
General classes are usually referred to as

topics and the classification task is called text
classification, text categorization and topic
classification.
Introduction to Information Retrieval
Document Classification
planning
Test language
Data: proof
intelligence
(AI) (Programming) (HCI)

Classes:
ML Planning Semantics Garb.Coll. Multimedia GUI
Training learning planning programming garbage ... ...

Data: intelligence temporal semantics collection
algorithm reasoning language memory
reinforcement plan proof... optimization
network... language... region...
Introduction to Information Retrieval Ch. 13
Classification Methods (1)

Manual classification
Used by the original Yahoo! Directory
Looksmart, about.com, ODP, PubMed
Accurate when job is done by experts
Consistent when the problem size and team is
small
Difficult and expensive to scale
Means we need automatic classification methods for
big problems

Hand-coded rule-based classifiers
One technique used by new agencies, intelligence
agencies, etc.
Widely deployed in government and enterprise
Vendors provide IDE for writing such rules
Introduction to Information Retrieval Sec. 13.1
Classification Methods (3):

Supervised learning
Given:
A document d
A fixed set of classes:
C = {c1, c2,, cJ}
A training set D of documents each with a label in C
Determine:
A learning method or algorithm which will enable us
to learn a classifier
For a test document d, we assign it the class
(d) C

Supervised learning
Naive Bayes (simple, common) see video
k-Nearest Neighbors (simple, powerful)
Support-vector machines (new, generally more
powerful)
plus many other methods
No free lunch: requires hand-classified training data
But data can be built up (and refined) by amateurs
Many commercial systems use a mixture of
methods
The bag of words representation

I love this movie! It's sweet, but with
satirical humor. The dialogue is great
and the adventure scenes are fun It
manages to be whimsical and romantic
while laughing at the conventions of the
fairy tale genre. I would recommend it to
just about anyone. I've seen it several
times, and I'm always happy to see it
again whenever I have a friend who
hasn't seen it yet.
great 2
love 2
recommend 1
laugh 1
happy 1
Introduction to Information Retrieval Sec.13.6
Evaluating Categorization
Evaluation must be done on test data that are
independent of the training data
Sometimes use cross-validation (averaging results
over multiple training and test splits of the overall
data)
Easy to get good performance on a test set
that was available to the learner during
training (e.g., just memorize the test set)
Introduction to Information Retrieval Sec.13.6
Evaluating Categorization
Measures: precision, recall, F1, classification
accuracy
Classification accuracy: r/n where n is the
total number of test docs and r is the number
of test docs correctly classified
Nave Bayes Classifier
The Bayesian Classification represents a supervised learning method as well as a
statistical method for classification. Assumes an underlying probabilistic model and
it allows us to capture uncertainty about the model in a principled way by
determining probabilities of the outcomes. It can solve diagnostic and predictive
problems.
This Classification is named after Thomas Bayes, who proposed the Bayes
Theorem.
Bayesian classification provides practical learning algorithms and prior knowledge
and observed data can be combined
lets say we have data on 1000 pieces of fruit. The fruit being a Banana, Orange or
some Other fruit and imagine we know 3 features of each fruit, whether its long or
not, sweet or not and yellow or not, as displayed in the table below
An Example of Text
Classification with Nave Bayes
Vector Space Classification
Support vector machines and Machine
learning on documents.
FAST and Hierarchical Clustering
Document clustering
Motivations
Document representations
Success criteria
Clustering algorithms
Partitional
Hierarchical
Ch. 16
What is clustering?
Clustering: the process of grouping a set of
objects into classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be
dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given
A common and important task that finds many
applications in IR and other places
Ch. 16
A data set with clear cluster structure
How would
you design
an algorithm
for finding
the three
clusters in
this case?
Sec. 16.1
Applications of clustering in IR
Whole corpus analysis/navigation
Better user interface: search without typing
For improving recall in search applications
Better search results (like pseudo RF)
For better navigation of search results
Effective user recall will be higher
For speeding up vector space retrieval
Cluster-based retrieval gives faster search
Sec. 16.1
Scatter/Gather: Cutting, Karger, and Pedersen

Sec. 16.2
Issues for clustering

Representation for clustering
Document representation
Vector space? Normalization?
Centroids arent length normalized
Need a notion of similarity/distance
How many clusters?
Fixed a priori?
Completely data driven?
Avoid trivial clusters - too large or small
If a cluster's too large, then for navigation purposes you've
wasted an extra user click without whittling down the set of
documents much.
Clustering Algorithms
Flat algorithms
Usually start with a random (partial) partitioning
Refine it iteratively
K means clustering
(Model based clustering)
Hierarchical algorithms
Bottom-up, agglomerative
(Top-down, divisive)
Hard vs. soft clustering
Hard clustering: Each document belongs to exactly one cluster
More common and easier to do
Soft clustering: A document can belong to more than one
cluster.
Makes more sense for applications like creating browsable
hierarchies
You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
You can only do that with a soft clustering approach.
We wont do soft clustering today. See IIR 16.5, 18
Partitioning Algorithms
Partitioning method: Construct a partition of n
documents into a set of K clusters
Given: a set of documents and the number K
Find: a partition of K clusters that optimizes the
chosen partitioning criterion
Globally optimal
Intractable for many objective functions
Ergo, exhaustively enumerate all partitions
Effective heuristic methods: K-means and K-
medoids algorithms
Sec. 16.4
K-Means
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of
gravity or mean) of points in a cluster, c:
1
(c)
| c | xc
x
Reassignment of instances to clusters is based

on distance to the current cluster centroids.
(Or one can equivalently phrase it in terms of
similarities)
Sec. 16.4
K-Means Algorithm
Select K random docs {s1, s2, sK} as seeds.
Until clustering converges (or other stopping criterion):
For each doc di:
Assign di to the cluster cj such that dist(xi, sj) is minimal.
(Next, update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
Sec. 16.4
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Sec. 16.4
Time Complexity
Computing distance between two docs is
O(M) where M is the dimensionality of the
vectors.
Reassigning clusters: O(KN) distance
computations, or O(KNM).
Computing centroids: Each doc gets added
once to some centroid: O(NM).
Assume these two steps are each done once
for I iterations: O(IKNM).
Sec. 16.4
K-means issues, variations, etc.

Recomputing the centroid after every
assignment (rather than after all points are re-
assigned) can improve speed of convergence
of K-means
Assumes clusters are spherical in vector space
Sensitive to coordinate changes, weighting etc.
Disjoint and exhaustive
Doesnt have a notion of outliers by default
But can add outlier filtering
STEPS K -Means
Number of clusters K is given
Partition n docs into predetermined number of
clusters
Finding the right number of clusters is part of
the problem
Given docs, partition into an appropriate number of
subsets.
E.g., for query results - ideal value of K not known up
front - though UI may impose limits.
Can usually take an algorithm for one flavor and
convert to the other.
K not specified in advance
Say, the results of a query.
Solve an optimization problem: penalize having
lots of clusters
application dependent, e.g., compressed summary
of search results list.
Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
K not specified in advance
Given a clustering, define the Benefit for a
doc to be the cosine similarity to its
centroid
Define the Total Benefit to be the sum of
the individual doc Benefits.
Why is there always a clustering of Total Benefit n?

Ch. 17
Hierarchical Clustering
Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
fish reptile amphib. mammal worm insect crustacean
One approach: recursive application of a

partitional clustering algorithm.
Dendrogram: Hierarchical Clustering
Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
59
Sec. 17.1
Hierarchical Agglomerative Clustering

(HAC)
Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, until there is only one cluster.
The history of merging forms a binary tree
or hierarchy.
Note: the resulting clusters are still hard and induce a partition
Sec. 17.2
Closest pair of clusters

Many variants to defining closest pair of clusters
Single-link
Similarity of the most cosine-similar (single-link)
Complete-link
Similarity of the furthest points, the least cosine-
similar
Centroid
Clusters whose centroids (centers of gravity) are the
most cosine-similar
Average-link
Average cosine between pairs of elements
Sec. 17.2
Single Link Agglomerative Clustering

Use maximum similarity of pairs:
sim (ci ,c j ) max sim ( x, y )

xci , yc j
Can result in straggly (long and thin) clusters
due to chaining effect.
After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:
sim ((ci c j ), ck ) max( sim (ci , ck ), sim (c j , ck ))

Sec. 17.2
Single Link Example

Sec. 17.2
Complete Link
Use minimum similarity of pairs:
sim (ci ,c j ) min sim ( x, y)
xci , yc j
Makes tighter, spherical clusters that are typically
preferable.
After merging ci and cj, the similarity of the resulting cluster
to another cluster, ck, is:
sim ((ci c j ), ck ) min( sim (ci , ck ), sim (c j , ck ))

Ci Cj Ck
Sec. 17.2
Complete Link Example

Sec. 17.2.1
Computational Complexity
In the first iteration, all HAC methods need to compute
similarity of all pairs of N initial instances, which is
O(N2).
In each of the subsequent N2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
In order to maintain an overall O(N2) performance,
computing similarity to each other cluster must be
done in constant time.
Often O(N3) if done naively or O(N2 log N) if done more
cleverly
Sec. 16.3
What Is A Good Clustering?

Internal criterion: A good clustering will produce
high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is
high
the inter-class similarity is low
The measured quality of a clustering depends on
both the document representation and the
similarity measure used
Sec. 16.3
Purity example
Cluster I Cluster II Cluster III
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Sec. 16.3
Rand Index measures between pair

decisions. Here RI = 0.68
Different
Number of Same Cluster
Clusters in
points in clustering
clustering
Same class in
ground truth 20 24
Different
classes in 20 72
ground truth
Sec. 16.3
Rand index and Cluster F-measure
A D
RI
A B C D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-
measure, which is probably a better measure.
Matrix-vector multiplication
has eigenvalues 30, 20, 1 with

corresponding eigenvectors
On each eigenvector, S acts as a multiple of the identity

matrix: but as a different multiple on each.
Any vector (say x= ) can be viewed as a combination of
the eigenvectors: x = 2v1 + 4v2 + 6v3
Semantic Indexing
Thus a matrix-vector multiplication such as Sx
(S, x as in the previous slide) can be rewritten
in terms of the eigenvalues/vectors:
Even though x is an arbitrary vector, the

action of S on x is determined by the
eigenvalues/vectors.
Suggestion: the effect of small eigenvalues is
small.
If we ignored the smallest eigenvalue (1), then
instead of
we would get
These vectors are similar (in cosine similarity,

etc.)
Eigenvalues & Eigenvectors

For symmetric matrices, eigenvectors for distinct
eigenvalues are orthogonal
All eigenvalues of a real symmetric matrix are real.
All eigenvalues of a positive semidefinite matrix

are non-negative
Example
Let Real, symmetric.
Then
The eigenvalues are 1 and 3 (nonnegative, real).

The eigenvectors are orthogonal (and real):
Plug in these values and
solve for eigenvectors.
Eigen/diagonal Decomposition
Let be a square matrix with m
linearly independent eigenvectors (a non-
defective matrix) diagonal
Unique
for
Theorem: Exists an eigen decomposition distinct
eigen-
(cf. matrix diagonalization theorem) values
Columns of U are the eigenvectors of S

Diagonal elements of are eigenvalues of
Diagonal decomposition: why/how
Let U have the eigenvectors as columns:
Then, SU can be written
Thus SU=U, or U1SU=
And S=UU1.
Diagonal decomposition - example
Recall
The eigenvectors and form
Inverting, we have Recall

UU1 =1.
Then, S=UU1 =
Example continued
Lets divide U (and multiply U1) by
Then, S=
Q (Q-1= QT )
Why? Stay tuned

Symmetric Eigen Decomposition

If is a symmetric matrix:
Theorem: There exists a (unique) eigen
decomposition
where Q is orthogonal:
Q-1= QT
Columns of Q are normalized eigenvectors
Columns are orthogonal.
(everything is real)
Exercise
Examine the symmetric eigen decomposition,
if any, for each of the following matrices:
Similarity Clustering
We can compute the similarity between two
document vector representations xi and xj by xixjT
Let X = [x1 xN]
Then XXT is a matrix of similarities
Xij is symmetric
So XXT = QQT
So we can decompose this similarity space into a
set of orthonormal basis vectors (given in Q)
scaled by the eigenvalues in
This leads to PCA (Principal Components Analysis)
17
Singular Value Decomposition

For an M N matrix A of rank r there exists a
factorization (Singular Value Decomposition = SVD)
as follows:
MM MN V is NN
(Not proven here.)

MM MN V is NN
AAT = QQT
AAT = (UVT)(UVT)T = (UVT)(VUT) = U2UT
The columns of U are orthogonal eigenvectors of AAT.
The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues 1 r of AAT are the eigenvalues of ATA.
Singular values

Illustration of SVD dimensions and sparseness
SVD example
Let
Thus M=3, N=2. Its SVD is
Typically, the singular values arranged in decreasing order.

Reduced SVD
If we retain only k singular values, and set the

rest to 0, then we dont need the matrix parts
in color
Then is kk, U is Mk, VT is kN, and Ak is
MN
This is referred to as the reduced SVD
It is the convenient (space-saving) and usual
form for computational applications
Its whatk Matlab gives you
SVD Low-rank approximation

Whereas the term-doc matrix A may have
M=50000, N=10 million (and rank close to
50000)
We can construct an approximation A100 with
rank 100.
Of all rank 100 matrices, it would have the lowest
Frobenius error.
Great but why would we??
Answer: Latent Semantic Indexing
C. Eckart, G. Young, The approximation of a matrix by another of lower rank.
Psychometrika, 1, 211-218, 1936.
Latent Semantic Indexing

via the SVD
What it is
From term-doc matrix A, we compute the
approximation Ak.
There is a row for each term and a column
for each doc in Ak
Thus docs live in a space of k<<r
dimensions
These dimensions are not the original axes
But why?
Vector Space Model: Pros

Automatic selection of index terms
Partial matching of queries and documents (dealing
with the case where no document contains all search terms)
Ranking according to similarity score (dealing with
large result sets)
Term weighting schemes (improves retrieval performance)
Various extensions
Document clustering
Relevance feedback (modifying query vector)
Geometric foundation
Problems with Lexical Semantics

Ambiguity and association in natural language
Polysemy: Words often have a multitude of
meanings and different types of usage (more
severe in very heterogeneous collections).
The vector space model is unable to discriminate
between different meanings of the same word.
Problems with Lexical Semantics

Synonymy: Different terms may have an
identical or a similar meaning (weaker:
words indicating the same topic).
No associations between words are
made in the vector space
representation.
Polysemy and Context

Document similarity on single word level:
polysemy and context ring
jupiter

space
meaning 1 voyager

planet saturn
... ...
meaning 2 car
company

contribution to similarity, if dodge
used in 1st meaning, but not if ford
in 2nd
Latent Semantic Indexing (LSI)

Perform a low-rank approximation of
document-term matrix (typical rank 100300)
General idea
Map documents (and terms) to a low-
dimensional representation.
Design a mapping such that the low-dimensional
space reflects semantic associations (latent
semantic space).
Compute document similarity based on the inner
product in this latent semantic space
Goals of LSI
LSI takes documents that are semantically similar
(= talk about the same topics), but are not similar
in the vector space (because they use different
words) and re-represents them in a reduced
vector space in which they have higher similarity.
Similar terms map to similar location in low

dimensional space
Noise reduction by dimension reduction
Latent Semantic Analysis

Latent semantic space: illustrating example
courtesy of Susan Dumais

Performing the maps

Each row and column of A gets mapped into
the k-dimensional LSI space, by the SVD.
Claim this is not only the mapping with the
best (Frobenius error) approximation to A, but
in fact improves retrieval.
A query q is also mapped into this space, by
Query NOT a sparse vector.

LSA Example
A simple example term-document matrix
(binary)
37
LSA Example
Example of C = UVT: The matrix U
38
LSA Example
Example of C = UVT: The matrix
39
LSA Example
Example of C = UVT: The matrix VT
40
LSA Example: Reducing the dimension
41
Original matrix C vs. reduced C2 = U2VT
42
Why the reduced dimension matrix is

better
Similarity of d2 and d3 in the original space: 0.
Similarity of d2 and d3 in the reduced space:
0.52 0.28 + 0.36 0.16 + 0.72 0.36 + 0.12
0.20 + 0.39 0.08 0.52
Typically, LSA increases recall and hurts

precision
43
Simplistic picture
Topic 1
Topic 2
Topic 3
Data Fusion
Outline
What is data fusion?
Why use data fusion?
Previous work
Components of data fusion
System selection
Bias concept
Data fusion methods
Experiments
Conclusion
109
Data Fusion
Merging the retrieval results of multiple
systems.
A data fusion algorithm accepts two or more
ranked lists and merges these lists into a single
ranked list with the aim of providing better
effectiveness than all systems used for data
fusion.
110
Combining evidence from different
systems leads to performance
improvement
Use data fusion to achieve better
performance than the individual
systems involved in the process.
Example metasearch systems
www.dogpile.com
www.copernic.com
111
Same idea is also used for different query
representations
Fuse the results of different query
representations for the same request and
obtain better results
Measuring relative performance of IR systems
such as web search engines is essential
Use data fusion for finding pseudo relevant
documents and use these for automatic
ranking of retrieval systems
112
Components of data fusion
1. DB/search engine selector
Select systems to fuse
2. Query dispatcher
Submit queries to selected search engines
3. Document selector
Select documents to fuse
4. Result merger
Merge selected document results
113
Ranking retrieval systems
114
System selection methods
1. Best: certain percentage of top performing
systems used
2. Normal: all systems to be ranked are used
3. Bias: certain percentage of systems that
behave differently from the norm (majority
of all systems) are used
115
Calculating bias of a system
Similarity value
s(v, w)
v w i i
v: vector of norm
(v ) ( w ) w: vector of retrieval system

2 2
i i
Bias of a system
B(v, w) 1 s (v, w)
116
Example of calculating bias
2 systems: A and B
7 documents: a, b, c, d, e, f, g
ith row is the result for ith query
XA=(3, 3, 3, 2, 1, 0, 0) XB=(0, 2, 3, 0, 2, 3, 2)
norm vector X = XA+XB = (3, 5, 6, 2, 3, 3, 2)
s(XA,X)=49/[32][96]1/2 = 0.8841
Bias(A)=1-0.8841=0.1159
s(XB,X)=47/[30][96]1/2 = 0.8758
Bias(B)=1-0.8758=0.1242
117
Bias calculation with order
Order is important because users usually just look
at the documents of higher rank.
2 systems: A and B
7 documents: a, b, c, d, e, f, g
ith row is the result for ith query
Increment the frequency count of a document by m/i instead of 1

where m is number of positions and i position of the document.
m=4
XA=(10, 8, 4, 2, 1, 0, 0); XB=(0, 8, 22/3, 0, 2, 8/3, 7/3)
Bias(A)=0.0087; Bias(B)=0.1226
118
Data fusion methods
1. Similarity value models
CombMIN, CombMAX, CombMED,
CombSUM, CombANZ, CombMNZ
2. Rank based models

Rank position (reciprocal rank) method
Borda count method
Condorcet method
Logistic regression model
119
Similarity value methods
CombMIN choose min of similarity values

CombMAX choose max of similarity values
CombMED take median of similarity values
CombSUM sum of similarity values
CombANZ - CombSUM / # non-zero similarity values
CombMNZ - CombSUM * # non-zero similarity values
120
Rank position method
Merge documents using only rank positions
Rank score of document i (j: system index)
1
r (d i )
j 1 pos(dij )
If a system j has not ranked document i at all,
skip it.
121
Rank position example
4 systems: A, B, C, D
documents: a, b, c, d, e, f, g
Query results:
A={a,b,c,d}, B={a,d,b,e},
C={c,a,f,e}, D={b,g,e,f}
r(a)=1/(1+1+1/2)=0.4
r(b)=1/(1/2+1/3+1)=0.52
Final ranking of documents:
(most relev) a > b > c > d > e > f > g (least relev)
122
Borda Count method
Based on democratic election strategies.
The highest ranked document in a system gets
n Borda points and each subsequent gets one
point less where n is the number of total
retrieved documents by all systems.
123
Borda Count example
3 systems: A, B, C
Query results:
A={a,c,b,d}, B={b,c,a,e}, C={c,a,b,e}
5 distinct docs retrieved: a, b, c, d, e. So, n=5.
BC(a)=BCA(a)+BCB(a)+BCC(a)=5+3+4=12
BC(b)=BCA(b)+BCB(b)+BCC(b)=3+5+3=11
Final ranking of documents:
(most relevant) c > a > b > e > d (least relevant)
124
Condorcet method
Also, based on democratic election strategies.
Majoritarian method
The winner is the document which beats each of
the other documents in a pair wise comparison.
125
Condorcet example
3 candidate documents: a, b, c
5 systems: A, B, C, D, E
A: a>b>c - B:a>c>b - C:a>b=c - D:b>a - E:c>a
Pairwise comparison Pairwise winners

a b c Win Lose Tie
a - 4, 1, 0 4, 1, 0 a 2 0 0
b 1, 4, 0 - 2, 2, 1 b 0 1 1
c 1, 4, 0 2, 2, 1 - c 0 1 1
Final ranking of documents

a>b=c
126
Experiments
Turkish Text Retrieval System will be used
All Milliyet articles from 2001 to 2005
80 different system ranked results
8 matching methods
10 stemming functions
72 queries for each system
4 approaches for on the experiments
127
Experiments
First Approach
Mean average precision values of merged system
is significantly greater than al the individual
systems
Second Approach
Find the data fusion method that gives the highest
mean average precision value
128
Experiments
Third Approach
Find the best stemming method in terms of mean
average precision values
Fourth Approach
See the effect of system selection methods
129
Conclusion
Data Fusion is an active research area
We will use several data fusion techniques on
the now famous Milliyet database and
compare their relative merits
We will also use TREC data for testing if
possible
We will hopefully find some novel approaches
in addition to existing methods
130

Completed UNIT-IV 18.9.17

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Completed UNIT-IV 18.9.17

Загружено:

Авторское право:

Доступные форматы

UNIT-IV

CLASSIFICATION AND CLUSTERING

Text Classification and Nave Bayes

General classes are usually referred to as

(AI) (Programming) (HCI)

Training learning planning programming garbage ... ...

Classification Methods (1)

Classification Methods (2)

Classification Methods (3):

Classification Methods (3)

The bag of words representation

A data set with clear cluster structure

Scatter/Gather: Cutting, Karger, and Pedersen

Issues for clustering

Reassignment of instances to clusters is based

K-means issues, variations, etc.

Why is there always a clustering of Total Benefit n?

fish reptile amphib. mammal worm insect crustacean

One approach: recursive application of a

Hierarchical Agglomerative Clustering

Closest pair of clusters

Single Link Agglomerative Clustering

sim (ci ,c j ) max sim ( x, y )

sim ((ci c j ), ck ) max( sim (ci , ck ), sim (c j , ck ))

Single Link Example

sim ((ci c j ), ck ) min( sim (ci , ck ), sim (c j , ck ))

Complete Link Example

What Is A Good Clustering?

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Rand Index measures between pair

Rand index and Cluster F-measure

has eigenvalues 30, 20, 1 with

On each eigenvector, S acts as a multiple of the identity

Even though x is an arbitrary vector, the

These vectors are similar (in cosine similarity,

Eigenvalues & Eigenvectors

All eigenvalues of a real symmetric matrix are real.

All eigenvalues of a positive semidefinite matrix

The eigenvalues are 1 and 3 (nonnegative, real).

Columns of U are the eigenvectors of S

Diagonal decomposition: why/how

Let U have the eigenvectors as columns:

Then, SU can be written

Thus SU=U, or U1SU=

Diagonal decomposition - example

The eigenvectors and form

Inverting, we have Recall

Why? Stay tuned

Symmetric Eigen Decomposition

Singular Value Decomposition

(Not proven here.)

Singular Value Decomposition

Singular Value Decomposition

Thus M=3, N=2. Its SVD is

Typically, the singular values arranged in decreasing order.

If we retain only k singular values, and set the

SVD Low-rank approximation

Latent Semantic Indexing

Vector Space Model: Pros

Problems with Lexical Semantics

Problems with Lexical Semantics