Академический Документы
Профессиональный Документы
Культура Документы
an overview of topics covered in the search results and help them to identify
the specific group of documents they were looking for. We feel this problem
has not been sufficiently solved in the previous research resulting in too long,
ambiguous and very often even meaningless group labels.
In this paper we briefly present our novel algorithm Lingo, which we
believe is able to capture thematic threads in a search result, that is discover
groups of related documents and describe the subject of these groups in a
way meaningful to a human. Lingo combines several existing methods to put
special emphasis on meaningful cluster descriptions, in addition to discovering
similarities among documents.
2 Theoretical background
Vector Space Model Vector Space Model (VSM) is a technique of infor-
mation retrieval that transforms the problem of comparing textual data into
a problem of comparing algebraic vectors in a multidimensional space. Once
the transformation is done, linear algebra operations are used to calculate
similarities among the original documents. Every unique term (word) from
the collection of analyzed documents forms a separate dimension in the VSM
and each document is represented by a vector spanning all these dimensions.
For example, if vector v represents document j in a k-dimensional space Ω,
then component t of vector v, where t ∈ 1 . . . k, represents the degree of the
relationship between document j and a term corresponding to dimension t
in Ω. This relationship is best expressed as a t × d matrix A, usually named
a term-document matrix , where t is the number of unique terms and d is the
number of documents. Element aij of matrix A is therefore a numerical rep-
resentation of relationship between term i and document j. There are many
methods for calculating aij , commonly referred to as term weighting meth-
ods. Refer to [9] for an overview. Once matrix A has been constructed, the
distance between vectors representing documents a and b, can be calculated
in a variety of ways; the most common measure calculates a cosine between
a and b using vector dot product formula.
Suffix arrays Let A = a1 a2 a3 . . . an be a sequence of objects. Let us
denote by Ai a suffix of A starting at position i ∈ 1 . . . n, such as Ai =
ai ai+1 ai+2 . . . an . An empty suffix is also defined for every A as An+1 = ∅. A
suffix array is an ordered array of all suffixes of A. Suffix arrays, introduced
in [5], are used as an efficient data structure for verifying whether a sequence
of objects B is a substring of A, or more formally: ∃i : B = Ai (sequence
equality is equality of elements at their corresponding positions in A and B).
The complexity of this operation is O(P + logN ), a suffix array can be built
in O(N logN ).
Latent Semantic Indexing and Singular Value Decomposition LSI
is a technique of feature extraction which attempts to reduce the rank of a
term-frequency matrix in order to get rid of noisy or synonymous words and
Lingo: Search Results Clustering Algorithm. . . 3
3.1 Preprocessing
Stemming and stop words removal are very common operations in Informa-
tion Retrieval. Interestingly, their influence on results is not always positive—
in certain applications stemming yielded no improvement to overall quality.
Be as it may, our previous work [10] and current experiments show that pre-
processing is of great importance in Lingo because the input snippets are
4 Stanislaw Osiński, Jerzy Stefanowski, Dawid Weiss
details here, refer to [7] for a full overview of the corrected algorithm. It does
not affect further discussion of Lingo because any algorithm capable of dis-
covering frequent phrases could be used at this stage; we use the suffix arrays
approach, because it is convenient in implementation and very efficient.
4 An illustrative example
Let us assume that the following input data is given (keywords and frequent
phrase extraction phase has been omitted):
The t = 5 terms
T1: Information The d = 7 documents
T2: Singular D1: Large Scale Singular Value Computations
T3: Value D2: Software for the Sparse Singular Value Decomposition
T4: Computations D3: Introduction to Modern Information Retrieval
T5: Retrieval D4: Linear Algebra for Intelligent Information Retrieval
D5: Matrix Computations
The p = 2 phrases D6: Singular Value Analysis of Cryptograms
P1: Singular Value D7: Automatic Information Organization
P2: Information Retrieval
8 Stanislaw Osiński, Jerzy Stefanowski, Dawid Weiss
5 Evaluation
Lingo has been evaluated empirically by performing an experiment on 7 users
and a set of 4 search results, 2 in Polish and 2 in English. Users have been
asked to establish whether cluster labels were meaningful and whether doc-
ument assignments to those clusters made sense. Unfortunately, because of
this paper’s length limitations we are unable to present the full result of the
evaluation. Full set of metrics and results is given in [7]. Let us mention here,
that the results, although done on a small number of users (7), were quite
promising—users found 70–80% clusters useful and 80–95% of snippets inside
those clusters matching their topic. Over 75% cluster labels were marked as
useful (with the distribution of noise clusters toward the lower-scoring ones).
Lingo: Search Results Clustering Algorithm. . . 9
Acknowledgment
The authors would like to thank an anonymous reviewer for helpful sugges-
tions. This research has been supported by funds from an internal university
grant.
References