Академический Документы
Профессиональный Документы
Культура Документы
Silhouette Coefficient
If the agenda has length d + 1 due to the last
binary split of one of its elements, agenda is short- Our aim was to compare SiVeR, TES, and COSA
ened by the element with least support (line 15). for a wide range of parameter settings. In order to
If the agenda has the correct number of features, be rather independent from the number of features
it is added to the output set describing a selection used for clustering and the number of clusters pro-
of concepts, hence an aggregation that represents duced as result, our main comparisons refer to the
documents by d-dimensional concept vectors (line silhouette coefficient (cf. [8]):
17). Definition 3 (Silhouette Coefficient) Let D M =
Thus, Algorithm 1 zooms into those concepts fD1 ; : : : ; Dk g describe a clustering result, i.e. it is
that exhibit strongest support, while taking into ac- an exhaustive partitioning of the set of documents
count the support of subconcepts. Finally, it pro- D. The distance of a document d 2 D to a cluster
poses sets of views for clustering that imply a d- Di 2 DM is given as
dimensional representation of documents by con- P
cept vectors. Each entry of a vector specifies how dist(d; D ) = 2 i dist(d; p) :
p D
the logarithmic representation of term or concept The silhouette coefficient is a measure for the
frequencies. clustering quality, that is rather independent from
the number of clusters, k . Experiences, such as
4 Evaluation documented in [8], show that values between 0.7
and 1.0 indicate clustering results with excellent
This section describes the evaluation of applying separation between clusters, viz. data points are
K-Means to the preprocessing strategies SiVeR, very close to the center of their cluster and remote
TES, and COSA introduced above. from the next nearest cluster. For the range from
0.5 to 0.7 one finds that data points are clearly as-
Setting signed to cluster centers. Values from 0.25 to 0.5
We have performed all evaluations on a document indicate that cluster centers can be found, though
set from the tourism domain. For this purpose, we there is considerable noise, i.e. there are many
have manually modeled an ontology O consisting data points that cannot be clearly assigned to clus-
of a set of concepts C (j C j = 1030), and a word ters. Below a value of 0.25 it becomes practically
lexicon consisting of 1950 stem entries (the cover- impossible to find significant cluster centers and to
age of different terms L by SMES is much larger!). definitely assign the majority of data points.
The heterarchy H has an average depth of 4:6, the For comparison of the three different pre-
longest uni-directed path from root to leaf is of processing methods we have used standard K-
length 9. Means. However, we are well aware that for high-
Our document corpus D has been crawled from dimensional data approaches like [3] may improve
a WWW provider for tourist information (URL: results very likely for all three preprocessing
http://www.all-in-all.de) consisting now of 2234 strategies. However, in preliminary tests we found
HTML documents with a total sum of over 16 mil- that in the low-dimensional realms where the sil-
lion words. The documents in this corpus describe houette coefficient indicated reasonable separation
Algorithm 1 (GenerateConceptViews)
Input: number of dimensions d; Ontology O with top concept ROOT ; document set D
1 begin
2 Agenda := [ROOT];
3 repeat
4 Elem := First(Agenda);
5 Agenda := Rest(Agenda);
7 if Leaf(Elem)
8 then continue := FALSE;
9 else
10 if Atom(Elem) then Elem := Subconcepts(Elem); fi;
11 NewElem := BestSupportElem(Elem);
12 RestElem := Elem n NewElem;
13 if :Empty(RestElem) then Agenda := SortInto(RestElem; Agenda); fi;
14 Agenda := SortInto(NewElem; Agenda);
15 if Length(Agenda) > d then Agenda := Butlast(Agenda); fi;
16 fi;
17 if Length(Agenda) = d then Output(Agenda); fi;
18 until continue = FALSE;
19 end
Output: Set of lists consisting of single concepts and lists of concepts, which describe views onto the
document corpus D.
Auxiliary functions used:
Subconcepts(C ) returns an arbitrarily ordered list of direct subconcepts of C .
Support(C ) cf. equation 4.
Support(ListC ) is the sum over all concepts C in ListC of Support(C ).
SortInto(Element; List2) sorts Element, which may be a single concept or a list of
concepts, as a whole into List2 ordering according to
Support(Element) and removing redundant elements.
BestSupportElem(List) returns the Element of List with maximal Support(Element).
[Element] constructs list with one Element.
[Element; List] list constructor extending List such that Element is first.
First(List); Rest(List) are the common list processing functions.
Atom(E ) returns true when E is not a list.
Leaf(E ) returns true when E is a concept without subconcepts.
List n E removes element E from List.
Length(List) returns the length of List.
Butlast(List) returns a list identical to List, but excluding the last element.
between clusters, quality measures for standard and Means based on COSA. The results for SiVeR are
improved K-Means coincided. strictly disappointing. TES is considerably better,
The general result of our evaluation using the sil- but it still yields a silhouette coefficient that indi-
houette measure was that K-Means based on COSA cates practically non-existent distinctions between
preprocessing excelled the comparison baseline, clusters. COSA produces for this parameter setting
viz. K-Means based on TES, to a large extent. 89 views. We found that the best aggregations pro-
K-Means based on SiVeR was so strongly hand- duced from COSA delivered clustering results with
icapped by having to cope with overly many di- silhouette measures of up to 0.48 indicating in-
mensions that its silhouette coefficient always ap- deed very reasonable separation between clusters.
proached 0 indicating that no reasonable clus-
tering structures could be found. Mean Squared Error
One exemplary, but overall characteristic dia- To base our comparisons not only on one single
gramm depicted in Figure 1 shows the silhouette measure we also did a study, evaluating the mean
coefficient for a fixed number of features used squared error (MSE) for our approach and the cor-
(namely 15) and a fixed number of clusters pro- responding baseline results. The mean squared er-
duced (namely 10). It does so for K-Means based ror is a measure of compactness of a given cluster-
on SiVeR, for K-Means based on TES, and for K- ing and is defined as follows.
0,5 shows the silhouette coefficient for the best aggre-
0,45 gation from the ones generated by GenerateCon-
0,4
0,35 ceptViews. We see that for TES and COSA the
0,3 COSA quality of results decreases as expected for
SC
0,25 TES
0,2
the higher dimensions, though COSA still com-
SiVeR
0,15 pares favorably against TES.
0,1
0,05
0,7
0
1
8
15
22
29
36
43
50
57
64
71
78
85
0,6
view #
0,5
0,4 COSA
SC
Figure 1: Comparing TES with 89 views produced 0,3 TES
0,1
Definition 4 (MSE) The overall mean squared 0
error MSE for a given clustering DM = 10 15 20 30 50 100
X k
Figure 3: Comparing TES and the best aggregation
MSE (D M) = MSE (D ): i (8) of COSA; k = 10; d = 10; 15; 30; 50; 100.
i=1
We have not included the lower bound of COSA
The mean squared error for one cluster MSE (D i ) in Figure 3. The reason is that so far we
is defined as have not been very attentive to optimize Gener-
X ateConceptViews in order to eliminate the worst
MSE (D ) = i dist(p; i )2 : D
(9) views up front. This, however, should be eas-
p 2 iD ily possible, because we observed that the bad re-
sults are produced by views that contain too many
(where Di is the centroid of cluster D i ) overly general concepts like M ATERIALT HING or
I NTANGIBLE.
We found that also for the MSE measure K- In the next experiments, we have varied the num-
Means based on COSA compared favorably against ber of clusters, k , between 2 and 100, while d re-
K-Means based on TES. 49 of COSAs views are mained at 15 (cf. Figure 4). The general result is
worse, but 40 of them are considerably better than that the number of clusters does not affect the re-
the TES baseline. Figure 2 shows the correspond- sults produced by COSA and TES very much. The
ing results with a baseline for TES at 3240, but the slight increase of the silhouette coefficient is due
best values for COSA exhibit a MSE of only 1314. to a growing number of documents (up to 30%)
6000
that cluster exactly at one point. There, viz. at
(0; : : : ; 0), all features disappear.
5000
4000 0,7
COSA
MSE
3000 0,6
TES
2000 0,5
0,4 COSA
1000
SC
0,3 TES
0
0,2
1
8
15
22
29
36
43
50
57
64
71
78
85
view # 0,1
0
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
Sauna Terrace
Go_In_For_Sports Hinneburg & Keim [6] show how projections
Solarium
Water_Sports improve the effectiveness and efficiency of the
clustering process. Their work shows that pro-
Figure 5: An example view generated by COSA jections are important for improving the per-
formance of clustering algorithms. In contrast
Here, one may recognize that to our work, they do not focus on cluster qual-
N ON P RIVATE FACILITIES OF ACCOMODATION ity with respect to the internal structures con-
and ACTIONs were important concepts used tained in the clustering.
for clustering and distinguishing documents.
Interpreting results, we may conjecture that The problem of clustering high-dimensional
H OTEL C ATEGORY (three, four, five star hotels, data sets has been researched by Agrawal et al.
etc.) is a concept, which might perhaps correlate [1]: They present a clustering algorithm called
with facilities of the accomodation a correlation CLIQUE that identifies dense clusters in sub-
that happens to be not described in the given spaces of maximum dimensionality. Cluster
ontology. Finally, we see SEA RESSORT in this descriptions are generated in the form of min-
view, which might play a role for clustering or imized DNF expressions.
which might occur just because of uninterpretable
noise.
A straightforward preprocessing strategy may
be derived from multivariate statistical data
We currently explore GUI possibilities in order analysis known under the name principal
to tie the interpretation of clustering results with component analysis (PCA). PCA reduces the
the navigation of the heterarchy in order to give the number of features by replacing a set of fea-
user a good grip at different clustering views. tures by a new feature representing their com-
Conclusion bination.
Our results support the general statement that struc- In [14], Schuetze and Silverstein have re-
ture can mostly be found in a low dimensional searched and evaluated projection techniques
space (cf. [2]). Our proposal is well suited to for efficient document clustering. They show
provide a selected number of aggregations in sub- how different projection techniques signifi-
spaces exploiting standard K-Means and compar- cantly improve performance for clustering,
ing favorably with baselines, like clustering based not accompanied by a loss of cluster qual-
on d terms ranked by tf/idf measures. ity. They distinguish between local and global
The selected concepts may be used to indicate to projection, where local projection maps each
the user, which text features were most relevant for document onto a different subspace, and,
the particular clustering results and to distinguish global projection selects the relevant terms for
different views. all documents using latent semantic indexing.
Now, on the other hand, in real-world applica-
5 Related Work tions the statistically optimal projection, such as
All clustering approaches based on frequencies of used in the approaches just cited, often does not
terms/concepts and similarities of data points suf- coincide with the projection most suitable for hu-
fer from the same mathematical properties of the mans to solve a particular task, such as finding the
right piece of knowledge in a large set of docu- [3] P. Bradley, U. Fayyad, and C. Reina. Scal-
ments. Users typically prefer explicit background ing clustering algorithms to large databases.
knowledge that indicates the foundations on which In Proc. of KDD-1998, New York, NY, USA,
a clustering results has been achieved. August 1998, pages 915, Menlo Park, CA,
Hinneburg et al. [7] consider this general prob- USA, 1998. AAAI Press.
lem a domain specific optimization task. There- [4] J. Fuernkranz, T. Mitchell, and E. Riloff. A
fore, they propose to use a visual and interactive Case Study in Using Linguistic Phrases for
environment to derive meaningful projections in- Text Categorization on the WWW. In Proc. of
volving the user. Our approach may be seen to AAAI/ICML Workshop Learning for Text Cat-
solve some part of task they assign to the user envi- egorization, Madison, WI, 1998. AAAI Press,
ronment automatically, while giving the user some 1998.
first means to explore the result space interactively
in order to select the projection most relevant for [5] A. Hinneburg, C. Aggarwal, and D.A. Keim.
her particular objectives. What is the nearest neighbor in high dimen-
Finally, we want to mention an interesting pro- sional spaces? In Proc. of VLDB-2000, Cairo,
posal for feature selection made in [13]. Devaney Egypt, September 2000, pages 506515. Mor-
and Ram describe feature selection for an unsuper- gan Kaufmann, 2000.
vised learning task, namely conceptual clustering. [6] A. Hinneburg and D.A. Keim. Optimal grid-
They discuss a sequential feature selection strategy clustering: Towards breaking the curse of di-
based on an existing COBWEB conceptual cluster- mensionality in high-dimensional clustering.
ing system. In their evaluation they show that fea- In Proc. of VLDB-1999, Edinburgh, Scotland,
ture selection significantly improves the results of September 2000. Morgan Kaufmann, 1999.
COBWEB. The drawback that Devaney and Ram [7] A. Hinneburg, M. Wawryniuk, and D.A.
face, however, is that COBWEB is not scalable like
Keim. Visual mining of high-dimensional
K-Means. Hence, for practical purposes of cluster- data. Computer Graphics & Applications
ing in large document repositories, COSA seems
Journal, September 1999.
better suited.
[8] L. Kaufman and P.J. Rousseeuw. Finding
6 Conclusion Groups in Data: An Introduction to Cluster
In this paper we have shown how to include back- Analysis. Wiley, New York, 1990.
ground knowledge in form of a heterarchy in or- [9] S. A. Macskassy, A. Banerjee, B.D. Davison,
der to generate different clustering views onto a set and H. Hirsh. Human performance on clus-
of documents. We have compared our approach tering web pages: a preliminary study. In
against a sophisticated baseline, achieving a result Proc. of KDD-1998, New York, NY, USA, Au-
favourable for our approach. In addition, we have gust 1998, pages 264268, Menlo Park, CA,
shown that it is possible to automatically produce USA, 1998. AAAI Press.
results for diverging views onto the same input. [10] A. Maedche and S. Staab. Ontology learning
Thereby, the user can rely on a heterarchy to con- for the semantic web. IEEE Intelligent Sys-
trol and possibly interpret clustering results. tems, 16(2), 2001.
The preprocessing method, COSA, that we pro-
pose is a very general one. Currently we try to [11] G. Miller. WordNet: A lexical database for
apply our techniques on a highly-dimensional data english. CACM, 38(11):3941, 1995.
set that is not based on text documents, but on a [12] G. Neumann, R. Backofen, J. Baur,
real-world customer data base with over 3 million M. Becker, and C. Braun. An informa-
clients. First preliminary results of our method on tion extraction core system for real world
this data base are very encouraging. german text processing. In ANLP-1997
Proceedings of the Conference on Ap-
References plied Natural Language Processing, pages
[1] R. Agrawal, J. Gehrke, D. Gunopulos, and 208215, Washington, USA, 1997.
P. Raghavan. Automatic subspace clustering [13] M. Devaneyand A. Ram. Efficient feature
of high dimensional data for data mining ap- selection in conceptual clustering. In Proc.
plications. In Proc. of the ACM SIGMOD Intl of ICML-1997, Nashville, TN, 1998. Morgan
Conference on Management of Data, Seattle, Kaufmann, 1998.
Washington, June 1998. ACM Press, 1998. [14] H. Schuetze and C. Silverstein. Projections
[2] K. Beyer, J. Goldstein, R. Ramakrishnan, and for efficient document clustering. In Proc.
U. Shaft. When is nearest neighbor mean- of SIGIR-1997, Philadelphia, PA, July 1997,
ingful. In Proc. of ICDT-1999, Jerusalem, Is- pages 7481. Morgan Kaufmann, 1997.
rael, 1999, pages 217235, 1999.