Академический Документы
Профессиональный Документы
Культура Документы
Media
∗
K. Selçuk Candan Luigi Di Caro Maria Luisa Sapino
Arizona State University Universita’ di Torino Universita’ di Torino
Tempe AZ, USA Torino, Italy Torino, Italy
candan@asu.edu dicaro@di.unito.it mlsapino@di.unito.it
ABSTRACT
In social media, such as blogs, since the content naturally
evolves over time, it is hard or in many cases impossible
to organize the content for effective navigation. Thus, one
commonly has to resort to simple tools, such as tags and tag
clouds, for presenting frequently used keywords to users to
provide them at least some high level idea about the content
of a given set of social media entries. Most visualizations of
tag clouds vary the sizes of the fonts to differentiate impor-
tant tags from those that are less important. We propose an (a) an un-condensed tag cloud sample
alternative “contextual-layout” method, TMine, for analyz-
ing and presenting tags that are extracted from textual con- motor vehicle
tent. In TMine tags are first mapped onto a latent semantic
space. Then, TMine analyzes the relationships between tags
relying on an extended boolean interpretation of the seman- car ......... truck motorcycle
tic space. The tag cloud is condensed into a hierarchy in a
way that captures contextual relationships between tags: in
particular, descendant terms in the hierarchy occur within
bus cab sedan minibike trail bike
the context defined by the ancestor terms. This provides a
mechanism for navigation within the tag space as well as for (b) a contextually-organized taxonomy segment
classification of the text documents based on the contextual
Figure 1: A tag cloud (obtained from
structure implied by the tags.
http://expertvoices.nsdl.org/ on 05/26/2008) vs.
a taxonomy: the taxonomy presents a hierarchy
Categories and Subject Descriptors that describes the given domain, while the tag cloud
H.3.1 [Information Storage and Retrieval]: Content fails to provide any contextual clues between terms
Analysis and Indexing; H.5 [Information Interfaces and
Presentation]; I.7 [Document and Text Processing] 1. INTRODUCTION
With the quick growth of social media content over the
General Terms web (e.g., the blogosphere), tag-based searches and tag cloud
Algorithms (sets of tags) based visualizations have become increasingly
popular [18]. Most visualizations of tag clouds vary the sizes
of the fonts to differentiate most important tags from those
Keywords that are less important (Figure 1(a)). Yet, this representa-
Tag hierarchy construction, visualization of social media tion falls short in describing the context in which these terms
occur in the collection. [18, 21] observe that for large blog
∗This author is also Adjunct Professor at Arizona State Uni- archives, a simple chronological order is not sufficient and a
versity. table of contents (TOC) like navigational hierarchy, depict-
ing the topical relationships within the blog archive would
be more effective. In this paper, we propose a “contextual-
layout” method for tag clouds, based on an analysis method
Permission to make digital or hard copies of all or part of this work for we refer to as TMine, for presenting tags extracted from
personal or classroom use is granted without fee provided that copies are content and their relationships.
not made or distributed for profit or commercial advantage and that copies Our main observation is that taxonomies are, essentially,
bear this notice and the full citation on the first page. To copy otherwise, to hierarchical representations of terms that are important in a
republish, to post on servers or to redistribute to lists, requires prior specific given application domain (Figure 1(b)) and their main differ-
permission and/or a fee.
SSM’08, October 30, 2008, Napa Valley, California, USA. ence from tag clouds is that they are contextually organized:
Copyright 2008 ACM 978-1-60558-258-0/08/10 ...$5.00. a node in a given taxonomy clusters and contextualizes all
75
(a) Tag hierarchy extracted from Kat- (b) Tag hierarchy extracted from Kat- (c) Tag hierarchy extracted from Kat-
rina related news articles published on rina related news articles published be- rina related news articles published be-
Aug 30th 2005 tween October 1st and 9th tween February 10th and 13th 2006
Figure 2: Tag hierarchies (and the associated tag maps) extracted from Katrina related news articles published
at different times (the colored lines show other relationships found by TMine). Each of the tag hierarchies
presented here are extracted using only 15 news entries randomly picked among those published at [2] on
the stated dates.
used explicit (Figure 3). We call this process condensing the
tag cloud into a tag flake (or tagFlake for short).
Since tag hierarchies impose a structure on the tags that
correspond to the context of the given social media collec-
tion, this structural knowledge can be used for organizing
entries. Tag hierarchies can also enable TMine to help
users track the topic development patterns in the tempo-
rally evolving social media collections. A social media visu-
alization and tracking interface based on this concept was
demonstrated at [9]. In this paper, we describe the under-
lying technology in detail.
76
further evaluates the contextual relationships between the Semi-automatic construction of taxonomies. The
terms, relying on an extended Boolean interpretation of the goal of the OntoGen [11, 12] system is similar to ours: the
tag space. TMine then leverages the resulting contextual authors aim at designing a semi-automatic method to mine
knowledge, along with the relative positioning of the terms ontological relations on concepts extracted from text cor-
with respect to the bases of the latent semantic space, to pora, by using Latent Semantic Analysis techniques. By
condense the tag cloud onto a tag hierarchy. referring to the TF/IDF vector based representation of the
The proposed method is language neutral, since it does documents, authors use cosine metric to compare the simi-
not rely on any natural language processing technique and it larity between documents, and apply K-Means multidimen-
is completely unsupervised. The only information source for sional clustering on the documents. Documents in the same
the process is a given set of documents. Thus, the hierarchy cluster are the ones that share more information content;
directly reflects the document content and the way different i.e., which have more keywords in common. As for the key-
topics are discussed in it. Experimental evaluation results word vector representation of the topics, [11] considers two
show the effectiveness of TMine in capturing the semantic alternative methods. The first method computes the cen-
structure of text documents. troid of the multidimensional cluster representing the topic,
and selects the keywords with higher weight in the topic
1.2 Related Work centroid. The second method relies on support vector ma-
chines. A graphical interface visualizes candidate topics as
In this section we present an overview of the related work
nodes, and possible relationships among them as edges. The
in the area of social media and tag clouds, textual document
user is asked to confirm or modify the topics, as well as the
visualizations, statistical analysis of text documents and of
relationships among topics, and to build the ontology. [11,
the (semi)automatic topic extraction from textual corpora.
12] consider ontologies in general, while we focus on the spe-
Tag clouds and Visualization of Text Document
cial case of tag hierarchies. Yet, the method in [11, 12] is
Collections. [14] proposes a graphical visualization of tag
not more general than ours, since it is a supervised method
clouds. Tags are selected on the basis of their frequency of
which needs the user’s action, while we are proposing a fully
use. Relationships among tags are defined in terms of their
unsupervised tag hierarchy construction procedure.
similarity, quantified by means of the Jaccard coefficient.
[5] proposes a method based on Formal Concept Analy-
K-means clustering is then applied on the tag similarity ma-
sis (FCA) to automatically acquire taxonomies or concept
trix, with an a priori chosen number of clusters and fixed
hierarchies out of domain-specific texts. The method re-
number of selected relevant tags. Unlike this work, which
lies on natural language processing tools; in particular, it
clusters tags for imposing some form of organization on the
is based on the analysis of the selectional restrictions that
tag clouds, we further aim to analyze the contexts in which
verbs pose on their arguments. [26] also relies on natural
tags are used and using this knowledge to create navigable
language processing tools as well as Wordnet to bootstrap
hierarchical structures of the tags. [15] furthers the work
taxonomies out of email collections, which are then further
in [14] by applying Multi-dimensional scaling, MDS, (using
refined using concept clustering techniques. [19] introduces
Pearson’s correlation as the tag similarity function) to cre-
a data-driven approach to semi-automatic construction of
ate a bi-dimensional space, which is then visualized through
multimedia ontologies. The authors concentrate on the ex-
a fish-eye system. We note, however, that while preserving
traction of ontologies from a video collection. The text re-
distances, MDS does not preserve the energies of the input
flecting the speech extracted from the video and the avail-
tags. Similarly, using only two dimensions can be overly
able textual annotations is processed. Concepts and rela-
lossy. Research on effective use of 2D spaces for multidi-
tionships to be included in the hierarchy are then manually
mensional data visualization focus on careful selection of
selected from the processed text. This is a major difference
the relevant dimensions [25] and organizing data in hierar-
with respect to our approach, which is fully unsupervised.
chical visualization structures, such as TreeMaps, along the
An unsupervised method uncovering latent topics in text
relevant dimensions, and mapping these two 2D spaces [3].
documents is described in [27]. Like ours, the proposed
[1] aims creating visually pleasant tag clouds, by present-
method is language neutral. To identify the concepts, [27]
ing tags in the form of seemingly random collections of circles
relies on the Markov Chain Monte Carlo process of Gibbs
with varying sizes: the size of the circle denotes its frequency.
sampling, following the Latent Dirichelet Allocation model,
Unlike our approach, this visualization scheme intentionally
while we use the SVD transform. To discover subsumption
randomizes the placement of the tags with the hope of pro-
relations, [27] uses conditional independence tests among the
jecting a more pleasant (if not highly informative) feeling.
concepts themselves, while we rely on the extended Boolean
[10] describes a system to visualize the semantic contained
model. [20] also relies on dependence (predictivity) to deter-
in a textual corpus. Like us, it relies on Latent Semantic
mine topic/sub-topic relationships, but skips the step for un-
Analysis techniques to extract the information about the
covering latent semantics; thus, the resulting structure can-
principal dimensions emerging from the text by means of
not be interpreted in terms of these concept dimensions. [16]
Singular Value Decomposition (SVD) applied on the term-
presents TaxaMiner for automated taxonomy construction
document frequency matrix [13]. After identifying the rele-
from a large corpus of documents, using a suite of clustering
vant dimensions, in [10], the authors apply Multidimensional
and NLP techniques. OntoMiner [7, 8] is an unsupervised
Scaling (MDS) [6] to reduce the dimensionality to two di-
system for bootstrapping and populating specialized domain
mensions, to allow the data visualization on the plan. As a
ontologies by organizing and mining taxonomy-directed Web
difference with respect to our work, the goal of [10] is purely
sites. The HTML regularities in the Web documents are at
the visualization of the concepts extracted from the text cor-
the basis of OntoMiner mining process. The HTML regular-
pora, to give a hint on their mutual distances and thus their
ities detected in Web documents are used and reflected into
mutual similarities, without any plan for other use of the
hierarchical semantic structures encoded as XML. A concept
extracted information.
77
taxonomy is then expanded with sub-concepts by selectively
crawling through the links corresponding to key concepts.
78
• the weight associated to the edge (tagi , tagj ) is
w(eij ) = 1 − cos(tvi , tvj ). Thus the weights reflect
the dissimilarity between the connected tags, i.e., low
weights correspond to high similarity values.
The tag hierarchy will be a subgraph of the tag graph. By
construction, the directionality of the edges in the tag graph
is such that higher energy tags precede the lower energy
ones, thus any subgraph satisfies the requirement that an-
cestors need to have higher energy than their descendants.
Figure 5: Outline of the hierarchy construction 2.4.2 Minimum Spanning Tree of the Tag Graph
process: Node labels reflect the order in which the To enforce the requirement that tags which are more sim-
corresponding tags are inserted in the tree: i.e., how ilar to each other are clustered in the same subtree, we rely
dominant they are on the weights associated to the edges. By construction, the
weights associated to each edge reflect the degree of similar-
ity of the tags associated to the connected vertices: the lower
space. For example, consider Figure 4 which shows a simple the weight, the higher the similarity. Thus, minimizing the
tag space, with two bases roughly corresponding to two pred- overall cost of a subgraph corresponds to maximizing the
icates about oil() and about business(). According to Ex- overall mutual similarity of the pairs of tags associated to
tended boolean interpretation of the tag space, the tag “gas” the connected nodes. This leads to the following statement:
has higher coverage (i.e., energy) than the tags “price” and
“ref inery”. Consequently, based on the extended Boolean Statement 1. Given a tag graph, its Directed Minimum
interpretation, we can argue that, as long as they are not Spanning Tree satisfies the requirements that
stop-words, high energy tags are more appropriate to be
placed closer to the root in the hierarchy as they provide • in any branch on the tree, tags closer to the root have
the context in which the other tags are used. higher energy than those further down in the hierarchy,
Based on the above observation, TMine organizes the tags and
distributed in the tag space into a hierarchy by differentiat- • the tree is compact in terms of tag distances.
ing contexts in which tags are used from each other and sepa-
rating the tags into different branches of the hierarchy based The minimization of the cost of the returned tree corre-
on these topics. If a tag is used in multiple contexts (e.g., sponds, by construction, to the maximization of the cosine
“gas” which may be used both in terms of business and oil), similarities between the tag vectors associated to the con-
then it is identified to have higher coverage than say “refin- nected nodes. Roughly speaking, each node is placed under
ery” (used mainly in oil related blogs and news articles) and the higher energy node that is most related to it.
“price” (used mainly in business related entries), and thus Notice that, while the construction of the Directed Mini-
can be used as a context for “refinery” and “price”. Thus, mum Spanning Tree can be solved in quadratic time (in the
the tag construction process is guided by the context defined size of the graph), using the Chu-Liu/Edmonds algorithm
by the higher energy tags in the tag space: each tag vector [6], in our case the problem is much simpler since, due to
defines a context and, thus, imposes a contextual-constraint the nature of the edges’ orientation, the tag graph does not
on the other tags that will be considered as candidates to contain any loop. Thus, Chu-Liu/Edmonds algorithm [6] re-
be included in the corresponding subtree. turns the spanning tree in linear time. This is due to the fact
that, in the absence of loops, the algorithm simply returns
2.4 Construction of the Tag Hierarchy the tree consisting of all the nodes of the graph, each one
In this subsection, we present the tag construction algo- connected through the edge with minimum weight among
rithm in more detail. the ones entering it in the given graph.
2.4.1 From Tag Space to Tag Graph 2.4.3 Preventing Context Drift
TMine considers a tag graph which captures the mutual Caution, however, is needed against context-drift of the
energy and contextual similarity relationships among tags (in tags in the hierarchy. This is because, the Directed Mini-
reality this graph is never materialized; see Section 2.4.3): mum Spanning Tree of the tags graph does not guarantee
that the similarity between any child node and its parent
Definition 2.1 (Tag Graph). Let T = is significant enough. The minimum spanning tree is opti-
{tag1 , . . . , tagn } be a set of tags (possibly pre-filtered mized against the overall cost associated to the entire tree,
based on their energies), each associated to its correspond- but it does not guarantee local optimization at the node
ing tag vector tvi . G = (V, E) is the directed, weighted level. On the other hand, tags that are to be considered
graph such that under a given tag need to be sufficiently close to the corre-
sponding tag vector. Thus, to avoid improper connections
• the set of vertices is V = T ∪ {All}, where All is a tag
between a node and other nodes more shared than itself,
with maximal energy, i.e., the tag whose corresponding
we enforce a similarity lowerbound, simlowerbound ; i.e., the
tag vector has the maximal length in the given space.
minimum cosine similarity between any parent-child pair in
• there is an edge eij = (tagi , tagj ) iff tagi has higher the hierarchy1 . Thus, if for a given tag node, tagi , there is
energy than tagj , i.e, iff tvi is longer than tvj (for this 1
The only node that is allowed to violate the constraint as-
construction, we denote All as the tag tag0 ); sociated to the threshold is “All”, which covers all tags.
79
TreeGen( C, simlowerbound )
Links the tags in C in a tree structured graph reflecting their shar-
ing factor, or energy
begin
1. Let tag0 , tag1 , . . . , tagm be the tags in C, in descending or-
der of energy;
2. attach to(tag1 , tag0 );
3. for i = 2, . . . , m
(a) bestSimilarity = 0;
(b) for j = 1, . . . , i − 1
i , tag
i. cosineSimilarity = cos(tag j );
ii. if bestSimilarity < cosineSimilarity Figure 7: Tag vectors associated to the nodes of a
A. bestSimilarity = cosineSimilarity;
B. bestN ode = tagj ;
tag hierarchy using CP/CV [17]: each tag vector has
one dimension for each tag-node in the hierarchy
4. if (bestSimilarity < simlowerbound )
then attach to(tagi , tag0 ) else attach to(tagi , bestN ode)
80
3.2 Measuring the Difference between
Relative KL-distances of Hierarchies
Hierarchies
1.2
We evaluate the difference between a given pair of hier-
T(100PermutedD2)
if a node appears in the common sub-hierarchy, its parent is
0.8
the closest ancestor (in the original hierarchy) which is also no permutation vs.
T(D1) vs.
a common node, and thus also appears in the sub-hierarchy. 100% permutation
0.6 Linear (no permutation
We then compute CP/CV vectors of the nodes of these vs. 100% permutation)
two extracted sub-hierarchies. In particular, we concentrate 0.4
on the tag vectors associated to their roots. Intuitively, the
tag vector associated to the root of a hierarchy represents 0.2
the semantic structure of the entire hierarchy.
0.0
We finally compare the CP/CV vectors of the roots of the
0.0 0.2 0.4 0.6 0.8 1.0 1.2
given hierarchies: T(D1) vs. T(D2)
81
Relative KL-distances of Hierarchies
5. REFERENCES
(News Article Data Set) [1] http://phasetwo.org/post/a-better-tag-cloud.html, 05/31/08.
3.5
[2] http://www.usatoday.com, http://www.abcnews.go.com,
http://www.forbes.com, http://www.nytimes.com,
vs.T(100permuted_article_set2)
3 http://www.smh.com.au, http://www.guardian.co.uk,
http://news.bbc.co.uk, http://www.cbsnews.com,
2.5 http://www.chron.com, http://www.washingtonpost.com,
T(article_set1)
2
http://news.com.com, http://www.boston.com, and
http://www.iht.com
1.5 [3] G.Chintalapani, C. Plaisant, B.Shneiderman: Extending the
Utility of Treemaps with Flexible Hierarchy. IV’04, 2004.
1
[4] S. Deerwester, S. Dumais, G.Furnas, R. Harshman, T.
0.5 Landauer, K. Lochbaum and L. Streeter. Computer
Information Retrieval using Latent Semantic Structure, US
0 Patent, 1989.
0 0.5 1 1.5 2 2.5 3 3.5
[5] P. Cimiano, S. Staab, J. Tane. Automatic Acquisition of
T(article_set1) vs. T(article_set2) Taxonomies from Text: FCA meets NLP. ECML/PKDD
Workshop on Adaptive Text Extraction and Mining , 2003.
[6] M.F. Cox, M.A.A. Cox. Multidimensional Scaling. Chapman
Figure 9: Experimental results for the News Articles and Hall, 2001.
Data Set [7] H. Davulcu, S. Vadrevu, and S. Nagarajan OntoMiner:
Bootstrapping and Populating Ontologies From Domain
Specific Web Sites SWDB 2003.
that the results for the case with lower distortion (first plot) [8] H. Davulcu, S. Vadrevu, S. Nagarajan. OntoMiner:
are closer to the 450 line, indicating better coherence), while Bootstrapping Ontologies From Overlapping Domain Specific
in the cases of large distortions (second plot), the results are Web Sites. 13th Int. WWW Conference 2004.
further away from the 450 line, indicating worse coherence.
[9] L.Di Caro, K.S. Candan, M.L. Sapino. Using tagFlake for
Condensing Navigable Tag Hierarchies from Tag Clouds.
Thus, the second hypothesis is also consistently confirmed KDD08 (demo session), 2008.
by the experiments. [10] B. Fortuna, D. Mladenic, and M. Grobelnik. Visualization of
text document corpus. Informatica Journal, 29 (2005), 497-502
3.3.2 News Articles Results [11] B. Fortuna, M. Grobelnik, and D. Mladenic. System for
semi-automatic ontology construction. ESWC, 2006.
As shown in Figure 9, the experiment results on the News [12] B. Fortuna, M. Grobelnik and D. Mladenic. Semi-automatic
Articles data set are consistent with the results on the edited data-driven ontology construction system KDD, 2006.
[13] C. Eckart and G. Young. The approximation of one matrix by
manuscripts. Note that, for this data set, there are some another of lower rank. Psychometrika, I, 211-218, 1936.
cases which fell below the 450 line. This is due to the fact [14] Y. Hassan-Montero, V. Herrero-Solana. Improving Tag-Clouds
that randomness in the construction of the article sets (i.e., as Visual Information Retrieval Interfaces. Int. Conf. on
selection of random news entries around the same time frame Multidisciplinary Information Sciences and Technologies, 2006.
[15] Y. Hassan-Montero, V. Herrero-Solana. Interfaz visual para
to construct individual sets of articles) results in some pairs recuperación de información basada en análisis de metadatos,
of sets that are not comparable to each other in terms of escalamiento multidimensional y efecto ojo de pez El
topical content. Profesional de la Información 15(4), 2006.
4. CONCLUSION
Summaries for Web Searches. SIGIR 2003.
[21] Yan Qi and K. Selcuk Candan. CUTS: CUrvature-Based
In this paper, we presented a completely unsupervised al- Development Pattern Analysis and Segmentation for Blogs and
other Text Streams. Hypertext 2006.
gorithm for constructing tag hierarchies from a given social [22] R. Rada, H. Mili, E.Bicknell, and M. Blettner. Development
media collection. The algorithm first performs latent se- and application of a metric on semantic nets. IEEE
mantic analysis of the resulting corpus of documents to map Transactions on Systems, Man and Cybernetics, 19(1), 1989.
each term/tag onto a tag space with independent basis di- [23] R. Richardson and A.F. Smeaton. Using WordNet in a
Knowledge-Based Approach to Information Retrieval. Working
mensions. It then analyzes the resulting space to rank-order paper CA-1294, Dublin City Univ., Dublin, 1994.
the tag in terms of their relative degrees of energy. Once [24] G. Salton, E.A. Fox, and H. Wu. Extended Boolean
this energy is computed, the algorithm clusters similar tags information retrieval. CACM, 26(11). 1983.
under the most suitable shared node to obtain a tag hierar- [25] J.Seo and B.Shneiderman. A Rank-by-Feature Framework for
Unsupervised Multidimensional Data Exploration Using Low
chy, where tags closer to the root are more shared than those Dimensional Projections. Infovis’04 2004.
further down in the hierarchy and, given all the descendants [26] Hui Yang and Jamie Callan. Ontology generation for large
of a tag, those that are similar to each other are clustered in email collections. DG.O, 2008
the same subtree. Experimental results show that the pro- [27] E. Zavitsanos, G. Paliouras, G. A. Vouros, and S. Petridis.
Discovering Subsumption Hierarchies of Ontology Concepts
posed scheme is able to extract compatible tag-hierarchies from Text Corpora. Int. Conf. on Web Intelligence, 2007
from different text on the same general topic.
82