Академический Документы
Профессиональный Документы
Культура Документы
www.elsevier.com/locate/dsw
School of Accountancy and Information Management, SAIM, College of Business, Arizona State University,
Box 873606, Tempe, AZ 85287-3606, USA
b
Department of MIS, School of Business and Public Administration, University of Arizona, Tucson, AZ 85721, USA
Abstract
This work demonstrates how the World Wide Web can be mined in a fully automated manner for discovering the semantic
similarity relationships among the concepts surfaced during an electronic brainstorming session, and thus improving the
accuracy of automated clustering meeting messages. Our novel Context Sensitive Similarity Discovery (CSSD) method takes
advantage of the meeting context when selecting a subset of Web pages for data mining, and then conducts regular concept cooccurrence analysis within that subset. Our results have implications on reducing information overload in applications of text
technologies such as email filtering, document retrieval, text summarization, and knowledge management.
D 2002 Elsevier Science B.V. All rights reserved.
Keywords: Data mining; Context sensitive similarity discovery; Empirical study; Group decision support systems; Internet; Machine learning;
Organizational concept space; Text clustering; Web mining
1. Introduction
Many researchers and practitioners believe that the
World Wide Web is a gold mine filled with useful
information. Indeed, such vast amount of textual and
multimedia information was not available for
researchers before the mid 90s. According to Lyman
and Varian [18], the Web currently contains more than
2.5 billion of pages, consisting of at least 10 terabytes
of textual information. What also makes it a novel and
exciting opportunity is its massive hyperlink structure
and the ability to trace peoples browsing patterns.
Although the term Web mining started to appear
*
Corresponding author.
E-mail addresses: dmitri.roussinov@asu.edu (D. Roussinov),
lzhao@bpa.arizona.edu (J.L. Zhao).
frequently in journals and magazines, not many successfully tested applications have been reported yet.
This paper presents our empirically proven Web
mining approach that is capable of discovering semantic relationships between specified concepts, and as a
result, helps to organize messages produced during
electronic meetings supported by Group Decision
Support Systems.
Electronic meeting support has been proven to
have great impact on productivity of group discussions [6,20]. However, it has been reported that Group
Decision Support Systems (GDSS) meetings (or computer-mediated meetings in general) very often result
in information overload [2,11,14]. Indeed, due to
anonymity, the ability to contribute in parallel and
automated record keeping, the number of text messages generated by a dozen of participants can exceed
0167-9236/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 9 2 3 6 ( 0 2 ) 0 0 1 0 2 - 1
150
Although our study focuses on resolving the information overload problem in computer-supported
meetings, the approach developed in our study is
useful to more general textual information processing
tasks such as search/retrieval, filtering, categorization,
and summarization.
The next section briefly reviews the pertinent
literature. Section 3 presents the research questions
and the research methodology. Section 4 gives the
details of the algorithms and their implementation
used in our experiments. Section 5 contains the results
of our experiments and their interpretation and implication. Finally, Section 6 concludes the paper and
outlines several directions for future research.
2. Literature review
2.1. Clustering and summarizing electronic meeting
messages
Several previous studies have explored automated
detection and summarization of structures in meeting
messages by identifying and listing the most important concepts [2], representing the messages with
semantic maps [21], or clustering the messages into
automatically created subsets of topics [23]. These
studies established the ability to organize the meeting
contents and to help prepare the meeting participants
for the decision making phase. However, the automated message clustering techniques do not always
offer satisfactory accuracy. When compared with the
results of manual clustering, researchers have found
that the outputs of computer algorithms do not always
match human expectations [21,23].
As we wrote in Introduction, these automatic message clustering approaches, along with other widely
accepted text technologies, inherently rely on automated indexing, by which each message is represented
with a vector, while the vector coordinates are determined by the words or phrases (called terms) in the
message. To perform clustering or to build semantic
maps, the algorithms rely on the computation of
similarity between the messages. While the particular
scoring formulas may be different (cosine, Jaccard,
Euclidean, etc.), all of them just implement simple
intuitive consideration that the more common words
the messages share the more similar they are. Mes-
151
152
153
3. Research design
In this section, we first set up the research questions
that we examine in this study. Then, we discuss the
experiments designed to answer our research questions
followed by the explanation of the metrics we used and
our hypothesis in the operationalized form.
3.1. Research questions
We pose four research questions that will accomplish our research objective aforementioned:
Q1: What is the degree of the agreement among
subjects with respect to the perceived relationships
between concepts mentioned in the meeting?
This was a basic and necessary question. If there
was no agreement among the meeting participants with
respect to the relationships between concepts, then the
entire idea of trying to represent them using any
framework would likely to be futile. On the other
hand, if the agreement was high, then we may try to
capture them by an OCS or a similar framework. This
leads to our second question:
154
Table 1
List L: 20 most frequently occurring concepts during the meeting
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
MEETINGS
MEETING
TECHNOLOGY
COLLABORATIVE SYSTEMS
INFORMATION
COLLABORATIVE
DISTRIBUTED
SYSTEMS
LINEAR THREAD MEETING
ENVIRONMENTS
NETWORKS
SUPPORT
NOTES
HARDWARE
FACILITATORS
TECHNOLOGIES
LANGUAGE
WIRELESS
NETWORK
BANDWIDTH
155
E
At
156
157
158
159
Table 2
A fragment of the list of the queries used to collect Web pages for mining
160
5. Experimental results
5.1. Analysis of results
This section presents our results using the metrics
introduced earlier in Section 3 and discusses their
implications.
Table 3
A fragment of a listing with mined similarity relationships
No OCS
With M-OCS
Difference
1
2
3
4
5
6
7
8
9
10
11
Average
Standard deviation
Standard mean error
t-test, p-value
0.63
0.89
0.84
0.61
0.60
0.79
0.84
0.82
0.81
0.74
0.78
0.76
0.10
0.03
0.50
0.82
0.77
0.48
0.47
0.76
0.89
0.78
0.82
0.73
0.62
0.69
0.15
0.05
0.13
0.06
0.07
0.13
0.13
0.03
0.05
0.04
0.02
0.02
0.17
0.07
0.07
0.02
0.004346
161
162
Table 5
Normalized cluster error with and without I-OCS
Subject
No OCS
With I-OCS
Difference
1
2
3
4
5
6
7
8
9
10
11
Average
Standard deviation
Standard mean error
t-test, p-value
0.63
0.89
0.84
0.61
0.60
0.79
0.84
0.82
0.81
0.74
0.78
0.76
0.10
0.03
0.62
0.84
0.84
0.56
0.54
0.78
0.82
0.74
0.77
0.73
0.68
0.72
0.11
0.03
0.02
0.05
0.00
0.05
0.06
0.02
0.03
0.08
0.04
0.02
0.10
0.04
0.03
0.01
0.0004489
clustering accuracy was detected under any combinations of parameters in either the manually built or the
mined concept spaces. This finding supported our
belief that the correctness of similarity relationships
was crucial for the discovered effects.
5.2.4. Importance of knowing the context
We also investigated the importance of knowing
the context for both manual and mined concept
spaces. To explore the M-OCS case, we asked three
(3) other subjects to perform the concept selection
tasks, but without preceding it with the clustering
tasks. Thus, those three subjects were not familiar
with the context of the meeting. We built and applied
the OCS built from this non-context specific data the
same way as we described above. In that case the
accuracy of clustering became only worse! The normalized clustering error was in 0.78 0.85 range,
under any combination of parameters. This finding
confirmed our conjecture that knowing the context
was crucial for the effectiveness when OCS was built
manually. Having only three subjects for this simple
test was a limitation, but we still think it is worth
mentioning this additional finding.
To test the importance of the context to the mined
OCS case, we created a different collection for mining
in such a way that we did not provide contextual cues
to the underlying search engine. Specifically, we
simplified our heuristic querying algorithm by removing all the nonrequired keywords (not preceded with
+ sign in Table 2). Each of the 30 queries consisted
of exactly one of the top 30 most frequent concepts.
Those queries resulted in the pages about each of the
concepts taken stand-along, i.e., the top 200 pages
from AltaVista about MEETING, then the top 200
about LOTUS NOTES, etc., ignoring the context in
which the concepts were mentioned. We also manually double-checked that those pages were more
general, thus as we hoped, much less context specific.
We obtained 5241 pages after removing duplicates
and empty pages. We randomly selected 3270 from
them, the same number as for context specific mining
in order to make comparison more objective. Then,
we mined the OCS from the selected documents in the
same way as described above. The resulting improvement was indeed smaller than with the I-OCS using
context (0.74 normalized clustering error, instead of
0.72), with the difference between them being statisti-
163
164
Acknowledgements
It is our pleasure to acknowledge the help from
Artificial Intelligence Laboratory and the Center for
Management at the University of Arizona for providing data sets and software to support our study.
References
[1] H. Chen, K.J. Lynch, Automatic construction of networks
of concepts characterizing document databases, IEEE Transactions on Systems, Man, and Cybernetics 22 (5) (1992)
885 902.
165
[2] H. Chen, P. Hsu, R. Orwig, L. Hoopes, J.F. Nunamaker, Automatic concept classification of text from electronic meetings,
Communications of the ACM 37 (10) (1994) 56 73.
[3] H. Chen, B.R. Schatz, T.D. Ng, J.P. Martinez, A.J. Kirchhoff,
C. Lin, A parallel computing approach to creating engineering
concept spaces for semantic retrieval: the Illinois digital library
initiative project, IEEE Transactions on Pattern Analysis and
Machine Intelligence 18 (8) (August 1996) 771 782.
[4] H. Chen, C. Schuffels, R. Orwig, Internet categorization and
search: a self-organizing approach, Journal of Visual Communication and Image Representation 7 (1) (1996) 88 102.
[5] H. Chen, J. Martinez, A. Kirchhoff, T.D. Ng, B.R. Schatz,
Alleviating search uncertainty through concept associations:
automatic indexing, co-occurrence analysis, and parallel computing, Journal of the American Society for Information Science 49 (3) (1998) 206 216.
[6] T. Connolly, L.M. Jessup, J.S. Valacich, Effects of anonymity
and evaluative tone on idea generation in computer mediated
groups, Management Science 36 (6) (1990) 689 703.
[7] R. Cooley, B. Mobasher, J. Srivastava, Web mining: information and pattern discovery on the world wide web, in: R.
Cooley, J. Srivastava (Eds.), Proceedings of the 9th IEEE
International Conference on Tools with Artificial Intelligence
(ICTAI97), November 1997.
[8] C.J. Crouch, An approach to the automatic construction of
global thesauri, Information Processing and Management 26
(5) (1990) 629 640.
[9] J.A. Firth, Synopsis of Linguistic Theory 1930 1955, Studies
in Linguistic Analysis, Philological Society, Oxford, 1957,
reprinted in Palmer, F. (ed. 1968) Selected Papers of J.R. Firth,
Longman, Harlow.
[10] G.W. Furnas, T.K. Landauer, L.M. Gomez, S.T. Dumais, The
vocabulary problem in human-system communication, Communications of the ACM 30 (11) (1987) 964 971.
[11] R.B. Gallupe, W.H. Cooper, Brainstorming electronically,
Sloan Management Review 35 (1) (Fall 1993) 27 36.
[12] G. Grefenstette, Explorations in Automatic Thesaurus Discovery, Kluwer Academic Publishing, Boston, MA, 1994.
[13] J.M. Hammersly, D.C. Handscomb, Monte Carlo Methods,
Methuen, London, 1964.
[14] S.R. Hiltz, M. Turoff, Structuring computer-mediated communication systems to avoid information overload, Communications of the ACM 28 (7) (1985) 680 689.
[15] A. Houston, Information classification: usability studies on
two automatic approaches, Doctoral Dissertation, University
of Arizona (1998).
[16] L. Hubert, P. Arabie, Comparing partitions, Journal of Classification, (1985) 193 218.
[17] N. Ide, J. Veronis, Word sense disambiguation: the state of the
art, Computational Linguistics 241 (1998) 1 40.
[18] P. Lyman, H. Varian, How Much Information? A project report
of the Regents of the University of California, available at
http://www.sims.berkeley.edu/how-much-info (2000).
[19] J. Minker, G.A. Wilson, B.H. Zimmerman, An evaluation of
query expansion by the addition of clustered terms for a document retrieval system, Information Storage and Retrieval,
(1972) 329 348.
166
[20] J.F. Nunamaker Jr., A.R. Dennis, J.S. Valacich, D.R. Vogel,
J.F. George, Electronic meeting systems to support group
work: theory and practice at Arizona, Communications of
the ACM 34 (7) (1991) 40 61.
[21] R.E. Orwig, H. Chen, J.F. Nunamaker, A graphical, self-organizing approach to classifying electronic meeting output,
Journal of the American Society for Information Science 48
(2) (1997) 157 170.
[22] H.J. Peat, P. Willett, The limitations of term co-occurrence data
for query expansion in document retrieval systems, Journal of
the American Society for Information Science 42 (5) (1991)
378 383.
[23] D. Roussinov, H. Chen, Document clustering for electronic
meetings: an experimental comparison of two techniques, Decision Support Systems 27 (1 2) (1999) 67 79.
[24] D. Roussinov, K. Tolle, M. Ramsey, M. McQuaid, H. Chen,
Visualizing internet search results with adaptive self-organizing maps, Proceedings of ACM SIGIR, August 15 19, 1999,
Berkeley, CA.
[25] G. Salton, M.J. McGill, Introduction to Modern Information
Retrieval, McGraw-Hill, New York, 1983.
[26] C.J. van Rijsbergen, A theoretical basis for the use of cooccurrence data in information retrieval, Journal of Documentation 33 (2) (1977) 106 119.
[27] J. Ward, Hierarchical grouping to optimize an objection function, Journal of the American Statistical Association 58 (1963)
236 244.
[28] J. Xu, W.B. Croft, Corpus-based stemming using co-occurrence of word variants, ACM Transactions on Information
Systems 16 (1) (1998) 61 81.
[29] J.L. Zhao, A. Kumar, E.A. Stohr, A dynamic grouping technique for distributing codified-knowledge in large organizations, Proceedings of the 10th Workshop on Information
Technology and Systems, December 9 10, 2000, Brisbane
Australia.