Академический Документы
Профессиональный Документы
Культура Документы
TEXT MINING:
APPROACHES AND APPLICATIONS1
Milo Radovanovi,2 Mirjana Ivanovi2
1. Introduction
Text mining is a new area of computer science which fosters strong con-
nections with natural language processing, data mining, machine learning, in-
formation retrieval and knowledge management. Text mining seeks to extract
useful information from unstructured textual data through the identification
and exploration of interesting patterns [2]. This paper will discuss several ap-
proaches to the identification of global patterns in text, based on the bag-of-
words (BOW) representation described in Section 2. The covered approaches
are automated classification and clustering (Section 3), and dimensionality re-
duction (Section 5). Pattern exploration will be illustrated through two ap-
plications from our recent work: presentation of Web meta-search engine re-
sults (Section 4) and visualization of coauthorship relationships automatically
extracted from a semi-structured collection of documents describing researchers
in the Serbian province of Vojvodina (Section 6). Finally, preliminary results
concerning the application of dimensionality reduction techniques to problems
in sentiment classification are presented in Section 7.
1 This work was supported by project Abstract Methods and Applications in Computer
Science (no. 144017A), of the Serbian Ministry of Science and Environmental Protection
2 University of Novi Sad, Faculty of Science, Department of Mathematics and Informatics,
Figure 1: Results for query animals england classified into Arts Music
5. Dimensionality Reduction
It is clear that even for small document collections the BOW document
vector will have high dimensionality. This may hinder the application of ML
methods not only for technical reasons, but also by degrading the performance of
learning algorithms which cannot scale to such high dimensions. There are two
230 M. Radovanovi, M. Ivanovi
Chair of Numerical
Mathematics
Chair of Information
Systems
Chair of Computer
Science
90
75
85
80 70
Accuracy (%)
Accuracy (%)
75
65
70 IG IG
SVD 60 SVD
65
SIMPLS SIMPLS
SVD+LDA SVD+LDA
60 55
LDA/QR (1) LDA/QR (1)
55 SRDA (1) SRDA (1)
All features 50 All features
50 SVM SVM
Nave Bayes Nave Bayes
45 45
0
0
0
0
10 0
0
0
0
0
10 0
20 0
30 0
00
50 0
00
20 0
30 0
00
50 0
00
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
10
12
15
20
30
50
10
12
15
20
30
50
0
0
0
0
0
10
20
30
40
50
10
20
30
40
50
40
40
Number of features Number of features
(a) (b)
8. Conclusion
Classification, clustering and dimensionality reduction are only some of the
methods; enhancing Web search, mining bibliographic data and sentiment clas-
234 M. Radovanovi, M. Ivanovi
sification are only some of the applications. The aim of this paper was to provide
an illustration of the already vast field of text mining through discussion of our
recent work, together with preliminary results which connect dimensionality
reduction with sentiment classification.
References
[1] Cai, D., He, X., Han, J., SRDA: An efficient algorithm for large-scale discriminant
analysis. IEEE T. Knowl. Data En., 20(1) (2008), 112.
[2] Feldman, R., Sanger, J., The Text Mining Handbook. Cambridge University
Press, 2007.
[3] de Jong, S., SIMPLS: An alternative approach to partial least squares regression.
Chemometr. Intell. Lab., 18(3) (1993), 251263.
[4] Pang, B., Lee, L., A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts. In: Proceedings of the ACL, pages 271
278, 2004.
[5] Pang, B., Lee, L., Seeing stars: Exploiting class relationships for sentiment cat-
egorization with respect to rating scales. In: Proceedings of the ACL, pages
115124, 2005.
[6] Radovanovi, M., Ferle, J., Mladeni, D., Grobelnik, M., Ivanovi, M., Mining
and visualizing scientific publication data from Vojvodina. Novi Sad Journal of
Mathematics 37(2) (2007), 161180.
[7] Radovanovi M., Ivanovi, M., CatS: A classification-powered meta-search engine.
In: Last, M., et al., editors, Advances in Web Intelligence and Data Mining, pages
191200, Springer-Verlag, 2006.
[8] Sebastiani, F., Text categorization. In Zanasi, A., editor, Text Mining and its
Applications, pages 109129, Southampton, UK: WIT Press, 2005.
[9] Torkkola K., Linear discriminant analysis in document classification. In: IEEE
ICDM Workshop on Text Mining, pages 800806, 2001.
[10] Witten, I.H., Frank, E., Data Mining: Practical Machine Learning Tools and
Techniques. Morgan Kaufmann Publishers, 2nd edition, 2005.
[11] Ye, J., Li, Q., A two-stage linear discriminant analysis via QR-decomposition.
IEEE T. Pattern Anal. 27(6) (2005), 929941.
[12] Zeng, X.-Q., Wang, M.-W., Nie, J.-Y., Text classification based on partial least
square analysis. In: Proceedings of ACM SAC, pages 834838, 2007.