Вы находитесь на странице: 1из 12

Online New Event Detection Based on IPLSA

Xiaoming Zhang and Zhoujun Li


School of Computer Science and Engineering, Beihang University, 100083 Beijing yolixs@163.com

Abstract. New event detection (NED) involves monitoring one or multiple news streams to detect the stories that report on new events. With the overwhelming volume of news available today, NED has become a challenging task. In this paper, we proposed a new NED model based on incremental PLSA(IPLSA), and it can handle new document arriving in a stream and update parameters with less time complexity. Moreover, to avoid the limitation of TFIDF method, a new approach of term reweighting is proposed. By dynamically exploiting importance of documents in discrimination of terms and documents topic information, this approach is more accurate. Experimental results on Linguistic Data Consortium (LDC) datasets TDT4 show that the proposed model can improve both recall and precision of NED task significantly, compared to the baseline system and other existing systems. Keywords: New event detection, PLSA, Term reweighting, TDT.

1 Introduction
The Topic Detection and Tracking (TDT) program, a DARPA funded initiative, aims to develop technologies that search, organize and structure multilingual news-oriented textual materials from a variety of sources. A topic is defined as a seminal event or activity, along with directly related events and activities [1]. An earthquake at a particular place could be an example of a topic. The first story on this topic is the story that first carries the report on the earthquakes' occurrence. The other stories that make up the topic are those discussing the death toll, the rescue efforts, and the commercial impact and so on. In this paper we define New Event Detection (NED), is the task of online identification of the earliest story for each topic as soon as that report arrives in the sequence of documents. NED systems are very useful in situations where novel information needs to be ferreted out from a mass of rapidly growing data. Examples of real-life scenarios are financial markets, news analyses, intelligence gathering etc [2]. In the other side, NED is an open challenge in text mining. It has been recognized as the most difficult task in the research area of TDT. A performance upper-bound analysis by Allan et al. [3] provided a probabilistic justification for the observed performance degradation in NED compared to event tracking, and suggested that new approaches must be explored in order to significantly enhance the current performance level achieved in NED.
R. Huang et al. (Eds.): ADMA 2009, LNAI 5678, pp. 397408, 2009. Springer-Verlag Berlin Heidelberg 2009

398

X. Zhang and Z. Li

Generally speaking, NED is difficult for several reasons: First, has to be done in an online fashion, which imposes constraints on both strategy and efficiency. Second, similar to other problems in text mining, we have to deal with a high-dimensional space with tens of thousands of features. And finally, the number of topics can be as large as thousands as in newswire data. In this paper we reduce the dependence on TF-IDF weighting by exploiting new approaches, which exploits latent analysis of stories and reweighs terms based on other information of documents and terms. We also use the age character of story to detect new event. Experiments show that this approach can improve recall and precision of NED greatly, and it can also identify new stories or old stories as well.

2 Relate Works
In currently NED systems, there are mainly two methods to compare news story on hand with previous topic. First, each news story is compared to all the previous received stories. Papka et al. proposed Single-Pass clustering on NED [11]. The other method organizes previous stories into clusters which correspond to topics, and new story is compared to the previous clusters instead of stories. Lam et al build up previous query representations of story clusters, each of which corresponds to a topic [12]. In this manner comparisons happen between stories and clusters, and concept terms besides named entities of a story are derived from statistical context analysis on a separate concept database. Nevertheless, it has been proved that this manner is less accurate [5, 6]. Recent years, most work focus on proposing better methods on comparison of stories and document representation. Stokes et al. [14] utilized a combination of evidence from two distinct representations of a documents content. A marginal increase in effectiveness was achieved when the combined representation was used. In paper [15], a ONED framework is proposed. It combines indexing and compression methods to improve the document processing rate by orders of magnitude. However, none of the systems have considered that terms of different types (e.g. Noun, Verb or Person name) have different effects for different classes of stories. Some efforts have been done on how to utilize named entities to improve NED[7,8,9,10]. Yang et al. gave location named entities four times weight than other terms and named entities [7]. DOREMI research group combined semantic similarities of person names, location names and time together with textual similarity [8,9]. UMass [10] research group split document representation into two parts: named entities and non-named entities. Paper [4] assumes that every news story is characterized by a set of named entities and a set of terms that discuss the topic of the story. It is assumed that a new story can at most share one of either the named entity terms or the topic terms with a single story. Further, these two stories must themselves be on different topics. Named entities are also used in other researches [17,18]. There are other methods that use indexing-tree or probabilistic model to improve NED. In paper [16] a new NED model is proposed to speed up the NED task by using news indexing-tree dynamically. Probabilistic models for online clustering of documents, with a mechanism for handling creation of new clusters have been developed. Each cluster was assumed to correspond to a topic. Experimental results did not show any improvement over baseline systems [19].

Online New Event Detection Based on IPLSA

399

In practice, it is not possible to determine the true optimal threshold, because there is no knowledge about incoming and future news documents. Thus, to preserve the high performance for nonoptimal thresholds, an event analysis algorithm has to be resilient. In

this paper we use IPLSA to exploit the latent analysis between stories, and it can reduce the dependence on optimal threshold because it can identify new story or old story more accurately. By using incremental approach, IPLSA can reduce the time of parameters reestimating greatly. We reduce the dependence of term weight on TFIDF by introducing another term weighting algorithm, which exploit the discrimination of documents and topics in terms weight.

3 PLSA Methods and Existing Incremental PLSA Methods


The Probabilistic Latent Semantic Analysis (PLSA) model incorporates higher level latent concepts or semantics to smooth the weights of terms in documents [20]. The latent semantic variables can be viewed as intermediate concepts or topics placed between documents and terms. Meanwhile, the associations between documents, concepts, and terms are represented as conditional probabilities and are estimated by the EM algorithm, an iterative technique that converges to a maximum likelihood estimator under incomplete data [21]. After the PLSA parameters have been estimated, the similarities between new documents (called query documents in [20]) and existing documents can be calculated by using the smoothed term vectors. The PLSA algorithm, which can be used in text classification and information retrieval applications [22], achieves better results than traditional VSM methods [20]. 3.1 PLSA Model PLSA is a statistical latent class model that has been found to provide better results than LSA for term matching in retrieval applications. In PLSA, the conditional probability between documents d and words w is modeled through a latent variable z, which can be loosely thought of as a class or topic. A PLSA model is parameterized by P(w|z) and P(z|d) , and the words may belong to more than one class and a document may discuss more than one topic. It is assumed that the distribution of words given a class, P(w|z) is conditionally independent of the document, i.e., P(w|z, d) = P(w|z). Thus the joint probability of a document d and a word w is represented as:

P( w, d ) = P ( d ) P ( w | z ) P( z | d )
z

(1)

The parameters of a PLSA model, P(w|z)an d P(z|d), are estimated using the iterative Expectation-Maximization (EM)algorit hm to fit a training corpus D by maximizing the log-likelihood function L:

L = f (d , w) log P(d , w)
d D wd

(2)

Where f(d,w)is the frequency of word w in document d [11]. Starting from random initial values, the EM procedure iterates between 1)the E-step, where the probability that a word w in a particular document d is explained by the class corresponding to z is estimated as:

400

X. Zhang and Z. Li

P ( z | w, d ) =

P(w | z ) P( z | d ) P( w | z ') P ( z ' | d ) z'

(3)

and 2)t he M-step, where parameters P(w|z)an d P(z|d)are re-estimated to maximize L:

P(w | z ) =

f (d , w) P( z | w, d ) f (d , w ') P( z | w ', d )
d w' d

(4)

P( z | d ) =

f (d , w) P( z | w, d ) f (d , w) P( z ' | w, d )
w z' w

(5)

Although PLSA has been successfully developed, there are two main shortcomings. First, the PLSA model is estimated only for those documents appearing in the training set. PLSA was shown to be a special variant of LDA with a uniform Dirichlet prior in a maximum a posteriori model [23]. Secondly, PLSA lacks the incremental ability, i.e. it cannot handle new data arriving in a stream. To handle streaming data, a naive approach is that it can re-train the model using both existing training data and new data. However, it is apparently not efficiently since it is very computationally expensive. What is more, for some practical applications, this is infeasible since the system needs real-time online update. Therefore, we need a fast incremental algorithm without compromising NED performance.
3.2 Existing PLSA Incremental Methods

There are some existing works on incremental learning of PLSA. Paper [24] provided a simple update scheme called Fold-In. The main idea is to update the P(z|d) part of the model while keeping P(w|z) fixed. However, P(w|z) can change significantly during EM iteration and affect P(z|d) as well. Thus, the result of Fold-In might be biased. Tzu-Chuan Chou et al [25] proposed Incremental PLSA (IPLSA), a complete Bayesian solution aiming to address the problem of online event detection. For the time complexity, the algorithm needs O(niter (nnd +nod) (nnw + now) K) operations to converge whenever there are new documents added, where nnd is the number of new documents, and nod is the number of old documents, and nnw is the number of new words and now is the number of old words, and K is the number of latent topics, and niter is the number of iterations. Note that the computational complexity is the same as that of the batched PLSA algorithm, although less EM iterations are needed. In the other paper, Chien and Wu proposed another PLSA incremental learning algorithm named MAP-PLSA [26], and the complexity is also great.

4 A Novel Incremental PLSA Model with Time Window


There are some issues should be considered for the incremental task in NED systems: Word-topic and document-topic probabilities for new documents. Efficiency of parameters updating.

Online New Event Detection Based on IPLSA

401

In order to address the above problems a novel incremental PLSA learning algorithm and a new term weighting algorithm are proposed. When a new document arrives, the probability of a latent topic given the document P(z|d) is updated accordingly, and so does the probability of words given a topic, P(w|z) [27]. The formulae for incremental update are as follows: E-step
P ( z | q , w) ( n ) = P( z | q) ( n ) P( w | z )( n ) z ' P( z ' | q)(n ) P(w | z ')(n )

(6)

M-step P ( z | q )( n ) =

n(q, w) P( z | q, w) n(q, w ') P( z ' | q, w ')


(n) w z' w'
(n) (n)

(n)

(7)

P( w | z ) ( n ) =

n(d , w) P( z | d , w) n(d , w ') P( z | d , w ')


d d w'

+ w" P( w " | z )( n1)

+ P( w | z )( n 1)

(8)

Where the superscript (n 1) denotes the old model parameters and (n) for the new ones, wd and wW are words in this document and all other words in the dictionary, respectively. The values of are hyper-parameters that manually selected based on empirical results. The time complexity of this algorithm is O(niter nnd ||nnd||K), where niter is the number of iterations, nnd is the number of new documents, ||nnd|| is the average number of words in these documents and K is the number of latent topic z. When the PLSI algorithm is used in calculation of similarity between any two documents, a document d is represented by a smoothed version of the term vector (P(w1|d),P(w2|d),P(w3|d).), where
P( w | d ) = P( w | z ) P ( z | d )
zZ

(9)

Then, after weighting by the IDF, the similarity between any two documents can be calculated by the following cosine function:
sim(d1 , d 2 ) =

d1 d 2

| d1 | | d 2 | .

(10)

Where
d = ( P ( w1 | d ) we1 , P ( w2 | d ) we 2 ,....)

(11)

Where wei is the weighting of wi. In our approach the weightings of named entities are different to topic terms (terms in the document not identified as named entities). Because different types of entities have different effect to NED, for example, to distinguish stories about earthquakes in two different places, the named entities may play import roles.

402

X. Zhang and Z. Li

As for IPLSA model, the number of documents to be processed affects the time complexity greatly. We can decrease the time complexity by reducing the number of documents to be processed. The rationale behind our approach is that some stories belonging to the same topic are very similar, and we can combine them to be one story. Then the weights of terms in the new story are updated accordingly. But how do we combine two stories to be one story. In the approach, if the similarity between two stories is greater than 2*( is the threshold value that determine whether a story is the first story of an event), then the two stories are combined. In the online NED, events have aging nature, and an old inactive event less likely attracts new stories than recently active events. Therefore, temporal relations of stories can be exploited to improve NED. Using a lookup window is a popular way of limiting the time frame that an incoming story can relate to. An example of a windowbased NED system is shown in Fig.1. In each advance of the window, which can be measured in time units or by a certain number of stories, the system discards old stories and fold in new ones.
discarding Folding-in

Old window New window


time

Fig. 1. The discarding and folding-in of a window

In our approach, when a story is labeled as a first story of an event, the widow advances and the EM algorithm reestimates all the parameters of the PLSA algorithm. Because we always hold stories in the window, so the number of story is not increase with time advance. As a result, the time complexity doesnt increase along with time advancing. The similarity between two documents in the same time window is modified by substituting (21) as follows, where T(d) is the time stamp of document d:
sim( d1 , d 2 ) =

* 1 | T (d1 ) T (d 2 ) | 2 | d1 | | d 2 | 1+ window size

d1 d 2

(12)

The similarity between a new document and an event is defined as the maximum similarity between the new document and documents previously clustered into the event. A document is deemed the first story of a new event if its similarity with all the events in the current window is below a predetermined threshold ; otherwise, it is assigned to the event that is the most similar.

5 Term Reweighting
We use 2 statistic to compute correlations between terms and topics, and use it to select features form documents. For each document, we only keep the top-K terms

Online New Event Detection Based on IPLSA

403

with the largest 2 values rather than all the terms. Here K is a predetermined percent constant. Only the top-K terms are used to compute the similarity values of document pairs. Reducing the number of saved terms can reduce the time complexity of incremental PLSA model greatly. TF-IDF has be the most prevalent terms weighting method in information retrieval systems. The basic idea of TF-IDF is that the fewer documents a term appears in, the more important the term is in discrimination of documents. However, the TF-IDF method cant weight terms of following classes properly: Terms that occurs frequently within a news category. Terms that occurs frequently in a topic, and infrequently in other topics. Terms with low document frequency, and appear in different topics. Besides, this method rarely considers the importance of document weight. In fact, documents are also important in discrimination of terms. The main assumption behind document weighting is as following: the more information a document gives to terms the more effect it gives to latent variable, and the less information a document gives to terms the less effect it gives to latent variable. To address above problems, we propose that term weight is constituted of following parts.

W (i, j ) = LT (i, j ) GT (i) GD ( j ) KD(i )


The notation used in the follow equations is defined as: tfij: the frequency of term i in document j dfi: the number of documents that contain term i. dfci: the number of documents containing term i within cluster c. gfi: the frequency of term i in document collection. sgf: the sum frequency of all terms in document collection. Nc: the number of documents in cluster c. Nt: the total number of documents in collection. We replace the TF in TF-IDF with following:
LT (i, j ) = log(tf ij + 1) log dl j + 1

(13)

(14)

Entropy theory is used to set GT(i) and GD( j), and it replaces IDF with following:
GT (i ) = H ( d ) H (d | ti ) H ( d | ti ) = 1 H (d ) H (d )

(15) (16) (17)

H (d | ti ) = p( j | i) log p( j | i)
p( j | i) = tf ij gfi
H(d) =log(Nt )

GD(j) is defined as following, and it mainly computer the importance of document to terms:

404

X. Zhang and Z. Li
H (t ) H (t | d j ) H (t ) H (t | d j ) H (t )

GD ( j ) =

= 1
n

(18) (19) (20)

H (t ) = p(t ) log p(t ) =


i =1

gfi gf log i sgf sgf


p(i | j ) = tf ij dl j

H (t | d j ) = p (i | j ) log p (i | j )

KD(i) is used to enhance weight of term that occurs frequently in a topic, and infrequently in other topics, and it is defined as following:
KD(i) = KL ( Pci || Pti )

(21) (22)

Pci =

df ci df , Pti = i KL( P || Q) = p( x) log p( x) Nc Nt , q( x)

6 Evaluation
In the evaluation, we used the standard corpora TDT-4 from the NIST TDT corpora. Only English documents tagged as definitely news topics (that is, tagged YES) were chosen for evaluation.
6.1 Performance Metrics

We follow the performance measurements defined in [19]. An event analysis system may generate any number of clusters, but only the clusters that best match the labeled topics are used for evaluation.
Table 1. A Cluster-Topic Contingency Table

In topic In cluster Not in cluster

Not in topic

a c

b d

Table 1 illustrates a 2-2 contingency table for a cluster-topic pair, where a, b, c, and d represent the numbers of documents in the four cases. Four singleton evaluation measures, Recall, Precision, Miss, and False Alarm, and F1, are defined as follows: Recall =a/(a+c) if (a+c) > 0; otherwise, it is undefined. Precision=a/(a+b) if (a+b) > 0; otherwise, it is undefined. Miss =c/(a+c) if (a+c) > 0; otherwise, it is undefined. False Alarm=b/(b+d) if (b+d) > 0; otherwise, it is undefined. F1= 2 *Recall *Precision/(Recall+ Precision). To test the approaches proposed in the model, we implemented and tested three systems:

Online New Event Detection Based on IPLSA

405

System 1 (SS), in which the story on hand is compared to each story received previously, and use the highest similarity to determine whether current story is about a new event. System 2 (SC), in which the story on hand is compared to all previous clusters each of which representing a topic, and the highest similarity is used for final decision for current story. If the highest similarity exceeds threshold , then it is an old story, and put it into the most similar cluster; otherwise it is a new story and creates a new cluster. System 3 (IPLSA), implemented based on IPLSA proposed in section 4, and terms are reweighted according to section 5, and 40 latent variable z is used in this system.
6.2 Evaluation Base Performance and Story Indentation

In the first experiment we test the recall, precision, miss and F1 of these three systems. Figure 2 summarizes the results of Systems based on CS, SS and IPLSA. All these systems conducted multiple runs with different parameter settings; here we present the best result for each system with respect to the F1 measure. As shown, the results of the evaluations demonstrate that the proposed IPLSI algorithm outperforms the SS, CS model greatly.
Table 2. New event detection results

Recall(%) Precision(%) Miss(%) False Alarm(%) F1

CS 62 57 38 0.162 0.59

SS 70 68 30 0.160 0.68

IPLSA 88 85 12 0.157 0.86

In the second experiment, we mainly test the scores distribution of new stories and old stories. The main goal of our effort was to come up with a way to correctly identify new stories. In practice, it is not possible to determine the true optimal threshold, because there is no knowledge about incoming and future news documents. Thus, to preserve the high performance for nonoptimal thresholds, an event analysis algorithm has to be resilient. This is means that concentration of new story scores and old story scores in the low level and high level respectively can make a good identification of new event. To understand what we had actually achieved by using the proposed model, we studied the distribution of the confidence scores assigned to new and old stories for the three systems for the TDT4 collection (Figures 1 and 2 respectively). In figure 3, we observe that the scores for a small fraction of new stories that were initially missed (between scores 0.8 and 1) are decreased by the system based on proposed model and a small fraction (between scores 0.1 and 0.3) is increased by a large amount. As to the old story scores, there is also a major impact of using IPLSA based system. In figure 4, we observe that the scores of a significant number of old stories (between scores 0.2 and 0.4) have been decreased. This had the effect of increasing the score difference between old and new stories, and hence improved new event identification by the minimum cost.

406

X. Zhang and Z. Li

Fig. 3. Distribution of new story scores for the SS, CS IPLSA based model systems

Fig. 4. Distribution of old story scores for the SS, CS IPLSA based model systems

7 Conclusion
We have shown the applicability of IPLSA techniques to solve the NED problem. Significant improvements were made over the SS and SC systems on the corpora tested on. We also have presented a new algorithm of term weighting. This algorithm adjusts term weights based on term distributions between the whole corpus and a cluster story set, and it also uses document information and entropy theory. Our experimental results on TDT4 datasets show that the algorithm contributes significantly to improvement in accuracy. NED requires not only detection and reporting of new events, but also suppression of stories that report old events. From the study of the distributions of scores assigned to stories by the SS, SC and IPLSA model systems, we can also see that IPLSA model can do a better job of detecting old stories (reducing false alarms). Thus it can be believed that attacking the problem as old story detection might be a better and more fruitful approach.

References
1. Allan, J.: Topic Detection and Tracking: Event-Based Information Organization. Kluwer Academic Publishers, Dordrecht (2002) 2. Papka, R., Allan, J.: On-line New Event Detection Using Single Pass Clustering TITLE2: Technical Report UM-CS-1998-021 (1998) 3. Allan, J., Lavrenko, V., Jin, H.: First story detection in tdt is hard. Washiongton DC. In: Proceedings of the Ninth International Conference on Informaiton and Knowledge Management (2000)

1 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0

A S L PI SC SS
9. 0 7. 0 5. 0 3. 0
ASLPI SC SS

1. 0
0 008 006 004 002 0001

0 4 8 21 61 02 42

Online New Event Detection Based on IPLSA

407

4. Giridhar, K., Allan, J., Andrew, M.: Classification Models for New Event Detection. In: Proceeding of CIKM (2004) 5. Yang, Y., Pierce, T., Carbonell, J.: A Study on Retrospective and On-line Event Detection. In: Proceedings of SIGIR, Melbourne, Australia, pp. 2836 (1998) 6. Allan, J., Lavrenko, V., Malin, D., Swan, R.: Detections, Bounds, and Timelines: Umass and tdt-3. In: Proceedings of Topic Detection and Tracking Workshop (TDT-3), Vienna, VA, pp. 167174 (2000) 7. Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned Novelty Detection. In: Proceedings of the 8th ACM SIGKDD International Conference, pp. 688693 (2002) 8. Juha, M., Helena, A.M., Marko, S.: Applying Semantic Classes in Event Detection and Tracking. In: Proceedings of International Conference on Natural Language Processing, pp. 175183 (2002) 9. Juha, M., Helena, A.M., Marko, S.: Simple Semantics in Topic Detection and Tracking. Information Retrieval, 347368 (2004) 10. Giridhar, K., Allan, J.: Text Classification and Named Entities for New Event Detection. In: Proceedings of the 27th Annual International ACM SIGIR Conference, New York, NY, USA, pp. 297304 (2004) 11. Papka, R., Allan, J.: On-line New Event Detection Using Single Pass Clustering TITLE2: Technical Report UM-CS-1998-021 (1998) 12. Lam, W., Meng, H., Wong, K., Yen, J.: Using Contextual Analysis for News Event Detection. International Journal on Intelligent Systems, 525546 (2001) 13. Thorsten, B., Francine, C., Ayman, F.: A System for New Event Detection. In: Proceedings of the 26th AnnualInternational ACM SIGIR Conference, pp. 330337. ACM Press, New York (2003) 14. Nicol, S.a., Joe, C.: Combining Semantic and Syntactic Document Classifiers to Improve First Story Detection. In: Proceedings of the 24th Annual International ACM SIGIR Conference, pp. 424425. ACM Press, New York (2001) 15. Luo, G., Tang, C., Yu, P.S.: Resource-Adaptive Real-Time New Event Detection. In: SIGMOD, pp. 497508 (2007) 16. Kuo, Z., Zi, L.J., Gang, W.: New Event Detection Based on Indexing-tree and Named Entity. In: Proceedings of SIGIR, pp. 215222 (2007) 17. Makkonen, J., Ahonen-Myka, H., Salmenkivi, M.: Applying semantic classes in event detection and tracking. In: Proceedings of International Conference on Natural Language Processing, pp. 175183 (2002) 18. Makkonen, J., Ahonen-Myka, H., Salmenkivi, M.: Simple semantics in topic detection and tracking. In: Information Retrieval, pp. 347368 (2004) 19. Zhang, J., Ghahramani, Z., Yang, Y.: A probabilistic model for online document clustering with application to novelty detection. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 17, pp. 16171624. MIT Press, Cambridge (2005) 20. Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proc. ACMSIGIR 1999 (1999) 21. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Statistical Soc. B 39, 138 (1977) 22. Brants, T., Chen, F., Tsochantaridis, I.: Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis. In: Proc. 11th ACM Intl Conf. Information and Knowledge Management (2002) 23. Girolami, M., Kaban, A.: On an Equivalence Between PLSI and LDA. In: Proc. of SIGIR, pp. 433434 (2003)

408

X. Zhang and Z. Li

24. Thomas, H.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Maching Learning Journal 42(1-2), 177196 (2001) 25. Chou, T.C., Chen, M.C.: Using Incremental PLSA for Threshold Resilient Online Event Anlysis. IEEE Transaction on Knowledge and Data Engineering 20(3), 289299 (2008) 26. Chien, J.T., Wu, M.S.: Adaptive Bayesian Latent Semantic Analysis. IEEE Transactions on Audio, Speech, and Language Processing 16(1), 198207 (2008) 27. Wu, H., Yongji, W., Xiang, C.: Incremental probabilistic latent semantic analysis for automatic question recommendation. In: Proceedings of ACM conference on Recommender systems, Lausanne, Switzerland, October 23-25 (2008) 28. Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Fisher, J.D.H. (ed.) The Fourteenth International Conference on MachineLearning, pp. 412420. Morgan Kaufmann, San Francisco (1997)

Вам также может понравиться