Вы находитесь на странице: 1из 4

Automatic text summarization based on sentences clustering and extraction

ZHANG Pei-ying
College of Computer & Communication Engineering China University of Petroleum Dongying, Shandong e-mail: smartfrom1024@yahoo.com.cn
AbstractTechnology of automatic text summarization plays an important role in information retrieval and text classification, and may provide a solution to the information overload problem. Text summarization is a process of reducing the size of a text while preserving its information content. This paper proposes a sentences clustering based summarization approach. The proposed approach consists of three steps: first clusters the sentences based on the semantic distance among sentences in the document, and then on each cluster calculates the accumulative sentence similarity based on the multifeatures combination method, at last chooses the topic sentences by some extraction rules. The purpose of present paper is to show that summarization result is not only depends the sentence features, but also depends on the sentence similarity measure. The experimental result on the DUC 2003 dataset show that our proposed approach can improve the performance compared to other summarization methods. Keywords- text summarization; similarity measure; sentences clustering; sentence extractive technique

LI Cun-he
College of Computer & Communication Engineering China University of Petroleum Dongying, Shandong e-mail: licunhe@hdpu.edu.cn word form feature, the word order feature and the semantic feature, using the weight to describe the contribution of each feature of the sentence, describes the sentence similarity more preciously. Determinates the number of the clusters, uses the K-means method to cluster the sentences of the document, and extracts the topic sentences to generate the extractive summary for the document. Experiments show that our method is outperforms than other summarization methods using the ROUGE-1, ROUGE-2 and F1-Measure evaluation metrics. The rest of this paper is organized as follows: Section 2 introduces related works. The proposed sentence-clustering based summarization algorithm is presented in Section 3. Section 4 presents evaluation result on DUC2003 dataset. The last section gives the conclusions. II. RELATED WORK

I.

INTRODUCTION

The continuing growth of World Wide Web and on-line text collections makes a large volume of information available to users. The information overload either leads to wastage of significant time in browsing all the information or else useful information is missed out. The technology of automatic text summarization is maturing and may provide a solution to the information overload problem. Text summarization is the process of automatically creating a compressed version of a given text that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic (Wan, 2008). Text summarization is a complex task which ideally would involve deep natural language processing capacities. In order to simplify the issue, current research is focused on extractive-summary generation. Sentence based extractive summarization techniques are commonly used in automatic text summarization to produce extractive summaries. Traditional method of summarization uses the sentence features to evaluate the importance of sentences of a document. Its limitation is not involved with the sentence semantic similarity computing. This paper proposes a sentence similarity computing method based on the three features of the sentences, on the base of analyzing of the

In the past, extractive summarizers have been mostly based on scoring sentences in the source document. The most common and recent text summarization techniques use either statistical approaches, for example (Zechner, 1996), (Carbonell, 1998), (Strzalkowski, 1998), (Berger and Mittal, 2000), (Nomoto, 2001); or linguistic techniques, for example (Klavans, 1995), (Radev, 1998), (Nakao, 2000); or some kind of a linear combination of these: (Goldstein et al., 1999), (Mani, 2002) and (Barzilay, 1997). Our algorithm is markedly different from each of these and tries to capture the semantic distance of the sentences in the document. We find that none of the above approaches to text summarization selects sentences based on the semantic content of the sentence and the relative importance of the content to the semantic of the text. Our algorithm is based on identifying semantic relations among sentences and is for automatic text summarization unlike almost all previous ones. III. SENTENCES CLUSTERING BASED SUMMARIZATION

A. Similarity measure between sentences Definition 1: Word Form Similarity The word form similarity is mainly used to describe the form similarity between two sentences, is measured by the number of same words in two sentences. It should be getting rid of the stop words in the computation. If S1 and S2 are two sentences, the word form similarity is calculated by the formula (1). Sim1(S1,S2)=2*(SameWord(S1,S2)/(Len(S1)+Len(S2))) (1)

_____________________________
978-1-4244-4520-2/09/$25.00 2009 IEEE

167
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING. Downloaded on July 31,2010 at 08:09:07 UTC from IEEE Xplore. Restrictions apply.

Here SameWord(S1,S2) is the number of the same words in two sentences, Len(S) is the word number in the sentence S. Definition 2: Word Order Similarity The word order similarity is mainly used to describe the sequence similarity between two sentences. Chinese sentence can be presented by many kinds of style, the different sequence of the words stand for different meanings. Here we describe the sentence as three vectors as follows: V1={d11,d12,,d1n1} V2={d21,d22,,d2n2} V3={d31,d32,,d3n3} Here the weight d1i in vector V1 is the tf-idf value of the words; the weight d2i in vector V2 is the bi-gram whether occur in the sentence (0 stands for no-occurring, 1 stands for occurring); the weight d3i in vector V3 is the tri-gram whether occur in the sentence. The word order similarity between S1 and S2 is: Sim2(S1,S2)=1*Cos(V11,V21)+2*Cos(V12,V22) +3*Cos(V13,V23) (2) Here 1+2+3=1. i stands for the ratio of each part. Definition 3: Word Semantic Similarity The word semantic similarity is mainly used to describe the semantic similarity between two sentences. Here the word semantic similarity computing (Jiang Min, 2008) is based on the HowNet[1]. Based on semantic similarity among words, we define Word-Sentence Similarity (WSSim) to be the maximum similarity between the word w and words within the sentence S. Therefore, we estimate WSSim(w,S) with the following formula: WSSim(w,S)=max{Sim(w, Wi)|WiS, where w and Wi are words} (3) Here the Sim(w,Wi) is the word similarity between w and Wi. With WSSim(w,S), we define the sentence similarity as follows:
Sim 3 (S1 , S 2 ) =
wi S1

the document, so its impossible to offer k effectively. The strategy that we used to determine the optimal number of clusters (the number of topics in a document) is based on the distribution of words in the sentences:
k=n D

(6) Where |D| is the number of terms in the document D, |Si| is the number of terms in the sentence Si, n is the number of sentences in document D. Here we analyze the property of this estimation by two extreme cases, please references the (Ramiz M. Aliguliyev, 2008) if you want to learn more detailed process of proof. (1) The document is constituted by n sentences which have the same set of terms. Therefore, the set of terms of the document coincides with the set of terms of each sentence: D= (t1, t2, , tm)=Si=S. From the definition (6) follows that n n i=1 S i i=1 S S k =n n =n n =n n =1 i=1 S i i=1 S i=1 S . (2) The document is constituted by n sentence which do not have any term in common, that is, SiSj= for ij. This
i =1 i =1

Si

=n

i =1

Si Si

means that each term belonging to

D = i =1 S i
n

to one of the sentences Si, therefore from which follows that k=n.

D =

belongs only
n i =1

S i = i =1 S i
n

WSSim( w , S
i

)+

w j S 2

WSSim( w , S )
j 1

(4)

S1 + S 2

Here S1, S2 are sentences; |S| is the number in the sentence S. Definition 4: Sentence Similarity The sentence similarity usually described as a number between zero and one, zero stands for non-similar, one stands for total similar. The larger the number is, the more the sentences similar. The sentence similarity between S1 and S2 is defined as follows: Sim(S1,S2)=1*Sim1(S1,S2)+2*Sim2(S1,S2) +3*Sim3(S1,S2) (5) Here 1, 2, 3 is the constant, and satisfied the equation: 1+2 +3=1. In this paper, 1=0.2, 2=0.1, 3=0.7. B. Estimating the number of clusters Determination of the optimal number of sentence clusters in a text document is a difficult issue and depends on the compression ratio of summary and chosen similarity measure, as well as on the document topics. For clustering of sentences, customers cant predict the latent topic number in

C. Sentences Clustering Once determinates the number of sentences clusters, we can use the K-means method to cluster the sentences of the document. This algorithm can be described as follows: Input: n sentences K: the number of clusters Output: the sentences clusters Step1: Random select K sentences into K clusters respectively, these sentences represent the initial cluster central sentences. Step2: Assign each sentence to the cluster that has the closest central sentence. Step3: When all sentences have been assigned, recalculate the central sentence of each cluster. The central sentence is the one which own the lowest accumulative similarity. Step4: Repeat Steps 2 and 3 until the central sentence no longer move. This produces a separation of the sentences into K clusters from which the metric to be minimized can be calculated. D. Topic Sentences Extraction Based on the result of section C, assume the sentences clusters is: D = {C1, C2, , Ck}. First, determinates the central sentence i of each cluster based on the accumulative similarity between the sentence Si and other sentences, then calculates the similarity between the sentence Si and the central sentence i. Assume that the similarity of central sentence i as 1, sorts the sentences based on its similarity weight, and chooses the high weight sentences as the topic sentences. At the same time, considering the recall rate of the text summarization, the text summary should include every

168
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING. Downloaded on July 31,2010 at 08:09:07 UTC from IEEE Xplore. Restrictions apply.

cluster sentences according to the principle of priority extract clusters in the process of extracting sentences. IV. EXPERIMENTS AND RESULTS In this section, we conduct experiments to evaluate the performance of the automatic text summarization system based on sentences clustering. A. Evaluation metrics Evaluating summaries and automatic text summarization systems is not a straight-forward process. There are many measures that can calculate the topical similarities between two summaries. For evaluation the results we use two methods. The first one is by precision (P), recall (R) and F1measure which are widely used in Information Retrieval. For each document, the manually extracted sentences are considered as the reference summary (denoted by Summref). This approach compared the candidate summary (denoted by Summcand) with the reference summary and computes the P, R and F1-measure values as shown in formula (7). (Shen et al., 2007)
P= Summ ref Summcand Summcand
R= Summref Summcand Summref

extraction. Our approach consists of three steps. First clusters the sentences in document, and then on each cluster calculates the accumulative sentence similarity based on the multi-features combination, at last chooses the topic sentences by the rules. When comparing our method with other existing summarization methods on an open DUC2003 datasets, we found that our method can improve the summarization results significantly using the evaluation metrics of ROUGE-1, ROUGE-2 and F1-Measure. The main contributions of this study are as follows: (1) It proposes a sentence similarity computing method based on the three features of the sentences, on the base of analyzing of the word form feature, the word order feature and the semantic feature, using the weight to describe the contribution of each feature of the sentence, describes the sentence similarity more preciously. (2) It has given a method of determinate the number of the sentence clusters. (3) It gives an approach of text summarization based on the sentences clustering. REFERENCES
[1] [2] Dong Zhen-dong. HowNet[OL].http://www.keenage.com Barzilay, Elhadad, 1997. Using lexical chains for text summarization. Proceedings of the ACL97/EACL97 Workshop on Intelligent Scalable Text Summarization. Berger and Mittal, 2000. Query-relevant summarization using faqs. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Carbonell, Goldstein, 1998. The use of MMR, diversity-based reranking for reordering documents and producting summaries [A], In: Proceedings of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval [C], Melbourne, Australia. Jiang Min, Xiao Shi-bin, Wang Hong-wei et al, 2008. An improved word similarity computing method based on HowNet[J].Journal of Chinese information processing. Goldstein, Kantrowitz, Mittal, and Carbonell, 1999. Summarization text documents: Sentence selection and evaluation metrics. Proceedings SIGIR. Katz, S. M, 1996. Distribution of content words and phrases in text and language modeling. Natural Language Engineering. Klavans, Shaw, 1995. Lexical semantics in summarization. Proceedings of the First Annual Workshop of the IFIP working Group for Natural Language Processing and Knowledge Representation. Mani, 2002. Automatic summarization. A tutorial presented at ICON. Nakao, 2000. An algorithm for one-page summarization for a long text based on thematic hierarchy detection. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Nomoto, Matsumoto, 2001. A new approach to unsupervised text summarization. Proceedings of the 24th ACM SIGIR. Radev, McKeown, 1998. Generating natural language summaries from multiple online sources. Computational Linguistics. Ramiz M. Aliguliyev, 2008. A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert System with Applications. Shen,D.,Sun,J.-T.,Li,H.,Yang, Q.,& Chen,Z. (2007). Document summarization using conditional random fields. In Proceedings of the 20th international joint conference on artificial intelligence (IJCAI 2007), January 6-12 (pp. 2862-2867) Hyderabad, India

(7) The second measure we use the ROUGE method for evaluation, which was adopted by DUC for automatically summarization evaluation. It has been shown that ROUGE is very effective for measuring document summarization. It measures summary quality by counting overlapping units such as the N-gram, word sequences and word pairs between the candidate summary and the reference summary. The ROUGE-N measure compares N-grams of two summaries, and counts the number of matches. The measure is defined by formula (8):
ROUGE N =

F1 =

2 PR P+R

[3]

[4]

(8) Where N stands for the length of the N-gram, Countmatch (N-gram) is the maximum number of N-grams co-occurring in candidate summary and a set of reference summaries. Count (N-gram) is the number of N-grams in the reference summaries.
SSummref N gramS

SSummref

N gramS

Count match ( N gram) Count ( N gram)

[5]

[6]

[7] [8]

B. Runs and Evaluation Results For evaluation the performance of our method we conduct the experiments on the document dataset DUC2003, compares our method with MMR (Carbonell, 1998) and WAA (Zhang Qi, 2004) methods. As shown in Table1, on DUC2003 dataset, the values of ROUGE-1, ROUGE-2 and F1 metrics of our method is better than other summarization methods.
TABLE 1 THE VALUES OF EVALUATION METRICS FOR SUMMARIZATION METHODS (DUC2003 DATASET)
Methods ROUGE-1 ROUGE-2 F1-Measure

[9] [10]

[11] [12] [13]

MMR WAA Our Method

0.34813 0.38023 0.43512

0.07917 0.09121 0.10142

0.43245 0.45335 0.47576

[14]

V.

CONCLUSION

We have presented the approach to automatic text summarization based on the sentences clustering and

169
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING. Downloaded on July 31,2010 at 08:09:07 UTC from IEEE Xplore. Restrictions apply.

[15] Strzalkowski, Wise, Wang, 1998. A robust practical text summarization system. Proceedings of the Fifteenth National Conference on A1. [16] Wan, X. (2008). Using only cross-document relationships for both generic and topic-focused multi-document summarizations. Information Retrieval.

[17] Zechner, 1996. Fast generation of abstracts from general domain text corpora by extracting relevant sentences. COLING. [18] Zhang Qi, Huang Xuan-jing, Wu Li-de, 2004. A new method for calculating similarity between sentences and application on automatic text summarization. Journal of Chinese information processing

170
Authorized licensed use limited to: CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING. Downloaded on July 31,2010 at 08:09:07 UTC from IEEE Xplore. Restrictions apply.

Вам также может понравиться