A Survey On Automated Text Summarization Methodologies

International Journal of Scientific Research Engineering & Technology (IJSRET)
Volume 2 Issue 8 pp 512-517 November 2013 www.ijsret.org ISSN 2278 0882
A Survey on Automated Text Summarization Methodologies

K. Divya
Computer Science Engineering, Medicaps Institute of Technology and Management, Indore
ABSTRACT
Availability of ample of data on the web is difficult of access. Hence it has become an important research area of automatic text summarization within the Natural Language processing. Text mining is another important research field that brings meaning to the natural language on the web. Automatic Text Summarization is the Process in which the input is to the computer is the text, whereas the output is the concise extract of the input data. The entire process of automatic text summarization takes four stages. They are tokenization, feature identification, characterization, tagging and summarization. The work can be used in much practical application. Keywords - Machine Learning, Natural Language Processing (NLP), Text Mining, Tokenization, Information Retrieval, Automated Text Summarization, Topic Identification, Interpretation and Summary Generation users to extract precise and important content from the original document instead of the entire original document. Automatic Text summarization is the process of extracting the main and important content from the original data. The extracted data has to be relevant and the relevancy is obtained by different techniques. Text summarization is categorized in two parts. They are: extraction and abstraction. Extraction is mostly used to create the summary in automated text summarization. The method for filtering information from large volumes of text is called Information Extraction. It is a limited task than understanding the full text. In full text understanding, we express in an explicit fashion about all the information in a given text. But, in Information Extraction, we delimit in advance, as part of the specification of the task and the semantic range of the result. Only extractive summarization method is considered and developed for the study.. Various analyses of results were also discussed by comparing it with the English language.
I.
INTRODUCTION
II.
KEY CONCEPTS
In the field of computer science, artificial intelligence and linguistics, Natural language Processing (NLP) is the application that deals with the computer human language interaction [1]. Applications of NLP include a number of fields of studies, such as machine translation, natural language text processing and summarization, user interfaces, multilingual and cross language information retrieval (CLIR), speech recognition, artificial intelligence and expert systems, machine learning, and so on. Text Mining the import field of the NLP which also an important field of research. It is an interesting technology to find text patters from the large data in the web. It is also known as Knowledge Discovery in the Text (KDD) and has provided the users the ease to use the relevant data on the web [2]. It is an interdisciplinary field which consist of information retrieval, machine learning, data mining, statistics and computational linguistics. It became important for the
To create a summary is very important to go through some important terminologies. These terms would help us to understand the process of automated text summarization more accurately and easily. A. Machine Learning Machine learning is the method of automating the expansion of any part of a system which performs some task. The algorithm, parameters to a process or algorithm can be learnt adaptively over a period of time. There are two types of situation where machine learning is necessary. The first kind of situation is one in which a human cannot easily, quickly or reliably creates a system (including the algorithm and its parameters) which performs some task. Most multi-class classification tasks fall into this category.
IJSRET @ 2013

B. Natural Language Processing
Natural Language Processing (NLP) is the computerized approach to analysing text that is based on both a set of theories and a set of technologies. And, being a very active area of research and development, there is not a single agreed-upon definition that would satisfy everyone, but there are some aspects, which would be part of any knowledgeable persons definition [3]. C. Text Mining Text mining is similar to data mining, except data mining tools handles structured data from databases, but it can work with unstructured or semi-structured data sets such as emails, full-text documents and HTML files etc. Text mining is an application of data mining techniques to automated discovery of useful or interesting knowledge from unstructured text [4]. Several techniques have been proposed for text mining including conceptual structure, association rule mining, episode rule mining, decision trees, and rule induction methods. Text mining is becoming one of the most popular topic in research field as the data on World Wide Web is increasing day by day. D. Tokenization Tokenization is the task of breaking up the given character sequence or document unit into small pieces called tokens, simultaneously omitting certain characters, such as punctuation [6]. Here is an example of tokenization: Input: Youre good reputation is more worth than money;
Youre more Good worth reputation is
Output
tha mone n y These tokens are usually referred as terms or words,
but many a times it becomes necessary to make type and token separately. A token is defined as the part of the sequence of a particular document .Whereas the type is the class of all the tokens containing the same character. A term is often referred as a type is included in an IR dictionary. E. Information Retrieval Extraction of high quality of data from the large database that is usually unstructured. According to am edition of Cambridge UP [6] Information retrieval (IR) is ending material (usually documents) of an unstructured nature (usually text) that
satises information need from wit hin large collections (usually stored on computers). Information retrieval system identifies the document that match the users query from collection of documents. The well known information retrieval system is the Google that identifies the documents form the world wide web , which matches the users search query at the same time the retrieved document form the world wide web is also relevant [8]. F. Automated Text summarization Most of the data on the web today dont have summaries. System as well as for human the job of summarization is challenging task. As a result of which the task of automating to get an indicative and informative summaries became a researchers main concern. The process of automated text summarization constructs a summary automatically for some text [7]. The system that summarizes a single document is called a single document summarization. System that summarizes a set of documents to form a single summary is called as multi document summarization .The task of single text summarization is a difficult task but creating a summary of multi document summary is much more difficult then single text summarization. To create the summary of the documents based on query or a question is called a query relevant summarization and the system is called as query relevant summarization system .This is system is same as question answering. The generated summary is modified according to the users interest. To create a summary the researchers has give three different stages which when performed gives a summary. They are: topic identification, Interpretation and summary generation. Automated text summarization is the process of automatically constructing summaries for a text. Systems summarizing a single document are called single document summarization systems. Systems summarizing a set of documents to form a single summary are called multi document summarization systems. Single document summarization is a difficult task by itself, but multi document summarization has additional difficulties. G. Topic Identification For the topic identification system uses different techniques [4]. The process of these modules is assigning the score to the input (word, sentence or passages) i.e. the text document. The module called as the combination module combines the score of the respective words, sentence or the passages assigned to the single integrated unit. As a result the system return
IJSRET @ 2013

the highest scoring unit .At the end the system generates the summary with the length requested by the user. For every unit we identify the correct, wrong and missed .Where correct is the number of sentences extracted by the system and the human; wrong the number of sentences extracted by the system but not by the human; and missed the number of sentences extracted by the human but not by the system. Hence the Precision = correct / (correct + wrong) Recall = correct / (correct + missed) Where the precision show how many systems extracted sentences are good and recall reflects the how many good sentences are missed by the system. H. Interpretation To differentiate extract type summarization system and the abstract type summarization system; the interpretation is process is used [7]. In the process of interpretation the topics considered as important are coupled together. These fused topics are represented in the form given by new formulations such that without using word that is in the original text. I. Summary Generation The final and the most important stage of the process is the summary generation [7]. When the summary content has been created through abstracting and/or information extraction, it exists within the computer in internal notation, and thus requires the techniques of natural language generation, namely text planning, sentence (micro-) planning, and sentence realization.
Summarization = +Interpretation + Generation
Topic
Identification
All the above three stages the SUMMARIST equation uses a combination of symbolic world knowledge and statistical or IR-based techniques. Each stages different and complementary methods. But before all these stages pre-processing stage takes place.. Each module either performs certain pre-processing tasks (such as tokenization) or attaches additional features (such as part-of-speech tags) to the input texts. These modules are [7]: Tokeniser: reads English texts and outputs tokenized texts. Part-of-speech tagger: reads tokenized texts and outputs part-of-speech tagged texts. This tagger is based on Brills part-of-speech tagger (Brill 93). Converter: converts tagged texts into SUMMARIST internal representation. Morpher: finds all root forms of each input token, using a modification of Word Nets (Miller et al. 90) demorphing program. Phraser: finds collocations (multi-word phrases), as recorded in Word Net. Token frequency counter: counts the occurrence of each token in an input text. Weight calculator: calculates the tf.idf weight Salton 88) for each input token, and ranks the tokens according to this weight. Query relevance calculator: to produce query sensitive summaries, this module records with each sentence the number of (demorphed) content words in the users query that also appear in that sentence. ii. Automated Text Categorization and Summarization using Rule Reduction This paper suggests the summary of the document or data by using rule reduction [9].The text analyser parses the given input into tokens and recognizes the features of the alphabets and group them into Noun Phrase or Verb Phrase or Prepositional Phrase. The system generates some rules for the respective phrase. Generation of Rules: The rules [9] generated are given below with explanations: Rules for Noun Phrase (NP) are: NP -+ (Det) (Adj) N (PP) (1) where 'Det' is Determiner, 'Adj' is an Adjective, 'N' is
III.
RELATED WROK-A SUMMARY OF METHEDOLOGY
Automated text summarization is not a new research idea. It is been a research topic since last few years .Many techniques have been tried during 1950s and 60s. i. Automated Text Summarization in SUMMARIST One such similar pioneer work is by Eduard Hovy and Chin-Yew-Lin. He introduced a summarist, an attempt to create a robust automated text summarization system [7],. He gave an equation for the summarist system that is:
IJSRET @ 2013

noun, 'PP' is the Preposional Phrase and () specifies the optionality. I.e., NP can be, NP -+ N eg. John NP -+ Det N eg. the Boy. NP -+ Det Adj N eg. the little bee. NP -+ Det N PP ie., NP D Det N P NP ie., NP -+ Det N P Det N ego the bee in the bin i.e. the determiner followed by adjective and a noun construct a noun phase or the determiner followed by a noun and a prepositional phrase construct a noun phase or the determiner followed a noun construct a noun phase or a simple noun may construct a noun phase. Rule created for Possessives (POSS) is: Adj -+ N's (2) i.e., a noun followed by an apostrophe is an adjective. ego Baby's. Rule created for prepositional phrase (PP) is: PP -+ P NP (3) i.e. a preposition followed by a noun phrase constructs a prepositional phrase. Eg, in the bin. Rules created for Verb Phrase (VP) are: VP -+ (Adj) V (NP/PP) (PP/NP) (Adj) (4) i.e., VP -+ V ego Ran VP -+ V NP ego Did the work VP -+ Adv V NP ego Eagerly seen the match VP -+ V NP PP ego Got a pen in the class VP -+ V PP Adv ego Fell into the well slowly That is, a verb followed by a noun phrase and a prepositional phrase constructs a verb phrase or a verb followed by a prepositional phrase and a noun phrase constructs a verb phrase or a verb followed by a noun phrase constructs a verb phrase or a verb followed by a prepositional phrase constructs a verb phrase or a verb itself constructs a verb phrase. Rule created for a sentence is a noun phrase followed by a Verb phrase as shown below. S -+ NP VP (5) Eg: Sita (NP) ate the cake (VP). The process of summarization takes palace through four stages. These are: Tokenization, Feature
Identification, Categorization and Summary Generation. These are explained below: Create Tokens for the given input. In this step, the input is broken into three major categories of tokens namely alphabets, white spaces and punctuation symbols. Recognize the feature of the created tokens In this step, the features of alphabet tokens are identified namely as Determiner, Preposition, Noun, Verb, Adjective etc based on the rules defined in the text analyser. Categorize the alpha tokens and summarize it to a sentence. In this step, depending on the rules, the analyser categorizes the tokens into Noun Phrase, Possessive Pass, Prepositional Phrase or Verb Phrase based on its feature and then summarizes them to formulate a sentence.
Fig. 1 Overview of the Automated Text Categorization and Summarization using Rule reduction
IV.
RESULT ANAYLSIS
After studying the different methodology a framework can be created that depicts the comparison of the methodologies. The framework is given below in TABLE 2.
TABLE I: AN OVERVIEW OF THE PRE DEFINED METHODS FOR SUMMARY GENERATION Serial Analysis of the existing methodology No: Regular Rule Reduction Summarist 1 Concept Rule Reduction: SUMMAR It is the concept IST is the in which the concept in author has which the developed new author rules to identify combines the noun, verb, three
IJSRET @ 2013

preposition phase of the sentences.
Architecture
Tokenization + Feature Identification + Categorization + Summary Generation Generate the rules , Tokenize the sentence , Identify the features of the rules according the rules and categorize the tokens according the rules
Modules
Conclusion
The work illustrates the automatic text categorization and summarization.
smaller concepts to create a system that creates the summary with an ease with relevant accuracy. Topic Identificati on + Interpretati on + Summary Generation Tokenizer, Part of speech tagger , Converter , morpher , phraser, token frequency counter, weight calculator , query relevance calculator This work creates a system SUMMAR IST that is capable of producing not only the summary but the summary is also relevantly abstract.
approaches have been attempted, depending on the application. Usually, abstractive summarization requires heavy machinery for language generation and is difficult to replicate or extend to broader domains. In contrast, simple extraction of sentences has produced satisfactory results in large-scale applications, especially in multidocument summarization. This survey emphasizes approaches to summarization methods and systems. A distinction steps in respective summarization system has been briefed in this survey. Since a lot of interesting work is being done far from the mainstream research in this field, the paper briefs on some methods that are relevant to future research. Even on focusing on small detail related to a general summarization process and not on building an entire summarization system, a research work can be created. Now some recent trends in automatic evaluation of summarization systems have been surveyed. Automatic text summarization is the technique, where a computer summarizes a text. A text is entered into the computer and a summarized text is returned, which is a non redundant extract from the original text. The technique has its roots in the 60's and has been developed during 30 years, but today with the Internet and the WWW the technique has become more important
VI.
ACCKNOLEDGEMENT
The authors express their thanks to, Rajiv Gandhi Proudyogiki Vishwavidhyalaya, for the motivation and encouragement given to complete this work successfully
REFERENCES
[1] [2] K.R Chowdhary, Natural Language Processing, 2nd ed.,April 29 , 2012. R. Feldman and I. Dagan, " Kdt - knowledge discovery in texts," In Proc. of the First Int. Conf. on Knowledge Discovery (KDD), 1995, pp. Ted Briscoe, Introduction to Linguistics for Natural Language Processing, Computer Laboratory University of Cambridge, October 4, 2011. Eduard Hovy , Text Summarization M. A. Hearst. Untangling text data mining. In Proceedings of the 37th Meeting of the Association for Computational Linguistics
V.
CONCLUSION
[3]
The growth rate of information on the World Wide Web has called for a need to develop accurate summarization systems. Although research on summarization is been going since 1950s and 60s.Howerver the attention has drifted from summarizing scientific articles to news articles, electronic mail messages, advertisements, and blogs. To create the summarization both abstractive and extractive
[4] [5]
IJSRET @ 2013

[6]
[7]
[8]
[9]
[10]
[11]
(ACL-99), pages 310, College Park, MD, June 1999.Annual Christopher D.Manning, Prabhar Raghavan , Hinrich Schutze An Introduction to Information Retrieval Cambridge University Press Cambridge , England , Online edition (c) 2009. Eduard Hovy and Chin-Yew Lin ,Automated Text Summarization in SUMMARIST In I. Mani and M. Maybury (eds), Advances in Automated Text Summarization. MIT Press. PDF written by mMembers of The National Center for text mining and produced ND EDITED BY Judy Redfearn and the JISC Communications team C.Lakshmi Devasena1 and M.Hemalatha2 , Automatic Text Categorization and Summarization using Rule Reduction,1Research Scholar ,2Professor , Department of Computer Science, IEEE International Conference On Advances In Engineering, Science And Management (ICAESM -2012) Karpagam University , Coimbatore , India,March 30,31,2012 Dipan Das AndreF.T.Marins A Survey on Automatic Text Summarization Language Technologies Institute Carnegie Mellon University, November 21,2007 Elena Lloret ,Text Summarizatoin : An Overview, Dept lenguajes y Sistemas Informaticos Universidad De Alicante , Spain. (TIN2006-15265-C06-01)
IJSRET @ 2013

A Survey On Automated Text Summarization Methodologies

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Survey On Automated Text Summarization Methodologies

Загружено:

Авторское право:

Доступные форматы

International Journal of Scientific Research Engineering & Technology (IJSRET)

Volume 2 Issue 8 pp 512-517 November 2013 www.ijsret.org ISSN 2278 0882