Вы находитесь на странице: 1из 10

Abstractive Text Summarization of Multimedia News Content Using RNN

Vishal Pawar Prof. Manisha Mali


PG Student Assistant Professor
Computer Engineering Computer Engineering
Vishwakarma Institute of Vishwakarma Institute of
Information Technology Information Technology
vishal.17p007@viit.ac.in manisha.mali@viit.ac.in

Abstract Conceptual Automatic content through the arranged streamlining of sub isolated
summarization is a main NLP technique that limits.
intends to consolidate a source content into a Keywords: Summarization, Multimedia, RNN,
shorter adjustment. The quick addition in all kind NLP, Sequence-to-Sequence
of data show over the internet requires abstractive
summarization from non-simultaneous Introduction
aggregations of content, picture, sound and video. Text summarization plays a fundamental
Here propose an abstractive summarization employment in our step by step life and has been
procedure that joins the strategies of NLP, considered for an extended period of time. With
discourse handling, PC vision and Recurrent the incident to the data age and the advancement
Neural Network to examine the rich data of multimedia development, multimedia data
contained from all kind of data and to get better (counting text, picture, audio, video) have
the idea of multimedia news summarization. The extended altogether. Multimedia data have
main idea is to associate the semantic openings unimaginably changed the way where people live
between multimodel substance. Sound and visual and make it difficult for them to get noteworthy
are major modalities in the video. For sound data, data capably. Most summarization structures base
we structure an approach to manage explicitly use on just NLP, the opportunity to commonly
its interpretation and to find the astounding nature improve the idea of the diagram with the guide of
of the translation with sound signals. For visual programmed discourse acknowledgment (ASR)
data, we get acquainted with the joint depictions and PC vision (CV) handling systems is
of content and pictures using a Computer Vision commonly dismissed.
Technique. Previous researches done on Text
Summarization mainly focuses on Extractive Abstractive text summarization aims to generate a
method of summarization. In this work, we put short summary of the complete article that covers
forward an Abstractive method of summarization all the important information. Summarization
with sequence to sequence architecture. Finally, techniques are mainly divided into extractive and
all the multimodal points are considered to make abstractive methods. Extractive methods
a literary once-over by increasing the striking construct a summary by extracting the salient
nature, non-reiteration, clarity and consideration words, phrases, or sentences from the source text
itself. On the other hand Abstractive methods of delivered abstract by using data in visual
produce a summary that is similar to a human- modality. Regardless, the yield of existing
written abstract by concisely paraphrasing the multimodal summarization systems is commonly
source content. That is, the former ensures the addressed in a single modality, for instance,
grammatical and semantic correctness of the textual or visual (Li et al., 2017; Evangelopoulos).
generated summaries, while the latter creates
more diverse and novel content. In this paper, we Majority of work in the past decade has been
focus on abstractive text summarization. focused on extractive summarization [18]–[26],
where a summary consists of key words or
The fast advancement in deep learning, sentences from the source text (article). Different
encourages many sequence to sequence models, from extractive methods copying units from the
proposed to project an input sequence into another source article directly, abstractive summarization
output sequence. These approaches has been uses the readable language for human to
successful in many applications or tasks like summarize the key information of the original
speech recognition [3], video captioning [4] and text. Therefore, abstractive approaches can
machine translation [5]. Unlike these tasks, in text produce much more diverse and richer
summarization, the output sequence summaries. Abstractive summarization task has
(summarized) is much shorter than the input been standardized by the DUC2003 and
source sequence (document). To realize the DUC2004 competitions [27]. Hence, there
context summarization, [6] we proposes an emerge a series of notable methods without neural
attentional sequence to sequence model based on networks on this task, e.g., the best performer
RNNs. It projects the original document into low- TOPIARY system [28].
dimensional embeddings. However, RNNs tend
to have low-efficiency problems as they rely on Deep learning has been developing fast recently,
the previous steps when training. Therefore, and it can handle many NLP tasks [20]. So
though not commonly used in sequence model, researchers have began to consider such
we propose a sequence to sequence model based framework as an effective, completely data-
on RNN to create the representations for the driven alternative for text summarization.
source texts. Reference [21] used convolutional modes to
encode the original text, and an attentional feed-
forward neural network to generate summaries.
Related work But it did not use hierarchical CNNs, thus not that
Text summarization is to expel the noteworthy effective. Reference [22] is an extension to [23],
data from source archives. With the development and it replaced the decoder with an RNN.
of multimedia data on the web, a couple of pros Reference [24] introduced a large Chinese dataset
(Shah et al., 2016; Li et al., 2017) revolve around for short text summarization and it used a RNN
multimodal summarization starting late. Existing based seq2seq model. [25] utilized a similar
examinations (Li et al., 2017, 2018a) have seq2seq model with RNN encoder and decoder,
exhibited that, stood out from text summarization; but it contained English corpora, reporting
multimodal summarization can improve the idea advanced results on DUC and Gigaword datasets.
Reference [26] utilized the generative adversarial films, pictorial storylines and social multimedia.
networks to generate the abstract text Erol et al. [2] hope to make huge parts of a social
summarization, meanwhile equipped with a time- event recording subject to an examination of
decay attention mechanism. Reference [27] was sound, text and visual activity. Tjondronegoro et
based on BERT to take advantage of the pre- al. [4] propose a technique to consolidate a game
trained language model in the seq2seq by separating the textual data removed from
framework, and designed a two-stage decoder multiple assets and recognizing the huge
process to consider both sides’ context substance. Li et al. consolidate news pictures by
information of every word in a summary. We text and visualize text by pictures.
denoted the above two methods as GAN and
BERT respectively, and we will compare our By then, a news story and a picture are picked to
model with them in experiments in detail. address each topic. For electronic long range
Reference [28] proposed a two-stage sentences interpersonal communication summarization,
selection model based on clustering and Fabro et al. [10] and Schinas et al. [12] propose to
optimization techniques. Reference [29] proposed layout the genuine events subject to multimedia
a text summarization model based on a LSTM- content. A multimodal LDA to recognize topics
CNN framework that can construct summaries by by getting the connections between's the text and
exploring semantic phrases. Reference [30] image features of little scale online diaries with
presented an automatic process for text introduced pictures. The yield of their procedure
abstraction which was based on fuzzy rules on a is a great deal of specialist pictures that portray
variety of extracted features to find the most the events.
important information from the source text.
Problem Definition
The present applications related to Text
Multi-document Summarization Summarization consolidate get-together of
MDS tries to expel huge data from a great deal of summarization, sport video summarization, film
reports related to an event to create an outline of summarization, pictorial storyline summarization,
much more diminutive size. MDS can be course of occasion’s summarization and social
abstractive or extractive. Extractive-based models multimedia summarization. Past examinations on
use distinctive phonetic features, for instance, these topics overwhelmingly revolve around
sentence position [17], [18] and tf*idf [19], to sketching out synchronous multimodal substance.
perceive the most surprising sentences in a ton of Pictorial storylines involve a great deal of pictures
reports. Diagram based methods [20] are with text depictions. None of these applications
generally used extractive-set up together MDS revolve around sketching out multimedia data that
models based. Finally, the top-situated sentence is contain non-simultaneous data about events.
picked to manufacture outlines.
Implementation Model Overview
Multi-modal Summarization There are various basic edges in making a not too
As of now, much work has been performed to bad textual framework for multi-modal data. The
consolidate meeting narratives, sport recordings, prominent substance in records should be held,
and the key convictions in recordings and pictures things considered handling strategies for different
should be verified. Further, the once-over should modalities.
be clear and non-dull and should seek after the
fixed length prerequisite. All of these perspectives Sound, i.e., discourse can be consequently
can be as one overhauled by the arranged converted into text by using an ASR system2.
expansion of sub particular capacities. For visual, which is extremely a progression of
pictures (diagrams), in light of the fact that most
Max S T {F(S) :X sϵS ls <=L } of the adjacent housings hold abundance data, we
first concentrate the most significant edges, i.e.,
Above T is the course of action of sentences, S is key frames. We become acquainted with the joint
the blueprint, ls is length words, L is spending p depictions for textual and visual modalities and
lan, i.e., length prerequisite for the summary, and would then have the option to recognize the
sub measured limit F(S) is the summation score sentence that is appropriate to the picture. Thusly,
related to the recently referenced points of view. we can guarantee the consideration of made
Text is the essential modality of archives, and on framework for the visual data.
occasion, pictures are embedded in records.
Recordings include at any rate two sorts of By then this text output is given to our trained
modalities: sound and visual. Next, we give all RNN model.

Architecture Model

Fig1: Implementation Architecture Work Flow


Text-Image Matching Model
The key frames in the recordings and the pictures Semantic Frame Level Text-Image Matching
embedded in the archives as often as possible The basic idea is that each activity word in a
catch news includes that address the huge data sentence is set apart with its propositional
that the blueprint should cover. Before assessing disputes, and the naming for each particular
the incorporation for the pictures, we need a activity word is known as a "plot". Each
model to vanquish any prevention among text and packaging addresses an event, and the conflicts
picture. We can deal with this issue by cross- express the significant data about this event.
modal examination. Crossmodal semantic There is a great deal of disputes showing the
coordinating can be better researched when multi- semantic activity of each term in a packaging. An
modal data is foreseen into the joint subspace. instance of edge semantic parsing is exhibited
wail figure. The main sentence "President Bush
Here the text-picture coordinating model is affirmed government calamity help for the
readied, for each text-picture pair (si, pj) in our affected domains and made courses of action for
endeavor, we can discover the coordinating score an examination voyage through the state" is
m(si, pj). We set the edge as the typical changed into two modified sentences "President
coordinating score for the positive text-picture Bush endorsed bureaucratic disaster help" and
pair. "President Bush made game plans for an
examination voyage through the state". The
improved sentences have less tolerable
assortment in significance, which focal points the
text-picture coordinating.

Fig2: Example for simplified sentence based on frame-semantic parsing.

pictures. The motivation driving this method is


Multi-modal Topic Modeling that textual delineations of pictures as often as
After the text-picture coordinating model is possible give noteworthy data about semantic
readied, we secure the joint depiction of text and edges (topics), and picture features are
pictures. Next, we perceive the topics of text and consistently connected with semantic topics.
Wang et al. make a course of occasions Encoder – Decoder with Attention Mechanism
summarization for Tweet streams by recognizing Our model is based on the Neural Machine
topic advancement. For our errand, the multi- Translation model used in Bahdanau et al. (2014). The
modal topic model can reveal various pieces of encoder contains bidirectional LSTM-RNN. Whereas
text and pictures; by then we can explore a the decoder consist of a uni-directional LSTM-RNN
representative arrangement of text covering the with the same hidden state size as encoder and an
pieces of the pictures. Topic models, for instance, attention mechanism over the source hidden state.
LDA, can together learn inert topics and topic Furthermore, this method likewise accelerates
assignments of reports. To reveal the semantic assembly by centering the demonstrating exertion
perspectives, we make the multi-modal topic as it were on the words that are fundamental to a
model subject to a neural topic model (NTM). The given model. This strategy is especially
multimodal topic model figures the unforeseen appropriate to summarize since a huge extent of
probability p(w|d) using the apportionment of the the words in the synopsis originate from the
word (or picture)- topic p(w|t) and topic-report (or source record in any case.
video) p(t|d). For encoder-decoder neural systems, the
utilization of attention mechanism takes into
p(w|d) = ∑T i=1 p(w | ti)p(ti | d) consideration the creation of a setting vector at
each timestep, given the decoder's current
RNN Sequence to Sequence Model concealed state and a subset of the encoder's
Every one of the three of the models we actualized shrouded states. For worldwide consideration, the
depended on an encoder-decoder RNN model. setting vector is adapted on the majority of the
For abstractive summarization, it is ordinary to encoder's shrouded states, though nearby
utilize either GRU or LSTM cells for the RNN consideration utilizes a severe subset of the
encoder and decoder. We chose for use LSTM encoder's concealed states.
cells for their additional control by means of their
memory unit, albeit many top models use GRU
cells for their less expensive calculation time
(Nallapati et al., 2016). Data Collection and Annotation
Here we build up a dataset as seeks after. We have
Preprocessing of the training dataset taken TOI dataset [36] available publically for
We prepared our model on sentence-feature sets news articles in image format. Further, audio clips
from the Unannotated English Gigaword Corpus, of All India Radio News are extracted from there
which is regularly utilized for preparing models official youtube channel [35]. We use python
for abstractive synopsis. During preprocessing, FFMpeg multimedia framework to extract these
we hauled out 700,000 sets of features and first audio clips in wav format. The criteria for
sentences, and we prepared our model to gathering records are (1) hold the huge substance
anticipate the feature given the primary sentence. of the information reports (2) avoid monotonous
data; (3) have a respectable lucidity; (4) satisfy
quite far.
Experimental Studies
A couple of models are pondered in our
examinations, incorporating delivering once-
overs with different modalities and using different
approaches to manage impact pictures.

Text only This model produces diagrams using


just the text in archives.

Audio only This model produces blueprints using


just the discourse translations from recordings.
Fig3: Seq to seq - Glove accuracy and loss.
The going with models produce layouts using the
two archives and recordings anyway adventure
pictures in different ways. The striking quality
scores for text are gained with heading
techniques.

Implementation Details
We first convert the multimedia news data (Audio
and Images) into text form. For this a GUI is
developed with a user login and an admin login.
We also categorized the news articles into
different categories such as Politics, Education,
Fig4: Seq-to-seq accuracy and loss.
Movies, Electronics, Fashion, Others, etc.
Further the extracted text data is given as an input
to our trained RNN model. The trained model
gives an abstractive summarization as an output.

Experimental Results
Conclusion
This paper tends to an offbeat Abstractive
Summarization task, to be specific, how to utilize
related text, sound and video data to create a
textual outline. In this work, we apply the
sequence to sequence framework for the task of
abstractive summarization with very promising
results. Our model is built on a encoder-decoder
model with attentional mechanism. As a part of
future work, we will concentrate on developing [8] D. Wang, T. Li, and M. Ogihara, “Generating
more advance model that deal with the huge range pictorial storylines via minimum-weight
of multimedia data to extract more comprehensive connected dominating set approximation in multi-
summarization. view graphs.” in AAAI, 2012.
[9] W. Y. Wang, Y. Mehdad, D. R. Radev, and
A. Stent, “A low-rank approximation approach to
References learning joint embeddings of news stories and
[1] H. Li, J. Zhu, C. Ma, J. Zhang, and C. Zong, images for timeline summarization,” in NAACL-
“Multi-modal summarization for asynchronous HLT, 2016, pp. 58–68.
collection of text, image, audio and video.” in [10] I. Mademlis, A. Tefas, N. Nikolaidis, and I.
EMNLP, 2017, pp. 1092–1102. Pitas, “Multimodal stereoscopic movie
[2] B. Erol, D.-S. Lee, and J. Hull, “Multimodal summarization conforming to narrative
summarization of meeting recordings,” in ICME, characteristics,” IEEE Transactions on Image
vol. 3. IEEE, 2003, pp. III–25. Processing, vol. 25, no. 12, pp. 5828–5840, 2016.
[3] R. Gross, M. Bett, H. Yu, X. Zhu, Y. Pan, J. [11] T. Hasan, H. Boˇril, A. Sangwan, and J. H.
Yang, and A. Waibel, “Towards a multimodal Hansen, “Multi-modal highlight generation for
meeting record,” in ICME, vol. 3. IEEE, 2000, pp. sports videos using an informationtheoretic
1593–1596. excitability measure,” EURASIP Journal on
[4] Venugopalan, S.; Rohrbach, M.; Donahue, J.; Advances in Signal Processing, vol. 2013, no. 1,
Mooney, R.J.; Darrell, T.; Saenko, K. Sequence to p. 173, 2013.
Sequence—Video to Text. In Proceedings of the [12] G. Evangelopoulos, A. Zlatintsi, A.
2015 IEEE International Conference on Potamianos, P. Maragos, K. Rapantzikos, G.
Computer Vision, ICCV 2015, Santiago, Chile, Skoumas, and Y. Avrithis, “Multimodal saliency
7–13 December 2015; pp. 4534–4542, and fusion for movie summarization based on
doi:10.1109/ICCV.2015.515. [CrossRef] aural, visual, and textual attention,” IEEE
[5] Bahdanau, D.; Cho, K.; Bengio, Y. Neural Transactions on Multimedia, vol. 15, no. 7, pp.
Machine Translation by Jointly Learning to Align 1553–1568, 2013.
and Translate. arXiv 2014, arXiv:1409.0473.
[6] Nallapati, R.; Zhou, B.; dos Santos, C.N.; [13] M. Del Fabro, A. Sobe, and L.
Gülçehre, Ç.; Xiang, B. Abstractive Text B¨osz¨ormenyi, “Summarization of real-life
Summarization using Sequence-to-sequence events based on community-contributed content,”
RNNs and Beyond. In Proceedings of the 20th in The Fourth International Conferences on
SIGNLL Conference on Computational Natural Advances in Multimedia, 2012, pp. 119–126.
Language Learning, CoNLL 2016, Berlin, [14] J. Bian, Y. Yang, and T.-S. Chua,
Germany, 11–12 August 2016; pp. 280–290. “Multimedia summarization for trending topics in
[7] D. Tjondronegoro, X. Tao, J. Sasongko, and microblogs,” in CIKM. ACM, 2013, pp. 1807–
C. H. Lau, “Multimodal summarization of key 1812.
events and top players in sports tournament [15] M. Schinas, S. Papadopoulos, G. Petkos, Y.
videos,” in WACV. IEEE, 2011, pp. 471–478. Kompatsiaris, and P. A. Mitkas, “Multimodal
graph-based event detection and summarization
in social media streams,” in Proceedings of the [23] K. Wong, M. Wu, and W. Li, “Extractive
23rd ACM international conference on summarization using supervised and semi-
Multimedia. ACM, 2015, pp. 189–192. supervised learning,” in Proc. Conf. 22nd Int.
[16] J. Bian, Y. Yang, H. Zhang, and T.-S. Chua, Conf. Comput.
“Multimedia summarization for social events in [24] Y. Ouyang, W. Li, Q. Lu, and R. Zhang, “A
microblog stream,” IEEE Transactions on study on position information in document
Multimedia, vol. 17, no. 2, pp. 216–228, 2015. summarization,” in COLING, 2010, pp. 919–927.
[17] R. R. Shah, A. D. Shaikh, Y. Yu, W. Geng, [25] D. R. Radev, H. Jing, M. Sty´s, and D. Tam,
R. Zimmermann, and G.Wu, “Eventbuilder: Real- “Centroid-based summarization of multiple
time multimedia event summarization by documents,” Information Processing
visualizing social media,” in Proceedings of the Management, vol. 40, no. 6, pp. 919–938, 2004.
23rd ACM international conference on [26] Z. Yong, J. E. Meng, Z. Rui, and M. Pratama,
Multimedia. ACM, 2015, pp. 185–188. “Multiview convolutional neural networks for
[18] J. L. Neto, A. A. Freitas, and C. A. A. multidocument extractive summarization,”
Kaestner, “Automatic text summarization using a IEEE Trans. Cybern., vol. 47, no. 10, pp. 3230–
machine learning approach,” in Proc. Adv. Artif. 3242, Oct. 2016.
Intell. 16th Braz. Symp. Artif. Intell. (SBIA), Nov. [27] K. Filippova and Y. Altun, “Overcoming the
2002, pp. 205–215. lack of parallel data in sentence compression,” in
[19] G. Erkan and D. R. Radev, “LexRank: Proc. Conf. Empir. Methods Nat. Lang. Process.
Graph-based lexical centrality as salience in text (EMNLP), Oct. 2013, pp. 1481–1491.
summarization,” CoRR, vol. abs/1109.2128, [28] T. Mikolov, I. Sutskever, K. Chen, G. S.
2011.[Online].Available:http://arxiv.org/abs/110 Corrado, and J. Dean, “Distributed
9.2128 representations of words and phrases and their
[20] V. Varma, V. Varma, and V. Varma, compositionality,” in Proc. 27th Annu. Conf.
“Sentence position revisited: a robust light-weight Neural Inf. Process. Syst. Adv. Neural Inf.
update summarization ’baseline’ algorithm,” in Process. Syst., Dec. 2013, pp. 3111–3119.
International Workshop on Cross Lingual
Information Access: Addressing the Information [29] C. A. Colmenares, M. Litvak, A. Mantrach,
Need of Multilingual Societies, 2009, pp. 46–52. and F. Silvestri, “HEADS: Headline generation as
[21] R. R. Shah, Y. Yu, A. Verma, S. Tang, A. D. sequence prediction using an abstract feature-rich
Shaikh, and R. Zimmermann, “Leveraging space,” in Proc. Conf. North Amer. Assoc.
multimodal information for event summarization Comput. Linguist. Human Lang. Technol.
and concept-level sentiment analysis,” (NAACL HLT), Denver, CO, USA, May/Jun.
Knowledge-Based Systems, vol. 108, pp. 102– 2015, pp. 133–142.
109, 2016. [30] J. Cheng and M. Lapata, “Neural
[22] S. Khuller, A. Moss, and J. S. Naor, “The summarization by extracting sentences and
budgeted maximum coverage problem,” words,” in Proc. 54th Annu. Meeting Assoc.
Information Processing Letters, vol. 70, no. 1, pp. Comput. Linguist. (ACL), vol. 1. Berlin,
39–45, 1999. Germany, Aug. 2016, pp. 484–494.
[31] X. Wan and J. Yang, “Improved affinity IEEE Transactions on Knowledge & Data
graph based multidocument summarization,” in Engineering, vol. 25, no. 5, pp. 1162–1174, 2013.
NAACL, 2006, pp. 181–184. [35] R. Nallapati, F. Zhai, and B. Zhou,
[32] G. Erkan and D. R. Radev, “Lexrank: Graph- “SummaRuNNer: A recurrent neural network
based lexical centrality as salience in text based sequence model for extractive
summarization,” Journal of Qiqihar Junior summarization of documents,” in Proc. 31st AAAI
Teachers College, vol. 22, p. 2004, 2011. Conf. Artif. Intell., San Francisco, CA, USA,
[33] X. Zhou, X. Wan, and J. Xiao, “Cminer: Feb. 2017, pp. 3075–3081.
Opinion extraction and summarization for chinese [36] R. Mihalcea and P. Tarau, “Textrank:
microblogs,” IEEE Transactions on Knowledge & Bringing order into texts,” in ACL, 2004.
Data Engineering, vol. 28, no. 7, pp. 1650–1663, [37]https://epaper.timesgroup.com/Olive/ODN/Tim
2016. esOfIndia/#
[34] X. Li, L. Du, and Y. D. Shen, “Update [38]http://epaper.indianexpress.com/
summarization via graphbased sentence ranking,” [39]http://paper.hindustantimes.com/epaper/viewe
r.aspx

Вам также может понравиться