Академический Документы
Профессиональный Документы
Культура Документы
1
Contents
1 Introduction 3
2 Motivation 4
3 Previous Works 5
3.1 Bag of Words Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Naive Bayesian Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Implementation Details 6
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.1.1 Bag of Words Model . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.1.2 Other features specific to short text messages like tweets . . . . . . 6
4.1.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 7
4.1.4 Support Vector Machine + Bag of Words Model . . . . . . . . . . 7
4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6 Future Work 9
2
1 Introduction
In the past decade, new forms of communication, such as microblogging and text mes-
saging have emerged and become ubiquitous. While there is no limit to the range of
information conveyed by tweets and texts, often these short messages are used to share
opinions and sentiments that people have about what is going on in the world around
them. We have worked on the following task, which was a part of SEMEVAL 2013
challenge. The task is:
3
2 Motivation
Working with these informal text genres presents challenges for natural language process-
ing beyond those typically encountered when working with more traditional text genres,
such as newswire data. Tweets and texts are short: a sentence or a headline rather than
a document. The language used is very informal, with creative spelling and punctuation,
misspellings, slang, new words, URLs, and genre-specific terminology and abbreviations,
such as, RT for "re-tweet" and # hashtags, which are a type of tagging for Twitter mes-
sages. How to handle such challenges so as to automatically mine and understand the
opinions and sentiments that people are communicating has only very recently been the
subject of research (Jansen et al., 2009 ; Barbosa and Feng, 2010; Bifet and Frank, 2010;
Davidov et al., 2010; OâĂŹConnor et al., 2010; Pak and Paroubek, 2010; Tumasjen et
al., 2010; Kouloumpis et al., 2011).
Another aspect of social media data such as Twitter messages is that it includes rich
structured information about the individuals involved in the communication. For exam-
ple, Twitter maintains information of who follows whom and re-tweets and tags inside of
tweets provide discourse information. Modelling such structured information is important
because: (i) it can lead to more accurate tools for extracting semantic information, and
(ii) because it provides means for empirically studying properties of social interactions
(e.g., we can study properties of persuasive language or what properties are associated
with influential users).
4
3 Previous Works
3.1 Bag of Words Model
The bag-of-words model is a simplifying representation used in natural language pro-
cessing and information retrieval (IR). In this model, a text (such as a sentence or a
document) is represented as an unordered collection of words, disregarding grammar and
even word order. The bag-of-words model is commonly used in methods of document
classification, where the (frequency of) occurrence of each word is used as a feature for
training a classifier. Most commonly, we use a word list where each word has been scored.
Positivity/negativity or sentiment strength and overall polarity is determined by the ag-
gregate of polarity of all the words in the text. In SEMEVAL 2013, Team IITB used Bag
of Words model with Discourse Information and could achieve an accuracy of 39.80%.
5
4 Implementation Details
4.1 Approach
4.1.1 Bag of Words Model
We have already given a brief description of Bag of Words Model, as used in previous
methods for Sentiment Analysis. We implemented it with some features in order to im-
prove accuracy. In general, the list of words used is given polarity from this set −1, 0, 1.
But we rather used two lists for this purpose. One contained the most common words
and were given polarity from set −4, −3, −2, −1, 0, 1, 2, 3, 4 and the other list, containing
the less common words, was marked with polarity from the set −1, 0, 1. With this simple
change, we achieved an accuracy of around 42%.
Discourse relations We have also considered the use of Discourse Relations, as first
suggested by P. Bhattacharyya et. al. It is one important factor that significantly im-
proves accuracy. Consider this example: I didn’t think I would like the movie but it
turned out to be great. If we use naive Bag of Words model, we would find the overall
sentiment to be neutral but this is not the case. It is because of the presence of "would"
and "but". Yet another factor taken into consideration was to assign double weightage
to polarity words in sentences occurring later on in tweets. But it turned out that it
didn’t make an overall difference in accuracy.
Hash tags: We also incorporated the effect of #hashtags. We applied Bag of Words
model on the text of the #hashtags after doing word boundary segmentation and we
proceeded only if some clear sentiment didn’t appear. If it did, then we would report
the aggregate of obtained polarity of the hashtags for that tweet. (For classification hash
tags were first segmented for word boundaries)
Polarity boosters: We also incorporated the effect of modifiers like "very", "too",
etc.
Normalize and correct spelling: Considering the fact that tweets contains important
words which are often intentionally misspelt to convey emotion like happy might be spelt
as haaaaaappyyyyy. So, whenever an alphabet occurs more than once it is counted as
once only.
6
Considering Negation: For every tweet polarities of words are reversed if present
near a negation word. All this helped achieve an accuracy of around 56%.
• Most frequent words: Top 1000 words in the training set are seperated. Pres-
ence/absence of each of the features counts as a feature. This constitutes 1000
features.
7
4.2 Data
We have uses following data from following sources in our project:
• SEMEVAL 2013 has also provided with around 10000 labelled tweets for the "Mes-
sage Polarity Classification" problem.
http://www.cs.york.ac.uk/semeval-2013/task2/index.php?id=data
• Lists of emoticons, emotions, negating words, and booster words (like very, too,
etc) are taken from http://sentistrength.wlv.ac.uk/.
We have achieved an accuracy of 68.36% with training on only around 9000 tweets and
testing on 1100 tweets. The only team with better performance than us in SemEval 2013
achieved an accuracy of 69.02% while using 1.6 million tweets for training. Our method
achieves good accuracy with relatively small data size.
8
6 Future Work
• We have covered most of the features in our classification. Bit, we didn’t include
effect of following features on classification accuracy.
– Taking care of emotions conveyed by abbreviations
– Analysing if subsequent sentences in a tweet are more important. (For eg.
giving greater weight to a 2nd line in a tweet of 2 lines.)
• Although it was clear from work done by others on the same problem that SVM
tends to perform better than other classifers, it would be interesting to see how
hybrid of other classifiers (like naive bayes classifier) with SVM would perform. (In
our work we tried hybrid of bag of words with SVM which improved the accuracy)
References
[1] S. Mukherjee and P. Bhattacharyya. Sentiment analysis in Twitter with lightweight
discourse analysis, December 2012.