Reviews
Abstract
With the rapid advancement in the web technology there is a huge amount of data
present in the web for internet users. Such huge amount of data is mainly from the
social media where millions of people express their thoughts and views in their daily
interaction which can be their sentiments or opinions about a particular thing. Mainly
for the E-commerce shoppers the customer reviews are a reliable source of
information. A customer usually searches for online product reviews while evaluating
other alternative products. These ecommerce websites provide the feature for
customer to write the product reviews and scores the product from 1 to 5 or it’s
commonly referred to as star rating. These data are very useful for the companies to
improve the customer Experience. By analysing and getting insights from customer
feedback, companies have better information to make strategic decisions, and an
accurate understanding of what the customer actually wants and, as a result, a better
experience for everyone.
In order to automate the analysis of such data, Sentiment analysis is used. Sentiment
analysis is a rapidly emerging domain in the area of research in the field of Natural
Language Processing (NLP) [1]. Sentiment analysis depends on our ability to identify
the sentimental terms in a corpus and their orientation [2]. This machine learning tool
can provide insights by automatically analysing product reviews and separating them
into tags: Positive, Neutral, and Negative [3].
Essentially, there are two different approaches to extract sentiments automatically.
Classification approach which involved building classifier from labelled instances of
texts or sentences which is a supervised method. Second is the Lexicon based
approach which involves deriving the orientation of the document using semantics of
the words or phrases and this is more of an unsupervised method [4]. Many works have
already been done using different Machine Learning Techniques, so to achieve the
target, in this work, we propose a novel lexicon-based . unsupervised model that
differs from existing models in the way that it aggregates the sentiment values of
positive and negative words within a message.
In this paper we attempt to find out if there exists any difference between the
sentiments of the product reviews w.r.t to the product ratings given by the Customer
in the “Amazon” Website. Here, we will use text mining to summarize users’ reviews
and extract sentiments of the writers of the review. We then use our sentiment lexicons
to mark up all sentiment words and associated entities in our corpus. We will use
dictionary of words annotated with the words semantic orientation or polarity based
on which we will be deriving the orientation of the review. For positive sentiment we
are going to use the rating of 4 and 5 and for negative sentiment from 1 to 3. Then we
will compare the ratings with the sentiments of the product reviews provided by the
customers to find out, whether they are associated with each other or not.
Consequently, a more comprehensive analysis can be undertaken regarding the
sentiment as opposed to positive–negative-neutral classification.
Key Words: Sentiment analysis; Natural Language Processing; Text Mining; Corpus;
Lexicon.
1. Introduction
Sentiment analysis is a line of research that allows to determine people’s attitude and opinions
in relation to different topics, products, services, events, and their attributes.
The role of sentiment analysis has been growing significantly with the rapid spread of social
networks, microblogging applications and forums. Today, almost every web page has a
section for the users to leave their comments about products or services, and share it with
friends on Facebook, Twitter or Pinterest - something that was not possible just few years ago.
Mining this volume of opinions provides information for understanding collective human
behaviour. An increasing amount of evidence is pointed out that by analysing sentiment of
the social-media content it might be possible to predict size of the markets, results of
marketing campaigns and marketing ROI.
2.Research Methodology
The main two methods of sentiment analysis, lexicon-based method and machine learning
based approach, both rely on the bag-of-words. In the machine learning supervised method
the classifiers are using the unigrams as features. In the lexicon-based method the unigrams
which are found in the lexicon are assigned a polarity score, the overall polarity score of the
text is then computed as sum of the polarities of the unigrams. In the recent years more
advanced algorithms for sentiment analysis were developed that take in consideration not
only the message itself, but the context in which the message is published, who is the author
of the message, who are the friends of the author, what is the underlying structure of the
network.
The Lexicon based model follows the following steps to extract the sentiment of the texts
2.1.1 Data Collection
For Our work we collected the data
using the online reviews on the “One
Plus 7 pro phones” available through
Amazon.com. Review data on
Amazon.com is provided through the
product’s page, along with general
product review and the star rating
provided by the customers. We
retrieved the pages containing all
customer reviews for “One Plus 7pro”
mobile phone. Our first criterion for
selecting this particular product was
that the specific product had a
relatively large number of product
reviews compared with other products
Figure 1. Flow Diagram of the Work to be Performed
in that category. We then Scraped the data using Parse hub tool available online to get the
desired dataset.
For our work, we obtained the posted reviews for “One Plus 7pro” mobile phone, and a total
of 2000 reviews has been collected along with the respective star rating given by the customers
who have bought this Phone. Each web page containing the set of reviews for a particular
product was parsed to remove the HTML formatting from the text and then transformed into
an XML file that separated the data into records (the review) and fields (the data in each
review).
We excluded from the analysis reviews that did not have anyone vote whether the review was
helpful or not.
Table 1. Reviews along with the Star-ratings showings the category of Ratings it falls to
whether “Positive “or” Negative”
We perform the pre-processing steps before the actual methods of sentiment analysis are
applied. Typical pre-processing procedure includes the following steps:
Text
Data Science is Fun.
Tokens
“Data”,”Science”,”is”,”Fun”.
A bag-of-words model, or BOW for short, is a way of extracting features from text for use in
modelling, such as with machine learning algorithms. The approach is very simple and
flexible, and can be used in a myriad of ways for extracting features from
documents.
4. Lexicon-based classification
Application of a lexicon is one of the two main approaches to sentiment analysis and it
involves calculating the sentiment from the semantic orientation of word or phrases that occur
in a text. In unsupervised technique, classification is done by comparing the features of a given
text against sentiment lexicons whose sentiment values are determined prior to their use.
Sentiment lexicon contains lists of words and expressions used to express people’s subjective
feelings and opinions. With this approach a dictionary of positive and negative words is
required, with a positive or negative sentiment value assigned to each of the words. Generally
speaking, in lexicon-based approaches a piece of text message is represented as a bag of
words. Following this representation of the message, sentiment values from the dictionary are
assigned to all positive and negative words or phrases within the message. A combining
function, such as sum or average, is applied in order to make the final prediction regarding
the overall sentiment for the message. Apart from a sentiment value, the aspect of the local
context of a word is usually taken into consideration, such as negation or intensification.
Figure 6. The result Obtained from the KNN Algorithm, where the Accuracy achieved was
63%.
Random forest, which were formally proposed in 2001 by Leo Breiman and Adèle Cutler, are
part of the automatic learning techniques. This algorithm combines the concepts of random
subspaces and "bagging". The decision tree forest algorithm trains on multiple decision trees
driven on slightly different subsets of data. The random forest algorithm is one of the best
among classification algorithms - able to classify large amounts of data with accuracy. It is an
ensemble learning method for classification and regression that constructs a number of
decision trees at training time and delivers the class that is the mode of the classes output by
individual trees. Random forest is an ensemble learning method that construct a number of
decision trees at randomly selected features and predict the class of a test instance by voting
of the individual trees.
For our study we also adopted
Randomforest to see if we could achieve
better accuracy in scoring or predicting
the polarity of the review texts.
RF is not sensitive to input parameters;
thus, we just used the default
parameters for each classifier. The
trained classifiers return scores between
0 and 1, these scores are then
Figure 7. The result Obtained from the RandomForest Algorithm,
where the Accuracy achieved was 62%.
‘negative’ or ‘positive’. For each combination, the existence of element is considered positive
(P) or negative (N). The classification metrics considered for the sentiment analysis are
Accuracy, Precision, Recall and F-Measure and these parameters are evaluated based on the
calculated positivity and negativity of reviews by the proposed hybrid approach. With
Randomforest method we were able to achieve 62% accuracy.
6.Conclusion
The interest in sentiment analysis as a field of research is growing rapidly. It has been shown
that transformation of the huge volume of textual data from the web into meaningful
information can be very useful. However, the task of accurate opinion extraction still remains
challenging. Most of the times the sentiments of the reviews do not match with the Star rating
provided by the customers, this may impact the Business of the Online shopping sites as
Online reviews are important because they have become a reference point for buyers across
the globe and because so many people trust them when making purchasing decisions and the
star ratings gives the overall picture of Customers experience. It is important to address the
difference between the sentiment of the reviews and the star ratings because a lot of
customers “use rating filters” to simplify their searches, so if the average star rating for
the particular product comes below the rating of 3 it would be considered as not
impressive product. Sometimes, the reviews will be unfair and even false and the rating
given will be “5” then it would be considered as the best product. To address such fake
and biased reviews we come up with this work. The machine learning Algorithm like
“KNN” and the “RandomForest” shows the accuracy as 63% and 62% ,this shows that
there exists an inconsistency between the sentiment score of the Reviews wrt to the star
ratings.
7.Future Work
There is a lot of scope in analysing the video and images on the web. Now a days, with
the advent of Facebook, Instagram and Video vines people are expressing their
thoughts with pictures and videos along with text.
Sentiment analysis will have to pace up with this change. Tools which are helping
companies to change strategies based on Facebook and Twitter will also have to
accommodate the number of likes and re-tweets that the thought is generating on the
Social media.
People follow and unfollow people and comments on Social Media but never
comment so there is scope in analysing these aspects of the Web as well.
References
1. NLP based sentiment analysis for Twitter's opinion mining and visualization - Maha Al-
Ghalibi; Adil Al-Azzawi; Kai Lawonn(2019).
2. Sentiment analysis and the complex natural language Khan et al. Complex Adapt Syst
Model- Muhammad Taimoor Khan, Mehr Durrani2, Armughan Ali, Irum Inayat, Shehzad
Khalid and Kamran Habib Khan (2016).
3. Sentiment analysis using product review data- Xing Fang and Justin Zhan Fang and
Zhan Journal of Big Data (2015).
4.Simple and Practical lexicon based approach to Sentiment Analysis- Prabu Palanisamy,
Vineet Yadav and Harsha Elchuri (2013)
5. Sentiment Mining of Movie Reviews using Random Forest with Tuned
Hyperparameters. (2014) Parmar, Hitesh & Bhanderi, Sanjay & Shah, Glory.
6. Random Forest and Support Vector Machine based Hybrid Approach to Sentiment
Analysis (2018) Yassine AL AMRANI, Mohamed LAZAAR , Kamal Eddine EL KADIRI
7. KNN classifier-based approach for multi-class sentiment analysis of twitter data.(2018)
Soudamini Hota, Sudhir Pathak
8. Twitter Sentiment Analysis: Lexicon Method, Machine Learning Method and Their
Combination, Olga Kolchyna, Th´arsis T. P. Souza1, Philip C. Treleaven and Tomaso Aste,
Department of Computer Science, UCL, Gower Street, London, UK, Systemic Risk Centre,
London School of Economics and Political Sciences, London, UK.
9. Serendio: Simple and Practical lexicon based approach to Sentiment Analysis Prabu
Palanisamy, Vineet Yadav and Harsha Elchuri Serendio Software Pvt Ltd Guindy, Chennai
600032, India