Вы находитесь на странице: 1из 6

A Cloud-based Big Data Sentiment Analysis

Application for Enterprises’ Brand Monitoring in


Social Media Streams

A. Tedeschi, Student Member, IEEE, F. Benedetto, Senior Member, IEEE,


Signal Processing for Telecommunications and Economics Lab.
University of Rome “Roma TRE”
Via Vito Volterra 62, 00146 - Rome, Italy
{antonio.tedeschi, francesco.benedetto}@uniroma3.it

Abstract— Due to the explosive diffusion of social media A brand is commonly associated with products and services
streams, users can evaluate the brands’ reputation and offered by an enterprise for a certain time interval, shaping
enterprises’ quality exploiting the information provided by new deeply anchored and clear images in the mind of the end
digital marketing channels. Consequently, enterprises need to spot consumers [5]. Hence, through the analysis of the contents
and analyze a big amount of digital data in order to improve their
produced by social users, enterprises can: (i) manage their brand
reputation among consumers. This work proposes a cloud-based
big data sentiment analysis application for brand monitoring and and (ii) create ad-hoc marketing campaigns and advertisements
analysis in social media streams. Enterprises can enhance their analyzing users’ feelings [6]. Actually, brands are not only the
competitiveness satisfying consumers’ needs and expectations products and services portfolio of an enterprise but they also
exploiting our tool, detecting the sentiment of a tweet and how it represent an emotional differentiation in the form of orientation
influences people, when the author’s popularity is taken into and the generation of trust. Due to the lack of personal ties, the
account. The obtained results, showing that branding strategies wealth of information available on the Internet can lead to a
play an even more important role in online environments, evidence depersonalization of relationships and follow the decrease of
the effectiveness of our approach for enterprises’ brand consumers’ loyalty [5]. In addition, negative examples
monitoring and analysis.
perpetuated in the press can increase the perceived risks of
Index Terms— Big Data Sentiment Analysis, Brand Monitoring, online purchases, especially in the case of less well-known
Social Media Streams, Cloud-based Applications. brands. The traditional market monitoring, based on human-
efforts, has become virtually impossible. Individuals are not
I. INTRODUCTION more able to manually spot and analyze all information of
Usually, one source for big data is user interactions on social particular importance for global large-scale corporations. This
media platforms and mobile applications [1]. The information need led to the definition of automated market intelligence and
shared between users are not restricted to updated status about tools which allow the monitoring of a brand’s reputation in a
their private life, but also about their opinion on products and/or mechanized fashion. Hence, it has become possible for
news [2]. However, managing and analyzing such big data enterprises to automatically identify the main and upcoming
require an amount of computational resources that are too topics of interest and to monitor the reputation of its own brands
expensive to buy, especially for Small and Medium Enterprises as well as its competitors [7]. These reputation platforms are
(SMEs). Such a need led to the definition of the Cloud based on the idea that consumers’ emotions are valid indicators
Computing technology. It relies on the sharing of computational of their satisfaction about the brand and can also be used as the
resources without forcing enterprises and users to buy the basis for customers’ second buying. Through these new
infrastructure network. Enterprises can rent the computational investigation methods, online businesses should pay more
resources from the cloud provider, according to their needs [3]. attention to reinforcing the refinement of the brand value and
Through the cloud computing technology and its service models, experience design, thus improving customer value [8].
the social media enterprises can easily manage users’ data, Addressing some of these issues, we propose an innovative
increasing the quality of their services and reducing the data and big data sentiment analysis application for brand monitoring in
infrastructure networks’ management costs, exploiting the pay- social media streams, named Social Brand Monitoring (SBM).
per-use model [4]. Currently, the SBM prototype is based on a Client-Server
Recently, enterprises have started to recognize the benefit architecture developed in Java. Moreover, Twitter is the target
derived from the use of both cloud computing technology and social media and through the services provided by SBM tool an
social media data analyses. The user-generated contents, such as enterprise can monitor its brand and image as well as its
update status and individual opinions, are becoming a useful competitor analyzing the user-generated data.
resource to understand what people think about a certain brand.

978-1-4673-8167-3/15/$31.00 ©2015 IEEE


The remainder of this work is organized as follows. Section services and tools are expensive and the stakeholder, which pays
2 depicts the most important related works, while Section 3 for these services, cannot directly access to the social media
introduces the SBM tool system model, the client-server data, because the data analysis is realized outside the company.
architecture, and the big data processing algorithm. Section 4 Moreover, a large number of tools do not allow streaming from
depicts the results obtained through our monitoring tool, while social media or are not online tools. Another web-based tool for
our conclusions are briefly summarized in Section 5. social media data analysis is proposed in [18]. Here, the authors
developed the SWAB (Social Web Analysis Buddy) tool,
integrating both data quality analysis and large-scale data
II. RELATED WORKS mining techniques. They propose a Client-Server architecture,
Understanding ‘what other people think’ has always been a where the client manages users’ input while the server manages
key concept during the decision-making process [9]. Sentiment several tasks such as the crawling of Twitter’s data and the
analysis and opinion mining have become two important fields processing of the retrieved data in order to establish the related
of Artificial Intelligence. Through sentiment analysis, it is sentiments. Finally, the authors used a Naïve Bayes
possible to establish the sentiment associated to social media classification model in order to compute the sentiments of user-
users in a mechanized and automated fashion. Recognizing the generated content.
benefit deriving from them, several enterprises have developed
their own sentiment analysis monitoring tool in order to keep
record of their brand reputation respectively. The brand
monitoring activities over social media and web platforms have
become more and more important. The monitoring tools are
typically based on different sentiment analysis approaches,
which compute the sentiment behind the users’ opinions and
sentences. As depicted in [10], several techniques are proposed
Figure 1. The SBM tool system model
in literature, which can be classified in three main approaches:
supervised learning, unsupervised learning, and knowledge-
based approach. Usually, the supervised learning is based on III. CLOUD-BASED BIG DATA SENTIMENT ANALYSIS
machine learning approaches, while unsupervised learning is
based on the sentiment orientation of specific sentences within A. System Model
the source document. For this approach, it is not required a prior The SMB tool adopts the PaaS model, provided by Microsoft
training in order to label data. The knowledge-based approach, Windows Azure, and respects the main software quality
namely also lexicon or text based approach, is a well-known and requirement, such as scalability and fault tolerance, for this kind
diffused strand of literature for detecting the sentiment related to of distributed services. As shown in Fig. 1, the prototype is based
a text, such as a blog post or a tweet of a Twitter user. This on three main entities: (i) the target social media represented by
approach is based on lexical resources, such as ontological Twitter, which allows us to gather the contents to analyze; (ii)
dictionaries, where each word can have a score that identifies the the client which is a user-friendly website; (iii) the server
strength of the related sentiment [11]. A basic algorithm is the released on Windows Azure Platform. We choose Twitter as
evaluation of texts’ polarity using lexical resources such as target social media because of its features and characteristics.
SentiWordNet [12]. Thanks to its microblogging nature and features, Twitter has
Recent approaches also establish the influence of the user evolved into an actual news media allowing users to easily
who wrote the sentence expressing his opinion. By computing discover what is happening at any moment in time in the world.
and analyzing the influence of a user, it is possible to determine Its evolution stems from the fact that Twitter not only provides
who the popular users are and how they influence other users a virtual space used as a new communication channel but it also
with their words. The main approaches at the state-of-art are stores any kind of news generated by its users networks in real-
designed for Twitter. The basic techniques used to compute the time. Twitter’s users exploit the provided communication
user’s popularity exploit both user’s features, such as followers channels in order to express and share their opinions and
and following count, and the features of the user’s tweet like sentiments not only with friends but also with strangers. The
retweet count [13]. SBM’s server provides several services through the defined
A list of both commercial and academic examples of brand Web Service RESTful and the Windows Azure platform. For
monitoring tools is provided in [14]. The authors analyze the instance, the server allows SBM’s users to start a new search
state of art of companies and university researches on sentiment sending a specific query to the server, which starts a crawling
analysis tools and systems designed for business marketing, session and stores all the retrieved data in a shared database.
such as Brandwatch [15], Synthesio [16], and SproutSocial [17]. Through the provided services, a user can query the SBM client
These social media monitoring tools allow finding insights into in order to know the overall sentiments about a certain brand for
a brands’ visibility on social networks, search and stream social a specific date and/or time interval. Finally, we developed SBM
media data using some criteria such as keyword or languages. tool involving different programming languages and
These tools also provide useful functions to analyze the data and frameworks choosing, i.e., Java as the programming language
understand people feelings and opinions. However, these for the server side and JavaServer Page (JSP) technology for the
client side. The rest of this section is divided in two sub-sections the obtained credential in order to enable the web crawler layer
which describe the design of the server and client. In addition, to access Twitter using the OAuth protocol.
in the rest of this paper we use the words “author” and The developed crawler allows us to crawl Twitter’s contents
“researcher” to refer to the Twitter and the SBM tool users, using the following parameters: name of brand or product; start
respectively. date of search; end date of search. The rate limit window has
been divided into chunks of 15 minutes per endpoint, with most
B. Server Design
individual calls allowing for 15 requests in each window. The
In order to design a well-defined architecture for SBM tool, crawled data are then stored in a shared relational database
we followed the principle of several design and architectural defined for SBM tool. This layer actually provides only the
patterns [19]. The patterns Layers and N-Tier were the main crawling of Twitter’s contents, but it is based on an interface
ones used to design the architecture. Through the benefits which can be implemented by other classes in order to develop
derived by N-Tier pattern application, it is possible to physically new web crawlers for Facebook. After the crawling step, the
separate the architecture in three tiers: (i) the presentation tier obtained data are processed and the database’s entries are
that manages all the aspects of the user interface in order to show updated with the correlated statistics such as the computed
the required result of an action to the end-user; (ii) the sentiment, the author’s popularity and influence of each tweet.
application logic tier that is the core tier which coordinates the We designed a second layer, the Data Access Object (DAO)
application behaviors and manages the data between the two layer, which allows the application to store and manage the data.
surrounding tiers; (iii) the data management tier that includes the The relational database is designed starting from the analysis of
data persistence mechanisms (database servers, file shares, etc.) the crawled data, such as: tweets and their information (i.e.
and the data access layer that encapsulates the persistence retweets and location); author information (i.e. followers count
mechanisms and exposes the data. The main benefits are the and nickname); and hashtag used for the researches. The result
possibility to independently develop and test each tier, of our analysis and design process is an Entity – Relational (E-
improving the debug phases. In order to develop a well-designed R) diagram of the relational database shown in Fig. 2.
application logic tier, we exploited the principle of the Layers Combining these E-R database with Hibernate, we have defined
pattern. This allows us to design four layers, which manage a well-defined DAO layer which helps the researcher to query
specific behaviors of SBM tool. The defined layers are: (i) the the tool in order to obtain the data of interest. Hibernate is a well-
web crawler layer crawls contents from Twitter; (ii) the data known object-relational mapping framework and through its
access object layer manages the data stored in a relational features it is possible to map Java classes to database tables. In
database; (iii) the sentiment analysis layer computes the order to query the database, it is possible to use Hibernate’s
sentiment and the influence associated to a certain tweet of an query language which is more light than other frameworks.
author; and (iv) the web-service RESTful layer manages the Finally, the proposed database is shared between all researchers.
researchers’ queries. This means that each brand or product research contributes to
insert new data which can be analyzed by other researchers.
It is well known that relational databases could be not very
efficient for Big Data management since they don’t scale well to
very large sizes. As a proof of concept of the application here
devised we have decided to work with the relational database
directly provided by Microsoft Windows Azure. However,
thanks to the efficient design of our SBM tool, it will be
extremely simple and fast change the services provided by the
DAO layer in order to migrate and manage a NoSQL database,
also in Microsoft Windows Azure. In order to give a point of
access to the tool and its functions, such as the retrieval of the
stored data, the web service RESTful layer was designed. This
layer defines a web service based on RESTful paradigm. Its
guideline allows us to define a services’ interface used by the
researcher to communicate with the SBM platform.
Finally, the last but not least layer is the sentiment analysis
Figure 2. Entity-Relational (E-R) diagram of the database. (SA) layer, which is the core of our tool. This layer provides the
algorithms to compute: the sentiment associated to tweets; the
The web crawler layer is based on the Twitter4J library [20], author’s popularity represented by the rank value; and the
an unofficial Java library that can be used to access both contents author’s influence that indicates how the author’s words can
and services of Twitter. In order to exploit this library, it is influence the other authors and change their opinion. The
necessary to register the owner’s Twitter account as developer sentiment analysis algorithm developed for SBM tool is based
in order to obtain the required credentials. Then, it is possible to on the use of the knowledge-based approach adopting
access Twitter’s contents and services. Moreover, it is possible SentiWordNet as English lexical resources. Through
to set the ConfigurationBuilder class provided by Twitter4J with SentiWordNet, verb, adverbs, nouns and adjectives are grouped
into a set of cognitive synonyms, namely synsets, each one is perceived by others as a strong negative tweet due to author’s
expressing a distinct concept and having an assigned score. We rank.
can easily compute the polarity score of a tweet as the sum of
each polarity word after their classification through the Part of
Speech (POS) tagging process. As shown in Table 1, we
consider seven polarity levels where each range of score values
is matched to a certain sentiment. The main problem of this kind
of approach is that it is not possible to directly compute the
polarity of a tweet on the retrieved tweet’s text due to several
issue such as the presence of slangs and emoticons which are not
in any ontological dictionary. Hence, it is necessary to
preprocess the text in order to improve the accuracy of the
algorithm.

TABLE 1. Polarity levels.

Figure 3. Sentiments and influence results.

IV. RESULTS AND DISCUSSION


In this section, we show the analysis’s results about a certain
brand, taken in the Technology area, as one of the best-selling
mobile phones, during the marketing campaign of its new
smartphone series.
Before evaluating the performance of the proposed
C. Data Processing Algorithm approach, we have analyzed the benefits provided by Microsoft
Windows Azure technology. The provided cloud computing
The first aspect that we introduced is a check on the tweet allowed us to easily deploy our solution through the PaaS service
text in order to understand if it is cut. If so, the tweet is not taken paradigm and to dynamically set the size of the relational
in consideration and deleted from the database. This phase is database. We have deployed our method on Windows Azure
necessary due to Twitter behavior on the retweeted tweet. and, at the same time, on a free Java web hosting starting a new
Following the suggestions and the approaches defined in [21]– research. As expected, the tool deployed on Windows Azure
[24], we developed a preprocessing algorithm based on 10 steps. proved to be faster than the alternate solution. This is a
After the application of the pre-processing algorithm, it is consequence of the fact that Windows Azure allowed us to
possible to compute the polarity level associated to the streamline the processes of the crawling and sentiment analysis
normalized text of the processed tweet. In order to do that, a computing, hence reducing the time required for such
second phase is necessary: the Part-Of-Speech (POS) tagging. operations. In the following, all the results are obtained by
Through a POS tagger, it is possible to assign parts of speech to deploying our tool with the Windows Azure technology.
each word. For SBM tool, we adopted the POS tagger developed Fig. 3 presents the retrieved data through three different
by Stanford University [25] taking into consideration only the charts. The searches have been anonymized to meet the
part of speech defined by SentiWordNet. requirements on disclosure and data responsibility. The two pie-
A key factor of our tool is the authors’ popularity, which charts show the sentiment and influence for each polarity level
helps the researcher to understand how the author status and the about the brand. As depicted by the sentiment pie chart, the
sentiment associated to his words can influence the opinion of authors’ opinions are in favor of the brand with 29% of Weak
his followers’ networks about a certain topic, brand or product. Positive tweets. Such a result is underlined also by the influence
To obtain the rank value, which represents the author’s pie chart. This chart involves the author’s popularity which
popularity, we suggest considering the followers and following strongly changes the polarity levels. The tool shows, for
count, the mentions and retweets of each author and his tweets. example, the change from 9.7% of Strong Positive tweets, if
Finally, the author’s influence is obtained multiplying the rank only the sentiment is considered, to 31% if the authors’ influence
value with the polarity score obtaining a new polarity level, is also involved. Then, the line chart displays the daily trend of
representing how the author’s followers perceive his words. For each sentiment and influence level in order to understand how
instance, a popular author can write a weak negative tweet which certain events can modify the authors’ opinion.
Figure 5. Rank values for author 1 (657 followers and 1652 following).

Through these two case studies, it is possible to assess that


Figure 4. Sentiments versus influence results. the variation of the rank trend is related to the increasing of the
associated retweets. Both cases show how the retweets count is
In this case, we can study both the daily trend of the polarity a good index to measure the popularity and the influence of an
levels and the number of tweets for each day and sentiment (and author among Twitter networks. Finally, this statistical analyses
influence). Through the line chart, we can analyze not only the on authors’ popularity underlines the usefulness of rank value
number of tweets for each polarity level but also their temporal which modifies the real sentiment of a tweet towards the
distribution. Then, fig. 4 displays for each polarity level and each maximum or minimum level of polarity, reducing the data for
day how the sentiment changes if the authors’ popularity is the middle polarity levels.
involved. Through this chart, we can easily assess how the
sentiments can be influenced by the users’ polarity. In particular, V. CONCLUSION
it is evident that the weak positive (or negative) sentiments
modify towards strong positive (or negative) levels. These This paper has introduced a cloud-based brand monitoring
analyses confirm the positive trend of this brand and its tool for the twitter social media. In particular, we have designed
smartphones on Twitter. Such data indicate that the new our tool as a cloud-based service, exploiting the benefit derived
marketing campaign gains the favors of authors who by the Platform as a Service model of Microsoft Windows
predominantly tweet positive opinions about the selected brand. Azure. Twitter is used as the target social network, since it is
In addition and as mentioned before, the authors’ popularity actually a pervasive social news media, due to its features and
is one of the main aspects of Twitter. In this work, we measure characteristics. Then, we have designed and developed a user-
the author’s popularity through the rank value computing the friendly interface for our tool and several analysis functions
main features of an author such as followers and following which allow users to understand: (i) what people feel about the
count. Through this numerical value, it is possible to assess how searched brand; (ii) who the influent users are; (iii) the brand’s
the tweets of a popular author influence the opinions of his diffusion worldwide. The obtained results have evidenced that
followers’ networks. Hence, in order to test and analyze the our tool can be a valid asset in providing strategic and innovative
behavior of the rank’s values and influence algorithm, we have potentialities to all the small and medium enterprises, helping
generated a dataset of more than 300.000 tweets on several them to be more efficient and effective in satisfying the
brands and products tweeted by more than 200.000 authors. customers’ needs and expectations. Future researches will be
Again, the obtained results are anonymized. The rank value is devoted to fully analyze new sentiment analysis algorithms, as
computed using the features of an author and also his tweets and well as different social media streams.
retweet count as well. This means that for each author it is
possible to have different values of rank. As a consequence, we
REFERENCES
have also monitored the variation of rank based on the number
of retweets. Hence, we have selected two authors with a high [1] O' Reilly, T. What is web 2.0 - design patterns and business
number of different ranks. The obtained data refers to the same models for the next generation of software (2005). Resource
document. http://oreilly.com/web2/archive/what-is-web-20.html.
brand search. The rank trend of the first examined author, Accessed 15 September 2014.
namely author 1, is shown in Figure 5. This chart shows both
[2] Kietzmann, J. H., Hermkens, K., McCarthy, I. P., Silvestre, B. S.
the minimum rank value when the author’s tweet is not shared (2011). Social media? Get serious! Understanding the functional
by his followers, which is equal to 2.71, and the maximum rank building blocks of social media. Business Horizons. 54(3). 241-
value of 5.62 obtained for 71227 retweets. Similarly, the trend 251.
of rank for author 2 shows the same behavior (here not reported) [3] Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H.,
and the minimum and maximum values of rank are respectively Konwinski, A., Lee, G., Patterson, D. A., Rabkin, A., Stoica, I.,
2.64 and 5.53 for 64993 retweets. Zaharia, M. (2009). Above the clouds: A berkeley view of cloud
computing. Dept. Electrical Eng. and Comput. Sciences,
University of California, Berkeley, Tech. Rep. UCB/EECS. 28.
[4] Shuai, Z., Shufen, Z., Xuebin, C., Xiuzhen, H. (2010). Cloud [14] Tedeschi A., Benedetto F. (2014), “A cloud-based tool for brand
Computing Research and Development Trend. 2nd Int. Conf. on monitoring in social networks”, in Proc. of the IEEE Int. Conf. on
Future Networks. 93-97. Future Internet of Things and Cloud (FiCloud 2014), pp. 541-546
[5] Strauss, R.E., Schoder, D., Gebauer, J. (2001). The Relevance of [15] Social Media Monitoring Tools - Brandwatch, Brandwatch.
Brands for Electronic Commerce Results from an Empirical [Online]. Available: http://www.brandwatch.com/
Study of Consumers in Europe. 34th Annual Hawaii Int. Conf. on [16] Synthesio [Online]. Available: http://synthesio.com/corporate/en
System Sciences.1-9.
[17] SproutSocial [Online]. Available: http://sproutsocial.com/
[6] Becker, K., Nobre, H., Kanabar, V. (2013). Monitoring and
[18] Xin, C., Madhavan, K., Vorvoreanu, M. (2013). A Web-Based
protecting company and brand reputation on social networks:
Tool for Collaborative Social Media Data Analysis. 3rd Int. Conf.
when sites are not enough. Global Business and Economics
on Cloud and Green Computing (CGC). 383-388.
Review, Inderscience Enterprises Ltd. 15(2/3). 293-308.
[19] Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1994). Design
[7] Ziegler, C.N., Skubacz, M. (2006). Towards Automated
Patterns: Elements of Reusable Object-Oriented Software.
Reputation and Brand Monitoring on the Web. IEEE Int. Conf. on
Web Intelligence. 1066 – 1072. [20] Twitter4J. Resource document. http://twitter4j.org/en/index.html.
Accessed 15 September 2014.
[8] Li, L. (2013). Study on the interactive relationship between
customer's emotional response and the brand trust — In the view [21] Balahur, A. (2013). Sentiment analysis in social media texts. 4th
of online shopping. IEEE Int. Conf. on Service Operations and Workshop on Computational Approaches to Subjectivity,
Logistics, and Informatics (SOLI). 245 – 248. Sentiment and Social Media Analysis. 120-128.
[9] Pang, B., Lee, L. (2008). Opinion mining and sentiment analysis. [22] Habernal, I., Ptácek, T., & Steinberger, J. (2013). Sentiment
Foundations and trends in information retrieval. 1 – 135. analysis in czech social media using supervised machine learning.
4th Workshop on Computational Approaches to Subjectivity,
[10] Feldman, R. (2013). Techniques and Applications for Sentiment
Sentiment and Social Media Analysis. 65-74.
Analysis. Communications of the ACM. 56(4). 82-89.
[23] Kouloumpis, E., Wilson, T., Moore, J. (2011). Twitter sentiment
[11] Vinodhini, G., Chandrasekaran, R. M. (2012). Sentiment analysis
analysis: The good the bad and the omg!. 5th Int. Conf. on
and opinion mining: a survey. International Journal of Advanced
Weblogs and Social Media. 538-541.
Research in Computer Science and Software Engineering. 2(6).
[24] Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.
[12] Hamouda, A., Rohaim, A. (2011). Reviews classification using
(2011). Sentiment analysis of twitter data. Workshop on
sentiwordnet lexicon. Online Journal on Computer Science and
Languages in Social Media. 30-38.
Information Technology. 2(1). 120-123.
[25] Stanford University. Stanford Log-linear Part-Of-Speech Tagger.
[13] Cha, M., Haddadi, H., Benevenuto, F., Gummadi, K. P. (2010).
Resource document.
Measuring user influence in Twitter: The million follower fallacy.
http://nlp.stanford.edu/software/tagger.shtml Accessed 15
4th Int. Conf. on Weblogs and Social Media. 10-17.
September 2014.

Вам также может понравиться