Big Data Analytics Applied To Track Sentiment Analysis PDF

BIG DATA ANALYTICS APPLIED TO TRACK SENTIMENT ANALYSIS
_______________
A Thesis
Presented to the
Faculty of
San Diego State University
_______________
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
in
Computer Science
_______________
by
Riddhi A. Shah
Spring 2016
ProQuest Number: 10125135
All rights reserved
INFORMATION TO ALL USERS

The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest 10125135
Published by ProQuest LLC (2016). Copyright of the Dissertation is held by the Author.
All rights reserved.

This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
iii
Copyright © 2016
by
Riddhi A. Shah
All Rights Reserved
iv
DEDICATION
This work is dedicated to my family, friends and faculties of San Diego State
University, for their incredible support, motivation and love. Special thanks to Professor Carl
Eckberg for his continuous feedback and guidance.
v
ABSTRACT OF THE THESIS
Big Data Analytics Applied to Track Sentiment Analysis

by
Riddhi A. Shah
Master of Science in Computer Science
San Diego State University, 2016
Since the beginning, the internet has provided us with various different methods of
gathering the user’s reviews on almost everything. Among those various options, one of the
possible ways is using a web based application. Earlier, methods of getting a review included
mail, telephone and personal talk. Now, in addition to these traditional methods we can get
reviews from people with just a few clicks and in less than no time.
My aim was to develop a review based search engine that will allow users to get
reviews or opinions about a product based on the available information.
In this application, I am getting a raw dataset from Twitter using the Twitter’s rest
API having OAuth for authorization. The responses are in JSON and unstructured, hence
using Mongo DB (NO SQL database) for storing them. To analyze the tweets, I have used
APACHE Open NLP, sentiwordnet packages. The images of the products are retrieved using
the Google API for images.
The purpose of the paper is to provide the ground level concept of creating and
developing a user-friendly review based search engine, which can be further extended to a
larger scale, adding more features to support detailed level review of a particular product.
vi
TABLE OF CONTENTS
PAGE
ABSTRACT ...............................................................................................................................v
LIST OF FIGURES ............................................................................................................... viii
LIST OF ABBREVIATIONS ....................................................................................................x
ACKNOWLEDGEMENTS ..................................................................................................... xi
CHAPTER
1 INTRODUCTION .........................................................................................................1
1.1 Overview ............................................................................................................1
1.2 Twitter Overview ...............................................................................................2
1.3 Motivation ..........................................................................................................2
1.4 Research Objectives ...........................................................................................2
1.5 Overview ............................................................................................................3
1.5.1 Web Search Tool.......................................................................................3
1.5.2 Web Service and Database........................................................................3
1.5.3 Parsing and Analyzing Packages ..............................................................3
1.6 Summary of Chapters ........................................................................................4
2 TECHNOLOGIES .........................................................................................................5
2.1 Java ....................................................................................................................5
2.1.1 Servlets ......................................................................................................6
2.1.2 Java Server Pages ......................................................................................6
2.2 Restful Web Services .........................................................................................7
2.3 NoSQL Database ...............................................................................................9
2.4 Apache Tomcat ................................................................................................11
3 APPLICATION DESIGN AND IMPLEMENTATION .............................................12
3.1 System Design .................................................................................................12
3.2 Web Data .........................................................................................................13
vii
3.2.1 Twitter .....................................................................................................13

3.3 Data Puller .......................................................................................................15
3.3.1 Application-Only Authentication............................................................15
3.3.2 The Search API .......................................................................................16
3.4 Raw Database...................................................................................................18
3.5 Indexer .............................................................................................................18
3.5.1 Apache Open NLP ..................................................................................18
3.5.1.1 Tokenizer .......................................................................................19
3.5.1.2 Part-Of-Speech Tagger ..................................................................21
3.6 Sentiment Analyzer ..........................................................................................23
3.6.1 Sentiment ................................................................................................23
3.6.2 Sentiment Analysis .................................................................................24
3.7 Indexed Database .............................................................................................25
3.8 Presentation Builder .........................................................................................26
4 APPLICATION WORKFLOW ...................................................................................28
5 CONCLUSION AND FUTURE ENHANCEMENT ..................................................31
5.1 Future Work .....................................................................................................31
REFERENCES ........................................................................................................................32
viii
LIST OF FIGURES
PAGE
Figure 2.1. Example of SOAP request. ......................................................................................8

Figure 2.2. Example of REST request. ......................................................................................8
Figure 2.3. Example of REST request with multiple paramters. ...............................................9
Figure 2.4. Example of Key-Value pair. ..................................................................................10
Figure 3.1. System Layered Architecture ................................................................................12
Figure 3.2. Example of tweets on the Twitter about an iPhone. ..............................................14
Figure 3.3. Web Opinion and the system flow. .......................................................................14
Figure 3.4. Working of Data Puller. ........................................................................................15
Figure 3.5. Auth Flow. .............................................................................................................16
Figure 3.6. Sample Request of Twitter API. ............................................................................17
Figure 3.7. Sample Response from Twitter. ............................................................................17
Figure 3.8. Source to download pre-tained models. ................................................................18
Figure 3.9. Sample code to load the model. .............................................................................19
Figure 3.10. Sample code to instantiate the model. .................................................................19
Figure 3.11. Sample code to execute the function. ..................................................................19
Figure 3.12. Sample of input text.............................................................................................20
Figure 3.13. Sample of individual tokens. ...............................................................................20
Figure 3.14. Sample code loading Tokenizer Model. ..............................................................20
Figure 3.15. Sample code to instantiate a TokenizerME (learnable tokenizer). ......................21
Figure 3.16. Sample code calling tokenizer methods. .............................................................21
Figure 3.17. Sample of generated tokens. ................................................................................21
Figure 3.18. Sample code to load the POS model ...................................................................22
Figure 3.19. Sample code to instantiate the POSTaggerME. ..................................................22
Figure 3.20. Sample code calling method to generate tags......................................................22
Figure 3.21. List of POS tags. ..................................................................................................23
ix
Figure 3.22. Working of Indexer. ............................................................................................23

Figure 3.23. Query retrieving Collection. ................................................................................25
Figure 3.24. Data in collection.................................................................................................25
Figure 3.25. Working of Presentation builder. ........................................................................26
Figure 3.26. Sample request of Google API. ...........................................................................27
Figure 3.27. Collection storing Image information. ................................................................27
Figure 4.1. Home screen of an Application. ............................................................................28
Figure 4.2. Search Results. ......................................................................................................29
Figure 4.3. Level 0-DFD. .........................................................................................................29
x
LIST OF ABBREVIATIONS
HTML Hyper Text Markup Language

CSS Cascading Style Sheet
SQL Structured Query Language
POS Part-of-Speech Tagging
API Application Programming Interface
REST Representational State Transfer
URIs Uniform Resource Identifiers
JSP JavaServerPages
NOSQL Non SQL or Non Relational
Blob Binary large object
Tomcat Apache Tomcat
OAUTH Open source Authorization
NLP Natural Language Processing
API Application Program Interface
POS Part-of-Speech
RDBMS Relational Databse Management System
xi
ACKNOWLEDGEMENTS
I am very thankful to Professor Carl Eckberg for being the Chair of the thesis
committee also his constant support, guidance and encouragement during my program of
study and my thesis work. Taking this opportunity, I express my gratitude and sincere thanks
to Professor Xiaobai Liu and Professor Timothy Dunster for their feedback, guidance and
their presence on my thesis committee. I am very thankful to my parents, family and friends
for their love and support.
1
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
Today is the world of the internet, and ever since the beginning of the internet, social
networking, blogging, online sales have been booming. With the help of the internet it has
become very easy to get people’s opinions on almost anything within just a few minutes.
Earlier, people needed to call each other or use mail to get an opinion on something; now in
addition to the old traditional methods, people’s ways of getting users’ opinions have
changed and doing so is getting much easier. As a result, people want to get opinions from
several users before making any decision e.g. buying a product, choosing a restaurant for a
dinner, casting a political vote etc. Our behavior and decision making processes are highly
influenced by opinion from other users [1].
These days, a large number of people have started expressing their reviews on micro
blogging websites such as Twitter. The ability to analyze this data and extract reviews on
various different topics can help us make better choices and predictions regarding those
topics. As a result of this, sentiment analysis of tweets is gaining importance among various
domains such as ecommerce [2].
Considering all this, the main vision and the idea behind the development of this tool
is to provide a simplified and user-friendly application for getting users opinions based on the
twitter tweets. We can search for reviews for any kind of product and with just one single
click we will have the chart displaying the results of the analyzed tweets or reviews. The
chart will let us know how many tweets sounded positive, negative and neutral. We also
display the tweets against the username, and a color marking it positive or negative. We also
have a feature where a user can directly enter his review into the database and that would be
further analyzed to mark it as positive, negative or neutral.
2
1.2 TWITTER OVERVIEW

Twitter is an online social communication or networking tool that allows users to read
and broadcast short messages known as tweets. These messages can be no longer than 140
characters. Twitter was created back in March 2006 by Jack Dorsey, Evan Williams, Biz
Stone, and Noah Glass and launched in July 2006. It has more than 500 million users and
almost 332 million are active [3].
Messages are about what someone is doing, where they are, what their opinion about
something is and so on. It is also possible to forward someone else’s tweet, giving them
credit for the original tweet, e.g. "RT @user: Join us in San Diego for a workshop on social
media, May 30" [4]. There are just various ways to tweet about something and also for
reading something in particular.
1.3 MOTIVATION
For the past 5 years, I worked as a software developer, mostly as a web developer.
My role during all these years was to build small as well as large scale web applications and
web services. While pursuing a Master’s degree at San Diego State University, I tried to
learn as many computer languages as I can. I learned a lot, from new technologies in web
programming, how to design large scale applications and working with databases;
considering all these I decided to do my thesis work in web and data mining.
1.4 RESEARCH OBJECTIVES

Developing a review based search engine that will allow users to get a rating, review
or opinion about a product based on the available tweets from the twitter brings a lot of
challenges. Invoke the Twitter Rest API with OAuth authorization for tweets; parse the
JSON responses to fetch required information. A method to break the tweets into words,
identify each word as Noun, verb, adjective, etc. i.e. tokenize them, analyze them and decide
whether it is positive, negative or neutral. Added features include handling large numbers of
semi structured responses (JSON responses) with a NO SQL database, storing them and
presenting those in a nicely readable format a for great user experience, and invoking Google
API for retrieving images.
3
1.5 OVERVIEW
The search engine has been designed and developed using the latest available web
technologies, a NO SQL database and object oriented programming. The main sections of
this search engine are:
 Rest API (Web Services) and NO SQL Database
 Packages for parsing, analyzing and scoring the tweets
1.5.1 Web Search Tool

Web Search Tool is drawn up and developed using HTML, Javascript, JSP. The core
functionality of the search engine is to invoke the Twitter Rest API for the input given by the
user and present the analyzed results of the tweets in a very user-friendly way.
 Twitter Search API: https://dev.twitter.com/rest/public/search
1.5.2 Web Service and Database

Twitter Rest API is invoked to fetch the tweets using OAuth for authorization. The
Google API is invoked to retrieve the images of the product searched for. Mongo DB is used
for storing semi-structured JSON data. Here are general references.
 Twitter Search API: https://dev.twitter.com/rest/public/search
 Google Custom Search: https://developers.google.com/custom-search/json-
api/v1/reference/cse/list#http-request
 Mongo DB: https://www.mongodb.org/
1.5.3 Parsing and Analyzing Packages

The Apache Open NLP library is used for processing the tweets. Using this we
perform part-of-speech tagging (POS). Once tagged, sentiment analysis is performed using
rule-based classification. SentiWordNet is a lexical resource for opinion mining.
SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity,
negativity, objectivity [5].
 POS Tagging with OpenNLP : http://blog.dpdearing.com/2011/06/part-of-speech-
pos-tagging-with-opennlp-1-5-0/
 POS Tagging with OpenNLP: http://blog.dpdearing.com/2011/12/opennlp-part-
of-speech-pos-tags-penn-english-treebank/
 SentiWordNet: http://sentiwordnet.isti.cnr.it/
4
1.6 SUMMARY OF CHAPTERS

 Chapter 1
Explains the need and overview of this application, comparing all the details with
real world examples.
 Chapter 2
Explains all the technologies used to develop this application such as Java, Rest
API, No SQL Database etc.
 Chapter 3
Explains the application design, architecture and use case.
 Chapter 4
Explains the application workflow and different tool used to write code.
 Chapter 5
Explains about the future enhancements and vision for this tool.
5
CHAPTER 2
TECHNOLOGIES
2.1 JAVA
There are several reasons for choosing Java over other languages and it’s not hard to
understand why I choose Java. Firstly, Java is a platform independent language that could be
easily used to create software to be embedded in various different electronic devices. Thus
the slogan “Write once, run anywhere” of Java.
It can be described as a simple, object-oriented, network-savvy, interpreted, robust,
secure, architecture neutral, portable, high-performance, multithreaded, dynamic language
[6]. In addition to the above features, Java provides a strong and vibrant ecosystem to the
programmers:
 Java is powered by one of the industrial leaders, i.e. Oracle.
 For building our applications in Java we can utilize lots of open source libraries.
 There are various tools and IDEs available to enhance and ease the Java
development.
 Various frameworks are available to help us build highly reliable applications
quickly
Software developed in Java, can be used on any platform without even making any
slight alterations in the code or in the libraries being used. One of the important features of
Java is the security. The platform and the language are both well designed considering
security as a prime feature. Users are allowed to download almost any code from the network
and have it run in a secure environment without any harm to the host system. It won’t allow
the host system to get infected by a virus. This feature itself makes the Java platform as
unique platform [7].
In the past few years, I have worked on several different languages like C#, Clojure,
Python and Java. Amongst all those languages, I chose Java because of the several reasons
described above, but in addition to those, the main reason was to get the advantage of using
6
servlets. With the help of servlets one can create very fast and efficient web applications.
These web applications can then be executed on any servlet enabled web servers. Servlets
can handle multiple requests at a time, generate the content or web pages dynamically, and
those are easy to write and fast to execute within web servers [8].
Numbers of paid and free IDE’s are available to support Java Development such as
NetBeans, Intellij IDEA, Eclipse, Jcreator etc. For the development of this application, I have
used JDK 8 and Eclipse Mars as IDE. It provides ease to the developer with its features like
auto code completion, debugging and easy integration with version control systems. JDK 8
can be easily downloaded from its official website by selecting a proper platform of
installation.
2.1.1 Servlets
Servlets are an essential part of developing any J2EE application. Advantages of
servlets are portability, safety, performance, efficiency, powerful, integration, extensibility
etc.
Core features:
 Portability: Servlets are written in Java and hence can be easily used on any
platform. It is totally simple and fair enough to develop a servlet on some
windows machine using some server like TomCat and deploy it on any other
operating system like UNIX.
 Powerful: With the help of servlets, it is easy to implement the Database
connection. With the help of servlet session tracking mechanism, it is easy to
maintain information from request to request.
 Efficiency: High scalability of servlets makes them efficient to or capable enough
to handle concurrent requests via separate threads i.e. using just a single servlet
class it can handle any number of threads.
2.1.2 Java Server Pages

JSP is a technology useful for developing dynamic content web pages. Using JSP
tags, e.g. <% %>, it is easily possible to integrate the Java code into the HTML Code. (${})
is the tag used for retrieving the value of the object properties. With the help of JSP, it
becomes easy to gather the input from the user using the web page forms. Web pages can be
created dynamically and easily represent the data from the Database [9].
7
JSPs are basically an extension of the Servlet. They ease the application development
process by having several features. In addition to the features of servlets, it has got custom
tags, implicit objects, expression language, etc. As they are an extension of the servlet, they
have all the features like being platform independent, secure, dynamic etc. JSP is considered
as browser and server independent, as tags are processed and executed by the server side web
container [10].
Servlets also support HTML tags, but while building an application, they are
generally used for business layer logic. JSP supports HTML plus Java code, hence while
developing an application they are used for the presentation layer.
The POST method is usually considered good for transferring the information to the
backend system. Information is sent as a separate message; the backend system would then
parse the input and process it. JSP supports a method named as “getParameter()”, and using
this method we can read the data passed by the front end.
<%= request.getParameter("search_text") %>
2.2 RESTFUL WEB SERVICES

RESTful web services are designed to work best on the web. Representational State
Transfer (REST) is an architectural style that helps us achieve performance, scalability and
modifiability by specifying constraints such as the uniform interface. These features allow
services to work very nicely and faster on the web. Data and functionality are considered as
resources in the REST architectural style. They are accessed typically as links on the web,
i.e., using URIs [11].
The REST architecture is designed and constrained to be used as client server
architecture and as a stateless communication protocol, usually HTTP with the same HTTP
verbs (GET, POST, PUT, DELETE, etc.) Using these verbs it is easy to send and retrieve
data from remote servers over the web browsers [11]. With the external systems, REST
systems are interfaced as web resources known as URIs. Now, for example using a standard
verb such as DELETE we can operate URI, /class/science as DELETE/class/science [12].
In the REST architecture, using a standardized interface and protocol, client and
servers exchange representations of the resources. RESTful applications happen to be
lightweight, simple and fast because of the following few principles [11]:
8
 Resource identification through URI: Resources are exposed by the RESTful web
service. These resources are identified by URIs. These URIs provides the
resources with a global addressing space and service discovery [11].
 Uniform Interface: There is a fixed set of four operations: PUT, GET, POST and
DELETE. Using GET, we can retrieve the current state of a resource [11].
 Self Descriptive messages: In order to access the content, resources are
dissociated from their representations. The contents can be accessed in various
different formats like JSON, HTML, XML, PDF etc [11].
 Stateful interactions through hyperlinks: Here request messages used in the
interactions with the resources are self contained, meaning interactions are
stateless [11].
REST is an architecture ideally used for designing networked applications. The idea
is to provide a simple mechanism to connect the machines by making calls over HTTP.
REST is a lightweight alternative compared to Web services. Let’s take an example: Getting
user’s information by querying a phone book application using Web services and soap. In
this scenario, the request would be something like that below [13]
<?xml version="1.0"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">
<soap:body pb="http://www.acme.com/phonebook">
<pb:GetUserDetails>
<pb:UserID>12345</pb:UserID>
</pb:GetUserDetails>
</soap:Body>
</soap:Envelope>
Figure 2.1. Example of SOAP request. Source: [13]
However, if we try to fetch the same information using REST, the query would be
fairly simple as below
http://www.acme.com/phonebook/UserDetails/12345
Figure 2.2. Example of REST request. Source: [13]

9
This is just a URL, not kind of a request body. Using a simple get request, this URL
is sent to the server and the result would be simple raw data so that we can directly use it. In
the REST all we need is just a simple network connection and then we can easily test it
directly with the browser [13].
It can also handle complex requests having multiple parameters; parameters will pass
along in the URL
http://www.acme.com/phonebook/UserDetails?firstName=John&lastName=Doe
Figure 2.3. Example of REST request with multiple paramters. Source: [13]
Ideally, GET requests are preferred to be used for read only queries, not preferred for
the cases where we need to change the data or the state of the server [13].
2.3 NOSQL DATABASE

Earlier a database means storing the data in tabular form, i.e. in tables having rows
and columns. A table is like a mathematical relation, hence the term Relational Database
Management System. With the progress in technologies, there arose a need of also storing
unstructured, semistructured types of data. To satisfy the emerging needs NO SQL Databases
are designed [14].
A NoSQL, also referred to as “Non Sql” or “Non relational” database, encompasses
various different database technologies to satisfy the needs of modern web applications. In a
NoSQL database, data are stored in other than tabular format unlike the Relational databases.
Key-Value, wide column, graph etc, are the data structures used in NoSQL Databases. These
data structures help in performing the operations faster than compared to Relational
Databases. For real time web applications, usage of NoSQL is getting increased. MongoDB,
Cassandra are examples of NOSQL databases [14].
10
NoSQL Database Types [15]

 Document Databases: In this type, complex data structures identified as
documents are paired with a key. Documents would be XML, JSON, BSON etc.
Now these documents can have key-array pairs, key-value pairs, maps, collections
or even scalar values but would be self describing and having a hierarchial
structure. All these documents don’t need to be exactly the same, but would be
similar to each other. One of the examples of this kind of database is MongoDB,
providing a rich query language, indexing, etc. to ease the transition from
relational databases.
Figure 2.4. Example of Key-Value pair.

Source: [15]
 It is preferable to use this type of database for content management systems,

ecommerce applications, web analytics, blogging system etc.
 Graph Stores: It allows us to store information about networks of data, which is
preferable for applications where data are kind of connected, e.g. routing
information of goods and money, social network etc.
 Column family stores: This type of database tends to store columns of data
together instead of rows. Few examples are Cassandra, HBase etc. They are
considered as optimized for querying the larger datasets. This is preferable for a
system which could have high read-write volume.
 Key-value stores: It allows us to store data in key-value form. Each item will
have a key and an associated value. Here the data store just keeps on storing the
data without even bothering about what is inside. For the data store, it is just a
blob, an application using it needs to understand what was stored. This is
preferable for the systems where we need to store user information/profile,
session information etc.
11
In comparison to Relational databases, NOSQL databases provide superior

performance. They are able to support more than 100,000 read, write requests per second.
Ebay, considered as the largest online auction site, stores all the media metadata in
MongoDB [16].
MongoDB is a document type database. In this type of database it is preferable to
store data in a smaller number of collections. This type of storing method allows us to get all
data with a single retrieval call. So here, all the data needed for some specific task would be
present in a single document. E.g. all the comments related to a blog would be saved in a
blog post document and each and every comment would be retrieved easily in a single call.
2.4 APACHE TOMCAT

Apache Tomcat, also known as Tomcat, is an open source web server developed by
the Apache Software Foundation (ASF) [17].
Tomcat provides a “pure Java” HTTP web server environment (useful for running the
Java code) as it implements several Java EE specifications like JSP, WebSocket, Servlet
etc.It is an application server that helps us in rendering web pages (comprising Java server
page coding) and executing servlets.
Tomcat can be used mutually with other servers like Apache, Netscape Enterprise
server or can be used as a separate product with its own internal web server.
12
CHAPTER 3
APPLICATION DESIGN AND IMPLEMENTATION
3.1 SYSTEM DESIGN
Presentation
Builder
Indexed
Database
Sentiment
Analyzer
Indexer
Raw
Database
Data
Puller
Web
Data
Figure 3.1. System Layered Architecture
The system can be considered as composed of the above mentioned major layers.
Each layer has got a different action to perform to help us get the review from the web for the
user of the request product. Each layer will be described in detail in the following pages
along with the design, flow and its connection with the next layer.
13
3.2 WEB DATA

With the rapid expansion of e-commerce over the past 10 years, more and more
products are sold on the Web, and more and more people are buying the products online.
Amazon is considered to be the largest Internet based retailer in the United States, and has
got millions of transactions per day. People love this online retail store idea as it makes
customer’s life very easy. There are several advantages of buying things online, firstly
customer’s, they don’t have to go to stores, look out for the product in the big store and then
stand up in a long billing queue. Secondly, they don’t know how the product would be and
lastly, it won’t be feasible to check every different type of product available for their
requirement. However, it is very easy for the customers to buy things online. People find it
reliable and very easy to buy things online. Firstly, they find very time saving options as they
don’t need to go out to a store, look out for various products and then stand up in a long
billing queue. Online it is very easy to look out for the different products, find if it is
available in stock or not, get reviews from people who have already used it and then finally
purchase it with just a few clicks. Also to enhance the customer shopping experience, it has
now become a common practice for the online merchants to allow their customers to write
reviews on products they have purchased, and at the same time, users are also getting
comfortable with the Web, and hence there is a rapid increase in the number of people
writing reviews.
There exist various different websites like Yelp, Twitter, Amazon etc. where we can
get reviews from people about a product, a place, a restaurant etc. Name the thing and you
would find reviews for it. Sometimes, we need to check different websites or blogs for
reviews on different kinds of products. We can check Amazon for almost all the products,
yelp for places and restaurants, etc. However, Twitter is the website where we can get
reviews about almost everything e.g. products like the Iphone, Playstation, Laptops, any
electronic devices, tourist places like “Grand Canyon”, “Sea World” etc., reviews about a
political party etc. One can get a review or opinion about almost everything from the twitter.
3.2.1 Twitter
Twitter is an online social communication or networking tool that allows users to read
and broadcast short messages known as tweets. Twitter was created back in March 2006
14
by Jack Dorsey, Evan Williams, Biz Stone, and Noah Glass and launched in July 2006. It has
more than 500 million users and almost 332 million are active [3]. On an average
approximately 500 million Tweets are tweeted per day.
Tweets (messages) can be no longer than 140 characters. Messages are about what
someone is doing, where they are, what their opinion about something is and so on. It is also
possible to rebroadcast someone else’s tweet and give them a credit for the original tweet,
e.g. "RT @user: Join us in San Diego for a workshop on social media, May 30" [4]. There
are just various ways to tweet about something and also for reading something in particular.
We can follow famous personalities and get to know their views, what they think about etc.
Figure 3.2. Example of tweets on the Twitter about an iPhone. Source: [18]
These kinds of tweets will be utilized by our application to get the reviews of the
product we requested.
Web Server Client

• Web opinion • Data process • Review display
Data by on user
opinionMiner nterface
Figure 3.3. Web Opinion and the system flow.

15
3.3 DATA PULLER

This layer will fetch the data/tweets from the web, i.e. Twitter about the product we
requested for and store it raw in a database. It will then be used by the indexer for processing.
To accomplish this we are using a REST API exposed by Twitter for developers.
Twitter has exposed the REST APIs to help developers get programmatic access to
read and write Twitter data, e.g. read tweets about a product, follow data or tweet our view as
a user. Using OAUTH, the REST API identifies the users and the applications. Responses are
given out in JSON format [18].
Figure 3.4. Working of Data Puller.
OAUTH is an open source protocol, used for allowing a secure authorization process
in a simple and standard method from various different possible applications. It is considered
as secure because users don’t need to share their passwords. Following are the two types of
Authentication Model exposed by Twitter
 Application-user authentication
 Application-only authentication
In this application we have used Application-only authentication
3.3.1 Application-Only Authentication

In this type of model, the application itself makes an API request as opposed to on
behalf of a specific user. As we are using this type of model, we won’t be able to make any
request that requires user’s context e.g. posting tweets would not work. Below mentioned are
the steps followed for authentication [19].
16
 Into a specially encoded set of credentials, encode the consumer and secret key of
the application.
 Now to exchange these credentials for a bearer token, the application makes a
request to the POST oauth2 / token endpoint.
 Application uses this bearer token to get authorized when accessing the REST
API.
Figure 3.5. Auth Flow. Source: [20]
3.3.2 The Search API

The purpose of this API (exposed by Twitter) is to allow the developers to query for
recent or popular tweets. This search feature is almost similar to the search we can perform
from Twitter web or mobile clients. We have used GET method to retrieve the tweets, by
using the following URL “https://api.twitter.com/1.1/search/tweets.json”.
The response would be in JSON format. We are allowed to make 450 Requests/15-
min window.
Following are the few parameters we would be using to make a request
17
 q (required): search query with a max limit of 500 characters. E.g. “Iphone 6S”
 lang (optional): tweets in the response would be given in this language. Value
must be as per ISO 639-1 code. E.g. “en” for English
 locale: specifies the language of the query we are sending. Currently only “ja” is
effective. E.g. “ja”
 count: This specifies the number of tweets we want per page. Maximum is 100
and default is 15. E.g. 25
GET
https://api.twitter.com/1.1/search/tweets.json?q=%22IPhone%206S%22&count=25
&lang=en&localej=ja
Figure 3.6. Sample Request of Twitter API. Source: [21]
Figure 3.7. Sample Response from Twitter. Source: [21]

18
3.4 RAW DATABASE

Raw database here means json files. Every time we ask Twitter for tweets about some
product, it returns a json response containing tweets and other information. We will save
these responses as json files in a directory.
Let us say for example, user searched for “IPhone” ; twitter would provide us a
response for the same query. Now we will create a directory (if it does not exist) in the name
of the product we searched for, in the current example as “IPhone”. Now in this directory, we
would create a .json file e.g. “IPhone_1.josn” having the current response. Every time we
save a response, we will increment the counter indicating the number of responses we have
till now. The idea behind saving these files is to have them as back up for the data we have in
the database. In an emergency, we can recover the data using these files. Once the file is
saved for the current response, an indexer will read the tweets in it and start tokenizing them.
3.5 INDEXER
The function of the Indexer layer is to read the data from the raw database which is
JSON file here and process each and every tweet in it. Tweets would be broken down into
words; analyzed and finally tagged with its appropriate label, e.g. Noun, Verb, Adjective etc.
Words are tagged using part of speech tagging process supported by the Apache OpenNLP
library.
3.5.1 Apache Open NLP

Apache OpenNLP is a library used for processing natural language text. Almost all
the NLP tasks such as part-of-speech tagging, parsing, sentence segmentation, tokenization
etc. can be achieved using this library. The library has got various different components like
sentence detector, tokenizer, part-of-speech tagger, parser, chunker, etc.; these components
allow one to build up a whole natural language processing pipeline. Each of these
components has got parts required to enable the execution of the respective natural language
processing task, train and evaluate the model. Pre-trained models can be downloaded from
following link [22].
Figure 3.8. Source to download pre-tained models.

19
Models are available in various different languages; given input text should match the
model text. In this language we have used English language models only .In order to execute
the task, model and an input text are required. In order to load the model, provide a model to
the FileInputStream and thereby to the constructor of the model class [22].
Figure 3.9. Sample code to load the model. Source: [23]
Once the model is loaded, instantiate the tool
Figure 3.10. Sample code to instantiate the model. Source: [23]
The processing task can be executed as below, once the tool is successfully loaded.
Tools have their own format for the input and the output, but usually those are an array of
Strings.
Figure 3.11. Sample code to execute the function. Source: [23]
3.5.1.1 TOKENIZER
This component of the OpenNLP is used to generate the tokens of the given input
character sequence. Words, Punctuation, numbers, etc. are considered as tokens.
20
Figure 3.12. Sample of input text. Source: [24]
Following is the result showing the individual tokens in a whitespace separated

representation
Figure 3.13. Sample of individual tokens. Source: [24]
Tokenizer can be integrated into the application as follows. Firstly, we need to load
the model
Figure 3.14. Sample code loading Tokenizer Model. Source: [24]

21
Once the model is loaded, TokenizerME can be instantiated as below
Figure 3.15. Sample code to instantiate a TokenizerME (learnable tokenizer). Source:

[24]
Thereafter, we call the method tokenize to generate the tokens of the given input
sentence
Figure 3.16. Sample code calling tokenizer methods. Source: [24]
Output would be an array of tokens (string array) as below
Figure 3.17. Sample of generated tokens. Source: [24]
3.5.1.2 PART-OF-SPEECH TAGGER

To mark the tokens with their corresponding word type considering the context of the
token and the token itself, we use the part-of-speech tagger. In order to predict the correct pos
tag out of the tag set (depending on the token and the context, it is possible that there exist
multiple POS tags for the token), the OpenNLP part-of-speech tagger uses a probability
model.
Below is the sample code showing integration of the POS model into the application.
Firstly, we will load the pos model
22
Figure 3.18. Sample code to load the POS model. Source: [25]
Once the model is loaded, instantiate the POSTaggerME
Figure 3.19. Sample code to instantiate the POSTaggerME. Source: [25]
Now the tagger is ready to tag the data. The input should be a tokenized sentence as
array of strings; each string would be one token.
Figure 3.20. Sample code calling method to generate tags. Source: [25]
Now for each token (string object) in the input array there would be a POS tag in the
“tags” array. Tags would be at the same index as the token in the input array.
So let’s consider we read the raw database of an Iphone and we have a tweet for an
IPhone as below
23
Example:
 Input: Iphone6 has excellent features.
 tagged statement: Iphone6 (NN) has (VB) excellent (JJ) features (NN).
Above tagged statement can be considered as output of the indexer
Here,
NN – Noun, singular or mass
VB – Verb
JJ – Adjective
Above mentioned POS tags are known as the Penn English Treebank POS tags. A list of the
tags can be found at
Figure 3.21. List of POS tags.
Figure 3.22. Working of Indexer.
3.6 SENTIMENT ANALYZER
3.6.1 Sentiment
Sentiment is something that cannot be verified or observed; it is some private state
that includes emotions, opinions, etc. Subjectivity refers to the subject and his or her
perspective, feelings, beliefs, and desires [1]. Objective can be understood as something that
has proof for its existence; basically anything that can be backed up with solid data. Whereas
subjective is something that cannot be proved e.g. opinions, interpretations, etc. Everyone
might have a different opinion about some product so opinions are kind of conclusions that
are open for discussion. Opinion reflecting one’s feelings can be described as Sentiment [1].
24
3.6.2 Sentiment Analysis

Sentiment Analysis can also be described as text analysis where the goal is to help the
users and ease the decision making process. This is achieved by analyzing the opinions of
several users; classifying them as positive, negative or into an n-point scale. Now a day’s
Twitter is considered as a good resource of getting user’s opinions, hence sentiment analysis
of the tweets can be considered as an effective way of calculating public opinion for business
market as well as general studies. Text can be classified as objective or subjective. Opinions
are subjective and they indicate the sentiment of the person about something e.g. electronic
device, an organization, a city or service at a restaurant [26].
Here we have used SentiWordNet for sentimental analysis of the opinion.
SentiWordNet provides a text file containing the model data. Using this data model we
would create a collection in MongoDB named as SentiWordNet. In this collection we save a
word, its tag and the score. Now for analyzing the tweet, we would loop through all the
words along with its tag respectively, and get a score for it from the model data. Finally, we
would sum up all the scores. If total score is greater than zero, opinion is considered as
positive, otherwise negative. In order to ease the summation process if the score is greater
than 0.5, we return value as 2, between 0 and 0.5 then 1. Similarly, less than negative 0.5, we
return -2, between 0 and negative 0.5 then -1.
In an opinion, the presence of negative words can reverse the expressed opinion, an
example of negative words are “no”, “never”, “problem” etc. Those types of opinions need
additional work. There are a few negation rules we have used below [27]
 Negative Negative → Positive
 Example: “no problem”
 Negative Positive → Negative
 Example: “not good”
 Negative Neutral → Negative
 Example: “does not work”
If at the end, if we found a negative flag as true, then we subtract the score from the total
score.
25
For this application our approach is like Bag of Words. Here we are considering only
the words and their meaning. We have used Apache OpenNLP for the part of speech (POS),
Sentence Detection and SentiWordNet for the meaning of the word.
We have calculated final polarity by evaluating the weight of each word. Each has a
predefined polarity with respect to its POS. We are using those values to calculate the final
approximate polarity of a document or a sentence.
3.7 INDEXED DATABASE

Once the tweets are assigned polarity, we save them into the Mongo DB. In Mongo
DB we create collections (tables in terms of RDBMS) in the name of the product for which
we are saving the tweets. If the collection already exists then we will just save more tweets in
addition to the existing records. To get the collections object we can use query as below
Figure 3.23. Query retrieving Collection.
Here parameter has data type as string. However, if the collection specified does not
exist, then MongoDB creates one.
In the collection we store a tweet, username and its score (polarity) as one document
(row in terms of RDBMS)
Figure 3.24. Data in collection.

26
Once the tweets from the current search response are inserted into the database, we
fetch all the tweets from the document and save it in JSON object, which would be further
used for displaying the tweets.
3.8 PRESENTATION BUILDER

The task of the presentation builder is to build a graphical representation of opinions.
All the tweets from the respective collection are read and then a chart is generated showing
the total number of positive, negative and neutral tweets. In the sidebar we display the
positive and negative tweets along with the username. Negative tweets are colored in red
while positive ones are colored in green.
Figure 3.25. Working of Presentation builder.
In order to get the images of the search item we have implemented Google Custom
Search JSON API. This API is to facilitate the developers for developing an application or
website and display the results from Google Custom search programmatically. In this API we
have used Restful requests to get image search results in JSON format. Using HTTP get
method we will invoke the service and the response would be in JSON format [28]
27
Figure 3.26. Sample request of Google API.
From the response we will pick the first Image. Once we have the image, we create a
collection in Mongo DB and store the image URL in it.
Figure 3.27. Collection storing Image information.
Now, next time when we search for the item, we will check if the item already exists
in the Images collection, if yes then will fetch an image URL from it and use it for display
else call the API.
28
CHAPTER 4
APPLICATION WORKFLOW
Home Screen :
Figure 4.1. Home screen of an Application.
Above picture is the home screen of the application. From the search box we can
initialize a search for an item. Secondly, from this screen the user can directly enter a review
into the database, and we do not need to go to Twitter to write a review about an item.
Thereafter whenever we search for an item (the one for which we entered a review), our
review will be displayed in the sidebar .
29
Figure 4.2. Search Results.
The above image is the final page showing the results of the searched item along with
a nice picture of it. In the sidebar we have displayed all the tweets for the item searched
indicating it as positive or negative by green and red color respectively.On click of “Home”
button, Home screen will get loaded.
DATAFLOW DIAGRAM
Figure 4.3. Level 0-DFD.

30
The above diagram indicates the flow of the application. Here the client(user
interface), the front end will generate a search request of an item to an application. The
application, the middle layer will invoke the REST API to get the tweets; Apache OPEN
NLP will tag the words, and SentiWordNet will analyze and generate a score for them.
Thereafter the middle layer communicates with the MongoDB, the backend, and save tweets
in it. Lastly, the middle layer will fetch all the tweets from the database and give it to the
front end for displaying.
31
CHAPTER 5
CONCLUSION AND FUTURE ENHANCEMENT
This application will help the user to get an opinion about almost anything with a
single click. Users don’t need to surf various different websites for gathering user’s opinions.
Application summarizes the opinion with a nice graphical presentation so with just a glance a
user can get an overview of the item searched.
Currently we have not considered the context of the word with respect to the sentence
or the document, so scores are not that accurate. These days’ people write in a shorthand
notation on most of the social networking sites and this causes problems in identifying the
exact word. Negation when used along with adjective is not handled properly. All these
things can be improved as a future enhancement
5.1 FUTURE WORK

 Use rule base classifier to make the system more accurate and precise to identify
the orientation of statement.
 Handle negation in statement more effectively by using negation rules.
 Use feature base extraction technique to handle context dependent words
 Make more precise lexicon dictionary using the statistical base classifier
 Can integrate API of another website like BestBuy and aggregate the opinions.
32
REFERENCES
[1] Y. MEJOVA, Sentiment analysis: an overview, Comprehensive exam paper, University

of Iowa, Iowa City, IA, 2009.
[2] P. CHIKERSAL, S. PORIA, AND E. CAMBRIA, Sentiment analysis of tweets by combining a
rule-based classifier with supervised learning, in Proceedings of the 9th International
Workshop on Semantic Evaluation (Proc., Denver, 2015), Association for
Computational Linguistics, Stroudsburg, Pennsylvania, 2015, pp. 647-651,
http://anthology.aclweb.org/S/S15/S15-2.pdf#page=689, accessed April 2016, n.d.
[3] WIKIPEDIA, Twitter. Wikipedia, https://en.wikipedia.org/wiki/Twitter, accessed April
2016, n.d.
[4] MARCUS INSTITUTE FOR DIGITLA EDUATION IN THE ARTS, Twitter overview. MIDEA,
http://midea.wiki.nmc.org/Twitter+Overview, accessed April 2016, n.d.
[5] SENTIWORDNET, Home page. SentiWordNet, http://sentiwordnet.isti.cnr.it/, accessed
April 2016, n.d.
[6] K. KARANDIKAR, Java in education. EDC385G Interactive Multimedia Design &
Production at the University of Texas, Austin,
http://www.edb.utexas.edu/minliu/multimedia/PDFfolder/JAVA_I~1.PDF, accessed
April 2016, n.d.
[7] D. FLANAGAN, Java in a Nutshell, 3rd edition, O’Reilly, Sebastopol, California, 1999,
http://docstore.mik.ua/orelly/java-ent/jnut/ch01_02.htm, accessed April 2016, n.d.
[8] J2EEBRAIN, Advantages of servlets. J2EEBrain, http://www.j2eebrain.com/java-J2ee-
advantages-of-servlets.html, accessed April 2016, n.d.
[9] TUTORIALS POINT, JSP overview. Tutorials Point,
http://www.tutorialspoint.com/jsp/jsp_overview.htm, accessed April 2016, n.d.
[10] TUTORIAL4US, Features of JSP. Tutorial4Us, http://www.tutorial4us.com/jsp/feature-
of-jsp, accessed April 2016, n.d.
[11] E. JENDROCK, R. CERVERA-NAVARRO, I. EVANS, D. GOLLAPUDI, K. HAASE, W.
MARKITO, AND C. SRIVATHSA, Java EE 6 tutorial. Oracle,
http://docs.oracle.com/javaee/6/tutorial/doc/gijqy.html, accessed April 2016, published
2013.
[12] WIKIPEDIA, Representational state transfer. Wikipedia,
https://en.wikipedia.org/wiki/Representational_state_transfer, accessed April 2016, n.d.
[13] M. ELKSTEIN, Learn REST: a tutorial. Blogger.com, http://rest.elkstein.org/, accessed
April 2016, n.d.
33
[14] WIKIPEDIA, NoSQL. Wikipedia, https://en.wikipedia.org/wiki/NoSQL, accessed April

2016, n.d.
[15] P. SADALAGE, NoSQL databases: an overview. ThoughtWorks,
https://www.thoughtworks.com/insights/blog/nosql-databases-overview, accessed April
2016, published October 2014.
[16] MONGODB, Home page. MongoDB, https://www.mongodb.com, accessed April 2016,
last modified 2016.
[17] INTERNET LIVE STATS, Twitter usage statistic. Internet Live Stats,
http://www.internetlivestats.com/twitter-statistics/, accessed April 2016, n.d.
[18] TWITTER, Antonio J. Santos tweet. Twitter, https://twitter.com/, accessed April 2016,
published March 3, 2016.
[19] TWITTER, Twitter APIs. Twitter, https://dev.twitter.com, accessed April 2016, n.d.
[20] TWITTER, Application-only authentication. Twitter,
https://dev.twitter.com/oauth/application-only, accessed April 2016, n.d.
[21] TWITTER, GET search/tweets. Twitter,
https://dev.twitter.com/rest/reference/get/search/tweets, accessed April 2016, n.d.
[22] THE APACHE SOFTWARE FOUNDATION, Apache OpenNLP. The Apache Software
Foundation, https://opennlp.apache.org, accessed April 2016, published 2010.
[23] APACHE OPENNLP DEVELOPMENT COMMUNITY, Description. The Apache Software
Foundation, https://opennlp.apache.org/documentation/
1.6.0/manual/opennlp.html#intro.description, accessed April 2016, n.d.
[24] APACHE OPENNLP DEVELOPMENT COMMUNITY, Tokenization. The Apache Software
Foundation, https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.
html#tools.tokenizer.introduction, accessed April 2016, n.d.
[25] APACHE OPENNLP DEVELOPMENT COMMUNITY, Tagging. The Apache Software
Foundation, https://opennlp.apache.org/documentation/1.6.0/manual
/opennlp.html#tools.postagger.tagging, accessed April 2016, n.d.
[26] A. Z. H. KHAN, M. ATIQUE, AND V. M. THAKARE, Combining lexicon-based and
learning based methods for Twitter sentiment analysis, Int. J. Elec., Comm. Soft
Comput. Sci. Eng., suppl. (2015), pp. 89-91. http://search.proquest.com/openview/
3bb4518ced02bbd25b28594d28b35094/1?pq-origsite=gscholar, accessed April 2016.
[27] X. DING, B. LIU, AND P. S. YU, A holistic lexicon-based approach to opinion mining, in
Proceedings of the 2008 International Conference on Web Search and Data Mining
(Proc., Stanford, 2008), ACM, New York, New York, 2008, pp. 231-240.
[28] GOOGLE DEVELOPERS, Google custome search. Google,
https://developers.google.com/custom-search/json-api/v1/reference/cse/list#request,
accessed April 2016, last modified December 2015.

Big Data Analytics Applied To Track Sentiment Analysis PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Big Data Analytics Applied To Track Sentiment Analysis PDF

Загружено:

Авторское право:

Доступные форматы

BIG DATA ANALYTICS APPLIED TO TRACK SENTIMENT ANALYSIS

San Diego State University

of the Requirements for the Degree

All rights reserved

INFORMATION TO ALL USERS

All rights reserved.

ABSTRACT OF THE THESIS

Big Data Analytics Applied to Track Sentiment Analysis

3.2.1 Twitter .....................................................................................................13

Figure 2.1. Example of SOAP request. ......................................................................................8

Figure 3.22. Working of Indexer. ............................................................................................23

HTML Hyper Text Markup Language

1.2 TWITTER OVERVIEW

1.4 RESEARCH OBJECTIVES

1.5.1 Web Search Tool

1.5.2 Web Service and Database

1.5.3 Parsing and Analyzing Packages

1.6 SUMMARY OF CHAPTERS

2.1.2 Java Server Pages

2.2 RESTFUL WEB SERVICES

Figure 2.2. Example of REST request. Source: [13]

2.3 NOSQL DATABASE

NoSQL Database Types [15]

Figure 2.4. Example of Key-Value pair.

 It is preferable to use this type of database for content management systems,

In comparison to Relational databases, NOSQL databases provide superior

2.4 APACHE TOMCAT

APPLICATION DESIGN AND IMPLEMENTATION

3.1 SYSTEM DESIGN

Figure 3.1. System Layered Architecture

3.2 WEB DATA

Web Server Client

Figure 3.3. Web Opinion and the system flow.

3.3 DATA PULLER

Figure 3.4. Working of Data Puller.

3.3.1 Application-Only Authentication

Figure 3.5. Auth Flow. Source: [20]

3.3.2 The Search API

Figure 3.6. Sample Request of Twitter API. Source: [21]

Figure 3.7. Sample Response from Twitter. Source: [21]

3.4 RAW DATABASE

3.5.1 Apache Open NLP

Figure 3.8. Source to download pre-tained models.

Figure 3.9. Sample code to load the model. Source: [23]

Once the model is loaded, instantiate the tool

Figure 3.10. Sample code to instantiate the model. Source: [23]

Figure 3.11. Sample code to execute the function. Source: [23]

Figure 3.12. Sample of input text. Source: [24]

Following is the result showing the individual tokens in a whitespace separated

Figure 3.13. Sample of individual tokens. Source: [24]

Figure 3.14. Sample code loading Tokenizer Model. Source: [24]

Once the model is loaded, TokenizerME can be instantiated as below

Figure 3.15. Sample code to instantiate a TokenizerME (learnable tokenizer). Source:

Figure 3.16. Sample code calling tokenizer methods. Source: [24]

Output would be an array of tokens (string array) as below

Figure 3.17. Sample of generated tokens. Source: [24]

3.5.1.2 PART-OF-SPEECH TAGGER

Once the model is loaded, instantiate the POSTaggerME

Figure 3.19. Sample code to instantiate the POSTaggerME. Source: [25]

Figure 3.21. List of POS tags.

Figure 3.22. Working of Indexer.

3.6 SENTIMENT ANALYZER

3.6.2 Sentiment Analysis