Академический Документы
Профессиональный Документы
Культура Документы
_______________
A Thesis
Presented to the
Faculty of
_______________
In Partial Fulfillment
Master of Science
in
Computer Science
_______________
by
Riddhi A. Shah
Spring 2016
ProQuest Number: 10125135
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest 10125135
Published by ProQuest LLC (2016). Copyright of the Dissertation is held by the Author.
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
iii
Copyright © 2016
by
Riddhi A. Shah
All Rights Reserved
iv
DEDICATION
This work is dedicated to my family, friends and faculties of San Diego State
University, for their incredible support, motivation and love. Special thanks to Professor Carl
Eckberg for his continuous feedback and guidance.
v
Since the beginning, the internet has provided us with various different methods of
gathering the user’s reviews on almost everything. Among those various options, one of the
possible ways is using a web based application. Earlier, methods of getting a review included
mail, telephone and personal talk. Now, in addition to these traditional methods we can get
reviews from people with just a few clicks and in less than no time.
My aim was to develop a review based search engine that will allow users to get
reviews or opinions about a product based on the available information.
In this application, I am getting a raw dataset from Twitter using the Twitter’s rest
API having OAuth for authorization. The responses are in JSON and unstructured, hence
using Mongo DB (NO SQL database) for storing them. To analyze the tweets, I have used
APACHE Open NLP, sentiwordnet packages. The images of the products are retrieved using
the Google API for images.
The purpose of the paper is to provide the ground level concept of creating and
developing a user-friendly review based search engine, which can be further extended to a
larger scale, adding more features to support detailed level review of a particular product.
vi
TABLE OF CONTENTS
PAGE
ABSTRACT ...............................................................................................................................v
LIST OF FIGURES ............................................................................................................... viii
LIST OF ABBREVIATIONS ....................................................................................................x
ACKNOWLEDGEMENTS ..................................................................................................... xi
CHAPTER
1 INTRODUCTION .........................................................................................................1
1.1 Overview ............................................................................................................1
1.2 Twitter Overview ...............................................................................................2
1.3 Motivation ..........................................................................................................2
1.4 Research Objectives ...........................................................................................2
1.5 Overview ............................................................................................................3
1.5.1 Web Search Tool.......................................................................................3
1.5.2 Web Service and Database........................................................................3
1.5.3 Parsing and Analyzing Packages ..............................................................3
1.6 Summary of Chapters ........................................................................................4
2 TECHNOLOGIES .........................................................................................................5
2.1 Java ....................................................................................................................5
2.1.1 Servlets ......................................................................................................6
2.1.2 Java Server Pages ......................................................................................6
2.2 Restful Web Services .........................................................................................7
2.3 NoSQL Database ...............................................................................................9
2.4 Apache Tomcat ................................................................................................11
3 APPLICATION DESIGN AND IMPLEMENTATION .............................................12
3.1 System Design .................................................................................................12
3.2 Web Data .........................................................................................................13
vii
LIST OF FIGURES
PAGE
LIST OF ABBREVIATIONS
ACKNOWLEDGEMENTS
I am very thankful to Professor Carl Eckberg for being the Chair of the thesis
committee also his constant support, guidance and encouragement during my program of
study and my thesis work. Taking this opportunity, I express my gratitude and sincere thanks
to Professor Xiaobai Liu and Professor Timothy Dunster for their feedback, guidance and
their presence on my thesis committee. I am very thankful to my parents, family and friends
for their love and support.
1
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
Today is the world of the internet, and ever since the beginning of the internet, social
networking, blogging, online sales have been booming. With the help of the internet it has
become very easy to get people’s opinions on almost anything within just a few minutes.
Earlier, people needed to call each other or use mail to get an opinion on something; now in
addition to the old traditional methods, people’s ways of getting users’ opinions have
changed and doing so is getting much easier. As a result, people want to get opinions from
several users before making any decision e.g. buying a product, choosing a restaurant for a
dinner, casting a political vote etc. Our behavior and decision making processes are highly
influenced by opinion from other users [1].
These days, a large number of people have started expressing their reviews on micro
blogging websites such as Twitter. The ability to analyze this data and extract reviews on
various different topics can help us make better choices and predictions regarding those
topics. As a result of this, sentiment analysis of tweets is gaining importance among various
domains such as ecommerce [2].
Considering all this, the main vision and the idea behind the development of this tool
is to provide a simplified and user-friendly application for getting users opinions based on the
twitter tweets. We can search for reviews for any kind of product and with just one single
click we will have the chart displaying the results of the analyzed tweets or reviews. The
chart will let us know how many tweets sounded positive, negative and neutral. We also
display the tweets against the username, and a color marking it positive or negative. We also
have a feature where a user can directly enter his review into the database and that would be
further analyzed to mark it as positive, negative or neutral.
2
1.3 MOTIVATION
For the past 5 years, I worked as a software developer, mostly as a web developer.
My role during all these years was to build small as well as large scale web applications and
web services. While pursuing a Master’s degree at San Diego State University, I tried to
learn as many computer languages as I can. I learned a lot, from new technologies in web
programming, how to design large scale applications and working with databases;
considering all these I decided to do my thesis work in web and data mining.
1.5 OVERVIEW
The search engine has been designed and developed using the latest available web
technologies, a NO SQL database and object oriented programming. The main sections of
this search engine are:
Rest API (Web Services) and NO SQL Database
Packages for parsing, analyzing and scoring the tweets
CHAPTER 2
TECHNOLOGIES
2.1 JAVA
There are several reasons for choosing Java over other languages and it’s not hard to
understand why I choose Java. Firstly, Java is a platform independent language that could be
easily used to create software to be embedded in various different electronic devices. Thus
the slogan “Write once, run anywhere” of Java.
It can be described as a simple, object-oriented, network-savvy, interpreted, robust,
secure, architecture neutral, portable, high-performance, multithreaded, dynamic language
[6]. In addition to the above features, Java provides a strong and vibrant ecosystem to the
programmers:
Java is powered by one of the industrial leaders, i.e. Oracle.
For building our applications in Java we can utilize lots of open source libraries.
There are various tools and IDEs available to enhance and ease the Java
development.
Various frameworks are available to help us build highly reliable applications
quickly
Software developed in Java, can be used on any platform without even making any
slight alterations in the code or in the libraries being used. One of the important features of
Java is the security. The platform and the language are both well designed considering
security as a prime feature. Users are allowed to download almost any code from the network
and have it run in a secure environment without any harm to the host system. It won’t allow
the host system to get infected by a virus. This feature itself makes the Java platform as
unique platform [7].
In the past few years, I have worked on several different languages like C#, Clojure,
Python and Java. Amongst all those languages, I chose Java because of the several reasons
described above, but in addition to those, the main reason was to get the advantage of using
6
servlets. With the help of servlets one can create very fast and efficient web applications.
These web applications can then be executed on any servlet enabled web servers. Servlets
can handle multiple requests at a time, generate the content or web pages dynamically, and
those are easy to write and fast to execute within web servers [8].
Numbers of paid and free IDE’s are available to support Java Development such as
NetBeans, Intellij IDEA, Eclipse, Jcreator etc. For the development of this application, I have
used JDK 8 and Eclipse Mars as IDE. It provides ease to the developer with its features like
auto code completion, debugging and easy integration with version control systems. JDK 8
can be easily downloaded from its official website by selecting a proper platform of
installation.
2.1.1 Servlets
Servlets are an essential part of developing any J2EE application. Advantages of
servlets are portability, safety, performance, efficiency, powerful, integration, extensibility
etc.
Core features:
Portability: Servlets are written in Java and hence can be easily used on any
platform. It is totally simple and fair enough to develop a servlet on some
windows machine using some server like TomCat and deploy it on any other
operating system like UNIX.
Powerful: With the help of servlets, it is easy to implement the Database
connection. With the help of servlet session tracking mechanism, it is easy to
maintain information from request to request.
Efficiency: High scalability of servlets makes them efficient to or capable enough
to handle concurrent requests via separate threads i.e. using just a single servlet
class it can handle any number of threads.
JSPs are basically an extension of the Servlet. They ease the application development
process by having several features. In addition to the features of servlets, it has got custom
tags, implicit objects, expression language, etc. As they are an extension of the servlet, they
have all the features like being platform independent, secure, dynamic etc. JSP is considered
as browser and server independent, as tags are processed and executed by the server side web
container [10].
Servlets also support HTML tags, but while building an application, they are
generally used for business layer logic. JSP supports HTML plus Java code, hence while
developing an application they are used for the presentation layer.
The POST method is usually considered good for transferring the information to the
backend system. Information is sent as a separate message; the backend system would then
parse the input and process it. JSP supports a method named as “getParameter()”, and using
this method we can read the data passed by the front end.
<%= request.getParameter("search_text") %>
Resource identification through URI: Resources are exposed by the RESTful web
service. These resources are identified by URIs. These URIs provides the
resources with a global addressing space and service discovery [11].
Uniform Interface: There is a fixed set of four operations: PUT, GET, POST and
DELETE. Using GET, we can retrieve the current state of a resource [11].
Self Descriptive messages: In order to access the content, resources are
dissociated from their representations. The contents can be accessed in various
different formats like JSON, HTML, XML, PDF etc [11].
Stateful interactions through hyperlinks: Here request messages used in the
interactions with the resources are self contained, meaning interactions are
stateless [11].
REST is an architecture ideally used for designing networked applications. The idea
is to provide a simple mechanism to connect the machines by making calls over HTTP.
REST is a lightweight alternative compared to Web services. Let’s take an example: Getting
user’s information by querying a phone book application using Web services and soap. In
this scenario, the request would be something like that below [13]
<?xml version="1.0"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">
<soap:body pb="http://www.acme.com/phonebook">
<pb:GetUserDetails>
<pb:UserID>12345</pb:UserID>
</pb:GetUserDetails>
</soap:Body>
</soap:Envelope>
Figure 2.1. Example of SOAP request. Source: [13]
However, if we try to fetch the same information using REST, the query would be
fairly simple as below
http://www.acme.com/phonebook/UserDetails/12345
This is just a URL, not kind of a request body. Using a simple get request, this URL
is sent to the server and the result would be simple raw data so that we can directly use it. In
the REST all we need is just a simple network connection and then we can easily test it
directly with the browser [13].
It can also handle complex requests having multiple parameters; parameters will pass
along in the URL
http://www.acme.com/phonebook/UserDetails?firstName=John&lastName=Doe
Figure 2.3. Example of REST request with multiple paramters. Source: [13]
Ideally, GET requests are preferred to be used for read only queries, not preferred for
the cases where we need to change the data or the state of the server [13].
CHAPTER 3
Presentation
Builder
Indexed
Database
Sentiment
Analyzer
Indexer
Raw
Database
Data
Puller
Web
Data
The system can be considered as composed of the above mentioned major layers.
Each layer has got a different action to perform to help us get the review from the web for the
user of the request product. Each layer will be described in detail in the following pages
along with the design, flow and its connection with the next layer.
13
3.2.1 Twitter
Twitter is an online social communication or networking tool that allows users to read
and broadcast short messages known as tweets. Twitter was created back in March 2006
14
by Jack Dorsey, Evan Williams, Biz Stone, and Noah Glass and launched in July 2006. It has
more than 500 million users and almost 332 million are active [3]. On an average
approximately 500 million Tweets are tweeted per day.
Tweets (messages) can be no longer than 140 characters. Messages are about what
someone is doing, where they are, what their opinion about something is and so on. It is also
possible to rebroadcast someone else’s tweet and give them a credit for the original tweet,
e.g. "RT @user: Join us in San Diego for a workshop on social media, May 30" [4]. There
are just various ways to tweet about something and also for reading something in particular.
We can follow famous personalities and get to know their views, what they think about etc.
Figure 3.2. Example of tweets on the Twitter about an iPhone. Source: [18]
These kinds of tweets will be utilized by our application to get the reviews of the
product we requested.
OAUTH is an open source protocol, used for allowing a secure authorization process
in a simple and standard method from various different possible applications. It is considered
as secure because users don’t need to share their passwords. Following are the two types of
Authentication Model exposed by Twitter
Application-user authentication
Application-only authentication
In this application we have used Application-only authentication
Into a specially encoded set of credentials, encode the consumer and secret key of
the application.
Now to exchange these credentials for a bearer token, the application makes a
request to the POST oauth2 / token endpoint.
Application uses this bearer token to get authorized when accessing the REST
API.
q (required): search query with a max limit of 500 characters. E.g. “Iphone 6S”
lang (optional): tweets in the response would be given in this language. Value
must be as per ISO 639-1 code. E.g. “en” for English
locale: specifies the language of the query we are sending. Currently only “ja” is
effective. E.g. “ja”
count: This specifies the number of tweets we want per page. Maximum is 100
and default is 15. E.g. 25
GET
https://api.twitter.com/1.1/search/tweets.json?q=%22IPhone%206S%22&count=25
&lang=en&localej=ja
3.5 INDEXER
The function of the Indexer layer is to read the data from the raw database which is
JSON file here and process each and every tweet in it. Tweets would be broken down into
words; analyzed and finally tagged with its appropriate label, e.g. Noun, Verb, Adjective etc.
Words are tagged using part of speech tagging process supported by the Apache OpenNLP
library.
Models are available in various different languages; given input text should match the
model text. In this language we have used English language models only .In order to execute
the task, model and an input text are required. In order to load the model, provide a model to
the FileInputStream and thereby to the constructor of the model class [22].
The processing task can be executed as below, once the tool is successfully loaded.
Tools have their own format for the input and the output, but usually those are an array of
Strings.
3.5.1.1 TOKENIZER
This component of the OpenNLP is used to generate the tokens of the given input
character sequence. Words, Punctuation, numbers, etc. are considered as tokens.
20
Tokenizer can be integrated into the application as follows. Firstly, we need to load
the model
Thereafter, we call the method tokenize to generate the tokens of the given input
sentence
Figure 3.18. Sample code to load the POS model. Source: [25]
Now the tagger is ready to tag the data. The input should be a tokenized sentence as
array of strings; each string would be one token.
Figure 3.20. Sample code calling method to generate tags. Source: [25]
Now for each token (string object) in the input array there would be a POS tag in the
“tags” array. Tags would be at the same index as the token in the input array.
So let’s consider we read the raw database of an Iphone and we have a tweet for an
IPhone as below
23
Example:
Input: Iphone6 has excellent features.
tagged statement: Iphone6 (NN) has (VB) excellent (JJ) features (NN).
Above tagged statement can be considered as output of the indexer
Here,
NN – Noun, singular or mass
VB – Verb
JJ – Adjective
Above mentioned POS tags are known as the Penn English Treebank POS tags. A list of the
tags can be found at
3.6.1 Sentiment
Sentiment is something that cannot be verified or observed; it is some private state
that includes emotions, opinions, etc. Subjectivity refers to the subject and his or her
perspective, feelings, beliefs, and desires [1]. Objective can be understood as something that
has proof for its existence; basically anything that can be backed up with solid data. Whereas
subjective is something that cannot be proved e.g. opinions, interpretations, etc. Everyone
might have a different opinion about some product so opinions are kind of conclusions that
are open for discussion. Opinion reflecting one’s feelings can be described as Sentiment [1].
24
For this application our approach is like Bag of Words. Here we are considering only
the words and their meaning. We have used Apache OpenNLP for the part of speech (POS),
Sentence Detection and SentiWordNet for the meaning of the word.
We have calculated final polarity by evaluating the weight of each word. Each has a
predefined polarity with respect to its POS. We are using those values to calculate the final
approximate polarity of a document or a sentence.
Here parameter has data type as string. However, if the collection specified does not
exist, then MongoDB creates one.
In the collection we store a tweet, username and its score (polarity) as one document
(row in terms of RDBMS)
Once the tweets from the current search response are inserted into the database, we
fetch all the tweets from the document and save it in JSON object, which would be further
used for displaying the tweets.
In order to get the images of the search item we have implemented Google Custom
Search JSON API. This API is to facilitate the developers for developing an application or
website and display the results from Google Custom search programmatically. In this API we
have used Restful requests to get image search results in JSON format. Using HTTP get
method we will invoke the service and the response would be in JSON format [28]
27
From the response we will pick the first Image. Once we have the image, we create a
collection in Mongo DB and store the image URL in it.
Now, next time when we search for the item, we will check if the item already exists
in the Images collection, if yes then will fetch an image URL from it and use it for display
else call the API.
28
CHAPTER 4
APPLICATION WORKFLOW
Home Screen :
Above picture is the home screen of the application. From the search box we can
initialize a search for an item. Secondly, from this screen the user can directly enter a review
into the database, and we do not need to go to Twitter to write a review about an item.
Thereafter whenever we search for an item (the one for which we entered a review), our
review will be displayed in the sidebar .
29
The above image is the final page showing the results of the searched item along with
a nice picture of it. In the sidebar we have displayed all the tweets for the item searched
indicating it as positive or negative by green and red color respectively.On click of “Home”
button, Home screen will get loaded.
DATAFLOW DIAGRAM
The above diagram indicates the flow of the application. Here the client(user
interface), the front end will generate a search request of an item to an application. The
application, the middle layer will invoke the REST API to get the tweets; Apache OPEN
NLP will tag the words, and SentiWordNet will analyze and generate a score for them.
Thereafter the middle layer communicates with the MongoDB, the backend, and save tweets
in it. Lastly, the middle layer will fetch all the tweets from the database and give it to the
front end for displaying.
31
CHAPTER 5
This application will help the user to get an opinion about almost anything with a
single click. Users don’t need to surf various different websites for gathering user’s opinions.
Application summarizes the opinion with a nice graphical presentation so with just a glance a
user can get an overview of the item searched.
Currently we have not considered the context of the word with respect to the sentence
or the document, so scores are not that accurate. These days’ people write in a shorthand
notation on most of the social networking sites and this causes problems in identifying the
exact word. Negation when used along with adjective is not handled properly. All these
things can be improved as a future enhancement
REFERENCES