Вы находитесь на странице: 1из 30

PROJECT REPORT

ON
SENTIMENT DATA ANALYSIS ON TWITTER
Submitted in partial fulfillment of the requirements
For the award of the degree of

Bachelor of Computer Application

Submitted To:
Dr. Pawan Whig

Submitted By:
Shivam Arora
Roll No. -07329802014

(BCA Batch 2014-2017)

Vivekananda Institute of Professional Studies , India


(Affiliated to Guru Gobind Singh Indraprastha University,
Dwarka, New Delhi)

CERTIFICATE
This is to certify that Shivam Arora of BCA 5th Semester from Vivekananda
Institute of Professional Studies, Delhi has presented this project work entitled
Sentiment Data Analysis on Twitter in partial fulfillment of the requirement
for the award of the degree of Bachelor of Computer Applications under my
supervision and guidance.

Dr.Pawan Whig

ACKNOWLEDGEMENT
It is our proud privilege to express our profound gratitude to the entire
management of Vivekananda Institute of Professional Studies and teachers of the
institute for providing us with the opportunity to avail the excellent facilities and
infrastructure. The knowledge and values inculcated have proved to be of
immense help at the very start of my career. I am grateful to Dr. Supriya Madan
(Dean VSIT) and Dr Pawan Whig. For their astute guidance, constant
encouragement and sincere support for this project work.
Sincere thanks to all my family members, seniors and friends for their support and
assistance throughout the project.

Name of student:Shivam Arora


Enrolment No- 07329802014

TABLE OF CONTENTS
1. INTRODUCTION
1.1 Introduction.
1.2
1.3
1.4
1.5
1.6

Project Objectives
Sentiment Data..
Future scope
Literature study.
Methodology.

2. DESIGNING TECHNIQUES
2.1 Data Collection.
2.2 Data flow diagram.
2.3 Use case diagram

3. CODING & IMPLEMENTATION


3.1 Coding
3.2 Implementation..
3.3 Reports.

4. CONCLUSION & FUTURE SCOPE


4.1 Conclusion.
4.2 Future scope.

CHAPTER 1
1.1 Introduction
This tutorial describes how to extract raw Twitter data, store in Hadoop, assign positive,
negative and neutral sentiment to every tweet and how to analyze and visualize this
sentiment data.

1.2 Project Objectives


In this Hadoop project, you are going to perform following activities:

Connect to a live social media (twitter) data stream, extract and hoard this data on
Hadoop

Process the data in Hadoop; restructure, clean and provide beneficial insights from it

Create tables in Hadoop and provide an interface to end users for simple querying

Do sentiment analysis by relating the sentiments of people on a specific item or


subject

Provide visualization of Sentiment Analytics

You essential to work on Hadoop gears like HDFS, MapReduce, Hive, Pig, Flume,
ODBC connectors and others

1.3 Sentiment Data


Sentiment data is unstructured data that signifies opinions, emotions, and attitudes
contained in sources such as societal media posts, blogs, online product reviews, and
customer support interactions.

1.4 Future scope


Organizations use sentiment analyst to understand how the public senses about something
at a specific moment in time, and also to track how those opinions change overtime.
An enterprise may analyze sentiment about:
A. A productFor example, does the aim segment understand and appreciate messaging
around a product inauguration? What products do tourists tend to buy together, and

what are they most likely to buy in the future?


B. A serviceFor example, a Hotels or eating place can look into its localities with
chiefly strong or poor service.
C. CompetitorsIn what areas do persons understand our company as superior than(or
weaker than) our competition?
D. ReputationWhat does the public really think approximately our company? Is our
status positive or negative?

1.5 LITERATURE STUDY


Sentiment analysis is the most important research area in business fields. Previously
research was carried out for sentiment analysis in various domains like company
product, movie reviews, politics etc. Some of the examples:
1) Earthquake shakes twitter users: Real-time event detection by social sensors: T.Sakaki et
al. developed an event notification system which monitors the tweets and delivers
notifications considering the time constraint. They detect real-time events in Twitter
such as earthquakes. They have proposed an algorithm to monitor tweets detecting
target event. Each Twitter user is considered as a sensor. Kalman filtering and particle
filtering are used for estimation of location. Data set: For classification of tweets, we
prepared 597 positive examples which report earthquake occurrence as a training set.
Advantages: 1. Main task of earthquake detection is done using the system. Users are
registered with it and email messages are sent to them. 2. The two filtering techniques
detect and provide estimation for location. Disadvantages: 1. Multiple events cannot be
detected at a time. 2. It cannot provide advanced algorithms to expand queries. 3.
Limited to only one target event detection at a single time event. 4. It uses SVM as a
classifier into positive and negative sentiments which is not applicable to small data
sets.
2) Interpreting the Public Sentiment Variations on Twitter: Twitter sentiment analysis is an
important research area for academic as well as business fields for decision making like
for the seller to decide if the product should be produced in a large quantity as per the
buyers feedback and for the students to decide if the study material to be referred or
not. in this work, Shulong Tan et al.[5] have proposed LDA based two models to
interpret the sentiment variations on twitter i.e.-LDA to distill out the foreground topics

and RCB-LDA to find out the reasons why public sentiments have been changed for the
target. Dataset: They have considered the twitter dataset for sentiment classification. It
is obtained from Stanford Network Analysis Platform. It consists of tweets from June 11,
2009 to December 31, 2009 with 476 million tweets. But the evaluation of results is
done on the dataset from June 13, 2009 to October 31, 2009. Advantages: 1. Distilled
out the foreground topics effectively and removed the noisy data accurately. 2. Found
the exact reasons behind sentiment variations on twitter data using RCB-LDA model
which is very useful for decision making. Disadvantages: Uses the sentiment analysis
tools TwitterSentiment and SentiStrength whose accuracy is less as compared to other
sentiment analysis techniques.

1.6 Methodology
In this approach, an enterprise will have a computer to store and process big data. At this
point data will be warehoused in an RDBMS similar to Oracle Database, MS SQL
Server or DB2 and refined softwares can be printed to relate with the database, process
the essential data and present it to the users for enquiry purpose.

Limitation
This method works fine where we have fewer volume of data that can be put up by
normal database servers, or up to the edge of the workstation which is handling the data.
But when it derives to dealing with massive amounts of data, it is truly a boring task to
process such data through a traditional database server.

Googles Solution
Google answered this problem using an procedure called MapReduce. This procedure
splits the task into minor fragments and allots those fragments to many computers
attached over the network, and gathers the results to form the final result dataset.

Above figure shows several commodity hardwares which could be single CPU machines
or servers with advanced capacity.

CHAPTER 2
Designing Techniques
2.1 DATA FLOW DIAGRAM

1. For Hadoop

2. for Sentiment Analysis Process

3. Proposed System of Project

4.To show whole process of Sentiment Data Analysis on Twitter.

2.2 Use Case Diagrams

Fig 1 : Topic and Sentiment Diagram

Fig 2: Topic and Influential user Diagram

Fig 3: Sentiment and Influential User Diagram

Fig 4: Influential user according to time

Fig 5: Influential User Diagram

CHAPTER 3
3.1 Coding
Create the tweets_raw table containing the records as received from

Twitter.
CREATE TABLE Mytweets_raw (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,

in_reply_to_screen_name STRING )
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe';
//LOCATION '/data/tweets_raw';
create sentiment dictionary
CREATE EXTERNAL TABLE dictionary (
type string,
lengthint,
word string,
pos string,
stemmed string,
polarity string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/dictionary';

loading data to the table dictionary


load data inpath 'data/dictionary/dictionary.tsv' INTO TABLE dictionary;
CREATE TABLE time_zone_map (
time_zone string,
country string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/time_zone_map';

loading data to the table time_zone_map


load data inpath 'data/time_zone_map/time_zone_map.tsv' INTO TABLE
time_zone_map;

Clean up tweets
CREATE VIEW tweets_simple AS

SELECT id,
cast ( from_unixtime( unix_timestamp(concat( '2014 ', substring(created_at,5,15)), 'yyyy
MMM ddhh:mm:ss')) as timestamp) ts, text, user.time_zone
FROM Mytweets_raw ;

CREATE VIEW tweets_clean AS


SELECT id, ts, text, m.country
FROMtweets_simple t LEFT OUTER JOIN time_zone_map m ON t.time_zone =
m.time_zone;

Compute sentiment
create view l1 as select id, words from Mytweets_raw lateral view
explode(sentences(lower(text))) dummy as words;
create view l2 as select id, word from l1 lateral view explode( words ) dummy as word ;
create view l3 as select id, l2.word, case d.polarity
when 'negative' then -1
when 'positive' then 1
else 0 end as polarity
from l2 left outer join dictionary d on l2.word = d.word;
create table tweets_sentiment as select id,
case
when sum( polarity ) > 0 then 'positive'
when sum( polarity ) < 0 then 'negative'
else 'neutral' end as sentiment
from l3 group by id;

put everything back together and re-name sentiments...


CREATE TABLE tweetsbi
AS
SELECT
t.*,

s.sentiment
FROM tweets_clean t LEFT OUTER JOIN tweets_sentiment s on t.id = s.id;

data with tweet counts.....


CREATE TABLE tweetsbiaggr
AS
SELECT
country,sentiment, count(sentiment) as tweet_count
FROM tweetsbi
group by country,sentiment;

store data for analysis......


CREATE VIEW A as select country,tweet_count as positive_response from tweetsbiaggr
where sentiment='positive';
CREATE VIEW B as select country,tweet_count as negative_response from tweetsbiaggr
where sentiment='negative';
CREATE VIEW C as select country,tweet_count as neutral_response from tweetsbiaggr
where sentiment='neutral';
CREATE TABLE tweetcompare as select A.*,B.negative_response as
negative_response,C.neutral_response as neutral_response from A join B on A.country=
B.country join C on B.country=C.country;
to download data into Excel
INSERT OVERWRITE LOCAL DIRECTORY '/home/cloudera/Desktop/hive-sampleout' row format delimited fields terminated by ',' SELECT * FROM tweetcompare

3.2 Implementation
Step1: Install VM Ware Workstation
The installation of VM Ware consists following steps as shown:

Step 2 : Install Cloudera on vm ware.

Step3: Creating a Twitter Application


If you already have a Twitter account then login there or otherwise create a new twitter
account on Twitter.comand login.
Next browse to https://dev.twitter.com/apps/ and click the Create New App button.

Next fill in the basic app info form:


The application Name must be globally unique crosswise all Twitter apps for all
operators ,so pick something unique .After filling the info , agree to the term so fuse and
press the Create App at the bottom of the form.
Youll be redirected to the management page for your new app.
Switch to the API Keys tab, and click the create my access token button.

OK, youre done! There are four pieces of information you need to copy from the Form
above, this will be used when we setup the Flume agent:

API key

API secret

Access token

Access token secret

Step4: Install and configure Flume


Flume is a distributed, reliable, and available service for efficiently gathering,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows.

A) Install Flume
Flume is easy to install with HDP. Just run the following Yum command yum install
flume
After this comman discomplete, the Flume is installed and ready to be used.
Copy the .jarfile flume-sources-1.0-SNAPSHOT.jar provided by us in Source Jar
folder to the /usr/lib/flume/lib folder on the node where you installed the Flume software,
and add it to the flume classpath as shown below in the /etc/flume/conf/flume-env.ps1 file.
[ FLUME_CLASSPATH="/usr/lib/flume/lib/flume-sources-1.0-SNAPSHOT.jar" ]
Now that the agent code is in place, we need to configure flume to create an agent using
the class in this .jar. We do this by updating the /etc/flume/conf/flume.conf file.
Make the following changes (highlighted in yellow). Note that the configuration file uses
the term consumer Key and consumer Secret.
Twitter now calls these API Key and API Secret ,respectively. Simply substitute in
the keys you copied from the Twitter app configuration screen earlier.
The TwitterAgent.sources.Twitter.Keywords contains a comma-separated list of words
used to select which tweets you want to add to HDFS.
This is the place where you can add your favorite keywords. Tweets with these keywords
will be extracted from Twitter.com
The TwitterAgent.sinks.HDFS.hdfs.path offers the path from the name node where the
tweets should be saved.
Be sure that the user running the Flume agent can write to this HDFS file location.
Configuration of flume.conf file is as below:
flume.conf
TwitterAgent.sources=Twitter TwitterAgent.channels=MemChannel TwitterAgent.sinks=
HDFS
TwitterAgent.sources.Twitter.type
=com.cloudera.flume.source.TwitterSourceTwitterAgent.sources.Twitter.channels=MemC

hannelTwitterAgent.sources.Twitter.consumerKey=<consumerKey>TwitterAgent.sources
.Twitter.consumerSecret=<consumerSecret>TwitterAgent.sources.Twitter.accessToken=<
accessToken>TwitterAgent.sources.Twitter.accessTokenSecret=<accessTokenSecret>
TwitterAgent.sources.Twitter.keywords = x-men,Interstellar
TwitterAgent.sinks.HDFS.channel=MemChannelTwitterAgent.sinks.HDFS.type
=hdfsTwitterAgent.sinks.HDFS.hdfs.path
=/user/root/data/tweets_rawTwitterAgent.sinks.HDFS.hdfs.fileType
=DataStreamTwitterAgent.sinks.HDFS.hdfs.writeFormat=TextTwitterAgent.sinks.HDFS.
hdfs.batchSize =1000
TwitterAgent.sinks.HDFS.hdfs.rollSize =0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.channels.MemChannel.type
=memoryTwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=100

Note that as a sample we have provided flume.conf file in Configuration file folder, you
can use this file and make the changes as shown above.

The consumer Key, consumer Secret, access Token and access Token Secret have to be
exchanged with those obtained from https://dev.twitter.com/apps.
And, TwitterAgent.sinks.HDFS.hdfs.path should point to the Name Node and the location
in HDFS where the tweets will go to.

The TwitterAgent.sources.Twitter.keywords value can be modified to get the tweets for


any topic.

Step5: Run Twitter Extraction Program using Flume


To make things simpler we have prepared a Data folder with all reference data files we
need.
With the help of Cloudera, drag and drop the Data folder to the root folder of your
CentOS. Now type below commands to create folders in HDFS:
$ hadoop fs mkdir data

$ hadoop fs mkdir data/dictionary


$ hadoop fs mkdir data/time_zone_map
$ hadoop fs mkdir data/tweets_raw
$ hadoop fs put data/dictionary/dictionary.tsvdata/dictionary
$ hadoop fs put data/time_zone_map/time_zone_map.tsvdata/time_zone_map.

Dictionary file has all the Positive, Negative and Neutral sentiment words list.

time_zone_map has mapping with time zones to countries. This will help us identify
the country from a tweet.

All tweets which we are going to extract will be stored in tweets_raw folder. View the
structure by:

# hadoop fs ls R data
And see the result that folders should be created accordingly...
drwxr-xr-x - root root

02015-01-0202:12 data/dictionary/dictionary.tsv

drwxr-xr-x - root root

02015-01-0202:37 data/time_zone_map/time_zone_map

drwxr-xr-x - root root

02015-01-0201:20 data/tweets_raw

We can start the flume agent with below command:


[ flume-ng agent --conf ./conf/ -f /usr/lib/flume-ng/conf/flume.conf Dflume.root.logger=DEBUG.console -n TwitterAgent ]

This will start loading Twitter data into the Cloudera.


Now you can view the files in the location folder by using ls andcat commands.Output
should be like this :

Step6: Run Hive Script and create Tables and View on Extracted Data Files
We will now copy the tweets.sql file to the Cloudera.
This file has the definition of all tables and views which we will create in Hive.Drag and
drop tweets.sql from DDL folder to the root folder of Cloudera.

If you see the script of tweets.sql the first table Mytweets_raw is created on the same
folder where your twitter data is getting stored; So if you have changed the path of folder
or you have twitter data in a different folder then please edit the path in tweets.sql as well.

STEP 7 :After fetching and loading all the sentiment data and making all the tables and
views in Hive , we have to download the main table file in all the data is stored and to
download that file we use the query as shown below:
INSERT OVERWRITE LOCAL DIRECTORY '/home/cloudera/Desktop/hivesample-out' row format delimited fields terminated by ',' SELECT * FROM
tweetcompare.

The file with blue line is the data file in which our all data is stored which is fetched from
twitter.

Step8: Access the Refined Sentiment Data with Excel.


In this section, we will use Excel 2010 to access the refined sentiment data.

In Windows, open a new Excel workbook, then select File and select the downloaded file
from recent workbooks of Excel.

Select the downloaded file and we will see that a dialog box is opened and then click yes on
the dialog box.

Now choose the file type and click Next.

Now choose the delimiters which you have used in cloudera and click Next button.

Now choose the column data format and click Finish button.

Step 9: Visualize the Sentiment Data Using Excel

Now that we have successfully imported the Twitter sentiment data into Microsoft Excel, we
can use the Excel Column or Pie chart to analyze and visualize the data.

CHAPTER

4.1 Conclusion
We are commited to make more Big Data , Hadoop , Analytics and Business Intelligence
Projects open sourced & free and make them available for larger learner community. The
availability of Big Data , low cost commodity hardware , & new information management &
analytic software have produced a unique moment in the history of data analysis. Using Big Data
and Hadoop , we can fetch any data from online social networking sites and it is very easy to run
and learn queries of big data and hadoop concepts. So we can easily work and learn this
language. And in future, Big Data and Hadoop have a great scope for all who want to learn this
language.

4.2 future scope


The applicability of sentiment analysis for future businesses and marketing in using a keywords
and analysis of the sentiments around that keyword by the public is only going to increase as the
popularity of Twitter grows over the next few years. However, in terms of long-term
development or research, the ability of the twitter API to pull data that is older, should be
developed as well as other social media APIs so that sentiment analysis could be performed
over a period of time, especially in the realm of social sciences where researchers could enquire
into social and political shifts of opinion on the social media sites. Equally the lack of change in
opinion over time on some issues might be worth pursuing as a topic of research for twitter
sentiment analysis. The usefulness of such a sentiment analyzer would allow for an interesting
analysis of social and political issues.

Вам также может понравиться