Академический Документы
Профессиональный Документы
Культура Документы
ON
SENTIMENT DATA ANALYSIS ON TWITTER
Submitted in partial fulfillment of the requirements
For the award of the degree of
Submitted To:
Dr. Pawan Whig
Submitted By:
Shivam Arora
Roll No. -07329802014
CERTIFICATE
This is to certify that Shivam Arora of BCA 5th Semester from Vivekananda
Institute of Professional Studies, Delhi has presented this project work entitled
Sentiment Data Analysis on Twitter in partial fulfillment of the requirement
for the award of the degree of Bachelor of Computer Applications under my
supervision and guidance.
Dr.Pawan Whig
ACKNOWLEDGEMENT
It is our proud privilege to express our profound gratitude to the entire
management of Vivekananda Institute of Professional Studies and teachers of the
institute for providing us with the opportunity to avail the excellent facilities and
infrastructure. The knowledge and values inculcated have proved to be of
immense help at the very start of my career. I am grateful to Dr. Supriya Madan
(Dean VSIT) and Dr Pawan Whig. For their astute guidance, constant
encouragement and sincere support for this project work.
Sincere thanks to all my family members, seniors and friends for their support and
assistance throughout the project.
TABLE OF CONTENTS
1. INTRODUCTION
1.1 Introduction.
1.2
1.3
1.4
1.5
1.6
Project Objectives
Sentiment Data..
Future scope
Literature study.
Methodology.
2. DESIGNING TECHNIQUES
2.1 Data Collection.
2.2 Data flow diagram.
2.3 Use case diagram
CHAPTER 1
1.1 Introduction
This tutorial describes how to extract raw Twitter data, store in Hadoop, assign positive,
negative and neutral sentiment to every tweet and how to analyze and visualize this
sentiment data.
Connect to a live social media (twitter) data stream, extract and hoard this data on
Hadoop
Process the data in Hadoop; restructure, clean and provide beneficial insights from it
Create tables in Hadoop and provide an interface to end users for simple querying
You essential to work on Hadoop gears like HDFS, MapReduce, Hive, Pig, Flume,
ODBC connectors and others
and RCB-LDA to find out the reasons why public sentiments have been changed for the
target. Dataset: They have considered the twitter dataset for sentiment classification. It
is obtained from Stanford Network Analysis Platform. It consists of tweets from June 11,
2009 to December 31, 2009 with 476 million tweets. But the evaluation of results is
done on the dataset from June 13, 2009 to October 31, 2009. Advantages: 1. Distilled
out the foreground topics effectively and removed the noisy data accurately. 2. Found
the exact reasons behind sentiment variations on twitter data using RCB-LDA model
which is very useful for decision making. Disadvantages: Uses the sentiment analysis
tools TwitterSentiment and SentiStrength whose accuracy is less as compared to other
sentiment analysis techniques.
1.6 Methodology
In this approach, an enterprise will have a computer to store and process big data. At this
point data will be warehoused in an RDBMS similar to Oracle Database, MS SQL
Server or DB2 and refined softwares can be printed to relate with the database, process
the essential data and present it to the users for enquiry purpose.
Limitation
This method works fine where we have fewer volume of data that can be put up by
normal database servers, or up to the edge of the workstation which is handling the data.
But when it derives to dealing with massive amounts of data, it is truly a boring task to
process such data through a traditional database server.
Googles Solution
Google answered this problem using an procedure called MapReduce. This procedure
splits the task into minor fragments and allots those fragments to many computers
attached over the network, and gathers the results to form the final result dataset.
Above figure shows several commodity hardwares which could be single CPU machines
or servers with advanced capacity.
CHAPTER 2
Designing Techniques
2.1 DATA FLOW DIAGRAM
1. For Hadoop
CHAPTER 3
3.1 Coding
Create the tweets_raw table containing the records as received from
Twitter.
CREATE TABLE Mytweets_raw (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING )
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe';
//LOCATION '/data/tweets_raw';
create sentiment dictionary
CREATE EXTERNAL TABLE dictionary (
type string,
lengthint,
word string,
pos string,
stemmed string,
polarity string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/data/dictionary';
Clean up tweets
CREATE VIEW tweets_simple AS
SELECT id,
cast ( from_unixtime( unix_timestamp(concat( '2014 ', substring(created_at,5,15)), 'yyyy
MMM ddhh:mm:ss')) as timestamp) ts, text, user.time_zone
FROM Mytweets_raw ;
Compute sentiment
create view l1 as select id, words from Mytweets_raw lateral view
explode(sentences(lower(text))) dummy as words;
create view l2 as select id, word from l1 lateral view explode( words ) dummy as word ;
create view l3 as select id, l2.word, case d.polarity
when 'negative' then -1
when 'positive' then 1
else 0 end as polarity
from l2 left outer join dictionary d on l2.word = d.word;
create table tweets_sentiment as select id,
case
when sum( polarity ) > 0 then 'positive'
when sum( polarity ) < 0 then 'negative'
else 'neutral' end as sentiment
from l3 group by id;
s.sentiment
FROM tweets_clean t LEFT OUTER JOIN tweets_sentiment s on t.id = s.id;
3.2 Implementation
Step1: Install VM Ware Workstation
The installation of VM Ware consists following steps as shown:
OK, youre done! There are four pieces of information you need to copy from the Form
above, this will be used when we setup the Flume agent:
API key
API secret
Access token
A) Install Flume
Flume is easy to install with HDP. Just run the following Yum command yum install
flume
After this comman discomplete, the Flume is installed and ready to be used.
Copy the .jarfile flume-sources-1.0-SNAPSHOT.jar provided by us in Source Jar
folder to the /usr/lib/flume/lib folder on the node where you installed the Flume software,
and add it to the flume classpath as shown below in the /etc/flume/conf/flume-env.ps1 file.
[ FLUME_CLASSPATH="/usr/lib/flume/lib/flume-sources-1.0-SNAPSHOT.jar" ]
Now that the agent code is in place, we need to configure flume to create an agent using
the class in this .jar. We do this by updating the /etc/flume/conf/flume.conf file.
Make the following changes (highlighted in yellow). Note that the configuration file uses
the term consumer Key and consumer Secret.
Twitter now calls these API Key and API Secret ,respectively. Simply substitute in
the keys you copied from the Twitter app configuration screen earlier.
The TwitterAgent.sources.Twitter.Keywords contains a comma-separated list of words
used to select which tweets you want to add to HDFS.
This is the place where you can add your favorite keywords. Tweets with these keywords
will be extracted from Twitter.com
The TwitterAgent.sinks.HDFS.hdfs.path offers the path from the name node where the
tweets should be saved.
Be sure that the user running the Flume agent can write to this HDFS file location.
Configuration of flume.conf file is as below:
flume.conf
TwitterAgent.sources=Twitter TwitterAgent.channels=MemChannel TwitterAgent.sinks=
HDFS
TwitterAgent.sources.Twitter.type
=com.cloudera.flume.source.TwitterSourceTwitterAgent.sources.Twitter.channels=MemC
hannelTwitterAgent.sources.Twitter.consumerKey=<consumerKey>TwitterAgent.sources
.Twitter.consumerSecret=<consumerSecret>TwitterAgent.sources.Twitter.accessToken=<
accessToken>TwitterAgent.sources.Twitter.accessTokenSecret=<accessTokenSecret>
TwitterAgent.sources.Twitter.keywords = x-men,Interstellar
TwitterAgent.sinks.HDFS.channel=MemChannelTwitterAgent.sinks.HDFS.type
=hdfsTwitterAgent.sinks.HDFS.hdfs.path
=/user/root/data/tweets_rawTwitterAgent.sinks.HDFS.hdfs.fileType
=DataStreamTwitterAgent.sinks.HDFS.hdfs.writeFormat=TextTwitterAgent.sinks.HDFS.
hdfs.batchSize =1000
TwitterAgent.sinks.HDFS.hdfs.rollSize =0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.channels.MemChannel.type
=memoryTwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=100
Note that as a sample we have provided flume.conf file in Configuration file folder, you
can use this file and make the changes as shown above.
The consumer Key, consumer Secret, access Token and access Token Secret have to be
exchanged with those obtained from https://dev.twitter.com/apps.
And, TwitterAgent.sinks.HDFS.hdfs.path should point to the Name Node and the location
in HDFS where the tweets will go to.
Dictionary file has all the Positive, Negative and Neutral sentiment words list.
time_zone_map has mapping with time zones to countries. This will help us identify
the country from a tweet.
All tweets which we are going to extract will be stored in tweets_raw folder. View the
structure by:
# hadoop fs ls R data
And see the result that folders should be created accordingly...
drwxr-xr-x - root root
02015-01-0202:12 data/dictionary/dictionary.tsv
02015-01-0202:37 data/time_zone_map/time_zone_map
02015-01-0201:20 data/tweets_raw
Step6: Run Hive Script and create Tables and View on Extracted Data Files
We will now copy the tweets.sql file to the Cloudera.
This file has the definition of all tables and views which we will create in Hive.Drag and
drop tweets.sql from DDL folder to the root folder of Cloudera.
If you see the script of tweets.sql the first table Mytweets_raw is created on the same
folder where your twitter data is getting stored; So if you have changed the path of folder
or you have twitter data in a different folder then please edit the path in tweets.sql as well.
STEP 7 :After fetching and loading all the sentiment data and making all the tables and
views in Hive , we have to download the main table file in all the data is stored and to
download that file we use the query as shown below:
INSERT OVERWRITE LOCAL DIRECTORY '/home/cloudera/Desktop/hivesample-out' row format delimited fields terminated by ',' SELECT * FROM
tweetcompare.
The file with blue line is the data file in which our all data is stored which is fetched from
twitter.
In Windows, open a new Excel workbook, then select File and select the downloaded file
from recent workbooks of Excel.
Select the downloaded file and we will see that a dialog box is opened and then click yes on
the dialog box.
Now choose the delimiters which you have used in cloudera and click Next button.
Now choose the column data format and click Finish button.
Now that we have successfully imported the Twitter sentiment data into Microsoft Excel, we
can use the Excel Column or Pie chart to analyze and visualize the data.
CHAPTER
4.1 Conclusion
We are commited to make more Big Data , Hadoop , Analytics and Business Intelligence
Projects open sourced & free and make them available for larger learner community. The
availability of Big Data , low cost commodity hardware , & new information management &
analytic software have produced a unique moment in the history of data analysis. Using Big Data
and Hadoop , we can fetch any data from online social networking sites and it is very easy to run
and learn queries of big data and hadoop concepts. So we can easily work and learn this
language. And in future, Big Data and Hadoop have a great scope for all who want to learn this
language.