Twitter Analysis in Hadoop

Notice of Violation of IEEE Publication Principles
"A Speedy Data Uploading Approach for Twitter Trend and Sentiment Analysis Using
HADOOP"
by Gaurav Digambarrao Rajurkar and Rajeshwari M Goudar
in the Proceedings of the International Conference on Computing Communication Control and
Automation (ICCUBEA) pp 580 - 584
After careful and considered review of the content and authorship of this paper by a duly
constituted expert committee, this paper has been found to be in violation of IEEEs Publication
Principles.
This paper contains substantial duplication of original text from the paper cited below. The
original text was copied without attribution (including appropriate references to the original
author(s) and/or paper title) and without permission.
Due to the nature of this violation, reasonable effort should be made to remove all past
references to this paper, and future references should be made to the following article:
"How-to: Analyze Twitter Data with Apache Hadoop"
by Jon Natkins
in the Cloudera Engineering Blog http://blog.cloudera.com/blog/2012/09/analyzing-twitter-datawith-hadoop/ September 19, 2012
2015 International Conference on Computing Communication Control and Automation
A speedy data uploading approach for Twitter Trend And Sentiment Analysis using
HADOOP
Gaurav D Rajurkar
Rajeshwari M Goudar
Computer Engineering Department

MIT AOE Alandi Pune
Alandi(D), Pune, India
rajurkargaurav39@gmail.com
Computer Engineering Department

MIT AOE Alandi Pune
Alandi(D), Pune, India
rmgoudar66@gmail.com
Abstract
The analytics persons and businesses feel the need to gain

new insights from social media; they require the analytics
tools and expertise to transform this big data information
which will have big volume and variety into the respective
strategies so as to draw certain conclusions. Social media
analytics is useful tool for getting details of customer
sentiments that are distributed across online sources [13].
The Apache Hadoop software library is a framework [9] that
allows for the distributed processing of large data sets across
clusters of computers using simple programming models
[16]. It provide highly scalable and flexible architecture for
parallel processing. Rather than rely on hardware to deliver
high- availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a
highly- available service on top of a cluster of computers,
each of which may be prone to failures. Social analytics
collects and analyzes consumer opinions and convert them
into insights and help businesses in identifying areas of
customer satisfaction or any customer grievance for the
product. It also provides a quick feedback to marketing
campaigns, so as to analyze campaign that will be received
well by the consumers.Social analytics acts as a new
channel between consumers and industries [11]. It also helps
them to provide the review of their influence of product in
the market. So the proposed model fulfills the needs of
companies by analyzing the data efficiently and delivering
results. There are many Analytical Tools and Dashboards
present in the market such as SocialBro, TweetStats,
Twentyfeet, Twtrland, Topsy [14] etc. but are very costly
and inefficient.
The current Analytics tools and models that are available in

the market are very costly, unable to handle Big Data and less
secure. The traditional Analytics systems takes a long time to
come up with results, so it is not beneficial to use for Real Time
Analytics. So, the proposed work resolves all these problems by
combining the Apache Open Source platform which solves the
issues of Real Time Analytics using HADOOP. It also provides
scalability and reduced cost over analytics by using open
Source Software. The work proposes to combine the Apache
Open Source Modules and configure them to get the required
result. This system also provide solution for speedy data
downloading on HDFS by using source and sink (data
ingestion) mechanism. The Hadoop is flexible and scalable
architecture. The proposed work is based upon the
phenomenon of combination of open source software along
with commodity hardware that will increase the profit of IT
Industry.
Keywords- Social media, Twitter Trend Analysis, Apache
HADOOP, Source and sink Mechanism, Apache HBase, Hive,
Pig. Apache Oozie, Zookeeper. )
I.
INTRODUCTION
Social media is a web-based and mobile-based internet

application that will allow the creation, access and exchange
of user-generated content that is ubiquitously accessible
[3,10]. Besides social networking media like twitter and
face- book, the term social media to encompass really
simple syndication (RSS) feeds, blogs, wikis and news, all
typically yielding unstructured text and accessible through
the web. Social media is especially important for research
into computational social science that investigates questions
using quantitative techniques for example, computational
statistics, machine learning and complexity and so-called
big data for data mining and simulation modeling [2]. Social
media has led to numerous data services, tools and analytics
platforms. The tools available to researchers are either give
superficial access to the raw data or non-superficial access.
Researchers require to program analytics in a language such
as Java. So the proposed work is much better than the
available ones with respect to cost, efficient handling Big
Data and scalability.
978-1-4799-6892-3/15 $31.00 2015 IEEE
DOI 10.1109/ICCUBEA.2015.119
580
II.
LIMITATIONS OF AVAILABLE SYSTEMS AND

TOOLS FOR ANALYTICS
The streaming data is in form of JSON i.e event form of

data. The data is queued and channeled via channel
mechanism.
The limitations are as follows:

1) The available system s like Twitter-Monitor [3] and Real
Time Twitter Trend Mining System [4] require extensive
data cleaning, data scraping and integration strategies that
will ultimately increase the overhead.
2) The available systems [3] are inefficient for Real Time
Analytics.
3) The available methods and systems undergo time
consuming process and the proposed work eliminates all
those drawbacks mentioned above.
III.
Finally the data is sink down into HDFS. Then the tweets
are analyzed using PIG or HIVE.
C. Analyzing The Data On HDFS i.e. Tweets using PIG or
Hive
The data now stored on data nodes is analyzed using PIG or
Hive. Suppose we want to perform Twitter Trend Analysis,
then we have to just fire the Count query that will count the
specific word count about any keyword. We have
performed Log Analysis, Fraud detection and Click Stream
Analysis by using this system.
METHODOLOGY FOR PROPOSED SYSTEM
Social media has acquired immense popularity and interest

with marketing teams. Twitter is an effective tool for any
company to get information about how people are excited
and reacting about its products. Twitters engage users and
perform communication directly with them and in turn,
users can provide word-of-mouth marketing for companies
by discussing the products. With the help of limited
resources and knowing that one cant target directly the
destination consumers, marketing departments can be more
efficient in their policy of marketing by being selective
about consumers they should reach out to. In this proposed
work, Apache Pig, Apache HDFS, Apache Oozie, and
Apache Hive can be used to design direct data pipeline that
will enable to analyze Twitter data. In order to find out who
is prominent in Social Media one should know the
mechanism of twitter which works on tweets and retweets.
A retweet is a repost of an update similar to forwarding an
email. Querying Twitter data in a traditional RDBMS is
inefficient. There are many Twitter API which provide
streaming of twitter data. In the proposed work
Figure 1. Data Pipeline Model
A. Developing Twitter API

Develop a Twitter API on the Twitter side.The Twitter
API directly communicates with the Souce and Sink
Mechanism via netwok based application. The
Authentication keys and tokens are established that helps in
communication over Twitter Server.
IV.
ARCHITECTURE OF PROPOSED WORK
The proposed system works on the phenomenon of

combination of Open Source Software [17] along with
hardware. The proposed high level architecture is as shown
in figure 2.
The data downloading speed is increased with the help of
network based application and the Source and sink
Mechanism. The reliable connection is established by
creating the application on twitter side. The network delays
and various latencies are removed for speedy data
downloading. The sink, channel and source mechanism is
used for the data transfer to HDFS. The Twitter server is
present in public domain [1] [2]. So, we can access the
twitter data by establishing connection with the Twitter
server [4]. There are many techniques available for acquiring
B. Establishing The connection via Source and Sink

Mechanism
After creation of Twitter API , design the soure and sink
mechanism that will help in speedy data downloading
approach from Twitter Server to HDFS(Hadoop Distributed
File System).
The source agent communicates with the Twitter API and
Channels the data .
581
the data from Twitter. One can access the data via
application programming interface (API) provided by the
twitter [3]. This API establishes the connection between
twitter server and the developers application. One has to
write the program in python language that will fetch the data.
In the proposed work. Source and Sink mechanism is
performing the same task very efficiently. We have
configured and implemented the required algorithm for
establishing connection with the twitter server and
continuously fetching the streaming data. The Sink and
source mechanism not only fetch the data efficiently but also
reduces the network latency and delay. It source the data
from twitter server and channel (queue) the data and finally
sink that data into Hadoop distributed file system (HDFS).
The HDFS is Hadoop Ecosystem daemon. Hadoop is the
scalable and flexible architecture that support parallelism.
Now, the data has arrived on the HDFS. The available data is
processed with the help of Apache Hive which is a query
language designed to handle complex and big data. The data
we can store on cloud or in local system. We can also use
Apache PIG which is combination of Pig latin language and
compiler that will produce the mapreduce sequences for
parallelism. The Hive and Pig provide easy interface to
handle the data. The proposed work also includes Apache
Oozie which is workflow scheduler for Hadoop jobs and
helps in co- ordination process. The other component is the
zookeeper which helps in maintaining configuration
information providing distributed synchronization and
performing group services. Now by configuring all the
components we can process the data. The results are
generated with help of graphs charts, etc. The generated
results and reports are analyzed to perform analytics and take
decisions accordingly.
Figure 2. Architecture of proposed work
V.
APPLICATIONS OF PROPOSED WORK
After the text edit has been completed, the paper is ready
for the template. Duplicate the template file by using the
Save As command, and use the naming convention
prescribed by your conference for the name of your paper. In
this newly created file, highlight all of the contents and
import your prepared text file. You are now ready to style
your paper.
A. Twitter Trend Analysis :
The proposed system calculate the trend on social media
that is beneficial for marketing people. The trend is
calculated by counting the bursty keywords [7]. The
COUNT algorithm is applied that will count all the bursty
keywords that will be considered useful while determining
the trend. The timely graph is plotted along with the trend
and time on it. This will help the marketing people to take
certain decisions.
582
Subjectivity =|Count of positive words and Count

of negative words| / |Count of neutral words|
pi {} = Pr [Xi = {} ] and pi(j) = 0 if not otherwise

determined. Here, we use {} to denote the null event.
We consider the ith element of the stream to be determined
according to the random variable Xi and each element
of the stream to be determined independently. Hence,
Pr { x1. x2, x3,, xm}=
B. Sentiment Analysis :
The proposed system is used for estimating the
sentimental analysis of people on the social media
about the specific product [5][8]. The sentiment is
calculated by using the generalized formulae:
Count for Sentiment = COUNT( positive words)
COUNT (negative words)
Let PN (Count) be the count function
If the PN (count) > 1 then positivity about the product
If the PN (count) < = -1 then negative sentiment
about the product
If the PN (count) = 0 then neutral sentiment about the
product
AGGREGATION QUERIES :
SUM = E i [m] :Xi Xi eqn (1)
COUNT = E [ |{i [m] : Xi} |] .eqn (2)
MEAN = E [ i[m] :Xi Xi/|{i [m] : Xi } | |{i[m] :
Xi} | 0 ] ..eqn (3)
DISTINCT = E [ |{ j [n] : i[m] , Xi= j} |] .eqn (4)
C. Log Analysis :
The proposed system performs log analysis that can be used
for fraud detection and various other purposes. The log
analysis reveals many problems. The log data is first moved
on HDFS and then by analyzing required results the HIVE
or PIG query are fired to get the expected results.
Suppose we want to calculate the validity of user
(intended to really buy a product on a website) then we get
the result by analyzing the Time of the user.
The spent time on website = Time_Out Time_In
This is calculated form the log file. Now if the user time
spent is less than the threshold determined by that specificc
website time then that user is nave to that site.
The aggragation queries determines the performace of HIVE

and PIG processing.
The eqn(2) is used for trend analysis as well as sentiment
analysis.
CONCLUSION
The proposed work gives information about the problems
with the available tools and systems in the market. It
provide an effective solution for Twitter Sentiment
Analysis that will reduce the time delay and reduce
the cost by using HADOOP is proposed. There are
various systems to get the Analytics available in the
market but are very costly, less efficient and less secure. So
the proposed system uses an efficient Apache Open Source
Product which presents the model that can have Twitter
Trend Analysis using HADOOP where no extra work like
scraping, cleansing and
data protection required. It
also provide the speedy data downloading approach for
efficient Twitter Trend Analysis. The proposed
work
concludes with the phenomenon of Open Source
Software along with Commodity Hardware that will
increase IT Industry Profit.
D. Click Stream Analysis :

The click stream analysis is also performed with the help
of proposed system. The advertising website (e.g google) get
paid by the Product hosting sites such as Amazon, Flipcart,
etc. When the user search for a specific product on the
google then the results are shown and which are of various
product hosting sites. If the user clicks on the showed icons
then the product hosting sites has to pay certain amount to
google. If the destructive or nave or unintended user clicks
on the showed icon of product then it will cause loss to the
product hosting sites. The click stream analysis will help the
product hosting sites by analyzing the validity of user. We
can block the user (for advertising purpose) IP address if
found invalid. Similarly, we can show advertisement to the
intended people by analyzing their sentiment and patterns of
buying a product.
VI.
ACKNOWLEDGEMENT
I express true sense of gratitude towards my project
guide Prof. Rajeshwari M Goudar, Asst. Prof. of
computer engineering department for her invaluable cooperation and guidance that she helping me for my project
study. I like to thank her once again for inspiring me and
providing me all the lab facilities, which made this work
very convenient and easy. I would also like to express
my appreciation and thanks Prof. Uma Nagaraj Head of
Computer Engineering Department and principal Dr. Y. J.
Bhalerao and all my friends who knowingly and
unknowingly have assisted me throughout my work.
MATHEMATICAL MODEL AND AGGREGATION QUERIES
A probabilistic stream [15] is a data stream

A ={ v 1, v 2, v3,.. , vm} in which each data item
encodes a
random variable that takes a value in [n] {} . In
particular, each ( vi ) consists of a set of at most l tuples of
the form (j, pi(j)) for some j [ n] and pi(j) [ 0, 1].
These tuples define the random variable Xi where Xi = j
with probability pi(j) and Xi = {} otherwise. We define
583
Last but not the least, heartily thanks to my family for being
there.
[13] Krushikanth R. Apala, Merin Jose, Supreme Motnam,

C.-C. Chan, Kathy J. Liszka,
and Federico de
Gregorio1 Prediction of Movies Box Office Performance
Using Social Media 2013 IEEE/ACM International
Conference on Advances in Social Networks Analysis and
Mining
[14] Pitiphat Santidhanyaroj, Talha Ahmad Khan A
SENTIMENT ANALYSIS PROTOTYPE SYSTEM
FOR SOCIAL NETWORK DATA 978-1-4799-3010-9/14
2014 IEEE
[15] Asli
Celikyilmaz1,Dilek
Hakkani,
Junlan
FengPROBABILISTIC MODEL-BASED SENTIMENT
ANALYSIS OF TWITTER MESSAGES 978-1-42447903-0/10/2010 IEE
[16] Javier
Conejero,
Jeffery
Morgan Scaling
Archived Social Media Data Analysis using a Hadoop
Cloud
2013 IEEE Sixth International Conference on
Cloud Computing 978-0-7695-5028-2/13 2013 IEEE
DOI 10.1109/CLOUD.2013.120
[17] www.cloudera.com
[18] www.hadoop.apache.org/
REFERENCES
[1] Claudio Cioffi-Revilla Computational social
science,
WILEY
Interdisciplinary
Reviews:
Computational Statistics, Vol. 2, no. 3, May/June 2010:pp.
259271
[2] Andreas M. Kaplan , Michael Haenlein Users of
the world, unite! The challenges and opportunities of
Social Media, Business Horizons (2010) 53, 5968
ELSEVIER
[3]
Michael
Mathioudakis,
Nick
Koudas,
TwitterMonitor:Trend Detection over the Twitter Stream,
SIGMOD10, June611, 2010, Indianapolis, Indiana, USA.
Copyright 2010 ACM 978-1-4503-0032-2/10/06
[4] Min Song, Meen Chul Kim, RT2M : Real-time
Twitter Trend Mining System, 978-0-7695-4998-9/13
2013 IEEE International
Conference
on
Social
Intelligence and Technology
[5] Saeideh Shahheidari, Hai Dong, Md Nor Ridzuan
Bin Daud Twitter sentiment mining: A multi domain
analysis 978-0-7695-4992-7/13 2013 IEEE Seventh
International Conference on Complex, Intelligent, and
Software Intensive Systems
[6] Mutia N. Kurniati, Woo-Jong Ryu, Md. Hijbul
Alam, SangKeun Lee, Examining the Performance
of
Topic Modeling Techniques
in Twitter Trends
Extraction, 978-1- 4799-3689-2/14/ 2014 IEEE
[7] Beiming Sun, Vincent TY Ng,
Analyzing
Sentimental Influence of Posts on Social Networks
978-1-4799-3776- 9/14/2014 IEEE
[8] David Alfred Ostrowski System Analytics Research
and Innovation Center Ford Motor Company, Semantic
Social Network Analysis for Trend Identification 978-07695-4859- 3/12 2012 IEEE
[9] Yang Lai, Shi ZhongZhi, EfficientAn Efficient
Data Mining Framework on Hadoop using Java Persistence
API978-0-7695-4108-2/10 2010 IEEE
[10] Kala Karun,A,
Chitharanjan. K Sree Chitra
Thirunal College of Engineering Thiruvananthapuram, A
Review on Hadoop HDFS Infrastructure Extensions 9781-4673-5758- 6/13/ 2013 IEEE
[11] Bogdan Batrinca Philip C. Treleaven Social
media analytics: a survey of techniques, tools and
platforms AI & Soc DOI 10.1007/s00146-014-0549-4This
article is published with open access at Springerlink.com
[12] Daehoon Kim, Daeyong Kim, Eenjun Hwang,
Seungmin Rho TwitterTrends: a spatio-temporal trend
detection
and related
keywords
recommendation
scheme Springer-Verla Berlin
Heidelberg
2013
Multimedia Systems DOI 10.1007/s00530-013-0342-0
584

Twitter Analysis in Hadoop

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Twitter Analysis in Hadoop

Загружено:

Авторское право:

Доступные форматы

Notice of Violation of IEEE Publication Principles

2015 International Conference on Computing Communication Control and Automation

Computer Engineering Department

Computer Engineering Department

The analytics persons and businesses feel the need to gain

The current Analytics tools and models that are available in

Social media is a web-based and mobile-based internet

LIMITATIONS OF AVAILABLE SYSTEMS AND

The streaming data is in form of JSON i.e event form of

The limitations are as follows:

METHODOLOGY FOR PROPOSED SYSTEM

Social media has acquired immense popularity and interest

A. Developing Twitter API

ARCHITECTURE OF PROPOSED WORK

The proposed system works on the phenomenon of

B. Establishing The connection via Source and Sink

Figure 2. Architecture of proposed work

APPLICATIONS OF PROPOSED WORK

Subjectivity =|Count of positive words and Count

pi {} = Pr [Xi = {} ] and pi(j) = 0 if not otherwise

The aggragation queries determines the performace of HIVE

D. Click Stream Analysis :

MATHEMATICAL MODEL AND AGGREGATION QUERIES

A probabilistic stream [15] is a data stream

[13] Krushikanth R. Apala, Merin Jose, Supreme Motnam,

Вам также может понравиться