Академический Документы
Профессиональный Документы
Культура Документы
"A Speedy Data Uploading Approach for Twitter Trend and Sentiment Analysis Using
HADOOP"
by Gaurav Digambarrao Rajurkar and Rajeshwari M Goudar
in the Proceedings of the International Conference on Computing Communication Control and
Automation (ICCUBEA) pp 580 - 584
After careful and considered review of the content and authorship of this paper by a duly
constituted expert committee, this paper has been found to be in violation of IEEEs Publication
Principles.
This paper contains substantial duplication of original text from the paper cited below. The
original text was copied without attribution (including appropriate references to the original
author(s) and/or paper title) and without permission.
Due to the nature of this violation, reasonable effort should be made to remove all past
references to this paper, and future references should be made to the following article:
"How-to: Analyze Twitter Data with Apache Hadoop"
by Jon Natkins
in the Cloudera Engineering Blog http://blog.cloudera.com/blog/2012/09/analyzing-twitter-datawith-hadoop/ September 19, 2012
A speedy data uploading approach for Twitter Trend And Sentiment Analysis using
HADOOP
Gaurav D Rajurkar
Rajeshwari M Goudar
Abstract
I.
INTRODUCTION
580
II.
Finally the data is sink down into HDFS. Then the tweets
are analyzed using PIG or HIVE.
C. Analyzing The Data On HDFS i.e. Tweets using PIG or
Hive
The data now stored on data nodes is analyzed using PIG or
Hive. Suppose we want to perform Twitter Trend Analysis,
then we have to just fire the Count query that will count the
specific word count about any keyword. We have
performed Log Analysis, Fraud detection and Click Stream
Analysis by using this system.
IV.
581
the data from Twitter. One can access the data via
application programming interface (API) provided by the
twitter [3]. This API establishes the connection between
twitter server and the developers application. One has to
write the program in python language that will fetch the data.
In the proposed work. Source and Sink mechanism is
performing the same task very efficiently. We have
configured and implemented the required algorithm for
establishing connection with the twitter server and
continuously fetching the streaming data. The Sink and
source mechanism not only fetch the data efficiently but also
reduces the network latency and delay. It source the data
from twitter server and channel (queue) the data and finally
sink that data into Hadoop distributed file system (HDFS).
The HDFS is Hadoop Ecosystem daemon. Hadoop is the
scalable and flexible architecture that support parallelism.
Now, the data has arrived on the HDFS. The available data is
processed with the help of Apache Hive which is a query
language designed to handle complex and big data. The data
we can store on cloud or in local system. We can also use
Apache PIG which is combination of Pig latin language and
compiler that will produce the mapreduce sequences for
parallelism. The Hive and Pig provide easy interface to
handle the data. The proposed work also includes Apache
Oozie which is workflow scheduler for Hadoop jobs and
helps in co- ordination process. The other component is the
zookeeper which helps in maintaining configuration
information providing distributed synchronization and
performing group services. Now by configuring all the
components we can process the data. The results are
generated with help of graphs charts, etc. The generated
results and reports are analyzed to perform analytics and take
decisions accordingly.
V.
After the text edit has been completed, the paper is ready
for the template. Duplicate the template file by using the
Save As command, and use the naming convention
prescribed by your conference for the name of your paper. In
this newly created file, highlight all of the contents and
import your prepared text file. You are now ready to style
your paper.
A. Twitter Trend Analysis :
The proposed system calculate the trend on social media
that is beneficial for marketing people. The trend is
calculated by counting the bursty keywords [7]. The
COUNT algorithm is applied that will count all the bursty
keywords that will be considered useful while determining
the trend. The timely graph is plotted along with the trend
and time on it. This will help the marketing people to take
certain decisions.
582
B. Sentiment Analysis :
The proposed system is used for estimating the
sentimental analysis of people on the social media
about the specific product [5][8]. The sentiment is
calculated by using the generalized formulae:
Count for Sentiment = COUNT( positive words)
COUNT (negative words)
Let PN (Count) be the count function
If the PN (count) > 1 then positivity about the product
If the PN (count) < = -1 then negative sentiment
about the product
If the PN (count) = 0 then neutral sentiment about the
product
AGGREGATION QUERIES :
SUM = E i [m] :Xi Xi eqn (1)
COUNT = E [ |{i [m] : Xi} |] .eqn (2)
MEAN = E [ i[m] :Xi Xi/|{i [m] : Xi } | |{i[m] :
Xi} | 0 ] ..eqn (3)
DISTINCT = E [ |{ j [n] : i[m] , Xi= j} |] .eqn (4)
C. Log Analysis :
The proposed system performs log analysis that can be used
for fraud detection and various other purposes. The log
analysis reveals many problems. The log data is first moved
on HDFS and then by analyzing required results the HIVE
or PIG query are fired to get the expected results.
Suppose we want to calculate the validity of user
(intended to really buy a product on a website) then we get
the result by analyzing the Time of the user.
The spent time on website = Time_Out Time_In
This is calculated form the log file. Now if the user time
spent is less than the threshold determined by that specificc
website time then that user is nave to that site.
CONCLUSION
The proposed work gives information about the problems
with the available tools and systems in the market. It
provide an effective solution for Twitter Sentiment
Analysis that will reduce the time delay and reduce
the cost by using HADOOP is proposed. There are
various systems to get the Analytics available in the
market but are very costly, less efficient and less secure. So
the proposed system uses an efficient Apache Open Source
Product which presents the model that can have Twitter
Trend Analysis using HADOOP where no extra work like
scraping, cleansing and
data protection required. It
also provide the speedy data downloading approach for
efficient Twitter Trend Analysis. The proposed
work
concludes with the phenomenon of Open Source
Software along with Commodity Hardware that will
increase IT Industry Profit.
ACKNOWLEDGEMENT
I express true sense of gratitude towards my project
guide Prof. Rajeshwari M Goudar, Asst. Prof. of
computer engineering department for her invaluable cooperation and guidance that she helping me for my project
study. I like to thank her once again for inspiring me and
providing me all the lab facilities, which made this work
very convenient and easy. I would also like to express
my appreciation and thanks Prof. Uma Nagaraj Head of
Computer Engineering Department and principal Dr. Y. J.
Bhalerao and all my friends who knowingly and
unknowingly have assisted me throughout my work.
583
Last but not the least, heartily thanks to my family for being
there.
REFERENCES
[1] Claudio Cioffi-Revilla Computational social
science,
WILEY
Interdisciplinary
Reviews:
Computational Statistics, Vol. 2, no. 3, May/June 2010:pp.
259271
[2] Andreas M. Kaplan , Michael Haenlein Users of
the world, unite! The challenges and opportunities of
Social Media, Business Horizons (2010) 53, 5968
ELSEVIER
[3]
Michael
Mathioudakis,
Nick
Koudas,
TwitterMonitor:Trend Detection over the Twitter Stream,
SIGMOD10, June611, 2010, Indianapolis, Indiana, USA.
Copyright 2010 ACM 978-1-4503-0032-2/10/06
[4] Min Song, Meen Chul Kim, RT2M : Real-time
Twitter Trend Mining System, 978-0-7695-4998-9/13
2013 IEEE International
Conference
on
Social
Intelligence and Technology
[5] Saeideh Shahheidari, Hai Dong, Md Nor Ridzuan
Bin Daud Twitter sentiment mining: A multi domain
analysis 978-0-7695-4992-7/13 2013 IEEE Seventh
International Conference on Complex, Intelligent, and
Software Intensive Systems
[6] Mutia N. Kurniati, Woo-Jong Ryu, Md. Hijbul
Alam, SangKeun Lee, Examining the Performance
of
Topic Modeling Techniques
in Twitter Trends
Extraction, 978-1- 4799-3689-2/14/ 2014 IEEE
[7] Beiming Sun, Vincent TY Ng,
Analyzing
Sentimental Influence of Posts on Social Networks
978-1-4799-3776- 9/14/2014 IEEE
[8] David Alfred Ostrowski System Analytics Research
and Innovation Center Ford Motor Company, Semantic
Social Network Analysis for Trend Identification 978-07695-4859- 3/12 2012 IEEE
[9] Yang Lai, Shi ZhongZhi, EfficientAn Efficient
Data Mining Framework on Hadoop using Java Persistence
API978-0-7695-4108-2/10 2010 IEEE
[10] Kala Karun,A,
Chitharanjan. K Sree Chitra
Thirunal College of Engineering Thiruvananthapuram, A
Review on Hadoop HDFS Infrastructure Extensions 9781-4673-5758- 6/13/ 2013 IEEE
[11] Bogdan Batrinca Philip C. Treleaven Social
media analytics: a survey of techniques, tools and
platforms AI & Soc DOI 10.1007/s00146-014-0549-4This
article is published with open access at Springerlink.com
[12] Daehoon Kim, Daeyong Kim, Eenjun Hwang,
Seungmin Rho TwitterTrends: a spatio-temporal trend
detection
and related
keywords
recommendation
scheme Springer-Verla Berlin
Heidelberg
2013
Multimedia Systems DOI 10.1007/s00530-013-0342-0
584