A Survey On Bandwidth Aggregation For Speedy Data Uploading in Twitter Trend Analysis Using Hadoop

International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)

ISSN(P): 2249-6831; ISSN(E): 2249-7943
Vol. 4, Issue 6, Dec 2014, 63-68
TJPRC Pvt. Ltd.
A SURVEY ON BANDWIDTH AGGREGATION FOR SPEEDY DATA

UPLOADING IN TWITTER TREND ANALYSIS USING HADOOP
GAURAV D. RAJURKAR & RAJESHWARI M. GOUDAR
Department of Computer Engineering, MIT AOE, Alandi, Pune, Maharashtra, India
ABSTRACT
Band width aggregation plays an important role in wireless network where in it helps the real time applications
that require high bandwidth for efficient processing. It can also be used for speedy data uploading in order to reduce the
network delays, various latencies, etc. The speedy data uploading approach is immensely required for analytics.
This approach can prove beneficial for Twitter Trend Analysis using HADOOP. The current Analytics tools and models
that are available in the market are very costly, unable to handle Big Data and less secure. The traditional Analytics
systems takes a long time to come up with results, so it is not beneficial to use for Real Time Analytics. So, the proposed
work resolves all these problems by combining the Apache Open Source platform which solves the issues of Real Time
Analytics using HADOOP and presents how to aggregate the bandwidth so that the speed of real time data uploading on
Twitter Analysis can be increased and thus reduces the network delays by using Bandwidth Aggregation solution. It also
provides scalability and reduced cost over analytics by using open Source Software.
KEYWORDS: Bandwidth-Aggregation, Social media, Twitter Trend Analysis, Speedy Data Uploading Apache
HADOOP, Apache Flume, Apache HBase
INTRODUCTION
Social media is a web-based and mobile-based internet application that will allow the creation, access and
exchange of user-generated content that is ubiquitously accessible [3,10]. Besides social networking media like twitter and
face-book, the term social media to encompass really simple syndication (RSS) feeds, blogs, wikis and news, all typically
yielding unstructured text and accessible through the web. Social media is especially important for research into
computational social science that investigates questions using quantitative techniques for example, computational statistics,
machine learning and complexity and so-called big data for data mining and simulation modeling [2]. Social media has led
to numerous data services, tools and analytics platforms. The tools available to researchers are either give superficial
access to the raw data or non-superficial access. Researchers require to program analytics in a language such as Java.
So the proposed work is much better than the available ones with respect to cost, efficient handling Big Data and
scalability.
The analytics persons and businesses feel the need to gain new insights from social media; they require the
analytics tools and expertise to transform this big data information which will have big volume and variety into the
respective strategies so as to draw certain conclusions. Social media analytics is useful tool for getting details of customer
sentiments that are distributed across online sources.
www.tjprc.org
editor@tjprc.org
64
Gaurav D. Rajurkar & Rajeshwari M. Goudar
According to Gartner (famous blogger), Social Analytics resides at top of decision making and strategic
technologies that could have great impact on businesses in 2016. Social analytics is the process of measuring, analyzing
and interpreting the results of interactions and associations among people. Internet users all over the world express their
opinions on products, brands and services using various social networking sites.
Social analytics collects and analyzes consumer opinions and convert them into insights and help businesses in
identifying areas of customer satisfaction or any customer grievance for the product. It also provides a quick feedback to
marketing campaigns, so as to analyze campaign that will be received well by the consumers. Social analytics acts as a new
channel between consumers and industries [11]. There by helps them to get the review of their influence in the market.
So the proposed model fulfills the needs of companies by analyzing the data efficiently and delivering results. There are
many Analytical Tools and Dashboards present in the market such as SocialBro, TweetStats, Twentyfeet, Twtrland, Topsy
[14] etc. but are very costly and inefficient.
RELATED WORK AND AVAILABLE TOOLS FOR TWITTER ANALYTICS
Twitter Monitor: Trend Detection over the Twitter Stream

The system identifies emerging topics (i.e. trends) on Twitter in real time and provides meaningful analytics that
synthesize an accurate description of each topic [4]. Users interact with the system by ordering the identified trends using
different criteria and submitting their own description for each trend.
Twitter Monitor performs trend detection in two steps and analyzes trends in a third step. First, it identifies bursty
keywords, i.e. keywords that suddenly appear in tweets at an unusually high rate. Subsequently, it groups bursty keywords
into trends based on their co-occurrences. In other words, a trend is identified as a set of bursty keywords that occur
frequently together in tweets. After a trend is identified, Twitter Monitor extracts additional information from the tweets that
belong to the trend, aiming to discover interesting aspects of it[8].The system has following drawbacks:
However the Twitter Monitor system works on bursty keyword phenomenon only and it fails to predict the
trends of online data and make changes in the keywords lists(in Worst Case Analysis)
The system not wholly referred as Trend Analysis System, but also as Sentimental Analysis System.
RT2M: Real-time Twitter Trend Mining System

The author developed the Real-time Twitter Trend Mining (RT2M) system [5] that is designed for real-time to
crawl and store every tweet produced in Twitter. It also keeps track of topical trend and visualize mention-based user
networks. The major contribution of the system is making it possible to mine dynamic social trends and content-based
networks generated in Twitter through adequate integration of state-of-the-art techniques [10]. However this system is
having following limitations.
Redis Db performs in-memory store where all the data that we are acquiring from Social media must fit in
memory. This will speed up the process of analytics but it will create problem of Storage. As Redis Db is not
scalable, so we need a framework that is scalable and flexible. So we are using Hadoop which is highly
scalable and flexible framework. Therefore usage of Hadoop will resolve this problem.
Impact Factor (JCC): 6.8785
Index Copernicus Value (ICV): 3.0
65
A Survey on Bandwidth Aggregation for Speedy Data Uploading in Twitter Trend Analysis Using Hadoop
Redis is a data type server. There is no specific query language (only commands) available and no support for
a relational calculus. We cannot submit ad-hoc queries like we do using structured query language on a
relational database. The load is more on the developer because he has to design all the things. So the system
becomes less flexible whereas in Hadoop, Hive or Pig can be used as query language and can give ad-hoc
queries. So this drawback is also removed in the proposed system.
Redis has few options for persistency. None of them is as secure as a real transactional server providing
redo/undo logging, recovery, other capabilities, etc.
Redis gives the basic security in terms of authentication and access rights at the instance level whereas using
Hadoop, security to data can be provided using SSH algorithms. So, the proposed system provides security.
Other Existing Systems

There are many social analytics tools like scientific programming tools, Business toolkits, Social media
monitoring tools, Text analysis tools, Data visualization tools, AeroText, Attensity, Clarabridge, IBM LanguageWare,
SPSS Text Analytics for Surveys, Language Computer Corporation, STATISTICA Text Miner and WordStat [9,14].
A spatio-temporal trend detection scheme [13, 15] is referred as TwitterTrends and based on the clientserver
collaboration model. The client on the mobile device executes simple text filtering on each tweet to determine candidate
keywords for trends. It then sends those candidate keywords to the TP server together with metadata including GPS data
and language type. On the basis of candidate keywords and metadata collected from clients around the world, the TP server
performs a sequence of analysis [15].The limitations are as follows:
These systems require extensive data cleaning, data scraping and integration strategies that will ultimately
increase the overhead.
Fails for Real Time Analytics.
These systems undergo time consuming process and the proposed work eliminates this drawback.
Methodology and Working of Proposed System

Social media has acquired immense popularity and interest with marketing teams. Twitter is an effective tool for
any company to get information about how people are excited and reacting about its products. Twitters engage users and
perform
Communication directly with them and in turn, users can provide word-of-mouth marketing for companies by
discussing the products. With the help of limited resources and knowing that one cant target directly the destination
consumers, marketing departments can be more efficient in their policy of marketing by being selective about consumers
they should reach out to. In this proposed work, Apache Pig, Apache Flume, Apache HDFS, Apache Oozie, and Apache
Hive can be used to design direct data pipeline that will enable to analyze Twitter data.
In order to find out who is prominent in Social Media one should know the mechanism of twitter which works on
tweets and retweets. A retweet is a repost of an update similat to forwarding an email. Querying Twitter data in a
traditional RDBMS is inefficient. There are many Twitter API which provide streaming of twitter data. In the proposed
work Apache Flume is used to sink and source the data. Now once the data is collected via Flume, the data is transferred to
www.tjprc.org
editor@tjprc.org
66
the Hadoop Distributed File System (HDFS) [17]. Processing of data is done by using HIVE and PIG to query the data as
shown in figure 1.
Figure 1: Pipeline of Data Flow in Proposed System

Band width Aggregation for speedy data transfer is advantageous in real time application for efficient processing
their respective tasks [1]. The Twitter Trend Analysis system proposed here also requires continuous availability of data.
The data here tweets in JSON form is taken from Twitter Server that is available in public domain. The connection
between Twitter API and Analytics system requires high bandwidth. The bandwidth can be aggregated at application layer,
network layer and transport layer [1, 6,7].The link aggregation overcomes the problem of low availability of data and loss
of data which will help the industries where continuous analytics are carried out.
Apache Flume is used for gathering data that is a data ingestion system configured by defining endpoints in a data
flow called sources and sinks. In Flume, each individual piece of data i.e. tweets is called an event; sources produce events,
and send the events through a channel, which connects the source to the sink. The sink then writes the events out to a
predefined location. The Flume will take the data by establishing connection via network based application and will
provide data continuously so that analytics becomes much easier. There are many Twitter APIs for bringing the data to the
local system [12]. One can use Python Programming by writing codes and establishing connections with Twitter API and
this data is periodically uploaded. So it fails for Real Time Analytics. But the proposed work uses Flume which will
remove this problem.
Apache Oozie is used for partition management that will help to design workflow of the jobs and take care about
the scheduling of various jobs. It also provides periodicity in data partitioning. Once the Twitter data loaded into HDFS,
one can use it for querying by creating an external table in Pig or Hive. The external table will simplify the task by
removing extra work of always moving the data into HDFS .For the purpose of scalability, when there is need to add more
data, partitioning of the table is done. Table partitioning will remove all the un-necessary data that will again boost the
efficiency. Twitter API continuously streaming the data via Flume into the HDFS and this will generate huge amount of
data. So in order to eliminate this problem Oozie will act as partition management system by periodically partitioning the
data. After getting the data, one must check whether the data is in proper form or not. If it is not in well parsed form then
we have to parse it using Auto storage class delimiter functionality. As the data will be in JSON
(Java Script Object Notation ) form. So it will be easily processed by Hive. It has also the functionality of Serializer and
Deserializer which is generally referred as SerDe. SerDe provides the interface in order to interpret what type of data is
loaded and also suggests Hive in what form the data is translated so that the processing becomes easier. One can write a
custom SerDe that reads the JSON data in and translates the objects for Hive. Once thats put into place, one can start
querying. So the proposed work suggests to design the custom SerDe if possible and in order to get highly efficient results.
A Survey on Bandwidth Aggregation for Speedy Data Uploading in Twitter Trend Analysis Using Hadoop
67
Architecture of Proposed Work

The proposed system works on the phenomenon of combination of Open Source Software [17] along with
hardware. The proposed high level architecture is as shown in figure 2. This will prove efficient, beneficial for the purpose
of Analytics. The bandwidth aggregation will boost the data uploading from the twitter server [10]. This approach will be
useful for the efficient Twitter Trend Analysis in Real Time. The data then transferred to the HDFS [16] via Apache
Flume. The processing is done by using HIVE or PIG (Hadoop query language) .The raw data (in JSON form) if required
is parsed via SerDe to get minimum results. The Oozie will help in time management and results generated will be in form
of reports that will perform analytics more efficiently.
Figure 2: Architecture of Proposed Work
CONCLUSIONS
This proposed work gives information about the problems with the available tools and systems in the market.
An effective solution for Twitter Trend Analysis by aggregating the bandwidth (link) that will reduce the time delay and
reduce the cost by using HADOOP is proposed. There are various systems to get the Analytics available in the market but
are very costly, less efficient and less secure. So the proposed system uses an efficient Apache Open Source Product which
presents the model that can aggregate the available bandwidth for speedy data uploading approach to have Twitter Trend
Analysis using HADOOP where no extra work like scraping, cleansing and data protection required. The proposed work
concludes with the phenomenon of Open Source Software along with Commodity Hardware will increase IT Industry
Profit.
ACKNOWLEDGEMENTS
I express true sense of gratitude towards my project guide Prof. Rajeshwari M Goudar, Asst. Prof. of computer
engineering department for her invaluable co-operation and guidance that she helping me for my project study. I like to
thank her once again for inspiring me and providing me all the lab facilities, which made this survey work very convenient
and easy. I would also like to express my appreciation and thanks Prof. Uma Nagaraj Head of Computer Engineering
Department and principal Dr. Y. J. Bhalerao and all my friends who knowingly and unknowingly have assisted me
throughout my work. Last but not the least, heartily thanks to my family for being there.
REFERENCES
1.
Suhaimi A. Latif, Mosharrof H. Masud, Farhat Anwar and Md. Khorshed Alam An Investigation of Scheduling
and Packet Reordering Algorithms for Bandwidth Aggregation in Heterogeneous Wireless Networks MiddleEast Journal of Scientific Research 18 (9): 1253-1263, 2013
www.tjprc.org
editor@tjprc.org
68
2.
Claudio Cioffi-Revilla Computational social science, WILEY Interdisciplinary Reviews: Computational

Statistics, Vol. 2, no. 3, May/June 2010:pp. 259271
3.
Andreas M. Kaplan, Michael Haenlein Users of the world, unite! The challenges and opportunities of Social
Media, Business Horizons (2010) 53, 5968 ELSEVIER
4.
Michael Mathioudakis, Nick Koudas, TwitterMonitor: Trend Detection over the Twitter Stream, SIGMOD10,
June 611, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM 978-1-4503-0032-2/10/06
5.
Min Song, Meen Chul Kim, RT2M : Real-time Twitter Trend Mining System, 978-0-7695-4998-9/13 2013
IEEE International Conference on Social Intelligence and Technology
6.
Juan Carlos Fernandez, Student Member, IEEE, Tarik Taleb, Member, IEEE, Mohsen Guizani, Fellow, IEEE, and
Nei Kato, Senior Member, IEEE Bandwidth Aggregation-Aware Dynamic QoS Negotiation for Real-Time
Video Streaming in Next-Generation Wireless Networks 1520-9210/ 2009 IEEE
7.
Karim Habak, Moustafa Youssef, Khaled A. Harras, An optimal deployable bandwidth aggregation system
1389-1286/ 2013 Elsevier B.V.
8.
Saeideh Shahheidari, Hai Dong, Md Nor Ridzuan Bin Daud Twitter sentiment mining: A multi domain analysis
978-0-7695-4992-7/13 2013 IEEE Seventh International Conference on Complex, Intelligent, and Software
Intensive Systems
9.
Mutia N. Kurniati, Woo-Jong Ryu, Md. Hijbul Alam, SangKeun Lee, Examining the Performance of Topic
Modeling Techniques in Twitter Trends Extraction, 978-1-4799-3689-2/14/ 2014 IEEE
10. Beiming Sun, Vincent TY Ng, Analyzing Sentimental Influence of Posts on Social Networks 978-1-4799-37769/14/2014 IEEE
11. David Alfred Ostrowski System Analytics Research and Innovation Center Ford Motor Company, Semantic
Social Network Analysis for Trend Identification 978-0-7695-4859-3/12 2012 IEEE
12. Yang Lai, Shi ZhongZhi, EfficientAn Efficient Data Mining Framework on Hadoop using Java Persistence API
978-0-7695-4108-2/10 2010 IEEE
13. Kala Karun, A, Chitharanjan. K Sree Chitra Thirunal College of Engineering Thiruvananthapuram, A Review on
Hadoop HDFS Infrastructure Extensions 978-1-4673-5758-6/13/ 2013 IEEE
14. Bogdan Batrinca Philip C. Treleaven Social media analytics: a survey of techniques, tools and platforms AI &
Soc DOI 10.1007/s00146-014-0549-4This article is published with open access at Springerlink.com
15. Daehoon Kim, Daeyong Kim, Eenjun Hwang, Seungmin Rho TwitterTrends: a spatio-temporal trend detection
and related keywords recommendation scheme Springer-Verlag Berlin Heidelberg 2013 Multimedia Systems
DOI 10.1007/s00530-013-0342-0
16. www.cloudera.com
17. www.hadoop.apache.org

A Survey On Bandwidth Aggregation For Speedy Data Uploading in Twitter Trend Analysis Using Hadoop

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Survey On Bandwidth Aggregation For Speedy Data Uploading in Twitter Trend Analysis Using Hadoop

Загружено:

Авторское право:

Доступные форматы

International Journal of Computer Science Engineering

and Information Technology Research (IJCSEITR)

A SURVEY ON BANDWIDTH AGGREGATION FOR SPEEDY DATA

Gaurav D. Rajurkar & Rajeshwari M. Goudar

RELATED WORK AND AVAILABLE TOOLS FOR TWITTER ANALYTICS

Twitter Monitor: Trend Detection over the Twitter Stream

RT2M: Real-time Twitter Trend Mining System

Impact Factor (JCC): 6.8785

Index Copernicus Value (ICV): 3.0

Other Existing Systems

Fails for Real Time Analytics.

Methodology and Working of Proposed System

Gaurav D. Rajurkar & Rajeshwari M. Goudar

Figure 1: Pipeline of Data Flow in Proposed System

Index Copernicus Value (ICV): 3.0

Architecture of Proposed Work

Figure 2: Architecture of Proposed Work

Gaurav D. Rajurkar & Rajeshwari M. Goudar

Claudio Cioffi-Revilla Computational social science, WILEY Interdisciplinary Reviews: Computational

Impact Factor (JCC): 6.8785

Index Copernicus Value (ICV): 3.0

Вам также может понравиться