Stream Social Media To Hadoop Using Flume

COLLABORATIVE WHITE PAPER SERIES
The Fast-Track to Hands-On

Understanding of Big Data
Technology
Stream Social Media to Hadoop & Create
Reports in Less Than a Day
COLLABORATIVE WHITE PAPER SERIES
The Fast-Track to Hands-On Understanding of

Big Data Technology
Stream Social Media to Hadoop & Create Reports in Less Than a Day
Big Data might be intimidating to the most seasoned IT professional. Its not simply the charged
nature of the term Big that is ominous, but the underlying technology is app-centric in a
very open-source way. If you are like most professionals who dont have a working knowledge
of MapReduce, JSON, Hive, or Flume, diving into the deep-end of the Big Data technology
pool may seem like a time-consuming process. Even if you possess these skill sets, the prospect
of launching a Hadoop environment and deploying an application that streams Twitter data
into the environment in a way that is accessible through standard ODBC tools would seem like
a task measured in weeks not days.
It may surprise most people looking to get hands-on with Big Data technology that each of us
can do so in short time and with the right approach, you can stream live social data to your
own Hadoop cluster and report on the information through Excel in less than one day. In an
instructive manner, this whitepaper series enables you with a fast track approach to create
your personal Big Data lab environment powered by Apache Hadoop. This first part in the
series will engage IT professionals with a passing interest in Big Data by providing them with:
Reasons to explore the world of Big Data and Big Data skills gap
A practical, lightweight approach to getting hands-on with Big Data technology
Describe the use case and the supporting technical components in more detail
Provide step-by-step instructions of how to setup the lab environment, and direct individuals
to Clouderas streaming Twitter agent tutorial.
We will enhance Clouderas tutorial in the following ways:
-- Make the tutorial real-time.
-- Provide steps to establish ODBC connectivity and how to execute Clouderas sample
queries in Excel.
-- Configure and register libraries at an overall environment level.
-- Provide sample code and troubleshooting tips.
A reason to explore the universe of Big Data

Before beginning this exercise, the first question that may be asked by IT professionals is why would one care
to explore the universe of Big Data? The fact is that the universe of data is expanding at an accelerating rate,
and increasingly the data growth is driven by sources of unstructured or machine-generated Big Data (e.g. from
social media, logs, the Internet of Things). The latest IDC Digital Universal Study reveals an explosion of stored
information: more than 2.8 zettabytesor roughly 3 billion terabytesof information was created and replicated
in 2012 alone. To put this number in perspective, this means that 95.07 terabytes of information was produced per
second over the course of a year.
Organizations are increasingly aware that this unrefined data represents an opportunity to gain valuable insight
into ongoing clinical research, monitoring financial risk, etc. From a practical perspective, this means that business
people will be asking IT for answers to questions that can be supported by sources of Big Data, and in some cases
processed by Big Data technologies. A recent Harvard Business Review survey suggests that 85% of organizations
had funded Big Data initiatives in process or in the planning stage, but the survey reveals a severe gap in analytical
skills and 70% of respondents describe finding qualified data scientists as challenging to very difficult1. Thus
Big Data introduces an opportunity to the business, but exposes a skills and technology gap for IT. This gap must
be filled in short time otherwise businesses will find themselves at a competitive disadvantage, and ITs ability to
support the business will be questioned.
The right approach

If you are convinced that an understanding of Big Data is important to your business and IT initiatives, in most cases
you need to formulate a practical, low-cost, and ultimately relevant approach of understanding the technology and
conventional use cases that resonate with the business. After all, IT resources are stretched thin, and many of us in
IT that are new to the world of Big Data could spend weeks getting up to speed on the various options before taking
the first step. For those of us with day jobs, there arent enough hours in the day to invest a lot of time in dissecting
the various Big Data technology players, or building relevant open source components that wont necessarily prove
anything from a technology or business point of view. Fortunately, the following game plan provides a universal use
case as a starting point, and practical ways with which you can get lab environment running so that you can mobilize
business sponsors and technical staff around Big Data capabilities in less than eight hours.
For learning purposes, it makes sense to pursue a fairly common scenario across industries. In our case we will
attempt to stream social media (specifically tweets from Twitter) to a lab environment, and then we will report on
the data through everyones favorite BI tool, Excel. The use case will be described in more detail later on. From a
technical training perspective, the approach relies heavily on Apache Hadoop.
The Fast-Track to Hands-On Understanding of Big Data Technology

3
The following lists the reasons why Hadoop is the preferred platform to learn Big Data and to implement this
scenario:
Why Hadoop

Other Big Data options
For skeptics that believe open-source is

not free, Hadoop is one case where the
OSS community provides unparalleled
processing power through its core HDFS and
MapReduce projects, broad capabilities (e.g.
real-time streaming), and most importantly
an active community of developers and users

providing sample code and workarounds to
wrinkles inherent to the world of OSS.
Hadoop adoption is setting the standard
for overall Big Data adoption, thereby
substantiating investment in Hadoop skills.
The Apache projects supporting Hadoop
provides all of the capabilities inherent to the
real-time streaming social media example.
NoSQL databases like MongoDB could feasibly handle

documents like the data streaming from social media
sites (i.e. JSON). After several test-drives with MongoDB,
the technology relied heavily on JavaScript which makes
sense from a document-store perspective, but not from a
query and analysis perspective.
Proprietary Big Data platforms from Google (BigQuery)
and Amazon (DynamoDB) may make sense for IT shops
that have already committed to these vendors, but
resources and communities supporting these proprietary
platforms are limited in comparison to Apache Hadoop.
Specialty Big Data technologies like Splunk serve a very
specific purpose, e.g. machine data. If you have a need
to learn Big Data with a use case that involves machine
data, then consider downloading Splunk.
There are many ways to deploy Apache Hadoop. Our example relies on Clouderas distribution of Apache Hadoop
(CDH) running in a Linux VM image. The following lists key considerations and why CDH was used.
Consideration
Direction and rationale
Building the Hadoop environment We considered building our Hadoop environment from scratch through the
from scratch as opposed to using Apache Hadoop projects. If your learning objectives include understanding
CDH.
what it takes to ensure compatibility of each Hadoop project, or if you need
to tweak the source code, then you should include this step in the approach.
Given the time commitment, it seemed more useful to take an existing
distribution that ensured interoperability and compatibility of the projects.
Using CDH over HortonWorks or Alternatives from HortonWorks and MapR were considered, specifically
Microsofts HDInsight distribution that uses HortonWorks. Ultimately
MapR.
Clouderas software and support resources and its Twitter Feed example,
which are available for download and general use. Cloudera also has VM
images with a free edition of the Cloudera manager and Hadoop available
with the entire Apache Hadoop project required by the scenario for download.
Deploying Hadoop in the Cloud.
Deploying the environment to the cloud was considered, and in some cases
it may be preferred. An instance of Microsoft HDInsight was used running in
Azure, and would have been pursued at a greater length but unfortunately
the lease on the Azure instance expired and inquiries on how to extend the
lease went unanswered.

4
The CDH stack (listed below) summarizes the core projects included with CDH, and the projects relevant to the use
case are captioned2.
Figure 1: Clouderas distribution including Apache Hadoop (CDH)3
It should be noted that this is a learning exercise, not a performance benchmark. Thus a single-node Hadoop
running inside of a Linux VM is deemed sufficient for those of us wanting to learn Hadoop. If performance tuning
is crucial to your learning objectives, then a more robust environment would be required. The business use case
would still be relevant, since streaming live social data will generate millions of transactions, depending on the key
words you have specified. The specifications of the VM image and the host matching are listed in the Appendix.
Finally, there are ways to make this use case more comprehensive. For instance, once the data streaming is
captured, you could use Apache Mahout to cluster and classify the data using the various algorithms available.
Since much of the classification would be dependent on business input, it seems reasonable to take a first iteration
through the use case as presented, and then proceed with next steps in concert with more involvement or direction
from the business.

5
Use case and supporting Hadoop components

Streaming social media data is a fairly common use case for Big Data and applies across industries. Cloudera
provides a tutorial that represents an implementation of this use case. This paper will build on Clouderas tutorial,
and extend it by making the data available in-real-time and reporting on the data in Excel. Using the approach
described above and by following the instructions, you will have Tweets streaming into your Hadoop sandbox
reportable in Excel in less than a business day.
Clouderas tutorial is documented thoroughly in a series of blog postings and the source code is available on
GitHub4. The following represents our version of the streaming Twitter tutorial. Major components are numbered
and their purpose explained below.
Figure 2: Streaming social media use case and supporting technical components

6
How to stream social data to Hadoop in less than a day

Now that we have established a rationale, an approach, and use case for learning Big Data, we can get started.
Despite the many moving parts listed in the use case, you can have the streaming social media use case operational
in your own lab environment in less than a working day.
The following lists the steps of how to make this happen, and any non-obvious instructions to follow that are not
provided by the instructions in the tutorial. Where appropriate, explanations have been provided to ensure the
significant concepts and mechanics are understood and reinforced.
1. First and foremost, you need a CDH lab environment. Building and configuring this environment from the
OS up could take time. Fortunately Cloudera provides a VM image that is available for download with all of
the necessary Hadoop projects pre-installed and pre-configured. You can download the VM from Clouderas
website.
2. To run the lab environment, you will also need VM Ware player. You can download and install the VM Ware
player from the VMware website.
3. Verify you have sufficient resources to run the VM image on your host machine. Please refer to Clouderas
system requirements, and the appendix for the host and guest machine specifications used in this example.
4. Start the VM. Once started, you can begin the Cloudera Twitter tutorial. For the most part, you can follow the
instructions exactly as provided in the GitHub Tutorial instructions. The following instructions should be followed
in addition to those provided by Cloudera, and the rationale for the amendments is also provided:
a. Unless you have a need to build the JARs from scratch, you should find the JARs referenced in the tutorial
will already exist on the VM image provided by Cloudera. If you build the JARs from the source, you will
probably need 2 days to get the tutorial operational.
b. Before starting, the GCC library was missing from the VM image. To include the library (which is required
to install other libraries):
sudo su
yum install gcc
c. When following the steps under Configuring Flume:

i.
Step 3 We had to manually create the flume-ng-agent file with the following contents:
# FLUME_AGENT_NAME=kings-river-flume
FLUME_AGENT_NAME=TwitterAgent
ii. Step 4 If you are not familiar with the details of your Twitter app, this step may cause confusion. All
that is required is a Twitter account. Once you have a Twitter account, you need to register the Flume
7
Twitter agent with twitter so that Twitter has a record of your agent and can govern the various 3rd
parties that stream Twitter data.
1. To register your Twitter App, go to https://dev.twitter.com.
2. Sign-in with your Twitter account.
3. Click Create a new Application.
4. Enter the following information:
5. Your new application will provide you with 4 security tokens that will be specified in the flume.conf
file. These properties are highlighted below.

8
6. Using the values for the application properties highlighted above, enter the following parameters in
flume.conf. If flume.conf does not exist on /etc/flume-ng/conf, please download it from the GitHub
project:
TwitterAgent.sources.Twitter.consumerKey = <consumer_key_from_twitter>
TwitterAgent.sources.Twitter.consumerSecret = <consumer_secret_from_twitter>
TwitterAgent.sources.Twitter.accessToken = <access_token_from_twitter>
TwitterAgent.sources.Twitter.accessTokenSecret = <access_token_secret_from_twitter>
7. In flume.conf, modify the following parameter according to the key words in which you want to filter
tweets. Note that the default flume.conf provided by Cloudera misspelled data scientist, the correct
spelling is listed in red below:
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata,

cloudera, data science, data scientist, business intelligence, mapreduce, data
warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence,
cloudcomputing
8. At this point you probably realize the importance of flume.conf. In addition to containing the details
of the Twitter app and the key words, it contains the following parameters which govern how big
the Flume files are before it rolls into a new file. These parameters are significant because as
you change them, the latency of the tweets will also change. The complete listing of the Flume
parameters can be on Clouderas website.
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 # number of events written to file

before it flushed to HDFS
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 # File size to trigger roll (in bytes)
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 # Number of events written to file
before it rolled
9. Place flume.conf under /etc/flume-ng/conf as instructed in Step 4.

d. When following the steps under Setting up Hive:
i.
Now copy hive-serdes-1.0-SNAPSHOT.jar in Step 1 to /usr/lib/hadoop.

9
cp hive-serdes-1.0-SNAPSHOT.jar /usr/lib/hadoop
ii. After step 4, youll want to create a new Java package using the following steps. There is no Java
programming knowledge required, simply follow these instructions. It is necessary to create this Java
class and JAR it so that you can exclude the temporary Flume files created as Tweets are streamed to
HDFS5.
mkdir com
mkdir com/twitter
mkdir com/twitter/util
export CLASSPATH=/usr/lib/hadoop/hadoop-common-2.0.0cdh4.1.2.jar:hadoop-common.jar
vi com/twitter/util/FileFilterExcludeTmpFiles.java
Copy the Java source code in the appendix into the file and save it.
javac com/twitter/util/FileFilterExcludeTmpFiles.java
jar cf TwitterUtil.jar com
cp TwitterUtil.jar /usr/lib/hadoop
iii. Edit the file /etc/hive/conf/hive-site.xml, and add the following tags. The first property ensures that you
wont have to add the JSON SerDe package and the new customer package that excludes Flume
temporary files for each Hive session. This will become part of the overall Hive configurations that is
available to each Hive session. The second tags instruct MapReduce of the class name and location of
the new Java class that we created and compiled above.
<property>
<name>hive.aux.jars.path</name>
<value>file:///usr/lib/hadoop/hive-serdes-1.0-SNAPSHOT.jar,file:///
usr/lib/hadoop/TwitterUtil.jar</value>
</property>
<property>

10
<name>mapred.input.pathFilter.class</name>
<value>com.twitter.util.FileFilterExcludeTmpFiles</value>
</property>
iv. Bounce the hive servers:
sudo service hive-server stop

sudo service hive-server2 stop
sudo service hive-server start
sudo service hive-server2 start
e. When following the steps under Prepare the Oozie workflow:

i.
For all steps, download the Oozie files from the Cloudera GitHub site.
ii. Before Step 4, edit the job.properties file accordingly.

1. Make sure the following parameters reference localhost.localdomain referenced, not just localhost:
nameNode=hdfs://localhost.localdomain:8020
jobTracker=localhost.localdomain:8021
2. The jobStart, jobEnd, tzOffset, and initialDataSet require explanation. Lets say Flume is streaming
the tweets to a HDFS folder, /user/flume/tweets/*. The parameter initialDataset instructs the
workflow what the earliest year, month, day, and hour for which you have data and therefore can
add a partition to the Hive tweets table. jobStart should be set to the initialDataset +/- the tzOffset.
Finally, jobEnd tells the Oozie workflow when to wind down, so it can be set well into the future.
In the following example, the parameters specify that the first set of Tweets live on HDFS under /
user/flume/tweets/2013/01/07/08, and once the directory is available it will create execute the Hive
Query Language script add-partition.q.
jobStart=2013-01-17T13:00Z
jobEnd=2013-12-12T23:00Z
initialDataset=2013-01-17T08:00Z
tzOffset=-5

11
3. Edit coord-app.xml:
a. Change timezone from America/Los_Angeles to America/New_York (or the corresponding
timezone for your location):
initial-instance=${initialDataset} timezone=America/New_York>
b. Remove the following tags. This is extremely important in making the tutorial as real-time as
possible. The default Oozie workflow has defined a readyIndicator which acts as a wait event.
It instructs the workflow to create a new partition after an hour completes. Thus if you leave
this configuration as-is, there will be a lag as great as one-hour between tweets and when the
tweets can be queried. The reason for this default configuration is that the tutorial did not define
the custom JAR we built and deployed for Hive that instructs MapReduce to omit temporary
Flume files. Because we have deployed this custom package, we do not have to force a full
hour to complete before querying tweets.
<data-in name=readyIndicator dataset=tweets>


<instance>${coord:current(1
instance>
(coord:tzOffset()
60))}</
</data-in>
iii. If you havent done so already, enable the Oozie web console according to the Cloudera documentation.
Doing so allows Oozie coordinating jobs and workflows to be accessed from the console located at
http://localhost.localdomain:11000/oozie/.
f. Once you have started the Flume Agent (under Starting the data pipeline), you will see Tweets streaming
to your HDFS.
i.
You can browse the HDFS directory structure from the Hadoop NameNode console on your cluster. You
can also access the cluster from http://localhost.localdomain:50070/dfshealth.jsp.

12
ii. If you are experiencing technical issues, please reference the Troubleshooting Guide in the appendix
5. Setup ODBC connectivity through Excel:
a. ODBC connectivity to Hive from an application is a logical extension of the Cloudera Twitter tutorial.
i.
There are several ODBC drivers for Hive, but many were not compatible with Excel (e.g. Clouderas
ODBC driver for Tableau) or not compatible with Clouderas environment (Microsofts ODBC driver for
Hive, which only worked when connecting to Microsoft HDInsight).
ii. We successfully used MapRs ODBC driver for Windows located here. Since we are running 32-bit
Excel, we needed to download the 32-bit ODBC driver for Hive, but MapR has a driver for 64-bit as well.
iii. Download and install the appropriate ODBC driver from MapRs website.
iv. Configure an ODBC connection to the Hive database.
1. We recommend specifying an entry in your Windows hosts file (C:\Windows\System32\drivers\etc\
hosts) to alias the IP address for your VM machine. You can get the IP address from your VM by
typing in the command ifconfig.
192.168.198.130 cloudera-vm
2. Create an ODBC connection to the Hive Database:
v. Open a new Excel workbook:

1. From Data tab.
2. Select From Other Sources.
3. Select From Data Connection Wizard.

13
4. Select ODBC DSN, click next.
5. Select the DSN you set up using the MapR driver (Cloudera Hive VM MapR).
6. Select the tweets table.
7. Select Finish.

14
8. Select Properties. This is the important part because we must override the HQL in order for the
query to execute. At the time this article was written, the major ODBC drivers append default to
Hive Query and the MapR ODBC driver is the only one able to establish connectivity which would
allow us to override the HQL.
9. Select Definition tab. Using one of the Hive queries provided in the Appendix, copy the HQL and
paste it into the Command Text. Also save password.
10. Hit OK to import the data.
11. Repeat for the remaining queries in the appendix. Create as many queries as you see fit. HQL is
very SQL-like and for many of us that know SQL will be easy to adapt the queries from the appendix
into other statements that provide the views you need.

15
Summary
Once you have successfully completed this tutorial, you should have a clearer understanding of Hadoop, specifically:
1. A quick overview of core Hadoop projects and how each is used to support streaming social media and reporting
through a standard ODBC connection.
2. An operational Hadoop sandbox that can be used for training, local development, and proof of concepts that
you can navigate and explore.
3. A real-world reference model for a use case illustrating the amazing streaming capabilities in Hadoop.
4. How to model semi-structured JSON data in Hive and query it in a conventional manner.
Lastly, this exercise should leave individuals wanting to take the Hadoop experience to the next level. Independently,
you can layer in a Mahout program to cluster and classify the tweets, thereby simulating some form of sentiment
analysis. You may also want to layer in Geospatial data into the set to provide more advanced analytics. You could
consider streaming data from other social media sites (if so, we recommend starting here). Above all, you may want
to show someone from the business to illustrate what this new technology. By demystifying Big Data technology,
you can take your understanding and ability to support additional business use cases to these next levels.
Appendix: Custom Java code for MapReduce PathFilter

package com.twitter.util;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
public class FileFilterExcludeTmpFiles implements PathFilter {

public boolean accept(Path p) {
String name = p.getName();
return !name.startsWith(_) && !name.startsWith(.) && !name.endsWith(.
tmp);
}
}

16
Appendix: Hardware/software environment

Host
Guest
OS
Processor
Memory
Disk
Software
Windows 7
Enterprise 64-bit
Intel(R)
Core(TM)2 Duo
CPU P8400
@ 2.26GHz,
2.27GHz
8GB (7.9
Addressable)
300GB
VM Player 3.1.2
build-301548
Intel(R)
Core(TM)2 Duo
CPU P8400 @
2.26GHz
2.98 GB
CentOS 6.2
Linux 64-bit
Microsoft Office
32-bit
23.5GB
Cloudera
Manager Free
Edition 4.1.1
CDH4.1.2
Appendix: Troubleshooting guide

Error message / stack trace
Cause
Resolution
FAILED: RuntimeException MetaException(message:org.apache.

hadoop.hive.serde2.SerDeException SerDe com.cloudera.hive.
serde.JSONSerDe does not exist
Hive cannot find

hive-serdes-1.0SNAPSHOT.jar
1.
Place hive-serdes-1.0-SNAPSHOT.jar in /
usr/lib/hadoop.
2.
Edit /etc/hive/conf/hive-site.xml, add the

following:
<property>
<name>hive.aux.jars.path</name>
<value>file:///usr/lib/hadoop/
hive-serdes-1.0-SNAPSHOT.
jar,file:///usr/lib/hadoop/
TwitterUtil.jar</value>
</property>
3.
Start and restart the hive services
013-01-17 13:57:37,027 INFO org.apache.oozie.command.

coord.CoordActionInputCheckXCommand: USER[-] GROUP[-]
TOKEN[-] APP[-] JOB[0000068-130117082739514-oozieoozi-C] ACTION[0000068-130117082739514-oozieoozi-C@2] [0000068-130117082739514-oozie-ooziC@2]::ActionInputCheck:: In checkListOfPaths: hdfs://localhost.
localdomain:8020/user/flume/tweets/2013/01/17/10 is Missing.
Permissions on /
user/flume/*
Change perms on /user/flume:
Main class [org.apache.oozie.action.hadoop.HiveMain], exit code

[10001]
Missing MySQL
driver
cp /var/lib/oozie/mysql-connector-java.
jar oozie-workflows/lib
OLE DB or ODBC error: [MapR][Hardy] (22) Error from

ThriftHiveClient: Query returned non-zero code: 2, cause:
FAILED: Execution Error, return code 2 from org.apache.hadoop.
hive.ql.exec.MapRedTask; HY000.
Flume temp file

permissions issue
Walk through the instructions, Setting up Hive

to ensure the custom Java class to set the
MapReduce pathFilter is built, deployed and
referenced in Hive as specified.
sudo -u flume hadoop fs -chmod -R 777 /

user/flume
An error occurred while the partition, with the ID of Tweets By

Timezone_cbf7182e-a7a6-416c-a3fd-d7f484952cc6, Name of
Tweets By Timezone was being processed.
The current operation was cancelled because another operation
in the transaction failed.

17
Error message / stack trace
Cause
Resolution
java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.
handleRecordReaderCreationException(HiveIOExcep
tionHandlerChain.java:97) at org.apache.hadoop.hive.
io.HiveIOExceptionHandlerUtil.handleRecordReaderCreation
Exception(HiveIOExceptionHandlerUtil.java:57) at org.apache.
hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordR
eader.initNextRecordReader(HadoopShimsSecure.java:350) at
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFil
eRecordReader.next(HadoopShimsSecure.java:229)
Flume temp file

permissions issue
Walk through the instructions, Setting up Hive

to ensure the custom Java class to set the
MapReduce pathFilter is built, deployed and
referenced in Hive as specified.
variable[wfInput] cannot be resolved
Oozie attempts to
add a partition that
does not exist
Ensure the files have been streamed to the

proper HDFS location (e.g. /user/flume/), modify
initialDataset in job.properties to the proper
starting point for the Oozie workflow, or ignore
the error (in some cases you may have paused
the stream and do not need the files).
Error 08001 Unable to establish connection with the Hive server
ODBC driver issue
The ODBC driver is prefixing default before

the Hive table name. Follow the steps in setting
up ODBC connectivity in Excel.
at org.apache.hadoop.mapred.
MapTask$TrackedRecordReader.moveToNext(MapTask.
java:210)
at org.apache.hadoop.mapred.
MapTask$TrackedRecordReader.next(MapTask.java:195)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.
java:48)
at org.apache.hadoop.mapred.MapTask.
runOldMapper(MapTask.java:393)
at org.apache.hadoop.mapred.MapTask.run(MapTask.
java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native
Error 01004 Out Connection String buffer size not allocated

Error 01000 Batch size not set or is invalid. Defaulting to 65536
Appendix: Excel queries6

Tweets by time zone and day
SELECT
user.time_zone,
SUBSTR(created_at, 0, 3),
COUNT(*) AS total_count
FROM tweets
WHERE user.time_zone IS NOT NULL
GROUP BY
user.time_zone,
SUBSTR(created_at, 0, 3)
ORDER BY total_count DESC
LIMIT 15

18
Top 15 Big Data hashtags

SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
Top 10 retweeted users on Big Data topics

SELECT
t.retweeted_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name as retweeted_screen_name,
retweeted_status.text,
max(retweet_count) as retweets
FROM tweets
GROUP BY retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweeted_screen_name
ORDER BY total_retweets DESC
LIMIT 10
Top 200 most active users on Big Data topics

select user.screen_name, count(*) tweet_cnt
from tweets
group
by user.screen_name
order
by tweet_cnt desc limit 200

19
References
1. http://blogs.hbr.org/cs/2012/11/the_big_data_talent_gap_no_pan.html
2. Definitions from the Apache Hadoop website for each respective package
3. http://www.cloudera.com/content/cloudera/en/products/cdh.html
4. http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structureddata-with-hive/
https://github.com/cloudera/cdh-twitter-example
5. Known issue with Flume, see https://issues.apache.org/jira/browse/FLUME-1702
6. Adapted from http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

20
About Collaborative
Collaborative Consulting is dedicated to helping
companies optimize their existing business and
technology assets. The company is committed to
building long-term relationships and strives to be a
trusted partner with every client. Founded in 1999,
Collaborative Consulting serves clients from offices
across the United States, with headquarters in
Burlington, Massachusetts.
2013 Collaborative Consulting

877-376-9900
www.collaborative.com

21

Stream Social Media To Hadoop Using Flume

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Stream Social Media To Hadoop Using Flume

Загружено:

Авторское право:

Доступные форматы

COLLABORATIVE WHITE PAPER SERIES

The Fast-Track to Hands-On

COLLABORATIVE WHITE PAPER SERIES

The Fast-Track to Hands-On Understanding of

A reason to explore the universe of Big Data

The right approach

The Fast-Track to Hands-On Understanding of Big Data Technology

Other Big Data options

For skeptics that believe open-source is

NoSQL databases like MongoDB could feasibly handle

Direction and rationale

The Fast-Track to Hands-On Understanding of Big Data Technology

Figure 1: Clouderas distribution including Apache Hadoop (CDH)3

The Fast-Track to Hands-On Understanding of Big Data Technology

Use case and supporting Hadoop components

The Fast-Track to Hands-On Understanding of Big Data Technology

How to stream social data to Hadoop in less than a day

c. When following the steps under Configuring Flume:

The Fast-Track to Hands-On Understanding of Big Data Technology

TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata,

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 # number of events written to file

9. Place flume.conf under /etc/flume-ng/conf as instructed in Step 4.

Now copy hive-serdes-1.0-SNAPSHOT.jar in Step 1 to /usr/lib/hadoop.

The Fast-Track to Hands-On Understanding of Big Data Technology

The Fast-Track to Hands-On Understanding of Big Data Technology

iv. Bounce the hive servers:

sudo service hive-server stop

e. When following the steps under Prepare the Oozie workflow:

ii. Before Step 4, edit the job.properties file accordingly.

The Fast-Track to Hands-On Understanding of Big Data Technology

<data-in name=readyIndicator dataset=tweets>

The Fast-Track to Hands-On Understanding of Big Data Technology

2. Create an ODBC connection to the Hive Database:

v. Open a new Excel workbook:

The Fast-Track to Hands-On Understanding of Big Data Technology

4. Select ODBC DSN, click next.

6. Select the tweets table.

The Fast-Track to Hands-On Understanding of Big Data Technology

10. Hit OK to import the data.

The Fast-Track to Hands-On Understanding of Big Data Technology

Appendix: Custom Java code for MapReduce PathFilter

public class FileFilterExcludeTmpFiles implements PathFilter {

The Fast-Track to Hands-On Understanding of Big Data Technology

Appendix: Hardware/software environment

Appendix: Troubleshooting guide

FAILED: RuntimeException MetaException(message:org.apache.

Hive cannot find

Edit /etc/hive/conf/hive-site.xml, add the

Start and restart the hive services

013-01-17 13:57:37,027 INFO org.apache.oozie.command.

Change perms on /user/flume:

Main class [org.apache.oozie.action.hadoop.HiveMain], exit code

OLE DB or ODBC error: [MapR][Hardy] (22) Error from

Flume temp file

Walk through the instructions, Setting up Hive

sudo -u flume hadoop fs -chmod -R 777 /

An error occurred while the partition, with the ID of Tweets By

The Fast-Track to Hands-On Understanding of Big Data Technology

Error message / stack trace

Flume temp file

Walk through the instructions, Setting up Hive

variable[wfInput] cannot be resolved

Ensure the files have been streamed to the