Академический Документы
Профессиональный Документы
Культура Документы
Big Data might be intimidating to the most seasoned IT professional. Its not simply the charged
nature of the term Big that is ominous, but the underlying technology is app-centric in a
very open-source way. If you are like most professionals who dont have a working knowledge
of MapReduce, JSON, Hive, or Flume, diving into the deep-end of the Big Data technology
pool may seem like a time-consuming process. Even if you possess these skill sets, the prospect
of launching a Hadoop environment and deploying an application that streams Twitter data
into the environment in a way that is accessible through standard ODBC tools would seem like
a task measured in weeks not days.
It may surprise most people looking to get hands-on with Big Data technology that each of us
can do so in short time and with the right approach, you can stream live social data to your
own Hadoop cluster and report on the information through Excel in less than one day. In an
instructive manner, this whitepaper series enables you with a fast track approach to create
your personal Big Data lab environment powered by Apache Hadoop. This first part in the
series will engage IT professionals with a passing interest in Big Data by providing them with:
Reasons to explore the world of Big Data and Big Data skills gap
A practical, lightweight approach to getting hands-on with Big Data technology
Describe the use case and the supporting technical components in more detail
Provide step-by-step instructions of how to setup the lab environment, and direct individuals
to Clouderas streaming Twitter agent tutorial.
We will enhance Clouderas tutorial in the following ways:
-- Make the tutorial real-time.
-- Provide steps to establish ODBC connectivity and how to execute Clouderas sample
queries in Excel.
-- Configure and register libraries at an overall environment level.
-- Provide sample code and troubleshooting tips.
skills and 70% of respondents describe finding qualified data scientists as challenging to very difficult1. Thus
Big Data introduces an opportunity to the business, but exposes a skills and technology gap for IT. This gap must
be filled in short time otherwise businesses will find themselves at a competitive disadvantage, and ITs ability to
support the business will be questioned.
The following lists the reasons why Hadoop is the preferred platform to learn Big Data and to implement this
scenario:
Why Hadoop
There are many ways to deploy Apache Hadoop. Our example relies on Clouderas distribution of Apache Hadoop
(CDH) running in a Linux VM image. The following lists key considerations and why CDH was used.
Consideration
Building the Hadoop environment We considered building our Hadoop environment from scratch through the
from scratch as opposed to using Apache Hadoop projects. If your learning objectives include understanding
CDH.
what it takes to ensure compatibility of each Hadoop project, or if you need
to tweak the source code, then you should include this step in the approach.
Given the time commitment, it seemed more useful to take an existing
distribution that ensured interoperability and compatibility of the projects.
Using CDH over HortonWorks or Alternatives from HortonWorks and MapR were considered, specifically
Microsofts HDInsight distribution that uses HortonWorks. Ultimately
MapR.
Clouderas software and support resources and its Twitter Feed example,
which are available for download and general use. Cloudera also has VM
images with a free edition of the Cloudera manager and Hadoop available
with the entire Apache Hadoop project required by the scenario for download.
Deploying Hadoop in the Cloud.
Deploying the environment to the cloud was considered, and in some cases
it may be preferred. An instance of Microsoft HDInsight was used running in
Azure, and would have been pursued at a greater length but unfortunately
the lease on the Azure instance expired and inquiries on how to extend the
lease went unanswered.
The CDH stack (listed below) summarizes the core projects included with CDH, and the projects relevant to the use
case are captioned2.
It should be noted that this is a learning exercise, not a performance benchmark. Thus a single-node Hadoop
running inside of a Linux VM is deemed sufficient for those of us wanting to learn Hadoop. If performance tuning
is crucial to your learning objectives, then a more robust environment would be required. The business use case
would still be relevant, since streaming live social data will generate millions of transactions, depending on the key
words you have specified. The specifications of the VM image and the host matching are listed in the Appendix.
Finally, there are ways to make this use case more comprehensive. For instance, once the data streaming is
captured, you could use Apache Mahout to cluster and classify the data using the various algorithms available.
Since much of the classification would be dependent on business input, it seems reasonable to take a first iteration
through the use case as presented, and then proceed with next steps in concert with more involvement or direction
from the business.
Figure 2: Streaming social media use case and supporting technical components
sudo su
yum install gcc
Step 3 We had to manually create the flume-ng-agent file with the following contents:
# FLUME_AGENT_NAME=kings-river-flume
FLUME_AGENT_NAME=TwitterAgent
ii. Step 4 If you are not familiar with the details of your Twitter app, this step may cause confusion. All
that is required is a Twitter account. Once you have a Twitter account, you need to register the Flume
The Fast-Track to Hands-On Understanding of Big Data Technology
7
Twitter agent with twitter so that Twitter has a record of your agent and can govern the various 3rd
parties that stream Twitter data.
1. To register your Twitter App, go to https://dev.twitter.com.
2. Sign-in with your Twitter account.
3. Click Create a new Application.
4. Enter the following information:
5. Your new application will provide you with 4 security tokens that will be specified in the flume.conf
file. These properties are highlighted below.
6. Using the values for the application properties highlighted above, enter the following parameters in
flume.conf. If flume.conf does not exist on /etc/flume-ng/conf, please download it from the GitHub
project:
TwitterAgent.sources.Twitter.consumerKey = <consumer_key_from_twitter>
TwitterAgent.sources.Twitter.consumerSecret = <consumer_secret_from_twitter>
TwitterAgent.sources.Twitter.accessToken = <access_token_from_twitter>
TwitterAgent.sources.Twitter.accessTokenSecret = <access_token_secret_from_twitter>
7. In flume.conf, modify the following parameter according to the key words in which you want to filter
tweets. Note that the default flume.conf provided by Cloudera misspelled data scientist, the correct
spelling is listed in red below:
8. At this point you probably realize the importance of flume.conf. In addition to containing the details
of the Twitter app and the key words, it contains the following parameters which govern how big
the Flume files are before it rolls into a new file. These parameters are significant because as
you change them, the latency of the tweets will also change. The complete listing of the Flume
parameters can be on Clouderas website.
cp hive-serdes-1.0-SNAPSHOT.jar /usr/lib/hadoop
ii. After step 4, youll want to create a new Java package using the following steps. There is no Java
programming knowledge required, simply follow these instructions. It is necessary to create this Java
class and JAR it so that you can exclude the temporary Flume files created as Tweets are streamed to
HDFS5.
mkdir com
mkdir com/twitter
mkdir com/twitter/util
export CLASSPATH=/usr/lib/hadoop/hadoop-common-2.0.0cdh4.1.2.jar:hadoop-common.jar
vi com/twitter/util/FileFilterExcludeTmpFiles.java
Copy the Java source code in the appendix into the file and save it.
javac com/twitter/util/FileFilterExcludeTmpFiles.java
jar cf TwitterUtil.jar com
cp TwitterUtil.jar /usr/lib/hadoop
iii. Edit the file /etc/hive/conf/hive-site.xml, and add the following tags. The first property ensures that you
wont have to add the JSON SerDe package and the new customer package that excludes Flume
temporary files for each Hive session. This will become part of the overall Hive configurations that is
available to each Hive session. The second tags instruct MapReduce of the class name and location of
the new Java class that we created and compiled above.
<property>
<name>hive.aux.jars.path</name>
<value>file:///usr/lib/hadoop/hive-serdes-1.0-SNAPSHOT.jar,file:///
usr/lib/hadoop/TwitterUtil.jar</value>
</property>
<property>
<name>mapred.input.pathFilter.class</name>
<value>com.twitter.util.FileFilterExcludeTmpFiles</value>
</property>
For all steps, download the Oozie files from the Cloudera GitHub site.
nameNode=hdfs://localhost.localdomain:8020
jobTracker=localhost.localdomain:8021
2. The jobStart, jobEnd, tzOffset, and initialDataSet require explanation. Lets say Flume is streaming
the tweets to a HDFS folder, /user/flume/tweets/*. The parameter initialDataset instructs the
workflow what the earliest year, month, day, and hour for which you have data and therefore can
add a partition to the Hive tweets table. jobStart should be set to the initialDataset +/- the tzOffset.
Finally, jobEnd tells the Oozie workflow when to wind down, so it can be set well into the future.
In the following example, the parameters specify that the first set of Tweets live on HDFS under /
user/flume/tweets/2013/01/07/08, and once the directory is available it will create execute the Hive
Query Language script add-partition.q.
jobStart=2013-01-17T13:00Z
jobEnd=2013-12-12T23:00Z
initialDataset=2013-01-17T08:00Z
tzOffset=-5
3. Edit coord-app.xml:
a. Change timezone from America/Los_Angeles to America/New_York (or the corresponding
timezone for your location):
initial-instance=${initialDataset} timezone=America/New_York>
b. Remove the following tags. This is extremely important in making the tutorial as real-time as
possible. The default Oozie workflow has defined a readyIndicator which acts as a wait event.
It instructs the workflow to create a new partition after an hour completes. Thus if you leave
this configuration as-is, there will be a lag as great as one-hour between tweets and when the
tweets can be queried. The reason for this default configuration is that the tutorial did not define
the custom JAR we built and deployed for Hive that instructs MapReduce to omit temporary
Flume files. Because we have deployed this custom package, we do not have to force a full
hour to complete before querying tweets.
(coord:tzOffset()
60))}</
</data-in>
iii. If you havent done so already, enable the Oozie web console according to the Cloudera documentation.
Doing so allows Oozie coordinating jobs and workflows to be accessed from the console located at
http://localhost.localdomain:11000/oozie/.
f. Once you have started the Flume Agent (under Starting the data pipeline), you will see Tweets streaming
to your HDFS.
i.
You can browse the HDFS directory structure from the Hadoop NameNode console on your cluster. You
can also access the cluster from http://localhost.localdomain:50070/dfshealth.jsp.
ii. If you are experiencing technical issues, please reference the Troubleshooting Guide in the appendix
5. Setup ODBC connectivity through Excel:
a. ODBC connectivity to Hive from an application is a logical extension of the Cloudera Twitter tutorial.
i.
There are several ODBC drivers for Hive, but many were not compatible with Excel (e.g. Clouderas
ODBC driver for Tableau) or not compatible with Clouderas environment (Microsofts ODBC driver for
Hive, which only worked when connecting to Microsoft HDInsight).
ii. We successfully used MapRs ODBC driver for Windows located here. Since we are running 32-bit
Excel, we needed to download the 32-bit ODBC driver for Hive, but MapR has a driver for 64-bit as well.
iii. Download and install the appropriate ODBC driver from MapRs website.
iv. Configure an ODBC connection to the Hive database.
1. We recommend specifying an entry in your Windows hosts file (C:\Windows\System32\drivers\etc\
hosts) to alias the IP address for your VM machine. You can get the IP address from your VM by
typing in the command ifconfig.
192.168.198.130 cloudera-vm
5. Select the DSN you set up using the MapR driver (Cloudera Hive VM MapR).
7. Select Finish.
8. Select Properties. This is the important part because we must override the HQL in order for the
query to execute. At the time this article was written, the major ODBC drivers append default to
Hive Query and the MapR ODBC driver is the only one able to establish connectivity which would
allow us to override the HQL.
9. Select Definition tab. Using one of the Hive queries provided in the Appendix, copy the HQL and
paste it into the Command Text. Also save password.
11. Repeat for the remaining queries in the appendix. Create as many queries as you see fit. HQL is
very SQL-like and for many of us that know SQL will be easy to adapt the queries from the appendix
into other statements that provide the views you need.
Summary
Once you have successfully completed this tutorial, you should have a clearer understanding of Hadoop, specifically:
1. A quick overview of core Hadoop projects and how each is used to support streaming social media and reporting
through a standard ODBC connection.
2. An operational Hadoop sandbox that can be used for training, local development, and proof of concepts that
you can navigate and explore.
3. A real-world reference model for a use case illustrating the amazing streaming capabilities in Hadoop.
4. How to model semi-structured JSON data in Hive and query it in a conventional manner.
Lastly, this exercise should leave individuals wanting to take the Hadoop experience to the next level. Independently,
you can layer in a Mahout program to cluster and classify the tweets, thereby simulating some form of sentiment
analysis. You may also want to layer in Geospatial data into the set to provide more advanced analytics. You could
consider streaming data from other social media sites (if so, we recommend starting here). Above all, you may want
to show someone from the business to illustrate what this new technology. By demystifying Big Data technology,
you can take your understanding and ability to support additional business use cases to these next levels.
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
Guest
OS
Processor
Memory
Disk
Software
Windows 7
Enterprise 64-bit
Intel(R)
Core(TM)2 Duo
CPU P8400
@ 2.26GHz,
2.27GHz
8GB (7.9
Addressable)
300GB
VM Player 3.1.2
build-301548
Intel(R)
Core(TM)2 Duo
CPU P8400 @
2.26GHz
2.98 GB
CentOS 6.2
Linux 64-bit
Microsoft Office
32-bit
23.5GB
Cloudera
Manager Free
Edition 4.1.1
CDH4.1.2
Cause
Resolution
1.
Place hive-serdes-1.0-SNAPSHOT.jar in /
usr/lib/hadoop.
2.
3.
Permissions on /
user/flume/*
Missing MySQL
driver
cp /var/lib/oozie/mysql-connector-java.
jar oozie-workflows/lib
Cause
Resolution
java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.
handleRecordReaderCreationException(HiveIOExcep
tionHandlerChain.java:97) at org.apache.hadoop.hive.
io.HiveIOExceptionHandlerUtil.handleRecordReaderCreation
Exception(HiveIOExceptionHandlerUtil.java:57) at org.apache.
hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordR
eader.initNextRecordReader(HadoopShimsSecure.java:350) at
org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFil
eRecordReader.next(HadoopShimsSecure.java:229)
Oozie attempts to
add a partition that
does not exist
at org.apache.hadoop.mapred.
MapTask$TrackedRecordReader.moveToNext(MapTask.
java:210)
at org.apache.hadoop.mapred.
MapTask$TrackedRecordReader.next(MapTask.java:195)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.
java:48)
at org.apache.hadoop.mapred.MapTask.
runOldMapper(MapTask.java:393)
at org.apache.hadoop.mapred.MapTask.run(MapTask.
java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native
References
1. http://blogs.hbr.org/cs/2012/11/the_big_data_talent_gap_no_pan.html
2. Definitions from the Apache Hadoop website for each respective package
3. http://www.cloudera.com/content/cloudera/en/products/cdh.html
4. http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/
http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structureddata-with-hive/
https://github.com/cloudera/cdh-twitter-example
5. Known issue with Flume, see https://issues.apache.org/jira/browse/FLUME-1702
6. Adapted from http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
About Collaborative
Collaborative Consulting is dedicated to helping
companies optimize their existing business and
technology assets. The company is committed to
building long-term relationships and strives to be a
trusted partner with every client. Founded in 1999,
Collaborative Consulting serves clients from offices
across the United States, with headquarters in
Burlington, Massachusetts.