Вы находитесь на странице: 1из 8

Practice: Process logs with Apache Hadoop

Extract useful data from logs using Hadoop on typical Linux


systems
M. Tim Jones
Independent author
Consultant

30 May 2012

Logs are an essential part of any computing system, supporting capabilities from audits to
error management. As logs grow and the number of log sources increases (such as in cloud
environments), a scalable system is necessary to efficiently process logs. This practice session
explores processing logs with Apache Hadoop from a typical Linux system.
Logs come in all shapes, but as applications and infrastructures grow, the result is a massive
amount of distributed data that's useful to mine. From web and mail servers to kernel and boot
logs, modern servers hold a rich set of information. Massive amounts of distributed data are a
perfect application for Apache Hadoop, as are log filestime-ordered structured textual data.
You can use log processing to extract a variety of information. One of its most common uses is to
extract errors or count the occurrence of some event within a system (such as login failures). You
can also extract some types of performance data, such as connections or transactions per second.
Other useful information includes the extraction (map) and construction of site visits (reduce) from
a web log. This analysis can also support detection of unique user visits in addition to file access
statistics.

Overview
About this article
You may want to read these articles before working through the exercises:

Distributed computing with Linux and Hadoop


Distributed data processing with Hadoop, Part 1: Getting started
Distributed data processing with Hadoop, Part 2: Going further
Distributed data processing with Hadoop, Part 3: Application development
Data processing with Apache Pig

These exercises give you practice in:


Copyright IBM Corporation 2012
Practice: Process logs with Apache Hadoop

Trademarks
Page 1 of 8

developerWorks

ibm.com/developerWorks/

Getting a simple Hadoop environment up and running


Interacting with the Hadoop file system (HDFS)
Writing a simple MapReduce application
Writing a filtering Apache Pig query
Writing an accumulating Pig query

Prerequisites
To get the most from these exercises, you should have a basic working knowledge of Linux.
Some knowledge of virtual appliances is also useful for bringing a simple environment up.

Exercise 1. Get a simple Hadoop environment up and running


There are two ways to get Hadoop up and running. The first is to install the Hadoop software,
and then configure it for your environment (the simplest case is a single-node instance, in which
all daemons run in a single node). See Distributed data processing with Hadoop, Part 1: Getting
started for details.
The second and simpler way is through the use of the Cloudera's Hadoop Demo VM (which
contains a Linux image plus a preconfigured Hadoop instance). The Cloudera virtual machine
(VM) runs on VMware, Kernel-based Virtual Machine (KVM), or Virtualbox.
Choose a method, and complete the installation. Then, complete the following task:
Verify that Hadoop is running by issuing an HDFS ls command.

Exercise 2. Interact with the HDFS


The HDFS is a special-purpose file system that manages data and replicas within a Hadoop
cluster, distributing them to compute nodes for efficient processing. Even though HDFS is a
special-purpose file system, it implements many of the typical file system commands. To retrieve
help information for Hadoop, issue the command hadoop dfs. Perform the following tasks:
Create a test subdirectory within the HDFS.
Move a file from the local file system into the HDFS subdirectory using copyFromLocal.
For extra credit, view the file within HDFS using a hadoop dfs command.

Exercise 3. Write a simple MapReduce application


As demonstrated in Distributed data processing with Hadoop, Part 3: Application development,
writing a word count map and reduce application is simple. Using the Ruby example demonstrated
in this article, develop a Python map and reduce application, and run them on a sample set of
data. Recall that Hadoop sorts the output of map so that like words are contiguous, which provides
a useful optimization for the reducer.

Exercise 4. Write a simple Pig query


As you saw in Data processing with Apache Pig, Pig allows you to build simple scripts that are
translated into MapReduce applications. In this exercise, you extract all log entries (from /var/log/
messages) that contain both the word kernel: and the word terminating.
Practice: Process logs with Apache Hadoop

Page 2 of 8

ibm.com/developerWorks/

developerWorks

Create a script that extracts all log lines with the predefined criteria.

Exercise 5. Write an aggregating Pig query


Log messages are generated by a variety of sources within the Linux kernel (such as kernel or
dhclient). In this example, you want to discover the various sources that generate log messages
and the number of log messages per source.
Create a script that counts the number of log messages for each log source.

Exercise solutions
The specific output depends on your particular Hadoop installation and configuration.

Solution for Exercise 1. Get a simple Hadoop environment up and


running
In Exercise 1, you perform an ls command on the HDFS. Listing 1 illustrates the proper solution.

Listing 1. Performing an ls operation on the HDFS


$ hadoop dfs -ls /
drwxrwxrwx
- hue
drwxr-xr-x
- hue
drwxr-xr-x
- mapred
$

supergroup
supergroup
supergroup

0 2011-12-10 06:56 /tmp


0 2011-12-08 05:20 /user
0 2011-12-08 10:06 /var

More or fewer files might be present depending on use.

Solution for Exercise 2. Interact with the HDFS


In Exercise 2, you create a subdirectory within HDFS and copy a file into it. Note that you create
test data by moving the kernel message buffer into a file. For extra credit, view the file within the
HDFS using the cat command (see Listing 2).

Listing 2. Manipulating the HDFS


$ dmesg > kerndata
$ hadoop dfs -mkdir /test
$ hadoop dfs -ls /test
$ hadoop dfs -copyFromLocal kerndata /test/mydata
$ hadoop dfs -cat /test/mydata
Linux version 2.6.18-274-7.1.el5 (mockbuild@builder10.centos.org)...
...
e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
$

Solution for Exercise 3. Write a simple MapReduce application


In Exercise 3, you create a simple word count MapReduce application in Python. Python is actually
a great language in which to implement the word count example. You can find a useful writeup on
Python MapReduce in Writing a Hadoop MapReduce Program in Python by Michael G. Noll.
Practice: Process logs with Apache Hadoop

Page 3 of 8

developerWorks

ibm.com/developerWorks/

This example assumes that you performed the steps of exercise 2 (to ingest data into the HDFS).
Listing 3 provides the map application.

Listing 3. Map application in Python


#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t1' % word

Listing 4 provides the reduce application.

Listing 4. The reduce application in Python


#!/usr/bin/env python
from operator import itemgetter
import sys
last_word = None
last_count = 0
cur_word = None
for line in sys.stdin:
line = line.strip()
cur_word, count = line.split('\t', 1)
count = int(count)
if last_word == cur_word:
last_count += count
else:
if last_word:
print '%s\t%s' % (last_word, last_count)
last_count = count
last_word = cur_word
if last_word == cur_word:
print '%s\t%s' % (last_word, last_count)

Listing 5 illustrates the process of invoking the Python MapReduce example in Hadoop.

Listing 5. Testing Python MapReduce with Hadoop


$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
-file pymap.py -mapper pymap.py -file pyreduce.py -reducer pyreduce.py \
-input /test/mydata -output /test/output
...
$ hadoop dfs -cat /test/output/part-00000
...
write 3
write-combining 2
wrong. 1
your 2
zone: 2
zonelists. 1
$

Practice: Process logs with Apache Hadoop

Page 4 of 8

ibm.com/developerWorks/

developerWorks

Solution for Exercise 4. Write a simple Pig query


In Exercise 4, you extract /var/log/messages log entries that contain both the word kernel: and
the word terminating. In this case, you use Pig in local mode to query the local file (see Listing 6).
Load the file into a Pig relation (log), filter its contents to only kernel messages, and then filter that
resulting relation for terminating messages.

Listing 6. Extracting all kernel + terminating log messages


$ pig -x local
grunt> log = LOAD '/var/log/messages';
grunt> logkern = FILTER log BY $0 MATCHES '.*kernel:.*';
grunt> logkernterm = FILTER logkern BY $0 MATCHES '.*terminating.*';
grunt> dump logkernterm
...
(Dec 8 11:08:48 localhost kernel: Kernel log daemon terminating.)
grunt>

Solution for Exercise 5. Write an aggregating Pig query


In Exercise 5, extract the log sources and log message counts from /var/log/messages. In this
case, create a script for the query, and execute it through Pig's local mode. In Listing 7, you load
the file and parse the input using a space as a delimiter. You then assign the delimited string fields
to your named elements. Use the GROUP operator to group the messages by their source, and then
use the FOREACH operator and COUNT to aggregate your data.

Listing 7. Log sources and counts script for /var/log/messages


log = LOAD '/var/log/messages' USING PigStorage(' ') AS (month:chararray, \
day:int, time:chararray, host:chararray, source:chararray);
sources = GROUP log BY source;
counts = FOREACH sources GENERATE group, COUNT(log);
dump counts;

The result is shown executed in Listing 8.

Listing 8. Executing your log sources script


$ pig -x local logsources.pig
...
(init:,1)
(gconfd,12)
(kernel:,505)
(syslogd,2)
(dhclient:,91)
(localhost,1168)
(gpm[2139]:,2)
[gpm[2168]:,2)
(NetworkManager:,292)
(avahi-daemon[3345]:,37)
(avahi-daemon[3362]:,44)
(nm-system-settings:,8)
$

Practice: Process logs with Apache Hadoop

Page 5 of 8

developerWorks

ibm.com/developerWorks/

Resources
Learn
Distributed computing with Linux and Hadoop (Ken Mann and M. Tim Jones,
developerWorks, December 2008): Discover Apache's Hadoop, a Linux-based software
framework that enables distributed manipulation of vast amounts of data, including parallel
indexing of internet web pages.
Distributed data processing with Hadoop, Part 1: Getting started (M. Tim Jones,
developerWorks, May 2010): Explore the Hadoop framework, including its fundamental
elements, such as the Hadoop file system (HDFS), common node types, and ways to monitor
and manage Hadoop using its core web interfaces. Learn to install and configure a singlenode Hadoop cluster, and delve into the MapReduce application.
Distributed data processing with Hadoop, Part 2: Going further (M. Tim Jones,
developerWorks, June 2010): Configure a more advanced setup with Hadoop in a multinode cluster for parallel processing. You'll work with MapReduce functionality in a parallel
environment and explore command line and web-based management aspects of Hadoop.
Distributed data processing with Hadoop, Part 3: Application development (M. Tim Jones,
developerWorks, July 2010): Explore the Hadoop APIs and data flow and learn to use them
with a simple mapper and reducer application.
Data processing with Apache Pig (M. Tim Jones, developerWorks, February 2012): Pigs
are known for rooting around and digging out anything they can consume. Apache Pig does
the same thing for big data. Learn more about this tool and how to put it to work in your
applications.
Writing a Hadoop MapReduce Program in Python (Michael G. Noll, updated October 2011,
published September 2007): Learn to write a simple MapReduce program for Hadoop in the
Python programming language in this tutorial.
IBM InfoSphere BigInsights Basic Edition offers a highly scalable and powerful analytics
platform that can handle incredibly high data throughput rates that can range to millions of
events or messages per second.
The Open Source developerWorks zone provides a wealth of information on open source
tools and using open source technologies.
developerWorks Web development specializes in articles covering various web-based
solutions.
Stay current with developerWorks technical events and webcasts focused on a variety of IBM
products and IT industry topics.
Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and
tools, as well as IT industry trends.
Watch developerWorks on-demand demos ranging from product installation and setup demos
for beginners, to advanced functionality for experienced developers.
Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.
Get products and technologies
Cloudera's Hadoop Demo VM (May 2012): Start using with Apache Hadoop with a set of
virtual machines that include a Linux image and a preconfigured Hadoop instance.
Practice: Process logs with Apache Hadoop

Page 6 of 8

ibm.com/developerWorks/

developerWorks

IBM InfoSphere BigInsights Basic Edition -- IBM's Hadoop distribution -- is an integrated,


tested and pre-configured, no-charge download for anyone who wants to experiment with and
learn about Hadoop.
Evaluate IBM products in the way that suits you best: Download a product trial, try a product
online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox
learning how to implement Service Oriented Architecture efficiently.
Discuss
Check out developerWorks blogs and get involved in the developerWorks community.
Get involved in the developerWorks community. Connect with other developerWorks users
while exploring the developer-driven blogs, forums, groups, and wikis.

Practice: Process logs with Apache Hadoop

Page 7 of 8

developerWorks

ibm.com/developerWorks/

About the author


M. Tim Jones
M. Tim Jones is an embedded firmware architect and the author of Artificial
Intelligence: A Systems Approach, GNU/Linux Application Programming (now in
its second edition), AI Application Programming (in its second edition), and BSD
Sockets Programming from a Multilanguage Perspective. His engineering background
ranges from the development of kernels for geosynchronous spacecraft to embedded
systems architecture and networking protocols development. Tim is a platform
architect with Intel and author in Longmont, Colo.
Copyright IBM Corporation 2012
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)

Practice: Process logs with Apache Hadoop

Page 8 of 8

Вам также может понравиться