Академический Документы
Профессиональный Документы
Культура Документы
Sheeba Samuel
sheeba.samuel@iiitb.org
Shwetha Muralidharan
shwetha.m@iiitb.org
Shrisha Rao
shrao@ieee.org
each of the DataNodes in the cluster. This collected information is serialized and stored in the Hadoop Distributed
File System (HDFS) in the form of log files.
The Apache Pig tool, which is a data analytics tool for
the Hadoop framework, is used to parse and analyze the
hardware-related information [3]. The log files for each
of the DataNodes which are stored in HDFS are given
as inputs to the Pig Latin [4] script. These Pig Latin
scripts are used to get the most recent values for each
hardware component considered in each of the DataNodes,
and the output so obtained is again stored in the HDFS.
Now the NameNode takes these most recent values, parses
them, and compares them with the threshold values of
the corresponding hardware component. In case for any
DataNode if it is found that any hardware component
value has exceeded the threshold value, a message is
sent to the corresponding DataNode about the suspected
failure. This signalling system to the DataNode works with
the assumption that there is a robust network connection
between the NameNode and the DataNodes.
In the context of high-volume data processing, the
limiting factor is the rate at which we can transfer data
between the nodes [5] or the bandwidth which is also
dependent on the physical topology of the network. In
order to configure Hadoop clusters to meet varying requirements in terms of size, processing power, reliability,
availability [6], etc., it is important to keep track of the
current topology of the Hadoop network and track the
status of the network services in Hadoop.
For the network topology discovery module, we get
the IP addresses of all the live nodes in the cluster by
observing what are all the DataNodes the executor process
is able to reach and storing the addresses of those nodes
in a file. This file containing the addresses of the live
nodes is given as input to the network monitoring tools,
OpenNMS and Zenmap. Zenmap provides a topology
map of the network [7]. OpenNMS monitors network
services like ICMP, SMTP, SNMP, SSH, etc., of the
Hadoop cluster [8]. It can also detect outages among the
DataNodes of Hadoop.
The detection of hardware failures module can be used
for large number of nodes in a Hadoop cluster. Since the
DataNode is signalled about the suspected failures in the
hardware components, the DataNode can take appropri-
I. I NTRODUCTION
Apache Hadoop is an open source framework [1] used
to develop scalable distributed software systems. It can
scale up to large number of nodes, each offering local
computation. Hadoop performs distributed processing of
large data sets operating on clusters of computers consisting of commodity hardware.
We propose an extension to the Hadoop framework
which signals the nodes in the Hadoop cluster about suspected hardware failures. This becomes extremely useful
in large clusters of nodes where failures are quite frequent
due to the fact that Hadoop uses inexpensive commodity
hardware. Hardware failure detection is made possible by
collecting the status of the hardware components from
each of the DataNodes in the network. This information is
analyzed and compared with existing threshold values. In
case the values are exceeded, the corresponding DataNode
is informed by the NameNode or the master node. We also
discover the network topology by generating the network
topology map of the Hadoop cluster and monitoring the
network services. Thus the robustness of the network in
the Hadoop cluster can also be analyzed through our
extension.
A hardware failure monitoring package, FailMon [2]
which is a built-in functionality within Hadoop, is used to
collect various hardware status related information from
58
III. I MPLEMENTATION
A seperate Hadoop user account is used to start Hadoop.
The user hduser and the group hadoop are added to
each local system. SSH access is required by Hadoop to
manage its nodes. For a single-node setup of Hadoop,
it is required to configure SSH access to localhost for
59
C. Failure Monitoring
A Hadoop cluster is started and a failure monitoring
script is run on the NameNode. The script schedules logs
collection tasks on all the DataNodes in the cluster. A
configuration file is maintained in each of the DataNodes.
These files are given as an information source to the
executor process about what local data it has to collect
and from where it has to collect. It also specifies the
upload path of HDFS. The information is collected from
the system log files, Hadoop log files, and various OS
hardware diagnostic utilities.
Log entries about checksum errors imply either a hard
disk failure or a memory failure. Various operating system
utilities give information about the status of the hardware
components, and ifconfig provides information about
the identity and metrics of various network interfaces
present in the system. It also provides information about
the status of the network interface. By using txerrors
and rxerrors parameters, we can obtain the transmit
and receive error rates of packets respectively.
The file /proc/cpuinfo lists all the processors
and cores present in the system, along with information
about their status. smartmontools and lm-sensors
are two other utilities supported by the Linux operating
system which we made use of in order to collect status
information of the hardware components.
1) Smartmontools: SMART (Self-Monitoring, Analysis
and Reporting Technology) is a system to monitor the
health of computer hard disk drives . It detects and
reports conditions that indicate that there could be some
failures [16]. Depending upon the statistics provided by
SMART, the user is advised to replace the hard drive in
order to avoid loss of data. It is also used by hard disk
manufacturers to discover various failures on their devices.
A variety of different metrics are reported across different types of disks. These metrics are collected by the
FailMon package to monitor the hard disk status of the
DataNodes of the Hadoop cluster.
Table I gives some sample vendor-specific values for
the hard disk parameter values.
Figure 2.
Table I
C RITICAL VALUES F OR H ARD D ISK PARAMETERS U SING
S MARTMONTOOLS
B. SSH access
The user hduser on the master (also known as
hduser@master) should connect:
to its own user account on the masteri.e., ssh
master; and
to the hduser user account on the slave (also known
as hduser@slave) via a passwordless SSH login
[13].
The hduser@masters public SSH key (which
should be in $HOME/.ssh/id_rsa.pub) has to
be added to the authorized keys file of hduser@slave,
i.e., users $HOME/.ssh/authorize_keys. This
is done manually or by using hduser@master:$
ssh-copy-id -i $HOME/.ssh/id_rsa.pub
hduser@slave SSH command [13].
Attribute
Seek Error Rate
Raw Read Error Rate
Reallocated Sector Count
Temperature Celsius
Hitachi
67
62
5
60
Western Digital
0
51
140
65
Toshiba
50
50
50
51
Figure 3.
Zenmap
Table III
P REDICTED T IME TAKEN TO PARSE L OG F ILES U SING FAILMON
PACKAGE O N L ARGE N UMBERS OF N ODES
is experienced by the applications. This minimum overhead exists because Pig scripts are internally scheduled as
MapReduce jobs, similar to Hadoop applications.
Number of Nodes
0
50
100
500
1000
0
51.93
454.23
2242.23
4477.23
A. Response Time
In order to measure the responsiveness of our implementation, the time taken to parse the log files, the
execution time of Pig in the cluster, and the execution
time of the total system were measured separately. These
values were taken by varying the number of DataNodes
in the Hadoop cluster.
1) Time Taken To Parse The Log Files Using Failmon
Package: The NameNode schedules jobs on all DataNodes
available in the Hadoop cluster. The DataNodes collect
data from various sources like Hadoop log files, system
log files, and various operating system utilities and upload
the same to HDFS periodically.
Table II shows the response times for collecting the information and storing in HDFS depending on the number
of nodes.
Table IV
T IME TAKEN T O RUN P IG O N T HE C LUSTER
Table II
T IME TAKEN T O PARSE L OG F ILES U SING FAILMON PACKAGE
Number of Nodes
0
1
2
3
4
5
0
17.23
18.78
23.85
24.98
25.65
Number of Nodes
0
1
2
3
4
5
0
29.2
35.64
42.9
50.81
63.36
Number of Nodes
0
50
100
500
1000
Table VIII
T EMPERATURE I N N AME N ODE AND D ATA N ODE AT VARIOUS T IME
I NSTANCES
Time
instant
Temperature on
NameNode
Temperature on
DataNode
1
2
3
4
5
6
7
8
75
72
75
60
64
71
80
72
61
62
61
61
62
63
62
63
1
2
3
4
5
64.85
80.19
99.26
122.36
145.89
Figure 6.
The probability of hardware component failures increases as the number of nodes in a cluster increases. Since
Hadoop is widely used in scalable computing systems
where the network topology is dynamic and potentially
unknown, locating hardware failures is also a challenge.
In this scenario, the hardware failure detection schema
that we propose gives a predictive ability for detecting
failures and ensures reliability. It also paves the way for
future work directed towards developing mechanisms to
perform smooth handoff of data and running tasks from
the signalled DataNode to another working node in the
cluster. In the network discovery module, a topology
map that displays only the live nodes in our Hadoop
network is generated, and also helps track the network
services and detect recent outages. This information helps
network administrators to ensure highly available services
in Hadoop.
Table VII
P REDICTED T IME TAKEN F OR E XECUTION O F E NTIRE S YSTEM O N
L ARGE N ODES
Number of Nodes
0
50
100
500
1000
0
245.54
2084.24
10256.24
20471.24
VI. ACKNOWLEDGMENT
The predicted response times in seconds for large numbers of nodes are shown in Table VII.
R EFERENCES
[1] Hadoop, The Apache Software Foundation, Dec. 2011.
[Online]. Available: http://hadoop.apache.org/
[2] I. Koltsidas, K. Gupta, and P. Sarkar, Log collection and
failure monitoring in large clusters running hadoop, IBM
Almaden Research Center, July 2008.
[3] Pig Tool. [Online]. Available: http://pig.apache.org
[4] Pig
Latin.
[Online].
Available:
http://pig.apache.org/docs/r0.8.1/piglatin ref1.html
[5] T. White, Hadoop: The definitive guide.
2010.
Yahoo Press,
[6] Hadoop
tutorial-YDN.
[Online].
Available:
http://developer.yahoo.com/hadoop/tutorial/module7.html
[7] A. M. Marques, Zenmap, 2007. [Online]. Available:
http://nmap.org/book/zenmap.html
[8] Opennms, OpenNMS Official Website, Jul. 1999. [Online]. Available: http://www.opennms.org/wiki/Main Page
[9] N. Nakka, A. Agrawal, and A. Choudhary, Predicting node
failure in high performance computing systems from failure
and usage logs, in Parallel and Distributed Processing
Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, may 2011, pp. 1557 1566.
[10] E. W. Fulp, G. A. Fink, and J. N. Haack, Predicting
computer system failures using support vector machines.
[11] F. Sultan, A. Bohra, Y. Pan, S. Smaldone, I. Neamtiu,
P. Gallard, and L. Iftode, Nonintrusive failure detection
and recovery for internet services using backdoors.
[12] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The
hadoop distributed file system, in Mass Storage Systems
and Technologies (MSST), 2010 IEEE 26th Symposium on,
may 2010, pp. 1 10.
[13] M. G. Noll, Running Hadoop on Ubuntu Linux (MultiNode Cluster), Aug. 2007, My digital moleskine. [Online].
Available: http://www.michael-noll.com/tutorials/runninghadoop-on-ubuntu-linux-multi-node-cluster/
[14] Failmon, Hadoop Releases, Aug. 2010. [Online].
Available: http://hadoop.apache.org/common/releases.html
[15] M. G. Noll, Running Hadoop on Ubuntu Linux (SingleNode Cluster), Aug. 2007, My digital moleskine. [Online].
Available: http://www.michael-noll.com/tutorials/runninghadoop-on-ubuntu-linux-single-node-cluster/
[16] Smartmontools.
[Online].
Available:
https://help.ubuntu.com/community/Smartmontools
[17] lm
sensors.
[Online].
Available:
https://launchpad.net/ubuntu/+source/lm-sensors/1:3.3.12ubuntu2
64