Effective Hadoop Cluster Management

WHITE PAPER
Effective Hadoop Cluster Management
Abstract
In this white paper, Impetus Technologies talks about Apache HadoopTM, an Open Source, Java-based free software framework that enables the processing of huge amounts of data through distributed data processing. It talks about how correct and effective provisioning and management plays a crucial role to ensure is the key to ensure a successful HadoopTM cluster environment and thus helps to make individuals HadoopTM working experience a pleasant one. Apart from this the white paper also and discusses the challenges associated with cluster setup, sharing, and management. The paper also focuses on the benefits of automated setup, centralized management of multiple HadoopTM clusters, and quick provisioning of cloud-based HadoopTM clusters.
Impetus Technologies, Inc. www.impetus.com April - 2012
Table of Contents
Introduction .............................................................................................. 3 Understanding HadoopTM cluster related challenges ............................... 3 Manual operation ........................................................................ 4 Cluster set up ............................................................................... 4 Cluster management ................................................................... 4 Cluster sharing ............................................................................. 4 HadoopTM compatibility and others............................................. 4 What is missing? ....................................................................................... 5 Solutions space ......................................................................................... 5 Addressing operational challenges .............................................. 5 Addressing cluster set up challenges ........................................... 6 Addressing cluster management challenges ............................... 8 Addressing cluster sharing challenges ......................................... 8 Addressing HadoopTM compatibility related challenges .............. 9 Can HadoopTM Cluster Management tools help? ................................... 10 The Impetus solution .............................................................................. 11 Summary ................................................................................................. 13
Introduction
The HadoopTM framework offers the required support for data-intensive distributed applications. It manages and engages multiple nodes for distributed processing of the large amount of data which is stored locally on individual nodes. The results produced by the individual nodes are then consolidated further to generate the final output. HadoopTM provides Map/Reduce APIs and works on HadoopTM -compatible distributed file systems. HadoopTM sub-components and related tools, such as HBase, Hive, Pig, Zookeeper, Mahout etc. have specific uses and benefits associated with them. Normally, these subcomponents are also used along with HadoopTM and therefore, need set up and configuration. Setting up a standalone or a pseudo-distributed cluster or even a relatively small sized localized cluster is an easy task. On the other hand, manually setting up and managing a production-level cluster in a truly distributed environment requires significant effort, particularly in the area of cluster set up, configuration and management. It is also tedious, time consuming and repetitive in nature. Factors such as HadoopTM vendor, version, bundle type and the target environment add to existing cluster set up and management related complexities. Also, different cluster modes call for different kinds of configurations. Commands and settings change due to alterations in cluster modes, increasing the challenges related to HadoopTM set up and management.
Understanding HadoopTM cluster related challenges

The challenges associated with HadoopTM can be broadly classified into following: 1. Operational 2. Cluster set up 3. Cluster management 4. Cluster sharing 5. HadoopTM compatibility and others Let us take them up one by one to understand what they mean: Operational challenges Operational challenges mainly arise due to factors like manual operation, console-based, non-friendly interface, as well as interactive and serial execution.
Manual operation
The manual mode of execution requires a full-time, totally interactive user session and consumes a lot of time. At the same time, it is also error-prone due to the mistakes and omissions that might have occurred. These in turn require the entire activity to begin again from scratch. Interface Another factor is using the console-based interface, which is the primary and the only available default interface, to interact with the HadoopTM cluster. It is therefore, to some extent, also responsible for the serial execution of activities.
Cluster set up
In their simplest form, HadoopTM bundles are simple tar files. They need to be extracted and require set up and initialization. Apache HadoopTM bundles, (especially the tarball), do not have any set up support around them. The default way to set up the cluster is totally manual. A sequence of activities has to be followed depending on cluster mode and HadoopTM version/vendor. The cluster set up activity involves a lot of complexities and variations due to different factors like the cluster set up environment (on premise versus the Cloud), cluster mode, component bundle type, vendor and version. On the top of existing complexities, the manual, interactive and attended mode of operation increases the challenges.
Cluster management
The current cluster management in HadoopTM offers limited functionalities and at the same time the operation needs to be carried out manually from a console-based interface. There is no feature that enables the management of multiple clusters from one single location. One needs to change the interface or log on to different machines in order to manage different clusters.
Cluster sharing
With the current way of operation, the task of sharing HadoopTM clusters across various users and user groups with altogether different requirements is not just challenging, tedious and time-consuming, but to some extent also insecure.
HadoopTM compatibility and others

The key factors that fall into this category are related to areas like HadoopTM API compatibility, working with HadoopTM bundles and bundle formats (tar/RPM) from multiple vendors, HadoopTM versions related operational and commands differences, etc.
What is missing?
After examining the challenges, it is important to understand what is missing. Once we know the missing dimensions, it is possible to overcome or address most of the challenges. Missing Dimensions: Operational support o Automation o Alternate User friendly interface o Monitoring and notifications support Setup support Cluster management support Cluster sharing mechanism When we compare HadoopTM with other Big Data solutions (like Greenplum or commercial solutions such as Aster Data), we find that the Big Data solutions offer support around the above mentioned dimensions. This appears to be missing in HadoopTM. Today, there are tools in the market that address some or the majority of the challenges mentioned above. The solutions primarily use these missing dimensions around HadoopTM, and address the various pain points. Let us now look at how these dimensions can help deal with the various challenges.
Solutions space
Addressing operational challenges
The operational challenges can be addressed using a combination of methods like applying automation, an alternate user interface with support for updates and notifications. Automation Automation enables unattended, quick, and error-free execution of any activity. Smart automation can take care of various associated factors and situations in a context aware manner. Automation ensures that the right commands are submitted, even though the parameters may or may not be correct, as they are keyed in by users. With an
input validating interface it is possible to validate user inputs so as to ensure that only the right parameters are being used. Using an alternate interface As discussed earlier, with a default console based interface, several limitations crop up, such as serial execution and interactive working. It is possible to overcome this problem by adopting a user-friendly GUI as an alternate interface that additionally supports configuration, input validation, and automation and at the same time runs several activities in parallel. An alternate, friendly, user interface helps in accessing HadoopTM functionality and operations in a streamlined manner. Impetus strongly believes that operational challenges associated with HadoopTM clusters can be addressed to a great extent by using an alternate interface that supports automation and provides parallel working support. Thus automation and an alternate interface together offer an easy and better HadoopTM environment for working.
Addressing cluster set up challenges

Cluster set up activity requires careful execution of pre-defined actions in a situation aware manner where even a minor error or omission due to manual intervention can result in a major setback. While simple automation can handle this problem, some actions may still require user intervention (e.g. accepting the license agreements). Bringing in smart automation into the picture enables a quick set up of the cluster, in a hassle-free and non-interactive manner. The entire cluster set up functionality can be offered through a configurable alternate user interface that offers simple, click-based cluster provisioning through a friendly and highly configurable UI. This in turn utilizes context-aware automation, based on the provided inputs and can perform multiple activities in parallel. Understanding the difference between setting up a cluster on-premise and over the Cloud For Cloud-based clusters, organizations are required to launch and terminate the cluster instances. However, the hardware, operating systems, and installed Java versions are mostly uniform, which may differ in the case of on-premise deployment. For on-premise deployment, it is important to set up password-less SSH between the nodes, which are not required in the Cloud set up. The setup of the HadoopTM ecosystem components remains the same, regardless of the cluster set up environment.
Provisioning the Cloud-based HadoopTM cluster The complexities for provisioning the Cloud-based HadoopTM cluster appear primarily due to the manual operation, which involves steps such as accessing the Cloud-provider interface to launch the required number of nodes with required different hardware configurations, providing inputs for key pairs, security settings, machine images, etc. There is need to open or unblock the required ports, manually collect individual node IPs and add all these IP addresses/hostnames to the HadoopTM slave files. After using the Cloud cluster, once again one need to manually terminate all the machines sequentially, by individually selecting them on the Cloud interface. If the cluster size is small then all these activities can be carried out easily. However, performing these activities manually on a large-sized cluster is cumbersome and may lead to errors. For performing the activities, one continuously needs to switch between the interface of the Cloud providers and the HadoopTM management interface. Bringing in automation into the picture can ease all these activities and help to save time and effort. One can incorporate automation just by adding simple scripts or by using Cloud provider-specific exposed APIs or alternatively by using generic Cloud APIs, such as JCloud, Simple Cloud, LibCloud, DeltaCloud etc. Cloudera CDH-2 scripts can help launch instances on the Cloud and then enable setting up HadoopTM over the launched nodes. This is somewhat similar in the case of Whirr, which uses the JCloud API in the background.
Addressing cluster management challenges

As we have discussed, the key challenge in this area is the lack of appropriate functionalities to manage the cluster. It is possible to effectively managing HadoopTM clusters by adopting tools with dedicated and sophisticated support for various cluster management capabilities. This may include functionalities ranging from node management to services, user, configuration, parameters and job management. Additionally, this may also provide adequate support for templates of various common required workflows and inputs for managing all the mentioned entities. The solutions may also support a friendly and configurable way for providing updates on performance monitoring, progress or status update notifications and alerts. This is a user-friendly approach which actively offers updates on the progress, notifications for various events and change in status to users, instead of them seeking or polling for it periodically in a passive manner. Furthermore, the mode of receiving these updates can be customized and configured by the user/administrator based on individual preferences and the exact requirements in terms of criticalness. Thus users, depending on their needs, can configure the communication channel/medium which can be any one of the online updates, e-mails or SMS notifications. All these functionalities are supported through an alternate, user-friendly interface which also automates cluster management activities and offers a way to work on multiple activities in parallel. If the user-interface is web-based, then it additionally offers you the ability to access cluster-related functionality from anywhere, and at any time.
Addressing cluster sharing challenges

It needs to be mentioned here that cluster sharing essentially means sharing clusters among the different development and testing team members. The main problem here is the manner in which these clusters are typically shared. In the traditional approach, there are two possible ways to share the cluster. The first is about sharing the credentials of a common user account with the entire set of users. The second approach is about creating separate user accounts or user groups for each user/user group with whom you are planning to share the cluster. If you share the cluster using the first approach, i.e. sharing common user account credentials with all users, regardless of their actual usage or access requirement, you are compromising the security of the system. The system (as well as cluster) and other linked systems are now exposed because this user
account may have some exclusive privileges that are now available to a broad category of all cluster users, regardless of their actual requirement. In the second approach, one needs to create OS level separate user accounts on the system (in some cases, even on each node of the cluster) with restricted privileges. Now this is a complex as well as time consuming task. You not only have to create or set it up, but even need to maintain and update the various requirements with time. Impetus strongly suggests using role-based cluster sharing through the alternate UI. This offers a cleaner way to share clusters without compromising on security. Some of the solutions not only allow you to control role-based access to various cluster management functionalities, they even offer a way to authenticate users and their roles from a valid existing external user authentication system. Some of the benefits of using this method include the fact that now users may not be necessarily created at the per machine or OS level. Rather they can be created at the solution level, or just reused even from existing domain level users. Thus, it will be relatively easy to manage and control users through the admin interface of the solution. Furthermore, based on requirements, specific roles can be created on-the-fly and assigned to the specific user accounts in order to restrict or provide access to specified functionalities for individual user access. Cluster sharing has definite associated benefits. Furthermore, if multiple shared clusters can be managed from a single centralized location without switching the interface or without being logged on to multiple machines, and then this makes the entire task even easier. It becomes easy to manage and fine tune a single shared cluster. Users and back-ups too are easy to manage. While working with a shared cluster all cluster users gets performance benefits. When compared with non-shared clusters running on individual machines, one saves a lot of time required to set up, manage and troubleshoot local clusters on individual machines. The performance figures received from local clusters running on individual user machines are also not the true measure of cluster performance, as the hardware of individual machines is not always up to the mark in terms of being the best configuration.
Addressing HadoopTM compatibility related challenges

Let us look at the various challenges related to HadoopTM. The very first challenge is HadoopTM compatibility. HadoopTM as a technology is still evolving and has not reached complete maturity yet. This factor in turn, gives way to numerous challenges, such as API differences across versions, their on wire compatibility i.e. multiple HadoopTM versions on different nodes of same cluster, and interoperability of multiple versions and their respective components.
Sometimes, given configurations may not be supported in certain versions. Problems may also arise due to multiple vendors and vendor-specific features. There can also be complexities related to bundle formats (tar ball and RPM) as their setup and folder locations (bin/conf) differ. Cluster modes are another factor that demands suitable changes in configuration and command execution. Security is available as a separate patch and needs customized configuration. Issues can also crop up due to vendor-specific solutions such as SPOF/HA and compatible file systems. It is possible to find work-arounds to partially address the compatibility challenges. A complete solution may not be possible, as these problems are the result of the underlying technology. One can address the compatibility problems by adopting HadoopTM Cluster Management Tools which can give you the option of replacing these incompatible bundles with suitable ones. This will primarily ensure that all the nodes within the cluster have the same version of HadoopTM and the respective components installed. Other factors such as version, vendor, bundle format, etc., can also be handled to a great extent by using any HadoopTM cluster management tool that facilitates context-aware component set up and management support. The tool can take care of the differences in files and folder names/locations changes, command changes and configuration differences in accordance with the bundle format, cluster mode and vendor.
Can HadoopTM Cluster Management tools help?

Impetus strongly believes that by utilizing these missing dimensions, a tool can offer a better HadoopTM environment for working. It can therefore improve your productivity when working with HadoopTM clusters. Such tools can help immensely, as they offer a quick turnaround and enable companies to create new clusters quickly and in accordance with their specifications. The tools offer automated set up, thereby leaving no room for error. They also help to minimize the total cost of the operation by reducing the time and effort required for cluster set up and management. HadoopTM cluster management tools can provide integrated support for all organizational requirements from one place and one interface. The tools can also help to set up clusters for different needs for e.g. setting up a cluster for testing the application across different vendors, distributions and versions and then benchmarking them on different configurations, loads and environments.
10
They help in analyzing the impact of cluster size against different load patterns and then enabling the launch and resize of the cluster, on-the-fly. Among the tools currently available for the effective set up, and management of HadoopTM clusters are Amazons Elastic Map-Reduce, Whirr, Cloudera SCM, and Impetus Ankush.
The Impetus solution

Impetus Ankush, which is a web-based solution, supports the intricacies involved in cluster set up, management, cloud provisioning, and sharing. Ankush offers customers the following benefits: Centralized management of multiple clusters. This is a very helpful feature as it eliminates the need for a change in the interface or log in to different machines, to manage different clusters. In the node management area for instance, the solution, through its web interface, supports the listing of existing nodes as well as the addition and removal of the nodes by IP or hostname. Support for cluster set up in all possible modes. It also performs context-aware auto initialization of configuration parameters based on the cluster mode. The initialization support includes services, configuration initialization, initial node addition and initial job submission. Additionally, it also supports multiple vendors, versions, and bundles for HadoopTM ecosystem components. For the cloud, the solution supports the launch and termination of entire HadoopTM clusters, for both heterogeneous and homogeneous configurations. Ankush can support all the required Cloud operations from its own UI, so organizations need not access the cloud-specific interface. Supports centralized management and monitoring of multiple clusters. Individual cluster-based operations can also be managed using the same interface from the same location. Facilitates support for reconfiguration and upgrade of HadoopTM and other components, the management of keys, users, configuration parameters, services and monitoring. User management supports multiple user roles. It allows role-based access to various cluster functionalities. Only the admin-role based user can perform the operations that affect the state of the cluster. The setup of the cluster,
11
components and pre-dependencies like Java and password-less SSH set up are undertaken in an automated fashion. Figure: AnkushImpetus HCM Tool
You can use Ankush to set up and manage local as well as Cloud-based clusters. A web application that is bundled as a war file, the solution is deployable even at the user level. Ankush furthermore, offers anytime-anywhere access to cluster functionalities through its web-based interface. Ankush is Cloud independent, thereby giving you an option to launch the cluster on other compatible Clouds. Ankush offers a way to quickly apply the configuration changes across all the clusters to leverage the performance benefits. According to Impetus, it helped the company reduce cluster set up time by 60 percent! Finally, the solution enables bundle set up optimization across cluster nodes using parallelism and bundle re-use.
12
Summary
In conclusion it can be said that, for effective HadoopTM cluster management, automation facilitates quick and error-free execution of activities. It can be applied to make the execution non-interactive and free from human intervention. It can also be used to save extensive time, effort and costs involved in the cluster set up. For quickly setting up the cluster on Cloud all that you need to do is add automation, either simple scripts or by using Cloud APIs (provider-specific exposed API or generic API). Another important takeaway is adopting a user-friendly GUI as an alternate interface that can help address your cluster-sharing problems. It will also support automation and help execute activities in parallel. It must be reiterated here that HadoopTM is still evolving and is yet to reach maturity. It is only possible to address HadoopTM compatibility issues in a partial way. Lastly, using a suitable HadoopTM Cluster Management tool can enable organizations to deal with the pain areas associated with cluster set up, management, and sharing.
About Impetus Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility Solutions, Test Engineering, Performance Engineering, and Social Media among others. Impetus Technologies, Inc. 5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: 408.213.3310 | Email: inquiry@impetus.com Regional Development Centers - INDIA: New Delhi Bangalore Indore Hyderabad Visit: www.impetus.com
Disclaimers
The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus 13 Technologies Inc.

Effective Hadoop Cluster Management - Impetus White Paper

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Effective Hadoop Cluster Management - Impetus White Paper

Загружено:

Авторское право:

WHITE PAPER

Impetus Technologies, Inc. www.impetus.com April - 2012