Академический Документы
Профессиональный Документы
Культура Документы
Abstract
In this white paper, Impetus Technologies talks about Apache HadoopTM, an Open Source, Java-based free software framework that enables the processing of huge amounts of data through distributed data processing. It talks about how correct and effective provisioning and management plays a crucial role to ensure is the key to ensure a successful HadoopTM cluster environment and thus helps to make individuals HadoopTM working experience a pleasant one. Apart from this the white paper also and discusses the challenges associated with cluster setup, sharing, and management. The paper also focuses on the benefits of automated setup, centralized management of multiple HadoopTM clusters, and quick provisioning of cloud-based HadoopTM clusters.
Table of Contents
Introduction .............................................................................................. 3 Understanding HadoopTM cluster related challenges ............................... 3 Manual operation ........................................................................ 4 Cluster set up ............................................................................... 4 Cluster management ................................................................... 4 Cluster sharing ............................................................................. 4 HadoopTM compatibility and others............................................. 4 What is missing? ....................................................................................... 5 Solutions space ......................................................................................... 5 Addressing operational challenges .............................................. 5 Addressing cluster set up challenges ........................................... 6 Addressing cluster management challenges ............................... 8 Addressing cluster sharing challenges ......................................... 8 Addressing HadoopTM compatibility related challenges .............. 9 Can HadoopTM Cluster Management tools help? ................................... 10 The Impetus solution .............................................................................. 11 Summary ................................................................................................. 13
Introduction
The HadoopTM framework offers the required support for data-intensive distributed applications. It manages and engages multiple nodes for distributed processing of the large amount of data which is stored locally on individual nodes. The results produced by the individual nodes are then consolidated further to generate the final output. HadoopTM provides Map/Reduce APIs and works on HadoopTM -compatible distributed file systems. HadoopTM sub-components and related tools, such as HBase, Hive, Pig, Zookeeper, Mahout etc. have specific uses and benefits associated with them. Normally, these subcomponents are also used along with HadoopTM and therefore, need set up and configuration. Setting up a standalone or a pseudo-distributed cluster or even a relatively small sized localized cluster is an easy task. On the other hand, manually setting up and managing a production-level cluster in a truly distributed environment requires significant effort, particularly in the area of cluster set up, configuration and management. It is also tedious, time consuming and repetitive in nature. Factors such as HadoopTM vendor, version, bundle type and the target environment add to existing cluster set up and management related complexities. Also, different cluster modes call for different kinds of configurations. Commands and settings change due to alterations in cluster modes, increasing the challenges related to HadoopTM set up and management.
Manual operation
The manual mode of execution requires a full-time, totally interactive user session and consumes a lot of time. At the same time, it is also error-prone due to the mistakes and omissions that might have occurred. These in turn require the entire activity to begin again from scratch. Interface Another factor is using the console-based interface, which is the primary and the only available default interface, to interact with the HadoopTM cluster. It is therefore, to some extent, also responsible for the serial execution of activities.
Cluster set up
In their simplest form, HadoopTM bundles are simple tar files. They need to be extracted and require set up and initialization. Apache HadoopTM bundles, (especially the tarball), do not have any set up support around them. The default way to set up the cluster is totally manual. A sequence of activities has to be followed depending on cluster mode and HadoopTM version/vendor. The cluster set up activity involves a lot of complexities and variations due to different factors like the cluster set up environment (on premise versus the Cloud), cluster mode, component bundle type, vendor and version. On the top of existing complexities, the manual, interactive and attended mode of operation increases the challenges.
Cluster management
The current cluster management in HadoopTM offers limited functionalities and at the same time the operation needs to be carried out manually from a console-based interface. There is no feature that enables the management of multiple clusters from one single location. One needs to change the interface or log on to different machines in order to manage different clusters.
Cluster sharing
With the current way of operation, the task of sharing HadoopTM clusters across various users and user groups with altogether different requirements is not just challenging, tedious and time-consuming, but to some extent also insecure.
What is missing?
After examining the challenges, it is important to understand what is missing. Once we know the missing dimensions, it is possible to overcome or address most of the challenges. Missing Dimensions: Operational support o Automation o Alternate User friendly interface o Monitoring and notifications support Setup support Cluster management support Cluster sharing mechanism When we compare HadoopTM with other Big Data solutions (like Greenplum or commercial solutions such as Aster Data), we find that the Big Data solutions offer support around the above mentioned dimensions. This appears to be missing in HadoopTM. Today, there are tools in the market that address some or the majority of the challenges mentioned above. The solutions primarily use these missing dimensions around HadoopTM, and address the various pain points. Let us now look at how these dimensions can help deal with the various challenges.
Solutions space
Addressing operational challenges
The operational challenges can be addressed using a combination of methods like applying automation, an alternate user interface with support for updates and notifications. Automation Automation enables unattended, quick, and error-free execution of any activity. Smart automation can take care of various associated factors and situations in a context aware manner. Automation ensures that the right commands are submitted, even though the parameters may or may not be correct, as they are keyed in by users. With an
input validating interface it is possible to validate user inputs so as to ensure that only the right parameters are being used. Using an alternate interface As discussed earlier, with a default console based interface, several limitations crop up, such as serial execution and interactive working. It is possible to overcome this problem by adopting a user-friendly GUI as an alternate interface that additionally supports configuration, input validation, and automation and at the same time runs several activities in parallel. An alternate, friendly, user interface helps in accessing HadoopTM functionality and operations in a streamlined manner. Impetus strongly believes that operational challenges associated with HadoopTM clusters can be addressed to a great extent by using an alternate interface that supports automation and provides parallel working support. Thus automation and an alternate interface together offer an easy and better HadoopTM environment for working.
Provisioning the Cloud-based HadoopTM cluster The complexities for provisioning the Cloud-based HadoopTM cluster appear primarily due to the manual operation, which involves steps such as accessing the Cloud-provider interface to launch the required number of nodes with required different hardware configurations, providing inputs for key pairs, security settings, machine images, etc. There is need to open or unblock the required ports, manually collect individual node IPs and add all these IP addresses/hostnames to the HadoopTM slave files. After using the Cloud cluster, once again one need to manually terminate all the machines sequentially, by individually selecting them on the Cloud interface. If the cluster size is small then all these activities can be carried out easily. However, performing these activities manually on a large-sized cluster is cumbersome and may lead to errors. For performing the activities, one continuously needs to switch between the interface of the Cloud providers and the HadoopTM management interface. Bringing in automation into the picture can ease all these activities and help to save time and effort. One can incorporate automation just by adding simple scripts or by using Cloud provider-specific exposed APIs or alternatively by using generic Cloud APIs, such as JCloud, Simple Cloud, LibCloud, DeltaCloud etc. Cloudera CDH-2 scripts can help launch instances on the Cloud and then enable setting up HadoopTM over the launched nodes. This is somewhat similar in the case of Whirr, which uses the JCloud API in the background.
account may have some exclusive privileges that are now available to a broad category of all cluster users, regardless of their actual requirement. In the second approach, one needs to create OS level separate user accounts on the system (in some cases, even on each node of the cluster) with restricted privileges. Now this is a complex as well as time consuming task. You not only have to create or set it up, but even need to maintain and update the various requirements with time. Impetus strongly suggests using role-based cluster sharing through the alternate UI. This offers a cleaner way to share clusters without compromising on security. Some of the solutions not only allow you to control role-based access to various cluster management functionalities, they even offer a way to authenticate users and their roles from a valid existing external user authentication system. Some of the benefits of using this method include the fact that now users may not be necessarily created at the per machine or OS level. Rather they can be created at the solution level, or just reused even from existing domain level users. Thus, it will be relatively easy to manage and control users through the admin interface of the solution. Furthermore, based on requirements, specific roles can be created on-the-fly and assigned to the specific user accounts in order to restrict or provide access to specified functionalities for individual user access. Cluster sharing has definite associated benefits. Furthermore, if multiple shared clusters can be managed from a single centralized location without switching the interface or without being logged on to multiple machines, and then this makes the entire task even easier. It becomes easy to manage and fine tune a single shared cluster. Users and back-ups too are easy to manage. While working with a shared cluster all cluster users gets performance benefits. When compared with non-shared clusters running on individual machines, one saves a lot of time required to set up, manage and troubleshoot local clusters on individual machines. The performance figures received from local clusters running on individual user machines are also not the true measure of cluster performance, as the hardware of individual machines is not always up to the mark in terms of being the best configuration.
Sometimes, given configurations may not be supported in certain versions. Problems may also arise due to multiple vendors and vendor-specific features. There can also be complexities related to bundle formats (tar ball and RPM) as their setup and folder locations (bin/conf) differ. Cluster modes are another factor that demands suitable changes in configuration and command execution. Security is available as a separate patch and needs customized configuration. Issues can also crop up due to vendor-specific solutions such as SPOF/HA and compatible file systems. It is possible to find work-arounds to partially address the compatibility challenges. A complete solution may not be possible, as these problems are the result of the underlying technology. One can address the compatibility problems by adopting HadoopTM Cluster Management Tools which can give you the option of replacing these incompatible bundles with suitable ones. This will primarily ensure that all the nodes within the cluster have the same version of HadoopTM and the respective components installed. Other factors such as version, vendor, bundle format, etc., can also be handled to a great extent by using any HadoopTM cluster management tool that facilitates context-aware component set up and management support. The tool can take care of the differences in files and folder names/locations changes, command changes and configuration differences in accordance with the bundle format, cluster mode and vendor.
10
They help in analyzing the impact of cluster size against different load patterns and then enabling the launch and resize of the cluster, on-the-fly. Among the tools currently available for the effective set up, and management of HadoopTM clusters are Amazons Elastic Map-Reduce, Whirr, Cloudera SCM, and Impetus Ankush.
11
components and pre-dependencies like Java and password-less SSH set up are undertaken in an automated fashion. Figure: AnkushImpetus HCM Tool
You can use Ankush to set up and manage local as well as Cloud-based clusters. A web application that is bundled as a war file, the solution is deployable even at the user level. Ankush furthermore, offers anytime-anywhere access to cluster functionalities through its web-based interface. Ankush is Cloud independent, thereby giving you an option to launch the cluster on other compatible Clouds. Ankush offers a way to quickly apply the configuration changes across all the clusters to leverage the performance benefits. According to Impetus, it helped the company reduce cluster set up time by 60 percent! Finally, the solution enables bundle set up optimization across cluster nodes using parallelism and bundle re-use.
12
Summary
In conclusion it can be said that, for effective HadoopTM cluster management, automation facilitates quick and error-free execution of activities. It can be applied to make the execution non-interactive and free from human intervention. It can also be used to save extensive time, effort and costs involved in the cluster set up. For quickly setting up the cluster on Cloud all that you need to do is add automation, either simple scripts or by using Cloud APIs (provider-specific exposed API or generic API). Another important takeaway is adopting a user-friendly GUI as an alternate interface that can help address your cluster-sharing problems. It will also support automation and help execute activities in parallel. It must be reiterated here that HadoopTM is still evolving and is yet to reach maturity. It is only possible to address HadoopTM compatibility issues in a partial way. Lastly, using a suitable HadoopTM Cluster Management tool can enable organizations to deal with the pain areas associated with cluster set up, management, and sharing.
About Impetus Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility Solutions, Test Engineering, Performance Engineering, and Social Media among others. Impetus Technologies, Inc. 5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: 408.213.3310 | Email: inquiry@impetus.com Regional Development Centers - INDIA: New Delhi Bangalore Indore Hyderabad Visit: www.impetus.com
Disclaimers
The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus 13 Technologies Inc.