Вы находитесь на странице: 1из 4

White Paper

TM

MapR The Industrys Most Dependable Hadoop Platform

Hadoop in the Enterprise:


Maximizing Big Data Benefits with MapR and Informatica

Table of Contents
Introduction Hadoop: A Strategic Data Analytics Platform Informatica with MapR A Better Hadoop: Additional Enhancements in MapRs Distribution Summary

Introduction
The volume, velocity and variety of data are all growing relentlessly. The growth is causing organizations to struggle nding the tools, talent and time to get value from data cost-eectively. The need to integrate Big Transaction Data with Big Interaction Data while leveraging Big Data Processing technologies like Hadoop, is particularly challenging. Informatica oers the industrys leading independent data integration platform that uniquely enables organizations to maximize the return on Big Data and drive top business imperatives. Informatica is also integrated with Hadoop, which is purpose-built for processing Big Data eectively and aordably, and MapR Technologies distribution for Hadoop which improves performance, scalability, reliability and ease-of-use. This white paper outlines how the combination of Informaticas Data Integration platform and MapRs distribution for Hadoop oers powerful new capabilities for integrating and processing Big Data more eciently and costeectively than ever before.

2012 MapR, Inc. Confidential. All Rights Reserved.

Page 2

Maximizing Big Data Benefits with MapR and Informatica

Hadoop: A Strategic Data Analytics Platform


Hadoop provides a way to capture, organize, store, search, share, and analyze disparate data sources across a large cluster of commodity servers. Hadoop is designed to scale up from dozens to thousands of servers, each oering local computation and storage. MapR Technologies has advanced the Hadoop state-of-the-art with major enhancements that overcome signicant limitations of other Hadoop distributions, making Hadoop more enterprise-class in its operation, performance, scalability and reliability, as well as easier to integrate into the enterprise. One major enhancement MapR has made involves re-architecting the Hadoop Distributed File System (HDFS) to provide full random read/write semantics, high availability, and direct access through NFS. These innovations overcome the many limitations of HDFS, including its batch-oriented data management and movement, lack of random read/write le access by multiple users/processes, and the requirement to close les before new updates can be read. In addition to overcoming HDFS limitations and improving data protection, Direct Access NFS aords some other signicant advantages. Lockless storage with random reads and writes enables simultaneous access to data in near real-time, substantially improving performance. Any remote client can simply mount the cluster, and application servers can then write their data and log les directly into the cluster, rather than writing rst to direct- or networkattached storage. Existing applications and workows can use standard NFS to access the Hadoop cluster to manipulate data, and optionally take advantage of the MapReduce framework for parallel processing. And les in the cluster can be modied directly using ordinary text editors, command-line tools, and UNIX applications and utilities, or other development environments.

Informatica with MapR


The combination of Informaticas Data Integration platform with MapRs Distribution for Hadoop enables organizations to access, ingest, parse and process the full range of structured and unstructured data (including messaging streams) with greater performance, scale and dependability than ever before. Leveraging MapRs Direct Access NFS, Informaticas Ultra Messaging can stream messages directly into the MapR cluster to be retained and processed via MapReduce. Both Ultra Messaging and MapR feature parallel architectures with HA (no single points of failure) and best-in-class performance, making the combination ideal for production deployments. Due to the limitations of HDFS, all other distributions cannot support Ultra Messaging streaming. Informaticas Data Replication and FastClone provide high-performance transaction updates and data loading from dierent hardware platforms and data sources into the MapR cluster for analysis via MapReduce or Hive. The data is loaded into the MapR cluster in near real-time or on a scheduled basis, whereas other Hadoop distributions and database connectors provide much lower throughput and are limited to one-time table dumps and batch loading. Commercial integration of the Informatica Data Integration platform with with MapR's distribution for Hadoop includes: includes: Bi-directional data integration with Informatica PowerExchange Near real-time and snapshot replication using Informatica Data Replication and Informatica FastClone Parallel parsing and transformation on MapR using Informatica HParser Data streaming using Informatica Ultra Messaging

2012 MapR, Inc. Confidential. All Rights Reserved.

Page 3

Maximizing Big Data Benefits with MapR and Informatica

Informatica with MapR, Continued. Informaticas HParser Community Edition (included in the MapR distribution) helps create an easy-to-use integrated data environment (IDE) that enables customers to visually design data parsing transformations for industry-standard (e.g. FIX, SWIFT, ACORD, HL7, EDI, and many more) and popular document formats (e.g. MS Oce, PDF, etc.), as well as complex les (e.g. Logs, Omniture, XML and JSON), which can then be executed in parallel in the Hadoop cluster. The performance advantages of MapR, combined with the eciency of HParser, allow users to perform data parsing and transformations with higher performance and lower hardware costs compared to other options. PowerExchange for Hadoop makes it easier for non-programmers to move transaction and interaction data between a MapR cluster and other databases and data warehouses, without the use of hand-coding. MapRs Direct Access NFS interface also enables users to leverage Informaticas full range of data sources and transformations with the Hadoop environment.

A Better Hadoop: Additional Enhancements in MapRs Distribution


Direct Access NFS also facilitates support for volumes, snapshots and mirroring for all data contained within the Hadoop cluster, further improving reliability without requiring any extraordinary measures. Volumes make clustered data easier to both access and manage by grouping related les and directories into a single tree structure that can be more readily organized, administered and secured. Snapshots can be taken periodically to create drag-and-drop recovery points, and mirroring extends data protection to satisfy recovery time objectives. Local mirroring provides high performance for highly-accessed data, while remote mirroring provides business continuity across multiple data centers, as well as integration between on-premise and private clouds. Another major enhancement MapR made was to eliminate single points of failure in the critical NameNode and JobTracker functions. MapRs Distributed NameNode HA (High Availability) distributes the le metadata on ordinary DataNodes throughout the cluster. In the extreme, all DataNodes might store and serve a portion of the le metadata. Every portion is then persisted to disk (with the nodes data) and also replicated to at least two other nodes to increase tolerance to multiple simultaneous node failures. This eliminates the need with other distributions to continuously back up the Primary NameNode to either a Checkpoint Node (previously called the Secondary NameNode) or a Backup Node. MapRs JobTracker oers similar resiliency with the ability to continue all tasks with no interruption or data loss in the event of a failure. Without such transparent failover, it is necessary to restart the job(s) aected from the beginning. MapRs Distributed NameNode HA architecture also improves scalability and performance compared to congurations with a single, Primary NameNode. Even in a server congured with copious amounts of memory, a single NameNode is normally limited to only about 70 million les. With MapRs Distributed NameNode HA architecture, by contrast, the cluster scales in a linear fashion with the number of DataNodes, and can therefore contain a virtually unlimited number of les. The performance advantage derives from the elimination of a Primary NameNode, which can become a bottleneck even in relatively small clusters. By distributing the le metadata across multiple DataNodes throughout the cluster, performance also scales in a linear fashion with the size of the cluster.

2012 MapR, Inc. Confidential. All Rights Reserved.

Page 4

Maximizing Big Data Benefits with MapR and Informatica

Summary
By using Informatica with MapRs distribution for Hadoop, organizations are now able to achieve high-performance data integration, replication and messaging. Together the two companies are pushing the limits of high-performance networks to move many terabytes per hour of transaction, interaction and streaming data into the MapR cluster, as well as to parse and process a broad range of structured and unstructured data natively in Hadoop all without coding. The combination also gives organizations a more aordable way to archive data in applications, data warehouses and/or legacy systems to Hadoop, or to archive data to Hadoops lower-cost storage.

Together Informatica and MapR provide a cost-eective, analytic-ready data storage and processing with enterprise-class high availability and business continuity. To learn more, please visit either company on the Web at www.mapr.com or www.informatica.com, or call 855-NOW-MAPR (855-669-6277).

MapR Technologies is the creator of the industrys fastest, most dependable and easiest to use distribution for Apache Hadoop. MapR Technologies is dedicated to advancing the Hadoop platform and ecosystem to enable more businesses to harness the power of big data analytics for competitive advantage. For more information, please visit www.mapr.com.
2012. MapR. Condential. 05.12

Вам также может понравиться