Вы находитесь на странице: 1из 6

CASE STUDY ON HADOOP

What is Hadoop?
Apache Hadoop is a new way for enterprises to store and analyze data.
Hadoop is an open-source project administered by the Apache Software Foundation. Hadoops contributors work for some of the worlds biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining and understanding large-scale data in order to better comprehend the data deluge.

Enterprises today collect and generate more data than ever before. Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop, however, was designed to solve a different problem: the fast, reliable analysis of both structured data and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, which allows them to combine old data and new data sets in powerful new ways.

Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.

Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data and can run large-scale, high-performance processing jobs in spite of system changes or failures.

Originally developed and employed by dominant Web companies like Yahoo and Facebook, Hadoop is now widely used in finance, technology, telecom, media and entertainment, government, research institutions and other markets with significant data. With Hadoop, enterprises can easily explore complex data using custom analyses tailored to their information and questions.

Cloudera is an active contributor to the Hadoop project and provides an enterprise-ready, commercial Distribution for Hadoop.Clouderas Distribution bundles the innovative work of a global open-source community; this includes critical bug fixes and important new features from the public development repository and applies all this to a stable version of the source code. In short,Cloudera integrates the most popular projects related to Hadoop into a single package, which is run through a suite of rigorous tests to ensure reliability during production.

Hadoop Overview
Apache Hadoop is a scalable, fault-tolerant system for data storage and processing. Hadoop is economical and reliable, which makes it perfect to run data-intensive applications on commodity hardware. Hadoop excels at doing complex analyses, including detailed, specialpurpose computation, across large collections of data. Hadoop handles search, log processing, recommendation systems, data warehousing and video/image analysis. Unlike traditional databases, Hadoop scales to address the needs of data-intensive distributed applications in a reliable, costeffective manner.

HDFS and MapReduce


Hadoop creates clusters of machines and coordinates work among them. Clusters can be built and scaled out with inexpensive computers.

The Hadoop software package includes the robust, reliable Hadoop Distributed File System (HDFS), which splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.

Fault-tolerant Hadoop Distributed File System (HDFS)


Provides reliable, scalable, low-cost storage.

HDFS breaks incoming files into blocks and stores them redundantly across the cluster. In addition, Hadoop includes MapReduce, a parallel distributed processing system that is different from most similar systems on the market. It was designed for clusters of commodity, shared-nothing hardware. No special programming techniques are required to run analyses in parallel using MapReduce; most existing algorithms work without changes. MapReduce takes advantage of the distribution and replication of data in HDFS to spread execution of any job across many nodes in a cluster.

MapReduce Software Framework


Offers clean abstraction between data analysis tasks and the underlying systems challenges involved in ensuring reliable large-scale computation.

Processes large jobs in parallel across many nodes and combines results. Eliminates the bottlenecks imposed by monolithic storage systems. Results are collated and digested into a single output after each piece has been analyzed. If a machine fails, Hadoop continues to operate the cluster by shifting work to the remaining machines. It automatically creates an additional copy of the data from one of the replicas it manages. As a result, clusters are selfhealing for both storage and computation without requiring intervention by systems administrators.

What can Hadoop do for you?


Apache Hadoop is an ideal platform for consolidating large-scale data from a variety of new and legacy sources. It complements existing data management solutions with new analyses and processing tools. It delivers immediate value to companies in a variety of vertical markets. Examples include: E-tailing

Recommendation engines increase average order size by recommending complementary products based on predictive analysis for cross-selling. Cross-channel analytics sales attribution, average order value, lifetime value (e.g., how many in-store purchases resulted from a particular recommendation, advertisement or promotion). Event analytics what series of steps (golden path) led to a desired outcome (e.g., purchase, registration).
Financial Services

Compliance and regulatory reporting. Risk analysis and management. Fraud detection and security analytics. CRM and customer loyalty programs. Credit scoring and analysis. Trade surveillance.
Government

Fraud detection and cybersecurity. Compliance and regulatory analysis. Energy consumption and carbon footprint management. Health & Life Sciences

Campaign and sales program optimization. Brand management. Patient care quality and program analysis. Supply-chain management.

Drug discovery and development analysis.


Retail/CPG

Merchandizing and market basket analysis. Campaign management and customer loyalty programs. Supply-chain management and analytics. Event- and behavior-based targeting. Market and consumer segmentations.
Telecommunications

Revenue assurance and price optimization. Customer churn prevention. Campaign management and customer loyalty. Call Detail Record (CDR) analysis. Network performance and optimization.
Web & Digital Media Services

Large-scale clickstream analytics. Ad targeting, analysis, forecasting and optimization. Abuse and click-fraud prevention. Social graph analysis and profile segmentation. Campaign management and loyalty programs.

Вам также может понравиться