Вы находитесь на странице: 1из 19

Data-Intensive Information Processing Session #1

Introduction to Hadoop
Mani mmaniga@yahoo.co.uk Monday, June 15, 2011

In pioneer days they used oxen for heavy pulling, and when one ox couldnt budge a log, they didnt try to grow a larger ox. We shouldnt be trying for bigger and bigger computer, but for more systems of computers. - Grace Hopper

About this course


About large data processing Focus on problem solving Map Reduce and Hadoop HBase Hive Less on Administration More on Programming

What is Hadoop
Framework written in Java
Designed to solve problem that involve analyzing large data (Petabytes) Programing model based on Googles Map Reduce Infrastructure based on Googles Big Data and Distributed File System Has functional programming roots (Haskell, Erlang, etc.,) Remember how merge sort work and try to apply the logic to large data problems

What is Hadoop
Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Hadoop Includes
Hadoop Common utilities Avro: A data serialization system with scripting languages. Chukwa: managing large distributed systems. HBase: A scalable, distributed database for large tables. HDFS: A distributed file system. Hive: data summarization and ad hoc querying. MapReduce: distributed processing on compute clusters. Pig: A high-level data-flow language for parallel computation. ZooKeeper: coordination service for distributed applications.

What is Hadoop
Hadoops key differentiating factors
Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon EC2 Robust: Since Hadoop can run on commodity cluster, its designed with the assumption of frequent hardware failure, it can gracefully handle such failure and computation dont stop because of few failed devices / systems Scalable: Hadoop scales linearly to handle large data by adding more slave nodes to the cluster Simple : Its easy to write efficient parallel programming with Hadoop

What is Hadoop
What it is not designed for
Hadoop is not designed for random reading and writing of few records, which is type of load for online transaction processing.

What is it designed for


Hadoop is designed for offline processing of large-scale data. Hadoop is best used as write once and read many type of data store, similar to data warehousing.

Hadoop & Distributed Systems

Parallel Programming
The Challenge so far
No framework represent the right level abstraction, provided the programming paradigm / model that represent parallelism natively so as to implement algorithm needed to effectively design, implement, and maintain applications that exploit parallelism. In the past we had as technology
MPI PVM Etc

As architecture
Single Instruction Multiple Data Multiple Instruction Multiple Data

Parallel Programming
The Challenge so far
Identifying work that can be done concurrently Mapping work to processing units Distributing the work Managing access to shared data Synchronizing various stages of activity

Algorithm
Sequential / Algorithm Parallel Algorithm

Sequence of finite instructions , often used for calculating and data processing.

An algorithm that which can be executed at a time on many different processing device, and then put back together again at the end to get the correct result.

Parallel Programming Model


What is a model (not from OO perspective)
A way to structure parallel algorithm by selecting decompositions and mapping techniques in a manner to minimize interactions.

Different Models to Parallel Algorithms


Data Parallel : Independent process assigned data to work with. Task graph : Task are mapped to nodes in data dependency graph, data moves from source to sink. Work Pool : Pool of workers take data as an when available. Master Slave : A master process assign work and sent data to set of other process. Pipeline Hybrid

Parallel Programming Model


Different Models to Parallel Programming
Message passing
Independent tasks encapsulating local data Tasks interact by exchanging messages Most widely used, MPI, PVM, and considered as the most efficient form of parallel programming and has been effective in solving lot of SIMD and MIMD kind of problems both in data intensive and compute intensive problems. Tasks share a common address space Tasks interact by reading and writing this space asynchronously. Mostly used on Symmetric Multiprocessing machine (multicore chips), they use shared address space and belong to single process with more threads.

Shared memory

Data parallelization
Tasks execute a sequence of independent operations Data usually evenly partitioned across tasks Also referred to as Embarrassingly parallel

Problems with Parallel Computing


Synchronization Efficiency Reliability

What is Map Reduce


A programming model designed by Google, using which a sub set of distributed computing problems can be solved by writing simple programs. It provides automatic data distribution and aggregation. Removes problems with parallel programming, it believes in shared nothing architecture. To Scale, systems should be stateless.

Map Reduce
A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs

Map Reduce
Partitions input data Schedules execution across a set of machines Handles machine failure Manages inter-process communication

Recap Hadoop
Hadoop is a framework for solving large scale distributed application, it provides a reliable shared storage and analysis system. The storage is provided by Hadoop Distributed File System and analysis provided by Map Reduce, both are implementation of Google Distributed File System (Big Table) and Google Map Reduce.

Hadoop is designed for handling large volume of data using commodity hardware and can process those large volume of data.

Recap this course


Primarily about large data, more specifically data in textual form. Problems that are statistical in nature, that is to infer some regularities in data. We deal with data, representation of data, and some method for capturing regularity in the data.

Вам также может понравиться