Haddop Session 01

Data-Intensive Information Processing Session #1
Introduction to Hadoop
Mani mmaniga@yahoo.co.uk Monday, June 15, 2011
In pioneer days they used oxen for heavy pulling, and when one ox couldnt budge a log, they didnt try to grow a larger ox. We shouldnt be trying for bigger and bigger computer, but for more systems of computers. - Grace Hopper
About this course

About large data processing Focus on problem solving Map Reduce and Hadoop HBase Hive Less on Administration More on Programming
What is Hadoop
Framework written in Java
Designed to solve problem that involve analyzing large data (Petabytes) Programing model based on Googles Map Reduce Infrastructure based on Googles Big Data and Distributed File System Has functional programming roots (Haskell, Erlang, etc.,) Remember how merge sort work and try to apply the logic to large data problems
What is Hadoop
Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Hadoop Includes
Hadoop Common utilities Avro: A data serialization system with scripting languages. Chukwa: managing large distributed systems. HBase: A scalable, distributed database for large tables. HDFS: A distributed file system. Hive: data summarization and ad hoc querying. MapReduce: distributed processing on compute clusters. Pig: A high-level data-flow language for parallel computation. ZooKeeper: coordination service for distributed applications.
What is Hadoop
Hadoops key differentiating factors
Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon EC2 Robust: Since Hadoop can run on commodity cluster, its designed with the assumption of frequent hardware failure, it can gracefully handle such failure and computation dont stop because of few failed devices / systems Scalable: Hadoop scales linearly to handle large data by adding more slave nodes to the cluster Simple : Its easy to write efficient parallel programming with Hadoop
What is Hadoop
What it is not designed for
Hadoop is not designed for random reading and writing of few records, which is type of load for online transaction processing.
What is it designed for

Hadoop is designed for offline processing of large-scale data. Hadoop is best used as write once and read many type of data store, similar to data warehousing.
Hadoop & Distributed Systems
Parallel Programming
The Challenge so far
No framework represent the right level abstraction, provided the programming paradigm / model that represent parallelism natively so as to implement algorithm needed to effectively design, implement, and maintain applications that exploit parallelism. In the past we had as technology
MPI PVM Etc
As architecture
Single Instruction Multiple Data Multiple Instruction Multiple Data
Parallel Programming
The Challenge so far
Identifying work that can be done concurrently Mapping work to processing units Distributing the work Managing access to shared data Synchronizing various stages of activity
Algorithm
Sequential / Algorithm Parallel Algorithm
Sequence of finite instructions , often used for calculating and data processing.
An algorithm that which can be executed at a time on many different processing device, and then put back together again at the end to get the correct result.
Parallel Programming Model

What is a model (not from OO perspective)
A way to structure parallel algorithm by selecting decompositions and mapping techniques in a manner to minimize interactions.
Different Models to Parallel Algorithms

Data Parallel : Independent process assigned data to work with. Task graph : Task are mapped to nodes in data dependency graph, data moves from source to sink. Work Pool : Pool of workers take data as an when available. Master Slave : A master process assign work and sent data to set of other process. Pipeline Hybrid
Parallel Programming Model

Different Models to Parallel Programming
Message passing
Independent tasks encapsulating local data Tasks interact by exchanging messages Most widely used, MPI, PVM, and considered as the most efficient form of parallel programming and has been effective in solving lot of SIMD and MIMD kind of problems both in data intensive and compute intensive problems. Tasks share a common address space Tasks interact by reading and writing this space asynchronously. Mostly used on Symmetric Multiprocessing machine (multicore chips), they use shared address space and belong to single process with more threads.
Shared memory
Data parallelization
Tasks execute a sequence of independent operations Data usually evenly partitioned across tasks Also referred to as Embarrassingly parallel
Problems with Parallel Computing

Synchronization Efficiency Reliability
What is Map Reduce

A programming model designed by Google, using which a sub set of distributed computing problems can be solved by writing simple programs. It provides automatic data distribution and aggregation. Removes problems with parallel programming, it believes in shared nothing architecture. To Scale, systems should be stateless.
Map Reduce
A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs
Map Reduce
Partitions input data Schedules execution across a set of machines Handles machine failure Manages inter-process communication
Recap Hadoop
Hadoop is a framework for solving large scale distributed application, it provides a reliable shared storage and analysis system. The storage is provided by Hadoop Distributed File System and analysis provided by Map Reduce, both are implementation of Google Distributed File System (Big Table) and Google Map Reduce.
Hadoop is designed for handling large volume of data using commodity hardware and can process those large volume of data.
Recap this course

Primarily about large data, more specifically data in textual form. Problems that are statistical in nature, that is to infer some regularities in data. We deal with data, representation of data, and some method for capturing regularity in the data.

Haddop Session 01

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Haddop Session 01

Загружено:

Авторское право:

Доступные форматы

Data-Intensive Information Processing Session #1

About this course

What is it designed for

Hadoop & Distributed Systems

Parallel Programming Model

Different Models to Parallel Algorithms

Parallel Programming Model

Problems with Parallel Computing

What is Map Reduce

Recap this course

Вам также может понравиться