Академический Документы
Профессиональный Документы
Культура Документы
Week 1 Week 5
– Understanding Big Data – Analytics using Hive
– Introduction to HDFS – Understanding HIVE QL
Week 2 Week 6
– Playing around with Cluster – NoSQL Databases
– Data loading Techniques – Understanding HBASE
Week 3 Week 7
– Map-Reduce Basics, types and formats – Real world Datasets and Analysis
– Use-cases for Map-Reduce – Hadoop Project Environment
Week 4 Week 8
– Analytics using Pig – Project Reviews
– Understanding Pig Latin – Planning a career in Big Data
How it works
Live classes
Class recordings
Module wise Quizzes, Coding Assignments
24x7 on-demand technical support
Project work on large Datasets
Online certification exam
Lifetime access to the Learning Management System
• Insurance
• Healthcare
• Retail
– Recommendations
– Groupings
• Genome Sequencing
• Utilities
Hadoop Users
http://wiki.apache.org/hadoop/Po
weredBy
Data volume is growing exponentially
Source: http://www.emc.com/leadership/programs/digital-
universe.htm, which was based on the 2011 IDC Digital
Universe Study
Un-Structured Data is exploding
Why DFS?
Read 1 TB Data
1 Machine 10 Machines
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s
Why DFS?
Read 1 TB Data
1 Machine 10 Machines
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s
45 Minutes
Why DFS?
Read 1 TB Data
1 Machine 10 Machines
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
of commodity computers using a simple programming model.
MapReduce (processing)
What is HDFS?
Highly fault-tolerant
High throughput
NameNode:
master of the system
maintains and manages the blocks which are present on the
DataNodes
DataNodes:
slaves which are deployed on each machine and provide the actual
storage
responsible for serving read and write requests for the clients
Secondary NameNode:
metadata
Single Point
NameNode Failure
Secondary NameNode:
Not a hot standby for the NameNode You give me
Connects to NameNode every hour* metadata every
hour, I make it
Housekeeping, backup of NemeNode metadata
secure
Saved metadata can build a failed NameNode
Secondary
NameNode
metadata
JobTracker and TaskTracker:
HDFS Architecture
Job Tracker
Job Tracker Contd.
Job Tracker Contd.
Job Tracker Contd.
HDFS Client Creates a New File
Rack Awareness
Anatomy of a File Write:
Anatomy of a File Read:
Thank You
See You in Class Next Week