Академический Документы
Профессиональный Документы
Культура Документы
Ecosystem
Introduction
“Hadoop is a framework that allows for the
distributed processing of large data sets across
clusters of computers using simple programming
models”.
Appliance hardware: .
Monitoring:
Metadata replication:
Data Replication
Data Resilience
Highly fault-tolerant
Data Integrity
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Data Replication
Default replication is 3-fold
HDFS primarily maintains one replica of each block
locally.
17/04/2018
27
Data Integrity
Consider a situation: a block of data fetched from
Datanode arrives corrupted.
This corruption may occur because of faults in a storage
device, network faults, or buggy software.
The Mapper
Reads data as key/value pairs
◦ The key is often discarded
The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local
machine, where the Reducer is running.
Fault-Tolerance