Академический Документы
Профессиональный Документы
Культура Документы
Overview
Apache Hadoop is an open-source software framework for storage and
large-scale processing of data-sets on clusters of commodity hardware.
There are mainly five building blocks inside this runtime envinroment
(from bottom to top):
YARN:application startup
Resource
Manager
Job submitter
GPFS
IBM developed GPFS(General Parallel File System) in 1998 as a SAN file system for use in HPC
applications and IBM.
GPFS is basically storage file system developed as a SAN file system. Being an storage system one
can not attach it directly with Hadoop system that makes a cluster. In order to do this IBM FPO(File
placement optimization) comes in picture and bridge the gap.
FPO is essentially emulation of key component of HDFS which is moving the workload from the
application to data. Basically it move the job to Data instead of moving the Data to job.
GPFS is POSIX compliant, which enables any other applications running on top of the Hadoop
cluster to access data stored in the file system in a straightforward manner. With HDFS, only
Hadoop applications can access the data, and they must go through the Java-based HDFS API.
So major difference is framework verses file system (GPFS) gives the flexibility to users to access
storage data from Hadoop and Non Hadoop system which free users to create more flexible
workflow (Big Data or ETL or online). In that case one can create the series of ETL process with
multiple execution steps, local data or java processes to manipulate the data. Also ETL can be
plugged with MapReduce to execute the process for workflow.
FEATURES:
1.Hierarchical storage management:
Hadoop - HDFS
Hadoop File System was developed using distributed file system design. It
is run on commodity hardware. Unlike other distributed systems, HDFS is
highly faulttolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store
such huge data, the files are stored across multiple machines. These files
are stored in redundant fashion to rescue the system from possible data
losses in case of failure. HDFS also makes applications available to parallel
processing.
Features of HDFS
It is suitable for the distributed storage and processing.
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
Streaming access to file system data.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software. It is a software that can be
run on commodity hardware. The system having the namenode acts as the
master server and it does the following tasks:
Datanode
The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes
manage the data storage of their system.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file
system will be divided into one or more segments and/or stored in
individual data nodes. These file segments are called as blocks. In other
words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the
need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery : Since HDFS includes a large number of
commodity hardware, failure of components is frequent. Therefore HDFS should
have mechanisms for quick and automatic fault detection and recovery.
Huge datasets : HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the
data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage : The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is stored in
the Hadoop file system (HDFS). The input file is passed to the mapper
function line by line. The mapper processes the data and creates several
small chunks of data.
The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
The key and the value classes should be in serialized manner by the
framework and hence, need to implement the Writable interface.
Additionally, the key classes have to implement the Writable-Comparable
interface to facilitate sorting by the framework. Input and Output types of a
MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3,
v3>(Output).
Input Output
Map <k1, v1> list (<k2, v2>)
JASON
Jaql (JAQL) is a functional data processing and query language most commonly used
for JSON query processing on BigData.
t started as an Open Source project at Google[1] but the latest release was on 7/12/2010. IBM[2] took it
over as primary data processing language for their Hadoop software package BigInsights.
Although having been developed for JSON it supports a variety of other data sources
like CSV, TSV, XML.
JAQL supports[4] Lazy Evaluation, so expressions are only materialized when needed.
where a sink can be a source for a downstream operator. So typically a JAQL program has to
following structure, expressing a data processing graph:
Most commonly for readability reasons JAQL programs are linebreaked after the arrow, as is also a
common idiom in Twitter SCALDING:
Core Operators
EXPAND
Use the EXPAND expression to flatten nested arrays. This expression takes as input an array of
nested arrays [ [ T ] ] and produces an output array [ T ], by promoting the elements of each nested
array to the top-level output array.
FILTER
Use the FILTER operator to filter away elements from the specified input array. This operator takes
as input an array of elements of type T and outputs an array of the same type, retaining those
elements for which a predicate evaluates to true. It is the Jaql equivalent of the SQL WHERE clause.
Example:
data = [
{name: "Jon Doe", income: 20000, mgr: false},
{name: "Vince Wayne", income: 32500, mgr: false},
{name: "Jane Dean", income: 72000, mgr: true},
{name: "Alex Smith", income: 25000, mgr: false}
];
[
{
"income": 72000,
"mgr": true,
"name": "Jane Dean"
}
]
[
{
"income": 20000,
"mgr": false,
"name": "Jon Doe"
},
{
"income": 25000,
"mgr": false,
"name": "Alex Smith"
}
]
GROUP
Use the GROUP expression to group one or more input arrays on a grouping key and applies an
aggregate function per group.
JOIN
Use the JOIN operator to express a join between two or more input arrays. This operator supports
multiple types of joins, including natural, left-outer, right-outer, and outer joins.
SORT
The TOP expression selects the first k elements of its input. If a comparator is provided, the output is
semantically equivalent to sorting the input, then selecting the first k elements.
TRANSFORM
Use the TRANSFORM operator to realize a projection or to apply a function to all items of an output.
Pig
Apache Pig[1] is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce
Apache Pig[1] is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES '\\w+';
The above program will generate parallel executable tasks which can be distributed across multiple
machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages
on the internet.