Вы находитесь на странице: 1из 10

By: Hadeel Ahmed

BIG DATA IS BIG



Big data from its name is very big
Starting size of it at least 1 TB

Sources of data
Mere data from internet
Data from military corporations
Hospitals data
NASA corporation data
And so on…
Types of data

 Unstructured data
Like images , videos , social media data
 Semi structured data
Like xml files
 Structured data
Like data base , SQL servers
What is the problem ?!

 The problem is that with this un sorted very large data size
, we cant analysis it, more over we cant classify it ,, it
become un-useful stored data without any usage

How to solve this ?!


Say Hello to HADOOP ;)
HADOOP

 Hadoop is an open source project developed by Google
and Doug Cutting
 It provides very useful options to deal with that data
 At first its scalable , it can accept any size of data
 It provides very high velocity in analysis this data ,
imagine that u can process 1TB data with in 2.5 minutes
 One of the most important characteristics of hadoop is
that it can distribute any data among number of files or
servers
MapReduce

 MapReduce is a parallel programming model for
writing distributed applications devised at Google
for efficient processing of large amounts of data
(multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity

 hardware in a reliable, fault-tolerant manner. The


MapReduce program runs on Hadoop which is an
Apache open-source framework.
HADOOP

 Hadoop is an Apache open source framework written in
java that allows distributed processing of large datasets
across clusters of computers using simple programming
models. The Hadoop framework application works in
an environment that provides distributed storage and
computation across clusters of computers. Hadoop is
designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Characteristics of HADOOP

 It doesn’t deal with your data as one whole block ,, instead
it distributes it into number of blocks , each block is
stored at different server or file that’s why it’s scalable
and accept any size of data
 It has one central node called NAME NODE which
control every other data nodes ,, it has the location info
for any block ( like broker)
 MapReduce is responsible for connecting all nodes
together, it distributes tasks among the nodes
So it acquire info of those nodes from name node at first
then distribute the task between them
So Mapreduce acquire info from name node, distribute the
task between nodes decrease needed time in processing the
data

Characteristics of HADOOP CONT.


 Fault tolerance
 Hadoop ensures that there is backup for every block and
there is more than one copy of each block among nodes
cluster
 It provides single write and multiple read for data
 PIG , HIVE, ZOOKEEPER
 They are already built projects dedicated for special type
of jobs
For example pig is used for data base projects
 It supports any language to write your own MapReduce

Вам также может понравиться