Вы находитесь на странице: 1из 17

BIG DATA

It refers to the large set of data or in


other words large volume of
input/abstract (Input/abstract can be
both Ordered and un-Ordered).
Ordered input/data: This type of data is
written in a format that's easy to
understand by the machine and easily
searchable by applying some basic
algorithms. Example:phone number,
spreadsheet.
Un-Ordered input/data: This type of
data is like human language which will
not fit nicely into relational database
like SQL. Example: emails,text
documents(PDF,word docs,etc).
How big can be data.
Let me explain this by taking an
example of photos, suppose in a family
there are 4 member and each member
have storage of 2gb(gigabytes) in 2011
and in 2017 the same family have
storage of 10gb per person just for
photos. In the same way as per records
till 2013 the whole world reched at 4.4
zb(zittabytes) which is 10^21 bytes.
Every month around two hundred
fourty billion photos are uploaded on
facebook, i.e per month seven pb.
DATA STORAGE
Let consider a storage device of 1990
which was able to store 1370mb of
inputed data and can be send with a
speed of four-five mbs, and after 20
years 1-tb device with a transfer speed
is arroud 100mbs,so it takes more than
2-3 hours to read the whole data.
We can reduce this reading and writting
time by storing our data into multiples
drives and make them work parallel,
then we could read whole data in 2-3
minutes.

DATA ANALYSIS
As we are storing our data to multiple
drives in order to read and write it in
minimum time, but there can be some
problems like:
1st problem can be hardware failure:
when we store data into pieces of
hardware, the chances of any one
hardware will fail is very high. To solve
this problem we can keep multiple
copies of our data in the system.
2nd problem is combining of data:
problem will occure is that if data
stored in one drive and in some task we
have to combine this data with any
other drives, this can be big problem
but solution to this is using
mapreducing technology.

MAP-REDUCE
It is a programming model distributed
computing which is based on java.
Its algorithm contains two important
task and that is map and reduce.
Map: It take some set of data and
convert this set of data into another set
of data where individual element is
broken down into tuples.
Reduce: It work is to reduce the data
according to the problem using map. i.e
it collect all the important tuples based
on the given task and then it prepares
the finall output.

Hadoop
Hadoop is an open source framwork
which is based on java programming,
which supports processing and storage
of large volume of data in a distributed
computing environment.
It is a part of apache software.

Analyzing the data with


hadoop
Let take an example of climate
condition.
In order to deal with climate condition
i.e to get the temprature of every year,
we make use of MapReduce.
MapReduce work by breaking the
process into two parts i.e map phase
and reduce phase. Each phase has
key-value pairs as input and output.
The input to our map phase is raw
NCDC(Natinal Climatic Data Center)
data. We choose a text input format
that gives us each line in dataset as a
text value.
Map function is simple as it take out the
year and the temperature only because
our focus is only this two fileds. Here
map function is act as preparing phase,
where is set up the data in such a way
that reduce function can do it's job on it
i.e finding the maximum temperature
of the year.
Let take an example in order to
understand it in a better way that how
mapreduce works.
0057011990999991893051507004....99
99999N9+00002+999999999.....
0033011990999991893051512004....99
99999N9+00212+999999999.....
0033011990999991893051518004....99
99999N9-00101+999999999.....
0033012650999991899032412004....05
00001N9+01122+999999999.....
0033012650999991899032418004....05
00001N9+00678+999999999.....

This line are presented to the map


function as the key-value pair.
(000,0067011990999991893051507004.
...9999999N9+00002+999999999.....)
(105,0043011990999991893051512004.
...9999999N9+00212+999999999.....)
(222,0043011990999991893051518004.
...9999999N9-00101+999999999.....)
(381,0043012650999991899032412004.
...0500001N9+01122+999999999.....)
(432,0043012650999991899032418004.
...0500001N9+00678+999999999.....)
The key are the line offset within the
file, which can be ignore in our map
function. The map function work is to
extracts the year and the temperature
and emits them as output.
(1893,2)
(1893,12)
(1893,-101)
(1899,22)
(1899,78)
This output is processed by the
mapreduce framework before it sent to
the reduce function. This process will
sort and group the key-value pairs by
key. Therefore above data is sorted as
(1949,[22,78])
(1950,[2,12,-101])
Now reduce function comes in picture
as we have list of all years with its
temperature readings.
Now reduce function will rduce the data
and take the maximum temperature
value.
(1893,12)
(1899,78)
This will be the final output which gives
us the maximum temperature of a year.

HDFS
It stands for Hadoop Distributed
Filesystem.
It is a filesystem basically designed or
used for storing the large files.
It represents a distributed file system
which is designed to store a very large
amounts of data i.e is in TB/PB, and to
provide streaming access to the data
sets. On the bases of HDFS structure,
the files which we have stored across
multiple nodes to ensure high
availability of the parallel applications.

PIG
Let me tell you about pig, it is a
techanology which deals with data in
order to analysis it.
It is a high level language which is used
to create programs and run on apache
hadoop. Pig latin language can create
user defined function which can be
writen in java, python, etc and then can
be called directly from the language.
Basically it is made up of two things:-
1- Pig Latin
It is a language in which we use to
express data flow.
2- Execution Environment
This is the environment on which we
are able to run our pig latin programs.

There are number of advantages of


apache pig like:-
~ It helps in reusing the code.
~ Faster development.
~ It uses minimum number of lines of
code.
~ It also helps in schema and type
checking.

It has some following features:-


~ Simple programming.
~ Better optimization.
~ Extensive nature -> It is called so
because it used to acheive highly
specific processing task.

APACHE HIVE
Hive is mainly used on structured data
in hadoop, it is a data warehouse
infrastructure tool.
Hive is not a relational database, it is
not design for online transaction
processing and also it is not a language
for real-time quaries.

Hive has some features like:-


~ It is designed for OLAP.
~ Hive processed the schema into the
HDFS and it stores schema in database.
~ The language called HQL which
support for quaries like SQL language.
YARN
YARN stands for Yet Another Resource
Navigator.
It is used to manage the hadoop cluster
and is very efficient technology. It is
sometime called as MapReduce 2. In
the year 2012 it became sub-project of
the larger apache hadoop project.
For example, with map-reduce batch
jobs hadoop clusters can run
simultaneously interactive querying and
streaming data applications.

HOW YARN WORKS


Application that can access data and
run in a hadoop cluster on a consistent
framework, this is now possible using
YARN. We can say that it is a flatform
for getting consistent solution, high
level of security and governing of data.

BASIC FEATURS OF YARN


TECHNOLOGY
* High degree of compatibility
It is possible that the application which
are created using the MapReduce
framework can easily run YARN.
* Better cluster utilization
* Multi-tenancy and* Utmost scalability

Вам также может понравиться