Академический Документы
Профессиональный Документы
Культура Документы
DATA ANALYSIS
As we are storing our data to multiple
drives in order to read and write it in
minimum time, but there can be some
problems like:
1st problem can be hardware failure:
when we store data into pieces of
hardware, the chances of any one
hardware will fail is very high. To solve
this problem we can keep multiple
copies of our data in the system.
2nd problem is combining of data:
problem will occure is that if data
stored in one drive and in some task we
have to combine this data with any
other drives, this can be big problem
but solution to this is using
mapreducing technology.
MAP-REDUCE
It is a programming model distributed
computing which is based on java.
Its algorithm contains two important
task and that is map and reduce.
Map: It take some set of data and
convert this set of data into another set
of data where individual element is
broken down into tuples.
Reduce: It work is to reduce the data
according to the problem using map. i.e
it collect all the important tuples based
on the given task and then it prepares
the finall output.
Hadoop
Hadoop is an open source framwork
which is based on java programming,
which supports processing and storage
of large volume of data in a distributed
computing environment.
It is a part of apache software.
HDFS
It stands for Hadoop Distributed
Filesystem.
It is a filesystem basically designed or
used for storing the large files.
It represents a distributed file system
which is designed to store a very large
amounts of data i.e is in TB/PB, and to
provide streaming access to the data
sets. On the bases of HDFS structure,
the files which we have stored across
multiple nodes to ensure high
availability of the parallel applications.
PIG
Let me tell you about pig, it is a
techanology which deals with data in
order to analysis it.
It is a high level language which is used
to create programs and run on apache
hadoop. Pig latin language can create
user defined function which can be
writen in java, python, etc and then can
be called directly from the language.
Basically it is made up of two things:-
1- Pig Latin
It is a language in which we use to
express data flow.
2- Execution Environment
This is the environment on which we
are able to run our pig latin programs.
APACHE HIVE
Hive is mainly used on structured data
in hadoop, it is a data warehouse
infrastructure tool.
Hive is not a relational database, it is
not design for online transaction
processing and also it is not a language
for real-time quaries.