Академический Документы
Профессиональный Документы
Культура Документы
DATA ANALYSIS
Bogdan Oancea
"Nicolae Titulescu" University of Bucharest
Raluca Mariana Dragoescu
The Bucharest University of Economic Studies,
BIG DATA
The term big data was defined as data sets of
increasing volume, velocity and variety 3V;
Big data sizes are ranging from a few hundreds
terabytes to many petabytes of data in a single
data set;
Requires high computing power and large
storage devices.
Administrative data;
Commercial or transactional data;
Data provided by sensors;
Data provided by tracking devices;
Behavioral data (for example Internet searches);
Data provided by social media.
Challenges:
legislative issues;
maintaining the privacy of the data;
financial problems regarding the cost of sourcing
data;
data quality and suitability of statistical methods;
technological challenges
R
Is a free software package for statistics and
data visualization;
Is available for UNIX, Windows and MacOS;
R is used as a computational platform for
regular statistics production in many official
statistics agencies;
It is used in many other sectors like finance,
retail, manufacturing etc.
HADOOP
Is a free software framework developed for
distributed processing of large data sets using
clusters of commodity hardware;
It was developed in Java;
Other languages could be used to: R, Python or
Ruby;
Available at http://hadoop.apache.org/.
HADOOP
HADOOP
degree of scalability;
Cost effective: it allows for massively parallel
computing using commodity hardware;
Flexibility: is able to use any type of data, structured
or not;
Fault tolerance.
R AND HADOOP
R and Streaming;
Rhipe;
Rhadoop;
R AND STREAMING
Allows users to run Map/Reduce jobs with any
script or executable that can access standard
input/standard output;
No client-side integration with R;
R AND HADOOP
The integration of R and Hadoop using
Streaming is an easy task;
Requires that R should be installed on every
DataNode of the Hadoop cluster ;
RHIPE
Rhipe = R and Hadoop Integrated
Programming Environment;
Provides a tight integration between R and
Hadoop;
Allows the user to carry out data analysis of big
data directly in R;
Available at www.datadr.org.
RHIPE
Rhipe is an R library which allows running a
MapReduce job within R;
Install requirements:
RHIPE AN EXAMPLE
library(Rhipe)
rhinit(TRUE, TRUE);
map<-expression ( {lapply (map.values, function(mapper))})
reduce<-expression(
pre = {},
reduce = {},
post = {},
)
x <- rhmr(map=map, reduce=reduce,
ifolder=inputPath,
ofolder=outputPath,
inout=c('text', 'text'),
jobname='a job name'))
rhex(z)
RHADOOP
RHadoop is an open source project developed by
Revolution Analytics;
allows running a MapReduce jobs within R just like
Rhipe;
Consists in:
RHADOOP AN EXAMPLE
library(rmr)
map<-function(k,v) { }
reduce<-function(k,vv) { }
mapreduce( input =data.txt,
output=output,
textinputformat =rawtextinputformat,
map = map,
reduce=reduce
)
CONCLUSIONS