Академический Документы
Профессиональный Документы
Культура Документы
1. Introduction:
Big data is a broad term for data sets so large or complex that traditional
data processing applications are inadequate. Challenges include analysis,
search, sharing, storage, transfer, visualization, and information privacy. The
term often refers simply to the use of predictive analytics or other certain
advanced methods to extract value from data, and seldom to a particular size of
data set.
Analysis of data sets can find new correlations, to "spot business trends,
prevent diseases, and combat crime and so on." Scientists, practitioners of
media and advertising and governments alike regularly meet difficulties with
large data sets in areas including Internet search, finance and business
informatics. Scientists encounter limitations in e-Science work, including
meteorology, genomics, complex physics simulations and biological and
environmental research.
1.1. Advantages:
Scalable: Hadoop is a highly scalable storage platform, because it can
store and distribute very large data sets across hundreds of inexpensive servers
that operate in parallel. Unlike traditional relational database systems (RDBMS)
that cant scale to process large amounts of data, Hadoop enables businesses to
run applications on thousands of nodes involving thousands of terabytes of data.
Cost Effective: Hadoop also offers a cost effective storage solution for
businesses exploding data sets. The problem with traditional relational database
management systems is that it is extremely cost prohibitive to scale to such a
degree in order to process such massive volumes of data. Hadoop, on the other
hand, is designed as a scale-out architecture that can affordably store all of a
companys data for later use.
Flexible: Hadoop enables businesses to easily access new data sources
and tap into different types of data (both structured and unstructured) to
generate value from that data. This means businesses can use Hadoop to derive
valuable business insights from data sources such as social media, email
conversations or clickstream data.
Source 1
Statistical Algorithms
Source 2
Source 3
Dataware house
Charts
Disadvantages:
1. Data Sampling: The data is coming from large number of sources and we
need to give only specific attributes from the data for the analysis purpose. If we
gave the whole dataset then the analysis would take much longer to process.
2. Data Cleaning: The data is coming from various sources, thus the data is not
structured. First we have to clean that data such as remove null, remove
redundant attribute values etc.
2. Literature Survey
1. MapReduce: Simplified Data Processing on Large Clusters:
This is the original paper written by Google employees. This paper
demonstrates the basis model of implementing the Big Data; it also shows the
working of Hadoop and MapReduce with each other.
Programming Model:
The computation takes a set of input key/value pairs, and produces a set
of output key/value pairs. The user of the MapReduce library expresses the
computation as two functions: Map and Reduce.
Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs. The MapReduce library groups together all
intermediate values associated with the same intermediate key I and passes
them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate
key I and a set of values for that key. It merges together these values to form a
possibly smaller set of values. Typically just zero or one output value is
produced per Reduce invocation. The intermediate values are supplied to the
user's reduce function via an iterator.