Вы находитесь на странице: 1из 6

Big Data: Analysis of large Data sets

1. Introduction:
Big data is a broad term for data sets so large or complex that traditional
data processing applications are inadequate. Challenges include analysis,
search, sharing, storage, transfer, visualization, and information privacy. The
term often refers simply to the use of predictive analytics or other certain
advanced methods to extract value from data, and seldom to a particular size of
data set.
Analysis of data sets can find new correlations, to "spot business trends,
prevent diseases, and combat crime and so on." Scientists, practitioners of
media and advertising and governments alike regularly meet difficulties with
large data sets in areas including Internet search, finance and business
informatics. Scientists encounter limitations in e-Science work, including
meteorology, genomics, complex physics simulations and biological and
environmental research.
1.1. Advantages:
Scalable: Hadoop is a highly scalable storage platform, because it can
store and distribute very large data sets across hundreds of inexpensive servers
that operate in parallel. Unlike traditional relational database systems (RDBMS)
that cant scale to process large amounts of data, Hadoop enables businesses to
run applications on thousands of nodes involving thousands of terabytes of data.
Cost Effective: Hadoop also offers a cost effective storage solution for
businesses exploding data sets. The problem with traditional relational database
management systems is that it is extremely cost prohibitive to scale to such a
degree in order to process such massive volumes of data. Hadoop, on the other
hand, is designed as a scale-out architecture that can affordably store all of a
companys data for later use.
Flexible: Hadoop enables businesses to easily access new data sources
and tap into different types of data (both structured and unstructured) to
generate value from that data. This means businesses can use Hadoop to derive
valuable business insights from data sources such as social media, email
conversations or clickstream data.

Resilient to failure: A key advantage of using Hadoop is its fault


tolerance. When data is sent to an individual node, that data is also replicated to
other nodes in the cluster, which means that in the event of failure, there is
another copy available for use.
1.2 Disadvantages:
Not Fit for Small Data: Due to its high capacity design, the Hadoop
Distributed File System or HDFS, lacks the ability to efficiently support the
random reading of small files. As a result, it is not recommended for
organizations with small quantities of data.
Security Concerns: A classic example can be seen in the Hadoop security
model, which is disabled by default due to sheer complexity. If whoevers
managing the platform lacks the knowhow to enable it, your data could be at
huge risk. Hadoop is also missing encryption at the storage and network levels.
Potential Stability Issues: Hadoop is an open source platform. That
essentially means it is created by the contributions of the many developers who
continue to work on the project. While improvements are constantly being
made, like all open source software, Hadoop has had its fair share of stability
issues. To avoid these issues, organizations are strongly recommended to make
sure they are running the latest stable version, or run it under a third-party
vendor equipped to handle such problems.
1.3 Existing Technology:
Data Warehousing: In computing, a data warehouse (DW or DWH), also
known as an enterprise data warehouse (EDW), is a system used for reporting
and data analysis. DWs are central repositories of integrated data from one or
more disparate sources. They store current and historical data and are used for
creating trending reports for senior management reporting such as annual and
quarterly comparisons.
The data stored in the warehouse is uploaded from the operational
systems (such as marketing, sales, etc. shown in the figure to the right). The
data may pass through an operational data store for additional operations before
it is used in the DW for reporting.

Source 1

Statistical Algorithms

Source 2

Source 3

Dataware house

Charts

Data cleaning and processing

Disadvantages:
1. Data Sampling: The data is coming from large number of sources and we
need to give only specific attributes from the data for the analysis purpose. If we
gave the whole dataset then the analysis would take much longer to process.
2. Data Cleaning: The data is coming from various sources, thus the data is not
structured. First we have to clean that data such as remove null, remove
redundant attribute values etc.

1.4. Methodology (DSS):


A Decision Support System (DSS) is a computer-based information
system that supports business or organizational decision-making activities.
DSSs serve the management, operations, and planning levels of an organization
(usually mid and higher management) and help to make decisions, which may
be rapidly changing and not easily specified in advance (Unstructured and
Semi-Structured decision problems). Decision support systems can be either
fully computerized, human or a combination of both.

DSS components may be classified as:


1. Inputs: Factors, numbers, and characteristics to analyse
2. User Knowledge and Expertise: Inputs requiring manual analysis by the
user
3. Outputs: Transformed data from which DSS "decisions" are generated
4. Decisions: Results generated by the DSS based on user criteria
BIG DATA As a solution:
Big data are a collection of data sets so large and complex that it
becomes difficult to process using on-going databases tools.
1. Big data technology solves the problem of handling large datasets, is in
todays world the size of data is changed from gigabytes to petabytes.
2. As the data is coming from different resources the format of data of data is
not consistent every time. Thats why it is complex.
3. And the current database tools are not able to process these data with certain
time limits.

1.5 Objective to solve:


In this project my objective is to take a large dataset and process it within
a considerable small time compared with traditional processing.
1. I am considering a file containing half a million lines consisting of purchases
history recorded by the server. Each line has date, time, location, item name,
purchase cost, purchase method.
2. By using the dataset my job is to calculate total number of sales and the total
sales value from all the stores.
3. I will demonstrate this project in two steps:
A. Calculating the time required to process the data by traditional
processing without using Hadoop and MapReduce.

B. In the real implementation part calculating time required to process


using Hadoop and MapReduce.
Observations:
Time required in computing the results without using Hadoop and MapReduce.
For traditional processing the data is stored into RDBMS, when the
stored data is large then it will take significant amount of time to process.
Time required in computing the results using Hadoop and MapReduce.
Here the data is stored in to files and the Hadoop will divide the files into
small chunks and each file is processed in parallel.
The system I am using is dual core system with 2GB of main memory. As
I am using the free version given by the cloudera, the system is single cluster
and single node, so the performance is not very good as compared the
multicluster system, but there is a performance improvement with the traditional
processing.
After the actual implementation I am going to display the comparison
between times taken in both the methods.

2. Literature Survey
1. MapReduce: Simplified Data Processing on Large Clusters:
This is the original paper written by Google employees. This paper
demonstrates the basis model of implementing the Big Data; it also shows the
working of Hadoop and MapReduce with each other.
Programming Model:
The computation takes a set of input key/value pairs, and produces a set
of output key/value pairs. The user of the MapReduce library expresses the
computation as two functions: Map and Reduce.
Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs. The MapReduce library groups together all
intermediate values associated with the same intermediate key I and passes
them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate
key I and a set of values for that key. It merges together these values to form a
possibly smaller set of values. Typically just zero or one output value is
produced per Reduce invocation. The intermediate values are supplied to the
user's reduce function via an iterator.

Вам также может понравиться