Вы находитесь на странице: 1из 3

BIG DATA MINING: A CHALLENGE AND

HOW TO MANAGE IT
DINESH JITENDER
CSA Deptt. PDMCE CSA Deptt. PDMCE
dineshgomber@gmail.com Jitender_13@rediffmail.com
other recipients. However, the two terms are used for two
ABSTRACT different elements of this kind of operation.
Big Data is a new term used to identify the datasets
that due to their large size and complexity, We call them Big data is a term for a large data set. Big data sets are those
BIG DATA because we can not manage them with our that outgrow the simple kind of database and data handling
current methodologies or data mining software tools. Big architectures that were used in earlier times, when big data
Data mining is the capability of extracting useful was more expensive and less feasible. For example, sets of
information from these large datasets or streams of data, data that are too large to be easily handled in a Microsoft
that due to its volume, variability, and velocity, it was not Excel spreadsheet could be referred to as big data sets.
possible before to do it. The Big Data challenge is
becoming one of the most exciting opportunities for the Data mining refers to the activity of going through big data
next years. We present in this issue, a broad overview of sets to look for relevant or pertinent information. This type
the topic, its current status on Big Data mining. This of activity is really a good example of the old axiom
paper shows the challenge and tools to manage "looking for a needle in a haystack." The idea is that
heterogeneous information frontier in Big Data mining businesses collect massive sets of data that may be
research homogeneous or automatically collected. Decision-makers
need access to smaller, more specific pieces of data from
those large sets. They use data mining to uncover the pieces
of information that will inform leadership and help chart the
course for a business.
INTRODUCTION Data mining can involve the use of different kinds of
Data Mining is an analytic process designed to explore software packages such as analytics tools. It can be
data in search of consistent patterns and/or systematic automated, or it can be largely labor-intensive, where
relationships between variables, and then to validate the individual workers send specific queries for information to
findings by applying the detected patterns to new subsets an archive or database. Generally, data mining refers to
of data. The ultimate goal of data mining is prediction - operations that involve relatively sophisticated search
and predictive data mining is the most common type of operations that return targeted and specific results. For
data mining and one that has the most direct business example, a data mining tool may look through dozens of
applications. The process of data mining consists of three years of accounting information to find a specific column of
stages: (1) the initial exploration, (2) model building or expenses or accounts receivable for a specific operating
pattern identification with validation/verification, and (3) year.
deployment (i.e., the application of the model to new data
in order to generate predictions). In short, big data is the asset and data mining is the
Applications where data collection has grown "handler" of that is used to provide beneficial results.
tremendously and is beyond the capability of commonly
used software tools to capture, manage, and process
within a tolerable elapsed time. The most fundamental
challenge for Big Data applications is to explore the large
volumes of data and extract useful information or Data Mining Challenges with Big
knowledge for future actions . In many situations, the
knowledge extraction process has to be very efficient and Data
close to real time because storing all observed data is
nearly infeasible. Data is being produced at an ever increasing rate. There has
also been an acceleration in the proportion of machine-
generated and unstructured data (photos , videos, social
DATA MINING AND BIG DATA media feeds and so on) compared to structured data such
that 80% or more of all data holdings are now unstructured
Big data and data mining are two different things. Both of and new approaches and technologies are required to access,
them relate to the use of large data sets to handle the link, manage and gain insight from these data sets. The
collection or reporting of data that serves businesses or
origin of the term Big Data is due to the fact that we are Apache Hadoop related projects [2]: Apache Pig, Apache
creating a huge amount of data every day.

Volume: there is more data than ever before, its size Hive, Apache HBase, Apache ZooKeeper, Apache Cas-
continues increasing, but not the percent of data that our sandra, Cascading, Scribe and many others.
tools can process

Variety: there are many different types of data, as text, Apache S4 [3]: platform for processing continuous
sensor data, audio, video, graph, and more data streams. S4 is designedspeci cally for managing
data streams. S4 apps are designed combining streams
Velocity: data is arriving continuously as streams of and processing elements in real time.
data, and we are interested in obtaining useful information
from it in real time.

Visualization. A main task of Big Data analysis is Storm [4]: software for streaming data-intensive dis-
how to visualize the results. As the data is so big, it tributed applications, similar to S4, and developed by
is very difficult to find user-friendly visualizations. Nathan Marz at Twitter.
New techniques, and frameworks to tell and show
stories will be needed,

Hidden Big Data. Large quantities of useful data are In Big Data Mining, there are many open source
initiatives. The most popular are the following:
getting lost since new data is largely untagged
filebased and unstructured data. The 2012 IDC
study on Big Data [10] explains that in 2012, 23%
(643 exabytes) of the digital universe would be Apache Mahout [5]: Scalable machine learning and
useful for Big Data if tagged and analyzed. data mining open source software based mainly in
However, currently only 3% of the potentially Hadoop. It has implementations of a wide range of
useful data is tagged, and even less is analyzed. machine learning and data mining algorithms:
clustering, clas-si cation, collaborative ltering and
frequent pattern mining.

R [6]: open source programming language and


T soft-ware environment designed for statistical
TOOLS: OPEN SOURCE computing and visualization. R was designed by
Ross Ihaka and Robert Gentleman at the
REVOLUTION University of Auckland, New Zealand beginning
in 1993 and is used for statistical analysis of
The Big Data phenomenon is intrinsically related to the very large data sets.
open source software revolution. Large companies as
Face-book, Yahoo!, Twitter, LinkedIn bene t and
contribute work-ing on open source projects. Big Data MOA [7]: Stream data mining open source
infrastructure deals with Hadoop, and other related software to perform data mining in real time. It
software as:
has imple-mentations of classi cation,
regression, clustering and frequent item set
mining and frequent graph mining. It started as
Apache Hadoop [1]: software for data-intensive dis- a project of the Machine Learning group of
tributed applications, based in the MapReduce pro- University of Waikato, New Zealand, famous for
gramming model and a distributed le system called the WEKA software. The streams framework [6]
Hadoop Distributed Filesystem (HDFS). Hadoop al- provides an environment for de ning and
lows writing applications that rapidly process large running stream pro-cesses using simple XML
amounts of data in parallel on large clusters of
based de nitions and is able to use MOA,
compute nodes. A MapReduce job divides the input
Android and Storm. SAMOA [1] is a new
dataset into independent subsets that are processed by
map tasks in parallel. This step of mapping is then fol- upcoming software project for distributed stream
lowed by a step of reducing tasks. These reduce tasks mining that will combine S4 and Storm with
use the output of the maps to obtain the nal result of MOA.
the job.
Vowpal Wabbit [8]: open source project started at More specic to Big Graph mining we found the
following open source tools:

Yahoo! Research and continuing at Microsoft


Research to design a fast, scalable, useful
learning algorithm. VW is able to learn from Pegasus [9]: big graph mining system built on
terafeature datasets. It can exceed the top of MapReduce. It allows to show patterns
throughput of any single machine network and anomalies in massive real-world graphs. See
interface when doing linear learning, via the paper by U. Kang and Christos Faloutsos in
parallel learn-ing. this issue.

Conclusion
[5] Apache Mahout, http://mahout.apache.org.

[6] R Core Team. R: A Language and Environment for


Statistical Computing. R Foundation for Statistical
Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.
Due to increasing the size of data day by day, Big [7] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer.
Data is going to continue growing during the next MOA: Massive Online Analysis http://moa.
years and becoming one of the exciting opportunities cms.waikato.ac.nz/. Journal of Machine Learning
in future. Because todays we have an overwhelming Research (JMLR), 2010.
growth of data in terms of volume, velocity and
variety. So from a security and privacy standpoint, [8] D. Laney. 3-D Data Management: Controlling Data
the threat landscape and security and privacy risks Volume, Velocity and Variety. META Group Research
have also seen an unprecedented growth. Now we Note, February 6, 2001
are in a new era where Big Data mining will help us
to discover knowledge that no one has discovered [9]U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS:
before In order to explore Big Data, we have Mining Billion-Scale Graphs in the Cloud. 2012.
analyzed several challenges at the data, model, and
[10]J. Gantz and D. Reinsel. IDC: The Digital Universe in
system levels. To support Big Data mining, high
2020: Big Data, Bigger Digital Shadows, and Biggest
performance computing platforms are required
Growth in the Far East. December 2012.
which impose systematic designs to unleash the full
power of the Big Data.

REFERENCES

[1] Apache Hadoop, http://hadoop.apache.org.

[2] P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, and G.


Lapis. IBM Understanding Big Data: Analytics for
Enterprise Class Hadoop and Streaming Data. McGraw-
Hill Companies,Incorporated, 2011

[3] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4:


Distributed Stream Computing Platform. In ICDM
Workshops, pages 170177, 2010

[4] Storm, http://storm-project.net.


.

Вам также может понравиться