You are on page 1of 20

Big Data Mining Technologies

By Kushagra Trivedi

Contents
Introduction to big data and big data mining
Apache Hadoop for big data mining
Apache S4 for big data mining
Apache Mahout for machine learning
Some other tools of machine learning and data mining
Comparison of big data mining technologies
Conclusion
References

Introduction To Big Data And Big Data Mining


Data with large amount and greater complexity
Definition of big data
Sources of data expansion
Definition of data mining
Why data mining is necessary
Some of the technologies are used for data mining

Apache Hadoop
Data intensive distributes architecture
Centralized server vs. distributed server
MapReduce and Hadoop Distributed File System
HDFS divides data blocks among
Writing application that rapidly process large amount of data
in parallel on large clusters of compute nodes
Applications - Yahoo, Facebook and other Fortune 50
companies are using apache Hadoop
2

Hadoop Distributed File System

Name Node
DataNode
1

DataNode
2

DataNode
3

DataNode
4

b
1

b
2

b
1

b
1

b
2

b
3

b
3

b
2

Cont 3

Cont

NameNode maintains all meta information about DataNodes


DataNodes contains actual data blocks
HDFS distributes and replicates data blocks among data nodes
Clients executes a query goes to NameNode and search actual
data by looking at meta infomation

MapReduce Algorithm
Austin Powers defeated
the league of evil villains

The league of evil villains


Austin Power defeated

If Austin Power defeated


the
League of evil villains

Wheres the league of evil


villains Austin Powers
defeated

Figure 2 MapReduce
distribution [2]

MAP

MAP

MAP

MAP

Austin
Powers
defeated
the
league
of
evil
villains

1
1
GROUP
1
1
1
1
1 the
1
1 league
1
1
of
1
evil
1
villains
1
Austin
Powers 1
1 defeated 1

if
1
austin
powers 1
defeated 1
1
the
league 1
1
of
1
evil
villains 1

GROUP

Wheres
The
League
Of
Evil
Villains
Austin
Powers
defeated

1
1
1
1
1
1
1
1
1

REDUCE
Austin
Powers
defeated
the
league
of
evil
villains

2
2
2
2
2
2
2
2

REDUCE

GROUP
REDUCE
If
2
Austin 2
Powers 2
defeated 2
the
2
league 2
of
2
evil
2
villains 2
Wheres 1

If
1
Austin 4
Powers 4
defeated 4
the
4
league 4
of
4
evil
4
villains 4
Wheres 1

Cont.

Continue.
Uses two functions: map and reduce
Data are fed into map function in order to produce
intermediate key and value pair
Intermediate result is then given to reduce function in order
to produce final result
Task tracker- do work that is assigned by job tracker
Job tracker- if task tracker fails then reallocation of task
tracker is done
6

Apache S4

S4 stands for simple scalable streaming system


Uses MapReduce and Actor model for computation
Data processing is done through processing elements
S4 framework provide a way to route and create processing
elements according to necessary
Applications - Yahoo, LinkedIn, A9 and Quantbench are
several companies use Apache S4 for big data mining
7

Continue

Figure 3 S4 word count sample


[6]

Continue
Processing elements are basic computational units
Processing elements only executes those events for which
key it was created
A special processing element is keyless element and it is
created for accepting any type of input
Processing nodes are logical hosts of processing elements
S4 routes events to processing nodes based on hash value
of keyed attributes in those events
9

Apache Mahout
Open source project of Apache foundation which allows
programmer to write machine learning algorithm
Works on three different algorithms those are clustering,
classification and collaborative filtering
Includes several distributed clustering algorithm such as kMeans, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy
Applications- Products you want to buy, people you might
want to connect with, potential life partners and
recommending songs you might like
10

Continue.
1) Building a recommendation engine
)

Currently provides Taste Library in order to build


recommendation engine

Library comes up with user based and item based


recommendations

Five preliminary components- DataMode, UserSimilarity,


ItemSimilarity, Recommender, UserNeighborhood

User can develop application that can give online and


offline recommendations using these components

11

Continue.
2) Clustering with Apache Mahout
Clustering algorithm written using MapReduce algorithm
Canopy, k-Means, Mean-Shift, and Dirichlet are clustering
algorithms
Select the data and convert it into numerical presentation
Select particular algorithm any of above
Evacuate the result

12

Continue.
3) Categorizing content with Mahout
Two approaches for categorizing - Nave Bayes classifier and
complementary nave Bayes classifier
One part of Naive Bayes classifier process that deal with
keeping track of the words associated with a particular
document and category
Second deal with information prediction using part one
Complementary Nave Bayes classifier is similar to nave
Bayes approach with simplicity
13

Some Other Tools of Machine Learning and Data


Mining
Big data R is used for statistical computing using high performance statistical
computing on big data
Machine Online Analysis is machine learning algorithm that is used for data
stream mining
Massive Online Analysis uses classification, regression, clustering and frequent
item set mining and frequent graph mining
Vowpal Wabbit is able to handle terabytes of data
Vowpal Wabbit can give better throughput using single machine network
Pegasus is big graph mining tool that finds patterns and anomalies from large
massive graphs
GraphLab is High level parallel data mining system built without using
MapReduce

14

Comparison
Apache Hadoop is used for batch processing
Data is divided into large size of blocks that makes it easy to handle
Put extra overhead of segmentation
Apache S4 is used for streaming data
No need of segmentation of data
Cannot add or remove nodes from running clusters
Apache Mahout is used for writing machine learning algorithm
No lack in community and documentation and examples are provided

15

Conclusion
Big data is crucial concern as data is going to increase in
future
Different techniques are needed for mining this big data
Apache Mahout gives recommendations to users according
to their past experience
Hadoop is used for data mining using MapReduce and HDFS
Apache S4 for mining streams of data
All techniques have their own significance for different types
of companies
16

References
[1] Apache Hadoop Fundamentals HDFS and MapReduce Explained with a Diagram By RAMESH NATARAJAN on JANUARY
4, 2012
[2] Pros and Cons of Hadoop By Guruzon.com on June 01, 2013
[3] HDFS: Facebook has the world's largest Hadoop cluster!
[4] S4 distributed stream of computing platform- overview
[5] S4 distributed stream computing Platform By Aleksandar Bradic, Sr. Director, Engineering and R&D
[6] Streaming Big Data By William Zhou in William Zhou's Blog on Sep 24, 2012
[7] Introducing Apache Mahout -Scalable, commercial-friendly machine learning for building intelligent applications by Grant
Ingersoll on 08 September 2009
[8] Introduction to scalable machine learning with apache mahout Grant Ingersoll on 15 September 2010
[9] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis
[10] J. Langford. Vowpal Wabbit, 2011.
[11]U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS: Mining Billion-Scale Graphs in the Cloud. 2012.
[12]R. Smolan and J. Erwitt. The Human Face of Big Data. Sterling Publishing Company Incorporated, 2012.

17

Any
Queries :