Brendan McAdams MongoDB Promises of Big Data Big Data Spain Talk

A Modest Proposal
for Taming and Clarifying the Promises of Big Data
and the Software Driven Future
Brendan McAdams
10gen, Inc.
brendan@10gen.com
@rit
Friday, November 16, 12

"In short, software is eating the world."
- Marc Andreesen
Wall Street Journal, Aug. 2011
http://on.wsj.com/XLwnmo

Software is Eating the World
Amazon.com (and .uk, .es, etc) started as a bookstore

Today, they sell just about everything - bicycles,
appliances, computers, TVs, etc.
In some cities in America, they even do home grocery
delivery
No longer as much of a physical goods company -
becoming fixated and surrounded by software
Pioneering the eBook revolution with Kindle
EC2 is running a huge percentage of the public
internet

Netflix started as a company to deliver DVDs to the home...

Netflix started as a company to deliver DVDs to the home...

But as theyve grown, business has shifted to an
online streaming service
They are now rolling out rapidly in many countries
including Ireland, the UK, Canada and the Nordics
No need for physical inventory or postal distribution ...
just servers and digital copies

Disney Found Itself Forced To Transform...
From This...

Disney Found Itself Forced To Transform...
... To This

But What Does All This Software Do?
Software always eats data be it text files, user form input,

emails, etc
All things that eat, must eventually excrete...

Ingestion = Excretion
+ =
Yeast Ingests Sugars,
and Excretes Ethanol

Ingestion = Excretion
Cows, er...
well, you get the point.

So What Does Software Eat?
Software always eats data be it text files, user form input,

emails, etc
But what does software excrete?

More Data, of course...
This data gets bigger and bigger
The solutions become narrower for storing &
processing this data
Data Fertilizes Software, in an endless cycle...

Theres a Big Market Here...
Lots of Solutions for Big Data

Data Warehouse Software
Operational Databases
Old style systems being upgraded to scale storage +
processing
NoSQL - Cassandra, MongoDB, etc
Platforms
Hadoop

Dont Tilt At Windmills...

Dont Tilt At Windmills...
It is easy to get distracted by all of these solutions

Keep it simple
Use tools you (and your team) can understand
Use tools and techniques that can scale
Try not to reinvent the wheel

... And Dont Bite Off More Than You Can Chew
Break it into smaller pieces

You cant fit a whole pig into your mouth...
... slice it into small parts that you can consume.

Big Data at a Glance
Large Dataset
Primary Key as username
Big Data can be gigabytes, terabytes, petabytes or exabytes

An ideal big data system scales up and down around various
data sizes while providing a uniform view
Major concerns
Can I read & write this data efficiently at different scale?
Can I run calculations on large portions of this data?

...
Large Dataset
Systems like Google File System (which inspired Hadoops

HDFS) and MongoDBs Sharding handle the scale problem by
chunking
Break up pieces of data into smaller chunks, spread across

many data nodes
Each data node contains many chunks
If a chunk gets too large or a node overloaded, data can be
rebalanced

Chunks Represent Ranges of Values
Initially, an empty
collection has a single
- + chunk, running the range
of minimum (-) to ...
INSERT {USERNAME: Bill} maximum (+)
As we add data, more

chunks are created of - B C +
new ranges
INSERT {USERNAME: Becky}

INSERT {USERNAME: Brendan}
Individual or partial letter

- Ba Be Br ranges are one possible
chunk value... but they
can get smaller!
INSERT {USERNAME: Brad}
The smallest possible

chunk value is not a Brad Brendan
range, but a single
possible value

a b c d e f g h
...
Large Dataset
s t u v w x y z
To simplify things, lets look at our dataset split into chunks by

letter
Each chunk is represented by a single letter marking its

contents
You could think of B as really being Ba Bz

a b c d e f g h
Large Dataset
s t u v w x y z

Large Dataset
x b v t d f z s
h e u c w a y g
MongoDB Sharding ( as well as HDFS ) breaks data into chunks (~64 mb)

Data Node 1 Data Node 2 Data Node 3 Data Node 4
Large Dataset
25% of chunks 25% of chunks 25% of chunks 25% of chunks
x b v t d f z s
h e u c w a y g
Representing data as chunks allows many levels of scale across n data nodes

Scaling
Data Node 1 Data Node 2 Data Node 3 Data
Data
Node
Node
4 5
x b v t d f z s
h e u c w a y g
The set of chunks can be evenly distributed across n data nodes

Add Nodes: Chunk Rebalancing
Data Node 1 Data Node 2 Data Node 3 Data Node 4 Data Node 5
x c b z t f v y
a s u g e w h d
The goal is equilibrium - an equal distribution.
As nodes are added (or even removed)
chunks can be redistributed for balance.

Dont Bite Off More Than You Can Chew...
The answer to calculating big data is much the same as

storing it
We need to break our data into bite sized pieces

Build functions which can be composed together
repeatedly on partitions of our data
Process portions of the data across multiple calculation
nodes
Aggregate the results into a final set of results

Bite Sized Pieces Are Easier to Swallow
These pieces are not chunks rather, the individual data

points that make up each chunk
Chunks make up a useful data transfer units for processing

as well
Transfer Chunks as Input Splits to calculation nodes,
allowing for scalable parallel processing

MapReduce the Pieces
The most common application of these techniques is

MapReduce
Based on a Google Whitepaper, works with two primary
functions map and reduce to calculate against large
datasets

MapReduce to Calculate Big Data
MapReduce is designed to effectively process data at varying

scales
Composable function units can be reused repeatedly for scaled

results

MapReduce to Calculate Big Data
In addition to the HDFS storage component, Hadoop is built

around MapReduce for calculation
MongoDB can be integrated to MapReduce data on Hadoop

No HDFS storage needed - data moves directly between
MongoDB and Hadoops MapReduce engine

What is MapReduce?
MapReduce made up of a series of phases, the primary of

which are
Map
Shuffle
Reduce
Lets look at a typical MapReduce job
Email records
Count # of times a particular user has received email

MapReducing Email
to: tyler
from: brendan
subject: Ruby Support
to: brendan
from: tyler
subject: Re: Ruby Support
to: mike
from: brendan
subject: Node Support
to: brendan
from: mike
subject: Re: Node Support
to: mike
from: tyler
subject: COBOL Support
to: tyler
from: mike
subject: Re: COBOL Support
(WTF?)

Map Step
map function breaks each document
to: tyler
into a key (grouping) & value
key: tyler
from: brendan value: {count: 1}
subject: Ruby Support
to: brendan
from: tyler key: brendan
subject: Re: Ruby Support value: {count: 1}
to: mike
from: brendan
subject: Node Support key: tyler
value: {count: 1}
map function
to: brendan emit(k, v)
from: mike
subject: Re: Node Support key: mike
value: {count: 1}
to: mike
from: tyler key: brendan
subject: COBOL Support value: {count: 1}
to: tyler
from: mike
subject: Re: COBOL Support key: mike
(WTF?) value: {count: 1}

Group/Shuffle Step
key: tyler
value: {count: 1}
key: brendan
Group like keys together, value: {count: 1}
creating an array of their key: tyler

value: {count: 1}
distinct values
(Automatically done by M/R frameworks)
key: mike
value: {count: 1}
key: brendan
value: {count: 1}
key: mike
value: {count: 1}

Group/Shuffle Step
Group like keys together,

key: tyler
creating an array of their

values: [{count: 1},
{count: 1}]
distinct values key: mike

{count: 1}]
(Automatically done by M/R frameworks)
key: brendan
{count: 1}]

Reduce Step
For each key reduce function
flattens the list of values to a single
result
key: tyler key: mike
values: [{count: 1}, value: {count: 2}
{count: 1}]
key: mike key: brendan

reduce function
values: [{count: 1}, value: {count: 2}
aggregate values
{count: 1}]
return (result)
key: brendan
key: tyler
value: {count: 2}
{count: 1}]

Processing Scalable Big Data
MapReduce provides an effective system for calculating

and processing our large datasets (from gigabytes through
exabytes and beyond)
MapReduce is supported in many places including

MongoDB & Hadoop
We have effective answers for both of our concerns.

Can I read & write this data efficiently at different scale?
Can I run calculations on large portions of this data?

Batch Isnt a Sustainable Answer
There are downsides here - fundamentally, MapReduce is a
batch process
Batch systems like Hadoop give us a Catch 22

You can get answers to questions from Petabytes of
Data
But you cant guarantee youll get them quickly
In some ways, this is a step backwards in our industry
Business Stakeholders tend to want answers now
We must evolve

Moving Away from Batch
The Big Data world is moving rapidly away from slow, batch
based processing solutions
Google moved forward from Batch into more Realtime over last
few years
Hadoop is replacing MapReduce as Assembly Language with

more flexible resource management in YARN
Now MapReduce is just a feature implemented on top of
YARN
Build anything we want
Newer systems like Spark & Storm provide platforms for
realtime processes

In Closing
The World IS Being Eaten By Software
All that software is leaving behind an awful lot of data

We must be careful not to step in it
More Data Means More Software Means More Data
Means...
Practical Solutions for Processing & Storing Data will save

us
We as Data Scientists & Technologists must always evolve

our strategies, thinking and tools

[Download the Hadoop Connector]
http://github.com/mongodb/mongo-hadoop
[Docs]
http://api.mongodb.org/hadoop/
QUESTIONS?
*Contact Me*
brendan@10gen.com
(twitter: @rit)

Brendan McAdams MongoDB Promises of Big Data Big Data Spain Talk

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Brendan McAdams MongoDB Promises of Big Data Big Data Spain Talk

Загружено:

Авторское право:

Доступные форматы

A Modest Proposal

for Taming and Clarifying the Promises of Big Data

and the Software Driven Future

Friday, November 16, 12

Friday, November 16, 12

Amazon.com (and .uk, .es, etc) started as a bookstore

Friday, November 16, 12

Netflix started as a company to deliver DVDs to the home...

Friday, November 16, 12

Netflix started as a company to deliver DVDs to the home...

Friday, November 16, 12

Friday, November 16, 12

Friday, November 16, 12

Software always eats data be it text files, user form input,

All things that eat, must eventually excrete...

Friday, November 16, 12

Yeast Ingests Sugars,

and Excretes Ethanol

Friday, November 16, 12

well, you get the point.

Software always eats data be it text files, user form input,

But what does software excrete?

Friday, November 16, 12

Lots of Solutions for Big Data

Friday, November 16, 12

Friday, November 16, 12

It is easy to get distracted by all of these solutions

Friday, November 16, 12

Break it into smaller pieces

Friday, November 16, 12

Big Data can be gigabytes, terabytes, petabytes or exabytes

Friday, November 16, 12

Systems like Google File System (which inspired Hadoops

Break up pieces of data into smaller chunks, spread across

Friday, November 16, 12

As we add data, more

INSERT {USERNAME: Becky}

Individual or partial letter

INSERT {USERNAME: Brad}

The smallest possible

Friday, November 16, 12

To simplify things, lets look at our dataset split into chunks by

Each chunk is represented by a single letter marking its

Friday, November 16, 12

Friday, November 16, 12

Friday, November 16, 12

Friday, November 16, 12

The set of chunks can be evenly distributed across n data nodes

Friday, November 16, 12

The goal is equilibrium - an equal distribution.

As nodes are added (or even removed)

chunks can be redistributed for balance.

Friday, November 16, 12

The answer to calculating big data is much the same as

We need to break our data into bite sized pieces

Friday, November 16, 12

These pieces are not chunks rather, the individual data

Chunks make up a useful data transfer units for processing

Friday, November 16, 12

The most common application of these techniques is

Friday, November 16, 12

MapReduce is designed to effectively process data at varying

Composable function units can be reused repeatedly for scaled

Friday, November 16, 12