Вы находитесь на странице: 1из 35

Shark: SQL and Rich

Analytics at Scale
Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott
Shenker, Ion Stoica

AMPLab, UC Berkeley

June 25 @ SIGMOD 2013
Challenges
Data size growing
Processing has to scale out over large
clusters
Faults and stragglers complicate DB design

Complexity of analysis increasing
Massive ETL (web crawling)
Machine learning, graph processing
Leads to long running jobs
The Rise of MapReduce
Whats good about
MapReduce?
1. Scales out to thousands of nodes in a fault-
tolerant manner
2. Good for analyzing semi-structured data and
complex analytics
3. Elasticity (cloud computing)
4. Dynamic, multi-tenant resource sharing
parallel relational database systems are
signicantly faster than those that rely on the
use of MapReduce for their query engines
I totally agree.
This Research
1. Shows MapReduce model can be extended to
support SQL efciently
Started from a powerful MR-like engine (Spark)
Extended the engine in various ways
2. The artifact: Shark, a fast engine on top of MR
Performant SQL
Complex analytics in the same engine
Maintains MR benets, e.g. fault-tolerance
MapReduce Fundamental Properties?
Data-parallel operations
Apply the same operations on a dened set of data

Fine-grained, deterministic tasks
Enables fault-tolerance & straggler mitigation
Why Were Databases Faster?
Data representation
Schema-aware, column-oriented, etc
Co-partition & co-location of data
Execution strategies
Scheduling/task launching overhead (~20s in Hadoop)
Cost-based optimization
Indexing
Lack of mid-query fault tolerance
MRs pull model costly compared to DBMS push
See Pavlo 2009, Xin 2013.
Why Were Databases Faster?
Data representation
Schema-aware, column-oriented, etc
Co-partition & co-location of data
Execution strategies
Scheduling/task launching overhead (~20s in Hadoop)
Cost-based optimization
Indexing
Lack of mid-query fault tolerance
MRs pull model costly compared to DBMS push
See Pavlo 2009, Xin 2013.
Not fundamental to
MapReduce
Can be
surprisingly
cheap
Introducing Shark
MapReduce-based architecture
Uses Spark as the underlying execution engine
Scales out and tolerate worker failures
Performant
Low-latency, interactive queries
(Optionally) in-memory query processing
Expressive and exible
Supports both SQL and complex analytics
Hive compatible (storage, UDFs, types, metadata, etc)
Spark Engine
Fast MapReduce-like engine
In-memory storage for fast iterative computations
General execution graphs
Designed for low latency (~100ms jobs)
Compatible with Hadoop storage APIs
Read/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc
Growing open source platform
17 companies contributing code
More Powerful MR Engine
General task DAG
Pipelines functions
within a stage
Cache-aware data
locality & reuse
Partitioning-aware
to avoid shufes
!"#$
&$#"$
'("&)*+
,-)
./-'0 1
./-'0 2
./-'0 3
45
*5
65
75
85
95
:5
; )(0<#"&=>+ ?",)&/0@ )-(/#/#"$

Client
CLI JDBC
Hive Architecture
Meta
store
Hadoop Storage (HDFS, S3, )
Driver
SQL
Parser
Query
Optimizer
Physical Plan
Execution
MapReduce

Client
CLI JDBC
Shark Architecture
Meta
store
Hadoop Storage (HDFS, S3, )
Driver
SQL
Parser
Spark
Cache Mgr.
Physical Plan
Execution
Query
Optimizer
Extending Spark for SQL
Columnar memory store
Dynamic query optimization
Miscellaneous other optimizations (distributed
top-K, partition statistics & pruning a.k.a. coarse-
grained indexes, co-partitioned joins, )
Columnar Memory Store
Simply caching records as JVM objects is inefcient
(huge overhead in MRs record-oriented model)
Shark employs column-oriented storage, a
partition of columns is one MapReduce record.
2
!"#$%& ()"*+,-
3 1
!"A$ ,#B0 =->>+
CD2 1DE FDC
."/ ()"*+,-
2 !"A$ CD2
3 ,#B0 1DE
1 =->>+ FDC
Benet: compact representation, CPU efcient
compression, cache locality.
How do we optimize:

SELECT * FROM table1 a JOIN table2 b ON a.key=b.key


WHERE my_crazy_udf(b.field1, b.field2) = true;
Hard to estimate cardinality!
Partial DAG Execution (PDE)
Lack of statistics for fresh data and the prevalent
use of UDFs necessitate dynamic approaches to
query optimization.

PDE allows dynamic alternation of query plans
based on statistics collected at run-time.
Shufe Join
Stage 3
Stage 2
Stage 1
Join
Result
Stage 1
Stage 2
Join
Result
Map Join (Broadcast Join)
minimizes network trafc
PDE Statistics
Gather customizable statistics at per-partition
granularities while materializing map output.
partition sizes, record counts (skew detection)
heavy hitters
approximate histograms
Can alter query plan based on such statistics
map join vs shufe join
symmetric vs non-symmetric hash join
skew handling
Complex Analytics Integration
Unied system for SQL,
machine learning

Both share the same set
of workers and caches
def logRegress(points: RDD[Point]): Vector {
var w = Vector(D, _ => 2 * rand.nextDouble - 1)
for (i <- 1 to ITERATIONS) {
val gradient = points.map { p =>
val denom = 1 + exp(-p.y * (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ + _)
w -= gradient
}
w
}

val users = sql2rdd("SELECT * FROM user u
JOIN comment c ON c.uid=u.uid")

val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")),
...)}
val trainedVector = logRegress(features.cache())
Pavlo Benchmark
!"#"$%&'(
) **+, -, ./+, 0)
!1234 !1234 67&849 :&;"
<+<
) <,) =)) -,) .))
>??3"?2%&'(
<@ A3'BC8
=*
:&;"
!1234 67&849
!1234
!1234 D'C23%&%&'("7
) ,)) <))) <,)) *)))
EB(%&F" 68"$'(789
Machine Learning Performance
!"#$%&' )*+',$-.&/
0 12 34 506 577 560
583
795
:;/.',.< =$/-$''.;&
0 47 76 34 >2 540
550
09>2
?@%-A B%C;;D
Runtime per iteration (secs)
Real Warehouse Benchmark
!
"#
#!
$#
%!!
&% &" &' &(
)
*
+
,
-
.
/

1
2
/
3
4
+
5
2
6
789:; 789:; 15-2;6 <-=/
%>% !>? !>$ %>!
1.7 TB Real Warehouse Data on 100 EC2 nodes
New Benchmark
!"#$%$
!"#$%$ '"(")
*(+,-./0
1-$23 '+.,3)
1-$23 '"(")
4 5 64 65 74
*890."( ',(:;9+,)
http://tinyurl.com/bigdata-benchmark
Other benets of MapReduce
Elasticity
Query processing can scale up and down dynamically
Straggler Tolerance
Schema-on-read & Easier ETL
Engineering
MR handles task scheduling / dispatch / launch
Simpler query processing code base (~10k LOC)
Berkeley Data Analytics Stack
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos Resource Manager
Spark
Streaming
GraphX MLBase
Community
3000 people attended
online training
800 meetup members
17 companies contributing
Conclusion
Leveraging a modern MapReduce engine and
techniques from databases, Shark supports both
SQL and complex analytics efciently, while
maintaining fault-tolerance.
Growing open source community
Users observe similar speedups in real use cases
http://shark.cs.berkeley.edu
http://www.spark-project.org

MapReduce
DBMSs Shark

Вам также может понравиться