Академический Документы
Профессиональный Документы
Культура Документы
Row-Stores
How Different are they Really?
Daniel J. Abadi, Samuel Madden, and Nabil Hachem,
SIGMOD 2008
Presented By,
Paresh
Modak(09305060)
Souman
1
Mandal(09305066)
Contents
Column-store introduction
Column-store data model
Emulation of Column Store in Row store
Column store optimization
Experiment and Results
Conclusion
2
Row Store and Column Store
3
Row Store and Column Store
Most of the queries does not process all the attributes of
a particular relation.
4
Row Store and Column Store
5
Why Column Stores?
Can be significantly faster than row stores for some
applications
Fetch only required columns for a query
Better cache effects
Better compression (similar attribute values within a
column)
But can be slower for other applications
OLTP with many row inserts, ..
Long war between the column store and row store
camps :-)
This paper tries to give a balanced picture of advantages
and disadvantages, after adding/ subtracting a number
of optimizations for each approach
6
Column Stores - Data Model
7
Column Stores - Data Model
8
Column Stores Data Model
Join Indexes
T1 and T2 are projections on T
M segments in T1 and N segments in T2
Join Index from T1 to T2 is a table of the form:
(s: Segment ID in T2, k: Storage key in Segment s)
Each row in join index matches corresponding row in T1
Join indexes are built such that T could be
efficiently reconstructed from T1 and T2
9
Column Stores Data Model
Construct EMP(name, age, salary) from EMP1
and EMP3 using join index on EMP3
10
Compression
Trades I/O for CPU
Increased column-store opportunities:
Higher data value locality in column stores
Techniques such as run length encoding far more
useful
Schemes
Null Suppression
Dictionary encoding
Run Length encoding
Bit-Vector encoding
Heavyweight schemes
11
Query Execution - Operators
12
Query Execution - Operators
Decompress: Converts compressed column
to uncompressed representation
Mask(Bitstring B, Projection Cs) => emit only
those values whose corresponding bits are 1
Concat: Combines one or more projections
sorted in the same order into a single
projection
Permute: Permutes a projection according to
the ordering defined by a join index
Bitstring operators: Band Bitwise AND,
Bor Bitwise OR, Bnot complement
13
Row Store Vs Column Store
14
Row Store Vs Column Store
Now the simplistic view about the difference
in storage layout leads to that one can obtain
the performance benefits of a column-store
using a row-store by making some changes to
the physical structure of the row store.
15
Vertical Partitioning
Process:
Full Vertical partitioning of each relation
Each column =1 Physical table
This can be achieved by adding integer position column
to every table
Adding integer position is better than adding primary key
Join on Position for multi column fetch
Problems:
Position - Space and disk bandwidth
Header for every tuple further space wastage
e.g. 24 byte overhead in PostgreSQL
16
Vertical Partitioning: Example
17
Index-only plans
Process:
Add B+Tree index for every Table.column
Plans never access the actual tuples on disk
Headers are not stored, so per tuple overhead is less
Problem:
Separate indices may require full index scan, which
is slower
Eg: SELECT AVG(salary)
FROM emp
WHERE age > 40
Composite index with (age, salary) key helps.
18
Index-only plans: Example
19
Materialized Views
Process:
Create optimal' set of MVs for given query workload
Objective:
Provide just the required data
Avoid overheads
Performs better
Expected to perform better than other two
approach
Problems:
Practical only in limited situation
Require knowledge of query workloads in advance
20
Materialized Views: Example
Select F.custID
from Facts as F
where F.price>20
21
Optimizing Column oriented Execution
22
Compression
23
Compression
25
Late Materialization
26
Late Materialization
Advantages
Unnecessary construction of tuple is avoided
Direct operation on compressed data
Cache performance is improved (PAX)
27
N-ary storage model (NSM)
28
Decomposition Storage Model(DSM)
High degree of
spatial locality for
sequential access
of single attribute
Performance
deteriorates
significantly for
queries that
involve multiple
attributes
29
PAX - Partition Attributes Across*
Cache utilization and performance is very
important
In-page data placement is the key to high cache
performance
PAX groups together all values of each attribute
within each page
Only affects layout inside the pages, incurs no
storage penalty and does not affect I/O behavior
Exhibits superior cache performance and utilization
over traditional methods
30
PAX Model
Maximizes inter-
record spatial
locality within each
column in the page
Incurs minimal
record
reconstruction cost
Orthogonal to other
design decisions
because it only
affects the layout of
data stored on a
single page
31
Block Iteration
32
Star Schema Benchmark
SSBM is a data warehousing benchmark
derived from TPC-H
It consist of a single fact table LINE-ORDER
There are four dimension table.
CUSTOMER
PART
SUPPLIER
DATE
LINEORDER table consist of 60,000,000 tuples
SSBM consist of thirteen queries divided into
four category
33
Star Schema Benchmark
Figure taken
form [1]
34
Invisible Join
Queries over data warehouse (particularly
modeled with star schema) often have
following structure
Restrict set of tuple in the fact table using
selection predicates on dimension table
Perform some aggregation on the restricted fact
table
Often grouping by other dimension table attribute
36
Invisible Join
Traditional plan for this type of query is to
pipeline join in order of predicate selectivity
Alternate plan is late materialized join
technique
37
Invisible Join
38
Invisible Join
Phase 1
Figure taken form [1]
39
Invisible Join
Between-Predicate rewriting
Use of range predicates instead of hash lookup in
phase 1
Useful if contiguous set of keys are valid after
applying a predicate
Dictionary encoding for key reassignment if not
contiguous
Query optimizer is not altered. Predicate is
rewritten at runtime
42
Between-predicate rewriting
Goal
Comparison of attempts to emulate a column
store in a row-store with baseline performance of
C-Store
Is it possible for an unmodified row-store to obtain
the benefits of column oriented design
Effect of different optimization technique in
column-store
44
Experiment setup
Environment
2.8GHz Dual Core Pentium(R) workstation
3 GB RAM
RHEL 5
4 disk array mapped as a single logical volume
Reported numbers are average of several runs
Warm buffer (30% improvement for both systems)
Data read exceeds the size of buffer pool
45
C-Store Vs Commercial row oriented DB
47
Column Store simulation in Row Store
48
Different configuration of System X
49
Different configuration of System X
50
Different configuration of System X
51
Different configuration of System X
T Traditional
T(B)
Traditional(bitmap)
MV materialized
views
VP vertical
partitioning
Better
AI Allperformance
indexes
of
traditional system is
because of
partitioning. Figure taken form [1]
Partitioning on
52
Different configuration of System X
53
Column Store simulation in Row Store:
Analysis
Tuple overheads:
LineOrder Table 60 million tuples, 17 columns
Compressed data
8 bytes of over head per row
4 bytes of record-id
1 column Whole table
RS 0.7-1.1 GB 4 GB
CS 240 MB 2.3 GB
54
Column Store simulation in Row Store:
Analysis
55
Column Store simulation in Row Store:
Analysis
Traditional
Scans the entire lineorder table
Hash joins with dwdate, part and supplier
Vertical Partitioning
Hash-joins partkey column with the filtered part table
Hash-joins suppkey column with filtered supplier table
Hash join the result of the two above join
Index-Only plans
Access all columns through unclustred B+Tree
indexes
56
Column Store Performence
57
Tuple overhead and Join costs
58
Breakdown of Column-Store
Advantages
59
Breakdown of Column-Store
Advantages
60
Conclusion
To emulate column store in row store, techniques like
Vertical portioning
Index only plan
does not yield good performance
High per-tuple overheads, high tuple reconstruction
cost are the reason
Where in column store
Late materialization
Compression
Block iteration
Invisible join
are the reason for good performance
61
Conclusion
62
References
63
Thank
You!
64