Вы находитесь на странице: 1из 71

o o

o oo

DIMENSIONAL DATA IN
DISTRIBUTED HASH TABLES
fturg Mike Male

STRANGE LOOP 2010


Monday, October 18, 2010
SIMPLEGEO

We originally began as a mobile


gaming startup, but quickly
discovered that the location services
and infrastructure needed to support
our ideas didn’t exist. So we took
matters into our own hands and
began building it ourselves.

Mt Gaig Joe Stump


CEO & co-founder CTO & co-founder

STRANGE LOOP 2010


Monday, October 18, 2010
ABOUT ME

MIKE MALONE
INFRASTRUCTURE ENGINEER
mike@simplegeo.com
@mjmalone

For the last 10 months I’ve been developing a spatial


database at the core of SimpleGeo’s infrastructure.

STRANGE LOOP 2010


Monday, October 18, 2010
REQUIREMENTS
Multidimensional
Efficiency Query Speed Complex Queries
Decentralization
Locality Spatial
Performance
Fault Tolerance Availability Distributedness
GOALS
Replication
Reliability Simplicity
Resilience Conceptual
Integrity Scalability
Operational
Consistency Data
Coherent Concurrency
Query Volume
Durability
Linear

STRANGE LOOP 2010


Monday, October 18, 2010
CONFLICT
Integrity vs Availability

Locality vs Distributedness

Reductionism vs Emergence

STRANGE LOOP 2010


Monday, October 18, 2010
DATABASE CANDIDATES
THE TRANSACTIONAL RELATIONAL DATABASES
They’re theoretically pure, well understood, and mostly
standardized behind a relatively clean abstraction
They provide robust contracts that make it easy to reason
about the structure and nature of the data they contain
They’re battle hardened, robust, durable, etc.
OTHER STRUCTURED STORAGE OPTIONS
Plain I see you, western youths, see you tramping with the
foremost, Pioneers! O pioneers!

STRANGE LOOP 2010


Monday, October 18, 2010
INTEGRITY
- vs -
AVAILABILITY

STRANGE LOOP 2010


Monday, October 18, 2010
ACID
These terms are not formally defined - they’re a
framework, not mathematical axioms

ATOMICITY
Either all of a transaction’s actions are visible to another transaction, or none are

CONSISTENCY
Application-specific constraints must be met for transaction to succeed

ISOLATION
Two concurrent transactions will not see one another’s transactions while “in flight”

DURABILITY
The updates made to the database in a committed transaction will be visible to
future transactions

STRANGE LOOP 2010


Monday, October 18, 2010
ACID HELPS
ACID is a sort-of-formal contract that makes it
easy to reason about your data, and that’s good

IT DOES SOMETHING HARD FOR YOU


With ACID, you’re guaranteed to maintain a persistent global
state as long as you’ve defined proper constraints and your
logical transactions result in a valid system state

STRANGE LOOP 2010


Monday, October 18, 2010
CAP THEOREM
At PODC 2000 Eric Brewer told us there were three
desirable DB characteristics. But we can only have two.

CONSISTENCY
Every node in the system contains the same data (e.g., replicas are
never out of date)

AVAILABILITY
Every request to a non-failing node in the system returns a response

PARTITION TOLERANCE
System properties (consistency and/or availability) hold even when
the system is partitioned and data is lost

STRANGE LOOP 2010


Monday, October 18, 2010
CAP THEOREM IN 30 SECONDS

CLIENT SERVER REPLICA

STRANGE LOOP 2010


Monday, October 18, 2010
CAP THEOREM IN 30 SECONDS

CLIENT SERVER REPLICA


wre

STRANGE LOOP 2010


Monday, October 18, 2010
CAP THEOREM IN 30 SECONDS

CLIENT SERVER plice REPLICA


wre

STRANGE LOOP 2010


Monday, October 18, 2010
CAP THEOREM IN 30 SECONDS

CLIENT SERVER plice REPLICA


wre

ack

STRANGE LOOP 2010


Monday, October 18, 2010
CAP THEOREM IN 30 SECONDS

CLIENT SERVER plice REPLICA


wre

aept ack

STRANGE LOOP 2010


Monday, October 18, 2010
CAP THEOREM IN 30 SECONDS

CLIENT SERVER REPLICA


wre
FAIL!

ni
UNAVAILAB!

STRANGE LOOP 2010


Monday, October 18, 2010
CAP THEOREM IN 30 SECONDS

CLIENT SERVER REPLICA


wre
FAIL!

aept
CSTT!

STRANGE LOOP 2010


Monday, October 18, 2010
ACID HURTS
Certain aspects of ACID encourage (require?)
implementors to do “bad things”

Unfortunately, ANSI SQL’s definition of isolation...


relies in subtle ways on an assumption that a locking scheme is
used for concurrency control, as opposed to an optimistic or
multi-version concurrency scheme. This implies that the
proposed semantics are ill-defined.
Joseph M. Hellerstein and Michael Stonebraker
Anatomy of a Database System

STRANGE LOOP 2010


Monday, October 18, 2010
BALANCE
IT’S A QUESTION OF VALUES
For traditional databases CAP consistency is the holy grail: it’s
maximized at the expense of availability and partition
tolerance
At scale, failures happen: when you’re doing something a
million times a second a one-in-a-million failure happens every
second
We’re witnessing the birth of a new religion...
• CAP consistency is a luxury that must be sacrificed at scale in order to
maintain availability when faced with failures

STRANGE LOOP 2010


Monday, October 18, 2010
APACHE CASSANDRA
A DISTRIBUTED HASH TABLE WITH SOME TRICKS
Peer-to-peer gossip supports fault tolerance and decentralization
with a simple operational model
Random hash-based partitioning provides auto-scaling and
efficient online rebalancing - read and write throughput increases
linearly when new nodes are added
Pluggable replication strategy for multi-datacenter replication
making the system resilient to failure, even at the datacenter level
Tunable consistency allow us to adjust durability with the value of
data being written

STRANGE LOOP 2010


Monday, October 18, 2010
CASSANDRA
DATA MODEL
{ column family
“users”: { key
“alice”: {
“city”: [“St. Louis”, 1287040737182],
columns
(name, value, timestamp)
“name”: [“Alice”, 1287080340940],
},
...
},
“locations”: {
},
...
}

STRANGE LOOP 2010


Monday, October 18, 2010
DISTRIBUTED HASH TABLE
DESTROY LOCALITY (BY HASHING) TO ACHIEVE GOOD
BALANCE AND SIMPLIFY LINEAR SCALABILITY

{
column family
“users”: { fffff 0
key
“alice”: {
“city”: [“St. Louis”, 1287040737182],
columns
“name”: [“Alice”, 1287080340940],
},
...
},

}
...
alice
bob s3b
3e8 STRANGE LOOP 2010
Monday, October 18, 2010
HASH TABLE
SUPPORTED QUERIES

EXACT MATCH
RANGE
PROXIMITY
ANYTHING THAT’S NOT
EXACT MATCH

STRANGE LOOP 2010


Monday, October 18, 2010
LOCALITY
- vs -
DISTRIBUTEDNESS

STRANGE LOOP 2010


Monday, October 18, 2010
THE ORDER PRESERVING
PARTITIONER
CASSANDRA’S PARTITIONING
STRATEGY IS PLUGGABLE
Partitioner maps keys to nodes
Random partitioner destroys locality by hashing
Order preserving partitioner retains locality, storing
keys in natural lexicographical order around ring z a
alice
a
bob
u h
sam m
STRANGE LOOP 2010
Monday, October 18, 2010
ORDER PRESERVING PARTITIONER

EXACT MATCH
RANGE
On a single dimension
? PROXIMITY

STRANGE LOOP 2010


Monday, October 18, 2010
SPATIAL DATA
IT’S INHERENTLY MULTIDIMENSIONAL

2 x 2, 2

1 2

STRANGE LOOP 2010


Monday, October 18, 2010
DIMENSIONALITY REDUCTION
WITH SPACE-FILLING CURVES

1 2

3 4

STRANGE LOOP 2010


Monday, October 18, 2010
Z-CURVE
SECOND ITERATION

STRANGE LOOP 2010


Monday, October 18, 2010
Z-VALUE

14
x

STRANGE LOOP 2010


Monday, October 18, 2010
GEOHASH
SIMPLE TO COMPUTE
Interleave the bits of decimal coordinates
(equivalent to binary encoding of pre-order
traversal!)
Base32 encode the result
AWESOME CHARACTERISTICS
Arbitrary precision
Human readable
Sorts lexicographically

01101
e STRANGE LOOP 2010
Monday, October 18, 2010
DATA MODEL
{
“record-index”: {
key
<geohash>:<id>
“9yzgcjn0:moonrise hotel”: {
“”: [“”, 1287040737182],
},
...
},
“records”: {
“moonrise hotel”: {
“latitude”: [“38.6554420”, 1287040737182],
“longitude”: [“-90.2992910”, 1287040737182],
...
}
}
}

STRANGE LOOP 2010


Monday, October 18, 2010
BOUNDING BOX
E.G., MULTIDIMENSIONAL RANGE

Gie stuff  bg box! Gie 2  3

1 2

3 4

Gie 4  5
STRANGE LOOP 2010
Monday, October 18, 2010
SPATIAL DATA
STILL MULTIDIMENSIONAL
DIMENSIONALITY REDUCTION ISN’T PERFECT
Clients must
• Pre-process to compose multiple queries
• Post-process to filter and merge results
Degenerate cases can be bad, particularly for nearest-neighbor
queries

STRANGE LOOP 2010


Monday, October 18, 2010
Z-CURVE LOCALITY

STRANGE LOOP 2010


Monday, October 18, 2010
Z-CURVE LOCALITY

x
x

STRANGE LOOP 2010


Monday, October 18, 2010
Z-CURVE LOCALITY

x
x

STRANGE LOOP 2010


Monday, October 18, 2010
Z-CURVE LOCALITY

x
o o o x
o
o o
o
STRANGE LOOP 2010
Monday, October 18, 2010
THE WORLD
IS NOT BALANCED

Credit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive

STRANGE LOOP 2010


Monday, October 18, 2010
TOO MUCH LOCALITY

1 2

SAN FRANCISCO

3 4

STRANGE LOOP 2010


Monday, October 18, 2010
TOO MUCH LOCALITY

1 2

SAN FRANCISCO

3 4

STRANGE LOOP 2010


Monday, October 18, 2010
TOO MUCH LOCALITY

1 2 I’m sad.

SAN FRANCISCO

3 4

STRANGE LOOP 2010


Monday, October 18, 2010
TOO MUCH LOCALITY
I’m b.

1 2 I’m sad. Me o.

SAN FRANCISCO

3 4

Let’s py xbox.

STRANGE LOOP 2010


Monday, October 18, 2010
A TURNING POINT

STRANGE LOOP 2010


Monday, October 18, 2010
HELLO, DRAWING BOARD
SURVEY OF DISTRIBUTED P2P INDEXING
An overlay-dependent index works directly with nodes of the
peer-to-peer network, defining its own overlay
An over-DHT index overlays a more sophisticated data
structure on top of a peer-to-peer distributed hash table

STRANGE LOOP 2010


Monday, October 18, 2010
ANOTHER LOOK AT POSTGIS
MIGHT WORK, BUT
The relational transaction management system (which we’d
want to change) and access methods (which we’d have to
change) are tightly coupled (necessarily?) to other parts of
the system
Could work at a higher level and treat PostGIS as a black box
• Now we’re back to implementing a peer-to-peer network with failure
recovery, fault detection, etc... and Cassandra already had all that.
• It’s probably clear by now that I think these problems are more
difficult than actually storing structured data on disk

STRANGE LOOP 2010


Monday, October 18, 2010
LET’S TAKE A STEP BACK

STRANGE LOOP 2010


Monday, October 18, 2010
EARTH

STRANGE LOOP 2010


Monday, October 18, 2010
EARTH

STRANGE LOOP 2010


Monday, October 18, 2010
EARTH

STRANGE LOOP 2010


Monday, October 18, 2010
EARTH

STRANGE LOOP 2010


Monday, October 18, 2010
EARTH, TREE, RING

STRANGE LOOP 2010


Monday, October 18, 2010
DATA MODEL
{
“record-index”: {
“layer-name:37.875, -90:40.25, -101.25”: {
“38.6554420, -90.2992910:moonrise hotel”: [“”, 1287040737182],
...
},
},
“record-index-meta”: {
“layer-name:37.875, -90:40.25, -101.25”: {
“split”: [“false”, 1287040737182],
}
“layer-name: 37.875, -90:42.265, -101.25” {
“split”: [“true”, 1287040737182],
“child-left”: [“layer-name:37.875, -90:40.25, -101.25”, 1287040737182]
“child-right”: [“layer-name:40.25, -90:42.265, -101.25”, 1287040737182]
}
}
}

STRANGE LOOP 2010


Monday, October 18, 2010
DATA MODEL
{
“record-index”: {
“layer-name:37.875, -90:40.25, -101.25”: {
“38.6554420, -90.2992910:moonrise hotel”: [“”, 1287040737182],
...
},
},
“record-index-meta”: {
“layer-name:37.875, -90:40.25, -101.25”: {
“split”: [“false”, 1287040737182],
}
“layer-name: 37.875, -90:42.265, -101.25” {
“split”: [“true”, 1287040737182],
“child-left”: [“layer-name:37.875, -90:40.25, -101.25”, 1287040737182]
“child-right”: [“layer-name:40.25, -90:42.265, -101.25”, 1287040737182]
}
}
}

STRANGE LOOP 2010


Monday, October 18, 2010
DATA MODEL
{
“record-index”: {
“layer-name:37.875, -90:40.25, -101.25”: {
“38.6554420, -90.2992910:moonrise hotel”: [“”, 1287040737182],
...
},
},
“record-index-meta”: {
“layer-name:37.875, -90:40.25, -101.25”: {
“split”: [“false”, 1287040737182],
}
“layer-name: 37.875, -90:42.265, -101.25” {
“split”: [“true”, 1287040737182],
“child-left”: [“layer-name:37.875, -90:40.25, -101.25”, 1287040737182]
“child-right”: [“layer-name:40.25, -90:42.265, -101.25”, 1287040737182]
}
}
}

STRANGE LOOP 2010


Monday, October 18, 2010
SPLITTING
IT’S PRETTY MUCH JUST A CONCURRENT TREE
Splitting shouldn’t lock the tree for reads or writes and failures
shouldn’t cause corruption
• Splits are optimistic, idempotent, and fail-forward
• Instead of locking, writes are replicated to the splitting node and the
relevant child[ren] while a split operation is taking place
• Cleanup occurs after the split is completed and all interested nodes are
aware that the split has occurred
• Cassandra writes are idempotent, so splits are too - if a split fails, it is
simply be retried
Split size: A Tunable knob for balancing locality and distributedness
The other hard problem with concurrent trees is rebalancing - we
just don’t do it! (more on this later)

STRANGE LOOP 2010


Monday, October 18, 2010
THE ROOT IS HOT
MIGHT BE A DEAL BREAKER
For a tree to be useful, it has to be traversed
• Typically, tree traversal starts at the root
• Root is the only discoverable node in our tree
Traversing through the root meant reading the root for every
read or write below it - unacceptable
• Lots of academic solutions - most promising was a skip graph, but
that required O(n log(n)) data - also unacceptable
• Minimum tree depth was propsed, but then you just get multiple hot-
spots at your minimum depth nodes

STRANGE LOOP 2010


Monday, October 18, 2010
BACK TO THE BOOKS
LOTS OF ACADEMIC WORK ON THIS TOPIC
But academia is obsessed with provable, deterministic,
asymptotically optimal algorithms
And we only need something that is probably fast enough
most of the time (for some value of “probably” and “most of
the time”)
• And if the probably good enough algorithm is, you know... tractable...
one might even consider it qualitatively better!

STRANGE LOOP 2010


Monday, October 18, 2010
REDUCTIONISM
- vs -
EMERGENCE

STRANGE LOOP 2010


Monday, October 18, 2010
We have We want

STRANGE LOOP 2010


Monday, October 18, 2010
THINKING HOLISTICALLY
WE OBSERVED THAT
Once a node in the tree exists, it doesn’t go away
Node state may change, but that state only really matters
locally - thinking a node is a leaf when it really has children is
not fatal
SO... WHAT IF WE JUST CACHED NODES THAT
WERE OBSERVED IN THE SYSTEM!?

CLIENT - XXX 2010


Monday, October 18, 2010
CACHE IT
STUPID SIMPLE SOLUTION
Keep an LRU cache of nodes that have been traversed
Start traversals at the most selective relevant node
If that node doesn’t satisfy you, traverse up the tree
Along with your result set, return a list of nodes that were
traversed so the caller can add them to its cache

STRANGE LOOP 2010


Monday, October 18, 2010
TRAVERSAL
NEAREST NEIGHBOR

o o
o xo

STRANGE LOOP 2010


Monday, October 18, 2010
TRAVERSAL
NEAREST NEIGHBOR

o o
o xo

STRANGE LOOP 2010


Monday, October 18, 2010
TRAVERSAL
NEAREST NEIGHBOR

o o
o xo

STRANGE LOOP 2010


Monday, October 18, 2010
KEY CHARACTERISTICS
PERFORMANCE
Best case on the happy path (everything cached) has zero
read overhead
Worst case, with nothing cached, O(log(n)) read overhead
RE-BALANCING SEEMS UNNECESSARY!
Makes worst case more worser, but so far so good

CLIENT - XXX 2010


Monday, October 18, 2010
DISTRIBUTED TREE
SUPPORTED QUERIES

EXACT MATCH
RANGE
PROXIMITY
SOMETHING ELSE I HAVEN’T
EVEN HEARD OF

STRANGE LOOP 2010


Monday, October 18, 2010
DISTRIBUTED TREE
SUPPORTED QUERIES
MUL
EXACT MATCH DI P
RANGE NS
NS!
PROXIMITY
SOMETHING ELSE I HAVEN’T
EVEN HEARD OF

STRANGE LOOP 2010


Monday, October 18, 2010
PEACE
Integrity and Availability

Locality and Distributedness

Reductionism and Emergence

STRANGE LOOP 2010


Monday, October 18, 2010
QUESTIONS?

MIKE MALONE
INFRASTRUCTURE ENGINEER
mike@simplegeo.com
@mjmalone

STRANGE LOOP 2010


Monday, October 18, 2010
STRANGE LOOP 2010
Monday, October 18, 2010

Вам также может понравиться