Академический Документы
Профессиональный Документы
Культура Документы
Application Roadmap
Reduction of cycle time for the document intake process. Currently, it can take anywhere from a few days to a few weeks from the time the documents are received to when they are available to the client.
New
York Times used Hadoop/MapReduce to convert pre-1980 articles that were TIFF images to PDF.
Agenda
Some
history What is NoSQL CAP Theorem What is lost Types of NoSQL Data Model Frameworks Demo Wrapup
begin to front RDBMS with memcache or integrate other caching mechanisms within the application (ie. Ehcache)
Scaling Up
Issues
big RDBMS were not designed to be distributed Began to look at multi-node database solutions Known as scaling out or horizontal scaling Different approaches include:
Master-slave Sharding
All writes are written to the master. All reads performed against the replicated slave databases Critical reads may be incorrect as writes may not have been propagated down Large data sets can pose problems as master needs to duplicate data to slaves
or sharding
Scales well for both reads and writes Not transparent, application needs to be partitionaware Can no longer have relationships/joins across partitions Loss of referential integrity across shards
replication INSERT only, not UPDATES/DELETES No JOINs, thereby reducing query time
This involves de-normalizing data
In-memory
databases
What is NoSQL?
Stands
for Not Only SQL Class of non-relational data storage systems Usually do not require a fixed table schema nor do they use the concept of joins All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP theorem)
Why NoSQL?
For
data storage, an RDBMS cannot be the beall/end-all Just as there are different programming languages, need to have other data storage tools in the toolbox A NoSQL solution is more acceptable to a client now than even a year ago
Think about proposing a Ruby/Rails or Groovy/Grails solution now versus a couple of years ago
10
of social media sites (Facebook, Twitter) with large data needs Rise of cloud-based solutions such as Amazon S3 (simple storage solution) Just as moving to dynamically-typed languages (Ruby/Groovy), a shift to dynamically-typed data with frequent schema changes Open-source community
11
Gossip protocol (discovery and error detection) Distributed key-value data store Eventual consistency
12
datasets, acceptance of alternatives, and dynamically-typed data has come together in a perfect storm Not a backlash/rebellion against RDBMS SQL is a rich query language that cannot be rivaled by the current list of NoSQL offerings
13
CAP Theorem
Three
properties of a system: consistency, availability and partitions You can have at most two of these three properties for any shared-data system To scale out, you have to partition. That leaves either consistency or availability to choose from
In almost all cases, you would choose availability over consistency
14
Availability
Traditionally,
thought of as the server/process available five 9s (99.999 %). However, for large node system, at almost any point in time theres a good chance that a node is either down or there is a network disruption among the nodes.
Want a system that is resilient in the face of network disruption
15
Consistency Model
A
consistency model determines rules for visibility and apparent order of updates. For example:
Row X is replicated on nodes M and N Client A writes row X to node N Some period of time t elapses. Client B reads row X from node M Does client B see the write from client A? Consistency is a continuum with tradeoffs For NoSQL, the answer would be: maybe CAP Theorem states: Strict Consistency can't be achieved at the same time as availability and partitiontolerance.
16
Eventual Consistency
When
no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID
17
18
Key/Value
Pros:
very fast very scalable simple model able to distribute horizontally
Cons:
- many data structures (objects) can't be easily
19
Schema-Less
Pros:
Schema-less data model is richer than key/value pairs eventual consistency many are distributed still provide excellent performance and scalability
Cons:
- typically no ACID transactions or joins
20
Common Advantages
Cheap,
easy to implement (open source) Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned Down nodes easily replaced No single point of failure Easy to distribute Don't require a schema Can scale up and down Relax the data consistency requirement (CAP)
21
by order by ACID transactions SQL as a sometimes frustrating but still powerful query language easy integration with other applications that support SQL
22
Cassandra
Originally
developed at Facebook Follows the BigTable data model: column-oriented Uses the Dynamo Eventual Consistency model Written in Java Open-sourced and exists within the Apache family Uses Apache Thrift as its API
23
Thrift
Created
Is
a cross-language, service-generation framework Binary Protocol (like Google Protocol Buffers) Compiles to: C++, Java, PHP, Ruby, Erlang, Perl, ...
24
Searching
Relational
SELECT `column` FROM `database`,`table` WHERE `id` = key; SELECT product_name FROM rockets WHERE id = 123;
Cassandra
(standard)
25
API access:
get(key) -- Extract the value given a key put(key, value) -- Create or update the value given its key delete(key) -- Remove the key and its associated value execute(key, operation, parameters) -- Invoke an operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).
26
Data Model
Within
way:
:Rockets, '1' might return: {'name' => Rocket-Powered Roller Skates', toon' => Ready Set Zoom', inventoryQty' => 5, productUrl => rockets\1.gif}
27
Key: the permanent name of the record Keyspace: the outer-most level of organization. This is usually the name of the application. For example, Acme' (think database name).
28
29
30
Consistent Hashing
Partition using consistent hashing Keys hash to a point on a fixed circular space Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots Nodes take positions on the circle. A, B, and D exists.
A V B C
C joins.
B, D split ranges. C gets BC from D.
R M
31
Domain Model
Design
your domain model first Create your Cassandra data store to fit your domain model
<Keyspace Name="Acme"> <ColumnFamily CompareWith="UTF8Type" Name="Rockets" /> <ColumnFamily CompareWith="UTF8Type" Name="OtherProducts" /> <ColumnFamily CompareWith="UTF8Type" Name="Explosives" /> </Keyspace>
32
Data Model
ColumnFamily: Rockets Key 1 Value
Name name toon inventoryQty brakes Value Rocket-Powered Roller Skates Ready, Set, Zoom 5 false Value Little Giant Do-It-Yourself Rocket-Sled Kit Beep Prepared 4 false Value Acme Jet Propelled Unicycle Hot Rod and Reel 1 1
33
Say the OtherProducts has inventory in categories. Querying (:OtherProducts, '174927') might return: {OtherProducts' => {'name' => Acme Instant Girl', ..}, foods': {...}, martian': {...}, animals': {...}} In the example, foods, martian, and animals are all super column names. They are defined on the fly, and there can be any number of them per row. :OtherProducts would be the name of the super column family.
Columns and SuperColumns are both tuples with a name & value. The key difference is that a standard Columns value is a string and in a SuperColumn the value is a Map of Columns.
34
Each
35
Hector
Leading
Java API for Cassandra Sits on top of Thrift Adds following capabilities
Load balancing JMX monitoring Connection-pooling Failover JNDI integration with application servers Additional methods on top of the standard get, update, delete methods.
Under
discussion
37
J2EE web.xml
<resource-env-ref> <description>Object factory for Cassandra clients.</description> <resource-env-ref-name>cassandra/CassandraClientFactory</resourceenv-ref-name> <resource-env-reftype>org.apache.naming.factory.BeanFactory</resource-env-ref-type> </resource-env-ref>
38
39
40
41
Some Statistics
Facebook
42
on Rails and Grails have ORM baked in. Would have to build your own ORM framework to work with NoSQL.
Some plugins exist.
Same
Support
43
performance problems Concurrency on non-key accesses Are the replicas working? No TOAD for Cassandra
though some NoSQL offerings have GUI tools have SQLPlus-like capabilities using Ruby IRB interpreter.
44
does not matter if the data is deployed on a NoSQL platform instead of an RDBMS. Still need to address:
Backups & recovery Capacity planning Performance monitoring Data integration Tuning & optimization
What
happens when things dont work as expected and nodes are out of sync or you have a data corruption occurring at 2am? Who you gonna call?
DBA and SysAdmin need to be on board
45
most of us, we work in corporate IT and a LinkedIn or Twitter is not in our future Where would I use a NoSQL database? Do you have somewhere a large set of uncontrolled, unstructured, data that you are trying to fit into a RDBMS?
Log Analysis Social Networking Feeds (many firms hooked in through Facebook or Twitter) External feeds from partners (EAI) Data that is not easily analyzed in a RDBMS such as time-based data Large data feeds that need to be massaged before entry into an RDBMS
46
Summary
Leading
users of NoSQL datastores are social networking sites such as Twitter, Facebook, LinkedIn, and Digg. To implement a single feature in Cassandra, Digg has a dataset that is 3 terabytes and 76 billion columns. Not every problem is a nail and not every solution is a hammer.
47
Questions
48
Resources
Cassandra
http://cassandra.apache.org
Hector
http://wiki.github.com/rantav/hector http://prettyprint.me
NoSQL
News websites
http://nosql.mypopescu.com http://www.nosqldatabases.com
High
Scalability
http://highscalability.com
Video
http://www.infoq.com/presentations/ProjectVoldemort-at-Gilt-Groupe
49