Вы находитесь на странице: 1из 61

Chapter - 2

Database Model
Key-Value data store
Document Databases
Column Databases
Graph Databases
Data models
• Key-Value data store
• Document Databases
• Column Databases
• Graph Databases

• Key/Value: Redis, Tokyo Cabinet, Memcached


• ColumnFamily: Cassandra, HBase
• Document: MongoDB, CouchDB
Key-value Data store

• Why do we need keys for each value?


• Naming conventions
• Key must be unique

Scenario based question


track the closest warehouse to a
customer that has the products
listed in the shopping cart?
Naming convention
Values
• Key-value databases give developers a great
deal of flexibility when storing values
• binary large object (BLOB)
• key-value databases allow virtually any data
type in values

Key value data store


Example : Redis –> in-memory database
Fast
Cache layer
costly
Difference Key-value vs RDBMS
• Key-value databases are modeled on minimal
principles for storing and retrieving data.
• Unlike in relational databases, there are no tables,
so there are no features associated with tables,
such as columns and constraints on columns.
• There is no need for joins in keyvalue databases,
so there are no foreign keys.
• Key-value databases do not support a rich query
language such as SQL.
What is Document?
• word processing or spreadsheet file or perhaps even a paper
document
or
• common type of document: an HTML document
• HTML,XML,JSON –
– similarities and dis-similarities
• HTML – predefined meaning of tags, application, purpose or usage
• XML?
• JSON?
• Difference of Key Value and JSON
– An advantage of documents over key-value databases is that related
attributes are managed within a single object
Document Database

• A document database, by allowing some form of data description


without enforcing a schema
• A document database is, at its core, a key/value store
• A happy medium between the rigid schema of the relational database
and the completely schema-less key-value stores
• Supports JSON or XML formats
– XML
• Content Management Systems
– JSON
• Web based applications, handling DBs

As noted, a key distinction between document and relational databases is that document
databases do not require a fixed, predefined schema.
Another important difference is that documents can have embedded documents and lists of
multiple values within a document.
JSON is much faster and smaller than th
The fundamental difference between XML and JSON is that
XML is a meta-language and JSON is a markup language. ...
By contrast, JSON syntax has specific semantics built in: stuff
between {} is an object, stuff between [] is an array, etc. A
JSON parser, therefore, knows exactly what every JSON
document means.
equivalent XML.

XML - applications
and devices of all
XML vs JSON kinds to use, store,
transmit, and display
data.

XML provides the capability to display data


because it is a markup language. JSON files are Why JSON Is Better Than XML. For awhile, XML
more human readable than XML. (extensible markup language) was the only
choice for open data interchange. But over the
years there has been a lot of transformation in
the world of open data sharing. ... JSON is
faster- Parsing XML software is slow and
cumbersome.
XML and Xquery

XQuery is the language for querying XML data


XQuery for XML is like SQL for databases
XQuery is supported by all major databases
XQuery is a W3C Recommendation
JSON ---------------------XML
Issues with XML
• XML is justifiably criticized as being wasteful of
space and computationally expensive to
process
• XML tags are verbose and repetitious, typically
increasing the amount of storage required by
several factors
• Partly as a result, the XML language format is
relatively expensive to parse
XML vs JSON
• designed to support quite dissimilar use cases and significantly different
applications.
• XML databases typically are used as content-management systems; that is,
organizing and maintaining collections of text files in XML format—academic
papers, business documents, and so on.
• JSON document databases, on the other hand, mostly support web-based
operational workloads—storing and modifying the dynamic content and
transactional data at the core of modern web-based applications.
• XML databases are built for querying large and complex tree structured data;
• XML - support in a wide array of languages and frameworks
• Mongo DB is built for aggregating over large sets of small JSON documents  If
there is no need for parallel data processing, there is no need for Mongo DB.
• Lightweight in comparison to XML Easier to handle with Javascript if you need
something for a web application.
JSON
• A document is the basic unit of storage, corresponding
approximately to a row in an RDBMS.
• A document comprises one or more key-value pairs, and may
also contain nested documents and arrays. Arrays may also
contain documents allowing for a complex hierarchical structure.
• • A collection or data bucket is a set of documents sharing some
common purpose; this is roughly equivalent to a relational table.
• The documents in a collection don’t have to be of the same
type, though it is typical for documents in a collection to
represent a common category of information.
Structure
• The Structure of JSON Objects
• JSON objects are constructed using several simple syntax rules:
• • Data is organized in key-value pairs, similar to key-value databases.
• • Documents consist of name-value pairs separated by commas.
• • Documents start with a { and end with a }.
• • Names are strings, such as "customer_id" and "address".
• • Values can be numbers, strings, Booleans (true or false), arrays, objects, or
the
• NULL value.
• • The values of arrays are listed within square brackets, that is [ and ].
• • The values of objects are listed as key-value pairs within curly brackets, that
is, {
• and }.
Managing Multiple Documents in Collections

• The full potential of document databases


becomes apparent when you work with large
numbers of documents.
• Documents are generally grouped into
collections of similar documents.
• One of the key parts of modeling document
databases is deciding how you will organize
your documents into collections
Example
• Example RDBMS and Document DB in word document

• the approach results in “actors” being duplicated across multiple


documents, and in a complex design this could lead to issues and possibly
inconsistencies if any of the “actor” attributes need to be changed.
• The number of actors in a film is relatively small, but in other application
scenarios, problems can also occur if the number of members in an
embedded document increases without limit

• For these reasons, a database designer might choose instead to link


multiple documents using
• document identifiers, much in the way a relational database relates rows
via foreign keys.
• we embed an array of actor IDs into the “films” document, which can be
used to locate the actors who appear in a film.
RDBMS vs No-SQLDB
NoSQL DBs Key Features
Polymorphic Schema – Doc DB
Vertical Partitioning
Horizontal Partitioning
Column Databases
• A column is a basic unit of storage in a column
family database. A column is a name and a value
Characteristics of Column DB
• When there are large numbers of columns, it can help to group
them into collections of related columns.
• For example, first and last name are often used together,
• Office numbers and office phone numbers are frequently needed
together.
• These can be grouped in collections called column families.
• As in document databases, column family databases do not require
a predefined fixed schema.
• Developers can add columns as needed. Also, rows can have
different sets of columns and super columns.
• Column family databases are designed for rows with many columns.
• Query languages for column family databases
may look similar to SQL.
• The query language can support SQL-like
terms such as SELECT, INSERT, UPDATE, and
DELETE
• as well as column family–specific operations,
such as CREATE COLUMNFAMILY
Column Databases
• “row formatted” - The first databases that
attempted to implement the relational model were
created during a period in which OLTP processing—
essentially, record at a time processing—was the
most important type of database workload.
• Analytic workloads - row-oriented physical
organization became less ideal. In a data warehouse
you rarely want to process all the columns of a
single row, but you often want to process the values
of a single column across all rows
• Column-oriented databases address this
requirement by storing columns physically
together on disk
• Examples
– Student DBs in School, COE, Admission office and
Hostel
Star Schema – Example 2
Advantages & Disadvantages

• Aggregate the values of specific columns - Adv


• Compression – Adv
• Disadv - Write penalty – poor choice for OLTP database
How many No.
of records to
RAM?
Asyn. Tuple mover
• insert and update overhead for single rows is a key weakness of a columnar architecture.
• However, it became increasingly important for data warehouses to provide real-time “up-to-
the-minute” information.
• The simplistic columnar architecture unable to cope with this constant stream of row-level
modifications.
• To address this issue, columnar databases generally implement some form of write-
optimized delta store (we’ll call this the delta store for short).
• This area of the database is optimized for frequent writes.
• Regardless of the internal format of the data, the delta store is generally memory resident,
the data is generally uncompressed, and the store can accept high-frequency data
modifications.
• Data in the delta store is periodically merged with the main columnar-oriented store.
• In Vertica, this process is referred to as the Tuple Mover and in Sybase IQ as the RLV (Row
Level Versioned) Store Merge.
• The merge will occur periodically, or whenever the amount of data in the delta store
exceeds a threshold.
• Prior to the merge, queries might have needed to access both the delta store and the
column store in order to return complete and accurate results.
Asyn. Tuple mover
• Large-scale bulk sequential loads—such as nightly ETL
jobs—will generally be directed to the column store
(1).
• Incremental inserts and updates will be directed to
the write-optimized store (2).
• Queries may need to read from both stores in order to
get complete and consistent results (3).
• Periodically, or as required, a process will shift data
from the write-optimized store to the column store
(4).
Architecture

insert and update overhead for


single rows is a key weakness of a
columnar architecture.
Projection
Graph databases
• Proponents of key-value stores, document databases, and relational systems
disagree about practically every aspect of database design, but they do agree
in one respect: databases are about storing information about “things,” be
those things represented by JSON, tables, or binary values.
• But sometimes it’s the relationship between things, rather than the things
themselves, that are of primary interest.
• This is where graph database systems shine.
• Graph structures are most familiar to us from social networks such as
Facebook.
• In Facebook, the information about individuals is important, but it’s the
network between people—the social graph—that provides the unique power
of the platform.
• Similar graph-oriented datasets occur within network topologies,
• access-control systems, medical models, and elsewhere.
Graph
• Graph theory defines these major
components of a graph:
• • Vertices, or “nodes,” represent distinct
objects.
• • Edges, or “relationships” or “arcs,” connect
these objects.
• • Both vertices and edges can have
properties.
Example
RDBMS
Facebook example – Graph Databses – Neo4J
Hash function and hash rings
Associative Arrays
• An associative array is a data structure, like an
array, but is not restricted to using integers as
indexes or limiting values to the same type.
• keys can be strings of characters or integers
exampleAssociativeArray[‘Pi’] = 3.1415
exampleAssociativeArray[‘CapitalFrance’] = ‘Paris’
exampleAssociativeArray[‘ToDoList’] = { ‘Alice’ : ‘run
reports; meeting with Bob’, ‘Bob’ : ‘order inventory;
meeting with Alice’ }
exampleAssociativeArray[17234] = 34468
Cache and Associate Array
• An in-memory cache is an associative array.
• The values retrieved from the relational database could be stored in the cache by
creating a key for each value stored.
• One way to create a unique key for each piece of data for each customer is to
concatenate a unique identifier with the name of the data item.
• For example, the following stores the data retrieved from the database in an in-
mecustomerCache[‘1982737:firstname’] = firstName
• customerCache[‘1982737:lastname’] = lastName
• customerCache[‘1982737:shippingAddress’] = shippingAddress
• customerCache[‘1982737:shippingCity’] = shippingCity
• customerCache[‘1982737:shippingState’] = shippingState
• customerCache[‘1982737:shippingZip’] = shippingZipmory cache:
Scalability
• Master-slave replication
• Masterless replication
An advantage of master-slave models is simplicity.
Except for the master, each node in the cluster
only needs to communicate with one other server:
the master. The master accepts all writes, so there
is no need to coordinate write operations or
resolve conflicts between multiple servers
accepting writes.

A disadvantage of the master-slave replication


model is that if the master fails, the cluster
cannot accept writes.
This can adversely impact the availability of the
cluster The master server is known as a single point
of failure—that is, a single component in a system
that if it fails, the entire system fails or at least loses
a critical capacity, such as accepting writes
Masterless
Issues
• In a masterless replication model, there is not
a single server that has the master copy of
updated data, so no single server can copy its
data to all other servers.
• Instead, servers in a masterless replication
model work in groups to help their neighbors.
• Addison Wesley Figure 3.10 and 3.11
Using Keys to Locate Values
• Using numbers to identify locations is a good
idea, but it is not flexible enough.
• The trick is to use a function that maps from
integers, character strings, or lists of objects to
a unique string or number.
• These functions that map from one type of
value to a number are known as hash
functions.
Hash function
• A hash function is a function that can take an arbitrary string
of characters and produce a (usually) unique, fixed-length
string of characters(sometimes two unrelated inputs can
generate the same output. This is known as a collision.).
Hash function
• You can take advantage of the fact that the hash function returns a number.
• Because the write load should be evenly distributed across all eight servers, you can send
one eighth of all writes to each server.
• You could send the first write to Server 1, the second to Server 2, the third to Server 3, and
so on in a round-robin fashion, but this would not take advantage of the hash value.
• One way to take advantage of the hash value is to start by dividing the hash value by the
number of servers. Sometimes the hash value will divide evenly by the number of servers.
• (For this discussion, assume the hash function returns decimal numbers, not hexadecimal
• numbers, and that the number of digits in the number is not fixed.)
• If the hash function returns the number 32 and that number is divided by 8, then the
remainder is 0. If the hash function returns 41 and it is divided by 8, then the remainder is
1. If the hash function returns 67, division by 8 leaves a remainder of 3.
• As you can see, any division by 8 will have a remainder between 0 and 7. Each of the eight
servers can be assigned a number between 0 and 7.
Moving queries to the data, not data to the queries

• When a client wants to send a general query


to all nodes that hold data, it’s more efficient
to send the query to each node than it is to
transfer large datasets to a central processor

• http://www.ques10.com/p/2828/explain-the-
ways-that-nosql-system-handle-big-da-
1/#2832
Moving queries to the data, not data to the queries

• With the exception of large graph databases, most NoSQL


systems use commodity processors that each hold a subset of
the data on their local shared-nothing drives.
• When a client wants to send a general query to all nodes that
hold data, it’s more efficient to send the query to each node
than it is to transfer large datasets to a central processor.
• This simple rule helps you understand how NoSQL databases
can have dramatic performance advantages over systems that
weren’t designed to distribute queries to the data nodes.
• Keeping all the data within each data node in the form of
logical documents means that only the query itself and the
final result need to be moved over a network. This keeps your
big data queries fast.
Hash rings to evenly distribute data on a
cluster
• Using a hash ring technique to evenly
distribute big data loads over many servers
with a randomly generated 40-character key is
a good way to evenly distribute a network
load.
• Partitioning keys into ranges and assigning
different key ranges to specific nodes is known
as keyspace management.
Replication
• http://www.geekinterview.com/question_det
ails/85235
Scenario Based Question
• Online Shopping in Amazon during Diwali and Thanks
giving Day. Consider few set of items(such as Mobiles,
Laptops, Dress Materials, ….)
• Fix location of servers and number of servers,
– Include Sharding criteria , Replication
– Distributed Databases
– Check Consistency in Read and Write operations
– ACID, CAP and BASE properties
– Suggest database models
– Draw Architecture diagram(Design)
– Discuss pros and Cons of ur design

Вам также может понравиться