Академический Документы
Профессиональный Документы
Культура Документы
WHITEPAPER
Users Guide to the Emerging Database Landscape: Row vs. Columnar vs. NoSQL
Overview
Businesses today are challenged by the ongoing explosion of data. Organizations capture, track, analyze and store more information than ever beforeeverything from mass quantities of transactional, online and mobile data, to growing amounts of machine-generated data. In fact, machinegenerated data represents the fastest-growing category of Big Data. How can you effectively address the impact of data overload on application performance, speed and reliability? Where do newer technologies such as columnar databases and NoSQL come into play? The first thing to recognize is that, in the new data management paradigm, one size will not fit all data needs. The right IT solution may encompass one to two to even three technologies working together. Figuring out which of the several technologies (and even subvariants of these technologies) meets your needs while also fitting your IT staffing and budget parameters is no small issue. We hope this User Guide will help clarify which data management approach is best for which of your companys data challenges.
INFOBRIGHT Corporate Headquarters 47 Colborne Street, Suite 403 Toronto, Ontario M5E1P8 Canada Tel. 416 596 2483 Toll Free 877 596 2483 info@infobright.com www.infobright.com Sales: North America Tel. 312-924-1695 EMEA Tel. +353 (0)87 743 7107
WHITEPAPER WHITEPAPER
1 2
Gartner IT Infrastructure, Operations & Management Summit 2009 Post Event Brief. Keeping Up with Ever-expanding Enterprise Data, Joseph McKendrick, Research Analyst, Unisphere Research, October 2010.
WHITEPAPER WHITEPAPER
Transactional Powerhouse
Row-based Database
The Rise of the Columnar Database, Mike Vizard, IT BusinessEdge, June 14 2011.
WHITEPAPER WHITEPAPER
data set increases in size, disk I/O becomes a substantial limiting factor since a row-oriented design forces the database to retrieve all column data for any query. As we mentioned above, many companies try to solve this I/O problem by creating indices to optimize queries. This may work for routine reports (i.e. you always want to know how many toasters you sold for the third week of a reporting period) but there is a point of diminishing returns as load speed degrades since indices need to be recreated as data is added In addition, users are severely limited in their ability to quickly do ad-hoc queries (i.e., how many toaster did we sell through our first Groupon offer? Should we do it again?) that cant depend on indices to optimize results.
Lightning Analytics
Column-based Database
Column-oriented databases allow data to be stored column-by-column rather than row-by-row. This simple pivot in perspectivelooking down rather than looking acrosshas profound implications for analytic speed. Column-oriented databases are better suited for analytics where, unlike transactions, only portions of each record are required. By grouping the data together this way, the database only needs to retrieve columns that are relevant to the query, greatly reducing the overall I/O. Returning to the example in the section above, we see that a columnar database would not only eliminate 43 days of data, it would also eliminate 28 columns of data. Returning only the columns for toasters and units sold, the columnar database would return only 14 million data elements or 93% less data. By returning so much less data, columnar databases are much faster than row-based databases when analyzing large data sets.
In addition, some columnar databases (such as Infobright) compress data at high rates because each column stores a single data type (as opposed to rows that typically contain several data types), and allow compression to be optimized for each particular data type. Row-based databases have multiple data types and limitless range of values, thus making compression less efficient overall. Read the sidebar Infobright: Putting Intelligence in Columns to learn how Infobright improves query speed even more, while simplifying administration and lowering costs, with its Knowledge Grid and Domain ExpertiseTM capabilities. Figure 3. Pivoting Data for Columnar View
WHITEPAPER WHITEPAPER
While each technology addresses different problems, they all share certain attributes: huge volume of data and transaction rates, a distributed architecture and often unstructured (or semi-structured data) with heavy read/write workloads. Unstructured information is typically text heavy but may contain data such as dates and other numbers as well. The resulting irregularities and ambiguities make this data unsuitable for traditional row-based or column-based structured databases. In short, NoSQL solutions are typically beasts in terms of their data capacity, lookup speed and ability to handle streaming data, especially over highly scaled environments. On the other hand, they generally lack a SQL interface and often come with little or no programmatic interfacesmeaning that setup and administration may require some specialized skills. In addition, NoSQL can be limited in terms of their ability to execute complex queries, restricting the types of actionable analytics they can deliver. For example, queries that JOIN two tables or employ nested SELECTs are typically not possible using these technologies.
Below, we go a bit deeper into each of three main NoSQL subvariants: key-value stores, document stores and column stores.
Wikipedia, http://en.wikipedia.org/wiki/NoSQL
WHITEPAPER WHITEPAPER
Key-value Store A key-value store does what it sounds like it does: values are stored and indexed by a key, usually built on a hash or tree data-structure. 5 Key-value pairs are widely used in tables and configuration files. Key-value stores allow the application to store its data without
Data Beasts
NoSQL Database
predefining a schemathere is no need for a fixed data-model. In a key-value store, for example, a record may look like: 12345 => img456.jpg,checkout.js,20 Companies turn to key-value stores when they require the functionality of key-values but do not require the technology overhead of a traditional RDBMS system, either because they require more efficient, cost-effective scalability or they are working with unstructured or semi-structured data. Key-value stores are great for unstructured data centered on a single object, and where data is stored in memory with some persistent backup. Consequently, they are typically used as a cache for data frequently requested by web applications such as online shopping carts or social-media sites. As these web pages are created
on the fly, the static components are quickly retrieved and served up to the user. Document Store As with a key-value store, companies turn to NoSQL document stores when they are dealing with huge volumes of data and transactions requiring massive horizontal scaling or sharding. And, similarly, there is no need for a pre-set schema. However, the data in document stores can contain several keys, so queries arent as limited as they are in key-value stores. For example, in a document data store an example record could read: id => 12345, name => Jane, age => 22, email => jane@gmail.com While multiple keys increase the types of possible queries, the data stored in these documents do not need to be predefined and can change from document to document. The tradeoff for the more complex query-options is speed: queries with a key-value store are much simpler and often faster. Document stores are often deployed for web-traffic analysis, user-behavior/action analysis, or log-file analysis in real time. However, while document stores allow more query capabilities than key-value stores, there are still limitations given the non-relational basis of the document-store database. Column Store Column stores are an emerging NoSQL option, created in response to very specific database problems involving beyond-massive amounts of data across a hugely distributed system. Think Google. Think Facebook. Imagine the colossal amount of data that Google stores in its data farms. And then imagine how many permutations of data sets need to be compiled to respond to all possible Google
For more on hash functions see http://en.wikipedia.org/wiki/Hash_function. For more on tree data see http://en.wikipedia.org/wiki/Tree_%28data_structure%29.
WHITEPAPER WHITEPAPER
searches. Clearly, this task could never be accomplished in any reasonable time frame with a traditional relational database. It requires the ability to handle massive amounts of data but with more query complexity than either key-value stores or document stores would deliver. Most column stores also use MapReduce, a fault-tolerant framework for processing huge datasets on certain kinds of distributable problems using a large number of computers. This technology is still emergingand use cases may eventually overlap with document stores as both technologies mature. But at the moment, the use cases in production for column stores are generally limited to applications such as Google and Facebook. A Column by Any Other Name.. It should go without saying, but well say it anywaya column store is only similar to a column-based database in that they both have the word column in their names. A column-based database is still a structured relational database, albeit one optimized for analytics. A column store is still firmly in the NoSQL campthis is a system for handling huge volumes of data and transactions, in a massively distributed manner, without the need to define the database structure up frontthough it tends to have more SQL traits than either a key-value store or document store.
The data typically stored with Hadoop is complex, from multiple data sources and, well, theres always lots and lots of it. Beyond being a mass-storage system, Hadoop, through MapReduce, also is used for batch processing and computation done in parallel execution spread over a cluster of servers. While running MapReduce jobs is a common way to access data stored in Hadoop, technologies such as Hbase and Hive which sit on top of HDFS are also used to query the data.
WHITEPAPER WHITEPAPER
WHITEPAPER
Row-Based
Basic Description Data structured in rows
Columnar
Data is vertically striped and stored in columns
NoSQLDocument Store
Persistent storage for unstructured or semi-structured data along with some SQL-like querying functionality Web apps or any app which needs better performance and scalability without having to define columns in an RDBMS
NoSQLColumn Store
Very large data storage, MapReduce support
Strengths
Persistent store with scalability features such as sharding built in with and better query support than key-value stores
Very high throughput for Big Data, strong partitioning support, random read-write access
Weaknesses
Not suited for transactions; import and export speed; heavy computing resource utilization
Usually all data must fit into memory, no complex query capabilities
Low-level API, inability to perform complex queries, high latency of response to queries
Typical Database Size Range Key Players MySQL, Oracle, SQL Sever, Sybase ASE
Copyright 2011 Infobright Inc. Infobright is a registered trademark of Infobright Inc. All other trademarks and registered trademarks are the property of their respective owners.