WP05 SQLHadoopBuyerGuide 082514

Actian
SQL in Hadoop
Buyers Guide

Contents
Introduction: Big Data and Hadoop ................................................................................................ 3
SQL on Hadoop Benefits .................................................................................................................. 4
Approaches to SQL on Hadoop ....................................................................................................... 4
The Top 10 SQL in Hadoop Capabilities ....................................................................................... 5
SQL in Hadoop Decision Criteria ...................................................................................................... 6
Actian: The High-performance SQL in Hadoop Database ................................................................ 8
Summary ....................................................................................................................................... 10
Learn more .................................................................................................................................... 10
Try it for free! ................................................................................................................................ 10

Actian SQL in Hadoop Buyers Guide
Introduction: Big Data and Hadoop

When most people think of big data, they also think of Hadoop. Why? Because Hadoop has
emerged as the de facto data operating system platform to address the biggest challenge posed
by big data: ginormous amounts of data continuously being generated. Organizations have come
to recognize that they can use Hadoop to store massive data sets by distributing them across
inexpensive commodity servers at a tenth of the cost of storing data in a data warehouse.
All this Hadoop data stored in an ever-growing reservoir provides an incredible opportunity for
organizations who can mine this massive big data to create transformational business outcomes,
including:
Delighting customers with exceptional service to increase revenue or reduce churn.

Responding faster or using new insights to create a competitive advantage.
Reducing risk and fraud, which impact the bottom line of every organization.
Uncovering opportunities for new product or service offerings.

However, while Hadoop has helped address the big data challenge of how to inexpensively store
large amounts of data of all shapes and sizes, it has fallen short as an analytics platform to
deliver on the promise of big data. Partly, because of the limitations of the Hadoop technology:
Not a database Hadoop consists of multiple components but fundamentally is a file
systemHadoop Distributed File System (HDFS)with a specialized programming
framework (MapReduce) to build programs that can process the data stored in HDFS. It
is not easy to find individual records or record sets. And, Hadoop lacks most of the
capabilities of a typical database to organize, access, and manage data.
Batch vs low latency Hadoop is designed for large batch processing where it can take
hours or even days to analyze a single data set. Because Hadoop is a file system, it
requires scanning all files just to find a single record. Batch processing also makes it
difficult for data scientists to explore data and build analytic models. Moreover, Hadoop
cannot support the interactive, ad hoc queries required by most business analysts and
applications.
Programming required Hadoop requires the use of complex, specialized MapReduce
programs to process data. MapReduce programmers are in short supply and rarely
understand the business objectives for how the Hadoop data is to be used. Plus, writing,
testing, and running a MapReduce job to prepare and analyze data often can take
weeks. This can become a huge bottleneck as every data request from a data scientist or
business analyst must go through a Java programmer.

Thus, the need for rare and expensive skillsets, long and error-prone implementation cycles, lack
of support for popular reporting and BI tools, and inadequate execution speed have led to a
search for an alternative that combines all of the benefits of Hadoop with SQLthe worlds
most widely used data querying language.

This guide is designed to help organizations understand what capabilities are most important
when making a SQL on Hadoop solution purchase decision.
SQL on Hadoop Benefits

SQL is THE standard trade language for anyone who uses databases and works with data. There
are more than one million SQL trained users and only 150,000 MapReduce programmers (with
an estimated shortage of 170,000). Almost every enterprise application and business analyst
tool uses SQL to manage data. It makes sense that SQL should be the way business users want to
access Hadoop data.
Here are six more reasons why it makes sense to consider a SQL on Hadoop solution:
1. Use existing SQL trained users no retraining needed for user access to Hadoop data.
2. Use existing BI tools immediate productivity using existing SQL-based business
intelligence and visualization tools.
3. Use existing SQL apps no need to rewrite queries to access Hadoop-based data.
4. No hunting for Hadoop skills no searching for hard to find, expensive MapReduce
programmers.
5. No transferring of Hadoop data no need to move the data out of Hadoop in order for
business users to be able to use SQL to access Hadoop data.
6. No managing duplicate Hadoop data duplicating and maintaining another copy of
Hadoop data in a separate data management environment negates the cost-benefit of
Hadoop and also requires additional IT resources for support.
Approaches to SQL on Hadoop

The SQL on Hadoop marketplace is crowded with solutions that broadly can be viewed in three
main categories:
Hadoop Connector: With this approach, the organization must deploy both a Hadoop
cluster and a DBMS cluster, on the same or separate hardware, and use a connector to
pass data back and forth between the two systems. The approach is expensive and hard
to manage and, most often, is adopted by traditional data warehouse vendors. Solutions
that fall into this category include HP Vertica, Teradata, and Oracle.
SQL and Hadoop: In this approach, vendors have taken an existing SQL engine and
modified it such that when generating a query execution plan, it can determine which
parts of the query should be executed via MapReduce and which parts should be
executed via SQL operators. Data that is processed via SQL operators is copied from the
HDFS into a local table structure. This again requires management of data in both HDFS
and local tables. Solutions that fall into this category include Hadapt, RainStor, Citus
Data, and Splice Machine.
SQL on Hadoop: These vendors are building SQL engines from the ground up that
enable native SQL processing of data in HDFS while avoiding the use of MapReduce.
These products have limited SQL language support, rudimentary query optimizers that
can require handcrafting queries to specify the join order, and no support for trickle
updates. Product immaturity is reflected in the lack of support for internationalization,
limited security features, and the lack of workload management. Solutions that fall into
this category include Impala, Drill, Stinger, and HAWQ.
Unfortunately, all of these categories fall short of what is needed to deliver true SQL access to
Hadoop data. Thus, a new category called SQL in Hadoop is needed for solutions that provide
an industrialized, high-performance analytics database able to run natively in Hadoop on top
of HDFS.
The Top 10 SQL in Hadoop Capabilities

Actian has worked with hundreds of customers to understand their big data requirements,
including SQL access to Hadoop data. Here are the top 10 key capabilities required of a true SQL
in Hadoop solution.
1. Comprehensive: A SQL in Hadoop capability shouldnt be used in isolation.
Organizations need to view SQL in Hadoop as part of a complete analytics solution that
covers the end-to-end process from connecting to raw big data, analyzing data natively
on Hadoop, all the way through delivering analytics results that impact business.
Capabilities should include data integration; data blending and enrichment; discovery
analytics and data science workbench; high-performance analytical processing; and SQL
access to analytical resultsall natively on Hadoop.
2. Data Quality: Organizations need to ensure that Hadoop data accessed by business
users has been curated for high-quality data. This especially is true if businesses will use
the Hadoop data for operational decision making. The SQL in Hadoop solution should be
part of an analytics platform that supports data blending, enrichment, and quality
validation.
3. SQL: The solution needs to support standard ANSI SQL to make it easy and flexible to
work with SQL-based applications, as well as standard business intelligence or
visualization tools. This also ensures that any existing SQL queries will be able to access
Hadoop data without the need for modification. In addition, the SQL in Hadoop
capability should support advanced analytics, such as cubing and window functions.
4. Performance Optimized: The solution should be mature technology strongly rooted in
data management with published TPC benchmarks. It should include a proven cost-
based query planner and optimizer that make optimal use of all available resources from
the node, memory, cache, and all the way down to the CPU.
5. Security: The solution should include native security expected of any enterprise quality
database management system (DBMS), including authentication, user- and role-based
security, data protection, and encryption.
6. Read/Write: The ability to not only access but also update/append Hadoop data is
important. The SQL in Hadoop solution should be able to both read and write data to
the HDFS nodes. It also should be fully ACID-compliant, with multi-version read
consistency, plus system-wide failover protection.
7. Compression: Hadoop can reduce storage costs but it still requires replicating copies for
high availability. The SQL in Hadoop solution should compress data by at least a factor of
10 and store it in a native columnar format for faster SQL performance.

8. Manageability: The SQL in Hadoop solution should be certified as YARN-ready to ensure

it uses the Hadoop YARN capability to automatically manage resources and nodes on
the cluster. The solution also should include a Web-based systems management console
to monitor analytics and query processing.
9. Architecture: The SQL in Hadoop solution should be able to expand and scale as the size
of the cluster grows. It should be able to handle extremely complex queries, thousands
of nodes and SQL users, and petabytes and more of data.
10. Native: SQL in Hadoop should run natively without the need to move the data off the
HDFS nodes into a separate database or file system. It also should be flexible to support
the most commonly used Hadoop distributions including, Apache, Cloudera,
Hortonworks, and MapR.
SQL in Hadoop Decision Criteria

Choosing a SQL in Hadoop solution is a critical decision that will have a long-term impact on your
application infrastructure and your ability to scale to support the needs of business users. While
you may not need a solution that meets all of these criteria immediately, its important to
consider how your requirements may change over time.
1. Comprehensive: Is the solution part of a complete analytics platform with:
Connectors to all types of disparate data (DBMS, file systems, social media,
machine/sensors, SaaS applications)?
Parallelized integration and loading of data into Hadoop?
Visual data science workbench to build re-usable data and analytics workflows?
Comprehensive set of hundreds of analytics and transformation functions?
Ability to analyze massive amounts of structured and unstructured data without need
for sampling?
High-performance data flow engine that is 10x faster than MapReduce for processing
analytic workflows?
Ability to blend and enrich Hadoop data with non-Hadoop data to improve context
and improve analytical accuracy?
2. Data Quality: If business users are to use the Hadoop data for operational decision
making, is the solution part of a complete analytics platform with ability to perform:
Profiling to assess the data to discover inconsistencies?
Data cleansing and validation to improve data quality and consistency?
3. SQL: Does the solution support:
Standard ANSI SQL enabling the use of existing SQL without rewrite?
Advanced analytics, including cubing, grouping, and window functions?
Standard enterprise and desktop BI and visualization tools?

4. Performance Optimized: Does the solution support/use/have:

Mature and proven cost-based query planner?
Optimal use of all available resources, including node, memory, cache, and CPU?
Use cases requiring responses in milliseconds to seconds versus minutes to hours?
Fast lookups on small subsets of the data?
Fast performance on massive joins?
Published performance benchmarks?

5. Security: Does the solution support:
Role-based security?

Authentication through LDAP or Active Directory?

Column-level data encryption?
User-level and role-level data access?
6. Read/Write: Does the solution use/provide:
Ability to both read and append/update data in HDFS?
Concurrent reads and updates without locking up the database?
Full ACID-compliance with multi-version read consistency?
7. Compression: Does the solution use/provide:
Compress the data by at least a factor of 10 to reduce the amount of
Hadoop storage?
Store the data in a columnar format for faster access?
8. Manageability: Does the solution use/provide:
YARN for automated Hadoop cluster resource management?
Certification as YARN-ready?
Web-based management console for monitoring analytic/query processing?
9. Architecture: Is/does the solution:
Based on a mature, proven data management platform?
Scale linearly to handle thousands of users, nodes, and petabytes of data?
Provide system-wide failover protection?
10. Native: Is/does the solution:
Run natively on Hadoop (HDFS) without the need to move the data or use a separate
integration layer?
Natively run on the most common Hadoop distributions, such as Apache, Cloudera,
Hortonworks, or MapR
11. Vendor: Is/does the vendor:
Built upon years of deep data management and Hadoop experience?
Well-financed and stable?
Have offices worldwide?
Have global customer support?
Have a strong customer base?

Actian: The High-performance SQL in Hadoop Database

Actian Vector in Hadoop uniquely offers the first true SQL in Hadoop solution. Actian Vector in
Hadoop, a key capability of the Actian Analytics PlatformHadoop SQL Edition, represents the
next generation of innovation at Actian and employs a unique approach for bringing all of the
benefits of industrialized, high-performance SQL together with Hadoop.

Figure 1: Actian Analytics Platform Hadoop SQL Edition Capabilities


Actian Vector in Hadoop contains a mature RDBMS engine that performs native SQL processing
of data in HDFS. It has rich SQL language support, an advanced query optimizer, support for
trickle updates, and has been certified for use with the most popular BI tools. It is built from
mature vector database technology that has been hardened in the enterprise and includes
support for localization and internationalization, advanced security features, and workload
management. Plus, it has been benchmarked to perform more than 30 times faster than other
approaches to SQL on Hadoop.
Figure 2: Actian Analytics Platform offers a fully functional analytics database in Hadoop

To understand whats special about Actian Vector in Hadoop, here are key performance features
that make it unique:
1. CPU Exploitation: Actian Vector was written from the ground up to take advantage
of performance features in modern CPUs, resulting in dramatically higher data
processing rates compared to other relational databases. Actian Vector in Hadoop
leverages these innovations and brings this unbridled processing power to the data
nodes in a Hadoop cluster.

2. Single Instruction, Multiple Data (SIMD): SIMD enables a single operation to be applied
on every entity in a set of data all at once. Actian Vector takes advantage of SIMD
instructions by processing vectors of data through the Streaming SIMD Extensions
instruction set. Because typical data analysis queries process large volumes of data, the
SIMD results in the average computation against a single data value taking less than a
single CPU cycle.
3. Parallel Execution: Actian implements a flexible, adaptive parallel execution algorithm
and can be scaled-up or scaled-out to meet specific workload requirements. Actian
Vector can execute statements in parallel using any number of CPU cores on a server or
across any number of data nodes on a Hadoop cluster. Taking the raw power of the
Vector data processing engine to the HDFS data is what gives Actian Vector in Hadoop
its unique performance characteristics.
4. Updates via Positional Delta Trees (PDTs): Actian Vector in Hadoop implements a fully
ACID-compliant transactional database with multi-version read consistency. One of the
biggest challenges with HDFS is that it is not designed for incremental updates. Actian
addresses this challenge with high-performance, in-memory Positional Delta Trees
(PDTs), which are used to store small incremental changes (inserts that are not
appends), as well as updates and deletes.
5. Out-of-the-box YARN Support: YARN provides resource negotiation and management
for the entire Hadoop cluster and all applications running on it. Actian Vector is the first
SQL in Hadoop capability certified as YARN-ready. This means Actian query workloads
can run as first-class citizens on Hadoop clusters, sharing resources side-by-side with
MapReduce-based applications.
Summary
If you need to analyze large volumes of data on Hadoop and deliver scalable, enterprise-grade
SQL access to Hadoop data to your business, you should strongly consider Actian for its
industrialized, high-performance SQL in Hadoop solution. Actians SQL in Hadoop is the
foundation for revolutionary performance gains in database processinggains that are so
game-changing that they appear on our competitors long-term roadmaps. Implement an easy
to deploy, easy to use, ANSI-compliant solution and benefit from significantly better query
performance than any other SQL on Hadoop solution.
Learn more
Learn more about the Actian Analytics PlatformHadoop SQL Edition, which includes Actian
Vector in Hadoop, our high-performance SQL in Hadoop capability: actian.com/Hadoop
Try it for free!

Experience the difference of having a true analytics database for high-performance SQL access
to Hadoop data. Download and try out Actians SQL in Hadoop capability for 30 days for free:
bigdata.actian.com/sql-in-hadoop

Actian SQL in Hadoop Buyers Guide 10

About Actian: Accelerating Big Data 2.0
Actian transforms big data into business value for any organizationnot just the privileged few.
sources of revenue, business opportunities, and ways of mitigating risk with high-performance,
in-database analytics complemented with extensive connectivity and data preparation.
Actian makes Hadoop enterprise-grade by providing high-performance data blending and
enrichment, visual design, and SQL analytics on Hadoop without the need for MapReduce skills.
Among the tens of thousands of organizations using Actian are innovators using analytics for a
competitive advantage in industries such as financial services, telecommunications, digital
media, healthcare, and retail. The company is headquartered in Silicon Valley and has offices
worldwide. www.actian.com

www.actian.com
500 Arguello Street, Ste. 200, Redwood City, CA 94063
+1.888.446.4737 [Toll Free] | +1.650.587.5500 [Tel]
2014 Actian Corporation. Actian, Big D ata for the Rest of Us, Accelerating Big D ata 2.0, and A ctian
Analytics Platform are trademarks of Actian and its subsidiaries. All other trademarks, trade names,
service marks, and logos referenced herein belong to their respective companies. (WP05-0814)

WP05 SQLHadoopBuyerGuide 082514

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

WP05 SQLHadoopBuyerGuide 082514

Загружено:

Авторское право:

Доступные форматы

Actian

Actian SQL in Hadoop Buyers Guide

Introduction: Big Data and Hadoop

Delighting customers with exceptional service to increase revenue or reduce churn.

Actian SQL in Hadoop Buyers Guide

SQL on Hadoop Benefits

Approaches to SQL on Hadoop

Actian SQL in Hadoop Buyers Guide

The Top 10 SQL in Hadoop Capabilities

Actian SQL in Hadoop Buyers Guide

8. Manageability: The SQL in Hadoop solution should be certified as YARN-ready to ensure

SQL in Hadoop Decision Criteria

Actian SQL in Hadoop Buyers Guide

4. Performance Optimized: Does the solution support/use/have:

Actian SQL in Hadoop Buyers Guide

Actian: The High-performance SQL in Hadoop Database

Actian SQL in Hadoop Buyers Guide

Actian SQL in Hadoop Buyers Guide

Try it for free!

Actian SQL in Hadoop Buyers Guide 10

Вам также может понравиться