Академический Документы
Профессиональный Документы
Культура Документы
SQL
in
Hadoop
Buyers
Guide
Contents
Introduction:
Big
Data
and
Hadoop
................................................................................................
3
SQL
on
Hadoop
Benefits
..................................................................................................................
4
Approaches
to
SQL
on
Hadoop
.......................................................................................................
4
The
Top
10
SQL
in
Hadoop
Capabilities
.......................................................................................
5
SQL
in
Hadoop
Decision
Criteria
......................................................................................................
6
Actian:
The
High-performance
SQL
in
Hadoop
Database
................................................................
8
Summary
.......................................................................................................................................
10
Learn
more
....................................................................................................................................
10
Try
it
for
free!
................................................................................................................................
10
However,
while
Hadoop
has
helped
address
the
big
data
challenge
of
how
to
inexpensively
store
large
amounts
of
data
of
all
shapes
and
sizes,
it
has
fallen
short
as
an
analytics
platform
to
deliver
on
the
promise
of
big
data.
Partly,
because
of
the
limitations
of
the
Hadoop
technology:
Not
a
database
Hadoop
consists
of
multiple
components
but
fundamentally
is
a
file
systemHadoop
Distributed
File
System
(HDFS)with
a
specialized
programming
framework
(MapReduce)
to
build
programs
that
can
process
the
data
stored
in
HDFS.
It
is
not
easy
to
find
individual
records
or
record
sets.
And,
Hadoop
lacks
most
of
the
capabilities
of
a
typical
database
to
organize,
access,
and
manage
data.
Batch
vs
low
latency
Hadoop
is
designed
for
large
batch
processing
where
it
can
take
hours
or
even
days
to
analyze
a
single
data
set.
Because
Hadoop
is
a
file
system,
it
requires
scanning
all
files
just
to
find
a
single
record.
Batch
processing
also
makes
it
difficult
for
data
scientists
to
explore
data
and
build
analytic
models.
Moreover,
Hadoop
cannot
support
the
interactive,
ad
hoc
queries
required
by
most
business
analysts
and
applications.
Programming
required
Hadoop
requires
the
use
of
complex,
specialized
MapReduce
programs
to
process
data.
MapReduce
programmers
are
in
short
supply
and
rarely
understand
the
business
objectives
for
how
the
Hadoop
data
is
to
be
used.
Plus,
writing,
testing,
and
running
a
MapReduce
job
to
prepare
and
analyze
data
often
can
take
weeks.
This
can
become
a
huge
bottleneck
as
every
data
request
from
a
data
scientist
or
business
analyst
must
go
through
a
Java
programmer.
Thus,
the
need
for
rare
and
expensive
skillsets,
long
and
error-prone
implementation
cycles,
lack
of
support
for
popular
reporting
and
BI
tools,
and
inadequate
execution
speed
have
led
to
a
search
for
an
alternative
that
combines
all
of
the
benefits
of
Hadoop
with
SQLthe
worlds
most
widely
used
data
querying
language.
This
guide
is
designed
to
help
organizations
understand
what
capabilities
are
most
important
when
making
a
SQL
on
Hadoop
solution
purchase
decision.
Hadoop
Connector:
With
this
approach,
the
organization
must
deploy
both
a
Hadoop
cluster
and
a
DBMS
cluster,
on
the
same
or
separate
hardware,
and
use
a
connector
to
pass
data
back
and
forth
between
the
two
systems.
The
approach
is
expensive
and
hard
to
manage
and,
most
often,
is
adopted
by
traditional
data
warehouse
vendors.
Solutions
that
fall
into
this
category
include
HP
Vertica,
Teradata,
and
Oracle.
SQL
and
Hadoop:
In
this
approach,
vendors
have
taken
an
existing
SQL
engine
and
modified
it
such
that
when
generating
a
query
execution
plan,
it
can
determine
which
parts
of
the
query
should
be
executed
via
MapReduce
and
which
parts
should
be
executed
via
SQL
operators.
Data
that
is
processed
via
SQL
operators
is
copied
from
the
HDFS
into
a
local
table
structure.
This
again
requires
management
of
data
in
both
HDFS
and
local
tables.
Solutions
that
fall
into
this
category
include
Hadapt,
RainStor,
Citus
Data,
and
Splice
Machine.
SQL
on
Hadoop:
These
vendors
are
building
SQL
engines
from
the
ground
up
that
enable
native
SQL
processing
of
data
in
HDFS
while
avoiding
the
use
of
MapReduce.
These
products
have
limited
SQL
language
support,
rudimentary
query
optimizers
that
can
require
handcrafting
queries
to
specify
the
join
order,
and
no
support
for
trickle
updates.
Product
immaturity
is
reflected
in
the
lack
of
support
for
internationalization,
limited
security
features,
and
the
lack
of
workload
management.
Solutions
that
fall
into
this
category
include
Impala,
Drill,
Stinger,
and
HAWQ.
Unfortunately,
all
of
these
categories
fall
short
of
what
is
needed
to
deliver
true
SQL
access
to
Hadoop
data.
Thus,
a
new
category
called
SQL
in
Hadoop
is
needed
for
solutions
that
provide
an
industrialized,
high-performance
analytics
database
able
to
run
natively
in
Hadoop
on
top
of
HDFS.
Figure
1:
Actian
Analytics
Platform
Hadoop
SQL
Edition
Capabilities
Actian
Vector
in
Hadoop
contains
a
mature
RDBMS
engine
that
performs
native
SQL
processing
of
data
in
HDFS.
It
has
rich
SQL
language
support,
an
advanced
query
optimizer,
support
for
trickle
updates,
and
has
been
certified
for
use
with
the
most
popular
BI
tools.
It
is
built
from
mature
vector
database
technology
that
has
been
hardened
in
the
enterprise
and
includes
support
for
localization
and
internationalization,
advanced
security
features,
and
workload
management.
Plus,
it
has
been
benchmarked
to
perform
more
than
30
times
faster
than
other
approaches
to
SQL
on
Hadoop.
Figure
2:
Actian
Analytics
Platform
offers
a
fully
functional
analytics
database
in
Hadoop
To
understand
whats
special
about
Actian
Vector
in
Hadoop,
here
are
key
performance
features
that
make
it
unique:
1. CPU
Exploitation:
Actian
Vector
was
written
from
the
ground
up
to
take
advantage
of
performance
features
in
modern
CPUs,
resulting
in
dramatically
higher
data
processing
rates
compared
to
other
relational
databases.
Actian
Vector
in
Hadoop
leverages
these
innovations
and
brings
this
unbridled
processing
power
to
the
data
nodes
in
a
Hadoop
cluster.
2. Single
Instruction,
Multiple
Data
(SIMD):
SIMD
enables
a
single
operation
to
be
applied
on
every
entity
in
a
set
of
data
all
at
once.
Actian
Vector
takes
advantage
of
SIMD
instructions
by
processing
vectors
of
data
through
the
Streaming
SIMD
Extensions
instruction
set.
Because
typical
data
analysis
queries
process
large
volumes
of
data,
the
SIMD
results
in
the
average
computation
against
a
single
data
value
taking
less
than
a
single
CPU
cycle.
3. Parallel
Execution:
Actian
implements
a
flexible,
adaptive
parallel
execution
algorithm
and
can
be
scaled-up
or
scaled-out
to
meet
specific
workload
requirements.
Actian
Vector
can
execute
statements
in
parallel
using
any
number
of
CPU
cores
on
a
server
or
across
any
number
of
data
nodes
on
a
Hadoop
cluster.
Taking
the
raw
power
of
the
Vector
data
processing
engine
to
the
HDFS
data
is
what
gives
Actian
Vector
in
Hadoop
its
unique
performance
characteristics.
4. Updates
via
Positional
Delta
Trees
(PDTs):
Actian
Vector
in
Hadoop
implements
a
fully
ACID-compliant
transactional
database
with
multi-version
read
consistency.
One
of
the
biggest
challenges
with
HDFS
is
that
it
is
not
designed
for
incremental
updates.
Actian
addresses
this
challenge
with
high-performance,
in-memory
Positional
Delta
Trees
(PDTs),
which
are
used
to
store
small
incremental
changes
(inserts
that
are
not
appends),
as
well
as
updates
and
deletes.
5. Out-of-the-box
YARN
Support:
YARN
provides
resource
negotiation
and
management
for
the
entire
Hadoop
cluster
and
all
applications
running
on
it.
Actian
Vector
is
the
first
SQL
in
Hadoop
capability
certified
as
YARN-ready.
This
means
Actian
query
workloads
can
run
as
first-class
citizens
on
Hadoop
clusters,
sharing
resources
side-by-side
with
MapReduce-based
applications.
Summary
If
you
need
to
analyze
large
volumes
of
data
on
Hadoop
and
deliver
scalable,
enterprise-grade
SQL
access
to
Hadoop
data
to
your
business,
you
should
strongly
consider
Actian
for
its
industrialized,
high-performance
SQL
in
Hadoop
solution.
Actians
SQL
in
Hadoop
is
the
foundation
for
revolutionary
performance
gains
in
database
processinggains
that
are
so
game-changing
that
they
appear
on
our
competitors
long-term
roadmaps.
Implement
an
easy
to
deploy,
easy
to
use,
ANSI-compliant
solution
and
benefit
from
significantly
better
query
performance
than
any
other
SQL
on
Hadoop
solution.
Learn more
Learn
more
about
the
Actian
Analytics
PlatformHadoop
SQL
Edition,
which
includes
Actian
Vector
in
Hadoop,
our
high-performance
SQL
in
Hadoop
capability:
actian.com/Hadoop
About
Actian:
Accelerating
Big
Data
2.0
Actian
transforms
big
data
into
business
value
for
any
organizationnot
just
the
privileged
few.
sources
of
revenue,
business
opportunities,
and
ways
of
mitigating
risk
with
high-performance,
in-database
analytics
complemented
with
extensive
connectivity
and
data
preparation.
Actian
makes
Hadoop
enterprise-grade
by
providing
high-performance
data
blending
and
enrichment,
visual
design,
and
SQL
analytics
on
Hadoop
without
the
need
for
MapReduce
skills.
Among
the
tens
of
thousands
of
organizations
using
Actian
are
innovators
using
analytics
for
a
competitive
advantage
in
industries
such
as
financial
services,
telecommunications,
digital
media,
healthcare,
and
retail.
The
company
is
headquartered
in
Silicon
Valley
and
has
offices
worldwide.
www.actian.com
www.actian.com
500
Arguello
Street,
Ste.
200,
Redwood
City,
CA
94063
+1.888.446.4737
[Toll
Free]
|
+1.650.587.5500
[Tel]
2014
Actian
Corporation.
Actian,
Big
D ata
for
the
Rest
of
Us,
Accelerating
Big
D ata
2.0,
and
A ctian
Analytics
Platform
are
trademarks
of
Actian
and
its
subsidiaries.
All
other
trademarks,
trade
names,
service
marks,
and
logos
referenced
herein
belong
to
their
respective
companies.
(WP05-0814)