Академический Документы
Профессиональный Документы
Культура Документы
.
should work both for analytical and transactional workloads
will support queries that take from milliseconds to hours
High performance:
C++ instead of Java
runtime code generation
completely new execution engine that doesn't build on MapReduce
User View of Impala: Overview
Runs as a distributed service in cluster:
one Impala daemon on each node with data
.
patterned after Hive's version of SQL
limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert
only equi-joins; no non-equi joins, no cross products
Functional limitations:
no custom UDFs, file formats, SerDes
no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath,
json)
only hash joins; joined table has to fit in memory:
.
uses Hive's mapping of HBase table into metastore table
predicates on rowkey columns are mapped into start/stop
row
predicates on other columns are mapped into
SingleColumnValueFilters
HBase functional limitations:
no nested-loop joins
all data stored as text
Impala Architecture
Two binaries: impalad and statestored
Impala daemon (impalad)
.
handles client requests and all internal requests related to query execution
accepts queries transmitted from the impala-shell command, Hue, JDBC, or
ODBC
parallelizes the queries and distributes work to other nodes in the Impala
cluster; and transmits intermediate query results back to the central coordinator
node.
exports Thrift services for these two roles
You can submit a query to the Impala daemon running on any node, and that node
serves as the coordinator node for that query. The other nodes transmit partial results
back to the coordinator, which constructs the final result set for a query. When
running experiments with functionality through the impala-shell command, you
might always connect to the same Impala daemon for convenience
The Impala daemons are in constant communication with the statestore, to confirm
which nodes are healthy and can accept new work.
Impala Statestore
.
checks on the health of Impala daemons on all the nodes in a cluster,
If an Impala node goes offline due to hardware failure, network error, software
issue, or other reason, the statestore informs all the other nodes so that future
queries can avoid making requests to the unreachable node.
.
relays the metadata changes from Impala SQL
statements to all the nodes in a cluster.
It is physically represented by a daemon process named
catalogd;
you only need such process on one node in the cluster.
.
request arrives via odbc/beeswax Thrift API
planner turns request into collections of plan fragments
coordinator initiates execution on remote impalad's
during execution
.
268
Impala Architecture: Query Execution
.
269
Impala Architecture: Query Execution
.
270
Impala Architecture
Metadata handling:
.
utilizes Hive's metastore
caches metadata: no synchronous metastore API calls
during query execution
beta: impalad's read metadata from metastore at startup
GA: metadata distribution through statestore
post-GA: HCatalog and AccessServer
Comparing Impala to Hive
.
High latency, low throughput queries
Fault-tolerance model based on MapReduce's on-disk
checkpointing; materializes all intermediate results
Java runtime allows for easy late-binding of functionality: file
formats and UDFs.
Extensive layering imposes high runtime overhead
Impala:
direct, process-to-process data exchange
no fault tolerance
an execution engine designed for low runtime overhead
Comparing Impala to Hive
Impala's performance advantage over Hive: no hard
.
numbers, but
Impala can get full disk throughput (~100MB/sec/disk); I/O-
bound workloads often faster by 3-4x
queries that require multiple map-reduce phases in Hive see a
higher speedup
queries that run against in-memory data see a higher speedup
(observed up to 100x)
Why Oozie
.
160
Installing Oozie
.
161
Running an Example.
.
162
.
163
EXAMPLE
.
164
Word Count Example
.
165
A Workflow Application
.
166
Workflow Submission
.
167
Workflow state transitions
.
168
Oozie Job Processing
.
169
Oozie-Hadoop Security
Oozie is a multi-tenant system
•Job can be scheduled to run later
•Oozie submits/maintains the hadoop jobs
.
•Hadoop needs security token for each request
User-Oozie Security
.
171
Why Oozie Security?
One user should not modify another user’s job
.
Hadoop doesn’t authenticate end–user
Oozie has to verify its user before passing the
job to Hadoop
172
Job Submission to Hadoop
Oozie is designed to handle thousands of jobs
at the same time
.
Question : Should Oozie server
–Submit the hadoop job directly?
–Wait for it to finish?
Answer: No Reason
.
174
Time Line of a Oozie Job
.
175
Coordinator
.
176
Oozie (Bundle)
.
177
Layers of abstraction
.
178
Architecture
.
179
Key Features and Design Decisions
.
180
Oozie Security, Multi Tenancy and Scalability
.
181
Use Case 1: Time Triggers
.
182
Use Case 2 : Data and Time Triggers
.
183
Use Case 3 : Rolling Window
.
184