Вы находитесь на странице: 1из 39

Impala Overview: Goals

 General-purpose SQL query engine:

.

should work both for analytical and transactional workloads

will support queries that take from milliseconds to hours

 Runs directly within Hadoop:



reads widely used Hadoop file formats

talks to widely used Hadoop storage managers

runs on same nodes that run Hadoop processes

 High performance:

C++ instead of Java

runtime code generation

completely new execution engine that doesn't build on MapReduce
User View of Impala: Overview
 Runs as a distributed service in cluster:
 one Impala daemon on each node with data

 The Impala Statestore

 The Impala Catalog Service

 Usersubmits query via ODBC/Beeswax Thrift API to any of the


daemons
 Query is distributed to all nodes with relevant data
 If any node fails, the query fails
 Impala uses Hive's metadata interface, connects to Hive's metastore
User View of Impala: SQL
 SQL support:

.

patterned after Hive's version of SQL

limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert


only equi-joins; no non-equi joins, no cross products

 Functional limitations:

no custom UDFs, file formats, SerDes

no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath,
json)

only hash joins; joined table has to fit in memory:

 beta: of single node


 GA: aggregate memory of all (executing) nodes
User View of Impala: Apache HBase
 HBase functionality:

.
 uses Hive's mapping of HBase table into metastore table
 predicates on rowkey columns are mapped into start/stop
row
 predicates on other columns are mapped into
SingleColumnValueFilters
 HBase functional limitations:
 no nested-loop joins
 all data stored as text
Impala Architecture

Two binaries: impalad and statestored

Impala daemon (impalad)

.

handles client requests and all internal requests related to query execution


accepts queries transmitted from the impala-shell command, Hue, JDBC, or
ODBC

parallelizes the queries and distributes work to other nodes in the Impala
cluster; and transmits intermediate query results back to the central coordinator
node.

exports Thrift services for these two roles
 You can submit a query to the Impala daemon running on any node, and that node
serves as the coordinator node for that query. The other nodes transmit partial results
back to the coordinator, which constructs the final result set for a query. When
running experiments with functionality through the impala-shell command, you
might always connect to the same Impala daemon for convenience
The Impala daemons are in constant communication with the statestore, to confirm
 which nodes are healthy and can accept new work.
Impala Statestore

 State store daemon (statestored)

.
 checks on the health of Impala daemons on all the nodes in a cluster,

 It is physically represented by a daemon process named statestored; you only


need such a process on one node in the cluster

 If an Impala node goes offline due to hardware failure, network error, software
issue, or other reason, the statestore informs all the other nodes so that future
queries can avoid making requests to the unreachable node.

 Because the statestore's purpose is to help when things go wrong, it is not


critical to the normal operation of an Impala cluster. If the statestore is not
running or becomes unreachable, the other nodes continue running and
distributing work among themselves as usual; the cluster just becomes less
robust if other nodes fail while the statestore is offline. When the statestore
comes back online, it re- establishes communication with the other nodes and
resumes its monitoring function.
Impala Catalog Service

The Impala component known as the catalog service

.

relays the metadata changes from Impala SQL
statements to all the nodes in a cluster.
 It is physically represented by a daemon process named
catalogd;
 you only need such process on one node in the cluster.

 Because the requests are passed through the


statestore daemon, it makes sense to run the
statestored and catalogd services on the same node. 266
Impala Architecture
 Query execution phases

.
 request arrives via odbc/beeswax Thrift API
 planner turns request into collections of plan fragments
 coordinator initiates execution on remote impalad's
 during execution

 intermediate results are streamed between


executors
 query results are streamed back to client
 subject to limitations imposed to blocking
operators (top-n, aggregation)
Impala Architecture: Query Execution

.
268
Impala Architecture: Query Execution

.
269
Impala Architecture: Query Execution

.
270
Impala Architecture
 Metadata handling:

.
 utilizes Hive's metastore
 caches metadata: no synchronous metastore API calls
during query execution
 beta: impalad's read metadata from metastore at startup
 GA: metadata distribution through statestore
 post-GA: HCatalog and AccessServer
Comparing Impala to Hive

 Hive: MapReduce as an execution engine

.
 High latency, low throughput queries
 Fault-tolerance model based on MapReduce's on-disk
checkpointing; materializes all intermediate results
 Java runtime allows for easy late-binding of functionality: file
formats and UDFs.
 Extensive layering imposes high runtime overhead

 Impala:
 direct, process-to-process data exchange
 no fault tolerance
 an execution engine designed for low runtime overhead
Comparing Impala to Hive
 Impala's performance advantage over Hive: no hard

.
numbers, but
 Impala can get full disk throughput (~100MB/sec/disk); I/O-
bound workloads often faster by 3-4x
 queries that require multiple map-reduce phases in Hive see a
higher speedup
 queries that run against in-memory data see a higher speedup
(observed up to 100x)
Why Oozie

.
160
Installing Oozie

.
161
Running an Example.

.
162
.
163
EXAMPLE

.
164
Word Count Example

.
165
A Workflow Application

.
166
Workflow Submission

.
167
Workflow state transitions

.
168
Oozie Job Processing

.
169
Oozie-Hadoop Security
 Oozie is a multi-tenant system
•Job can be scheduled to run later
•Oozie submits/maintains the hadoop jobs

.
•Hadoop needs security token for each request

 Question : Who should provide the security token


to hadoop and how?
 Answer : Oozie How?
 –Hadoop considers Oozie as a super-user
–Hadoop does not check end-user credential
–Hadoop only checks the credential of Oozie
process
BUT hadoop job is executed as end-user. Oozie
utilizes doAs() functionality of Hadoop 170


User-Oozie Security

.
171
Why Oozie Security?
 One user should not modify another user’s job

.
 Hadoop doesn’t authenticate end–user
 Oozie has to verify its user before passing the
job to Hadoop

172
Job Submission to Hadoop
 Oozie is designed to handle thousands of jobs
at the same time

.
 Question : Should Oozie server
–Submit the hadoop job directly?
–Wait for it to finish?
 Answer: No Reason

 Resource constraints: A single Oozie process can’t


simultaneously create thousands of thread for each
hadoop job. (Scaling limitation)
 Isolation: Running user code on Oozie server might
de-stabilize Oozie
 Design Decision
–Create a launcher hadoop job 173
–Execute the actual user job from the launcher.
–Wait asynchronously for the job to finish
Oozie Security, Multi Tenancy and
Scalability

.
174
Time Line of a Oozie Job

.
175
Coordinator

.
176
Oozie (Bundle)

.
177
Layers of abstraction

.
178
Architecture

.
179
Key Features and Design Decisions

.
180
Oozie Security, Multi Tenancy and Scalability

.
181
Use Case 1: Time Triggers

.
182
Use Case 2 : Data and Time Triggers

.
183
Use Case 3 : Rolling Window

.
184

Вам также может понравиться