Вы находитесь на странице: 1из 39

Querying Oracle Table from

Hive
and Querying HDFS from
PL/SQL
CON6359

Kuassi Mensah Nicholas Van Wyen


CTO
Director, Product Management
MTI
Oracle Server Technologies
September 19, 2016

Copyright 2016, Oracle and/or its affiliates. All rights reserved. |


Safe Harbor Statement
The following is intended to outline our general product direction. It is
intended for information purposes only, and may not be incorporated
into any contract. It is not a commitment to deliver any material, code,
or functionality, and should not be relied upon in making purchasing
decisions. The development, release, and timing of any features or
functionality described for Oracles products remains at the sole
discretion of Oracle.

Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 2


Program Agenda

1
Querying Oracle Table from Hadoop/Hive
2
Querying Hadoop/HDFS from PL/SQL

Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 3


Querying Oracle Table from
Hive:
Oracle Datasource for
Hadoop

Kuassi Mensah
Director, Product Management
Oracle Server Technologies
September 19, 2016

Copyright 2016, Oracle and/or its affiliates. All rights reserved. |


Agenda

1
Big Data Analytics & Requirements
2
Oracle Datasource for Hadoop (OD4H)
3
Summary and Demo

Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 5


Big Data Analytics
Goal: furnish actionable information to help business
decisions making.
Example
Which of our products got a rating of four stars or higher,
on social media in the last quarter?

Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 6


Big Data Analytics and Requirements
Goal: furnish actionable information to help business
decisions making.
Example
Which of our products got a rating of four stars or higher,
on social media in the last quarter?

Big Data
(Weblogs, Facts, Scans,
Master Data Events, IoT)

Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 7


Two Approaches for Accessing Master Data in
Oracle Direct Access
ETL Copy: Oracle -> Ad-hoc queries, always current
Hadoop Hive SQL, Spark SQL, Impala*,
Preplanned/scheduled other SQL engines
What to copy and when? Hadoop APIs
Always behind Oracle database security
Copy is protected using Hadoop
file-level security Oracle Big Data SQL (not
covered here)
Oracle Datasource for
Oracle CopyToBDA 2.0 Hadoop (OD4H)
Hive-ODCI (part II of this
presentation)
Copyright 2016, Oracle and/or its affiliates. All rights reserved. |
Direct Access From Hadoop

Example
Hive query for joining tables from Big Data and Oracle

SELECT HadoopT.First_Name, HadoopT.Last_Name,


OracleT.bonus
FROM HadoopT join OracleT on
(HadoopT.Emp_ID=OracleT.Emp_ID)
WHERE salary > 70000 and bonus > 7000;

Copyright 2016, Oracle and/or its affiliates. All rights reserved. |


Program Agenda

1
Big Data Analytics & Requirements
2
Oracle Datasource for Hadoop & Spark (OD4H)
3
Summary and Demo

Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 10


Hadoop 2.0 Architecture

Hive SQL Batch Big Data Spark Mahout


(MapReduce) SQL (In-Memory) (ML libs)

YARN
Data
HCatalog,
InputFormat,
StorageHand
Compute ler Storage
Resources HDFS NoSQLExternal
Table Oracle
+ Handler
table(s)
Scheduler
Redundant Storage

Copyright 2016, Oracle and/or its affiliates. All rights reserved. |


Oracle Datasource for Hadoop (OD4H)

ect, parallel, fast secure and consistent access to master data (SCN)
Hive

StorageHandler
InputFormat
Database

HCatalog
Impa
Oracle

la

YARN
Spar
k

Maho
ut

Oth
er
Copyright 2016, Oracle and/or its affiliates. All rights reserved. |
Oracle Table as Hive External Table
DDL

CREATE EXTERNAL TABLE Hadoop_employees (


EMPLOYEE_ID INT, FIRST_NAME STRING, LAST_NAME
STRING,SALARY DOUBLE, ...)
STORED BY
'oracle.hcat.osh.storagehandler.OracleStorageHandler
TBLPROPERTIES
( ...
'mapreduce.jdbc.input.table.name' ='EMPLOYEES,
...
);
Copyright 2016, Oracle and/or its affiliates. All rights reserved. |
Parallel Access to Oracle Table: Splitter Patterns
SINGLE_SPLITTER
ROW_SPLITTER
number of rows set inoracle.hcat.osh.rowsPerSplit
BLOCK_SPLITTER
max # of splits directed byoracle.hcat.osh.maxStorageBasedSplits
PARTITION_SPLITTER
CUSTOM_SPLITTER
a user-defined SELECT statement that emits ROWIDs corresponding to
start and end of each split in oracle.hcat.osh.chunkSQL
Copyright 2016, Oracle and/or its affiliates. All rights reserved. |
OD4H Steps

Hadoop or Spark Cluster 1. Gets a secure connection to DB


2. Generate database Splits (DLDL)
Hive
with SCN
or 3. Rewrites HiveQL or Spark SQL into
Spar Execution OD4 Oracle SQL for each split
k Plan (partial) H
4. Each split is processed by a
Quer
y Hadoop/Spark task
5. Matching rows returned to
Hadoop/Spark Query coordinator

Oracle Confidential
Copyright 2016, Oracle and/or its affiliates. All rights reserved. |
Putting Everything Together
Hive
DDL

HCatalog

Oracle Oracle Map Reduce Job


Table Rewritten Storage
Query Handler MapTask
granule split
MapTask Job Tracker
granule split

granule split

split
MapTask
granule

Copyright 2016, Oracle and/or its affiliates. All rights reserved. |


Program Agenda

1
Big Data Analytics & Requirements
2
Oracle Datasource for Hadoop (OD4H)
3
Summary and Demo

Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 17


OD4H Summary
Support for Hadoop & Spark query engines: Hive SQL, Spark-SQL,
Impala*
Support for Hadoop programming models: Pig, MapReduce, Pig, etc
Secure and reliable authentication: Kerberos authentication, SSL,
Oracle Wallet
Efficient translation of HQL to Oracle SQL
Scalability: splits based on DB meta-data
Column Projection Pushdown
Predicate Pushdown
Partition Pruning
Connection caching Oracle Confidential
Copyright 2016, Oracle and/or its affiliates. All rights reserved. |
OpenWorld 2016
Querying Hadoop/HDFS from
PL/SQL
CON6359

Nicholas Van Wyen


MTI
September 18, 2016

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. |
Restricted
Agenda

1
The Real-World
2
The Problem
3
The Solution
4
Considerations
5
Questions

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 26
Restricted
Lets get started

1
The Real-World
2
The Problem
3
The Solution
4
Considerations
5
Questions

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 27
Restricted
The Real-World
Different solutions, for different requirements
Spar
C++
k
Analytic C Pro*C
s

Sqoo Java PL/SQ


p L
ES C#

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 28
Restricted
Moving on

1
The Real-World
2
The Problem
3
The Solution
4
Considerations
5
Questions

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 29
Restricted
The Problem
Changes

{;}
PL/SQL

available storage capacity available storage capacity


Application
Confidential Oracle Internal/Restricted/Highly
Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 30
Restricted
The Problem
Example

$ beeline -u jdbc:hive2://hive.corp.com:10000 \ SQL> desc SCOTT.USER_LOG


-n oracle -w welcome1.passwd
Name Null? Type
0: jdbc:hive2://localhost:10000> desc user_log; ---------- -------- ---------------
+------------------+------------+---------+ STAMP NOT NULL DATE
| col_name | data_type | comment | ACCOUNT VARCHAR2(30)
+------------------+------------+---------+ MESSAGE VARCHAR2(4000)
| stamp | date | |
| account | string | |
| message | string | |
+------------------+------------+---------+

procedure user_report( p out xmltype ) is


begin
create view user_log_monthly
as for rec in ( select account,
select stamp, message
account, from scott.user_log
message order by account ) loop
from scott.user_log ...
where stamp between sysdate - 30
and sysdate; end user_report;

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 31
Restricted
Next

1
The Real-World
2
The Problem
3
The Solution
4
Considerations
5
Questions

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 32
Restricted
The Solution
Introduction
Presenting Hive-ODCI
Built on Oracle Data Cartridge Interface
Inspired by DBPrism and internal projects using ODCI
Initial Requirements
Dynamically access Hadoop/Hive within the Oracle 12c RDBMS
Allow for First-Class Oracle objects
Leverage existing RBAC
Support active Bind variables
User defined, Static or Saved
Support Oracle SQL and PL/SQL
Easy to use, for Developers and Administrators
Confidential Oracle Internal/Restricted/Highly
Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 33
Restricted
The Solution
Overview
Hadoo
Hive
p
server
ojvm
HiveDriv
hive.jar org.apache.hive.jdbc.HiveDriver
er

database
pl/sql

odci
session hive_t binding

sql
hive_q view

select/dml/ddl
client

Application
Confidential Oracle Internal/Restricted/Highly
Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 34
Restricted
The Solution
Example

pipelined data

hive-odci hive_t hive_q

parallel session binding

param

bidirectional Application

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 35
Restricted
The Solution
Example param( 'hive_jdbc_url', 'jdbc:hive2://hive.corp.com:10000' );
1 param( 'hive_jdbc_url.1', 'user=oracle' );
param( 'hive_jdbc_url.2', 'password=welcome1' );

create or replace view scott.user_log


( stamp, account, message )
2 as
select *
from table( hive_q( q'[ select stamp,
account,
message
hive-odci from user_log
order by stamp ]' ) )

procedure user_report( p out xmltype ) is


3 begin

for rec in ( select account,


message
from scott.user_log
order by account ) loop
SQL> alter procedure scott.user_report compile; ...

Procedure altered. end user_report;

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 36
Restricted
The Solution
Example
create or replace view scott.user_log_monthly
(
stamp,
account,
message
)
as
select *
from table( hive_q( q'[ select stamp,
account,
message
from user_log
hive-odci where stamp between ? and ? ]',
hive_binds( hive_bind( to_char( sysdate - 30,
'yyyy-mm-dd' ),
1 /* type_date */,
1 /* ref_in */ ),
hive_bind( to_char( sysdate, ,
4 'yyyy-mm-dd' ),
1 /* type_date */,
1 /* ref_in */ ) ) )

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 37
Restricted
The Solution
Example create or replace trigger scott.user_log_dml
instead of insert or update or delete on scott.user_log
for each row
declare
cmd varchar2( 4000 );
bnd hive_binds := hive_binds();
begin

if ( inserting ) then

cmd := q'[ insert into user_log


( stamp, account, message )
values
( ?, ?, ? ) ];
hive-odci
bnd.extend;
bnd( bnd.count ) := hive_bind( to_char( :new.stamp,
'yyyy-mm-dd' ),
hive_binding.type_date,
hive_binding.ref_in );
...
elsif ( updating ) then
...
end if;
5
hive_remote.dml( cmd, bnd );

end user_log_dml;

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 38
Restricted
Wrapping it up

1
The Real-World
2
The Problem
3
The Solution
4
Considerations
5
Questions

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 39
Restricted
Considerations
In Oracle
Become familiar with the Hive-ODCI API
Read the documentation
Ask questions and test, test, test
Use session isolation whenever possible
Particularly authentication, set at the session not the system
Lean on your experience and your DBA Team
Keep signatures consistent
Change code if necessary
Become familiar with the DB wait events
Confidential Oracle Internal/Restricted/Highly
Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 40
Restricted
Considerations
In Hive
Analytics over in-line views
Review queries and use common sense
Leverage the CBO and gather statistics
Use best practices
ORCFile - Optimized Row Columnar File format, highly efficient Hive data storage
Apache Tez - Extensible framework for high performance batch and interactive
processing, coordinated by YARN, it improves MapReduce by dramatically improving speed,
while maintaining ability to scale
Vectorized queries - Hive feature that greatly reduces the CPU usage for query
operations like scans, filters, aggregates, and joins, which involves metadata interpretation
in the inner loop of execution code paths.

Lean on your experience and your BDA Team Confidential Oracle Internal/Restricted/Highly
Copyright 2016, Oracle and/or its affiliates. All rights reserved. |
Restricted
41
Considerations
Reach out
If you have questions, concerns or comments
Feel free to contact me

Available on Github
https://github.com/nvanwyen/hive-odci
https://github.com/nvanwyen/hive-odci/releases/latest

Contact
nvanwyen@mtihq.com
Confidential Oracle Internal/Restricted/Highly
Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 42
Restricted
Thats it

1
The Real-World
2
The Problem
3
The Solution
4
Considerations
5
Questions

Confidential Oracle Internal/Restricted/Highly


Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 43
Restricted
Copyright 2016, Oracle and/or its affiliates. All rights reserved. | 44

Вам также может понравиться