Вы находитесь на странице: 1из 232

BIG

DATA ANALYTICS
WITH ORACLE




Csar Prez

INTRODUCTION

Big data analytics is the process of examining big data to uncover hidden patterns,
unknown correlations and other useful information that can be used to make better
decisions. With big data analytics, data scientists and others can analyze huge volumes of
data that conventional analytics and business intelligence solutions cant touch. Consider
this; its possible that your organization could accumulate (if it hasnt already) billions of
rows of data with hundreds of millions of data combinations in multiple data stores and
abundant formats. High-performance analytics is necessary to process that much data in
order to figure out whats important and what isnt. Enter big data analytics.
Big data is now a reality: The volume, variety and velocity of data coming into your
organization continue to reach unprecedented levels. This phenomenal growth means that
not only must you understand big data in order to decipher the information that truly
counts, but you also must understand the possibilities of what you can do with big data
analytics.
Using big data analytics you can extract only the relevant information from terabytes,
petabytes and exabytes, and analyze it to transform your business decisions for the future.
Becoming proactive with big data analytics isnt a one-time endeavor; it is more of a
culture change a new way of gaining ground by freeing your analysts and decision
makers to meet the future with sound knowledge and insight.
On the other hand, business intelligence (BI) provides standard business reports, ad hoc
reports, OLAP and even alerts and notifications based on analytics. This ad hoc analysis
looks at the static past, which has its purpose in a limited number of situations.
Oracle support for big data implementations, including Hadoop. throught Oracle and
Hadoop is possible work in all steps of Analytical Process: Identify/formulate Problem,
Data Preparation, Data Exploration, Transform and select, Buil Model, Validate model,
Deploy Model and Evaluate/Monitor Results.
This book presents the work possibilities that Oracle offers in the modern sectors of Big
Data, Business Intelligence and Analytics. The most important tools of Oracle are
presented for processing and analyzing large volumes of data in an orderly manner. In
turn, these tools allow also extract the knowledge contained in the data.

INDEX

INTRODUCTION
BIG DATA CONCEPTS
1.1 DEFINING BIG DATA
1.2 THE IMPORTANCE OF BIG DATA
1.3 THE NEED FOR BIG DATA
1.4 KEY TECHNOLOGIES FOR EXTRACTING BUSINESS VALUE FROM BIG DATA
1.4.1 Information Management for Big Data
HADOOP
2.1 BUILDING A BIG DATA PLATFORM
2.1.1 Acquire Big Data
2.1.2 Organize Big Data
2.1.3 Analyze Big Data
2.1.4 Solution Spectrum
2.2 HADOOP
2.3 HADOOP COMPONENTS
2.3.1 Benefits of Hadoop
2.3.2 Limitations of Hadoop
2.4 GET DATA INTO HADOOP
2.5 HADOOP USES
2.5.1 Prime Business Applications for Hadoop
2.6 HADOOP CHALLENGES
ORACLE BIG DATA APPLIANCE
3.1 INTRODUCTION
3.2 ORACLE BIG DATA APPLIANCE BASIC CONFIGURATION
3.3 AUTO SERVICE REQUEST (ASR)
3.4 ORACLE ENGINEERED SYSTEMS FOR BIG DATA
3.5 SOFTWARE FOR BIG DATA
3.5.1 Software Component Overview
3.6 ACQUIRING DATA FOR ANALYSIS
3.6.1 Hadoop Distributed File System
3.6.2 Apache Hive
3.6.3 Oracle NoSQL Database
3.7 ORGANIZING BIG DATA
3.8 MAPREDUCE
3.9 ORACLE BIG DATA CONNECTORS
3.9.1 Oracle SQL Connector for Hadoop Distributed File System
3.9.2 Oracle Loader for Hadoop
3.9.3 Oracle Data Integrator Application Adapter for Hadoop
3.9.4 Oracle XQuery for Hadoop
3.10 ORACLE R ADVANCED ANALYTICS FOR HADOOP
3.11 ORACLE R SUPPORT FOR BIG DATA
3.12 ANALYZING AND VISUALIZING BIG DATA
3.13 ORACLE BUSINESS INTELLIGENCE FOUNDATION SUITE

3.13.1 Enterprise BI Platform


3.13.2 OLAP Analytics
3.13.3 Scorecard and Strategy Management
3.13.4 Mobile BI
3.13.5 Enterprise Reporting
3.14 ORACLE BIG DATA LITE VIRTUAL MACHINE
ADMINISTERING ORACLE BIG DATA APPLIANCE
4.1 MONITORING MULTIPLE CLUSTERS USING ORACLE ENTERPRISE MANAGER
4.1.1 Using the Enterprise Manager Web Interface
4.1.2 Using the Enterprise Manager Command-Line Interface
4.2 MANAGING OPERATIONS USING CLOUDERA MANAGER
4.2.1 Monitoring the Status of Oracle Big Data Appliance
4.2.2 Performing Administrative Tasks
4.2.3 Managing CDH Services With Cloudera Manager
4.3 USING HADOOP MONITORING UTILITIES
4.3.1 Monitoring MapReduce Jobs
4.3.2 Monitoring the Health of HDFS
4.4 USING CLOUDERA HUE TO INTERACT WITH HADOOP
4.5 ABOUT THE ORACLE BIG DATA APPLIANCE SOFTWARE
4.5.1 Software Components
4.5.2 Unconfigured Software
4.5.3 Allocating Resources Among Services
4.6 STOPPING AND STARTING ORACLE BIG DATA APPLIANCE
4.6.1 Prerequisites
4.6.2 Stopping Oracle Big Data Appliance
4.6.3 Starting Oracle Big Data Appliance
4.7 MANAGING ORACLE BIG DATA SQL
4.7.1 Adding and Removing the Oracle Big Data SQL Service
4.7.2 Allocating Resources to Oracle Big Data SQL
4.8 SWITCHING FROM YARN TO MAPREDUCE 1
4.9 SECURITY ON ORACLE BIG DATA APPLIANCE
4.9.1 About Predefined Users and Groups
4.9.2 About User Authentication
4.9.3 About Fine-Grained Authorization
4.9.4 About On-Disk Encryption
4.9.5 Port Numbers Used on Oracle Big Data Appliance
4.9.6 About Puppet Security
4.10 AUDITING ORACLE BIG DATA APPLIANCE
4.10.1 About Oracle Audit Vault and Database Firewall
4.10.2 Setting Up the Oracle Big Data Appliance Plug-in
4.10.3 Monitoring Oracle Big Data Appliance
4.11 COLLECTING DIAGNOSTIC INFORMATION FOR ORACLE CUSTOMER SUPPORT
4.12 AUDITING DATA ACCESS ACROSS THE ENTERPRISE
4.12.1 Configuration
4.12.2 Capturing Activity
4.12.3 Ad Hoc Reporting
4.12.4 Summary
ORACLE BIG DATA SQL

5.1 INTRODUCTION
5.2 SQL ON HADOOP
5.3 SQL ON MORE THAN HADOOP
5.4 UNIFYING METADATA
5.5 OPTIMIZING PERFORMANCE
5.6 SMART SCAN FOR HADOOP
5.7 ORACLE SQL DEVELOPER & DATA MODELER SUPPORT FOR ORACLE BIG DATA SQL
5.7.1 Setting up Connections to Hive
5.7.2 Using the Hive Connection
5.7.3 Create Big Data SQL-enabled Tables Using Oracle Data Modeler
5.7.4 Edit the Table Definitions
5.7.5 Query All Your Data
5.8 USING ORACLE BIG DATA SQL FOR DATA ACCESS
5.8.1 About Oracle External Tables
5.8.2 About the Access Drivers for Oracle Big Data SQL
5.8.3 About Smart Scan Technology
5.8.4 About Data Security with Oracle Big Data SQL
5.9 INSTALLING ORACLE BIG DATA SQL
5.9.1 Prerequisites for Using Oracle Big Data SQL
5.9.2 Performing the Installation
5.9.3 Running the Post-Installation Script for Oracle Big Data SQL
5.9.4 Running the bds-exa-install Script
5.9.5 bds-ex-install Syntax
5.10 CREATING EXTERNAL TABLES FOR ACCESSING BIG DATA
5.10.1 About the Basic CREATE TABLE Syntax
5.10.2 Creating an External Table for a Hive Table
5.10.3 Obtaining Information About a Hive Table
5.10.4 Using the CREATE_EXTDDL_FOR_HIVE Function
5.10.5 Developing a CREATE TABLE Statement for ORACLE_HIVE
5.10.6 Creating an External Table for HDFS Files
5.10.7 Using the Default Access Parameters with ORACLE_HDFS
5.10.8 Overriding the Default ORACLE_HDFS Settings
5.10.9 Accessing Avro Container Files
5.11 ABOUT THE EXTERNAL TABLE CLAUSE
5.11.1 TYPE Clause
5.11.2 DEFAULT DIRECTORY Clause
5.11.3 LOCATION Clause
5.11.4 REJECT LIMIT Clause
5.11.5 ACCESS PARAMETERS Clause
5.12 ABOUT DATA TYPE CONVERSIONS
5.13 QUERYING EXTERNAL TABLES
5.14 ABOUT ORACLE BIG DATA SQL ON ORACLE EXADATA DATABASE MACHINE
5.14.1 Starting and Stopping the Big Data SQL Agent
5.14.2 About the Common Directory
5.14.3 Common Configuration Properties
5.14.4 bigdata.properties
5.14.5 bigdata-log4j.properties
5.14.6 About the Cluster Directory

5.14.7 About Permissions


HIVE USER DEFINED FUNCTIONS (UDFS)
6.1 INTRODUCTION
6.1.1 The Three Little UDFs
6.2 THREE LITTLE HIVE UDFS: EXAMPLE 1
6.2.1 Introduction
6.2.2 Extending UDF
6.3 THREE LITTLE HIVE UDFS: EXAMPLE 2
6.3.1 Introduction
6.3.2 Extending GenericUDTF
6.3.3 Using the UDTF
6.4 THREE LITTLE HIVE UDFS: EXAMPLE 3
6.4.1 Introduction
6.4.2 Prefix Sum: Moving Average without State
6.4.3 Orchestrating Partial Aggregation
6.4.4 Aggregation Buffers: Connecting Algorithms with Execution
6.4.5 Using the UDAF
6.4.6 Summary
ORACLE NO SQL
7.1 INTRODUCTION
7.2 DATA MODEL
7.3 API
7.4 CREATE, REMOVE, UPDATE, AND DELETE
7.5 ITERATION
7.6 BULK OPERATION API
7.7 ADMINISTRATION
7.8 ARCHITECTURE
7.9 IMPLEMENTATION
7.9.1 Storage Nodes
7.9.2 Client Driver
7.10 PERFORMANCE
7.11 CONCLUSION

1Chapter 1.

BIG DATA CONCEPTS

1.1 DEFINING BIG DATA


Big data typically refers to the following types of data:

Traditional enterprise data includes customer information from CRM


systems, transactional ERP data, web store transactions, and general ledger data.

Machine-generated /sensor data includes Call Detail Records (CDR),


weblogs, smart meters, manufacturing sensors, equipment logs (often referred to
as digital exhaust), trading systems data.

Social data includes customer feedback streams, micro-blogging sites like


Twitter, social media platforms like Facebook

The McKinsey Global Institute estimates that data volume is growing 40% per year, and
will grow 44x between 2009 and 2020. But while its often the most visible parameter,
volume of data is not the only characteristic that matters. In fact, there are four key
characteristics that define big data (Figure 1-1):
Volume. Machine-generated data is produced in much larger quantities than
non-traditional data. For instance, a single jet engine can generate 10TB of data
in 30 minutes. With more than 25,000 airline flights per day, the daily volume of
just this single data source runs into the Petabytes. Smart meters and heavy
industrial equipment like oil refineries and drilling rigs generate similar data
volumes, compounding the problem.

Velocity. Social media data streams while not as massive as machinegenerated data produce a large influx of opinions and relationships valuable to
customer relationship management. Even at 140 characters per tweet, the high
velocity (or frequency) of Twitter data ensures large volumes (over 8 TB per
day).

Variety. Traditional data formats tend to be relatively well defined by a data


schema and change slowly. In contrast, non-traditional data formats exhibit a
dizzying rate of change. As new services are added, new sensors deployed, or
new marketing campaigns executed, new data types are needed to capture the
resultant information.

Value. The economic value of different data varies significantly. Typically


there is good information hidden amongst a larger body of non-traditional data;
the challenge is identifying what is valuable and then transforming and
extracting that data for analysis.

To make the most of big data, enterprises must evolve their IT infrastructures to handle
these new high-volume, high-velocity, high-variety sources of data and integrate them
with the pre-existing enterprise data to be analyzed.

Big data is a relative term describing a situation where the volume, velocity and variety of
data exceed an organizations storage or compute capacity for accurate and timely decision
making. Some of this data is held in transactional data stores the byproduct of fastgrowing online activity. Machine-to-machine interactions, such as metering, call detail
records, environmental sensing and RFID systems, generate their own tidal waves of data.
All these forms of data are expanding, and that is coupled with fast-growing streams of
unstructured and semistructured data from social media.
Thats a lot of data, but it is the reality for many organizations. By some estimates,
organizations in all sectors have at least 100 terabytes of data, many with more than a
petabyte. Even scarier, many predict this number to double every six months going
forward, said futurist Thornton May, speaking at a webinar in 2011.

Figure 1-1

1.2 THE IMPORTANCE OF BIG DATA


When big data is distilled and analyzed in combination with traditional enterprise data,
enterprises can develop a more thorough and insightful understanding of their business,
which can lead to enhanced productivity, a stronger competitive position and greater
innovation all of which can have a significant impact on the bottom line.
For example, in the delivery of healthcare services, management of chronic or long-term
conditions is expensive. Use of in-home monitoring devices to measure vital signs, and
monitor progress is just one way that sensor data can be used to improve patient health
and reduce both office visits and hospital admittance.
Manufacturing companies deploy sensors in their products to return a stream of telemetry.
In the automotive industry, systems such as General Motors OnStar or Renaults RLink , deliver communications, security and navigation services. Perhaps more
importantly, this telemetry also reveals usage patterns, failure rates and other opportunities
for product improvement that can reduce development and assembly costs.
The proliferation of smart phones and other GPS devices offers advertisers an opportunity
to target consumers when they are in close proximity to a store, a coffee shop or a
restaurant. This opens up new revenue for service providers and offers many businesses a
chance to target new customers.
Retailers usually know who buys their products. Use of social media and web log files
from their ecommerce sites can help them understand who didnt buy and why they chose
not to, information not available to them today. This can enable much more effective
micro customer segmentation and targeted marketing campaigns, as well as improve
supply chain efficiencies through more accurate demand planning.
Finally, social media sites like Facebook and LinkedIn simply wouldnt exist without big
data. Their business model requires a personalized experience on the web, which can only
be delivered by capturing and using all the available data about a user or member.

1.3 THE NEED FOR BIG DATA


The term Big Data can be interpreted in many different ways. We defined Big Data as
conforming to the volume, velocity, and variety attributes that characterize it. Note that
Big Data solutions arent a replacement for your existing warehouse solutions, and in our
humble opinion, any vendor suggesting otherwise likely doesnt have the full gambit of
experience or understanding of your investments in the traditional side of information
management.
We think its best to start out this section with a couple of key Big Data principles we want
you to keep in mind, before outlining some consider- ations as to when you use Big Data
technologies, namely:
Big Data solutions are ideal for analyzing not only raw structured data, but
semistructured and unstructured data from a wide variety of sources.

Big Data solutions are ideal when all, or most, of the data needs to be
analyzed versus a sample of the data; or a sampling of data isnt nearly as
effective as a larger set of data from which to derive analysis.

Big Data solutions are ideal for iterative and exploratory analysis when
business measures on data are not predetermined.
When it comes to solving information management challenges using Big
Data technologies, we suggest you consider the following:

Is the reciprocal of the traditional analysis paradigm appropriate for the


business task at hand? Better yet, can you see a Big Data platform
complementing what you currently have in place for analysis and achieving
synergy with existing solutions for better business outcomes?

For example, typically, data bound for the analytic warehouse has to be
cleansed, documented, and trusted before its neatly placed into a strict
warehouse schema (and, of course, if it cant fit into a traditional row and
column format, it cant even get to the warehouse in most cases). In contrast, a
Big Data solution is not only going to leverage data not typically suitable for a
traditional warehouse environment, and in massive amounts of volume, but its
going to give up some of the formalities and strictness of the data. The benefit
is that you can preserve the fidelity of data and gain access to mountains of
information for exploration and discovery of business insights before running it
through the due diligence that youre accustomed to; the data that can be
included as a participant of a cyclic system, enriching the models in the
warehouse.
Big Data is well suited for solving information challenges that dont natively
fit within a traditional relational database approach for handling the problem at
hand.

Its important that you understand that conventional database technolo- gies are an
important, and relevant, part of an overall analytic solution. In fact, they become even
more vital when used in conjunction with your Big Data platform.
A good analogy here is your left and right hands; each offers individual strengths and
optimizations for a task at hand. For example, if youve ever played baseball, you know
that one hand is better at throwing and the other at catching. Its likely the case that each
hand could try to do the other task that it isnt a natural fit for, but its very awkward (try
it; better yet, film your- self trying it and you will see what we mean). Whats more, you
dont see baseball players catching with one hand, stopping, taking off their gloves, and
throwing with the same hand either. The left and right hands of a base- ball player work in
unison to deliver the best results. This is a loose analogy to traditional database and Big
Data technologies: Your information plat- form shouldnt go into the future without these
two important entities work- ing together, because the outcomes of a cohesive analytic
ecosystem deliver premium results in the same way your coordinated hands do for
baseball. There exists some class of problems that dont natively belong in traditional
databases, at least not at first. And theres data that were not sure we want in the
warehouse, because perhaps we dont know if its rich in value, its unstructured, or its
too voluminous. In many cases, we cant find out the value per byte of the data until after
we spend the effort and money to put it into the warehouse; but we want to be sure that
data is worth saving and has a high value per byte before investing in it.
Some organizations will need to rethink their data management strategies when they face
hundreds of gigabytes of data for the first time. Others may be fine until they reach tens or
hundreds of terabytes. But whenever an organization reaches the critical mass defined as
big data for itself, change is inevitable.
Organizations are moving away from viewing data integration as a standalone discipline to
a mindset where data integration, data quality, metadata management and data governance
are designed and used together. The traditional extract-transform-load (ETL) data
approach has been augmented with one that minimizes data movement and improves
processing power.
Organizations are also embracing a holistic, enterprise view that treats data as a
core enterprise asset. Finally, many organizations are retreating from reactive data
management in favor of a managed and ultimately more proactive and predictive approach
to managing information.
The true value of big data lies not just in having it, but in harvesting it for fast, factbased
decisions that lead to real business value. For example, disasters such as the recent
financial meltdown and mortgage crisis might have been prevented with risk computation

on historical data at a massive scale. Financial institutions were essentially taking bundles
of thousands of loans and looking at them as one. We now have the computing power to
assess the probability of risk at the individual level. Every sector can benefit from this
type of analysis.
Big data provides gigantic statistical samples, which enhance analytic tool results. The
general rule is that the larger the data sample, the more accurate are the statistics and other
products of the analysis. However, organizations have been limited to using subsets of
their data, or they were constrained to simplistic analysis because the sheer volume of data
overwhelmed their IT platforms. What good is it to collect and store terabytes of data if
you cant analyze it in full context, or if you have to wait hours or days to get results to
urgent questions? On the other hand, not all business questions are better served by bigger
data. Now, you have choices to suit both scenarios:
Incorporate massive data volumes in analysis. If the business question is one
that will get better answers by analyzing all the data, go for it. The game-changing
technologies that extract real business value from big data all of it are here today. One
approach is to apply high-performance analytics to analyze massive amounts of data using
technologies such as grid computing, in-database processing and in-memory analytics.
Determine upfront which data is relevant. The traditional modus operandi has
been to store everything; only when you query it do you discover what is relevant.
Oracle provides the ability to apply analytics on the front end to determine data relevance
based on enterprise context. This analysis can be used to determine which data should be
included in analytical processes and which can be placed in
low-cost storage for later availability if needed.

1.4 KEY TECHNOLOGIES FOR EXTRACTING BUSINESS VALUE


FROM BIG DATA
Big data technologies describe a new generation of technologies and architectures,
designed to economically extract value from very large volumes of a wide variety of data
by enabling high-velocity capture, discovery and/or analysis.
Furthermore, this analysis is needed in real time or near-real time, and it must be
affordable, secure and achievable.
Fortunately, a number of technology advancements have occurred or are under way that
make it possible to benefit from big data and big data analytics. For starters, storage,
server processing and memory capacity have become abundant and cheap. The cost of a
gigabyte of storage has dropped from approximately $16 in February 2000 to less than
$0.07 today. Storage and processing technologies have been designed specifically for large
data volumes. Computing models such as parallel processing, clustering, virtualization,
grid environments and cloud computing, coupled with high-speed connectivity, have
redefined what is possible.
Here are three key technologies that can help you get a handle on big data and even
more importantly, extract meaningful business value from it.
Information management for big data. Manage data as a strategic, core asset,
with ongoing process control for big data analytics.
High-performance analytics for big data. Gain rapid insights from big data and
the ability to solve increasingly complex problems using more data.
Flexible deployment options for big data. Choose between options for onpremises or
hosted, software-as-a-service (SaaS) approaches for big data and big data analytics.

1.4.1 Information Management for Big Data


Many organizations already struggle to manage their existing data. Big data will only add
complexity to the issue. What data should be stored, and how long should we keep it?
What data should be included in analytical processing, and how do we properly prepare it
for analysis? What is the proper mix of traditional and emerging technologies?
Big data will also intensify the need for data quality and governance, for embedding
analytics into operational systems, and for issues of security, privacy and regulatory
compliance. Everything that was problematic before will just grow larger.
Oracle provides the management and governance capabilities that enable organizations to
effectively manage the entire life cycle of big data analytics, from data to decision. SAS
provides a variety of these solutions, including data governance, metadata management,
analytical model management, run-time management and deployment management.
With Oracle, this governance is an ongoing process, not just a one-time project. Proven
methodology-driven approaches help organizations build processes based on their specific
data maturity model.
Oracle technology and implementation services enable organizations to fully exploit and
govern their information assets to achieve competitive differentiation and sustained
business success. Three key components work together in this realm:
Unified data management capabilities, including data governance, data
integration, data quality and metadata management.
Complete analytics management, including model management, model
deployment, monitoring and governance of the analytics information asset.
Effective decision management capabilities to easily embed information and
analytical results directly into business processes while managing the necessary
business rules, workflow and event logic.
High-performance, scalable solutions slash the time and effort required to filter,

aggregate and structure big data. By combining data integration, data quality and
master data management in a unified development and delivery environment,
organizations can maximize each stage of the data management process.
Oracle is unique for incorporating high-performance analytics and analytical intelligence
into the data management process for highly efficient modeling and faster results.
For instance, you can analyze all the information within an organization such as
email, product catalogs, wiki articles and blogs extract important concepts from that
information, and look at the links among them to identify and assign weights to millions
of terms and concepts. This organizational context is then used to assess data as it streams
into the organization, churns out of internal systems, or sits in offline data stores. This upfront analysis identifies the relevant data that should be pushed to the enterprise data
warehouse or to high-performance analytics.

2Chapter 2.

HADOOP

2.1 BUILDING A BIG DATA PLATFORM


As with data warehousing, web stores or any IT platform, an infrastructure for big data has
unique requirements. In considering all the components of a big data platform, it is
important to remember that the end goal is to easily integrate your big data with your
enterprise data to allow you to conduct deep analytics on the combined data set.

The requirements in a big data infrastructure span data acquisition, data organization and
data analysis.

2.1.1 Acquire Big Data


The acquisition phase is one of the major changes in infrastructure from the days before
big data. Because big data refers to data streams of higher velocity and higher variety, the
infrastructure required to support the acquisition of big data must deliver low, predictable
latency in both capturing data and in executing short, simple queries; be able to handle
very high transaction volumes, often in a distributed environment; and support flexible,
dynamic data structures.

NoSQL databases are frequently used to acquire and store big data. They are well suited
for dynamic data structures and are highly scalable. The data stored in a NoSQL database
is typically of a high variety because the systems are intended to simply capture all data
without categorizing and parsing the data into a fixed schema.
For example, NoSQL databases are often used to collect and store social media data.
While customer facing applications frequently change, underlying storage structures are
kept simple. Instead of designing a schema with relationships between entities, these
simple structures often just contain a major key to identify the data point, and then a
content container holding the relevant data (such as a customer id and a customer profile).
This simple and dynamic structure allows changes to take place without costly
reorganizations at the storage layer (such as adding new fields to the customer profile).

2.1.2 Organize Big Data


In classical data warehousing terms, organizing data is called data integration. Because
there is such a high volume of big data, there is a tendency to organize data at its initial
destination location, thus saving both time and money by not moving around large
volumes of data. The infrastructure required for organizing big data must be able to
process and manipulate data in the original storage location; support very high throughput
(often in batch) to deal with large data processing steps; and handle a large variety of data
formats, from unstructured to structured.
Hadoop is a new technology that allows large data volumes to be organized and processed
while keeping the data on the original data storage cluster. Hadoop Distributed File
System (HDFS) is the long-term storage system for web logs for example. These web
logs are turned into browsing behavior (sessions) by running MapReduce programs on the
cluster and generating aggregated results on the same cluster. These aggregated results are
then loaded into a Relational DBMS system.

2.1.3 Analyze Big Data


Since data is not always moved during the organization phase, the analysis may also be
done in a distributed environment, where some data will stay where it was originally
stored and be transparently accessed from a data warehouse. The infrastructure required
for analyzing big data must be able to support deeper analytics such as statistical analysis
and data mining, on a wider variety of data types stored in diverse systems; scale to
extreme data volumes; deliver faster response times driven by changes in behavior; and
automate decisions based on analytical models. Most importantly, the infrastructure must
be able to integrate analysis on the combination of big data and traditional enterprise data.
New insight comes not just from analyzing new data, but from analyzing it within the
context of the old to provide new perspectives on old problems.

For example, analyzing inventory data from a smart vending machine in combination with
the events calendar for the venue in which the vending machine is located, will dictate the
optimal product mix and replenishment schedule for the vending machine.

2.1.4 Solution Spectrum


Many new technologies have emerged to address the IT infrastructure requirements
outlined above. At last count, there were over 120 open source key-value databases for
acquiring and storing big data, while Hadoop has emerged as the primary system for
organizing big data and relational databases maintain their footprint as a data warehouse
and expand their reach into less structured data sets to analyze big data. These new
systems have created a divided solutions spectrum comprised (Figure 2-1) of:
Not Only SQL (NoSQL) solutions: developer-centric specialized systems
SQL solutions: the world typically equated with the manageability, security and trusted
nature of relational database management systems (RDBMS)
NoSQL systems are designed to capture all data without categorizing and parsing it upon
entry into the system, and therefore the data is highly varied. SQL systems, on the other
hand, typically place data in well-defined structures and impose metadata on the data
captured to ensure consistency and validate data types.
Distributed file systems and transaction (key-value) stores are primarily used to capture
data and are generally in line with the requirements discussed earlier in this paper. To
interpret and distill information from the data in these solutions, a programming paradigm
called MapReduce is used. MapReduce programs are custom written programs that run in
parallel on the distributed data nodes.
The key-value stores or NoSQL databases are the OLTP databases of the big data world;
they are optimized for very fast data capture and simple query patterns. NoSQL databases
are able to provide very fast performance because the data that is captured is quickly
stored with a single indentifying key rather than being interpreted and cast into a schema.
By doing so, NoSQL database can rapidly store large numbers of transactions.
However, due to the changing nature of the data in the NoSQL database, any data
organization effort requires programming to interpret the storage logic used. This,
combined with the lack of support for complex query patterns, makes it difficult for end
users to distill value out of data in a NoSQL database.
To get the most from NoSQL solutions and turn them from specialized, developer-centric
solutions into solutions for the enterprise, they must be combined with SQL solutions into
a single proven infrastructure that meets the manageability and security requirements of

todays enterprises.

Figure 2-1

2.2 HADOOP
Hadoop is an open-source software framework for storing and processing big data in a
distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes
two tasks: massive data storage and faster processing.
For starters, lets take a quick look at some of those terms and what they mean.
Open-source software. Open source software differs from commercial software due to

the broad and open network of developers that create and manage the programs.
Traditionally, its free to download, use and contribute to, though more and more
commercial versions of Hadoop are becoming available.
Framework. In this case, it means everything you need to develop and run your

software applications is provided programs, tool sets, connections, etc.


Distributed. Data is divided and stored across multiple computers, and computations

can be run in parallel across multiple connected machines.


Massive storage. The framework can store huge amounts of data by breaking the data

into blocks and storing it on clusters of lower-cost commodity hardware.


Faster processing. How? Hadoop can process large amounts of data in parallel across

clusters of tightly connected low-cost computers for quick results.


With the ability to economically store and process any kind of data (not just numerical or
structured data), organizations of all sizes are taking cues from the corporate web giants
that have used Hadoop to their advantage (Google, Yahoo, Etsy, eBay, Twitter, etc.), and
theyre asking What can Hadoop do for me?
Since its inception, Hadoop has become one of the most talked about technologies. Why?
One of the top reasons (and why it was invented) is its ability to handle huge amounts of
data any kind of data quickly. With volumes and varieties of data growing each day,
especially from social media and automated sensors, thats a key consideration for most
organizations. Other reasons include:
Low cost. The open-source framework is free and uses commodity hardware to store

large quantities of data.


Computing power. Its distributed computing model can quickly process very large

volumes of data. The more computing nodes you use, the more processing power you
have.
Scalability. You can easily grow your system simply by adding more nodes. Little

administration is required.
Storage flexibility. Unlike traditional relational databases, you dont have to preprocess

data before storing it. And that includes unstructured data like text, images and videos.
You can store as much data as you want and decide how to use it later.

Inherent data protection and self-healing capabilities. Data and application

processing are protected against hardware failure. If a node goes down, jobs are
automatically redirected to other nodes to make sure the distributed computing does not
fail. And it automatically stores multiple copies of all data.

2.3 HADOOP COMPONENTS


Hadoop components have funny names, which is sort of understandable knowing that
Hadoop was the name of a yellow toy elephant owned by the son of one of its inventors.
Heres a quick rundown on names you may hear. Currently three core components are
included with your basic download from the Apache Software Foundation (Figure 2-2).
HDFS the Java-based distributed file system that can store all kinds of data
without prior organization.

MapReduce a software programming model for processing large sets of


data in parallel.

YARN a resource management framework for scheduling and handling


resource requests from distributed applications.

Figure 2-2
Other components that have achieved top-level Apache project status and are available
include:

Pig a platform for manipulating data stored in HDFS. It consists of a


compiler for MapReduce programs and a high-level language called Pig Latin. It
provides a way to perform data extractions, transformations and loading, and
basic analysis without having to write MapReduce programs.

Hive a data warehousing and SQL-like query language that presents data in
the form of tables. Hive programming is similar to database programming. (It
was initially developed by Facebook.)

HBase a nonrelational, distributed database that runs on top of Hadoop.


HBase tables can serve as input and output for MapReduce jobs.

Zookeeper an application that coordinates distributed processes.


Ambari a web interface for managing, configuring and testing Hadoop


services and components.

Flume software that collects, aggregates and moves large amounts of


streaming data into HDFS.

Sqoop a connection and transfer mechanism that moves data between


Hadoop and relational databases.

Oozie a Hadoop job scheduler.


In addition, commercial software distributions of Hadoop are growing. Two of the most
prominent (Cloudera and Hortonworks) are startups formed by the frameworks inventors.
And there are plenty of others entering the Hadoop sphere. With distributions from
software vendors, you pay for their version of the framework and receive additional
software components, tools, training, documentation and other services.

2.3.1 Benefits of Hadoop


There are several reasons that 88 percent of organizations consider Hadoop an
opportunity.

Its inexpensive. Hadoop uses lower-cost commodity hardware to reliably store large
quantities of data.

Hadoop provides flexibility to scale out by simply adding more nodes.


You can upload unstructured data without having to schematize it first. Dump any
type of data into Hadoop and apply structure as needed for consuming applications.

If capacity is available, Hadoop will start multiple copies of the same task for the same
block of data. If a node goes down, jobs are automatically redirected to other working
servers.

2.3.2 Limitations of Hadoop

Management and high-availability capabilities for rationalizing Hadoop clusters with


data center infrastructure are only now starting to emerge.

Data security is fragmented, but new tools and technologies are surfacing.

MapReduce is very batch-oriented and not suitable for iterative, multi-step analytics
processing.

The Hadoop ecosystem does not have easy-to-use, full-feature tools for data
integration, data cleansing, governance and metadata. Especially lacking are tools for
data quality and standardization.

Skilled professionals with specialized Hadoop skills are in short supply and at a
premium.

MapReduce is file intensive. And because the nodes dont intercommunicate except
through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce
phases to complete. This creates multiple files between MapReduce phases and is very
inefficient for advanced analytic computing.

Hadoop definitely provides economical data storage. But the next step is to manage the
data and use analytics to quickly identify previously unknown insights. Enter SAS. More
on that later.

2.4 GET DATA INTO HADOOP


There are numerous ways to get data into Hadoop. Here are just a few:
You can load files to the file system using simple Java commands, and HDFS takes care
of making multiple copies of data blocks and distributing those blocks over multiple
nodes in the Hadoop system.

If you have a large number of files, a shell script that will run multiple put commands
in parallel will speed up the process. You dont have to write MapReduce code.

Create a cron job to scan a directory for new files and put them in HDFS as they
show up. This is useful for things like downloading email at regular intervals.

Mount HDFS as a file system and simply copy files or write files there.

Use Sqoop to import structured data from a relational database to HDFS, Hive and
HBase. It can also extract data from Hadoop and export it to relational databases and
data warehouses.

Use Flume to continuously load data from logs into Hadoop.


Use third-party vendor connectors).



2.5 HADOOP USES
Going beyond its original goal of searching millions (or billions) of web pages and
returning relevant results, many organizations are looking to Hadoop as their next big data
platform. Here are some of the more popular uses for the framework today.

Low-cost storage and active data archive. The modest cost of commodity hardware
makes Hadoop useful for storing and combining big data such as transactional, social
media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you
keep information that is not currently critical but could become useful later for business
analytics.
Staging area for a data warehouse and analytics store. One of the most prevalent
uses is to stage large amounts of raw data for loading into an enterprise data warehouse
(EDW) or an analytical store for activities such as advanced analytics, query and
reporting, etc. Organizations are looking at Hadoop to handle new types of data (e.g.,
unstructured), as well as to offload some historical data from their EDWs.
Data lake. Hadoop is often used to store large amounts of data without the constraints
introduced by schemas commonly found in the SQL-based world. It is used as a lowcost compute-cycle platform that supports processing ETL and data quality jobs in
parallel using hand-coded or commercial data management technologies. Refined
results can then be passed to other systems (e.g., EDWs, analytic marts) as needed.
Sandbox for discovery and analysis. Because Hadoop was designed to deal with
volumes of data in a variety of shapes and forms, it can enable analytics. Big data
analytics on Hadoop will help run current business more efficiently, uncover new
opportunities and derive next-level competitive advantage. The sandbox setup provides
a quick and perfect opportunity to innovate with minimal investment.

Certainly Hadoop provides an economical platform for storing and processing large and
diverse data. The next logical step is to transform and manage the diverse data and use
analytics to quickly identify undiscovered insights.

2.5.1 Prime Business Applications for Hadoop


Hadoop is providing a data storage and analytical processing environment for a variety of
business uses, including:

Financial services: Insurance underwriting, fraud detection, risk mitigation and


customer behavior analytics.

Retail: Location-based marketing, personalized recommendations and website


optimization.

Telecommunications: Bandwidth allocation, network quality analysis and call detail


records analysis.

Health and life sciences: Genomics data in medical trials and prescription adherence.

Manufacturing: Logistics and root cause for production failover.


Oil and gas and other utilities: Predict asset failures, improve asset utilization and
monitor equipment safety.

Government: Sentiment analysis, fraud detection and smart city initiatives.


2.6 HADOOP CHALLENGES


First of all, MapReduce is not a good match for all problems. Its good for simple requests
for information and problems that can be broken up into independent units. But it is
inefficient for iterative and interactive analytic tasks. MapReduce is file-intensive.
Because the nodes dont intercommunicate except through sorts and shuffles, iterative
algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates
multiple files between MapReduce phases and is very inefficient for advanced analytic
computing.
Second, theres a talent gap. Because it is a relatively new technology, it is difficult to find
entry-level programmers who have sufficient Java skills to be productive with
MapReduce. This talent gap is one reason distribution providers are racing to put
relational (SQL) technology on top of Hadoop. It is much easier to find programmers with
SQL skills than MapReduce skills. And, Hadoop administration seems part art and part
science, requiring low-level knowledge of operating systems, hardware and Hadoop
kernel settings.
Other challenges include fragmented data security , though new tools and technologies are
surfacing. And, Hadoop does not have easy-to-use, full-feature tools for data management,
data cleansing, governance and metadata. Especially lacking are tools for data quality and
standardization.

3Chapter 3.

ORACLE BIG DATA APPLIANCE

3.1 INTRODUCTION

Oracle Big Data Appliance is an engineered system that provides a high-performance,


secure platform for running diverse workloads on Hadoop and NoSQL systems, while
integrating tightly with Oracle Database and Oracle Exadata Machine.
Companies have been making business decisions for decades based on transactional data
stored in relational databases. Beyond that critical data is a potential treasure trove of less
structured data: weblogs, social media, email, sensors, and photographs that can be mined
for useful information.
Oracle offers a broad and integrated portfolio of products to help you acquire and organize
these diverse data sources and analyze them alongside your existing data to find new
insights and capitalize on hidden relationships. Learn how Oracle helps you acquire,
organize, and analyze your big data.
Oracle Big Data Appliance is an engineered system of hardware and software optimized to
capture and analyze the massive volumes of unstructured data generated by social media
feeds, email, web logs, photographs, smart meters, sensors, and similar devices.
Oracle Big Data Appliance is engineered to work with Oracle Exadata Database Machine
and Oracle Exalytics In-Memory Machine to provide the most advanced analysis of all
data types, with enterprise-class performance, availability, supportability, and security.
The Oracle Linux operating system and Clouderas Distribution including Apache Hadoop
(CDH) underlie all other software components installed on Oracle Big Data Appliance.

3.2 ORACLE BIG DATA APPLIANCE BASIC CONFIGURATION


Oracle Big Data Appliance Configuration Generation Utility acquires information from
you, such as IP addresses and software preferences, that are required for deploying Oracle
Big Data Appliance. After guiding you through a series of pages, the utility generates a set
of configuration files. These files help automate the deployment process and ensure that
Oracle Big Data Appliance is configured to your specifications.
Choose the option that describes the type of hardware installation you are configuring:
One or more new Big Data Appliance racks being installed: You enter all new data for
this choice.
One or more Big Data Appliance racks being added to an existing group of Big Data
Appliances: This choice activates the Import button, so that you can select the
BdaDeploy.json file that was used to configure the last rack in the group.
One or two in-rack expansion kits being added to a Big Data Appliance starter rack:
This choice activates the Import button, so that you can select the BdaDeploy.json file
that was last used to configure the rack (either the starter rack or one in-rack expansion
kit).
An in-process configuration using a saved master.xml configuration file: This choice
activates the Import button, so that you can select the master.xml file and continue the
configuration.
The next figure shows the Customer Details page of the Oracle Big Data Appliance
Configuration Generation Utility.

3.3 AUTO SERVICE REQUEST (ASR)


Auto Service Request (ASR) is designed to automatically open service requests when
specific Oracle Big Data Appliance hardware faults occur. ASR detects faults in the most
common server components, such as disks, fans, and power supplies, and automatically
opens a service request when a fault occurs. ASR monitors only server components and
does not detect all possible faults.
ASR is not a replacement for other monitoring mechanisms, such as SMTP and SNMP
alerts, within the customer data center. It is a complementary mechanism that expedites
and simplifies the delivery of replacement hardware. ASR should not be used for
downtime events in high-priority systems. For high-priority events, contact Oracle
Support Services directly.
When ASR detects a hardware problem, ASR Manager submits a service request to Oracle
Support Services. In many cases, Oracle Support Services can begin work on resolving the
issue before the administrator is even aware the problem exists.
An email message is sent to both the My Oracle Support email account and the technical
contact for Oracle Big Data Appliance to notify them of the creation of the service
request.
A service request may not be filed automatically on some occasions. This can happen
because of the unreliable nature of the SNMP protocol or a loss of connectivity to ASR
Manager. Oracle recommends that customers continue to monitor their systems for faults
and call Oracle Support Services if they do not receive notice that a service request has
been filed automatically.
The next figure shows the network connections between ASR and Oracle Big Data
Appliance.

3.4 ORACLE ENGINEERED SYSTEMS FOR BIG DATA


Oracle Big Data Appliance is an engineered system comprising both hardware and
software components. The hardware is optimized to run the enhanced big data software
components.
Oracle Big Data Appliance delivers:
A complete and optimized solution for big data
Single-vendor support for both hardware and software
An easy-to-deploy solution
Tight integration with Oracle Database and Oracle Exadata Database Machine

Oracle provides a big data platform that captures, organizes, and supports deep analytics
on extremely large, complex data streams flowing into your enterprise from many data
sources. You can choose the best storage and processing location for your data depending
on its structure, workload characteristics, and end-user requirements.
Oracle Database enables all data to be accessed and analyzed by a large user community
using identical methods. By adding Oracle Big Data Appliance in front of Oracle
Database, you can bring new sources of information to an existing data warehouse. Oracle
Big Data Appliance is the platform for acquiring and organizing big data so that the
relevant portions with true business value can be analyzed in Oracle Database.
For maximum speed and efficiency, Oracle Big Data Appliance can be connected to
Oracle Exadata Database Machine running Oracle Database. Oracle Exadata Database
Machine provides outstanding performance in hosting data warehouses and transaction
processing databases. Moreover, Oracle Exadata Database Machine can be connected to
Oracle Exalytics In-Memory Machine for the best performance of business intelligence
and planning applications. The InfiniBand connections between these engineered systems
provide high parallelism, which enables high-speed data transfer for batch or query
workloads.
The next figure shows the relationships among these engineered systems.

3.5 SOFTWARE FOR BIG DATA


The Oracle Linux operating system and Clouderas Distribution including Apache
Hadoop (CDH) underlie all other software components installed on Oracle Big Data
Appliance. CDH is an integrated stack of components that have been tested and packaged
to work together.
CDH has a batch processing infrastructure that can store files and distribute work across a
set of computers. Data is processed on the same computer where it is stored. In a single
Oracle Big Data Appliance rack, CDH distributes the files and workload across 18 servers,
which compose a cluster. Each server is a node in the cluster.
The software framework consists of these primary components:
File system: The Hadoop Distributed File System (HDFS) is a highly scalable file

system that stores large files across multiple servers. It achieves reliability by
replicating data across multiple servers without RAID technology. It runs on top of the
Linux file system on Oracle Big Data Appliance.
MapReduce engine: The MapReduce engine provides a platform for the massively

parallel execution of algorithms written in Java. Oracle Big Data Appliance 3.0 runs
YARN by default.
Administrative framework: Cloudera Manager is a comprehensive administrative

tool for CDH. In addition, you can use Oracle Enterprise Manager to monitor both the
hardware and software on Oracle Big Data Appliance.
Apache projects: CDH includes Apache projects for MapReduce and HDFS, such as

Hive, Pig, Oozie, ZooKeeper, HBase, Sqoop, and Spark.


Cloudera applications: Oracle Big Data Appliance installs all products included in

Cloudera Enterprise Data Hub Edition, including Impala, Search, and Navigator.
CDH is written in Java, and Java is the language for applications development. However,
several CDH utilities and other software available on Oracle Big Data Appliance provide
graphical, web-based, and other language interfaces for ease of use.

3.5.1 Software Component Overview


The major software components perform three basic tasks:
Acquire
Organize
Analyze and visualize

The best tool for each task depends on the density of the information and the degree of
structure. The next figure shows the relationships among the tools and identifies the tasks
that they perform.
The next figure shows the Oracle Big Data Appliance Software structutre

3.6 ACQUIRING DATA FOR ANALYSIS


Databases used for online transaction processing (OLTP) are the traditional data sources
for data warehouses. The Oracle solution enables you to analyze traditional data stores
with big data in the same Oracle data warehouse. Relational data continues to be an
important source of business intelligence, although it runs on separate hardware from
Oracle Big Data Appliance.
Oracle Big Data Appliance provides these facilities for capturing and storing big data:
Hadoop Distributed File System
Apache Hive
Oracle NoSQL Database

3.6.1 Hadoop Distributed File System


Clouderas Distribution including Apache Hadoop (CDH) on Oracle Big Data Appliance
uses the Hadoop Distributed File System (HDFS). HDFS stores extremely large files
containing record-oriented data. On Oracle Big Data Appliance, HDFS splits large data
files into chunks of 256 megabytes (MB), and replicates each chunk across three different
nodes in the cluster. The size of the chunks and the number of replications are
configurable.
Chunking enables HDFS to store files that are larger than the physical storage of one
server. It also allows the data to be processed in parallel across multiple computers with
multiple processors, all working on data that is stored locally. Replication ensures the high
availability of the data: if a server fails, the other servers automatically take over its work
load.
HDFS is typically used to store all types of big data.

3.6.2 Apache Hive


Hive is an open-source data warehouse that supports data summarization, ad hoc querying,
and data analysis of data stored in HDFS. It uses a SQL-like language called HiveQL. An
interpreter generates MapReduce code from the HiveQL queries. By storing data in Hive,
you can avoid writing MapReduce programs in Java.
Hive is a component of CDH and is always installed on Oracle Big Data Appliance.
Oracle Big Data Connectors can access Hive tables.

3.6.3 Oracle NoSQL Database


Oracle NoSQL Database is a distributed key-value database built on the proven storage
technology of Berkeley DB Java Edition. Whereas HDFS stores unstructured data in very
large files, Oracle NoSQL Database indexes the data and supports transactions. But unlike
Oracle Database, which stores highly structured data, Oracle NoSQL Database has relaxed
consistency rules, no schema structure, and only modest support for joins, particularly
across storage nodes.
NoSQL databases, or Not Only SQL databases, have developed over the past decade
specifically for storing big data. However, they vary widely in implementation. Oracle
NoSQL Database has these characteristics:
Uses a system-defined, consistent hash index for data distribution
Supports high availability through replication

Provides single-record, single-operation transactions with relaxed consistency


guarantees

Provides a Java API

Oracle NoSQL Database is designed to provide highly reliable, scalable, predictable, and
available data storage. The key-value pairs are stored in shards or partitions (that is,
subsets of data) based on a primary key. Data on each shard is replicated across multiple
storage nodes to ensure high availability. Oracle NoSQL Database supports fast querying
of the data, typically by key lookup.
An intelligent driver links the NoSQL database with client applications and provides
access to the requested key-value on the storage node with the lowest latency.
Oracle NoSQL Database includes hashing and balancing algorithms to ensure proper data
distribution and optimal load balancing, replication management components to handle
storage node failure and recovery, and an easy-to-use administrative interface to monitor
the state of the database.
Oracle NoSQL Database is typically used to store customer profiles and similar data for
identifying and analyzing big data. For example, you might log in to a website and see
advertisements based on your stored customer profile (a record in Oracle NoSQL
Database) and your recent activity on the site (web logs currently streaming into HDFS).
Oracle NoSQL Database is an optional component of Oracle Big Data Appliance and runs
on a separate cluster from CDH.

3.7 ORGANIZING BIG DATA


Oracle Big Data Appliance provides several ways of organizing, transforming, and
reducing big data for analysis:
MapReduce
Oracle Big Data Connectors
Oracle R Support for Big Data

3.8 MAPREDUCE
The MapReduce engine provides a platform for the massively parallel execution of
algorithms written in Java. MapReduce uses a parallel programming model for processing
data on a distributed system. It can process vast amounts of data quickly and can scale
linearly. It is particularly effective as a mechanism for batch processing of unstructured
and semistructured data. MapReduce abstracts lower-level operations into computations
over a set of keys and values.
Although big data is often described as unstructured, incoming data always has some
structure. However, it does not have a fixed, predefined structure when written to HDFS.
Instead, MapReduce creates the desired structure as it reads the data for a particular job.
The same data can have many different structures imposed by different MapReduce jobs.
A simplified description of a MapReduce job is the successive alternation of two phases:
the Map phase and the Reduce phase. Each Map phase applies a transform function over
each record in the input data to produce a set of records expressed as key-value pairs. The
output from the Map phase is input to the Reduce phase. In the Reduce phase, the Map
output records are sorted into key-value sets, so that all records in a set have the same key
value. A reducer function is applied to all the records in a set, and a set of output records is
produced as key-value pairs. The Map phase is logically run in parallel over each record,
whereas the Reduce phase is run in parallel over all key values.
Oracle Big Data Appliance uses the Yet Another Resource Negotiator (YARN)
implementation of MapReduce by default. You have the option of using classic
MapReduce (MRv1) instead. You cannot use both implementations in the same cluster;
you can activate either the MapReduce or the YARN service.

3.9 ORACLE BIG DATA CONNECTORS


Oracle Big Data Connectors facilitate data access between data stored in CDH and Oracle
Database. The connectors are licensed separately from Oracle Big Data Appliance and
include:
Oracle SQL Connector for Hadoop Distributed File System
Oracle Loader for Hadoop
Oracle XQuery for Hadoop
Oracle R Advanced Analytics for Hadoop
Oracle Data Integrator Application Adapter for Hadoop

3.9.1 Oracle SQL Connector for Hadoop Distributed File System


Oracle SQL Connector for Hadoop Distributed File System (Oracle SQL Connector for
HDFS) provides read access to HDFS from an Oracle database using external tables.
An external table is an Oracle Database object that identifies the location of data outside
of the database. Oracle Database accesses the data by using the metadata provided when
the external table was created. By querying the external tables, users can access data
stored in HDFS as if that data were stored in tables in the database. External tables are
often used to stage data to be transformed during a database load.
You can use Oracle SQL Connector for HDFS to:
Access data stored in HDFS files
Access Hive tables.
Access Data Pump files generated by Oracle Loader for Hadoop
Load data extracted and transformed by Oracle Data Integrator

3.9.2 Oracle Loader for Hadoop


Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement
of data from a Hadoop cluster into a table in an Oracle database. It can read and load data
from a wide variety of formats. Oracle Loader for Hadoop partitions the data and
transforms it into a database-ready format in Hadoop. It optionally sorts records by
primary key before loading the data or creating output files. The load runs as a
MapReduce job on the Hadoop cluster.

3.9.3 Oracle Data Integrator Application Adapter for Hadoop


Oracle Data Integrator (ODI) extracts, transforms, and loads data into Oracle Database
from a wide range of sources.
In ODI, a knowledge module (KM) is a code template dedicated to a specific task in the
data integration process. You use Oracle Data Integrator Studio to load, select, and
configure the KMs for your particular application. More than 150 KMs are available to
help you acquire data from a wide range of third-party databases and other data
repositories. You only need to load a few KMs for any particular job.
Oracle Data Integrator Application Adapter for Hadoop contains the KMs specifically for
use with big data.

3.9.4 Oracle XQuery for Hadoop


Oracle XQuery for Hadoop runs transformations expressed in the XQuery language by
translating them into a series of MapReduce jobs, which are executed in parallel on the
Hadoop cluster. The input data can be located in HDFS or Oracle NoSQL Database.
Oracle XQuery for Hadoop can write the transformation results to HDFS, Oracle NoSQL
Database, or Oracle Database.

3.10 ORACLE R ADVANCED ANALYTICS FOR HADOOP


Oracle R Advanced Analytics for Hadoop is a collection of R packages that provides:
Interfaces to work with Hive tables, Apache Hadoop compute infrastructure, local R

environment and database tables


Predictive analytic techniques written in R or Java as Hadoop MapReduce jobs that can

be applied to data in HDFS files


Using simple R functions, you can copy data between R memory, the local file system,
HDFS, and Hive. You can write mappers and reducers in R, schedule these R programs to
execute as Hadoop MapReduce jobs, and return the results to any of those locations.

3.11 ORACLE R SUPPORT FOR BIG DATA


R is an open-source language and environment for statistical analysis and graphing It
provides linear and nonlinear modeling, standard statistical methods, time-series analysis,
classification, clustering, and graphical data displays. Thousands of open-source packages
are available in the Comprehensive R Archive Network (CRAN) for a spectrum of
applications, such as bioinformatics, spatial statistics, and financial and marketing
analysis. The popularity of R has increased as its functionality matured to rival that of
costly proprietary statistical packages.
Analysts typically use R on a PC, which limits the amount of data and the processing
power available for analysis. Oracle eliminates this restriction by extending the R platform
to directly leverage Oracle Big Data Appliance. Oracle R Distribution is installed on all
nodes of Oracle Big Data Appliance.
Oracle R Advanced Analytics for Hadoop provides R users with high-performance, native
access to HDFS and the MapReduce programming framework, which enables R programs
to run as MapReduce jobs on vast amounts of data. Oracle R Advanced Analytics for
Hadoop is included in the Oracle Big Data Connectors.
Oracle R Enterprise is a component of the Oracle Advanced Analytics option to Oracle
Database. It provides:
Transparent access to database data for data preparation and statistical analysis from R
Execution of R scripts at the database server, accessible from both R and SQL
A wide range of predictive and data mining in-database algorithms

Oracle R Enterprise enables you to store the results of your analysis of big data in an
Oracle database, or accessed for display in dashboards and applications.
Both Oracle R Advanced Analytics for Hadoop and Oracle R Enterprise make Oracle
Database and the Hadoop computational infrastructure available to statistical users
without requiring them to learn the native programming languages of either one.

3.12 ANALYZING AND VISUALIZING BIG DATA


After big data is transformed and loaded in Oracle Database, you can use the full spectrum
of Oracle business intelligence solutions and decision support products to further analyze
and visualize all your data.

3.13 ORACLE BUSINESS INTELLIGENCE FOUNDATION SUITE


Oracle Business Intelligence Foundation Suite, a comprehensive, modern and marketleading BI platform provides the industrys best in class platform for ad hoc query and
analysis, dashboards, enterprise reporting, mobile analytics, scorecards, multidimensional
OLAP, and predictive analytics, on an architecturally integrated business intelligence
foundation. This enabling technology for custom and packaged business intelligence
applications helps organizations drive innovation, and optimize processes while,
delivering extreme performance.
Oracle Business Intelligence Foundation Suite includes the following capabilities:

3.13.1 Enterprise BI Platform


Transform IT from a cost center to a business asset by standardizing on a single, scalable
BI platform that empowers business users to easily create their own reports with
information relevant to them.

3.13.2 OLAP Analytics


The industry-leading multi-dimensional online analytical processing (OLAP) server,
designed to help business users forecast likely business performance levels and deliver
what-if analyses for varying conditions.

3.13.3 Scorecard and Strategy Management


Define strategic goals and objectives that can be cascaded to every level of the enterprise,
enabling employees to understand their impact on achieving success and align their
actions accordingly.

3.13.4 Mobile BI
Business doesnt stop just because youre on the go. Make sure critical information is
reaching you wherever you are.

3.13.5 Enterprise Reporting


Provides a single, Web-based platform for authoring, managing, and delivering interactive
reports, dashboards, and all types of highly formatted documents.

3.14 ORACLE BIG DATA LITE VIRTUAL MACHINE


Oracle Big Data Appliance Version 2.5 was released recently. Some great new features in
this release- including a continued security focus (on-disk encryption and automated
configuration of Sentry for data authorization) and updates to Cloudera Distribution of
Apache Hadoop and Cloudera Manager.
With each BDA release, we have a new release of Oracle Big Data Lite Virtual Machine.
Oracle Big Data Lite provides an integrated environment to help you get started with the
Oracle Big Data platform. Many Oracle Big Data platform components have been
installed and configured - allowing you to begin using the system right away. The
following components are included on Oracle Big Data Lite Virtual Machine v 2.5:

Oracle Enterprise Linux 6.4
Oracle Database 12c Release 1 Enterprise Edition (12.1.0.1)
Clouderas Distribution including Apache Hadoop (CDH4.6)
Cloudera Manager 4.8.2
Cloudera Enterprise Technology, including:

Cloudera RTQ (Impala 1.2.3)
Cloudera RTS (Search 1.2)
Oracle Big Data Connectors 2.5

Oracle SQL Connector for HDFS 2.3.0
Oracle Loader for Hadoop 2.3.1
Oracle Data Integrator 11g
Oracle R Advanced Analytics for Hadoop 2.3.1
Oracle XQuery for Hadoop 2.4.0
Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54)
Oracle JDeveloper 11g
Oracle SQL Developer 4.0
Oracle Data Integrator 12cR1

Oracle R Distribution 3.0.1


Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key
components of Oracles big data platform, including: Oracle Database 12c Enterprise
Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution
including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and
more. Its been configured to run on a developer class computer; all Big Data Lite needs
is a couple of cores and about 5GB memory to run (this means that your computer should
have at least 8GB total memory). With Big Data Lite, you can develop your big data
applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big
Data Lite as a client to the BDA during application development.

4Chapter 4.

ADMINISTERING ORACLE BIG DATA APPLIANCE

4.1 MONITORING MULTIPLE CLUSTERS USING ORACLE


ENTERPRISE MANAGER
An Oracle Enterprise Manager plug-in enables you to use the same system monitoring tool
for Oracle Big Data Appliance as you use for Oracle Exadata Database Machine or any
other Oracle Database installation. With the plug-in, you can view the status of the
installed software components in tabular or graphic presentations, and start and stop these
software services. You can also monitor the health of the network and the rack
components.
Oracle Enterprise Manager enables you to monitor all Oracle Big Data Appliance racks on
the same InfiniBand fabric. It provides summary views of both the rack hardware and the
software layout of the logical clusters.

4.1.1 Using the Enterprise Manager Web Interface


After opening Oracle Enterprise Manager web interface, logging in, and selecting a target
cluster, you can drill down into these primary areas:

InfiniBand network: Network topology and status for InfiniBand switches and ports.
See the net figure.

Hadoop cluster: Software services for HDFS, MapReduce, and ZooKeeper.


Oracle Big Data Appliance rack: Hardware status including server hosts, Oracle
Integrated Lights Out Manager (Oracle ILOM) servers, power distribution units
(PDUs), and the Ethernet switch.

The next figure shows a small section of the cluster home page. YARN Page in Oracle
Enterprise Manager

To monitor Oracle Big Data Appliance using Oracle Enterprise Manager:


1. Download and install the plug-in.

2. Log in to Oracle Enterprise Manager as a privileged user.


3. From the Targets menu, choose Big Data Appliance to view the Big Data page. You
can see the overall status of the targets already discovered by Oracle Enterprise
Manager.

4. Select a target cluster to view its detail pages.


5. Expand the target navigation tree to display the components. Information is available at
all levels.

6. Select a component in the tree to display its home page.


7. To change the display, choose an item from the drop-down menu at the top left of the
main display area.

4.1.2 Using the Enterprise Manager Command-Line Interface


The Enterprise Manager command-line interface (emcli) is installed on Oracle Big Data
Appliance along with all the other software. It provides the same functionality as the web
interface. You must provide credentials to connect to Oracle Management Server.

4.2 MANAGING OPERATIONS USING CLOUDERA MANAGER


Cloudera Manager is installed on Oracle Big Data Appliance to help you with Clouderas
Distribution including Apache Hadoop (CDH) operations. Cloudera Manager provides a
single administrative interface to all Oracle Big Data Appliance servers configured as part
of the Hadoop cluster.
Cloudera Manager simplifies the performance of these administrative tasks:
Monitor jobs and services

Start and stop services


Manage security and Kerberos credentials


Monitor user activity


Monitor the health of the system


Monitor performance metrics


Track hardware use (disk, CPU, and RAM)


Cloudera Manager runs on the ResourceManager node (node03) and is available on port
7180.
To use Cloudera Manager:

Open a browser and enter a URL like the following:
http://bda1node03.example.com:7180
In this example, bda1 is the name of the appliance, node03 is the name of the server,
example.com is the domain, and 7180 is the default port number for Cloudera Manager.
Log in with a user name and password for Cloudera Manager. Only a user with
administrative privileges can change the settings. Other Cloudera Manager users can view
the status of Oracle Big Data Appliance.

4.2.1 Monitoring the Status of Oracle Big Data Appliance


In Cloudera Manager, you can choose any of the following pages from the menu bar
across the top of the display:
Home: Provides a graphic overview of activities and links to all services controlled by
Cloudera Manager. See the next figure.

Clusters: Accesses the services on multiple clusters.


Hosts: Monitors the health, disk usage, load, physical memory, swap space, and other
statistics for all servers in the cluster.
Diagnostics: Accesses events and logs. Cloudera Manager collects historical
information about the systems and services. You can search for a particular phrase for a
selected server, service, and time period. You can also select the minimum severity
level of the logged messages included in the search: TRACE, DEBUG, INFO, WARN,
ERROR, or FATAL.
Audits: Displays the audit history log for a selected time range. You can filter the
results by user name, service, or other criteria, and download the log as a CSV file.

Charts: Enables you to view metrics from the Cloudera Manager time-series data store
in a variety of chart types, such as line and bar.

Backup: Accesses snapshot policies and scheduled replications.


Administration: Provides a variety of administrative options, including Settings,


Alerts, Users, and Kerberos.

The next figure shows the Cloudera Manager home page.

4.2.2 Performing Administrative Tasks


As a Cloudera Manager administrator, you can change various properties for monitoring
the health and use of Oracle Big Data Appliance, add users, and set up Kerberos security.
To access Cloudera Manager Administration:

Log in to Cloudera Manager with administrative privileges.

Click Administration, and select a task from the menu.


4.2.3 Managing CDH Services With Cloudera Manager


Cloudera Manager provides the interface for managing these services:
HDFS

Hive

Hue

Oozie

YARN

ZooKeeper

You can use Cloudera Manager to change the configuration of these services, stop, and
restart them. Additional services are also available, which require configuration before
you can use them

4.3 USING HADOOP MONITORING UTILITIES


You also have the option of using the native Hadoop utilities. These utilities are read-only
and do not require authentication.
Cloudera Manager provides an easy way to obtain the correct URLs for these utilities. On
the YARN service page, expand the Web UI submenu.

4.3.1 Monitoring MapReduce Jobs


You can monitor MapReduce jobs using the resource manager interface.
To monitor MapReduce jobs:
Open a browser and enter a URL like the following:
http://bda1node03.example.com:8088
In this example, bda1 is the name of the rack, node03 is the name of the server where the
YARN resource manager runs, and 8088 is the default port number for the user interface.
The next figure shows the YARN resource manager interface.

4.3.2 Monitoring the Health of HDFS


You can monitor the health of the Hadoop file system by using the DFS health utility on
the first two nodes of a cluster.
To monitor HDFS:

Open a browser and enter a URL like the following:

http://bda1node01.example.com:50070
In this example, bda1 is the name of the rack, node01 is the name of the server where the
dfshealth utility runs, and 50070 is the default port number for the user interface.
The next figure shows the DFS health utility interface.

4.4 USING CLOUDERA HUE TO INTERACT WITH HADOOP


Hue runs in a browser and provides an easy-to-use interface to several applications to
support interaction with Hadoop and HDFS. You can use Hue to perform any of the
following tasks:
Query Hive data stores

Create, load, and delete Hive tables


Work with HDFS files and directories


Create, submit, and monitor MapReduce jobs


Monitor MapReduce jobs


Create, edit, and submit workflows using the Oozie dashboard


Manage users and groups


Hue is automatically installed and configured on Oracle Big Data Appliance. It runs on
port 8888 of the ResourceManager node (node03).
To use Hue:

Log in to Cloudera Manager and click the hue service on the Home page.

On the hue page, click Hue Web UI.


Bookmark the Hue URL, so that you can open Hue directly in your browser. The
following URL is an example:

http://bda1node03.example.com:8888
Log in with your Hue credentials.

Oracle Big Data Appliance is not configured initially with any Hue user accounts. The
first user who connects to Hue can log in with any user name and password, and
automatically becomes an administrator. This user can create other user and administrator
accounts.
The next figure shows the Hive Query Editor.

4.5 ABOUT THE ORACLE BIG DATA APPLIANCE SOFTWARE


The following sections identify the software installed on Oracle Big Data Appliance.
Some components operate with Oracle Database 11.2.0.2 and later releases.
This section contains the following topics:
Software Components

Unconfigured Software

Allocating Resources Among Services


4.5.1 Software Components


These software components are installed on all servers in the cluster. Oracle Linux,
required drivers, firmware, and hardware verification utilities are factory installed. All
other software is installed on site. The optional software components may not be
configured in your installation.
You do not need to install additional software on Oracle Big Data Appliance. Doing so
may result in a loss of warranty and support.
Base image software:

Oracle Linux 6.4 (upgrades stay at 5.8) with Oracle Unbreakable Enterprise Kernel
version 2 (UEK2)

Java HotSpot Virtual Machine 7 version 25 (JDK 7u25)


Oracle R Distribution 3.0.1-2


MySQL Database 5.5.35 Advanced Edition


Puppet, firmware, Oracle Big Data Appliance utilities


Oracle InfiniBand software


Mammoth installation:

Clouderas Distribution including Apache Hadoop Release 5 (5.1.0) including:


Apache Hive 0.12


Apache HBase
Apache Sentry
Apache Spark
Cloudera Impala
Cloudera Search 1.2.0

Cloudera Manager Release 5 (5.1.1) including Cloudera Navigator


Oracle Database Instant Client 12.1


Oracle Big Data SQL (optional)


Oracle NoSQL Database Community Edition or Enterprise Edition 12c Release 1


Version 3.0.5 (optional)

Oracle Big Data Connectors 4.0 (optional):


Oracle SQL Connector for Hadoop Distributed File System (HDFS)


Oracle Loader for Hadoop
Oracle Data Integrator Agent 12.1.3.0
Oracle XQuery for Hadoop

Oracle R Advanced Analytics for Hadoop

The next figure shows the relationships among the major Major Software Components of
Oracle Big Data Appliance.

4.5.2 Unconfigured Software


Your Oracle Big Data Appliance license includes all components in Cloudera Enterprise
Data Hub Edition. All CDH components are installed automatically by the Mammoth
utility. Do not download them from the Cloudera website.
However, you must use Cloudera Manager to add the following services before you can
use them:
Apache Flume

Apache HBase

Apache Spark

Apache Sqoop

Cloudera Impala

Cloudera Navigator

Cloudera Search

To add a service:

1. Log in to Cloudera Manager as the admin user.

2.

On the Home page, expand the cluster menu in the left panel and choose Add a
Service to open the Add Service wizard. The first page lists the services you can add.

3. Follow the steps of the wizard.


You can find the RPM files on the first server of each cluster in
/opt/oracle/BDAMammoth/bdarepo/RPMS/noarch.

4.5.3 Allocating Resources Among Services


You can allocate resources to each serviceHDFS, YARN, Oracle Big Data SQL, Hive,
and so forthas a percentage of the total resource pool. Cloudera Manager automatically
calculates the recommended resource management settings based on these percentages.
The static service pools isolate services on the cluster, so that a high load on one service as
a limited impact on the other services.
To allocate resources among services:

Log in as admin to Cloudera Manager.

Open the Clusters menu at the top of the page, then select Static Service Pools under
Resource Management.

Select Configuration.

Follow the steps of the wizard, or click Change Settings Directly to edit the current
settings.

4.6 STOPPING AND STARTING ORACLE BIG DATA APPLIANCE



This section describes how to shut down Oracle Big Data Appliance gracefully and restart
it.

Prerequisites

Stopping Oracle Big Data Appliance


Starting Oracle Big Data Appliance


4.6.1 Prerequisites
You must have root access. Passwordless SSH must be set up on the cluster, so that you
can use the dcli utility.
To ensure that passwordless-ssh is set up:

Log in to the first node of the cluster as root.

Use a dcli command to verify it is working. This command should return the IP address
and host name of every node in the cluster:

# dcli -C hostname
192.0.2.1: bda1node01.example.com
192.0.2.2: bda1node02.example.com
.
.
.
If you do not get these results, then set up dcli on the cluster:
# setup-root-ssh -C


4.6.2 Stopping Oracle Big Data Appliance
Follow these procedures to shut down all Oracle Big Data Appliance software and
hardware components.
Note:
The following services stop automatically when the system shuts down. You do not need
to take any action:
Oracle Enterprise Manager agent

Auto Service Request agents


Task 1 Stopping All Managed Services


Use Cloudera Manager to stop the services it manages, including flume,
hbase, hdfs, hive, hue, mapreduce, oozie, and zookeeper.
1. Log in to Cloudera Manager as the admin user.
2. In the Status pane of the opening page, expand the menu for the cluster and
click Stop, and then click Stop again when prompted to confirm. See the nect
figure. To navigate to this page, click the Home tab, and then the Status subtab.
3. On the Command Details page, click Close when all processes are stopped.
4. In the same pane under Cloudera Management Services, expand the menu
for the mgmt service and click Stop.
5. Log out of Cloudera Manager.

The next figure shows the stopping HDFS Services

Task 2 Stopping Cloudera Manager Server

Follow this procedure to stop Cloudera Manager Server.


1. Log in as root to the node where Cloudera Manager runs (initially node03).
The remaining tasks presume that you are logged in to a server as root. You can
enter the commands from any server by using the dcli command. This example
runs the pwd command on node03 from any node in the cluster:

# dcli -c node03 pwd


2. Stop the Cloudera Manager server:


# service cloudera-scm-server stop


Stopping cloudera-scm-server: [ OK ]
Verify that the server is stopped:
# service cloudera-scm-server status
cloudera-scm-server is stopped

After stopping Cloudera Manager, you cannot access it using the web
console.

Task 3 Stopping Oracle Data Integrator Agent


If Oracle Data Integrator Application Adapter for Hadoop is installed on
the cluster, then stop the agent.
1. Check the status of the Oracle Data Integrator service:

# dcli -C service odi-agent status


2. Stop the Oracle Data Integrator agent, if it is running:



# dcli -C service odi-agent stop

3. Ensure that the Oracle Data Integrator service stopped running:



# dcli -C service odi-agent status

Task 4 Dismounting NFS Directories


All nodes share an NFS directory on node03, and additional directories
may also exist. If a server with the NFS directory (/opt/exportdir) is
unavailable, then the other servers hang when attempting to shut down.
Thus, you must dismount the NFS directories first.
1. Locate any mounted NFS directories:

# dcli -C mount | grep shareddir


192.0.2.1: bda1node03.example.com:/opt/exportdir on /opt/shareddir type nfs
(rw,tcp,soft,intr,timeo=10,retrans=10,addr=192.0.2.3)
192.0.2.2: bda1node03.example.com:/opt/exportdir on /opt/shareddir type nfs
(rw,tcp,soft,intr,timeo=10,retrans=10,addr=192.0.2.3)
192.0.2.3: /opt/exportdir on /opt/shareddir type none (rw,bind)
.
.

The sample output shows a shared directory on node03 (192.0.2.3).


2. Dismount the shared directory:

# dcli -C umount /opt/shareddir

3. Dismount any custom NFS directories.


Task 5 Stopping the Servers


The Linux shutdown -h command powers down individual servers. You
can use the dcli -g command to stop multiple servers.
1. Create a file that lists the names or IP addresses of the other servers in the
cluster, that is, not including the one you are logged in to.

2. Stop the other servers:



# dcli -g filename shutdown -h now

For filename, enter the name of the file that you created in step 1.
3. Stop the server you are logged in to:


# shutdown -h now

Task 6 Stopping the InfiniBand and Cisco Switches


To stop the network switches, turn off a PDU or a breaker in the data
center. The switches only turn off when power is removed.
The network switches do not have power buttons. They shut down only when power is
removed
To stop the switches, turn off all breakers in the two PDUs.

4.6.3 Starting Oracle Big Data Appliance


Follow these procedures to power up the hardware and start all services on Oracle Big
Data Appliance.
Task 1 Powering Up Oracle Big Data Appliance
1. Switch on all 12 breakers on both PDUs.
2. Allow 4 to 5 minutes for Oracle ILOM and the Linux operating system to
start on the servers.
3. If password-based, on-disk encryption is enabled, then log in and mount the
Hadoop directories on those servers:


$ mount-hadoop-dirs
Enter password to mount Hadoop directories: password

If the servers do not start automatically, then you can start them locally by
pressing the power button on the front of the servers, or remotely by using
Oracle ILOM. Oracle ILOM has several interfaces, including a commandline interface (CLI) and a web console. Use whichever interface you prefer.
For example, you can log in to the web interface as root and start the server
from the Remote Power Control page. The URL for Oracle ILOM is the
same as for the host, except that it typically has a -c or -ilom extension.
This URL connects to Oracle ILOM for bda1node4:
http://bda1node04-ilom.example.com

Task 2 Starting the HDFS Software Services


Use Cloudera Manager to start all the HDFS services that it controls.
1. Log in as root to the node where Cloudera Manager runs (initially node03).
Note:
The remaining tasks presume that you are logged in to a server as root.
You can enter the commands from any server by using the dcli command.
This example runs the pwd command on node03 from any node in the
cluster:

# dcli -c node03 pwd


2. Verify that the Cloudera Manager started automatically on node03:



# service cloudera-scm-server status
cloudera-scm-server (pid 11399) is running

3. If it is not running, then start it:



# service cloudera-scm-server start

4. Log in to Cloudera Manager as the admin user.


5. In the Status pane of the opening page, expand the menu for the cluster and
click Start, and then click Start again when prompted to confirm. See the next
figure.

To navigate to this page, click the Home tab, and then the Status subtab.
6. On the Command Details page, click Close when all processes are started.

7. In the same pane under Cloudera Management Services, expand the menu
for the mgmt service and click Start.

8. Log out of Cloudera Manager (optional).


Task 3 Starting Oracle Data Integrator Agent


If Oracle Data Integrator Application Adapter for Hadoop is used on this
cluster, then start the agent.
1. Check the status of the agent:


#
/opt/oracle/odiagent/agent_standalone/oracledi/agent/bin/startcmd.sh
OdiPingAgent [-AGENT_NAME=agent_name]

2. Start the agent:


#
/opt/oracle/odiagent/agent_standalone/oracledi/agent/bin/agent.sh
NAME=agent_name] [-PORT=port_number]

[-


4.7 MANAGING ORACLE BIG DATA SQL
Oracle Big Data SQL is registered with Cloudera Manager as an add-on service. You can
use Cloudera Manager to start, stop, and restart the Oracle Big Data SQL service or
individual role instances, the same way as a CDH service.
Cloudera Manager also monitors the health of the Oracle Big Data SQL service, reports
service outages, and sends alerts if the service is not healthy.

4.7.1 Adding and Removing the Oracle Big Data SQL Service
Oracle Big Data SQL is an optional service on Oracle Big Data Appliance. It may be
installed with the other client software during the initial software installation or an
upgrade. Use Cloudera Manager to determine whether it is installed. A separate license is
required; Oracle Big Data SQL is not included with the Oracle Big Data Appliance
license.
You cannot use Cloudera Manager to add or remove the Oracle Big Data SQL service
from a CDH cluster on Oracle Big Data Appliance. Instead, log in to the server where
Mammoth is installed (usually the first node of the cluster) and use the following
commands in the bdacli utility:
To enable Oracle Big Data SQL
bdacli enable big_data_sql

To disable Oracle Big Data SQL:


bdacli disable big_data_sql

4.7.2 Allocating Resources to Oracle Big Data SQL


You can modify the property values in a Linux kernel Control Group (Cgroup) to reserve
resources for Oracle Big Data SQL.
To modify the resource management configuration settings:

1. Log in as admin to Cloudera Manager.

2. On the Home page, click bigdatasql from the list of services.


3. On the bigdatasql page, click Configuration.


4.

Under Category, expand BDS Server Default Group and click Resource
Management.

5. Modify the values of the following properties as required:


Cgroup CPU Shares


Cgroup I/O Weight
Cgroup Memory Soft Limit
Cgroup Memory Hard Limit
6. Click Save Changes.

7. From the Actions menu, click Restart.


The next figure shows the bigdatasql service configuration page.

4.8 SWITCHING FROM YARN TO MAPREDUCE 1


Oracle Big Data Appliance uses the Yet Another Resource Negotiator (YARN)
implementation of MapReduce by default. You have the option of using classic
MapReduce (MRv1) instead. You cannot use both implementations in the same cluster;
you can activate either the MapReduce or the YARN service.
To switch a cluster to MRv1:

1. Log in to Cloudera Manager as the admin user.

2. Stop the YARN service:


a. Locate YARN in the list of services on the Status tab of the Home page.
b. Expand the YARN menu and click Stop.

3. On the cluster menu, click Add a Service to start the Add Service wizard:

a. Select MapReduce for the type of service you want to add.


b. Select hdfs/zookeeper as a dependency (default).
c. Customize the role assignments:

JobTracker: Click the field to display a list of nodes in the cluster. Select the third node.
TaskTracker: For a six-node cluster, keep the TaskTrackers on all nodes (default). For
larger clusters, remove the TaskTrackers from the first two nodes.
d. On the Review Changes page, change the parameter values:

TaskTracker Local Data Directory List: Change the default group and group 1
to /u12/hadoop/mapred../u01/hadoop/mapred.
JobTracker Local Data Directory List: Change the default group to
/u12/hadoop/mapred../u01/hadoop/mapred.
e. Complete the steps of the wizard with no further changes. Click Finish to save
the configuration and exit.

4. Update the Hive service configuration:


a. On the Status tab of the Home page, click hive to display the hive page.
b. Expand the Configuration submenu and click View and Edit.
c. Select mapreduce as the value of the MapReduce Service property.
d. Click Save Changes.

5. Repeat step 4 to update the Oozie service configuration to use mapreduce.


6.

On the Status tab of the Home page, expand the hive and oozie menus and choose
Restart.

7. Optional: Expand the yarn service menu and choose Delete.


If you retain the yarn service, then after every cluster restart, you will see Memory
overcommit validation warnings, and you must manually stop the yarn service.
8. Update the MapReduce service configuration:

a. On the Status tab of the Home page, click mapreduce to display the
mapreduce page.

b. Expand the Configuration submenu and click View and Edit.


c.
Under Category, expand TaskTracker Default Group, and then click
Resource Management.

d. Set the following properties:


Java Heap Size of TaskTracker in Bytes: Reset to the default value of 1 GiB.
Maximum Number of Simultaneous Map Tasks: Set to either 15 for Sun Fire
X4270 M2 racks or 20 for all other racks.
Maximum Number of Simultaneous Reduce Tasks: Set to either 10 for Sun
Fire X4270 M2 racks or 13 for all other racks.
e. Click Save Changes.

9. Add overrides for nodes 3 and 4 (or nodes 1 and 2 in a six-node cluster).

10. Click the mapreduce1 service to display the mapreduce page:


11. Expand the Actions menu and select Enable High Availability to start the Enable
JobTracker High Availability wizard:

a. On the Assign Roles page, select the fourth node (node04) for the Standby
JobTracker.
b. Complete the steps of the wizard with no further changes. Click Finish to
save the configuration and exit.

12. Verify that all services in the cluster are healthy with no configuration issues.

13. Reconfigure Perfect Balance for the MRv1 cluster:


a. Log in as root to a node of the cluster.


b. Configure Perfect Balance on all nodes of the cluster:
c. $ dcli C /opt/oracle/orabalancer-[version]/bin/configure.sh


4.9 SECURITY ON ORACLE BIG DATA APPLIANCE
You can take precautions to prevent unauthorized use of the software and data on Oracle
Big Data Appliance.
This section contains these topics:
1. About Predefined Users and Groups

2. About User Authentication


3. About Fine-Grained Authorization


4. About On-Disk Encryption


5. Port Numbers Used on Oracle Big Data Appliance


6. About Puppet Security


4.9.1 About Predefined Users and Groups


Every open-source package installed on Oracle Big Data Appliance creates one or more
users and groups. Most of these users do not have login privileges, shells, or home
directories. They are used by daemons and are not intended as an interface for individual
users. For example, Hadoop operates as the hdfs user, MapReduce operates as mapred,
and Hive operates as hive.
You can use the oracle identity to run Hadoop and Hive jobs immediately after the Oracle
Big Data Appliance software is installed. This user account has login privileges, a shell,
and a home directory.
Oracle NoSQL Database and Oracle Data Integrator run as the oracle user. Its primary
group is oinstall.
Do not delete, re-create, or modify the users that are created during installation, because
they are required for the software to operate.
The next table identifies the operating system users and groups that are created
automatically during installation of Oracle Big Data Appliance software for use by CDH
components and other software packages.


User
Group
Name

Used By

Login
Rights

flume

flume

Apache Flume parent and nodes

No

hbase

hbase

Apache HBase processes

No

hdfs

hadoop

NameNode, DataNode

No

hive

hive

Hive metastore and server processes

No

hue

hue

Hue processes

No

mapred

hadoop

ResourceManager, NodeManager, Hive Thrift daemon Yes

mysql

mysql

MySQL server

Yes

oozie

oozie

Oozie server

No

oracle

dba,
oinstall

Oracle NoSQL Database, Oracle Loader for Hadoop,


Oracle Data Integrator, and the Oracle DBA

Yes

puppet

puppet

Puppet parent (puppet nodes run as root)

No

sqoop

sqoop

Apache Sqoop metastore

No

Auto Service Request

No


svctag

zookeeper zookeeper ZooKeeper processes

No

4.9.2 About User Authentication


Oracle Big Data Appliance supports Kerberos security as a software installation option.

4.9.3 About Fine-Grained Authorization


The typical authorization model on Hadoop is at the HDFS file level, such that users either
have access to all of the data in the file or none. In contrast, Apache Sentry integrates with
the Hive and Impala SQL-query engines to provide fine-grained authorization to data and
metadata stored in Hadoop.
Oracle Big Data Appliance automatically configures Sentry during software installation,
beginning with Mammoth utility version 2.5.

4.9.4 About On-Disk Encryption


On-disk encryption protects data that is at rest on disk. When on-disk encryption is
enabled, Oracle Big Data Appliance automatically encrypts and decrypts data stored on
disk. On-disk encryption does not affect user access to Hadoop data, although it can have
a minor impact on performance.
Password-based encryption encodes Hadoop data based on a password, which is the same
for all servers in a cluster. You can change the password at any time by using the
mammoth-reconfig update command.
If a disk is removed from a server, then the encrypted data remains protected until you
install the disk in a server (the same server or a different one), startup the server, and
provide the password. If a server is powered off and removed from an Oracle Big Data
Appliance rack, then the encrypted data remains protected until you restart server and
provide the password. You must enter the password after every startup of every server to
enable access to the data.
On-disk encryption is an option that you can select during the initial installation of the
software by the Mammoth utility. You can also enable or disable on-disk encryption at any
time by using either the mammoth-reconfig or bdacli utilities.

4.9.5 Port Numbers Used on Oracle Big Data Appliance


The next table identifies the port numbers that might be used in addition to those used by
CDH.
To view the ports used on a particular server:

1. In Cloudera Manager, click the Hosts tab at the top of the page to display the Hosts
page.

2. In the Name column, click a server link to see its detail page.

3. Scroll down to the Ports section.



Oracle Big Data Appliance Port Numbers


Service

Port

Automated Service Monitor (ASM)


30920

HBase master service (node01)


60010

MySQL Database

3306

Oracle Data Integrator Agent


20910

Oracle NoSQL Database administration


5001

Oracle NoSQL Database processes


5010 to 5020

Oracle NoSQL Database registration


5000

Port map

111

Puppet master service

8140


Puppet node service

8139

rpc.statd

668

ssh

22

xinetd (service tag)

6481

4.9.6 About Puppet Security


The puppet node service (puppetd) runs continuously as root on all servers. It listens on
port 8139 for kick requests, which trigger it to request updates from the puppet master.
It does not receive updates on this port.
The puppet master service (puppetmasterd) runs continuously as the puppet user on the
first server of the primary Oracle Big Data Appliance rack. It listens on port 8140 for
requests to push updates to puppet nodes.
The puppet nodes generate and send certificates to the puppet master to register initially
during installation of the software. For updates to the software, the puppet master signals
(kicks) the puppet nodes, which then request all configuration changes from the puppet
master node that they are registered with.
The puppet master sends updates only to puppet nodes that have known, valid certificates.
Puppet nodes only accept updates from the puppet master host name they initially
registered with. Because Oracle Big Data Appliance uses an internal network for
communication within the rack, the puppet master host name resolves using /etc/hosts to
an internal, private IP address.

4.10 AUDITING ORACLE BIG DATA APPLIANCE


You can use Oracle Audit Vault and Database Firewall to create and monitor the audit
trails for HDFS and MapReduce on Oracle Big Data Appliance.
This section describes the Oracle Big Data Appliance plug-in:
1. About Oracle Audit Vault and Database Firewall

2. Setting Up the Oracle Big Data Appliance Plug-in


3. Monitoring Oracle Big Data Appliance


4.10.1 About Oracle Audit Vault and Database Firewall


Oracle Audit Vault and Database Firewall secures databases and other critical components
of IT infrastructure in these key ways:
1. Provides an integrated auditing platform for your enterprise.

2. Captures activity on Oracle Database, Oracle Big Data Appliance, operating systems,
directories, file systems, and so forth.

3. Makes the auditing information available in a single reporting framework so that you
can understand the activities across the enterprise. You do not need to monitor each
system individually; you can view your computer infrastructure as a whole.

Audit Vault Server provides a web-based, graphic user interface for both administrators
and auditors.
You can configure CDH/Hadoop clusters on Oracle Big Data Appliance as secured targets.
The Audit Vault plug-in on Oracle Big Data Appliance collects audit and logging data
from these services:
1. HDFS: Who makes changes to the file system.

2. Hive DDL: Who makes Hive database changes.


3. MapReduce: Who runs MapReduce jobs that correspond to file access.


4. Oozie workflows: Who runs workflow activities.


The Audit Vault plug-in is an installation option. The Mammoth utility automatically
configures monitoring on Oracle Big Data Appliance as part of the software installation
process.

4.10.2 Setting Up the Oracle Big Data Appliance Plug-in


The Mammoth utility on Oracle Big Data Appliance performs all the steps needed to setup
the plug-in, using information that you provide.
To set up the Audit Vault plug-in for Oracle Big Data Appliance:
1. Ensure that Oracle Audit Vault and Database Firewall Server Release 12.1.1 is up and
running on the same network as Oracle Big Data Appliance.

2. Complete the Audit Vault Plug-in section of Oracle Big Data Appliance Configuration
Generation Utility.

3. Install the Oracle Big Data Appliance software using the Mammoth utility. An Oracle
representative typically performs this step.

You can also add the plug-in at a later time using either bdacli or mammoth-reconfig.
When the software installation is complete, the Audit Vault plug-in is installed on Oracle
Big Data Appliance, and Oracle Audit Vault and Database Firewall is collecting its audit
information. You do not need to perform any other installation steps.

4.10.3 Monitoring Oracle Big Data Appliance


After installing the plug-in, you can monitor Oracle Big Data Appliance the same as any
other secured target. Audit Vault Server collects activity reports automatically.
The following procedure describes one type of monitoring activity.
To view an Oracle Big Data Appliance activity report:
1. Log in to Audit Vault Server as an auditor.

2. Click the Reports tab.


3. Under Built-in Reports, click Audit Reports.


4. To browse all activities, in the Activity Reports list, click the Browse report data icon
for All Activity.

5. Add or remove the filters to list the events. Event names include ACCESS, CREATE,
DELETE, and OPEN.

6. Click the Single row view icon in the first column to see a detailed report.

The next figure shows the beginning of an activity report, which records access to a
Hadoop sequence file.

4.11 COLLECTING DIAGNOSTIC INFORMATION FOR ORACLE


CUSTOMER SUPPORT
If you need help from Oracle Support to troubleshoot CDH issues, then you should first
collect diagnostic information using the bdadiag utility with the cm option.
To collect diagnostic information:
1. Log in to an Oracle Big Data Appliance server as root.

2.

Run bdadiag with at least the cm option. You can include additional options on the
command line as appropriate.

# bdadiag cm

The command output identifies the name and the location of the diagnostic file.
3. Go to My Oracle Support at http://support.oracle.com.

4. Open a Service Request (SR) if you have not already done so.

5.

Upload the bz2 file into the SR. If the file is too large, then upload it to
sftp.oracle.com, as described in the next procedure.

To upload the diagnostics to ftp.oracle.com:


6.

Open an SFTP client and connect to sftp.oracle.com. Specify port 2021 and remote
directory /support/incoming/target, where target is the folder name given to you by
Oracle Support.

7. Log in with your Oracle Single Sign-on account and password.


8. Upload the diagnostic file to the new directory.


9. Update the SR with the full path and the file name.

4.12 AUDITING DATA ACCESS ACROSS THE ENTERPRISE


Security has been an important theme across recent Big Data Appliance releases. Our
most recent release includes encryption of data at rest and automatic configuration of
Sentry for data authorization. This is in addition to the security features previously added
to the BDA, including Kerberos-based authentication, network encryption and auditing.
Auditing data access across the enterprise - including databases, operating systems and
Hadoop - is critically important and oftentimes required for SOX, PCI and other
regulations. Lets take a look at a demonstration of how Oracle Audit Vault and Database
Firewall delivers comprehensive audit collection, alerting and reporting of activity on an
Oracle Big Data Appliance and Oracle Database 12c.

4.12.1 Configuration
In this scenario, weve set up auditing for both the BDA and Oracle Database 12c.

The Audit Vault Server is deployed to its own secure server and serves as mission control
for auditing. It is used to administer audit policies, configure activities that are tracked on
the secured targets and provide robust audit reporting and alerting. In many ways, Audit
Vault is a specialized auditing data warehouse. It automates ETL from a variety of sources
into an audit schema and then delivers both pre-built and ad hoc reporting capabilities.
For our demonstration, Audit Vault agents are deployed to the BDA and Oracle Database
12c monitored targets; these agents are responsible for managing collectors that gather
activity data. This is a secure agent deployment; the Audit Vault Server has a trusted
relationship with each agent. To set up the trusted relationship, the agent makes an
activation request to the Audit Vault Server; this request is then activated (or approved)
by the AV Administrator. The monitored target then applies an AV Server generated Agent
Activation Key to complete the activation.

On the BDA, these installation and configuration steps have all been automated for you.
Using the BDAs Configuration Generation Utility, you simply specify that you would like
to audit activity in Hadoop. Then, you identify the Audit Vault Server that will receive the
audit data. Mammoth - the BDAs installation tool - uses this information to configure the
audit processing. Specifically, it sets up audit trails across the following services:

HDFS: collects all file access activity
MapReduce: identifies who ran what jobs on the cluster
Oozie: audits who ran what as part of a workflow
Hive: captures changes that were made to the Hive metadata
There is much more flexibility when monitoring the Oracle Database. You can create audit
policies for SQL statements, schema objects, privileges and more. Check out the auditors

guide for more details. In our demonstration, we kept it simple: we are capturing all select
statements on the sensitive HR.EMPLOYEES table, all statements made by the HR user
and any unsuccessful attempts at selecting from any table in any schema.
Now that we are capturing activity across the BDA and Oracle Database 12c, well set up
an alert to fire whenever there is suspicious activity attempted over sensitive HR data in
Hadoop:

In the alert definition found above, a critical alert is defined as three unsuccessful attempts
from a given IP address to access data in the HR directory. Alert definitions are extremely
flexible - using any audited field as input into a conditional expression. And, they are
automatically delivered to the Audit Vault Servers monitoring dashboard - as well as via
email to appropriate security administrators.
Now that auditing is configured, well generate activity by two different users: oracle and
DrEvil. Well then see how the audit data is consolidated in the Audit Vault Server and
how auditors can interrogate that data.

4.12.2 Capturing Activity


The demonstration is driven by a few scripts that generate different types of activity by
both the oracle and DrEvil users. These activities include:

An oozie workflow that removes salary data from HDFS
Numerous HDFS commands that upload files, change file access privileges, copy
files and list the contents of directories and files
Hive commands that query, create, alter and drop tables
Oracle Database commands that connect as different users, create and drop users,
select from tables and delete records from a table
After running the scripts, we log into the Audit Vault Server as an auditor. Immediately,
we see our alert has been triggered by the users activity.

Drilling down on the alert reveals DrEvils three failed attempts to access the sensitive
data in HDFS:

Now that we see the alert triggered in the dashboard, lets see what other activity is taking
place on the BDA and in the Oracle Database.

4.12.3 Ad Hoc Reporting


Audit Vault Server delivers rich reporting capabilities that enables you to better
understand the activity that has taken place across the enterprise. In addition to the
numerous reports that are delivered out of box with Audit Vault, you can create your own
custom reports that meet your own personal needs. Here, we are looking at a BDA
monitoring report that focuses on Hadoop activities that occurred in the last 24 hours:

As you can see, the report tells you all of the key elements required to understand: 1)
when the activity took place, 2) the source service for the event, 3) what object was
referenced, 4) whether or not the event was successful, 5) who executed the event, 6) the
ip address (or host) that initiated the event, and 7) how the object was modified or
accessed. Stoplight reporting is used to highlight critical activity - including DrEvils failed
attempts to open the sensitive salaries.txt file.
Notice, events may be related to one another. The Hive command ALTER TABLE
my_salarys RENAME TO my_salaries will generate two events. The first event is
sourced from the Metastore; the alter table command is captured and the metadata
definition is updated. The Hive command also impacts HDFS; the table name is
represented by an HDFS folder. Therefore, an HDFS event is logged that renames the
my_salarys folder to my_salaries.
Next, consider an Oozie workflow that performs a simple task: delete a file salaries2.txt
in HDFS. This Oozie worflow generates the following events:


1. First, an Oozie workflow event is generated indicating the start of the workflow.
2. The workflow definition is read from the workflow.xml file found in HDFS.
3. An Oozie working directory is created
4. The salaries2.txt file is deleted from HDFS
5. Oozie runs its clean-up process
The Audit Vault reports are able to reveal all of the underlying activity that is executed by
the Oozie workflow. Its flexible reporting allows you to sequence these independent
events into a logical series of related activities.
The reporting focus so far has been on Hadoop - but one of the core strengths of Oracle
Audit Vault is its ability to consolidate all audit data. We know that DrEvil had a few
unsuccessful attempts to access sensitive salary data in HDFS. But, what other
unsuccessful events have occured recently across our data platform? Well use Audit
Vaults ad hoc reporting capabilities to answer that question. Report filters enable users to
search audit data based on a range of conditions. Here, well keep it pretty simple; lets
find all failed access attempts across both the BDA and the Oracle Database within the last
two hours:

Again, DrEvils activity stands out. As you can see, DrEvil is attempting to access
sensitive salary data not only in HDFS - but also in the Oracle Database.

4.12.4 Summary
Security and integration with the rest of the Oracle ecosystem are two tablestakes that are
critical to Oracle Big Data Appliance releases. Oracle Audit Vault and Database Firewalls
auditing of data across the BDA, databases and operating systems epitomizes this goal providing a single repository and reporting environment for all your audit data.

5Chapter 5.

ORACLE BIG DATA SQL

5.1 INTRODUCTION
Big Data SQL is Oracles unique approach to providing unified query over data in Oracle
Database, Hadoop, and select NoSQL datastores. Oracle Big Data SQL supports queries
against vast amounts of big data stored in multiple data sources, including Hadoop. You
can view and analyze data from various data stores together, as if it were all stored in an
Oracle database.
Using Oracle Big Data SQL, you can query data stored in a Hadoop cluster using the
complete SQL syntax. You can execute the most complex SQL SELECT statements
against data in Hadoop, either manually or using your existing applications, to tease out
the most significant insights.

5.2 SQL ON HADOOP


As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a
proliferation of solutions in the last 18 months, and just as large a proliferation of press.
From good, ol Apache Hive to Cloudera Impala and SparkSQL, these days you can have
SQL-on-Hadoop any way you like it. It does, however, prompt the question: Why SQL?
Theres an argument to be made for SQL simply being a form of skill reuse. If people and
tools already speak SQL, then give the people what they know. In truth, that argument
falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves. If
there were a better language for querying Big Data, the community would have turned it
up by now.
The reality is that the SQL language endures because it is uniquely suited to querying
datasets. Consider, SQL is a declarative language for operating on relations in data. Its a
domain-specific language where the domain is datasets. In and of itself, thats powerful:
having language elements like FROM, WHERE and GROUP BY make reasoning about
datasets simpler. Its set theory set into a programming language.
It goes beyond just the language itself. SQL is declarative, which means I only have to
reason about the shape of the result I want, not the data access mechanisms to get there,
the join algorithms to apply, how to serialize partial aggregations, and so on. SQL lets us
think about answers, which lets us get more done.
SQL on Hadoop, then, is somewhat obvious. As data gets bigger, we would prefer to only
have to reason about answers.

5.3 SQL ON MORE THAN HADOOP


For all the obvious goodness of SQL on Hadoop, theres a somewhat obvious drawback.
Specifically, data rarely lives in a single place. Indeed, if Big Data is causing a
proliferation of new ways to store and process data, then there are likely more places to
store data then every before. If SQL on Hadoop is separate from SQL on a DBMS, run the
risk of constructing every IT architects least favorite solution: the stovepipe.
If we want to avoid stovepipes, what we really need is the ability to run SQL queries that
work seamlessly across multiple datastores. Ideally, in a Big Data world, SQL should
play data where it lies, using the declarative power of the language to provide answers
from all data.
This is why we think Oracle Big Data SQL is obvious too.
Its just a little more complicated than SQL on any one thing. To pull it off, we have to do
a few things:

Maintain the valuable characteristics of the system storing the data
Unify metadata to understand how to execute queries
Optimize execution to take advantage of the systems storing the data
For the case of a relational database, we might say that the valuable storage characteristics
include things like: straight-through processing, change-data logging, fine-grained access
controls, and a host of other things.
For Hadoop, the two most valuable storage characteristics are scalability and schema-onread. Cost-effective scalability is one of the first things that people look to HDFS for, so
any solution that does SQL over a relational database and Hadoop has to understand how
HDFS scales and distributes data. Schema-on-read is at least equally important if not
more. As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop
tremendous power: dump data into HDFS, and access it without having to convert it to a
specific format. So, then, any solution that does SQL over a relational database and
Hadoop is going to have to respect the schemas of the database, but be able to really apply
schema-on-read principals to data stored in Hadoop.
Oracle Big Data SQL maintains all of these valuable characteristics, and it does it
specifically through the approaches taken for unifying metadata and optimizing
performance.

5.4 UNIFYING METADATA


To unify metadata for planning and executing SQL queries, we require a catalog of some
sort. What tables do I have? What are their column names and types? Are there special
options defined on the tables? Who can see which data in these tables?
Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata
using Oracle Database: specifically as external tables. Tables in Hadoop or NoSQL
databases are defined as external tables in Oracle. This makes sense, given that the data is
external to the DBMS.
Wait a minute, dont lots of vendors have external tables over HDFS, including Oracle?
Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the
valuable characteristics of Hadoop. The difficulty with most external tables is that they are
designed to work on flat, fixed-definition files, not distributed data which is intended to be
consumed through dynamically invoked readers. That causes both poor parallelism and
removes the value of schema-on-read.
The external tables Big Data SQL presents are different. They leverage the Hive metastore
or user definitions to determine both parallelism and read semantics. That means that if a
file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be
read in parallel. If the data was stored in a SequenceFile using a binary SerDe, or as
Parquet data, or as Avro, that is how the data is read. Big Data SQL uses the exact same
InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data
from HDFS.
Once that data is read, we need only to join it with internal data and provide SQL on
Hadoop and a relational database.

5.5 OPTIMIZING PERFORMANCE


Being able to join data from Hadoop with Oracle Database is a feat in and of itself.
However, given the size of data in Hadoop, it ends up being a lot of data to shift around. In
order to optimize performance, we must take advantage of what each system can do.
In the days before data was officially Big, Oracle faced a similar challenge when
optimizing Exadata, our then-new database appliance. Since many databases are
connected to shared storage, at some point database scan operations can become bound on
the network between the storage and the database, or on the shared storage system itself.
The solution the group proposed was remarkably similar to much of the ethos that infuses
MapReduce and Apache Spark: move the work to the data and minimize data movement.
The effect is striking: minimizing data movement by an order of magnitude often yields
performance increases of an order of magnitude.
Big Data SQL takes a play from both the Exadata and Hadoop books to optimize
performance: it moves work to the data and radically minimizes data movement. It does
this via something we call Smart Scan for Hadoop.
Moving the work to the data is straightforward. Smart Scan for Hadoop introduces a new
service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and
YARN NodeManagers. Queries from the new external tables are sent to these services to
ensure that reads are direct path and data-local. Reading close to the data speeds up I/O,
but minimizing data movement requires that Smart Scan do some things that are, well,
smart.

5.6 SMART SCAN FOR HADOOP


Consider this: most queries dont select all columns, and most queries have some kind of
predicate on them. Moving unneeded columns and rows is, by definition, excess data
movement and impeding performance. Smart Scan for Hadoop gets rid of this excess
movement, which in turn radically improves performance.
For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS,
but only cared about a few fields email and status and only wanted results from the
state of Texas.
Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading. It
applies parsing functions to our JSON data, discards any documents which do not contain
TX for the state attribute. Then, for those documents which do match, it projects out only
the email and status attributes to merge with the rest of the data. Rather than moving every
field, for every document, were able to cut down 100s of TB to 100s of GB.
The approach we take to optimizing performance with Big Data SQL makes Big Data
much slimmer.
So, there you have it: fast queries which join data in Oracle Database with data in Hadoop
while preserving the makes each system a valuable part of overall information
architectures. Big Data SQL unifies metadata, such that data sources can be queried with
the best possible parallelism and the correct read semantics. Big Data SQL optimizes
performance using approaches inspired by Exadata: filtering out irrelevant data before it
can become a bottleneck.
Its SQL that plays data where it lies, letting you place data where you think it belongs.

5.7 ORACLE SQL DEVELOPER & DATA MODELER SUPPORT FOR


ORACLE BIG DATA SQL
Oracle SQL Developer and Data Modeler (version 4.0.3) now support Hive and Oracle
Big Data SQL. The tools allow you to connect to Hive, use the SQL Worksheet to query,
create and alter Hive tables, and automatically generate Big Data SQL-enabled Oracle
external tables that dynamically access data sources defined in the Hive metastore.
Lets take a look at what it takes to get started and then preview this new capability.

5.7.1 Setting up Connections to Hive


The first thing you need to do is set up a JDBC connection to Hive. Follow these steps to
set up the connection:
DOWNLOAD AND UNZIP JDBC DRIVERS
Cloudera provides high performance JDBC drivers that are required for connectivity:

Download the Hive Drivers from the Cloudera Downloads page to a local directory
Unzip the archive

unzip Cloudera_HiveJDBC_2.5.4.1006.zip

Two zip files are contained within the archive. Unzip the JDBC4 archive to a target
directory that is accessible to SQL Developer (e.g. /home/oracle/jdbc below):

unzip Cloudera_HiveJDBC4_2.5.4.1006.zip -d /home/oracle/jdbc/

Now that the JDBC drivers have been extracted, update SQL Developer to use the new
drivers.
UPDATE SQL DEVELOPER TO USE THE CLOUDERA HIVE JDBC DRIVERS
Update the preferences in SQL Developer to leverage the new drivers:

Start SQL Developer
Go to Tools -> Preferences
Navigate to Database -> Third Party JDBC Drivers
Add all of the jar files contained in the zip to the Third-party JDBC Driver Path. It
should look like the picture below:


Restart SQL Developer
CREATE A CONNECTION
Now that SQL Developer is configured to access Hive, lets create a connection to
Hiveserver2. Click the New Connection button in the SQL Developer toolbar. Youll need
to have an ID, password and the port where Hiveserver2 is running:

The example above is creating a connection called hive which connects to Hiveserver2 on

localhost running on port 10000. The Database field is optional; here we are specifying the
default database.

5.7.2 Using the Hive Connection


The Hive connection is now treated like any other connection in SQL Developer. The
tables are organized into Hive databases; you can review the tables data, properties,
partitions, indexes, details and DDL:

And, you can use the SQL Worksheet to run custom queries, perform DDL operations whatever is supported in Hive:

Here, weve altered the definition of a hive table and then queried that table in the
worksheet.

5.7.3 Create Big Data SQL-enabled Tables Using Oracle Data Modeler
Oracle Data Modeler automates the definition of Big Data SQL-enabled external tables.
Lets create a few tables using the metadata from the Hive Metastore. Invoke the import
wizard by selecting the File->Import->Data Modeler->Data Dictionary menu item. You
will see the same connections found in the SQL Developer connection navigator:

After selecting the hive connection and a database, select the tables to import:

There could be any number of tables here - in our case we will select three tables to
import. After completing the import, the logical table definitions appear in our palette:

You can update the logical table definitions - and in our case we will want to do so. For
example, the recommended column in Hive is defined as a string (i.e. there is no
precision) - which the Data Modeler casts as a varchar2(4000). We have domain
knowledge and understand that this field is really much smaller - so well update it to the
appropriate size:

Now that were comfortable with the table definitions, lets generate the DDL and create
the tables in Oracle Database 12c. Use the Data Modeler DDL Preview to generate the
DDL for those tables - and then apply the definitions in the Oracle Database SQL
Worksheet:

5.7.4 Edit the Table Definitions


The SQL Developer table editor has been updated so that it now understands all of the
properties that control Big Data SQL external table processing. For example, edit table
movieapp_log_json:

You can update the source cluster for the data, how invalid records should be processed,
how to map hive table columns to the corresponding Oracle table columns (if they dont
match), and much more.

5.7.5 Query All Your Data


You now have full Oracle SQL access to data across the platform. In our example, we can
combine data from Hadoop with data in our Oracle Database. The data in Hadoop can be
in any format - Avro, json, XML, csv - if there is a SerDe that can parse the data - then Big
Data SQL can access it! Below, were combining click data from the JSON-based movie
application log with data in our Oracle Database tables to determine how the companys
customers rate blockbuster movies:

5.8 USING ORACLE BIG DATA SQL FOR DATA ACCESS


Oracle Big Data SQL supports queries against vast amounts of big data stored in multiple
data sources, including Hadoop. You can view and analyze data from various data stores
together, as if it were all stored in an Oracle database.
Using Oracle Big Data SQL, you can query data stored in a Hadoop cluster using the
complete SQL syntax. You can execute the most complex SQL SELECT statements
against data in Hadoop, either manually or using your existing applications, to tease out
the most significant insights.

5.8.1 About Oracle External Tables


Oracle Big Data SQL provides external tables with next generation performance gains. An
external table is an Oracle Database object that identifies and describes the location of
data outside of a database. You can query an external table using the same SQL SELECT
syntax that you use for any other database tables.
External tables use access drivers to parse the data outside the database. Each type of
external data requires a unique access driver. This release of Oracle Big Data SQL
includes two access drivers for big data: one for accessing data stored in Apache Hive, and
the other for accessing data stored in Hadoop Distributed File System (HDFS) files.

5.8.2 About the Access Drivers for Oracle Big Data SQL
By querying external tables, you can access data stored in HDFS and Hive tables as if that
data was stored in tables in an Oracle database. Oracle Database accesses the data by
using the metadata provided when the external table was created.
Oracle Database 12.1.0.2 supports two new access drivers for Oracle Big Data SQL:
1. ORACLE_HIVE: Enables you to create Oracle external tables over Apache Hive data
sources. Use this access driver when you already have Hive tables defined for your
HDFS data sources. ORACLE_HIVE can also access data stored in other locations,
such as HBase, that have Hive tables defined for them.

2.

ORACLE_HDFS: Enables you to create Oracle external tables directly over files
stored in HDFS. This access driver uses Hive syntax to describe a data source,
assigning default column names of COL_1, COL_2, and so forth. You do not need to
create a Hive table manually as a separate step.

Instead of acquiring the metadata from a Hive metadata store the way that
ORACLE_HIVE does, the ORACLE_HDFS access driver acquires all of the necessary
information from the access parameters. The ORACLE_HDFS access parameters are
required to specify the metadata, and are stored as part of the external table definition in
Oracle Database.
Oracle Big Data SQL uses these access drivers to optimize query performance.

5.8.3 About Smart Scan Technology


External tables do not have traditional indexes, so that queries against them typically
require a full table scan. However, Oracle Big Data SQL extends SmartScan capabilities,
such as filter-predicate offloads, to Oracle external tables with the installation of Exadata
storage server software on Oracle Big Data Appliance. This technology enables Oracle
Big Data Appliance to discard a huge portion of irrelevant dataup to 99 percent of the
totaland return much smaller result sets to Oracle Exadata Database Machine. End users
obtain the results of their queries significantly faster, as the direct result of a reduced load
on Oracle Database and reduced traffic on the network.

5.8.4 About Data Security with Oracle Big Data SQL


Oracle Big Data Appliance already provides numerous security features to protect data
stored in a CDH cluster on Oracle Big Data Appliance:
1.

Kerberos authentication: Requires users and client software to provide credentials


before accessing the cluster.

2.

Apache Sentry authorization: Provides fine-grained, role-based authorization to data


and metadata.

3. On-disk encryption: Protects the data on disk and at rest. For normal user access, the
data is automatically decrypted.

4.

Oracle Audit Vault and Database Firewall monitoring: The Audit Vault plug-in on
Oracle Big Data Appliance collects audit and logging data from MapReduce, HDFS,
and Oozie services. You can then use Audit Vault Server to monitor these services on
Oracle Big Data Appliance

Oracle Big Data SQL adds the full range of Oracle Database security features to this list.
You can apply the same security policies and rules to your Hadoop data that you apply to
your relational data.

5.9 INSTALLING ORACLE BIG DATA SQL


Oracle Big Data SQL is available only on Oracle Exadata Database Machine connected to
Oracle Big Data Appliance. You must install the Oracle Big Data SQL software on both
systems.

5.9.1 Prerequisites for Using Oracle Big Data SQL


Oracle Exadata Database Machine must comply with the following requirements:
1.

Compute servers run Oracle Database and Oracle Enterprise Manager Grid Control
12.1.0.2 or later.

2. Storage servers run Exadata storage server software 12.1.1.1 or 12.1.1.0.


3.

Oracle Exadata Database Machine is configured on the same InfiniBand subnet as


Oracle Big Data Appliance

4. Oracle Exadata Database Machine is connected to Oracle Big Data Appliance by the
InfiniBand network.

5.9.2 Performing the Installation


Take these steps to install the Oracle Big Data SQL software on Oracle Big Data
Appliance and Oracle Exadata Database Machine:
1. Download My Oracle Support one-off patch 19377855 for RDBMS 12.1.0.2.

2. On Oracle Exadata Database Machine, install patch 19377855 on:


Oracle Enterprise Manager Grid Control home


Oracle Database homes

3. On Oracle Big Data Appliance, install or upgrade the software to the latest version.

4. You can select Oracle Big Data SQL as an installation option when using the Oracle
Big Data Appliance Configuration Generation Utility.

5. Download and install Mammoth patch 18064328 from Oracle Automated Release
Updates.

6. If Oracle Big Data SQL is not enabled during the installation, then use the bdacli
utility:


# bdacli enable big_data_sql

7. On Oracle Exadata Database Machine, run the post-installation script.


8. You can use Cloudera Manager to verify that Oracle Big Data SQL is up and running.

5.9.3 Running the Post-Installation Script for Oracle Big Data SQL
To run the Oracle Big Data SQL post-installation script:
1. On Oracle Exadata Database Machine, ensure that the Oracle Database listener is
running and listening on an interprocess communication (IPC) interface.

2. Verify the name of the Oracle installation owner. Typically, the oracle user owns the
installation.

3. Verify that the same user name (such as oracle) exists on Oracle Big Data Appliance.

4. Download the bds-exa-install.sh installation script from the node where Mammoth is
installed, typically the first node in the cluster. You can use a command such as wget or
curl. This example copies the script from bda1node07:

wget http://bda1node07/bda/bds-exa-install.sh

5. As root, run the script and pass it the system identifier (SID). In this example, the SID
is orcl:


./bds-exa-install.sh oracle_sid=orcl

6. Repeat step 5 for each database instance.


When the script completes, Oracle Big Data SQL is running on the database instance.
However, if events cause the Oracle Big Data SQL agent to stop, then you must restart it.

5.9.4 Running the bds-exa-install Script


The bds-exa-install script generates a custom installation script that is run by the owner of
the Oracle home directory. That secondary script installs all the files need by Oracle Big
Data SQL into the $ORACLE_HOME/bigdatasql directory. It also creates the database
directory objects, and the database links for the multithreaded Oracle Big Data SQL agent.
If the operating system user who owns Oracle home is not named oracle, then use the
install-user option to specify the owner.
Alternatively, you can use the generate-only option to create the secondary script, and
then run it as the owner of $ORACLE_HOME.

5.9.5 bds-ex-install Syntax


The following is the bds-exa-install syntax:
./bds-exa-install.sh oracle_sid=name [option]

The option names are preceded by two hyphens ():


generate-only={true | false}
Set to true to generate the secondary script, but not run it, or false to generate and run it in
one step (default).
install-user=user_name
The operating system user who owns the Oracle Database installation. The default values
is oracle.

5.10 CREATING EXTERNAL TABLES FOR ACCESSING BIG DATA


The SQL CREATE TABLE statement has a clause specifically for creating external tables.
The information that you provide in this clause enables the access driver to read data from
an external source and prepare the data for the external table.

5.10.1 About the Basic CREATE TABLE Syntax


The following is the basic syntax of the CREATE TABLE statement for external tables:
CREATE TABLE table_name (column_name datatype,
column_name datatype[,])
ORGANIZATION EXTERNAL (external_table_clause);

You specify the column names and data types the same as for any other table.
ORGANIZATION EXTERNAL identifies the table as an external table.
The external_table_clause identifies the access driver and provides the information that it
needs to load the data. See About the External Table Clause.

5.10.2 Creating an External Table for a Hive Table


You can easily create an Oracle external table for data in Apache Hive. Because the
metadata is available to Oracle Database, you can query the data dictionary for
information about Hive tables. Then you can use a PL/SQL function to generate a basic
SQL. CREATE TABLE EXTERNAL ORGANIZATION statement. You can then modify
the statement before execution to customize the external table.

5.10.3 Obtaining Information About a Hive Table


The DBMS_HADOOP PL/SQL package contains a function named
CREATE_EXTDDL_FOR_HIVE. It contains the CREATE_EXTDDL_FOR_HIVE
function, which returns the data dictionary language (DDL) for an external table. This
function requires you to provide basic information about the Hive table:
Name of the Hadoop cluster
Name of the Hive database
Name of the Hive table
Whether the Hive table is partitioned

You can obtain this information by querying the ALL_HIVE_TABLES data dictionary
view. It displays information about all Hive tables that you can access from Oracle
Database.
This example shows that the current user has access to an unpartitioned Hive table named
RATINGS_HIVE_TABLE in the default database. A user named JDOE is the owner.
SQL> SELECT cluster_id, database_name, owner, table_name, partitioned FROM all_hive_tables;

CLUSTER_ID DATABASE_NAME OWNER TABLE_NAME PARTITIONED

hadoop1 default jdoe ratings_hive_table UN-PARTITIONED

You can query these data dictionary views to discover information about

5.10.4 Using the CREATE_EXTDDL_FOR_HIVE Function


With the information from the data dictionary, you can use the
CREATE_EXTDDL_FOR_HIVE function of DBMS_HADOOP. This example specifies a
database table name of RATINGS_DB_TABLE in the current schema. The function
returns the text of the CREATE TABLE command in a local variable named DDLout, but
does not execute it.
DECLARE
DDLout VARCHAR2(4000);
BEGIN
dbms_hadoop.create_extddl_for_hive(
CLUSTER_ID=>hadoop1,
DB_NAME=>default,
HIVE_TABLE_NAME=>ratings_hive_table,
HIVE_PARTITION=>FALSE,
TABLE_NAME=>ratings_db_table,
PERFORM_DDL=>FALSE,
TEXT_OF_DDL=>DDLout
);
dbms_output.put_line(DDLout);
END;
/

When this procedure runs, the PUT_LINE function displays the CREATE TABLE
command:
CREATE TABLE ratings_db_table (
c0 VARCHAR2(4000),
c1 VARCHAR2(4000),
c2 VARCHAR2(4000),
c3 VARCHAR2(4000),
c4 VARCHAR2(4000),
c5 VARCHAR2(4000),
c6 VARCHAR2(4000),
c7 VARCHAR2(4000))
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.cluster=hadoop1
com.oracle.bigdata.tablename=default.ratings_hive_table
)
) PARALLEL 2 REJECT LIMIT UNLIMITED

You can capture this information in a SQL script, and use the access parameters to change
the Oracle table name, the column names, and the data types as desired before executing
it. You might also use access parameters to specify a date format mask.
The ALL_HIVE_COLUMNS view shows how the default column names and data types
are derived. This example shows that the Hive column names are C0 to C7, and that the
Hive STRING data type maps to VARCHAR2(4000):

SQL> SELECT table_name, column_name, hive_column_type, oracle_column_type FROM all_hive_columns;



TABLE_NAME COLUMN_NAME HIVE_COLUMN_TYPE ORACLE_COLUMN_TYPE
-
ratings_hive_table c0 string VARCHAR2(4000)
ratings_hive_table c1 string VARCHAR2(4000)
ratings_hive_table c2 string VARCHAR2(4000)
ratings_hive_table c3 string VARCHAR2(4000)
ratings_hive_table c4 string VARCHAR2(4000)
ratings_hive_table c5 string VARCHAR2(4000)
ratings_hive_table c6 string VARCHAR2(4000)
ratings_hive_table c7 string VARCHAR2(4000)

8 rows selected.

5.10.5 Developing a CREATE TABLE Statement for ORACLE_HIVE


You can choose between using DBMS_HADOOP and developing a CREATE TABLE
statement from scratch. In either case, you may need to set some access parameters to
modify the default behavior of ORACLE_HIVE.
Using the Default ORACLE_HIVE Settings
The following statement creates an external table named ORDER to access Hive data:

CREATE TABLE order (cust_num VARCHAR2(10),


order_num VARCHAR2(20),
description VARCHAR2(100),
order_total NUMBER (8,2))
ORGANIZATION EXTERNAL (TYPE oracle_hive);

Because no access parameters are set in the statement, the ORACLE_HIVE access driver
uses the default settings to do the following:
Connects to the default Hadoop cluster.

Uses a Hive table named order. An error results if the Hive table does not have fields
named CUST_NUM, ORDER_NUM, DESCRIPTION, and ORDER_TOTAL.

Sets the value of a field to NULL if there is a conversion error, such as a CUST_NUM
value longer than 10 bytes.

Overriding the Default ORACLE_HIVE Settings

You can set properties in the ACCESS PARAMETERS clause of the external table clause,
which override the default behavior of the access driver. The following clause includes the
com.oracle.bigdata.overflow access parameter. When this clause is used in the previous
example, it truncates the data for the DESCRIPTION column that is longer than 100
characters, instead of throwing an error:
(TYPE oracle_hive
ACCESS PARAMETERS (
com.oracle.bigdata.overflow={action:truncate, col:DESCRIPTION} ))

1.

The next example sets most of the available parameters for ORACLE_HIVE:

CREATE TABLE order (cust_num VARCHAR2(10),


order_num VARCHAR2(20),
order_date DATE,
item_cnt NUMBER,
description VARCHAR2(100),
order_total (NUMBER(8,2)) ORGANIZATION EXTERNAL
(TYPE oracle_hive
ACCESS PARAMETERS (
com.oracle.bigdata.tablename: order_db.order_summary
com.oracle.bigdata.colmap: {col:ITEM_CNT, \
field:order_line_item_count}
com.oracle.bigdata.overflow: {action:TRUNCATE, \
col:DESCRIPTION}

com.oracle.bigdata.erroropt: [{action:replace, \
value:INVALID_NUM , \
col:[CUST_NUM,ORDER_NUM]} ,\
{action:reject, \
col:ORDER_TOTAL}
]

The parameters make the following changes in the way that the ORACLE_HIVE access
driver locates the data and handles error conditions:

com.oracle.bigdata.tablename: Handles differences in table names. ORACLE_HIVE


looks for a Hive table named ORDER_SUMMARY in the ORDER.DB database.

com.oracle.bigdata.colmap: Handles differences in column names. The Hive


ORDER_LINE_ITEM_COUNT field maps to the Oracle ITEM_CNT column.

com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters


for the DESCRIPTION column are truncated.

com.oracle.bigdata.erroropt: Replaces bad data. Errors in the data for CUST_NUM or


ORDER_NUM set the value to INVALID_NUM.

5.10.6 Creating an External Table for HDFS Files


The ORACLE_HDFS access driver enables you to access many types of data that are
stored in HDFS, but which do not have Hive metadata. You can define the record format
of text data, or you can specify a SerDe for a particular data format.
You must create the external table for HDFS files manually, and provide all the
information the access driver needs to locate the data, and parse the records and fields.
The following are some examples of CREATE TABLE ORGANIZATION EXTERNAL
statements.

5.10.7 Using the Default Access Parameters with ORACLE_HDFS


The following statement creates a table named ORDER to access the data in all files
stored in the /usr/cust/summary directory in HDFS:
CREATE TABLE ORDER (cust_num VARCHAR2(10),
order_num VARCHAR2(20),
order_total (NUMBER 8,2))
ORGANIZATION EXTERNAL (TYPE oracle_hdfs)
LOCATION (hdfs:/usr/cust/summary/*);

Because no access parameters are set in the statement, the ORACLE_HDFS access driver
uses the default settings to do the following:
Connects to the default Hadoop cluster.

Reads the files as delimited text, and the fields as type STRING.

Assumes that the number of fields in the HDFS files match the number of columns
(three in this example).

Assumes the fields are in the same order as the columns, so that CUST_NUM data is in
the first field, ORDER_NUM data is in the second field, and ORDER_TOTAL data is
in the third field.

Rejects any records in which the value causes a data conversion error: If the value for
CUST_NUM exceeds 10 characters, the value for ORDER_NUM exceeds 20
characters, or the value of ORDER_TOTAL cannot be converted to NUMBER.

5.10.8 Overriding the Default ORACLE_HDFS Settings


You can use many of the same access parameters with ORACLE_HDFS as
ORACLE_HIVE.
Accessing a Delimited Text File

The following example is equivalent to the one shown in Overriding the Default
ORACLE_HIVE Settings. The external table access a delimited text file stored in HDFS.

CREATE TABLE order (cust_num VARCHAR2(10),


order_num VARCHAR2(20),
order_date DATE,
item_cnt NUMBER,
description VARCHAR2(100),
order_total (NUMBER8,2)) ORGANIZATION EXTERNAL
(TYPE oracle_hdfs
ACCESS PARAMETERS (
com.oracle.bigdata.colmap: {col:item_cnt, \
field:order_line_item_count}
com.oracle.bigdata.overflow: {action:TRUNCATE, \
col:DESCRIPTION}
com.oracle.bigdata.erroropt: [{action:replace, \
value:INVALID NUM, \
col:[CUST_NUM,ORDER_NUM]} , \
{action:reject, \
col:ORDER_TOTAL}]
)
LOCATION (hdfs:/usr/cust/summary/*));

The parameters make the following changes in the way that the ORACLE_HDFS access
driver locates the data and handles error conditions:

com.oracle.bigdata.colmap: Handles differences in column names.


ORDER_LINE_ITEM_COUNT in the HDFS files matches the ITEM_CNT column in
the external table.

com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters


for the DESCRIPTION column are truncated.
com.oracle.bigdata.erroropt: Replaces bad data. Errors in the data for CUST_NUM or
ORDER_NUM set the value to INVALID_NUM.

5.10.9 Accessing Avro Container Files


The next example uses a SerDe to access Avro container files.
CREATE TABLE order (cust_num VARCHAR2(10),
order_num VARCHAR2(20),
order_date DATE,
item_cnt NUMBER,
description VARCHAR2(100),
order_total (NUMBER8,2)) ORGANIZATION EXTERNAL
(TYPE oracle_hdfs
ACCESS PARAMETERS (
com.oracle.bigdata.rowformat: \
SERDE org.apache.hadoop.hive.serde2.avro.AvroSerDe
com.oracle.bigdata.fileformat: \
INPUTFORMAT org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat\
OUTPUTFORMAT org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
com.oracle.bigdata.colmap: { col:item_cnt, \
field:order_line_item_count}
com.oracle.bigdata.overflow: {action:TRUNCATE, \
col:DESCRIPTION}

LOCATION (hdfs:/usr/cust/summary/*));

The access parameters provide the following information to the ORACLE_HDFS access
driver:

com.oracle.bigdata.rowformat: Identifies the SerDe that the access driver needs to use
to parse the records and fields. The files are not in delimited text format.

com.oracle.bigdata.fileformat: Identifies the Java classes that can extract records and
output them in the desired format.
com.oracle.bigdata.colmap: Handles differences in column names. ORACLE_HDFS
matches ORDER_LINE_ITEM_COUNT in the HDFS files with the ITEM_CNT
column in the external table.
com.oracle.bigdata.overflow: Truncates string data. Values longer than 100 characters
for the DESCRIPTION column are truncated.

5.11 ABOUT THE EXTERNAL TABLE CLAUSE


CREATE TABLE ORGANIZATION EXTERNAL takes the external_table_clause as its
argument. It has the following subclauses:
TYPE Clause

DEFAULT DIRECTORY Clause


LOCATION Clause

REJECT LIMIT Clause


ORACLE_HIVE Access Parameters


5.11.1 TYPE Clause


The TYPE clause identifies the access driver. The type of access driver determines how
the other parts of the external table definition are interpreted.
Specify one of the following values for Oracle Big Data SQL:
ORACLE_HDFS: Accesses files in an HDFS directory.

ORACLE_HIVE: Accesses a Hive table.


The ORACLE_DATAPUMP and ORACLE_LOADER access drivers are not associated


with Oracle Big Data SQL.

5.11.2 DEFAULT DIRECTORY Clause


The DEFAULT DIRECTORY clause identifies an Oracle Database directory object. The
directory object identifies an operating system directory with files that the external table
reads and writes.
ORACLE_HDFS and ORACLE_HIVE use the default directory solely to write log files
on the Oracle Database system.

5.11.3 LOCATION Clause


The LOCATION clause identifies the data source.
ORACLE_HDFS LOCATION Clause
The LOCATION clause for ORACLE_HDFS contains a comma-separated list of file
locations. The files must reside in the HDFS file system on the default cluster.
A location can be any of the following:

A fully qualified HDFS name, such as /user/hive/warehouse/hive_seed/hive_types.


ORACLE_HDFS uses all files in the directory.

A fully qualified HDFS file


/user/hive/warehouse/hive_seed/hive_types/hive_types.csv

A URL for an HDFS file or a set of files, such as


hdfs:/user/hive/warehouse/hive_seed/hive_types/*. Just a directory name is invalid.

name,

such

as

The file names can contain any pattern-matching character described in the next table.
Pattern-Matching Characters
Character

Description

Matches any one character

Matches zero or more characters

[abc]

Matches one character in the set {a, b, c}

[a-b]

Matches one character in the range {ab}. The character must be less
than or equal to b.

[^a]

Matches one character that is not in the character set or range {a}. The
carat (^) must immediately follow the left bracket, with no spaces.

\c

Removes any special meaning of c. The backslash is the escape character.

{ab\,cd}

Matches a string from the set {ab, cd}. The escape character (\) removes
the meaning of the comma as a path separator.

{ab\,c{de\,fh}

Matches a string from the set {ab, cde, cfh}. The escape character (\)
removes the meaning of the comma as a path separator.

ORACLE_HIVE LOCATION Clause


Do not specify the LOCATION clause for ORACLE_HIVE; it raises an error. The data is

stored in Hive, and the access parameters and the metadata store provide the necessary
information.

5.11.4 REJECT LIMIT Clause


Limits the number of conversion errors permitted during a query of the external table
before Oracle Database stops the query and returns an error.
Any processing error that causes a row to be rejected counts against the limit. The reject
limit applies individually to each parallel query (PQ) process. It is not the total of all
rejected rows for all PQ processes.

5.11.5 ACCESS PARAMETERS Clause


The ACCESS PARAMETERS clause provides information that the access driver needs to
load the data correctly into the external table. See CREATE TABLE ACCESS
PARAMETERS Clause.

5.12 ABOUT DATA TYPE CONVERSIONS


When the access driver loads data into an external table, it verifies that the Hive data can
be converted to the data type of the target column. If they are incompatible, then the
access driver returns an error. Otherwise, it makes the appropriate data conversion.
Hive typically provides a table abstraction layer over data stored elsewhere, such as in
HDFS files. Hive uses a serializer/deserializer (SerDe) to convert the data as needed from
its stored format into a Hive data type. The access driver then converts the data from its
Hive data type to an Oracle data type. For example, if a Hive table over a text file has a
BIGINT column, then the SerDe converts the data from text to BIGINT. The access driver
then converts the data from BIGINT (a Hive data type) to NUMBER (an Oracle data
type).
Performance is better when one data type conversion is performed instead of two. The
data types for the fields in the HDFS files should therefore indicate the data that is actually
stored on disk. For example, JSON is a clear text format, therefore all data in a JSON file
is text. If the Hive type for a field is DATE, then the SerDe converts the data from string
(in the data file) to a Hive date. Then the access driver converts the data from a Hive date
to an Oracle date. However, if the Hive type for the field is string, then the SerDe does not
perform a conversion, and the access driver converts the data from string to an oracle date.
Queries against the external table are faster in the second example, because the access
driver performs the only data conversion.
The next table identifies the data type conversions that ORACLE_HIVE can make when
loading data into an external table.
Supported Hive to Oracle Data Type Conversions

Hive
Type

VARCHAR2,
NUMBER, FLOAT,
Data CHAR,
BINARY_NUMBER, BLOB
NCHAR2,
BINARY_FLOAT
NCHAR, CLOB

DATE,
TIMESTAMP,
RAW TIMESTAMP WITH
TZ, TIMESTAMP
WITH LOCAL TZ

INTERVAL
YEAR
TO
MONTH,
INTERVAL DAY
TO SECOND

INT
SMALLINT
yes

yes

yes

yes

no

no

yes

yes

yes

yes

no

no

DECIMAL

yes

yes

no

no

no

no

BOOLEAN

yesFoot 1

yesFoot 2

yesFootref yes
2

no

no

BINARY

yes

no

yes

no

no

TINYINT
BIGINT
DOUBLE
FLOAT

yes

STRING

yes

TIMESTAMP yes

yes

yes

yes

yes

yes

no

no

no

yes

no

no

no

no

no

no

STRUCT
ARRAY
yes
UNIONTYPE
MAP
Footnote 1 FALSE maps to the string FALSE, and TRUE maps to the string TRUE.
Footnote 2 FALSE maps to 0, and TRUE maps to 1.

5.13 QUERYING EXTERNAL TABLES


Users can query external tables using the SQL SELECT statement, the same as they query
any other table.
Granting User Access
Users who query the data on a Hadoop cluster must have READ access in Oracle
Database to the external table and to the database directory object that points to the cluster
directory.
About Error Handling
By default, a query returns no data if an error occurs while the value of a column is
calculated. Processing continues after most errors, particularly those thrown while the
column values are calculated.
Use the com.oracle.bigdata.erroropt parameter to determine how errors are handled.
About the Log Files
You can use these access parameters to customize the log files:
com.oracle.bigdata.log.exec
com.oracle.bigdata.log.qc

5.14 ABOUT ORACLE BIG DATA SQL ON ORACLE EXADATA


DATABASE MACHINE
Oracle Big Data SQL runs exclusively on systems with Oracle Big Data Appliance
connected to Oracle Exadata Database Machine. The Oracle Exadata Storage Server
Software is deployed on a configurable number of Oracle Big Data Appliance servers.
These servers combine the functionality of a CDH node and an Oracle Exadata Storage
Server.
The Mammoth utility on installs the Big Data SQL software on both Oracle Big Data
Appliance and Oracle Exadata Database Machine. The information in this section explains
the changes that Mammoth makes to the Oracle Database system.
Oracle SQL Connector for HDFS provides access to Hadoop data for all Oracle Big Data
Appliance racks, including those that are not connected to Oracle Exadata Database
Machine. However, it does not offer the performance benefits of Oracle Big Data SQL,
and it is not included under the Oracle Big Data Appliance license.

5.14.1 Starting and Stopping the Big Data SQL Agent


The agtctl utility starts and stops the multithreaded Big Data SQL agent. It has the
following syntax:
agtctl {startup | shutdown} bds_clustername

5.14.2 About the Common Directory


The common directory contains configuration information that is common to all Hadoop
clusters. This directory is located on the Oracle Database system under the Oracle home
directory. The oracle file system user (or whichever user owns the Oracle Database
instance) owns the common directory. A database directory named
ORACLE_BIGDATA_CONFIG points to the common directory.

5.14.3 Common Configuration Properties


The Mammoth installation process creates the following files and stores them in the
common directory:
bigdata.properties
bigdata-log4j.properties

The Oracle DBA can edit these configuration files as necessary.

5.14.4 bigdata.properties
Thebigdata.properties file in the common directory contains property-value pairs that
define the Java class paths and native library paths required for accessing data in HDFS.
These properties must be set:
bigdata.cluster.default
java.classpath.hadoop
java.classpath.hive
java.classpath.oracle

The following list describes all properties permitted in bigdata.properties.


bigdata.properties

bigdata.cluster.default
The name of the default Hadoop cluster. The access driver uses this name when the access
parameters do not specify a cluster. Required.
Changing the default cluster name might break external tables that were created
previously without an explicit cluster name.
bigdata.cluster.list
A comma-separated list of Hadoop cluster names. Optional.
java.classpath.hadoop
The Hadoop class path. Required.
java.classpath.hive
The Hive class path. Required.
java.classpath.oracle
The path to the Oracle JXAD Java JAR file. Required.
java.classpath.user
The path to user JAR files. Optional.
java.libjvm.file
The full file path to the JVM shared library (such as libjvm.so). Required.
java.options
A comma-separated list of options to pass to the JVM. Optional.
This example sets the maximum heap size to 2 GB, and verbose logging for Java Native
Interface (JNI) calls:
Xmx2048m,-verbose=jni


LD_LIBRARY_PATH
A colon separated (:) list of directory paths to search for the Hadoop native libraries.
Recommended.
If you set this option, then do not set java.library path in java.options.
The next example shows a sample bigdata.properties file.
# bigdata.properties
#
# Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
#
# NAME
# bigdata.properties - Big Data Properties File
#
# DESCRIPTION
# Properties file containing parameters for allowing access to Big Data
# Fixed value properties can be added here
#

java.libjvm.file=$ORACLE_HOME/jdk/jre/lib/amd64/server/libjvm.so
java.classpath.oracle=$ORACLE_HOME/hadoopcore/jlib/*:$ORACLE_HOME/hadoop/jlib/hver2/*:$ORACLE_HOME/dbjava/lib/*
java.classpath.hadoop=$HADOOP_HOME/*:$HADOOP_HOME/lib/*
java.classpath.hive=$HIVE_HOME/lib/*
LD_LIBRARY_PATH=$ORACLE_HOME/jdk/jre/lib
bigdata.cluster.default=hadoop_cl_1

5.14.5 bigdata-log4j.properties
The bigdata-log4j.properties file in the common directory defines the logging behavior of
queries against external tables in the Java code. Any log4j properties are allowed in this
file.
The next example shows a sample bigdata-log4j.properties file with the relevant log4j
properties.
# bigdata-log4j.properties
#
# Copyright (c) 2014, Oracle and/or its affiliates. All rights reserved.
#
# NAME
# bigdata-log4j.properties - Big Data Logging Properties File
#
# DESCRIPTION
# Properties file containing logging parameters for Big Data
# Fixed value properties can be added here

bigsql.rootlogger=INFO,console
log4j.rootlogger=DEBUG, file
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.logger.oracle.hadoop.sql=ALL, file

bigsql.log.dir=.
bigsql.log.file=bigsql.log
log4j.appender.file.File=$ORACLE_HOME/bigdatalogs/bigdata-log4j.log

5.14.6 About the Cluster Directory


The cluster directory contains configuration information for a CDH cluster. Each cluster
that Oracle Database will access using Oracle Big Data SQL has a cluster directory. This
directory is located on the Oracle Database system under the common directory. For
example, a cluster named bda1_cl_1 would have a directory by the same name
(bda1_cl_1) in the common directory.
The cluster directory contains the CDH client configuration files for accessing the cluster,
such as the following:
core-site.xml
hdfs-site.xml
hive-site.xml
mapred-site.xml (optional)
log4j property files (such as hive-log4j.properties)

A database directory object points to the cluster directory. Users who want to access the
data in a cluster must have read access to the directory object.

5.14.7 About Permissions


The oracle operating system user (or whatever user owns the Oracle Database installation
directory) must have the following setup:
READ/WRITE access to the database directory that points to the log directory. These
permissions enable the access driver to create the log files, and for the user to read them.
A corresponding oracle operating system user defined on Oracle Big Data Appliance, with
READ access in the operating system to the HDFS directory where the source data is
stored.

6Chapter 6.

HIVE USER DEFINED FUNCTIONS (UDFs)


6.1 INTRODUCTION
User-defined Functions (UDFs) have a long history of usefulness in SQL-derived
languages. While query languages can be rich in their expressiveness, theres just no way
they can anticipate all the things a developer wants to do. Thus, the custom UDF has
become commonplace in our data manipulation toolbox.
Apache Hive is no different in this respect from other SQL-like languages. Hive allows
extensibility via both Hadoop Streaming and compiled Java. However, largely because of
the underlying MapReduce paradigm, all Hive UDFs are not created equally. Some UDFs
are intended for map-side execution, while others are portable and can be run on the
reduce-side. Moreover, UDF behavior via streaming requires that queries be formatted
so as to direct script execution where we desire it.
The intricacies of where and how a UDF executes may seem like minutiae, but we would
be disappointed time spent coding a cumulative sum UDF only executed on single rows.
To that end, Im going to spend the rest of the week diving into the three primary types of
Java-based UDFs in Hive.

6.1.1 The Three Little UDFs


Hive provides three classes of UDFs that most users are interested in: UDFs, UDTFs, and
UDAFs. Broken down simply, the three classes can be explained as such:

UDFs User Defined Functions; these operate row-wise, generally during map
execution. Theyre the simplest UDFs to write, but constrained in their functionality.
UDTFs User Defined Table-Generating Functions; these also execute row-wise,
but they produce multiple rows of output (i.e., they generate a table). The most
common example of this is Hives explode function.
UDAFs User Defined Aggregating Functions; these can execute on either the
map-side or the reduce-side and far more flexible than UDFs. The challenge,
however, is that in writing UDAFs we have to think not just about what to do with a
single row, or even a group of rows. Here, one has to consider partial aggregation and
serialization between map and reduce proceses.

6.2 THREE LITTLE HIVE UDFS: EXAMPLE 1


6.2.1 Introduction
In the ongoing series of posts explaining the ins and outs of Hive User Defined
Functions, were starting with the simplest case. Of the three little UDFs, todays entry
built a straw house: simple, easy to put together, but limited in applicability. Well walk
through important parts of the code, but you can grab the whole source from github here.

6.2.2 Extending UDF


The first few lines of interest are very straightforward:
@Description(name = moving_avg, value = _FUNC_(x, n) - Returns the moving mean of a set of numbers
over a window of n observations)
@UDFType(deterministic = false, stateful = true)

public class UDFSimpleMovingAverage extends UDF

Were extending the UDF class with some decoration. The decoration is important for
usability and functionality. The description decorator allows us to give the Hive some
information to show users about how to use our UDF and what its method signature will
be. The UDFType decoration tells Hive what sort of behavior to expect from our function.
A deterministic UDF will always return the same output given a particular input. A squareroot computing UDF will always return the same square root for 4, so we can say it is
deterministic; a call to get the system time would not be. The stateful annotation of the
UDFType decoration is relatively new to Hive (e.g., CDH4 and above). The stateful
directive allows Hive to keep some static variables available across rows. The simplest
example of this is a row-sequence, which maintains a static counter which increments
with each row processed.
Since square-root and row-counting arent terribly interesting, well use the stateful
annotation to build a simple moving average function. Well return to the notion of a
moving average later when we build a UDAF, so as to compare the two approaches.
private DoubleWritable result = new DoubleWritable();
private static ArrayDeque<Double> window;
int windowSize;

public UDFSimpleMovingAverage() {
result.set(0);

The above code is basic initialization. We make a double in which to hold the result, but it
needs to be of class DoubleWritable so that MapReduce can properly serialize the data.
We use a deque to hold our sliding window, and we need to keep track of the windows
size. Finally, we implement a constructor for the UDF class.
public DoubleWritable evaluate(DoubleWritable v, IntWritable n) {
double sum = 0.0;
double moving_average;

double residual;
if (window == null)
{
window = new ArrayDeque<Double>();

Heres the meat of the class: the evaluate method. This method will be called on each row
by the map tasks. For any given row, we cant say whether or not our sliding window
exists, so we initialize it if its null.
//slide the window
if (window.size() == n.get())
window.pop();

window.addLast(new Double(v.get()));

// compute the average
for (Iterator<Double> i = window.iterator(); i.hasNext();)

sum += i.next().doubleValue();

Here we deal with the deque and compute the sum of the windows elements. Deques are
essentially double-ended queues, so they make excellent sliding windows. If the window
is full, we pop the oldest element and add the current value.
moving_average = sum/window.size();
result.set(moving_average);

return result;

Computing the moving average without weighting is simply dividing the sum of our
window by its size. We then set that value in our Writable variable and return it. The value
is then emitted as part of the map task executing the UDF function.

6.3 THREE LITTLE HIVE UDFS: EXAMPLE 2


6.3.1 Introduction
In our ongoing exploration of Hive UDFs, weve covered the basic row-wise UDF. Today
well move to the UDTF, which generates multiple rows for every row processed. This
UDF built its house from sticks: its slightly more complicated than the basic UDF and
allows us an opportunity to explore how Hive functions manage type checking.

6.3.2 Extending GenericUDTF


Our UDTF is going to produce pairwise combinations of elements in a comma-separated
string. So, for a string column Apples, Bananas, Carrots well produce three rows:

Apples, Bananas
Apples, Carrots
Bananas, Carrots
As with the UDF, the first few lines are a simple class extension with a decorator so that
Hive can describe what the function does.
@Description(name = pairwise, value = _FUNC_(doc) - emits pairwise combinations of an input
array)
public class PairwiseUDTF extends GenericUDTF {

private PrimitiveObjectInspector stringOI = null;

We also create an object of PrimitiveObjectInspector, which well use to ensure that the
input is a string. Once this is done, we need to override methods for initialization, row
processing, and cleanup.
@Override

public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException


{

if (args.length != 1) {
throw new UDFArgumentException(pairwise() takes exactly one argument);
}

if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE

&& ((PrimitiveObjectInspector) args[0]).getPrimitiveCategory() !=


PrimitiveObjectInspector.PrimitiveCategory.STRING) {

throw new UDFArgumentException(pairwise() takes a string as a parameter);


}

stringOI = (PrimitiveObjectInspector) args[0];

This UDTF is going to return an array of structs, so the initialize method needs to return
aStructObjectInspector object. Note that the arguments to the constructor come in as an
array of ObjectInspector objects. This allows us to handle arguments in a normal fashion
but with the benefit of methods to broadly inspect type. We only allow a single argument
the string column to be processed so we check the length of the array and validate
that the sole element is both a primitive and a string.
The second half of the initialize method is more interesting:
List<String> fieldNames = new ArrayList<String>(2);
List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(2);
fieldNames.add(memberA);
fieldNames.add(memberB);
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);

Here we set up information about what the UDTF returns. We need this in place before we
start processing rows, otherwise Hive cant correctly build execution plans before
submitting jobs to MapReduce. The structures were returning will be two strings per
struct, which means well needObjectInspector objects for both the values and the names
of the fields. We create two lists, one of strings for the name, the other of ObjectInspector
objects. We pack them manually and then use a factor to get the StructObjectInspector
which defines the actual return value.
Now were ready to actually do some processing, so we override the process method.
@Override
public void process(Object[] record) throws HiveException {
final String document = (String) stringOI.getPrimitiveJavaObject(record[0]);

if (document == null) {
return;
}
String[] members = document.split(,);

java.util.Arrays.sort(members);
for (int i = 0; i < members.length - 1; i++)
for (int j = 1; j < members.length; j++)
if (!members[i].equals(members[j]))
forward(new Object[] {members[i],members[j]});

This is simple pairwise expansion, so the logic isnt anything more than a nested for-loop.
There are, though, some interesting things to note. First, to actually get a string object to
operate on, we have to use an ObjectInspector and some typecasting. This allows us to bail
out early if the column value is null. Once we have the string, splitting, sorting, and
looping is textbook stuff.
The last notable piece is that the process method does not return anything. Instead, we
callforward to emit our newly created structs. From the context of those used to database
internals, this follows the producer-consumer models of most RDBMs. From the context
of those used to MapReduce semantics, this is equivalent to calling write on the Context
object.
@Override
public void close() throws HiveException {
// do nothing

If there were any cleanup to do, wed take care of it here. But this is simple emission, so
our override doesnt need to do anything.

6.3.3 Using the UDTF


Once weve built our UDTF, we can access it via Hive by adding the jar and assigning it
to a temporary function. However, mixing the results of a UDTF with other columns from
the base table requires that we use a LATERAL VIEW.
#Add the Jar
add jar /mnt/shared/market_basket_example/pairwise.jar;

#Create a function

CREATE temporary function pairwise AS com.oracle.hive.udtf.PairwiseUDTF;



# view the pairwise expansion output
SELECT m1, m2, COUNT(*) FROM market_basket

LATERAL VIEW pairwise(basket) pwise AS m1,m2 GROUP BY m1,m2;

6.4 THREE LITTLE HIVE UDFS: EXAMPLE 3


6.4.1 Introduction
In the final installment in the series on Hive UDFs, were going to tackle the least intuitive
of the three types: the User Defined Aggregating Function. While theyre challenging to
implement, UDAFs are necessary if we want functions for which the distinction of mapside v. reduce-side operations are opaque to the user. If a user is writing a query, most
would prefer to focus on the data theyre trying to compute, not which part of the plan is
running a given function.
The UDAF also provides a valuable opportunity to consider some of the nuances of
distributed programming and parallel database operations. Since each task in a
MapReduce job operates in a bit of a vacuum (e.g. Map task A does not know what data
Map task B has), a UDAF has to explicitly account for more operational states than a
simple UDF. Well return to the notion of a simple Moving Average function, but ask
yourself: how do we compute a moving average if we dont have state or order around the
data?

6.4.2 Prefix Sum: Moving Average without State


In order to compute a moving average without state, were going to need a specialized
parallel algorithm. For moving average, the trick is to use a prefix sum, effectively
keeping a table of running totals for quick computation (and recomputation) of our
moving average. A full discussion of prefix sums for moving averages is beyond length of
a blog post, but John Jenq provides an excellent discussion of the technique as applied to
CUDA implementations.
What well cover here is the necessary implementation of a pair of classes to store and
operate on our prefix sum entry within the UDAF.

public class PrefixSumMovingAverage {


static class PrefixSumEntry implements Comparable
{
int period;
double value;
double prefixSum;
double subsequenceTotal;
double movingAverage;
public int compareTo(Object other)
{
PrefixSumEntry o = (PrefixSumEntry)other;
if (period < o.period)
return -1;
if (period > o.period)
return 1;
return 0;
}

Here we have the definition of our moving average class and the static inner class which
serves as an entry in our table. Whats important here are some of the variables we define
for each entry in the table: the time-index or period of the value (its order), the value itself,
the prefix sum, the subsequence total, and the moving average itself. Every entry in our

table requires not just the current value to compute the moving average, but also sum of
entries in our moving average window. Its the pair of these two values which allows
prefix sum methods to work their magic.
//class variables
private int windowSize;
private ArrayList<PrefixSumEntry> entries;

public PrefixSumMovingAverage()
{
windowSize = 0;
}

public void reset()
{
windowSize = 0;
entries = null;
}

public boolean isReady()
{
return (windowSize > 0);
}

The above are simple initialization routines: a constructor, a method to reset the table, and
a boolean method on whether or not the object has a prefix sum table on which to operate.
From here, there are 3 important methods to examine: add, merge, and serialize. The first
is intuitive, as we scan rows in Hive we want to add them to our prefix sum table. The
second are important because of partial aggregation.
We cannot say ahead of time where this UDAF will run, and partial aggregation may be
required. That is, its entirely possible that some values may run through the UDAF during
a map task, but then be passed to a reduce task to be combined with other values. The
serialize method will allow Hive to pass the partial results from the map side to the reduce
side. The merge method allows reducers to combine the results of partial aggregations
from the map tasks.

@SuppressWarnings(unchecked)
public void add(int period, double v)
{
//Add a new entry to the list and update table
PrefixSumEntry e = new PrefixSumEntry();
e.period = period;
e.value = v;
entries.add(e);
// do we need to ensure this is sorted?
//if (needsSorting(entries))
Collections.sort(entries);
// update the table
// prefixSums first
double prefixSum = 0;
for(int i = 0; i < entries.size(); i++)
{
PrefixSumEntry thisEntry = entries.get(i);
prefixSum += thisEntry.value;
thisEntry.prefixSum = prefixSum;
entries.set(i, thisEntry);
}

The first part of the add task is simple: we add the element to the list and update our
tables prefix sums.
// now do the subsequence totals and moving averages
for(int i = 0; i < entries.size(); i++)
{
double subsequenceTotal;
double movingAverage;

PrefixSumEntry thisEntry = entries.get(i);


PrefixSumEntry backEntry = null;
if (i >= windowSize)
backEntry = entries.get(i-windowSize);
if (backEntry != null)
{
subsequenceTotal = thisEntry.prefixSum - backEntry.prefixSum;

}
else
{
subsequenceTotal = thisEntry.prefixSum;
}
movingAverage = subsequenceTotal/(double)windowSize;
thisEntry.subsequenceTotal = subsequenceTotal;
thisEntry.movingAverage = movingAverage;
entries.set(i, thisEntry);

In the second half of the add function, we compute our moving averages based on the
prefix sums. Its here you can see the hinge on which the algorithm swings:
thisEntry.prefixSum - backEntry.prefixSum that offset between the current table entry
and its nth predecessor makes the whole thing work.
public ArrayList<DoubleWritable> serialize()
{
ArrayList<DoubleWritable> result = new ArrayList<DoubleWritable>();

result.add(new DoubleWritable(windowSize));
if (entries != null)
{
for (PrefixSumEntry i : entries)

{
result.add(new DoubleWritable(i.period));
result.add(new DoubleWritable(i.value));
}
}
return result;

The serialize method needs to package the results of our algorithm to pass to another
instance of the same algorithm, and it needs to do so in a type that Hadoop can serialize.
In the case of a method like sum, this would be relatively simple: we would only need to
pass the sum up to this point. However, because we cannot be certain whether this
instance of our algorithm has seen all the values, or seen them in the correct order, we
actually need to serialize the whole table. To do this, we create a list ofDoubleWritables,
pack the window size at its head, and then each period and value. This gives us a structure
thats easy to unpack and merge with other lists with the same construction.
@SuppressWarnings(unchecked)
public void merge(List<DoubleWritable> other)
{
if (other == null)
return;

// if this is an empty buffer, just copy in other
// but deserialize the list
if (windowSize == 0)
{
windowSize = (int)other.get(0).get();
entries = new ArrayList<PrefixSumEntry>();
// were serialized as period, value, period, value
for (int i = 1; i < other.size(); i+=2)
{
PrefixSumEntry e = new PrefixSumEntry();

e.period = (int)other.get(i).get();
e.value = other.get(i+1).get();
entries.add(e);
}

Merging results is perhaps the most complicated thing we need to handle. First, we check
the case in which there was no partial result passed just return and continue. Second,
we check to see if this instance of PrefixSumMovingAverage already has a table. If it
doesnt, we can simply unpack the serialized result and treat it as our window.
// if we already have a buffer, we need to add these entries
else
{
// were serialized as period, value, period, value
for (int i = 1; i < other.size(); i+=2)
{
PrefixSumEntry e = new PrefixSumEntry();
e.period = (int)other.get(i).get();
e.value = other.get(i+1).get();
entries.add(e);
}

The third case is the non-trivial one: if this instance has a table and receives a serialized
table, we must merge them together. Consider a Reduce task: as it receives outputs from
multiple Map tasks, it needs to merge all of them together to form a larger table. Thus,
merge will be called many times to add these results and reassemble a larger time series.
// sort and recompute
Collections.sort(entries);
// update the table
// prefixSums first
double prefixSum = 0;

for(int i = 0; i < entries.size(); i++)


{
PrefixSumEntry thisEntry = entries.get(i);
prefixSum += thisEntry.value;
thisEntry.prefixSum = prefixSum;
entries.set(i, thisEntry);

This part should look familiar, its just like the add method. Now that we have new entries
in our table, we need to sort by period and recompute the moving averages. In fact, the
rest of the merge method is exactly like the add method, so we might consider putting
sorting and recomputing in a separate method.

6.4.3 Orchestrating Partial Aggregation


Weve got a clever little algorithm for computing moving average in parallel, but Hive
cant do anything with it unless we create a UDAF that understands how to use our
algorithm. At this point, we need to start writing some real UDAF code. As before, we
extend a generic class, in this case GenericUDAFEvaluator.
public static class GenericUDAFMovingAverageEvaluator extends GenericUDAFEvaluator {

// input inspectors for PARTIAL1 and COMPLETE
private PrimitiveObjectInspector periodOI;
private PrimitiveObjectInspector inputOI;
private PrimitiveObjectInspector windowSizeOI;

// input inspectors for PARTIAL2 and FINAL
// list for MAs and one for residuals
private StandardListObjectInspector loi;

As in the case of a UDTF, we create ObjectInspectors to handle type checking. However,


notice that we have inspectors for different states: PARTIAL1, PARTIAL2, COMPLETE,
and FINAL. These correspond to the different states in which our UDAF may be
executing. Since our serialized prefix sum table isnt the same input type as the values our
add method takes, we need different type checking for each.
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {

super.init(m, parameters);

// initialize input inspectors
if (m == Mode.PARTIAL1 || m == Mode.COMPLETE)
{
assert(parameters.length == 3);
periodOI = (PrimitiveObjectInspector) parameters[0];

inputOI = (PrimitiveObjectInspector) parameters[1];


windowSizeOI = (PrimitiveObjectInspector) parameters[2];
}

Heres the beginning of our overrided initialization function. We check the parameters for
two modes, PARTIAL1 and COMPLETE. Here we assume that the arguments to our
UDAF are the same as the user passes in a query: the period, the input, and the size of the
window. If the UDAF instance is consuming the results of our partial aggregation, we
need a different ObjectInspector. Specifically, this one:
else
{
loi = (StandardListObjectInspector) parameters[0];

Similar to the UDTF, we also need type checking on the output types but for both
partial and full aggregation. In the case of partial aggregation, were returning lists of
DoubleWritables:

// init output object inspectors


if (m == Mode.PARTIAL1 || m == Mode.PARTIAL2) {
// The output of a partial aggregation is a list of doubles representing the
// moving average being constructed.
// the first element in the list will be the window size
//
return ObjectInspectorFactory.getStandardListObjectInspector(
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);

But in the case of FINAL or COMPLETE, were dealing with the types that will be
returned to the Hive user, so we need to return a different output. Were going to return a
list of structs that contain the period, moving average, and residuals (since theyre cheap
to compute).
else {
// The output of FINAL and COMPLETE is a full aggregation, which is a

// list of DoubleWritable structs that represent the final histogram as


// (x,y) pairs of bin centers and heights.

ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>();
foi.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
foi.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
foi.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
ArrayList<String> fname = new ArrayList<String>();
fname.add(period);
fname.add(moving_average);
fname.add(residual);

return ObjectInspectorFactory.getStandardListObjectInspector(
ObjectInspectorFactory.getStandardStructObjectInspector(fname, foi) );

Next come methods to control what happens when a Map or Reduce task is finished with
its data. In the case of partial aggregation, we need to serialize the data. In the case of full
aggregation, we need to package the result for Hive users.
@Override
public Object terminatePartial(AggregationBuffer agg) throws HiveException {
// return an ArrayList where the first parameter is the window size
MaAgg myagg = (MaAgg) agg;
return myagg.prefixSum.serialize();
}

@Override
public Object terminate(AggregationBuffer agg) throws HiveException {
// final return value goes here
MaAgg myagg = (MaAgg) agg;

if (myagg.prefixSum.tableSize() < 1)
{
return null;
}

else
{
ArrayList<DoubleWritable[]> result = new ArrayList<DoubleWritable[]>();
for (int i = 0; i < myagg.prefixSum.tableSize(); i++)
{
double
myagg.prefixSum.getEntry(i).movingAverage;

residual

myagg.prefixSum.getEntry(i).value


DoubleWritable[] entry = new DoubleWritable[3];
entry[0] = new DoubleWritable(myagg.prefixSum.getEntry(i).period);
entry[1] = new DoubleWritable(myagg.prefixSum.getEntry(i).movingAverage);
entry[2] = new DoubleWritable(residual);
result.add(entry);
}

return result;
}

We also need to provide instruction on how Hive should merge the results of partial
aggregation. Fortunately, we already handled this in our PrefixSumMovingAverage class,
so we can just call that.
@SuppressWarnings(unchecked)
@Override
public void merge(AggregationBuffer agg, Object partial) throws HiveException {

// if were merging two separate sets were creating one table thats doubly long

if (partial != null)
{
MaAgg myagg = (MaAgg) agg;
List<DoubleWritable> partialMovingAverage = (List<DoubleWritable>) loi.getList(partial);
myagg.prefixSum.merge(partialMovingAverage);
}

Of course, merging and serializing isnt very useful unless the UDAF has logic for
iterating over values. The iterate method handles this and as one would expect relies
entirely on thePrefixSumMovingAverage class we created.
@Override
public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException {

assert (parameters.length == 3);

if (parameters[0] == null || parameters[1] == null || parameters[2] == null)
{
return;
}

MaAgg myagg = (MaAgg) agg;

// Parse out the window size just once if we havent done so before. We need a window of at
least 1,
// otherwise theres no window.
if (!myagg.prefixSum.isReady())
{
int windowSize = PrimitiveObjectInspectorUtils.getInt(parameters[2], windowSizeOI);

if (windowSize < 1)
{
throw new HiveException(getClass().getSimpleName() + needs a window size >= 1);
}
myagg.prefixSum.allocate(windowSize);
}

//Add the current data point and compute the average

int p = PrimitiveObjectInspectorUtils.getInt(parameters[0], inputOI);
double v = PrimitiveObjectInspectorUtils.getDouble(parameters[1], inputOI);
myagg.prefixSum.add(p,v);

6.4.4 Aggregation Buffers: Connecting Algorithms with Execution


One might notice that the code for our UDAF references an object of type
AggregationBuffer quite a lot. This is because the AggregationBuffer is the interface
which allows us to connect our custom PrefixSumMovingAverage class to Hives
execution framework. While it doesnt constitute a great deal of code, its glue that binds
our logic to Hives execution framework. We implement it as such:
// Aggregation buffer definition and manipulation methods
static class MaAgg implements AggregationBuffer {
PrefixSumMovingAverage prefixSum;
};

@Override
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
MaAgg result = new MaAgg();
reset(result);
return result;

6.4.5 Using the UDAF


The goal of a good UDAF is that, no matter how complicated it was for us to implement,
its that it be simple for our users. For all that code and parallel thinking, usage of the
UDAF is very straightforward:
ADD JAR /mnt/shared/hive_udfs/dist/lib/moving_average_udf.jar;
CREATE
TEMPORARY
FUNCTION
com.oracle.hadoop.hive.ql.udf.generic.GenericUDAFMovingAverage;

moving_avg

AS

#get the moving average for a single tail number


SELECT TailNum,moving_avg(timestring, delay, 4) FROM ts_example WHERE TailNum=N967CA GROUP BY
TailNum LIMIT 100;

Here were applying the UDAF to get the moving average of arrival delay from a
particular flight. Its a really simple query for all that work we did underneath. We can do
a bit more and leverage Hives abilities to handle complex types as columns, heres a
query which creates a table of timeseries as arrays.
#create a set of moving averages for every plane starting with N
#Note: this UDAF blows up unpleasantly in heap; there will be data volumes for which you need to
throw
#excessive amounts of memory at the problem
CREATE TABLE moving_averages AS
SELECT TailNum, moving_avg(timestring, delay, 4) as timeseries FROM ts_example

WHERE TailNum LIKE N% GROUP BY TailNum;

6.4.6 Summary
Weve covered all manner of UDFs: from simple class extensions which can be written
very easily, to very complicated UDAFs which require us to think about distributed
execution and plan orchestration done by query engines. With any luck, the discussion has
provided you with the confidence to go out and implement your own UDFs or at least
pay some attention to the complexities of the ones in use every day.

7Chapter 7.

ORACLE No SQL

7.1 INTRODUCTION

NoSQL databases represent a recent evolution in enterprise application architecture,


continuing the evolution of the past twenty years. In the 1990s, vertically integrated
applications gave way to client-server architectures, and more recently, client-server
architectures gave way to three-tier web application architectures. In parallel, the demands
of web-scale data analysis added map-reduce processing into the mix and data architects
started eschewing transactional consistency in exchange for incremental scalability and
large-scale distribution. The NoSQL movement emerged out of this second ecosystem.
NoSQL is often characterized by what its not depending on whom you ask, its either
not only a SQL-based relational database management system or its simply not a SQLbased RDBMS. While those definitions explain what NoSQL is not, they do little to
explain what NoSQL is. Consider the fundamentals that have guided data management for
the past forty years. RDBMS systems and large-scale data management have been
characterized by the transactional ACID properties of Atomicity, Consistency, Isolation,
and Durability. In contrast, NoSQL is sometimes characterized by the BASE acronym:
Basically Available: Use replication to reduce the likelihood of data unavailability and
use sharding, or partitioning the data among many different storage servers, to make any
remaining failures partial. The result is a system that is always available, even if subsets of
the data become unavailable for short periods of time.
Soft state: While ACID systems assume that data consistency is a hard requirement,
NoSQL systems allow data to be inconsistent and relegate designing around such
inconsistencies to application developers.
Eventually consistent: Although applications must deal with instantaneous consistency,
NoSQL systems ensure that at some future point in time the data assumes a consistent
state. In contrast to ACID systems that enforce consistency at transaction commit, NoSQL
guarantees consistency only at some undefined future time.
NoSQL emerged as companies, such as Amazon, Google, LinkedIn and Twitter struggled
to deal with unprecedented data and operation volumes under tight latency constraints.
Analyzing high-volume, real time data, such as web-site click streams, provides
significant business advantage by harnessing unstructured and semi-structured data
sources to create more business value. Traditional relational databases were not up to the
task, so enterprises built upon a decade of research on distributed hash tables (DHTs) and
either conventional relational database systems or embedded key/value stores, such as
Oracles Berkeley DB, to develop highly available, distributed key-value stores..
Although some of the early NoSQL solutions built their systems atop existing relational
database engines, they quickly realized that such systems were designed for SQL-based
access patterns and latency demands that are quite different from those of NoSQL
systems, so these same organizations began to develop brand new storage layers. In
contrast, Oracles Berkeley DB product line was the original key/value store; Oracle
Berkeley DB Java Edition has been in commercial use for over eight years. By using
Oracle Berkeley DB Java Edition as the underlying storage engine beneath a NoSQL

system, Oracle brings enterprise robustness and stability to the NoSQL landscape.
Furthermore, until recently, integrating NoSQL solutions with an enterprise application
architecture required manual integration and custom development;
Oracles NoSQL Database provides all the desirable features of NoSQL solutions
necessary for seamless integration into an enterprise application architecture. The next
figure shows a canonical acquireorganize-analyze data cycle, demonstrating how Oracles
NoSQL Database fits into such an ecosystem. Oracle-provided adapters allow the Oracle
NoSQL Database to integrate with a Hadoop MapReduce framework or with the Oracle
Database in-database MapReduce, Data Mining, R-based analytics, or whatever business
needs demand.

The Oracle NoSQL Database, with its No Single Point of Failure architecture is the
right solution when data access is simple in nature and application demands exceed the
volume or latency capability of traditional data management solutions. For example, clickstream data from high volume web sites, high-throughput event processing, and social
networking communications all represent application domains that produce extraordinary
volumes of simple keyed data. Monitoring online retail behavior, accessing customer
profiles, pulling up appropriate customer ads and storing and forwarding real-time
communication are examples of domains requiring the ultimate in low-latency access.
Highly distributed applications such as real-time sensor aggregation and scalable
authentication also represent domains well-suited to Oracle NoSQL Database.

7.2 DATA MODEL


Oracle NoSQL Database leverages the Oracle Berkeley DB Java Edition High Availability
storage engine to provide distributed, highly-available key/value storage for large-volume,
latency-sensitive applications or web services. It can also provide fast, reliable, distributed
storage to applications that need to integrate with ETL processing.
In its simplest form, Oracle NoSQL Database implements a map from user-defined keys
(Strings) to opaque data items. It records version numbers for key/data pairs, but maintains
the single latest version in the store. Applications need never worry about reconciling
incompatible versions because Oracle NoSQL Database uses single-master replication; the
master node always has the most up-todate value for a given key, while read-only replicas
might have slightly older versions. Applications can use version numbers to ensure
consistency for read-modify-write operations.
Oracle NoSQL Database hashes keys to provide good distribution over a collection of
computers that provide storage for the database. However, applications can take advantage
of subkey capabilities to achieve data locality. A key is the concatenation of a Major Key
Path and a Minor Key Path, both of which are specified by the application. All records
sharing a Major Key Path are co-located to achieve datalocality.
Within a co-located collection of Major Key Paths, the full key, comprised of both the
Major and Minor Key Paths, provides fast, indexed lookup. For example, an application
storing user profiles might use the profile-name as a Major Key Path and then have several
Minor Key Paths for different components of that profile such as email address, name,
phone number, etc. Because applications have complete control over the composition and
interpretation of keys, different Major Key Paths can have entirely different Minor Key
Path structures. Continuing our previous example, one might store user profiles and
application profiles in the same store and maintain different Minor Key Paths for each.
Prefix key compression makes storage of key groups efficient.
While many NoSQL databases state that they provide eventual consistency, Oracle
NoSQL Database provides several different consistency policies. At one end of the
spectrum, applications can specify absolute consistency, which guarantees that all reads
return the most recently written value for a designated key. At the other end of the
spectrum, applications capable of tolerating inconsistent data can specify weak
consistency, allowing the database to return a value efficiently even if it is not entirely up
to date. In between these two extremes, applications can specify time-based consistency to
constrain how old a record might be or version-based consistency to support both
atomicity for read-modify-write operations and reads that are at least as recent as the
specified version. The next figure shows how the range of flexible consistency policies
enables developers to easily create business solutions providing data guarantees while
meeting application latency and scalability requirements.

Oracle NoSQL Database also provides a range of durability policies that specify what
guarantees the system makes after a crash. At one extreme, applications can request that
write requests block until the record has been written to stable storage on all copies. This
has obvious performance and availability implications, but ensures that if the application
successfully writes data, that data will persist and can be recovered even if all the copies
become temporarily unavailable due to multiple simultaneous failures.
At the other extreme, applications can request that write operations return as soon as the
system has recorded the existence of the write, even if the data is not persistent anywhere.
Such a policy provides the best write performance, but provides no durability guarantees.
By specifying when the database writes records to disk and what fraction of the copies of
the record must be persistent (none, all, or a simple majority), applications can enforce a
wide range of durability policies.

7.3 API
Incorporating Oracle NoSQL Database into applications is straightforward. APIs for basic
Create, Read, Update and Delete (CRUD) operations and a collection of iterators are
packaged in a single jar file. Applications can use the APIs from one or more client
processes that access a stand-alone Oracle NoSQL Database server process, alleviating the
need to set up multi-system configurations for initial development and testing.

7.4 CREATE, REMOVE, UPDATE, AND DELETE


Data create and update operations are provided by several put methods. The putIfAbsent
method implements creation while the putIfPresent method implements update. The put
method does both, adding a new key/value pair if the key is not currently present in the
database or updating the value if the key does exist. Updating a key/value pair generates a
new version of the pair, so the API also includes a conditional putIfVersion method that
allows applications to implement consistent readmodify-write semantics.
The delete and deleteIfVersion methods unconditionally and conditionally remove
key/value pairs from the database, respectively. Just as putIfVersion ensures read-modifywrite semantics, deleteIfVersion provides deletion of a specific version.
The get method retrieves items from the database.
The code sample below demonstrates the use of the various CRUD APIs. All code
samples asume that you have already opened an Oracle NoSQL Database, referenced by
the variable store.

7.5 ITERATION
In addition to basic CRUD operations, Oracle NoSQL Database supports two types of
iteration: unordered iteration over records and ordered iteration within a Major Key set.
In the case of unordered iteration over the entire store, the result is not transactional; the
iteration runs at an isolation level of read-committed, which means that the result set will
contain only key/value pairs that have been persistently written to the database, but there
are no guarantees of semantic consistency across key/value pairs.
The API supports both individual key/value returns using several storeIterator methods
and bulk key/value returns within a Major Key Path via the various multiGetIterator
methods. The example below demonstrates iterating over an entire store, returning each
key/value pair individually. Note that although the iterator returns only a single key/value
pair at a time, the storeIterator method takes a second parameter of batchSize, indicating
how many key/value pairs to fetch per network round trip.
This allows applications to simultaneously use network bandwidth efficiently, while
maintaining the simplicity of key-at-a-time access in the API.
Using Oracle Big Data SQL, you can query data stored in a Hadoop cluster using the
complete SQL syntax. You can execute the most complex SQL SELECT statements
against data in Hadoop, either manually or using your existing applications, to tease out
the most significant insights.

7.6 BULK OPERATION API


In addition to providing single-record operations, Oracle NoSQL Database supports the
ability to bundle a collection of operations together using the execute method, providing
transactional semantics across multiple updates on records with the same Major Key Path.
For example, lets assume that we have the Major Key Path Katana from the previous
example, with several different Minor Key Paths, containing attributes of the Katana, such
as length and year of construction. Imagine that we discover that we have an incorrect
length and year of construction currently in our store. We can update multiple records
atomically using a sequence of operations as shown below.

7.7 ADMINISTRATION
Oracle NoSQL Database comes with an Administration Service, accessible from both a
command line interface and a web console. Using the service, administrators can
configure a database instance, start it, stop it, and monitor system performance, without
manually editing configuration files, writing shell scripts, or performing explicit database
operations.
The Administration Service is itself a highly-available service, but consistent with the
Oracle NoSQL Database No Single Point of Failure philosophy, the ongoing operation
of an installation is not dependent upon the availability of the Administration Service.
Thus, both the database and the Administration Service remain available during
configuration changes.
In addition to facilitating configuration changes, the Administration Service also collects
and maintains performance statistics and logs important system events, providing online
monitoring and input to performance tuning.

7.8 ARCHITECTURE
We present the Oracle NoSQL Database architecture by following the execution of an
operation through the logical components of the system and then discussing how those
components map to actual hardware and software operation. We will create the key/value
pair Katana and sword. The next figure depicts the method invocation
putIfAbsent(Katana, sword)1.

The application issues the putIfAbsent method to the Client Driver (step 1). The client
driver hashes the key Katana to select one of a fixed number of partitions (step 2).
Thenumber of partitions is fixed and set by an administrator at system configuration time
and is chosen to be significantly larger than the maximum number of storage nodes
expected in the store. In this example, our store contains 25 storage nodes, so we might
have configured the system to have 25,000 partitions. Each partition is assigned to a
particular replication group. The driver consults the partition table (step 3) to map the
partition number to a replication group.
A replication group consists of some (configurable) number of replication nodes. Every
replication group consists of the same number of replication nodes. The number of
replication nodes in a replication group dictates the number of failures from which the
system is resilient; a system with three nodes per group can withstand two failures while
continuing to service read requests. Its ability to withstand failures on writes is based upon
the configured durability policy. If the application does not require a majority of
participants to acknowledge a write, then the system can also withstand up to two failures
for writes. A five-node group can withstand up to four failures for reads and up to two
failures for writes, even if the application demands a durability policy requiring a majority
of sites to acknowledge a write operation.

Given a replication group, the Client Driver next consults the Replication Group State
Table (RGST) (step 4). For each replication group, the RGST contains information about
each replication node comprising the group (step 5). Based upon information in the RGST,
such as the identity of the master and the load on the various nodes in a replication group,
the Client Driver selects the node to which to send the request and forwards the request to
the appropriate node (step 6). In this case, since we are issuing a write operation, the
request must go to the master node.
The replication node then applies the operation. In the case of a putIfAbsent, if the key
exists, the operation has no effect and returns an error, indicating that the specified entry is
already present in the store. If the key does not exist, the replication node adds the
key/value pair to the store and then propagates the new key/value pair to the other nodes
in the replication group (step 7).

7.9 IMPLEMENTATION
An Oracle NoSQL Database installation consists of two major pieces: a Client Driver and
a collection of Storage Nodes. As shown in Figure 3, the client driver implements the
partition map and the RGST, while storage nodes implement the replication nodes
comprising replication groups. In this section, well take a closer look at each of these
components.

7.9.1 Storage Nodes


A storage node (SN) is typically a physical machine with its own local persistent storage,
either disk or solid state, a CPU with one or more cores, memory, and an IP address. A
system with more storage nodes will provide greater aggregate throughput or storage
capacity than one with fewer nodes, and systems with a greater degree of replication in
replication groups can provide decreased request latency over installations with smaller
degrees of replication.
A Storage Node Agent (SNA) runs on each storage node, monitoring that nodes behavior.
The SNA both receives configuration from and reports monitoring information to the
Administration Service, which interfaces to the Oracle NoSQL Database monitoring
dashboard. The SNA collects operational data from the storage node on an ongoing basis
and then delivers it to the Administration Service when asked for it.
A storage node serves one or more replication nodes. Each replication node belongs to a
single replication group. The nodes in a single replication group all serve the same data.
Each group has a designated master node that handles all data modification operations
(create, update, and delete). The other nodes are read-only replicas, but may assume the
role of master should the master node fail. A typical installation uses a replication factor of
three in the replication groups, to ensure that the system can survive at least two
simultaneous faults and still continue to service read operations. Applications requiring
greater or lesser reliability can adjust this parameter accordingly.
The next figure shows an installation with 30 replication groups (0-29). Each replication
group has a replication factor of 3 (one master and two replicas) spread across two data
centers. Note that we place two of the replication nodes in the larger of the two data
centers and the last replication node in the smaller one. This sort of arrangement might be
appropriate for an application that uses the larger data center for its primary data access,
maintaining the smaller data center in case of catastrophic failure of the primary data
center. The 30 replication groups are stored on 30 storage nodes, spread across the two
data centers.

Replication nodes support the Oracle NoSQL Database API via RMI calls from the client
and obtain data directly from or write data directly to the log-structured storage system,
which provides outstanding write performance, while maintaining index structures that
provide low-latency read performance as well. The Oracle NoSQL Database storage
engine pioneered the use of log-structured storage in key/value databases since its initial
deployment in 2003 and has been proven in several open-source NoSQL solutions, such as
Dynamo, Voldemort, and GenieDB, as well as in Enterprise deployments.
Oracle NoSQL Database uses replication to ensure data availability in the case of failure.
Its singlemaster architecture requires that writes are applied at the master node and then
propagated to the replicas. In the case of failure of the master node, the nodes in a
replication group automatically hold a reliable election (using the Paxos protocol),
electing one of the remaining nodes to be the master. The new master then assumes write
responsibility.

7.9.2 Client Driver


The client driver is a Java jar file that exports the API to applications. In addition, the
client driver maintains a copy of the Topology and the Replication Group State Table
(RGST). The Topology efficiently maps keys to partitions and from partitions to
replication groups. For each replication group, it includes the host name of the storage
node hosting each replication node in the group, the service name associated with the
replication nodes, and the data center in which each storage node resides.
The client then uses the RGST for two primary purposes: identifying the master node of a
replication group, so that it can send write requests to the master, and load balancing
across all the nodes in a replication group for reads. Since the RGST is a critical shared
data structure, each client and replication node maintains its own copy, thus avoiding any
single point of failure. Both clients and replication nodes run a RequestDispatcher that use
the RGST to (re)direct write requests to the master and read requests to the appropriate
member of a replication group.
The Topology is loaded during client or replication node initialization and can
subsequently be updated by the administrator if there are Topology changes. The RGST is
dynamic, requiring ongoing maintenance. Each replication node runs a thread, called the
Replication Node State Update thread, that is responsible for ongoing maintenance of the
RGST. The update thread, as well as the RequestDispatcher, opportunistically collect
information on remote replication nodes including the current state of the node in its
replication group, an indication of how up-to-date the node is, the time of the last
successful interaction with the node, the nodes trailing average response time, and the
current length of its outstanding request queue. In addition, the update thread maintains
network connections and reestablishes broken ones. This maintenance is done outside the
RequestDispatchers request/response cycle to minimize the impact of broken onnections
on latency.

7.10 PERFORMANCE
We have experimented with various Oracle NoSQL Database configurations and present a
few performance results of the Yahoo! Cloud Serving Benchmark (YCSB), emonstrating
how the system scales with the number of nodes in the system. As with all performance
measurements, your mileage may vary. We applied a constant YCSB load per storage
node to configurations of varying sizes. Each storage node was comprised of a 2.93ghz
Westmere 5670 dual socket machine with 6 cores/socket and 24GB of memory. Each
machine had a 300GB local disk and ran RedHat 2.6.18-164.11.1.el5.crt1. At 300 GB, the
disk size is the scale-limiting resource on each node, dictating the overall configuration, so
we configured each node to hold 100M records, with an average key size of 13 bytes and
data size of 1108 bytes.
The next graph shows the raw insert performance of Oracle NoSQL Database for
configurations ranging from a single replication group system with three nodes storing
100 million records to a system with 32 replication groups on 96 nodes storing 2.1 billion
records (the YCSB benchmark is limited to a maximum of 2.1 billion records). The graph
shows both the throughput in operations per second (blue line and left axis) and the
response time in milliseconds (red line and right axis). Throughput of the system scales
almost linearly as the database size and number of replication groups grows, with only a
modest increase in response time.

The next graph shows the throughput and response time for a workload of 50% reads and
50% updates. As the system grows in size (both data size and number of replication
groups), we see both the update and read latency decline, while throughput scales almost
linearly, delivering the scalability needed for todays demanding applications.

7.11 CONCLUSION
Oracles NoSQL Database brings enterprise quality storage and performance to the highlyavailable, widely distributed NoSQL environment. Its commercially proven, writeoptimized storage system delivers outstanding performance as well as robustness and
reliability, and its No Single Point of Failure design ensures that the system continues to
run and data remain available after any failure.

Вам также может понравиться