Aster DBA Book

Module 0
Teradata Aster
Database Administration
Software Version 6.01
Teradata Aster Database Administration Page 1

Page 2 Course Overview
Table Of Contents
Teradata Aster DBA Objectives ..................................................................................................... 4
Table of Contents ............................................................................................................................ 6
Hands-on with the Aster solution.................................................................................................... 8
The Big Picture VMware environment ...................................................................................... 10
Setting up the Environment (1 of 9).............................................................................................. 12
The ReadyTech Desktop ............................................................................................................... 30
Desktop Setup ............................................................................................................................... 32
Fire up TD Studio and Login ........................................................................................................ 34
Go to Project Explorer tab to execute code ................................................................................... 36
TD Studio Best Practices ........................................................................................................... 38
Fire up ACT client ........................................................................................................................ 40
Disconnect at the End of the Day.................................................................................................. 42

Teradata Aster DBA Objectives
This course provides a solid technical look at the daily tasks of an Aster Database
Administrator would encounter. Common tasks such as creating users and assigning
permissions are covered, as well as backup and restore features. Later on well go over
how to configure Workload Policies to ensure all queries get a fair share of the CPU/IO
resources. Finally, we will talk about bottlenecks and how to avoid them as well as the
various logs you can use to diagnosis issues with the system.

Teradata Aster DBA Objectives
The purpose of this training course is to acquaint you

with Administering the Aster Database
After completing this course, you should be able to:
Understand Aster Architecture and Components

Create Databases, Schemas and Tables
Eight Rules of Aster Modeling (Logical & Physical)
Loading Data
Managing Tables
Unified Database Architecture
Create Users, Assign Privileges and Roles
Configure Workloads based on Predicates
Backup and Restore your Database and Tables
Using the Data Dictionary, Scripts, and Ganglia
Understanding Explain Plans and Join strategies
Identifying Bottlenecks and Reading LOG files

Table of Contents
Here is the proposed 3-day class schedule. Note this can change based on student
interest in the topics.

Table of Contents Aster DBA
Day 1 Mod 00 Introduction

Mod 01 Architecture
Mod 02 Aster Management Console
Mod 03 Databases and Schemas
Mod 04 Aster Modeling
Mod 05 Loading
Day 2 Mod 06 Managing Tables

Mod 07 Unified Data Architecture
Mod 08 Users, Privileges and Roles
Mod 09 Backup and Restore
Mod 10 Workload Management
Day 3 Mod 11 Data Dictionary and Scripts

Mod 12 Optimizing, EXPLAIN and Joins
Mod 13 Bottlenecks and Case Studies
Mod 14 Logging

Hands-on with the Aster solution
As mentioned, this class is 50% hands-on workshops. We will be using a number of

client tools including:
Aster Command Terminal

Teradata Studio
Aster Management Console

Hands-on with the Aster Solution
- Exercises with TD Studio

For execution of SQL commands and SQL-MR functions
- Exercises with Aster Command Terminal (ACT)

For execution of SQL commands and SQL-MR functions
- Exercises with Aster Management Console (AMC)

For Aster Database monitoring and management
- Exercises with Aster Analytic Foundation (AAF)

For exercise of QueryGrid data import and export

The Big Picture VMware environment
Every student will have their own VMware cloud environment where they will have:
Aster cluster consisting of a Queen (coordinator) and two Workers

A Teradata 15.0 box
A Hortonworks HDP 2.1 box

Clients (Aster Mgt Console, ACT, TD Studio, Eclipse)
Aster Cluster
Queen Worker1 Worker2 Loader Hadoop Backup TD box
192.168.100.100 192.168.100.151 192.168.100.152 192.168.100.141 192.168.100.21 192.168.100.172 192.168.100.15
Servers
We will be using VMware images for our lab environment. Heres the big picture

Setting up the Environment (1 of 9)
Follow along with the Instructor as we setup the ReadyTech VMware environment.

We will be using VMware images for all our labs
1. From Web browser (IE preferred): Good idea to bookmark this now in your Web
browser, since youll need to go back to this
site every day of class
https://teradata.hostedtraining.com
Setting up a ReadyTech VMware environment is a custom process for each teach. This
is a one-time setup and may require IT support from your company to complete
successfully. The following instructions are generic in nature and are not all inclusive.


Typically we first go under Automatically Configure Your Settings,

click the Configure Automatically hotlink


The goal is to get all those green checks. (OK if VNC is not found)
Next, click Continue to ActiveX Download or JAVA Download
If continue to Whirlybird here, best thing is to close window and

then GOOGLE JAVA. Download latest JAVA code, then retry


May be prompted to Install ActiveX Control or Java. Do so
1. Attempt to INSTALL Plug in from Banner-popup

2. If it Whirlybirds more than 1 minute then:
1. Click IE BACK button
2. Then click IE REFRESH button
3. Then click INSTALL ACTIVEX CONTROL
3. It will INSTALL and youll get to SAVE SETTINGS
2 things can stop you here. You must have Admin privileges to install this Plugin. If you repeatedly get
RETRY message, logout and log back into Windows as someone with permissions to Install Apps.
Secondly, you may need to manually configure IP Binding as follows:


Save Settings when Connection Test Successful (OK if VNC is not

found)


Enter your Access code and click Submit button


Enter your Credentials and click Activiation button


Now click the Connect button
Username: student
Password: training


Enter your Username and Password as follows:

Username: administrator Password: training
Click the Use another account and then

type: administrator training
One more thing: You may get a Security Alert
while in ReadyTech image. If this window
appears, always click Yes to proceed.

The ReadyTech Desktop

The ReadyTech Desktop
Youre landing page will be a Microsoft Windows Desktop.

Desktop Setup

VMware
Desktop Setup Workstation icon
Housekeeping chores
Once all the images are started, we will SUSPEND HDP 2.1 since we wont need
this image until tomorrow. Click on Vmware Workstaton icon, then right-click on
HDP 2.1, then select Power>SUSPEND

Fire up TD Studio and Login
We will be using Teradata Studio for most of our queries..

Run the following query and ensure you get back an Answer set.

Fire up TD Studio - Login to Aster/TD
From Data Source Explorer tab:

1. Double click on Teradata Studio icon
2. Right-click on Aster5.10 and select Connect
3. Right-click on TD14.10 and select Connect

Go to Project Explorer tab to execute code

Go to Project Explorer tab, execute code
Notice you are querying Aster
Result set displays

here
History and any Errors

reported here
From Project Explorer tab, open 3)DBA folder > Testing 1-2-3 folder and double-click on
00-a-Aster-test-SQL. The code will appear in the upper right-hand pane
Highlight code, then right-click and select EXECUTE SELECTED TEXT
Confirm you get back a Result Set in the pane. Then repeat for 00-b-Teradata-test-SQL

TD Studio Best Practices
Note there are a few things you need to know about TD Studio before we start
submitting queries.

TD Studio Best Practices
1. By default, JDBC silently disconnect after 30

minutes of idle time. So before we begin any
new hands-on labs, well Disconnect, then
Connect to ensure connectivity
2. When running DDL, you dont get back an

Result Set. So keep you eyes glued on the
gas gauge at bottom right-hand corner of TD
Studio to know when statement is finished
3. If get 'Mule Exception Error', just go to Data Source Explorer tab and Disconnect, then
Reconnect. Then run the query again
When successful, Minimize the TD Studio window

Fire up ACT client
Lets first open PUTTY and logon using the provided credentials. Once at the Linux
prompt from PUTTY, logon to ACT using the enclosed syntax.
Run the simple SQL statement and confirm you get an Answer set.

Fire up ACT - Putty, then ACT login
1. Double click on Putty icon

2. Select 1)Aster-Queen and
click the Load button
3. Next click the Open button
Putty logon credentials:
Login as: root

Password: root
At Unix prompt login to ACT:

queen1:~ # act -U db_superuser
Password: db_superuser
Test ACT client: Select * from aaf.employee;

When successful, Minimize the Putty window

Disconnect at the End of the Day
At the end of the day, you Disconnect from ReadyTech by either clicking the X button in
the upper right hand pane, or by clicking the CONNECT button to DISCONNECT.
Tomorrow when you re-connect, you will be back to the same exact place you left the
day before.

Disconnect at End of the Day
When you login the next day (via https://teradata.hostedtraining.com) you will
return right where you left off.
Did I mention you should Bookmark this URL right now ???

Module 1
Mod 01
Aster Architecture

Page 2 Mod 1 Aster Architecture
Table Of Contents
Architecture Module objectives ................................................................................................... 4
Cluster Architecture ........................................................................................................................ 6
Cluster: The Queen ......................................................................................................................... 8
Cluster: Workers ........................................................................................................................... 10
Virtual Workers ............................................................................................................................. 12
Cluster: Loader .............................................................................................................................. 14
Backup Architecture...................................................................................................................... 16
Queen Node sizing ........................................................................................................................ 18
Worker Node sizing ...................................................................................................................... 20
Loader Node sizing ....................................................................................................................... 22
How many total Nodes? ................................................................................................................ 24
What is Replication Factor (RF)? ................................................................................................. 26
Virtual Worker Replication ........................................................................................................... 28
Virtual Worker Failover ................................................................................................................ 30
RAID (Redundant Array of Independent Disks) .......................................................................... 32
Aster Appliance 3 Configuration .................................................................................................. 34
Install Wizard (1 of 2) ................................................................................................................... 36
Install Wizard (2 of 2) ................................................................................................................... 38
Aster clients (1 of 2)...................................................................................................................... 40
Aster clients (2 of 2)...................................................................................................................... 42
In-line lab: Using PuTTY to login to ACT ................................................................................... 44
In-line lab: ACT ............................................................................................................................ 46
In-line lab: Navigating ACT (1 of 4) ............................................................................................ 48
Other (Graphical) query tools ....................................................................................................... 56
Review: Module 1 Architecture ................................................................................................. 58

Architecture Module objectives
This module will explain the core components of an Aster clustering including the Queen,
Worker, and Loader nodes. In addition, we will talk about the Backup nodes which is not part of
the Aster cluster.
Next well talk about Fault Tolerance functionalities built into Aster such as Replication Factor
and RAID. And well discuss sizing a Cluster based on your requirements.
Finally, well list the various Aster clients.

Architecture - Module Objectives
Learn about Asters Core Components
Understand Sizing guidelines
Fault Tolerance mechanisms in Aster
Aster Software and Hardware components
Aster clients

Cluster Architecture
Aster Data is the only big data solution designed from the database perspective. Even other database-
based big data solutions were designed initially from the MapReduce perspective.
The nCluster DB provides a single-system view, DBAs & analysts connect to just one server.
When we say that nCluster was built for
Data growth: we mean it offers an MPP architecture with simple scale-out

Faster loading: we mean we can parallelize the loading tier as necessary
Agile in-database analytics: we are referring to the SQL/MR framework
SQL support: we mean full SQL/JDBC/ODBC/OleDB interface support
High Availability: we mean built-in data replication on multi-levels
Low cost: it runs on a cluster of x86 servers & GigE/10GigE networks.
Additionally, Cluster is:
Easy to manage:
o Automated install and upgrade of physical servers

o Web-based UI and command line management tools
o Customers can scale-out, balance, and troubleshoot
Self-managing:
o Automatic recovery from loss of a worker node

o Semi-automatic load balancing of processing
o Automatic physical and logical data partitioning
o Fast and continuous background data replication
An Aster Cluster consists of:
Active Queen
One or more Worker Nodes
One or more Loader Nodes

Cluster Architecture
Reports, Analytics, Applications

(SQL / ODBC / JDBC / OleDB)
Queries/Answers
Designed for Big Data using clusters
Queen
of hardware boxes managed as a
1
single database
Queries
4 independent (share nothing) tiers for
management, query processing,
2 Worker Nodes loading, & backup
In-database SQL-MapReduce enables

Data
easy parallelization of query &
analytics processing
3 Loader Nodes
Built for: Data Growth, Full SQL

Aster Cluster Database Support, Powerful SQL-MR Analytics,
High Availability, Low Cost
4 Backup -Less RAM
-Less CPU
- More HD

Cluster: The Queen
The Queen Node is the cluster coordinator, top-level query planner/coordinator, and the keeper
of the data dictionary and other system tables. You can maintain an inactive Queen as a backup.
The Queen is roughly something like a combination of the Teradata Bynet and Parsing Engine.
Queens are a single point of failure, but recovery time can be minutes with a redundant Queen.

1 Cluster: The Queen

(SQL / ODBC / JDBC / OleDB) Queen
Queries/Answers - Presents query interface,
SQL/ODBC/JDBC/OleDB.
Queen
The Face of Cluster
- Cluster, metadata, and transaction

Queries management = The Brains of
Cluster
Worker Nodes
- Coordinates queries and returns
the query results:
Data Breaks queries into parallel tasks
Coordinates overall execution and
Loader Nodes returns aggregated results
Uses advanced Rule-based
optimization techniques to enhance
Aster Cluster Database overall execution
Best to have a Passive Secondary
Queen. Only hours to recover

Cluster: Workers
As the name implies, Worker nodes are the physical machines where the bulk of the data storage,
analysis, and retrieval tasks get done in Aster Database. Actually doing these tasks is the
responsibility of the virtual workers (vworkers) that reside on each worker node. There are
usually more than one vworker per worker node. The number of virtual workers on each worker
node is a function of the hardware configuration of the node: the number of CPU cores, memory,
and direct-attached disk capacity. The queen communicates with vworkers via standard SQL,
and the vworkers on various worker nodes communicate with each other via Teradata Asters
mechanism.

2 Cluster: Workers

Queries/Answers
Queen
Worker Nodes (uses vWorkers)
Queries
- Stores data and interacts with the
Queen and other Workers
Worker Nodes
Intra Cluster Express (ICE)
- Executes Queens orders with

Data
local SQL & SQL-MR Engines:
Loader Nodes Stores data
Runs queries on data
Replicates data
Aster Cluster Database

Virtual Workers
Virtual workers reside on the Worker node. There are usually more than one vworker per
worker node. The number of virtual workers on each worker node is a function of the hardware
configuration of the node: the number of CPU cores, memory, and direct-attached disk capacity.
The queen communicates with vworkers via standard SQL, and the vworkers on various worker
nodes communicate with each other via Teradata Asters mechanism

2 Virtual Workers
Green square Primary v-Workers (answer query)
Yellow square Secondary v-Workers (fallback)
Worker Nodes
Worker-1 Worker-2 Worker-3 Worker-4 Worker-5 Worker-6
Virtual workers, also called v-Workers

Each Worker node contains multiple virtual- Each v-Worker
Workers (Primary and Secondary) for:
Named Pipe
connector
- Performance: Parallelism on multi-core JVM
machines on Primary PostgresSQL (In Memory
processing)
- Availability: Maintains up-to-date replicas on SQL-MR
functions
Seconary Virtual Workers and ensures that
the Primary v-Worker and its replica are not
located on the same Worker node

Cluster: Loader
Loader nodes are optional nodes in a cluster. Without Loader nodes the Queen handles loading.
Hashing distributes the rows among v-Workers.
Using Loader nodes reduces the stream of data to the Queen during the Load process. In
addition, if Loading Distributed tables, the Loader node will also provide for the hashing of those
rows to the correct vWorker.

3 Cluster: Loader

Loader Node
Queries/Answers
Queen
- Run specialized software for
high speed bulk load of data
Queries
- Receives data from loader
clients, segments data into
Worker Nodes
partitions and moves partitions
directly to appropriate
Worker/vWorker bypassing the
Data
Queen
Loader Nodes
- Fully parallel. Takes advantage
of multi-core CPU
Aster Cluster Database Does Hashing of rows

Backup Architecture
Aster Database Backup relies on a Backup Cluster made up of a set of Backup Nodes. The
Backup Cluster is not an Aster Database; it typically has only a few nodes, and its architecture
is not derived from the Aster Database clustering architecture. Each Backup Node is a machine
with Aster Database Backup software that enables it to store backed up Aster Database data.
You designate one of the Backup Nodes as the Backup Manager from which you can manage
your backup and restoration actions. (To do this you use the command line tool,
ncluster_backup, which runs on the Backup Manager.) In addition to its manager activities, the
Backup Manager stores data, like any other Backup Node

4 Backup Architecture

Queen Node sizing
In Aster Data the Queen Node is a single point of failure so a redundant Queen is recommended.
In the Appliance configuration the Base Cabinet comes with 2 Queen Nodes (Active & Passive).
The Queen metadata of a backup or a replacement Queen will be re-built from vWorker backup.

Queen Node sizing
A Cluster system requires 1 Queen (Active) node

- The Queen is the coordinator for the system, not the primary
location of query processing
No Queen Failover With Queen Failover

1 2
Cluster supports Active-Passive Queens.

- A second Queen node can be configured as a standby for system
recovery within minutes. Otherwise a new Queen must be
provisioned and initialized with recovery of hours/days
- Additional Queen nodes do not provide any performance benefit to
the Cluster system

Worker Node sizing
There is no space allocation in Aster Data. vWorkers grab space on first come first serve basis.
Cluster uses Nagios to monitor free space and alert the DBAs when it drops below thresholds.
Free space is equivalent to spool space in Teradata, it us used for moving data to complete joins.
The formula to determine how many Worker nodes are needed is:
Number of Workers = ( 3 x D ) / C
Where:
- D = Total amount of data to be stored

- C = Raw storage capacity of each node after RAID is applied:
RAID 0: C = number of disks * size of disk

RAID 5: C = (number of disks 2) * size of disk
The different data types take up bytes of storage, and have ranges, as in the table below:
Name Size Range

============ ===== =====================================
smallint 2 bytes 32768 to +32767
Integer 4 bytes -2147483648 to +2147483647
bigint 8 bytes -9223372036854775808 to 9223372036854775807
decimal variable no limit
numeric variable no limit
real 4 bytes 6 decimal digits precision
double precision 8 bytes 15 decimal digits precision
serial 4 bytes 1 to 2147483647
bigserial 8 bytes 1 to 9223372036854775807
Unlike in traditional data warehouses, companies often get rid of older data in analytic platforms.

Worker Node sizing
Cluster needs 2 Worker Nodes (only 1 if no replication)

The primary driver for a larger number of Worker nodes in a cluster is:
cost, query speed, capacity
Given an amount of data that is to be stored, determine the number of
Worker nodes needed by using the 33% Rule:
Free space Data type
(If free space < 30% queries may fail and failover may not work)
Increasing Workers with the same amount of data increases the ratio of
processing to data thus increasing performance since now have more
CPU processing power

Loader Node sizing
There is no standard relationship between the # of Workers and the # of Loaders in an nCluster.
That said, if you grow past a single digit number of Workers, then you probably want Loaders.
Likewise it has no standard relationship to how much data is already stored within the system.

Loader Node sizing
A Cluster system needs 0 Loaders since the Queen can act as a

Loader if the loading rate is not high. But the practical number of
Loaders is based on how fast you want to load data, so ask:
How much data do we plan to load each hour or day?

How many total Nodes?
The smallest production ready system is thus 3 nodes. Requirements will dictate needed growth.

How many Total Nodes?
The smallest production

Queen
Cluster system is thus:
1 Queen
Queries
2 Workers
Worker Nodes
0 Loaders
The system configuration

Data
recommendations are:
Loader Nodes Active/Passive Queen
Workers per data size

Aster Cluster Database 1-2 CPU/vWorker
Loaders per load rate
In 8-CPU Worker node, Aster comes preconfigured with 6 v-Workers per Node

What is Replication Factor (RF)?
The replication factor (RF) is the number of copies of your data that are stored in Aster
Database to provide tolerance against failures. Maintaining an RF of two ensures Aster
Database is resilient to node and queen failures. While you can run Aster Database at an RF of
one, Teradata strongly recommends that you run with an RF of two. If you have some tables
that should not be replicated, create those as analytic tables. During operation of the cluster,
hardware failures can cause the RF to fall below two, at which point you must take action to
restore the RF.

What is Replication Factor (RF) ?
Replicas (Pri, Sec)
All real-life systems will be configured for RF =
2. This is a fault tolerance process which
ensures a copy of the data is on a Secondary
vWorker of a different Worker than the original
data. You can view both the Primary and
Secondary vWorkers via AMCs Partition Map
With RF = 2, not only is the Worker data replicated, but

the Queens data is replicated on a Worker node as well.
This facilitates faster Queen recovery
RF is configured by editing the goalReplicationFactor file on
the Queen and setting it from 1 to 2

Virtual Worker Replication
A replication factor of 2 ensures there is a backup when nodes fail because you have one active
and one backup copy of all the data. The Aster Management Console alerts you if your RF goes
below 2.
The replication of a given piece of data occurs in the background as soon as the transaction is
committed on the primary data. Txman ensures that the v-Worker replication gets a record of the
change.

Virtual Worker Replication
Switch
(10 GB or InfiniBand)
PRIMARY PRIMARY PRIMARY PRIMARY
PRIMARY PRIMARY PRIMARY PRIMARY
SECONDARY SECONDARY SECONDARY SECONDARY
SECONDARY SECONDARY SECONDARY SECONDARY
Provides protection of data loss on worker node failure:

- Each v-Worker is replicated to a separate Worker node
- Query processing happens only on active v-Workers

Virtual Worker Failover
Aster Data nCluster vWorker failover is automatic, allows you to continue your big data analytic
processing while you repair or replace the downed worker node. The repair/replace should be
done as soon as possible because with a replication factor < 2 you may have a full system failure.
(During the outage an administrator may rebalance the data to be safe from further node failure.)
No customer has yet lost data on their Aster Data system because of node (or disk) failures

Virtual Worker Failover
During Failover, System

operates in degraded
mode and RF = 1
PRIMARY PRIMARY PRIMARY
PRIMARY PRIMARY PRIMARY
PRIMARY SECONDARY SECONDARY
SECONDARY PRIMARY SECONDARY
v-Worker failover is automatic (on-going query will still complete)

Repair/replace the downed Worker node to maintain protection

RAID (Redundant Array of Independent Disks)
This is standard disk technology used by all databases.
Nothing about RAID is specific to just Aster Data.
Brief descriptions of the RAID levels are as follows:
RAID 0: Data is striped across all disks

100% Capacity (all disks are used for data)
Maximum Performance (no RAID overhead)
No Reliability (one disk failure = data loss)
RAID 5: Data is striped across all but 1 disk

n 1 capacity for n disks (one disk is used for parity)
Slightly lower read performance than RAID 0 for n 1 disks
write performance significantly slower due to parity disks
Data loss only if two disks fail
RAID 10: Data striped across mirrored pairs of disks

50% capacity (all data is duplicated)
50% read performance vs. RAID 0 for n disks
Maximum data protection (Data loss only on mirror loss)
The slide shows general RAID recommendations for the Aster Data system node types.
The asterisk notes that Aster Data does not support Queens without redundant storage.
RAID is not really important for the Loader Nodes as data just passes through them.

RAID (Redundant Array of Independent Disks)
RAID provides protection against data loss on disk failure
The choice of RAID level determines how the disks will be combined
into a storage layer. It impacts both performance and reliability
through tradeoffs in capacity, cost, and risk
The Cluster RAID guidelines are:

- Queen:
RAID 10 only (Stripes and Mirrors)
- Workers:
RAID 5 for normal systems (Strips w/parity)
- Loaders:
Same as Workers

Aster Appliance 3 Configuration
The Teradata Aster Big Analytics Appliance is delivered as a fully integrated hardware and
software system. It is ready to plug-in and can be up and running within a few hours. It can be
configured to meet the demanding analytic workload needs of the organization, with a tailored
mix of the Aster Database and Hadoop nodes.
The Teradata Aster Big Analytics Appliance features a complete Aster Database, including the
patented Aster SQL-MapReduce framework, Hortonworks Data Platform, Aster MapReduce
Analytics Portfolio with more than 50 analytical functions. It runs on proven Teradata hardware,
leverages the most current Intel processor chip technology, SUSE Linux operating system,
and market-leading enterprise-class storage. It can be configured to store a maximum of 5
petabytes of uncompressed user data for Aster and up to 10 petabytes of uncompressed user data
for Hadoop.
- See more at: http://www.teradata.com/News-Releases/2012/Teradata-Big-Analytics-Appliance-

Enables-New-Business-Insights-on--All-Enterprise-
Data/?LangType=1033&LangSelect=true#sthash.SGcMTRaA.dpuf

Aster Appliance 3 Configuration 24-port 1Gb Server Mgmt Attic
42
Aster Worker 41
40
Aster Worker 39
38
Aster Worker 37
Up to 18 nodes for both Aster and Hadoop Aster Worker
36
Nodes All have Dual Eight Core 2.6GHz Processors
35
34
Aster Worker 33
Worker nodes 24 900 GB HD storage (RAID 5). Aster Worker

32
31
Storage Queen 24 900 GB HD storage (RAID 10) 30
Aster Backup
(12) 3TB Internal Drives per Node (Hadoop DataNode) 29
28
Aster Backup 27
288 TB per full Aster cabinet (16 workers) Aster Backup
26
- 18 TB per Aster Worker Node 25

Total User Data Aster Loader
24
151 TB for Hadoop cabinet (2 Master, 16 Data Nodes) 23
Capacity - 9.5 TB per Hadoop DataNode KMM (Optional) 22
InfiniBand switch (opt) 21
(Totals are uncompressed compression is available) Cable Space 20
36-port InfiniBand switch 19
Up to 5PB with Aster, up to 10 PB with Hadoop VMS 18

Scalability (Totals are with larger network switches) Hadoop Master
17
16
15
Hadoop Data Node 14
Availability RAID-5 (Workers) , Software data replication Hadoop Data Node
13
12
11
SUSE Linux 11 Hadoop Data Node
Operating System 10
9
Hadoop Data Node 8
Enterprise Integration Aster-Teradata Connector, SQL-H Aster Loader
7
6
5
Aster Backup Queen 4
Node Interconnect 40Gb high-speed networking via InfiniBand switch 3
Aster Queen 2
24-port 1Gb Server Mgmt 1

Install Wizard (1 of 2)
To run the Installer on the Queen:
1. Make sure that you are running python version 2.5.5 or a higher version.
2. 2 Run the Aster Database installer from a command shell on the queen. The command
will be similar to (the installer file name below is just an example; please replace it with
the appropriate name for use on your operating system):
# ./AsterInstaller__6-00-xx.R_xxxxx.bin

Follow the prompts:

Follow the Wizard as it asks you a number of questions including:
Queen
Number of Workers
Number of Virtual Workers
Number of CPUs
Other parameters

Follow the prompts:

Aster clients (1 of 2)
Both Teradata Studio Express and Eclipse use the Aster JDBC driver for connectivity.

Aster clients (1 of 2) Linux client also available
TD Studio (Express) 14.02+ (via Aster JDBC)
Eclipse Provides both JAVA development and SQL commands

Aster clients (2 of 2)
ACT (Aster command terminal) is another Aster client. It can execute three types of queries:
ACT commands
SQL commands
SQL-MR commands

ACT (Aster command terminal) (2 of 2)
Putty login
Putty password
ACT logon
ACT password
ACT Provides SQL, SQL-
MR and ACT commands
Uses PUTTY application to

connect to Queen via ACT
ACT command
SQL command
ACT command
Closes window

In-line lab: Using PuTTY to login to ACT

In-line lab: Using PUTTY to login to ACT
When using commands to manage database objects

(CREATE, ALTER, UPDATE, SELECT, etc.) we will:
First, logon to PUTTY using our UNIX credentials
login as: root Password: aster
Then logon to ACT using following command:

act -d beehive -U beehive -w beehive
where d = database, U = username and w = password

In-line lab: ACT

In-line lab: ACT
ACT is a basic Aster SQL client (Windows, Linux, AIX)

It is a tool used for Cluster database administration
In ACT DBAs execute SQL DDL/DML
ACT is accessed through ssh client tools such as Putty
Database you are currently connected
If copy ACT software onto your PC, after initially getting your RSA signature, you can then
logon to ACT via a Windows command prompt (ie: CMD) using following syntax:
ACT h <IP address of Queen> -U <user account>
In our labs, well be using PUTTY to access ACT

In-line lab: Navigating ACT (1 of 4)

Notice the prompt.

You are in the beehive database. The => means can enter commands.
SQL commands must have a ; (semi-colon) at the end of the statement.
If you fail to enter a semi-colon, prompt changes to:

Anytime prompt doesnt have =>, means expects more input


Notice this prompt. Forgot closing tic
This prompt typically appears when a ' (single quote) is missing.
Whenever you are at a prompt that does not have =>, and you cannot
get back to a => prompt, you can escape back to the => prompt by
typing:
CTRL + C


When have more than 1 page of results from a query, hit the SPACE bar
to go to the next page.
# export PAGER=less
(scroll down- scroll up via keyboard)
When at the last page, will see an END message.
To break out of the final page and return to a prompt, hit the q key


To change from 1 database to another, instead of logging out (\q) of the old
database (prod) and logging back in to the new database
(ie: act d retail_sales U beehive w beehive), can use below code:
\c retail_sales beehive
where \c = connect, retail_sales = new database, beehive = password
If wish to paste code into ACT, just right-click from the ACT prompt
To re-run previous command, hit up arrow
Note: In the hands-on labs, we will typically be using ACT when an ACT
command needs to be run. To get a list of ACT commands, type: \?
All ACT commands start with: \ (slash)

Other (Graphical) query tools
There are a number of 3rd party Aster client applications. Aqua Data Studio is particularly
popular as an Aster client.

Other (Graphical) Query Tools
Aqua Data Studio
BI tools:
- Tableau
- MicroStrategy
- SAS
- Cognos

Review: Module 1 Architecture

Review : Module 1- Architecture
1. Percentage of Free Space needed by v-Workers?
2. What are the 3 tiers/purposes of the Cluster architecture?
3. What are the benefits of multiple v-Workers ?
4. Name 4 Aster clients
5. What features provide protection against Worker node/disk failure?

Module 2
Mod 02
Management Consoles -
Aster and Viewpoint

Table Of Contents
Management Consoles Module objectives ...................................................................................3
Aster Management Console (AMC) ................................................................................................5
Dashboard tab The 3 sections .......................................................................................................7
Dashboard > Top of Dashboard > Processes ...................................................................................9
Dashboard > Nodes........................................................................................................................11
Processes tab The 3 sub-tabs ......................................................................................................13
Processes > Processes ....................................................................................................................15
Processes > Processes > Details ....................................................................................................17
Processes > Query Timeline ..........................................................................................................19
Processes > Sessions ......................................................................................................................21
Nodes tab 4 sub-tabs ...................................................................................................................23
Nodes > Node Overview ...............................................................................................................25
Nodes > Hardware Stats ................................................................................................................27
Nodes > Node data.........................................................................................................................29
Nodes > Hardware Config .............................................................................................................31
Nodes > Partition Map ...................................................................................................................33
Admin tab 6 sub-tabs ..................................................................................................................35
Admin > Cluster Management.......................................................................................................37
Cluster Management Node Status ..............................................................................................39
Admin > Cluster Mgt > Upgrade Software ...................................................................................41
Admin > Events .............................................................................................................................43
Admin > Events Supported Events .............................................................................................45
Admin > Events ncli commands .................................................................................................47
Lab 02: NCLI Adding an Event..................................................................................................49
Admin > Executables overview .....................................................................................................51
Admin > Executables .....................................................................................................................53
In-line lab: Admin > Executables ..................................................................................................55
Admin > Backup ............................................................................................................................57
Admin > Configuration > Cluster settings ....................................................................................59
Admin > Configuration > Workload .............................................................................................61
Admin > Configuration > Roles and Privileges ............................................................................63
Admin > Configuration > Hosts ....................................................................................................65
Admin > Configuration > Network ...............................................................................................67
Admin > Configuration > SQL-H .................................................................................................69
Admin > Logs ................................................................................................................................71
Cluster Scaling Expanding the system .......................................................................................73
Scaling out Cluster Add Workers (1 of 2) ..................................................................................75
Scaling out Cluster Add v-Workers (2 of 2) ...............................................................................77
Optional in-line lab: Remove Worker-2, add back ........................................................................79
In-line lab: Remove Worker-2, add back (cont)...........................................................................81
Teradata Viewpoint (1 of 3) ..........................................................................................................83
Review: Module 2 Management Consoles .................................................................................89
DBA Lab2a-Lab2b - Optional .......................................................................................................91
Page 2 Mod 02 Aster Management Console

Management Consoles Module objectives
This module will cover both the Aster Management Console and Teradata Viewpoint.
The Aster Management Console (AMC) is a web-based interface that lets you manage,
configure, and monitor Aster Database activity. The AMC provides administrators with an
authoritative view of the system and mechanisms for invoking administrative actions. AMC
provides developers and other users with insight into Aster Database activity, such as details on
currently executing SQL statements and statement histories.
Teradata Viewpoint portal is a framework where Web-based applications, known as

portlets, are displayed. IT professionals and business users can customize their portlets to
manage and monitor their Teradata systems using a Web browser.
Portlets enable users across an enterprise to customize tasks and display options to their
specific business needs. You can view current data, run queries, and make timely business
decisions, reducing the database administrator workload by allowing you to manage your
work independently. Portlets are added to a portal page from a menu. The Teradata
Viewpoint Administrator configures access to portlets based on your role.
Teradata Viewpoint portlets let you monitor not only Teradata systems, but also Aster and
Hadoop systems. In the future the AMC functionalities will all migrate to Teradata Viewpoint.

Mgt Consoles - Module Objectives
Learn how to manage Aster via Aster Management Console (AMC)
Learn about Teradata ViewPoint

Aster Management Console (AMC)
The Aster Management Console (AMC) is a web-based interface that lets you manage,
configure, and monitor Aster Database activity. The AMC provides administrators with an
authoritative view of the system and mechanisms for invoking administrative actions. AMC
provides developers and other users with insight into Aster Database activity, such as details on
currently executing SQL statements and statement histories.

Aster Management Console (AMC)
Powerful visibility and control of data and analytics processing
1 Dashboard: summarizes
cluster status and activity
2 Processes: provides query

processing statistics
3 Nodes: provides visibility

into node health and status
4 Admin: access system

administration controls

Dashboard tab The 3 sections
The AMC Dashboard is the main information center where you can view the condition of the
cluster and the jobs currently running on it. Many field labels in this window are clickable. By
clicking a label or message, you can usually see more details about the message or navigate to
the commands related to it.

1 Dashboard tab The 3 sections
Top
Processes
Nodes

Dashboard > Top of Dashboard > Processes
Top of the Dashboard window
As shown in the image above, the top of the Dashboard consists of the following items.
Clockwise from the upper left, they are:
Status Lamp: The status lamp lights green to show the cluster is running correctly. The
legend next to the status lamp shows the name of the cluster and its status, and the current queen
time, converted to browser-local time.
Cluster Name: The name assigned to the cluster.
Link to Docs and Downloads
Resource Center: Click this link to open the Teradata Aster Resource Center, a web
page where you can find documentation, videos, and downloadable client software for
various operating systems.
Help Link: Click this link to open an HTML page containing information about the
AMC page you are currently viewing.
Login Details: In the top right of the window is the Teradata Aster logo. Directly below that
is your current, logged-in AMC user account name. Your user account determines what
actions you can perform in the AMC.
Status Summary: In the upper right of the Dashboard tab is the status box. This box is a
fixture not only of the Dashboard, but of all AMC windows. The status box notifies you of
important events in Aster Database.
Message Board: In the upper left of the Dashboard tab is the message board. Here, you and
other Aster Database administrators can post messages to all AMC users. To add a
message, click the pencil icon, type the message in the dialog box that appears, and click
OK to post it. All AMC users on this cluster will see your message immediately on the
message board in their AMC session.
The Processes Section of the Dashboard Window
The Processes section of the dashboard shows an overview of the current and recent jobs in
the cluster, as well as statistics including the Most Active Users rankings and the Process
Execution Time graph. The Active Applications box shows currently installed applications
that run on the cluster. The Processes section corresponds to the Processes tab, and clicking
most labels in this section will take you to the Processes tab.

Dashboard > Top of Dashboard > Processes
Top of Dashboard
Processes

Dashboard > Nodes
At the bottom of the Dashboard tab, below the Processes section, is the Nodes section. The
Nodes section summarizes the operational status of the machines in your cluster, including
the quantity of data stored and the remaining free space in the cluster.
Nodes Summary Box
The green summary box lists the counts of nodes in your cluster and summarizes the status of
the nodes.
This section shows the following (click any label to show its details):
Queen(s): Count of queen nodes in this cluster. The Active count is the number of active
queen nodes in this cluster. This can only be 1 or zero. The Passive count is the number of
passive (backup) queens in this cluster.
Loader(s): Count of the loader nodes in the cluster.
Worker Nodes: Count of worker machines in the cluster. Note this is the count of worker
machines, not the count of virtual workers. Below this are listed the counts of Active, New,
Suspect, and Failed nodes.
Nodes Statistics Summary
The center panel of the Nodes section shows the current replication factor of Aster Database. If
the current replication factor is below your target replication factor (your Aster Database
administrator specified this when installing Aster Database), a warning appears at the top of
this section.
The Replication Factor section shows, first, the cluster-wide current replication factor. Below
that, it shows how many virtual workers are at RF=2 (these are workers that have a valid
backup worker stored in Aster Database) and how many are lacking a backup (RF=1).
Teradata Asters recommended setting is to maintain the cluster at RF=2.
The bottom of this section is the Hardware Statistics panel, showing current and recent CPU
usage, memory usage, network bandwidth usage, and disk I/O usage. Click the Nodes:
Hardware Stats tab for more hardware statistics.
Cluster-Wide Disk Capacity/Usage
The right side of the Nodes panel of the AMC Dashboard shows the Data Payload Panel. This
panel provides a cluster-wide view of the data capacity of your cluster and shows how much
disk space is currently being occupied by data and other system files.

Dashboard > Nodes
# of Nodes (not
counting Replication Factor and Cluster-wide disk
Backup Nodes) Hardware statistics capacity/usage

Processes tab The 3 sub-tabs
Monitor and track the SQL statements running in Aster Database using the AMC Processes
tab. The filtering area at the top left is useful for showing and hiding different subsets of the
processes, so you can focus on just the processes of interest to you. The green summary box at
the top right shows counts of current and past statements, categorized by status.

2 Processes tab The 3 sub-tabs
Processes - Current and past SQL-MR queries
Sessions - Shows user sessions
Query Timeline - Graphical

representation of queries run
in past 24 hours
View information on queries and processes executed in the database
Drill down to identify query details (statement, execution plan, etc.)

Processes > Processes
By default, when you click the Processes tab, AMC displays the list of processes in the
Processes sub-tab. The list displays information about running processes or processes that
finished running on the Aster Database.
Each process is a SQL command or a block of SQL statements (BEGIN ... END). The
statements can contain SQL-MapReduce functions.
The Processes list is useful for monitoring activity on your cluster, checking on the progress of
queries you have submitted, and finding performance issues such as statements that take
much longer to run than others.
To display processes:
1 Click Processes (Processes > Processes).
2 To filter the display of processes in the Query Timeline using the Change Filter button, as
described
3 To display summary information about a process, move the mouse over the process ID.
4 To display detailed information about a connected process, click its ID.
Cancel SQL Statements
Sometimes, you may need to cancel a running process on the cluster. For example, suppose a
user runs the query SELECT * from events. If the events table is large, the query could
easily take far too long to complete. Another operation that can be time-consuming is a
CREATE TABLE that inserts a large number of rows.
Some SQL statements that are not cancellable. These are transaction-related SQL statements,
such as COMMIT, ROLLBACK, CLOSE cursor, and COPY-in SQL.
To cancel a running process, do one of the following:
In the Processes tab (Processes > Processes), if a Cancel icon is displayed for a process, click
the icon in the Cancel column (right-most column), then click OK when prompted.
In the Process Details tab, you can cancel the statement by clicking the Cancel Process
button.
Either action will place the process in Cancelling mode, which indicates that the cancellation
request has been received. Statement cancellation in Aster Database is an asynchronous,
besteffort
operation. While executing a statement, the Aster Database back-end checks
periodically to see whether a cancellation request has been issued. If requested, the back-end
acknowledges the cancellation and triggers a best-effort service to cancel the ongoing
execution.

Processes > Processes Auto Purge 72 hours
Processes
Tab/Page
Filter
Query /
process
details
View information on queries and processes executed in the database
Drill down to identify query details (statement, execution plan, etc.)

Processes > Processes > Details
When you click the ID of a process, AMC creates a new tab displaying detailed information
about the process.
A Process Detail tab for that process is displayed. In addition to the columns displayed in the
process list, this tab shows the following additional information.

Processes > Processes > Details
Statement
Execution
steps
To see more detail , click the SHOW ALL STEPS hotlink (not shown)
Can find most
expensive v-Worker

Processes > Query Timeline
The Query Timeline tab (Processes > Query Timeline) shows a graphical representation of
commands run in the past 24 hours. By using this bar graph view, you can more quickly spot
commands that are out of the ordinary in terms of processing time.
Each bar represents one SQL command.
To display processes in the Query Timeline:
1. Click Query Timeline (Processes > Query Timeline).
2. To filter the display of processes in the Query Timeline using the Change Filter button.
3. To display details about a process, move the mouse over it. A popup message appears
with additional information.

Processes > Query Timeline
Graphical representation of commands run in the past 24 hours

Processes > Sessions
The Sessions tab (Processes > Session) shows a list of the connected or closed user sessions on
this cluster. You can use this list to monitor user activity and help troubleshooting user issues.
To display user sessions:
1. Click Query Sessions (Processes > Sessions).

2. To sort the list, click a column heading.
3. To display summary information about a connected process, move the mouse over the
process ID.
4. To display detailed information about a connected process, click its ID.

Processes > Sessions
History of sessions including:
Session status
User
Database
Login Time
Login Duration

Nodes tab 4 sub-tabs
The Nodes tab in the AMC gives a system-wide overview of the amount of data stored in Aster
Database. In particular, it shows information about the extent to which data is replicated to
tolerate node failures, and how overall node storage is utilized by data in the cluster. It provides
interfaces through which administrators can manage data and replication in the system.
The Nodes tab is also used to monitor the operation of Aster Databaseits virtual workers,
worker nodes, and loader nodes. In the Nodes tab, administrators can view information on each
of the nodes participating in the Aster Database, configure those nodes, and retrieve logs and
other information for debugging purposes.
Node Failures
The queen node in Aster Database actively monitors all nodes participating in the system. If it
observes a node behaving in an unexpected or inappropriate manner, it will consider thatnode to
be suspicious and change its status to Suspect, and the node will appear yellow in the AMC.
A Suspect node status does not necessarily imply that the node has experienced a failure, only
that the queen is examining it in order to determine whether one has occurred. If the node
continues to demonstrate suspicious behavior while in Suspect status, the queen will consider
it to be Failed and change its status accordingly.
What do I do with suspect nodes?

It is important to note that nodes that have been marked Suspect still participate in Aster
Database. They continue to store data and are active participants in statement execution. A
Suspect node is a node on which one or more of the vworker databases of the node reported
an error (disk errors are a frequent cause), and in response, the queen removed that vworker
or vworkers from active status. The other, error-free vworkers on the node remain up and
running in active status.
The presence of a Suspect node does not necessarily imply a decrease in performance, but it
typically means the cluster has fallen from RF=2 to RF=1, meaning that one or more vworkers
may not have a backup vworker. While a node is in Suspect state, the queen monitors the
nodes behavior and only consider it to be Failed if it continues to demonstrate such behavior.
If the behavior that was originally observed was a one-time event (e.g. a transient network
error between the queen and the node), the node will remain an active participant while being
considered Suspect.
In Aster Database, the queen will not automatically transition a node from Suspect to Active.
Instead, a node will be returned to Active status on the next activation or load balancing activity.
If the system continues to operate for a reasonable length of time after the node was originally
marked as Suspect, Teradata Aster recommends that the node be returned to Active status by
clicking the Balance Data button in the AMC. Allowing a node that is performing normally
(e.g. one that has continued to operate for at least 24 hours without transitioning to Failed) to
remain in a Suspect status for a lengthy period of time increases the chance that the node will
eventually be considered Failed, triggered by an event such as an unrelated transient error.

3 Nodes tab 4 sub-tabs
Node Hardware Hardware Partition

Overview Stats Config Map
Instant view into status of cluster all nodes

Identify roles and relationships of nodes
Optimize clusters and cluster management
Click on the Node Name hotlink for more details

Nodes > Node Overview
The Node List in the Nodes Panel contains a list of all nodes that have been registered with
Aster Database. This includes nodes that are active participants in the system, as well as nodes
that are not currently participating in Aster Database (for example, nodes that have failed and
nodes that have yet to be powered on).
Each node is listed with an icon that depicts the type of node (see below for node types), along
with a color representing its status. Nodes are identified by the IP address that has been
assigned by the system. The nodes in the Node List can be filtered based on node type, using
the Node Type drop-down menu above the list.
Per-Node Disk Capacity and Current Usage
To see detailed descriptions of how and to what extent the disks are being used on individual
nodes, click the Nodes tab and click the Node Overview tab. Disk usage details appear in these
columns:
Uncompressed Active Data Size: This column shows the amount of data currently stored on
the node. The term active data refers to the raw, uncompressed data size before it is
stored on disk.
Storage (GB) This column shows a graph showing the current usage of the nodes disk, by
type of data stored (user data, replica data, and free space), and lists the amount of disk
space currently occupied by user and replica data, expressed in GB. This shows the actual
on-disk space that is used and free on the node. Hover your mouse cursor on the graph to
see these statistics for the node:
User Data is the amount of space occupied by primary copies of your data on the node.
Replica Data is the amount of space occupied by the replica copies of your data on the
node.
System represents the amount of the nodes disk space consumed by operating system
files, Aster Database software files, and other files that do not contain your Aster
Database-stored data.
Available represents the amount of unused storage currently available on the node.
Total Space shows the total amount of disk space on the node.
% Full: This column indicates how mach space has been used on this node. This graph
turns orange to indicate that more than 70% of the node disk space has been used, and it
turns red to indicate that more than 90% had been used. If this graph is displayed in
orange or red, you must take action to free up disk space by calling Teradata Support.

Nodes > Node Overview
Identifies Node type, Status, # of vWorkers Hardware

Disk Utilization, Hardware Utilization and Hardware Snapshot
Tells you how much Free Space your Worker nodes have

Nodes > Hardware Stats
The Node > Hardware Stats tab provides information about CPU, memory, network and disk
usage.

Nodes > Hardware Stats
IP address, CPU, Memory, Network statistics

Nodes > Node data
The Node >Node Data tab provides information about the virtual workers.

Nodes > Node data
Presents 3 more Sub-tabs will more detail

Includes 3 Log types

Nodes > Hardware Config
The Nodes > Hardware Config subpanel shows the hardware configuration detail for a selected
Aster Database node. The panel displays detailed compute, memory and storage information,
including a breakdown by processor, which is relevant in multiprocessor servers.

Nodes > Hardware Config
NIC speed, CPU, # of cores, Storage available

Nodes > Partition Map
The Nodes > Partition Map tab show a graphical representation of the cluster with details for
each node.
Replication Factor
The replication factor (RF) is the number of copies of your data that are stored in Aster
Database to provide tolerance against failures. Maintaining an RF of two ensures Aster
Database is resilient to node and queen failures. While you can run Aster Database at an RF of
one, Teradata Aster strongly recommends that you run with an RF of two. During operation
of the cluster, hardware failures can cause the RF to fall below two, at which point you must
take action to restore the RF.

Nodes > Partition Map
Overview of use of
database nodes in
a rack
Not Bal Prepared Failed Suspect
Summary of virtual
workers on node
Prepared
Replication Factor
Listing of each Node and their vWorkers
Failed node means v-Workers not participating in Cluster. So this screen shot may not be Refreshed
Balance Process ( see upcoming slide) not run since 1 Worker seen has all Primary v-Workers (no Secondary's)
Prepared nodes means either New Worker with no v-Workers, or existing Worker whose v-Workers are all Secondary
Suspect node typically means some v-Workers are Suspect but still participate in Queries

Admin tab 6 sub-tabs
The Admin tab is the main tab for configuring the Aster cluster. This is the interface to use when
you wish to change you system (add New Nodes, configure SQL-H, etc.)

4 Admin tab 6 sub-tabs
1. Cluster Management
2. Events
3. Executables
4. Backup
5. Configuration
6. Logs

Admin > Cluster Management
The Cluster Management page (Admin > Cluster Management) lets you manage your Aster
Database cluster.

Admin > Cluster Management
1. Soft Restart* Boot Aster software only * - Queries cannot be processed during these activities
2. Hard Restart* Boot both Operating system and then Aster software
3. Activate Cluster* When add New node or existing Worker reboots
4. Balance Data 2nd v-Workers copied. Goal: Eensure Pri/Sec vWorkers on different Workers
5. Balance Process* Optimally locates vWorkers. Decides Primary/Secondary v-Workers (ie:
which v-Workers will be Active/Passive). Goal is even v-Worker distribution across Workers
Can bounce individual nodes:

Cluster Management Node Status
New
When a node is first added to Aster Database, or registered, it is considered to be a New node.
At this point, Aster Database is aware of the nodes existence, but the node has not yet
contacted the queen in order to be prepared, or loaded with the Aster Database software.
Nodes are also shown as New immediately following a restart of Aster Database, before their
state can be determined.
Preparing
After the node contacts the queen to be prepared, its status changes to Preparing. While in this
status, it is loading the Aster Database software and preparing itself to become a participant in
Aster Database.
Prepared
Once the node completes preparation, its status becomes Prepared. At this point, the node is
ready to be incorporated into Aster Database so that it can host vworkers.
Active
Active and Passive are the acceptable states for nodes in a running cluster. Active nodes are
nodes that are available immediately to process queries in Aster Database.
Passive
Active and Passive are the acceptable states for nodes in a running cluster. A Passive node is a
standby that holds frequently updated copies of vworkers data and later can be made Active to
take on query processing work as needed.
Suspect
Suspect nodes are nodes that have exhibited unusual behavior and are participating in the
Aster Database in a limited capacity while being investigated for potential failures by the
queen.
Failed
Failed nodes are nodes that are no longer participating in the Aster Database.

Cluster Management Node Status
New When a node is added. At this point, Aster Database is aware of

the nodes existence, but the node has not yet contacted the Queen.
Nodes are also shown as New immediately following a restart of Aster
Database, before their state can be determined.
Preparing After the node contacts the Queen. While in this status, it is
loading Aster software in lieu of become a participant in Aster DB.
Prepared Once the node completes preparation, its status becomes
Prepared At this point, the node is ready to be incorporated into Aster
Database so that it can host vworkers.
Active Active nodes are nodes that are available immediately to
process queries in Aster Database. Passive nodes have no Primary
vworkers
Passive Passive node is a standby that holds frequently updated
copies of vworkers data and later can be made Active if primary
vworker fails
Suspect Suspect nodes are nodes that have exhibited unusual
behavior and are participating in the Aster Database in a limited
capacity while being investigated for potential failures by the Queen.
Failed Failed nodes no longer participating in the Aster Database.

Admin > Cluster Mgt > Upgrade Software
Upgrading software is a 2-step process. First copy the code over to the Queen.
Then the Queen distributes to all other active nodes

Admin > Cluster Mgt > Upgrade Software

Admin > Events
The Aster Database Event Engine assists in system maintenance and monitoring. The Event
Engine uses a subscription model to send notifications of various events within the system.
You can configure separate subscriptions to be notified of events based on various filters.
Some examples of filters you can create include:
Give me an email when a hardware alert happens
Give me an email only when bad things happen, or
Notify me when components change their state.
The Event Engine resides on the queen. It monitors and generates notification on states and
activities on each node. You create subscriptions to specific types of events in order to be
notified when they occur. These subscriptions are created through ncli, and may be viewed in
ncli or the AMC. When certain events occur, Aster Database will perform a remediation, such
as a soft shutdown automatically.

Admin > Events
The Event Engine uses a subscription model to send notifications within the system
You can configure separate subscriptions to be notified of events based on various filters.
Some examples of filters you can create include:
Give me an email when a hardware alert happens

Notify me when components change their state
The Event Engine resides on the Queen. It monitors and generates notification on states
and activities on each node. You create subscriptions to specific types of events in order to
be notified when they occur. These subscriptions are created through ncli, and may be
viewed in ncli or the AMC. When certain events occur, Aster Database will perform a
remediation, such as a soft shutdown automatically

Admin > Events Supported Events
Below is a list of supported events. Note it is possible to create new events from scratch.

Admin > Events Supported Events
To assist Administrators in detecting and managing situations where the cluster is running out
of disk space, a node is suspect or failed, a user is initiating actions in the AMC, or replication
factor issues exist, Aster Database provides the following subscribable events:
Partial listing

Admin > Events ncli commands
ncli Events Section
The events section provides commands to view and configure event subscriptions in the Aster
Database Event Engine. See Monitor Events with the Event Engine on page 140 for
information about event subscriptions.
When you set up event subscriptions, youre setting up subscription to be notified via SNMP
or email whenever events of a particular type occur. The ncli is the only way to add and
manage subscriptions.
The commands in the events section will run against the queen, even if executed from a
worker node. The syntax to run a command in the events section looks like this example:
$ ncli events listsubscriptions
Event Subscriptions
+--------+------------+--------------+--------------+----------------
| Sub ID | Notif Type | Min Priority | Min Severity | Component Type
+--------+------------+--------------+--------------+----------------
| 9 | snmp | High | FATAL |
| 8 | snmp | Medium | ERROR |
+--------+------------+--------------+--------------+----------------
4 rows
table continued...
+-----------+---------------+----------------------+
| Event IDs | Throttle Secs | Notification Details |
+-----------+---------------+----------------------+
| ST0001 | 0 | manager=10.60.11.5 |
| SY0002 | 0 | manager=10.60.11.5 |
| SY0001 | 0 | manager=10.60.11.5 |
| ST0002 | 0 | manager=10.60.11.5 |
----------------+-----------+----------------------+
To add a new event subscription, issue a command like:
$ ncli events addsubscription --eventIds ST0003 --type snmp --manager
10.60.11.5 --minPriority high --minSeverity fatal
Which displays the event subscription added, returning a result like:
Event Subscriptions
+--------+------------+--------------+--------------+----------------+-----------+
| Sub ID | Notif Type | Min Priority | Min Severity | Component Type | Event IDs |
+--------+------------+--------------+--------------+----------------+-----------+
| 5 | snmp | High | FATAL | | ST0003 |
+--------+------------+--------------+--------------+----------------+-----------+

Admin > Events ncli commands
Events are configured from a command line interface using NCLI commands
From PUTTY UNIX prompt (linux-qsvn), type the following:
To view all existing Subscriptions (and their subid): ncli events listsubscriptions
To view an existing Subscription: ncli events listsubscriptions <subid>
To delete an existing Subscription: ncli events deletesubscription <subid>
To create email subscription when a CANCEL ocurs, would type the following:
ncli events addsubscription --eventIds MC0001 --type email --smtp 192.168.100.10

--from aster@freemail.com --to aster@freemail.com --minSeverity info --minPriority low
The above code will send an E-mail to aster@freemail.com when a User attempts to cancel
a process from the AMC by clicking Cancel from the Processes list

Lab 02: NCLI Adding an Event
Follow along with the instructor and we create an E-mail event and test it.

Lab 02: NCLI Adding an Event
We will create a Subscription event to send us an E-mail when a person

cancels a Query from AMC
1. From PUTTYs UNIX prompt, create the following Subscription:
ncli events addsubscription --eventIds MC0001 --type email --smtp 192.168.100.10
--from aster@freemail.com --to aster@freemail.com --minSeverity info --minPriority low
2. From TD Studio, run the following SQL code:

select * from sales_fact UNION select * from sales_fact
UNION select * from sales_fact;
3. From AMC, CANCEL the query from the Processes tab
4. From ReadyTech Desktop, under Apps caption, open Thunderbird Mail. Click on
Get Mail and you should receive an E-mail concerning the Cancelled query
ncli events listsubscriptions -- to view events
ncli events deletesubscription <sub id> -- to delete event

Admin > Executables overview
The Aster Database Executables framework is a set of script management tools that allow
Aster Database administrators to create, manage and run custom scripts on one or many
nodes in their cluster. Scripts can be shell scripts, SQL scripts or can invoke SQL-MapReduce
functions. Scripts can be run on any node on the cluster, or they can be restricted to run on
only specified nodes.
AMC Executables provide an easier way to diagnose cluster issues, such as data skew, and
perform routine cluster maintenance. Prior to AMC Executables, you could create custom
scripts to provide these benefits, but they had to be run by a user logging in through a shell,
which could be inconvenient depending on security and IT policies. There was also no
provision for creating a library of scripts before AMC Executables

Admin > Executables overview
Aster provides five out-of-the-box scripts, which install automatically with a clean
install or upon upgrading. These scripts perform cluster administration tasks, such as
finding data skew and determining table information such as size. These scripts
cannot be modified or deleted, but they serve as a useful reference when creating
your own custom scripts. Many of the scripts cascade, which means that if they are
acting on a parent table, they will automatically act on all of its descendants as well.

Admin > Executables
The Executables Library provides a list of all available scripts, both out-of-the-box and
custom. Each script includes information on what variables are needed to run it, who created
it, and when it was created. There is a Run Now button to invoke the script, a pencil icon to
edit the script, and an X icon to delete it (if it is a custom script).

Admin > Executables
The Aster out-of-the-box scripts are as follows:

All Table Sizes* - Get table size of all tables in the database (cascade)
Data Skew Detector - Identifies any tables in a database with statistically significant skew
Table Info - Get table statistics of the specified table in the database (cascade)
Table Size* - Get table size of the specified table and database (cascade). This script
aggregates or sums up all the results across Aster Database. It displays results per v-Worker
and supplies the total across the cluster
Table Size (Details)* - Get size of the specified table and database on each v-Worker
(cascade)

In-line lab: Admin > Executables
Follow along with the instructor as we find out how much space is taken up by an Aster table.

In-line lab: Admin > Executables
Run Table Info script on clicks using the following parameters, then go to
Executables Jobs tab to view result via Output hotlink. Note may take a few
minutes to run
How much space is this table occupying? ____________

Admin > Backup
In the AMC, select Admin > Backup to open the Backup panel, which is used for managing and
monitoring backups of tables and database. See Backup and Restoration in the Teradata
Aster Big Analytics Appliance 3H Database User Guide for details on setting up Aster Database
Backup and running backups.

Admin > Backup
To make the Cluster aware of the Backup Manager, you add the IP address of
the Backup Manager here
When Backup Manager starts a Backup, those Backups will be recorded here
Backup will be covered in a later module

Admin > Configuration > Cluster settings
In the AMC, select Admin > Configuration > Cluster Settings to open the Cluster Settings
panel, which allows you to set the basic operating parameters for the Aster Database. These
settings apply to the AMC installation; all users will use the settings that are defined and saved
here.

Admin > Configuration > Cluster settings
a. Cluster Settings Name of Cluster and log cleanup

b. Sparklinle Graph Scale Units How graphs display in
AMC
c. Graph Scaling Numerical scale of graphs in AMC
d. Internet Access Settings Proxy settings for Queens
outbound access
e. Aster Support Settings Configures AMC access to
Teradata support
f. QoS Concurrency Threshold Configuration Number of
maximum transactions allowed to run simultaneously
But cannot login if QoS = 0
When Concurrency = 2, all 3 can logon, but only 2 queries can be

running at a time

Admin > Configuration > Workload
The Workload Management feature of Aster Database (WLM) lets you allocate resources so
that low-priority workloads have a minimal impact on higher-priority workloads when they
run concurrently. This becomes especially important if you have a particular type of query
upon which other transactions depend (call center, point of sale, etc.).
As administrator, you create rules that group database operations into workloads using criteria
such as:
Physical backups
ETL operations
All queries generated by members of the Sales department
Reports against the table daily_summary
Administrative operations
You create rules, known as service classes, to assign to each workload with:
a priority - a first-level control on admission to the queue for processing and resource
usage (CPU and disk I/O),
a weight - a second-level control on admission to the queue and resource usage, and
soft and hard memory limits that control the memory to be allocated for the workload.
These rules instruct Aster Database to run each type of job with the right level of urgency.
Based on your rules, Aster Database assigns an initial level of importance to each job and, if
warranted, re-ranks the job while it is running. For example, your rules can ensure high
resource allocation for a newly added query of a given type but throttle down resources for
that query if it runs so long that it is suspected of being a runaway query.

Admin > Configuration > Workload
The Admin>Configuration> Workload panel lets you control Aster Databases workload
management rules to ensure proper allocation of the clusters computing resources. In this
panel, you create the rules that allow Aster Database to identify higher- and lower-
importance jobs and run them with the right level or urgency.
Workloads will be covered in a later module

Admin > Configuration > Roles and Privileges
To define the actions a user of the AMC can perform, use the AMC roles and privileges.
View the List of Available AMC User Privileges
1 Log into the AMC as an amc_admin user. This is typically the db_superuser account in a
new Aster Database installation.
2 Go to Admin > Configuration > Roles & Privileges.
3 In the Roles & Privileges tab, the available AMC Roles (amc_admin, process_admin,
process_viewer, process_runner, node_admin, and node_viewer) are listed on the horizontal
axis of the table, and the individual privileges are listed on the vertical axis. Each privilege
is a combination of a section of the AMC and an action the user can perform there.

Admin > Configuration > Roles and Privileges
Each privilege is a combination of a section of the AMC and an action the user can perform there
Roles and Privileges will be covered in a later module

Admin > Configuration > Hosts
Set Up Host Entries for all Nodes
You can set up host entries on all the nodes of an Aster Database cluster by editing the /etc/
hosts file on each Aster Database node manually (for UMOS clusters) or through the AMC
(for AMOS and UMOS clusters) by performing the following steps.
1. Log into the AMC as an administrator user.
2. Go to Admin > Configuration > Hosts.
3. Click the Hosts tab.
4. Create a host entry for each host you want to add by clicking on New Host Entry and filling
in the web form with its IP address and alias.
5. When you are finished adding entries for each node, click Save and Apply Changes.
6. Your changes will be written to the hosts file on each Aster Database node.

Admin > Configuration > Hosts
You can set up host entries on all the nodes of an Aster Database cluster by editing the
/etc/hosts file on each Aster Database node manually or through the AMC
HOSTS are commonly used to point to other Databases (ie: Teradata, Hadoop) so when
using the Connectors to these databases, the host names can be resolved to an IP address

Admin > Configuration > Network
Configure network settings
You assign Aster Database functions to their own subnets by using the AMC Network settings.
To view and/or edit the Network settings:
1. Select the Admin tab, and then choose Configuration and Network from the drop-down
options.
2. The AMC Network Overview screen will appear, showing each node and its current settings.
For each node, you can assign an IP address or NIC for each of the following functions.
Queries - for internal database communication between nodes (default).

Loads - applies to loaders only.
Backups - applies to backups.
Note that if you do not assign an IP address or NIC for backups or loads, the default
(queries) setting will be used.
3. Click the Configure button on the far right hand side for the node whose network settings
you want to configure. In the network configuration window for the node, you will see
three tabs for AMOS: Current State, Edit Configuration, and Network Assignments. For UMOS,
the Edit Configuration tab does not appear.

Admin > Configuration > Network
Allows you to test Network Connectivity of your Backup and Loader nodes

Admin > Configuration > SQL-H
There is no special installation needed to enable using SQL-H. All required Hadoop packages
and HCatalog jars for certified distributions are installed during the normal Aster Database
installation or upgrade. However, you do need to set up SQL-H Configuration for each
Hadoop cluster you will access.
You need to fill out the Server name (host name or IP address) and which version of Hadoop you
are using from a particular vendor (i.e.: Hortonworks or Cloudera).

Admin > Configuration > SQL-H
Must configure Aster SQL-H connector within AMC first or will get
following error message when using the connector
SELECT * FROM load_from_hcatalog Good idea to

(on mr_driver configure HOSTS in
server('192.168.100.21') dbname('default') AMC as well if use
tablename('department') username('admin')) Names instead of IP
limit 5;

Admin > Logs
Overview
When an issue arises on a cluster, one of the first steps in finding the cause is to retrieve the
relevant log files. Aster Database is made up of a large array of distinct services, and it
produces more than 60 different logs spread across every node in the cluster. The AMC
213 Teradata Aster Big Analytics Appliance 3H Aster Database Administrator Guide
provides an easy way for you to deal with all these different logs by creating diagnostic log
bundles. A diagnostic log bundle is a compressed tarball containing data used to determine the
system context and diagnose Aster Database issues. This data may come in system logs from
the queen and subordinate nodes (worker and loader).
By using diagnostic log bundles, you can more easily send information to Teradata Aster tech
support for analysis, reducing the time and effort required to diagnose system problems.
Only AMC users with administrative privileges can create, download, and send diagnostic log
Bundles.

Admin > Logs
bundles . A diagnostic log bundle is a compressed tarball containing data used to determine the
the queen and subordinate nodes (Worker and Loader)

Cluster Scaling Expanding the system
You can add more Nodes (Workers and Loaders) as need to your cluster. This will increase
performance linearly as when you add Nodes you are adding more CPU and RAM resources.
You can also add more Backup nodes although Backup nodes are not considered part of the
cluster. There is a separate process for doing this which is outside the AMC.

Cluster Scaling Expand the System
Incrementally scale tiers

independently:
Add Workers for more data

storage/processing*
Add Loaders for more data

load per hour/day
Workers talk to
each other during Add a redundant Queen for
SHUFFLE process fast system recovery
Add a Backup Cluster

system for data archiving
All scaling done through the


Scaling out Cluster Add Workers (1 of 2)
This procedure installs the Aster Database software on the worker and loader machines and
adds them to the cluster.
Prerequisites
Before you add nodes, make sure youve:
Warning! If you wish to re-deploy a node that previously served as an Aster Database node,
make sure the machine does not contain any data you need, since you must delete all its Aster-
stored data before you re-deploy it. As a guideline, if your cluster is currently running at RF=2
(after removing the node that you will re-deploy), then it is probably safe to delete the nodes
data as explained below.
1. Ensured that the operating system and any required patches are installed on the node;
2. Set up passwordless node-to-node SSH for the root user; and
3. Made a list of the IP addresses of the nodes you plan to add.
4. If the prospective node machine has been previously used as an Aster Database node,
then you may wish to clean its file system. Alternatively, you can leave the old data in
place and tick the Clean Node checkbox to allow Aster Database to delete the old data
when adding the machine as a new node.

Scaling out Cluster - Add Workers (1 of 2)
5 Worker Nodes (each with 4 v-Workers)
New Worker node
To scale Cluster Worker tier you first add Worker node(s)

Distribution of vWorkers to the new node(s) is automatic upon Soft Reset
(restart Aster s/w) or Hard Reset (restart OS and Aster s/w)
Adding Worker nodes increases the disk space by the disks of the new
node but it does not increase the number of vWorkers

Scaling out Cluster Add v-Workers (2 of 2)
Split Partitions
Partition splitting is an Aster Database feature that helps you add vworkers so that you can
maintain an optimal ratio of CPU cores to vworkers as your cluster grows.
To scale out your cluster, you add worker nodes. As you add worker nodes to the cluster, Aster Database
does not automatically increase the number of vworkers. In other words, the number of vworkers stays
constant as you add worker nodes (machines). This means that, as you add nodes to the cluster, the ratio
of CPU cores to vworkers will increase, and eventually your CPUs may become under-utilized. If this
happens, you can improve performance by increasing the number of vworkers (also known as splitting
partitions).
Teradata Aster recommends that you manage your cluster so that you have approximately two
CPU cores per vworker. For example, an 8-core node should typically host 4 to 6 vworkers. In order to
avoid having to split partitions, you may elect to set up your cluster with 6 vworkers per 8-core node and
then add nodes as your data grows, until your ratio falls below 4 vworkers per 8-core node. Once the ratio
falls below this point, its a good idea to split partitions to make better use of the processing power of
your nodes.

Scaling out Cluster - Add v-Workers (2 of 2)
Worker Nodes
24
To increase the parallel processing power of your cluster, you add (Primary) vWorkers. This
is called Partition splitting. This is done at the Queen from a Unix command prompt. You
must have a quiet system so set Concurrency=0 so no one can logon
Once increase # of v-Workers, the data is re-hashed and re-shuffled among all v-Workers
which requires an exclusive lock on the Cluster system. Afterwards, set Concurrency = 100
24 means there are 24 PRI and 24 SEC vWorkers in Cluster. Queens PRI-SEC vWorker not counted

Optional in-line lab: Remove Worker-2, add back
Instructor may or may not do this lab during class as this is a one-hour lab. If time is an issue, it
is recommended to do as an after-class activity.

Optional in-line lab: Remove Worker-2,
then add back
1. From AMC, go to: Admin > Cluster Management . Note the Loader Node
has Status = Failed
2. Click Remove X button for Worker IP Address = 192.168.100.151. Then click
Soft Restart. Wait about 3 minutes. Then click the Go to login page button
and log back into the AMC
3. Wait until the remaining Worker is in a PREPARED state (not PREPARING).
This will take about 5 minutes. Then click Activate Cluster. Note this will
bring the Worker and Loader status = ACTIVE
4. Next click the Add Node(s) button and type: 192.168.100.151 in the MAC/IP
Address caption. Put a check in check box next to Select All and Clean
Node. Then click the Ok button. Watch as the Status changes a number of
times. How many states did it change before status = PREPARED?
_________________
Cleaning, Installing, Preparing, Upgrading, Prepared

Status changes
Continue Lab on next Page

In-line lab: Remove Worker-2, add back (cont)
Instructor may or may not do this lab during class as this is a one-hour lab. If time is an issue, it
is recommended to do as an after-class activity.

Optional in-line lab: Remove Worker-2,
then add back (con't)
5. Go to Nodes > Partition Map and notice there a no v-Workers on the new
Worker node. Go back and click Balance Data button, then click OK. Go
back to Nodes>Partition Map. Where are all the Primary and Secondary v-
Workers housed on the Worker nodes now?
_______________________________
Old Worker 2 Primary v-Workers New v-Worker 2

Secondary v-Workers
6. Go back to Admin>Cluster Management and wait for Green Status box to

finish. Notice the Worker status = Passive. This means the new Worker has
only Secondary v-Workers on it
7. From Admin > Cluster Management , click Balance Process button, then
click OK. This will decide which v-Workers should be Primary or
Secondary. Once complete, your new Worker should have Status = Active.
8. Go back to Nodes>Partition Map. Does each Worker have 1 Primary and 1
Secondary v-Worker? ________

Teradata Viewpoint (1 of 3)
Instructor will lead you through first configuring Aster system via TD Viewpoint.
Then we will add an Aster portlet to monitor an Aster system.

Viewpoint's single operational view (SOV) monitoring has been extended to
include support for Teradata Aster. The following existing portlets support
Aster:
System Health
Query Monitor
Capacity Heatmap
Metrics Graph/Analysis
Space Usage

Administrators configure portlets to view/monitor their TD, Aster and Hadoop resources.

Other portlets examples:
Eventually, Viewpoint will be the preferred

client to monitor and manage Aster. Until
that time, most DBA will want to use the

Here is a listing of current and future Aster portlets.


Review: Module 2 Management Consoles

Review: Module 2 - AMC
1. Go to this tab to see all Queries run in last 24 hours
2. Go here to see if your Workload Policy was used in your query
3. Go here to add a new Worker node
4. Go here to see CPU, NIC speed, RAM, Storage available
5. Difference between Soft Restart and Hard Restart?
6. Partition Splitting refers to:

DBA Lab2a-Lab2b - Optional

DBA Lab 2a-Lab2b - Optional
Overview:
Lab 2a Working with the AMC

Lab 2b Adding v-Workers (optional-after class)
1. Open your lab manual to Page 4 (Lab 2a)

2. Perform the steps in Lab 2a only (unless instructed otherwise
3. Notify the instructor when you are finished
4. Be prepared to discuss answers for Step 13

Module 3
Mod 03
Databases and Schemas

Page 2 Mod 3 Databases and Schemas
Table Of Contents
Database objects Module objectives ............................................................................................ 4
Default Database Objects ................................................................................................................ 6
Databases in a Cluster ..................................................................................................................... 8
Creating a Database in Aster ......................................................................................................... 10
Schemas in a Database .................................................................................................................. 12
Creating Schemas .......................................................................................................................... 14
Schema Search Paths..................................................................................................................... 16
Review: Module 3 Databases/Schemas ..................................................................................... 18
DBA Lab 3 Databases and Schemas .......................................................................................... 20

Database objects Module objectives

Database objects - Module Objectives
After completing this module, you should be able to:
Create and Drop databases

Create Schemas to provide the
organizational/management structure of Cluster
Cluster Aster-Queen
Database - beehive
Schema - public
Tables/Views
Database - <name>
Schema - <xxx>
<Tables/Views>

Default Database Objects
Upon install, there are a number of system generated objects created. This includes a database,
schemas, users and roles.

Default Database Objects
Upon initial installation, there are a number of default objects:
2 Users/passwords
db_superuser/db_superuser - Can access all database objects
beehive/beehive - Has no admin rights, but owns beehive database
Roles (not all inclusive)

db_admin - Users in this role have unrestricted access to all object
catalog_admin - Users have access to all system tables, but minimal admin rights
public - All users are included in this Role by default
1 Database
beehive - Default database in Aster cluster
2 Schemas
public - All users have read/write access to public schema
nc_system - Houses data dictionary (system tables) for that database

Databases in a Cluster
You create databases with the SQL command CREATE DATABASE:
CREATE DATABASE name;
To create a database, you must be a superuser or have the special db_admin privilege.
When you create a database, no other users have the right to use it. You must manage user
privileges as follows:
To grant users the right to use the new database, you must GRANT at least the CONNECT
privilege on the database to the users or roles who will use it.
To grant users the right to create tables in the new database, you must grant them at least
the CREATE privilege on one of the schemas in the database.
The user who created the database is the owner of the new database. It is solely the privilege of
the owner of a database to drop it later. Removing a database removes all the objects (e.g.
tables) within it, even if the individual object has a different owner than the database owner.
You need to be connected to the database server to execute the CREATE DATABASE
command. The first database is always created when Aster Database is initialized. This default
database is called beehive. To create the first ordinary database, you can connect to
beehive.
Database Name Limitations
Database names must comply with these rules:
A database name may contain only alphanumeric characters and the underscore character
(A-Z, a-z, 0-9, and _).
A database name may be up to 50 bytes long.
The name must be allowed by PostgreSQL (i.e. it cannot be a PostgreSQL keyword).
The name must not start with the prefix "_bee" which is reserved for use in naming Aster
Database system objects.
The name cannot be a substring of the word 'database'.
Drop a Database
Databases are permanently destroyed with the command DROP DATABASE:

DROP DATABASE name;
Only the owner of the database can drop a database. Dropping a database removes all objects
contained within the database. The destruction of a database cannot be undone.

Databases in a Cluster
Databases are physical repositories of data in Cluster
- Objects created in a database will only be accessible to users

who are connected to that specific database
- There is no data sharing among Cluster databases

Queries can access only 1 Cluster database at a time
Views may not span multiple Cluster databases either
- Each User/Role must be given the CONNECT privilege on a

database to access the data objects in that database
Out of the box, 2 different Aster databases within same Cluster cannot JOIN tables by
default. You will need AnyDatabase2Aster connector to do so
By default, it is possible for Aster to JOIN to both Teradata and Hadoop tables/views

Creating a Database in Aster
Follow along with the Instructor as you create a new Database in Aster.
Note although you will have created one logical database, you actually create as many databases
as you have Primary v-Workers.
Note there an no space limitations on a database. It can consume as much hard drive space as is
available on the Workers.
It is also important to note there is no hierarchy to databases. In other words, a new database is
not created from a Parent database.
In most cases, an Aster DBA will create only 1 additional logical database (beehive being the
first database) and use Schemas to enforce security and permissions.

Creating Databases in Aster
Queen Server Group
Each Cluster instance has a default

database beehive
beehive=>CREATE DATABASE asterdb;

beehive asterdb
- A new 'asterdb' database is
v-Worker 1
created on each v-Worker
.
- 'asterdb' DB metadata is put in .
.
Queen system tables beehive asterdb
- Note: the Queen manages all v-Worker N

v-Workers so that they appear to
the user as one database Worker node
Unlike Teradata, you cannot put SPACE limit on Database.

Nor are new DB's created from existing DB in hierarchy fashion

Schemas in a Database
Users can access tables across multiple schemas in the same database in the same Cluster. By
default, you cannot join tables in different schemas in different databases in the same Cluster.
Note if you want to join tables in different schemas in different database in the same Cluster, it is
possible using the AnyDatabase2Aster connector. However for performance reasons, this should
be used as a last resort.

Schemas in a Database
Schemas are separately managed parts of an Aster database
They preserve each groups control over the structure and

modification of database objects that belong to that group
Data objects may be shared across schemas within same database
Users can join tables from one schema with tables in another
schema (within same Aster database) if they have proper privileges
for the schemas/tables
2 different schemas
From ACT:
Since pointing to beehive
database, can only see tables
beehive=> SELECT * FROM sales_schema.sales_fact f
in this database INNER JOIN public.calendar cal
ON cal.cal_date = f.sales_date;
Users/Roles must be given the USAGE privilege on a schema to be

able to access the data objects that are within that schema

Creating Schemas
Schemas are a database structure. It allows you to partition a database. For example, you may
want a certain group of users to have access to only one Schema. You can create a Schema
under that database and then assign permissions to these users for that Database and that Schema.

Creating Schemas
Queen Server Group
Each database has a default schema

named 'public'. Any User can create
objects here. In addition, you also get
another schema named 'nc_system'. asterdb
You cannot create objects here
public nc_system mkt
asterdb=> CREATE SCHEMA mkt; v-Worker 1

- A new schema 'mkt' is created in the .
retail sales database on each v-Worker
.
.
asterdb
- The Queen gets metadata to manage all
public nc_system mkt
the objects built in the 'mkt'schema
v-Worker N
To partition your cluster, you typically use
Schemas instead of Database objects Worker node
From ACT, to view all SCHEMAS in the Cluster, type : \dn

Schema Search Paths
Schema-qualified table names can be tedious to write, and analysts and engineers often prefer
not to hardwire a particular schema name into their queries. As a result, tables are often
referred to by unqualified names that consist of just the table name. When Aster Database
encounters a table name that is not qualified with a schema name, Aster Database determines
which table is meant by following a search path. The search path is an ordered list of schemas
to search. The first matching table in the search path is taken to be the one wanted. If there is
no match in the search path, an error is reported, even if matching table names exist in other
schemas in the database (but not in the search path).
You can specify a schema search path for a transaction or session (using SET), or as the default
for a user (using ALTER USER).
The first schema named in the search path is called the current schema. Aside from being the
first schema searched, it is also the schema in which new tables will be created if the CREATE
TABLE command does not specify a schema name.
Show Current Schema Search Path
To show the current search path, use the following command:

SHOW search_path;
The first schema in the search path that exists is the default location for creating new objects.
When objects are referenced in any other context without schema qualification (table
modification, data modification, or query commands) the search path is traversed until a
matching object is found.
Adding to the Schema Search Path
To put a new schema in the search path, use the SET search_path command, as shown in
this example:
CREATE SCHEMA myschema;
CREATE TABLE myschema.mytable (
...
);
SET search_path TO myschema,public;
After doing this, we can access the table without schema qualification:

To view your current schema search paths,
Schema Search Paths From ACT, type: show search_path;
Or from TD Studio, SELECT nc_users view
Best practice for table access is to use fully qualified schema-name.table-name
For unqualified queries, Schema Search Path will determine the schema
accessed. The default schema search path for all users is the public schema
Schema Search Paths can be changed by DBAs like:
CREATE SCHEMA fred_schema;

ALTER USER fred SET SEARCH_PATH = 'fred_schema','public';
For queries with no schema specification Cluster will:

- CREATE new tables in the first schema in the search path
- SELECT from tables by walking through schema(s) in the order specified
in the search path until the table is found
When using ACT commands, \dt to display tables, can only view 1st schema in the
SET_SEARCH_PATH command. To view Tables in other schemas, use \dt <SCHEMA>.* command

Review: Module 3 Databases/Schemas
.

Review: Module 3 - Databases/Schemas
1. Each database gets its own NC_SYSTEM schema? T or F
2. From ACT, type this command to see which schema the system will
use if you dont specify one in your query?
3. The 2 hard-coded SCHEMAS that are baked into Cluster
4. Can JOIN between 2 schemas in the same Cluster DB? T or F
5. Can JOIN between 2 schemas in 2 different Cluster DBs? T or F

DBA Lab 3 Databases and Schemas

DBA Lab 3 (Databases/Schema)
Goal:
Create new database
(retail_sales) and 4 schemas
(meta, views, stage, prod)
1. Open your lab manual to Page 12

2. Perform the steps in the labs
4. Be prepared to discuss the labs

Module 4
Mod 4
Data Modeling
Teradata Aster Database Administrator Page 1

Table Of Contents
Data Modeling Module Objectives ...............................................................................................4
Stages of Database Development ....................................................................................................6
Logical Data Model (LDM).............................................................................................................8
LDM components ..........................................................................................................................10
Character and Special Data types ..................................................................................................12
How to create and use SERIAL .....................................................................................................14
Numeric Data types .......................................................................................................................16
Date/Time Data types ....................................................................................................................18
Schema Modeling selection ...........................................................................................................20
Analytic Platform data models ......................................................................................................22
Star schema example .....................................................................................................................24
Snowflake schema example ...........................................................................................................26
Eight fundamental modeling rules for Big Data ............................................................................28
1. Dimensionalize your Schema ................................................................................................30
FACT table ....................................................................................................................................32
Dimension tables ...........................................................................................................................34
FACT and DIMENSION relationships .........................................................................................36
2. Consider (addt) Denormalized model ..................................................................................38
3. Columnarize (if needed) ........................................................................................................40
Visualize Row and Column storage ..............................................................................................42
Columnar syntax ............................................................................................................................44
4. Verticalize your Schema (1 of 3) ...........................................................................................46
Generic CREATE TABLE syntax .................................................................................................52
5. Distribute your data with Joins in mind.................................................................................54
Join 2 Distributed tables on same Hash key ..................................................................................56
6. Replicate frequently Joined columns .....................................................................................58
Join Replicated table to Replicated table .......................................................................................60
Join Replicated table to Distributed table ......................................................................................62
Physical Distribution (table) overview .......................................... Error! Bookmark not defined.
Four Table types ............................................................................................................................64
Create FACT table syntax .............................................................................................................66
Distribution (Hash) key .................................................................................................................68
Physical Distribution (Hash) Illustrated ........................................................................................72
Create DIMENSION table syntax .................................................................................................74
Physical Replication Illustrated .....................................................................................................76
Create ANALYTIC table syntax ...................................................................................................78
Create TEMP table syntax .............................................................................................................80
When to use Distribution vs Replication .......................................................................................82
Distributed (Fact), Replicated (Dimension) ..................................................................................84
In-line lab: CREATE TABLE AS syntax ......................................................................................86
Creating Tables from SQL-MR functions .....................................................................................88
Table Compression ........................................................................................................................90
7. Split data into Logical Partitions ...........................................................................................92
Logical Partitioning (LP) overview ...............................................................................................94
Page 2 Mod 04 Data Modeling

PARTITON BY List ..................................................................................................................... 96
PARTITION BY Range (1 of 3) ................................................................................................... 98
PARTITION BY Range (2 of 3) ................................................................................................. 100
PARTITION BY Range (3 of 3) ................................................................................................. 102
Logical Partitioning notes ........................................................................................................... 104
LP supports Columnar and Compression.................................................................................... 106
Multi-Level Logical Partitioning (MLP) .................................................................................... 108
Multi-Level Partitioning (MLP) syntax ...................................................................................... 110
MLP in Action (1 of 2) ............................................................................................................... 112
Adding/Deleting LP .................................................................................................................... 114
In-line lab: Creating a LP table ................................................................................................... 116
SALES_FACT will become a LP table ...................................................................................... 118
8. Index your tables ................................................................................................................. 120
Index rules ................................................................................................................................... 122
Creating an Index ........................................................................................................................ 124
Index Rules ................................................................................................................................. 126
System Limits (Aster 6.x) ........................................................................................................... 128
Review: Mod 4 - Modeling ......................................................................................................... 130
DBA Lab 4a and 4b..................................................................................................................... 132

Data Modeling Module Objectives
Aster Database is designed to let you perform fast analysis on large data sets, and just as
importantly, Aster Database is designed to perform well in environments where analysts
frequently run ad hoc queries in an effort to derive new insights from their data. In many
environments with traditional database systems, analysts are accustomed to writing a query,
running it overnight, then tweaking it based on the results, running the new query, and waiting
for hours for the new results. Using a properly configured Aster Database, this cycle of query
iteration can be shortened to minutes or seconds, rather than hours.
To achieve these results, though, you must take the time to properly design your data model, so
that it suits the characteristics of your data and your queries. This does not mean you are
designing your data model around pre-canned queries! Instead, it means that you are providing
Aster Database with clues as to which data should be collocated with which other data, and
which data is likely to be queried more often, or used more often for joining or filtering results.

Data Modeling - Module Objectives
Describe Logical and Physical Data Model
Eight rules for Modeling in Aster
How to create Fact (Distributed)

and Dimension (Replicated) tables

Stages of Database Development

Stages of Database Development
Project Initiation
Requirements Initial Training and Research

Analysis
Project Analysis
Logical Logical Database Design Conceptual Entities, Attributes, Relationships,

Modeling Modeling and View Integration PK, FK
Activity Modeling
Activity Volume
Modeling Usage Extended Logical Data Model (ELDM)
Frequency
Integrity
Physical
Modeling Physical Database Design & Creation CREATE TABLE, Indexes
Application Development and Testing
Production Release

Logical Data Model (LDM)

Logical Data Model (LDM)
The following form is used for customer purchase. Based on the below Attributes
(columns), how many Entities (tables) do we think we will initially need?
Customer ID __________
Last Name ___________ First Name __________ Middle Initial ___
Gender _____ Born days ago ______ City _______
Date Product Product Product Retail Unit Sales Discount Store Store Region Store Basket
ID Name Category Price Cost Quantity Amt Id Name Id Sqft Id
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
It looks like 4 tables: CUSTOMER, PRODUCT, STORE and SALES table

LDM components
Aster has a number of Logical Data Model components including Primary Key, Foreign Keys,
and Constraints.

LDM components
Primary Key
One or more attributes that uniquely identifies an entity
nCluster does not require Primary Keys
Foreign Key
An attribute in common between entities making a relationship
nCluster uses the link but does not enforce referential integrity
Constraint Options
Null / Not Null (Not Null eliminates the instances of null values)
Check Values (Range constraints for logically partition tables)
Default Values (Default values set column to predefined value)
Rules for Columns

Columns must be a valid data type, may have constraints
Columns (like Tables) can be up to 63 characters in length
Logical Partitions are also considered a
Constraint. More on these later

Character and Special Data types
Aster supports the typical ANSI data types plus some extensions.

Character and Special Data types
Character Type Syntax Range

Fixed Length CHAR(n) 1GB Maximum
Variable Length CHARACTER VARYING 1GB Maximum

VARCHAR(n)
Text TEXT Unlimited
Special Type Syntax Range

Boolean BOOLEAN TRUE 't' 'true' 'y' 'yes' '1' , '0'
FALSE,f,false,n,no
Bytea Bytea Variable length binary string
Serial* SERIAL Auto-incrementing

1 to 2147483647
Big Serial* BIGSERIAL Auto-Incrementing
1 to 9223372036854775807

How to create and use SERIAL
If you need to create a surrogate key for a table that does not have a natural key, you can use the
SERIAL data type. Normally it is a 2-step process as follows:
1. Do a CREATE TABLE statement with data type SERIAL with argument GLOBAL on a
DIMENSION table. This will ensure all rows inserted into the table will have a unique
value.
2. If you need to put the contents of the DIMENSION table into a FACT table, do the
following. Do an INSERT SELECT where you copy the contents of the DIMENSION
table into a new FACT table.

How to create and use SERIAL

Numeric Data types
It is recommended to use DECIMAL or NUMERIC since it provide more precision than
DOUBLE or FLOAT.

Numeric Data types
Numeric Type Syntax Range
Small Integer SMALLINT Up to +/- 32K
Integer INTEGER Up to +/- 2B
Large Integer BIGINT Over 2B
Arbitrary Precision NUMERIC(m,n) Precise - No Limit

(Variable) DECIMAL(m,n)
Real REAL 6 decimal digits

precision
Double DOUBLE/FLOAT Up to 15 decimal

Precision/Float digits precision

Date/Time Data types

Date/Time Data types
Date/TimeType Syntax Examples
Date DATE 2012-01-01
Date and Time TIMESTAMP(p) 2012-01-01

(with time zone) WITH TIME ZONE 19:54:17.01 PST
Date and Time TIMESTAMP(p) 2012-01-01
(without time zone) WITHOUT TIME 19:54:17.01
ZONE
Time TIME(p) 19:54:17.01 PST
(with time zone) WITH TIME ZONE
Time TIME(p) 19:54:17.01 PST
(without time zone) WITHOUT TIME
ZONE
Optional precision p is 0 (seconds) to 6 (microseconds)

Schema Modeling selection
Aster recommends a Star or Snowflake schema model for best performance.
Use the star schema model (also known as dimensionalizing data) to put your most frequently
read columns into skinny fact tables, and to relegate the less frequently read columns to separate
dimension tables. The goal is to make your fact tables skinny. This lets your queries run faster,
because your queries dont have to read the less relevant dimension information. Put more
precisely, a query can scan more rows at a time, since each row is smaller, and this makes
lookups faster.

Schema Modeling selection
Now we need to decide which Schema model to use for our tables (Sales, Customer, Product, Store)
Star Schema
Classifies the attributes of an event into facts (measured numeric/time data), and
descriptive dimension attributes (product ID)
Care is taken to minimize the number and size of attributes in order to constrain the
overall table size and maintain performance
Star schemas are designed to optimize user ease-of-use and retrieval performance by
minimizing the number of tables to join to materialize a transaction
Snowflake schema
The snowflake schema is similar to the star schema. However, in the snowflake
schema, dimensions are normalized into multiple related tables, whereas the star
schema's dimensions are normalized with each dimension within a single table
3rd Normal Form

Attributes must describe only the Primary Key and not each other

Analytic Platform data models
Aster Database is not at all the same as traditional, single-server database systems such as
Oracle or SQL Server, and Aster Database is also not similar to other MPP, shared nothing
systems that you may have used in the past. Aster Database has been optimized to provide
extremely fast ad-hoc query performance and analytical function execution. For ad-hoc or
iterative analysis of large amounts of data, Aster Databases performance is among the fastest
available today. However, your queries will only run fast if you have set up your data model
properly to match the characteristics of your data set, queries, and analytic routines.

Analytic Platform Data Models
Star and Snowflake Schemas are optimal for big data analytics:
- Fast retrieval of a large number of rows

due to a lesser need of join processing
- Easier to aggregate data together: much
is already aggregated in the fact tables
- Intuitive to business and analytic users:
central facts with 1-2 dimension levels
May have data redundancies, may have data redundancies
Aster works with 3rd Normal Form models as well

Star schema example
The Star Schema Model
The star schema approach is commonly used in data warehousing for databases that contain
large amounts of data. The star schema is efficient for large data sets because it relies on a
single, narrow table (called the fact table) that avoids storing descriptive values and repeated
values. Such columns are instead moved to helper tables called dimension tables. Timesensitive
queries are run against the fact table only and can run very quickly because the
narrowness of the table allows fast scanning. Queries that have to join against dimension tables take
slightly longer to run (but note that Aster Database supports a number of
techniques, outlined earlier in this document, for making these joins run fast).
Benefits of Using Star Schemas
You can apply dimensionalization with verticalization to speed up query performance by further reducing
the size of tables that must be scanned to find the desired rows. If youre running SQL-MapReduce
functions, a dimensionalized schema provides a much smaller memory footprint for SQL-MapReduce
operations that do not need the dimension data, because in most cases these operations need
only deal with an integer ID number, rather than the dimension data.
For more information and examples on how to dimensionalize data, refer to The Data
Warehouse Toolkit: The Complete Guide to Dimensional Modeling by Ralph Kimball and Margy
Ross.

Star Schema example
Fact: (Sales) Dimension: (Date, Store, Product)

Key measures of the business Known entry point to a Fact
Relates to a unique occurrence A defined attribute of a Fact

Snowflake schema example
Snowflakes schemas are another popular data model used with Aster.

Snowflake Schema example
Dimensions have sub- Greater reporting

dimensions capabilities
Snowflake Schemas in Cluster

should be limited to one or two
additional levels of dimensions

Eight fundamental modeling rules for Big Data
Eight Fundamental Rules for Modeling Big Data in Aster Database details can be found in the
following user guide:
Teradata Aster Big Analytics Appliance 3H

Database User Guide

Eight Fundamental Modeling rules for Big Data
Big Data typically means we are typically scanning (not seeking)

the entire table (FTS). In many instances these are ad-hoc queries
doing exploratory queries SELECT * from nPath
(on WEBCLICKS FTS
As such we recommend the following for best performance:
1. Dimensionalize your schema

2. Consider a Denormalized data model
3. Columnarize (if needed)
4. Verticalize your schema (if needed)
5. Distribute your data with Joins in mind
6. Replicate frequently Joined rows
7. Split data to child partitions
8. Index your tables (if needed)
We will be using these rules to make decisions about our Model in the following slides

1. Dimensionalize your Schema
Use the star schema model (also known as dimensionalizing data) to put your most
frequently read columns into skinny fact tables, and to relegate the less frequently read
columns to separate dimension tables. The goal is to make your fact tables skinny. This lets
your queries run faster, because your queries dont have to read the less relevant dimension
information. Put more precisely, a query can scan more rows at a time, since each row is
smaller, and this makes lookups faster.
Dimensionalize your schema: Use a star schema (also known as a dimensionalized

schema) to make your fact tables skinny. Skinny tables let your queries run faster,
because they have less data to read. With a properly dimensionalized schema, queries
can run up to 20x faster.

1. Dimensionalize your Schema (Pick Schema)
Use a Star schema to make your Fact table(s) skinny. Skinny tables let your
queries run faster because they have less data to read. With a properly
dimensionalized schema, queries can run up to 20x faster. Optional database
models include Snowflake and 3rd Normal Form
Customer ID __________
Last Name ___________ First Name __________ Middle Initial ___
Gender _____ Born days ago ______ City _______
Date Product Product Product Retail Unit Sales Discount Store Store Region Store Basket
ID Name Category Price Cost Quantity Amt Id Name Id Sqft Id
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______
_____ ______ ______ _______ _____ _____ _____ ______ ______ _____ _____ _____ ______

FACT table
Heres an example of a FACT table. The FACT table typically contains some columns that are
numeric in nature and some columns that will be able to join to other DIMENSION tables when
needed.
Fact tables are usually very large (e.g. millions or billions of rows). These tables contain two
types of columns: the columns that contain facts and the columns that refer to the dimension
tables. Fact tables require a distribution key column to be declared.

(Sales) FACT table
FACT tables holds the metric values recorded Have following columns:
for a specific event. Which columns would you Customer_id: 100
choose ? Last_name: Nimitz
First_name: Juli
Middle_initial: A
Gender: F
Born_days_ago: 20075
City_id: 456
Date: 2012-07-04
Store_id: 567
Store_name: NickNack
Region_id: 5
Store_Sq_ft: 50000
Product_id: 692
Product_name: Widget
Product_category: Home
Retail_price: 10.50
Unit_cost: 4.00
Basket_id: 125
Sales_quantity: 3
Discount_amount: .10
Retail_price column might be added to SALES_FACT table so

could derive Total sales price, Discounted price, Net price

Dimension tables
Dimension tables are usually much smaller (e.g. tens to thousands of rows) than fact
tables. Each dimension table specifies a set of known descriptive values for a particular
dimension. For example, a customers table can be a dimension table that contains
detailed information about each customer: for example, a customer ID, name, address,
and phone number. A distribution key column is optional for dimension tables.

DIMENSION tables
How many DIMENSION tables should their be ? Have following columns:

Customer_id: 100
Customer_dim Store_dim Product_dim Last_name: Nimitz
First_name: Juli
Middle_initial: A
Gender: F
Born_days_ago: 20075
City_id: 456
Date: 2012-07-04
Store_id: 567
Store_name: NickNack
Region_id: 5
Store_Sq_ft: 50000
Which rows go in which tables ?
Product_id: 692
Customer_id Product_id Product_name: Widget
Store_id
Last_name: Product_name Product_category: Home
Store_name:
First_name Product_categor Retail_price: 10.50
Region_id Unit_cost: 4.00
Middle_initial y
Gender Store_Sq_ft Basket_id: 125
Retail_price
Born_days_ag Sales_quantity: 3
Unit_cost
o Discount_amount: .10
City_id

FACT and DIMENSION relationships
To provide optimal performance for large-scale, data analytics workloads, Aster Database
natively differentiates between fact tables and dimension tables. Declaring a table as a fact
table or a dimension table in CREATE TABLE and CREATE TABLE AS statements affects
how Aster Database stores the table contents internally. In each case, the manner in which the
data is stored is optimized for the types of operations that the table will be involved in
(particularly for JOINS between tables).
Fact Tables
Fact tables are usually very large (i.e. millions or billions of rows), with each row containing a
set of dimension values and a set of measures. These tables contain two types of columns: the
columns that contain the facts (the raw data youre tracking such as units sold or pages
clicked) and the columns that are foreign keys to the dimension tables (Note that Aster
Database does not enforce referential constraints. Foreign keys are used mainly for joining
tables.) In Aster Database you must declare a distribution key column for each fact table using
DISTRIBUTE BY HASH. The distribution key tells Aster Database how to divide up the tables
contents so that these contents can be physically distributed across the vworkers. Distributing the
data in this way is called distribution or physical distribution in Aster Database. (Historically, it
used to be referred to as physical partitioning.) When creating a table, if DISTRIBUTE BY
HASH is used, then the table will be a FACT table by default, but may optionally be specified as
a DIMENSION table.
Dimension Tables
Dimension tables are usually smaller than fact tables (that is, a typical dimension table holds
only thousands of rows, rather than millions). Each dimension table specifies a set of known
values for a particular dimension. For example, a customers table is a dimension table that
contains detailed information about each customer (for example, customer_id, name,
address, and phone_number). Most dimension tables are replicated in Aster Database, meaning
that a copy of the table exists on every node in the cluster. Having a local copy on every node
makes it more likely that joins can run locally on each node, providing faster query results. To
declare your table as a replicated dimension table, include the DISTRIBUTE BY
REPLICATION clause in the CREATE TABLE statement.
Optionally, you can declare your dimension table as distributed, by declaring a distribution
key column using DISTRIBUTE BY HASH. In that case the table will be distributed across
nodes using the distribution key specified, rather than replicated on every node. There may be
a couple advantages to having a distributed dimension table:
If the fact table is distributed on the column that will be used to perform joins to a
dimension table, it can be very practical to distribute that dimension table on the same
column, too. Joins between the fact and dimension tables on their respective distribution
key fields will be fast because the lookups will be local.
Additions and updates to a distributed dimension table will be faster because those
changes only need to be made in one place, rather than to every instance, as is the case
with a replicated table.

FACT and DIMENSION relationships
If need to JOIN Fact table to DIM tables, use:

CUSTOMER_ID (PK in Dim table)
STORE_ID (PK in Dim table)
PRODUCT_ID (PK in Dim table)
DIM tables have a 1-M relationship to FACT table

2. Consider (addt) Denormalized model
Consider using a denormalized data model: Denormalizing, or allowing some columns to
be duplicated in your model, lets you make your most frequently queried table wide
enough (but remember to make it only just wide enough!) to include all your
most frequently queried columns, reducing the need to perform joins.

2. Consider (addt) Denormalized model
Denormalizing, or allowing some columns to be duplicated in your model,

lets you make your most frequently queried table wide enough (but
remember to make it only just wide enough!) to include all your most
frequently queried columns, reducing the need to perform joins.
Here Im going to duplicate the RETAIL_PRICE column in the SALES_FACT

table since I want to be able to calculate the Gross Sales, and Net Sales after
discount. By doing this, I will not have to JOIN to the PRODUCT_DIM table

As an alternative to row-oriented tables, Aster Database supports tables with a columnoriented
storage layout. This means that all of the values for a single column are stored next to
each other in the same file. In a columnar table, all of the data for a single row is NOT stored
in a single contiguous portion of a file. This option is available for distributed, replicated, and
logically partitioned tables, including temporary and persistent tables. However, analytic
tables may not use column-oriented storage.
Columnarize if you can: If you are willing to use a table as an appendonly

table, then your queries will probably run faster if you create the table with a columnar storage
layout, rather than the traditional rowwise layout.

In certain queries where only a small portion of data is retrieved from the table
(bytes), your queries will probably run faster if you create table with a
columnar storage layout, rather than the traditional row-wise layout
Use Row-oriented storage if: Use Column-oriented storage if:

Need to UPDATE, DELETE, MERGE Will never UPDATE, DELETE, MERGE
Partition splitting is required in the future Your queries select less than 30% of total
Most queries on table require more 30% of width of the table
tables width in bytes
Management is concerned the economy is driving customers to purchase

lower priced items. They have requested a daily report to see if there is a trend
in lower priced item sales
Based on this, you decide to create a new table named
COLUMNAR__SALES_PRICE with just RETAIL_PRICE and SALES_QUANTITY
column in the table
Typically, Aster tables are created as ROW STORAGE. If performance issue arises, and it is confirmed that only a small portion
of the table width is retrieved, then COLUMN STORAGE may be tested to confirm validity. Based on queries run on a table, it
may be beneficial to have both ROW-WISE and COLUMNAR for same table

Visualize Row and Column storage
When to Use Columnar Tables
The obvious benefit of columnar tables is the fact that, for a given query, only the required
columns will be fetched from disk. In some situations, this can substantially reduce the
amount of I/O required. So if a majority of your queries on a table access a low percentage of
its columns, then it may be a good candidate for columnar.
The performance gains youll see will depend on the cost associated with transposing values
from a column-wise layout to a row-wise layout during retrieval. This cost goes up with the
total width of columns selected. As the total width of the selected columns increases, the cost
of I/O and transposition can exceed that of straight selection from a row table. Note that we
say column width, not number of columns.
For example, consider a table that consists of five columns, of which four are integer-typed
columns, and the other is a very wide, varchar-typed column. If most of your queries select
only the integer columns (and even if they select all of the integer columns), then it makes
sense to have the table be a columnar table. Doing so allows Aster Database to store the wide,
varchar values separately from the other columns, so that queries can load the other columns
without paying the price of scanning over the wide values.
Note that columnar storage tables are optimized for read operations not for update
operations. They have more overhead on INSERT/UPDATE than row-based tables do. It is
preferable to perform updates to columnar tables in batch rather than updating only a few
tuples at a time. Also, DELETE operations may show degraded performance if there are many
TOAST entries for a table.
To summarize, it makes sense to consider using a columnar table if the majority of the queries
going against this table in your workload access only a few of the columns and the table is not
expected to have a lot of incremental updates.

Visualizing Row and Column storage
Row Store Column Store
Column Store
Row Store Pageviews table
PageviewsTable
projection1 projection2 projection3 projection4
userid ip ts page domain qs ref userid ip ts page
domain
1 MB I/O

1 MB I/O
1 MB I/O
500 kb I/O 500 kb I/O
Query: Select USER_ID, PAGE from TableX; -- only need 2 columns
Row Storage - Must scan entire row so consuming I/O on all columns
Column Storage - Only scan columns you need. I/O is minimized

Columnar syntax
To create a column-oriented table, you append the 'STORAGE COLUMN' clause at the end of
the CREATE TABLE statement. For example, this statement creates a column-oriented table:
CREATE TABLE films
(
code integer,
title varchar(40) NOT NULL,
did integer NOT NULL,
date_prod date
)
DISTRIBUTE BY HASH(code)
STORAGE COLUMN
COMPRESS;

Columnar syntax
To create a column-wise table, you simply append a 'STORAGE COLUMN'

clause at the end of the CREATE TABLE statement
CREATE FACT TABLE films1

(code integer,
title varchar(40) NOT NULL,
did integer NOT NULL,
date_prod date)
DISTRIBUTE BY HASH (code)
STORAGE COLUMN | STORAGE ROW ;

4. Verticalize your Schema (1 of 3)
Make your fact tables narrow by creating relevant materialized projections. Run your queries
against the projections for fast execution. This can make queries faster when columnar storage
is not an option.
Verticalization can be used for row-formatted tables to achieve some of the similar
performance benefits of columnar storage. The only caveat is that a verticalized table requires
more manual effort to maintain as the data changes.
If your schema contains a wide fact table that must remain so to satisfy some users, but you
wish to narrow the table to allow other queries to run more quickly, then you can improve
performance by creating materialized projection tables that include only those columns
needed by your high-performance queries. We call this verticalization because the
projections are more vertical in nature than the wide fact tables they represent. Verticalization
is useful, for example, if you have wide tables on which you run daily reporting queries that
select only a few of the columns and must run quickly.
Tip: In most cases, 60% is the magic number: If most of your queries hit less that 60% of the
fact tables columns, then you should consider creating a verticalized projection of the fact
table.

Make your Fact tables narrow by creating relevant materialized projections

that is, duplicate tables that contain only the frequently queried-together
columns. By verticalizing your schema, queries on the projections can run 2x
to 3x faster than they would run on a wide fact table
When to use: Verticalization can be used to achieve

some of the similar performance benefits of Columnar
storage. The only caveat is verticalized tables require
more manual effort to maintain as the data changes
If your schema contains a wide Fact table that must

remain so to satisfy some users, but you wish to
narrow the table to allow other queries to run more
quickly, then you can improve performance by
creating materialized projection tables that include
only those columns needed by your high-performance
queries. Tip: In most cases, 60% is the magic number:
If most of your queries hit less that 60% of the Fact
tables columns, then you should consider creating a This rule may be ignored if your
FACT table is already skinny
verticalized projection of the Fact table


Verticalize your Schema (2 of 3)
Costs: When you weigh the usefulness of Verticalization, bear in mind the costs:
You must build a two-step loading process to create both your Fact table and
its materialized projection(s). This typically means a standard load to the main
Fact table and a CREATE TABLE AS SELECT to create the projection
You must rewrite some of your existing queries to run against the materialized
projection, rather than the main Fact table
How to Verticalize: To verticalize your schema, you create new tables that
contain copies of only those columns that are frequently queried together. We
refer to such copy-tables as materialized projections
Maintaining materialized projections requires that you periodically update or
recreate the materialized projection with data from the source Fact table.
Your queries that use only the projected columns will now run faster, while
other queries that need data from the columns not in the projection can
continue to use the original, wider table
For example, in a clickstream tracking database that logs users views of
website pages, you might precompute page_view summary statistics for
every combination of user, page_view, and domain

To verticalize your schema, you create new tables that contain copies of only those columns
that are frequently queried together. We refer to such copy-tables as materialized projections.
Maintaining materialized projections requires that you periodically update or recreate the
materialized projection with data from the source fact table.
Your queries that use only the projected columns will now run faster, while other queries that
need data from the columns not in the projection can continue to use the original, wider table.
For example, in a clickstream tracking database that logs users views of website pages, you
might precompute page_view summary statistics for every combination of user, page_view,
and domain.

Verticalize your Schema (3 of 3)
Management likes the idea of Monthly_Summary_Product_Sales

using pre-aggregated summary
Name Type Width
tables. They want to view
summed Sales on a Monthly basis Date Date 4
for each Product so will use query Product_id Integer 4
like the following
Sales_qty Integer 4
Retail_price Double 8
precision
Materialized Verticalized table
INSERT into Monthly_Summary_Product_Sales

SELECT date, product_id , sum(sales_qty) * (Retail_price))
WHERE date between '2012-01-3' and '2012-01-31)
FROM sales_fact Original FACT table
GROUP BY date, product_id;

Generic CREATE TABLE syntax

Generic CREATE TABLE syntax
CREATE <schema>.TABLE Table_Name (

Column_Name Data Type Constraint,
)
DISTRIBUTE BY [ HASH Column_Name | REPLICATION ]
{ PARTITION BY Column_Name } -- optional;
The upcoming slides will detail these

specifications

Aster Database always distributes the contents of fact tables throughout the cluster (large
dimension tables can be distributed, too). This allows each worker to manage its own slice of
the data. Your job is to choose a good distribution key column to specify how the data will be
sliced for distribution. A good distribution key is one that enables a large portion of your most
common, costly operations (joins, DISTINCTs, and GROUP BYs) to be performed locally,
within individual workers, before the operation is handed off to the Aster Database queen to
assemble the results.
When choosing the distribution key, you should also, as a secondary concern, bear in mind
the evenness of your data distribution. That is, make sure the distribution key column
contains enough unique values to allow Aster Database to distribute data so that no single
worker holds significantly more data than any other. When a worker holds too much or too
little data, relative to other workers, we say there is data skew. Avoiding data skew is important,
but it is really a secondary concern because matching your schema to your joins is more
important.

Physically distribute large Fact and Dimension tables by selecting a

Distribution Key column that lets Aster Database distribute rows throughout
the cluster but still enables your most common, costly operations (joins,
DISTINCT computations, GROUP BY computations) to be performed locally on
individual workers in the cluster. With properly distributed data, queries that
make use of the distribution key can run as much as 100x faster.
The rule is to keep data movement across the network at a minimum. Any time
you have to move data from one v-Worker to another (ie: JOIN, GROUP BY),
you incur a performance penalty
One way to minimize data movement is by:
Choose Join columns that match Hash column, there is no need to copy
data between v-Workers since rows guaranteed to be hashed to the same v-
Worker
Some SKEW may be acceptable on v-Worker if column chosen as
HASH column joined frequently to other tables

Join 2 Distributed tables on same Hash key
When you create a fact table that will contain a large amount of data, you must physically distribute it
across the worker nodes in the cluster. To physically distribute a table in Aster Database, you specify a
single column as the distribution key when you create the table. For a given row, the value in the
distribution key column determines where in the cluster (on which worker) that row is stored. By
designating the distribution key, you indicate to Aster Database which records you want grouped together
on each worker.
Follow these guidelines to pick the distribution key column for a table:
1 Consider your joins. Choose the column that will be used in your most performancecritical
joins. When picking your distribution key, choose the column that is most
frequently used in joins or aggregations (GROUP BY or DISTINCT), in that order. Since
Aster Database is optimized for joins, it is cost effective to design table schemas so that as
many joins happen on the distribution key as possible. Aster Database is also optimized for
aggregation on the column specified as the distribution key. Therefore, when there are no
joins but only aggregations (via GROUP BY or DISTINCT), then using the column most
frequently involved in the aggregation as the distribution key provides better performance.
2 Data skew is secondary. Consider using a distribution key that avoids data skew, but
remember that when youre optimizing for performance, its more important to pick a
distribution key that matches your joins than to pick one that avoids data skew. Data skew
occurs when the distribution key causes a disproportionately large number of rows to be
routed to a single worker in the cluster. As a result, one worker has a very large amount of
data, and all other workers have correspondingly smaller amounts of data. The one worker
with the majority of records performs slowly, as expected, and this can slow down queries
that need to access the skewed data.
3 What if theres no good candidate? If no appropriate distribution key column exists, you
may need to create a surrogate distribution key during loading. To do this, look for a
column or columns whose values might be transformed or combined to create more
useful distribution key values. As discussed in point 1 above, let your users actual join
predicates guide you to find useful values to distribute on. You can use SQL-MapReduce
functions to perform the needed transformations during loading. If you have no existing
join predicates to guide you, then you can create distribution key values that just minimize
data skew. For example, you might define a an id column of type UUID, and then, in
your data loading code, include an SQL-MapReduce function that uses a utility like
java.util.UUID to create a fairly universal identifier for every row. The broad distribution
of these values ensures good data distribution in Aster Database.

Join 2 Distributed tables on same Hash Key
SELECT c.userid, z.clickid FROM click c, zip z No Shuffle

needed
WHERE c.userid = z.userid; Join column = Hash column for both tables
v-Worker1 Worker 1 v-Worker2 v-Worker1 Worker 2 v-Worker2
userid clickid userid clickid userid clickid userid clickid
1 115 52 601 33 732 78 547
2 212 54 629 55 103 35 440
42 519 65 925 89 232 59 243
userid zip Userid zip userid zip userid zip
1 12345 52 12345 33 12345 35 12345
2 23456 54 23456 55 23456 78 23456
42 45069 65 76282 89 34578 59 41223
Userid: 1-20 Userid: 21-40 Userid: 41-60 Userid: 61-80
click (FACT) table (Distribution key userid)

zip (FACT) table (Distribution key userid)

6. Replicate frequently Joined columns

6. Replicate frequently Joined columns
Use Replicated dimension tables in Aster Database to ensure that a

copy of each frequently joined dimension table is always present on
EACH v-Worker for joins. By using replicated dimension tables, you
eliminate the need to Copy dimension data over the network,
increasing query performance by as much as 4x

Join Replicated table to Replicated table
Use replicated dimension tables in Aster Database to ensure that a copy of each frequently
joined dimension table is always present on the local machine for joins. Replicated dimension
tables are copied to every node and therefore always available locally on every worker in the
cluster. As we mentioned earlier, the rows of a fact table are distributed throughout the cluster
for efficient storage and processing. To perform fast joins against these physically distributed
fact table rows, it helps to have a local copy of the dimension table you are joining with.

Join Replicated table to Replicated table
Explain select f.c1 from factjoin f INNER JOIN repljoin r on f.c1 = r.c1;
SELECT c.clickid, v.userid FROM click c, vendor v No Shuffle needed.
Only 1-Worker gets
c.userid
WHERESELECT = v.userid;
product_id, sum(sales_quantity) from sales_fact group by 1; involved
1 115 1 115 1 115 1 115
2 212 2 212 2 212 2 212
42 519 42 519 42 519 42 519
userid zip userid zip userid zip userid zip
1 12345 1 12345 1 12345 1 12345
2 23456 2 23456 2 23456 2 23456
. . . . . . . .
100 34567 100 34567 100 34567 100 34567
Click (DIMENSION) table Vendor (DIMENSION) table

Replicated table Replicated table

Join Replicated table to Distributed table
A replicated dimension table is a dimension table whose entire contents are copied to all
vworkers for faster lookup. This is the default behavior of a dimension table in Aster Database.
As long as you dont include a distribution key in your CREATE DIMENSION TABLE
statement, you are creating a replicated dimension table. Replicated dimension tables are
especially good for small data sets, such as a table that translates zip codes to states, because
such tables are likely to stay in cache in memory and thus can be accessed extremely quickly
even when a join is involved.
Any time you have a Replicated table in a JOIN condition, a shuffle of data will not be required
for that JOIN regardless of the other table type (Replicated or Distributed).

Join Replicated table to Distributed table
SELECT
Explain select clickid, zip
f.c1 from factjoin FROM
f INNER click,
JOIN repljoin r on f.c1zip
= r.c1; No Shuffle
WHERE SELECT zip.userid = click.userid;
product_id, sum(sales_quantity) from sales_fact group by 1;
needed
1 115 52 601 33 732 78 547
2 212 54 629 55 103 35 440
. .
100 532 100 532 100 532 100 532
userid zip userid zip userid zip userid zip
1 12345 1 12345 1 12345 1 12345
2 23456 2 23456 2 23456 2 23456
. . . . . . . .
100 34567 100 34567 100 34567 100 34567
click (FACT) table zip (DIMENSION) table

(Distribution key clickid) Replicated table

Four Table types
There are four table types in Aster. However every table must ultimately be defined as a FACT
table or a DIMENSION table even if that table is an ANALYIC table or TEMP table.

Four Table types
1 - Fact Tables If CREATE TABLE statement has a:

Typically Distributed by hash DISTRIBUTE BY HASH (column), it is
considered a FACT table
May also be logically partitioned
2 - Dimension Tables If CREATE TABLE statement has a:

DISTRIBUTE BY REPLICATION,
Typically replicated to all v-Workers it is considered a DIMENSION
table
May also be logically partitioned
3 - Analytic Tables If CREATE TABLE statement has a:

Persists until System state change CREATE ANALYTIC TABLE <name>,
it is considered an ANALYTIC table
Not replicated so good performer
4 - Temporary Tables If CREATE TABLE statement has a:

BEGIN;
Persists for life of the query
CREATE TEMP TABLE <name> ;
Both Analytic and Temp tables must be END;
defined as either a FACT or DIMENSION it is considered an TEMP table
Use WHERE clause in JOIN statements to minimize on-the-fly skew

Create FACT table syntax

1 Create FACT Table syntax
CREATE TABLE AdEventFact

(
UserID BIGINT NOT NULL,
AdID BIGINT NOT NULL,
CampaignID INTEGER NOT NULL,
PageViewID BIGINT NOT NULL,
CookieID VARCHAR(20),
EventTimeStamp TIMESTAMP NOT NULL,
EventType CHAR(10),
Page VARCHAR(100),
DomainID BIGINT
Can only declare 1
) column. Cannot use
DISTRIBUTE BY HASH (UserID); Fixed CHAR data
type. VARCHAR,
TEXT is OK
In older releases, would use the following syntax:
create table stuff1 (c1 int, c2 int, partition key(c1));

FACT table characteristics
Note you can only choose one column for your HASH column.

FACT table characteristics
This is distributing a table's data across many v-Workers in Cluster to

allow data scaling by dividing and conquering
Declare a single column Distribution key on table definition
Physical distribution is hash-based. For a given row in the database,

Cluster computes a hash of the Distribution Key and assigns that row
to a v-Worker based on its hash value
Once defined (in Create Table) physical distribution will be performed

automatically for tables with a Distribution Key
Physical distribution enables maximized performance in a distributed

setting by spreading the work across v-Workers
The next few pages shows you the CREATE TABLE

syntax when creating various table types
(Fact, Dimension, Analytic, Temp)

Distribution (Hash) key
When choosing the distribution key, you should also, as a secondary concern, bear in mind
the evenness of your data distribution. That is, make sure the distribution key column
contains enough unique values to allow Aster Database to distribute data so that no single
worker holds significantly more data than any other. When a worker holds too much or too
little data, relative to other workers, we say there is data skew. Avoiding data skew is important,
but it is really a secondary concern because matching your schema to your joins is more
important.

Distribution (Hash) Key
Out of the 4 columns in the CREATE TABLE web_clicks

table, which column should be (customer_id INTEGER, session_id INTEGER,
selected as the Hash key? page VARCHAR(100), visitdate date)
DISTRIBUTE BY HASH (??????);
Choosing the column for a tables Distribution Key is a

critical design decision that impacts both how data is
distributed across v-Workers and moved in table joins
When choosing a column consider the following issues:
- Choose a column that is fairly unique to ensure good

distribution of data across the systems v-Workers
- Choose the column that will be used most for joins to

ensure the minimum amount of data movement

Physical Distribution (Hash) Illustrated
Heres an example of distributing the rows of a FACT table across the v-Workers. Hashing is
deterministic so the same value for that data type will always go to the same v-Worker.

Physical Distribution (Hash) Illustrated
CREATE TABLE web_clicks INSERT into web_clicks values (100, 1, home, 2012-01-01);
(customer_id INTEGER, session_id INTEGER, INSERT into web_clicks values (100, 1, mortgage, 2012-01-02);
page VARCHAR(100), visitdate date) INSERT into web_clicks values (200, 1, fraud, 2012-01-03);
INSERT into web_clicks values (200, 1, savings, 2012-01-04);
DISTRIBUTE BY HASH (customer_id);
INSERT into web_clicks values (300, 1, checking, 2012-01-05);
INSERT into web_clicks values (300, 1, cd, 2012-01-06);
As INSERT a row, Queen (or Loader) hashes customer_id

value to determine which v-Worker gets that row
200, 1, fraud 100, 1 ,home 300, 1, check

200, 1, savi 100, 1, 300 300, 1, cd
v-Worker1 v-Worker2 v-Worker3 v-Worker4

Create DIMENSION table syntax
A replicated dimension table is a dimension table whose entire contents are copied to all
vworkers for faster lookup. This is the default behavior of a dimension table in Aster Database.

2 Create DIMENSION Table syntax
CREATE TABLE DimAd

(
AdID BIGINT NOT NULL,
AdName VARCHAR(30) NOT NULL, Good for Joining to other
tables since ensures 1-M
AdDescription VARCHAR(100) NOT NULL, relationship on Join column
AdType CHAR(10) NOT NULL, Dimension tables often
AdText TEXT, implement PRIMARY
AdWidth INTEGER, KEYs for uniqueness
AdHeight INTEGER,
AdStatus CHAR(10),
CONSTRAINT DimAdPK PRIMARY KEY (AdID)
)
DISTRIBUTE BY REPLICATION;
Note, there is no Distribution KEY
declaration for Replicated tables

Physical Replication Illustrated
As shown in the graphic, each v-Worker will get a copy of the entire table.

Physical Replication Illustrated
CREATE TABLE web_clicks insert into web_clicks values (100, 1, 200, home, 2012-01-01);
(customer_id INTEGER, session_id INTEGER, insert into web_clicks values (100, 1, 300, mortgage, 2012-01-02);
page VARCHAR(100), visitdate date) insert into web_clicks values (200, 1, 400, fraud, 2012-01-03);
DISTRIBUTE BY REPLICATION; insert into web_clicks values (200, 1, 500, savings, 2012-01-04);
insert into web_clicks values (300, 1, 600, checking, 2012-01-05);
insert into web_clicks values (300, 1, 700, cd, 2012-01-06);
As INSERT a row, Queen (or Loader) loads to one v-Worker and

the Replication service than replicates to all other v-Workers
No hashing is needed
100, 1 ,200 100, 1 ,200 100, 1 ,200 100, 1 ,200

100, 1, 300 100, 1, 300 100, 1, 300 100, 1, 300
200, 1, 400 200, 1, 400 200, 1, 400 200, 1, 400
200, 1, 500 200, 1, 500 200, 1, 500 200, 1, 500
300, 1, 600 300, 1, 600 300, 1, 600 300, 1, 600
300, 1, 700 300, 1, 700 300, 1, 700 300, 1, 700
v-Worker1 v-Worker2 v-Worker3 v-Worker4

Create ANALYTIC table syntax
Aster Database introduced analytic tables, which have a persistence between that of regular and
temporary tables. This special type of table was created to hold data that is useful for operations
across the span of several transactions, sessions or days. If the analysis of the data will last
longer than a few days, be prepared to re-generate the table if necessary, or else use a permanent
table. The data in an analytic table is not as persistent as data held within a regular table. For
example, it is not replicated and will not survive a system restart. Analytic tables should only be
used for derived data, and never for the original source data.
Making good use of analytic tables where appropriate can speed up query performance and
make multiple explorations on a specific set of data easier to perform. Analytic tables are not
replicated, so for very large tables that are based on derived data, they reduce the load on the
cluster. They have the benefit over temporary tables of not having to worry about losing the
data if a session or transaction is terminated before the user has finished doing the analysis.
Some common use cases for analytic tables are:
Create an analytic table to hold the output of a SQL-MR function, such as sessionize,
attribution or nPath. Then use the analytic table as input to other SQL-MR functions or
SQL queries. For example, nPath is sometimes used to filter web sessions based on the
behavior of shoppers in an online store (i.e. browsers, cherry pickers, price-sensitive
shoppers, etc.). Then further analysis can be done on just the sessions that fit that behavior
profile.
Use an analytic table to hold the results of a resource-intensive JOIN operation, so further
exploration can be done on the data without having to perform the JOIN again.
Employ analytic tables for a complex multistep process for which you need the highest
performance and want to keep the end results, but not the intermediate steps. In this case,
you can do most of the processing using analytic tables, and then write to a regular
(persistent) table at the very end of that process.
The reason these operations invalidate analytic tables has to do with replication. Analytic
tables are effectively unreplicated (RF=1) tables, because although their metadata is replicated,
the data itself is not. After a worker failover operation, the partition that was previously a
Secondary becomes a Primary. Since the data in the Analytic Tables was never replicated to the
Secondary, that Secondary (which is now the new Primary) does not have a copy of the data
rows - just an empty table. Therefore, a worker failover must invalidate the Analytic Tables to
force the user to recognize that the Analytic Tables don't have any data. Similarly, other
operations which may cause the Secondary to be used (i.e. balance data, node failover, etc. will
also have the side effect of invalidating the analytic tables.

3 Create ANALYTIC Table syntax
CREATE ANALYTIC TABLE MyTempTable

(
AdID BIGINT,
AdName VARCHAR(30),
AdDescription VARCHAR(100)
)
DISTRIBUTE BY (HASH <column> | REPLICATION);
'Lightweight' tables for 'High Performance Analytics'
Trade Durability for Performance (wont survive System state change)
Lightweight = No Write-Ahead Logging or v-Worker Replication
High Performance Analytics = ~2X on Inserts/Updates/Deletes*

- Low write overhead, fast loading, better overall concurrency
- Good for quick loading and manipulation types of analytics

Create TEMP table syntax
TEMP tables are very useful in a complex workflow. Since they can occur only within a
transaction, there are some specific optimizations that make them faster to materialize and
manage than permanent tables.

4 Create TEMP Table syntax
Must be either Fact/Dimension table and typically used in Complex workflows

Must have BEGIN and END clause
Objects de-materialized at end of succesful query (life-of-query only)
BEGIN;
CREATE TEMP TABLE temp1_hash
DISTRIBUTE BY HASH( emp) AS
SELECT e.emp, d.dept FROM emp e, dept d
WHERE e.dept = d.dept and e.mgr = 801;
SELECT * from temp1_hash;
END;
BEGIN;
CREATE TEMP TABLE temp1_repl
DISTRIBUTE BY REPLICATION AS
SELECT e.emp, d.dept FROM emp e, dept d
WHERE e.dept = d.dept and e.mgr = 801;
SELECT * from temp1_repl;
END;

When to use Distribution vs Replication
Fact tables are usually very large (e.g. millions or billions of rows). These tables contain
two types of columns: the columns that contain facts and the columns that refer to the
dimension tables. Fact tables require a distribution key column to be declared.
Dimension tables are usually much smaller (e.g. tens to thousands of rows) than fact
tables. Each dimension table specifies a set of known descriptive values for a particular
dimension. For example, a customers table can be a dimension table that contains
detailed information about each customer: for example, a customer ID, name, address,
and phone number. A distribution key column is optional for dimension tables.

When to use Distribution vs Replication
CREATE TABLE web_clicks CREATE TABLE web_clicks

(customer_id INTEGER, session_id INTEGER, (customer_id INTEGER, session_id INTEGER,
page VARCHAR(100), visitdate date) page VARCHAR(100), visitdate date)
DISTRIBUTE BY HASH (customer_id); DISTRIBUTE BY REPLICATION;
Table distribution-replication is a space versus performance choice
For "large" tables (> 1 million rows, usually Fact tables) we

recommend that you DISTRIBUTE BY HASH (col name) which distributes
the data across all of the v-Workers. This practice minimizes the disk
space usage by not replicating the data
For "small" tables (< 1 million rows, usually Dimension tables) we

recommend that you DISTRIBUTE BY REPLICATION to place a copy of the
table on each and every v-Worker. This practice minimizes data
movement and enables faster join processing
- Note: try to have the data types of join columns of related Fact and
Dimension tables match one another to minimize join processing

Distributed (Fact), Replicated (Dimension)
Since there is no SHOW TABLE command in Aster yet, you can determine if your table is
FACT or DIMENSION by issuing the ACT command \d <table name>.
It is also possible to query the NC_ALL_TABLES view to determine the table type as well.

Distributed (Fact) , Replicated (Dimension)
Any table may be Distributed or Replicated
If a Distribution key is specified (DISTRIBUTE BY HASH <column name>)

the table will be considered a Distributed Fact table
If a Distribution key is not specified but instead a DISTRIBUTE BY

REPLICATION, the table will be considered a Replicated Dimension table in
which case the entire contents of the table will be copied to all v-Workers

In-line lab: CREATE TABLE AS syntax
You can use the CREATE TABLE AS command to create a new table from an existing one.

In-line lab: CREATE TABLE AS syntax
Using TD Studio, go to BEEHIVE database and type:

(need this table in future lab)
CREATE TABLE clicks5 distribute by hash (user_id) AS

SELECT * from clicks;
CREATE TABLE AS creates a table and fills it with data computed by a

SELECT command. The table columns have the names and data types
associated with the output columns of the SELECT (except that you can
override the column names by giving an explicit list of new column names)
CREATE TABLE AS is not supported with logically partitioned tables (tables

created with PARTITION BY LIST or PARTITION BY RANGE)
Note if wish to keep same hash key, do not have to type

DISTRIBUTE BY HASH (column) syntax
If wanted to CT AS without data use following syntax:

CREATE TABLE T2 AS SELECT * from T1 where false;

Creating Tables from SQL-MR functions
You can also use the CREATE TABLE AS to create a table a table from the output of a
SQL_MR function.

Creating Tables from SQL-MR functions
CREATE TABLE session1 distribute by hash(customer_id) AS

SELECT * FROM SESSIONIZE MR function
(ON bank_web_clicks Input table
PARTITION BY customer_id ORDER BY datestamp
TIMECOLUMN ('datestamp') TIMEOUT ('600'))
ORDER BY customer_id;
Session1 table
To copy Result set from an
SQL-MR query to a
Permanent Table, use
above syntax
Now this output can be

queried by others and/or
copied to other databases
via Connectors

Table Compression
Compression allows you to create compressed tables and save space. You can compress to
varying degrees by including a HIGH, MEDIUM, or LOW constraint in your compression
syntax. The syntax for creating compressed tables is:
CREATE [ FACT | DIMENSION ] TABLE table_name ( [

{ column_name data_type [ column_constraint [ ... ] ]
| table_constraint }
[, ... ]
])
[ COMPRESS [ HIGH | MEDIUM | LOW ] ]
[ INHERITS ( parent_table ) ]
and, for CREATE TABLE AS SELECT, the syntax is:
[, ... ]
])
[COMPRESS [ HIGH | MEDIUM | LOW ] ]
AS SELECT STATEMENT
There is no change in query syntax for compressed tables. For all query purposes, a
compressed table will be treated the same as a normal table. Compression is currently not
supported for temporary tables. Compressed tables are replicated in their compressed form.
Before you alter existing table compression properties compression levels, initial
compression of a table, decompression of a table you should ensure that there is sufficient
disk space available for the operation.
Table compression occurs in an online fashion without disruption to Aster Database. One
useful application of compression is to combine it with Aster Databases logical partitioning
feature for information lifecycle management. As you recall, logical partitioning enables
creation of a hierarchy such that a large table can have partitions, which in turn can have their
own partitions, and so on. If the child partitions are range-partitioned (e.g. monthly
partitions), compression can be used to compress the monthly child partitions over time, as
they become less frequently accessed.
For example, assume it is November. You may leave the October and November child
partitions uncompressed as they are more frequently accessed. However, older data can be
compressed at increasing levels since query frequency may drop as data gets stale. For
example, Q3 data (July-Sept.) may be compressed LOW, Q2 data (April-June) may be
compressed MEDIUM, and Q1 data (Jan.-Mar.) may be compressed HIGH.
Realized compression ratios depend on the compression level selected by the user and the data
characteristics. While realized compression rates vary, typical ratios range from 3x to 12x.

Table Compression
Lower storage costs & improve performance via Compression
Table compression reduces costs/improves query performance

3 levels of compression (3X-12X)
New (hot) data: none / low
Old (cold) data: medium / high
CREATE TABLE films

(code integer, title text, ts date)
Distribute by hash (code)
COMPRESS [ HIGH | MEDIUM |LOW ];
At slight cost of CPU, get better Disk I/O since have more rows per Data block

7. Split data into Logical Partitions
Logical partitioning is the practice of splitting off what is logically one large table into smaller
child partitions for faster performance. Logical partitioning also makes it easier to manage
data in the table. This is a common database practice as well as a popular feature of Aster
Database. For example, a typical logically partitioned table might partition the set of rows
based on date, with a child partition for each weeks data. Logical partitioning is also known as
autopartitioning, list partitioning, and range partitioning. You may logically partition
both fact and dimension tables.

Split data into Logical Partitions
Use Logical Partitioning to give the effect of smaller tables, which shortens
query runtimes. By using logical partitioning, queries will improve in
performance nearly linearly with the number of partitions added
Logical Partitioned tables can take advantage of Partition Pruning. Provided
that the child partition structure matches the users predicate in the WHERE
clause, Aster Database reads only the relevant child partitions, resulting in
fast query runtimes
Goal of Logical Partition table query: Reduced I/0

Logical Partitioning (LP) overview
A table can be partitioned or split into child partitions to optimize for queries that require
only well-defined subsets of the tables data. For example, you can split the orders table by
the order date, and divide it into separate child partitions, one per month. This is useful in an
environment where each monthly report query requires only one months worth of records
from the orders table, and where analysts are not interested in scanning the entire orders
table.
Benefits of Logical Partitioning
Logical Partitioning provides several benefits:
1 Query performance is improved dramatically for certain types of queries.
2 Update performance is improved, since each partition of the table has indexes smaller than
an index on the entire data set would be.
3 You can effectively do a bulk delete by dropping a child partition, as long as your
partitioning design plan allows for it. ALTER TABLE ... DROP PARTITION is far faster
than a bulk DELETE.
4 Removing a large segment of data does not leave a big hole in the table as it would when
using only one large table.
5 Archiving of data can be automated through scripting archiving activities to occur in

batches corresponding to the logical partitioning plan.

Logical Partitioning (LP) overview
Logical Partitioning is a technique by which a single table is divided into

logical partitions within a given physical partition
- For example: we might have a single fact table for purchases which we
partition into daily child tables (Oct. 1, Oct. 2, ...)
With Logical Partitioning we can reap both access performance and data
manageability benefits in an Cluster environment:
- By using check constraints on the child table, and including those

constraints in queries against the parent table, we can reduce a queries
the on-disk read access time significantly
- The PARTITION BY clauses causes the table to be logically partitioned

in separate files and defines the partition layout
2 ways to PARTITION BY
LIST
RANGE

PARTITON BY List
A list partition is specified by using PARTITION BY LIST and providing the list of values. If
NULL is in the list, then the partition will include NULL values as well as the other values in
the list. For example, if the list partition specifies VALUES(1, 2, NULL) for column x, then the
partition matches rows where x=1 OR x=2 OR x IS NULL.

No catch-all bucket using LIST,
PARTITION BY LIST since cannot specify range
PARTITION BY LIST is when you provide a list of values that belong to Partition
- If NULL is in the list, then the partition will include NULL values
[ ie: VALUES(1,2, NULL) ]
CREATE TABLE web_clicks

(customer_id INTEGER, session_id INTEGER, page text, region text)
DISTRIBUTE BY HASH (customer_id)
PARTITION BY LIST (region)
(partition latitude (VALUES ('E,W')) , partition longitude (VALUES ('N,S')),
partition nulls (VALUES(NULL)) );
Ensure each value is assigned to only one partition

Depending on how you load data into a LP table, rows whose values do not have an
PARTITION, one of things can occur:
1. That row will not be loaded into table

2. Upon first row found that does not have PARTITION, rollback all rows
Confidential and proprietary. Copyright 2009 Aster Data Systems

PARTITION BY Range (1 of 3)
A range partition is specified by using PARTITION BY RANGE with a START value and an
END value.
START Value
The START value specifies the lower bound of the partition.

Allowed values: For the START value, you may specify a constant or the keyword,
MINVALUE. Specifying START MINVALUE says that there is no lower bound in the
partition. MINVALUE does not correspond to a real value. Conceptually, MINVALUE is
something less than all possible values (including NULL).
INCLUSIVE/EXCLUSIVE: The START value is by default INCLUSIVE. This means that if

you specify a constant A as the START VALUE, the partition will match rows where the
expression is >=A. You may force the START value to be exclusive by including the
keyword, "EXCLUSIVE", in which case the partition will match rows where the expression
is >A.
Required? No, declaring a START value is not required. If you omit it, Aster Database will
create a default START value for you.
Default START Values
You may omit the START value from a partition definition.
If you omit the START value from the first range partition, it is equivalent to declaring
START MINVALUE.
If you omit the START value from any subsequent range partition, that partition uses the
END value of the preceding partition as its start value.

PARTITION BY RANGE (1 of 3)
PARTITION BY RANGE specifies a numeric, alpha, or date range

How START value works Most RANGE are date-based
columns, may be possible to
START values specifies the lower bound of the partition build Catch-all bucket
Allowed values: For the START value, you may specify a constant or the keyword,
MINVALUE. Specifying START MINVALUE says that there is no lower bound in the
partition. MINVALUE does not correspond to a real value. Conceptually, it is something
less than all possible values (including NULL)
INCLUSIVE/EXCLUSIVE: The START value is by default INCLUSIVE. This means that if
you specify a constant A as the START VALUE, the partition will match rows where the
expression is >=A. You may force the START value to be exclusive by including the
keyword, "EXCLUSIVE", in which case the partition will match rows where the
expression is >A
Required? No, declaring a START value is not required. If you omit it, Aster Database
will create a default START value for you. If you omit the START value from the first
range partition, it is equivalent to declaring START MINVALUE
If you omit the START value from any subsequent range partition, that partition uses
the END value of the preceding partition as its Start value

END Value
The END value specifies the upper bound of the partition.
Allowed values: For the END value, you may specify a constant or the keyword,
MAXVALUE. Specifying END MAXVALUE says that there is no upper bound in the
partition. MAXVALUE does not correspond to a real value. Conceptually, MAXVALUE is
greater than all possible values (including NULL).
INCLUSIVE/EXCLUSIVE: The END value is by default EXCLUSIVE. This means that if

you specify a constant A as the END VALUE, the partition will match rows where the expression is <A.
You may force the END value to be inclusive by including the keyword,
"INCLUSIVE", in which case the partition will match rows where the expression is <=A.
Required? Yes, declaring an END value is required.

PARTITION BY RANGE (2 of 3)
PARTITION BY RANGE specifies a numeric, alpha, or date range of key values

that belong to a Partition
How END value works
END value specifies the upper bound of the partition
Allowed values: For the END value, you may specify a constant or the keyword,
MAXVALUE. Specifying END MAXVALUE says that there is no upper bound in
the partition. MAXVALUE does not correspond to a real value. Conceptually,
MAXVALUE is greater than all possible values (including NULL)
INCLUSIVE/EXCLUSIVE: The END value is by default EXCLUSIVE. This means that if
you specify a constant A as the END VALUE, the partition will match rows where the
expression is <A. You may force the END value to be inclusive by including the
keyword, "INCLUSIVE", in which case the partition will match rows where the
expression is <=A.
Required? Yes, declaring an END value is required

Using NULLs in Range Partitions
The range partition can also specify NULLS FIRST or NULLS LAST. This says that the NULL
value is ordered before (NULLS FIRST) or after (NULLS LAST) all the other values in the
partition. Regardless of which option for NULL ordering is designated, any NULL values will
come after MINVALUE and before MAXVALUE. The default is NULLS LAST.
In range partitions, using NULLS FIRST or NULLS LAST only affects the interpretation of the
START and END values in a range partition.
MAXVALUE and MINVALUE do not correspond to real values. Conceptually, MINVALUE is

something less than all possible values (including NULL) and MAXVALUE is greater than all
possible values (including NULL).
If NULLS LAST is chosen (that is the default), then the ordering will be:
MINVALUE, <actual values, e.g. Albania, Zambia>, NULL, MAXVALUE
If NULLS FIRST is chosen, then the ordering will be:

MINVALUE, NULL, <actual values, e.g. Albania, Zambia>, MAXVALUE

PARTITION BY RANGE - NULLS (3 of 3)
The Range partition can also specify NULLS FIRST or NULLS LAST. This says
that the NULL value is ordered before (NULLS FIRST) or after (NULLS LAST) all
the other values in the partition. Regardless of which option for NULL ordering
is designated, any NULL values will come after MINVALUE and before
MAXVALUE. The default is NULLS LAST
If you say NULLS FIRST: Similarly, if you say NULLS LAST:
(START MINVALUE END 0) will include NULL (START MINVALUE END 0) will not include NULL
(START NULL END 0) will be a valid range (START NULL END 0) is not valid and will be
(START 0 END NULL) is not valid and will be rejected by the system
rejected by the system (START 0 END NULL) will be a valid range
(START 0 END MAXVALUE) will not contain NULL (START 0 END MAXVALUE) will contain NULL
CREATE
CREATE TABLE
TABLE customer
lp_null_first (customer_id,
(customer_id int, zipcodezipcode
int ) int )
DISTRIBUTE BY HASH(customer_id)
PARTITION BY RANGE(zipcode NULLS FIRST)
PARTITION
( PARTITION BY RANGE(zipcode
zipcode_is_NULL( START NULL END NULLS FIRST) PARTITION
NULL INCLUSIVE), (
zipcodes_00000_09999 (START 0 END 10000 EXCLUSIVE), PARTITION zipcodes_10000_19999 (END
PARTITION
20000 EXCLUSIVE), zipcode_is_NULL (START
PARTITION zipcodes_20000_29999 (ENDNULL
30000 END NULL
EXCLUSIVE) ); INCLUSIVE),
PARTITION zipcodes_00000_09999 (START 0 END 10000 EXCLUSIVE),
insert into lp_null_first values (1, null);
insert into lp_null_first values (2, 00000);
PARTITION
insert zipcodes_10000_19999
into lp_null_first values (3, 10000); (END 20000 EXCLUSIVE),
insert into lp_null_first values (4, 20000);
PARTITION zipcodes_20000_29999 (END 30000 EXCLUSIVE),
select * from lp_null_first;
... );

Logical Partitioning notes
Use these tips for defining logical partitions:
When using ranges, ensure that ranges do not overlap. When using values, ensure that
each value is assigned to only one partition. If you attempt to insert rows with values that
do not fall within the defined list of ranges or values for any partition, the insert will fail.
Note that you can create multilevel partitioned tables in one CREATE TABLE command
by nesting PARTITION BY statements.

Logical Partitioning notes
A Logical Partition inherits the following from its Parent table:

Column names and data types
Distribution key constraints
Constraints (e.g. NOT NULL)
Owners and permissions
Index definitions
Compression (each partition may be specified differently)
Note you cannot have Overlapping partitions
Both partitions would

have NULLs in them
With logical partitioning all data is in the Child tables, none is in

the Parent table. Each row must fit in some partition logic

LP supports Columnar and Compression
Using COMPRESS
Compression is supported at every level in the logical partition hierarchy. That is, you may
compress the table itself, its index, and/or any of its child partitions. If compression is
specified for the table or one of its partitions, the compression will cascade to any partitions
below it in the hierarchy, unless they have their own compression explicitly specified.

LP supports Columnar and Compression
CREATE TABLE records2(id int, country varchar, ts timestamp)

DISTRIBUTE BY HASH(id) STORAGE COLUMN
PARTITION BY RANGE(ts) (
PARTITION daily_0 ( START '2010-01-01'::date + interval '0 days' END '2010-01-01'::date + interval '1 days'
PARTITION BY LIST(country) (
PARTITION countrylist0 ( VALUES ('a', 'b', 'c') ),
PARTITION countrylist1 ( VALUES ('x', 'y', 'z', NULL) ),
PARTITION countrylist2 ( VALUES ('d', 'e') ) ) .
CREATE FACT TABLE records(id int, country varchar, ts timestamp) DISTRIBUTE BY HASH(id)
PARTITION BY RANGE(ts) ( PARTITION oldrecords( END '2010-01-01' COMPRESS LOW),
PARTITION jan01_2010( END '2010-01-02' COMPRESS LOW),
...
PARTITION dec31_2010( END '2011-01-01' COMPRESS LOW),
PARTITION jan01_2011( END '2011-01-02' PARTITION BY LIST(country) (
PARTITION na ( VALUES ('usa', 'canada', 'mexico') ),
PARTITION eu ( VALUES ('germany', 'spain') ) )

Multi-Level Logical Partitioning (MLP)
Note that you can create multilevel partitioned tables in one CREATE TABLE command
by nesting PARTITION BY statements.

Multi-Level Logical Partitioning (MLP)
We can create multiple levels of logical partitions:
- You can create Multi-level partitioned tables by nesting PARTITION

BY statements in the CREATE TABLE
- This can improve query performance when the predicate for each
level is given in the query (in the WHERE clause)
- This practice is most useful with canned queries or queries

generated through BI reporting tools
- This can cause full table scans to run longer because now a
SELECT * FROM parent_table; query must open/read MULTIPLE
child tables!
Can COMPRESS older partitions to save space:

ALTER table page_view_fact alter partition (page_view_fact_200801) compress high;

Multi-Level Partitioning (MLP) syntax

Multi-Level Partitioning (MLP) syntax
1st Date is inclusive if Date is specified.
If use END as 1st Date means use upstream end date as your start. 2nd Date is not inclusive
CREATE TABLE Parent1 4 2-level partitions

(orderid int , region text, amt int, ts date)
1. pre_2012_07
distribute by hash (orderid) (< 2012-07-01 )
partition by range (ts) East
(partition pre_2012_07 (start minvalue end '2012-07-01' West
partition by list (region)
(partition East (values ('E')) 2. sales_2012-07
,partition West (values ('W')) )) (2012-07-01 to 2012-07-31)
,partition sales_2012_07 (end '2012-08-01'

East
West
partition by list (region)
(partition East (values ('E')) 3. sales_2012_08
,partition West (values ('W')) )) 2012-08-01 to 2012-08-31)
,partition sales_2012_08 (end '2012-09-01' East
partition by list (region) West
(partition East (values ('E')) sales_future

4.
,partition West (values ('W')) )) (2012-09-01 to 3999-01-01)
,partition sales_future ( end '3999-01-01' East
partition by list (region) West
(partition East (values ('E'))
,partition West (values ('W')) )) ) ;
Note that 1st date can be START or END

MLP in Action (1 of 2)

MLP in Action (1 of 2)
.. So LP tables can
PARTITION BY RANGE (sales_date ) provide increased
perf. However
1 level deep (partition Sales_Old (END '2012-07-01') without a WHERE
2 level deep ,partition Sales_2012_07 (END '2012-08-01' clause using LP
columns, an LP
PARTITION BY LIST (region) table can be
(partition East (VALUES ('E') slower than an
,partition West (VALUES ('W'))); non-LP table
since has to walk
through each
SELECT * FROM sales partition
WHERE sales_date between date '2012-07-01' and date '2012-07-31'

and region = 'E';
#L1 Sales_Old Sales_2012_07

1 2 3 4 1 2 3 4 #L2 East West
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 2 3 4 1 2 3 4
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Rows retrieved from this partition only

so only had to read 25% of the table

Adding/Deleting LP
ALTER TABLE...ADD PARTITION
The ALTER TABLE...ADD PARTITION operation allows adding a new partition to a logically
partitioned table. The new partition will have the same columns, indexes, permissions and
distribution key as the logically partitioned table to which it will be added.
Examples
The following example adds a partition south_america to an existing table of distributors
partitioned by a list of country names:
ALTER TABLE distributors

ADD PARTITION south_america (
VALUES ('Brazil','Chile')
);
The following example adds a partition year_2014 to an existing table of records partitioned
by timestamp range:
ALTER TABLE records ADD PARTITION year_2014

(START '2014-01-01' END '2015-01-01')
;
When adding a partition, you can create it with nested subpartitions, all in one command.
The following example shows this. This command creates a new partition at the top level of
the records table, which is partitioned by LIST. The new partition is called p2 and it is
subpartitioned
by RANGE(c). The two nested partitions are called p21 and p22. Partition
p22 is compressed.
ALTER TABLE records

ADD PARTITION p2 (VALUES('y')
PARTITION BY RANGE (c) (
PARTITION p21 (END '2000-1-1'),
PARTITION p22 (END MAXVALUE COMPRESS LOW)));
ALTER TABLE...DROP PARTITION
The ALTER TABLE...DROP PARTITION operation drops an existing partition and all of its
data from a logically partitioned table. If the partition to be dropped is a subpartition, the
reference to it must include references to all partitions above it in the hierarchy (i.e.
partition_name.subpartition_name). If the partition to be dropped includes
subpartitions, they will be deleted as well, along with their data.

ALTER TABLE sales_lp
Adding/Deleting LP ADD PARTITION sales_fact_jan_2014

(START '2014-01-01'::date END '2014-02-01'::date);
Suppose have the following LP table and want to detach post_2008 partition and add 200901 partition
Creating 13 partitions in a LP table

CREATE TABLE Parent_LP
.. DISTRIBUTE BY HASH(userid)
PARTITION BY RANGE(eventtimestamp::date)
( PARTITION page_view_fact_pre_2008 (START MINVALUE END '2008-01-01'::date)
, PARTITION page_view_fact_200801 (START '2008-01-01'::date END '2008-02-01'::date)
, PARTITION page_view_fact_200802 (START '2008-02-01'::date END '2008-03-01'::date)
..
, PARTITION page_view_fact_200812 (START '2008-12-01'::date END '2009-01-01'::date) );
Drop 'old' partition

ALTER TABLE Parent_LP
DETACH PARTITION (page_view_fact_pre_2008) into temppart; -- tempart table created on-the-fly
1st create new table with same structure as logically

partitioned table with no data
CREATE TABLE add_200901() AS
SELECT * FROM Parent_LP WHERE FALSE;
Add 'new' partition
ALTER TABLE Parent_LP ATTACH PARTITION add_200901
(START '2009-01-01'::date END '2009-02-01'::date) from add_200901;
Confirm partitions
select relname, parentname, size_on_disk, data_size from nc_tablesize(on (select 1)
partition by 1 info_database('retail_sales') info_relation('*') PASSWORD('beehive')) order by 2,1;

In-line lab: Creating a LP table

In-line lab: Create a LP table
Users are complaining about the performance of a large fact table named
page_view_fact. You find they are doing queries based on dates
Create a LP table (page_view_fact_LP) and insert rows from page_view_fact table

into it
Then run following query on both tables to see perf difference:
SELECT * from page_view_fact where eventtimestamp between ('2008-01-01') and
(2008-01-31');
SELECT * from page_view_fact_lp where where eventtimestamp between ('2008-
01-01') and (2008-01-31');
Use PROCESS tab in AMC to record times: Non-LP: _______ LP: _______
CREATE TABLE aaf.page_view_fact_LP

( pageviewid bigint , cookieid varchar , sessionid varchar , userid bigint , ipaddr varchar
, eventtimestamp timestamp , domain varchar , page varchar , querystring varchar
, refdomain varchar , refpage varchar , refquerystring varchar , locationid smallint )
DISTRIBUTE BY HASH(userid)
PARTITION BY RANGE (eventtimestamp::date) ????????????????????????

SALES_FACT will become a LP table
We have decided to logically partition the SALES_FACT table. Below is the syntax we will be
using.

SALES_FACT will become a LP table
Most queries from the SALES_FACT table will be queried by MONTH so create
this table as a Logically Partitioned table for each Month of 2008
CREATE TABLE prod.SALES_FACT_LP

( sales_date date not null
, customer_id integer not null
, store_id integer not null
, basket_id bigint not null
, product_id integer not null
, sales_quantity integer not null
, discount_amount real not null
)
PARTITION BY RANGE(sales_date)
( PARTITION sales_fact_jan_2008 (START '2008-01-01'::date END '2008-02-01'::date)
, PARTITION sales_fact_feb_2008 (START '2008-02-01'::date END '2008-03-01'::date)

, PARTITION sales_fact_dec_2008 (START '2008-12-01'::date END '2009-01-01'::date));

8. Index your tables
Indexing tables makes retrieving selective information fast. Targeted lookups prevent
irrelevant data from being read and processed.
Indexes are created using the CREATE INDEX command, specifying the name of the index,
the underlying table to index, and the column(s) to index.
Indexes are useful for quickly accessing selective amounts of data. For example, suppose we
want to filter a table called pageviews by a highly selective criteria: the sessionid in (11, 12).
Then an index on the sessionid column of the pageviews table would speed up this query. In
contrast, suppose we have a non-selective criteria: find the average age of all users in the North
America region. Then a sequential scan is likely more efficient than index-based access which
results in many random I/Os.

8. Index your tables (if needed)
Create relevant Indexes on your tables to make retrieving selective information

fast. Indexes allow more targeted lookups and prevent irrelevant data from
being read and processed. Queries that took hours to run on a non-indexed
table can run just a few seconds on a properly indexed table
When to use an Index - Generally, its a good idea to define indexes on a table
if that table is frequently the target of point lookup queries (queries looking
for a single row) or queries that use highly selective filters. In other words, if
you have a large table from which the typical query selects only a few rows,
(< 20%) then you should define an index on the column(s) that most queries
use to filter their results. Indexes can also be very useful for speeding up
group-bys and joins. In particular, if your queries result in nested loop joins
or merge joins, you may see faster query runtimes if you add indexes
How indexes improve performance - Indexes allow you to avoid sequential
scans (that is, full table scans) which may be relatively time-consuming. If,
when running a query, Aster Database finds that the table has an index or
indexes that are likely to be useful in finding that querys desired results,
then Aster Database will choose an alternative, potentially faster scan
method. These faster methods are the Index scan and the Bitmap index scan

Index rules
Guidelines for Indexes
Teradata recommends the following guidelines for vworker-local indexes. (Note: Aster
Database does not support cluster-global indexes. Instead, indexes are per-vworker, which
means each index includes all rows stored on that vworker.)
1 Create indexes after inserting data:
It is more efficient to create indexes on fact tables after data has been loaded. If done the
other way, i.e., if an index exists before the load begins, the database will need to maintain
the index for every inserted row, slowing loads tremendously.
2 Recognize when indexes are appropriate, such as in cases like these:
Low row selection: If a workload has queries that frequently access less than 10-15% of
the rows in a large table, an index might be appropriate. Of course, such a percentage
value depends largely on the relative speed of table-scan and the distribution of the row
data in relation to the order of the index key. The faster the table-scan, the lower the
above percentage; the more clustered the row data, the higher the percentage.
JOINs on multiple tables: Performance of JOINs across multiple tables could improve
with indexes, as the execution plan avoids sequential scans of all tables in the JOIN.
Suitability of columns for indexing: If a column contains many NULLs, and the workload has
frequent queries that access the non-NULL values, an index might be appropriate.
3 Order index columns for performance:
When creating an index with a composite (multi-column) key, the order of the columns
should be based on the general rule that the most frequently occurring columns in queries
should be placed first. For example, an index over columns <c1, c2, c3> will be used by
queries that access either c1, c1 and c2, or c1 and c2 and c3. Queries that access c2, c3, or
c2 and c3 will not leverage the index.
4 Limit the number of indexes for each table:
An index will speed up SELECTs, but will slow down DMLs such as INSERTs, DELETEs,
and UPDATEs, because index maintenance is a per-row operation and entails random I/O.
This trade-off between SELECTs and DMLs must be kept in mind when deciding how
many indexes to create for a table. Tables that are primarily read-only will benefit from
indexes. Tables that get modified very often will do better with fewer indexes.

Index rules
Aster performs an Index scan only if all of the following criteria are met:
Indexes are available and
The query filters on a column that has an index; and
The optimizer thinks the querys filtering on the indexed column will remove
at least 80-90% of the rows from the result. Please note that what matters
is what the optimizer thinks, so always make sure you run ANALYZE on the
table after any significant change to the tables contents
Other considerations:
For many slightly complicated queries, the planner doesnt estimate the
cost of an operation exactly right. In such scenarios, the planner might
mistakenly use an index scan because it assumed high selectivity, when in
actuality the selectivity is not high enough
When your table contains a large number of rows but your typical queries
only select a small portion at any point, indices are usually helpful
because they can reduce the amount of scanning needed to find a
querys results, but indexes are not the only way. In these situations,
using logical partitioning may also reduce scanning and should be
considered as an alternative to indexing

Creating an Index
CREATE INDEX constructs an index name on the specified table. Indexes are primarily used to
enhance database performance. However, over use of indexes can result in slower
performance, so you should only create indexes that you intend to use.
The key field(s) for the index are specified as column names. Multiple fields can be specified.
An index field can be an expression computed from the values of one or more columns of the
table row. This feature can be used to obtain fast access to data based on some transformation
of the basic data. For example, an index computed on upper(col) would allow the clause
WHERE upper(col) = 'JIM' to use an index.
Aster Database supports the B-tree index method (btree) and the GiST method (gist).
Note that GiST indexes (used for columns of IP4range datatype) is not supported for analytic
tables. So if you want to index a table with an IP4range column, you should create the table as
a regular or persistent table.
The keyword ASC (ascending) or DESC (descending) is optional. It specifies the ordering of the
values in the column. If not specified, ASC is assumed by default.
If NULLS LAST is specified, null values sort after all non-null values in the index; if NULLS
FIRST is specified, null values sort before all non-null values in the index. If neither is
specified, the default behavior is NULLS LAST when ASC is specified or implied, and NULLS
FIRST when DESC is specified (thus, the default is to act as though nulls are larger than
nonnulls).
When the WHERE clause is present, a partial index is created. A partial index is an index that
contains entries for only a portion of a table, usually a portion that is more useful for indexing
than the rest of the table. For example, if you have a table that contains both billed and
unbilled orders, where the unbilled orders take up a small fraction of the total table and yet
that is an often used section, you can improve performance by creating an index on just that
portion.
The expression used in the WHERE clause may refer only to columns of the underlying table,
but it can use all columns, not just the ones being indexed. Presently, subqueries and aggregate
expressions are also forbidden in WHERE. The same restrictions apply to index fields that are
expressions.

Creating an INDEX
Management wants to create an Index on the customer_id column to search

for NULL customers on a monthly basis
Note we could have done this just as easily using a Logically Partitioned table

When an Index is used

When an Index is used
Aster Database performs an Index scan only if all of the following

criteria are met:
Indexes are available

and
The query filters on a column that has an index;
and
The optimizer thinks that the querys filtering on the indexed
column will remove at least 80-90% of the rows from the result.
Please note that what matters is what the

optimizer thinks, so always make sure you
run ANALYZE on the table after any
significant change to the tables contents

System Limits (Aster 6.x)
Here are a list of important Aster limits on a number of database objects. This is of particular
importance when you wish to import or export tables from one data store to another using the
Teradata QueryGrid connectors.

System Limits (Aster 6.x)
Max size of a database: No limit

Max size of a hash distributed table: No limit
Max size of a replicated DIMENSION table: 512 GB
Max size of a row: 32 MB
Max size of a char field 10 MB
Max number of rows in a table: No limit
Max number of columns in a table: 1600
Max number of indexes on a table : No limit
Max length of a table , column, or view name: 63 characters
Max length of a database name: 50 characters
Max number of columns in a SELECT: 1660 columns
Max number of users: No limit
Max number of sessions: 4000 connections
Max length of an SQL query: No limit

Review: Mod 4 - Modeling

Review: Module 4 - Modeling
1. Defining an Index creates an underlying Table? (T or F)
2. The Schema model Aster recommends
3. 2 (of 3) ways to guarantee wont have to copy data during a JOIN
4. You can define a different Distribution Key when using the CREATE
TABLE AS syntax (T of F)
5. The 2 ways you can PARTITION BY in a table

DBA Lab 4a and 4b

DBA Lab 4a and 4b
Note: Next modules labs are dependent on these labs
1. Open your lab manual to Page 17

beehive=> select count (distinct(customer_id)) from aaf.sales_fact;

Lab 4: Step 2 question:
beehive=> select count (distinct(sales_date)) from aaf.sales_fact;
For column selected for HASH,
run this code on populated table:
1) Above code determines if have enough
values to populate # of v-Workers (assume
400 v-Workers in your nCluster) beehive=> select customer_id, count(*) from aaf.sales_fact
2) Will rows on the v-Workers be evenly group by 1 order by 2 desc limit 10;
populated for best parallel performance?

Module 5
Mod 5
Loading
Now is a good time to RESUME
the HDP 2.1 VMware image

Table Of Contents
Loading Module Objectives .........................................................................................................4
The Sources .....................................................................................................................................6
Data Loading using ncluster_loader.exe..........................................................................................8
ncluster_loader.exe ........................................................................................................................10
Running ncluster_loader.exe .........................................................................................................12
ncluster_loader.exe Arguments .....................................................................................................14
Standard Operating Procedures for Loading .................................................................................16
Loader Node ..................................................................................................................................18
Forcing particular Loader node to be used ....................................................................................20
Loading from 1 Staging machine ..................................................................................................22
Loading from many Staging machines ..........................................................................................24
Data Input and Delimiters ..............................................................................................................26
Loading Multiple files with a MAP file ........................................................................................28
Column ordering ............................................................................................................................30
Column ordering example .............................................................................................................32
Pre/Post Load scripts .....................................................................................................................34
ETL with Pre/Post scripts ..............................................................................................................36
Skip rows, Analyze and Verbose arguments ................................................................................38
Autopartitioning .............................................................................................................................40
Error Logging ................................................................................................................................42
Enabling Error logging ..................................................................................................................44
Error Logging details .....................................................................................................................46
Error statistics ................................................................................................................................48
Error correction ..............................................................................................................................50
Loading/Error table strategies........................................................................................................52
Lab 5.01: Load Abort due to Create Table constraints ..................................................................54
Lab 5.02: Load proceeds even with Malformed rows ...................................................................56
Lab 5.03: Load Default Error table................................................................................................58
Lab 5.04: Load Custom Error table ...............................................................................................60
Lab 5.05: Threshold to Abandon Load ..........................................................................................62
Lab 5.06: Avoiding Extra spaces ...................................................................................................64
Lab 5.07: Run Analyze ..................................................................................................................66
Lab 5.08: Monitoring your Error table size ...................................................................................68
Lab 5.09: Create an Error file for Load .........................................................................................70
Lab 5.10: Monitor your Loads error statistics ...............................................................................72
Lab 5.11: Managing case sensitivity .............................................................................................74
Lab 5.12: Keep files under 100 MB ..............................................................................................76
Lab 5.13: Running in Parallel with NOHUP .................................................................................78
Successful Loading strategies ........................................................................................................80
Successful Loading strategies (cont) ............................................................................................82
ncluster_loader Performance tips ..................................................................................................84
Loading using QueryGrid: Aster:TD .............................................................................................86
Lab 5.14: QueryGrid: Aster-TD lab ..............................................................................................88
Lab 5.15 INSERT CHAR into INT ...............................................................................................90
Loading using QueryGrid: Aster-Hadoop .....................................................................................92
Lab 5.15: QueryGrid: Aster-Hadoop lab .......................................................................................94
Page 2 Mod 5 - Loading

Exporting data using ncluster_export.exe ..................................................................................... 96
Review: Module 5 - Loading ........................................................................................................ 98
DBA Lab 5a 5c (Optional) ....................................................................................................... 100

Loading Module Objectives
To load data into Aster Database, you can use the Aster Loader Tool, the COPY command, the
INSERT command, or a custom-defined SQL-MapReduce data loading function you have
written. This section provides tips for efficient loading and shows how to load using the Aster
Loader Tool.
This module will focus mainly on the Aster utility NCLUSTER_LOADER utility.

Loading Module Objectives
Load data using ncluster_loader.exe

Load data using Teradata QueryGrid
Using the load_from_teradata connector
Using the load_from_hcatalog connector

The Sources
Sources of information come from varied locations as annotated in the slide.

The Sources
Multi-structured data comes from a variety of different systems:

Data Loading using ncluster_loader.exe
The Aster Loader Tool, ncluster_loader, is a command-line application for bulk-loading data
from files into Aster Database. It provides an alternative to the SQL COPY statement. Each
invocation of the Aster Loader Tool is equivalent to a single invocation of the COPY
statement. Each invocation is a single database transaction that loads the contents of the
specified data file or files into the specified table or tables.
Input files are typically gathered together on a server called a Staging machine where the
ncluster_loader.exe utility is installed. Using the ncluster_loader syntax, a command can be
issued to load data using a Loader node if desired. By specifying a Loader node, this relieves the
Queen from doing the loading.
Optionally, error logging can be enabled to track malformed rows if desired.

Data Loading using ncluster_loader.exe
Input Data: Source file(s) to be loaded ncluster Loader Node:

into Cluster. Source files assumed to Node (machine) in Cluster
be in TSV format (default) that is dedicated to data loading
Many loader nodes can operate in parallel
Staging Table: A table in Cluster that you
Staging Machine(s): Machines where data load into for ELT, before transforming it
you "stage" Input Data files prior to and moving it to another Cluster table
loading them. (ncluster_loader tool is
ncluster Load Error Logging
installed and run on staging machine(s))
(--el-enabled) places your rows with errors
to an error logging table and/or file where
you can correct and reload later
ncluster Loader Tool, ncluster_loader is
a client application that runs on Staging
machines for initiating high-speed bulk
loads Queen:
Cluster Loader Tool must point to Queen
ncluster Loader Tool must point to Queen (-h) and can optionally (where DB is located). Queen must know
point to Cluster Loader node (-l) for hashing
where Cluster Loader nodes via AMC)
$ ncluster_loader -h queen -l loader1 -U beehive -w beehive -c emptable empfile.csv

ncluster_loader.exe
The Aster Database Loader Tool, ncluster_loader, is a command-line application for bulk
loading data from files into Aster Database. It provides an alternative to the SQL COPY
statement. Each invocation of the Aster Database Loader Tool is equivalent to a single
invocation of the COPY statement. Each invocation is a single database transaction that loads
the contents of the specified data file or files into the specified table or tables.
Rather than load a single, larger, data file with a single instance of loader, the input file can be
split into smaller chunks using an OS command such and split or csvgrep in linux. Then to
run multiple copies of ncluster_loader, on linux, you put an nohup before the ncluster_loader
command to cause it to run in the background.

ncluster_loader.exe
Aster Datas bulk data loading tool (ncluster_loader.exe)
Its a file located on the Queen (\home\behive\clients_all). Just copy this

file to the Staging machine and invoke from command line
It is a command-line tool (using Staging machines O/S) that is run on the

staging machines to enable quick/easy loading of data files
Each invocation of ncluster_loader is equivalent to an invocation of a

COPY statement, loading the data of a specified file (or files) into a
specified table (or tables)
To load table(s) from multiple files you must map source files to target
tables in a Mapping file
Can invoke multiple times on the same staging machine (i.e. nohup
ncluster_loader) so can load in parallel

Running ncluster_loader.exe
To run the Aster Loader Tool, you type:
$ ncluster_loader [arguments] [schemaname.]tablename [ filename |dirname ]
where
arguments are the command-line flags that control how the loader runs. The flags are
explained in Argument Flags, below, or you can display the help by typing:
$ ncluster_loader -?
schemaname is the optional name of the destination schema. If no schema name is

provided, Aster Database will search the schemas listed in the schema search path.
tablename is the name of the destination table (See Case-Sensitive Handling for
Table Names on page 180 if you wish to have Aster Database evaluate table names in a
case-sensitive manner.);
filename or dirname indicates the file or directory of files to be loaded. Files to be loaded
can optionally be compressed gzip or bzip2 files. These are extracted and their contents are
then loaded.
filename Qualified path of the file containing the data to be loaded. The contents of
the file must be in either CSV or text format, as described for the COPY statement.
Details of the encoding used (such as non-default values for null or delimiter) are
specified using the appropriate options, as described below; or
dirname Qualified path of the directory containing one or more data files to be
loaded. All data files found within this directory are expected to be in the same format
and will be loaded as a single transaction. Subdirectories will not be processed.

Running ncluster_loader.exe
Generic syntax
$ ncluster_loader [flags] tablename { filename | dirname }
Tablename is the name of the destination table
Filename or dirname is the name of the file

or the directory of files to be loaded
Flags: the command-line flags
- For loading help you can run:

$ ncluster_loader --help
- Can run many on one Staging machine
You can invoke ncluster_loader utility multiple times on the same staging machine
by adding nohup before the ncluster_loader command.

ncluster_loader.exe Arguments
Argument Flags
In addition to the schemaname.tablename and filename/dirname arguments explained

above, the Aster Loader Tool takes the following argument flags at the command line or in the
map file. (Map files let you load from many input files in a single running of Aster Loader.
In the table that follows, the argument flags are sorted based on the long-form, command-line
flag:
The left column lists the flag you use at the command line
The middle column lists the flag you can use in a map file
If no value appears in the middle column, then the argument is one that can only be passed at the
command line.
Note that you can always specify a loader node using its IP address. If you wish to specify it by
hostname, you first need to add the loader node to the Aster Database hosts file on all nodes
through the Hosts tab in the AMC.
See Teradata Aster Client User Guide for more information on loader arguments and their use.

ncluster_loader Arguments
Bracketed flags are another option if dont
want to use abbreviated characters
-B [ --begin-script ] arg -l [ --loader ] arg

-c [ --csv ] -m [ --map-file ] arg
-C [ --columns ] arg -n [ --null ] arg
-d [ --dbname ] arg -P [ --data-prefix ] arg
--date-style arg -p [ --port ] arg
-D [ --delimiter ] arg -q [ --quote ] arg
--el-discard-errors --truncate-table
--el-enabled -U [ --username ] arg
--el-label arg -w [ --password ] arg
-L [ --el-limit ] arg -W [ --password-prompt ]
-E [ --end-script ] arg -z [ --auto-analyze]
-e [ --escape ] arg -? [ --help ]
-f [ --force-loader ] --verbose
-h [ --hostname ] arg
(Partial list: See details in User guide) or type ncluster_loader --help

Standard Operating Procedures for Loading
Here is the typical process for Loading data into Aster.

Standard Operating Procedures for Loading
1. Install Aster Loader utility on Staging machine

2. Add Loader nodes if needed
3. Prepare files for loading (delimiters, column mismatch between
file and table, parsing hints)
4. Place files on Staging machine
5. Figure out how you want to handle malformed rows during
loading
6. Figure out advanced logging techiniques (Mapping files)
7. Load data using ncluster_loader
8. Check results by quering statistics tables
9. Check your error logging tables
10. Fix any malformed rows, then reload

Loader Node
Loader node relieves the Queen from the loading task.
The Aster Loader Tool supports the use of many Aster Loader nodes. For most loading tasks,
the queen is sufficient to handle all loading, but for high volume loading, you can add dedicated
loader nodes to your cluster.
To use a loader node, you invoke one or more ncluster_loader instances that will load through
that loader node. You may run many ncluster_loader sessions in parallel against one loader node,
and you may use many loader nodes in parallel (with each node handling loads from a number of
ncluster_loader instances).
To do this, you invoke each ncluster_loader instance with the -l (and optionally -f) argument to
specify the loader node. The required flags are:
as always, the --hostname option (-h) provides the queen IP address;
the --loader flag (-l) provides the IP address of the desired loader node; and
Optionally, the --force-loader flag (-f) forces the use of the desired loader node.

(ncluster) Loader Node
CREATE TABLE web_clicks

A special purpose type of Node in Cluster (customer_id INTEGER, session_id INTEGER,
page VARCHAR(100), visitdate date)
DISTRIBUTE BY HASH (customer_id);
- Hashes Distribution Key to load each
row to the correct v-Worker CREATE TABLE dept
(dept_id INTEGER, dept_name INTEGER)
DISTRIBUTE BY REPLICATION
- Many Loader nodes can load in
parallel
Loader Node
- Optional! (For small data volumes you
can load via the Queen)
Loader Node not only relieve the Queen of hashing rows for Distributed tables, it
also offloads the work of receiving the stream of rows from the client and
distributing them to the v-Workers

Forcing particular Loader node to be used
Instructs the Aster Loader Tool to use the loader node specified with the -l or --loader option
even if the IP address provided is not known to Aster Database. Note that if this option is
specified, the Aster Loader Tool will only try that single IP address and return an error status if
the connection fails for any reason. If this options is not used, Aster Loader Tool will attempt to
choose the least busy loader node available.

Forcing particular Loader node to be used
Once Loader is registered with AMC via Admin>Cluster

Management
you do not have to specify LOADER node in ncluster_loader

statement. It'll pick it up automatically
If 192.168.100.160 is only
$ ncluster_loader -h 10.2.3.100 -U beehive -w beehive Loader, it will be used to load
-D "~" sales_fact MarchSales.txt this file. If Queen cant find
Loader, Queen will load
- If have multiple LOADER nodes registered, specify the --force-

loader argument to pick the one you want to use
$ ncluster_loader -h 10.2.3.100 --force-loader --loader 192.168.100.160

-U beehive w beehive -D "~" sales_fact 2010MarchSales_data.txt

Loading from 1 Staging machine
Assuming you are meeting your loading timelines, one staging machine will typically be enough
to load your data in the time window allocated.
Since you can invoke ncluster_loader multiple times on the staging machine, you can load in
parallel to speed the process.

Loading from One Staging Machine
(Queen as Loader)
Run multiple of the follow, each pointing to different file
$ ncluster_loader -h queen -U beehive -w beehive emptable emp1.txt Use NOHUP with
& command to
$ ncluster_loader -h queen -U beehive -w beehive emptable emp2.txt run in Parallel
$ ncluster_loader -h queen -U beehive -w beehive emptable emp3.txt

$ ..
Multiple Cluster Loader Tool(s)

running on 1 Staging machine Queen Cluster
If > 100 GB file,
split it into
multiple files
Workers
Best Practice: 10 concurrent loads

maximum per Staging machine

Loading from many Staging machines
If you wish to speed the loading process you can install multiple staging machines and run those
in parallel as well.

Loading from many Staging machines
(using Loader nodes)
Multiple Cluster Loader Tool(s) $ ncluster_loader -h queen -l loader1 --force-loader -loader

running on many Staging Machines
192.168.100.160 -U beehive -w beehive emptable emp.txt
Queen Cluster
Loaders Workers
Point to Loaders via Argument -l. Note each Cluster Loader Tool
still has to point to Queen so jobs can be assigned, but Loaders
now do the Hashing and stream data

Data Input and Delimiters
You have the option of passing a file name or a directory name to load from. If you pass a
directory name, then all data files found in the directory are loaded in a single transaction.
Subdirectories are ignored. All files must be in the same format.
Files are considered to be in tab separated value (TSV) format. If this is not the case you can use
the c flag to denote comma separted value (CSV) or the D flag to explicitly spell out the
delimiter.
The column delimiter character to use when interpreting the input file (must be a string that
represents a valid single character, such as 'd' or '\n'). The default is a tab character ('\t').

Data Input (using file or directory) and
Delimiters
- To load a single file:
$ ncluster_loader -h queen -U beehive -w beehive ----d d ads
Note: It is possible to hide the
password (-w beehive) using
-c mystagingtable my_input_file1.csv TD Wallet
- or to load all the files in a directory:

$ ncluster_loader -h queen -U beehive -w beehive Can only load single table
when pointing to directory
-D "~" mystagingtable my_input_directory1
-By default ncluster_loader uses tab delimited format (TSV).

-c [ --csv ] - Indicates the input is in comma separated value (CSV)
format. CSV is more flexible than TSV because:
With CSV only, you can use the --null flag to indicate the null character
-D [ --delimiter ] - Separator character that divides one value from

the next in your input file (must be a string that represents a valid
single character, such as a 'd' or a 'n')
Directorys load in Serial so to achieve Parallelism, invoke multiple ncluster_loaders with Directories

Loading Multiple files with a MAP file
With the Aster Loader Tool you can load from multiple data files and insert into multiple
tables in a single invocation by passing the --map-file or -m option. We refer to this as
loading with a map file. All tables are loaded in a single database transaction.
When loading a very large amount of data, you may choose to created multiple map files that
each load their data files using a different loader. This can help speed up the process of loading
a large amount of data.
The map file is a text file containing a set of logical text blocks, each surrounded by curly braces.
Each block represents a file or directory to be loaded. The format is like this:
{
"dbname" : "beehive",
"username" : "beehive",
"password" : "beehive_pwd",
"loader" : "141.206.66.28",
"force-loader" : true,
"timeout" : 5,
"loadconfig" :
[
{
"table" : "schema1.targettable1",
"file" : "data/insert1.txt",
"errorlogging" : { "enabled" : true }
},
{
"table" : "schema1.targettable1",
"file" : "data/insert2.txt",
"begin-script" : "input/mapfile/begin-script.sql",
"end-script" : "input/mapfile/end-script.sql",
"errorlogging" :
{
"enabled" : true,
"discard-errors" : true
"label" : "insert2_log",
"schema" : "nc_system",
"table" : "nc_errortable_part"
}
]
}
In the above example, we assume the current directory (from which we invoke ncluster_loader)
contains a subdirectory, data, which has two files, insert1.txt and insert2.txt, and we load these
both into table targettable1 in schema1. Error logging is turned on. For the second table,
additional error logging parameters are supplied to log to a system table, label each row and skip
errors.
See Teradata Aster Client User Guide for more information on loader arguments and their use.

Loading Multiple files with a MAP file
With Cluster_loader tool you can load multiple data files into multiple tables in a
single invocation using this flag and a file:
$ /home/clients/ncluster_loader -h queen -U beehive

-w beehive -d retail_sales --map-file /home/lab07/page_view.map
page_view.map x
{"loadconfig" :
How to CHUNK a large file into smaller files to get parallelism during loading? It would be
better and probably faster to use PERL scripting (or python) to do the initial chunking. In
[{ these languages you have a function that can grab N bytes and write it out as a record (for
instance the chomp function in PERL) and it would be really fast in the OS because it is a
"table" : "prod.page_view_fact", straight disk read/write, no DB overhead with replication factor or commits etc Then
"file": "/home/lab07/pageviewdata200801.tsv",
load into the DB in parallel by creating multiple files or the 10 byte rows.
"errorlogging":{"enabled":true,"label":"page200801","limit":100000,"schema":"prod","table":"load_pag
e_err"}}
,
{"table" : "prod.page_view_fact", Could have pointed to a different table here
"file": "/home/lab07/pageviewdata200802.tsv",
"errorlogging":{"enabled":true,"label":"page200802","limit":100000,"schema":"prod","table":"load_pag
e_err"}}
]}
MAP FILEs load in Serial so to achieve Parallelism, invoke multiple ncluster_loader.exe with MAP FILES

Column ordering

Column Ordering
-C [ --columns ] arg - Ordered, comma-separated list of the columns in the

target table that will get data in this load.
--columns username,orderqty,timestamp
Must not have spaces after comma (,) or else load fails
Useful for:
-Shuffling columns around to fix any mismatchs in the column ordering
between input file and destination table
-Telling the Loader which columns will not receive data if an input file lacks
data for columns in the destination table. In other words, lack of column name
in above command means NULL values for tables column rows (see next slide
for example)
-Alternatively if have 4 values in file and 3 columns in table, load will fail. It
cannot ignore columns in input files, for that you need (see two slides from
here). Workaround is using pre-Script to load to staging table

Column ordering example
An optional comma-separated list of the columns in the target table that will receive the data
being loaded (the default is to assume that each input row contains exactly one data value for
each column in the target table, in the order in which the table columns were defined).
By using -C, you can specify that the ordering of the columns in the input file is different from
the ordering in the table. You can also use -C to specify that this input file contains values for
only some of the columns in the table.
The input data is assumed to contain values for the columns in the order specified here. For
example, to load data into columns 'col1' and 'col2', one could specify "col1, col2" as the value
for this option. Column names not specified here are expected to get NULL values.
Using the -C Option With Uppercase or Special Characters
When using the -C option where the column list has any uppercase or special characters, you
must put the column list in double quotes. On Windows, this requires escaping the double
quotes:
Example on Linux:
./ncluster_loader -h 10.51.150.100 -l 10.51.150.240 -w db_superuser -U

db_superuser -c -D , -C '"ID","NAME","$value"' test2 test2.csv
On Windows, when using the -C option where the column list has any uppercase or special
characters, you must put the column list in escaped double quotes.
Example on Windows:
ncluster_loader.exe -h 10.51.150.100 -w db_superuser -U db_superuser -c

-C '\"ID\",\"NAME\",\"$value\"' test2 test2.csv

Column Ordering example
-C [ --columns ] arg - Ordered, comma-separated list of the columns in

the target table that will get data in this load
--columns c1,c2
Without --columns argument, get error
With --columns argument, error if have space after comma
With --columns argument, no space after comma, OK

Pre/Post Load scripts
begin-script and end-script specify scripts to run either before or after the loading of the file.
Both are optional. Each map file entry can have a separate begin script and end-script. For each
map file entry, ncluster_loader will run the begin-script, load the data from the file, then run the
end-script. Do not include commands that begin or end the transaction in the script files. Empty
lines in the scripts will be ignored.
Here is an example of a valid script file:
DROP table IF EXISTS foo;

CREATE table foo (id int, sometext varchar(40)) DISTRIBUTE BY HASH(id);
-B [ --begin-script ] arg begin-script
Specifies the qualified path of the file containing SQL commands that should be executed when
the transaction starts, i.e. immediately after the BEGIN command is issued to Aster Database.
Note: Data returning
statements such as SELECT are not allowed in scripts executed by the ncluster_loader.
-E [ --end-script ] arg end-script
Qualified path of the file containing SQL commands that should be executed when the
transaction is about to commit, i.e., immediately before the END/COMMIT command is issued
to Aster Database.

Lab 3.2: QueryGrid: Aster-Hadoop lab
-- From HADOOP to ASTER ( load_from_hcatalog )
Goal: Show how to LOAD data from Hadoop into new Aster table
03a SELECT * FROM load_from_hcatalog 9 rows

(ON mr_driver
server ('192.168.100.130') dbname ('default')
username ('hive') tablename ('department'));
01b (1
03b of 4) DELETE
CREATE TABLE
from ASTER_TARGET;
Aster_from_Hadoop distribute by hash(department_number) AS
SELECT count(*) from 0 rows
SELECT ASTER_TARGET;
* FROM load_from_hcatalog Aster Function
(ON mr_driver Dummy fact table required for syntax

01b (2 of 4) SELECT * from sql00.teradata_source; 26 rows
INSERT
username into tablename
('hive') ASTER_TARGET
('department')); Source table on Hadoop
01c (3 of 4) SELECT * FROM load_from_teradata Aster Function
(ON mr_driver
03c SELECT * tdpidFROM('dbc')
Aster_from_Hadoop; 9 rows ('sql00')
* from username ('sql00') password TD credentials used to logon
01d (4 of 4) SELECT ASTER_TARGET; 26 rows
QUERY ('SELECT * FROM sql00.teradata_source'));

ETL with Pre/Post scripts
Here is an example using the pre-and post-script arguments.

ELT with Pre/Post scripts
ELT: Extract, Load, Transform:
1. Create a staging table (via Begin script)

2. Load to the staging table
3. Use SQL-MR to Transform the data and copy it to your destination table
(via End script)
4. Drop your staging table
$ ncluster_loader -h queen d beehive -U beehive -w beehive

-c -B 'preload.sql' -E 'postload.sql' mystagingtable myinputfile.csv
1 3-4 2
Error in script will stop processing at point of error

Skip rows, Analyze and Verbose arguments
--skip-rows arg
Specifies how many rows of the file to skip before starting to load data. If combined with --
header-included, --skip-rows will not start counting until after the first line in the file. The
default is 0. --verbose Run in verbose mode.
-z [ --auto-analyze ]
Runs ANALYZE on the table(s) after loading the data and sets the hint bits on the table(s). By
default, this is disabled. If data was loaded to child tables via autopartitioning, they will be
analyzed, as well. Note that analyzing a columnar table may be slow if there are many columns.
To improve the speed of statistics collection, execute a
separate ANALYZE command after the load that only processes the columns involved in query
row filters or grouping.
--verbose
Run in verbose mode.

Skip rows, Analyze and Verbose arguments
--verbose Return information concerning load

--auto-analyze Run ANALYZE on target table/tables after it loads
--skip-rows <integer> Specifies how many rows of file to skip before
starting to load (used if have header row)
$ ncluster_loader --skip-rows 1 --auto-analyze --verbose
ad_web_clicks ad_web_clicks.txt

Autopartitioning
If the tables you are loading are logically partitioned, ncluster_loader will automatically load the
child partitions. There are no special arguments needed nor do you have to load each child
partition separately.

Autopartitioning
For Logically Partitioned tables,

Cluster will automatically load data into
the proper child table based on the
partitioning logic
Rows that fail to match the logic of any

$ ncluster_loader -h 192.168.100.100 -l 192.168.100.152
partition will error out (and will not be
-U beehive -w beehive sales_fact sales_fact1.txt
inserted into the table)
.. Jan_2008 .. Feb_2008 .. Mar_2008

01-17-2008 02-01-2008 03-05-2008
01-22-2008 02-13-2008 03-08-2008
01-25-2008 02-18-2008 03-11-2008
01-29-2008 02-22-2008 03-22-2008
Loader Node

Error Logging
Error logging enables you to pinpoint those malformed rows that did not get loaded into the
table.
Error logging is optional and is invoked with the el-enabled flag.
--el-enabled errorlogging
If present, turns on error logging for this invocation of the ncluster_loader. This needs to be
enabled for any other error logging option to be accepted. The default is disabled.

Error logging
ncluster_loader can detect the following types of input data errors:
Malformed data types (text values in an integer column)
Missing column values (input not available for all columns)
Extra column values (input available for > all columns)
Check constraint violations (no logical partition to go to)
Character set conversions (input not all UTF8 characters)
Missing/invalid Distribution Key (disallowed key data type)

Enabling Error logging
Enabling Load Error Logging
Use the --el-enabled flag (or the errorlogging flag inside a map file) to run the Aster
Loader Tool in a mode in which it tolerates poorly formatted input rows and logs each bad
row to a table. This differs from Aster Loaders normal mode of running:
Running normally, Aster Loader aborts the load immediately if it encounters a bad input
row, and it does not log the malformed input row to a table.
Running in --el-enabled mode, Aster Loader logs each malformed input row (that is,
any row it cannot interpret for loading or cannot load due to datatype mismatch or check
constraint violation) to an error table and continues to load the remaining rows in the
load job. We refer to this as error logging.
The --el-enabled flag is a master flag that operates with a set of sub-flags (--el-discarderrors,
--el-errfile, --el-label, --el-limit, --el-schema, and --el-table) that
fine-tune your handling of malformed rows. To use any of the sub-flags, you must first have
specified the --el-enabled flag. If youre using a map file, the syntax is different. The master
flag is errorlogging, and the sub-flags are discard-errors, errfile, label, limit,
schema, and table.
The --el-discard-errors flag discards all malformed rows, the --el-label tags failed
row data, the --el-limit flag sets a maximum number of allowed failed rows for the job, and
the --el-table flag specifies a custom error logging table.
To perform error logging, the Aster Loader Tool relies on the error handling features of the
Aster Database COPY command in SQL.
If data being loaded will cause duplicate values of a UNIQUE or PRIMARY KEY constraint on
the target table, it is considered an error. This particular error cannot be handled by error
logging, so the loading transaction will be aborted if any record causes a unique or primary
key constraint violation, even when error logging is enabled.

Enabling Error logging
Cluster_loader command line example:
$ ncluster_loader -h 10.50.25.100 -U beehive -w beehive

--el-enabled --el-label batch5 --el-limit 100 --el-table my_error_table
customers input_data.tsv
Error logging turned off by default. Must enable via --el-enabled argument. If error
logging is not enabled then the load job will abort and rollback the data upon first error
encountered
If error table not defined (have --el-enabled but not --el-table <tablename>), then
default error table used. From ACT, \dt nc_system.* to view these 2 default error tables
--el-label <labelname> is optional and adds extra column to Error table. This is so
when have multiple loads inserting into same Error table, that you can recognize which
rows belong to which load job
--el-limit <#> is optional. If not defined, (but -el-enabled defined), even in presence of
malformed rows, load succeeds. If limit set (ie: 100), and exceed this limit, transaction
aborts and no rows inserted into error table or the target table

Error Logging details
What Errors Get Logged
Error logging is more general than just handling malformed rows. Essentially, any error
related to an individual row can be recorded by error logging and the load can continue,
except for the special case of UNIQUE/PRIMARY KEY violations. If there is such a violation,
the loader will abort the load. Also, any error not related to an individual row (such as
insufficient privileges) will abort the load operation.
Some examples of what errors can be caught by error logging:
NOT NULL violations

Wrong number of columns
Value format errors (e.g. specifying "asdf " for an integer column)
No matching child partition
CHECK constraint violations
Field length overflow (e.g., specifying "asdf " for a varchar(2) column)
Text/CSV format errors (e.g. misplaced carriage return, unterminated quoted string, etc.)

Error Logging details
The details logged to the Load Error Logging tables are:
key - Distribution Key (for distributed tables only)

tupletimestamp - Date and time when this error occurred
label - User-specified label provided in --el-label flag
targettable - Intended destination table for this tuple
dmltype - Type of operation that failed. (always 'C' for copy)
errmessage - Error message returned by PostgreSQL
sqlerrcode - SQL code associated with the rejection
rawdata - The failed tuple itself
ncluster_loader -h 10.50.25.100 -U beehive -w beehive
--el-enabled --el-label batch5 --el-limit 100 --el-table my_error_table
customers input_data.tsv
SELECT * from my_error_table;

Error statistics
To see the number of rows that loaded or failed to load, query the load error statistics tables,
nc_all_errorlogging_stats and nc_user_errorlogging_stats. Each loading command generates a
row in the tables. For a given transaction, the totalcount, goodcount, and malformedcount
columns show the total number of rows you tried to load, the number of rows that successfully
loaded, and the number of rows not loaded, respectively.

Happens automatically so dont need to code this. These
tables show you the malformed and well-formed row count
Error statistics for each load operation
username, sessionid, transactionid, statementid - All IDs

associated with the load
targettable - Table into which data is being loaded
eltable - Error logging table for this load attempt
dmltype - Attempted action. (always C for copy)
label - User specified label (--el-label) for this load
totalcount - Number of rows attempted to load
goodcount - Number of rows successfully loaded
malformedcount - Number of failed rows
SELECT * from nc_all_errorlogging_stats;

Error correction
-F --el-errfile arg errfile
Sub-flag that can accompany --el-enabled. The elerrfile flag introduces the pathname of the
optional error logging file. If you use error logging, you must have an error logging table,
and you can have an error logging file. Upon completion of the load, ncluster_loader writes the
contents of the error logging tables rawdata column (and no other columns) to the error logging
file. Only the contents of this column are written to the file. The filename will have a numeral
appended to it in the form, _0.
For this option to work, you must have also specified an el-table.
Regardless of whether or not you specify that an error logging file should be used, the error
logging table will still contain all error rows upon completion of each load.

Error correction
One way to correct errors, and get the data loaded into Cluster, is
to use the following flag:
Copy from table to file

$ ncluster_loader
--el-enabled --el-table <my_error_table> --el-errfile <my_errorfile>
Upon completion of the load, Cluster will then write the contents
of the error logging tables raw data column, and no other data, to
the designated error logging file. Upon completion of the load,
ncluster_loader writes the contents of the error logging tables
rawdata column (and no other columns) to the error logging file.
Only the contents of this column are written to the file. The
filename will have a numeral appended to it in the form, _0
Inspect the contents of the error logging table to find the cause of
the errors, fix the problems you find there in the file, and finally
reload the fixed data from the file

Loading/Error table strategies
Here are some standard strategies to use when loading data. Find the one that matches your
corporate standard.

Loading/Error tables strategies
Make sure that even in the presence of malformed rows a given load operation succeeds;
This can be accomplished by enabling error logging but not setting an error logging limit .
(Set --el-enabled but do not set an --el-limit.) If you are not interested in what errors are present,
malformed input rows can be discarded so that they are not stored in the cluster (--el-enabled --el-
discard-errors).
Store malformed rows into a separate table for later inspection;

If the error logging feature of Aster Database is turned on without any additional parameter, malformed
rows are stored in the default system error logging tables (--el-enabled). If a custom error logging table is
to be used for malformed data storage, that can be done via the appropriate option (--el-enabled --el-table
'my_errorlogging_table'). Note that any custom error logging table has to inherit from the default system
error logging table. To create such a table, see Load Error Logging Tables on page V-196 of User Guide
Abort data load operation in the presence of too many malformed rows;
This is in particularly useful if you want a given load operation to abort if too many malformed rows are
present in the input data (--el-enabled --el-limit = 100). In order to preserve atomicity for bulk load
operations, the load operation fails as a transaction when the error limit is exceeded. When the operation
fails, any rows already written by the transaction to the target table and error logging table are deleted
Watch your load error statistics

To monitor error rates, check the nc_all_errorlogging_stats and nc_user_errorlogging_stats as
mentioned earlier

Lab 5.01: Load Abort due to Create Table constraints

Lab 5.01: Load Abort Due to CREATE TABLE
Constraints
Confirm if table has any UNIQUE or PRIMARY key constraints
If have violation when loading, the above two constraints will always cause load
to abort regardless of any error settings in ncluster_loader. Must find and fix
these before load

table1 /home/table1.tsv
1st load - OK
2nd load - Fail

Lab 5.02: Load proceeds even with Malformed rows

Lab 5.02: Load proceeds even
with Malformed rows
Even with malformed rows, load process to the end and good rows are
loaded
This can be accomplished by enabling error logging but not setting an error
logging limit. Set --el-enabled but do not set an --el-limit)
If you are not interested in what errors are present, malformed input rows can be
discarded using the --el-discard-errors

--el-enabled table2 /home/table2.tsv

Lab 5.03: Load Default Error table

Lab 5.03: Load Default Error table
Store malformed rows in a default table or specify a custom table for

each load for later inspection
If using a global error table provided(see below box), can still delineate
between different loads using the --el-label argument

--el-enabled --el-label batch3 table3 /home/table3.tsv
SELECT * from nc_errortable_part;
If don't specify a ERROR table, system will automatically store

malformed rows in default system error table (nc_errortable_part or
nc_errortable_repl) depending if loading Fact or Dimension table

Lab 5.04: Load Custom Error table

Lab 5.04: Load Custom Error table
Store malformed rows in a global table or specify a custom table for

each load for later inspection
If want an exclusive Error table for each load, you must define Error table prior
to loading and grant any necessary permissions
CREATE table error_table3 () inherits (nc_system.nc_errortable_part);

grant all on error_table3 to beehive;
alter table error_table3 owner to beehive;

--el-enabled --el-table error_table3 table3 /home/table3.tsv
SELECT * from ERROR_TABLE3;

Lab 5.05: Threshold to Abandon Load

Lab 5.05: Threshold to Abandon Load
Abort a load if too many malformed rows
When the operation fails, any rows already written by the transaction to the target
table and error logging table are deleted. Use the --el-enabled --el-limit <INT>

--el-enabled --el limit = 3 table4 /home/table4.tsv
So if have 4 errors during the load, load will fail and the good rows are rolled back
Error types detected

Malformed datatypes (e.g., text value for an integer column);
Missing column values (e.g., the input data file provides data values for the first two columns, but not for the third one);
Additional column values (e.g., the input data file contains a row with more values than there are columns in the destination table);
Check constraint violations (e.g., the integer value of an imported data field exceeds range allowed by check constraint on target table
Character set conversion errors (e.g., input data file contains invalid UTF-8 characters);
Missing or invalid distribution key

Lab 5.06: Avoiding Extra spaces

Lab 5.06: Avoiding Extra spaces
Avoid extra spaces in your data
Quotes can be used on any portion of a data field, typically around special
characters. For example, with the default CSV mode, this is the usual way to
handle commas within a string:
1,2,"red,blue,green" ncluster_loader -w beehive -c table6 /home/table6a.csv
Below example can introduce problems when working with varchar columns,
because many people put space between the comma and the quote, and that
space is considered significant. For example:
1,2, "red,blue,green" ncluster_loader -w -c beehive table6 /home/table6b.csv
In this example,the third column will be loaded with a space before r character.
Note can only see space in ACT. Will not see this space in TD Studio

Lab 5.07: Run Analyze

Lab 5.07: Run ANALYZE (with Verbose)
Run ANALYZE after every load
While running, ANALYZE requires an amount of disk space on the cluster

proportional to the amount of new data in the table being analyzed. For this
reason, Teradata Aster recommends that your run ANALYZE after every load,
rather than waiting for multiple loads to finish. Waiting too long can result in a
large amount of the clusters disk space being consumed during the running of
ANALYZE

-z --verbose table8 /home/table8.tsv
Daily child tables offer a good example: When you load to daily child tables, run
ANALYZE on each child table after you load the days data into that table
ANALYZE SALES_FACT PARTITION (SALES_2012_07);

Lab 5.08: Monitoring your Error table size

Lab 5.08: Monitor your Error table size
Keep an eye on size of your Error tables
There is no automatic DROP/TRUNCATE of Error tables. So when these tables

are no longer needed, you need to either DROP or TRUNCNATE
SELECT count(*) from nc_system.nc_errortable_part;

SELECT count(*) from nc_system.nc_errortable_repl;
SELECT count(*) from <custom_error_table>;
DELETE from nc_system.nc_errortable_part;

VACUUM FULL nc_system.nc_errortable_part;
TRUNCATE won't work !! (It appears to work but if do:

SELECT count(*) from nc_errortable_part; it returns row count

Lab 5.09: Create an Error file for Load

Lab 5.09: Create an Error file for Load
Can remedy errors in file, then reload the file

Input file Copy from table to file
--el-enabled --el-table <my_error_table> --el-errfile <my_errorfile>
table10 /home/table10.tsv
Upon completion of the load, Cluster will then write the

contents of the error logging tables raw data column to
the designated error logging file. The filename will have
a numeric appended to it in the form, _0
Inspect the contents of the error logging table to find the

cause of the errors, fix the problems you find there in the
file, and finally reload the fixed data from the file

Lab 5.10: Monitor your Loads error statistics

Lab 5.10: Monitor your Load error statistics
Summarizes all good/bad rows per load via nc_all_errorlogging_stats

username, sessionid, transactionid, statementid
(All of the IDs associated with the load attempt)
targettable - Table into which data is being loaded
eltable - Error logging table for this load attempt
dmltype - Attempted action. (always C for copy)
label - User specified label (--el-label) for this load
totalcount - Number of rows attempted to load
goodcount - Number of rows successfully loaded
malformedcount - Number of failed rows
SELECT * from nc_all_errorlogging_stats;

Lab 5.11: Managing case sensitivity

Lab 5.11: Managing case sensitivity
To treat Table name and Schema name as case sensitive, surround

schema and table name in double quotes
To pass a double quote mark to Aster Loader, you must prefix it with
the escape character, which is a backslash
If either the schema name, the table name or both names include
capital letters, you must surround each name in escaped quotation
marks, individually

\"Juli\".\"Nimitz\" /home/table12.tsv
Also note a carriage return at the end of the source file will cause Error

Lab 5.12: Keep files under 100 MB
For best performance, it is suggested to keep your files under 100 MB in size. By having a
number of smaller files instead of one large file, you can then invoke multiple ncluster_loader
instances to load in parallel.
SPLIT is a Linux command that can split a large file into smaller ones.

Lab 5.12: Keep files under 100 MB
Use Linux commands to split larger files into multiple smaller ones, then
load in parallel via ncluster_loader
Split by size limit Split by # of lines (1500 lines)
See how to Load in Parallel in next slide

Lab 5.13: Running in Parallel with NOHUP
NOHUP is a Linux command that allows you to run applications in the background. Use it with
ncluster_loader to run in parallel.

Lab 5.13: Running in Parallel with NOHUP
Use NOHUP to run multiple load jobs in Parallel in the background
Use NOHUP with & and run multiple of them
nohup ncluster_loader -h 192.168.100.100 -U beehive -w beehive

bankloan_unk /home/bankloan_unk.tsv &

Successful Loading strategies

Successful Loading strategies
1. Only load character set = UTF-8 files
2. Make sure your data file uses a consistent character to represent

newlines. The UNIX command, dos2unix, can be useful for doing
this
3. Keep source file size to manageable level so loads don't take

excessive time to load. Many times better off to split the file into
multiple files and then run in parallel. 100MB is typically the
largest file you want to load. If bigger than this, best to split it into
smaller chunks

Successful Loading strategies (cont)

Successful Loading strategies (con't)
4. Loading operation using the Aster Loader Tool, COPY, or INSERT

can be expensive when the following conditions exist:
Target table uses columnar storage, AND
Target table has many logical partitions, AND
Loaded data matches many different logical partitions.
5. Ensure that each batch only loads a small number of logical

partitions in the columnar table. For example, when inserting data
on columnar table with weekly partitions, each batch may insert
data for a single month
6. Ensure that the size of each batch is a small fraction of system

memory available at the worker nodes. This should only be done if
the data being loaded into the columnar table has a mixture of
records matching many different logical partitions

ncluster_loader Performance tips

ncluster_loader Performance Tips
Loading Tables How fast? It depends roughly in order of importance
1. Number of ncluster_loader instances

2. Length of each row in the data file
3. Number of Loader nodes being used
4. NIC bandwidth
5. Number of Workers and v-Workers
6. Number of Child tables
7. Ordering of Child table with respect to schema
8. Compression on target tables
Load Speeds
From 10GB/hour (Replicated tables, especially if have load errors) to 500GB/hour for Distributed tables.
(LP tables are in the 10-20GB/hr range). Loads tend to slow down on bigger files and on Columnar tables

Loading using QueryGrid: Aster:TD
You can use the Teradata QueryGrid connectors to load data from one data store to another.
Note when using TD QueryGrid it is important to confirm the data types in both tables are in the
same domain.
It is important to point out there is no built-in error table that is created when loading into an
Aster table. In other words, this is an all-or-nothing proposition. Any malformed row loading
into an Aster table will cause an ABORT and ROLLBACK of that data.

Loading using QueryGrid: Aster-TD
A high performance bi-directional, connector to copy data

between a Teradata EDW and an Aster Analytic Platform
Teradata
Teradata | Aster
Supports Terabytes Data Warehouse
MapReduce Platform
of Data Transfer
Workers per Hour
v-W1 v-W2 v-W3 v-W4
SQL-MapReduce <<< >>>

Loaders/Exporters
Connectors essentially Load data from one database to

Aster-Teradata Connector another. There are 3 ways to Load to Aster:
Infrastructure/Capabilities
CREATE TABLE AS : Used when want a permanent copy
- SQL-MR & TPT based Interfaces of tables rows. Typically done when you want to run
multiple queries against this table
- Join Tables - Build Remote Views SELECT : One-time query (data wont persist)
- Run SQL-MR on all of your data INSERT : Already have a table definition. Want to copy
rows into that table

Lab 5.14: QueryGrid: Aster-TD lab

Lab 5.14: QueryGrid: Aster-TD lab
-- From TERADATA to ASTER ( load_from_teradata )
Goal: Show how to LOAD data from Teradata into new Aster table
01a SELECT * FROM td01.teradata_source; 26 rows
01b (1 of 4) DELETE
CREATE
from ASTER_TARGET;
TABLE Aster_from_TD distribute by hash(employee_number) compress low AS
SELECT count(*) from 0 rows
SELECT ASTER_TARGET;
* FROM load_from_teradata Aster Function

01b (2 of 4) SELECT * from sql00.teradata_source; 26 rows
tdpid ('192.168.100.15') username ('td01') password ('td01') TD credentials used to logon
QUERY ('SELECT * FROM td01.teradata_source')); Source table on Teradata

01c (3 of 4) INSERT into ASTER_TARGET Aster Function
SELECT * FROM load_from_teradata

01c SELECT(ON* FROM Aster_from_TD;
mr_driver 26 rows
01d (4 of 4) SELECT 26 rows
TD credentials used to logon
* from ASTER_TARGET;
tdpid ('dbc') username ('sql00') password ('sql00') Source table on Teradata
QUERY ('SELECT
First error* FROM sql00.teradata_source'));
rollbacks the Aster table contents

Lab 5.15 INSERT CHAR into INT
Here is an example of a different data type attempting to be loaded in Aster. When this is
encountered, the load ABORTS and any rows loaded are ROLLBACKed.

Lab 5.15 INSERT CHAR into INT
(load_from_teradata)
Employee_number column is defined in Aster as CHAR and in TD as INT.

Below are rows in Aster. 2nd row will fail when loading into Aster
COPY
insert into td01.employee3 values ('2222');

insert into td01.employee3 values ('456.78');
insert into td01.employee3 values ('Mistake');
If any errors encounted when using Teradata QueryGrid connectors to

load Aster table, contents of Aster table are rolled back. Note you
cannot configure an Error table when using QueryGrid

Loading using QueryGrid: Aster-Hadoop

Loading using QueryGrid: Aster-Hadoop
QueryGrid: Aster-Hadoop gives

Analysts and Data Scientists a better
way to analyze data stored in
Hadoop
QueryGrid: Aster-Hadoop is in an
intelligent connector to Hadoop that
selectively pulls data from Hadoop into
Aster for analytics
Allow standard ANSI SQL access to

Hadoop data
It works off of HCatalog, the metadata
layer of Hadoop
Leverage existing BI tool and enable
self service


-- From HADOOP to ASTER ( load_from_hcatalog )
Goal: Show how to LOAD data from Hadoop into new Aster table
03a SELECT * FROM load_from_hcatalog 9 rows

(ON mr_driver
username ('hive') tablename ('department'));
03b CREATE DELETE from ASTER_TARGET;

01b (1 of 4) TABLE Aster_from_Hadoop distribute by hash(department_number) AS
SELECT count(*) from
SELECT * FROM load_from_hcatalog 0 rows
Aster Function
ASTER_TARGET;
01b (2 of server
4) ('192.168.100.21')
SELECT dbname ('default')
* from sql00.teradata_source; 26 rows
INSERT into ASTER_TARGET
username ('hive') tablename ('department')); Source table on Hadoop
SELECT * FROM load_from_teradata
01c (3 of 4) Aster Function
(ON mr_driver
tdpid ('dbc') username ('sql00') password ('sql00')
03c SELECT * QUERYFROM Aster_from_Hadoop; 9 rows
('SELECT * FROM sql00.teradata_source')); TD credentials used to logon
01d (4 of 4) SELECT * from ASTER_TARGET; 26 rows

Exporting data using ncluster_export.exe
You can also export data from Aster using the ncluster_export utility. Its syntax is similar to
ncluster_loader.
See Teradata Aster Client User Guide for more information.

Exporting Data with ncluster_export.exe
$ ncluster_export.exe [arguments] [schema.] tablename [filename]
- Invoked from UNIX command prompt. First navigate to default location on

Queen: cd /home/beehive/clients to execute 30% faster than
ncluster_backup
- To preserve case-sensitive data, surround the Schema and Table names in

double quote marks. To pass these double quote marks, you must prefix it
with the ESCAPE symbol, which is a backslash (\)
$ ncluster_export -h queen -d beehive -U beehive -w beehive

\"aaf\".\"accesslog\" myfile1.tsv
Pipe one table to another, heres the syntax

queen:~ # ncluster_export -h localhost -d beehive -U beehive -w beehive Table1 |
ncluster_loader -h 10.90.90.100 -d beehive -U beehive -w beehive Table2
This transferred contents of table to a different cluster, running different version of Aster,
and put it in a table with a different storage type (columnar vs row). It was 17M rows

Review: Module 5 - Loading

Review - Module 05 - Loading
1. What tools can be used to load an Cluster database?
2. Data loading can be handled by what node types?
3. What task do these nodes perform during loading?
4. The Loading tier can be scaled in what three ways?

DBA Lab 5a 5c (Optional)

DBA Lab 5a - 5c (Optional)
Objectives:
5a Bulk loading
5b Data validation (error logging)
5c Loading small datasets
1. Open your lab manuals to Page 22
Before you begin:
SUSPEND the HDP 2.1 VMware image
When prompted later in the lab you will need to POWER
ON the LOADER. Do not do this step until prompted
Note in the following labs, all the files to be loaded are on the Queen
and we will invoke NCLUSTER_LOADER from the Queen. Typically
we would load from a Staging machine instead of using the Queen

Module 6
Mod 06
Managing Tables

Page 2 Mod 6 Managing Tables
Table Of Contents
Managing Tables Module objectives ........................................................................................... 4
Before we Begin Table structure.................................................................................................. 6
Multi-Version Concurrency Control (MVCC)................................................................................ 8
How MVCC works ....................................................................................................................... 10
Truncate to remove all rows .......................................................................................................... 12
Truncate example .......................................................................................................................... 14
In-line lab: Truncate ...................................................................................................................... 16
Vacuum ......................................................................................................................................... 18
How Vacuum works ...................................................................................................................... 20
Vacuum and Table size ................................................................................................................. 22
How Vacuum Full recovers space ................................................................................................ 24
In-line lab: Vacuum....................................................................................................................... 26
Vacuum Full alternative ................................................................................................................ 28
Other Vacuuming chores............................................................................................................... 30
Analyze ......................................................................................................................................... 32
When to Vacuum and Analyze ..................................................................................................... 34
How to calculate Dead space ........................................................................................................ 36
Table Compression........................................................................................................................ 38
In-line lab: Compress .................................................................................................................... 40
Seeing Logical Partitions (1 of 2) ............................................................................................... 42
Seeing Logical Partitions (2 of 2) ............................................................................................... 44
When to use Analytic tables .......................................................................................................... 46
Table Characteristics ..................................................................................................................... 48
In-line lab: One more thing - Views ............................................................................................. 50
In-line lab: DIM table size bigger than FACT table ..................................................................... 52
Review: Module 6 Managing Tables ......................................................................................... 54

Managing Tables Module objectives

Managing Tables - Module Objectives
Learn about Multi-Version Concurrency Control (MVCC)
How to TRUNCATE
How to VACUUM and VACUUM FULL
How to ANALYZE
Views, Tables and the CASCADE argument
How to COMPRESS tables and view Compression
Seeing Logical Partitions in a Table
When to use Analytic tables

Before we Begin Table structure
Aster tables use a 32-kilobyte data block to store the tables rows. As more rows are inserted,
additional 32-kb blocks are automatically allocated to that table.

Before we begin Table structure
Emp Table After 9 Inserts Emp Table After 6 more

ID Name . Visible ID Name . Visible When INSERTs initially occur in
101 Mike . True 101 Mike . True a Table, it will grow in 32-kb
32-kb db
102 Mark . True 102 Mark . True increments (datablocks) to
103 Marty . True 103 Mary . True accommodate the row size.
104 Paula . True 104 Paula . True Note this may create some Free
105 Kris . True 105 Kris . True
32-kb db Space
106 Jude . True 106 Jude . True
107 Susan . True 107 Susan . True

Free Space is unused Space
108 Scott . True 108 Scott . True
109 May . True 109 May . True

that future rows can occupy. In
32-kb db
Free Space 110 Gary True

not large enough to fit, more
Free Space 111 Jim True 32-byte datablocks will be
Free Space 112 Fawn True dynamically created
Each 4 rows = 1 datablock 113 Brian True
32-kb db
Have 3 - 32 kb datablocks
114 Ann True In addition, a hidden column is
115 Dave True created that flags the new rows
Free Space
as being able to be queried
Table grew to 4 datablocks
with new INSERTS. FREE
SPACE went down

Multi-Version Concurrency Control (MVCC)
Unlike most other database systems which use locks for concurrency control, Postgres maintains
data consistency by using a multiversion model. This means that while querying a database each
transaction sees a snapshot of data (a database version) as it was some time ago, regardless of the
current state of the underlying data. This protects the transaction from viewing inconsistent data
that could be caused by (other) concurrent transaction updates on the same data rows,
providing transaction isolation for each database session.
The main difference between multiversion and lock models is that in MVCC locks acquired for
querying (reading) data don't conflict with locks acquired for writing data and so reading never
blocks writing and writing never blocks reading.

Multi-Version Concurrency Control (MVCC)
Multi-Version Concurrency Control (MVCC) is used by

Cluster to maintain data consistency
- Advantages: ('Always On')
Reading never blocks writing/scaling/backup

Writing never blocks reading/scaling/backup
Allows concurrent INSERTS into same table
- Disadvantages: (Space Maintenance)
'Deleted/Updated' rows not physically removed

'Deleted/Updated' rows marked non-readable
'Deleted/Updated' rows thus take up disk space

How MVCC works

How MVCC works UPDATE employees
SET name = 'James'
WHERE ID = 102;
Employee Table - After Employee Table - Before
ID Name Visible ID Name Visible UPDATE employees
102 Jones False 102 Jones True SET name = 'Raj'
103 Qi .. True 103 Qi .. True WHERE ID = 104;
104 Pente False 104 Pente True
105 Sarah True 105 Sarah True
108 Su True
DELETE FROM employees
108 Su True
201 Fred False 201 Fred True

WHERE ID > 200;
202 Alex False 202 Alex True
203 Esther False 203 Esther True Rows are not removed when
204 Jan False 204 Jan True
DML operations are applied.
They are simply marked as not
205 Mike False 205 Mike True
being Visible by the RDBMS

206 Al False 206 Al True
102 James True Free Space
104 Raj True Free Space

Bottom Line: 3 types of rows:
1. Live rows True
Free Space Free Space
2. Dead space False

3. Free Space Neither
Dead Spaces consumes valuable HD space and increases Read times as
system has to scan flag to determine if it should be in the Result set

Truncate to remove all rows
TRUNCATE -- empty a table or set of tables
The TRUNCATE command empties a table or set of tables. TRUNCATE is a faster alternative
to performing an unqualified DELETE on a table. DELETE operates more slowly because it
does a full scan of each table before deleting the rows. TRUNCATE deletes the rows without
performing a scan.
If your table has child tables created through inheritance and you want to delete rows from the
entire hierarchy, remember to include the CASCADE option. Note that this usage of the
CASCADE option is different from the meaning of TRUNCATE CASCADE in Postgres.
If the table is a logically partitioned table, TRUNCATE automatically acts on the whole
hierarchy unless the ONLY keyword is used.
Synopsis
TRUNCATE [ TABLE ] [ ONLY ] name [, ...] [ CASCADE | RESTRICT ]
Description
TRUNCATE removes all rows from a set of tables. It reclaims disk space, rather than requiring
a subsequent VACUUM operation. The reclaimed disk space may be available immediately, or
availability may be deferred. This is most useful on large tables.

TRUNCATE to remove all rows
(Live rows, Dead/free space)
Employee Table - Before Employee Table - After

ID Name Visible ID Name Visible
102 Jones False

103 Qi .. True
Too much Dead Space causes:
104 Pente False
105 Sarah True Full disk situations
108 Su True
May slow query performance
201 Fred False
202 Alex False
203 Esther False

TRUNCATE [tablename] deletes all:
204 Jan False
1. Live rows (TRUE)
205 Mike False
2. Dead Space (FALSE)
206 Al False
102 James True

3. Free Space (neither TRUE nor
104 Raj True FALSE)
Free Space
Table still exists in Data Dictionary
TRUNCATE is typically used for STAGING tables and takes Table size to 0 kb

Truncate example

TRUNCATE example Best to qualify Schema.table
TRUNCATE deletes all live rows

(Visible) and removes all Dead
space. Free space increased
across nCluster since table goes
to 0 space (table_len)
How is Free Space determined??

Since table size is only limited
by disk space available, each
Table starts off at Minimum 32k
and keeps growing in 32k
increments as needed to house
rows inserted
DELETE increases
DEAD_TUPLE_LEN. Large
values here are to be avoided
when possible

In-line lab: Truncate

See TD Studio >Project Explorer tab>
In-line lab: TRUNCATE Mod06-Manage-Tables>
Mod06baTruncate.sql
We will do an INSERT, UPDATE and DELETE and confirm table size on the
ASTER_TARGET able using below code. Record Table size for each after
running SQL-MR function NCLUSTER_STORAGESTAT
Dead Tuples
TRUNCATE aster_target;
SELECT * from ncluster_storagestat('aaf.aster_target'); 0
_________________
INSERT into aster_target select * from aster_source;

select * from ncluster_storagestat('aaf.aster_target'); 0
_________________
UPDATE aster_target set salary_amount = salary_amount *

1.10 where salary_amount < 50000;
_________________
DELETE from aster_target;

_________________
TRUNCATE aster_target;
_________________

Vacuum
VACUUM reclaims storage occupied by deleted rows. In normal Aster Database operation, rows
that are deleted or made obsolete by an update are not physically removed from their table;
they remain present until a VACUUM is done. Therefore it is necessary to do VACUUM
periodically, especially on frequently-updated tables.
If your table has child tables created through inheritance, dont forget to include the CASCADE
option. If the table is a logically partitioned table, VACUUM automatically acts on the whole
hierarchy, unless you specify a partition using partitionname.
VACUUM ANALYZE performs a VACUUM and then an ANALYZE on the specified table or
partition. This updates the table or partitions statistics for proper query planning. See
ANALYZE for details.

VACUUM (Marks Dead Space as Free Space)
VACUUM makes the non-usable storage space (Dead space) that is

occupied by deleted/updated rows usable again as Free Space
VACUUM [ FULL ] ANALYZE [ tablename [ ( columnname [, ...] ) ]
[PARTITION ( partition_reference) ] ] [ CASCADE ]
- VACUUM FULL both reclaims disk space & repacks rows*

- VACUUM and ANALYZE together both reclaims disk space and collects
statistics on the table to help the Optimizer*
- With no table parameter VACUUM will work on all tables, with the
parameter VACUUM will work on only that table
Note: Running VACUUM on a logically partitioned table will clean up all of
the tables logical partitions automatically
VACUUM : VACUUM FULL : (use sparingly)
Allows concurrent activities on table Requires Exclusive lock on table
Marks dead space (typically non-contiguous) Marks Dead space as Free space
as free space which is now available for new Does compact to shrink Table size
INSERTS
New INSERTS use contiguous free space
Does not compact which is better performer

How Vacuum works
VACUUM -- garbage-collect and optionally analyze a table (or a database)
Synopsis
Run the VACUUM command on the queen, from the Aster Database Cluster Terminal (ACT).
The default Aster Database behavior requires that you pass a tablename argument:
VACUUM [ FULL ] tablename [ partition_reference ] [ CASCADE ]
where partition_reference is:
PARTITION ( partitionname[. partitionname ...] )
When you run ANALYZE during a vacuum, you can also pass one or more columnname
arguments, if you wish to update statistics for only that column or columns. You may also pass
an optional partition_reference argument if you wish to VACUUM a partition of a logically
partitioned table. Note that the partition reference always comes after the column list when
present. It is always okay to use a partition reference without a column list or vice versa.:
VACUUM [ FULL ] ANALYZE [ tablename [ ( columnname [, ...] ) ] [

PARTITION (partition_reference) ] ] [ CASCADE ]
Optional Aster Database behavior allows you to omit the tablename to VACUUM the whole
database. This behavior is not allowed in a default Aster Database installation; contact
Teradata Global Technical Support (GTS) if you wish to enable it. See Optional: Running
VACUUM on a Database on page 718. With this feature enabled, the following synopsis
applies in addition to the two above:
VACUUM [ FULL ] [ ANALYZE ]
Parameters
FULL Selects "full" vacuum, which may reclaim more space, but takes much longer and
exclusively locks the table.
ANALYZE Updates statistics used by the planner to determine the most efficient way to
execute a query.
tablename The name of a specific table to vacuum.
columnname The name of a column to ANALYZE. If omitted, all columns are ANALYZEd.
partitionname The name of a partition to ANALYZE. If omitted for a logically partitioned

table, all partitions are ANALYZEd.
CASCADE Also vacuums all children of the named table.

How VACUUM works
Employee Table
ID Name Visible
Here we have our table after
102 James True we DELETEd 3 rows
103 Qi .. True
104 Raj True
105 Sarah True

VACUUM employee;
102
Free Jones
Space False
108
Free Su
Space False
New INSERTS can occupy newly
104
Free Pente
Space False
201 Fred True

created FREE space. But this can
202 Alex True slow performance as it searches for
203 Esther True these FREE SPACE slots
204 Jan True
205 Mike True
206 Al True
207 Rajeet True

Free Space size may go UP/DOWN with a
Free Space
VACUUM. But the VACUUM should put the
Free Space
Table Size back to the original size (ie:
Free Space
before any UPDATE/DELETES)
Before VACUUM table size - 64 k

After VACUUM table size - 64 k

Vacuum and Table size

VACUUM and Table size Use ACT (and not TD Studio) for this query.
From ACT, first run \x to format data
1 After UPDATE
2
After DELETE
4 5
Table size back to

8 where we started
6

How Vacuum Full recovers space
Checking Whether a VACUUM is Needed
To find out whether a table would benefit from a VACUUM operation, you can check its dead
row percentages (as well as its uncompressed table size) using the nc_relationstats function.
You can use the function to find the number of dead rows, live rows, size of dead rows, size of
live rows, etc., so you can decide whether to run VACUUM FULL.
Note that in the case of a dimension table that is distributed by replication, the tuple count
returned reflects the total number of tuples on all workers.
Usage Recommendations
Observe these recommendations when deciding whether to VACUUM a table:
Before initiating a VACUUM FULL request on a database, use the ncli catalog
checklocks command to check for catalog locks in order to avoid conflicts with other
processes.
VACUUM generates a large amount of I/O traffic, which can slow other queries.
After adding or deleting a large number of rows, its a good idea to issue a VACUUM
ANALYZE command for the affected table. This updates the system catalogs so that query
planner can plan more efficient queries.
When possible, use VACUUM rather than VACUUM FULL.
Do not run VACUUM while bulk loading is ongoing.
Do not run VACUUM FULL (especially on an entire database) on a production database on

which users are actively running queries, because VACUUM FULL is a very expensive
operation.
If you do need to run VACUUM FULL on an active production database, then we

recommend you run it at a per-table level.

How VACUUM FULL Recovers Space
VACUUM recovered 35 MB of
dead space from DEL/UPD rows.
This space can now be used for
new INSERT rows. But notice
Table size is still >125 MB
VACUUM FULL compacts space.

Freed up 35 MB of space. Table
size = 88 MB
DELETE followed by VACUUM

FULL = TRUNCATE

In-line lab: Vacuum

See TD Studio >Project Explorer tab>
In-line lab: VACUUM** Mod06-Manage-Tables > Mod06cb
Vacuum.sql
Using ACT exclusively, login to BEEHIVE database and type the following:
SELECT * from ncluster_storagestat('aaf.clicks5');
UPDATE clicks5 SET session_id = 1 WHERE session_id = 2;
DELETE from clicks5 WHERE session_id = 1;
VACUUM clicks5; -- Use ACT to run this

If entire datablock is dead
space, can recover this space
VACUUM FULL clicks5; -- Use ACT to run this

Vacuum Full alternative

VACUUM FULL alternative
VACUUM FULL is a very resource intensive operation and because of

this it should only be executed on small tables
VACUUM FULL on very large tables (hundreds of millions or billions of

rows) will take hours and slow the entire system
On large tables execute a CREATE TABLE AS query to rewrite the table,

which removes/repacks the dead space
CREATE FACT TABLE t_new as SELECT * from t;

ANALYZE t_new; -- helps Optimizer make better decisions
DROP TABLE t;
ALTER TABLE t_new RENAME to t;
Above requires re-assigning Permissions to table since it has new Object ID.
Must also build an Indexes on this table as well. And if have Views on Table want
to DROP, have to remove those dependent Views first

Other Vacuuming chores

Other Vacuuming Chores
Vacuuming Catalogs (postgres tables ie: pg_* tables)
- MVCC applies to system catalogs because ALTER TABLE and DROP

TABLE commands leave dead rows in system catalogs
- If the catalogs are not vacuumed regularly then the Cluster system
performance will degrade over time
- Ask for VACUUM catalogs script from Aster Data Support

ALTER TABLE, DROP and CREATE TABLES, CREATE TEMP
TABLE creates dead rows in the system catalogs.
Vacuuming Indexes
VACUUMing operates on the table data, and do not affect Indexes.

Each time you INSERT, UPDATE, or DELETE a significant number of
the rows of a table, you should run the REINDEX command to rebuild
the tables indexes. This removes fragmentation from the index:
REINDEX INDEX indexname; or REINDEX TABLE tablename;

Analyze
ANALYZE -- collect statistics about a database
Synopsis
ANALYZE table [ (column [, ...] ) ] [ partition_reference ] [ CASCADE ]

where partition_reference is:
PARTITION ( partition_name[. partition_name ...] )
Description
ANALYZE collects statistics about the contents of the specified table or partition in the database
and stores the results in internal tables. Subsequently, the query planner uses these statistics to
help determine the most efficient execution plans for queries. Note that the "partition
reference" always comes after the column list when present. It is always okay to use a partition
reference without a column list or vice versa.
You have the option of specifying one or more column names, in which case only the statistics
for those columns are collected. If your table has child tables created through inheritance,
dont forget to include the CASCADE option. If the table is a logically partitioned table,
ANALYZE automatically acts on the whole hierarchy, unless a partition is specified.

ANALYZE (helps Optimizer perform better)
ANALYZE collects statistics about the contents of tables in the

database, and stores the results in internal tables for the Optimizer
to use for query plan development
SYNTAX: ANALYZE table [ (column [, ...] ) ]

[PARTITION(name)] [CASCADE];
With no table parameter ANALYZE looks at all tables in the DB.

With the parameter it looks at only that table
With (column) ANALYZE evaluates at the column level. By

default, all columns are analyzed
CASCADE analyzes all children of the named table
Running ANALYZE on a logically partitioned table will look at all the

tables logical partitions automatically
To analyze specific partition from LP table, use following syntax:
ANALYZE SALES_FACT PARTITION (SALES_2012_07)

When to Vacuum and Analyze
Notes About ANALYZE
It is a good idea to run ANALYZE periodically, or just after making major changes in the
contents of a table. Accurate statistics will help the planner to choose the most appropriate
query plan, and thereby improve the speed of query processing. Also, the information
provided by the EXPLAIN command is only as current as the last running of ANALYZE.
Teradata recommends that you run ANALYZE after every batch of writes so that the statistics
are refreshed in bulk. You should run ANALYZE after any running of a CREATE TABLE AS
SELECT, INSERT, UPDATE, DELETE, or ALTER TABLE statement. A common strategy is
to run VACUUM and ANALYZE once a day during a low-usage time of day.
Unlike VACUUM FULL, the ANALYZE command requires only a read lock on the target table,
so it can run in parallel with other activity on the table.
The statistics collected by ANALYZE usually include a list of some of the most common values
in each column and a histogram showing the approximate data distribution in each column.
One or both of these may be omitted if ANALYZE deems them uninteresting (for example, in a
unique-key column, there are no common values) or if the column datatype does not support
the appropriate operators.

When to VACUUM and ANALYZE
As a general rule you should VACUUM a table after it experiences a 'large

number' of updates or deletes
ANALYZE after a 'large number' of: INSERT, UPDATE, DELETE, or ALTER

TABLE statements. (And with CT AS)
In general practice we recommend you VACUUM on the weekends and

ANALYZE after each batch load job
Note there is a VACUUM [FULL] ANALYZE [table_name] command that

does both processes
Best Practices:
VACUUM useful for slowly changing DIMENSION tables where
constant stream of UPDATES and INSERTS that could potentially
reuse the FREE SPACE
For tables loaded only once, no need to VACUUM
ncluster_loader utility has an argument (-z) that allows you to

ANALYZE the table after it has been loaded

How to calculate Dead space
Operations That Create Dead Space
In Aster Database, the following SQL operations result in dead space being created on disk in
the data files for a given database table:
DELETE, UPDATE, aborted INSERT, aborted COPY, aborted bulk load
Dead space occurs when data rows are marked invisible but the space they take up is not
compacted or reused. For example, SQL DELETE command is executed by marking all
qualifying rows as invisible. The SQL UPDATE command, for example, operates as follows:
When you update a row, the existing row is marked as invisible, and the updated row is
appended at the end of the tables data file. Dead space, such as the invisible row in this
example, is not automatically marked to be reclaimed or compacted. Reclaiming such space
requires the administrator to run specific commands, which we will discuss below.
Effects of Too Much Dead Space
It is important to be vigilant about dead space and proactively reuse or compact dead space.
Even though dead space does not contain live data for a table, it affects your cluster in these
ways:
Too much dead space may result in node failures: Excessive dead space may result in a full
disk on a worker node.
Too much dead space may result in slow query/commit performance: A sequential scan on
a table requires scanning through all data files for the table, including the dead space.
Cluster-wide replication at transaction commit time is performed at the file level, so dead
space also needs to be replicated over the network.
Calculate the Amount of Dead Space in the Cluster
Dead tuples no longer in use (either due to a DELETE or an UPDATE) are not physically
deleted. This dead tuple space is only reclaimed after a VACUUM operation. In order to decide
if you need to perform a VACUUM operation, first determine the exact dead tuple count.
SELECT schema, relation, sum(dead_tuple_count) as dead_tuples,

sum(tuple_count) as total_tuples
FROM nc_relationstats(
ON (SELECT 1)
PARTITION BY 1
DATABASES ('niray')
RELATION ('schema."TblName"')
REPORT_STATS('tuple_count')
REPORT_STATS_MODE('exact'))
GROUP BY schema, relation;

How to calculate Dead Space
Dead tuples that are no longer in use (either due to a DELETE or an

UPDATE) are not physically deleted. This dead tuple space is only
reclaimed after a VACUUM operation. In order to decide if you need to
perform a VACUUM operation, first determine the exact dead tuple count

sum(tuple_count) as total_tuples
ON (SELECT 1) PARTITION BY 1DATABASES ('retail_sales')
REPORT_STATS('tuple_count') REPORT_STATS_MODE('exact'))
GROUP BY schema, relation;

Table Compression
Compression allows you to create compressed tables and save space. You can compress to
varying degrees by including a HIGH, MEDIUM, or LOW constraint in your compression
syntax. The syntax for creating compressed tables is:

[, ... ]
] )
[ COMPRESS [ HIGH | MEDIUM | LOW ] ]
[ INHERITS ( parent_table ) ]
and, for CREATE TABLE AS SELECT, the syntax is:

[, ... ]
] )
[COMPRESS [ HIGH | MEDIUM | LOW ] ]
AS SELECT STATEMENT
There is no change in query syntax for compressed tables. For all query purposes, a
compressed table will be treated the same as a normal table. Compression is currently not
supported for temporary tables. Compressed tables are replicated in their compressed form.
Before you alter existing table compression properties compression levels, initial
compression of a table, decompression of a table you should ensure that there is sufficient
disk space available for the operation.
Table compression occurs in an online fashion without disruption to Aster Database. One
useful application of compression is to combine it with Aster Databases logical partitioning
feature for information lifecycle management. As you recall, logical partitioning enables
creation of a hierarchy such that a large table can have partitions, which in turn can have their
own partitions, and so on. If the child partitions are range-partitioned (e.g. monthly
partitions), compression can be used to compress the monthly child partitions over time, as
they become less frequently accessed.

Table Compression
Lower storage costs and improve performance via Compression

Table compression reduces costs/improves query performance
3 levels of compression (3X-12X)
New (hot) data: none / low
Old (cold) data: medium / high
CREATE TABLE films
(code integer, title text, ts date)
distribute by hash (code)
COMRESS [ HIGH | MEDIUM | LOW ];
ALTER TABLE emp

COMRESS [ HIGH | MEDIUM | LOW ];
Use NC_TABLESIZE to compare compression rates
When using the CREATE TABLE statement, specify Low since are typically good performers at cost of
CPU cycles. From ACT, to check for Table compression, execute \d <schema.tablename>

In-line lab: Compress

In-line lab: COMPRESS
Using SQL Assistant, create 3 tables with the only change being different
Table names and Compression levels
CREATE TABLE Clicks_HI compress HIGH as SELECT * from Clicks; CREATE

TABLE Clicks_MED compress MEDIUM as SELECT * from Clicks; CREATE
TABLE Clicks_LO compress LOW as SELECT * from Clicks;
Run the following to check the Disk space of each of the 3 compressed tables and the 1 Uncompressed table
SELECT * FROM nc_tablesize

(on (SELECT 1) partition by 1
info_database('beehive') info_relation('*') password('beehive'))
where rootname like 'click%' order by 2,1;
Code may take a few minutes to run

Seeing Logical Partitions (1 of 2)
Since there is no SHOW TABLE command yet in Aster to view the original DDL for a
CREATE TABLE command, you would think the ACT command \d <table name> would be the
easiest way to see if a table were logically partitioned. However this ACT command does not
display child partitions.

'Seeing' Logical Partitions (1 of 2)
There is no SHOW TABLE command to see your original DDL
Neither ACT, using\d <tablename>, nor TD Studio, allows you to see your
Logical Partitions for a Table. Note TD Studio 14.02 has a rudimentary Wizard
that attempts to Show DDL but is not yet a complete solution
Original SALES_FACT DDL statement Cannot see Logical Partitions here
See next page for 2 solutions to view Logical Partitions of a Parent Table

Seeing Logical Partitions (2 of 2)
Use either the data dictionary or SQL-MR functions to view logical partition tables child
partitions.

'Seeing' Logical Partitions tables (2 of 2)
Using Data Dictionary

nc_user_child_partitions table Using nc_tablesize function
select partitionid, parentid, partitionname select relname, rootname,

from nc_user_child_partitions where partitionname size_on_disk
like 'sales_fact%'; from nc_tablesize
(on (select 1) partition by 1
info_database ('beehive')
info_relation('*')
PASSWORD('beehive'))
where rootname = 'sales_fact' order
by 2,1
There is no SHOW TABLE command in ASTER. Nor will GENERATE DDL

from TD Studio display Logical Partitions of a LP table

Some common use cases for analytic tables are:
Create an analytic table to hold the output of a SQL-MR function, such as sessionize,
attribution or nPath. Then use the analytic table as input to other SQL-MR functions or
SQL queries. For example, nPath is sometimes used to filter web sessions based on the
behavior of shoppers in an online store (i.e. browsers, cherry pickers, price-sensitive
shoppers, etc.). Then further analysis can be done on just the sessions that fit that behavior
profile.
Use an analytic table to hold the results of a resource-intensive JOIN operation, so further
exploration can be done on the data without having to perform the JOIN again.
Employ analytic tables for a complex multistep process for which you need the highest
performance and want to keep the end results, but not the intermediate steps. In this case,
you can do most of the processing using analytic tables, and then write to a regular
(persistent) table at the very end of that process.

This special type table was created to hold data that for operations across a
span of several transactions, sessions or days. It has persistence between that
of a regular table and temporary table
These tables are not replicated and will not survive a System restart
Should only be used for derived data and never for Source data
Use cases include:

Hold SQL-MR output which can then be used as input for other queries
Hold results of large JOIN operation so dont have to repeat JOIN
As a Staging table to be later copied to a regular table
Analytic tables will not survive things like a Soft/Hard restart, Node failover,
Balance Data, Balance Process, Activate, etc. If this occurs, it is recommended
you either TRUNCATE or DROP table, then repopulate it
CREATE ANALYTIC table stuff AS

select * from EMPLOYEE

Table Characteristics
Here is a cheat sheet to compare the table types. Regular can refer to either a FACT table or a
DIMENSION table.

Table Characteristics

In-line lab: One more thing - Views
Read-Only
Views are read-only in Aster Database: the system will not allow an insert, update, or delete on
a view. Also, Aster Database does not provide session-level temporary views, which you may
be accustomed to using on other database platforms.
Changes to Underlying Database Objects
If an object is referenced by a view, and that object is renamed, then the view will continue to
reference that object using the old name. Even if a new object is created with the old name, the
view will continue to reference the original object.
Here an "object" can be any of the following:
schema
table
column
another view
Typically Views point to another table (or view). This makes the View a dependent object
which means you cannot DROP the table unless you first drop the dependent View which can be
a time-consuming process. The CASCADE argument can be used in the DROP TABLE
statement to drop any dependent Views.

One more thing - VIEWS
CREATE VIEW v_emp1 AS SELECT * from emp1;
Views are READ-only. So cannot INSERT, UPDATE, DELETE

WITH CHECK OPTION may not be used
Cannot ALTER Table that has View attached to it (ie: Cant compress)
If create View on Table, cannot drop that table until drop Dependent View
first. Below code cannot drop due to dependent v_emp1 VIEW
So use the CASCADE command to drop both the VIEW and TABLE
In-line Lab:
create table sales_repl distribute by replication as
CREATE TABLE emp1 as SELECT * from employee;

select * from sales_fact;
CREATE VIEW v_emp1 as SELECT * from emp1;

select count(*) from sales_fact;
select count(*) from sales_repl;
DROP
select TABLE
* from emp1;
ncluster_storagestat('aaf.sales_fact'); -- Error
DROP TABLE emp1 CASCADE;
select * from ncluster_storagestat('aaf.sales_repl');
-- OK

In-line lab: DIM table size bigger than FACT table

In-line lab: DIM table size bigger than
FACT table
DROP table if exists aaf.sales_repl;
CREATE table aaf.sales_repl distribute by replication as

SELECT * from aaf.sales_fact;
SELECT count(*) from aaf.sales_fact; ___________

SELECT count(*) from aaf.sales_repl; ___________
SELECT * from ncluster_storagestat('aaf.sales_fact'); ____________

SELECT * from ncluster_storagestat('aaf.sales_dim); ____________
SQL-MR function
-- Original DDL for sales_fact
CREATE table aaf.sales_fact

(date_id date, customer_id varchar, .)
distribute by hash (date_id);

Review: Module 6 Managing Tables

Review: Module 6 - Managing Tables
1. The difference between VACUUM vs. VACUUM FULL
2. The difference between TRUNCATE vs. TRUNCATE ALL
3. A good alternative to doing a VACUUM FULL is this
4. Name 3 Compression Levels
5. Difference between a DELETE and a TRUNCATE
No formal Labs since we did all In-line Labs

Module 7
Mod 07
Unified Data Architecture (UDA)
and Teradata QueryGrid connectors
VMware Workstation Before continuing, now is a good time to RESUME
icon the HADOOP nodes via VMware Workstation

Page 2 Mod 7 Unified Data Architecture and QueryGrid connectors
Table Of Contents
UDA Module objectives ............................................................................................................. 4
Teradata Unified Data Architecture ................................................................................................ 6
Teradata QueryGrid ........................................................................................................................ 8
Big Data Architecture Positioning ................................................................................................ 10
When to use which Data store? ..................................................................................................... 12
Aster options when using remote Data stores ............................................................................... 14
QueryGrid: Aster-TD Connector setup ......................................................................................... 16
QueryGrid: Aster-Teradata Connectors ........................................................................................ 18
Lab7.01: QueryGrid Aster-TD ...................................................................................................... 20
Lab7.02: QueryGrid Aster-TD ...................................................................................................... 22
Lab7.03: Join between Aster-TD .................................................................................................. 24
Connector argument clauses ......................................................................................................... 26
Supported Data types .................................................................................................................... 28
Unsupported Data types ................................................................................................................ 30
Other Limitations .......................................................................................................................... 32
load_to_teradata example.............................................................................................................. 34
When Aster v-Workers > TD AMPS ............................................................................................ 36
load_to_teradata best practices ..................................................................................................... 38
load_from_teradata notes .............................................................................................................. 40
load_to_teradata and Locking ....................................................................................................... 42
Troubleshooting Connectors ......................................................................................................... 44
Insert CHAR into INT................................................................................................................... 46
Insert CHAR into INT (cont) ....................................................................................................... 48
QueryGrid: Aster-Hadoop............................................................................................................. 50
QueryGrid: Aster-Hadoop and HCatalog ..................................................................................... 52
QueryGrid: Aster-Hadoop benefits ............................................................................................... 54
Configuring QueryGrid: Aster-Hadoop ........................................................................................ 56
QueryGrid: Aster-Hadoop syntax ................................................................................................. 58
Data type conversion (Hadoop to Aster) ....................................................................................... 60
Lab7.04: QueryGrid Aster-Hadoop .............................................................................................. 62
Lab7.05: QueryGrid Aster-Hadoop with Sentiment ..................................................................... 64
Lab7.06: QueryGrid Aster-Hadoop (Partition pruning)................................................................ 66
Other Connectors .......................................................................................................................... 68
Review: Mod 7 - Teradata Unified Data Architecture .................................................................. 70
Review: Module 7 - UDA ............................................................................................................. 72
DBA Lab 7 (Optional) .................................................................................................................. 74

UDA Module objectives

UDA - Module Objectives
How to Interconnect to other Databases using the various

Aster related Teradata QueryGrid connectors:
- QueryGrid: Aster-Teradata connector

load_to_teradata
load_from_teradata
- QueryGrid: Aster-Hadoop connectors

load_from_hcatalog
load_from_hadoop connector

Teradata Unified Data Architecture
Just like hybrid cars get superior gas mileage from coupling traditional gas engines with electric
batteries, the Teradata Unified Data Architecture integrates three types of engines to provide
the best fit-for-purpose analytical capabilities for any kind of data.
Teradata has been in the big data market longer than anyone, so weve leveraged our expertise to
tackle the whats new part of the big data phenomenon with five parallel engineering activities.
Beginning with our core product, the Teradata Integrated Data Warehouse, we:
1.Defined a new architecture called the Teradata Unified Data Architecture that adds a
discovery platform and a data platform to complement the Teradata Integrated Data
Warehouse. In the Teradata advocated solution, the discovery platform is Teradata Aster, while
the data platform can either be Hadoop or a Teradata Integrated Big Data Platform for large,
cost-effective storage and processing of big data.
2. Engineered new access connectors between the platforms and to external sources of big data,
and extended our hardware platform portfolio and interconnect options to provide even faster
data transfers.
3. Created a library of pre-built Teradata Aster analytic modules to speed up and simplify
discovery, all in an easy-to-use Teradata Aster SQL-MapReduce programming paradigm. We
also continue to add more technical partner analytics to the Teradata Aster and Teradata engines.
4. Extended our traditional systems management products (Viewpoint, Studio) for similar use
with Hadoop and Teradata Aster.
5. Developed new Professional and Customer Service offers to help customers quickly design
and deploy enterprise data architecture and analytics projects.

Teradata Unified Data Architecture
Other connectors : tudio

Oracle ection
Informatica
SAS
SQLServer
Netezza
etc ..
tudio
ection

Teradata QueryGrid
To deliver value from big data, customers should create an architecture that allows the
orchestration of analytic processes across parallel databases rather than federated
servers. Teradata QueryGrid is the most flexible solution with innovative software that gets the
job done," said Scott Gnau, president, Teradata Labs. "After the user selects an analytic engine
and a file system, Teradata software seamlessly orchestrates analytic processing across systems
with a single SQL query, without moving the data. In addition, Teradata allows for multiple file
systems and engines in the same workload."
"Teradata pioneered integration with Hadoop and HCatalog with Aster SQL-H to empower
customers to run advanced analytics directly on vast amounts of data stored in Hadoop," said Ari
Zilka, CTO, Hortonworks. "Now they are taking it to the next level with pushdown processing
into Hadoop, leveraging the Hive performance improvements from Hortonworks Stinger
initiative, delivering results at unprecedented speed and scale."
Teradata QueryGrid changes the rules of the game by giving users seamless, self-service access
to data and analytic processing across different systems from within a single Teradata Database
or Aster Database query. Teradata QueryGrid uses analytic engines and file systems to
concentrate their power on accessing and analyzing data without special tools or IT intervention.
It minimizes data movement and duplication by processing data where it resides.

Teradata QueryGrid
Teradata syntax example Aster syntax example

IDW Discovery
TERADATA
TERADATA ASTER
DATABASE DATABASE
TERADATA
HADOOP ASTER TERADATA OTHER LANGUAGES
DATABASE DATABASE DATABASES
Remote, Aster functions Teradata RDBMS Leverage
push-down such as SQL- Databases Databases Languages such
processing in MapReduce, as SAS, Perl,
Hadoop graph Python, Ruby, R
When fully implemented, the QueryGrid will be able to intelligently use the functionality
and data of multiple heterogeneous processing engines

Big Data Architecture Positioning
There is no one database-to-fit-all. Based on the strengths and weaknesses of data stores, you
pick the right database based on the task. Here is Teradatas perspective on Teradata, Aster and
Hadoop data stores.

Big Data Architecture Positioning
Engineers Data Scientists Quants Business Analysts
~5
concurrent users
~25
concurrent users
~100++
Ingest, Transform, Archive concurrent users
Discover and Explore
Analyze and Execute
Hadoop serves as the data store for capturing and refining bulk data, the Aster Database acts
as the discovery platform, and the Teradata Database acts as the data warehouse
Fast data loading

ELT/ETL Path/Pattern Analysis Ad-Hoc/OLAP
Image processing Graph Analysis Predictive Analytics
Online archival Multi-structured data Spatial/Temporal
SQL Map-Reduce Active Execution
Hadoop Aster Teradata

Batch Interactive Active

When to use which Data store?
Based on the task at hand and whether or not the data has a schema helps answer which data
store is the best option.

When to Use Which data store?
The best approach by workload and data type
Processing as a Function of Schema Requirements by Data Type
Data Pre- Simple math at

Low Cost Analytics
Processing, scale Joins, Unions,
Storage and Reporting (Iterative and
Refining, (Score, filter, sort, Aggregates
Fast Loading data mining)
Cleansing avg., count...)
Stable Teradata
Teradata/
Teradata Teradata Teradata Teradata (SQL
Schema Hadoop
analytics)
Aster
Aster
Evolving Hadoop
Aster / Aster /
Aster Aster
(SQL
(SQL ++
Schema Hadoop Hadoop MapReduce
MapReduce
Analytics)
Analytics)
Aster
Aster
Format, Hadoop Hadoop Hadoop Aster Aster
Hadoop Hadoop Hadoop Aster Aster (MapReduc
(MapReduce
No Schema e Analytics)
Analytics)

Aster options when using remote Data stores
You can load the remote table into Aster if needed. This would be useful when you want to run
numerous queries on Aster. This is a must better option than copying over the data each time
you would want to run an Aster query.
Other options include creating an Aster View that points to the remote data store table. Now the
Aster user can perform analysis using the Aster View that points to the remote data stores table.

Aster options when using remote Data stores
1. Load Data Table will be persisted into Aster Database. Can Query or do
Analysis as needed
CREATE TABLE aster_movieratings DISTRIBUTE BY HASH( userid) AS
(SELECT * FROM load_from_hcatalog (ON mr_driver SERVER ('hadoop1')
USERNAME ('huser') DBNAME ('default')TABLENAME
('hadoop_movieratings'));
2. Query Data via SQL Table (or Subset of table via WHERE clause) will be
brought into Aster via Aster View for duration of the query
SELECT * FROM v_movieratings CREATE VIEW v_movieratings as SELECT * FROM load_from_hcatalog

WHERE ts= '2012-12-12'; ( ON mr_driver SERVER ('hadoop1') USERNAME ('huser') DBNAME
('default')TABLENAME ('hadoop_movieratings'));
3. Analyze Data via SQL-MR Using Aster SQL-MR functions, pull in data on-
the-fly (from table or view) for the duration of the transaction
via View via Table
SELECT * FROM npath (ON SELECT * FROM npath (ON
(SELECT * FROM v_movieratings .. (SELECT * FROM load_from_hatcatalog

QueryGrid: Aster-TD Connector setup
Before running the Teradata QueryGrid: Aster-Teradata connector, heres some things you want
to do first.
On the Teradata side, you need connectivity for all nodes and an assigned database name for
any Teradata database(s) you will be accessing through the Connector. All the other network
configurations in this section apply to Aster Database only.
Because the Teradata Import and Export operations are executed in parallel across Aster
Database, every node in Aster Database will need to be able to access the source or target
Teradata database(s) by name using DNS. Depending on the network configuration in use,
this may require that the /etc/hosts and/or the /etc/resolv.conf files on each Aster
Database node be edited to include the necessary entries to access these gateways. It is
recommended that you manage these configurations centrally using the AMC, as described in
the Teradata Aster Big Analytics Appliance 3H Database Administrator Guide. You will make
the settings once in the AMC, and they will be copied to all Aster Database nodes
automatically.
The Teradata TPT client uses DNS to discover gateways to the Teradata database. Teradata
calls these gateways Communication Processors or cops. Each Teradata database is given a
database name, and all the cops have DNS names which use a very specific naming convention
(e.g. dbnamecop1, dbnamecop2, ..., dbnamecopn).

QueryGrid: Aster-TD Connector setup
The Connector requires Aster have network access to each other
For example, because the Teradata Import and Export operations are
executed in parallel across an Aster Database, every node in Aster
Database will need to be able to access the source or target Teradata
database(s) by name using DNS. It is recommended that you manage
these configurations centrally using the AMC
Alternatively, can also hard code

the HOSTS file of each database
with IP addresses of the remote
database
A third option is using NCLI to

create a HOSTS file for Aster

QueryGrid: Aster-Teradata Connectors
The load_from_teradata SQL-MR function copies data from a Teradata table to an Aster
Database table. The function is invoked on Aster Database. A SELECT statement is supplied in
the QUERY clause of the function, to specify the data to be loaded.
The load_to_teradata SQL-MR function copies data from Aster Database to Teradata. The
function is invoked on Aster Database. A SELECT statement is supplied in the ON clause, to
specify the data to be loaded. The function outputs information about the data copied and any
errors.

QueryGrid: Aster-Teradata Connectors
A high performance bi-directional, connector to copy data

between a Teradata EDW and an Aster Analytic Platform
Teradata
Teradata | Aster
Supports Terabytes Data Warehouse
MapReduce Platform
of Data Transfer
Workers per Hour
v-W1 v-W2 v-W3 v-W4
SQL-MapReduce <<< >>>

Loaders/Exporters
Connectors essentially Load data from one database to

Aster-Teradata Connector another. There are 3 ways to Load to Aster:
Infrastructure/Capabilities
CREATE TABLE AS : Used when want a permanent copy
- SQL-MR & TPT based Interfaces of tables rows. Typically done when you want to run
multiple queries against this table
- Join Tables - Build Remote Views
SELECT : One-time query (data wont persist)
- Run SQL-MR on all of your data INSERT : Already have a table definition. Want to copy
rows into that table

Lab7.01: QueryGrid Aster-TD
Using load_to_teradata
You must first create a table in Teradata to hold the data being loaded. This table must exist, be
empty, and have a schema that's compatible with the data being exported from Aster
Database. Note that because the table must be empty, you cannot make two consecutive
load_to_teradata calls to the same target table. If you are making multiple consecutive calls to
the function, you must use a different target table for each one. After the data is loaded into
the target tables, it can be consolidated into a single Teradata table in a separate SQL operation
on Teradata.
If a datatype specified in the originating Aster Database schema does not match the datatype
in the target Teradata table, implicit datatype conversion will be performed by Teradata. For
Teradata, conversion rules are listed in the Teradata document SQL Functions, Operators,
Expressions, and Predicates which may be found at http://www.info.teradata.com/
edownload.cfm?itemid=102320046. After this table has been created, you can execute the
load_to_teradata function.

-- From ASTER to TERADATA ( load_to_teradata )
Goal: Show how to copy Aster table aster_source into permanent

Teradata table so customers can use the data
a (1 of 3) DELETE FROM from td01.teradata_target;

SELECT count(*) from td01.teradata_target; 0 rows
b (2 of 3) SELECT sum(loaded_row_count), sum(error_row_count) Note TD table must

FROM load_to_teradata Aster Function
be empty table or
load will fail
(ON aaf.aster_source
tdpid ('192.168.100.15') username ('td01') password ('td01')
target_table ('td01.teradata_target')); 52 rows
c (3 of 3) SELECT * from td01.teradata_target; 52 rows
Why not just keep the data in Aster for users to query?
Because Aster is for Analytics by a limited number of Data Scientists.
You put the table on Teradata so thousands of users can access data

Using load_from_teradata
The load_from_teradata SQL-MR function must be invoked on a partitioned fact table in

Aster Database. This is usually done by using a dummy table, which will be referred to as
mr_driver, created as follows:
CREATE TABLE mr_driver(

c1 INT)
DISTRIBUTE BY HASH (c1);
For larger result sets, it's a good idea to capture the output from load_from_teradata to a table
in Aster Database, to avoid the need to repeat the load if a query must be run again.

-- From TERADATA to ASTER ( load_from_teradata )
Goal: This lab demonstrates how to copy data from Teradata to Aster
7a DELETE from ASTER_TARGET;

SELECT count(*) from ASTER_TARGET; 0 rows
From TD
7b SELECT * from sql00.teradata_source; 26 rows
7c INSERT into ASTER_TARGET

SELECT * FROM load_from_teradata Aster Function
tdpid ('dbc') username ('td01) password ('td01') TD credentials used to logon
QUERY ('SELECT * FROM td01.teradata_source')); Source table on Teradata
7d SELECT * from ASTER_TARGET; 26 rows
Note that LOAD_FROM_TERADATA supports Partition Pruning

Lab7.03: Join between Aster-TD
Using the QueryGrid connectors, from Aster, I can easily join Aster tables to Teradata tables.
Of course, to make it even easier for the Aster end-users, it is a common practice to hide the
complexities of the SQL-MR code by creating a View. Now the end-user can use common
ANSI standard SQL statements.

Lab7.03: Join between Aster-TD
Query: Find names (TD table) and

Job Description (Aster table) via a Most Aster DBAs would create an Aster
JOIN using QueryGrid connector View that points to TD table. Users could
then use pure SQL in their queries when
Joining Aster to TD
View = SQL-MR code
SQL-MR code
Query using the View

Connector argument clauses
See the Aster Database User Guide for more information on the various arguments.

Connector argument clauses See Aster Database User Guide
for argument details
Common arguments
TDPID('tdpid')
[USERNAME(' username ')]
[PASSWORD(' password ')]
[LOGON_MECHANISM('TD2' | 'LDAP')]
[LOGON_DATA(' mechanism-specific logon data')]
[ACCOUNT_ID(' account-id')]
[TRACE_LEVEL('trace-level ')]
[MAX_SESSIONS('max-sessions-number')]
[QUERY_TIMEOUT(' timeout_in_seconds ')]);
load_from_teradata only load_to_teradata only
ON mr_driver ON source_query
QUERY ('query') TARGET_TABLE ('Tablename')
NUM_INSTANCES ('instances-count') ERROR_TABLES ('error table')
PRESERVE_COLUMN_CASE ('Yes|No') LOG_TABLE ('Tablename')
SPOOLMODE ('NoSpool | Spool') START_INSTANCE ('Instance')
SKIP_ERROR_RECORDS('yes'|'no') NUM_INSTANCES ('Instance')

Supported Data types
When using the QueryGrid connectors, it is important to know which data types are supported or
not. Here is a list of Teradata and Aster data types.

Supported Data types
Teradata Database Type Aster SQL Type
bigint bigint
byte[(n)] bytea
byteint smallint
char[(n)] char[(n)]
date date
decimal[(s[, p])] numeric(s, p)
float double precision
integer integer
long varchar varchar
smallint smallint
time time
time with time zone time with time zone
timestamp timestamp
timestamp with time zone timestamp with time zone
varbyte(n) bytea
varchar(n) varchar(n)

Unsupported Data types
When using the QueryGrid connectors, it is important to know which data types are supported or
not. Here is a list of Teradata and Aster data types.

Unsupported Data types
Teradata DB Type Reason Unsupported Aster Workaround
Extract desired subsequence of

blob (64000 max) LOBs not supported in TPT API
blob in SQL and cast to varchar
Extract desired subsequence of

clob LOBs not supported in TPT API
clob in SQL and cast to varchar
Split into start/end within export
period(date) No period type in Cluster
query
period(time) No period type in Cluster
query
period(time with time zone) No period type in Cluster
query
period(timestamp) No period type in Cluster
query
period(timestamp with time Split into start/end within export
No period type in Cluster
zone) query

Other Limitations
Since we are combining different data stores when using the QueryGrid connectors, there are
differences in object parameters.

Other Limitations
Number of Concurrent Load/ Export operations (default 5) Max = 30

Configuration can be changed on Teradata database
Maximum row for TD = 64k. For Aster, its 32 MB
Table/Column/View name length for TD = 30 For Aster, its 63
VARCHAR column: Max length 32000

Error during transfer if data exceeds 32000 bytes. Truncate data for
successful transfer
Teradata CHAR/VARCHAR column with KANJISJIS character set is not

supported

load_to_teradata example
The load_to_teradata function has two outputs. For each output, one row is returned for each
vworker instance on which the function was executed.
loaded_row_count
The loaded_row_count output indicates the total number of rows that were loaded into
the target Teradata table. Only one row will have the total row count, and the other rows
will have a value of 0. If the connector succeeded in loading rows into Teradata, but failed
to get statistics, this column will have the value of -1. Use sum(loaded_row_count) to
obtain the total number of rows loaded.
The value returned is equal to (actual number of rows returned) modulo 2^32. If the
number of rows to be loaded is expected to be greater than 2^32 (4,294,967,295) or the
row count is -1, please check the row count in the Teradata database by issuing a SELECT
COUNT(*) on the target table.
error_row_count
The error_row_count returns the number of rows in both of the Teradata error tables. To
see a total number of errors, issue sum(error_row_count).

load_to_teradata example
You must first create a table in Teradata to hold the data being loaded. This table must exist, be
empty, and have a schema that's compatible with the data being exported from Aster
Because the table must be empty, you cannot make two consecutive load_to_teradata calls to
the same target table. If you are making multiple consecutive calls to the function, you must
use a different target table for each one. After the data is loaded into the target tables, it can be
consolidated into a single Teradata table in a separate SQL operation on Teradata.
The following code example includes a SELECT statement to access the output of the function
and provide 1) a count of rows successfully loaded, and 2) a count of rows with errors. In the
ON clause, there is a SELECT statement to indicate which rows are to be copied to Teradata.
SELECT sum(loaded_row_count), sum(error_row_count)
FROM load_to_teradata
(ON (SELECT * FROM ASTER_SOURCE)
tdpid ('dbc') username ('td01') password ('td01')
target_table ('td01.teradata_target'));

When Aster v-Workers > TD AMPS
Using load_to_teradata when number of vworkers exceeds
number of AMPs
This example shows how to load data into Teradata when the number of Aster Database
virtual workers exceeds the number of AMPs in Teradata. To find out the number of AMPs in
Teradata:
To find out the number of Teradata AMPs, do the following:

1 Login to the Teradata database using the Teradata client, bteq.
$bteq
.logon dbc/UserID
password:
Select Count( distinct vproc) from dbc.AmpUsage;
*** Query completed. One row found. One column returned.
*** Total elapsed time was 1 second.
Count(Distinct(Vproc))
----------------------
2
quit;
2 The output displays the number of Teradata AMPs. In the above example the number of
Teradata AMPs is 2.
The output displays the number of Teradata AMPs. In the above example the number of
Teradata AMPs is 2.
The load_to_teradata function is invoked as many times as needed, to balance the data
transfer between the Aster Database vworkers and the Teradata AMPs. Note that as a best
practice, the functions should be run within the context of a single transaction to maintain
data integrity, as shown in the example.
Perform the following steps when the number of vworkers exceeds the number of AMPs:
1 Determine the number of load_to_teradata() calls to make, using the following formula:
CallCount = vworkerCount / AMPCount
If the result is not an integer, round up to the nearest integer.
2 In Teradata, create the required number of target tables (e.g. if CallCount is 4, then create 4
target tables with names like target_table1, target_table2, target_table3, etc.)
3 Create and execute the load_to_teradata() statements.

When v-Workers > AMPs (load_to_teradata)
If # of v-Workers exceeds # of TD AMPs ( to determine # of AMPS use: select HASHAMP()+1; ),

can invoke load_to_teradata as many times as needed to balance the data transfer
1. Determine the number of load_to_teradata() calls to make, using the following formula:
CallCount = v-WorkerCount / AMPCount
2. In Teradata, create the required number of target tables (e.g. if CallCount is 2, then create 2
target tables with names like target_table1, target_table2)
3. Create and execute the load_to_teradata() statements, as follows:
BEGIN; Example:
SELECT (*)FROM load_to_teradata (ON (SELECT * FROM aster_source_table) # of v-Workers = 64
TARGET_TABLE ('schema.target_table1') TDPID('...') USERNAME('...') PASSWORD('...') # of AMPs = 32
START_INSTANCE('0') NUM_INSTANCES('64'));
SELECT (*)FROM load_to_teradata (ON (SELECT * FROM aster_source_table) Start_Instance Beginning # of

TARGET_TABLE ('schema.target_table2') TDPID('...') USERNAME('...') PASSWORD('...') v-Workers that should
participate
START_INSTANCE('32') NUM_INSTANCES('64')); Num_Instances - # of v-Workers
END; that should participate
Later, run: INSERT INTO target_table SELECT * FROM target_table1; INSERT INTO target_table SELECT * FROM target_table2;

load_to_teradata best practices

load_to_teradata best practices
Empty target table must be created on Teradata before running query

To allow duplicate records, create a Multiset table
Create table with 'No Primary Index' option to improve performance and
allow duplicate rows
Schema of target table should match the results of query specified in ON
clause (number of columns, length, valid data type conversions)
CREATE TABLE in TD
CREATE MULTISET TABLE td01.td_telco_store
(customer_id INT, sessionid INT, channel VARCHAR(100),
action VARCHAR(300), ts TIMESTAMP(6))
no primary index;
Use LOAD_TO_TERADATA
SELECT * FROM load_to_teradata
(on telco_store_dim
tdpid('dbc') username('td01') password('td01')
target_table('td01.td_telco_store'));

load_from_teradata notes

load_from_teradata notes
The load_from_teradata SQL-MR function must be invoked on a fact table in Aster Database.
This is usually done by using a dummy table, which will be referred to as 'mr_driver', created
as follows: CREATE TABLE mr_driver (c1 INT) DISTRIBUTE BY HASH (c1)
Handing Upper Case letters in Imported Column Names
In Aster Database, you must surround any uppercase or mixed case name in quotation marks
(you use double quotes,) or it will be treated as if you had typed it in all lowercase characters.
In Teradata, no such quoting is needed. This difference can create confusion when you
retrieve tables and columns from Teradata and use them in an Aster query, because if you
specify PRESERVE_COLUMN_CASE ( ' YES ' ) the case of the Teradata table and column
names is preserved. As a result, if a retrieved table or column name contains any uppercase
characters, you must double-quote that name in your Aster Database query. To avoid
confusion, you should make a habit of enclosing in double quotes all table and column
names that you retrieve using load_from_teradata. For example:
SELECT "td"."Last_name" Best Practice: Always use lower case ALIAS

FROM load_from_teradata (ON mr_driver for Table name in QUERY argument and always
double-quote any Tablename and any Column
tdpid ('td15') username ('td01') password ('td01') name in 1st SELECT
preserve_column_case ('yes')
QUERY ('SELECT * FROM td01.TeradataMixedCase')) td;

load_to_teradata and Locking

load_from_teradata and LOCKING
From Aster, can do Dirty Reads on Teradata Views and Tables
Put in Aster
code

Troubleshooting Connectors

Troubleshooting Connectors
For both QueryGrid: Aste-Teradata functions, the most common problems are the
mapping of the schema is inconsistent between the two tables
Ensure that both tables have the same structure (columns, data types, etc.)
Data inconsistencies (constraint violations, hidden characters, failure of data type
conversion) can also cause problems, so check your data when in doubt
In case of a Teradata database shutdown, a pre-existing query using the Teradata
Connector on Aster Database will continue to wait for up to four hours for a reply
from Teradata. The query will resume and run to completion upon a restart of the
Teradata database. During the wait period, the query will show a status of Running
on Aster Database
During execution of the Connector functions, an error message may be returned from
either Aster Database or Teradata. First, determine whether an error message is likely
to be based on a problem on the Aster Database side or on the Teradata side
See next 2 slides for examples of mismatched data types

Insert CHAR into INT
When using load_to_teradata function, Teradata will create an error table automatically to store
the malformed rows.

Insert CHAR into INT (load_to_teradata)

Below are rows in Aster. 3rd row will fail when loading into TD
COPY
Errors are automatically

captured in ERROR table with
name of TD table followed by
suffix _ET in TD BOX

Insert CHAR into INT (cont)
When using load_from_teradata function, any malformed row will cause an ABORT and
ROLLBACK to the Aster table.

Insert CHAR into INT (load_from_teradata)

Below are rows in Aster. 2nd row will fail when loading into Aster
COPY
insert into sql00.td_employee3 values ('2222',1003,301,312102,'Flintstone','Fred', '1999-01-02' , '2000-02-07', 29250);

insert into sql00.td_employee3 values ('456.78',1002,401,317102,'Flintstone','Wilma', '2000-02-12' , '1958-05-11', 37250);
insert into sql00.td_employee3 values ('Mistake',1005,501,517102,'Rubble','Barney', '2000-02-01' , '1956-09-17', 57250);
Unlike load_to_teradata, load_from_teradata is an all-or-nothing

proposition. Any malformed row causes load to fail and no rows inserted

QueryGrid: Aster-Hadoop

QueryGrid: Aster-Hadoop
tudio
(QueryGrid: Aster-Hadoop) ection
LOAD_FROM_HCATALOG

QueryGrid: Aster-Hadoop and HCatalog
There is no special installation needed to enable using Teradata QueryGrid: Aster-Hadoop.
All required Hadoop packages and HCatalog jars for certified distributions are installed during
the normal Aster Database installation or upgrade. However, you do need to set up SQL-H
Configuration for each Hadoop cluster you will access.

QueryGrid: Aster-Hadoop and HCatalog
QueryGrid: Aster-Hadoop gives

Analysts and Data Scientists a
better way to analyze data stored in
Hadoop
QueryGrid: Aster-Hadoop is in an
intelligent connector to Hadoop that
selectively pulls data from Hadoop
into Aster for analytics
Allow standard ANSI SQL access to

Hadoop data
It works off of HCatalog, the metadata
layer of Hadoop
Leverage existing BI tool and enable
self service

QueryGrid: Aster-Hadoop benefits
Using ACT, you can query HCatalog directly without using the QueryGrid connectors. Here are
examples to view databases, tables and columns of a table.
Example of \extl usage
Return a list of databases in the external database:

beehive> \extl host=hdp1.aster.com port=50111 user=beehive
systemtype=Hcatalog
List of databases
Name
---------
default
www_data
finance_dept
Examples of \extd usage
Return a list of tables in the external database:

beehive=> \extd host=hdp1.aster.com port=50111 database=www_data
List of tables
Name
------------------------
Raw_Jan_2012
Raw_Feb_2012
Raw_Mar_2012
(3 rows)
beehive=> \extd host=hdp1.aster.com port=50111 database=www_data table=

Raw_Mar_2012
Table "www_data"."Raw_Mar_2012"
Column | Type | Partitioned Column
------------+--------+--------------------
page_id | int | f
category_id | int | t
click_date | string | t
click_time | string | f
(4 rows)

QueryGrid: Aster-Hadoop benefits
Apache Hadoop is an open source platform for storing and managing big
data. Teradata Aster SQL-H is a software access method which provides a
bridge that enables users to easily analyze data stored in Hadoop through
standard ANSI SQL and Aster's SQL-MapReduce framework. The table and
storage management service for data stored in Apache Hadoop is HCatalog.
SQL-H provides deep metadata layer integration with the Apache Hadoop
HCatalog project
Some of the benefits of QueryGrid: Aster-Hadoop are:
1. You can use Aster Database to access the HCatalog metadata directly.
For example, you can list all databases and tables in HCatalog from
Aster using ACT
2. Aster Database supports fetching data from HCatalog, and the
automatic mapping of HCatalog datatype value to the Aster datatype
value
3. You can query HCatalog from Aster Database. There is support for
partitions and Partition Pruning on HCatalog, to improve query
performance

Configuring QueryGrid: Aster-Hadoop
In the Aster Management Console, we name the Teradata QueryGrid: Aster-Hadoop connector,
SQL-H. These names are synonymous. So dont let the different names confuse you; they are
one in the same and eventually SQL-H will be deprecated.
Create a SQL-H configuration
Each Hadoop cluster you wish to access must have a SQL-H configuration. You can either use
the AMC or ncli to create a SQL-H configuration. To use the AMC:
1 Log in to the AMC as a user with the administrator role.

2 Navigate to Admin > Configuration>SQL-H Configuration.
3 Click on New SQL-H Configuration Entry.
4 Fill in the values for your Hadoop master node. Note that the Server value you specify here
must be the same one you will use when invoking SQL-H:
a Server: The hostname of your Hive server. Note that if the Hive server and namenode
have different hostnames, you must specify the Hive server hostname, not the
namenode hostname.
b Distribution: The version of your Hadoop cluster.
c Port: The port on which the Hive server listens. This is generally port 9083.
d Network: The type of network - public or private.
e Comment: A comment related to this configuration.
5 Click OK. This saves your SQL-H configuration.
6 You will see the entry you just made in the SQL-H Configuration list.

Configuring QueryGrid: Aster-Hadoop
Must configure QueryGrid connector within AMC first or will get following error
message when using the connector (Note it uses the older name 'SQL-H)
SELECT * FROM load_from_hcatalog
(on mr_driver
server('192.168.100.21')
dbname('default')
tablename('department')
username('hive')) limit 5;
Good idea to
configure HOSTS in
AMC as well if use
Names instead of IP

QueryGrid: Aster-Hadoop syntax
For more details on QueryGrid: Aster-Hadoop syntax, see the Aster Database User Guide.

QueryGrid: Aster-Hadoop syntax
QueryGrid: Aster-Hadoop connector allows Aster to use the services of

Hadoops HCatalog to see metadata of Hadoop objects
Here is the full syntax for load_from_hcatalog, where mr_driver is an

empty fact table in Aster Database

( ON mr_driver Dummy fact table required for syntax
SERVER(' server_IPaddress ') If Metadata server and
[PORT('server_port ')] Namenode have different IP
address, point to Metadata
USERNAME(' username ') server
DBNAME(' database_name ')
TABLENAME('hcat_tablename')
[PARTITIONFILTER(' expression_string')]
[COLUMNS(' col1'[,'col2',...])]
[TEMPLETON_PORT('REST_port')]
[SKIP_ERROR_RECORDS('yes' | 'no')];

Data type conversion (Hadoop to Aster)

Data type conversion (Hadoop to Aster)
When copy from Hadoop to Aster, it is important to know data type

conversion rules. Heres a list of Data types
Supported
HCatalog Datatype Aster Datatype

Lab7.04: QueryGrid Aster-Hadoop

Lab7.04: QueryGrid Aster-Hadoop
It works off of H-Catalog, an optional metadata layer of Hadoop
Unlike the Hadoop Connector which gets entire datasets, it understands

partitioning and enables partition pruning, thus can minimize the amount of
Big Data movement. From ACT, do following:
To view databases in Hcatalog via ACT To view tables in Hcatalog via ACT
SELECT customer_id, date_id, product_id Output

FROM load_from_hcatalog Aster Function
(on mr_driver
tablename ('sales_fact) username ('hive'))
limit 5;

Lab7.05: QueryGrid Aster-Hadoop with Sentiment
Heres an example of using the QueryGrid connector to do a Sentiment query on a Hadoop table.

Lab7.05: QueryGrid: Aster-Hadoop
2-min
Query: Find sentiment analysis on Twitter movie feed located on Hadoop
Hive Table tweets CREATE TABLE tweets STORED AS RCFile AS

on Hadoop
SELECT coordinates.type AS coordinate_type,coordinates.coordinates[0] AS coordinate_x
, coordinates.coordinates[1] AS coordinate_y , created_at,favorite_count, favorited, geo.type AS
geo_type
, geo.coordinates[0] AS geo_x , geo.coordinates[1] AS geo_y, id_str , in_reply_to_screen_name,
, in_reply_to_status_id_str, in_reply_to_user_id_str, lang, place.id AS place_id, possibly_sensitive
, retweet_count, retweeted, source, text, truncated, user.id_str AS user_id_str FROM raw_source;
Aster view that points to CREATE VIEW vH_tweets as

Table on Hadoop named
( ON mr_driver SERVER ('192.168.100.21')
USERNAME ('hive') DBNAME ('default')TABLENAME('tweets'));
From Aster client, do Sentiment

on Twitter feed from Aster view

Lab7.06: QueryGrid Aster-Hadoop (Partition pruning)
If a Hadoop table has been partitioned, it is possible to do Partition Pruning during the SELECT.
Instead of reading the whole table, only that partition is scanned which reduces Disk I/O and
results in better performance.
The WHERE clause in the SELECT statement must point to the partitioning columns that were
declared in the CREATE TABLE of the Hive table.

Lab7.06: QueryGrid: Aster-Hadoop
(Partition pruning)
Hive Table CREATE TABLE carpricedata (Price DOUBLE, Mileage BIGINT, Make STRING,
Carpricedata on
Hadoop Model STRING, Trim STRING, Type STRING, Cylinder INT, Liter FLOAT, Doors
Notice this table is INT, Cruise TINYINT, Sound TINYINT, Leather TINYINT)
Partitioned PARTITIONED BY(country STRING, brand STRING, dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS
TEXTFILE;
Aster view that points to CREATE VIEW vH_carpricedata as

Table on Hadoop named
( ON mr_driver SERVER ('192.168.100.21')
USERNAME ('hive') DBNAME ('default')
TABLENAME ('carpricedata'));
From Aster client, SELECT from

v_daily_pv view
WHERE clause performs

Partition Pruning automatically
so only pull data needed over
network wire
The vH_carpricedata view hides the load_from_hcatalog call to Hadoop. Aster user can perform ANSI
SQL query as if the data were on Aster. Or can do SQL-MR code if wanted to do Analysis on this data

Other Connectors
Note there are other Connectors as well that connect to other popular data stores.

Other Connectors
There are a host of other Connectors including:
DB2 (anydatabase2 aster connector)

Oracle (anydatabase2aster connector)
Informatica
SAS (ODBC-driver)
Others (see below. PS-supported)

Review: Mod 7 - Teradata Unified Data Architecture

Review: Mod 7 - TD Unified Data Architecture
tudio
ection
tudio
ection
s to
s all

Review: Module 7 - UDA

Review: Module 7 - UDA
1. If want total connective to all 3 query platforms today, you must initiate
from Aster side (T or F)
2. This connector tells you how many good rows were loaded and how
many malformed rows were not load
3. The GUI Smart Loader can only pull entire contents to/ from TD and
Hadoop
4. Which connector(s) require a dummy (staging) table (mr_driver)?
The same labs we did in this Module are also in

the Lab book (see next slide)

DBA Lab 7 (Optional)

DBA Lab 7(Optional)
Objectives:
Using QueryGrid connectors, query both

Teradata and Hadoop


Module 8
Module 8
Users, Privileges, Functions and
Security
VMware Workstation
SUSPEND the TD-box and Hadoop nodes
icon

Page 2 Mod 8 Users and Security
Table Of Contents
Users and Security Module objectives ......................................................................................... 4
Security Model The Big Picture ................................................................................................... 6
Default Users ................................................................................................................................... 8
Creating a new User ...................................................................................................................... 10
User Passwords ............................................................................................................................. 12
User Authentication (2 types) ....................................................................................................... 14
Setting a Schema Search Path for a User ...................................................................................... 16
Dropping User (or Role) ............................................................................................................... 18
Using System Views ..................................................................................................................... 20
Users, Roles and Access Control .................................................................................................. 22
Default Roles ................................................................................................................................. 24
GRANTing Privileges syntax ....................................................................................................... 26
Privileges can be assigned to Roles/Users .................................................................................... 28
Lab8.a: Using NC_tables to view Users Role ............................................................................. 30
The GRANT command (2 types) .................................................................................................. 32
GRANT Objects and Roles ........................................................................................................... 34
GRANT on DB, Schemas and Functions (1 of 2)......................................................................... 36
GRANT on Tables (2 of 2) ........................................................................................................... 38
GRANT on Database objects ........................................................................................................ 40
In-line lab: Create User with access to table ................................................................................. 42
The Revoke command................................................................................................................... 44
Function Permissions example...................................................................................................... 46
Lab8b: Function permission.......................................................................................................... 48
Public schema issue....................................................................................................................... 50
In-line lab: Create following User accounts ................................................................................. 52
Configuring Security ..................................................................................................................... 54
In-line lab: Teradata Wallet .......................................................................................................... 56
In-line lab: Add <name value> pairs to TD Wallet ...................................................................... 58
In-line lab: ACT login using TD Wallet ....................................................................................... 60
How LDAP works ......................................................................................................................... 62
LDAP caveats................................................................................................................................ 64
Review: Module 8 Users, Privileges, Roles ............................................................................... 66
Lab: Module 8- User, Privileges and Roles, Security .................................................................. 68

Users and Security Module objectives

Users and Security - Module Objectives
How to create a new User
Passwords and Authentication
How to create a new Role and assign Privileges
How to nest a User into a Role
How to nest a Role into a Role
How to assign Roles/Privileges to functions
External Security
Teradata Wallet
Lightweight Directory Access Protocol (LDAP)
Secure Socket Layer (SSL)
See Guide for how to setup LDAP and SSL security

Security Model The Big Picture
Users in Aster Database are authenticated before they can access the various database
components. A user has access to the various database components based on the privileges
that they have been granted. The different authentication modes and the privileges that can be
granted and revoked are explained in detail below.
Access Control
Limits on users ability to read from and write to databases are governed as follows:
Aster Database security on database objects is managed through GRANT and REVOKE
statements.
GRANT statements grant privileges on database objects to one or more roles or individual
users.
Object level security authorizations are stored locally in systems tables on the coordinator.
Users rights to use the AMC are also managed with GRANT and REVOKE.

Security Model - The Big Picture
Role
CREATE role mktRole;
1
GRANT CONNECT on DATABASE New User assigned with Role
beehive TO mktRole;
CREATE USER moe in
2 PASSWORD 'stooge'
GRANT USAGE on aaf TO mktRole; IN ROLE mktRole;
Add Role to existing User
3
GRANT SELECT on GRANT mktRole TO curly;
aaf.employee TO mktRole;
Assigned to User
GRANT SELECT on
aaf.employee TO larry;
Privileges User

Default Users
As installed, Aster Database contains two default administrative users and roles, and a default
database, beehive.
Default users:
db_superuser: The user db_superuser (default password: db_superuser) has the

powerful db_admin role and can access all database objects in every possible way without
any restrictions. This account it should be used only when needed and with care.
beehive: The user beehive (default password: beehive) owns the default database, also
called beehive. By default, the beehive user has no administrative rights.
Important! Immediately after you install Aster Database, you should change the default
password of db_superuser to one that is more secure.

Default Users
beehive (default password: beehive)
- Owns default database, also called beehive
- Has no administrative rights on the system

Has full owner privileges on beehive database
Must be granted system level permissions
__________________________________________________
db_superuser (default password: db_superuser)
- Has the powerful db_admin role and can access all database objects
in every way without restriction.
- This should be used only as needed and with care
Good idea to change these passwords immediately

Creating a new User
New users in Aster Database are defined through the CREATE USER command:
CREATE USER name [ [ WITH ] option [ ... ] ] PASSWORD 'password'
See CREATE USER in the Teradata Aster Big Analytics Appliance 3H SQL and Function
Reference for the list of options.
The following example demonstrates how to add a new user to Aster Database with the name
ryan in the group marketing with specified password ryan123.
CREATE USER ryan IN GROUP marketing PASSWORD 'ryan123'
Database users are global across an Aster Database installation (and not per individual
database). To create user theadon with password 5t4g0l33, use the SQL command CREATE
USER:
CREATE USER theadon PASSWORD '5t4g0l33';
The user name must follow the rules for SQL identifiers: either double-quoted or without
special characters. The password must follow the Password Rules.
To remove an existing user, use the DROP USER command, as in:
DROP USER theadon;
Database user names must not contain:
Whitespace (including space, tab, and newline characters)

Colon :, backslash \, or single-quote ' (apostrophe) characters
Control characters (including ASCII 0-31 and 127).
Multibyte characters.

Creating a New User
A User is an entity that can login into the database, can own database
objects, and can have database privileges
New users are created with the CREATE USER command
Common Syntax (there are alternative ways to code):
CREATE USER <username>

PASSWORD 'mypassword'
[IN ROLE <myrole>]; -- optional
- Note: database users are separate from OS users

It is a best practice to use the same user name for both
- Users are global across Cluster - not database specific

Users are valid in all databases in the Cluster
Access to databases is controlled by user Roles

User Passwords
Database user passwords must follow certain rules, beginning in Aster Database 5.0.2. These
password rules apply to new installations of and upgrades to Aster Database 5.0.2. If any
database users were created with a disallowed password before upgrading, their ability to login
will not be affected in this release. However, any new passwords (for new or existing users)
must follow the new rules
These password rules apply regardless of the authentication method. If you are automatically
generating passwords, ensure that only passwords that follow these rules can be generated. If
you use any tools which automatically generate passwords, you may need to modify them to
choose only Aster-supported passwords.
1 Password length must be at least 1 character and most 128 characters.

2 Password must consist of ASCII characters (foreign language or multi-byte characters are
not supported).
3 Password must *not* contain:

Single quote ' (apostrophe) characters, double quote ", and back-quotes `
Colon : and backslash \ characters
Right parenthesis ( and left parenthesis )
Control characters (including ASCII 0-31 and 127)
4 Passwords are case sensitive.

User Passwords
Database user passwords must follow certain rules, beginning in Aster

Database 5.0.2. These password rules apply to new installations of and
upgrades to Aster Database 5.0.2. Password Rules:
1. Password length must be at least 1 character and most 128 characters

2. Password must consist of ASCII characters
3. Password must not contain:
Single quote ' (apostrophe) characters, double quote ", and back-
quotes `
Colon : and backslash \ characters
Right parenthesis ( and left parenthesis )
Control characters (including ASCII 0-31 and 127)
4. Passwords are case sensitive

User Authentication (2 types)
You configure your Aster Database installation to authenticate SQL and AMC users by one of
the following authentication methods:
Local password authentication: Aster Database validates the username and password
against the users record in its local repository on the queen (with backup on the cluster).
Passwords stored here are masked. The Aster Database user accounts are not shared with
the operating system user accounts and vice versa. This is the default. If you have activated
LDAP or AD authentication and wish to switch back to password authentication
Lightweight Directory Access Protocol (LDAP) authentication: Aster Database passes the
username and password to the LDAP server for authentication. Your user accounts must
be stored in an LDAP-compatible directory server such as Active Directory or OpenLDAP.
Active Directory authentication: This uses the LDAP mechanism discussed in the
preceding bullet point.

User Authentication (2 types)
Users may be authenticated using either Aster passwords or LDAP
Aster Password Authentication

In the Password approach, the user sends across the username and the
plain-text password and this is validated against the one stored in the
password file on the Queen
Lightweight Directory Access Protocol (LDAP)

In the LDAP approach, the user sends the username and password as
above. Cluster then connects to the LDAP server and provides the
users request. LDAP accepts or rejects the request, Cluster relays the
response to the user
Note that Active Directory can also be supported via LDAP support
Single Sign-on is also supported but is outside scope of this courseware

Setting a Schema Search Path for a User
You can specify a schema search path for a transaction or session (using SET), or as the default
for a user (using ALTER USER).

Setting a Schema Search Path for a User
The best practice for certain table access is to use fully qualified
schema-name.table-name queries
For unqualified queries the Schema Search Path will determine the
schema accessed. The default schema search path for all users is
simply the 'public' schema
Schema Search Paths can be changed by DBAs:
CREATE SCHEMA 'aaf';

ALTER USER beehive SET SEARCH_PATH to 'aaf' , 'public';
For queries with no schema specification nCluster will:

- CREATE new tables in the first schema in the search path.
- SELECT from tables by walking through schema(s) in the order
specified in the search path until the table is found
To view your current schema search paths, from ACT type: show search_path;
From TD Studio, open nc_all_users

Dropping User (or Role)
You destroy a group role in the same manner with which you destroy any other role, using the
DROP ROLE command:
DROP ROLE name;
If you are unable to drop a role, you may need to revoke privileges before dropping the role.
Any memberships in the group role are automatically revoked, but the individual members
(users or roles) are not otherwise affected. So if DROP ROLE admin occurred, user jstrummer
would no longer be a member of the group role admin since it was dropped. But jstrummer
as a user would still exist and not be affected.

Dropping User (or Role)
If a USER or ROLE owns or has privileges to database objects, you will not be
able to drop it easily
Best Practice:
Find which objects a USER or ROLE owns/has access in the system tables
(more on this in a few slides)
If user owns tables, use ALTER TABLE OWNER 'db_superuser' to change

owner
Again in the system tables find all the roles this USER or ROLE has a
relationship with and REVOKE all of the related privileges. Also REVOKE
any access to any objects for that user
Finally, DROP the USER or ROLE

Using System Views
Information about workloads is stored in system views within Aster Database. The underlying
tables for these view are intentionally hidden.

Using System Views
Use the nc_all_tables to view Users, Roles and assigned Privileges

(useful when you need to drop a User)
SELECT * from nc_all_users;

SELECT * from nc_all_roles;
SELECT * from nc_all_database_privs;
SELECT * from nc_all_schema_privs;
SELECT * from nc_all_table_privs;
SELECT * from nc_all_view_privs;
SELECT * from nc_all_sqlmr_func_privs;

Users, Roles and Access Control
Role Membership
You may find it convenient to group users to ease management of privileges. That way,
privileges can be granted to, or revoked from, a group as a whole. In Aster Database this is
done by creating a role that represents the group, and then granting membership in the group
role to individual user roles.
New roles in Aster Database are defined through the CREATE ROLE command:
CREATE ROLE name [ [ WITH ] option [ ... ] ] PASSWORD 'password'
See CREATE ROLE in the Teradata Aster Big Analytics Appliance 3H SQL and Function
Reference for a list of options.
To set up a group role, first create the role:
CREATE ROLE rolename;
Then create one or more users:
CREATE USER username;
Once the group role exists, you can add and remove members using the GRANT and
REVOKE commands:
GRANT rolename TO username, ... ;

and
REVOKE rolename FROM username, ... ;
There isn't any real distinction between group roles and non-group roles, so you can grant
membership to other group roles, too. The only restriction is that you can't set up circular
membership loops. Member roles automatically have the privileges of the group roles to
which they belong.
You destroy a group role in the same manner with which you destroy any other role, using the
DROP ROLE command:
DROP ROLE name;
If you are unable to drop a role, you may need to revoke privileges before dropping the role.

Users, Roles and Access control
Users in Aster Database are authenticated before they can access the
various database components. A user has access to the various database
components based on the privileges they have been granted
Users can be grouped in a Role
Privileges can be assigned to either a User/Role
Access Control limits on users ability to read from and write to databases
are governed as follows:
- Aster Database security on database objects is managed through GRANT

and REVOKE statements
- GRANT statements grant privileges on database objects to one or more
roles or individual users
Sometimes Roles are called 'Group' or 'Group Roles'. All 3 are synonymous and used
interchangeably. For ease of use, we will use Role exclusively unless otherwise noted

Default Roles
db_admin is Aster Databases superuser role. A member of this role will automatically be
an Aster Database Administrator. Only superusers can create and alter databases and roles!
The db_admin role does not provide AMC access. Note that the db_admin role presents a
special case when configuring workload management.
catalog_admin is Aster Databases standard administrative role. This role has a minimal
set of administrative rights. The catalog_admin role has the privilege to view all system
tables. (However, the catalog_admin role does not have unrestricted access to everything
in the database. For example, the catalog administrator cannot arbitrarily modify user
tables unless explicitly granted permission to do so.) The catalog_admin role does not
provide AMC access.
amc_admin and similar roles determine what actions the user can undertake in the AMC.
The default administrative roles db_admin and catalog_admin, as well as the default
administrative user db_superuser cannot be dropped or altered in any way.
Note there are other default Roles that can be viewed from the AMC as well.

Default Roles
db_admin
- Member of this role are Cluster Database Administrators with
unrestricted access to everything within the database
- Only superusers can create and alter databases and roles
catalog_admin
- Users with this role are privileged to view all system tables
- This role has a minimal set of administrative rights, it does not have
unrestricted access to everything in the database
PUBLIC
- All Users are automatically assigned to this Role by system
The above Roles cannot be Dropped. To view other Roles, type: SELECT *
from nc_all_roles; or use AMC
Another Default Role, process_runner can be given to all Users so they can
have access to the AMC in order to Cancel their queries

GRANTing Privileges syntax
When an object (such as a database, schema, or table) is created, it is assigned an owner. The
owner is the user that created the object. For most kinds of objects, only the owner can do
anything with the object initially. To allow other users and roles to use the object, privileges
must be granted. There are many different privileges, including: SELECT, INSERT, UPDATE,
DELETE, and CREATE. For more details, see the Teradata Aster Big Analytics Appliance 3H
SQL and Function Reference.
To assign privileges, use the GRANT command. In Aster Database, privileges can be granted
only at the database or table level. Note that rolename can be either a role group or user. Heres
a subset of the GRANT syntax supported in Aster Database:
GRANT { { SELECT | INSERT | UPDATE | DELETE }

[,...] | ALL [ PRIVILEGES ] }
ON [ TABLE ] tablename [, ...]
TO { [ GROUP ] rolename | PUBLIC } [, ...] [ WITH GRANT OPTION ] [ CASCADE ]
GRANT { { CREATE | CONNECT } [,...] | ALL [ PRIVILEGES ] }

ON DATABASE dbname [, ...]
TO { [ GROUP ] rolename | PUBLIC } [, ...] [ WITH GRANT OPTION ]
GRANT { { CREATE | USAGE } [,...] | ALL [ PRIVILEGES ] }

ON SCHEMA schemaname [, ...]
TO { username | GROUP rolename | PUBLIC } [, ...] [ WITH GRANT OPTION ]
To revoke a privilege, use the REVOKE command.

GRANTing Privileges syntax
GRANTs can be on Objects or membership in a Role
Assign Privileges
to Objects
Assign Role to User

Privileges can be assigned to Roles/Users
When an object (such as a database, schema, or table) is created, it is assigned an owner. The
owner is the user that created the object. For most kinds of objects, only the owner can do
anything with the object initially. To allow other users and roles to use the object, privileges
must be granted. There are many different privileges, including: SELECT, INSERT, UPDATE,
DELETE, and CREATE. For more details, see the Teradata Aster Big Analytics Appliance 3H
SQL and Function Reference.

Privileges can be assigned to Roles/ Users
Roles conveniently allow database administrators to group users to ease

management of user privileges
- Privileges can be GRANTed to a Role or individual User

May not want to do this since cannot quickly kock User off the
system. Best to GRANT CONNECT to USER instead
CREATE ROLE mkt;
GRANT CONNECT on DATABASE retail_sales TO mkt;

Assign Object Privileges to
GRANT USAGE on SCHEMA aaf TO mkt; Role
GRANT SELECT on aaf.employee TO mkt;
GRANT CONNECT on DATABASE retail_sales TO
watson; Assign Object Privileges to
User
GRANT USAGE on SCHEMA aaf TO watson;
GRANT SELECT on aaf.employee TO watson;
Assign Role to User
GRANT <Role> TO <User>
GRANT <Role> TO <Role> Assign Role to Role
CREATE USER moriarty CREATE USER holmes CREATE USER watson

IN ROLE mkt IN ROLE mkt PASSWORD 'help';
PASSWORD 'beevil'; PASSWORD 'stopevil';

Lab8.a: Using NC_tables to view Users Role

Lab8.a: Using NC_tables to view
Users Role(s)*
You can see information on Roles (groups) and their members with the
following metadata query
SELECT gr.rolename as Role role | membername

----------------------+---------------------
,COALESCE(mr.rolename
,u.username) as Username,
FROM nc_group_members g catalog_admin | jfuller
INNER JOIN nc_all_roles gr db_admin | db_superuser
db_admin | beehive
ON gr.roleid = g.groupid
LEFT OUTER JOIN nc_all_roles mr
ON mr.roleid = g.memberid rtl_admin | retail_admin
LEFT OUTER JOIN nc_all_users u rtl_readonly | retail_rpt
rtl_readonly | jfuller
ON u.userid = g.memberid
ORDER BY 1;

The GRANT command (2 types)
It is important to note you can assign GRANT to both database objects such as databases,
schemas and tables as well as Roles.

The GRANT command (2 types)
Object level syntax:

GRANT [ SELECT|INSERT|UPDATE|DELETE|ALL ]
ON tablename (or viewname)
TO { username OR rolename | PUBLIC }
[ WITH GRANT OPTION ]
Role level syntax :

GRANT role TO username OR rolename Note you can nest Roles
[ WITH GRANT OPTION ];
Grant the minimum privileges that a user/group will require
To achieve this grant specific privileges to all the tables and views, and
use the WITH GRANT OPTION clause sparingly
If WITH GRANT OPTION is specified, the recipient of the privilege may in turn grant
it to others. Without a grant option, the recipient cannot do that. Grant options
cannot be granted to PUBLIC role

GRANT Objects and Roles

GRANT Objects and Roles
We can grant Objects to Roles Tables Roles

GRANT SELECT ON sales_fact TO mkt_readonly;
GRANT SELECT ON date_dim TO mkt_readonly;
GRANT SELECT ON geo_dim TO mkt_readonly;
Role: mkt_readonly GRANT SELECT ON prod_dim TO mkt_readonly;
GRANT SELECT ON store_dim TO mkt_readonly;
GRANT SELECT ON emp_dim TO mkt_readonly;
GRANT SELECT ON region_dim TO mkt_readonly;
Can grant Role to another Role (Nesting Roles- Can nest at least 9 levels deep)
Roles Roles
Role: mkt_report_writer GRANT mkt_readonly TO mkt_report_writer;
Role: mkt_etl_user GRANT mkt_update TO mkt_etl_user;
Role: mkt_admin GRANT mkt_update TO mkt_admin;

GRANT mkt_delete TO mkt_admin;

GRANT on DB, Schemas and Functions (1 of 2)
When you create new users, databases, schemas, and tables, you will typically need to grant
some combination of the following privileges to give users access to database objects and
contents:
GRANT CONNECT allows the user to connect to a database.
GRANT CREATE on a database gives the user/role the right to create new schemas in the
database. Granting CREATE on a database does not confer the right to create tables. To do
that, you must do the following:
GRANT CREATE on a schema gives the user or role the right to create new tables and
objects in the schema.
GRANT USAGE on a schema gives the user or role the right to access objects in the schema.

GRANT on DB, Schemas and Functions (1 of 2)
GRANT CREATE
- For Databases, gives the user/role the right to create new Schemas in the
database. Granting CREATE on a database does not confer the right to
create tables. To do that, you must grant the user CREATE on a schema in
the database
- For Schemas, granting CREATE on a Schema gives the user/role the right to
create new Tables and objects in the schema
GRANT CONNECT - Gives a User/Role the ability to connect to a Database.
This privilege is checked at connection startup. For new databases you create,
only you have the CONNECT privilege
GRANT USAGE - Gives a user/role the ability to access data objects stored in a
Schema. It is checked on schema access
GRANT INSTALL FILE, GRANT CREATE FUNCTION Allows user to upload
and install files and SQL-MR functions in the schema
GRANT EXECUTE Allows user to run SQL-MR function
ALL [PRIVILEGES] Grants all privileges at once

GRANT on Tables (2 of 2)
Use GRANT SELECT, INSERT, UPDATE, DELETE, or ALL on a table to give the user or
role appropriate access to a table.

GRANT on Tables (2 of 2)
There are 4 basic, and 1 group, Table-level privileges:

- SELECT - Allows SELECTs of rows of the specified table. Also allows
COPY FROM. This privilege is a prerequisite for UPDATE & DELETE
- INSERT - Allows INSERTs of new rows into specified table. Also allows
COPY TO
- UPDATE - Allows UPDATEs of rows in the specified table
- DELETE - Allows DELETEs of rows from the specified table
- ALL - Grants all of the above object level privileges at once
Examples: Grant insert privilege to all Users on Table FILMS:
GRANT INSERT ON films TO PUBLIC;
Grant all privileges to all Users on database VAULT: Probably too

GRANT ALL PRIVILEGES ON DATABASE vault TO PUBLIC; broad for most
Grant membership in role ADMINS to user FRED:

GRANT admins TO fred;
If dont have GRANT to table, get this error message

GRANT on Database objects

GRANT on Database objects
(and PUBLIC role)
This variant of the GRANT command gives privileges on a database object
to one or more roles. These privileges accumulate
The keyword PUBLIC specifies that the privileges are to be granted to all
USERs, including those that might be created later. PUBLIC can be
thought of as an implicitly defined Role that always includes all USERs.
Any particular role will have the sum of privileges granted directly to it,
privileges granted to any role it is presently a member of, and privileges
granted to PUBLIC
If WITH GRANT OPTION is specified, the recipient of the privilege may in

turn grant it to others. Without a grant option, the recipient cannot do that.
Grant options cannot be granted to PUBLIC
There is no need to Grant privileges to the owner of an object (usually User

that created it), as the owner has all privileges by default
Do not confuse PUBLIC role with the PUBLIC schema

In-line lab: Create User with access to table

In-line lab: Create User with access to
beehive.aaf.employee
1. From ACT, login as beehive (act -d beehive -U beehive) and

create User = joe with password = joe
CREATE user moe password 'joe';
2. Grant ALL on Database = beehive to Joe

GRANT ALL ON DATABASE beehive to joe;
3. Grant ALL on Schema = aaf to Joe

GRANT ALL ON SCHEMA aaf to joe;
4. Logon to another ACT session as joe (act -d beehive -U joe) and then:
SELECT * from aaf.employee; Were you successful ? No
5. What command do you type to (from 1st ACT) so Joe can do Step 4?
GRANT SELECT ON aaf.employee to joe;

The Revoke command
To revoke a privilege, use the REVOKE command.
Note: The special privileges of an object's owner the right to modify or destroy the object
are always implicit and cannot be granted or revoked.
Note: Some types of objects can be assigned to a new owner with an object-appropriate ALTER
command. A user/role can reassign ownership of an object only if she is both the current owner
of the object (or a member of the owning role) and a member of the new owning role.
You cannot change the owner of a view.

The REVOKE command
Object Level: REVOKE [ SELECT|INSERT|UPDATE|DELETE|ALL ]

ON tablename (or viewname)
FROM username OR rolename;
- Use only if you have granted Object level privileges to a User
Role Level: REVOKE role FROM username;
Automatically revokes all privileges associated with the Role

EXTRA CREDIT: If have time, attempt to DROP USER MOE;
REVOKE all previous permissions to do so
REVOKE all on aaf.employee from moe;

REVOKE all on schema aaf from moe;
REVOKE all on database beehive from moe;
REVOKE execute on function aaf.antiselect from moe;
DROP user moe;

Function Permissions example
User Permissions for Installed Files and Functions
You grant and revoke users rights to functions using the commands shown below. For more
complete descriptions of these commands, see the reference documentation for GRANT and
REVOKE in the Teradata Aster Big Analytics Appliance 3H SQL and Function Reference.
GRANT INSTALL FILE and GRANT CREATE FUNCTION on a schema
To give a user or group the right to install files and create functions in Aster Database, an Aster
Database administrator must use one of GRANT commands that gives the user privileges in
the context of a schema:
GRANT { INSTALL FILE | CREATE FUNCTION } [, ...] [ PRIVILEGE ]
TO { username | GROUP rolename | PUBLIC } [, ...]
Note that there is no support for delegating privilege management, because the WITH
GRANT OPTION clause is not supported.
GRANT EXECUTE on a function
To give a user or group the right to run a function in Aster Database, an Aster Database
administrator or the functions owner must use the GRANT EXECUTE command that gives
the user the privilege for the specific function:
GRANT EXECUTE [ PRIVILEGE ]
ON FUNCTION [schemaname.]funcname
TO { username | GROUP rolename | PUBLIC } [, ...]
Note that there is no support for delegating privilege management, because the WITH
GRANT OPTION clause is not supported.
REVOKE INSTALL
To deny a user or group the right to install functions and files in Aster Database, an Aster
Database administrator must use the REVOKE INSTALL command:
REVOKE [ GRANT OPTION FOR ]
INSTALL { FILE | FUNCTION } [,...] [ PRIVILEGES ]
FROM { [ GROUP ] rolename | PUBLIC } [, ...]
REVOKE EXECUTE
To deny a user or group the right to run a function in Aster Database, an Aster Database
administrator or the function's owner must use the REVOKE EXECUTE command:
REVOKE [ GRANT OPTION FOR ]
EXECUTE [ PRIVILEGES ]
ON FUNCTION [schemaname.]funcname
FROM{ [ GROUP ] rolename | PUBLIC } [, ...]

Function Permissions example
Must have permission to CREATE and INSTALL functions:
GRANT CREATE FUNCTION ON schema <name> TO <USER | GROUP rolename>;

GRANT INSTALL FILE ON schema <name> TO <USER | GROUP rolename>;
To install Function on a database, use the \INSTALL command from

ACT. Then GRANT EXECUTE to User or Role
Must include file extension and path. Also names are case sensitive !!
Example:
Once you INSTALL, the Queen distributes a copy of the file to all the
v-Workers. When you execute a SQL-MR statement, that function is
copied into JVM along with tables rows for processing
Must utilize above commands on every Database if want to

use function Cluster-wide

Lab8b: Function permission

Lab8b: Function permissions
1. Login to ACT: act -d beehive -U db_superuser, then execute following:

beehive=> SELECT * from antiselect (on aaf.employee exclude ('job_code')); OK
beehive=> \remove antiselect.jar -- dont point to path OK
beehive=> SELECT * from antiselect (on aaf.employee exclude ('job_code')); Fail
beehive=> \install /home/antiselect.jar -- must point to file path OK
beehive=> SELECT * from antiselect (on aaf.employee exclude ('job_code')); OK
2. Open another ACT and logon as joe and attempt to run ANTISELECT function
SELECT * from aaf.ANTISELECT (on aaf.employee exclude ('job_code')); Fail
3. From db_superuser ACT, give user joe permissions to ANTISELECT function

GRANT execute on function aaf.ANTISELECT to moe; Grant
4. From ACT, have joe attempt to run ANTISELECT function again

SELECT * from aaf.antiselect (on aaf.employee exclude ('job_code')); OK
From DB_supeuser
EXTRA CREDIT: ACT,
What ALTER userhave
do you moe set
to search_path to 'aaf' , 'public';
do so Joe doesnt have to qualify both the
Then have moe
function and log
theback in as Hint:
table? moe Need to do two things

Public schema issue
Set Up Read-Only Access for a User
If you want a user group in a deployment to be able to SELECT but not add or alter data or
tables in the public schema, you can revoke the default rights to the public schema and grant
limited rights as show here. Let's assume the user's who we want to give read-only access are in
a group called, "ANALYSTS":
By default, all users in Aster Database have read/write access to schema public, so first we must
revoke that:
REVOKE ALL ON SCHEMA PUBLIC FROM PUBLIC;
Next, we grant back the rights we want the group to have. For example, let's assume we want
them to be able to select from table1:
GRANT USAGE ON SCHEMA PUBLIC TO ANALYSTS;

GRANT SELECT ON PUBLIC.TABLE1 TO ANALYSTS;

PUBLIC schema issue
The PUBLIC schema is a default schema created during Aster installation.

By default, all users have read/write access here
You can REVOKE the default rights to the PUBLIC schema and grant limited
rights
Assume you have a Role = Analysts and 10 users are in this Role. To setup
for read-only access to PUBLIC schema for Analysts Role:
Role
REVOKE ALL on SCHEMA public from public;

GRANT USAGE ON SCHEMA public to Analysts;
GRANT SELECT on PUBLIC.table1, PUBLIC.table2 to Analysts;
Above will revoke all privileges from all Users in the default PUBLIC role for
the PUBLIC schema. It then GRANTS back SELECT permissions on the 2
tables within the PUBLIC schema

In-line lab: Create following User accounts

In-line lab: Create following User accounts
(these accounts will be used later for the Workload Policy labs)
1. Create the following Users (create user <name> password 'pass';)
1. etl
2. strategic
3. tactical
2. For each user:
GRANT ALL on Database beehive to etl, strategic, tactical;

GRANT ALL on Schema aaf to etl, strategic, tactical;
GRANT ALL on AAF.sales_fact_lp to etl, strategic, tactical;

Configuring Security
Besides using an Aster password to logon, you have two other options including TD Wallet and
LDAP.

Configuring Security
Aster can Authenticate a User by one of three ways:
1. Local Password: Default
2. Teradata Wallet
3. LDAP (Lightweight Directory Access Protocol)

In-line lab: Teradata Wallet
Teradata Wallet (TD Wallet) is a software utility that provides the users of an Aster Database
client system with full and unrestricted access to their stored database passwords on that
system, while at the same time protecting those passwords from being exposed in scripts.
Each user on an Aster Database client has a TD Wallet. The TD Wallet securely stores the
passwords that the user adds to the wallet. However, a user cannot access the passwords stored
in the wallets of other users.
Rather than using the passwords in scripts, you can use their corresponding names defined in
your wallet.
TD Wallet supports SQL-MR functions and the following Aster Clients:
ACT
Loader Tool
Export
JDBC
ODBC

In-line lab: TD Wallet
Install TD 14.00.00.06 for Linux TD Wallet
First ensure there is an Aster user named 'trm' with password 'trm'.
If not, create this user
1. From VMware image of Queen, go to /home/beehive/clients_all/linux64

2. Right click on TDwallet1400*.tar.gz file and select Open with File Roller
3. Copy tdwallet1400 folder same directory
4. Right-click on tdwallet folder and Open
5. Right-click on tdwallet1400-14.00.00.06-1.noarch.rpm and select Open
with Install Software. Click Continue Anyway, then Install
6. Type: root for password, then click Authenticate
7. Click Close button after it installs

In-line lab: Add <name value> pairs to TD Wallet
Wallet Contents
A wallet contains a set of <name, value> pairs. These pairs consist of Unicode character
sequences.
The example in Table 2 - 5 shows entries in the TD Wallet of user client system user jdoe.
TD Wallet Commands
TD Wallet provides these commands:

Add <name value> pairs toTD Wallet
Link TD Wallet to vault where <name value> pairs are stored
1. From command prompt, navigate to:

cd /home/beehive/clients_all/linux64
2. Type: ln -s /opt/teradata/client/tdwallet tdwalletdir (this creates shortcut)
3. Type: ls -la (this confirms shortcut exists ->)
Open TD Wallet, create master password (tdwallet1) and <name value> for the
trm User account (trm's password = trm) name
password

In-line lab: ACT login using TD Wallet
Example using TD Wallet with Aster Loader Tool
In this example, we load the file sales_fact 2010MarchSales_data.txt into Aster

Database as the user beehive. TD Wallet replaces $tdwallet(beehive) with the
corresponding password stored on the client TD Wallet.
$ ./ncluster_loader -h 10.50.25.100 w $tdwallet(beehive) -D "~"
sales_fact 2010MarchSales_data.txt
Example using TD Wallet with Aster Export
In this example, we export data from the Aster Database table mytable to the file
mydata.txt as the user mjones. TD Wallet replaces $tdwallet(mjones) with the
corresponding password stored on the client TD Wallet.
$ ./ncluster_export -U mjones -w $tdwallet(mjones) -h 10.50.52.100 -d
mydb mytable mydata.txt
Example using TD Wallet with ACT
In this example, we connect to Aster Database as the user beehive. TD Wallet replaces
$tdwallet(beehive) with the corresponding password stored on the client TD Wallet.
$ act -d beehive -h 10.42.52.100 -U beehive -w $tdwallet(beehive)
Example using TD Wallet in a SQL-MR query
In this query, TD Wallet replaces $tdwallet(abc) with the corresponding password stored on the
client TD Wallet.
select * from cfilter(
on (select 1)
partition by 1
database('beehive')
userid('beehive')

In-line lab: ACT login using TD Wallet
Keeps Aster passwords private and not exposed in scripts:
Put TD wallet 1400 in below path:
select * from load_from_teradata

From /home/beehive/clients:
(on mr_driver
ln s /opt/teradata/clients/tdwallet tdwalletdirusername('trm')
tdpid('192.168.100.14')
password('\$tdwallet(pwd_trm)')
ls la (allows you to see the shortcut
query('select * from td01.teradata_source'));
-- this still wont work
TD Wallet support SQL-MR commands:
SELECT * from cfilter( n (select 1) partition by 1

database('beehive') userid('beehive')
PASSWORD('$tdwallet(abc)')
inputtable('cfilter_test') outputTable('cfilter_test1') tdwallet1
inputColumns('item')joinColumns('tranid')droptable('true'));

How LDAP works
The Lightweight Directory Access Protocol (LDAP) is an open, vendor-neutral, industry

standard application protocol for accessing and maintaining distributed directory information
services over an Internet Protocol (IP) network.[1] Directory services play an important role in
developing intranet and Internet applications by allowing the sharing of information about users,
systems, networks, services, and applications throughout the network.[2] As examples, directory
services may provide any organized set of records, often with a hierarchical structure, such as a
corporate email directory. Similarly, a telephone directory is a list of subscribers with an address
and a phone number.
LDAP is specified in a series of Internet Engineering Task Force (IETF) Standard Track
publications called Request for Comments (RFCs), using the description language ASN.1. The
latest specification is Version 3, published as RFC 4511. For example, here is an LDAP search
translated into plain English: "Search in the company email directory for all people located in
Nashville whose name contains 'Jesse' that have an email address. Please return their full name,
email, title, and description."[3]
A common usage of LDAP is to provide a "single sign on" where one password for a user is
shared between many services, such as applying a company login code to web pages (so that
staff log in only once to company computers, and then are automatically logged into the
company intranet).[3]
Guidelines for Using LDAP Authentication
To use LDAP authentication in Aster Database, each user must have two corresponding user
accounts: one in LDAP and one in Aster Database. The usernames must match, and the user
account in Aster Database must have connect privileges to the databases the user will use. If
the user will use the AMC, you must grant an AMC-capable role to his or her Aster Database
user account. Stated more explicitly, this means:
1 If an existing Aster Database user does not also have an account in LDAP, he or she will not
be able to log in to Aster Database after LDAP is enabled! Create user accounts in LDAP for
all Aster Database users.
2 If an LDAP user does not also have an account in Aster Database, he or she will not be able
to log in to Aster Database! Even with LDAP authentication enabled, every Aster Database
user must also have an account in Aster Database.
3 If the users Aster Database user name does not match his or her global LDAP user name,
he or she cannot connect to Aster Database. To fix this without creating new user accounts,
create an alias in the global LDAP server that matches the Aster Database user name.

How LDAP works
Aster User logs to Queen (in our case, using ACT)

Aster User and LDAP User must match for a successful Login Password entered at
Login window must match LDAP password
The Username and Password are Case-sensitive
Username checked against Aster data
dictionary, then credentials based to LDAP Password confirmation
comes from LDAP server

LDAP caveats
Disable LDAP Authentication
To revert the LDAP configuration to the default (authentication through Aster Database
only):
1 Perform a soft shutdown on the cluster. Log in to the queen as root and run the following
command:
# ncli system softshutdown
2 Change the working directory to where the Aster Database configuration utility is located:
# cd /home/beehive/bin/lib/configure
3 Issue the command:
#./ConfigureNCluster.py --auth_type=PASSWD
4 Restart the cluster:
# ncli system softstartup

LDAP Caveats
User accounts beehive and db_superuser can always logon
If wish to enable LDAP for Aster Lens, this requires additional

configuration
Slightly different configuration for Secure LDAP
Aster supports Single Sign-on with 3rd party software
Besides OpenLDAP, Active Directory also support LDAP

Can return to Password authentication via:
/home/beehive/bin/lib/configure/ConfigureNCluster.py --
auth_type=PASSWD
Cannot hurt to have the LDAP server configured in HOSTS of AMC

Review: Module 8 Users, Privileges, Roles

Review: Module 8 - Users,Privileges, Roles
1. Dropping a USER will cascade down automatically (T or F)
2. USAGE is to _________ and CONNECT is to __________
3. Can GRANT on an Object or a Role (T or F)
4. PUBLIC is both a SCHEMA and a ROLE (T or F)
5. What does EXECUTE privilege allow?
6. CASCADE works on all DB objects (T or F)

Lab: Module 8- User, Privileges and Roles, Security

Lab: Mod 8 - User, Privileges, Roles, Security


Module 9
Mod 09
Backup and Restore
Now is a good time to fire up the

Backup VMware image

Page 2 Mod 8 Backup and Restore
Table Of Contents
Backup and Restore Module objectives ....................................................................................... 4
Backup Architecture........................................................................................................................ 6
Terminology .................................................................................................................................... 8
Where Backup directories/files located ........................................................................................ 10
Architecture notes ......................................................................................................................... 12
Installing the Backup software - Preparation ................................................................................ 14
Installing the Backup software (1 of 2) ......................................................................................... 16
Installing the Backup software (2 of 2) ......................................................................................... 18
In-line lab: Installing Backup Mgr/Node ...................................................................................... 20
In-line lab: Installing Backup Mgr/Node (cont) .......................................................................... 22
In-line lab: nCluster Backup user interface ................................................................................... 24
In-line lab: Registering Backup Mgr with AMC .......................................................................... 26
Key commands .............................................................................................................................. 28
Monitoring Nodes ......................................................................................................................... 30
Backup Node Management ........................................................................................................... 32
Backup Methods Physical and Logical ...................................................................................... 34
Backup characteristics ................................................................................................................... 36
How Full Physical Backups work ................................................................................................. 38
How Incremental Backups work ................................................................................................... 40
Logical Backup characteristics ..................................................................................................... 42
Starting a Full/Incremental Physical Backup ................................................................................ 44
Starting Logical Backup on a single table .................................................................................... 46
In-line lab: Backing up SALES_FACT table................................................................................ 48
Scheduling a Recurring Backup .................................................................................................... 50
Show Schedules and Deleting Scheduled Backups ...................................................................... 52
Pause and Resume a running Backup ........................................................................................... 54
Cancelling Backups ....................................................................................................................... 56
Verify Show Status, Show Backups .......................................................................................... 58
Deleting a Physical Backup .......................................................................................................... 60
Restore a Physical Backup ............................................................................................................ 62
Restore a Logical Backup ............................................................................................................. 64
In-line lab: Restore the SALES_FACT table ................................................................................ 66
Backup Performance ..................................................................................................................... 68
Backup Monitoring Web interface ............................................................................................. 70
Backup Monitoring Aster commands ........................................................................................ 72
Backup Troubleshooting ............................................................................................................... 74
Managing Backups through the AMC .......................................................................................... 76
Review: Module 9 Backup and Restore ..................................................................................... 78
DBA Lab 9 Backup and Restore ................................................................................................. 80

Backup and Restore Module objectives

Backup and Restore - Module Objectives
Learn about the Aster Backup architecture
Understanding the Two types of backup

- Physical
Full
Incremental
- Logical
Learn how to execute, monitor and troubleshoot
Backup and Restore an SQL table

Backup Architecture
Aster Database Backup relies on a Backup Cluster made up of a set of Backup Nodes. The
Backup Cluster is not an Aster Database; it typically has only a few nodes, and its architecture
is not derived from the Aster Database clustering architecture. Each Backup Node is a machine
with Aster Database Backup software that enables it to store backed up Aster Database data.
You designate one of the Backup Nodes as the Backup Manager from which you can manage
your backup and restoration actions. (To do this you use the command line tool,
ncluster_backup, which runs on the Backup Manager.) In addition to its manager
activities, the Backup Manager stores data, like any other Backup Node
Aster Database Backup is based on a purpose-built architecture that leverages a set of

dedicated Backup Nodes and a Backup Manager to form a Backup Cluster. The Backup
Cluster architecture can leverage massive parallelism of the Backup Nodes to provide
highperformance
backup and restoration. For disaster recovery, the Backup Cluster can reside in a
location that is geographically separated from the Aster Database. Multiple Aster Databases
can send data to a single Backup Cluster, providing an opportunity to consolidate backups
and reduce costs.
Data from Aster Database can be backed up to two types of storage targets:
Backup Cluster: The Backup Nodes themselves can store backups using their directattached
storage, with hardware-level disk mirroring (e.g. RAID 1) for enhanced data
protection. Storage-heavy servers can provide high-density storage at a low cost per
terabyte an important consideration for data volumes associated with data warehousing
applications. The Backup Cluster design follows incremental scalability principles similar
to those of the Aster Database architecture more servers can be incrementally
provisioned to add backup capacity. Thus, backup storage costs can be managed in a
granular manner as data volumes grow. Backup files can subsequently be moved to tapes
or VTLs (virtual tape libraries) from Backup Nodes if required by IT or corporate
governance policies.
Network Storage: Aster Database Backup also provides the flexibility to store backups on
network storage (SAN/NAS) if required by an organizations IT policies. If this option is
chosen, Backup Nodes can run backup and restore processes in a massively-parallel
manner while using a networked storage array to store backup data.

Any time you do Partition Splitting (add v-Workers),
Backup Architecture back up immediately. Must restore to same # of v-

Workers (and same version of Aster software)
Multi-version concurrency control (MVCC) ensures online backup with consistency

Note how Primary Queen is also backed up via Backup nodes
Backup Manager is always configured as a Backup node

Terminology
Aster Backup is comprised of a Backup cluster made up of a Backup Manager and Backup
Nodes. The Backup cluster is separate from the Aster Database cluster; it has its own
architecture, and it typically has only a few nodes. When installing, you designate one of the
Backup Nodes to also act as the Backup Manager, from which you can manage your backup
and restoration actions. In addition to its manager activities, the Backup Manager stores data,
like any other Backup Node.
These are the steps that occur when performing a backup operation:
1 The Backup Manager first contacts the queen node in Aster Database.
2 The queen coordinates the backup request with the worker nodes.
3 Worker nodes connect directly to the Backup Nodes and stream data in a massively
parallel manner to the Backup Nodes.

Terminology
Aster Backup relies on a Cluster made up of a set of Backup Nodes. The

Backup Cluster is not an Aster Database; it typically has only a few nodes.
Each Backup Node is a machine with Aster Backup software that enables it to
store backed up Aster Database data
You designate one of the Backup Nodes as the Backup Manager from which
you can manage your backup and restoration actions. (To do this you use the
command line tool, ncluster_backup, which runs on the Backup Manager). In
addition to its manager activities, the Backup Manager stores data, like any
other Backup Node Backup Cluster
When you run a backup, the Aster Database 1 Backup Manager

components do the following:
1. The Backup Manager first contacts the Queen Backup
Manager
node
2. The Queen, in turn, coordinates the backup
request with the Worker nodes
3. Worker nodes connect directly to the Backup 1- Many Backup
Nodes and stream data in a massively parallel Nodes
manner to the Backup Nodes
Backup Node Backup Node Backup Node
1 2 n

Where Backup directories/files located
When you create a backup, the backup files are written to the directory /home/beehive/
data/<backup_id> on each Backup Node. When you create an archive, Aster Backup takes
these backup files and creates one or more archive blobs as .tarsplit files. These files have
the extension .tarsplit because if any individual file becomes too large, it is split into
chunks automatically. These archive blobs are written to the designated archive directory (the
/home/beehive/data/archives directory by default) on each node. The archive process
runs in parallel on each node of the Backup cluster.
The archive blob(s) on each node are named automatically, using this convention:
AsterArchive_<backup_id>_Node<node_id>_<BLOBIndexNumber>_<BLOBCount>.tar
Split
By looking at the name of any archive file, you can infer details such as which backup object
this blob belongs to, which backup node the blob was created on, the index number of this
blob, and the total number of blobs that make up the archive for this backup object.
Storing a copy of the archive off site helps in implementing a disaster recovery plan. After the
archive has been created, you may move the archive blobs to offline storage or tape. When
moving the files, ensure that the owner, group and permissions remain the same by using the
mv instead of the cp command. The permissions on the archive files should remain as:
drwx------ beehive beehive
The backups and archive files on the Aster Backup cluster can then be removed to free up disk
space.

Where Backup directories/files located
As seen from Backup Mgr
Rm rf .metadata

Architecture notes

Architecture notes
The Backup Cluster is NOT a Cluster!

Not the same as a standby/spare cluster
Storage targets:
Backup nodes with direct attached storage
RAID 6 used in Aster appliance
Backups stored under /home/beehive/data
Network Storage (SAN/NAS)
Mounted under /home/beehive/data
Supported Platforms
Linux only (RHEL, CentOS)
Teradata SuSE
Network Connectivity
Between backup nodes and Aster queen
Between backup nodes and Aster worker nodes
Direct (parallel) transfer between worker/Queen and backup nodes

Installing the Backup software - Preparation

Installing the Backup software - Preparation
To prepare the servers of the Backup cluster, do the following:
1. On all Backup servers, create the UNIX account 'beehive' with the home directory
/home/beehive
2. Create UNIX 'beehive' group and add UNIX 'beehive' user to the 'beehive' group
3. Create the directory /home/beehive/data on the Backup Manager and Backup
Nodes. Give full access to 'beehive' user on the /home/beehive/data directory
4. Set up passwordless SSH for user 'beehive and user 'root' among Backup Nodes
5. Set up the Firewall Open the following ports

22 2111
1984 2113
1986 2406
1991 3211
2008 8001
2105 All ports in range 11000 - 12000

Installing the Backup software (1 of 2)

You can perform this installation from any Linux workstation that has Python
2.5.2 or a later version of Python 2.x installed and has access to the machines
that will form your Backup Cluster. In these instructions we will assume you
are working in a command shell on the machine that will be the Backup
Manager. After logging in as ROOT user:
This folder must remian
1. Create the directory /home/beehive/data empty prior to install
2. Give the installer permissions on the directory by issuing the following

command or similar: chown -R beehive:beehive /home/beehive/data
(chmod 777 data)
3. Get the Aster Database Backup installation bundle and decompress
tar -xvf nCluster-backup-upgrade_40186.tar
This produces the 5 files you will need (put in /home directory):
- install_backup.py
- install_backup_node.py and install_backup_node.pyc
- backup-sw.tar.gz
- tc_backup_x86_64.tar.gz


4. Save these five files to a single directory on the Backup Manager. The
machine you place them on must have network connectivity to all Backup
Nodes
5. Log in as root and change directories to the directory where the 3 installer files
are located, and run the script install_backup.py with the following options:
# python install_backup.py install -m <Mgr-IP> -n "<Node1-IP>

<Node2-IP> ..." -p <pkg-dir>
where <Mgr-IP> is the IP address of the machine that will be Backup Manager,
<Node1-IP> is the address of the first Backup Node (this is typically the same
machine you designated as Backup Manager), <Node2-IP> is the address of
the
next Backup Node, and so on.For example, to install on a single server and
make it both a Backup Manager and Backup node, type:
# python install_backup.py install -m 192.168.100.172 -n 192.168.100.172

-p /home
Backup Manager Backup Node(s)
Note Backup Manager is also considered Backup Node

In-line lab: Installing Backup Mgr/Node

In-line lab: Installing Backup Mgr/Node
1. Open VMware image for Backup node. If needed login as: root/root
2. Click on Computer button in lower-left hand corner, and open up both the
Nautilus (file browser) and GNOME Terminal
3. From File Browser, in left-hand pane click on File System, then double-click
on Home folder. Confirm have files needed to install (backup-sw.targ.gz,
install_backup_node.py, install_backup_node.pyc, tc_backup_x86_64.tar.gz
4. Minimize the VMware Workstation icon

In-line lab: Installing Backup Mgr/Node (cont)

In-line lab: Installing Backup Mgr/Node (cont)
5. From PUTTY, login to Backup-Mgr. Use root / root to login. From prompt,
type: cd /home, then type:
python install_backup.py install -m 192.168.100.172 -n 192.168.100.172 -p /home
When prompted to PROCEED with INSTALL?

(Y/N), type Y
Above command is to install Manager and Nodes.

If later wanted to add another Backup node, would
use the install-nodes argument

In-line lab: nCluster Backup user interface
To begin working with Aster Backup, launch its command line interface (ncluster_backup).
Note that Aster Backup must be running in order for ncluster_backup to work.
Launching the Backup Command Line Interface
To launch ncluster_backup:
1 Open a command shell on any machine that has Aster Backup installed.
2 Log in as beehive.
3 If ncluster_backup is not in your path, add it now, or change the working directory to
the executable directory. By default, the executables are installed in:
$ cd /home/beehive/bin/exec
4 Type ncluster_backup, passing the -h flag followed by the IP address or DNS name of
the Backup Manager:
$ ncluster_backup -h <mgr_IP>
For example, if you are working on the manager itself, you will type
$ ncluster_backup -h localhost

In-line lab: nCluster Backup User Interface
ncluster_backup command
From Unix prompt, to initiate software, type:
/home/beehive/bin/exec/ncluster_backup h <Backup Mgr IP addr>
Command line interface Backup Manager

Offers Interactive mode and batch mode
Batch mode (see: ncluster_backup help)

Provide commands in a text file
One command for each line
runs all commands in parallel, be careful

Keep this PUTTY window open so
Register Backup Manager in AMC we can more labs later without
Allows viewing of backups and their status having to logon again
Option to pause/resume backups

In-line lab: Registering Backup Mgr with AMC
Adding a New Backup Manager to the AMC
To add the Backup Manager to the AMC:
1 Log in to the AMC for the Aster Database cluster you want to back up.
2 Select Admin > Backup.
3 Click the Add Manager button.
4 Enter the IP Address of the Backup Manager.
5 Click OK.
Once the Backup Manager has been added, you will see a confirmation message in the upper
right part of the window. You wont see any entries in the Cluster Backups table until a backup
has been started.

In-line lab: Register Backup Mgr with AMC
1. From AMC, navigate to Admin > Backup

2. Click on Add Manager and type: 192.168.100.172. Click OK
3. Any previous Backups will display
in window

Key commands
Here is a partial listing of some of the command for ncluster_backup. See the Aster Database
User Guide for a complete listing.

Key commands
# ncluster_backup -h <Backup Manager IP>

nCluster Backup> help

Monitoring Nodes
Checking the Availability and Capacity of the Backup Cluster
1 Launch the Backup Command Line Interface.
2 Issue the command:
nCluster Backup> show storage
The output shows all backup nodes along with their used/total storage capacity and current
status.

Monitoring nodes
show storage
Shows all backup nodes along with their used/total storage capacity
and their current status.
After do DELETE, USE% may not immediately reflect the actual disk space used

Backup Node Management

Backup Node Management
Register/unregister node
- Backup manager needs to be registered with itself
nCluster Backup> help register

Usage: register node IP
Examples:
> register node 192.168.100.20
Registers the node 192.168.100.20 as a Backup node
nCluster Backup> help unregister
Usage: unregister Node IP
Examples:
> unregister node 192.168.100.20
Unregisters the Node 192.168.100.20 as a backup node
Note during original install, code automatically REGISTER any nodes defined after the -n
argument. So above code is after you have installed the Backup software and want to
add/delete Backup nodes
python install_backup.py install -m 192.168.100.172 -n 192.168.100.172 -p /home

Backup Methods Physical and Logical
Aster Backup can make three distinct types of backups:
Physical Backup (Full) backs up the entire Aster Database cluster.
Physical Backup (Incremental) backs up anything that has changed in the Aster Database
cluster since the last physical backup, saving space and bandwidth compared to a full
physical backup. When restored, an incremental backup does a full physical restore,
automatically using the last full physical backup and any incremental backups taken up to
the point of the incremental backup you have chosen to restore.
Logical Backup backs up an individual table or partition by essentially taking a consistent

snapshot of it, along with all its associated metadata. A logical restore operation can later
be performed to reconstruct the table exactly as it was at the time the logical backup was taken.
All types of backups are online operations, meaning your cluster remains up and can service
queries while the backup runs. All may be scheduled to occur automatically at a specified
interval.
Aster File Store (AFS) does not use Aster Backup.

Backup Methods Physical and Logical
Physical backups are Cluster wide
Physical backups can be:
Full - All data in Cluster

Incremental - Just changes from last Physical. Saves space
and bandwidth
Logical backups are just individual Tables you specify
This is a File level backup, not a Block level

backup. So if only 1 row is changed in a large
table, the entire table is backed up again

Backup characteristics

Backup characteristics
Online
Full physical backups
Incremental physical backups
Physical restore
Restore FULL and any required INCREMENTALS
Compression at Backup node, not Worker node for both FULL and
INCREMENTAL (for cost and performance)
Backups Can
Pause/Resume
Cancel
Physical backups have 3 phases:

Best effort phase (copy of Aster datafiles)
Consistent phase (copy of delta since backup started)
Queen database backup phase

How Full Physical Backups work

How Full Physical Backups work
2
Full Physical Backup Backup Manager communicates with Queen
1
Queen returns list of v- 3 Backup command issued
Workers via CLI to Backup Manager
Queen
v-Worker 1 v-Worker 11 v-Worker 21 *

v-Worker 2 v-Worker 12 v-worker 22
4
Data is transferred in
v-Worker 3 v-Worker 13 v-Worker 23
Uncompressed
format from v-
Backup Cluster
v-Worker n v-Worker 1n v-Worker 2n Workers
Worker A- Worker B- Worker C-
Node Node Node 5
DB Files Data compressed at
Backup Nodes
DB Files
On-line data changes are synchronized by

the Backup Manager and WAL is copied
During RESTORE, files are uncompressed on Backup nodes and then transferred to Worker nodes

How Incremental Backups work

How Incremental Backups work
Same basic steps as FULL backup

- Only files that have changed since the last backup
Queen
Changed Files Full Backup
Incremental Backup
Full Backup
v-vWorker 1 v-Worker 11 v-Worker 21
File 1 File 2 File 3
File
Link1 File
Link2 File
New3
File 3
v-Worker 2 v-Worker 12 v-Worker 22 File 1 File 2 File 3
v-Worker 3 v-Worker 13 v-Worker 23 File 44

File File 55
File File 66
File
Worker B- Nod Link Link Link
File 4 File 5 File 6
v-Worker n v-Worker 1n v-Worker 2n File
File 9 Link77
File File
Link88
File File
New99
File
Worker A- Worker B- Worker C- File 7 File 8 File 9
Node Node Node Backup Cluster
During Restore, point to FULL. The INCREMENTALS are automatically applied in proper Order

Logical Backup characteristics
There are several key differences between logical and physical backups. One is related to the
data format used. While a physical backup preserves the native format used by the Aster
Database, a logical backup extracts the data in a non-proprietary text format. Another
important difference between physical and logical backups is related to granularity. Physical
backups work at the cluster level, while logical backups are used to preserve individual tables.
These two backup options complement each other.
Logical backup restoration is an online operation, meaning that it does not interrupt the
operation of Aster Database. Physical backup restoration, on the other hand, requires a restart
of Aster Database.

Logical Backup characteristics
Backup of the data (as opposed to database files)

Granularity at the Table level Data is transferred in Compressed
format from v-Workers
- Does a full backup of the Table
- No incremental backups of Table level data
Multiple streams for partitioned Tables
Single stream for replicated Tables
Backup is stored compressed
File includes table DDL + data
Unlike a Physical backup, a Logical backup extracts the data
in a non-proprietary text format. One advantage of using
such a format is that the data is in a human readable form.
Additionally, the portability of text formats makes it easier to
migrate the data to other Aster Database systems or even
load it into a different database system

Starting a Full/Incremental Physical Backup
Starting a Full Physical Backup
To make a full backup of an Aster Database:
2 Type the start backup command, specifying physical full and the IP address of the
Aster Database queen:
nCluster Backup> start backup <queen_IP> physical full
The Backup ID number will be returned.
3 Use the show backups command to check the progress of your backup:
nCluster Backup> show backups
Starting an Incremental Physical Backup
To make an incremental backup of an Aster Database:
2 Type the start backup command, specifying physical incremental and the IP
address of the Aster Database queen:
nCluster Backup> start backup <queen_IP> physical incremental

Starting a Full/Incremental Physical Backup
Queen
Starting a Full Physical Backup:
Starting a Incremental Physical backup
Can execute from PUTTY or from VMware image via UNIX prompt . Note youe
execute commands from Backup Mgr, but always point to Queen in the command

Starting Logical Backup on a single table
Starting a Logical Backup of a Table
To make a logical backup of table:
2 Type the start backup command, specifying logical, the table name, and the IP
address of the Aster Database queen:
nCluster Backup> start backup <queen_IP> logical <tablename> in <dbname>
Starting a Logical Backup of a Partition
To make a logical backup of partition:
2 Type the start backup command, specifying logical, the table name, the partition
reference, and the IP address of the Aster Database queen:
nCluster Backup> start backup <queen_IP> logical <tablename> <partition_ref> in

<dbname>
where <partition_ref> is a path to the partition in the format partitionname[.partitionname ...]

Starting Logical Backup on single Table
Starting a Logical backup of the sales_fact table in the beehive

database:
Note you cannot take an Incremental backup of a Logical backup

In-line lab: Backing up SALES_FACT table

In-line lab: Backing up SALES_FACT table
If already at nCluster Backup> prompt, proceed to Step 3
1. From PUTTY, open up 192.168.100.172 . Login as: root/root
2. From Terminal, to get to nCluster Backup prompt, type:
3. To backup the SALES_FACT_LP table, type the following:
4. Record your Backup ID number here as we will be using it later when we do

a Restore:
_______________________________________________

Scheduling a Recurring Backup
Examples of Scheduling Backups Using the CLI
To schedule a physical backup:
nCluster Backup> schedule backup physical 10.50.200.100 full 2008-07-

31T06:30 repeat 2w 2d
The above example schedules a physical backup of the Aster Database with queen IP address
10.50.200.100. The scheduled backup will start at 6:30 AM on July 31, 2008 UTC
(Coordinated Universal Time). The repeat keyword is used to indicate if the backup activity is
recurring. In the above example, a full backup will be taken at the start time and would be
automatically taken once every two weeks (represented by 2w). An incremental backup will
be taken every other day (represented by 2d).
To schedule a physical backup without incrementals, specify repeat time for incremental
backups as 0 days:
nCluster Backup> schedule backup physical 192.168.1.10 full 2010-08-

10T10:15 repeat 1w 0d
To schedule a repeating logical backup of a table:
nCluster Backup> schedule backup logical 10.50.200.100 testTable in

TestDB at 2008-07-31T01:00 repeat 1d
The above example schedules a logical backup of table testTable from database TestDB on
queen 10.50.200.100 starting at 1:00 AM on July 31, 2008 (UTC). A logical backup will be
taken at that time and will be taken every day (represented by repeat 1d).

Scheduling a Recurring Backup
The REPEAT keyword indicates if the backup is Recurring. So a Full backup will occur
on Oct 23, 2013 starting at 2:30 AM and will automatically be taken once every week.
An Incremental will be taken every day thereafter
Here we take a FULL backup each week starting o Oct 22nd and have no Incremental
backups whatsoever
A Full backup will occur on Jan 1, 2013 starting at 2:30 AM and will
automatically be taken once every week. No incremental backups will taken
A Logical backup will occur on Jan 2, 2013 starting at 1:00 AM and will automatically
be taken once every day
Note: Start time is always entered as UTC

Show Schedules and Deleting Scheduled Backups

Show Schedules and Deleting Scheduled
Backup
Show schedules
Delete schedule (this will Unschedule)

Pause and Resume a running Backup
Pausing, Resuming and Cancelling Backups
While backups can execute concurrently with queries and loads, running the backup still
consumes system resources. In some cases, you may want to pause the current backup so the
system can allocate all its resources to queries or data loading. The pause and resume
commands let you do this. Note that only physical backups (either full or incremental) can be
paused. Logical backups cannot be paused.
All of these operations require that you know the unique ID of the backup you wish to act
upon. If you do not know the backup ID, you can use the show backups command to find
the backup you wish to pause, resume, or cancel.
Pausing a Backup
Pausing a backup stops a backup that is in progress, until you resume it. Any data that has
already been transferred from the subject Aster Database to the Backup cluster will not be
transferred again. In other words, any incremental work that the backup performed prior to
being paused is preserved.
To pause a backup that is in progress:
2 Type the pause backup command, specifying the ID of the backup to pause:
nCluster Backup> pause backup <backup_id>
3 Use the show backups command to check that your backup has been paused:
Resuming a Backup
Resuming a backup starts a paused backup process from where it left off.
To resume a backup that was paused:
2 Type the resume backup command, specifying the ID of the backup to resume:
nCluster Backup> resume backup <backup_id>
3 Use the show backups command to check that your backup has been resumed:

Pause and Resume a running Backup
Start backup, then Pause, then Resume
Only Physical backups (Full or Incremental) can be Paused

Can Pause and Resume for AMC as well

Cancelling Backups
Cancelling a Backup
To cancel a backup that is in progress:
2 Type the cancel backup command, specifying the ID of the backup to cancel:
nCluster Backup> cancel backup <backup_id>
3 Use the show backups command to check that your backup has been cancelled:

Cancelling Backups
nCluster Backup> cancel backup 1232951166098754

nCluster Backup> delete backup 1232951481537969

Wait 5 minutes for backup to Cancel
Check Backup status in AMC or using CLI: SHOW BACKUPS
If status is FAILED, check the backup logs in: /home/beehive/data/logs
Cancelling a Backup does not free up space! Need to DELETE

BACKUP (see next few slides)
Can CANCEL a Running/Paused Backup

Can Cancel from AMC as well

Verify Show Status, Show Backups

Verify SHOW STATUS, SHOW BACKUPS
Individual backup status
Status of all Backups

Deleting a Physical Backup
Deleting a Backup or Archive
The storage used for a backup or archive can be reclaimed by issuing a delete backup or
delete archive command. After the delete command finishes, the data associated with
the given backup or archive will have been removed.
Before you can delete a physical backup, you must delete all its subsequent incremental
backups. That is, if there is an incremental backup with a timestamp later than the backup you
are trying to delete, your attempt to delete will fail.
To delete a backup or archive:
2 Type the delete command, specifying the ID of the backup you want to delete:
To delete a backup:
nCluster Backup> delete backup <backup_id>
To delete an archive:
nCluster Backup> delete archive <backup_id>
Note that it is possible to delete an archive, but leave its corresponding backup intact and
vice versa. The two are stored and managed independently of one another.

Can delete any Incremental backup and
still recover by specifying a different
Deleting a Physical Backup Incremental during a Restore
Before you can DELETE a Full Physical backup, you must Delete its subsequent Incremental
backups. That is, if there is an Incremental Backup with a timestamp later than the Backup
you are trying to Delete, your attempt to delete will fail.
To do so first, do a SHOW BACKUPS to view Backup IDs, then can run DELETE command in
the proper timestamp order
3
2
1
Delete backup
This removes it from
the AMC and removes
Directories and Files

Restore a Physical Backup
To restore from a physical backup:
1 Perform the steps in Physical Restore Prerequisites show below.

2 Log in as beehive to the Backup Manager.
4 List the available backups, and make note the ID number of the backup you wish torestore:
5 Type start restore followed by the backup ID:

nCluster Backup> start restore <queen_IP> physical <backup_id>
6 Make a note of the Restore ID that appears when the restore begins.
7 Check the status of the restore job: nCluster Backup> show restores
Once your restore shows a status of Succeeded, proceed to the next step.
8 Log in as root on the Aster Database queen.

9 Soft restart the cluster: # ncli system softrestart
10 Activate the Aster Database cluster: # ncli system activate
This will bring up the cluster and replicate the data to restore the original replicationfactor.
Physical Restore Prerequisites
1 Make sure your Aster Database on the target cluster (i.e. the cluster to which you wish to
apply the restore) has the same version number as the cluster from which the backup was
made. Cross-version restores are not supported.
2 Make sure your target Aster Database has the same partition count as the backup you will
restore. If the cluster is active and no administrator has performed a partition split since
the backup was taken, then the partition count is the same. You can check the partition
count as shown in Prepare for Partition Splitting on page 436.
3 Because physical restore is an offline operation, you should notify users that the system
will be unavailable during the restore operation.
4 Log in as root on the target Aster Database queen .
5 Perform a soft shutdown of Aster Database: # ncli system softshutdown
6 Clean all data off of the target Aster Database by running the following command:
# ConfigureNCluster.py --clean_data
7 Perform a soft startup of Aster Database: # ncli system softstartup
8 Activate the Aster Database cluster: # ncli system activate
9 Next, to perform the restore

Restore a Physical Backup
1. Perform soft shutdown: # ncli system softshutdown

/home/beehive/bin/lib/configure
Pre-Queen
2. Didnt have to do this.

Clean data off Cluster: # ConfigureNCluster.py clean_data
Can delete last Incremental backup and still recover
Note all v-Workers i= Secondary during restore.
3. by to previous last INCREMENTAL via RESTORE
Perform soft startup: # ncli system softstartup
When later Activate, go back to Pri/Sec
4. Activate: # ncli system activate Data balancing took forever
1. Start nCluster backup: # nCluster_backup h 192.168.100.100

Restore-Backup Mgr
2. Type SHOW BACKUP: nCluster Backup> show backups

(this will display all Backup IDs. Pick the one want to use and go to
Step 3)
3. To restore BACKUP ID: nCluster Backup> start restore
192.168.100.100 physical 1351608134854799
4. To display progress, type: show restores
Post-Queen
1. Perform soft restart: # ncli system softrestart

2. Activate: # ncli system activate

Restore a Logical Backup
Logical Restore
Restoring from a logical backup is an online operation that can be performed while Aster
Database is running. Logical restore takes longer than logical backup because of the replicating
overhead. A logical restore operation is executed as a single modifying transaction, with
replication of the data included as part of the transaction commit.
There is some overhead associated with restoration of formatting from logical backups. While
data blocks associated with a physical backup can be readily used after being transferred back
into Aster Database, data in a logical backup needs to be reloaded. The data has to be
converted back from the text format to the native format used by Aster Database, which causes
some performance overhead. This process is not executed by the queen or loader nodes. It occurs
automatically on the worker nodes after the data has been transferred.
Logical Restore Prerequisites
1 Make sure your Aster Database has the same version number as the cluster from which the
backup was made. Cross-version restores are not supported.
2 The table (or partition) you are restoring must not exist in the database.
If you are restoring a table that already exists, drop it before attempting to restore.
If you are restoring a partition, you can either drop it or detach it from its parent table
before attempting to restore.
3 If the table you are restoring is a child table or partition, its parent table or partition must exist
in the database. If the parent is missing, create it before attempting to restore the child.
4 If you are restoring a partition, its parent must have the same partition format (i.e. LIST/
RANGE option and partition key) that it had at the time of the backup operation. After
the partition is restored, it will have the same structure and be attached at the same point
in the hierarchy as it was before the backup.
5 If you are restoring a partition, the parent table should have the same structure as it had
when the partition was backed up.

Restore a Logical Backup
On-line operation
Target table must not exist. DROP it if it exists
Ensure have same Aster version number as one from backup
Parent of Target should exist (if using parent/child inheritance)
Includes restore of metadata
nCluster Backup> help start

start restore IPADDR logical BACKUPID as TABLE in DATABASE
Starts a restore of the logical backup identified by BACKUPID as the
table TABLE in the database DATABASE in the cluster with coordinator IP
address IPADDR

nCluster Backup> start restore 192.168.100.100 logical 1232951481537969
as t2 in beehive
Table name
Must use exact same Table name in RESTORE as the BACKUP

In-line lab: Restore the SALES_FACT table

In-line lab: Restore the SALES_FACT table
1. Before we RESTORE, using TD Studio, type: DROP TABLE aaf.SALES_FACT;

2. From Terminal, at the nCluster Backup> prompt, type:
start restore 192.168.100.100 logical <BackupID> as aaf.sales_fact_lp in beehive
3. To see status, type: show restores
Running
Failed since table existed
Succeeded

Backup Performance

Backup Performance
Workload Management
Verify priority of the Backup class in AMC
Backup nodes
Verify incoming network throughput (iperf)
Verify IO writes (atop, iostat -m -d x 60)
Verify CPU usage (sar u 60)
Possible Actions
Add more network cards
Enable or correct NIC bonding
Check routing
Setup dedicated backup network (Network Assignments
feature)
Change RAID level
Enable write back cache
Replace faulty disks
Optimize TCP parameters
Add Backup Nodes

Backup Monitoring Web interface

Backup Monitoring Web interface
Aster Backup Stats web page (192.168.100.172)

http://<backup manager IP address>:1991/stats

Backup Monitoring Aster commands

Backup Monitoring Aster commands
To confirm status of Backup servers, from a Unix prompt, navigate to /etc/init.d

and type one of the following commands: (commands are case sensitive)
nBackupNode start nBackuMgr start

nBackupNode stop nBackupMgr stop
nBackupNode restart nBackupMgr restart
nBackupNode status nBackupMgr status

Backup Troubleshooting

Backup troubleshooting
Collect the following log files (from time of error) from the Backup
Manager node:
- /home/beehive/data/logs/backupExec.log
- /home/beehive/data/logs/cluster.log
Collect the following log files (from time of error) from the Queen node:
- /primary/logs/sysmanExec.log
- /primary/logs/cluster.log
- /primary/logs/queenExec.log

Managing Backups through the AMC
The AMC provides support for the end-to-end management and monitoring of backup and
recovery process:
Manage all aspects of backup and recovery.

Estimate the time required for backup and recovery tasks.
Monitor the progress of backup and recovery tasks and see a detailed status.
Archive backups and retrieve them from the archive.

Managing Backups through the AMC
Many of the same task you can perform from the Aster backup command
line you can also perform from the AMC

Review: Module 9 Backup and Restore

Review: Module 9 Backup and Restore
1. The Backup Manager can optionally be a Backup node
2. You can do Incremental Backup on a Table
3. Backup IDs are viewable from the AMC
4. You can restore Logical backup to a different table name
5. Cancelling a Backup will automatically recover Free space consumed
6. You can define 3 different levels of Backup Compression (Hi, Med, Lo)
There are no formal Labs since Labs were performed In-line.

See lab book on next page if wish to repeat exercies

DBA Lab 9 Backup and Restore

DBA Lab 9 - Backup
Goal Do complete Physical backup
1. Open your lab manual on Page 43

2. Perform the steps in the lab
4. Be prepared to discuss the lab

Module 10
Mod 10
Workload Management and
Admission Controls
Now is a good time to SUSPEND
the Backup VMware image

Page 2 Mod 10 Workload Management
Table Of Contents
Workload Management Module objectives ................................................................................. 4
The Problem .................................................................................................................................... 6
The Solution .................................................................................................................................... 8
Workload Management concepts .................................................................................................. 10
Workload Management Illustrated................................................................................................ 12
Service Classes .............................................................................................................................. 14
Service Classes - Priority .............................................................................................................. 16
Service Classes - Weight ............................................................................................................... 18
Service Classes Memory Soft Limit % ...................................................................................... 20
Service Classes Memory Hard Limit % ..................................................................................... 22
Automatic Query cancellation....................................................................................................... 24
Memory Limit example................................................................................................................. 26
Service Classes and Resource Allocation example....................................................................... 28
In-line lab: Creating Service Classes ............................................................................................ 30
Workload Policies ......................................................................................................................... 32
The Default Workload Policy ....................................................................................................... 34
In-line lab: Creating Default Workload policy ............................................................................. 36
Interesting Use Case scenarios ...................................................................................................... 38
Workload Predicates (1 of 3) ........................................................................................................ 40
Predicate syntax ............................................................................................................................ 46
Setting Workload Policies evaluation order .................................................................................. 48
Managing Concurrency ................................................................................................................. 50
Monitoring Workload Policy queries ............................................................................................ 52
In-line lab: Creating and Running Workload Policies .................................................................. 54
Common mistakes ......................................................................................................................... 56
Admission Control ........................................................................................................................ 58
Admission Control Predicate evaluation....................................................................................... 60
WLM + Admission Control (AC) = Better performance .............................................................. 62
Review: Module 10 Workload Management ............................................................................. 64
Bonus lab: Create User Accounts ................................................................................................. 66
Bonus lab: Create and Run Workload Policies ............................................................................. 68
Bonus lab: Solution ....................................................................................................................... 70
DBA Lab 10 Workload Policies ................................................................................................ 72

Workload Management Module objectives

Workload Management - Module Objectives
Learn about the need for Workload Management
How to configure Service Classes
How to configure Workload Policies
Confirming the Policies are being utilized

The Problem
With a limited amount of resources (RAM, HD, Network, Disk) it is possible for query
performance to be less than optimal.

The Problem
Resources are limited and not all jobs are equal

Cpu, disk, memory, network
Problem examples
Interactive queries (< 5 sec) affected by ad-hoc queries
System exhaustion caused by Loads and physical Backups
Report Writers
Frontline
Applications
BI Analysts
Management
Ad Hoc Users

The Solution
Using Workload Management and Admission Control, an Aster DBA can throttle certain
queries so they get the right amount of resources.

The Solution
Workload Management
Resource prioritization based on user-provided guidelines
Workload: Set of SQL statements and activities with shared
properties from the users perspective
Main objective: predictable running times
More resources given to higher-priority Workloads at the expense
of lower-priority ones

Workload Management concepts
Workload Management (WLM) for Aster Database includes two main areas of consideration:
Admission Control and Resource Management.
Administrators have the option to enforce admission control by setting Admission Limits
to determine when and if tasks (transactions, jobs, or queries) are allowed to be admitted into
the system for processing. This is especially important if you have a particular type of query
upon which other transactions depend. As an example, if you have call center or point of sale
transactions that depend on other transactions, the administrator has the ability to control:
Which tasks are allowed to be admitted.

When tasks are allowed to run.
How many tasks of a particular workload type may run concurrently.
Administrators control resource management by creating Workload settings and classifying

different database process types into named profiles, called workload policies and applying
a service class designation to each named workload. This allocates resources to particular
workload types so that low-priority workloads (transactions, tasks, jobs, or queries) have a
minimal impact on higher-priority workloads when they run concurrently.
These rules instruct Aster Database to run each type of job with the right level of urgency.
Based on your rules, Aster Database assigns an initial level of importance to each job and, if
warranted, re-ranks the job while it is running. For example, your rules can ensure high
resource allocation for a newly added query of a given type but throttle down resources for
that query if it runs so long that it is suspected of being a runaway query.

Workload Management concepts
The Workload Management feature of Aster Database (WLM) lets you

allocate resources so that low-priority workloads have a minimal impact on
higher-priority workloads when they run concurrently. This becomes
especially important if you have a particular type of query upon which other
transactions depend (call center, point of sale, etc.)
It consists of 2 parts:
- Service Classes Assigned to Workloads, these rules consist of

hardware resources an SQL statement will receive
- Workloads Policies Rules that group database operations using criteria

such as: Backup, ETL, Admin operations, tactical queries, etc
These rules instruct Aster to run each type of job with the right level of
urgency. Based on your rules, Aster Database assigns an initial level of
importance to each job and, if warranted, re-ranks the job while it is running

Workload Management Illustrated
Administrators control resource management by creating Workload settings and classifying
different database process types into named profiles, called workload policies and applying
a service class designation to each named workload. This allocates resources to particular
workload types so that low-priority workloads (transactions, tasks, jobs, or queries) have a
minimal impact on higher-priority workloads when they run concurrently.

Workload Management Illustrated
Statements are optimally executed by controlling the running and resources allocated to all tasks
including: loading, reporting, mining, applications, compression, backup, scale-out
Granular rules-based prioritization

Workloads can be managed pre- and post- admission
Dynamic resource allocation and re-allocation

Service Classes
Service classes define how Aster Database divides the available hardware resources among all
active statements and other activities (e.g. physical backups) in the system. Each service class
has a priority, a weight, a soft memory limit percent and a hard memory limit percent. These
settings apply to all statements that map to that service class

Service Classes
Service classes define how Aster Database divides the available

hardware resources among all active statements and other activities
(e.g. physical backups) in the system.
Service Classes consist of 4 parts:
1 2 3 4

Service Classes - Priority
Priority is the first-level setting that governs admission to the queue for processing. That is,
incoming requests are first evaluated and prioritized by their priority, and then by weight.
These two settings also map to a per-node CPU share control and a per-node IO priority.
The priority value is an integer between 0 and 3, inclusive, that establishes, at the coarsest
level, how important a job is. A higher priority value indicates a job of higher importance. You
set a service classs priority when you create the class in the Admin: Configuration: Workload:
Service Classes tab of the AMC. The priority value will be one of the following: Deny, Low,
Medium, or High. This is recorded as an integer code in the priority column of the
nc_qos_service_class table.
Priority 0 (zero) indicates a job that will not be allowed to run, priority 1 a very lowimportance
job, priority 2 a medium-priority job, and priority 3 a high-importance job. You
can prevent a query from starting by having it map to a priority zero service class at the outset,
but you cannot use priority zero to stop an already-running query. Priority levels 1, 2, and 3
can be applied to a running query.
Reasons for using priority 0 (deny) might be that you wish to disallow any queries against a
particular table during certain hours when the daily sales reports are run. Or you may wish to
block certain categories of users from running queries during peak hours.

1 Service Classes Priority
Priority 1st level setting that governs admission to the queue. The
priority value is an integer between 0 and 3, inclusive, that establishes,
at the coarsest level, how important a job is. A higher priority value
indicates a job of higher importance.
Priority 0 = Deny (a job that will not be allowed to run), Priority 1 = Low
importance , Priority 2 = Medium, and Priority 3 = High importance
Reasons for using Priority 0 (deny) might be that you wish to disallow
any queries against a particular table during certain hours when the
daily sales reports are run. Or you may wish to block certain categories
of users from running queries during peak hours.

Service Classes - Weight
Weight is the second-level setting that governs admission to the queue for processing. That is,
incoming requests are first evaluated and prioritized by their priority, and then by weight.
These two settings also map to a per-node CPU share control and a per-node IO priority.
The weight value is an integer between 1 and 100, inclusive, that establishes how important a
job is, relative to other jobs executing with the same priority value. A higher weight value
indicates a job of higher importance. You set a service classs weight when you create the class
in the Admin: Configuration: Workload: Service Classes tab of the AMC. This is stored in the
weight column of the nc_qos_service_class table.
Within a priority level, the weight value dictates the ratio of resource allocation. For example,
if two statements execute with the same priority, but with weight values of 80 and 20, the
system will aim to allocate resources in a 4:1 ratio, with most of the resources allocated to the
statement with higher weight. Allocation of I/O-related resources is less accurate than
allocation of CPU shares, so in this example, the CPU share ratio would be very close to 4:1,
while the disk I/O shares cannot be guaranteed as precisely.

2 Service Classes Weight
Weight 2nd level setting that governs admission to the queue. After
evaluated by Priority, statement are then ranked by weight. Taken
together, these 2 settings map to a per-node CPU and Disk I/O
The Weight value is an integer between 1 and 100, inclusive, that
establishes how important a job is, relative to other jobs executing with
the same priority
Within a Service class, the Weight value dictates the ratio of resource
allocation. For example, if two statements execute with the same
priority, but with weight values of 80 and 20, the system will aim to
allocate resources in a 4:1 ratio, with most of the resources allocated to
the statement with higher weight. Allocation of I/O-related resources is
less so cannot be guaranteed
For CPU, the Weight is a 100x differential between 1 and 100,

and a 10x differential between Priority levels
HIGH/100 gets 100x more CPU than HIGH/1

HIGH/1 gets 10x more CPU than MEDIUM/100
HIGH/100 gets 1000x more CPU than MEDIUM/1

Service Classes Memory Soft Limit %
A memory soft limit is a best-effort limit on the amount of memory collectively used by all
processes at a given node running under a service class. The limit is associated with the service
class and therefore affects all queries running under all workloads that map to that service
class. When there is no memory pressure and resources are available, a service class is allowed
to utilize more memory per node than defined by its soft limit. When memory is needed, the
system will strive to reduce the memory used by a service class to its soft limit, but the success
of this operation is not guaranteed.
A soft limit is defined as a percentage of physical memory (RAM) on a per physical node basis.
For each service class, the percentage is a value between 0% and 100%. Note that given the use
of swap space, the total allocated percentage among all service classes is not limited to a value
between 0-100%. An administrator assigns a soft limit to each service class and the cumulative
percentage of all soft limits defined for all service classes may add up to more than 100%,
again due to the use of swap. The system will not reject such high values.
Under extreme memory pressure, service classes using more memory than their soft limits are
considered to be over quota. In such situations, queries and other activities executing under
that service class may be canceled

3 Service Classes Memory Soft Limit %
A Memory Soft Limit is a best-effort limit on the amount of

memory collectively used by all processes at a given node
running under a service class. The limit is associated with
the service class and therefore affects all queries running
under all workloads that map to that service class. When
there is no memory pressure and resources are available,
a Service Class is allowed to utilize more memory per
node than defined by its soft limit
A Soft limit is defined as a percentage of physical memory (RAM) on a per
physical node basis. For each service class, the percentage is a value
between 0% and 100%
Under extreme memory pressure, service classes using more memory than
their soft limits are considered to be over quota. In such situations, queries
and other activities executing under that service class may be canceled.
This will be discussed in 2 more slides

Service Classes Memory Hard Limit %
A memory hard limit defines an upper bound to the amount of memory used by a service
class on each physical node. The limit is associated with the service class and therefore affects
all queries running under all workloads that map to that service class. Even when resources are
available and the system is not under memory pressure, a service class will not be allowed to
consume more memory than defined by its hard limit (i.e. it is always considered over quota
when trying to go beyond its hard limit).
Like a soft limit, a hard limit is defined as a percentage of physical memory (RAM) on a per
physical node basis. Again because of swap, hard limits and/or their sum can be higher than
100%.

4 Service Classes Memory Hard Limit %
A Memory Hard Limit defines an upper bound to the amount of memory

used by a service class on each physical node. The limit is associated
with the service class and therefore affects all queries running under all
workloads that map to that service class. Even when resources are
available and the system is not under memory pressure, a Service
Class will not be allowed to
consume more memory than defined by its hard limit (i.e. it is always
considered over quota when trying to go beyond its hard limit)
Like a Soft Limit, a Hard Limit is defined as a percentage of physical
memory (RAM) on a per physical node basis. Again because of swap,
hard limits and/or their sum can be higher than 100%
Assuming have Aster Appliance

3, all the Primary v-Workers
share the Workers 256 GB RAM

Automatic Query cancellation
Queries in a given service class will use swap space when the service class hard limit is reached,
or when the node is under memory pressure. Queries are automatically cancelled by the
system when they use substantial amounts of swap space. The allowed amount of swap space
queries can use is governed by the following rules:
Queries in service classes with a hard limit are canceled when the service class swap usage
goes above 1 GB.
Queries in service classes without a hard limit are canceled when the service class swap
usage goes above 10 GB.
Queries without an assigned service class fall under the default service class, which is
required for WLM.

Automatic Query cancellation
Queries in a given Service Class will use swap space when the service
class hard limit is reached, or when the node is under memory pressure.
Queries are automatically cancelled by the system when they use
substantial amounts of swap space. Here are the rules:
1. Queries in service classes with a Hard limit are canceled when the
service class swap usage goes above 1 GB
2. Queries in service classes without a Hard limit are canceled when

the service class swap usage goes above 10 GB
Per Node. Use folllowing command

NCLI NODE RUNONALL FREE
NCLI NODE RUNONALL DF

Memory Limit example

Memory Limit example
1 GB If have HARD LIMIT defined and query exceeds it, query spills
into SWAP and is cancelled if SWAP > 1 GB
10 GB
If have HARD LIMIT not defined and query exceeds Soft limit,
query spills into SWAP and is cancelled if SWAP > 10 GB
If I dont have a SOFT or HARD LIMIT, that query can use all
10 GB
the available RAM (at that point in time) on that Worker and if
query spills into SWAP, it will be cancelled if SWAP > 10 GB

Service Classes and Resource Allocation example
Each service class defines the unit for resource allocation in the system independently of how
many queries - and indirectly, processes - execute under it. One consequence of this property
is that if your configuration has two service classes whose priority/weight settings are
identical, then if you run many concurrent queries, some of which match one class and some
of which match the other, you will not see the same results as if you had run many concurrent
queries that all matched just one class.
For example, consider the following simple service class configuration show in the slide.
Despite the fact that they have the same priority and weight settings, we have separated
interactive queries from the statements issued by the administrator. If these are the only two
service classes active in the system at a given point in time, in other words, no active statement
maps to any other service class - the configuration above stipulates that each get the same
share of resources, or 50% each. For the allocation of CPU resources this is the case even if the
administrator is issuing a single SQL statement while 99 concurrent interactive queries are
executing. Instead of receiving only 1% of the available resources, that single admin query will
in fact receive roughly the same share as all the interactive queries put together! Note that I/O
resource allocation is not as fine-grained as CPU time allocation, so Aster Database performs
this in a best-effort manner.
Similarly, if only the CEO and the administrator have active statements executing in the
system, all admin statements would collectively receive 3 times more resources than all CEO
queries put together.

Service Classes and
Resource Allocation example
If these are the only 2 ACTIVE Service Classes (ie: InteractiveQueries and
AdminStmts) in the system at a given point in time.- the configuration above
stipulates that each get the same share of resources, or 50% each
For the allocation of CPU resources this is the case even if:
InteractiveQueries 99 active queries
AdminStmts 1 active query
Instead of receiving only 1% of the available resources, that single Admin
query will in fact receive roughly the same share as all the Interactive queries
put together

In-line lab: Creating Service Classes

In-line lab: Creating Service Classes
From AMC, go to Admin>Configuration>Workload
1. Create 5 Service Classes 1 High, 2 Medium, 1 Low and 1 Deny

2. Assign parameters as shown below, then Save and Apply
FLIP Numbers between

SOFT/HARD

Workload Policies
A workload policy defines a set of related activities (SQL queries being the most common)
that share similar properties. Assigning these related activities to a service class allows them to
be prioritized in a similar manner by Aster Database.
Before any statement starts executing, it is mapped to a workload policy. Workload policies
contain a predicate attribute that specifies a boolean clause that evaluates to true for all
activities that are to be mapped to that workload. You define this predicate to evaluate the
attributes of the statement and its context. For instance, a workload containing all queries
issued by user beehive could be defined with the predicate username='beehive'.
The workload policy then specifies the service class under which the statement will be
executed. In the examples in this document we focus on SQL statements, but the mechanism
described here also applies to other activities including physical backup and restore
operations.

Workload Policies
Before any statement starts executing, it is mapped to a Workload Policy.

Workload policies contain a Predicate attribute that specifies a boolean
clause that evaluates to true for all activities that are to be mapped to that
workload. You define this predicate to evaluate the attributes of the
statement and its context. For instance, a workload containing all queries
issued by user beehive could be defined with the Predicate
"username='beehive'".
The Workload Policy then specifies the Service Class under which the
statement will be executed

The Default Workload Policy
When in use, workload management requires that at least one default workload policy be
defined. A default workload policy is one which matches any activity in the system and it is
defined by a predicate of 'true'. This ensures that any activity in the system can be mapped to
at least one workload policy. Note that the default workload policy should appear last in the
evaluation order, so that it will be applied only if none of the other workload policies apply.
When you install Aster Database, no workload policies are provided. Before you define any
other workload policy, you must first define the default workload policy.

The Default Workload Policy This is mandatory Policy
When in use, workload management requires that at least one Default

Workload Policy be defined. A default Workload Policy is one which
matches any activity in the system and it is defined by a Predicate = 'true'.
This ensures that any activity in the system can be mapped to at least one
Workload Policy. Note the Default Workload Policy should appear last in
the evaluation order, so that it will be applied only if none of the other
Workload Policies apply
Workload Policies are enforced against all User accounts, including db_superuser

In-line lab: Creating Default Workload policy

In-line lab: Creating Default Workload policy
1. Create Default Workload Policies as follows, then Save and Apply
Name = Default Predicate = true Service class = Medium-60-wt

Interesting Use Case scenarios
This slide annotates a number of ways to create your Workload Policies.

Interesting Use Case scenarios
User-based policies
Many customers may rely mostly on userName, roles
Time-based policies
Variable currentTime can be used to implement different policies
for business hours and during the night
Object-based policies
Using table and database names
IP-based policies
May be useful for large companies; branches, office vs. home
office
Periodic Re-evaluation
Allows the same statement/activity to be mapped to different
service classes over time
Dynamic statement reprioritization
Example: change workloads based on statement execution time

Workload Predicates (1 of 3)
A workload policy governs only those queries that are mapped to it. Queries are mapped to
workload policies using the evaluation order you specify . The first match between a query and
the workload policy predicate results the query being mapped to that workload policy.
You build the workload predicate using the pre-defined WLM attributes listed in the table
below. WLM attributes are SQL-typed values that contain information about the query itself
or about the user or session that ran the query. When you write a predicate, the datatype of
your test value(s) must match the datatype of the attribute being tested.
The WLM attributes and their associated values are assigned to the query when it is planned
by the system. For example, the attribute 'userName' has its value assigned during connection
establishment and contains the username provided by the user, while the attribute 'stmtType'
contains the type of statement being executed (e.g., a SELECT or an INSERT statement).

You build the workload predicate using the pre-defined WLM attributes
listed in the table below. WLM attributes are SQL-typed values that
contain information about the query itself or about the user or session
that ran the query. When you write a predicate, the datatype of your test
value(s) must match the datatype of the attribute being tested


Allows SQL-MR
functions to be run
at different WLM



Predicate syntax
Because WLM attributes are SQL-typed, you can use existing UDFs and SQL operators when
specifying workload predicates. Like in a WHERE clause, any construct that is compatible
with the attributes type is accepted. WLM predicates may include AND and OR conjunctions,
with as much complexity as desired. For example, the predicate:
stmtElapsedTime > '30 min' AND 'analyticsTeam'=ANY(roles)
returns true for any statement executed by any user on the analyticsTeam group but only after
the statement has been running for more than thirty minutes.
The attributes you can evaluate in your predicate, listed in the following table, are also listed in
the Aster Database system table, nc_qos_workload_variables.

Predicate syntax
Like in a WHERE clause, any construct that is compatible with the attributes
type is accepted. WLM predicates may include AND and OR conjunctions,
with as much complexity as desired. For example, the predicate:
stmtElapsedTime > '30 min' AND 'analyticsTeam'=ANY(roles)
returns true for any statement executed by any user on the analyticsTeam
Role but only after the statement has been running for more than 30minutes
It is a best practice to immediately SAVE your Workload Policy after creating
it. If there is a syntax error, you will get an Error message
Missing Parens

Setting Workload Policies evaluation order
Set the Evaluation Order of Workload Policies
At evaluation time, the first match wins. In other words, when a user submits a query, the first
workload policy whose predicate matches the users query will be used. Because of this, you
will always want the default policy to appear last in the evaluation order. Follow the steps
below to set the workload policy evaluation order:
1 In the Workload Policies tab, click on a policys row and drag it up or down.
2 Repeat for the other policies, dragging each row to the right place in the order.

Setting Workload Polices evaluation order
At evaluation time, the Query walks the Workload Policy Tree and the
first match wins. Because of this, you will always want the Default Policy
to appear last in the evaluation order as the Catch-all Policy for queries
that match no other Policy
Simply Drag and Drop the Workload Policies to the order you wish them
Query
Re-order
Queries can match more than 1 Policy so Policy placement order is important

Managing Concurrency

Managing Concurrency
By default, Aster allows maximum of 100 transactions to run

simultaneously. Until the concurrency threshold is reached, requests are
admitted to the queue as they are received
Once the number of concurrent transactions reaches the concurrency
threshold, your workload definitions determine which transaction is the
next to start running. When an active transaction ends, the system admits
the highest ranked transaction from the queue, based on its priority value.
If there are multiple queued transactions with the same priority level, the
ratio of the weight values define the shares of transactions for each
service class
For instance, if Service Classes A and B have many blocked transactions
and are set with the highest priority and weights of 100 and 50,
respectively, twice as many transactions from A will be admitted over time
As seen from AMC

Monitoring Workload Policy queries
The AMC allows you to monitor query execution, by showing the workload policy that was
mapped to each query. In the AMC, you can view this in the Processes tab. Use a filter if
needed to load the statistics for the desired queries, and look at the Workload Policy column and
the Priority column.
You cannot force re-prioritization of a running query, but you can write workload rules that
will re-prioritize running and queued jobs.

Monitoring Workload Policy queries
From the AMC, go the Processes tab to confirm your Queries are picking up
the desired Workload Policy
Non-SQL scripts do not get processed by Workload Policies. For

example, ncluster_loader and ncluster_export bypass Workload Policies

In-line lab: Creating and Running Workload Policies

In-line lab: Creating and Running
Workload Policies
1. Create 2 Workload Policies as follows (Default created in earlier lab)
a. <= 1 Minute-Query Predicate = stmtElapsedTime <= '1 min' with

Service Class = High
b. > 1 Minute-Query Predicate = stmtElapsedTime > '1 min' with Service
Class = Low
c. Default Predicate = true with Service class = Medium
2. Drag and drop Workload Polices so match above order, then click Save &
Apply Changes button
3. Run following query: select * from page_view_fact UNION
select * from page_view_fact
5. From AMC, go to Processes tab and watch as the Workload Policy
changes after 1 minute of processing

Common mistakes
No Default Workload Policy
As discussed above, the system requires at least one default workload policy to be defined if
any workload policies are defined (i.e., the nc_qos_workload table is not empty), so remember to
add one.
Incorrect Default Workload Evaluation Order

The default workload policy must be last in the execution order. If it were first in the execution
order, every incoming query would be mapped to the default workload policy, because it would
be evaluated first and it always returns 'true'.
Overlapping Workloads
It is possible for a single statement to match multiple workloads, so sometimes the mapping may
not happen as you expect (and still be correct). To ensure that mapping is working as intended,
you can use the nc_qos_active_workloads view to see the ordered list of active workload.
Invalid Predicates
The type of each QoS context attribute defines the expressions and operators that are allowed
by the system. For instance, given that userName is of type VARCHAR, an expression such as
userName like 'daniel%' is valid and should be accepted, while something like userName
< 100 will be rejected by the system with a message as shown below.
beehive=> insert into nc_qos_workload values (10, 'newWorkload', 't', 'userName < 100',
'defaultClass');
ERROR: Predicate error: operator does not exist: character varying < integer (SQLSTATE:
2883)
beehive=>
Improper Quoting
The predicate column in the nc_qos_workload table is of type VARCHAR, so you need to
properly quote constants when inserting new entries into the table. For example, note how the
given username is quoted below:
beehive=> insert into nc_qos_workload values (10, 'newWorkload', 't',
'userName=''jsmith''', 'defaultClass');
INSERT 1
Case Sensitivity
Values of type VARCHAR are case sensitive! Keep this in mind when defining predicates that
use attributes of this type. For example, although the predicate stmtType='alter' will be
accepted as a valid predicate, (that is, its a valid expression using a VARCHAR attribute) it will
not match any ALTER ... SQL statements because Aster Database recognizes each operation
only by the Aster Database constant used to represent it, which in this case is Alter with a
capital A instead of the all-lowercase alter.

Common mistakes
1. No Default Workload Policy

Need a least 1 Default policy to be defined if any other Workloads
defined
2. Incorrect Workload Policy evaluation order
Ensure Default policy last in list. If 1st, all queries will use it to run
3. Overlapping Workloads
Query can match multiple Workloads, so be careful when defining
4. Invalid Predicates
Username predicate = VARCHAR, so username < 100 invalid syntax
5. Taking Roles into account

'db_admin' role is a special role which has visibility to all database
objects. Hence for any user with the db_admin role, the predicate
'myrole'=ANY(roles) will evaluate as TRUE. To remedy, move the
'db_admin=ANY(role)' condition to be evaluated before the others

Admission Control
Beginning in Aster Database 5.10, admission control enables you to create Admission Limits
to set and manage predictable task execution, based on a particular expected workload.
Admission limits are configured using either the AMC or the command line to set specific
admission limits and to set the global admission threshold or limit.
Admission limits are created using an arbitrarily ordered list of rules to apply admission limits
to a particular transaction, executable, or query. These rules define the maximum number of
queries of a specific type (those that match a specific predicate) that are allowed to run
concurrently. Every admission limit is tied directly to one predicate (which must be a valid
SQL WHERE clause) and requires each task (transaction, job, or query) to pass all predicates
and admission limit counts before being admitted into the system.
The side effect of this is that a global admission threshold or limit can be set. The ncli qos
setconcurrency <concurrency> command sets and then displays the maximum query
concurrency.
This setting can be used as a global admission threshold to hold all tasks under a certain
concurrency limit. If the number of running tasks is at or above the set nc_qos_concurrency
value, no new tasks are admitted to the system. These tasks are not denied, but rather queued.
If this global admission threshold is not reached, admission limits determine if and when a
task is admitted into the system.

Admission Control (works with WLM)
Beginning in Aster Database 5.10, admission control enables you to create

'Admission Limits' to set and manage predictable task execution, based on a
particular expected
Admission limits are created using an arbitrarily ordered list of rules to apply
admission limits to a particular transaction, task, job, or query. These rules
define the maximum number of queries of a specific type (those that match a
specific predicate) that are allowed to run concurrently. Every admission
limit is tied directly to one predicate (which must be a valid SQL WHERE
clause) and requires each task (transaction, job, or query) to pass all
predicates and admission limit counts before being admitted into the system

Admission Control Predicate evaluation
During predicate evaluation, every predicate is evaluated (or checked) against the context at
that particular moment in time for that session. When a predicate check is matched, the
number of tasks against that limit is checked to see if the admission limit is exceeded.
If the limit is 0, the transaction is denied or terminated.

If the limit is reached, the transaction is left in the queue.
If it passes all of these checks it is admitted to the system.
For example, when administrators want to restrict certain users from submitting queries
(during specific business hours) the limit for that user would be set to 0 and all queries for that
user would be denied.
Admission control is performed at the transaction level, but is based on the context at that
particular moment in time for that session. When a user makes a connection, it is a
continuous session until that connection is ended. Within a session, one or more transactions
may happen.
Sessions are the high level container;

Within a session there are transactions;
Within a transaction, there are one or more statements.
The context of all three (session, transaction, and statement) at that particular moment in
time constitutes the context against which predicates are evaluated.
A transaction that starts with a BEGIN can only be evaluated against that statement and it
implies one or more statements may follow before the transaction progresses to the COMMIT,
ABORT, or END phase. However, a single statement is still its own transaction and can be
evaluated based on attributes within the statement when it is outside of an explicit transaction.
Statements in the same transaction will all execute once the transaction is admitted, however
multiple transactions in the same session must each pass their own admission routine.
Note! Because predicate evaluation uses the context at that particular moment in time for
that session (which includes some values from the statement) a statement of BEGIN or
END may not work as one would expect against the table name checks or SQL-MR function
names. This means that in the context of admission limits, transactions with multiple
statements may not be evaluated as expected.

Admission Control Predicate evaluation
During predicate evaluation, every predicate is evaluated (or checked)

against the context at that particular moment in time for that session.
When a predicate check is matched, the number of tasks against that limit
is checked to see if the admission limit is exceeded
If the limit is 0, the transaction is denied or terminated
If the limit is reached, the transaction is left in the queue
If it passes all of these checks it is admitted to the system
For example, when administrators want to restrict certain users from
submitting queries (during specific business hours) the limit for that user
would be set to 0 and all queries for that user would be denied

WLM + Admission Control (AC) = Better performance

WLM + Admission Control (AC) =
Better Performance

Review: Module 10 Workload Management

Review: Module 10 Workload Management
1. The 2 major components of Workload Management
2. The Default Workload Policy will always have this Predicate
3. Weight is 1st level, Priority is the 2nd level of Service Class (T or F)
4. The 4 Service Classes Priorities are
5. You monitor Workload Policies in AMCs Dashboard tab (T or F)
6. Ordering of Workload Policies is important (T or F)

Bonus lab: Create User Accounts

Bonus lab: Creating User Accounts
1. Create the following Users (create user <name> password 'pass';)
1. etl
2. strategic
3. tactical
2. For each user:
GRANT ALL on Database beehive to etl, stategic, tactical;

GRANT ALL on Schema aaf to etl, stategic, tactical;
GRANT ALL on AAF.sales_fact_lp to etl, stategic, tactical;

Bonus lab: Create and Run Workload Policies

Bonus lab: Create and Run Workload Policies
1. Create 5 Workload Policies as follows (Default created in earlier lab)
a. etl can only run from midnight to 2:00 AM as Medium, else Deny
b. tactical runs High 1st minute, then Low thereafter
c. strategic has no limitations (High)
2. Order Workload Polices, then

then click Save & Apply Changes
3. From AMC, go to Admin>Cluster Management and do a Soft Restart

followed by an Activate Cluster
4. Login as each User above, then run following query: select * from
aaf.sales_fact_lp UNION select * from aaf.sales_fact_lp;
4. From AMC, confirm everything is working as planned. It will look similar

Bonus lab: Solution

Bonus lab: Solution

DBA Lab 10 Workload Policies

DBA Lab 10 Workload Policies

The Lab book has slightly more advanced Lab. So no

need to do Lab book lab if did In-line labs. If do Lab
book lab, ensure you remove all Workload Policies first.
But keep Service Classes

Module 11
Mod 11
DD, scripts, ncli and ganglia

Table Of Contents
DD, ncli, scripts, ganglia Module Objectives ...............................................................................4
Aster System (Data Dicitionary) Views ..........................................................................................6
System Views Database objects ...................................................................................................8
NC_*_ DATABASES ...................................................................................................................10
NC_*_SCHEMAS .........................................................................................................................12
NC_*_TABLES .............................................................................................................................14
NC_*_COLUMNS ........................................................................................................................16
Looking for Distribution Keys.......................................................................................................18
NC_*_CONSTRAINTS ................................................................................................................20
Lab11a: Validate Constraints ........................................................................................................22
System Views Query related ......................................................................................................24
NC_ALL_SESSIONS*..................................................................................................................26
NC_ALL_TRANSACTIONS and TRANSACTION_PHASES*.................................................28
NC_ALL_STATEMENTS* ..........................................................................................................30
NC_ALL_STATEMENT_PHASES* ...........................................................................................32
System Views User related .........................................................................................................34
Lab11b: User / Role Metadata .......................................................................................................36
Supporting Scripts .........................................................................................................................38
Lab11c: check_table_last_vacuum_analyze.sh .............................................................................40
Lab11c: check_catalogs_deadspace_v2.sh ....................................................................................42
Lab11d: Finding Dependencies using gen_drop_sql.sh ................................................................44
Step 1: Edit gen_drop_sql.sh .........................................................................................................46
Step 2: Create table to store objects ...............................................................................................48
Step 3: Suppose have following scenario .................................................................................50
Step 4: Using gen_drop_sql.sh script ............................................................................................52
Heres table that gets populated by the script ................................................................................54
Lab11e: nc_relationstats function (1 of 3) .....................................................................................56
ncli .................................................................................................................................................62
NCLI Permissions and Usage ........................................................................................................64
NCLI commands ............................................................................................................................66
NCLI Help .....................................................................................................................................68
To view NCLI sections, type: ncli .................................................................................................70
NCLI Sections ...............................................................................................................................72
Typical NCLI command ................................................................................................................74
Select NCLI sections NCLI EVENTS ........................................................................................76
Select NCLI sections NCLI NODE ............................................................................................78
Select NCLI sections NCLI NODE (cont) ................................................................................80
Select NCLI sections NCLI NODE (cont) ................................................................................82
Select NCLI sections NCLI PROCESS ......................................................................................84
Select NCLI sections NCLI QOS ...............................................................................................86
Select NCLI sections NCLI Replication.....................................................................................88
Select NCLI sections NCLI SYSTEM .......................................................................................90
Select NCLI sections NCLI TABLES ........................................................................................92
Ganglia ...........................................................................................................................................94
Page 2 Mod 11 DD, scripts, ncli, and ganglia

Ganglia System View................................................................................................................. 96
Ganglia Node Overview............................................................................................................. 98
Ganglia Network Metrics ......................................................................................................... 100
Review: Module11 DD, scripts, ncli, ganlia ............................................................................ 102
DBA Lab 11: DD, scripts, ncli, ganglia ...................................................................................... 104

DD, ncli, scripts, ganglia Module Objectives

DD, ncli, scripts, ganglia - Module Objectives
Describe metadata captured in nCluster System Views
Run Aster scripts to monitor the nCluster
Gen_drop_sql.sh script to drop depedent objects
NCLI utility for monitoring and troubleshooting
Using Ganglia

Aster System (Data Dicitionary) Views
Typically only members of the roles catalog_admin and db_admin can access these tables, all
of which are read-only. In addition, you cannot normally run SQL-MR functions against the
Data Dictionary views.

Aster System Views
nCluster system Views are read-only views that contain

metadata information about various database elements
There are three versions of most metadata system views, the

version a user can see depends upon their privileges
The naming convention for Cluster system views:
nc_version_database object type
Where version is:
User_owned - Only displays those objects that the user owns

User - Only displays objects for which the user has privileges
All - Displays all objects to db_admin and catalog_admin roles
In-line Lab: SELECT * FROM nc_all_tables;

System Views Database objects
The data dictionary views nc_user_owned_databases, nc_user_databases and
nc_all_databases contain information about all the databases in the system.
nc_user_databases will display all databases for which the currently logged in user has the
CONNECT privilege. All three views have the following schema:
Schema-Related Data Dictionary Views

Aster Database supports schemas for managing users rights to database objects. A schema is a
separately managed part of a database. Creating a database with multiple schemas allows multiple groups
to use the database while preserving each groups control over the structure of tables that belong to
that group. Each group can be granted (if desired) database-wide SELECT and INSERT rights,
but each group can only modify those tables and database objects that fall inside that groups realm.
The data dictionary views nc_user_owned_schemas, nc_user_schemas and nc_all_schemas

contain information about all the schemas in the system. The nc_user_schemas table displays
all schemas for which the currently logged in user has the USAGE privilege.

System Views - Database objects

NC_*_ DATABASES
Here we obtain a listing of all databases on the Aster cluster.

NC_*_DATABASES
Field Type Description

dbid int Unique database Id
dbname varchar Database name
dbowner varchar Database owner name
Character encoding for this database

permissions varchar Access privileges for
dbencoding varchar this database

NC_*_SCHEMAS
This view tells you which users have USAGE permissions to the schemas.

NC_*_SCHEMAS

dbname varchar Database name to which
this schema belongs
schemaid int Unique schema Id

Schemaname varchar Database name
schemaowner varchar Database owner name
permissions varchar Access privileges for this

schema
Who has which perm (USAGE, CREATE) granted by who
The system view nc_user_schemas will display all of the schemas for
which the currently logged in user has the USAGE privilege

NC_*_TABLES
This view tells you table information such as table owner, which table type, compression type if
any, and access priviliges.

NC_*_TABLES

schemaid int Id of the schema to which this table belongs
tableid int Unique table Id
tablename Varchar Table name
tableowner varchar Table owner name
tabletype varchar Table type: { 'fact' , 'dimension' }
compresslevel varchar Compression level for that table:
{ 'none' , 'low' , 'medium' , 'high' }
partitionkey varchar Partition key of table
permissions varchar Access privileges for this table

NC_*_COLUMNS
This view lists all column information for each table. Note you would need to join this view to
the table view to actually find the table name.

NC_*_COLUMNS

relid int Table id of table to which this column belongs
colname varchar Column name
coltype varchar Column type -- Common columns I can join on
select c."colname", t.tablename
FROM "nc_user_owned_tables" t,
typeid int Type id of column type "nc_user_owned_columns" c
where t."tableid" = c.relid
typemod int Type modifier of column type
and t.tablename not in ('all_objects', 'srheader',
'srstatus')
colnum int Columns number group by c.colname, t.tablename
order by c.colname, t.tablename
isnotnull boolean true if there is a not-null constraint on this column
ispartitionkey boolean true if column is a partition key
isinherited boolean true if column is inherited from parent

Looking for Distribution Keys
Here we are joining the TABLE and COLUMNs view to get table name and Distribution Keys.
Note you can obtain this information without a join by going to the NC_ALL_TABLES view
too.

Looking for Distribution Keys
The following query will display all tables and

their corresponding Distribution Key (if any)
SELECT t.tablename, MAX(CASE WHEN c.ispartitionkey THEN c.colname
ELSE 'Not Partitioned' END) AS DistKey
FROM nc_all_tables t LEFT OUTER JOIN nc_all_columns c ON c.relid =
t.tableid GROUP BY 1;
tablename | distkey Or from TD Studio,

-------------------------+--------------------- Sample Contents of NC_ALL_TABLES
customer_dim | Not Partitioned
date_dim | Not Partitioned
geo_dim | Not Partitioned
product_dim | Not Partitioned
region_dim | Not Partitioned
sales_fact | customer_id
sales_fact_200801 | customer_id
sales_fact_200802 | customer_id

NC_*_CONSTRAINTS
The CONSTRAINTS view shows Check constraints and Primary Key constraints only.

NC_*_CONSTRAINTS

Table id of table on which the
tableid int constraint is defined
conid int Unique constraint Id
conname varchar Constraint name
c = check constraint,
contype char p = primary key constraint
List of column positions for columns

collist varchar that form this constraint
condef varchar Constraint definition
The system view nc_user_constraints will display all table

level constraints for all of the tables for which the
currently logged in user has the SELECT privilege

Lab11a: Validate Constraints
Below shows you how to get a listing of table names and PK and Check constraints. To see
logical partitions of a table (another kind of constraint) can join the TABLES view to the
USER_CHILD_PARTITIONS views.

Lab11a: Validate Constraints
You can validate the constraint definitions for your tables (ie:
CHECK, PK) using:
SELECT t.tablename, COALESCE(c.condef, 'No Constraints') AS Def

FROM nc_all_tables t LEFT OUTER JOIN nc_all_constraints c
ON c.tableid = t.tableid;
CREATE TABLE Conme

(userid INT NOT NULL, name VARCHAR, age SMALLINT,
favorite_color CHAR(5),
CHECK ( age >=25),
CHECK (name is not null),
PRIMARY KEY (userid, name))
DISTRIBUTE BY HASH(userid);
If wish to view Logical Partitions (another kind on constraint), run:
SELECT t.tablename, COALESCE(c.constraintdef, 'No Constraints') AS Def

FROM nc_all_tables t LEFT OUTER JOIN nc_user_child_partitions c
ON c.tableid = t.tableid order by 1;

System Views Query related
There are a number of Query related Views as shown in the slide.

System Views Query related Auto Purged every 72 hours
This set of system views, sometimes referred to as the Stats DB, maintains
information and statistics about various activities in the database cluster. This set of
system views is accessible only to members of the roles catalog_admin and
db_admin and all are read-only
Although can increase

number of hours, this can
slow down Optimizer. So
better to just extract tables
and put elsewhere if want
keep this history
For long queries, look here

for Tuning opportunities

NC_ALL_SESSIONS*
This View allows you to see current and past user sessions.

* means cannot see View in TD Studio.
NC_ALL_SESSIONS* As alternative, use AMC Processes tab to
view sessions

sessionid bigint Unique session id
username varchar Session username
clientip character(16) Client ip address
dbname varchar Name of database to which
connection was established
starttime timestamp without time zone Session start time
endtime timestamp without time zone Session end time
This system view contains information about current and past sessions

NC_ALL_TRANSACTIONS and
TRANSACTION_PHASES*
These Views contain transaction information of a query. For example, you can find each phase
of a query.

NC_ALL_TRANSACTIONS and
TRANSACTION_PHASES*
xactionid bigint Unique transaction id
sessionid bigint Session id of transaction
starttime timestamp without time zone Start time of transaction
endtime timestamp without time zone End time of transaction
This table contains information about all transactions. Individual statements are implemented as
stand-alone transactions, or everything in a BEGIN ... END block are explicit transactions.

phase character(20) Phase description
xactionid bigint Transaction id of transaction
sessionid bigint Session id of transaction
starttime timestamp without time zone Start time of transaction phase
endtime timestamp without time zone End time of transaction phase
This table contains information about phases (wait for admission, executing, worker-
to-worker transfer (Shuffle) of transactions)

NC_ALL_STATEMENTS*
The Statements View allows you to find the clock time of a query.

NC_ALL_STATEMENTS*

statementid bigint Unique statement id
xactionid bigint Transaction id of transaction to
which this statement belongs
sessionid bigint Session id of session to which
the statement belongs
retrynum integer Retry count of this statement
statement character varying Statement string
starttime timestamp without time zone Start time of statement
endtime timestamp without time zone End time of statement
iscancelable boolean True if statement is cancelable
This system view contains information about recently executed statements

in Cluster. By default, the table retains three days worth of statements

NC_ALL_STATEMENT_PHASES*

NC_ALL_STATEMENT_PHASES*

statementid bigint Unique statement id
xactionid bigint Transaction id of transaction to
which this statement belongs
sessionid bigint Session id of session to which
the statement belongs
phaseid bigint Phase Identifier
retrynum integer Retry count of this statement
execlocation Character varying Queen, Workers, any worker,
or n.a.
starttime timestamp without time zone Start time of statement
endtime timestamp without time zone End time of statement
This system view contains information about recently executed statements

in Cluster. By default, the table retains three days worth of statements.

System Views User related
User-related Views tells you User names and the Roles that have been granted to them.

System Views User related

Lab11b: User / Role Metadata
This query maps Users and their Roles.

Lab 11b: User / Role Metadata
You can see information on Roles and their Users with

the following metadata query
SELECT gr.rolename as Role role | membername

,COALESCE(mr.rolename --------------------+---------------------
,u.username) as membername catalog_admin | jfuller
FROM nc_group_members g
db_admin | db_superuser
INNER JOIN nc_all_roles gr
ON gr.roleid = g.groupid
db_admin | beehive
LEFT OUTER JOIN nc_all_roles mr rtl_admin | retail_admin
ON mr.roleid = g.memberid rtl_readonly | retail_rpt
LEFT OUTER JOIN nc_all_users u rtl_readonly | jfuller
ON u.userid = g.memberid
ORDER BY 1;

Supporting Scripts
There are a number of field scripts that are useful in maintaining your Aster cluster. Note these
are not supported scripts, so should only be run by Teradata Aster PS consultants.

Supporting scripts
Aster Data has a number of Support scripts developed including:
- check_table_last_vacuum_analyze.sh
- vacuum_catalogs.sh
- check_catalogs_deadspace_v2.sh
- nc_relationstats replaces the following:
nc_tablesize
nc_tablesize_details
getTableSize
ncluster_storagestat
- gen_drop_sql.sh

Lab11c: check_table_last_vacuum_analyze.sh
This script allows you to point to a table and it will tell you the last time it was vacuumed and
analyzed.

Lab11c: check_table_last_vacuum_analyze.sh
Purpose: Displays last_analyze, last_vacuum on a specific table (queries

pg_stat_all_tables)
- Need to enter password = _bee_sysman
- Returns timestamp when table was last vacuumed/analyzed
- Parameters include: Database Name and Table Name
Example:
bash check_table_last_vacuum_analyze_pk .sh -d beehive -t sales_fact
Output
First, run from Unix prompt on Queen:

su - beehive
Then type: cd /home/beehive/clients
Wen prompted, type password: _bee_sysman 3 times

Lab11c: check_catalogs_deadspace_v2.sh
This script can tell you the amount of dead space in your PG_* tables. For better performance, it
is suggest to periodical remove the dead space from these tables.
At this time you must contact a Teradata Aster consultant to purge the dead space for the PG
tables.

Lab11c: check_catalogs_deadspace_v2.sh
check_catalogs_deadspace_v2.sh Login: su beehive

Run from Unix prompt on Queen.
- Purpose: Path = /home/beehive/clients
Returns the table statistics for all the catalog tables
Helps DBA determine if system tables need to be vacuumed
Script provided by Aster Data Support upon request
- Parameters Alternative: Select table and

Database Name or alldbs run this command.
Dead_tuple len = Bytes
Output

Lab11d: Finding Dependencies using gen_drop_sql.sh
This script will find dependencies of objects. For example, suppose you want to DROP a table
but are unable to do so since there are Views that depend on that table.
When you run this script, you can point to an object (such as a Table) and it will return all the
Views that are dependent on that Table. The script will create a DROP script of all the objects
that must be executed before you can DROP that object.

Lab11d: Finding Dependent objects using
gen_drop_sql.sh
You cannot DROP objects since there are dependent objects attached to
them (ie: Views, Indexes, etc) . This can happen on:
1. Tables
2. Views
3. Users
Need a way to find these objects. That's where gen_drop_sql.sh comes in

handy. This is not supported but works just fine

Step 1: Edit gen_drop_sql.sh
You must first edit the GEN_DROP_SQL.SH file and fill in a number of parameters such as
database to connect, schema name, password and Queen location.

Step 1: Edit gen_drop_sql.sh
Open gen_drop_sql.sh and edit the following variables for your Cluster
(this has already been done for you):
#-------------------------------------------------------------------------------------
# The following variables must be set for your environment.
#-------------------------------------------------------------------------------------
ASTER_DB_HOSTNAME="10.XXX.XXX.100"
ASTER_DB_DATABASE="dbname"
ASTER_DB_LOGON="db_superuser"
ASTER_DB_PASSWORD="XXXXX" # password for account
# ASTER_DB_PASSWORD="\$tdwallet(db_superuser_passwd)" # tdwallet
ASTER_DBA_SCHEMA="dba_maint" # working DBA schema
FULL_ASTER_CLIENT_DIR="/home/beehive/clients"

Step 2: Create table to store objects
Before you execute GEN_DROP_SQL.SH, you will want create a table to store the results of the
script when it is run. The CREATE TABLE statement is contained within the SH file so it just a
matter of copying it and running it beforehand.
You will need to edit the statement to point to your schema.

Step 2: Create table to store objects which are
dependent on each other
Execute this script is in gen_drop_sql.sh file

(this has already been done for you):
CREATE table
${ASTER_DBA_SCHEMA}.aster_dependent_objects
( child_object_type varchar(8)
, child_object_id bigint
, child_object_name varchar(200)
, parent_object_type varchar(8)
, parent_object_id bigint
, parent_object_name varchar(200)
)
distribute by replication;

Step 3: Suppose have following scenario
Here is an example of a Table that has a number of dependent objects. When you attempt to
DROP it, you get an error message.

Step 3: Suppose have following hierarchy
of dependent objects
DepTable
v_depTable1
v_depTable10
v_depTable2
But when you attempt to DROP DepTable, you get the following ERROR

Step 4: Using gen_drop_sql.sh script
After running the script, it creates a DROP script that I can then use to DROP the dependent
objects and well as the Parent object.

Step 4: Use gen_drop_sql.sh script
Using Bruno's script, point to object type (-t) that has dependencies on it
(table in our case) and point to object name (-n) which is our case is table
named 'aaf.deptable'
bash gen_drop_sql.sh -t table -n aaf.deptable
DepTable
v_depTable1
v_depTable10
v_depTable2

Heres table that gets populated by the script
As mentioned earlier, a table is populated showing Parent objects and the dependent Children.

Here's the table that gets populated by the script
SELECT * from aster_dependent_objects order by 3 asc;

Lab11e: nc_relationstats function (1 of 3)
The SQL-MR function nc_relationstats enables an Administrator to generate various reports
for on-disk table size and statistics for one or more tables.
Syntax
SELECT * FROM nc_relationstats(

ON (SELECT 1)
PARTITION BY 1
[DATABASES ('*'|'dbname1','dbname2',...)]
[RELATIONS ('[schema.]relation',...)]
[PATTERN (['schema%',]'relations%')]
[PARTITIONS ('p1.p2.p3','',...)]
[OWNERS ('owner1','owner2',...)]
[COMPRESS_FILTER (('none'|'high'|'medium'|'low')+)]
[INHERIT ('true' | 'false')]
[REPORT_SIZE ('compressed' | 'uncompressed' | 'all' | 'none')]
[REPORT_STATS_MODE ('estimated' | 'exact')]
[REPORT_STATS ('tuple_count' | 'tuple_space' | 'all' | 'none')]
[TOAST_SIZE ('combined' | 'separate' | 'none')]
[TOAST_STATS ('combined' | 'separate' | 'none')]
[PERSISTENCE_MODE ('all' | 'no_analytic' | 'only_analytic')]
[RELATIONS_SHOWN ('user' | 'only_catalog' | 'user_and_catalog')]
);
For more information on this function, see the Aster Database User Guide.

nc_relationstats is a function that enables an Administrator to generate

various reports for on-disk table size and statistics for one or more tables.
It is meant to replace many older functions
This function is primary generates reports for on:
1. Table size and LP size
2. Skew
3. Dead tuple count
sum(tuple_count) as total_tuples FROM nc_relationstats(ON (SELECT 1) PARTITION
BY 1 DATABASES ('retail_sales') REPORT_STATS('tuple_count')
REPORT_STATS_MODE('exact')) GROUP BY schema, relation;

Here are some nc_relationstats queries.

Misc. functions - nc_relationstats (2 of 3)

Here are some nc_relationstats queries.

Misc. functions - nc_relationstats (3 of 3)

ncli
Aster Database Command Line Interface (ncli) is a command line tool that enables you to
gather operational information from all nodes in Aster Database and to take administrative
actions in a uniform manner throughout the cluster. ncli is functional even if the cluster is
down - at which time, ncli may be used to repair the cluster.
ncli allows you to generate output (such as cluster system statistics) in a format that you can
later analyze. ncli functionality includes a way to look at node status, vworker configuration,
I/O configuration, replication status, and process management job status. Operations may be
performed on one, a group, or all of the nodes. Output may be formatted in tables for screen
viewing, piped to another UNIX command, or saved to a file.

ncli
Aster Database Command Line Interface (ncli) is a command line tool that enables you to
gather information from all nodes and to take administrative action. ncli is functional even if
the cluster is down - at which time, ncli may be used to repair the cluster
ncli allows you to generate output (such as system statistics) in a format that you can later
analyze. ncli functionality includes a way to look at node status, vworker configuration, I/O
configuration, replication status, and process mgt job status. Operations may be performed
on one, a group, or all nodes. Output may be formatted in tables or saved to a file
EXAMPLE: (from Unix)

ncli ?
ncli node showcpuconfig
Partial listing:
See Appendix for
NCLI examples.
Plus go to Aster
Database Guide
for more details

NCLI Permissions and Usage

NCLI Permissions and Usage
To run most ncli commands, you should log in as the UNIX user, beehive.
However, to run certain powerful commands (e.g., softrestart and

softshutdown in the system section), you must be logged in as root. If you
attempt to run one of these commands as beehive, an error message will
display indicating that this command may only be run by root

NCLI commands
In most installations, your in-house power users, administrators and Teradata Global
Technical Support (GTS) and consultants will use ncli. Any administrator who doesn't want
to use the AMC for various reasons, can use ncli. Some operations, like configuration of event
subscriptions, are only possible through ncli. Administrators will find ncli very useful because
many ncli commands work when the cluster is down (i.e. AMC does not) and it can aid in
troubleshooting.
This is in contrast to the AMC (Aster Database Management Console), which is focused on
setting up, managing, and scaling out Aster Database. The AMC is used by your in-house
Aster Database administrators and DBAs.

NCLI commands
Invoked from the Queen by typing NCLI
Works even when the Cluster is down. So when AMC is unavailable due to
Cluster down, NCLI is still functional
Typical users of NCLI are Teradata support personnel and consultants
NCLI is divided into sections. Flags may be added to commands

NCLI Help
To invoke ncli help, type:
$ ncli --help

NCLI Help
ncli -- help Help and high level commands
ncli Available command sections
ncli <section> Help within specific section
ncli --help <section> <command> Help for specific command
ncli --helpshort Help only for ncli module

To view NCLI sections, type: ncli
ncli Command Sections
The capabilities of ncli are divided into sections, which are groups of commands with related
functions. The following table lists the sections:

To view NCLI sections, type: ncli
A command line interface that can do many of AMC duties plus others that
the AMC cannot do

NCLI Sections
The capabilities of ncli are divided into sections, which are groups of commands with related
functions.

NCLI Sections
apm Commands related to Aster Package Manager (apm)

database Commands for pre-upgrade database tasks
disk Commands related to disks
events Commands to configure events
ice Commands related to ICE (Inter Cluster Express) server
ippool Commands to configure the Aster Database pool of IP addresses
netconfig Commands to configure network interfaces byfunction
node Commands related to nodes
nsconfig Commands to configure name servers and hosts
process Commands related to running processes
procman Commands that retrieve status from the process management master
qos Commands related to workload management and admission limits
query Commands to view the state of running queries in the system
replication Commands related to replication
session Commands to view the state of running sessions in the system
sqlh Commands related to SQL-H
sqlmr Commands related to SQL-MR
statsserver Commands related to the StatServer
sysman Commands related to sysman, the Aster Database system management layer
system Commands related to Aster Database system status display and control
tables Commands related to table information
util Miscellaneous commands
vworker Commands related to vworkers

Typical NCLI command
Heres an example of a simple NCLI command.

Typical NCLI command
$ ncli [highlevelflags] <section> <command> [<commandflag>]
Type ncli help to see highlevelflags

Select NCLI sections NCLI EVENTS
The events section provides commands to view and configure event subscriptions in the Aster
Database Event Engine.
When you set up event subscriptions, youre setting up subscription to be notified via SNMP
or email whenever events of a particular type occur. The ncli is the only way to add and
manage subscriptions.
The commands in the events section will run against the queen, even if executed from a
worker node. The syntax to run a command in the events section looks like this example:
$ ncli events listsubscriptions

Select NCLI sections NCLI EVENTS
View and configure Events
We did this lab earlier. We configured an E-mail to be sent when a

query was cancelled

Select NCLI sections NCLI NODE
The most commonly used section is the node section, which provides general tools for
reporting and running UNIX commands on one or many nodes in the cluster. The commands
in this section require Passwordless SSH to be enabled, if you run them as root. The syntax to
run a command in the node section looks like this example:
$ ncli node showsummaryconfig

Select NCLI sections NCLI NODE
NCLI NODE is the most commonly used Section
It provides general tools for reporting and running UNIX commands on one or
more nodes in the cluster

Select NCLI sections NCLI NODE (cont)

Select NCLI sections NCLI NODE (con't)
NCLI node showsummaryconfig
Shows how much Free Memory, Free Swap is available on the Queen and all
Worker nodes
To copy file to all Workers, type the following:

ncli node clonefile "/home/beehive/config/procmgmtConfigs/coordinator.cfg"

Select NCLI sections NCLI NODE (cont)

Select NCLI sections NCLI NODE (con't)
RUNONALL allows you to run any executable on multiple nodes. The

executable must exist on all nodes prior to command being run
For some commands, like df, the command already exists
If a user-written script is being executed, it must be copied to all nodes using

ncli clonefile or similar mechanism

Select NCLI sections NCLI PROCESS
The process section provides commands related to running processes - specifically the
memcheck [<processnamefilter>] command, which reports on memory utilization. The
command in this section must be run as root and requires Passwordless SSH to be enabled
The syntax to run a command in the process section looks like this example:
# ncli process memcheck postgres

Select NCLI sections NCLI PROCESS
The process section provides commands related to running processes

specifically the memcheck [<processnamefilter>] command, which reports on
memory utilization

Select NCLI sections NCLI QOS
The qos section allows you to view details related to Workload Management and admission
limits.
The Workload Management and Admission Limits commands enable you to connect with the
QosManager to access (to set, edit, remove, or show) the statistics, settings, and rules for
concurrency, workload management, and admission limits. This allows you to query the
admission queue to show why a particular task is still queued and not yet admitted.
The following workload management and admission limit commands are available:
The syntax to run a command in the qos section looks like this example:
$ ncli qos showconcurrency

Select NCLI sections NCLI QOS
View details related to Workload Management and admission limits

Select NCLI sections NCLI Replication
The replication section provides commands related to Aster Database replication. The syntax
to run a command in the replication section looks like this example:
$ ncli replication showgoal

Select NCLI sections NLCI REPLICATION
The Replication section provides commands related to Aster Database

replication

Select NCLI sections NCLI SYSTEM
The system section provides commands related to Aster Database status display and control.
The syntax for commands in the system section looks like this example:
$ ncli system softrestart
Note that if you run the following commands as beehive, your session will be dropped:
ncli system softshutdown

ncli system softrestart
Most of the commands in the system section duplicate some of the functionality available
through the AMC. Exposing them through ncli enables you to run those commands even if
the AMC is not running or you do not have access to the AMC for whatever reason. Because
they are so powerful, the commands softrestart and softshutdown must be run by the
root OS user. The softrestart command may be issued on a cluster after it has been shut
down or to restart it when it is running. The softrestart command should be issued for the
first time on a new cluster only after the workers have attained a status of Prepared (you can
check the worker status in the AMC) When a node reboots, it may pass through the states of
New to Preparing to Upgrading, before reaching Prepared. This is normal.

Select NCLI sections NCLI SYSTEM
The system section provides commands related to Aster Database status

display and control.
Particularily useful when do FULL restore since AMC will not be able to
initialize without running NCLI commands first

Select NCLI sections NCLI TABLES
The tables section provides general tools for returning information about tables. The syntax to
run a command in the tables section looks like the examples shown below.
First, you must gather information by issuing:
$ ncli tables gathertableinfo --forcerun=true
which displays results like:
Table space information has been recorded in /home/beehive/data/tmp/

table_space_data_beehive
Table space information has been recorded in /home/beehive/data/tmp/
table_space_data_retail_sales
Then you can display the information gathered by issuing:
# ncli tables showtableinfo

Select NCLI sections NCLI TABLES
The tables section provides general tools for returning information about tables
First gather information using the below command
ncli tables gathertableinfo --forcerun=true
Can gather information on individual table

Ganglia
Ganglia is a scalable open-source distributed system monitor tool for high-performance
computing systems such as clusters and grids. It allows the user to remotely view live or
historical statistics (such as CPU load averages or network utilization) for all machines that are
being monitored.
If you have a slow query, check Ganglia (http://queen_ip_address/ganglia) to find the

overly busy worker or workers. Note you can also access Ganglia for the hot link in the AMC.
On the worker that seems to be the bottleneck, you can use Linux utilities such as the 'top'
command to find the process with high CPU usage or high I/O wait times

Ganglia Agent runs on every Node
http://192.168.100.100/ganglia
- Ganglia is a scalable distributed monitoring system for high-

performance computing systems such as clusters and grids
- Cluster uses Ganglia for hardware monitoring/management
- Ganglia supports the following metrics via dropdown menus:

Boot Time
Bytes In/Out
CPU Metrics (idle, report, wait I/O)
Disk Metrics (free and total space)
I/O Metrics (wait, packets, KBs)
Load Metrics (1,5,15 minute windows)
Memory Usage (buffering, caching)
- It uses carefully engineered data structures and algorithms to

achieve very low per-node overheads and high concurrency

Ganglia System View
See the overall health of the Aster system in System View.

Ganglia System View

Ganglia Node Overview
Drill down to a specific Node to find trouble spots.

Ganglia Node Overview

Ganglia Network Metrics
Can also track Network Metrics via Ganglia.

Ganglia Network Metrics
Network Bytes / Sec
Network Packets / Sec

Review: Module11 DD, scripts, ncli, ganlia

Review: Module 11 - DD, scripts, ncli, ganglia
1. The 3 versions of the System Views are:
2. NC_*_CONSTRAINTS only checks for PK and NOT NULL constraints
3. Many of the support scripts focus on ANALYZE and VACUUM
4. getTableSize function can discover Skew
5. NCLI is a tool that can help recovery an Aster Cluster
6. Ganglia is used for Alerts when certain thresholds are exceeded

DBA Lab 11: DD, scripts, ncli, ganglia

DBA Lab 11: dd, scripts, ncli, ganglia
Goal
Use Data Dictionary tables
Use Ganglia
1. Open your lab manual on Page 51

2. Perform the steps in the lab
4. Be prepared to discuss the lab

Module 12
Mod 12
Explain Plans, Joins and Table
Scans

Page 2 Mod 12 Explain Plans, Joins and Table Scans
Table Of Contents
Explains, Joins and Scans Module objectives ............................................................................. 4
Explain Plan using ACT ................................................................................................................. 6
Execution Plan using AMC............................................................................................................. 8
How does EXPLAIN work? ......................................................................................................... 10
What happens Under the Covers? .............................................................................................. 12
Explain Plan At the Queen ......................................................................................................... 14
Explain Plan At the v-Workers .................................................................................................. 16
Examining the Explain from the v-Worker................................................................................... 18
Inter-Cluster Express in Explain Plans ......................................................................................... 20
Lab12a: ICE with Shuffle (1 of 2) ................................................................................................ 22
Lab12a: ICE with Shuffle (2 of 2) ................................................................................................ 24
Lab 12a: ICE, but no Shuffle ........................................................................................................ 26
Ice, but no Shuffle ......................................................................................................................... 28
Lab 12b: Repartitioning ................................................................................................................ 30
Data Transfer ................................................................................................................................. 32
Lab12c: What is Materialize? ....................................................................................................... 34
v-Worker Explain Sequential scan ............................................................................................. 36
v-Worker Explain Append/Aggregation .................................................................................... 38
v-Worker Explain Costs (1 of 2) ................................................................................................ 40
v-Worker Explain Costs (2 of 2) ................................................................................................ 42
Queen Explain ............................................................................................................................... 44
Lab12d: LP v non-LP table cost.................................................................................................... 46
Explain Plan What to look for.................................................................................................... 48
Table Access Methods (Scan operations) ..................................................................................... 50
Lab12e: Table Scans Sequential scans ....................................................................................... 52
Lab12f: Table Scans Index scans ............................................................................................... 54
Lab12g: Table Scans Bitmap Index Scans ................................................................................. 56
Join Types ..................................................................................................................................... 58
Hash Joins ..................................................................................................................................... 60
Lab12h: Hash Join - Details .......................................................................................................... 62
Merge Joins ................................................................................................................................... 64
Lab 12i: Merge Join - Details ....................................................................................................... 66
Nested Loop Join .......................................................................................................................... 68
Lab12j: Nest Loop Join - Details .................................................................................................. 70
Join Operations - Summary........................................................................................................... 72
Review: Module 12 Explain, Joins, Scans ................................................................................. 74

Explains, Joins and Scans Module objectives

Explains, Joins, Scans - Module Objectives
By the end of this class you will:
- Be able to review the Explain Plan
- Explain the concept of Inter Cluster Express (ICE)
- Understand the types of Joins that the Cluster optimizer executes
- Know the three kinds of Table Scans

Explain Plan using ACT
Tricks for making the plan readable in ACT:
1 EXPLAIN prints wide columns of output that are too wide to read easily on most screens.
A quick way to view more readable output is to switch the formatting of ACT before you
run EXPLAIN. To do this, type \x at the ACT prompt. This turns on 'expanded output,'
which pivots the results of the display to show each column on a new row, providing a
cleaner view. See Formatting-related commands in ACT in the Aster Database User's
Guide.
2 Always invoke ACT with the -h <queen hostname> argument. (Use the queens actual
hostname or IP address; never use "localhost" as the queen hostname, even when running
EXPLAIN from the queen node.) If you fail to provide the queens hostname or IP address,
you get a very verbose explain plan. With -h, it gives you a local plan which provides a
better picture of the real plan.
3 When you generate an EXPLAIN plan in an ACT session that is running directly on the
queen, it provides more detail than you would get in an ACT session running on your local
workstation.

Explain Plan using ACT
ACT options that may be helpful

\x - For more readable EXPLAIN output
\o <filename.ext> - To save output to file
#export PAGER=less - Type this before logon to ACT so can scroll both
up and down an EXPLAIN PLAN via keyboard (not mouse)
Query:
EXPLAIN SELECT * from EMP;
Note:, Cost is not summed at end

of an Aster EXPLAIN

Execution Plan using AMC
Within the AMC, you can see the Execute Plan for all statement that were run. It annotates
which component (Queen or Worker) executed as well as clock time spent on that event.

Execution Plan using AMC
Lots easier to read

Has benefit of actually running so have actual wall clock component
Query: SELECT * from EMP;

How does EXPLAIN work?
EXPLAIN is a database command that, given a query, prints the execution plan that the Aster
Database query planner and the Postgres query planners have chosen for that query. Running
EXPLAIN provides hints that can help you tune your queries, and the EXPLAIN output can
also be helpful for debugging multi-step transactions because it performs a syntax check of
each SQL statement.
EXPLAIN shows the estimated cost (startup cost and total cost) for each of the phases in the
querys execution plan. Phases in the plan are organized into tasks and sub-tasks called
nodes. Cost values are shown for each node. The cost shown for a parent node includes all
the costs of its children, grandchildren, and so on.
EXPLAIN also estimates the size of the result set and shows the network data transfers
involved (if any), and the purpose of each. Pay close attention to the estimated result set sizes.
Misestimating these is one of the planner's most common errors.
There are two levels of EXPLAIN used with Aster Database. The main plan is the parallel
execution plan, often referred to as the queen explain plan. This is the plan that the queen
will execute to complete the query. Next are the Postgres execution plans, often referred to as
the vworker explain plans. These are the plans the vworkers will follow in order to execute
the queries that have been sent to them by the queen. You can examine the queen-level plans
yourself. If you suspect a problem in an individual worker, your Teradata support
representative can check vworker-level explain plans for possible inefficiencies in processing.
When you run a query in Aster Database, the cluster components do the following:
1 Queen parses the query and does syntax checking.
2 Queen creates the parallel execution plan.
3 Queen sends the SQL to the vworkers to process.
4 vworkers send back the results.
5 Queen assembles and returns the finished query results.

How EXPLAIN works Note there is no final COST in EXPLAIN
PLAN. To get cost, must sum manually
There are two levels of 'EXPLAIN' used with nCluster

- The first level is the parallel execution plan
This is the plan that the Queen will execute to complete the query
- The second level is the PostgreSQL execution plan
This plan is how v-Workers will execute the queries sent by Queen
In case the query executes at multiple workers, output from the slowest
v-Worker is displayed
- Both levels must be checked to get a complete picture of how the query will
be handled by nCluster
Cost-based Optimizer
Rule-based Optimizer

What happens Under the Covers?
When you run a query in Aster Database, the cluster components do the following:
1 Queen parses the query and does syntax checking.

2 Queen creates the parallel execution plan.
3 Queen sends the SQL to the vworkers to process.
4 vworkers send back the results.
5 Queen assembles and returns the finished query results.

What Happens 'Under the Covers'?
1. The Queen manages Individual sessions

2. Parses the query and does Syntax checking
3. Confirms User has permissions to Object(s)
4. The Queen creates the parallel execution plan
5. The Queen sends the SQL to the v-Workers to process
6. The v-Workers send back results
7. The Queen assembles and returns final query results
Aster v-Workers will do Cost-based optimization (ie: choose Join algorithm

(Hash, Merge, Nested Loop) and Table access method (ie: Sequential, Index, etc.)
COST is a logical unit; a combination of CPU cost, Memory cost and Disk cost

Explain Plan At the Queen
The Explain Plan is compiled by both the Queen and one of the Workers. Here is an portion of
the Queens Explain Plan for a COUNT from the SALES_FACT table.

Explain Plan At the Queen
Execute an Explain Plan on the Queen by simply entering the keyword

EXPLAIN before your SQL statement in ACT
EXPLAIN select count(*) from sales_fact_lp;
beehive=> explain select count(*) from prod.sales_fact_lp;
-[ RECORD 7 ]---------------------+-----------------------------------------------------------------
Number | 7
Operation Type | Actual statement execution
Location | Queen
Statement Type | Query
Result Table |
This is the query that the Queen will execute to
Finish Action Type |
Final Result? | Y
get the result set back to the User. It is taking
Transaction Id | Intermediate result sets from all the v-Workers
Transfer Type | and SUMming them up
Partition Attribute Offsets |
Target Table Attributes |
Statement | SELECT sum( "_c5" ) AS "count(1)" FROM "_tmp_0" AS "aggregateInp_"
Replacement Parameters |
Query Plan and Estimates | localCost=117.63..117.64 rows=1 width=8 networkCost=0.00
: Aggregate (cost=117.63..117.64 rows=1 width=8)
: -> Seq Scan on _tmp_0 "aggregateInp_" (cost=0.00..96.10 rows=8610

Explain Plan At the v-Workers
The slowest v-Worker will also contribute to the Explain Plan. Notice the intermediate result
sets are being inserted into a TEMP table.

Explain Plan At the v-Workers
-[ RECORD 6 ]---------------------+-----------------------------------------------------------------
Number | 6
Operation Type | Pre-condition
Executed at vWorkers
Location | Workers
Result Table | _tmp_0
Finish Action Type
Final Result?
|
| N
Results placed in table _tmp_0
Transaction Id |
Transfer Type |
Partition Attribute Offsets | Query sent to all vWorkers
Statement | SELECT count( 1 ) AS "_c5"
| FROM ( SELECT 1 AS "_c2"
| FROM "aaf"."sales_fact_lp" AS "projInp_" ) AS "aggregateInp_"
Replacement Parameters | Explain Plan from
Query Plan and Estimates | localCost=9733.10..9875.15 rows=2 width=0 networkCost=0
: Aggregate (cost=9875.14..9875.15 rows=1 width=0) slowest vWorker
: -> Append (cost=0.00..8182.31 rows=677131 width=0)
: -> Seq Scan on sales_fact "projInp_" (cost=0.00..62.80 rows=5280 width=0)
These are the : -> Seq Scan on _bee_p572_sales_fact_jan_2008 "projInp_" (cost=0.00..542.08 rows=44808 width=0)
: -> Seq Scan on _bee_p573_sales_fact_feb_2008 "projInp_" (cost=0.00..491.52 rows=40652 width=0)
individual Partitions : -> Seq Scan on _bee_p574_sales_fact_mar_2008 "projInp_" (cost=0.00..545.02 rows=45102 width=0)
from a Logically : -> Seq Scan on _bee_p575_sales_fact_apr_2008 "projInp_" (cost=0.00..528.00 rows=43700 width=0)
Partitioned table
: -> Seq Scan on _bee_p576_sales_fact_may_2008 "projInp_" (cost=0.00..536.77 rows=44377 width=0)
: -> Seq Scan on _bee_p577_sales_fact_jun_2008 "projInp_" (cost=0.00..529.79 rows=43779 width=0)
: -> Seq Scan on _bee_p578_sales_fact_jul_2008 "projInp_" (cost=0.00..1079.04 rows=89304 width=0)
: -> Seq Scan on _bee_p579_sales_fact_aug_2008 "projInp_" (cost=0.00..1093.60 rows=90460 width=0)
: -> Seq Scan on _bee_p580_sales_fact_sep_2008 "projInp_" (cost=0.00..1042.34 rows=86234 width=0)
Step executed from bottom : -> Seq Scan on _bee_p581_sales_fact_oct_2008 "projInp_" (cost=0.00..539.38 rows=44638 width=0)
up. So first Append child : -> Seq Scan on _bee_p582_sales_fact_nov_2008 "projInp_" (cost=0.00..521.40 rows=43140 width=0)
: -> Seq Scan on _bee_p583_sales_fact_dec_2008 "projInp_" (cost=0.00..544.97 rows=45097 width=0)
partitions together, then : -> Seq Scan on _bee_p597_sales_fact_pre_2008 "projInp_" (cost=0.00..62.80 rows=5280 width=0)
Aggregate : -> Seq Scan on _bee_p599_sales_fact_post_2008 "projInp_" (cost=0.00..62.80 rows=5280 width=0)
:
Data Size Distribution (in bytes) | mean size=0 standard deviation=0

Examining the Explain from the v-Worker
You read the high-level phases that Aster Database first generates from top to bottom. Each
row in the EXPLAIN output is one such phase, and is essentially an SQL statement that gets
executed at either the queen or worker nodes. The output columns are:
Number: This is the number of the high-level phase.

Statement: The actual statement being executed at the individual workers.
Result Table: If the statement being executed generates output that will be used later, this goes
into the table name displayed in this column.
Operation Type: This column describes the kind of operation being performed. This can be
one of the following:
1 Operation Type: Pre-condition
2 Operation Type: Actual statement execution
3 Operation Type: Repartition tuples <reason for repartition>
(Column_name(s)) -- This indicates that rows (tuples) are being shuffled across
different workers. The Column name(s) is (are) the column(s) on which data is being
partitioned. Here are the exact reasons for repartitioning data:
Repartition tuples and populate table

Repartition tuples for subselect
Repartition tuples for aggregation
Repartition tuples to satisfy join
4 Operation Type: Broadcast tuples <reason for broadcast> This indicates that
the result set of rows (tuples) is being sent to all the workers. The following are the cases
that could cause broadcast of rows:
Broadcast tuples and populate table

Broadcast tuples for subselect
Location: This column indicates where this phase is being executed. This can state either
queen, Workers or AnyWorker.

Examining the Explain from the v-Worker
EXPLAIN select count(*) from sales_fact;

-[ RECORD 6 ]---------------------+-----------------------------------------------------------------
Number | 6
Operation Type | Pre-condition
Location | Workers
Result Table | _tmp_0
Finish Action Type |
Final Result? | N
Transaction Id |
Transfer Type |
Partition Attribute Offsets |
Statement | SELECT count( 1 ) AS "_c5" FROM ( SELECT 1 AS "_c2"
| FROM "prod"."sales_fact" AS "projInp_" ) AS "aggregateInp_"
Operation type: Pre/Post-condition, Actual statement execution, Re-

partition tuples <reason> <column>, or Broadcast tuples
Location: Queen, Workers, Any Worker (runs on a single worker- [ie:
Replicated table: select sum(c1) from repltbl])
Result Table: If the statement being executed generates output that will be
used later, the result goes into the table named here
Statement: Actual statement being executed
Statement type: Query, Data Transfer

Inter-Cluster Express in Explain Plans
ICE stands for inter-cluster exchange. The ICE section provides commands related to ICE
server, which provides services to move data around in Aster Database. The syntax to run a
command in the ICE section looks like this example:
$ ncli ice showactivetransports
which returns a result like:
Active Transports
+------------+---------------------+--------------------+
| Node | SessionId | TransportId |
+------------+---------------------+--------------------+
| 10.60.11.5 | 2327674903724048181 | 387487523833891928 |
| 10.60.11.6 | 2327674903724048181 | 387487523833891928 |
| 10.60.11.7 | 2327674903724048181 | 387487523833891928 |
+------------+---------------------+--------------------+
3 rows

Inter-Cluster Express (ICE) in Explain Plans
ICE moves tuples around in nCluster (Operation = DataTransfer)

Repartition data on a different partition key
Broadcast data from one partition to multiple partitions
Move data from multiple partitions to Queen
ICE may/may not invoke SHUFFLE process. Below is not a SHUFFLE
EXPLAIN select * from USERS;
USERS table is a Replicated table so

any v-Worker can answer the query.
For example, Worker-1 passes all its
rows from USERS table to Queen
Worker-1
v-W1
queenExec
ICE
v-W2
queendb ICE
v-W3
ICE
ICE is always DataTransfer Operation but not
all DataTransfer Operations are ICE
Worker-2
v-W4

Lab12a: ICE with Shuffle (1 of 2)
The next few slides will show ICE involved in shuffling data between Workers to Workers and
between Workers to Queen.

Suppose you have the following two tables (Distributed and Replicated) and
do a JOIN between them
Will always need an ICE step (Operation=DataTransfer) to send information

between v-Workers and the Queen
create table clicks create table clicks_page

(user_id int, session_id int, product_id (userid int, session_id int, product_id
int, page text, search_keyword text, int, page_id text, search_keyword text,
datestamp timestamp without time datestamp timestamp without time
zone) zone)
distribute by hash(user_id); distribute by hash(page_id);
EXPLAIN select * from clicks, clicks_page where clicks.user_id = clicks_page.userid;
See next page for EXPLAIN PLAN

Operation of Data Transfer means a shuffle is occurring. In this case we are copying (actually
hashing) rows from a v-Worker to another v-Worker. The table being shuffled is the
CLICKS_PAGE table.

EXPLAIN select * from clicks, clicks_page where clicks.user_id = clicks_page.userid;
Step 3 declares Dist

or Repl table types
Decision made to Redistribute

CLICKS_PAGE table. This is
Shuffle

Lab 12a: ICE, but no Shuffle
Here we are still using ICE, but there is no shuffling of a tables rows since we are joining to a
Replicated table.

Lab12a: ICE, but no Shuffle (1 Dist, 1 Repl table)
create table clicks create table users (userid int)

(user_id int, session_id int, product_id int, distribute by replication;
page text, search_keyword text,
datestamp timestamp without time zone)
distribute by hash(user_id);
EXPLAIN select clicks.user_id from clicks, users where clicks.user_id = users.userid;

Step 3 declares Dist
table Hash key
column
No table being
Shuffled. Note just
showing columns.
So Data Transfer
(ICE) will be to Dest
Queen
Step 4 show Queen

executing final query

Ice, but no Shuffle
Notice the Explain Plan here. Since the join is being down between 2 Replicated tables, the
Queen will pick one v-Worker to do all the processing. Notice the Any Worker location
comment instead of the previous slide that said All Workers.

Lab12a: ICE, but no Shuffle (2 Repl table)
create table users (userid int) create table users2 (user_id int)
distribute by replication; distribute by replication;
EXPLAIN select users.userid from users, users2 where users.userid = users2.user_id;

Only 1 worker.
Lose parallelism
Step 3 declares
Repl table
No table being
Shuffle. Note just
showing columns.
So Data Transfer
(ICE) will be to Dest
Queen
Step 4 shows
Queen executing
final query

Lab 12b: Repartitioning
There is no free lunch.
In the top example, Repartitioning in the Explain Plan means there is a shuffle going on. In this
case, its shuffling data in order to do an aggregation. Thats because the table that is being
aggregated is a FACT table which means have to copy like values to the same v-Worker in order
to aggregate. So although we save space using a FACT table (as opposed to a DIMENSION)
table, the shuffle must be performed which slows performance somewhat.
In the bottom example, there is no Repartitioning because the aggregation is being done on a
DIMENSION table so there is no need to copy data since every v-Worker has a copy of the
complete table. The bad news is you lose parallelism for the aggregation since only one v-
Worker will do the task.

Lab12b: Repartitioning (Dist vs Repl)
EXPLAIN SELECT product_id, sum(sales_quantity) FROM sales_fact group by 1;
Good: Distributed table means conserve disk space

Bad: Worker-to-Worker Partitioning occurring
Above show 2 Repartitions (Shuffles) resulting in Network traffic. If SALES_FACT table were a
Replicated table instead of a Distributed table, could eliminate Repartitioning
EXPLAIN SELECT product_id, sum(sales_quantity) FROM sales_repl group by 1;

Good: No Repartition (Shuffle). No TEMP
table needed since not Materialized
Remember there is no free lunch.
Bad: More disk space consumed. Lose
You must balance Network traffic with
parallelism which could hurt performance other factors. For example, the
solution to the left minimizes Network
traffic, but at the cost of Parallelism
since only one vWorker is carrying
out this task. Testing multiple
scenarios to find best balanced
approach

Data Transfer
Data Transfers occur between Worker-to-Worker and Worker-to-Queen (and Queen-to-Worker
for that matter). Keep in mind that Workers will also pass their data to the Queen so this part is
unavoidable.

Data Transfer (Worker-to-Worker vs. Worker-to-Queen)
EXPLAIN SELECT last_name from emp e, dept d where e.dept = d.dept;
Try to minimize Worker-to-Worker Partitioning (Shuffle)
When JOIN 2 hashed tables and JOIN columns dont match hash columns, one of the Tables must be
shuffled (via hash). d.dept rows will be hashed and copied to get on same v-Workers as e.dept rows
This table is
going to be
Shuffled
Dont be concerned about Worker-to-Queen Partitioning (unavoidable)

This network traffic cannot be avoided. Why? Because Workers will always be sending their
Intermediate result sets to Queen
Network bottleneck Repartition is when Workers are sending their rows to other Workers

Lab12c: What is Materialize?
Materialize - false means the data is pipelined from one node to another. This is the fast
processing possible as there is no need for the TEMP (tmp) table.
Materialize true means the data is stored in the TEMP (tmp) table and then passed to the next
node.

Lab12c: What is Materialize?
False - We don't materialize the Append results. It is pipeline execution (no

Temp table), the output of Append is streamed to Aggregate operator
True- We materialize the Intermediate result set into a TEMP table. It can then be
queried by the next process
EXPLAIN select e.last_name, d.dept from employee e FULL OUTER JOIN
dept d on e.department_number = d.dept;

v-Worker Explain Sequential scan
In this screen shot, the table that is being scanned is a logically partitioned table. Each child
partition has to be scanned and then appended/aggregated to get the count of rows for that v-
Worker.

v-Worker Explain - Sequential scan
The Explain Plan at the vWorker level is read from the bottom up
localCost=9733.10..9875.15 rows=2 width=0 networkCost=0
Query Plan and Estimates |
Aggregate (cost=9875.14..9875.15 rows=1 width=0)
:
: -> Append (cost=0.00..8182.31 rows=677131 width=0)
Append summed rows of Child
: -> Seq Scan on sales_fact "projInp_" (cost=0.00..62.80 rows=5280 width=0) partitions
-> Seq Scan on sales_fact (cost=0.00..23.10 rows=1310 width=0)
: -> Seq Scan on _bee_p572_sales_fact_jan_2008 "projInp_" (cost=0.00..542.08 rows=44808 width=0)
-> : ->Seq
SeqScan
Scan on sales_fact_200801 sales_fact (cost=0.00..806.50
on _bee_p573_sales_fact_feb_2008 "projInp_" (cost=0.00..491.52 rows=40652 width=0)
: -> Seq Scan on _bee_p574_sales_fact_mar_2008 "projInp_" (cost=0.00..545.02 rows=45102 width=0)
rows=43950 width=0)
: -> Seq Scan on _bee_p575_sales_fact_apr_2008 "projInp_" (cost=0.00..528.00 rows=43700 width=0)
SeqScan
-> : ->Seq Scan on
on sales_fact_200812
_bee_p576_sales_fact_may_2008 "projInp_"(cost=0.00..812.84
sales_fact (cost=0.00..536.77 rows=44377 width=0)
: -> Seq Scan on _bee_p577_sales_fact_jun_2008 "projInp_" (cost=0.00..529.79 rows=43779 width=0)
rows=44284 width=0)
: -> Seq Scan on _bee_p578_sales_fact_jul_2008 "projInp_" (cost=0.00..1079.04 rows=89304 width=0)
: -> Seq Scan on _bee_p579_sales_fact_aug_2008 "projInp_" (cost=0.00..1093.60 rows=90460 width=0)
-> Seq Scan on _bee_p580_sales_fact_sep_2008 "projInp_" (cost=0.00..1042.34 rows=86234 width=0)
-> Seq Scan on _bee_p581_sales_fact_oct_2008 "projInp_" (cost=0.00..539.38 rows=44638 width=0)
-> Seq Scan on _bee_p582_sales_fact_nov_2008 "projInp_" (cost=0.00..521.40 rows=43140 width=0)
-> Seq Scan on _bee_p583_sales_fact_dec_2008 "projInp_" (cost=0.00..544.97 rows=45097 width=0)
-> Seq Scan on _bee_p597_sales_fact_pre_2008 "projInp_" (cost=0.00..62.80 rows=5280 width=0)
-> Seq Scan on _bee_p599_sales_fact_post_2008 "projInp_" (cost=0.00..62.80 rows=5280 width=0)
Here we see the vWorker is going to perform full table (sequential) scans of
all of the logical partitions of the sales_fact table
The total cost for scanning the July 2008 logical partition is 0.00 to read the
first row, while the cost to read all rows is 1079.04. We will return 89,304
rows (based on ANALYZE). Width= 0 tells us that we are not returning any
columns.(Width = # of bytes returned)

v-Worker Explain Append/Aggregation
Here we see the rows being appended/aggregated. The result set is not materialized but instead
is pipelined to the Queen.

v-Worker Explain - Append/Aggregation
1. Append (cost=0.00..8182.31 rows=677131 width=0)
We don't materialize the Append results, it is pipeline execution (ie: no temp

table), the output of append is streamed to aggregate operator
2. Aggregate (cost=9875.14..9875.15 rows=1 width=0)
Finally we Aggregate (COUNT in this instance) the rows (note constant

cost) and return rows to the Queen

v-Worker Explain Costs (1 of 2)
Query Plan and Estimates: This column displays the low level query plan and cost estimates
for the corresponding phase and is read bottom-up. Note that all costs are relative, and the
cost of fetching 8K of sequential data from the disk is set to 1. The cost of transferring 1K of
network data is set to 1. The first line of the output displays the summary of the entire query.
Here is how to interpret the query plan and estimates in the EXPLAIN output for phase 1:
localCost=0.00..34.00 rows=9600 width=4 networkCost=37
The cost of getting the first row is 0. The cost to read all the rows is 34.00. The number of rows
returned will be 9600. The average width of the row is 4, and the network cost to transfer these
rows from the workers to the queen is 37.
From the second line onward, the output is from the node where the query is going to be
executed. In case the query executes at multiple workers, output from the slowest worker is
displayed to the user. One can see which low-level algorithms will be applied at the nodes. In
phase 1 for example, the slowest worker will perform a sequential scan on explain_t1 to
satisfy the query.
Data Size Distribution (in bytes): This column gives an estimate on the mean and standard
deviation of data coming from worker nodes. This is only applicable for queries that require
transfer of data from either the worker nodes to the queen node or amongst the workers.

v-Worker Explain - Costs (1 of 2)
-> Append (cost=0.00..8182.31 rows=677131 width=0)
The numbers that are quoted by EXPLAIN are (left to right):

Estimated start-up cost (time expended before the output scan can
start, e.g., time for sorting during merge join)
Estimated total cost (if all rows are retrieved, though they might not
be; e.g., a query with a LIMIT clause will stop short of paying the total
cost of the Limit plan node's input node)
Estimated number of rows output by this plan node (again, only if
executed to completion)
Estimated average width (in bytes) of rows output by this plan node

v-Worker Explain Costs (2 of 2)

v-Worker Explain - Costs (2 of 2)
The first line of the output displays the cost summary of the entire query on
the vWorker. Here is how to interpret the query plan and estimates in the
EXPLAIN output the vWorkers:
The cost of getting the first row is 9581.10 There are Costs for:
The cost to read all the rows is 9723.15 Join, Sort, Appends,
Aggregate, etc.
The number of rows returned will be 2 (# of vWorkers)
The average width of the row is 0 (bytes)
The network cost to transfer these rows from Workers to Queen is 0.
The basic unit of cost is a decimal value where one unit represents the cost
of fetching 8 KB of sequential data from the disk. The cost of transferring 1
KB of network data is set to 1

Queen Explain
Notice in the Explain Plan 1 row is expected to be returned. Thats because the answer set is a
count of all rows of the table and hence will be a 1-row answer set.

Queen Explain
Statement | SELECT sum( "_c5" ) AS "count(1)" FROM "_tmp_0" AS "aggregateInp_"

Replacement Parameters |
Query Plan and Estimates | localCost=117.63..117.64 rows=1 width=8 networkCost=0.00
: Aggregate (cost=117.63..117.64 rows=1 width=8)
: -> Seq Scan on _tmp_0 "aggregateInp_" (cost=0.00..96.10
rows=8610 width=8)
The statement that is executed on the Queen is shown. This is the query
that will be executed to return results to the User
The query on the Queen has its own cost as well

Lab12d: LP v non-LP table cost
The idea behind logically partitioned tables is to reduce Disk I/O which in turn will speed query
performance. But Partition Pruning must occur in order to realize a performance gain. Partition
Pruning is accomplished by having the Partitioning columns in the WHERE clause.
Take a look at the examples when executing the same query against a non-LP table and a LP
table.
Performance improves (as denoted by a lower cost) where Partition Pruning occurs. But when
there is no WHERE clause, the non-LP is a better performer.

Lab12d: LP vs non-LP table cost
I/O cost much cheaper when PARTITION column in WHERE clause

beehive=> EXPLAIN select count(*) from sales_fact_LP
where sales_date between '2008-01-01' and '2008-03-15';
beehive=> EXPLAIN select count(*) from sales_fact_nonLP

where sales_date between '2008-01-01' and '2008-03-15';
But LP can take longer if no WHERE clause

EXPLAIN select * from sales_fact_LP; EXPLAIN select * from sales_fact_nonLP;

Explain Plan What to look for
And of course one of the most important things you can do to improve the Optimizer is to
ANALYZE your tables.

Explain Plan What to look for
Here is what to look for when examining the Explain plan output:
- Are any of the phases taking an inordinately long amount of time?

- Is the network cost of the phases (steps) too high? The network can be the
biggest bottleneck, so queries should be written to minimize network traffic
- Is the standard deviation of data really high? This could indicate a large
Data skew. This could be helpful in determining how data should be
partitioned across v-Workers
- Beware of Dynamic data skew
This occurs when JOIN column forces a SHUFFLE where the JOIN
column value lands on small number of v-Workers

Table Access Methods (Scan operations)
There are 3 types of access methods in Aster: Sequential, Index and Bitmap Index.

Table Access Methods (Scan Operations)
For Scan type of operations, there are only 3 possible operation types by
which a table can be accessed:
1 Sequential scan Performs full table scan, visits all data blocks
2 Index scan (only if indices are available AND the query specifies
the column being indexed AND the optimizer thinks that the
querys filtering on the column will select out < 10-20% of the
base fact table) (note: again, what matters is what the Optimizer
thinks, and so ANALYZE ANALYZE ANALYZE)
3 Bitmap Index scan (if multiple indices are available AND the query
filters on columns on which these indices have been constructed
AND the optimizer thinks that the querys filtering on these
columns will select out <20% of the base fact table)

Lab12e: Table Scans Sequential scans
Most queries in Aster will be Sequential scans. All data blocks are read once.

1 Lab12e: Table Scans - Sequential Scans
- Fast to startup
- Sequential I/O is much faster than random access if more than 80% of the
records are fetched from the table
- Only has to read each data block once
- Produces Unordered output
begin; Yes, we are dropping
set enable_seqscan to 'on'; Hints to force
set enable_indexscan to 'off'; Optimizer to do our
set enable_bitmapscan to 'off'; bidding
explain select customer_id, product_id from sales_fact where
customer_id = 467 and product_id = 42;
end;
- Since I was only looking for a single row from the table, there might be a
better way to retrieve this row than a Full Table Scan. See next page

Lab12f: Table Scans Index scans
Indexes allow you to avoid sequential scans (that is, full table scans) which may be relatively
time-consuming. If, when running a query, Aster Database finds that the table has an index or
indexes that are likely to be useful in finding that querys desired results, then Aster Database
will choose an alternative, potentially faster scan method. These faster methods are the index
scan and the bitmap index scan.
Index scans can be very fast. If the query has the indexed column in its WHERE clause and
Aster Database estimates that the predicate will select less than 20% of the rows in the table
(based on Aster Databases table statistics) then Aster Database is likely to choose an index
scan rather than a sequential scan.

2 Lab12.f: Table Scans Index Scans
- Fastest access method when few records (<20%) need to be selected

- Will scan Index entries and then relevant data blocks from table
- Ensure you ANALYZE so Optimizer makes decision to use Index
- Drop hint to Optimizer via SET ENABLE commands
begin;
set enable_seqscan to 'off';
set enable_indexscan to 'on';
set enable_bitmapscan to 'off';
end;

Lab12g: Table Scans Bitmap Index Scans
Bitmap Index scans are useful if a handful of cases where you have multiple columns in your
WHERE clause with equality and those columns have previously been Indexed.

3 Lab12g: Table Scans Bitmap
- Best of both worlds Sequential I/O with Index selectivity

- But slow to start up since must read all the Index tuples and Sort them
- Often selected for IN an =ANY(array) operators
- But Optimizer can choose it for an Indexable scan with low selectivity
- Often ideal for DSS queries (produces Unordered output)
begin;
set enable_seqscan to 'off';
set enable_indexscan to 'off';
set enable_bitmapscan to 'on';
end;

Join Types
There are 3 Join types in Aster: Hash, Merge and Nested Loop.

Join Types
Join performance is key to overall query processing performance!
There are several JOIN types commonly executed by the Optimizer:
1- HASH JOIN
2- MERGE JOIN
3- NESTED LOOP JOIN

Hash Joins
Hash joins work well when neither table is small but the smaller table's joining column can
be used to construct a hash table that Postgres thinks will fit in work_mem amount of
memory.
A Hash join is typically faster than a merge join, given enough memory. This is the join
method most typically observed. The Postgres optimizer picks the smaller table for
constructing the hash table by looking at the unique distribution of values in the smaller table.
If the table hasn't been analyzed, then the Postgres optimizer could mistakenly assume that the
hash table will fit in memory whereas in fact, the table could be requiring more memory than
available RAM. In such cases, there will be Swap activity happening. That is the surest sign
that the estimates are wrong.

1 Hash Joins
EXPLAIN select e.last_name, d.dept from employee e INNER JOIN dept d on e.department_number = d.dept;
Hash Joins are efficient because it's single pass, whereas sorting in a
Merge join may be multi-pass
This is the Join method most typically observed
Works great when the joining column of the smaller of the two tables will
fit into a Hash table in work_mem amount of memory
It might be that Optimizer thinks it'll fit, but due to analyze not having
been run, the Hash table actually doesnt fit - in that case the Hash table
spills to virtual memory (Swap file) and performance will deteriorate

Lab12h: Hash Join - Details
Notes on hash joins
Hash join is the join method most typically observed.
A hash join tends to be the more efficient join type to use when the joining column of the
smaller of the two tables will fit into a hash table in the allocated (work_mem) amount of
memory. A hash join is efficient because it takes place in a single pass, whereas sorting in a
merge join may require more than one pass.
The optimizer picks the smaller table for constructing the hash table by looking at the
unique distribution of values in the smaller table.
Its important that tables statistics are up to date. Make sure you run ANALYZE after any
significant change to a tables contents. Otherwise, the optimizer may mistakenly guess
that a columns contents will fit into an in-memory hash table when, in fact, they will not.
In such cases the hash table spills to virtual memory, and performance is likely to be poor.
If this happens, you will see disk swap activity on the worker node. That is a sign the plans
estimates might be wrong.

Lab12h: Hash Join - Details
- The Optimizer picks the smaller table for constructing the Hash table by
looking at the unique distribution of values in the smaller table
- If the table hasnt been analyzed, then the Optimizer could mistakenly
assume that the Hash table will fit in memory whereas in fact, the table
could be requiring more memory than available RAM. In such cases,
there will be Swap activity happening. That is the surest sign that the
estimates are wrong (ie: vmstat 5 5))
EXPLAIN select e.last_name, d.dept from employee e inner join dept d on e.department_number = d.dept;
In this case, DEPT is the Inner table and will be the HASH table

Merge Joins
Merge joins work well if the tables are large and Postgres thinks they cannot fit in
work_mem amount of memory.
A Merge join is the most scalable and widely usable method of joining. That also typically
makes it the slowest.
The Merge join is implemented by sorting both the tables on the columns being joined and
then streaming the top few rows from each table to do the join. This makes for a very low
memory utilization during the join operation. But the sorting step requires memory, and
again misestimation causes problems. If Postgres thinks a table sorting will fit in memory,
then it will use a quicksort algorithm. Else, it will use an external disk-based sorting algorithm
(using at most work_mem amount of memory at any point during the sort).

2 Merge Joins
Merge Join is the most scalable and widely usable way of joining. That also
typically makes it the slowest
The Merge join is implemented by Sorting both the tables on the columns
being joined and then streaming the top few rows from each table to do the
join
Can only be used for Equality join conditions
Its only practical to do Joins on large tables that do not fit in memory
Works great for all cases where the tables are large and the Optimizer
thinks they cannot fit in work_mem amount of memory

Lab 12i: Merge Join - Details
Merge join is the most scalable and widely usable method of joining.
That also typically makes it the slowest if any of the other two approaches are possible.
The merge join is implemented by sorting both the tables on the columns being joined and
then streaming the top few rows from each table to do the join.
This makes for a very low memory utilization during the join operation.
However, the sorting step requires memory, and again bad estimates cause problems.
If the optimizer thinks a table sorting will fit in memory, then it will use a quicksort algorithm.
Else, it will use an external disk-based sorting algorithm (using at most work_mem amount of
memory at any point during the sort).

Lab12i: Merge Join - Details
This makes for a very low memory utilization during the join operation.
However, the Sorting step requires memory, and again bad estimates
cause problems
If the Optimizer thinks a table sorting will fit in memory, then it will use a
quicksort algorithm. Else, it will use an external disk-based sorting
algorithm (using work_mem of memory at any point during the sort)
begin;
set enable_hashjoin to 'off';
EXPLAIN select p1.userid, p2.page from page_view_fact p1 inner join
page_view_fact p2 on p1.refdomain = p2.refdomain;
end;

Nested Loop Join
Nested loop joins work well when Postgres thinks both the tables are really small OR one
table is much, much smaller than the other AND the second table has an index on the
joining column.
Nested loop joins are the fastest but can't always be used. Since the default statistics for a table
that has not been analyzed tend to produce table size estimates that are too small, and since
nested loop joins are the joins of choice for smaller tables, the Postgres planner, by default,
tends to pick nested loops. For this reason, all Aster Database deployments ship with the
enable_nestedloop parameter set to off. Turn it on with care, and only after you have done
an EXPLAIN ANALYZE directly on the worker Postgres instances! (Aster Database does not
support EXPLAIN ANALYZE, but Postgres does, so you can run it by connecting directly to a
Postgres instance.)
Nested loop joins are very useful for performing a star-schema join between a large fact table
and a very small dimension table (either the dimension table is itself small or the dimension
table is being filtered down to a small set of rows). In such cases, consider creating an index on
the large fact table on the column being joined with the very small dimension table. Turn on
nested loops, and then check to see that the Postgres optimizer is scanning the very small
dimension table first and then using its joining column values as probes into the large fact
table using the index.

3 Nested Loop Join
Nested Loops Join can be fastest but only where it makes sense
Since the default statistics for a table that has NOT been analyzed is very
small, the Optimizer by default has a tendency to pick nested loops.
Thats why all Cluster deployments today ship with the
enable_nestedloop set to off
Works great when the Optimizer thinks both the tables are really small OR
One table is much, much smaller than the other AND
the larger table has an Index on the joining column
Nested loop joins come in really handy when you are doing a star-schema
join between a large Fact table and very small Dimension table
In such cases, consider creating an Index on the large Fact table on the
column being joined with the very small Dimension table
Turn on nested loops, and then check to see that the Optimizer is scanning
the very small dimension table first and then using its joining column values
as it probes the large fact table using the Index

Lab12j: Nest Loop Join - Details
Nested loop joins:
Are least efficient of join in theory, fast to produce first record;
Work great when the optimizer thinks both the tables are really small
Work great when one table is much, much smaller than the other AND the larger table has
an index on the joining column.

Lab12j: Nested Loop Join - Details
begin;
set enable_mergejoin to 'off';
set enable_hashjoin to 'off';
set enable_nestloop to 'on';
EXPLAIN select p1.userid, p2.page from page_view_fact p1 inner join
page_view_fact p2 on p1.refdomain = p2.refdomain;
end;
- Slowest Join in theory
- But fast to produce first record
- In practice, its usually desirable for OLTP queries
- Performs poorly if second child is slow
- Only Join capable of executing CROSS JOIN
- Only Join capable of Non-equi JOIN conditions
- - Can't fit HASH table in Memory

Join Operations - Summary

Join Operations - Summary
Again, remember it is what the Optimizer 'thinks' that matters. So

ANALYZE, ANALYZE, ANALYZE!
The most important parameter is work_mem. Use 'SET WORK_MEM =

XXX' as needed, but only under the advise of Aster Consultant
The Optimizer picks among the 3 operation types we have discussed

based on the work_mem setting in relation to the amount of memory it
thinks is needed for each of the operation types
For all Join operations, the most important cluster parameter to think
about is amount of physical memory in the machine (and the number of
v-Workers in that physical worker)

Review: Module 12 Explain, Joins, Scans

Review: Mod 12 Explain, Joins and Scans
1. The 3 Scan types are:
2. What commands allow you to change Table scan type?
3. Data Distribution skewing is common with I/O bottlenecks
4. The 3 JOIN types are:
There are no Formal Labs for this Module

Module 13
Mod 13
Bottlenecks and Tuning

Table Of Contents
Bottlenecks and Tuning Module objectives .................................................................................4
Bottlenecks - The Big Picture ..........................................................................................................6
Performance Considerations ............................................................................................................8
3 Principals for High Performance on Big Data ............................................................................10
Basic plan.......................................................................................................................................12
Bottleneck types.............................................................................................................................14
CPU Bottlenecks Why? ..............................................................................................................16
CPU Bottlenecks Where? ...........................................................................................................18
CPU Bottlenecks top utility (Queen) ..........................................................................................20
CPU Bottlenecks top utility (Worker) ........................................................................................22
CPU Bottlenecks vmstat utility ..................................................................................................24
CPU Bottlenecks - Example ..........................................................................................................26
Network Bottlenecks Why? ........................................................................................................28
Network Bottlenecks Where ? ....................................................................................................30
Network Bottlenecks - Utilities .....................................................................................................32
Network Bottlenecks Data Transfer ...........................................................................................34
Network Bottlenecks - Repartitioning ...........................................................................................36
Disk Bottlenecks Why? ..............................................................................................................38
Disk Bottlenecks Where? ...........................................................................................................40
Disk Bottlenecks - Skew ................................................................................................................42
Disk Bottlenecks getTableSize function.....................................................................................44
Disk Bottlenecks Row Elimination ............................................................................................46
Disk Bottlenecks - Compression ...................................................................................................48
Disk Monitoring Usage (1 of 3)..................................................................................................50
Disk Bottlenecks Controlling Scans ...........................................................................................56
Memory Bottlenecks Why? ........................................................................................................58
Memory Bottlenecks Where? .....................................................................................................60
Memory Bottlenecks - Utilities .....................................................................................................62
Checklist for Query Tuning ...........................................................................................................64
Real World Problems .....................................................................................................................66
General Tips...................................................................................................................................68
Failed Queries Insufficient Storage ............................................................................................70
Read/Write Testing Get I/O Statistics ........................................................................................72
Read/Write Testing iostat command ..........................................................................................74
iostat What are we looking for? ..................................................................................................76
Get Disk Read/Write using dd .......................................................................................................78
Keep an eye on Data skew using du ..............................................................................................80
Tips and Tricks ..............................................................................................................................82
Case Study 1 Bad Load Performance .........................................................................................84
Case Study 2 Bad Backup Performance .....................................................................................86
Case Study 3 Bad Query Performance .......................................................................................88
Case Study 4 Data Skew issue ....................................................................................................90
Lab13: Skew table (1 of 3) ............................................................................................................92
Lab13: Skew table (2 of 3) ............................................................................................................94
Page 2 Mod 13 Bottlenecks, Tuning and Case Studies

Lab13: Skew table (3 of 3) ............................................................................................................ 96
Review: Module 13 Bottlenecks, Tuning, Case Studies ............................................................ 98

Bottlenecks and Tuning Module objectives

Bottlenecks and Tuning - Module Objectives
How to monitor both Proactive and Reactive
Identifying performance bottlenecks
Handing failed queries
Linux performance testing
Performance case studies

Bottlenecks - The Big Picture

Bottlenecks - The Big Picture
Proactive
Viewpoint, AMC, ncli, Alerts
Reactive
Logging and Troubleshooting
Recommended place to go when monitoring system
1. Viewpoint
2. AMC
3. NCLI
4. Linux utilities

Performance Considerations

Performance Considerations
Three most important considerations:
Bottleneck
Bottleneck
Bottleneck
Measuring your system is a first step in defining where the bottleneck exits.
Several bottlenecks may exist simultaneously

3 Principals for High Performance on Big Data
Networking: Thou shalt not move big data
If you need to move big data, make it small first, and then move small data.
Disk: Thou shalt not read irrelevant data
Prepare the data model in advance to ensure that queries touch the least amount of data.
Processor: Thou shalt not do redundant processing
Prepare your queries such that each computation is done exactly once, and never again.
Top Tips for Analysts

1 When given a choice between using a join and using another construct like a subselect, use
the join. Aster Database typically handles joins faster.
2 Unroll your expression's subselects by moving the subselect from the WHERE clause to
the FROM clause, or by rewriting it as a join. This speeds up queries.
3 Use an inner subquery to shrink the size of data on which outer queries need to operate.
Likewise, use an inner subquery to shrink the size of a relation before the data is
redistributed on a different distribution key.
4 Avoid per-row function invocation. Group rows first, and then apply the function.
5 Normally, you should use the distribution key for JOINs, GROUP BYs, and WHERE
clauses. If this isnt possible, you will find that using fixed-length datatypes will give better
performance than using non-fixed-length datatypes.
6 EXPLAIN is your friend. Learn to read the Aster Database query EXPLAIN plans to
identify potential bottlenecks in executing the query.
Top Tips for DBAs

1 Run the ANALYZE command after every data load or data update. This will update the
table statistics and allow the query planner to choose the best plan.
2 Use transactions when you modify data (for example, when using CREATE, DROP,
CREATE TABLE AS, INSERT, UPDATE, DELETE, or ALTER statements). This allows
several statements to run before doing a commit. This reduces the overall execution time
as compared with doing several separate commits.
3 As mentioned in the list of tips for analysts, use the EXPLAIN command to get to the
bottom of performance problems with individual queries.
4 Use the Aster Database AMC (Processes tab) and Ganglia to find and investigate long running queries.
5 Control how ACT handles errors when running a single, multi-statement sql, or a sql file.
The on-error-stop option can be set to stop running SQL queries.

3 Principles for High Performance
1. Networking: Thou shalt not move big data

If you need to move big data, make it small first, and then move small data
2. Disk: Thou shalt not read irrelevant data

Prepare the data model in advance to ensure that queries touch the least
amount of data
3. Processor: Thou shalt not do redundant processing

Prepare your queries such that each computation is done exactly once, and
never again.

Basic plan
Attack bottleneck problems asking WHERE, HOW and WHY questions.

Basic plan
Performance debugging plan
WHERE is the bottleneck

WHY is there a bottleneck
HOW can the bottleneck be fixed
HOW typically follows from answering the WHERE and the WHY

Bottleneck types
There are various bottleneck types. We will focus on the Hardware variety.

Bottleneck types
Hardware resources
CPU, Disk, Network, Memory

Can use Unix utilities to answer the WHERE and the WHY part
Software
Examples: Operating System, Aster database software)
SQL
Poor coding (ie: Unintentional Product join)

Requires familiarity with the SQL

CPU Bottlenecks Why?
Skew on one or more Workers is to be avoided as much as possible.

CPU Bottlenecks Why?
WHY is CPU the bottleneck?
If only one v-Worker is busy, then there may not be enough

processes to use other CPUs such as in the case of data
skew
CPU SKEW Data not skewed, but Predicate WHERE

GENDER = male is causing skew
Typically look for CPU Skew on Worker nodes
If a query is proving business value, and CPU is busy , then it

may be OK. If upgrading to newer hardware is not the issue,
the answer probably lies with the software

CPU Bottlenecks Where?
CPU bottlenecks can be view via: AMC, ganglia, and Linux utilities.

CPU Bottlenecks Where?
WHERE is the bottleneck is it CPU ?

Utilities: AMC, ganglia, mpstat, top, vmstat
Typical symptoms:
All CPUs are 100% busy
One CPU is constantly 100% busy but others are idle
>90%
Normal
>70%
https://192.168.100.100/ganglia

CPU Bottlenecks top utility (Queen)
The 'top' program provides a dynamic real-time view of a running system.
It can display system summary information as well as a list of tasks currently being
managed by the Linux kernel.
The types of system summary information shown and the types, order and size of
information displayed for tasks are all user configurable and that configuration can be
made persistent across restarts.
Run 'top' from the Linux command line. Press 'h' at any time to toggle online help.
The 'top' command displays a variety of information about the processor state. The display is
updated every five seconds by default, but you can change that with the 'd' command-line
option or the interactive command, 's'.
When you run top, it displays the following information in the console:
Up (uptime) This line displays the time the system has been up, and the three load averages
for the system. The load averages are the average number of processes ready to run during the
last 1, 5, and 15 minutes. This line is just like the output of the 'uptime' command. The uptime
display may be toggled by the interactive command, 'l'.
Tasks / processes Shows the total number of processes running at the time of the last update.
This is also broken down into the number of tasks which are running, sleeping, stopped, or
undead. The processes and states display may be toggled by the interactive command, 't'.
Cpu(s) Shows the percentage of CPU time in user mode, system mode, 'niced' tasks, iowait
and idle. ('Niced' tasks are only those whose nice value is positive.) Time spent in niced tasks
will also be counted in system and user time, so the total will be more than 100%. The processes
and states display may be toggled by the interactive command, 't'.
Mem Statistics on memory usage, including total available memory, free memory, used
memory, shared memory, and memory used for buffers. The display of memory information may
be toggled by the interactive command, 'm'.
Swap Statistics on swap space, including total swap space, available swap space, and used
swap space. The contents of this field and the Mem field are just like the output of the 'free'
command.
TOP: What are we looking for at the queen?

On the queen, look for the 'QueenExec' process run by user 'beehive'. It should be among the
most resource-consuming processes, consuming most of the CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18582 beehive 20 0 0 0 0 Z 10 8.0 0:00.00 QueenExec
18821 root 20 0 0 0 0 Z 0 0.0 0:00.00 dnscache

CPU Bottlenecks top utility (Queen)
TOP (Linux utility): What are we looking for at the Queen?
On the Queen, look for the QueenExec process run by user

'beehive'. It should be among the most resource-consuming
processes, consuming most of the CPU

CPU Bottlenecks top utility (Worker)
TOP: Worker Node Example
Running 'top' with no parameters creates a display that refreshes (by default) every three seconds.
top -07:01:51 up 5 days, 12:17, 1 user, load average: 0.38, 0.29, 0.16
Tasks: 178 total, 2 running, 176 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.2% us, 0.2% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.3% si
Mem: 6046832k total, 4075068k used, 1971764k free, 156340k buffers
Swap: 11807348k total, 0k used, 11807348k free, 3510632k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22485 beehive 20 0 654m 13m 11m S 74 0.2 0:04.58 postgres
22486 beehive 20 0 654m 13m 11m R 54 0.2 0:03.72 postgres
22495 beehive 20 0 78360 11m 1656 S 23 0.2 0:00.62 IWTServerExec
5968 beehive 20 0 31900 1404 572 S 4 0.0 0:20.92 postgres
22497 beehive 20 0 75288 8772 1188 S 4 0.1 0:00.02 IWTServerExec
22509 beehive 20 0 10668 1296 892 R 4 0.0 0:00.02 top
5293 beehive 20 0 268m 24m 1628 S 2 0.4 142:59.20 python
1 root 20 0 2612 580 492 S 0 0.0 0:04.98 init
TOP: What are we looking for at the worker?
On the worker node, look for the Postgres processes. There should be one for each vworker on
that worker node. When you view the 'top' output during query execution, if you see fewer
running Postgres processes than you have vworkers on the node, this may indicate processing
skew. Seeing fewer Postgres processes than vworkers indicates that some vworkers have
completed processing a given query before others, which may be a sign of skew.
Taken together, the Postgres processes should consume almost all available CPU. A properly
configured Aster Database should be CPU bound when processing its typical large queries,
rather than I/O bound or network bound.
Look for the IceServer process on current versions of Aster Database, or the IWTServerExec
process on pre-4.5 versions of Aster Database. This is the process that shuffles data between
vworkers.
Look for high amounts of swap activity. The 'Swap' field appears near the top of you console:
TOP: CPU stats
While running 'TOP', pressing the '1' key will toggle the distinct CPU usage by processor.
top -08:10:29 up 5 days, 13:26, 1 user, load average: 0.00, 0.02, 0.00
Tasks: 167 total, 1 running, 166 sleeping, 0 stopped, 0 zombie
Cpu0: 0.3% us, 1.0% sy, 0.0% ni, 95.7% id, 0.0% wa, 0.0% hi, 3.0% si
Look at the wait (wa) percentage. Long wait times can indicate swapping. One CPU running
at a higher use percent ('us') than the rest can indicate processing skew.

CPU Bottlenecks top utility (Worker)
TOP : What are we looking for at the Worker?
On the Worker node, look for the Postgres processes. There should be one for
each v-Worker on that worker node. If you see:
Fewer running Postgres processes than you have v-workers on the node,
this may indicate processing skew
Taken together, the Postgres processes should consume almost all available
CPU. A properly configured Database should be CPU bound when processing
its large queries, rather than I/O bound or Network bound

CPU Bottlenecks vmstat utility
The UNIX vmstat command is useful for finding out how busy a worker node is. Specifically,
vmstat prints statistics showing memory usage, disk paging, I/O wait times, and CPU activity.
vmstat syntax
vmstat [-a] [-n] [delay [ count]]

vmstat [-f] [-s] [-m]
vmstat [-S unit]
vmstat [-d]
vmstat [-p disk partition]
vmstat [-V]
vmstat -a -n 5 4
-a = Active/Inactive Memory switch
-n = Display the header 1 time
5 = Delay between system output
4 = Number of iterations
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free inact active si so bi bo in cs us sy id wa st
0 0 368160 84528 403120 346772 3 9 101 150 159 343 1 1 95 2 0
0 0 368160 84288 403148 346816 0 0 0 258 95 136 0 1 96 3 0
0 0 368160 83164 403156 348264 0 0 0 0 75 155 1 0 99 0 0
0 0 368160 82660 403156 348264 0 0 0 4 62 139 0 0 99 0 0
vmstat: VM mode reporting fields
Procs
r: The number of processes waiting for run time.
b: The number of processes in uninterruptible sleep.
Memory
swpd: the amount of virtual memory used.
free: the amount of idle memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.
inact: the amount of inactive memory. (-a option)
active: the amount of active memory. (-a option)
Swap
si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).
IO
bi: Blocks received from a block device (blocks/s).
bo: Blocks sent to a block device (blocks/s).
System
in: The number of interrupts per second, including the clock.
cs: The number of context switches per second.
CPU These are percentages of total CPU time.
us: Time spent running non-kernel code. (user time, including nice time)
sy: Time spent running kernel code. (system time)
id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
wa: Time spent waiting for IO. Prior to Linux 2.5.41, shown as zero.

CPU Bottlenecks vmtstat utility
The UNIX vmstat command is useful for finding out how busy a worker node is.
Specifically, vmstat prints statistics showing memory usage, disk paging, I/O wait
times, and CPU activity
At rest Under load

CPU Bottlenecks - Example
For performance reasons, may be better off opening 3 sessions of ACT and submitting the 3
queries versus submitting them all in one multi-request statement.

CPU Bottlenecks Example
For example, three big tables T1, T2, T3, each size 100GB
This is All-or-nothing query. One
Sequential: One user runs following: error causes ROLLBACK
BEGIN; Select * from T1; Select * from T2; Select * from T3; END;
Highly possible for one v-Worker to use only one CPU
(this assumes all 3 tables are REPL tables)
Parallel: If OPEN up 3 ACT terminals

Select * from T1; (ACT 1)
Highly possible one v-Worker will use three different CPUs
In conclusion, for sequential, one v-Worker processes, only one CPU is

used, possibly 100%. So, CPU seems a bottleneck in this scenario,
because two other queries are queued and one CPU is 100% busy. While,
in Parallel, no query is queued and more CPUs are used

Network Bottlenecks Why?
Network bandwidth can be an issue, especially when using the Teradata QueryGrid connectors to
pull data across the network wire.
Ensure have 100-GB bandwidth and high-end switches that can handle the increased bandwidth.
If possible, ensure only query traffic is on the link. Configure Virtual LANs (VLANS) so only
query traffic is allowed on the link.

Network Bottlenecks Why?
Network Bottlenecks depend on your Network infrastructure
For example if you are running an Aster Appliance (with both

Aster and Hadoop inside the box) with INFINIBAND (40 GB
bandwidth), doing JOINS between Aster and Hadoop should not
encountered network issues
However if your Aster appliance is connected to Hadoop nodes
via Ethernet cable, bottlenecks could develop dependent on
effective bandwidth of the Ethernet network. Best to have both
boxes on the same Subnet and running 10(0) GBps ethernet
Network traffic should be benchmarked for:
Queries, Loading and Backup/Restore

Network Bottlenecks Where ?
The Aster DBA should review queries and attempt to minimize the Shuffle process as much as
possible. This might require changing Table types (FACT and DIMENSION) if needed.

Network Bottlenecks Where?
Eliminate Network bottlenecks
Typically related to Skew or poor hash column choices
For this, look at EXPLAIN PLAN
Look for Worker-to-Worker Repartitioning (SHUFFLE)

If you have your data model right, then there should be only
minimal repartitioning
Repartitioning can occur for several reasons:
Joining 2 tables where JOIN column is not the Distribution key

column
Aggregating on a Fact table over a column or expression (GROUP
BY, DISTINCT) that is not the distribution key
Compounding multiple queries together (e.g. UNION, INTERSECT)

Network Bottlenecks - Utilities
Utilities for network bottleneck monitoring include: AMC, ganglia, ethtool, netstat, iperf.
It is also possible to enable NIC bonding. See Aster Database User Guide for more details on
this.

Network Bottlenecks Utilities
Utilities: AMC, ganglia, ethtool (does not apply to InfiniBand), netstat,

traceroute, iperf (open source tool), AMC
Check status on each network interface:
i.e. ethtool eth0
Use iperf to measure the serial/parallel network throughput
Potential Solution : NIC bonding ties 2 NICs together for double the
throughput assuming network wire is not over-consumed
https://192.168.100.100/ganglia

Network Bottlenecks Data Transfer
The goal is to minimize Data Transfer between Workers when possible.

Network Bottlenecks Data Transfer
Try to minimize Worker-to-Worker Partitioning (Shuffle)

Both tables are Distributed
d.dept rows will be hashed and copied to get
on same vWorkers as e.dept rows
Dont worry about Worker-to-Queen Partitioning

This is not a network bottleneck Repartition. Why? Because Workers are
sending their Intermediate result sets to Queen.
Network bottleneck Repartition is when Workers are sending their rows to

other Workers

Network Bottlenecks - Repartitioning
Remember there is no free lunch. You must balance Network traffic with other factors. For
example, the solution to the left minimizes Network traffic, but at the cost of Parallelism since
only one vWorker is carrying out this task. Testing multiple scenarios to find best balanced
approach

Network Bottlenecks Repartitioning
SELECT product_id, sum(sales_quantity) FROM sales_fact group by 1;
Good: Distributed table means conserve disk space

Bad: Worker-to-Worker Partitioning occurring
Above show 2 Repartitions (Shuffles) resulting in Network traffic. If SALES_FACT table

were a Replicated table instead of a Distributed table, could eliminate Repartitioning
SELECT product_id, sum(sales_quantity) FROM sales_repl group by 1;

Good: No Repartition (Shuffle) Remember there is no free lunch.
Bad: More disk space consumed. Lose You must balance Network traffic
parallelism which could hurt performance with other factors. For example,
the solution to the left minimizes
Network traffic, but at the cost of
Parallelism since only one
vWorker is carrying out this task.
Testing multiple scenarios to find
best balanced approach

Disk Bottlenecks Why?

Disk Bottlenecks Why?
WHY is disk the bottleneck ?
Check read or write disk bandwidth from iostat

High utilization, but low bandwidth indicates:
1. Bad disk hardware

2. Random IO
Is Disk bottleneck skewed?
Iowait high on only one node (Ganglia)

May be caused by Data skew
Run sequential read/write benchmarks on an idle system
Use dd to read and write large file/files

Check if bandwidth is unexpectedly low

Disk Bottlenecks Where?

Disk Bottlenecks Where?
WHERE is the bottleneck Is it the disk ?
Utilities: AMC, ganglia, iostat
Typical symptom:
Utilization (%util) shown by iostat is high (> 90%)
https://192.168.100.100/ganglia

Disk Bottlenecks - Skew
Note there are three flavors of skew that can occur on an Aster system:
Table skew
Join skew
GROUP BY (aggregate) skew
It is particularly important to place your most distributed column last in a GROUP BY or

PARTITION BY clause since Aster hashes on the last column in these clauses.
For example, if your statement has: GROUP BY city, gender, then only two v-Workers would
hold the two genders. If you instead did a: GROUP BY gender, city, then you have a much
better chance of distributing the rows across more v-Workers.

Disk Bottlenecks Skew
For this, use the v-Worker section of the EXPLAIN Plan
Start from the Cluster EXPLAIN Plan step that you want to understand
disk characteristics
Typically this is the first phase that scans the base fact tables that are
being queried
Check for Skewing
Table Skew Due to table rows being distributed unevenly across v-

Worker due to poor Hash column choice
Join Skew Due to Join column choice not be evenly distributed
across v-Workers
GROUP BY Skew When have multiple columns in GROUP BY, ensure
most distributed column is last in GROUP BY
GROUP BY gender, store_id (Good)

GROUP BY store_id, gender (Not good)

Disk Bottlenecks getTableSize function
There are a number of ways to determine if your Table has skew as shown in this slide. Other
ways to find skew include the following:
Finding skew with UNIX du
The Linux du command is used to look for data skew on a worker node. Its syntax is:
du [OPTIONs] [file systems]
Data on the Aster Database workers is stored in directories with names like /primary/w5z
(vworker number 5), /primary/w12z (vworker number 12), and so on. Collectively, we refer
to these directories as the "w*z" directories. We can examine these directories to check for data
skew. For example, to show space usage for all the virtual workers on a node, you type:
du -sh /primary/w*z
To sort your results, pipe them through 'sort':
du -sh /primary/w*z | sort
Your results might look like this:

119M /primary/w14z
119M /primary/w30z
120M /primary/w10z
120M /primary/w26z
121M /primary/w34z
124M /primary/w18z
129M /primary/w42z
130M /primary/w22z
130M /primary/w2z
131M /primary/w46z
134M /primary/w6z
196M /primary/w38z
Note that vworker number 38 has 60 MB more data than the lowest vworker!
Finding skew using NC_RELATIONSTATS function
Top 10 tables which cause maximum skew in terms of on-disk space.
SELECT schema, relation, partition_name, max(storage_size) -

min(storage_size) as worst_skew
ON (SELECT 1)
PARTITION BY 1
DATABASES ('niray')
)
where object_type='relation' -- avoids indexes
GROUP BY 1,2,3
ORDER BY 4 desc
limit 10;

Disk Bottlenecks getTableSize function
retail_sales=> select customer_id, count(1)
from prod.sales_fact
Run query like this to confirm your
Distribution key column is evenly
group by customer_id order by 2 desc;
distributed among vWorkers
customer_id | count(1)
-------------+----------
2686 | 2330 (if this were 100,000 rows for example, it would constitute data skew.
3370 | 2190
3917 | 2047
3668 | 2026
1495 | 1787 getTableSize function show table
3609 | 1681 size per vWorker
2261 | 1555
2302 | 1520 Use: Select * from getTableSize
1664 | 1495 (on (select 'aaf.clicks_skew')
2790 | 1494 partition by 1 DB ('beehive'));
(10 rows)
Skew on one v-Worker can cause disk I/O bottlenecks as the v-Worker
on that Worker node must process more data compared to others
See left-hand page for how to use Linux du and SQL-MR function nc_relationstats to find skew

Disk Bottlenecks Row Elimination

Disk Bottlenecks Row Elimination
How does the Optimizer estimate how much data a filtering condition would
filter out?
- Given a condition like 'WHERE a>10', how does the Optimizer estimate
that <10% of the base Fact data set is selected and so it is time to use
the Index on a?
In general, the Optimizer never 'knows'

It tries to make an informed guess (based on ANALYZE)
Based on the frequency distribution of values in that column . In most
cases, this is good enough
In odd cases, the estimate is woefully wrong so, always look for
discrepancies between the estimated and the actual row count in
EXPLAIN for each filtering operation
Best Practice: Disk I/O dominates the cost of a Query. Therefore pick Plan that minimizes this
WHERE clause to reduce the size of the Intermediate Result set before further processing
Logically Partitioned tables to reduce I/O on certain queries
Columnar tables when warranted

Disk Bottlenecks - Compression
It is a best practice to compress all your tables as LOW. You save on disk space and SELECT
clauses are better performers since you are scanning more rows per data block.

Disk Bottlenecks Compression
The advantage of compressed data, besides saving disk space, is that

reading data takes less Disk I/O, resulting in faster data reads
Datablock size = 32k. So if you can stuff more rows in a datablock, you
naturally read less datablocks
Note Compression will increase CPU utilization slightly. However on

whole this will be still be better performer since Disk I/O typically carries
higher cost than CPU
During CREATE TABLE statement may want to consider using:
COMPRESS LOW for your hot data

COMPRESS LOW/MEDIUM for warm data
COMPRESS HIGH for your cold data

Disk Monitoring Usage (1 of 3)

If the /primary file system is filling up there are several possible reasons.
We need to check them one by one:
- Overall v-Worker sizes

- Background replication in progress or failed (orphans)
- Excessive temporary database objects created
- Excessive Aster temp files created
- Excessive WAL files created
- Excessive core files created

runonall and runonother
The ncli node runonall command may be used to run any executable on multiple nodes.
It can also be used to run a command from a file. The executable must exist on all nodes prior
to the command being run. For some commands (like df), the command already exists on all
nodes. If a user-written script is being executed, then it must be copied to all nodes using ncli
node clonefile or a similar mechanism. This effectively allows you to run commands in
parallel over SSH on the cluster.

Disk Monitoring Usage using NCLI (2 of 3)
If the /primary file system is filling up there are several possibilities. We

need to check them one by one:
- Overall v-Worker sizes (utilized space)

ncli node runonall "du -sh /primary/w*z
- Background replication in progress or failed (orphan files)

ncli node runonall "du -sh /primary/tmp/w*z"
Contact support for the Detect orphans script
- Excessive temporary database objects created

ncli node runonall "du -sh /primary/w*z/_bee_temptables
- Excessive Aster temp files created

ncli node runonall "du -sh /primary/tmp"


Disk Monitoring Usage using NCLI (3 of 3)
We need to check them one by one:
- Excessive tmp files created by worker processes

ncli node runonall "du -sh
/primary/tmp/cleanonreboot/WorkerDaemon/*
- Excessive WAL files created

ncli node runonall "du -sh /primary/w*z/pg_xlog
- Excessive core files created

ncli node runonall "du -sh /primary/cores
Check overall disk space across nodes

- Can also use NCLI
ncli node runonall df -h /primary

Disk Bottlenecks Controlling Scans
Runtime Parameters
You can pass the following runtime parameter names as the name clause of a SET statement.
For each, we list the set or range or allowed values. When assigning a value, enclose the value
in single quotes, unless otherwise specified.
enable_bitmapscan
Enable or disable the Local Planner's use of bitmap-scan plan types. Default value is 'on'.
enable_seqscan
Enable or disable Local Planner's use of sequential-scan plan types. Default value is 'on'.
random_page_cost
Set the Local Planner' estimate of the cost of a disk page that was fetched non-sequentially from
disk. The default value is 40. Reducing this value relative to 'seq_page_cost' will cause the local
planner to prefer index scans over sequential scans. Increasing this value will lead to sequential
scans being preferred over index scans.
effective_cache_size
Sets the Local Planner's assumption about the effective size of the disk cache that is available to
a single query. This is factored into estimates of the costs of specific plans, i.e. whether to use an
index or not. A higher value makes it more likely to use an index scan, whereas a lower value
makes it more likely that sequential scans will be used. The default value is 2GB.
Examples Using SET
To enable Local Planners use of nested-loop plan types:
SET enable_nestloop to 'on' ;

Disk Bottlenecks Controlling Scans
Scan operations are controlled by the following Cluster parameters/hints
enable_seqscan defaults to 'on'

enable_bitmapscan defaults to 'on'
random_page_cost defaults to 40
effective_cache_size defaults to 2GB
Tip: you would rarely turn off 'seqscan', and you would
rarely tune 'random_page_cost' or 'effective_cache_size'
Effective_cache_size: Sets the Local Planner's assumption about the effective size of
the disk cache that is available to a single query. This is factored into estimates of the costs of
specific plans, i.e. whether to use an index or not. A higher value makes it more likely to use an
index scan, whereas a lower value makes it more likely that sequential scans will be used. The
default value is 2GB
Random_page_cost: Set the Local Planner' estimate of the cost of a disk page that was fetched
non-sequentially from disk. The default value is 40. Reducing this value relative to
'seq_page_cost' will cause the local planner to prefer index scans over sequential scans.
Increasing this value will lead to sequential scans being preferred over index scans

Memory Bottlenecks Why?
work_mem
Specifies the amount of memory to be used by internal sort operations and hash tables before
switching to temporary ondisk files. The default value, per virtual worker is 64MB. Note that for
a complex query, several sort or hash operations might be running in parallel; each one will be
allowed to use as much memory as this value specifies before it starts to put data into temporary
files. Also, several running sessions could be doing such operations concurrently. Lastly, there
are multiple virtual workers per worker node. So the total memory used could be many times the
value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort
operations are used for ORDER BY, DISTINCT, and merge joins.

Memory Bottlenecks Why?
WHY do have a memory the bottleneck ?
Machine configured with too little RAM

Application working set doesn't fit in RAM
Check the 'work_mem' setting (show work_mem)
Data skew in repartitioning

Memory Bottlenecks Where?
The UNIX vmstat command is useful for finding out how busy a worker node is. Specifically,
vmstat prints statistics showing memory usage, disk paging, I/O wait times, and cpu activity.
vmstat syntax
vmstat [-a] [-n] [delay [ count]]

vmstat [-f] [-s] [-m]
vmstat [-S unit]
vmstat [-d]
vmstat [-p disk partition]
vmstat [-V]

Memory Bottlenecks Where?
How much memory? (Appl-2 96 GB, Appl-3 256 GB)

Memory shortage manifests as a Disk bottleneck
Utilities: vmstat, ganglia
VMSTAT below shows 5 readings over 5 second intervals. Look at si-swap

in and so-swap out which indicates number of paging space per second
utilized. If values are constantly non-zero, there might be a memory
bottleneck
Run on individual node
In-line lab: ncli node runonall "vmstat 5 5" will show Queen,
All Workers and Loader parameters

Memory Bottlenecks - Utilities

Memory Bottlenecks Utilities
Look at the 'work_mem' parameter
beehive=> show work_mem
Query tuning In general, keep the default. Instead use transaction set
commands like below to increase work_mem in effort to increase chance of
Hash Join instead of Merge Join
Keep in mind each cluster query translates to N (# of v-Workers on a Worker

node) postgresql query on each Worker. WORK_MEM is the limit to be
allocated to each operator (ie: hash join, merge join, hash aggr) so a
pipelined hash join plan could be using up multiples of WORK_MEM amount
of memory. Also concurrent cluster queries means more memory usage
begin;
set work_mem = '196000'; (191 MB but only for life of query)
SELECT * from ;
end:

Checklist for Query Tuning

Checklist for Query Tuning
1. Check for bottlenecks. Use AMC and Ganglia to find the overworked
worker
2. Run ANALYZE regularly to ensure the most optimal query plans
3. Run VACUUM to ensure queries achieve best performance
4. Make sure youre using transactions. When UPDATING or INSERTING
data to Aster Database, wrap statements in a transaction
5. Make sure youve written your queries in a way that allows Aster
Database to parallelize the work as much as possible
6. Check the EXPLAIN plan to make sure Aster Database has chosen a
good plan for running the query
7. If you suspect Aster Database has chosen the wrong join technique,
can drop Hints via SET ENABLE commands
8. If you suspect too much data is being shuffled among workers, refer to
Aster User Guide
9. If none of the techniques listed above solve the problem, try to isolate
the problem using the Linux tools

Real World Problems

Real World Problems
Estimates are inaccurate

Have you Analyzed recently? (if NOT, perform ANALYZE)
Are your tables empty? Postgres falls back to heuristic
Make sure that tables are not empty
Are your clauses like: WHERE I + 0 = values or lower(t) = foo ?
Avoid using mathematical expressions in WHERE clause
Not using available Index

Are you sure using the Index would actually help?
Are you using LIKE?
Mysterious time sinks

Triggers?
Do you have Indexes on Foreign keys?
Dead tuples?
Have you VACUUMed recently?

General Tips

General Tips
Tuning is only as good as the data model get that right and optimized for
Aster!
ANALYZE after every major modification
If you can express the same results using GROUP BY and DISTINCT (hash
to v-Worker, then Dedup) use the GROUP BY. This gives the Optimizer
more options to choose from.
Any table that is a target of an UPSERT operation or a DELETE operation

should very frequently be VACUUMed, and occasionally be recreated
(Refer to the Best Practices sections 4 and 5 of Volume 1: Aster

Database User Guide in the Aster nCluster User Guide for 6.x.)

Failed Queries Insufficient Storage

Failed Queries Insufficient Storage
Here are some steps for diagnosing insufficient storage issue:
Determine Worker node with the storage issue using AMC

Under Nodes tab, can view Disk space utilized
SSH to the worker from the Queen

Try to: df -h (it shows disk usage information)
The insufficient storage error can also be caused by:

Not having enough space to perform replication
Skew on one of the worker nodes
The cluster log should indicate the node with space issues
The queen exec log will show query related errors
/primary/logs/queenExec.log (Not available via the AMC)

Read/Write Testing Get I/O Statistics

Read/Write Testing Get I/O Statistics
- Logon to the Queen and then clear all disk caches

sync
echo 3 > /proc/sys/vm/drop_caches
- Run the iostat command

iostat -mtx 1 >> /primary/tmp/iostat.write.directio.log
(-mtx = display in megabytes / sec, print the time, show extended
stats) If % IOWAIT high (>50%) and if rMB/s , wMB/s and %util
are high, the disk is very busy and disk may be the
- View the results bottleneck. Use Workload Mgt to resolve
Less /primary/tmp/iostat.write.directio.log
Time: 10:23:47
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 2.82 0.00 0.00 97.18
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 10.56 0.00 2.82 0.00 0.05 38.00 0.00 0.00 0.00 0.00

Read/Write Testing iostat command

Read/Write Testing iostat command
The iostat command is used for monitoring system input/output device

loading by observing the time the devices are active in relation to their
average transfer rates. The following reports are available:
CPU Utilization Report

Device Utilization Report
Network File System report
Example:
iostat Displays single history since boot report for all CPU/Devices
iostat -d 2 Display a continuous device report at two second

intervals
iostat -d 2 6 Display six reports at two second intervals for all devices
iostat -c Display the CPU report
iostat -d -k Display the Device report, statistics in KB per second

iostat What are we looking for?

iostat What are we looking for?
We are looking for abnormalities in the average transfer rates

- For example: %iowait means 61.18% of CPU time are waiting for Disk I/O reply
The sample output from iostat -c -k shows abnormal %iowait
avg-cpu: %user %nice %system %iowait %steal %idle

1.16 0.00 6.26 61.18 0.00 32.40
The sample output from iostat -d -k shows nothing abnormal, however it is

interesting to note the read versus write rates. (Not many SELECTS (reads) ,
lots of INSERTS (writes))
Linux 2.6.24.7 (sm-prod-queen) 10/28/09
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn

sda 2.05 0.91 26.53 10968482 321283884
sdb 14.72 9.17 360.03 111056323 4360559985
These are physical drives

Get Disk Read/Write using dd

Get Disk Read/Write using dd
- First, clear all disk caches

Sync
echo 3 > /proc/sys/vm/drop_caches
- Run the 'dd' command to check disk reads and writes.

dd if=/dev/zero of=/primary/tmp/test0 oflag=direct bs=1048576
count=10000 conv=fdatasync >> /primary/tmp/disk.log
- View the output

Less /primary/tmp/disk.log
############################
Wed Sep 23 13:12:10 PDT 2009
Write test directio
1857+0 records in
1857+0 records out
1947205632 bytes (1.9 GB) copied, 7.36374 s, 264 MB/s

Keep an eye on Data skew using du

Keep an eye on Data Skew with du
- Data Skew occurs when Distribution Key causes disproportional

large number of rows to be routed to a single v-Worker
- From Worker, execute:
du -sh /primary/w*z | sort

Tips and Tricks

Tips and Tricks
How to find queries that bottleneck the system (on less busy systems the
AMC can be used to identify slow queries)
1. Identify the point in time the cluster is really busy. You could look at
Ganglia to figure this out
2. Identify one of the nodes where the resource utilization is heavy
3. ssh to that node , and run atop -r /var/log/atop.log.1 ( where atop store
logs for about 7 days )
4. Travel back to the point of time when cluster was busy
5. Get one of the postgres backend PIDs whose resource utilization (
CPU, Disk, RAM ) is heavy.
6. Associate this postgres backend PID on the worker with the queenExec
PID on the queen
You can tie the postgres pid ( from atop output ) to the queenExec by searching
queenExec logs for backendPid=<pid>
The queenExec PID will be listed as: [fromPid:32653]
7. Get the query from the queenExec PID. (displayed in queenExec.log)

Case Study 1 Bad Load Performance

Case Study 1 Bad Load Performance
Customer saw poor Loading performance using Queen
Where is the bottleneck ?

Possibly nCluster_loader processes were network
bottlenecked
Why was the network a bottleneck ?

Network pipe shared with other users/applications
No NIC bonding configured on data staging server
Solution 1. Add Loader Node 2. Increase the network bandwidth

between staging hosts and Queen/Loader, i.e replace 1 Gbps NIC
with 10 Gbps NIC or NIC bonding to increase network bandwidth
3. Make the VLAN between staging hosts and Aster Database

Case Study 2 Bad Backup Performance

Case Study 2 Bad Backup Performance
Backup Nodes dont

Customer saw long backup times have to be in Cabinet
WHERE is the bottleneck ?

'dd' or other disk diagnostic tools to test showed poor
write performance on backup nodes
Network between Aster db and Backup nodes congested
WHY is the backup disk bottlenecked ?

Write back cache not enabled (missing battery)
Solution: Add battery for cache, add NIC card, take advantage of
InfiniBand by configuring Backup node in cabinet, enable NIC
bonding

Case Study 3 Bad Query Performance

Case Study 3 Bad Query Performance
Customer saw bad query performance

Queries lingered on two Worker nodes

'dd' is a utility that shows how fast system can read/write
Basically its a COPY command
Showed bad disk performance on these two nodes
WHY are disks bottlenecked on these nodes ?

Diagnostic tools showed bad disks in RAID
Solution: Replaced Disks

Case Study 4 Data Skew issue

Case Study 4 Data Skew issue
Distributed table created with distribution key that is not distributed evenly
among the vWorkers (column=gender)

When queried (ie: where gender = M) only 1 v-Worker is engaged.
Solution: Do a CREATE TABLE AS using a distribution key that is more

evenly distributed across the v-Workers. DROP old table
nc_tablesize_details
select vworker, filenode, size_on_disk, relname, rootname pg_size

from nc_tablesize_details (on (select 1) partition by 1);

Lab13: Skew table (1 of 3)

1. DROP TABLE clicks_stage; DROP TABLE clicks_skew;

2. CREATE TABLE clicks_stage AS select * from clicks;
3. ALTER TABLE clicks_stage add column gender text default 'm';
4. CREATE TABLE clicks_skew distribute by hash(gender) AS
SELECT * from clicks_stage; ANALYZE clicks_skew(gender);
5. SELECT * from clicks_skew; --You notice poor performance when you run
6. EXPLAIN SELECT * from clicks_skew; --Doesnt show anything abnormal
7. SELECT * from getTableSize(on(SELECT 'aaf.clicks_skew') partition by 1
DB('beehive'));
Heres the problem. Table skew. Only 1 v-Worker is

working. We have total loss of parallelism


1. From AMC, go to Worker Node that had Skew, then go to Node Hardware Stats
and determine which component spiked (CPU, Memory, Network I/O, Disk I/O).
Note may have to wait a few minutes for it to register in Graph


1. Run following query on clicks_skew until find most unique column:
SELECT count(distinct <column>) from clicks_skew;
2. Repeat for session_id, product_id, page and record numbers below:

Column # of unique values
User_id 500000
________ CREATE TABLE Clicks_new distribute by hash(user_id) as
5
Session_id ________ SELECT * from clicks_skew; -- assign User permissions
6
Product_id ________
DROP table clicks_skew;
Page 6
________
3. One more thing. This table will be joined to Product_dim table at the end of
each month to determines month end sales figures per Product
4. Another query will be to find any NULL customers at month end
5. What are your recommendations to prevent bottlenecks at all costs?
Create new CLICKS_PROD table with PRODUCT_ID as Distribution key
Create INDEX on USER_ID in CLICKS_Skew table
ANALYZE both columns

Review: Module 13 Bottlenecks, Tuning, Case
Studies

Review: Module 13 Bottlenecks and Tuning
1. Bottlenecks typically occur in these 4 places:
2. Name two Linux utilities that can help you diagnose bottlenecks
3. 2 ways of getting better LOAD performance
4. This cost tends to be the most expensive (Disk I/0, Network, Memory,
CPU)

Module 14
Module 14
Errors and Logs

Table Of Contents
Logs Module objectives ............................................................................................................... 4
Aster Database issues (1 of 2) ......................................................................................................... 6
Aster Database issues (2 of 2) ......................................................................................................... 8
Logs in Aster Database ................................................................................................................. 10
In-line lab: Individual Worker node .............................................................................................. 12
In-line lab: AMC Diagnostic log bundles ..................................................................................... 14
Diagnostic log bundle for all Nodes ............................................................................................. 16
In-line lab: View Diagnostic log bundle contents ......................................................................... 18
In-line lab: SQL-MR query logs ................................................................................................... 20
Various Error Codes ...................................................................................................................... 22
Before we begin, the Big Picture (Queen) .................................................................................... 24
Before we begin, the Big Picture (Worker) .................................................................................. 26
Troubleshooting LOG files ........................................................................................................ 28
sysmanExec.log............................................................................................................................. 30
cluster.log ...................................................................................................................................... 32
queenExec.log ............................................................................................................................... 34
alerts.log ........................................................................................................................................ 36
Aster Components (Queen) ........................................................................................................... 38
Aster Components (Queen) cont ................................................................................................. 40
Aster Components (Workers) ....................................................................................................... 42
Aster Components (Workers) cont .............................................................................................. 44
Aster Components (Workers) cont .............................................................................................. 46
Aster Web Pages ........................................................................................................................... 48
Monitoring Replication ................................................................................................................. 50
Monitoring Replication 2-step process ...................................................................................... 52
Monitoring Replication What to look for .................................................................................. 54
Web page diagnostics .................................................................................................................... 56
What to do if Node marked Suspect/Failed .................................................................................. 58
What to do with Statement issues ................................................................................................. 60
What to do for Backup issues (1 of 2)........................................................................................... 62
What to do for Backup issues (2 of 2)........................................................................................... 64
Backup Web page ......................................................................................................................... 66
AMC wont start after Physical Restore ....................................................................................... 68
Other Quick checks ....................................................................................................................... 70
Review: Module 14 Errors and Logging.................................................................................... 72

Logs Module objectives

Logs - Module Objectives
Before you call Aster Engineering, gather this
How to bundle AMC logs
Know various Error codes
Key LOG files to view
Aster Diagnostic Web pages
Monitoring Replication
What to do when .

Aster Database issues (1 of 2)
Heres some things to do before call Aster when you have Database issues.

Before call Aster Engineering, gather this

1. OS version :
- Run the following command on the Queen node: "cat /etc/*-release"
2. Aster version :
- Run the following command on the Queen node: "cat /home/beehive/bin/.build"
3. Is this Aster Express, Aster Live or Aster full version?
4. Is this production environment, QA, development or other?
5. Is this Aster implementation based on Amazon EC2, VM or hardware?
6. If hardware, is this Aster Appliance or commodity (customer managed) hardware?
7. Capture the exact error message with timestamp customer is complaining about. If the
error message or timestamp is not available, collect the exact time the problem happened.
8. Cluster state:
- Run the following command on the Queen node: "ncli statsserver showclusterstatus"
- If the above command fails (typically because the cluster is running an older, pre-5.0
version of the software), run the following command on the Queen node: "ncli system show"
9. Node(s) state:
- Run the following command on the Queen node: "ncli node show"
10. Collect the following log files (from time of error) from the Queen node:
NOTE: Please tar and compress these files as Queen-Logs.tar.gz
example tar command: tar czvf Queen-Logs.tar.gz /primary/logs/queenExec.log

/primary/logs/sysmanExec.log /primary/logs/cluster.log

Heres some things to do before call Aster when you have Database issues.

1. Do everything that is mentioned in "Aster database issue" section above.
2. Is this error related to an Aster client library or application (ODBC or JDBC driver, .NET driver, Aster-Hadoop or
Aster-Teradata connector, Informatica connector, Cluster_loader, etc)?
3. If client software is involved, collect the name and version number for the client software and on what OS this client
package is running.
4. Is the above problem related to a partner software application (SAS, Information, Tableau, Microstrategy, R, etc)?
5. If partner software is involved, collect the name and version of the partner application along with OS the
partner application is running on.
6. Gather any relevant logs that has the error messages and post the error message statement(s) into the incident.
Aster backup issue :

====================
1. Do everything that is mentioned in "Aster database issue" section above.
2. Backup cluster version:
- Run the following command on the Backup Manager node: "cat /home/beehive/bin/.build"
3. Gather information on error message or other performance problems customer is complaining about
4. Collect the following log files (from time of error) from the Backup Manager node:
NOTE: Please tar these files as Backup-Logs.tar
example tar command: tar czvf Queen-Logs.tar.gz /home/beehive/data/logs/backupExec.log

/home/beehive/data/logs/cluster.log
5. Backup node(s) state:
- Run the following commands on the Backup Manager:
(a) /home/beehive/bin/exec/ncluster_backup -h localhost ---- This will take you to nCluster Backup prompt
(b) show storage
6. Backup/ Restore state:
- If problem reported is for backups, run the following commands:
(a) show backups --- Determine the current backup ID <backupid> from here
- If problem reported is for restores, run the following commands:
(a) show restores --- Determine the current restore ID <restoreid> from here

Logs in Aster Database
By using diagnostic log bundles, you can more easily send information to Teradata Global
Technical Support (GTS) for analysis, reducing the time and effort required to diagnose
system problems.
bundles.

Logs in Aster Database
Aster Database automatically tracks its activity in a variety of log files. The log files
are useful when you need to find the cause of an error or unexpected behavior, or
when you just want to confirm that an operation has taken place
There are 2 ways to to access logs in Aster Database
1 To access log files through the AMC:

1a
- To view log files for individual worker or loader nodes or view the system logs
stored on the queen, use the Node Details tab. To display this tab, click Nodes
> Node Overview, then click the IP address or name of the node. Click the
Prep, System, or Kernel link to viewthe desired log (In-line lab)
1b
- To view (or create) bundles containing multiple log files, which you can send to
Teradata Global Technical Support (GTS) along with a request for
troubleshooting assistance, use the Logs tab. To display this tab, click Admin >
Logs, then click the Prepare, Download, or Send link (see next slide)
2 To monitor logs and manage SQL-MR execution through the AMC

In-line lab: Individual Worker node
Click the desired Logs link, which is one of:
preparation log (the log of events related to the process of preparing a node for
participation in Aster Database);
system log (the contents of the Linux syslog file /var/log/messages); and
kernel log (the contents of the Linux kernel buffer provided through dmesg).
The log appears, showing the latest 1000 lines. Click Refresh at any time to load the latest
1000 lines.

1a In-line lab: Individual Worker node
Navigate to the Nodes panel in the AMC

1 From the AMC, go to the Nodes > Nodes Overview and click on Node
Name (Worker) wish to inspect
2 From here, click on Prep, or System or Kernel
preparation log (the log of events related to the process of preparing a node for
participation in Aster Database);
system log (the contents of the Linux syslog file /var/log/messages);
kernel log (the contents of the Linux kernel buffer provided through dmesg)

In-line lab: AMC Diagnostic log bundles
By using diagnostic log bundles, you can more easily send information to Teradata Global
Technical Support (GTS) for analysis, reducing the time and effort required to diagnose
system problems.
bundles.

1b AMC Diagnostic log bundles
When an issue arises on a cluster, one of the first steps in finding the cause
is to retrieve the relevant log files. Aster Database is made up of a large
array of distinct services, and it produces more than 60 different logs
spread across every node in the cluster. The AMC provides an easy way for
you to deal with all these different logs by creating diagnostic log bundles
A diagnostic log bundle is a compressed tarball containing data used to
determine the system context and diagnose Aster Database issues. This
data may come in system logs from the Queen and subordinate nodes
(Worker and Loader). By using diagnostic log bundles, you can more easily
send information to Teradata Aster tech

Diagnostic log bundle for all Nodes

1b Diagnostic log bundle for all Nodes
By default, a diagnostic log bundle contains only system logs from the
Queen. If you want to create a complete bundle that includes logs from the
other nodes as well, you can create what is called a cluster bundle by
clicking the Prepare link
Another way to include all nodes in a bundle is to click the Manually

Initiate Diagnostic Bundle button. This displays a dialog that provides
many more choices, including the choice to include queen and cluster
nodes in the bundle, set a time window, and add custom commands

In-line lab: View Diagnostic log bundle contents
Job ID System-generated unique number to identify the job.
Type Queen or Cluster. A queen-type bundle includes only log files and information
from the queen. A cluster-type bundle includes log files and information from all
nodes, including the queen.
Status Tells whether the job is currently running, completed, or failed.
Submitted by Tells what initiated the job. System means the job was run automatically by the
AMC. If the job was manually initiated, the username of the person who
submitted the job is displayed.
Start Time Start time of the log content. That is, the time of the first logged event included in
the bundle.
End Time End time of the log content.
Filename Name of the log bundle file. The name indicates the time the bundle creation job
was initiated.
Filesize Size of the log bundle file in MB.
PrepareClusterBundle Click Prepare create a complete bundle that includes logs from the other
nodes as well
Download Click Download to download a diagnostic log bundle.
Send to Aster Support Click Send to use to send the log bundle to Teradata Global
Technical Support (GTS).

1b View Diagnostic log bundle contents
Open an ssh session to the Queen machine and look in the directory
/primary/diagbundles . Look for a file with the same name shown in the
list of diagnostic log bundle jobs. The name will look similar to
YYYYMMDD_
Another way to include all nodes in a bundle is to click the Manually
Initiate Diagnostic Bundle button. This displays a dialog that provides
many more choices, including the choice to include queen and cluster
nodes in the bundle, set a time window, and add custom commands

In-line lab: SQL-MR query logs
Debug SQL-MapReduce Job and Task Execution
Your SQL-MapReduce functions can emit debugging messages written to the standard output
or the standard error. During or after execution, access to the standard output (stdout) and
standard error (stderr) is provided through the AMC. To see this:
1 Open the AMC in a browser window by typing http:<IP address of the queen>
2 Go to the Processes tab.
3 Find your query in the Processes list. To do this, it may be helpful to sort based on Type or
User. Click a column to sort based on that column.
4 Click the ID of the query you wish to view. In the Process Detail tab that appears, click the
View Logs button.

2 In-line lab: SQL-MR query logs
Debug SQL-MapReduce Job and Task Execution

1 From the AMC, go to the Processes tab
2 Find your query in the Processes list
3 Click the ID of the query you wish to view. In the Process Detail tab
that appears, click the View Logs button

Various Error Codes
All messages emitted by the Aster Database are assigned five-character error codes that follow
the SQL standard's conventions for SQLSTATE codes. Applications that need to know
which error condition has occurred should usually test the error code, rather than looking at
the textual error message. Note that some, but not all, of the error codes produced by Aster
Database are defined by the SQL standard; some additional error codes for conditions not
defined by the standard have been invented or borrowed from other databases.
According to the standard, the first two characters of an error code denote a class of errors,
while the last three characters indicate a specific condition within that class. Thus, an
application that does not recognize the specific error code can still be able to infer what to do
from the error class.

Various Error Codes
Partial listing

Before we begin, the Big Picture (Queen)

Before we begin, the Big Picture (Queen)
Parses, Planner, Optimizer
Phase Executor
Partition Mgt (Hash table)
queenExec sysman Node Incorporation
SQL queenExec Assist in transactions
queenExec Partition Manager Failure Handling
Physical backup/restore
http://<QueenIP>:1990
procman
Job creator
Creates job for SQL-MR function
Tasks a TEMP directory
txman qos executor Tasks QosSlave for Workload Mgt
Transaction Manage Workload Manager

Commits
ICE
Tuple Mover (Inter
queendb Cluster Express)
Persistent store for catalog,

txman state, wlm state

Before we begin, the Big Picture (Worker)

Before we begin, the Big Picture (Worker)
SQL vworkeraa
vworker
vworker a IOInterceptor
FUSE server for compression ICE
SQL vworkeraa
vworker
Tuple Mover
vworker b queenExec
queenExec
JVM
workerd
Java MR execution
engines
SQL vworkeraa
vworker
Slave for txman, sysman,
vworker N procman
Active vworkers. Fully

functional SQL engines
qos slave
Per-node process that handles
workload management
replicationd
Passive vworkers.
Filesystem state Replication Server
containing up-to-date
vworker copies.

Troubleshooting LOG files

Troublehshooting - LOG files
Collect the following log files (from time of error) from Queen node:
1 - /primary/logs/sysmanExec.log
2 - /primary/logs/cluster.log
3 - /primary/logs/queenExec.log
4 - /primary/logs/alerts.log
Generic.log is another log that has information that may be of interest

Go to /var/logs and look at ClusterServices.log file when Nodes fail

sysmanExec.log

1 sysmanExec.log
Information about activity involving worker nodes and v-workers is recorded

in /primary/logs/sysmanExec.log
Contains information on:

- Partition Management Hash map and load balancing
- Node Incorporation Activates v-Workers
- Assist in Transactions Drives Replication as part of Trans commit
- Failure Handling PING, connection to v-Workers, etc
- Physical backup/restore

cluster.log

2 cluster.log
All warning, error, and fatal messages from all the nodes in the cluster are
collected in the file /primary/logs/cluster.log
Contains information on:

- Node activity
- Workload Policies

queenExec.log

3 queenExec.log
Contains information on each step taken by the Queen during planning

and execution
Note that each log entry has an associated session id that can be used to
put together the messages for a specific session

alerts.log

4 alerts.log
Indicates some transition or special event such ash audit-like

messages and normal cluster transition activities
Queen LogServer writes to alerts.log, then if needed forwards to

Alert service (Blackbird)

Aster Components (Queen)

Aster Components (Queen)
queenExec
This process handled the planning, preprocessing and execution of
all statements. Each statement in the AMC has a corresponding
queenExec process active on the queen. This queenExec
communicates with the PostGres processes in each Worker node
To find the queenExec process for a specific statement:

ps ef | grep <statementId> the Process ID in AMC
If you do not find a queenExec process, it is possible that there is a

core dump on /primary/cores, check for files created around the time
that the statement ran
Note: for data loads through a loader node the queenExec process
will reside on a loader node, there will not be a queenExec process
active on the Queen
Main log file: /primary/logs/queenExec.log

Aster Components (Queen) cont

Aster Components (Queen) con't
StatsServer
This process collects statistics and populates the NC_ views such
as nc_all_statements, nc_all_statement_phases
Main log file: /primary/logs/StatsServer.log

Aster Components (Workers)

Aster Components (Workers)
WorkerDaemon
This process receives commands from sysMan on the queen to
perform certain PostGres related tasks: vworker
activate/deactivate, commit/rollback of transactions
Main log file: /primary/logs/WorkerDaemon.log

Aster Components (Workers) cont

Aster Components (Workers) con't
ReplicationDaemon
This process initiates replication of data from the primary v-Workers to
the secondary v-Workers, including replication from the queenDb to the
secondary queenDb
Main logfile: /primary/logs/ReplicationDaemon.log
The ReplicationDaemon communicates with two other processes on each

worker node:
RepHelper: this process identifies file changes, e.g. to determine

what has to be replicated
FileStreamServer: handles actual data replication, which show high

CPU usage (up to 600% when 'Balance Data' is in progress)

Aster Components (Workers) cont

Aster Components (Workers) con't
IoInterceptor
This process handles the compressed data in Aster. Since PostGres
does not support high levels of compression, Aster built a module
that sits between the file system and PostGres
Compressed data is stored in: /primary/iointerceptor/storagedir/data
IceServer
- This process mainly handles repartitioning of data.
- Each v-Worker has a bridge (.so file linked into PG) to extract data and
send it through ICE (InterConnect Exchange, reads/write in raw pg
format) to another v-Worker

Aster Web Pages

Aster Web Pages
Process URL
queenExec http://<queen_ip>:1990
Workload Mapper http://<queen_ip>:2011/workloadmapper
StatsServer http://<worker_ip>:6543
sysMan http://<queen_ip>:2105
FileStreamServer http://<worker_ip>:2113 (On Worker nodes)
Backups http://<backup manager IP address>:1991/stats
IceServer http://<worker_ip>:2115
Backup page is the most useful for general

monitoring

Monitoring Replication

In-line lab: Monitoring Replication
URL : http://<queen_ip>:2105/asyncActivityStats
- This page shows any pending activities related to replication. Any time a
user performs a commit the AMC will show the step 'Prepare
Transaction' which generates WAL (Write Ahead Logs) which are
replicated to the secondary v-Workers. If replication fails, the Prepare
Transaction step will be marked Failed
- Under normal conditions the activities listed on this web page should
process (and disappear from the list) every 15-30 seconds. If activities
remain in this list for more than 30 min, there should be several errors
reported in /primary/logs/sysmanExec.log on the Queen

Monitoring Replication 2-step process

Monitoring Replication 2-step process
Replication during Cluster Activation is a 2-step process
- Step 1: Copy data files to /primary/tmp on target node
- Step 2: Apply WAL files

Monitoring Replication What to look for

Monitoring Replication What to look for
The bottom of the page shows replication stats. The 'Replications Failed'
field should be zero or close to zero

Web page diagnostics

Web page diagnostics
Process Management: http://<QueenIP>:1990
Replication diagnostics
- Files Stream Service: Transfers files from 1 node to another via AFTP
http://<QueenIP>:2113
- Replication helper Service: Tracks files that have been modified since
last Replication http://<QueenIP>:2111

What to do if Node marked Suspect/Failed

What to do if Node is marked Suspect/Failed
When a node is marked Suspect in the AMC, it means that one or more of
the v-workers on that node have failed, but one or more are still running.
When the node is marked Failed, it means that all the v-workers have
failed (often as a result of the node itself having crashed, gone off the
network, or otherwise failed)
Try to login to node via ssh

- If cant, thats probably why nCluster thinks its failed
Check status of nCluster software on that node

- /etc/init.d/local status
If local status=Stopped and node=Failed in AMC, try to restart

- /etc/init.d/local restart
If local restart fails,

- Send /var/log/clusterservices.log from that node to Aster Support

What to do with Statement issues

What to do with Statement issues
Statement doesnt terminate after the expected time

- May have a dead lock. Contact Aster Support
Cannot CANCEL Statement in AMC

- Find ID number of session the statement is in by querying
nc_all_statements for text of statement. Then find PID of queenExec
process with that sessionID (use ps efww to see entire command
lie). Then issue: kill -15 <PID>
Statement fails with 'database node failure' message

- Check /primary/logs/queenExec for more specific error message
Statement fails with 'failure at the database coordinator' message

Statement fails with 'error happened at <timestamp>' message


What to do for Backup issues (1 of 2)

Collect the following log files (from time of error) from the Backup
Manager node:
Collect the following log files (from time of error) from the Queen node:


Backup node(s) state:
- Run the following commands on the Backup Manager:
- /home/beehive/bin/exec/ncluster_backup -h localhost ----

This will take you to nCluster Backup prompt
- show storage (ensure USE% below 80%)
Backup/ Restore state:
- If problem reported is for backups, run following commands:

- show backups --- Check status (ie: running, failed)
- If problem reported is for restores, run following commands:

- show restores --- Check status (ie: running, failed)

Backup Web page

Backup Web page

AMC wont start after Physical Restore

AMC wont start after Physical Restore
Upon completion of an Aster database physical restore, procedure calls for a

soft restart of the cluster services. However, in some instances, the AMC
may not present the login screen but may present an error page. (Struts
error.) You can fix this issue by performing a soft restart from the queen
node by running the following command as the root user:
Aster database 4.6 and later:
ncli system softrestart

ncli system activate
Upon successful completion of the soft restart, the AMC should be

available and you may continue the restore process as per the User's Guide

Other Quick checks

Other Quick checks
Before filling out a bug, check the following:
- Check for processed killed due to running out of memory

ncli node runonall grep -I memory /var/log/memory
ncli node runonall grep -I oom /primary/logs/dmesg.log
- Check for 'Disk full'

ncli node runonall df -k /primary
- AMC not displaying what you think

Try clearing your cache
- Check /primary/logs/cluster.log for messages about 'error' and 'fatal'

situations

Review: Module 14 Errors and Logging

Review: Module 14 - Errors and Logging
1. Name 3 LOG file name:
2. FATAL messages are located in this log file
3. You do a RESTORE on your Aster nCluster. But now the AMC wont
display. What command(s) would you type?
4. Many of the log files are found under this path:
There are no Formal Labs for this Module

Teradata Aster Database Administration- The End
The conclusion slides that were presented in class are omitted in this book..
Thank you for attending the Teradata Aster Database Administration class!


Aster DBA Book

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Aster DBA Book

Загружено:

Авторское право:

Доступные форматы

Module 0

Software Version 6.01

Teradata Aster Database Administration Page 1

Teradata Aster Database Administration Page 3

Page 4 Course Overview

The purpose of this training course is to acquaint you

Understand Aster Architecture and Components

Teradata Aster Database Administration Page 5

Page 6 Course Overview

Day 1 Mod 00 Introduction

Day 2 Mod 06 Managing Tables

Day 3 Mod 11 Data Dictionary and Scripts

Teradata Aster Database Administration Page 7

As mentioned, this class is 50% hands-on workshops. We will be using a number of

Aster Command Terminal

Page 8 Course Overview

- Exercises with TD Studio

- Exercises with Aster Command Terminal (ACT)

- Exercises with Aster Management Console (AMC)

- Exercises with Aster Analytic Foundation (AAF)

Teradata Aster Database Administration Page 9

Aster cluster consisting of a Queen (coordinator) and two Workers

Page 10 Course Overview

Queen Worker1 Worker2 Loader Hadoop Backup TD box

192.168.100.100 192.168.100.151 192.168.100.152 192.168.100.141 192.168.100.21 192.168.100.172 192.168.100.15

Teradata Aster Database Administration Page 11

Page 12 Course Overview

We will be using VMware images for all our labs

Teradata Aster Database Administration Page 13

Page 14 Course Overview

Typically we first go under Automatically Configure Your Settings,

Teradata Aster Database Administration Page 15

Page 16 Course Overview

If continue to Whirlybird here, best thing is to close window and

Teradata Aster Database Administration Page 17

Page 18 Course Overview

May be prompted to Install ActiveX Control or Java. Do so

1. Attempt to INSTALL Plug in from Banner-popup

3. It will INSTALL and youll get to SAVE SETTINGS

Teradata Aster Database Administration Page 19

Page 20 Course Overview

Save Settings when Connection Test Successful (OK if VNC is not

Teradata Aster Database Administration Page 21

Page 22 Course Overview

Enter your Access code and click Submit button

Teradata Aster Database Administration Page 23

Page 24 Course Overview

Enter your Credentials and click Activiation button

Teradata Aster Database Administration Page 25

Page 26 Course Overview

Now click the Connect button

Teradata Aster Database Administration Page 27

Page 28 Course Overview

Enter your Username and Password as follows:

Click the Use another account and then

Teradata Aster Database Administration Page 29

Page 30 Course Overview

Youre landing page will be a Microsoft Windows Desktop.

Teradata Aster Database Administration Page 31

Page 32 Course Overview

Teradata Aster Database Administration Page 33

We will be using Teradata Studio for most of our queries..

Page 34 Course Overview

From Data Source Explorer tab: