Вы находитесь на странице: 1из 24

Query Evaluation

Chs 12 - 14

Web Forms

Application FEs

SQL Interface

Plan Executor Operator Evaluator

Parser Optimizer

Query Evaluation Engine

Concurrency Control

Transaction Manager Lock Manager

File/Access Methods Buffer Manager Disk Space Manager Recovery Manager

Index Files
CISC 432/832

System Catalog Data Files


2

Lectures in Module 2
Overview Sort Join Other operations

CISC 432/832

Query Evaluation Overview

Steps in Query Processing


Query Query Parser Parsed query

Plan Generator

Plan Cost Estimator

Catalog Manager

Query Optimizer Evaluation plan Query Plan Evaluator

CISC 432/832

Overview of Query Evaluation


Evaluation Plan: Tree of R.A. ops, with alg for each op.

Each operator typically implemented using a `pull interface: when an operator is `pulled for the next output tuples, it `pulls on its inputs and computes them.

Two main issues in query optimization:


For a given query, what plans are considered?


Need to search plan space for cheapest (estimated) plan.

How is the cost of a plan estimated?

Ideally: Want to find best plan. Practically: Avoid worst plans! System R approach discussed in text.
CISC 432/832 6

Evaluation of Expressions
Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. Repeat. Pipelining: pass on tuples to parent operations even as an operation is being executed
CISC 432/832 7

Materialization
Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations. E.g., in figure below, compute and store

building"Watson" (department )
then compute the store its join with instructor, and finally compute the projection on name.

CISC 432/832

Pipelining
Result of one operator pipelined to another without creating temporary table Pipelines can be executed in two ways: demand driven and producer driven
CISC 432/832

D
C A B

Pipelined Evaluation
9

Pipelining (Cont.)
In demand driven or lazy evaluation
system repeatedly requests next tuple from top level operation Each operation requests next tuple from children operations as required, in order to output its next tuple In between calls, operation has to maintain state so it knows what to return next

In producer-driven or eager pipelining


Operators produce tuples eagerly and pass them up to their parents
Buffer maintained between operators, child puts tuples in buffer, parent removes tuples from buffer if buffer is full, child waits till there is space in the buffer, and then generates more tuples

System schedules operations that have space in output buffer and can process more input tuples

Alternative name: pull and push models of pipelining


CISC 432/832 10

Other Common Techniques


Indexing: Can use WHERE conditions to retrieve small set of tuples (selections, joins) Iteration: Sometimes, faster to scan all tuples even if there is an index. (And sometimes, we can scan the data entries in an index instead of the table itself.) Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs.
CISC 432/832 11

Iterator Interface
Relational operators at nodes in plan tree support a uniform iterator interface
Open: initializes state by allocating input and output buffers, passes arguments to operator. Get_next: calls operator specific code to process input tuples and generate output tuples. Close: deallocates state info when all output produced.

Hides whether operator pipelines or materializes input tuples Also used to encapsulate access methods like B+tree and hash indexes.
CISC 432/832 12

Statistics and Catalogs


Need information about the relations and indexes involved. Catalogs typically contain at least:

# tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and NPages for each index. Index height, low/high key values (Low/High) for each tree index. Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok.

Catalogs updated periodically.

More detailed information (e.g., histograms of the values in some field) are sometimes stored.
CISC 432/832 13

Highlights of System R Optimizer


Impact:

Most widely used currently; works well for < 10 joins. Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes. Considers combination of CPU and I/O costs. Only the space of left-deep plans is considered.
Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation.

Cost estimation: Approximate art at best.


Plan Space: Too large, must be pruned.

Cartesian products avoided.


CISC 432/832 14

Cost Estimation
For each plan considered, must estimate cost:

Must estimate cost of each operation in plan tree.


Depends on input cardinalities. Mainly based on number of IOs required

Must also estimate size of result for each operation in tree!


Use information about the input relations. For selections and joins, assume independence of predicates.

CISC 432/832

15

Quick Review - Relational Algebra


Theory on which SQL is (loosely) based operational
queries are composed using a collection of operators. Describes a step by step procedure for computing an answer. Inputs and outputs are relations.
Result a,b (R
CISC 432/832

P)

(t=5 T)
16

Quick Review - Relational Algebra (cont)


Basic operations:

Selection ( ) Selects a subset of rows from relation. Projection ( ) Selects desired columns from relation. Cross-product ( X ) Combines two relations. Set-difference ( - ) Tuples in one relation but not the other. Union ( ) All tuples in both relations. Intersection () , join ( ), outer joins .
17

Additional operations:

CISC 432/832

Quick Review - Relational Algebra (cont)


Joins
Joins are frequently used operations in database systems. They can be very expensive! Consider a join between 2 tables that have 10 attributes each and millions of tuples! Joins have been the focus of database research for years and many different variants of joins are implemented in a DBMS to be used under difference circumstances (as determined by the optimizer). Natural join, equijoin, outer joins
CISC 432/832 18

Schema for Examples


Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)

Reserves:

Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
19

Sailors:

CISC 432/832

Example
SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
What is the equivalent relational algebra query?

CISC 432/832

20

RA Tree:

sname

Motivating Example
SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5

bid=100

rating > 5

sid=sid

Sailors Reserves Cost: 500+500*1000 I/Os By no means the worst plan! (On-the-fly) Plan: sname Misses several opportunities: selections could have been `pushed rating > 5 (On-the-fly) bid=100 earlier, no use is made of any available indexes, etc. Goal of optimization: To find more (PO Nested Loops) sid=sid efficient plans that compute the same answer.
CISC 432/832

Sailors

Reserves

21

Alternative Plan 1 (Push Selects)


(Scan)

(On-the-fly) sname

(PO Nested Loops, sid=sid with pipelining) (Scan, write to Temp 1)

bid=100

rating > 5

Cost of plan:

Reserves

Sailors

Scan Reserves (1000) - produces 10 pages, if we have 100 boats, uniform distribution. Scan Sailors (500) + write temp T1 (250 pages, if we have 10 ratings). PONL: 10 * 250 = 2500 Total: 1000 + 500 + 250 + 2500 = 4250 page I/Os.
CISC 432/832 22

Alternative Plan 2 (With Indexes)


With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages. INL with pipelining (outer is not materialized). Projecting out unnecessary fields from outer doesnt help.
Join column sid is a key for Sailors.
(Use hash index; do not write result to temp)

sname

(On-the-fly)

rating > 5 (On-the-fly)

sid=sid

(Index Nested Loops, with pipelining )

bid=100

Sailors

(Hash index on sid)

Reserves

At most one matching tuple, unclustered index on sid OK.


Decision not to push rating>5 before the join is based on

availability of sid index on Sailors. Cost: Selection of Reserves tuples (10 IOs); for each, must get matching Sailors tuple (1000*1.2 IOs); total 1210 I/Os.
CISC 432/832 23

Summary
There are several alternative evaluation algorithms for each relational operator. A query is evaluated by converting it to a tree of operators and evaluating the operators in the tree. Must understand query optimization in order to fully understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). Two parts to optimizing a query:

Consider a set of alternative plans.


Must prune search space; typically, left-deep plans only.

Must estimate cost of each plan that is considered.


Must estimate size of result and cost for each plan node. Key issues: Statistics, indexes, operator implementations.
CISC 432/832 24

Вам также может понравиться