218 views

Uploaded by api-3820001

- Data Structure
- Trees Notes
- Csc313 Lecture Complexity(6)
- 06 - Trees
- Unit 6
- Vocabulary and Definitions — Problem Solving with Algorithms and Data Structures.pdf
- 1 Time Complexity
- Algorithms
- DSL Lab Manual
- 175388_Lec03_BSTs
- DSA-VIVA
- Asymptotic Notation
- 220-03
- Data 123 DataCentric
- Data Strct mcq notes,,,,,,,,,,,,,,,,,,,
- Design analysis algorithm
- Data Structures Aptitutde
- Article Summary
- avl1
- Class

You are on page 1of 207

CS2604:

Data Structures and File Processing

C++ Edition

Cliord A. Shaer

Department of Computer Science

Virginia Tech

Copyright
c 1995, 1996, 1998

The Need for Data Structures

[A primary concern of this course is eciency.]

Data structures organize data

⇒ more ecient programs. [You might

believe that faster computers make it unnecessary to be

concerned with eciency. However...]

• More powerful computers ⇒ more complex

applications.

• More complex applications demand more

calculations.

• Complex computing tasks are unlike our

everyday experience. [So we need special

training]

Any organization for a collection of records can

be searched, processed in any order, or

modied. [If you are willing to pay enough in time delay.

Ex: Simple unordered array of records.]

• The choice of data structure and algorithm

can make the dierence between a program

running in a few seconds or many days.

1

Eciency

A solution is said to be ecient if it solves the

problem within its resource constraints. [Alt:

Better than known alternatives (\relatively" ecient)]

• space [These are typical contraints for programs]

• time

[This does not mean always strive for the most ecient

program. If the program operates well within resource

constraints, there is no benet to making it faster or smaller.]

The cost of a solution is the amount of

resources that the solution consumes.

2

Selecting a Data Structure

Select a data structure as follows:

1. Analyze the problem to determine the

resource constraints a solution must meet.

2. Determine the basic operations that must

be supported. Quantify the resource

constraints for each operation.

3. Select the data structure that best meets

these requirements.

[Typically want the \simplest" data struture that will meet

requirements.]

Some questions to ask: [These questions often help

to narrow the possibilities]

• Are all data inserted into the data structure

at the beginning, or are insertions

interspersed with other operations?

• Can data be deleted? [If so, a more complex

representation is typically required]

• Are all data processed in some well-dened

order, or is random access allowed?

3

Data Structure Philosophy

Each data structure has costs and benets.

Rarely is one data structure better than

another in all situations.

A data structure requires:

• space for each data item it stores, [Data +

Overhead]

• time to perform each basic operation,

• programming eort. [Some data

structures/algorithms more complicated than others]

Each problem has constraints on available

space and time.

Only after a careful analysis of problem

characteristics can we know the best data

structure for the task.

Bank example:

• Start account: a few minutes

• Transactions: a few seconds

• Close account: overnight

4

Goals of this Course

1. Reinforce the concept that there are costs

and benets for every data structure. [A

worldview to adopt]

2. Learn the commonly used data structures.

These form a programmer's basic data

structure \toolkit." [The \nuts and bolts" of the

course]

3. Understand how to measure the

eectiveness of a data structure or

program.

• These techniques also allow you to judge

the merits of new data structures that

you or others might invent. [To prepare

you for the future]

5

Denitions

A type is a set of values.

[Ex: Integer, Boolean, Float]

A data type is a type and a collection of

operations that manipulate the type.

[Ex: Addition]

A data item or element is a piece of

information or a record.

[Physical instantiation]

A data item is said to be a member of a data

type.

[]

A simple data item contains no subparts.

[Ex: Integer]

An aggregate data item may contain several

pieces of information.

[Ex: Payroll record, city database record]

6

Abstract Data Types

Abstract Data Type (ADT): a denition for a

data type solely in terms of a set of values and

a set of operations on that data type.

Each ADT operation is dened by its inputs

and outputs.

Encapsulation: hide implementation details

A data structure is the physical

implementation of an ADT.

• Each operation associated with the ADT is

implemented by one or more subroutines in

the implementation.

Data structure usually refers to an

organization for data in main memory.

File structure: an organization for data on

peripheral storage, such as a disk drive or tape.

An ADT manages complexity through

abstraction: metaphor. [Hierarchies of labels]

[Ex: transistors → gates → CPU. In a program, implement an

ADT, then think only about the ADT, not its implementation]

7

Logical vs. Physical Form

Data items have both a logical and a physical

form.

Logical form: denition of the data item within

an ADT. [Ex: Integers in mathematical sense: +, −]

Physical form: implementation of the data item

within a data structure. [16/32 bit integers: over
ow]

Data Type

ADT: Data Items:

Type

Operations

Logical Form

{ Storage Space Physical Form

{ Subroutines

[In this class, we frequently move above and below \the line"

separating logical and physical forms.]

8

Problems

Problem: a task to be performed.

• Best thought of as inputs and matching

outputs.

• Problem denition should include

constraints on the resources that may be

consumed by any acceptable solution. [But

NO constraints on HOW the problem is solved]

Problems ⇔ mathematical functions

• A function is a matching between inputs

(the domain) and outputs (the range).

• An input to a function may be single

number, or a collection of information.

• The values making up an input are called

the parameters of the function.

• A particular input must always result in the

same output every time the function is

computed.

9

Algorithms and Programs

Algorithm: a method or a process followed to

solve a problem. [A recipe]

An algorithm takes the input to a problem

(function) and transforms it to the output. [A

mapping of input to output]

A problem can have many algorithms.

An algorithm possesses the following properties:

1. It must be correct. [Computes proper function]

2. It must be composed of a series of

concrete steps. [Executable by that machine]

3. There can be no ambiguity as to which

step will be performed next.

4. It must be composed of a nite number of

steps.

5. It must terminate.

A computer program is an instance, or

concrete representation, for an algorithm in

some programming language.

[We frequently interchange use of \algorithm" and \program"

though they are actually dierent concepts]

10

Mathematical Background

[Look over Chapter 2, read as needed depending on your

familiarity with this material.]

Set concepts and notation [Set has no duplicates,

sequence may]

Recursion

Induction proofs

Logarithms [Almost always use log to base 2. That is our

default base.]

Summations

11

Estimation Techniques

Known as \back of the envelope" or \back of

the napkin" calculation.

1. Determine the major parameters that aect

the problem.

2. Derive an equation that relates the

parameters to the problem.

3. Select values for the parameters, and apply

the equation to yield an estimated solution.

Example:

How many library bookcases does it take to

store books totaling one million pages?

Estimate:

• pages/inch [guess 500]

• feet/shelf [guess 4 (actually, 3)]

• shelves/bookcase [guess 5 (actually, 7)]

pages/bkcase]

12

Algorithm Eciency

There are often many approaches (algorithms)

to solve a problem. How do we choose between

them?

At the heart of computer program design are

two (sometimes con
icting) goals:

1. To design an algorithm that is easy to

understand, code and debug.

2. To design an algorithm that makes ecient

use of the computer's resources.

Goal (1) is the concern of Software

Engineering.

Goal (2) is the concern of data structures and

algorithm analysis.

When goal (2) is important, how do we

measure an algorithm's cost?

13

How to Measure Eciency?

1. Empirical comparison (run programs).

[Dicult to do \fairly." Time consuming.]

2. Asymptotic Algorithm Analysis.

Critical resources:

[Time. Space (disk, RAM). Programmer's eort. Ease of use

(user's eort).]

[Machine load. OS. Compiler. Problem size. Specic input

values for given problem size.]

\size" of the input.

Running time is expressed as T(n) for some

function T on input size n.

14

Examples of Growth Rate

Example 1: [As n grows, how does T(n) grow?]

int largest(int* array, int n) { // Find largest value

int currlarge = array[0]; // Store largest seen

for (int i=1; i<n; i++) // For each element

if (array[i] > currlarge) // If largest

currlarge = array[i]; // Remember it

return currlarge; // Return largest

}

1 2

Example 3:

sum = 0;

for (i=1; i<=n; i++)

for (j=1; j<=n; j++)

sum++;

1

2

2

2 2

increments.]

15

Growth Rate Graph

[2n is an exponential algorithm. 10n and 20n dier only by a

constant.]

2n 2n2 5n log n

1400

1200

20n

1000

800

600 10n

400

200

0

0 10 20 30 40 50

2n 2n2

400

20n

300

5n log n

200

10n

100

0

0 5 10 1615

Input size n

Best, Worst and Average Cases

Not all inputs of a given size take the same

time.

Sequential search for K in an array of n

integers:

• Begin at rst element in array and look at

each element in turn until K is found.

Best Case: [Find at rst position: 1 compare]

Worst Case: [Find at last position: n compares]

Average Case: [(n + 1)/2 compares]

While average time seems to be the fairest

measure, it may be dicult to determine.

[Depends on distribution. Assumption for above analysis:

Equally likely at any position.]

When is worst case time important?

[Real time algorithms]

17

Faster Computer or Algorithm?

What happens when we buy a computer 10

times faster? [How much speedup? 10 times. More

important: How much increase in problem size for same time?

Depends on growth rate.]

T(n) n n0 Change n0/n

10n 1, 000 10 000, n0 = 10n 10

20n 500 5 000 , n

√

0 = 10n 10

5n log n 250 1 842 , 10n√< n0 < 10n 7.37

2n2 70 223 n0 = 10n 3.16

2n 13 16 n0 = n + 3 −−

[For n , if n = 1000, then n0 would be 1003]

2

hour (10,000 steps).

n0: Size of input that can be processed in one

hour on the new machine (100,000 steps).

[Compare T(n) = n to T(n) = n log n. For n > 58, it is faster to

2

10 times faster.]

18

Asymptotic Analysis: Big-oh

Denition: For T(n) a non-negatively valued

function, T(n) is in the set O(f (n)) if there

exist two positive constants c and n0 such that

T(n) ≤ cf (n) for all n > n0.

average, worst] case.

Meaning: For all data sets big enough (i.e.,

n > n0), the algorithm always executes in less

than cf (n) steps [in best, average or worst

case].

[Must pick one of these to complete the statement. Big-oh

notation applies to some set of inputs.]

Upper Bound.

Example: if T(n) = 3n2 then T(n) is in O(n2).

Wish tightest upper bound:

While T(n) = 3n2 is in O(n3), we prefer O(n2).

[It provides more information to say O(n ) than O(n )]

2 3

19

Big-oh Example

Example 1. Finding value X in an array. [Average

case]

T(n) = csn/2. [cs is a constant. Actual value is irrelevant]

For all values of n > 1, csn/2 ≤ csn.

Therefore, by the denition, T(n) is in O(n) for

n0 = 1 and c = cs.

c1n2 + c2 n ≤ c1n2 + c2n2 ≤ (c1 + c2)n2 for all

n > 1.

Example 3: T(n) = c. We say this is in O(1).

[Rather than O(c)]

20

Big-Omega

Denition: For T(n) a non-negatively valued

function, T(n) is in the set

(g(n)) if there

exist two positive constants c and n0 such that

T(n) ≥ cg (n) for all n > n0.

n > n0), the algorithm always executes in more

than cg(n) steps.

Lower Bound.

Example: T(n) = c1n2 + c2n.

c1n2 + c2 n ≥ c1n2 for all n > 1.

T(n) ≥ cn2 for c = c1 and n0 = 1.

Therefore, T(n) is in

(n2) by the denition.

Want greatest lower bound.

21

Theta Notation

When big-Oh and

meet, we indicate this by

using (big-Theta) notation.

Denition: An algorithm is said to be (h(n))

if it is in O(h(n)) and it is in

(h(n)).

[For polynomial equations on T(n), we always have . There

is no uncertainty, a \complete" analysis.]

Simplifying Rules:

1. If f (n) is in O(g(n)) and g(n) is in O(h(n)),

then f (n) is in O(h(n)).

2. If f (n) is in O(kg(n)) for any constant

k > 0, then f (n) is in O(g (n)). [No constant]

3. If f1(n) is in O(g1(n)) and f2(n) is in

O(g2(n)), then (f1 + f2)(n) is in

O(max(g1(n), g2(n))). [Drop low order terms]

4. If f1(n) is in O(g1(n)) and f2(n) is in

O(g2(n)) then f1(n)f2(n) is in

O(g1(n)g2(n)). [Loops]

22

Running Time of a Program

[Asymptotic analysis is dened for equations. Need to convert

program to an equation.]

Example 1: a = b;

This assignment takes constant time, so it is

(1). [Not (c) { notation by tradition]

Example 2:

sum = 0;

for (i=1; i<=n; i++)

sum += n;

[(n) (even though sum is n )]

2

Example 3:

sum = 0;

for (j=1; j<=n; j++) // First for loop

for (i=1; i<=j; i++) // is a double loop

sum++;

for (k=0; k<n; k++) // Second for loop

A[k] = k;

P

[First statement is (1). Double for loop is i = (n ). Final

2

2

23

More Examples

Example 4.

sum1 = 0;

for (i=1; i<=n; i++) // First double loop

for (j=1; j<=n; j++) // do n times

sum1++;

sum2 = 0;

for (i=1; i<=n; i++) // Second double loop

for (j=1; j<=i; j++) // do i times

sum2++;

[First loop, sum is n . Second loop, sum is (n + 1)(n)/2. Both

2

2

Example 5.

sum1 = 0;

for (k=1; k<=n; k*=2)

for (j=1; j<=n; j++)

sum1++;

sum2 = 0;

for (k=1; k<=n; k*=2)

for (j=1; j<=k; j++)

sum2++;

Plog n

[First is k=1

n = (n log n). Second is P log n−1

k=0

2 = (n).]

k

24

Binary Search

Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Key 11 13 21 26 29 36 40 41 45 51 54 56 65 72 77 83

// Return position of element (if any) with value K

int l = left-1;

int r = right+1; // l, r are beyond array bounds

while (l+1 != r) { // Stop when l and r meet

int i = (l+r)/2; // Look at middle of subarray

if (K < array[i]) r = i; // In left half

if (K == array[i]) return i; // Found it

if (K > array[i]) l = i; // In right half

}

return UNSUCCESSFUL; // Search value not in array

}

in the worst case? [(log n)]

25

Other Control Statements

while loop: analyze like a for loop.

if statement: Take greater complexity of

then/else clauses.

[If probabilities are independent of n.]

switch statement: Take complexity of most

expensive case.

[If probabilities are independent of n.]

Subroutine call: Complexity of the subroutine.

26

Analyzing Problems

[Typically do a lot of this in a senior algorithms course.]

Upper bound: Upper bound of best known

algorithm.

Lower bound: Lower bound for every possible

algorithm.

[The examples so far have been easy in that exact equations

always yield . Thus, it was hard to distinguish

and O.

Following example should help to explain the dierence {

bounds are used to describe our level of uncertainty about an

algorithm.]

Example: Sorting

1. Cost of I/O:

(n)

2. Bubble or insertion sort: O(n2)

3. A better sort (Quicksort, Mergesort,

Heapsort, etc.): O(n log n)

4. We prove later that sorting is

(n log n)

27

Multiple Parameters

[Ex: 256 colors (8 bits), 1000 × 1000 pixels]

Compute the rank ordering for all C pixel

values in a picture of P pixels.

for (i=0; i<C; i++) // Initialize count

count[i] = 0;

for (i=0; i<P; i++) // Look at all of the pixels

count[value(i)]++; // Increment proper value count

sort(count); // Sort pixel value counts

(P log P ).

More accurate is (P + C log C ).

28

Space Bounds

Space bounds can also be analyzed with

asymptotic complexity analysis.

Time: Algorithm

Space: Data Structure

Space/Time Tradeo Principle:

One can often achieve a reduction in time is

one is willing to sacrice space, or vice versa.

• Encoding or packing information

Boolean
ags

• Table lookup

Factorials

Disk Based Space/Time Tradeo Principle:

The smaller you can make your disk storage

requirements, the faster your program will run.

29

Lists

[Students should already be familiar with lists. Objectives: use

alg analysis in familiar context, compare implementations.]

A list is a nite, ordered sequence of data

items called elements.

[The positions are ordered, NOT the values.]

Each list element has a data type.

The empty list contains no elements.

The length of the list is the number of

elements currently stored.

The beginning of the list is called the head,

the end of the list is called the tail.

Sorted lists have their elements positioned in

ascending order of value, while unsorted lists

have no necessary relationship between element

values and positions.

Notation: ( a0, a1, ..., an−1 )

What operations should we implement?

[Add/delete elem anywhere, nd, next, prev, test for empty.]

30

List ADT

class List { // List class ADT

public:

List(int =LIST_SIZE); // Constructor

~List(); // Destructor

void clear(); // Remove all Elems

void insert(const Elem); // Insert Elem at curr

void append(const Elem); // Insert Elem at tail

Elem remove(); // Remove and return Elem

void setFirst(); // Set curr to first pos

void prev(); // Move curr to prev pos

void next(); // Move curr to next pos

int length() const; // Return current length

void setPos(int); // Set curr to position

void setValue(const Elem); // Set current value

Elem currValue() const; // Return current value

bool isEmpty() const; // TRUE if list is empty

bool isInList() const; // TRUE if curr in list

bool find(int); // Find value

};

[This is an example of an ADT. Our list implementations will

match. Note that the generic type \Elem" is being used for

the element type.]

31

List ADT Examples

List: ( 12, 32, 15 )

MyList.insert(99);

an assumed parameter, 99 is another parameter.]

Assume MyPos has 32 as current element:

[Put 99 before current element, yielding (12, 99, 32, 15).]

Process an entire list:

for (MyList.first(); MyList.isInList(); MyList.next())

DoSomething(MyList.currValue());

32

Array-Based List Insert

Insert 23:

13 12 20 8 3 13 12 20 8 3

0 1 2 3 4 5 0 1 2 3 4 5

(a) (b)

23 13 12 20 8 3

0 1 2 3 4 5

(c)

33

Array-Based List Class

class List { // Array-based list class

private:

int msize; // Maximum size of list

int numinlist; // Actual number of Elems

int curr; // Position of "current"

Elem* listarray; // Array of list Elems

public:

List(int =LIST_SIZE); // Constructor

~List(); // Destructor

void clear(); // Remove all Elems

void insert(const Elem); // Insert Elem at curr

void append(const Elem); // Insert Elem at tail

Elem remove(); // Remove and return Elem

void setFirst(); // Set curr to first pos

void prev(); // Move curr to prev pos

void next(); // Move curr to next pos

int length() const; // Return current length

void setPos(int); // Set curr to position

void setValue(const Elem); // Set current value

Elem currValue() const; // Return current value

bool isEmpty() const; // TRUE if list is empty

bool isInList() const; // TRUE if curr in list

bool find(int); // Find value

};

34

Array-Based List Implementation

List::List(int sz) // Constructor

{ msize = sz; numinlist = 0; curr = 0;

listarray = new Elem[sz]; }

void List::insert(const Elem item) {

assert((numinlist < msize) && (curr >=0) &&

(curr <= numinlist));

for(int i=numinlist; i>curr; i--) // Shift Elems up

listarray[i] = listarray[i-1];

listarray[curr] = item;

numinlist++; // Increment size

}

assert(numinlist < msize); // Can’t be full

listarray[numinlist++] = item; // Increment size

}

assert(!isEmpty() && isInList()); // Must have Elem

Elem temp = listarray[curr]; // Store Elem

for(int i=curr; i<numinlist-1; i++) // Shift down

listarray[i] = listarray[i+1];

numinlist--; return temp;

}

35

List Implementation (Cont)

void List::setFirst() { curr = 0; } // Set to first pos

{ assert(isInList()); listarray[curr] = val; }

{ assert(isInList()); return listarray[curr]; }

{ return (curr >= 0) && (curr < numinlist); }

while (isInList()) // Stop if reach end

if (key(currValue()) == val) return TRUE; // Found

else next();

return FALSE; // Not found

}

36

Link Class

Dynamic allocation of new list elements.

class Link { // Singly-linked node

public:

Elem element; // Elem value for node

Link *next; // Pointer to next node

Link(const Elem elemval, Link* nextval =NULL)

{ element = elemval; next = nextval; }

Link(Link* nextval =NULL) { next = nextval; }

};

37

Linked List Position

head curr tail

20 23 12 15

(a)

20 23 10 12 15

[Naive approach: Point to current

(b) node. Current is 12. Want

to insert node with 10. No access available to node with 23.

How can we do the insert?]

head curr tail

20 23 12 15

(a)

20 23 10 12 15

(b)

node. Now we can do the insert. Also note use of header

node.]

38

Linked List Class

class List { // Linked list class

private:

Link* head; // Pointer to list header

Link* tail; // Pointer to last Elem

Link* curr; // Pos of "current" Elem

public:

List(int =LIST_SIZE); // Constructor

~List(); // Destructor

void clear(); // Remove all Elems

void insert(const Elem); // Insert at current pos

void append(const Elem); // Insert at tail of list

Elem remove(); // Remove/return Elem

void setFirst(); // Set curr to first pos

void prev(); // Move curr to prev pos

void next(); // Move curr to next pos

int length() const; // Return length

void setPos(int); // Set current pos

void setValue(const Elem); // Set current value

Elem currValue() const; // Return current value

bool isEmpty() const; // TRUE if list is empty

bool isInList() const; // TRUE if now in list

bool find(int); // Find value

};

39

Linked List Insertion

// Insert Elem at current position

void List::insert(const Elem item) {

assert(curr != NULL); // Must be pointing to Elem

curr->next = new Link(item, curr->next);

if (tail == curr) // Appended new Elem

tail = curr->next;

}

curr

... 23 12 ...

Insert 10: 10

(a)

curr

... 23 12 ...

3

10

1 2

(b)

40

Linked List Remove

Elem List::remove() { // Remove/return Elem

assert(isInList()); // Must be valid pos

Elem temp = curr->next->element; // Remember value

Link* ltemp = curr->next; // Remember link

curr->next = ltemp->next; // Remove from list

if (tail == ltemp) tail = curr; // Set tail

delete ltemp; // Free link

return temp; // Return value

}

curr

... 23 10 15 ...

(a)

curr 2

... 23 10 15 ...

it 1

(b)

41

Freelists

System new and delete are slow.

class Link { // Singly-linked node

public: // with freelist

Elem element; // Elem value for node

Link* next; // Pointer to next node

static Link* freelist; // Link class freelist

Link(const Elem elemval, Link* nextval =NULL)

{ element = elemval; next = nextval; }

Link(Link* nextval =NULL) { next = nextval; }

void* operator new(size_t); // Overloaded new

void operator delete(void*); // Overloaded delete

};

if (freelist == NULL) return ::new Link; // New space

Link* temp = freelist; // Or get from freelist

freelist = freelist->next;

return temp; // Return the link

}

((Link*)ptr)->next = freelist; // Put on freelist

freelist = (Link*)ptr;

}

42

Comparison of List Implementations

Array-Based Lists: [Average and worst cases]

• Insertion and deletion are (n).

• Array must be allocated in advance.

• No overhead if all array positions are full.

Linked Lists:

• Insertion and deletion (1);

prev and direct access are (n).

• Space grows with number of elements.

• Every element requires overhead.

DE = n(P + E ); n = +

P

DE

E

[Estimation question: What are the dimensions of these

variables?]

E: Space for data value

P: Space for pointer

D: Number of elements in array

43

Doubly Linked Lists

Simplify insertion and deletion: Add a prev

pointer.

class Link { // Doubly-linked node

public: // with freelist

Elem element; // Node Elem value

Link* next; // Pointer to next node

Link* prev; // Pointer to prev node

static Link* freelist; // Link class freelist

Link(const Elem Elemval, Link* nextp =NULL,

Link* prevp =NULL)

{ element = Elemval; next = nextp; prev = prevp;}

Link(Link* nextp =NULL, Link* prevp = NULL)

{ next = nextp; prev = prevp; }

void* operator new(size_t); // Overloaded new

void operator delete(void*); // Overloaded delete

};

head curr tail

20 23 12 15

44

Doubly Linked List Operations

curr

... 20 23 12 ...

Insert 10: 10

(a)

curr

4 5

... 20 23 12 ...

10

3 1 2

(b)

// Insert Elem at current position

void List::insert(const Elem item) {

assert(curr != NULL);

curr->next = new Link(item, curr->next, curr);

if (curr->next->next != NULL)

curr->next->next->prev = curr->next;

if (tail == curr) tail = curr->next;

}

assert(isInList()); // Must be valid position

Elem temp = curr->next->element;

Link* ltemp = curr->next;

if (ltemp->next != NULL) ltemp->next->prev = curr;

else tail = curr; // Removed tail Elem - change tail

curr->next = ltemp->next;

delete ltemp;

return temp;

}

45

Stacks

LIFO: Last In, First Out

Restricted form of list: Insert and remove only

at front of list.

Notation:

• Insert: PUSH

• Remove: POP

• The accessible element is called TOP.

46

Array-Based Stack

Dene top as rst free position.

class Stack { // Array-based stack class

private:

int size; // Maximum size of stack

int top; // Index for top Elem

Elem *listarray; // Array holding stack Elems

public:

Stack(int sz =LIST_SIZE) // Constructor: initialize

{ size = sz; top = 0; listarray = new Elem[sz]; }

~Stack() // Destructor: free array

{ delete [] listarray; }

void clear() // Remove all Elems

{ top = 0; }

void push(const Elem item) // Push Elem onto stack

{ assert(top < size); listarray[top++] = item; }

Elem pop() // Pop Elem from stack top

{ assert(!isEmpty()); return listarray[--top]; }

Elem topValue() const // Return value of top Elem

{ assert(!isEmpty()); return listarray[top-1]; }

bool isEmpty() const // Return TRUE if empty

{ return top == 0; }

};

top1 top2

47

Linked Stack

class Stack { // Linked stack class

private:

Link *top; // Pointer to top Elem

public:

Stack(int sz =LIST_SIZE) // Constructor:

{ top = NULL; } // initialize

~Stack() { clear(); } // Destructor

void clear(); // Remove stack Elems

void push(const Elem item) // Push Elem onto stack

{ top = new Link(item, top); }

Elem pop(); // Pop Elem from stack

Elem topValue() const // Get value of top Elem

{ assert(!isEmpty()); return top->element; }

bool isEmpty() const // Return TRUE if empty

{ return top == NULL; }

};

while(top != NULL) // Free link nodes

{ Link* temp = top; top = top->next; delete temp; }

}

assert(!isEmpty());

Elem temp = top->element;

Link* ltemp = top->next;

delete top; top = ltemp;

return temp;

}

48

Queues

FIFO: First In, First Out

Restricted form of list:

Insert at one end, remove from other.

Notation:

• Insert: Enqueue

• Delete: Dequeue

• First element: FRONT

• Last element: REAR

49

Queue Implementations

Array-Based Queue

front rear

20 5 12 17

(a)

front rear

12 17 3 30 4

(b)

position for front, there are n − 1 states (0 through n elements

in queue) and only n positions for rear. So, something must

be done to distinguish between two of the states.]

front

front

20 5

12 12

17 17

rear

3

30

4

rear

(a) (b)

[\Drifting" queue { use modular arithmetic.]

Linked Queue: modied linked list. [Operations

are (1)]

50

Binary Trees

A binary tree is made up of a nite set of

nodes that is either empty or consists of a

node called the root together with two binary

trees, called the left and right subtrees, which

are disjoint from each other and from the root.

Notation: Node, Children, Edge, Parent

Ancestor, Descendant, Path, Depth,

Height, Level, Leaf Node, Internal Node,

Subtree.

A

B C

D E F

G H I

Height = max depth + 1.]

51

Full and Complete Binary Trees

Full binary tree: each node either is a leaf or is

an internal node with exactly two non-empty

children.

Complete binary tree: If the height of the tree

is d, then all levels except possibly level d are

completely full. The bottom level has all nodes

to the left side.

(a) (b)

remember which is which.]

52

Full Binary Tree Theorem

Theorem: The number of leaves in a

non-empty full binary tree is one more than the

number of internal nodes.

[Relevant since it helps us calculate space requirements.]

Proof (by Mathematical Induction):

• Base Case: A full binary tree with 1

internal node must have two leaf nodes.

• Induction Hypothesis: Assume any full

binary tree T containing n − 1 internal

nodes has n leaves.

• Induction Step: Given tree T with n

internal nodes, pick internal node I with

two leaf children. Remove I 's children, call

resulting tree T'. By induction hypothesis,

T' is a full binary tree with n leaves.

Restore i's two children. The number of

internal nodes has now gone up by 1 to

reach n. The number of leaves has also

gone up by 1.

53

Full Binary Tree Theorem Corollary

Theorem: The number of NULL pointers in a

non-empty binary tree is one more than the

number of nodes in the tree.

Proof: Replace all null pointers with a pointer

to an empty leaf node. This is a full binary tree.

54

Binary Tree Node Class

class BinNode { // Binary tree node class

public:

Belem element; // The node’s value

BinNode* left; // Pointer to left child

BinNode* right; // Pointer to right child

static BinNode* freelist;

// Two constructors: with and without initial values

BinNode() { left = right = NULL; }

BinNode(Belem e, BinNode* l =NULL, BinNode* r =NULL)

{ element = e; left = l; right = r; }

~BinNode() { } // Destructor

BinNode* leftchild() const { return left; }

BinNode* rightchild() const { return right; }

Belem value() const { return element; };

void setValue(Belem val) { element = val; }

bool isLeaf() const // TRUE if is a leaf

{ return (left == NULL) && (right == NULL); }

void* operator new(size_t); // Overload new

void operator delete(void*);// Overload delete

};

55

Traversals

Any process for visiting the nodes in some

order is called a traversal.

Any traversal that lists every node in the tree

exactly once is called an enumeration of the

tree's nodes.

Preorder traversal: Visit each node before

visiting its children.

Postorder traversal: Visit each node after

visiting its children.

Inorder traversal: Visit the left subtree, then

the node, then the right subtree.

void preorder(BinNode* rt) // rt is root of a subtree

{

if (rt == NULL) return; // Empty subtree

visit(rt); // visit performs desired action

preorder(rt->leftchild());

preorder(rt->rightchild());

}

56

Binary Tree Implementation

A

B C

D E F

G H I

[Leaves are the same as internal nodes. Lots of wasted

space.]

,

c

+

4 x a

2 x

dierent from internal nodes.]

57

Union Implementation

enum Nodetype {leaf, internal}; // Enumerate node types

class VarBinNode { // Generic node class

public:

Nodetype mytype; // Stores type for this node

union {

struct { // Structure for internal node

VarBinNode* left; VarBinNode* right; // Children

Operator opx; // Internal node value

} intl;

Operand var; // Leaves just store a value

};

VarBinNode(const Operand& val) // Constructor: leaf

{ mytype = leaf; var = val; }

// Constructor: Internal

VarBinNode(const Operator& op,

VarBinNode* l, VarBinNode* r) {

mytype = internal; intl.opx = op;

intl.left = l; intl.right = r; }

bool isLeaf() { return mytype == leaf; }

VarBinNode* leftchild() { return intl.left; }

VarBinNode* rightchild() { return intl.right; }

};

if (rt == NULL) return;

if (rt->isLeaf()) cout << "Leaf: " << rt->var <<"\n";

else {

cout << "Internal: " << rt->intl.opx << "\n";

traverse(rt->leftchild());

traverse(rt->rightchild());

}}

58

Inheritance

class VarBinNode { // Node base class

public:

// isLeaf is a "pure" virtual function.

// Each derived class must define isLeaf.

virtual bool isLeaf() = 0;

};

private:

Operand var; // Operand value

public:

LeafNode(const Operand& val) { var = val; }

bool isLeaf() { return TRUE; } // Subclass version

Operand value() { return var; } // Return value

};

private:

VarBinNode* left; // Left child

VarBinNode* right; // Right child

Operator opx; // Operator value

public:

IntlNode(const Operator& op,

VarBinNode* l, VarBinNode* r)

{ opx = op; left = l; right = r; } // Constructor

bool isLeaf() { return FALSE; } // Subclass version

VarBinNode* leftch() { return left; } // Left

VarBinNode* rightch() { return right; } // Right

Operator value() { return opx; } // Value

};

59

Inheritance (cont)

void traverse(VarBinNode *rt) { // Preorder traversal

if (rt == NULL) return; // Nothing to visit

if (rt->isLeaf()) // Do leaf node

cout << "Leaf: " << ((LeafNode *)rt)->value()

<< "\n";

else { // Do internal node

cout << "Internal: " << ((IntlNode *)rt)->value()

<< "\n";

traverse(((IntlNode *)rt)->leftch());

traverse(((IntlNode *)rt)->rightch());

}

}

60

Space Overhead

From Full Binary Tree Theorem:

Half of pointers are NULL.

If leaves only store information, then overhead

depends on whether tree is full.

All nodes the same, with two pointers to

children:

Total space required is (2p + d)n.

Overhead: 2pn.

If p = d, this means 2p/(2p + d) = 2/3 overhead.

[The following is for full binary trees:]

Eliminate pointers from leaf nodes:

n (2p)

2 = p

n (2p) + dn p+d

2

[Half the nodes have 2 pointers, which is overhead.]

This is 1/2 if p = d.

2p/(2p + d) if data only at leaves ⇒ 2/3

overhead.

Some method is needed to distinguish leaves

from internal nodes. [This adds overhead.]

61

Array Implementation

[This is a good example of logical representation vs. physical

implementation.]

For complete binary trees.

0

1 2

3 4 5 6

7 8 9 10 11

(a)

Node 0 1 2 3 4 5 6 7 8 9 10 11

• Parent(r) = [(r − 1)/2 if r 6= 0 and r < n.]

• Leftchild(r) = [2r + 1 if 2r + 1 < n.]

• Rightchild(r) = [2r + 2 if 2r + 2 < n.]

• Leftsibling(r) = [r − 1 if r is even, r > 0 and r < n.]

• Rightsibling(r) = [r + 1 if r is odd, r + 1 < n.]

[Since the complete binary tree is so limited in its shape,

(only one shape for tree of n nodes), it is reasonable to

expect that space eciency can be achieved.]

62

Human Coding Trees

ASCII Codes: 8 bits per character.

Fixed length coding.

Can take advantage of relative frequency of

letters to save space.

Variable length coding.

Z K F C U D L E

2 7 24 32 37 42 42 120

Build the tree with minimal external path

weight.

63

Human Tree Construction

Step 1: 2 7 24 32 37 42 42 120

Z K F C U D L E

9 24 32 37 42 42 120

Step 2: F C U D L E

2 7

Z K

32 33 37 42 42 120

C U D L E

Step 3: 9 24

F

2 7

Z K

37 42 42 65 120

U D L E

32 33

Step 4: C

9 24

F

2 7

Z K

42 65 79 120

L E

32 33 37 42

C U D

Step 5:

9 24

F

2 7

Z K

64

Assigning Codes

306

0 1

120 186

E 0 1

79 107

0 1 0 1

37 42 42 65

U D L 0 1

32 33

C 0 1

9 24

0 1 F

2 7

Z K

C 32

D 42

E 120

F 24

K 7

L 42

U 37

Z 2

[C code: 1110; D code: 101; E code: 0; F code 1111; K code

111101; L code: 110; U code 100; Z code 111100.]

65

Coding and Decoding

A set of codes are said to meet the

prex property if no code in the set is the

prex of another.

Code for DEED: [101 0 0 101]

Decode 1011001110111101: [DUCK]

Expected cost per letter:

[ 1∗120+3∗121+4∗32+5∗24+6∗9

306

= 785

306

= 2.57.]

C 32 1110 4

D 42 101 3

E 120 0 1

F 24 11111 5

K 7 111101 6

L 42 110 3

U 37 100 3

Z 2 111100 6

66

Binary Search Trees

Binary Search Tree (BST) Property

All elements stored in the left subtree of a node

whose value is K have values less than K . All

elements stored in the right subtree of a node

whose value is K have values greater than or

equal to K .

[Problem with lists: either insert/delete or search must be

(n) time. How can we make both update and search

ecient? Answer: Use a new data structure.] 120

37 42

24 42 7 42

7 32 40 42 2 32

2 120 24 37

40

(a) (b)

67

BST Search

class BST {

private:

BinNode* root;

void clearhelp(BinNode*); // Private

void inserthelp(BinNode*&, const Belem); // functions

BinNode* deletemin(BinNode*&);

void removehelp(BinNode*&, int);

Belem findhelp(BinNode*, int) const;

void printhelp(const BinNode*, int) const;

public:

BST() { root = NULL; }

~BST() { clearhelp(root); }

void clear() { clearhelp(root); root = NULL; }

void insert(const Belem val) {inserthelp(root, val);}

void remove(const val) { removehelp(root, val); }

Belem find(const val) const

{ return findhelp(root, val); }

bool isEmpty() const { return root == NULL; }

void print() const {

if (root == NULL) cout << "The BST is empty.\n";

else printhelp(root, 0);

}

};

if (rt == NULL) return UNSUCCESSFUL; // Empty tree

else if (val < key(rt->value())) // Left

return findhelp(rt->leftchild(), val);

else if (val == key(rt->value())) return rt->value();

else return findhelp(rt->rightchild(), val); // Right

}

68

BST Insert

void BST::inserthelp(BinNode*& rt, const Belem val) {

if (rt == NULL) // Empty tree: create node

rt = new BinNode(val, NULL, NULL);

else if (key(val) < key(rt->value()))

inserthelp(rt->left, val); // Check left

else inserthelp(rt->right, val); // Check right

}

37

24 42

7 32 40 42

2 35 120

69

Alternate Approach

void BST::inserthelp(BinNode* rt, const Belem val) {

if (rt == NULL)

return new BinNode(val, NULL, NULL);

if (key(val) < key(rt->value()))

rt->left = inserthelp(rt->left, val);

else rt->right = inserthelp(rt->right, val);

return rt;

}

70

Remove Minimum Value

BinNode* BST::deletemin(BinNode*& rt) {

assert(rt != NULL); // Must be a node to delete

if (rt->left != NULL) // Continue left

return deletemin(rt->left);

else // Found it

{ BinNode* temp = rt; rt = rt->right; return temp; }

}

10

rt

5 20

71

BST Remove

void BST::removehelp(BinNode*& rt, int val) {

if (rt == NULL) cout << val << " is not in tree.\n";

else if (val < key(rt->value())) // Check left

removehelp(rt->left, val);

else if (val > key(rt->value())) // Check right

removehelp(rt->right, val);

else { // Found it: remove

BinNode* temp = rt;

if (rt->left == NULL) // Only a right -

rt = rt->right; // point to right

else if (rt->right == NULL) // Only a left -

rt = rt->left; // point to left

else { // Both non-empty

temp = deletemin(rt->right); // Replace with min

rt->setValue(temp->value()); // in right subtree

}

delete temp; // Free up space

}

}

37 40

24 42

7 32 40 42

2 120

72

Cost of BST Operations

Find:

Insert:

Remove:

[All cost depth of the node in question. Worst case:(n).

Average case:(n log n).]

73

Heaps

Heap: Complete binary tree with the

Heap Property:

• Min-heap: all values less than child values.

• Max-heap: all values greater than child

values.

The values in a heap are partially ordered.

Heap representation: normally the array based

complete binary tree representation.

74

Building the Heap

[Max Heap]

1 7

2 3 4 6

4 5 6 7 1 2 3 5

(a)

1 7

2 3 5 6

4 5 6 7 4 2 1 3

(b)

(5-2), (5-4), (6-3), (6-5), (7-5), (7-6).

(b) requires exchanges (5-2), (7-3), (7-1),

(6-1).

[How to get a good number of exchanges? By induction.

Heapify the root's subtrees, then push the root to the correct

level.]

75

Heap ADT

class heap { // Max-heap class

private:

Elem* Heap; // Pointer to heap array

int size; // Maximum size of heap

int n; // Number of ELEMs in heap

void siftdown(int); // Put ELEM in place

public:

heap(Elem*, int, int); // Constructor

int heapsize() const; // Return current size

bool isLeaf(int) const; // TRUE if pos is a leaf

int leftchild(int) const; // Return L child position

int rightchild(int) const; // Return R child position

int parent(int) const; // Return parent position

void insert(const Elem); // Insert value into heap

Elem removemax(); // Remove maximum value

Elem remove(int); // Remove specified value

void buildheap(); // Heapify contents

};

76

Siftdown

For fast heap construction:

• Work from high end of array to low end.

• Call siftdown for each item.

• Don't need to call siftdown on leaf nodes.

void heap::buildheap() // Heapify contents

{ for (int i=n/2-1; i>=0; i--) siftdown(i); }

assert((pos >= 0) && (pos < n));

while (!isLeaf(pos)) {

int j = leftchild(pos);

if ((j<(n-1)) && (key(Heap[j]) < key(Heap[j+1])))

j++; // j now index of child with greater value

if (key(Heap[pos]) >= key(Heap[j])) return; // Done

swap(Heap, pos, j);

pos = j; // Move down

}

}

Cost for heap construction:

log

Xn

(i − 1) 2i ≈ n.

n

i =1

[(i − 1) is number of steps down, n/2i is number of nodes at

that level.]

77

Priority Queues

A priority queue stores objects, and on request

releases the object with greatest value.

Example: Scheduling jobs in a multi-tasking

operating system.

The priority of a job may change, requiring

some reordering of the jobs.

Implementation: use a heap to store the

priority queue.

To support priority reordering, delete and

re-insert. Need to know index for the object.

// Remove value at specified position

Elem heap::remove(int pos) {

assert((pos > 0) && (pos < n));

swap(Heap, pos, --n); // Swap with last value

while (key(Heap[pos]) > key(Heap[parent(pos)]))

swap(Heap, pos, parent(pos)); // Push up if large

if (n != 0) siftdown(pos); // Push down if small

return Heap[n];

}

78

General Trees

A tree T is a nite set of one or more nodes

such that there is one designated node r called

the root of T , and the remaining nodes in

(T − {r}) are partitioned into n ≥ 0 disjoint

subsets T1, T2, ..., Tk, each of which is a tree,

and whose roots r1, r2, ..., rk, respectively, are

children of r.

[Note: disjoint because a node cannot have two parents.]

Root R

Ancestors of V

Parent of V P

V

S1 S2

C1 C2 Siblings of V

Subtree rooted at V

Children of V

79

General Tree ADT

class GTNode {

public:

GTNode(const Elem); // Constructor

~GTNode(); // Destructor

Elem value(); // Return node’s value

bool isLeaf(); // TRUE if is a leaf

GTNode* parent(); // Return parent

GTNode* leftmost_child(); // Return first child

GTNode* right_sibling(); // Return right sibling

void setValue(Elem); // Set node’s value

void insert_first(GTNode* n); // Insert first child

void insert_next(GTNode* n); // Insert right sibling

void remove_first(); // Remove first child

void remove_next(); // Remove right sibling

};

class GenTree {

public:

GenTree(); // Constructor

~GenTree(); // Destructor

void clear(); // Free nodes

GTNode* root(); // Return root

void newroot(Elem, GTNode*, GTNode*); // Combine

};

80

General Tree Traversal

void print(GTNode* rt) { // Preorder traverse from root

if (rt->isLeaf()) cout << "Leaf: ";

else cout << "Internal: ";

cout << rt->value() << "\n"; // Print or take action

GTNode* temp = rt->leftmost_child();

while (temp != NULL)

{ print(temp); temp = temp->right_sibling(); }

}

R

A B

C D E F

[RACDEBF]

81

Parent Pointer Implementation

R W

A B X Y Z

C D E F

Parent's Index 0 0 1 1 1 2 7 7 7

Label R A B C D E F W X Y Z

Node Index 0 1 2 3 4 5 6 7 8 9 10

answering:

Are two elements in the same tree?

bool Gentree::differ(int a, int b) {

// Are nodes a and b in different trees?

GTNode* root1 = FIND(&array[a]); // Find a’s root

GTNode* root2 = FIND(&array[b]); // Find b’s root

return root1 != root2; // Compare roots

}

[Examples of equivalence classes: Connected components in

graphs; point clustering.]

82

Equivalence Classes

When joining equivalence classes, want to keep

depth small.

Weighted Union Rule: join the tree with fewer

nodes to the tree with more nodes.

Limits depth to log n for n nodes.

[Less than half of the nodes increase depth by 1.]

Path Compression: Make all nodes visited point

to root. [Nearly constant cost.]

class GTNode { // General tree node

public:

GTNode* par; // Parent pointer

GTNode() { par = NULL; } // Constuctor

GTNode* parent() { return par; } // Return parent

};

private:

GTNode* array; // Node array

int size; // Size of node array

GTNode* FIND(GTNode*) const; // Find root

public:

Gentree(int); // Constructor

~Gentree(); // Destructor

void UNION(int, int); // Merge equivalences

bool differ(int, int); // TRUE if not in same tree

};

83

UNION-FIND

Gentree::Gentree(int sz) { // Constructor

size = sz;

array = new GTNode[sz]; // Create node array

}

Gentree::~Gentree() { // Destructor

delete [] array; // Free node array

}

// Are nodes a and b in different trees?

GTNode* root1 = FIND(&array[a]); // Find a’s root

GTNode* root2 = FIND(&array[b]); // Find b’s root

return root1 != root2; // Compare roots

}

GTNode* root1 = FIND(&array[a]); // Find a’s root

GTNode* root2 = FIND(&array[b]); // Find b’s root

if (root1 != root2) root2->par = root1; // Merge

}

84

Equivalence Processing Example

A B C D E

A B C D E F G H I J

0 1 2 3 4 5 6 7 8 9 F G H I J

[Process (A,B), (C, H), (G,F), (D, E), (I, F)]

(a)

0 3 5 2 5 A C F J D

A B C D E F G H I J

0 1 2 3 4 5 6 7 8 9 B H G I E

[Process (H, A), (E, G)] (b)

0 0 5 3 5 2 5 A F J

A B C D E F G H I J

0 1 2 3 4 5 6 7 8 9 B C G I D

[Process (H, E)]

H E

(c)

5 0 0 5 3 5 2 5 F J

A B C D E F G H I J

0 1 2 3 4 5 6 7 8 9 A G I D

B C E

H

(d)

85

Path Compression Example

5 0 0 5 5 5 0 5 F J

A B C D E F G H I J

0 1 2 3 4 5 6 7 8 9 A G I D E

B C H

while (curr->parent() != NULL) curr = curr->par;

return curr; // At root

}

86

Lists of Children

Index Val Par

0 R 1 3

1 A 0 2 4 6

2 C 1

3 B 0 5

4 D 1

5 F 3

6 E 1

7

87

Leftmost Child/Right Sibling

Left Val Par Right

1 R

R 0

3 A 0 2

6 B 0

R X C 1 4

D 1 5

A B E 1

F 2

C D E F 8 R 0

X 7

[Note: Two trees share same array.]

Left Val Par Right

1 R 7 8

R0

3 A 0 2

6 B 0

R X C 1 4

D 1 5

A B E 1

F 2

C D E F 0 R 0

-1

X 7

88

Linked Implementations

Val Size

R 2

R

A 3 B 1

A B

C D E F C 0 D 0 E 0 F 0

(a) (b)

R A B

A B

C D E F C D E F

(a) (b)

89

Sequential Implementations

List node values in the order they would be

visited by a preorder traversal.

Saves space, but allows only sequential access.

Need to retain tree structure for reconstruction.

AB/D//CEG///F H//I//

[Need NULL mark since this tree is not full.]

A0B 0 /DC 0E 0G/F 0HI

RAC )D)E ))BF )))

90

Convert to Binary Tree

Left Child/Right Sibling representation

essentially stores a binary tree.

Use this process to convert any general tree to

a binary tree.

A forest is a collection of one or more general

trees.

root

(a) (b)

91

Graphs

A graph G = (V, E) consists of a set of

vertices V, and a set of edges E, such that

each edge in E is a connection between a pair

of vertices in V.

The number of vertices is written |V|, and the

number of edges is written |E|.

A sequence of vertices v1, v2, ..., vn forms a path

of length n − 1 if there exist edges from vi to

vi+1 for 1 ≤ i < n.

A path is simple if all vertices on the path are

distinct.

A cycle is a path of length 3 or more that

connects vi to itself.

A cycle is simple if the path is simple, except

for the rst and last vertices being the same.

92

Graph Denitions (Cont)

An undirected graph is connected if there is at

least one path from any vertex to any other.

The maximal connected subgraphs of an

undirected graph are called

connected components.

A graph without cycles is acyclic.

A directed graph without cycles is a

directed acyclic graph or DAG.

A free tree is a connected, undirected graph

with no simple cycles. Equivalently, a free tree

is connected and has |V − 1| edges.

0 2

4 1

3 4 7

1

2

1 3

(a) (b) (c)

93

Connected Components

0 2 6 7

1 3 5

94

Graph Representations

Adjacency Matrix: (|V|2).

Adjacency List: (|V| + |E|).

0 1 2 3 4

0 2 0 1 1

1 1

4 2 1

3 1

1 3 4 1

(a) (b)

0 1 4

1 3

2 4

3 2

4 1

(c)

0 1 2 3 4

0 2 0 1 1

1 1 1 1

4 2 1 1

3 1 1

1 3 4 1 1 1

(a) (b)

0 1 4

1 0 3 4

2 3 4

3 1 2

4 0 1 2

(c)

95

Implementation: Adjacency Matrix

class EdgeClass {

public:

int v1, v2;

EdgeClass(int in1, int in2) { v1 = in1; v2 = in2; }

};

typedef EdgeClass* Edge;

private:

int** matrix; // The edge matrix

int numVertex; // Number of vertices

int numEdge; // Number of edges

bool* Mark; // The mark array

public:

Graph(); // Constructor

~Graph(); // Destructor

int n(); // Number of graph vertices

int e(); // Number of edges for graph

Edge first(int); // Get vertex first edge

bool isEdge(Edge); // TRUE if this is an edge

Edge next(Edge); // Get vertex next edge

int v1(Edge); // Vertex edge comes from

int v2(Edge); // Vertex edge goes to

int weight(int, int); // Weight of edge

int weight(Edge); // Weight of edge

bool getMark(int); // Return a Mark value

void setMark(int, bool); // Set a Mark value

};

[For completeness, add setedge and deledge.]

96

Adjacency Matrix Functions

Edge Graph::first(int v) { // Get vertex first edge

for (int i=0; i<numVertex; i++)

if (matrix[v][i] != NOEDGE)

return new EdgeClass(v, i);

return NULL;

}

{ return (w != NULL) &&

(matrix[w->v1][w->v2] != NOEDGE); }

for (int i=w->v2+1; i<numVertex; i++)

if (matrix[w->v1][i] != NOEDGE)

return new EdgeClass(w->v1, i);

return NULL;

}

if (matrix[i][j] == NOEDGE) return INFINITY;

else return matrix[i][j];

}

int Graph::weight(Edge w) { // Return weight of edge

if ((w == NULL) || (matrix[w->v1][w->v2] == NOEDGE))

return INFINITY;

else return matrix[w->v1][w->v2];

}

int Graph::v1(Edge w) { return w->v1; } // Comes from

int Graph::v2(Edge w) { return w->v2; } // Goes to

bool Graph::getMark(int v) { return Mark[v]; }

void Graph::setMark(int v, bool val) { Mark[v] = val; }

97

Implementation: Adjacency List

class EdgeLink { // Singly-linked list node

public:

int weight; // Edge weight

int v1, v2; // Edge vertices

EdgeLink* next; // Pointer to next list edge

EdgeLink(int vt1, int vt2, int w, EdgeLink* nxt=NULL)

{ v1 = vt1; v2 = vt2; weight = w; next = nxt; }

EdgeLink(EdgeLink* nxt =NULL) { next = nxt; }

};

typedef EdgeLink* Edge;

private:

Edge* list; // The vertex list

int numVertex, numEdge // Number of vertices, edges

bool* Mark; // The mark array

public:

Graph(); // Constructor

~Graph(); // Destructor

int n(); // Number of vertices

int e(); // Number of edges

Edge first(int); // Get vertex first edge

bool isEdge(Edge); // TRUE if this is an edge

Edge next(Edge); // Get vertex next edge

int v1(Edge); // Vertex edge comes from

int v2(Edge); // Vertex edge goes to

int weight(int, int); // Weight of edge

int weight(Edge); // Weight of edge

bool getMark(int); // Return a Mark value

void setMark(int, bool); // Set a Mark value

};

98

Adjacency List Functions

Edge Graph::first(int v) // Get first edge for vertex

{ return list[v]; }

{ return w != NULL; }

if (w == NULL) return NULL;

else return w->next;

}

{ return w->v1; }

{ return w->v2; }

for (Edge curr = list[i]; curr != NULL;

curr = curr->next)

if (curr->v2 == j) return curr->weight;

return INFINITY;

}

if (w == NULL) return INFINITY;

else return w->weight;

}

99

Graph Traversals

Some applications require visiting every vertex

in the graph exactly once.

Application may require that vertices be visited

in some special order based on graph topology.

Example: Articial Intelligence

• Problem domain consists of many \states."

• Need to get from Start State to Goal State.

• Start and Goal are typically not directly

connected.

To insure visiting all vertices:

void graph_traverse(Graph& G) {

for (v=0; v<G.n(); v++)

G.Mark[v] = UNVISITED; // Initialize mark bits

for (v=0; v<G.n(); v++)

if (G.Mark[v] == UNVISITED)

do_traverse(G, v);

}

[Two traversals we will talk about: DFS, BFS.]

100

Depth First Search

void DFS(Graph& G, int v) { // Depth first search

PreVisit(G, v); // Take appropriate action

G.setMark(v, VISITED);

for (Edge w = G.first(v); G.isEdge(w); w = G.next(w))

if (G.getMark(G.v2(w)) == UNVISITED)

DFS(G, G.v2(w));

PostVisit(G, v); // Take appropriate action

}

A B A B

C C

D D

F F

E E

(a) (b)

Depth First Search Tree.]

101

Breadth First Search

Like DFS, but replace stack with a queue.

Visit the vertex's neighbors before continuing

deeper in the tree.

void BFS(Graph& G, int start) {

Queue Q(G.n());

Q.enqueue(start);

G.setMark(start, VISITED);

while (!Q.isEmpty()) {

int v = Q.dequeue();

PreVisit(G, v); // Take appropriate action

for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))

if (G.getMark(G.v2(w)) == UNVISITED) {

G.setMark(G.v2(w), VISITED);

Q.enqueue(G.v2(w));

}

PostVisit(G, v); // Take appropriate action

}}

A B A B

C C

D D

F F

E E

(a) (b)

102

Topological Sort

Problem: Given a set of jobs, courses, etc.

with prerequisite constraints, output the jobs in

an order that does not violate any of the

prerequisites.

J6

J1 J2 J5 J7

J3 J4

for (int i=0; i<G.n(); i++) // Initialize Mark array

G.setMark(i, UNVISITED);

for (i=0; i<G.n(); i++) // Process all vertices

if (G.getMark(i) == UNVISITED)

tophelp(G, i); // Call helper function

}

G.setMark(v, VISITED);

// No PreVisit operation

for (Edge w = G.first(v); G.isEdge(w); w = G.next(w))

if (G.getMark(G.v2(w)) == UNVISITED)

tophelp(G, G.v2(w));

printout(v); // PostVisit for Vertex v

}

[Prints in reverse order.]

103

Queue-based Topological Sort

void topsort(Graph& G) { // Topological sort: Queue

Queue Q(G.n());

int Count[G.n()];

for (v=0; v<G.n(); v++) // Process every edge

for (Edge w=G.first(v); G.isEdge(w); w=G.next(w))

Count[G.v2(w)]++; // Add to v2’s prereq count

for (v=0; v<G.n(); v++) // Initialize Queue

if (Count[v] == 0) // Vertex has no prereqs

Q.enqueue(v);

while (!Q.isEmpty()) { // Process the vertices

int v = Q.dequeue();

printout(v); // PreVisit for Vertex V

for (Edge w=G.first(v); G.isEdge(w); w=G.next(w)) {

Count[G.v2(w)]--; // One less prerequisite

if (Count[G.v2(w)] == 0) // Vertex is now free

Q.enqueue(G.v2(w));

}

}

}

104

Shortest Paths Problems

Input: A graph with weights or costs

associated with each edge.

Output: The list of edges forming the shortest

path.

Sample problems:

• Find the shortest path between two

specied vertices.

• Find the shortest path from vertex S to all

other vertices.

• Find the shortest path between all pairs of

vertices.

Our algorithms will actually calculate only

distances.

105

Shortest Paths Denitions

d(A, B) is the shortest distance from vertex A

to B.

w(A, B) is the weight of the edge connecting

A to B.

• If there is no such edge, then w(A, B) = ∞.

B 5

10 D

20

A

2 11

3

15

C E

106

Single Source Shortest Paths

Given start vertex s, nd the shortest path

from s to all other vertices.

Try 1: Visit all vertices in some order, compute

shortest paths for all vertices seen so far, then

add the shortest path to next vertex x.

Problem: Shortest path to a vertex already

processed might go through x.

Solution: Process vertices in order of distance

from s.

107

Dijkstra's Algorithm Example

A B C D E

Initial 0 ∞ ∞ ∞ ∞

Process A 0 10 3∞ 20

Process C 0 5 3 20 18

Process B 0 5 3 10 18

Process D 0 5 3 10 18

Process E 0 5 3 10 18

B 5

10 D

20

A

2 11

3

15

C E

108

Dijkstra's Algorithm: Array

void Dijkstra(Graph& G, int s) { // Use array

int D[G.n()];

for (int i=0; i<G.n(); i++) // Initialize

D[i] = INFINITY;

D[s] = 0;

for (i=0; i<G.n(); i++) { // Process vertices

int v = minVertex(G, D);

if (D[v] == INFINITY) return; // Unreachable

G.setMark(v, VISITED);

for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))

if (D[G.v2(w)] > (D[v] + G.weight(w)))

D[G.v2(w)] = D[v] + G.weight(w);

}

}

int v; // Initialize v to any unvisited vertex;

for (int i=0; i<G.n(); i++)

if (G.getMark(i) == UNVISITED) { v = i; break; }

for (i++; i<G.n(); i++) // Now find smallest D value

if ((G.getMark(i) == UNVISITED) && (D[i] < D[v]))

v = i;

return v;

}

closest vertex.

Total cost: (|V|2 + |E|) = (|V|2 ).

109

Dijkstra's Algorithm: Priority Queue

class Elem { public: int vertex, dist; };

int key(Elem x) { return x.dist; }

void Dijkstra(Graph& G, int s) { // W/ priority queue

int v; // The current vertex

int D[G.n()]; // Distance array

Elem temp;

Elem E[G.e()]; // Heap array

temp.dist = 0; temp.vertex = s;

E[0] = temp; // Initialize heap

heap H(E, 1, G.e()); // Create the heap

for (int i=0; i<G.n(); i++) D[i] = INFINITY;

D[s] = 0;

for (i=0; i<G.n(); i++) { // Now, get distances

do { temp = H.removemin(); v = temp.vertex; }

while (G.getMark(v) == VISITED);

G.setMark(v, VISITED);

if (D[v] == INFINITY) return; // Rest unreachable

for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))

if (D[G.v2(w)] > (D[v] + G.weight(w))) {

D[G.v2(w)] = D[v] + G.weight(w); // Update D

temp.dist = D[G.v2(w)]; temp.vertex = G.v2(w);

H.insert(temp); // Insert new distance in heap

}}}

Approach 2: Store unprocessed vertices using a

min-heap to implement a priority queue ordered

by D value. Must update priority queue for

each edge.

Total cost: ((|V| + |E|) log |V|).

110

All Pairs Shortest Paths

For every vertex u, v ∈ V, calculate d(u, v).

Could run Dijkstra's Algorithm V times.

[Cost: |V ||E| log |V | = |V | log |V | for dense graph.]

3

[The issue is how to eciently check all the paths without

computing any path more than once.]

Dene a k-path from u to v to be any path

whose intermediate vertices all have indices less

than k.

1

1 7

4

5 1 3

0 1 1 3

2 2 11

12

1

[0,3 is a 0-path. 2,0,3 is a 1-path. 0,2,3 is a 3-path, but not a

2 or 1 path. Everything is a 4 path.]

111

Floyd's Algorithm

void Floyd(Graph& G) { // All-pairs shortest paths

int D[G.n()][G.n()]; // Store distances

for (int i=0; i<G.n(); i++) // Initialize D

for (int j=0; j<G.n(); j++)

D[i][j] = G.weight(i, j);

for (int k=0; k<G.n(); k++) // Compute all k paths

for (int i=0; i<G.n(); i++)

for (int j=0; j<G.n(); j++)

if (D[i][j] > (D[i][k] + D[k][j]))

D[i][j] = D[i][k] + D[k][j];

}

112

Minimum Cost Spanning Trees

Minimum Cost Spanning Tree (MST) Problem:

• Input: An undirected, connected graph G.

• Output: The subgraph of G that 1) has

minimum total cost as measured by

summing the values for all of the edges in

the subset, and 2) keeps the vertices

connected.

A B

7 5

C

9 1 2 6

D 2

F

E 1

113

Prim's MST Algorithm

void Prim(Graph& G, int s) { // Prim’s MST alg

int D[G.n()]; // Distance vertex

int V[G.n()]; // Who’s closest

for (int i=0; i<G.n(); i++) // Initialize

D[i] = INFINITY;

D[s] = 0;

for (i=0; i<G.n(); i++) { // Process vertices

int v = minVertex(G, D);

G.setMark(v, VISITED);

if (v != s) AddEdgetoMST(V[v], v); // Add to MST

if (D[v] == INFINITY) return; // Rest unreachable

for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))

if (D[G.v2(w)] > G.weight(w)) {

D[G.v2(w)] = G.weight(w); // Update distance,

V[G.v2(w)] = v; // who came from

}}}

int v; // Initialize v to any unvisited vertex;

for (int i=0; i<G.n(); i++)

if (G.getMark(i) == UNVISITED) { v = i; break; }

for (i=0; i<G.n(); i++) // Now find smallest value

if ((G.getMark(i) == UNVISITED) && (D[i] < D[v]))

v = i;

return v;

}

114

Alternative Prim's Implementation

Like Dijkstra's algorithm, we can implement

Prim's algorithm with a priority queue.

void Prim(Graph& G, int s) { // W/ priority queue

int v; // The current vertex

int D[G.n()]; // Distance array

int V[G.n()]; // Who’s closest

Elem temp;

Elem E[G.e()]; // Heap array

temp.distance = 0; temp.vertex = s;

E[0] = temp; // Initialize heap array

heap H(E, 1, G.e()); // Create the heap

for (int i=0; i<G.n(); i++) D[i] = INFINITY; // Init

D[s] = 0;

for (i=0; i<G.n(); i++) { // Now build MST

do { temp = H.removemin(); v = temp.vertex; }

while (G.getMark(v) == VISITED);

G.setMark(v, VISITED);

if (v != s) AddEdgetoMST(V[v], v); // Add to MST

if (D[v] == INFINITY) return; // Rest unreachable

for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))

if (D[G.v2(w)] > G.weight(w)) { // Update D

D[G.v2(w)] = G.weight(w);

V[G.v2(w)] = v; // Who came from

temp.distance = D[G.v2(w)];

temp.vertex = G.v2(w);

H.insert(temp); // Insert distance in heap

}

}}

115

Proof of Prim's MST Algorithm

Theorem 14.1 Prim's algorithm produces a

minimum cost spanning tree.

Proof by contradiction:

Order vertices by how they are added to the

MST by Prim's algorithm. v1, v2, ..., vn.

Let edge ei connect (vx, vi+1), x < i.

Let ej be the lowest numbered (rst) edge

added by the algorithm such that the set of

edges selected so far cannot be extended to

form an MST for G.

Let V1 = (v1, ..., vj ). Let V2 = (vj +1, ..., vn).

Marked Unmarked

Vertices v , i < j

i Vertices v , i j

i

\correct" edge

e

0

v u v w

vp v j

e

116

j

Prim's edge

Kruskal's MST Algorithm

Kruskel(Graph& G) { // Kruskal’s MST algorithm

Gentree A(G.n()); // Equivalence class array

Elem E[G.e()]; // Array of edges for min-heap

int edgecnt = 0;

for (int i=0; i<G.n(); i++) // Put edges on array

for (Edge w = G.first(i);

G.isEdge(w); w = G.next(w)) {

E[edgecnt].weight = G.weight(w);

E[edgecnt++].edge = w;

}

heap H(E, edgecnt, edgecnt); // Heapify the edges

int numMST = G.n(); // Init w/ n equiv classes

for (i=0; numMST>1; i++) { // Combine equiv classes

Elem temp = H.removemin(); // Get next cheap edge

Edge w = temp.edge;

int v = G.v1(w); int u = G.v2(w);

if (A.differ(v, u)) { // If different equiv classes

A.UNION(v, u); // Combine equiv classes

AddEdgetoMST(G.v1(w), G.v2(w)); // Add to MST

numMST--; // One less MST

}

}

}

Solution: Use Parent Pointer representation to

merge equivalance classes.

117

Kruskal's Algorithm Example

Time dominated by cost for initial edge sort.

[Alternative: Use a min heap, quit when only one set left.

\Kth-smallest" implementation.]

Total cost: (|V| + |E| log |E|).

Initial A B C D E F

C

Step 1 A B E F

1

Process edge (C. D)

D

C F

Step 2 A B

1 E 1

Process edge (E, F)

D

C

Step 3 A B

1 2

Process edge (C, F)

D

F

E 1

118

Sorting

Each record contains a eld called the key.

Linear order: comparison.

[a < b and b < c ⇒ a < c.]

The Sorting Problem

Given a sequence of records R1, R2, ..., Rn with

key values k1, k2, ..., kn, respectively, arrange the

records into any order s such that records

Rs1 , Rs2 , ..., Rsn have keys obeying the property

ks1 ≤ ks2 ≤ ... ≤ ksn .

[Put keys in ascending order.]

Measures of cost:

• Comparisons

• Swaps

119

Insertion Sort

void inssort(Elem* array, int n) { // Insertion Sort

for (int i=1; i<n; i++) // Insert i’th record

for (int j=i; (j>0) &&

(key(array[j])<key(array[j-1])); j--)

swap(array, j, j-1);

}

i=1 2 3 4 5 6 7

42 20 17 13 13 13 13 13

20 42 20 17 17 14 14 14

17 17 42 20 20 17 17 15

13 13 13 42 28 20 20 17

28 28 28 28 42 28 23 20

14 14 14 14 14 42 28 23

23 23 23 23 23 23 42 28

15 15 15 15 15 15 15 42

Worst Case: [n /2 swaps and compares]

2

2

120

Bubble Sort

void bubsort(Elem* array, int n) { // Bubble Sort

for (int i=0; i<n-1; i++) // Bubble up i’th record

for (int j=n-1; j>i; j--)

if (key(array[j]) < key(array[j-1]))

swap(array, j, j-1);

}

[Using test \j > i" saves a factor of 2 over \j > 0".]

i=0 1 2 3 4 5 6

42 13 13 13 13 13 13 13

20 42 14 14 14 14 14 14

17 20 42 20 15 15 15 15

13 17 20 42 20 17 17 17

28 14 17 15 42 20 20 20

14 28 15 17 17 42 23 23

23 15 28 23 23 23 42 28

15 23 23 28 28 28 28 42

2

2 2

2 2

121

Selection Sort

void selsort(Elem* array, int n) { // Selection Sort

for (int i=0; i<n-1; i++) { // Select i’th record

int lowindex = i; // Remember its index

for (int j=n-1; j>i; j--) // Find the least value

if (key(array[j]) < key(array[lowindex]))

lowindex = j; // Put it in place

swap(array, i, lowindex);

}

}

[Select the value to go in the ith position.]

i=0 1 2 3 4 5 6

42 13 13 13 13 13 13 13

20 20 14 14 14 14 14 14

17 17 17 15 15 15 15 15

13 42 42 42 17 17 17 17

28 28 28 28 28 20 20 20

14 14 20 20 20 28 23 23

23 23 23 23 23 23 28 28

15 15 15 17 42 42 42 42

Best Case: [0 swaps (n − 1 as written), n /2 compares.]

2

2

2

122

Pointer Swapping

Key = 42 Key = 42

Key = 5 Key = 5

(a) (b)

123

Exchange Sorting

Summary

Insertion Bubble Selection

Comparisons:

Best Case (n) (n2) (n2)

Average Case (n2) (n2) (n2)

Worst Case (n2) (n2) (n2)

Swaps:

Best Case 0 0 (n)

Average Case (n2) (n2) (n)

Worst Case (n2) (n2) (n)

All of these sorts rely on exchanges of

adjacent records.

What is the average number of exchanges

required?

[n /4 { average distance from a record to its sorted position.]

2

124

Shellsort

void shellsort(Elem* array, int n) { // Shellsort

for (int i=n/2; i>2; i/=2) // For each increment

for (int j=0; j<i; j++) // Sort each sublist

inssort2(&array[j], n-j, i);

inssort2(array, n, 1);

}

void inssort2(Elem* A, int n, int incr) {

for (int i=incr; i<n; i+=incr)

for (int j=i; (j>=incr) &&

(key(A[j])<key(A[j-incr])); j-=incr)

swap(A, j, j-incr);

}

59 20 17 13 28 14 23 83 36 98 11 70 65 41 42 15

[8 lists of length 2]

36 20 11 13 28 14 23 15 59 98 17 70 65 41 42 83

[4 lists of length 4]

28 14 11 13 36 20 17 15 59 41 23 70 65 98 42 83

[2 lists of length 8]

11 13 17 14 23 15 28 20 36 41 42 70 59 83 65 98

[1 list of length 16]

11 13 14 15 17 20 23 28 36 41 42 59 65 70 83 98

O(n1.5) [Any increments will work, provided the last is 1.

Shellsort takes advantage of inssort's best case performance.]

125

Quicksort

Divide and Conquer: divide list into values less

than pivot and values greater than pivot.

[Initial call: ]

qsort(array, 0, n-1);

int pivotindex = findpivot(array, i, j);

swap(array, pivotindex, j); // Swap to end

// k will be the first position in the right subarray

int k = partition(array, i-1, j, key(array[j]));

swap(array, k, j); // Put pivot in place

if ((k-i) > 1) qsort(array, i, k-1); // Sort left

if ((j-k) > 1) qsort(array, k+1, j); // Sort right

}

{ return (i+j)/2; }

126

Quicksort Partition

int partition(Elem* array, int l, int r, int pivot) {

do { // Move the bounds inward until they meet

while (key(array[++l]) < pivot); // Move right

while (r && (key(array[--r]) > pivot));// Move left

swap(array, l, r); // Swap out-of-place vals

} while (l < r); // Stop when they cross

swap(array, l, r); // Reverse wasted swap

return l; // Return first pos in right partition

}

Initial 72 6 57 88 85 42 83 73 48 60

l r

Pass 1 72 6 57 88 85 42 83 73 48 60

l r

Swap 1 48 6 57 88 85 42 83 73 72 60

l r

Pass 2 48 6 57 88 85 42 83 73 72 60

l r

Swap 2 48 6 57 42 85 88 83 73 72 60

l r

Pass 3 48 6 57 42 85 88 83 73 72 60

r l

Swap 3 48 6 57 85 42 88 83 73 72 60

r l

Reverse Swap 48 6 57 42 85 88 83 73 72 60

r l

The cost for Partition is (n).

127

Quicksort Example

72 6 57 88 60 42 83 73 48 85

Pivot = 60

48 6 57 42 60 88 83 73 72 85

Pivot = 6 Pivot = 73

6 42 57 48 72 73 85 88 83

Pivot = 57 Pivot = 88

Pivot = 42 42 48 57 85 83 88 Pivot = 85

42 48 83 85

6 42 48 57 60 72 73 83 85 88

Final Sorted Array

128

Cost for Quicksort

Best Case: Always partition in half.

Worst Case: Bad partition.

Average Case:

1 X1

n−

T (n) = n + 1 + (T (k) + T (n − k))

n − 1 k=1

= (n log n)

• Better pivot.

• Use better algorithm for small sublists.

• Eliminate recursion.

129

Mergesort

List mergesort(List inlist) {

if (inlist.length() <= 1) return inlist;;

List l1 = half of the items from inlist;

List l2 = other half of the items from inlist;

return merge(mergesort(l1), mergesort(l2));

}

36 20 17 13 28 14 23 15

20 36 13 17 14 28 15 23

13 17 20 36 14 15 23 28

13 14 15 17 20 23 28 36

130

Mergesort Implementation

Mergesort is tricky to implement.

void mergesort(Elem* array, Elem* temp,

int left, int right) {

int mid = (left+right)/2;

if (left == right) return; // List of one ELEM

mergesort(array, temp, left, mid); // Sort 1st half

mergesort(array, temp, mid+1, right);// Sort 2nd half

for (int i=left; i<=right; i++) // Copy to temp

temp[i] = array[i];

// Do the merge operation back to array

int i1 = left; int i2 = mid + 1;

for (int curr=left; curr<=right; curr++) {

if (i1 == mid+1) // Left sublist exhausted

array[curr] = temp[i2++];

else if (i2 > right) // Right sublist exhausted

array[curr] = temp[i1++];

else if (key(temp[i1]) < key(temp[i2]))

array[curr] = temp[i1++];

else array[curr] = temp[i2++];

}}

[Note: This requires a second array.]

Mergesort cost: [(n log n)]

Mergesort is good for sorting linked lists.

[Send records to alternating linked lists, mergesort each, then

merge.]

131

Optimized Mergesort

void mergesort(Elem* array, Elem* temp,

int left, int right) {

int i, j, k, mid = (left+right)/2;

if (left == right) return;

mergesort(array, temp, left, mid); // Sort 1st half

mergesort(array, temp, mid+1, right);// Sort 2nd half

for (i=left; i<=mid; i++) temp[i] = array[i];

for (j=1; j<=right-mid; j++)

temp[right-j+1] = array[j+mid];

// Merge sublists back to array

for (i=left,j=right,k=left; k<=right; k++)

if (key(temp[i]) < key(temp[j]))

array[k] = temp[i++];

else array[k] = temp[j--];

}

[The right sublist is reversed. So, each list increases towards

the middle. This means that each list serves as a sentinal for

the other list, so no test for end-of-list is required.]

132

Heapsort

Heapsort uses a max-heap.

void heapsort(Elem* array, int n) { // Heapsort

heap H(array, n, n); // Build the heap

for (int i=0; i<n; i++) // Now sort

H.removemax(); // Value placed at end of heap

}

Cost of Heapsort: [(n log n)]

Cost of nding k largest elements: [(k log n + n).

Time to build heap: (n).

Time to remove least element: (log n).]

[Compare to sorting with BST: this is expensive in space

(overhead), potential bad balance, BST does not take

advantage of having all records available in advance.]

[Heap is space ecient, balanced, and building initial heap is

ecient.]

133

Heapsort Example

Original Numbers 73

73 6 57 88 60 42 83 72 48 85 6 57

88 60 42 83

72 48 85

Build Heap 88

88 85 83 72 73 42 57 6 48 60 85 83

72 73 42 57

6 48 60

Remove 88 85

85 73 83 72 60 42 57 6 48 88 73 83

72 60 42 57

6 48

Remove 85 83

83 73 57 72 60 42 48 6 85 88 73 57

72 60 42 48

6

Remove 83 73

73 72 57 6 60 42 48 83 85 88 72 57

6 60 42 48

134

Binsort

A simple, ecient sort:

for (i=0; i<n; i++)

B[key(A[i])] = A[i];

Ways to generalize:

• Make each bin the head of a list. [Support

duplicates]

• Allow more keys than records. [I.e., larger key

space]

void binsort(ELEM *A, int n) {

list B[MaxKeyValue];

for (i=0; i<n; i++) B[key(A[i])].append(A[i]);

for (i=0; i<MaxKeyValue; i++)

for (B[i].first(); B[i].isInList(); B[i].next())

output(B[i].currValue());

}

Cost: [(n).]

[Oops! Actually, (n ∗ Maxkeyvalue)]

[Maxkeyvalue could be O(n ) or worse.]

2

135

Radix Sort

Initial List: 27 91 1 97 17 23 84 28 72 5 67 25

First pass Second pass

(on right digit) (on left digit)

0 0 1 5

1 91 1 1 17

2 72 2 23 25 27 28

3 23 3

4 84 4

5 5 25 5

6 6 67

7 27 97 17 67 7 72

8 28 8 84

9 9 91 97

Result of second pass: 1 5 17 23 25 27 28 67 72 84 91 97

136

Cost of Radix Sort

void radix(Elem* A, Elem* B, int n, int k, int r,

int* count) {

// Count[i] stores number of records in bin[i]

for (int j=0; j<r; j++) count[j] = 0; // Init

for (j=0; j<n; j++) count[(key(A[j])/rtok)%r]++;

for (j=1; j<r; j++) count[j] = count[j-1]+count[j];

// Since bins fill from bottom, j counts downwards

for (j=n-1; j>=0; j--)

B[--count[(key(A[j])/rtok)%r]] = A[j];

}

}

How do n, k and r relate?

[r can be viewed as a constant. k≥ log n if there are n distinct

keys.]

137

Radix Sort Example

Initial Input: Array A 27 91 1 97 17 23 84 28 72 5 67 25

0 1 2 3 4 5 6 7 8 9

First pass values for Count. 0 2 1 1 1 2 0 4 1 0

rtok = 1.

0 1 2 3 4 5 6 7 8 9

Count array: 0 2 3 4 5 7 7 11 12 12

Index positions for Array B.

0 1 2 3 4 5 6 7 8 9

Second pass values for Count. 2 1 4 0 0 0 1 1 1 2

rtok = 10.

0 1 2 3 4 5 6 7 8 9

Count array: 2 3 7 7 7 7 8 9 10 12

Index positions for Array B.

138

Empirical Comparison

Algorithm 10 100 1,000 10K 30K

Insert. Sort .00200 .1833 18.13 1847.0 16544

Bubble Sort .00233 .2267 22.47 2274.0 20452

Selec. Sort .00167 .0967 2.17 900.3 8142

Shellsort .00233 .0600 1.00 17.0 59

Shellsort/O .00233 .0500 .93 16.3 65

QSort .00367 .0500 .63 7.3 24

QSort/O .00200 .0300 .43 5.7 18

Merge .00700 .0700 .87 10.7 35

Merge/O .00133 .0267 .37 5.0 16

Heapsort .00900 .1767 2.67 36.3 122

Rad Sort/1 .02433 .2333 2.30 23.3 69

Rad Sort/4 .00700 .0600 .60 6.0 17

Rad Sort/8 .00967 .0333 .30 3.0 8

Algorithm 10 100 1,000 10K 100K

Insert. Sort .0002 .0170 1.68 168.8 23382

Bubble Sort .0003 .0257 2.55 257.2 41874

Selec. Sort .0003 .0273 2.65 267.5 40393

Shellsort .0003 .0027 0.12 1.9 40

Shellsort/O .0003 .0060 0.11 1.8 33

QSort .0004 .0057 0.08 0.9 12

QSort/O .0002 .0040 0.06 0.8 10

Merge .0009 .0130 0.17 2.3 30

Merge/O .0003 .0067 0.11 1.5 21

Heapsort .0010 .0173 0.26 3.5 49

Rad Sort/1 .0123 .1197 1.21 12.5 135

Rad Sort/4 .0035 .0305 0.30 3.2 34

Rad Sort/8 .0047 .0183 0.16 1.6 18

139

Sorting Lower Bound

Want to prove a lower bound for all possible

sorting algorithms.

Sorting is O(n log n).

Sorting I/O takes

(n) time.

Will now prove

(n log n) lower bound.

Form of proof:

• Comparison based sorting can be modeled

by a binary tree.

• The tree must have

(n!) leaves.

• The tree must be

(n log n) levels deep.

140

Decision Trees

XYZ

XYZ YZX

XZY ZXY

YXZ ZYX

Yes A[1]<A[0]? No

(Y<X?)

YXZ XYZ

YXZ XYZ

YZX XZY

ZYX ZXY

Yes No Yes No

A[2]<A[1]? A[2]<A[1]?

YZX (Z<X?) YXZ XZY (Z<Y?) XYZ

YZX XZY

ZYX ZXY

Yes No Yes No

A[1]<A[0]? A[1]<A[0]?

ZYX (Z<Y?) YZX ZXY (Z<X?) XZY

There are n! permutations, and at least 1 node

for each permutation.

A tree with n nodes has at least log n levels.

Where is the worst case in the decision tree?

log n! =

(n log n).

141

Primary vs. Secondary Storage

Primary Storage: Main memory (RAM)

Secondary Storage: Peripheral devices

• Disk Drives

• Tape Drives

Medium Price Price per Mbyte

32MB RAM $225 $7.00/MB

1.4MB
oppy disk $.50 $0.36/MB

2.1GB disk drive $210 $0.10/MB

1GB JAZ cassette $100 $0.10/MB

2GB cartridge tape $20 $0.01/MB

RAM is usually volatile.

RAM is about 1/4 million times faster than

disk.

142

Golden Rule of File Processing

Minimize the number of disk accesses!

1. Arrange information so that you get what

you want with few disk accesses.

2. Arrange information so minimize future disk

accesses.

An organization for data on disk is often called

a le structure.

Disk based space/time tradeo: Compress

information to save processing time by reducing

disk accesses.

143

Disk Drives

Boom

(arm)

Platters

Read/Write Track

Spindle Heads

(a) (b)

Sectors

Intersector

Gaps Bits of data

[CD-ROM: Spiral with equally spaced dots, variable speed

rotation.]

144

Sectors

8 1 6 1

7 2 3 4

6 3 8 7

5 4 5 2

(a) (b)

Interleaving factor: Physical distance between

logically adjacent sectors on a track, to allow

for processing of sector data.

Locality of Reference: If a record is read

from disk, the next request is likely to come

from near the same place in the le.

Cluster: Smallest unit of le allocation {

several sectors.

Extent: A group of physically contiguous

clusters.

Internal Fragmentation: Wasted space within

a sector if record size does not match sector

size, or wasted space within a cluster if le size

is not a multiple of cluster size.

145

Factors aecting disk access time

1. Seek time: time for I/O head to reach

desired track. Largely determined by

distance between I/O head and desired

track. [If seeking at random, n = disk width/3.]

Seek time: f (n) = t ∗ n + s where t is time

to traverse one track and s is startup time

for the I/O head.

2. Rotational delay or latency: time for data

to rotate to I/O head position.

At 3600 RPM, one half rotation of the disk:

16.7/2 msec = 8.3 msec.

[Newer disks are running at 7200 RPM.]

3. Transfer time: time for data to move

under the I/O head.

Number of sectors read / Number of

sectors per track * 16.7 msec

[One disk rotation.]

146

Disk Access Cost Example

675 Mbyte disk drive

• 15 platters ⇒ 45 Mbyte/platter

• 612 tracks/platter

• 150 sectors/track ⇒ 512 bytes/sector

• 8 sectors/cluster (4K bytes/cluster) ⇒ 18

clusters/track

• Interleaving factor of 3 ⇒ 3 revolutions to

read one track (50.1 msec)

How long to read a le of 128 Kbytes divided

into 256 records of 512 bytes?

Number of Clusters:

If le lls minimum number of tracks:

150 sectors of one track, 106 of the next

Total time:

612/3 ∗ 0.08 + 3 + 3.5 ∗ 16.7 + 0.08 + 3+

3.5 ∗ 16.7 = 139.3 msec.

If clusters are spread randomly across disk:

612 24

32 ∗ ( 3 ∗ 0.08 + 3 + 16.7/2 + 150 ∗ 16.7)

= 32 ∗ 30.3 = 969.6 msec.

147

Magnetic Tape

Example: 9 track tape at 6250 bytes per inch

(bpi).

At 2400 feet, this yields 170 Mbytes for $20, or

$0.12/Mbyte.

Workstation/PC cartridge tape is similar.

Magnetic tape requires sequential access.

Magnetic tape has two speeds:

• High speed for \skipping."

• Low speed for \reading."

to recognize beginning of a record at high

speed. This is a signicant amount of space

compared to typical record size.

Blocking Factor: Number of records in a

block between gaps.

148

Buers

Read time for one track:

612/3 ∗ 0.08 + 3 + 3.5 ∗ 16.7 = 77.8 msec.

Read time for one sector:

612/3∗0.08+3+16.7/2+16.7/150 = 27.8 msec.

Read time for one byte:

612/3 ∗ 0.08 + 3 + 16.7/2 = 27.7 msec.

Nearly all disk drives read/write one sector at

every I/O access.

• Also called a page.

buer or cache.

If next I/O access is to same buer, then no

need to go to disk.

There are usually one or more input buers and

one or more output buers.

149

Buer Pools

A series of buers used by an application to

cache disk data is called a buer pool.

Virtual memory uses a buer pool to imitate

greater RAM memory by actually storing

information on disk and \swapping" between

disk and RAM.

Organization for buer pools: which one to use

next?

• First-in, First-out: Use the rst one on the

queue.

• Least Frequently Used (LFU): Count buer

accesses, pick the least used.

• Least Recently Used (LRU):

Keep buers on linked list.

When a buer is accessed, bring to front.

Reuse the one at the end.

Double Buering: Read data from disk for

next buer while CPU is processing previous

buer.

150

Programmer's View of Files

Logical view of les:

• An array of bytes.

• A le pointer marks the current position.

• Read bytes from current position (move le

pointer).

• Write bytes to current position (move le

pointer).

• Set le pointer to specied byte position.

151

C/C++File Functions

FILE *fopen(char *filename, char *mode);

void fclose(FILE *stream);

Mode examples:

• "rb": open a binary le, read-only.

• "w+t": create a text le for reading and

writing.

size t fread(void *ptr, size t size, size t n,

FILE *stream);

if(numrec !=

fread(recarr, sizeof rec, numrec, myfile))

its_an_error();

n, FILE *stream);

whence);

where whence is one of:

• SEEK SET: Oset from le beginning.

• SEEK CUR: Oset from current position.

• SEEK END: Oset from end of le.

152

External Sorting

Problem: Sorting data sets too large to t in

main memory.

• Assume data stored on disk drive.

into main memory, processed, and returned to

disk.

An external sort should minimize disk accesses.

153

Model of External Computation

Secondary memory is divided into equal-sized

blocks (512, 2048, 4096 or 8192 bytes are

typical sizes).

The basic I/O operation transfers the contents

of one disk block to/from main memory.

Under certain circumstances, reading blocks of

a le in sequential order is more ecient.

(When?) [1) Adjacent logical blocks of le are physically

adjacent on disk. 2) No competition for I/O head.]

Typically, the time to perform a single block

I/O operation is sucient to Quicksort the

contents of the block.

Thus, our primary goal is to minimize the

number fo block I/O operations.

Most workstations today must do all sorting on

a single disk drive.

[So, the algorithm presented here is general for these

conditions.]

154

Key Sorting

Often records are large while keys are small.

• Ex: Payroll entries keyed on ID number.

then write them out again.

Approach 2: Read only the key values, store

with each key the location on disk of its

associated record.

If necessary, after the keys are sorted the

records can be read and re-written in sorted

order.

[But, this is not usually done. (1) It is expensive (random

access to all records). (2) If there are multiple keys, there is

no \correct" order.]

155

External Sort: Simple Mergesort

Quicksort requires random access to the entire

set of records.

Better: Modied Mergesort algorithm

• Process n elements in (log n) passes.

2. Read in a block from each le.

3. Take rst record from each block, output

them in sorted order.

4. Take next record from each block, output

them to a second le in sorted order.

5. Repeat until nished, alternating between

output les. Read new input blocks as

needed.

6. Repeat steps 2-5, except this time the

input les have groups of two sorted

records that are merged together.

7. Each pass through the les provides larger

and larger groups of sorted records.

A group of sorted records is called a run.

156

Problems with Simple Mergesort

36 17 28 23 20 36 14 28 13 17 20 36

20 13 14 15 13 17 15 23 14 15 23 28

sequential? [yes]

What happens if all work is done on a single

disk drive? [Competition for I/O head eliminates

advantage of sequential processing.]

How can we reduce the number of Mergesort

passes? [Read in a block (or several blocks) and do an

in-memory sort to generate large initial runs.]

In general, external sorting consists of two

phases:

1. Break the le into initial runs.

2. Merge the runs together into a single sorted

run.

157

Breaking a le into runs

General approach:

• Read as much of the le into memory as

possible.

• Perform and in-memory sort.

• Output this group of records as a single run.

158

Replacement Selection

1. Break available memory into an array for

the heap, an input buer and an output

buer.

2. Fill the array from disk.

3. Make a min-heap.

4. Send the smallest value (root) to the

output buer.

5. If the next key in the le is greater than the

last value output, then

Replace the root with this key.

else

Replace the root with the last key in the

array.

Add the next record in the le to a new

heap (actually, stick it at the end of the

array).

Input Output

File Input Buer RAM Output Buer Run File

159

Example of Replacement Selection

Input Memory Output

12

16 19 31 12

25 21 56 40

16

29 19 31 16

25 21 56 40

29

19 31

25 21 56 40

19

14 21 31 19

25 29 56 40

40

21 31

25 29 56 14

21

35 25 31 21

40 29 56 14

160

Benet from Replacement Selection

Use double buering to overlap input,

processing and output.

How many disk drives for greatest advantage?

Snowplow argument:

• A snowplow moves around a circular track

onto which snow falls at a steady rate.

• At any instant, there is a certain amount of

snow S on the track. Some falling snow

comes in front of the plow, some behind.

• During the next revolution of the snowplow,

all of this is removed, plus 1/2 of what falls

during that revolution.

• Thus, the plow removes 2S amount of

snow. Falling Snow

Existing snow

Snowplow Movement

Start time T

161

Simple Mergesort may not be Best

Simple Mergesort: Place the runs into two les.

• Merge the rst two runs to output le, then

next two runs, etc.

This process is repeated until only one run

remains.

• How many passes for r initial runs? [log r]

[Not if all on one disk drive.]

Is working memory well used?

[No { only 2 blocks are used.]

Need a way to reduce the number of passes.

[And use memory better.]

162

Multiway Merge

With replacement selection, each initial run is

several blocks long.

Assume that each run is placed in a separate

disk le.

We could then read the rst block from each

le into memory and perform an r-way merge.

When a buer becomes empty, read a block

from the appropriate run le.

Each record is read only once from disk during

the merge process.

In practice, use only one le and seek to

appropriate block.

Input Runs

5 10 15 ...

Output Buer

6 7 23 ... 5 6 7 10 12 ...

12 18 20 ...

163

Limits to Single Pass Multiway

Merge

Assume working memory is b blocks in size.

How many runs can be processed at one time?

The runs are 2b blocks long (on average).

[Because of replacement selection.]

How big a le can be merged in one pass?

[2b ]

2

size grows quickly!

[In K merge passes, process 2b k

( +1)

blocks.]

This approach trades (log b) (possibly)

sequential passes for a single or a very few

random (block) access passes.

[Example: 128K → 32 4K blocks.

With replacement selection, get 256K-length runs.

One merge pass: 8MB. Two merge passes: 256MB.

Three merge passes: 8GB.]

164

General Principals of External

Sorting

In summary, a good external sorting algorithm

will seek to do the following:

• Make the initial runs as long as possible.

• At all stages, overlap input, processing and

output as much as possible.

• Use as much working memory as possible.

Applying more memory usually speeds

processing.

• If possible, use additional disk drives for

more overlapping of processing with I/O,

and allow for more sequential le

processing.

165

Search

Given: Distinct keys k1, k2, ... kn and

collection T of n records of the form

(k1, I1), (k2, I2), ..., (kn, In)

where Ij is information associated with key kj

for 1 ≤ j ≤ n.

Search Problem: For key value K , locate the

record (kj , Ij ) in T such that kj = K .

Searching is a systematic method for locating

the record (or records) with key value kj = K .

A successful search is one in which a record

with key kj = K is found.

An unsuccessful search is one in which no

record with kj = K is found (and presumably

no such record exists).

166

Approaches to Search

1. Sequential and list methods (lists, tables,

arrays).

2. Direct access by key value (hashing).

3. Tree indexing methods.

167

Searching Ordered Arrays

Sequential Search

Binary Search

int binary(int K, int* array, int left, int right) {

// Return pos of ELEM in array (if any) with value K

int l = left-1;

int r = right+1; // l, r beyond bounds of array

while (l+1 != r) { // Stop when l and r meet

int i = (l+r)/2; // Look at middle of subarray

if (K < array[i]) r = i; // In left half

if (K == array[i]) return i; // Found it

if (K > array[i]) l = i; // In right half

}

return UNSUCCESSFUL; // Search value not in array

}

Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Key 11 13 21 26 29 36 40 41 45 51 54 56 65 72 77 83

Dictionary Search

168

Lists Ordered by Frequency

Order lists by (expected) frequency of

occurrance.

• Perform sequential search.

Cost to access second record: 2

Expected search cost:

C n = 1p1 + 2p2 + ... + npn

[pi is probability of ith record being accessed.]

Example: all records have equal frequency

n

X

Cn = i/n = (n + 1)/2.

i =1

Example: Exponential frequency

(

pi =

1 /2i if 1 ≤ i ≤ n − 1

1/2n−1 if i = n

[Second line is to make proabilities sum to 1.]

n

X

Cn ≈ (i/2i) ≈ 2.

i =1

169

Zipf Distributions

Applications:

• Distribution for frequency of word usage in

natural languages.

• Distribution for populations of cities, etc.

distribution for n records as 1/iHn.

[Hn = Pni 1

=1 i

≈ log n.]

e

n

X

Cn = i/iHn = n/Hn ≈ n/ loge n

i =1

80/20 rule: 80% of the accesses are to 20% of

the records.

For distributions following the 80/20 rule,

C n ≈ 0.122n.

170

Self-Organizing Lists

Self-organizing lists modify the order of records

within the list based on the actual pattern of

record access.

Self-organizing lists use a rule called a heuristic

for deciding how to to reorder the list. These

heuristics are similar to the rules for managing

buer pools.

[Buer pools can be viewed as a form of self-organizing list.]

• Order by actual historical frequency of

access. (Similar to LFU buer pool

replacement stratagy.)

• When a record is found, swap it with the

rst record on list.

• Move-to-Front: When a record is found,

move it to the front of the list. [Not worse

than twice \best arrangement."]

• Transpose: When a record is found, swap

it with the record ahead of it. [A bad example:

keep swapping last two elements.]

171

Example of Self-Organizing Tables

Application: Text compression.

Keep a table of words already seen, organized

via Move-to-Front Heuristic.

If a word not yet seen, send the word.

Otherwise, send the (current) index in the

table.

The car on the left hit the car I left.

The car on 3 left hit 3 5 I 5.

This is similar in spirit to Ziv-Lempel coding.

172

Searching in Sets

For dense sets (small range, many elements in

set):

[1 if element in set, 0 otherwise.]

Example: To nd all primes that are odd

numbers, compute

0011010100010100 & 0101010101010101

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0

173

Hashing

Hashing: The process of mapping a key value

to a position in a table.

A hash function maps key values to positions

It is denoted by h.

A hash table is an array that holds the

records. It is denoted by T .

The hash table has M slots, indexed from 0 to

M − 1.

hash function h,

h(K ) = i, 0 ≤ i < M , such that key(T [i]) = K .

174

Hashing (continued)

Hashing is appropriate only for sets (no

duplicates).

Good for both in-memory and disk based

applications.

Answers the question \What record, if any, has

key value K ?"

[Not good for range queries.]

Example: Store the n records with keys in

range 0 to n − 1.

• Store the record with key i in slot i.

• Use hash function h(K ) = K.

175

Collisions

More reasonable example:

• Store about 1000 records with keys in

range 0 to 16,383.

• Impractical to keep a hash table with

16,384 slots.

• We must devise a hash function to map the

key range to a smaller table.

Given: hash function h and keys k1 and k2.

β is a slot in the hash table.

If h(k1) = β = h(k2), then k1 and k2 have a

collision at β under h.

Search for the record with key K :

1. Compute the table location h(K ).

2. Starting with slot h(K ), locate the record

containing key K using (if necessary) a

collision resolution policy.

Collisions are inevitable in most applications.

• Example: 23 people are likely to share a

birthday.

176

Hash Functions

A hash function MUST return a value within

the hash table range.

To be practical, a hash function SHOULD

evenly distribute the records stored among the

hash table slots.

Ideally, the hash function should distribute

records with equal probability to all hash table

slots. In practice, success depends on the

distribution of the actual records stored.

If we know nothing about the incoming key

distribution, evenly distribute the key range

over the hash table slots while avoiding obvious

opportunities for clustering.

If we have knowlege of the incoming

distribution, use a distribution-dependant hash

function.

177

Example Hash Functions

int h(int x) {

return(x % 16);

}

4 bits of the key.

Mid-square method: square the key value,

take the middle r bits from the result for a

hash table of 2r slots.

[All bits contribute to the result.]

Sum the ASCII values of the letters and take

results modulo M .

int h(char x[10]) {

int i, sum;

for (sum=0, i=0; i<10; i++)

sum += (int) x[i];

return(sum % M);

}

178

ELF Hash

From Executable and Linking Format (ELF),

UNIX System V Release 4.

int ELFhash(char* key) {

unsigned long h = 0;

while(*key) {

h = (h << 4) + *key++;

unsigned long g = h & 0xF0000000L;

if (g) h ^= g >> 24;

h &= ~g;

}

return h % M;

}

179

Open Hashing

What to do when collisions occur?

Open hashing treats each hash table slot as a

bin.

0 1000 9530

1

2

3 3013

4

5

6

7 9877 2007 1057

8

9 9879

180

Bucket Hashing

Divide the hash table slots into buckets.

• Example: 8 slots/bucket.

Records hash to the rst slot of the bucket,

and ll bucket. Go to over ow if necessary.

When searching, rst check the proper bucket.

Then check the over ow.

181

Closed Hashing

Closed hashing stores all records directly in the

hash table.

Each record i has a home position h(ki).

If i is to be inserted and another record already

occupies i's home position, then another slot

must be found to store i.

The new slot is found by a

collision resolution policy.

Search must follow the same policy to nd

records not in their home slots.

182

Collision Resolution

During insertion, the goal of collision resolution

is to nd a free slot in the table.

Probe Sequence: the series of slots visited

during insert/search by following a collision

resolution policy.

Let β0 = h(K ). Let (β0, β1, ...) be the series of

slots making up the probe sequence.

void hashInsert(Elem R) { // Insert R into hash table T

int home; // Home position for R

int pos = home = h(key(R));// Initial pos on sequence

for (int i=1; key(T[pos]) != EMPTY; i++) {

pos = (home + p(key(R), i)) % M; // Next slot

if (key(T[pos]) == key(R)) ERROR; //No duplicates

}

T[pos] = R; // Insert R

}

int home; // Home position for K

int pos = home = h(K); // Initial pos on sequence

for (int i = 1; (key(T[pos]) != K) &&

(key(T[pos]) != EMPTY); i++)

pos = (home + p(K, i)) % M; // Next pos on sequence

if (key(T[pos]) == K) return T[pos]; // Found it

else return UNSUCCESSFUL; // K not in hash table

}

183

Linear Probing

Use the probe function

int p(int K, int i) { return i; }

Linear probing simply goes to the next slot in

the table.

If the bottom is reached, wrap around to the

top.

To avoid an innite loop, one slot in the table

must always be empty.

184

Linear Probing Example

0 1001 0 1001

1 9537 1 9537

2 3016 2 3016

3 3

4 4

5 5

6 6

7 9874 7 9874

8 2009 8 2009

9 9875 9 9875

10 10 1052

(a) (b)

in the table under linear probing since the

probabilities for which slot to use next are not

the same.

[For (a): prob(3) = 4/11, prob(4) = 1/11, prob(5) = 1/11,

prob(6) = 1/11, prob(10) = 4/11.]

[For (b): prob(3) = 8/11, prob(4,5,6) = 1/11 each.]

185

Improved Linear Probing

Instead of going to the next slot, skip by some

constant c.

Warning: Pick M and c carefully.

The probe sequence SHOULD cycle through all

slots of the table.

[If M = 10 with c = 2, then we eectively have created 2 hash

tables (evens vs. odds).]

Pick c to be relatively prime to M .

There is still some clustering.

• Example: c = 2. h(k1) = 3. h(k2) = 5.

• The probe sequences for k1 and k2 are

linked together.

186

Pseudo Random Probing

The ideal probe function would select the next

slot on the probe sequence at random.

An actual probe function cannot operate

randomly. (Why?)

Pseudo random probing:

• Select a (random) permutation of the

numbers from 1 to M − 1:

r1, r2, ..., rM−1

• All insertions and searches use the same

permutation.

Example: Hash table of size M = 101

• r1 = 2, r2 = 5, r3 = 32.

• h(k1) = 30, h(k2) = 28.

• Probe sequence for k1 is: [30, 32, 35, 62]

• Probe sequence for k2 is: [28, 30, 33, 60]

187

Quadratic Probing

Set the i'th value in the probe sequence as

(h(K ) + i2) mod M.

Example: M = 101.

• h(k1) = 30, h(k2) = 29.

• Probe sequence for k1 is: [30, 31, 34, 39]

• Probe sequence for k2 is: [21, 30, 33, 38]

188

Double Hashing

Pseudo random probing eliminates primary

clustering.

If two keys hash to same slot, they follow the

same probe sequence. This is called

secondary clustering.

To avoid secondary clustering, need a probe

sequence to be a function of the original key

value, not just the home position.

Double hashing:

p(K, i) = i ∗ h2(K ) for 0 ≤ i ≤ M − 1.

relatively prime to M .

Example: Hash table of size M = 101

• h(k1) = 30, h(k2) = 28, h(k3) = 30.

• h2(k1) = 2, h2(k2) = 5, h2(k3) = 5.

• Probe sequence for k1 is: [30, 32, 34, 36]

• Probe sequence for k2 is: [28, 33, 38, 43]

• Probe sequence for k3 is: [30, 35, 40, 45]

189

Analysis of Closed Hashing

The load factor is α = N/M where N is the

number of records currently in the table.

Insert Delete

4

1

0 .2 .4 .6 .8 1.0

190

Deletion

1. Deleting a record must not hinder later

searches.

2. We do not want to make positions in the

hash table unusable because of deletion.

Both of these problems can be resolved by

placing a special mark in place of the deleted

record, called a tombstone.

A tombstone will not stop a search, but that

slot can be used for future insertions.

Unfortunately, tombstones do add to the

average path length.

Solutions:

1. Local reorganizations to try to shorten the

average path length.

2. Periodically rehash the table (by order of

most frequently accessed record).

191

Indexing

Goals:

• Store large les.

• Support multiple search keys.

• Support ecient insert, delete and range

queries.

Entry sequenced le: Order records by time

of insertion. [Not practical as a database organization.]

Use sequential search.

Index le: Organized, stores pointers to actual

records. [Could be a tree or other data structure.]

Primary key: A unique identier for records.

May be inconvenient for search.

Secondary key: an alternate search key, often

not unique for each record. Often used for

search key.

192

Linear Indexing

Linear Index: an index le organized as a

simple sequence of key/record pointer pairs

where the key values are in sorted order.

If the index is too large to t in main memory,

a second level index may be used.

Linear indexing is good for searching variable

length records. [Also good for indexing an entry

sequenced le.]

Linear indexing is poor for insert/delete.

Linear Index

37 42 52 73 98

73 52 98 37 42

Database Records [This is an entry sequenced le.]

Second Level Index

Linear Index: Disk Pages

193

Inverted List

Inverted list is another term for a secondary

index. A secondary key is associated with a

primary key, which in turn locates the record.

Jones AA10 AB12 AB39 FF37

Smith AX33 AX35 ZX45

Zukowski ZQ99

[An improvement over 1D list: 2D table can save space of

duplicating secondary keys. But, may waste space in rows.

Easier to update than linear list, but still requires some

shifting of values.]

Secondary Primary

Key Key

Jones AA10

Smith AB12

Zukowski AB39

FF37

AX33

AX35

ZX45

ZQ99

194

Inverted List (Continued)

Secondary Primary

Key Index Key Next

Jones 0 0 AA10 4

Smith 1 1 AX33 6

Zukowski 3 2 ZX45

3 ZQ99

4 AB12 5

5 AB39 7

6 AX35 2

7 FF37

[A linked list.]

195

Tree Indexing

Linear index is poor for insertion/deletion.

Tree index can eciently support all desired

operations:

• Insert/delete

• Multiple search keys [Multiple tree indices.]

• Key range search

problems:

1. Tree must be balanced. [Minimize disk accesses.]

2. Each path from root to a leaf should cover

few disk pages.

5 4

3 7 2 6

2 4 6 1 3 5 7

(a) (b)

196

2-3 Tree

A 2-3 Tree has the following properties:

1. A node contains one or two keys.

2. Every internal node has either two children

(if it contains one key) or three children (if

it contains two keys).

3. All leaves are at the same level in the tree,

so the tree is always height balanced.

The 2-3 Tree also has a search tree property

analogous to the BST.

The advantage of the 2-3 Treeover the BST is

that it can be updated at low cost.

18 33

12 23 30 48

10 15 20 21 24 31 45 47 50 52

197

2-3 Tree Insertion

18 33

12 23 30 48

10 15 15 20 21 24 31 45 47 50 52

14 [Insert 14]

18 33

12 23 30 48 52

10 15 20 21 24 31 45 47 50 55

198

2-3 Tree Splitting

23

20 23 30 20 30

19 21 24 31 19 21 24 31

(a)

[Insert 19] (b)

23

18 33

12 20 30 48

10 15 19 21 24 31 45 47 50 52

(c)

199

B-Trees

The B-Tree is an extension of the 2-3 Tree.

The B-Tree is now the standard le

organization for applications requiring insertion,

deletion and key range searches.

1. B-Trees are always balanced.

2. B-Trees keep related records on a disk

page, which takes advantage of locality of

reference.

3. B-Trees guarantee that every node in the

tree will be full at least to a certain

minimum percentage. This improves space

eciency while reducing the typical number

of disk fetches necessary during a search or

update operation.

200

B-Trees (Continued)

A B-Tree of order m has the following

properties.

• The root is either a leaf or has at least two

children.

• Each node, except for the root and the

leaves, has between dm/2e and m children.

• All leaves are at the same level in the tree,

so the tree is always height balanced.

A B-Tree node is usually selected to match the

size of a disk block.

A B-Tree node could have hundreds of children.

201

B-Tree Example

Search in a B-Tree is a generalization of search

in a 2-3 Tree.

1. Perform a binary search on the keys in the

current node. If the search key is found,

then return the record. If the current node

is a leaf node and the key is not found,

then report an unsuccessful search.

2. Otherwise, follow the proper branch and

repeat the process.

24

15 20 33 45 48

10 12 18 21 23 30 31 38 47 50 52 60

202

B+-Trees

The most commonly implemented form of the

B-Tree is the B+-Tree.

Internal nodes of the B+-Tree do not store

records { only key values to guide the search.

Leaf nodes store records or pointers to records.

A leaf node may store more or less records than

an internal node stores keys.

[Assume leaves can store 5 values, internal notes 3 (4

children).]

33

18 23 48

10 12 15 18 19 20 21 22 23 30 31 33 45 47 48 50 52

203

B+-Tree Insertion

[Note special rule for root: May have only two children.]

33

10 12 23 33 48 10 12 23 33 48 50

[(b) Add (a)

50.] [Add 45, 52, 47 (split),(b)18, 15, 31 (split), 21,

20.]

18 33 48

10 12 15 18 20 21 23 31 33 45 47 48 50 52

33

18 23 48

10 12 15 18 20 21 23 30 31 33 45 47 48 50 52

(d)

204

B+-Tree Deletion

[Simple delete { delete 18 from original example.]

33

18 23 48

10 12 15 19 20 21 22 23 30 31 33 45 47 48 50 52

33

19 23 48

10 15 18 19 20 21 22 23 30 31 33 45 47 48 50 52

205

B-Tree Space Analysis

B+-Tree nodes are always at least half full.

The B∗-Tree splits two pages for three, and

combines three pages into two. In this way,

nodes are always 2/3 full.

Asymptotic cost of search, insertion and

deletion of records from B-Trees, B+-Trees and

B∗-Trees is (log n). (The base of the log is

the (average) branching factor of the tree.)

Example: Consider a B+-Tree of order 100

with leaf nodes containing 100 records.

1 level B+-Tree: [Max: 100]

2 level B+-Tree: [Min: 2 leaves of 50 for 100 records.

Max: 100 leaves with 100 for 10,000 records.]

3 level B+-Tree: [Min: 2 × 50 nodes of leaves for 5000

records. Max: 100 = 1, 000, 000 records.]

3

50). Max: 100 million records (100 * 100 * 100 * 100).]

Ways to reduce the number of disk fetches:

• Keep the upper levels in memory.

• Manage B+-Tree pages with a buer pool.

206

- Data StructureUploaded byRajinder Singh
- Trees NotesUploaded bysubramanyam62
- Csc313 Lecture Complexity(6)Uploaded byChristopher Miller
- 06 - TreesUploaded byAmir Afzal
- Unit 6Uploaded bybaluchandrashekar2008
- Vocabulary and Definitions — Problem Solving with Algorithms and Data Structures.pdfUploaded byDonnie Burns
- 1 Time ComplexityUploaded byAbhishek Jaisingh
- AlgorithmsUploaded byohyeah08
- DSL Lab ManualUploaded bysijojacob1111
- 175388_Lec03_BSTsUploaded byHaarvendra Rao
- DSA-VIVAUploaded byArjun Edakkat
- Asymptotic NotationUploaded byJeetendra Garg
- 220-03Uploaded bysanandan_1986
- Data 123 DataCentricUploaded byBala Dharmaraj
- Data Strct mcq notes,,,,,,,,,,,,,,,,,,,Uploaded bybansaliresh1
- Design analysis algorithmUploaded byshruti5488
- Data Structures AptitutdeUploaded byVASANTH2K0
- Article SummaryUploaded bynomore891
- avl1Uploaded byLalit Jain
- ClassUploaded bytrung
- 259337_Week11-TreeUploaded byAgungPutra
- Asymptotic NotationsUploaded byVishwa Teja
- Disjoint SetUploaded byVasantha Kumari
- BACKTRACKINGUploaded bylaklira
- CS9215DATASTRUCTURESLABUSINGJAVA.pdfUploaded byNivitha
- Data Structures PPTUploaded byGuru Samy
- CSC263 Spring 2011 Assignment 2 SolutionsUploaded byyellowmoog
- lec5Uploaded byvphucvo
- 09-AVLUploaded byImmanuel Wonder
- Binary TreeUploaded byOk Nope

- MasteringPhysics Ch 21 HW #2Uploaded byGiancarlo Santos
- Casio Fx-500ES User ManualUploaded byaaf6
- resumen programacionUploaded byirving
- Variance Covariance MatrixUploaded byArun Prasath
- SyllabusUploaded byBer Ni
- Discrete and Continuous DataUploaded byTeresa Navarrete
- PCP HandoutsUploaded byMichael Belostoky
- 15 Word Syntax.pdfUploaded byHussein
- Characters of abelian groupsUploaded byEmmanureva
- Thermal Chemical Process Safety Calcs_verAUploaded bycymy
- cs431ass1Uploaded byumair_waqas
- Some Important programs on Strings(in C)Uploaded byAnik
- 03 - Finite Element Analysis & the Guyed Tower StudyUploaded byhalcyon
- Oxford Math ClassesUploaded byChua Ying Li Pamela
- Basic Statistics - Practice QuestionsUploaded byAmer Ahmad
- APSC 160 (UBC)Uploaded byJFOXX7777
- FM - BY Civildatas.com.pdfUploaded byrajmeha
- Equity ValuationUploaded byJasmine Nanda
- M&I lab manualUploaded byGandhi Ramasamy
- ch25 - Vibrations and Waves.pdfUploaded byReyan Qowi Dzakyprasetyo
- Binary Distillation ManualUploaded byMico Anonuevo
- Integer programing.pptUploaded byDeepankumar Athiyannan
- Oop AssignmentUploaded bydalu
- Dana Chiller 2-5trUploaded bymohdnazir
- S. R Filonovich-The greatest speed (Science for everyone) -Mir Publishers (1986).pdfUploaded byMLalli5340
- Integration TestingUploaded byLuís Pedro Oliveira
- spep transcriptUploaded byapi-319053601
- Professor Matthew Gilbert - Application of Computational Limit Analysis & Design in the Middle East.pdfUploaded byMasnun Rahman
- 30605-01 Sect 4 Torsional System Dynamic Analysis 300 HpUploaded byAlejandro Romero Ballestas
- Revision Plan for IIT-JEE-AIEEE by Kaysons EducationUploaded byTimeLessRahul