You are on page 1of 207

COURSENOTES

CS2604:
Data Structures and File Processing
C++ Edition
Cli ord A. Sha er
Department of Computer Science
Virginia Tech
Copyright c 1995, 1996, 1998
The Need for Data Structures
[A primary concern of this course is eciency.]
Data structures organize data
⇒ more ecient programs. [You might
believe that faster computers make it unnecessary to be
concerned with eciency. However...]
• More powerful computers ⇒ more complex
applications.
• More complex applications demand more
calculations.
• Complex computing tasks are unlike our
everyday experience. [So we need special
training]
Any organization for a collection of records can
be searched, processed in any order, or
modi ed. [If you are willing to pay enough in time delay.
Ex: Simple unordered array of records.]
• The choice of data structure and algorithm
can make the di erence between a program
running in a few seconds or many days.

1
Eciency
A solution is said to be ecient if it solves the
problem within its resource constraints. [Alt:
Better than known alternatives (\relatively" ecient)]
• space [These are typical contraints for programs]
• time
[This does not mean always strive for the most ecient
program. If the program operates well within resource
constraints, there is no bene t to making it faster or smaller.]
The cost of a solution is the amount of
resources that the solution consumes.

2
Selecting a Data Structure
Select a data structure as follows:
1. Analyze the problem to determine the
resource constraints a solution must meet.
2. Determine the basic operations that must
be supported. Quantify the resource
constraints for each operation.
3. Select the data structure that best meets
these requirements.
[Typically want the \simplest" data struture that will meet
requirements.]
Some questions to ask: [These questions often help
to narrow the possibilities]
• Are all data inserted into the data structure
at the beginning, or are insertions
interspersed with other operations?
• Can data be deleted? [If so, a more complex
representation is typically required]
• Are all data processed in some well-de ned
order, or is random access allowed?

3
Data Structure Philosophy
Each data structure has costs and bene ts.
Rarely is one data structure better than
another in all situations.
A data structure requires:
• space for each data item it stores, [Data +
Overhead]
• time to perform each basic operation,
• programming e ort. [Some data
structures/algorithms more complicated than others]
Each problem has constraints on available
space and time.
Only after a careful analysis of problem
characteristics can we know the best data
structure for the task.
Bank example:
• Start account: a few minutes
• Transactions: a few seconds
• Close account: overnight

4
Goals of this Course
1. Reinforce the concept that there are costs
and bene ts for every data structure. [A
worldview to adopt]
2. Learn the commonly used data structures.
These form a programmer's basic data
structure \toolkit." [The \nuts and bolts" of the
course]
3. Understand how to measure the
e ectiveness of a data structure or
program.
• These techniques also allow you to judge
the merits of new data structures that
you or others might invent. [To prepare
you for the future]

5
De nitions
A type is a set of values.
[Ex: Integer, Boolean, Float]
A data type is a type and a collection of
operations that manipulate the type.
[Ex: Addition]
A data item or element is a piece of
information or a record.
[Physical instantiation]
A data item is said to be a member of a data
type.
[]
A simple data item contains no subparts.
[Ex: Integer]
An aggregate data item may contain several
pieces of information.
[Ex: Payroll record, city database record]

6
Abstract Data Types
Abstract Data Type (ADT): a de nition for a
data type solely in terms of a set of values and
a set of operations on that data type.
Each ADT operation is de ned by its inputs
and outputs.
Encapsulation: hide implementation details
A data structure is the physical
implementation of an ADT.
• Each operation associated with the ADT is
implemented by one or more subroutines in
the implementation.
Data structure usually refers to an
organization for data in main memory.
File structure: an organization for data on
peripheral storage, such as a disk drive or tape.
An ADT manages complexity through
abstraction: metaphor. [Hierarchies of labels]
[Ex: transistors → gates → CPU. In a program, implement an
ADT, then think only about the ADT, not its implementation]
7
Logical vs. Physical Form
Data items have both a logical and a physical
form.
Logical form: de nition of the data item within
an ADT. [Ex: Integers in mathematical sense: +, −]
Physical form: implementation of the data item
within a data structure. [16/32 bit integers: over ow]
Data Type
ADT: Data Items:
 Type
 Operations
Logical Form

Data Structure: Data Items:


{ Storage Space Physical Form
{ Subroutines
[In this class, we frequently move above and below \the line"
separating logical and physical forms.]
8
Problems
Problem: a task to be performed.
• Best thought of as inputs and matching
outputs.
• Problem de nition should include
constraints on the resources that may be
consumed by any acceptable solution. [But
NO constraints on HOW the problem is solved]
Problems ⇔ mathematical functions
• A function is a matching between inputs
(the domain) and outputs (the range).
• An input to a function may be single
number, or a collection of information.
• The values making up an input are called
the parameters of the function.
• A particular input must always result in the
same output every time the function is
computed.

9
Algorithms and Programs
Algorithm: a method or a process followed to
solve a problem. [A recipe]
An algorithm takes the input to a problem
(function) and transforms it to the output. [A
mapping of input to output]
A problem can have many algorithms.
An algorithm possesses the following properties:
1. It must be correct. [Computes proper function]
2. It must be composed of a series of
concrete steps. [Executable by that machine]
3. There can be no ambiguity as to which
step will be performed next.
4. It must be composed of a nite number of
steps.
5. It must terminate.
A computer program is an instance, or
concrete representation, for an algorithm in
some programming language.
[We frequently interchange use of \algorithm" and \program"
though they are actually di erent concepts]
10
Mathematical Background
[Look over Chapter 2, read as needed depending on your
familiarity with this material.]
Set concepts and notation [Set has no duplicates,
sequence may]
Recursion
Induction proofs
Logarithms [Almost always use log to base 2. That is our
default base.]
Summations

11
Estimation Techniques
Known as \back of the envelope" or \back of
the napkin" calculation.
1. Determine the major parameters that a ect
the problem.
2. Derive an equation that relates the
parameters to the problem.
3. Select values for the parameters, and apply
the equation to yield an estimated solution.
Example:
How many library bookcases does it take to
store books totaling one million pages?
Estimate:
• pages/inch [guess 500]
• feet/shelf [guess 4 (actually, 3)]
• shelves/bookcase [guess 5 (actually, 7)]

[Units check: pages/in × ft/shelf × shelf/bkcase ⇒


pages/bkcase]

12
Algorithm Eciency
There are often many approaches (algorithms)
to solve a problem. How do we choose between
them?
At the heart of computer program design are
two (sometimes con icting) goals:
1. To design an algorithm that is easy to
understand, code and debug.
2. To design an algorithm that makes ecient
use of the computer's resources.
Goal (1) is the concern of Software
Engineering.
Goal (2) is the concern of data structures and
algorithm analysis.
When goal (2) is important, how do we
measure an algorithm's cost?

13
How to Measure Eciency?
1. Empirical comparison (run programs).
[Dicult to do \fairly." Time consuming.]
2. Asymptotic Algorithm Analysis.
Critical resources:
[Time. Space (disk, RAM). Programmer's e ort. Ease of use
(user's e ort).]

Factors a ecting running time:


[Machine load. OS. Compiler. Problem size. Speci c input
values for given problem size.]

For most algorithms, running time depends on


\size" of the input.
Running time is expressed as T(n) for some
function T on input size n.
14
Examples of Growth Rate
Example 1: [As n grows, how does T(n) grow?]
int largest(int* array, int n) { // Find largest value
int currlarge = array[0]; // Store largest seen
for (int i=1; i<n; i++) // For each element
if (array[i] > currlarge) // If largest
currlarge = array[i]; // Remember it
return currlarge; // Return largest
}

[Cost: T(n) = c n + c steps]


1 2

Example 2: Assignment statement [Constant cost]


Example 3:
sum = 0;
for (i=1; i<=n; i++)
for (j=1; j<=n; j++)
sum++;

[Cost: T(n) = c n + c Roughly n steps, with sum being n at


1
2
2
2 2

the end. Ignore various overhead such as loop counter


increments.]

15
Growth Rate Graph
[2n is an exponential algorithm. 10n and 20n di er only by a
constant.]
2n 2n2 5n log n
1400

1200
20n
1000

800

600 10n
400

200

0
0 10 20 30 40 50

2n 2n2
400

20n
300

5n log n
200

10n
100

0
0 5 10 1615
Input size n
Best, Worst and Average Cases
Not all inputs of a given size take the same
time.
Sequential search for K in an array of n
integers:
• Begin at rst element in array and look at
each element in turn until K is found.
Best Case: [Find at rst position: 1 compare]
Worst Case: [Find at last position: n compares]
Average Case: [(n + 1)/2 compares]
While average time seems to be the fairest
measure, it may be dicult to determine.
[Depends on distribution. Assumption for above analysis:
Equally likely at any position.]
When is worst case time important?
[Real time algorithms]

17
Faster Computer or Algorithm?
What happens when we buy a computer 10
times faster? [How much speedup? 10 times. More
important: How much increase in problem size for same time?
Depends on growth rate.]
T(n) n n0 Change n0/n
10n 1, 000 10 000, n0 = 10n 10
20n 500 5 000 , n

0 = 10n 10
5n log n 250 1 842 , 10n√< n0 < 10n 7.37
2n2 70 223 n0 = 10n 3.16
2n 13 16 n0 = n + 3 −−
[For n , if n = 1000, then n0 would be 1003]
2

n: Size of input that can be processed in one


hour (10,000 steps).
n0: Size of input that can be processed in one
hour on the new machine (100,000 steps).
[Compare T(n) = n to T(n) = n log n. For n > 58, it is faster to
2

have the (n log n) algorithm than to have a computer that is


10 times faster.]

18
Asymptotic Analysis: Big-oh
De nition: For T(n) a non-negatively valued
function, T(n) is in the set O(f (n)) if there
exist two positive constants c and n0 such that
T(n) ≤ cf (n) for all n > n0.

Usage: The algorithm is in O(n2) in [best,


average, worst] case.
Meaning: For all data sets big enough (i.e.,
n > n0), the algorithm always executes in less
than cf (n) steps [in best, average or worst
case].
[Must pick one of these to complete the statement. Big-oh
notation applies to some set of inputs.]
Upper Bound.
Example: if T(n) = 3n2 then T(n) is in O(n2).
Wish tightest upper bound:
While T(n) = 3n2 is in O(n3), we prefer O(n2).
[It provides more information to say O(n ) than O(n )]
2 3

19
Big-oh Example
Example 1. Finding value X in an array. [Average
case]
T(n) = csn/2. [cs is a constant. Actual value is irrelevant]
For all values of n > 1, csn/2 ≤ csn.
Therefore, by the de nition, T(n) is in O(n) for
n0 = 1 and c = cs.

Example 2. T(n) = c1n2 + c2n in average case


c1n2 + c2 n ≤ c1n2 + c2n2 ≤ (c1 + c2)n2 for all
n > 1.

T(n) ≤ cn2 for c = c1 + c2 and n0 = 1.

Therefore, T(n) is in O(n2) by the de nition.


Example 3: T(n) = c. We say this is in O(1).
[Rather than O(c)]

20
Big-Omega
De nition: For T(n) a non-negatively valued
function, T(n) is in the set
(g(n)) if there
exist two positive constants c and n0 such that
T(n) ≥ cg (n) for all n > n0.

Meaning: For all data sets big enough (i.e.,


n > n0), the algorithm always executes in more
than cg(n) steps.
Lower Bound.
Example: T(n) = c1n2 + c2n.
c1n2 + c2 n ≥ c1n2 for all n > 1.
T(n) ≥ cn2 for c = c1 and n0 = 1.

Therefore, T(n) is in
(n2) by the de nition.
Want greatest lower bound.

21
Theta Notation
When big-Oh and
meet, we indicate this by
using  (big-Theta) notation.
De nition: An algorithm is said to be (h(n))
if it is in O(h(n)) and it is in
(h(n)).
[For polynomial equations on T(n), we always have . There
is no uncertainty, a \complete" analysis.]
Simplifying Rules:
1. If f (n) is in O(g(n)) and g(n) is in O(h(n)),
then f (n) is in O(h(n)).
2. If f (n) is in O(kg(n)) for any constant
k > 0, then f (n) is in O(g (n)). [No constant]
3. If f1(n) is in O(g1(n)) and f2(n) is in
O(g2(n)), then (f1 + f2)(n) is in
O(max(g1(n), g2(n))). [Drop low order terms]
4. If f1(n) is in O(g1(n)) and f2(n) is in
O(g2(n)) then f1(n)f2(n) is in
O(g1(n)g2(n)). [Loops]

22
Running Time of a Program
[Asymptotic analysis is de ned for equations. Need to convert
program to an equation.]
Example 1: a = b;
This assignment takes constant time, so it is
(1). [Not (c) { notation by tradition]
Example 2:
sum = 0;
for (i=1; i<=n; i++)
sum += n;
[(n) (even though sum is n )]
2

Example 3:
sum = 0;
for (j=1; j<=n; j++) // First for loop
for (i=1; i<=j; i++) // is a double loop
sum++;
for (k=0; k<n; k++) // Second for loop
A[k] = k;
P
[First statement is (1). Double for loop is i = (n ). Final
2

for loop is (n). Result: (n ).]


2

23
More Examples
Example 4.
sum1 = 0;
for (i=1; i<=n; i++) // First double loop
for (j=1; j<=n; j++) // do n times
sum1++;

sum2 = 0;
for (i=1; i<=n; i++) // Second double loop
for (j=1; j<=i; j++) // do i times
sum2++;
[First loop, sum is n . Second loop, sum is (n + 1)(n)/2. Both
2

are (n ).]


2

Example 5.
sum1 = 0;
for (k=1; k<=n; k*=2)
for (j=1; j<=n; j++)
sum1++;

sum2 = 0;
for (k=1; k<=n; k*=2)
for (j=1; j<=k; j++)
sum2++;
Plog n
[First is k=1
n = (n log n). Second is P log n−1
k=0
2 = (n).]
k

24
Binary Search
Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Key 11 13 21 26 29 36 40 41 45 51 54 56 65 72 77 83

int binary(int K, int* array, int left, int right) {


// Return position of element (if any) with value K
int l = left-1;
int r = right+1; // l, r are beyond array bounds
while (l+1 != r) { // Stop when l and r meet
int i = (l+r)/2; // Look at middle of subarray
if (K < array[i]) r = i; // In left half
if (K == array[i]) return i; // Found it
if (K > array[i]) l = i; // In right half
}
return UNSUCCESSFUL; // Search value not in array
}

Analysis: How many elements can be examined


in the worst case? [(log n)]

25
Other Control Statements
while loop: analyze like a for loop.
if statement: Take greater complexity of
then/else clauses.
[If probabilities are independent of n.]
switch statement: Take complexity of most
expensive case.
[If probabilities are independent of n.]
Subroutine call: Complexity of the subroutine.

26
Analyzing Problems
[Typically do a lot of this in a senior algorithms course.]
Upper bound: Upper bound of best known
algorithm.
Lower bound: Lower bound for every possible
algorithm.
[The examples so far have been easy in that exact equations
always yield . Thus, it was hard to distinguish
and O.
Following example should help to explain the di erence {
bounds are used to describe our level of uncertainty about an
algorithm.]
Example: Sorting
1. Cost of I/O:
(n)
2. Bubble or insertion sort: O(n2)
3. A better sort (Quicksort, Mergesort,
Heapsort, etc.): O(n log n)
4. We prove later that sorting is
(n log n)

27
Multiple Parameters
[Ex: 256 colors (8 bits), 1000 × 1000 pixels]
Compute the rank ordering for all C pixel
values in a picture of P pixels.
for (i=0; i<C; i++) // Initialize count
count[i] = 0;
for (i=0; i<P; i++) // Look at all of the pixels
count[value(i)]++; // Increment proper value count
sort(count); // Sort pixel value counts

If we use P as the measure, then time is


(P log P ).
More accurate is (P + C log C ).

28
Space Bounds
Space bounds can also be analyzed with
asymptotic complexity analysis.
Time: Algorithm
Space: Data Structure
Space/Time Tradeo Principle:
One can often achieve a reduction in time is
one is willing to sacri ce space, or vice versa.
• Encoding or packing information
Boolean ags
• Table lookup
Factorials
Disk Based Space/Time Tradeo Principle:
The smaller you can make your disk storage
requirements, the faster your program will run.

29
Lists
[Students should already be familiar with lists. Objectives: use
alg analysis in familiar context, compare implementations.]
A list is a nite, ordered sequence of data
items called elements.
[The positions are ordered, NOT the values.]
Each list element has a data type.
The empty list contains no elements.
The length of the list is the number of
elements currently stored.
The beginning of the list is called the head,
the end of the list is called the tail.
Sorted lists have their elements positioned in
ascending order of value, while unsorted lists
have no necessary relationship between element
values and positions.
Notation: ( a0, a1, ..., an−1 )
What operations should we implement?
[Add/delete elem anywhere, nd, next, prev, test for empty.]
30
List ADT
class List { // List class ADT
public:
List(int =LIST_SIZE); // Constructor
~List(); // Destructor
void clear(); // Remove all Elems
void insert(const Elem); // Insert Elem at curr
void append(const Elem); // Insert Elem at tail
Elem remove(); // Remove and return Elem
void setFirst(); // Set curr to first pos
void prev(); // Move curr to prev pos
void next(); // Move curr to next pos
int length() const; // Return current length
void setPos(int); // Set curr to position
void setValue(const Elem); // Set current value
Elem currValue() const; // Return current value
bool isEmpty() const; // TRUE if list is empty
bool isInList() const; // TRUE if curr in list
bool find(int); // Find value
};
[This is an example of an ADT. Our list implementations will
match. Note that the generic type \Elem" is being used for
the element type.]

31
List ADT Examples
List: ( 12, 32, 15 )
MyList.insert(99);

[The above is an example use of the insert function. MyList is


an assumed parameter, 99 is another parameter.]
Assume MyPos has 32 as current element:
[Put 99 before current element, yielding (12, 99, 32, 15).]
Process an entire list:
for (MyList.first(); MyList.isInList(); MyList.next())
DoSomething(MyList.currValue());

32
Array-Based List Insert

[Push items up/down. Cost: (n).]


Insert 23:

13 12 20 8 3 13 12 20 8 3
0 1 2 3 4 5 0 1 2 3 4 5
(a) (b)

23 13 12 20 8 3
0 1 2 3 4 5
(c)

33
Array-Based List Class
class List { // Array-based list class
private:
int msize; // Maximum size of list
int numinlist; // Actual number of Elems
int curr; // Position of "current"
Elem* listarray; // Array of list Elems
public:
List(int =LIST_SIZE); // Constructor
~List(); // Destructor
void clear(); // Remove all Elems
void insert(const Elem); // Insert Elem at curr
void append(const Elem); // Insert Elem at tail
Elem remove(); // Remove and return Elem
void setFirst(); // Set curr to first pos
void prev(); // Move curr to prev pos
void next(); // Move curr to next pos
int length() const; // Return current length
void setPos(int); // Set curr to position
void setValue(const Elem); // Set current value
Elem currValue() const; // Return current value
bool isEmpty() const; // TRUE if list is empty
bool isInList() const; // TRUE if curr in list
bool find(int); // Find value
};

34
Array-Based List Implementation
List::List(int sz) // Constructor
{ msize = sz; numinlist = 0; curr = 0;
listarray = new Elem[sz]; }

List::~List() { delete [] listarray; } // Destructor

void List::clear() { numinlist = 0; curr = 0; }

// Insert Elem at current position


void List::insert(const Elem item) {
assert((numinlist < msize) && (curr >=0) &&
(curr <= numinlist));
for(int i=numinlist; i>curr; i--) // Shift Elems up
listarray[i] = listarray[i-1];
listarray[curr] = item;
numinlist++; // Increment size
}

void List::append(const Elem item) { // Insert at tail


assert(numinlist < msize); // Can’t be full
listarray[numinlist++] = item; // Increment size
}

Elem List::remove() { // Remove/return current Elem


assert(!isEmpty() && isInList()); // Must have Elem
Elem temp = listarray[curr]; // Store Elem
for(int i=curr; i<numinlist-1; i++) // Shift down
listarray[i] = listarray[i+1];
numinlist--; return temp;
}
35
List Implementation (Cont)
void List::setFirst() { curr = 0; } // Set to first pos

void List::prev() { curr--; } // Move curr to prev pos

void List::next() { curr++; } // Move curr to next pos

int List::length() const { return numinlist; }

void List::setPos(int pos) {curr = pos; }

void List::setValue(const Elem val) // Set curr value


{ assert(isInList()); listarray[curr] = val; }

Elem List::currValue() const // Return current value


{ assert(isInList()); return listarray[curr]; }

bool List::isEmpty() const { return numinlist == 0; }

bool List::isInList() const // TRUE if in list


{ return (curr >= 0) && (curr < numinlist); }

bool List::find(int val) { // Find value


while (isInList()) // Stop if reach end
if (key(currValue()) == val) return TRUE; // Found
else next();
return FALSE; // Not found
}

36
Link Class
Dynamic allocation of new list elements.
class Link { // Singly-linked node
public:
Elem element; // Elem value for node
Link *next; // Pointer to next node
Link(const Elem elemval, Link* nextval =NULL)
{ element = elemval; next = nextval; }
Link(Link* nextval =NULL) { next = nextval; }
};

37
Linked List Position
head curr tail

20 23 12 15

(a)

head curr tail

20 23 10 12 15
[Naive approach: Point to current
(b) node. Current is 12. Want
to insert node with 10. No access available to node with 23.
How can we do the insert?]
head curr tail

20 23 12 15

(a)

head curr tail

20 23 10 12 15

(b)

[Alt implementation: Point to node preceding actual current


node. Now we can do the insert. Also note use of header
node.]

38
Linked List Class
class List { // Linked list class
private:
Link* head; // Pointer to list header
Link* tail; // Pointer to last Elem
Link* curr; // Pos of "current" Elem
public:
List(int =LIST_SIZE); // Constructor
~List(); // Destructor
void clear(); // Remove all Elems
void insert(const Elem); // Insert at current pos
void append(const Elem); // Insert at tail of list
Elem remove(); // Remove/return Elem
void setFirst(); // Set curr to first pos
void prev(); // Move curr to prev pos
void next(); // Move curr to next pos
int length() const; // Return length
void setPos(int); // Set current pos
void setValue(const Elem); // Set current value
Elem currValue() const; // Return current value
bool isEmpty() const; // TRUE if list is empty
bool isInList() const; // TRUE if now in list
bool find(int); // Find value
};

39
Linked List Insertion
// Insert Elem at current position
void List::insert(const Elem item) {
assert(curr != NULL); // Must be pointing to Elem
curr->next = new Link(item, curr->next);
if (tail == curr) // Appended new Elem
tail = curr->next;
}

curr

... 23 12 ...

Insert 10: 10

(a)

curr

... 23 12 ...

3
10
1 2
(b)

40
Linked List Remove
Elem List::remove() { // Remove/return Elem
assert(isInList()); // Must be valid pos
Elem temp = curr->next->element; // Remember value
Link* ltemp = curr->next; // Remember link
curr->next = ltemp->next; // Remove from list
if (tail == ltemp) tail = curr; // Set tail
delete ltemp; // Free link
return temp; // Return value
}

curr

... 23 10 15 ...

(a)

curr 2

... 23 10 15 ...

it 1
(b)

41
Freelists
System new and delete are slow.
class Link { // Singly-linked node
public: // with freelist
Elem element; // Elem value for node
Link* next; // Pointer to next node
static Link* freelist; // Link class freelist
Link(const Elem elemval, Link* nextval =NULL)
{ element = elemval; next = nextval; }
Link(Link* nextval =NULL) { next = nextval; }
void* operator new(size_t); // Overloaded new
void operator delete(void*); // Overloaded delete
};

Link* Link::freelist = NULL; // Create static variable

void* Link::operator new(size_t) { // Overload new


if (freelist == NULL) return ::new Link; // New space
Link* temp = freelist; // Or get from freelist
freelist = freelist->next;
return temp; // Return the link
}

void Link::operator delete(void* ptr) { // Overload


((Link*)ptr)->next = freelist; // Put on freelist
freelist = (Link*)ptr;
}

42
Comparison of List Implementations
Array-Based Lists: [Average and worst cases]
• Insertion and deletion are (n).
• Array must be allocated in advance.
• No overhead if all array positions are full.

Linked Lists:
• Insertion and deletion (1);
prev and direct access are (n).
• Space grows with number of elements.
• Every element requires overhead.

Space \break-even" point:


DE = n(P + E ); n = +
P
DE
E
[Estimation question: What are the dimensions of these
variables?]
E: Space for data value
P: Space for pointer
D: Number of elements in array

43
Doubly Linked Lists
Simplify insertion and deletion: Add a prev
pointer.
class Link { // Doubly-linked node
public: // with freelist
Elem element; // Node Elem value
Link* next; // Pointer to next node
Link* prev; // Pointer to prev node
static Link* freelist; // Link class freelist
Link(const Elem Elemval, Link* nextp =NULL,
Link* prevp =NULL)
{ element = Elemval; next = nextp; prev = prevp;}
Link(Link* nextp =NULL, Link* prevp = NULL)
{ next = nextp; prev = prevp; }
void* operator new(size_t); // Overloaded new
void operator delete(void*); // Overloaded delete
};
head curr tail

20 23 12 15

44
Doubly Linked List Operations
curr

... 20 23 12 ...

Insert 10: 10

(a)

curr
4 5
... 20 23 12 ...
10
3 1 2
(b)
// Insert Elem at current position
void List::insert(const Elem item) {
assert(curr != NULL);
curr->next = new Link(item, curr->next, curr);
if (curr->next->next != NULL)
curr->next->next->prev = curr->next;
if (tail == curr) tail = curr->next;
}

Elem List::remove() { // Remove current Elem


assert(isInList()); // Must be valid position
Elem temp = curr->next->element;
Link* ltemp = curr->next;
if (ltemp->next != NULL) ltemp->next->prev = curr;
else tail = curr; // Removed tail Elem - change tail
curr->next = ltemp->next;
delete ltemp;
return temp;
}

45
Stacks
LIFO: Last In, First Out
Restricted form of list: Insert and remove only
at front of list.
Notation:
• Insert: PUSH
• Remove: POP
• The accessible element is called TOP.

46
Array-Based Stack
De ne top as rst free position.
class Stack { // Array-based stack class
private:
int size; // Maximum size of stack
int top; // Index for top Elem
Elem *listarray; // Array holding stack Elems

public:
Stack(int sz =LIST_SIZE) // Constructor: initialize
{ size = sz; top = 0; listarray = new Elem[sz]; }
~Stack() // Destructor: free array
{ delete [] listarray; }
void clear() // Remove all Elems
{ top = 0; }
void push(const Elem item) // Push Elem onto stack
{ assert(top < size); listarray[top++] = item; }
Elem pop() // Pop Elem from stack top
{ assert(!isEmpty()); return listarray[--top]; }
Elem topValue() const // Return value of top Elem
{ assert(!isEmpty()); return listarray[top-1]; }
bool isEmpty() const // Return TRUE if empty
{ return top == 0; }
};
top1 top2

47
Linked Stack
class Stack { // Linked stack class
private:
Link *top; // Pointer to top Elem
public:
Stack(int sz =LIST_SIZE) // Constructor:
{ top = NULL; } // initialize
~Stack() { clear(); } // Destructor
void clear(); // Remove stack Elems
void push(const Elem item) // Push Elem onto stack
{ top = new Link(item, top); }
Elem pop(); // Pop Elem from stack
Elem topValue() const // Get value of top Elem
{ assert(!isEmpty()); return top->element; }
bool isEmpty() const // Return TRUE if empty
{ return top == NULL; }
};

void Stack::clear() { // Remove Elems


while(top != NULL) // Free link nodes
{ Link* temp = top; top = top->next; delete temp; }
}

Elem Stack::pop() { // Pop Elem from stack


assert(!isEmpty());
Elem temp = top->element;
Link* ltemp = top->next;
delete top; top = ltemp;
return temp;
}

48
Queues
FIFO: First In, First Out
Restricted form of list:
Insert at one end, remove from other.
Notation:
• Insert: Enqueue
• Delete: Dequeue
• First element: FRONT
• Last element: REAR

49
Queue Implementations
Array-Based Queue
front rear

20 5 12 17

(a)

front rear

12 17 3 30 4

(b)

[Application of Pigeonhole Principle: Given a xed (arbitrary)


position for front, there are n − 1 states (0 through n elements
in queue) and only n positions for rear. So, something must
be done to distinguish between two of the states.]
front
front
20 5
12 12
17 17
rear
3
30
4

rear
(a) (b)

[Choices: Leave one slot empty, or use a separate ag.]


[\Drifting" queue { use modular arithmetic.]
Linked Queue: modi ed linked list. [Operations
are (1)]
50
Binary Trees
A binary tree is made up of a nite set of
nodes that is either empty or consists of a
node called the root together with two binary
trees, called the left and right subtrees, which
are disjoint from each other and from the root.
Notation: Node, Children, Edge, Parent
Ancestor, Descendant, Path, Depth,
Height, Level, Leaf Node, Internal Node,
Subtree.
A

B C

D E F

G H I

[A has depth 0. B and C form level 1. The tree has height 4.


Height = max depth + 1.]

51
Full and Complete Binary Trees
Full binary tree: each node either is a leaf or is
an internal node with exactly two non-empty
children.
Complete binary tree: If the height of the tree
is d, then all levels except possibly level d are
completely full. The bottom level has all nodes
to the left side.

(a) (b)

[These terms can be hard to distinguish. Students will need to


remember which is which.]

52
Full Binary Tree Theorem
Theorem: The number of leaves in a
non-empty full binary tree is one more than the
number of internal nodes.
[Relevant since it helps us calculate space requirements.]
Proof (by Mathematical Induction):
• Base Case: A full binary tree with 1
internal node must have two leaf nodes.
• Induction Hypothesis: Assume any full
binary tree T containing n − 1 internal
nodes has n leaves.
• Induction Step: Given tree T with n
internal nodes, pick internal node I with
two leaf children. Remove I 's children, call
resulting tree T'. By induction hypothesis,
T' is a full binary tree with n leaves.
Restore i's two children. The number of
internal nodes has now gone up by 1 to
reach n. The number of leaves has also
gone up by 1.

53
Full Binary Tree Theorem Corollary
Theorem: The number of NULL pointers in a
non-empty binary tree is one more than the
number of nodes in the tree.
Proof: Replace all null pointers with a pointer
to an empty leaf node. This is a full binary tree.

54
Binary Tree Node Class
class BinNode { // Binary tree node class
public:
Belem element; // The node’s value
BinNode* left; // Pointer to left child
BinNode* right; // Pointer to right child
static BinNode* freelist;
// Two constructors: with and without initial values
BinNode() { left = right = NULL; }
BinNode(Belem e, BinNode* l =NULL, BinNode* r =NULL)
{ element = e; left = l; right = r; }
~BinNode() { } // Destructor
BinNode* leftchild() const { return left; }
BinNode* rightchild() const { return right; }
Belem value() const { return element; };
void setValue(Belem val) { element = val; }
bool isLeaf() const // TRUE if is a leaf
{ return (left == NULL) && (right == NULL); }
void* operator new(size_t); // Overload new
void operator delete(void*);// Overload delete
};

55
Traversals
Any process for visiting the nodes in some
order is called a traversal.
Any traversal that lists every node in the tree
exactly once is called an enumeration of the
tree's nodes.
Preorder traversal: Visit each node before
visiting its children.
Postorder traversal: Visit each node after
visiting its children.
Inorder traversal: Visit the left subtree, then
the node, then the right subtree.
void preorder(BinNode* rt) // rt is root of a subtree
{
if (rt == NULL) return; // Empty subtree
visit(rt); // visit performs desired action
preorder(rt->leftchild());
preorder(rt->rightchild());
}

56
Binary Tree Implementation
A

B C

D E F

G H I
[Leaves are the same as internal nodes. Lots of wasted
space.]
,

 c

 +

4 x  a

2 x

[Example of expression tree: (4x ∗ (2x + a)) − c. Leaves are


di erent from internal nodes.]

57
Union Implementation
enum Nodetype {leaf, internal}; // Enumerate node types
class VarBinNode { // Generic node class
public:
Nodetype mytype; // Stores type for this node
union {
struct { // Structure for internal node
VarBinNode* left; VarBinNode* right; // Children
Operator opx; // Internal node value
} intl;
Operand var; // Leaves just store a value
};
VarBinNode(const Operand& val) // Constructor: leaf
{ mytype = leaf; var = val; }
// Constructor: Internal
VarBinNode(const Operator& op,
VarBinNode* l, VarBinNode* r) {
mytype = internal; intl.opx = op;
intl.left = l; intl.right = r; }
bool isLeaf() { return mytype == leaf; }
VarBinNode* leftchild() { return intl.left; }
VarBinNode* rightchild() { return intl.right; }
};

void traverse(VarBinNode* rt) { // Preorder traversal


if (rt == NULL) return;
if (rt->isLeaf()) cout << "Leaf: " << rt->var <<"\n";
else {
cout << "Internal: " << rt->intl.opx << "\n";
traverse(rt->leftchild());
traverse(rt->rightchild());
}}
58
Inheritance
class VarBinNode { // Node base class
public:
// isLeaf is a "pure" virtual function.
// Each derived class must define isLeaf.
virtual bool isLeaf() = 0;
};

class LeafNode : public VarBinNode { // Leaf subclass


private:
Operand var; // Operand value
public:
LeafNode(const Operand& val) { var = val; }
bool isLeaf() { return TRUE; } // Subclass version
Operand value() { return var; } // Return value
};

class IntlNode : public VarBinNode { // Internal node


private:
VarBinNode* left; // Left child
VarBinNode* right; // Right child
Operator opx; // Operator value
public:
IntlNode(const Operator& op,
VarBinNode* l, VarBinNode* r)
{ opx = op; left = l; right = r; } // Constructor
bool isLeaf() { return FALSE; } // Subclass version
VarBinNode* leftch() { return left; } // Left
VarBinNode* rightch() { return right; } // Right
Operator value() { return opx; } // Value
};
59
Inheritance (cont)
void traverse(VarBinNode *rt) { // Preorder traversal
if (rt == NULL) return; // Nothing to visit
if (rt->isLeaf()) // Do leaf node
cout << "Leaf: " << ((LeafNode *)rt)->value()
<< "\n";
else { // Do internal node
cout << "Internal: " << ((IntlNode *)rt)->value()
<< "\n";
traverse(((IntlNode *)rt)->leftch());
traverse(((IntlNode *)rt)->rightch());
}
}

60
Space Overhead
From Full Binary Tree Theorem:
Half of pointers are NULL.
If leaves only store information, then overhead
depends on whether tree is full.
All nodes the same, with two pointers to
children:
Total space required is (2p + d)n.
Overhead: 2pn.
If p = d, this means 2p/(2p + d) = 2/3 overhead.
[The following is for full binary trees:]
Eliminate pointers from leaf nodes:
n (2p)
2 = p
n (2p) + dn p+d
2
[Half the nodes have 2 pointers, which is overhead.]
This is 1/2 if p = d.
2p/(2p + d) if data only at leaves ⇒ 2/3
overhead.
Some method is needed to distinguish leaves
from internal nodes. [This adds overhead.]
61
Array Implementation
[This is a good example of logical representation vs. physical
implementation.]
For complete binary trees.
0

1 2

3 4 5 6

7 8 9 10 11

(a)

Node 0 1 2 3 4 5 6 7 8 9 10 11
• Parent(r) = [(r − 1)/2 if r 6= 0 and r < n.]
• Leftchild(r) = [2r + 1 if 2r + 1 < n.]
• Rightchild(r) = [2r + 2 if 2r + 2 < n.]
• Leftsibling(r) = [r − 1 if r is even, r > 0 and r < n.]
• Rightsibling(r) = [r + 1 if r is odd, r + 1 < n.]
[Since the complete binary tree is so limited in its shape,
(only one shape for tree of n nodes), it is reasonable to
expect that space eciency can be achieved.]
62
Hu man Coding Trees
ASCII Codes: 8 bits per character.
Fixed length coding.
Can take advantage of relative frequency of
letters to save space.
Variable length coding.
Z K F C U D L E
2 7 24 32 37 42 42 120
Build the tree with minimal external path
weight.

63
Hu man Tree Construction
Step 1: 2 7 24 32 37 42 42 120
Z K F C U D L E

9 24 32 37 42 42 120
Step 2: F C U D L E
2 7
Z K

32 33 37 42 42 120
C U D L E
Step 3: 9 24
F
2 7
Z K

37 42 42 65 120
U D L E
32 33
Step 4: C
9 24
F
2 7
Z K

42 65 79 120
L E
32 33 37 42
C U D
Step 5:
9 24
F
2 7
Z K
64
Assigning Codes
306
0 1
120 186
E 0 1
79 107
0 1 0 1
37 42 42 65
U D L 0 1
32 33
C 0 1
9 24
0 1 F
2 7
Z K

Letter Freq Code Bits


C 32
D 42
E 120
F 24
K 7
L 42
U 37
Z 2
[C code: 1110; D code: 101; E code: 0; F code 1111; K code
111101; L code: 110; U code 100; Z code 111100.]
65
Coding and Decoding
A set of codes are said to meet the
pre x property if no code in the set is the
pre x of another.
Code for DEED: [101 0 0 101]
Decode 1011001110111101: [DUCK]
Expected cost per letter:
[ 1∗120+3∗121+4∗32+5∗24+6∗9
306
= 785
306
= 2.57.]

Letter Freq Code Bits


C 32 1110 4
D 42 101 3
E 120 0 1
F 24 11111 5
K 7 111101 6
L 42 110 3
U 37 100 3
Z 2 111100 6
66
Binary Search Trees
Binary Search Tree (BST) Property
All elements stored in the left subtree of a node
whose value is K have values less than K . All
elements stored in the right subtree of a node
whose value is K have values greater than or
equal to K .
[Problem with lists: either insert/delete or search must be
(n) time. How can we make both update and search
ecient? Answer: Use a new data structure.] 120

37 42

24 42 7 42

7 32 40 42 2 32

2 120 24 37

40

(a) (b)

67
BST Search
class BST {
private:
BinNode* root;
void clearhelp(BinNode*); // Private
void inserthelp(BinNode*&, const Belem); // functions
BinNode* deletemin(BinNode*&);
void removehelp(BinNode*&, int);
Belem findhelp(BinNode*, int) const;
void printhelp(const BinNode*, int) const;
public:
BST() { root = NULL; }
~BST() { clearhelp(root); }
void clear() { clearhelp(root); root = NULL; }
void insert(const Belem val) {inserthelp(root, val);}
void remove(const val) { removehelp(root, val); }
Belem find(const val) const
{ return findhelp(root, val); }
bool isEmpty() const { return root == NULL; }
void print() const {
if (root == NULL) cout << "The BST is empty.\n";
else printhelp(root, 0);
}
};

Belem BST::findhelp(BinNode* rt, int val) const {


if (rt == NULL) return UNSUCCESSFUL; // Empty tree
else if (val < key(rt->value())) // Left
return findhelp(rt->leftchild(), val);
else if (val == key(rt->value())) return rt->value();
else return findhelp(rt->rightchild(), val); // Right
}
68
BST Insert
void BST::inserthelp(BinNode*& rt, const Belem val) {
if (rt == NULL) // Empty tree: create node
rt = new BinNode(val, NULL, NULL);
else if (key(val) < key(rt->value()))
inserthelp(rt->left, val); // Check left
else inserthelp(rt->right, val); // Check right
}

Note that rt is declared \by reference."

37

24 42

7 32 40 42

2 35 120

69
Alternate Approach
void BST::inserthelp(BinNode* rt, const Belem val) {
if (rt == NULL)
return new BinNode(val, NULL, NULL);
if (key(val) < key(rt->value()))
rt->left = inserthelp(rt->left, val);
else rt->right = inserthelp(rt->right, val);
return rt;
}

70
Remove Minimum Value
BinNode* BST::deletemin(BinNode*& rt) {
assert(rt != NULL); // Must be a node to delete
if (rt->left != NULL) // Continue left
return deletemin(rt->left);
else // Found it
{ BinNode* temp = rt; rt = rt->right; return temp; }
}

10
rt

5 20

71
BST Remove
void BST::removehelp(BinNode*& rt, int val) {
if (rt == NULL) cout << val << " is not in tree.\n";
else if (val < key(rt->value())) // Check left
removehelp(rt->left, val);
else if (val > key(rt->value())) // Check right
removehelp(rt->right, val);
else { // Found it: remove
BinNode* temp = rt;
if (rt->left == NULL) // Only a right -
rt = rt->right; // point to right
else if (rt->right == NULL) // Only a left -
rt = rt->left; // point to left
else { // Both non-empty
temp = deletemin(rt->right); // Replace with min
rt->setValue(temp->value()); // in right subtree
}
delete temp; // Free up space
}
}

37 40

24 42

7 32 40 42

2 120

72
Cost of BST Operations
Find:
Insert:
Remove:
[All cost depth of the node in question. Worst case:(n).
Average case:(n log n).]

73
Heaps
Heap: Complete binary tree with the
Heap Property:
• Min-heap: all values less than child values.
• Max-heap: all values greater than child
values.
The values in a heap are partially ordered.
Heap representation: normally the array based
complete binary tree representation.

74
Building the Heap
[Max Heap]

1 7

2 3 4 6

4 5 6 7 1 2 3 5

(a)

1 7

2 3 5 6

4 5 6 7 4 2 1 3

(b)

(a) requires exchanges (4-2), (4-1), (2-1),


(5-2), (5-4), (6-3), (6-5), (7-5), (7-6).
(b) requires exchanges (5-2), (7-3), (7-1),
(6-1).
[How to get a good number of exchanges? By induction.
Heapify the root's subtrees, then push the root to the correct
level.]
75
Heap ADT
class heap { // Max-heap class
private:
Elem* Heap; // Pointer to heap array
int size; // Maximum size of heap
int n; // Number of ELEMs in heap
void siftdown(int); // Put ELEM in place
public:
heap(Elem*, int, int); // Constructor
int heapsize() const; // Return current size
bool isLeaf(int) const; // TRUE if pos is a leaf
int leftchild(int) const; // Return L child position
int rightchild(int) const; // Return R child position
int parent(int) const; // Return parent position
void insert(const Elem); // Insert value into heap
Elem removemax(); // Remove maximum value
Elem remove(int); // Remove specified value
void buildheap(); // Heapify contents
};

76
Siftdown
For fast heap construction:
• Work from high end of array to low end.
• Call siftdown for each item.
• Don't need to call siftdown on leaf nodes.
void heap::buildheap() // Heapify contents
{ for (int i=n/2-1; i>=0; i--) siftdown(i); }

void heap::siftdown(int pos) { // Put ELEM in place


assert((pos >= 0) && (pos < n));
while (!isLeaf(pos)) {
int j = leftchild(pos);
if ((j<(n-1)) && (key(Heap[j]) < key(Heap[j+1])))
j++; // j now index of child with greater value
if (key(Heap[pos]) >= key(Heap[j])) return; // Done
swap(Heap, pos, j);
pos = j; // Move down
}
}
Cost for heap construction:
log
Xn
(i − 1) 2i ≈ n.
n
i =1
[(i − 1) is number of steps down, n/2i is number of nodes at
that level.]
77
Priority Queues
A priority queue stores objects, and on request
releases the object with greatest value.
Example: Scheduling jobs in a multi-tasking
operating system.
The priority of a job may change, requiring
some reordering of the jobs.
Implementation: use a heap to store the
priority queue.
To support priority reordering, delete and
re-insert. Need to know index for the object.
// Remove value at specified position
Elem heap::remove(int pos) {
assert((pos > 0) && (pos < n));
swap(Heap, pos, --n); // Swap with last value
while (key(Heap[pos]) > key(Heap[parent(pos)]))
swap(Heap, pos, parent(pos)); // Push up if large
if (n != 0) siftdown(pos); // Push down if small
return Heap[n];
}

78
General Trees
A tree T is a nite set of one or more nodes
such that there is one designated node r called
the root of T , and the remaining nodes in
(T − {r}) are partitioned into n ≥ 0 disjoint
subsets T1, T2, ..., Tk, each of which is a tree,
and whose roots r1, r2, ..., rk, respectively, are
children of r.
[Note: disjoint because a node cannot have two parents.]
Root R
Ancestors of V
Parent of V P

V
S1 S2
C1 C2 Siblings of V

Subtree rooted at V

Children of V

79
General Tree ADT
class GTNode {
public:
GTNode(const Elem); // Constructor
~GTNode(); // Destructor
Elem value(); // Return node’s value
bool isLeaf(); // TRUE if is a leaf
GTNode* parent(); // Return parent
GTNode* leftmost_child(); // Return first child
GTNode* right_sibling(); // Return right sibling
void setValue(Elem); // Set node’s value
void insert_first(GTNode* n); // Insert first child
void insert_next(GTNode* n); // Insert right sibling
void remove_first(); // Remove first child
void remove_next(); // Remove right sibling
};

class GenTree {
public:
GenTree(); // Constructor
~GenTree(); // Destructor
void clear(); // Free nodes
GTNode* root(); // Return root
void newroot(Elem, GTNode*, GTNode*); // Combine
};

80
General Tree Traversal
void print(GTNode* rt) { // Preorder traverse from root
if (rt->isLeaf()) cout << "Leaf: ";
else cout << "Internal: ";
cout << rt->value() << "\n"; // Print or take action
GTNode* temp = rt->leftmost_child();
while (temp != NULL)
{ print(temp); temp = temp->right_sibling(); }
}
R

A B

C D E F

[RACDEBF]

81
Parent Pointer Implementation
R W

A B X Y Z

C D E F

Parent's Index 0 0 1 1 1 2 7 7 7
Label R A B C D E F W X Y Z
Node Index 0 1 2 3 4 5 6 7 8 9 10

Parent Pointer representation is good for


answering:
Are two elements in the same tree?
bool Gentree::differ(int a, int b) {
// Are nodes a and b in different trees?
GTNode* root1 = FIND(&array[a]); // Find a’s root
GTNode* root2 = FIND(&array[b]); // Find b’s root
return root1 != root2; // Compare roots
}
[Examples of equivalence classes: Connected components in
graphs; point clustering.]

82
Equivalence Classes
When joining equivalence classes, want to keep
depth small.
Weighted Union Rule: join the tree with fewer
nodes to the tree with more nodes.
Limits depth to log n for n nodes.
[Less than half of the nodes increase depth by 1.]
Path Compression: Make all nodes visited point
to root. [Nearly constant cost.]
class GTNode { // General tree node
public:
GTNode* par; // Parent pointer
GTNode() { par = NULL; } // Constuctor
GTNode* parent() { return par; } // Return parent
};

class Gentree { // General tree: UNION/FIND


private:
GTNode* array; // Node array
int size; // Size of node array
GTNode* FIND(GTNode*) const; // Find root
public:
Gentree(int); // Constructor
~Gentree(); // Destructor
void UNION(int, int); // Merge equivalences
bool differ(int, int); // TRUE if not in same tree
};

83
UNION-FIND
Gentree::Gentree(int sz) { // Constructor
size = sz;
array = new GTNode[sz]; // Create node array
}

Gentree::~Gentree() { // Destructor
delete [] array; // Free node array
}

bool Gentree::differ(int a, int b) {


// Are nodes a and b in different trees?
GTNode* root1 = FIND(&array[a]); // Find a’s root
GTNode* root2 = FIND(&array[b]); // Find b’s root
return root1 != root2; // Compare roots
}

void Gentree::UNION(int a, int b) { // Merge subtrees


GTNode* root1 = FIND(&array[a]); // Find a’s root
GTNode* root2 = FIND(&array[b]); // Find b’s root
if (root1 != root2) root2->par = root1; // Merge
}

84
Equivalence Processing Example
A B C D E
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 F G H I J
[Process (A,B), (C, H), (G,F), (D, E), (I, F)]
(a)

0 3 5 2 5 A C F J D
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 B H G I E
[Process (H, A), (E, G)] (b)

0 0 5 3 5 2 5 A F J
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 B C G I D
[Process (H, E)]
H E
(c)

5 0 0 5 3 5 2 5 F J
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 A G I D

B C E

H
(d)
85
Path Compression Example
5 0 0 5 5 5 0 5 F J
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 A G I D E

B C H

GTNode* Gentree::FIND(GTNode* curr) const {


while (curr->parent() != NULL) curr = curr->par;
return curr; // At root
}

86
Lists of Children
Index Val Par
0 R 1 3
1 A 0 2 4 6
2 C 1
3 B 0 5
4 D 1
5 F 3
6 E 1
7

[Hard to nd right sibling.]

87
Leftmost Child/Right Sibling
Left Val Par Right
1 R
R 0
3 A 0 2
6 B 0
R X C 1 4
D 1 5
A B E 1
F 2
C D E F 8 R 0

X 7
[Note: Two trees share same array.]
Left Val Par Right
1 R 7 8
R0
3 A 0 2
6 B 0
R X C 1 4
D 1 5
A B E 1
F 2
C D E F 0 R 0
-1
X 7

88
Linked Implementations
Val Size
R 2
R

A 3 B 1
A B

C D E F C 0 D 0 E 0 F 0
(a) (b)

[Allocate child pointer space when node is created.]

R A B

A B

C D E F C D E F
(a) (b)

89
Sequential Implementations
List node values in the order they would be
visited by a preorder traversal.
Saves space, but allows only sequential access.
Need to retain tree structure for reconstruction.

For binary trees: Use symbol to mark NULL links.


AB/D//CEG///F H//I//

Full binary trees: Mark leaf or internal.


[Need NULL mark since this tree is not full.]
A0B 0 /DC 0E 0G/F 0HI

General trees: Mark end of each subtree.


RAC )D)E ))BF )))

90
Convert to Binary Tree
Left Child/Right Sibling representation
essentially stores a binary tree.
Use this process to convert any general tree to
a binary tree.
A forest is a collection of one or more general
trees.
root

(a) (b)

[Dynamic implementation of \Left child/right sibling."]

91
Graphs
A graph G = (V, E) consists of a set of
vertices V, and a set of edges E, such that
each edge in E is a connection between a pair
of vertices in V.
The number of vertices is written |V|, and the
number of edges is written |E|.
A sequence of vertices v1, v2, ..., vn forms a path
of length n − 1 if there exist edges from vi to
vi+1 for 1 ≤ i < n.
A path is simple if all vertices on the path are
distinct.
A cycle is a path of length 3 or more that
connects vi to itself.
A cycle is simple if the path is simple, except
for the rst and last vertices being the same.

92
Graph De nitions (Cont)
An undirected graph is connected if there is at
least one path from any vertex to any other.
The maximal connected subgraphs of an
undirected graph are called
connected components.
A graph without cycles is acyclic.
A directed graph without cycles is a
directed acyclic graph or DAG.
A free tree is a connected, undirected graph
with no simple cycles. Equivalently, a free tree
is connected and has |V − 1| edges.
0 2
4 1

3 4 7
1
2
1 3
(a) (b) (c)

93
Connected Components

0 2 6 7

1 3 5

94
Graph Representations
Adjacency Matrix: (|V|2).
Adjacency List: (|V| + |E|).
0 1 2 3 4
0 2 0 1 1
1 1
4 2 1
3 1
1 3 4 1

(a) (b)

0 1 4
1 3
2 4
3 2
4 1

(c)
0 1 2 3 4
0 2 0 1 1
1 1 1 1
4 2 1 1
3 1 1
1 3 4 1 1 1

(a) (b)

0 1 4
1 0 3 4
2 3 4
3 1 2
4 0 1 2
(c)

[Instead of bits, the graph could store edge, weights.]


95
Implementation: Adjacency Matrix
class EdgeClass {
public:
int v1, v2;
EdgeClass(int in1, int in2) { v1 = in1; v2 = in2; }
};
typedef EdgeClass* Edge;

class Graph { // Adjacency matrix


private:
int** matrix; // The edge matrix
int numVertex; // Number of vertices
int numEdge; // Number of edges
bool* Mark; // The mark array
public:
Graph(); // Constructor
~Graph(); // Destructor
int n(); // Number of graph vertices
int e(); // Number of edges for graph
Edge first(int); // Get vertex first edge
bool isEdge(Edge); // TRUE if this is an edge
Edge next(Edge); // Get vertex next edge
int v1(Edge); // Vertex edge comes from
int v2(Edge); // Vertex edge goes to
int weight(int, int); // Weight of edge
int weight(Edge); // Weight of edge
bool getMark(int); // Return a Mark value
void setMark(int, bool); // Set a Mark value
};
[For completeness, add setedge and deledge.]
96
Adjacency Matrix Functions
Edge Graph::first(int v) { // Get vertex first edge
for (int i=0; i<numVertex; i++)
if (matrix[v][i] != NOEDGE)
return new EdgeClass(v, i);
return NULL;
}

bool Graph::isEdge(Edge w) // TRUE if this is an edge


{ return (w != NULL) &&
(matrix[w->v1][w->v2] != NOEDGE); }

Edge Graph::next(Edge w) { // Get vertex next edge


for (int i=w->v2+1; i<numVertex; i++)
if (matrix[w->v1][i] != NOEDGE)
return new EdgeClass(w->v1, i);
return NULL;
}

int Graph::weight(int i, int j) { // Return edge weight


if (matrix[i][j] == NOEDGE) return INFINITY;
else return matrix[i][j];
}
int Graph::weight(Edge w) { // Return weight of edge
if ((w == NULL) || (matrix[w->v1][w->v2] == NOEDGE))
return INFINITY;
else return matrix[w->v1][w->v2];
}
int Graph::v1(Edge w) { return w->v1; } // Comes from
int Graph::v2(Edge w) { return w->v2; } // Goes to
bool Graph::getMark(int v) { return Mark[v]; }
void Graph::setMark(int v, bool val) { Mark[v] = val; }
97
Implementation: Adjacency List
class EdgeLink { // Singly-linked list node
public:
int weight; // Edge weight
int v1, v2; // Edge vertices
EdgeLink* next; // Pointer to next list edge
EdgeLink(int vt1, int vt2, int w, EdgeLink* nxt=NULL)
{ v1 = vt1; v2 = vt2; weight = w; next = nxt; }
EdgeLink(EdgeLink* nxt =NULL) { next = nxt; }
};
typedef EdgeLink* Edge;

class Graph { // Graph class: Adjacency list


private:
Edge* list; // The vertex list
int numVertex, numEdge // Number of vertices, edges
bool* Mark; // The mark array
public:
Graph(); // Constructor
~Graph(); // Destructor
int n(); // Number of vertices
int e(); // Number of edges
Edge first(int); // Get vertex first edge
bool isEdge(Edge); // TRUE if this is an edge
Edge next(Edge); // Get vertex next edge
int v1(Edge); // Vertex edge comes from
int v2(Edge); // Vertex edge goes to
int weight(int, int); // Weight of edge
int weight(Edge); // Weight of edge
bool getMark(int); // Return a Mark value
void setMark(int, bool); // Set a Mark value
};
98
Adjacency List Functions
Edge Graph::first(int v) // Get first edge for vertex
{ return list[v]; }

bool Graph::isEdge(Edge w) // TRUE if this is an edge


{ return w != NULL; }

Edge Graph::next(Edge w) { // Get next edge for vertex


if (w == NULL) return NULL;
else return w->next;
}

int Graph::v1(Edge w) // Return vertex it comes from


{ return w->v1; }

int Graph::v2(Edge w) // Return vertex it goes to


{ return w->v2; }

int Graph::weight(int i, int j) { // Return edge weight


for (Edge curr = list[i]; curr != NULL;
curr = curr->next)
if (curr->v2 == j) return curr->weight;
return INFINITY;
}

int Graph::weight(Edge w) { // Return weight of edge


if (w == NULL) return INFINITY;
else return w->weight;
}

99
Graph Traversals
Some applications require visiting every vertex
in the graph exactly once.
Application may require that vertices be visited
in some special order based on graph topology.
Example: Arti cial Intelligence
• Problem domain consists of many \states."
• Need to get from Start State to Goal State.
• Start and Goal are typically not directly
connected.
To insure visiting all vertices:
void graph_traverse(Graph& G) {
for (v=0; v<G.n(); v++)
G.Mark[v] = UNVISITED; // Initialize mark bits
for (v=0; v<G.n(); v++)
if (G.Mark[v] == UNVISITED)
do_traverse(G, v);
}
[Two traversals we will talk about: DFS, BFS.]

100
Depth First Search
void DFS(Graph& G, int v) { // Depth first search
PreVisit(G, v); // Take appropriate action
G.setMark(v, VISITED);
for (Edge w = G.first(v); G.isEdge(w); w = G.next(w))
if (G.getMark(G.v2(w)) == UNVISITED)
DFS(G, G.v2(w));
PostVisit(G, v); // Take appropriate action
}

Cost: (|V| + |E|).

A B A B

C C

D D
F F
E E
(a) (b)

[The directions are imposed by the traversal. This is the


Depth First Search Tree.]

101
Breadth First Search
Like DFS, but replace stack with a queue.
Visit the vertex's neighbors before continuing
deeper in the tree.
void BFS(Graph& G, int start) {
Queue Q(G.n());
Q.enqueue(start);
G.setMark(start, VISITED);
while (!Q.isEmpty()) {
int v = Q.dequeue();
PreVisit(G, v); // Take appropriate action
for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))
if (G.getMark(G.v2(w)) == UNVISITED) {
G.setMark(G.v2(w), VISITED);
Q.enqueue(G.v2(w));
}
PostVisit(G, v); // Take appropriate action
}}
A B A B

C C

D D
F F
E E
(a) (b)
102
Topological Sort
Problem: Given a set of jobs, courses, etc.
with prerequisite constraints, output the jobs in
an order that does not violate any of the
prerequisites.
J6
J1 J2 J5 J7

J3 J4

void topsort(Graph& G) { // Topological sort: recursive


for (int i=0; i<G.n(); i++) // Initialize Mark array
G.setMark(i, UNVISITED);
for (i=0; i<G.n(); i++) // Process all vertices
if (G.getMark(i) == UNVISITED)
tophelp(G, i); // Call helper function
}

void tophelp(Graph& G, int v) { // Helper function


G.setMark(v, VISITED);
// No PreVisit operation
for (Edge w = G.first(v); G.isEdge(w); w = G.next(w))
if (G.getMark(G.v2(w)) == UNVISITED)
tophelp(G, G.v2(w));
printout(v); // PostVisit for Vertex v
}
[Prints in reverse order.]
103
Queue-based Topological Sort
void topsort(Graph& G) { // Topological sort: Queue
Queue Q(G.n());
int Count[G.n()];

for (int v=0; v<G.n(); v++) Count[v] = 0; // Init


for (v=0; v<G.n(); v++) // Process every edge
for (Edge w=G.first(v); G.isEdge(w); w=G.next(w))
Count[G.v2(w)]++; // Add to v2’s prereq count
for (v=0; v<G.n(); v++) // Initialize Queue
if (Count[v] == 0) // Vertex has no prereqs
Q.enqueue(v);
while (!Q.isEmpty()) { // Process the vertices
int v = Q.dequeue();
printout(v); // PreVisit for Vertex V
for (Edge w=G.first(v); G.isEdge(w); w=G.next(w)) {
Count[G.v2(w)]--; // One less prerequisite
if (Count[G.v2(w)] == 0) // Vertex is now free
Q.enqueue(G.v2(w));
}
}
}

104
Shortest Paths Problems
Input: A graph with weights or costs
associated with each edge.
Output: The list of edges forming the shortest
path.
Sample problems:
• Find the shortest path between two
speci ed vertices.
• Find the shortest path from vertex S to all
other vertices.
• Find the shortest path between all pairs of
vertices.
Our algorithms will actually calculate only
distances.

105
Shortest Paths De nitions
d(A, B) is the shortest distance from vertex A
to B.
w(A, B) is the weight of the edge connecting
A to B.
• If there is no such edge, then w(A, B) = ∞.

B 5
10 D
20
A
2 11
3
15
C E

[w(A, D) = 20; d(A, D) = 10 (through ACBD).]

106
Single Source Shortest Paths
Given start vertex s, nd the shortest path
from s to all other vertices.
Try 1: Visit all vertices in some order, compute
shortest paths for all vertices seen so far, then
add the shortest path to next vertex x.
Problem: Shortest path to a vertex already
processed might go through x.
Solution: Process vertices in order of distance
from s.

107
Dijkstra's Algorithm Example
A B C D E
Initial 0 ∞ ∞ ∞ ∞
Process A 0 10 3∞ 20
Process C 0 5 3 20 18
Process B 0 5 3 10 18
Process D 0 5 3 10 18
Process E 0 5 3 10 18
B 5
10 D
20
A
2 11
3
15
C E

108
Dijkstra's Algorithm: Array
void Dijkstra(Graph& G, int s) { // Use array
int D[G.n()];
for (int i=0; i<G.n(); i++) // Initialize
D[i] = INFINITY;
D[s] = 0;
for (i=0; i<G.n(); i++) { // Process vertices
int v = minVertex(G, D);
if (D[v] == INFINITY) return; // Unreachable
G.setMark(v, VISITED);
for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))
if (D[G.v2(w)] > (D[v] + G.weight(w)))
D[G.v2(w)] = D[v] + G.weight(w);
}
}

int minVertex(Graph& G, int* D) { // Get mincost vertex


int v; // Initialize v to any unvisited vertex;
for (int i=0; i<G.n(); i++)
if (G.getMark(i) == UNVISITED) { v = i; break; }
for (i++; i<G.n(); i++) // Now find smallest D value
if ((G.getMark(i) == UNVISITED) && (D[i] < D[v]))
v = i;
return v;
}

Approach 1: Scan the table on each pass for


closest vertex.
Total cost: (|V|2 + |E|) = (|V|2 ).
109
Dijkstra's Algorithm: Priority Queue
class Elem { public: int vertex, dist; };
int key(Elem x) { return x.dist; }
void Dijkstra(Graph& G, int s) { // W/ priority queue
int v; // The current vertex
int D[G.n()]; // Distance array
Elem temp;
Elem E[G.e()]; // Heap array
temp.dist = 0; temp.vertex = s;
E[0] = temp; // Initialize heap
heap H(E, 1, G.e()); // Create the heap
for (int i=0; i<G.n(); i++) D[i] = INFINITY;
D[s] = 0;
for (i=0; i<G.n(); i++) { // Now, get distances
do { temp = H.removemin(); v = temp.vertex; }
while (G.getMark(v) == VISITED);
G.setMark(v, VISITED);
if (D[v] == INFINITY) return; // Rest unreachable
for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))
if (D[G.v2(w)] > (D[v] + G.weight(w))) {
D[G.v2(w)] = D[v] + G.weight(w); // Update D
temp.dist = D[G.v2(w)]; temp.vertex = G.v2(w);
H.insert(temp); // Insert new distance in heap
}}}
Approach 2: Store unprocessed vertices using a
min-heap to implement a priority queue ordered
by D value. Must update priority queue for
each edge.
Total cost: ((|V| + |E|) log |V|).
110
All Pairs Shortest Paths
For every vertex u, v ∈ V, calculate d(u, v).
Could run Dijkstra's Algorithm V times.
[Cost: |V ||E| log |V | = |V | log |V | for dense graph.]
3

Better is Floyd's Algorithm.


[The issue is how to eciently check all the paths without
computing any path more than once.]
De ne a k-path from u to v to be any path
whose intermediate vertices all have indices less
than k.
1
1 7
4
5 1 3

0 1 1 3
2 2 11

12

1
[0,3 is a 0-path. 2,0,3 is a 1-path. 0,2,3 is a 3-path, but not a
2 or 1 path. Everything is a 4 path.]

111
Floyd's Algorithm
void Floyd(Graph& G) { // All-pairs shortest paths
int D[G.n()][G.n()]; // Store distances
for (int i=0; i<G.n(); i++) // Initialize D
for (int j=0; j<G.n(); j++)
D[i][j] = G.weight(i, j);
for (int k=0; k<G.n(); k++) // Compute all k paths
for (int i=0; i<G.n(); i++)
for (int j=0; j<G.n(); j++)
if (D[i][j] > (D[i][k] + D[k][j]))
D[i][j] = D[i][k] + D[k][j];
}

112
Minimum Cost Spanning Trees
Minimum Cost Spanning Tree (MST) Problem:
• Input: An undirected, connected graph G.
• Output: The subgraph of G that 1) has
minimum total cost as measured by
summing the values for all of the edges in
the subset, and 2) keeps the vertices
connected.

A B
7 5
C
9 1 2 6

D 2
F
E 1

113
Prim's MST Algorithm
void Prim(Graph& G, int s) { // Prim’s MST alg
int D[G.n()]; // Distance vertex
int V[G.n()]; // Who’s closest
for (int i=0; i<G.n(); i++) // Initialize
D[i] = INFINITY;
D[s] = 0;
for (i=0; i<G.n(); i++) { // Process vertices
int v = minVertex(G, D);
G.setMark(v, VISITED);
if (v != s) AddEdgetoMST(V[v], v); // Add to MST
if (D[v] == INFINITY) return; // Rest unreachable
for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))
if (D[G.v2(w)] > G.weight(w)) {
D[G.v2(w)] = G.weight(w); // Update distance,
V[G.v2(w)] = v; // who came from
}}}

int minVertex(Graph& G, int* D) { // Min cost vertex


int v; // Initialize v to any unvisited vertex;
for (int i=0; i<G.n(); i++)
if (G.getMark(i) == UNVISITED) { v = i; break; }
for (i=0; i<G.n(); i++) // Now find smallest value
if ((G.getMark(i) == UNVISITED) && (D[i] < D[v]))
v = i;
return v;
}

This is an example of a greedy algorithm.


114
Alternative Prim's Implementation
Like Dijkstra's algorithm, we can implement
Prim's algorithm with a priority queue.
void Prim(Graph& G, int s) { // W/ priority queue
int v; // The current vertex
int D[G.n()]; // Distance array
int V[G.n()]; // Who’s closest
Elem temp;
Elem E[G.e()]; // Heap array
temp.distance = 0; temp.vertex = s;
E[0] = temp; // Initialize heap array
heap H(E, 1, G.e()); // Create the heap
for (int i=0; i<G.n(); i++) D[i] = INFINITY; // Init
D[s] = 0;
for (i=0; i<G.n(); i++) { // Now build MST
do { temp = H.removemin(); v = temp.vertex; }
while (G.getMark(v) == VISITED);
G.setMark(v, VISITED);
if (v != s) AddEdgetoMST(V[v], v); // Add to MST
if (D[v] == INFINITY) return; // Rest unreachable
for (Edge w = G.first(v); G.isEdge(w); w=G.next(w))
if (D[G.v2(w)] > G.weight(w)) { // Update D
D[G.v2(w)] = G.weight(w);
V[G.v2(w)] = v; // Who came from
temp.distance = D[G.v2(w)];
temp.vertex = G.v2(w);
H.insert(temp); // Insert distance in heap
}
}}

115
Proof of Prim's MST Algorithm
Theorem 14.1 Prim's algorithm produces a
minimum cost spanning tree.
Proof by contradiction:
Order vertices by how they are added to the
MST by Prim's algorithm. v1, v2, ..., vn.
Let edge ei connect (vx, vi+1), x < i.
Let ej be the lowest numbered ( rst) edge
added by the algorithm such that the set of
edges selected so far cannot be extended to
form an MST for G.
Let V1 = (v1, ..., vj ). Let V2 = (vj +1, ..., vn).
Marked Unmarked
Vertices v , i < j
i Vertices v , i  j
i

\correct" edge
e
0

v u v w

vp v j

e
116
j

Prim's edge
Kruskal's MST Algorithm
Kruskel(Graph& G) { // Kruskal’s MST algorithm
Gentree A(G.n()); // Equivalence class array
Elem E[G.e()]; // Array of edges for min-heap
int edgecnt = 0;
for (int i=0; i<G.n(); i++) // Put edges on array
for (Edge w = G.first(i);
G.isEdge(w); w = G.next(w)) {
E[edgecnt].weight = G.weight(w);
E[edgecnt++].edge = w;
}
heap H(E, edgecnt, edgecnt); // Heapify the edges
int numMST = G.n(); // Init w/ n equiv classes
for (i=0; numMST>1; i++) { // Combine equiv classes
Elem temp = H.removemin(); // Get next cheap edge
Edge w = temp.edge;
int v = G.v1(w); int u = G.v2(w);
if (A.differ(v, u)) { // If different equiv classes
A.UNION(v, u); // Combine equiv classes
AddEdgetoMST(G.v1(w), G.v2(w)); // Add to MST
numMST--; // One less MST
}
}
}

How do we compute function MSTof(v)?


Solution: Use Parent Pointer representation to
merge equivalance classes.
117
Kruskal's Algorithm Example
Time dominated by cost for initial edge sort.
[Alternative: Use a min heap, quit when only one set left.
\Kth-smallest" implementation.]
Total cost: (|V| + |E| log |E|).
Initial A B C D E F

C
Step 1 A B E F
1
Process edge (C. D)
D

C F
Step 2 A B
1 E 1
Process edge (E, F)
D

C
Step 3 A B
1 2
Process edge (C, F)
D
F
E 1
118
Sorting
Each record contains a eld called the key.
Linear order: comparison.
[a < b and b < c ⇒ a < c.]
The Sorting Problem
Given a sequence of records R1, R2, ..., Rn with
key values k1, k2, ..., kn, respectively, arrange the
records into any order s such that records
Rs1 , Rs2 , ..., Rsn have keys obeying the property
ks1 ≤ ks2 ≤ ... ≤ ksn .
[Put keys in ascending order.]
Measures of cost:
• Comparisons
• Swaps

119
Insertion Sort
void inssort(Elem* array, int n) { // Insertion Sort
for (int i=1; i<n; i++) // Insert i’th record
for (int j=i; (j>0) &&
(key(array[j])<key(array[j-1])); j--)
swap(array, j, j-1);
}

i=1 2 3 4 5 6 7
42 20 17 13 13 13 13 13
20 42 20 17 17 14 14 14
17 17 42 20 20 17 17 15
13 13 13 42 28 20 20 17
28 28 28 28 42 28 23 20
14 14 14 14 14 42 28 23
23 23 23 23 23 23 42 28
15 15 15 15 15 15 15 42

Best Case: [0 swaps, n − 1 comparisons]


Worst Case: [n /2 swaps and compares]
2

Average Case: [n /4 swaps and compares]


2

[Good best case performance.]


120
Bubble Sort
void bubsort(Elem* array, int n) { // Bubble Sort
for (int i=0; i<n-1; i++) // Bubble up i’th record
for (int j=n-1; j>i; j--)
if (key(array[j]) < key(array[j-1]))
swap(array, j, j-1);
}
[Using test \j > i" saves a factor of 2 over \j > 0".]
i=0 1 2 3 4 5 6
42 13 13 13 13 13 13 13
20 42 14 14 14 14 14 14
17 20 42 20 15 15 15 15
13 17 20 42 20 17 17 17
28 14 17 15 42 20 20 20
14 28 15 17 17 42 23 23
23 15 28 23 23 23 42 28
15 23 23 28 28 28 28 42

Best Case: [n /2 compares, 0 swaps]


2

Worst Case: [n /2 compares, n /2 swaps]


2 2

Average Case: [n /2 compares, n /4 swaps]


2 2

[No redeeming features to this sort.]


121
Selection Sort
void selsort(Elem* array, int n) { // Selection Sort
for (int i=0; i<n-1; i++) { // Select i’th record
int lowindex = i; // Remember its index
for (int j=n-1; j>i; j--) // Find the least value
if (key(array[j]) < key(array[lowindex]))
lowindex = j; // Put it in place
swap(array, i, lowindex);
}
}
[Select the value to go in the ith position.]
i=0 1 2 3 4 5 6
42 13 13 13 13 13 13 13
20 20 14 14 14 14 14 14
17 17 17 15 15 15 15 15
13 42 42 42 17 17 17 17
28 28 28 28 28 20 20 20
14 14 20 20 20 28 23 23
23 23 23 23 23 23 28 28
15 15 15 17 42 42 42 42
Best Case: [0 swaps (n − 1 as written), n /2 compares.]
2

Worst Case: [n − 1 swaps, n /2 compares]


2

Average Case: [O(n) swaps, n /2 compares]


2

122
Pointer Swapping

Key = 42 Key = 42
Key = 5 Key = 5

(a) (b)

[For large records.]

123
Exchange Sorting
Summary
Insertion Bubble Selection
Comparisons:
Best Case (n) (n2) (n2)
Average Case (n2) (n2) (n2)
Worst Case (n2) (n2) (n2)
Swaps:
Best Case 0 0 (n)
Average Case (n2) (n2) (n)
Worst Case (n2) (n2) (n)
All of these sorts rely on exchanges of
adjacent records.
What is the average number of exchanges
required?
[n /4 { average distance from a record to its sorted position.]
2

124
Shellsort
void shellsort(Elem* array, int n) { // Shellsort
for (int i=n/2; i>2; i/=2) // For each increment
for (int j=0; j<i; j++) // Sort each sublist
inssort2(&array[j], n-j, i);
inssort2(array, n, 1);
}

// Version of Insertion Sort for varying increments


void inssort2(Elem* A, int n, int incr) {
for (int i=incr; i<n; i+=incr)
for (int j=i; (j>=incr) &&
(key(A[j])<key(A[j-incr])); j-=incr)
swap(A, j, j-incr);
}
59 20 17 13 28 14 23 83 36 98 11 70 65 41 42 15

[8 lists of length 2]
36 20 11 13 28 14 23 15 59 98 17 70 65 41 42 83

[4 lists of length 4]
28 14 11 13 36 20 17 15 59 41 23 70 65 98 42 83
[2 lists of length 8]
11 13 17 14 23 15 28 20 36 41 42 70 59 83 65 98
[1 list of length 16]
11 13 14 15 17 20 23 28 36 41 42 59 65 70 83 98
O(n1.5) [Any increments will work, provided the last is 1.
Shellsort takes advantage of inssort's best case performance.]
125
Quicksort
Divide and Conquer: divide list into values less
than pivot and values greater than pivot.
[Initial call: ]
qsort(array, 0, n-1);

void qsort(Elem* array, int i, int j) { // Quicksort


int pivotindex = findpivot(array, i, j);
swap(array, pivotindex, j); // Swap to end
// k will be the first position in the right subarray
int k = partition(array, i-1, j, key(array[j]));
swap(array, k, j); // Put pivot in place
if ((k-i) > 1) qsort(array, i, k-1); // Sort left
if ((j-k) > 1) qsort(array, k+1, j); // Sort right
}

int findpivot(Elem* array, int i, int j)


{ return (i+j)/2; }

126
Quicksort Partition
int partition(Elem* array, int l, int r, int pivot) {
do { // Move the bounds inward until they meet
while (key(array[++l]) < pivot); // Move right
while (r && (key(array[--r]) > pivot));// Move left
swap(array, l, r); // Swap out-of-place vals
} while (l < r); // Stop when they cross
swap(array, l, r); // Reverse wasted swap
return l; // Return first pos in right partition
}
Initial 72 6 57 88 85 42 83 73 48 60
l r

Pass 1 72 6 57 88 85 42 83 73 48 60
l r

Swap 1 48 6 57 88 85 42 83 73 72 60
l r

Pass 2 48 6 57 88 85 42 83 73 72 60
l r

Swap 2 48 6 57 42 85 88 83 73 72 60
l r

Pass 3 48 6 57 42 85 88 83 73 72 60
r l

Swap 3 48 6 57 85 42 88 83 73 72 60
r l

Reverse Swap 48 6 57 42 85 88 83 73 72 60
r l
The cost for Partition is (n).
127
Quicksort Example
72 6 57 88 60 42 83 73 48 85
Pivot = 60
48 6 57 42 60 88 83 73 72 85
Pivot = 6 Pivot = 73
6 42 57 48 72 73 85 88 83
Pivot = 57 Pivot = 88
Pivot = 42 42 48 57 85 83 88 Pivot = 85

42 48 83 85

6 42 48 57 60 72 73 83 85 88
Final Sorted Array

128
Cost for Quicksort
Best Case: Always partition in half.
Worst Case: Bad partition.
Average Case:

1 X1
n−
T (n) = n + 1 + (T (k) + T (n − k))
n − 1 k=1
= (n log n)

Optimizations for Quicksort:


• Better pivot.
• Use better algorithm for small sublists.
• Eliminate recursion.

129
Mergesort
List mergesort(List inlist) {
if (inlist.length() <= 1) return inlist;;
List l1 = half of the items from inlist;
List l2 = other half of the items from inlist;
return merge(mergesort(l1), mergesort(l2));
}

36 20 17 13 28 14 23 15

20 36 13 17 14 28 15 23

13 17 20 36 14 15 23 28

13 14 15 17 20 23 28 36

130
Mergesort Implementation
Mergesort is tricky to implement.
void mergesort(Elem* array, Elem* temp,
int left, int right) {
int mid = (left+right)/2;
if (left == right) return; // List of one ELEM
mergesort(array, temp, left, mid); // Sort 1st half
mergesort(array, temp, mid+1, right);// Sort 2nd half
for (int i=left; i<=right; i++) // Copy to temp
temp[i] = array[i];
// Do the merge operation back to array
int i1 = left; int i2 = mid + 1;
for (int curr=left; curr<=right; curr++) {
if (i1 == mid+1) // Left sublist exhausted
array[curr] = temp[i2++];
else if (i2 > right) // Right sublist exhausted
array[curr] = temp[i1++];
else if (key(temp[i1]) < key(temp[i2]))
array[curr] = temp[i1++];
else array[curr] = temp[i2++];
}}
[Note: This requires a second array.]
Mergesort cost: [(n log n)]
Mergesort is good for sorting linked lists.
[Send records to alternating linked lists, mergesort each, then
merge.]
131
Optimized Mergesort
void mergesort(Elem* array, Elem* temp,
int left, int right) {
int i, j, k, mid = (left+right)/2;
if (left == right) return;
mergesort(array, temp, left, mid); // Sort 1st half
mergesort(array, temp, mid+1, right);// Sort 2nd half

// Do merge operation. First, copy halves to temp.


for (i=left; i<=mid; i++) temp[i] = array[i];
for (j=1; j<=right-mid; j++)
temp[right-j+1] = array[j+mid];
// Merge sublists back to array
for (i=left,j=right,k=left; k<=right; k++)
if (key(temp[i]) < key(temp[j]))
array[k] = temp[i++];
else array[k] = temp[j--];
}
[The right sublist is reversed. So, each list increases towards
the middle. This means that each list serves as a sentinal for
the other list, so no test for end-of-list is required.]

132
Heapsort
Heapsort uses a max-heap.
void heapsort(Elem* array, int n) { // Heapsort
heap H(array, n, n); // Build the heap
for (int i=0; i<n; i++) // Now sort
H.removemax(); // Value placed at end of heap
}
Cost of Heapsort: [(n log n)]
Cost of nding k largest elements: [(k log n + n).
Time to build heap: (n).
Time to remove least element: (log n).]
[Compare to sorting with BST: this is expensive in space
(overhead), potential bad balance, BST does not take
advantage of having all records available in advance.]
[Heap is space ecient, balanced, and building initial heap is
ecient.]

133
Heapsort Example
Original Numbers 73
73 6 57 88 60 42 83 72 48 85 6 57
88 60 42 83
72 48 85

Build Heap 88
88 85 83 72 73 42 57 6 48 60 85 83
72 73 42 57
6 48 60

Remove 88 85
85 73 83 72 60 42 57 6 48 88 73 83
72 60 42 57
6 48

Remove 85 83
83 73 57 72 60 42 48 6 85 88 73 57
72 60 42 48
6

Remove 83 73
73 72 57 6 60 42 48 83 85 88 72 57
6 60 42 48

134
Binsort
A simple, ecient sort:
for (i=0; i<n; i++)
B[key(A[i])] = A[i];

[Only works for a permutation.]


Ways to generalize:
• Make each bin the head of a list. [Support
duplicates]
• Allow more keys than records. [I.e., larger key
space]
void binsort(ELEM *A, int n) {
list B[MaxKeyValue];
for (i=0; i<n; i++) B[key(A[i])].append(A[i]);
for (i=0; i<MaxKeyValue; i++)
for (B[i].first(); B[i].isInList(); B[i].next())
output(B[i].currValue());
}
Cost: [(n).]
[Oops! Actually, (n ∗ Maxkeyvalue)]
[Maxkeyvalue could be O(n ) or worse.]
2

135
Radix Sort
Initial List: 27 91 1 97 17 23 84 28 72 5 67 25
First pass Second pass
(on right digit) (on left digit)
0 0 1 5
1 91 1 1 17
2 72 2 23 25 27 28
3 23 3
4 84 4
5 5 25 5
6 6 67
7 27 97 17 67 7 72
8 28 8 84
9 9 91 97

Result of rst pass: 91 1 72 23 84 5 25 27 97 17 67 28


Result of second pass: 1 5 17 23 25 27 28 67 72 84 91 97

136
Cost of Radix Sort
void radix(Elem* A, Elem* B, int n, int k, int r,
int* count) {
// Count[i] stores number of records in bin[i]

for (int i=0, rtok=1; i<k; i++, rtok*=r) {// k digits


for (int j=0; j<r; j++) count[j] = 0; // Init

// Count number of records for each bin this pass


for (j=0; j<n; j++) count[(key(A[j])/rtok)%r]++;

// Index B: count[j] is index for last slot of j.


for (j=1; j<r; j++) count[j] = count[j-1]+count[j];

// Put recs into bins working from bottom of bin.


// Since bins fill from bottom, j counts downwards
for (j=n-1; j>=0; j--)
B[--count[(key(A[j])/rtok)%r]] = A[j];

for (j=0; j<n; j++) A[j] = B[j]; // Copy B to A


}
}

Cost: (nk + rk).


How do n, k and r relate?
[r can be viewed as a constant. k≥ log n if there are n distinct
keys.]

137
Radix Sort Example
Initial Input: Array A 27 91 1 97 17 23 84 28 72 5 67 25

0 1 2 3 4 5 6 7 8 9
First pass values for Count. 0 2 1 1 1 2 0 4 1 0
rtok = 1.

0 1 2 3 4 5 6 7 8 9
Count array: 0 2 3 4 5 7 7 11 12 12
Index positions for Array B.

End of Pass 1: Array A. 91 1 72 23 84 5 25 27 97 17 67 28

0 1 2 3 4 5 6 7 8 9
Second pass values for Count. 2 1 4 0 0 0 1 1 1 2
rtok = 10.

0 1 2 3 4 5 6 7 8 9
Count array: 2 3 7 7 7 7 8 9 10 12
Index positions for Array B.

End of Pass 2: Array A. 1 5 17 23 25 27 28 67 72 84 91 97

138
Empirical Comparison
Algorithm 10 100 1,000 10K 30K
Insert. Sort .00200 .1833 18.13 1847.0 16544
Bubble Sort .00233 .2267 22.47 2274.0 20452
Selec. Sort .00167 .0967 2.17 900.3 8142
Shellsort .00233 .0600 1.00 17.0 59
Shellsort/O .00233 .0500 .93 16.3 65
QSort .00367 .0500 .63 7.3 24
QSort/O .00200 .0300 .43 5.7 18
Merge .00700 .0700 .87 10.7 35
Merge/O .00133 .0267 .37 5.0 16
Heapsort .00900 .1767 2.67 36.3 122
Rad Sort/1 .02433 .2333 2.30 23.3 69
Rad Sort/4 .00700 .0600 .60 6.0 17
Rad Sort/8 .00967 .0333 .30 3.0 8
Algorithm 10 100 1,000 10K 100K
Insert. Sort .0002 .0170 1.68 168.8 23382
Bubble Sort .0003 .0257 2.55 257.2 41874
Selec. Sort .0003 .0273 2.65 267.5 40393
Shellsort .0003 .0027 0.12 1.9 40
Shellsort/O .0003 .0060 0.11 1.8 33
QSort .0004 .0057 0.08 0.9 12
QSort/O .0002 .0040 0.06 0.8 10
Merge .0009 .0130 0.17 2.3 30
Merge/O .0003 .0067 0.11 1.5 21
Heapsort .0010 .0173 0.26 3.5 49
Rad Sort/1 .0123 .1197 1.21 12.5 135
Rad Sort/4 .0035 .0305 0.30 3.2 34
Rad Sort/8 .0047 .0183 0.16 1.6 18
139
Sorting Lower Bound
Want to prove a lower bound for all possible
sorting algorithms.
Sorting is O(n log n).
Sorting I/O takes
(n) time.
Will now prove
(n log n) lower bound.
Form of proof:
• Comparison based sorting can be modeled
by a binary tree.
• The tree must have
(n!) leaves.
• The tree must be
(n log n) levels deep.

140
Decision Trees
XYZ
XYZ YZX
XZY ZXY
YXZ ZYX
Yes A[1]<A[0]? No
(Y<X?)
YXZ XYZ
YXZ XYZ
YZX XZY
ZYX ZXY
Yes No Yes No
A[2]<A[1]? A[2]<A[1]?
YZX (Z<X?) YXZ XZY (Z<Y?) XYZ
YZX XZY
ZYX ZXY
Yes No Yes No
A[1]<A[0]? A[1]<A[0]?
ZYX (Z<Y?) YZX ZXY (Z<X?) XZY
There are n! permutations, and at least 1 node
for each permutation.
A tree with n nodes has at least log n levels.
Where is the worst case in the decision tree?
log n! =
(n log n).
141
Primary vs. Secondary Storage
Primary Storage: Main memory (RAM)
Secondary Storage: Peripheral devices
• Disk Drives
• Tape Drives
Medium Price Price per Mbyte
32MB RAM $225 $7.00/MB
1.4MB oppy disk $.50 $0.36/MB
2.1GB disk drive $210 $0.10/MB
1GB JAZ cassette $100 $0.10/MB
2GB cartridge tape $20 $0.01/MB
RAM is usually volatile.
RAM is about 1/4 million times faster than
disk.

142
Golden Rule of File Processing
Minimize the number of disk accesses!
1. Arrange information so that you get what
you want with few disk accesses.
2. Arrange information so minimize future disk
accesses.
An organization for data on disk is often called
a le structure.
Disk based space/time tradeo : Compress
information to save processing time by reducing
disk accesses.

143
Disk Drives

Boom
(arm)

Platters
Read/Write Track
Spindle Heads
(a) (b)

[Spacing loses 1/2 of potential data density.]

Sectors

Intersector
Gaps Bits of data
[CD-ROM: Spiral with equally spaced dots, variable speed
rotation.]

144
Sectors
8 1 6 1

7 2 3 4

6 3 8 7

5 4 5 2

(a) (b)

A sector is the basic unit of I/O.


Interleaving factor: Physical distance between
logically adjacent sectors on a track, to allow
for processing of sector data.
Locality of Reference: If a record is read
from disk, the next request is likely to come
from near the same place in the le.
Cluster: Smallest unit of le allocation {
several sectors.
Extent: A group of physically contiguous
clusters.
Internal Fragmentation: Wasted space within
a sector if record size does not match sector
size, or wasted space within a cluster if le size
is not a multiple of cluster size.
145
Factors a ecting disk access time
1. Seek time: time for I/O head to reach
desired track. Largely determined by
distance between I/O head and desired
track. [If seeking at random, n = disk width/3.]
Seek time: f (n) = t ∗ n + s where t is time
to traverse one track and s is startup time
for the I/O head.
2. Rotational delay or latency: time for data
to rotate to I/O head position.
At 3600 RPM, one half rotation of the disk:
16.7/2 msec = 8.3 msec.
[Newer disks are running at 7200 RPM.]
3. Transfer time: time for data to move
under the I/O head.
Number of sectors read / Number of
sectors per track * 16.7 msec
[One disk rotation.]

146
Disk Access Cost Example
675 Mbyte disk drive
• 15 platters ⇒ 45 Mbyte/platter
• 612 tracks/platter
• 150 sectors/track ⇒ 512 bytes/sector
• 8 sectors/cluster (4K bytes/cluster) ⇒ 18
clusters/track
• Interleaving factor of 3 ⇒ 3 revolutions to
read one track (50.1 msec)
How long to read a le of 128 Kbytes divided
into 256 records of 512 bytes?
Number of Clusters:
If le lls minimum number of tracks:
150 sectors of one track, 106 of the next
Total time:
612/3 ∗ 0.08 + 3 + 3.5 ∗ 16.7 + 0.08 + 3+
3.5 ∗ 16.7 = 139.3 msec.
If clusters are spread randomly across disk:
612 24
32 ∗ ( 3 ∗ 0.08 + 3 + 16.7/2 + 150 ∗ 16.7)
= 32 ∗ 30.3 = 969.6 msec.
147
Magnetic Tape
Example: 9 track tape at 6250 bytes per inch
(bpi).
At 2400 feet, this yields 170 Mbytes for $20, or
$0.12/Mbyte.
Workstation/PC cartridge tape is similar.
Magnetic tape requires sequential access.
Magnetic tape has two speeds:
• High speed for \skipping."
• Low speed for \reading."

Interblock Gap is space required for I/O head


to recognize beginning of a record at high
speed. This is a signi cant amount of space
compared to typical record size.
Blocking Factor: Number of records in a
block between gaps.

148
Bu ers
Read time for one track:
612/3 ∗ 0.08 + 3 + 3.5 ∗ 16.7 = 77.8 msec.
Read time for one sector:
612/3∗0.08+3+16.7/2+16.7/150 = 27.8 msec.
Read time for one byte:
612/3 ∗ 0.08 + 3 + 16.7/2 = 27.7 msec.
Nearly all disk drives read/write one sector at
every I/O access.
• Also called a page.

The information in a sector is stored in a


bu er or cache.
If next I/O access is to same bu er, then no
need to go to disk.
There are usually one or more input bu ers and
one or more output bu ers.
149
Bu er Pools
A series of bu ers used by an application to
cache disk data is called a bu er pool.
Virtual memory uses a bu er pool to imitate
greater RAM memory by actually storing
information on disk and \swapping" between
disk and RAM.
Organization for bu er pools: which one to use
next?
• First-in, First-out: Use the rst one on the
queue.
• Least Frequently Used (LFU): Count bu er
accesses, pick the least used.
• Least Recently Used (LRU):
Keep bu ers on linked list.
When a bu er is accessed, bring to front.
Reuse the one at the end.
Double Bu ering: Read data from disk for
next bu er while CPU is processing previous
bu er.
150
Programmer's View of Files
Logical view of les:
• An array of bytes.
• A le pointer marks the current position.

Three fundamental operations:


• Read bytes from current position (move le
pointer).
• Write bytes to current position (move le
pointer).
• Set le pointer to speci ed byte position.

151
C/C++File Functions
FILE *fopen(char *filename, char *mode);
void fclose(FILE *stream);

Mode examples:
• "rb": open a binary le, read-only.
• "w+t": create a text le for reading and
writing.
size t fread(void *ptr, size t size, size t n,
FILE *stream);
if(numrec !=
fread(recarr, sizeof rec, numrec, myfile))
its_an_error();

size t fwrite(void *ptr, size t size, size t


n, FILE *stream);

int fseek(FILE *stream, long offset, int


whence);
where whence is one of:
• SEEK SET: O set from le beginning.
• SEEK CUR: O set from current position.
• SEEK END: O set from end of le.
152
External Sorting
Problem: Sorting data sets too large to t in
main memory.
• Assume data stored on disk drive.

To sort, portions of the data must be brought


into main memory, processed, and returned to
disk.
An external sort should minimize disk accesses.

153
Model of External Computation
Secondary memory is divided into equal-sized
blocks (512, 2048, 4096 or 8192 bytes are
typical sizes).
The basic I/O operation transfers the contents
of one disk block to/from main memory.
Under certain circumstances, reading blocks of
a le in sequential order is more ecient.
(When?) [1) Adjacent logical blocks of le are physically
adjacent on disk. 2) No competition for I/O head.]
Typically, the time to perform a single block
I/O operation is sucient to Quicksort the
contents of the block.
Thus, our primary goal is to minimize the
number fo block I/O operations.
Most workstations today must do all sorting on
a single disk drive.
[So, the algorithm presented here is general for these
conditions.]

154
Key Sorting
Often records are large while keys are small.
• Ex: Payroll entries keyed on ID number.

Approach 1: Read in entire records, sort them,


then write them out again.
Approach 2: Read only the key values, store
with each key the location on disk of its
associated record.
If necessary, after the keys are sorted the
records can be read and re-written in sorted
order.
[But, this is not usually done. (1) It is expensive (random
access to all records). (2) If there are multiple keys, there is
no \correct" order.]

155
External Sort: Simple Mergesort
Quicksort requires random access to the entire
set of records.
Better: Modi ed Mergesort algorithm
• Process n elements in (log n) passes.

1. Split the le into two les.


2. Read in a block from each le.
3. Take rst record from each block, output
them in sorted order.
4. Take next record from each block, output
them to a second le in sorted order.
5. Repeat until nished, alternating between
output les. Read new input blocks as
needed.
6. Repeat steps 2-5, except this time the
input les have groups of two sorted
records that are merged together.
7. Each pass through the les provides larger
and larger groups of sorted records.
A group of sorted records is called a run.
156
Problems with Simple Mergesort
36 17 28 23 20 36 14 28 13 17 20 36

20 13 14 15 13 17 15 23 14 15 23 28

Runs of length 1 Runs of length 2 Runs of length 4

Is each pass through input and output les


sequential? [yes]
What happens if all work is done on a single
disk drive? [Competition for I/O head eliminates
advantage of sequential processing.]
How can we reduce the number of Mergesort
passes? [Read in a block (or several blocks) and do an
in-memory sort to generate large initial runs.]
In general, external sorting consists of two
phases:
1. Break the le into initial runs.
2. Merge the runs together into a single sorted
run.

157
Breaking a le into runs
General approach:
• Read as much of the le into memory as
possible.
• Perform and in-memory sort.
• Output this group of records as a single run.

158
Replacement Selection
1. Break available memory into an array for
the heap, an input bu er and an output
bu er.
2. Fill the array from disk.
3. Make a min-heap.
4. Send the smallest value (root) to the
output bu er.
5. If the next key in the le is greater than the
last value output, then
Replace the root with this key.
else
Replace the root with the last key in the
array.
Add the next record in the le to a new
heap (actually, stick it at the end of the
array).
Input Output
File Input Bu er RAM Output Bu er Run File

159
Example of Replacement Selection
Input Memory Output
12
16 19 31 12
25 21 56 40
16
29 19 31 16
25 21 56 40
29
19 31
25 21 56 40
19
14 21 31 19
25 29 56 40
40
21 31
25 29 56 14
21
35 25 31 21
40 29 56 14

160
Bene t from Replacement Selection
Use double bu ering to overlap input,
processing and output.
How many disk drives for greatest advantage?
Snowplow argument:
• A snowplow moves around a circular track
onto which snow falls at a steady rate.
• At any instant, there is a certain amount of
snow S on the track. Some falling snow
comes in front of the plow, some behind.
• During the next revolution of the snowplow,
all of this is removed, plus 1/2 of what falls
during that revolution.
• Thus, the plow removes 2S amount of
snow. Falling Snow

Is this always true? Future snow


Existing snow

Snowplow Movement
Start time T

161
Simple Mergesort may not be Best
Simple Mergesort: Place the runs into two les.
• Merge the rst two runs to output le, then
next two runs, etc.
This process is repeated until only one run
remains.
• How many passes for r initial runs? [log r]

Is there bene t from sequential reading?


[Not if all on one disk drive.]
Is working memory well used?
[No { only 2 blocks are used.]
Need a way to reduce the number of passes.
[And use memory better.]

162
Multiway Merge
With replacement selection, each initial run is
several blocks long.
Assume that each run is placed in a separate
disk le.
We could then read the rst block from each
le into memory and perform an r-way merge.
When a bu er becomes empty, read a block
from the appropriate run le.
Each record is read only once from disk during
the merge process.
In practice, use only one le and seek to
appropriate block.
Input Runs
5 10 15 ...
Output Bu er
6 7 23 ... 5 6 7 10 12 ...

12 18 20 ...
163
Limits to Single Pass Multiway
Merge
Assume working memory is b blocks in size.
How many runs can be processed at one time?
The runs are 2b blocks long (on average).
[Because of replacement selection.]
How big a le can be merged in one pass?
[2b ]
2

Larger les will need more passes { but the run


size grows quickly!
[In K merge passes, process 2b k
( +1)
blocks.]
This approach trades (log b) (possibly)
sequential passes for a single or a very few
random (block) access passes.
[Example: 128K → 32 4K blocks.
With replacement selection, get 256K-length runs.
One merge pass: 8MB. Two merge passes: 256MB.
Three merge passes: 8GB.]
164
General Principals of External
Sorting
In summary, a good external sorting algorithm
will seek to do the following:
• Make the initial runs as long as possible.
• At all stages, overlap input, processing and
output as much as possible.
• Use as much working memory as possible.
Applying more memory usually speeds
processing.
• If possible, use additional disk drives for
more overlapping of processing with I/O,
and allow for more sequential le
processing.

165
Search
Given: Distinct keys k1, k2, ... kn and
collection T of n records of the form
(k1, I1), (k2, I2), ..., (kn, In)
where Ij is information associated with key kj
for 1 ≤ j ≤ n.
Search Problem: For key value K , locate the
record (kj , Ij ) in T such that kj = K .
Searching is a systematic method for locating
the record (or records) with key value kj = K .
A successful search is one in which a record
with key kj = K is found.
An unsuccessful search is one in which no
record with kj = K is found (and presumably
no such record exists).

166
Approaches to Search
1. Sequential and list methods (lists, tables,
arrays).
2. Direct access by key value (hashing).
3. Tree indexing methods.

167
Searching Ordered Arrays
Sequential Search
Binary Search
int binary(int K, int* array, int left, int right) {
// Return pos of ELEM in array (if any) with value K
int l = left-1;
int r = right+1; // l, r beyond bounds of array
while (l+1 != r) { // Stop when l and r meet
int i = (l+r)/2; // Look at middle of subarray
if (K < array[i]) r = i; // In left half
if (K == array[i]) return i; // Found it
if (K > array[i]) l = i; // In right half
}
return UNSUCCESSFUL; // Search value not in array
}

Position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Key 11 13 21 26 29 36 40 41 45 51 54 56 65 72 77 83

Dictionary Search
168
Lists Ordered by Frequency
Order lists by (expected) frequency of
occurrance.
• Perform sequential search.

Cost to access rst record: 1


Cost to access second record: 2
Expected search cost:
C n = 1p1 + 2p2 + ... + npn
[pi is probability of ith record being accessed.]
Example: all records have equal frequency
n
X
Cn = i/n = (n + 1)/2.
i =1
Example: Exponential frequency
(
pi =
1 /2i if 1 ≤ i ≤ n − 1
1/2n−1 if i = n
[Second line is to make proabilities sum to 1.]
n
X
Cn ≈ (i/2i) ≈ 2.
i =1
169
Zipf Distributions
Applications:
• Distribution for frequency of word usage in
natural languages.
• Distribution for populations of cities, etc.

De nition: Zipf frequency for item i in the


distribution for n records as 1/iHn.
[Hn = Pni 1
=1 i
≈ log n.]
e

n
X
Cn = i/iHn = n/Hn ≈ n/ loge n
i =1
80/20 rule: 80% of the accesses are to 20% of
the records.
For distributions following the 80/20 rule,
C n ≈ 0.122n.

170
Self-Organizing Lists
Self-organizing lists modify the order of records
within the list based on the actual pattern of
record access.
Self-organizing lists use a rule called a heuristic
for deciding how to to reorder the list. These
heuristics are similar to the rules for managing
bu er pools.
[Bu er pools can be viewed as a form of self-organizing list.]
• Order by actual historical frequency of
access. (Similar to LFU bu er pool
replacement stratagy.)
• When a record is found, swap it with the
rst record on list.
• Move-to-Front: When a record is found,
move it to the front of the list. [Not worse
than twice \best arrangement."]
• Transpose: When a record is found, swap
it with the record ahead of it. [A bad example:
keep swapping last two elements.]

171
Example of Self-Organizing Tables
Application: Text compression.
Keep a table of words already seen, organized
via Move-to-Front Heuristic.
If a word not yet seen, send the word.
Otherwise, send the (current) index in the
table.
The car on the left hit the car I left.
The car on 3 left hit 3 5 I 5.
This is similar in spirit to Ziv-Lempel coding.

172
Searching in Sets
For dense sets (small range, many elements in
set):

Can use logical bit operators.


[1 if element in set, 0 otherwise.]
Example: To nd all primes that are odd
numbers, compute
0011010100010100 & 0101010101010101

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0

173
Hashing
Hashing: The process of mapping a key value
to a position in a table.
A hash function maps key values to positions
It is denoted by h.
A hash table is an array that holds the
records. It is denoted by T .
The hash table has M slots, indexed from 0 to
M − 1.

For any value K in the key range and some


hash function h,
h(K ) = i, 0 ≤ i < M , such that key(T [i]) = K .

174
Hashing (continued)
Hashing is appropriate only for sets (no
duplicates).
Good for both in-memory and disk based
applications.
Answers the question \What record, if any, has
key value K ?"
[Not good for range queries.]
Example: Store the n records with keys in
range 0 to n − 1.
• Store the record with key i in slot i.
• Use hash function h(K ) = K.

175
Collisions
More reasonable example:
• Store about 1000 records with keys in
range 0 to 16,383.
• Impractical to keep a hash table with
16,384 slots.
• We must devise a hash function to map the
key range to a smaller table.
Given: hash function h and keys k1 and k2.
β is a slot in the hash table.
If h(k1) = β = h(k2), then k1 and k2 have a
collision at β under h.
Search for the record with key K :
1. Compute the table location h(K ).
2. Starting with slot h(K ), locate the record
containing key K using (if necessary) a
collision resolution policy.
Collisions are inevitable in most applications.
• Example: 23 people are likely to share a
birthday.
176
Hash Functions
A hash function MUST return a value within
the hash table range.
To be practical, a hash function SHOULD
evenly distribute the records stored among the
hash table slots.
Ideally, the hash function should distribute
records with equal probability to all hash table
slots. In practice, success depends on the
distribution of the actual records stored.
If we know nothing about the incoming key
distribution, evenly distribute the key range
over the hash table slots while avoiding obvious
opportunities for clustering.
If we have knowlege of the incoming
distribution, use a distribution-dependant hash
function.

177
Example Hash Functions
int h(int x) {
return(x % 16);
}

This function is entirely dependant on the lower


4 bits of the key.
Mid-square method: square the key value,
take the middle r bits from the result for a
hash table of 2r slots.
[All bits contribute to the result.]
Sum the ASCII values of the letters and take
results modulo M .
int h(char x[10]) {
int i, sum;
for (sum=0, i=0; i<10; i++)
sum += (int) x[i];
return(sum % M);
}

[Only good if the sum is large compared to M .]

178
ELF Hash
From Executable and Linking Format (ELF),
UNIX System V Release 4.
int ELFhash(char* key) {
unsigned long h = 0;
while(*key) {
h = (h << 4) + *key++;
unsigned long g = h & 0xF0000000L;
if (g) h ^= g >> 24;
h &= ~g;
}
return h % M;
}

179
Open Hashing
What to do when collisions occur?
Open hashing treats each hash table slot as a
bin.

0 1000 9530
1
2
3 3013
4
5
6
7 9877 2007 1057
8
9 9879

180
Bucket Hashing
Divide the hash table slots into buckets.
• Example: 8 slots/bucket.

Include an over ow bucket.


Records hash to the rst slot of the bucket,
and ll bucket. Go to over ow if necessary.
When searching, rst check the proper bucket.
Then check the over ow.

181
Closed Hashing
Closed hashing stores all records directly in the
hash table.
Each record i has a home position h(ki).
If i is to be inserted and another record already
occupies i's home position, then another slot
must be found to store i.
The new slot is found by a
collision resolution policy.
Search must follow the same policy to nd
records not in their home slots.

182
Collision Resolution
During insertion, the goal of collision resolution
is to nd a free slot in the table.
Probe Sequence: the series of slots visited
during insert/search by following a collision
resolution policy.
Let β0 = h(K ). Let (β0, β1, ...) be the series of
slots making up the probe sequence.
void hashInsert(Elem R) { // Insert R into hash table T
int home; // Home position for R
int pos = home = h(key(R));// Initial pos on sequence
for (int i=1; key(T[pos]) != EMPTY; i++) {
pos = (home + p(key(R), i)) % M; // Next slot
if (key(T[pos]) == key(R)) ERROR; //No duplicates
}
T[pos] = R; // Insert R
}

Elem hashSearch(int K) { // Search for record w/ key K


int home; // Home position for K
int pos = home = h(K); // Initial pos on sequence
for (int i = 1; (key(T[pos]) != K) &&
(key(T[pos]) != EMPTY); i++)
pos = (home + p(K, i)) % M; // Next pos on sequence
if (key(T[pos]) == K) return T[pos]; // Found it
else return UNSUCCESSFUL; // K not in hash table
}

183
Linear Probing
Use the probe function
int p(int K, int i) { return i; }

This is called linear probing.


Linear probing simply goes to the next slot in
the table.
If the bottom is reached, wrap around to the
top.
To avoid an in nite loop, one slot in the table
must always be empty.

184
Linear Probing Example
0 1001 0 1001
1 9537 1 9537
2 3016 2 3016
3 3
4 4
5 5
6 6
7 9874 7 9874
8 2009 8 2009
9 9875 9 9875
10 10 1052

(a) (b)

Primary Clustering: Records tend to cluster


in the table under linear probing since the
probabilities for which slot to use next are not
the same.
[For (a): prob(3) = 4/11, prob(4) = 1/11, prob(5) = 1/11,
prob(6) = 1/11, prob(10) = 4/11.]
[For (b): prob(3) = 8/11, prob(4,5,6) = 1/11 each.]
185
Improved Linear Probing
Instead of going to the next slot, skip by some
constant c.
Warning: Pick M and c carefully.
The probe sequence SHOULD cycle through all
slots of the table.
[If M = 10 with c = 2, then we e ectively have created 2 hash
tables (evens vs. odds).]
Pick c to be relatively prime to M .
There is still some clustering.
• Example: c = 2. h(k1) = 3. h(k2) = 5.
• The probe sequences for k1 and k2 are
linked together.

186
Pseudo Random Probing
The ideal probe function would select the next
slot on the probe sequence at random.
An actual probe function cannot operate
randomly. (Why?)
Pseudo random probing:
• Select a (random) permutation of the
numbers from 1 to M − 1:
r1, r2, ..., rM−1
• All insertions and searches use the same
permutation.
Example: Hash table of size M = 101
• r1 = 2, r2 = 5, r3 = 32.
• h(k1) = 30, h(k2) = 28.
• Probe sequence for k1 is: [30, 32, 35, 62]
• Probe sequence for k2 is: [28, 30, 33, 60]

187
Quadratic Probing
Set the i'th value in the probe sequence as
(h(K ) + i2) mod M.
Example: M = 101.
• h(k1) = 30, h(k2) = 29.
• Probe sequence for k1 is: [30, 31, 34, 39]
• Probe sequence for k2 is: [21, 30, 33, 38]

188
Double Hashing
Pseudo random probing eliminates primary
clustering.
If two keys hash to same slot, they follow the
same probe sequence. This is called
secondary clustering.
To avoid secondary clustering, need a probe
sequence to be a function of the original key
value, not just the home position.
Double hashing:
p(K, i) = i ∗ h2(K ) for 0 ≤ i ≤ M − 1.

Be sure that all probe sequence constants are


relatively prime to M .
Example: Hash table of size M = 101
• h(k1) = 30, h(k2) = 28, h(k3) = 30.
• h2(k1) = 2, h2(k2) = 5, h2(k3) = 5.
• Probe sequence for k1 is: [30, 32, 34, 36]
• Probe sequence for k2 is: [28, 33, 38, 43]
• Probe sequence for k3 is: [30, 35, 40, 45]
189
Analysis of Closed Hashing
The load factor is α = N/M where N is the
number of records currently in the table.

Insert Delete
4

1
0 .2 .4 .6 .8 1.0

190
Deletion
1. Deleting a record must not hinder later
searches.
2. We do not want to make positions in the
hash table unusable because of deletion.
Both of these problems can be resolved by
placing a special mark in place of the deleted
record, called a tombstone.
A tombstone will not stop a search, but that
slot can be used for future insertions.
Unfortunately, tombstones do add to the
average path length.
Solutions:
1. Local reorganizations to try to shorten the
average path length.
2. Periodically rehash the table (by order of
most frequently accessed record).

191
Indexing
Goals:
• Store large les.
• Support multiple search keys.
• Support ecient insert, delete and range
queries.
Entry sequenced le: Order records by time
of insertion. [Not practical as a database organization.]
Use sequential search.
Index le: Organized, stores pointers to actual
records. [Could be a tree or other data structure.]
Primary key: A unique identi er for records.
May be inconvenient for search.
Secondary key: an alternate search key, often
not unique for each record. Often used for
search key.

192
Linear Indexing
Linear Index: an index le organized as a
simple sequence of key/record pointer pairs
where the key values are in sorted order.
If the index is too large to t in main memory,
a second level index may be used.
Linear indexing is good for searching variable
length records. [Also good for indexing an entry
sequenced le.]
Linear indexing is poor for insert/delete.
Linear Index

37 42 52 73 98

73 52 98 37 42
Database Records [This is an entry sequenced le.]

1 2003 5894 10528 [First key on each disk page.]


Second Level Index

1 2001 2003 5688 5894 9942 10528 10984


Linear Index: Disk Pages
193
Inverted List
Inverted list is another term for a secondary
index. A secondary key is associated with a
primary key, which in turn locates the record.
Jones AA10 AB12 AB39 FF37
Smith AX33 AX35 ZX45
Zukowski ZQ99
[An improvement over 1D list: 2D table can save space of
duplicating secondary keys. But, may waste space in rows.
Easier to update than linear list, but still requires some
shifting of values.]
Secondary Primary
Key Key
Jones AA10
Smith AB12
Zukowski AB39
FF37

AX33
AX35
ZX45
ZQ99

194
Inverted List (Continued)
Secondary Primary
Key Index Key Next
Jones 0 0 AA10 4
Smith 1 1 AX33 6
Zukowski 3 2 ZX45
3 ZQ99
4 AB12 5
5 AB39 7
6 AX35 2
7 FF37

[A linked list.]

195
Tree Indexing
Linear index is poor for insertion/deletion.
Tree index can eciently support all desired
operations:
• Insert/delete
• Multiple search keys [Multiple tree indices.]
• Key range search

Storing a tree index on disk causes additional


problems:
1. Tree must be balanced. [Minimize disk accesses.]
2. Each path from root to a leaf should cover
few disk pages.

5 4

3 7 2 6

2 4 6 1 3 5 7

(a) (b)
196
2-3 Tree
A 2-3 Tree has the following properties:
1. A node contains one or two keys.
2. Every internal node has either two children
(if it contains one key) or three children (if
it contains two keys).
3. All leaves are at the same level in the tree,
so the tree is always height balanced.
The 2-3 Tree also has a search tree property
analogous to the BST.
The advantage of the 2-3 Treeover the BST is
that it can be updated at low cost.
18 33

12 23 30 48

10 15 20 21 24 31 45 47 50 52

197
2-3 Tree Insertion
18 33

12 23 30 48

10 15 15 20 21 24 31 45 47 50 52
14 [Insert 14]

18 33

12 23 30 48 52

10 15 20 21 24 31 45 47 50 55

[Insert 55. Always insert at leaf node.]

198
2-3 Tree Splitting

23
20 23 30 20 30

19 21 24 31 19 21 24 31

(a)
[Insert 19] (b)

23

18 33

12 20 30 48

10 15 19 21 24 31 45 47 50 52

(c)

[All operations are local to original search path.]

199
B-Trees
The B-Tree is an extension of the 2-3 Tree.
The B-Tree is now the standard le
organization for applications requiring insertion,
deletion and key range searches.
1. B-Trees are always balanced.
2. B-Trees keep related records on a disk
page, which takes advantage of locality of
reference.
3. B-Trees guarantee that every node in the
tree will be full at least to a certain
minimum percentage. This improves space
eciency while reducing the typical number
of disk fetches necessary during a search or
update operation.

200
B-Trees (Continued)
A B-Tree of order m has the following
properties.
• The root is either a leaf or has at least two
children.
• Each node, except for the root and the
leaves, has between dm/2e and m children.
• All leaves are at the same level in the tree,
so the tree is always height balanced.
A B-Tree node is usually selected to match the
size of a disk block.
A B-Tree node could have hundreds of children.

201
B-Tree Example
Search in a B-Tree is a generalization of search
in a 2-3 Tree.
1. Perform a binary search on the keys in the
current node. If the search key is found,
then return the record. If the current node
is a leaf node and the key is not found,
then report an unsuccessful search.
2. Otherwise, follow the proper branch and
repeat the process.
24

15 20 33 45 48

10 12 18 21 23 30 31 38 47 50 52 60

202
B+-Trees
The most commonly implemented form of the
B-Tree is the B+-Tree.
Internal nodes of the B+-Tree do not store
records { only key values to guide the search.
Leaf nodes store records or pointers to records.
A leaf node may store more or less records than
an internal node stores keys.
[Assume leaves can store 5 values, internal notes 3 (4
children).]
33

18 23 48

10 12 15 18 19 20 21 22 23 30 31 33 45 47 48 50 52

[Link leaf nodes in a linked list.]

203
B+-Tree Insertion
[Note special rule for root: May have only two children.]

33

10 12 23 33 48 10 12 23 33 48 50
[(b) Add (a)
50.] [Add 45, 52, 47 (split),(b)18, 15, 31 (split), 21,
20.]
18 33 48

10 12 15 18 20 21 23 31 33 45 47 48 50 52

[Add 30 (split).] (c)

33

18 23 48

10 12 15 18 20 21 23 30 31 33 45 47 48 50 52
(d)

204
B+-Tree Deletion
[Simple delete { delete 18 from original example.]
33

18 23 48

10 12 15 19 20 21 22 23 30 31 33 45 47 48 50 52

[Delete of 12 form original example: Borrow from sibling.]

33

19 23 48

10 15 18 19 20 21 22 23 30 31 33 45 47 48 50 52

205
B-Tree Space Analysis
B+-Tree nodes are always at least half full.
The B∗-Tree splits two pages for three, and
combines three pages into two. In this way,
nodes are always 2/3 full.
Asymptotic cost of search, insertion and
deletion of records from B-Trees, B+-Trees and
B∗-Trees is (log n). (The base of the log is
the (average) branching factor of the tree.)
Example: Consider a B+-Tree of order 100
with leaf nodes containing 100 records.
1 level B+-Tree: [Max: 100]
2 level B+-Tree: [Min: 2 leaves of 50 for 100 records.
Max: 100 leaves with 100 for 10,000 records.]
3 level B+-Tree: [Min: 2 × 50 nodes of leaves for 5000
records. Max: 100 = 1, 000, 000 records.]
3

4 level B+-Tree: [Min: 250,000 records (2 * 50 * 50 *


50). Max: 100 million records (100 * 100 * 100 * 100).]
Ways to reduce the number of disk fetches:
• Keep the upper levels in memory.
• Manage B+-Tree pages with a bu er pool.
206