Coursera Data Structure

Ffs
http://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/lec20-amortized/amortized.htm\
https://anh.cs.luc.edu/363/notes/06A_Amortizing.html
very important notes :
https://courses.engr.illinois.edu/cs225/sp2019/notes/disjoint-sets/
Arrays
Prev
Next
Welcome
Arrays and Linked Lists
o
Video: LectureArrays
7 min
Video: LectureSingly-Linked Lists
9 min
Video: LectureDoubly-Linked Lists
4 min
Reading: Slides and External References
10 min
Stacks and Queues

o
Video: LectureStacks
10 min
Video: LectureQueues
7 min
10 min
Trees
o
Video: LectureTrees
11 min
Video: LectureTree Traversal
10 min
10 min
o
Practice Quiz: Basic Data Structures
5 questions
Programming Assignment 1
o
Reading: Available Programming Languages
10 min
Reading: FAQ on Programming Assignments
10 min
Purchase a subscription to unlock this item.
Programming Assignment: Programming Assignment 1: Basic Data Structures
2h
Acknowledgements (Optional)
o
Reading: Acknowledgements
10 min
Arrays
So in this lecture we're talking about arrays and linked lists.
Play video starting at 4 seconds and follow transcript0:04
In this video, we're going to talk about arrays.
So here's some examples of declarations of arrays in a couple of different languages. Along with, we
can see the one dimensional array laid out with five elements in it, and then a two dimensional array
with one row, sorry two rows and five columns.
So what's the definition of an array? Well we got basically a contiguous array of memory. That is one
chunk of memory. That can either be on a stack or it can be in the heap, it doesn't really matter where
it is.
It is broken down into equal sized elements, and each of those elements are indexed by contiguous
integers. All three of these things are important for defining an array.
Here, in this particular example, we have an array whose indices are from 1 to 7. In many languages,
the same indices for this particular array would be from zero to six. So it would be zero based
indexing, but one based indexing is also possible in some languages. And other languages allow you to
actually specify what the initial index is.
Play video starting at 1 minute 12 seconds and follow transcript1:12
What's so special about arrays? Well, the key point about an array is we have random access. That is,
we have constant time access to any particular element in an array. Constant time access to read,
constant time access to write.
How does that actually work? Well basically what that means is we can just do arithmetic to figure
out the address of a particular array element.
So the first thing we need to do is start with the address of the array.
So we take the address of the array and then we multiply that by first the element size. So this where
the key part that every element was the same size matters, so that allows us to do a simple
multiplication. Rather than if each of the array elements were of different sizes, we'd have to sum
them together, and if we had to sum together n items, that would be order n time. So we take our
array address, we add to it the element size times i which is the index that's of interest minus the
first_index.
Play video starting at 2 minutes 15 seconds and follow transcript2:15
If we're doing zero based indexing, that first index isn't really necessary. I like this example because it
really shows a more general case where we do have a first index.
Let's say for instance we're looking at the address for index four. We would take four minus the first
index, which is one, which would give us three. Multiply that by whatever our element size is, and
then add that to our array address. Now of course, we don't have to do this work, the compiler or
interpreter does this work for us, but we can see how it is that it works in constant-time.
Many languages also support multi-dimensional arrays, if not you can actually kind of roll your own
through an example I'll show you here, where you do your own arithmetic. So here, let's look. Let's
say that the top left element is at index (1, 1), and here's the index (3,4). So this means we're in row 3,
column 4. How do we find the address of that element? Well, first off what we need to do is skip the
rows that, the full rows, that we're not using. So that is, we need to skip two rows, or skip 3, which is
the row index minus 1, which is the initial row index. So that gives us 2 times 6 or 12 elements we're
skipping for those rows in order to get to row 3. Then we've got to skip the elements before (3,4) in
the same row. So there are three of them. How do we get that? We take the column index, which is 4
and subtract it from the initial column index which is 1. So this basically gives us 15.
Six for the first row, six for the second row and then three for the third row before this particular
element. We take that 15 and multiply it by our element size and then add it to our array address.
And that will give us the address of our element (3,4).
Now we made kind of a supposition here. And that was that the way this was laid out is we laid out all
the elements of the first row, followed by all of the elements of the second row, and so on. That's
called row-major ordering or row-major indexing. And what we do is basically, we lay out, (1, 1), (1,
2), (1, 3), (1, 4), (1, 5), (1, 6). And then right after that in memory (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2,
6). So the column index is changing most rapidly as we're looking at successive elements. And that's
an indication of it's row-major indexing.
We could lay out arrays differently, and some languages or compilers actually do that, where they
would lay out each column in order, so you'd have the first column, then the second column, and
then the third column. And so that, then, the successive elements would be (1, 1), (2, 1), (3, 1),
followed by (1, 2), (2, 2), (3, 2), and so on.
So there we see that the row index is changing most rapidly, and this is called column-major ordering.
How long does it take to perform operations? We already said to read any element is O(1), and to
write any element is O(1). That is a standard feature of arrays. What happens if we want to add an
element at the end of an array? So let's say we have allocated seven elements for an array. We're
only using four of them, okay? So we have kept track that we're using four and we want to add a fifth
element. And again there's room for seven. Then all we know it was just add it, then update the
number of elements that are in use. That's an O(1) operation. If we want to remove the last element
as well, that's an O(1) operation because we just update the number of elements that are in use, and
so that's an O(1) operation.
Where it gets to be expensive, is if we want to, for instance, remove the first element. So we remove
the five here, and what we've got to do then, we don't want to have holes left in it. So we need to
move the 8 down, move the 3 down, move the 12 down. That's an O(n) operation.
Same thing would happen if he wanted to insert at the beginning. So we would need to move the 12,
move the 3, and move the 8 to make space for our new element. So that also would be O(n).
And if we want to add or remove somewhere in the middle, again that's an O(n) operation. If we want
to add exactly in the middle, we have to move n/2 items, which is O(n). Same thing for removal. So
arrays are great if you want or remove at the end. But it's expensive if you want to add or remove in
the middle or at the beginning.
However, remember, a huge advantage for arrays is that we have this constant time access to
elements, either read or write.
Play video starting at 7 minutes 1 second and follow transcript7:01
In summary then, an array consists of a contiguous area of memory. because if it were non-
contiguous then we couldn't just do this simple arithmetic to get where we're going. We have to have
equal-size elements again so our arithmetic works. And indexed by contiguous integers again so our
arithmetic works.
We have constant time access to any element, constant time to add or remove at the end and linear
time to add and remove at any arbitrary location.
In our next video we're going to talk about linked lists.
Now let's talk about linked lists.
So linked lists, it's named kind of like links in a chain, right, so we've got a head pointer that points to
a node that then has some data and points to another node, points to another node and eventually
points to one that doesn't point any farther. So here in our top diagram we show head points to the
node containing 7, points to the node containing 10, points to the node containing 4, points to the
node containing 13 doesn't point anywhere. How this actually works is that a node contains a key
which in this case is these integers, and a next pointer. The diagram below shows more detail of
what's going on. So head is a pointer that points to a node, and that node contains two elements, the
value 7. And then a pointer that points off to the next node that contains a key 10, and a pointer that
points off to the next node 4, points off to the next node 13, 13's next pointer is just nill.
What are the operations that can be done on a linked list? There's several of them, and the names of
these sometimes are different, in different environments and different libraries. But normally the
operations provided are roughly these. So we can add an element to the front of the list, and that
we're calling PushFront. So that takes a key, adds it to the front of the list. We can return the front
element of the list. We're calling that TopFront. Or we can remove the front element of the list, called
PopFront. The same things that we can do at the front of the list, we can also do at the end of the list.
With PushBack, later on in a later module, we'll actually use the word Append for that, or TopBack, or
PopBack.
These seem uniform in there, but there is a difference in that the runtimes are going to be different
between those, and we're going to talk about that.
You can find whether an element is in the list and it's as simple as just running yourself down the the
list looking to find a matching key.
You can erase an element and then again run yourself down the list til you find the matching key and
then remove that element. So these latter ones are both O(n) time.
Is the list empty or not? That's as simple as checking is the head equal to nil.
We can add a particular key--if we want to splice in a key into a list we can actually add in a key either
before a given node or after a given node.
So lets look at the times for some common operations.
We've got here our list with four elements in it: 7, 10, 4, and 13. Now we go ahead and push an
element to the front. So we push 26 to the front of the list. So the first thing we do, create a node
that contains the 26 as its key. And then we update our next pointer of that node to point to the
head, which is the 7 element, and then update the head pointer to point to our new node, and that's
it we're done. So it's O(1). Allocate, update one pointer, update another pointer, constant time.
If we want to pop the front element, clearly finding the front element is very cheap here, right? You
can just look at the first element and return it. So TopFront is O(1). PopFront turns out is going to be
O(1). First thing we're going to do, update the head pointer. Then, remove the node. That's an O(1)
operation.
If we want to push at the back, and we don't have a tail pointer, we're going to talk about a tail
pointer in a moment, then it's going to be a fairly expensive operation. We're going to have to start at
the head and walk our way down the list until we get to the end, and add a node there, so that's going
to be O(n) time.
Similarly if we want to TopBack or PopBack, we're going to also have to start at the head, walk our
way down to the last element. Those are all going to be O(n) time.
If we had a tail pointer, some of these will become simpler. Okay, so, we're going to have both a head
pointer that points to the head element and a tail pointer that points to the tail element. So, that
way, getting the first element is cheap. Getting the last element is cheap.
Let's look at what happens when we try an insert when we have a tail. We allocate a node, put in our
new key, and we then update the next pointer of the current tail, to point to this new tail. And then
update the tail pointer itself.
O(1) operation.
Retrieving the last element, so a PopBack, sorry a TopBack, is also an O(1) operation. We just go to
the tail, find the element, return the key.
If we want to pop the back however that's a little bit of an expensive operation. Okay. We are going to
need to update the tail to point from 8 to 13 so we're at 8 right now we want to go to 13, the problem
is how do we get to 13? Okay.
We don't have a pointer from 8 to 13 we have a pointer from 13 to 8. And that pointer doesn't help
us going back. So what we've got to do is, again, start at the head, walk our way down until we find
the 13 node that then points to the current tail, and then update our tail pointer to point to that, and
then update the next pointer to be nil. And then we can remove that old one. So that's going to be an
O(n) operation. because we've got to walk all the way down there. Okay, because even though we
have a tail pointer we don't have the next to the tail pointer, we don't have the next to last element.
The head is different because our pointers point this way, if we had the head Its also cheap to get the
second element, right, and one more to get the third element but the tail pointer doesn't help us get
to the next to the last element.
Let's look at some of the code for this, so for PushFront we have a singly linked list: we're going to
allocate a new node, set its key, set its next to point to the old head and then we'll update the current
head pointer.
If the tail is equal to nil, that meant that before the insertion, the head and the tail were nil, it was an
empty list. So we've got to update the tail to point to the same thing the head points to.
Popping the front, well, if we're asked to pop the front on an empty list, that's an error. So that's the
first check we do here and then we just update the head to point now to the head's next. And just in
case that there was only one element in the list and now there are no elements, we check if our new
head is nil and if so update our tail to also be nil. Pushing in the back: allocate a new node, set its key,
set its next pointer, and then check the current tail. If the current tail is nil again, it's an empty list.
Update the head and the tail to point to that new node. Otherwise update the old tail's next to point
to our new node, and then update the tail to point to that new node.
Popping the back.
More difficult, right. If it's an empty list and we're trying to pop, that's an error. If the head is equal to
tail, that means we have one element. So we need to just update the head and the tail to nil.
Otherwise we've got to start at the head, and start working our way down, trying to find the next to
the last element. When we exit the while loop, p will be the next to last element, and we then update
its next pointer to nil.
And set our tail equal to that element.
Adding after a node? Fairly simple in a singly linked list. Allocate a new node,
set its next pointer to whatever node we're adding after, to its next. So we sort of splice in, and then
we need to update the node pointer. The one we're adding after, so that it points now to our new
node. And just in case that node we're adding after was the tail we've got to now update the tail to
that new node.
Adding before, we have the same problem we had in terms of PopBack in that we don't have a link
back to the previous element. So we have no way of updating its next pointer other than going back
to the beginning of the head and moving our way down until we find it. So AddBefore would be an
O(n) operation.
So let's summarize what the cost of things are. PushFront, O(1).
TopFront, PopFront, all O(1). Pushing the back O(n) unless we have a tail pointer in which case its
O(1).
TopBack O(n), again unless we have a tail pointer in which it's O(1). Popping the back: O(n) operation,
with or without a tail.
Finding a key is O(n) we just walk our way through the list trying to find a particular element. Erasing,
also O(n). Checking whether it's empty or not is as simple as checking whether the head is nil. Adding
before: O(n) because finding the previous element takes O(n) because we're going to walk all the way
from the head to find it. AddAfter: constant time.
There is a way to make popping the back and adding before cheap.
Our problem was that although we had a way to get from
a previous element to the next element, we had no way to get back.
And what a doubly-linked list says is, well, let's go ahead and
add a way to get back.
So we'll have two pointers, forward and back pointers.
That's the bidirectional arrow we're showing here conceptually.
And the way we would actually implement this is,
with a node that adds an extra pointer.
So we have not only a next pointer, we have a previous pointer.
So this shows for example that the 10 element has a next
pointer that points to 4 but a previous pointer that points to 7.
So at any node we can either go forward or we can go backwards.
So that means if we're trying to pop the back, that's going to work pretty well.
What we're going to do is update the tail pointer to point to the previous element
because again we ca get there in an O(1) operation.
And then update its next pointer to be nil and then finally remove the node.
So that's O(1).
So if we have a doubly linked list it's slightly more complicated (our code) because
we've got to make sure to manage both prev pointers as well as next pointers.
So if we're pushing something in the back, we'll allocate a new node.
If the tail is nil, which means it's empty, then we just have a single node
whose prev and next pointers are both nil and then head and tail both point to it.
Otherwise, we need to update the tail's next pointer for
this new node, because we're pushing at the end and
then go update the prev pointer of this new node to point to the old tail and
then finally update the tail pointer itself.
Popping the back, also pretty straightforward.
We're going to again check to see whether this is first an empty list,
in which case it's an error.
A list with only one element, in which case it's simple.
Otherwise we're going to go ahead and
update our tail to be the prev tail, and the next of that node to be nil.
Adding after, fairly simple again we just need to maintain the prev pointer but
adding before also now works in the sense that we can allocate our node,
our new node and
its prev pointer will be the prev pointer of the existing node we're adding before.
We splice it in that way and
then we'll update the next pointer of that previous node to point to our new node.
And finally, just in case we're adding before the head,
we need to update the head.
So in a singly-linked list, we saw the cost of things.
Working with the front of the list was cheap,
working with the back of the list with no tail, was all linear time.
If we added a tail, it was easy to push something at the end,
easy to retrieve something at the end, but hard to remove something at the end.
By switching to a doubly linked list,
removing from the end (a PopBack) becomes now an O(1) operation,
as does adding before which used to be a linear time operation.
One thing to point out as we contrast arrays versus linked lists.
So in arrays, we have random access,
in a sense that it's constant time to access any element.
That makes things like a binary search very simple,
where we start searching in the middle, and then tell (if we have a sorted array),
and then can decide which side of the array we're on.
And then, go to one side or the other.
For a linked list, that doesn't work.
Finding the middle element is an expensive operation.
because you've got to start either at the head or the tail and
work your way into the middle.
So that's an O(n) operation to get to any particular element.
Big difference in between that and an array.
However, linked lists are constant time to insert at or remove from the front,
unlike arrays.
What we saw in arrays, if you want to insert from the front, or
remove from the front, it's going to take you O(n) time because you're going to have to move
a bunch of elements.
If you have a tail and doubly-linked,
it is also constant time to work at the end of the list.
So you can get at or remove from there.
It's linear time to find an arbitrary element.
The list element are not contiguous as they are in an array.
You have separately allocated locations of memory and
then there are pointers between them.
And then, with a doubly-linked list it's also constant time to insert
between nodes or to remove a node.
Slides and External References
Slides
Download the slides on arrays and linked lists:
05_1_arrays_and_lists.pdf PDF File
References
See the chapter 10.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest,
Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Stacks and Queues

Stacks
So now we're going to start talking about two very important data structures, stacks and queues. In
this video we're going to talk about stacks.
So, what is a stack? It's an abstract data type, and here are the operations we have. We can push a
key, so we've got a collection of values and we can push it.
We can find the most recently added key with Top. And we can Pop, which returns and removes the
most recently added. So, the way to think of it is as if you have a stack of books. You can put a book
on top of the stack, or you can take a book from the top of the stack. But you can't take the element
at the bottom of the stack of books without taking off all of the previous elements. So it's really pretty
simple. Push items on, you can find what the top one is. You can pop off the top one and you can
intermingle these operations. Last one: you can find is is it empty? So are, do we have an empty
stack?
This turns out to be really useful for lots and lots of things, where you need to be keep track of what
has happened in this particular order. Let's look at an example. So let's say we've got a balanced
brackets problem. So, here we have a string. And it has left parens, right parens, left square brackets,
and right square brackets. And we want to determine whether or not that string of parentheses and
square brackets, whether they're balanced.
And balance meaning there's a matching left paren for every right paren. And they're in the right
order. And they don't cross over. Let's look at some examples of unbalanced and balanced. So, the
first string here left paren, square bracket, matching square bracket, matching right paren, square
bracket, matching square bracket, left paren, matching paren, balanced. The second one also
balanced. The unbalanced one for example here a left paren with no matching right paren. Assuming
that the two parens on the right side are matching themselves. The square brackets match, but then
we have got an unmatched right bracket. In the last case we've got a square left bracket and a square
right bracket, but the problem is that they are in the wrong order. It is the square right bracket
followed by the square left bracket.
How do we keep track of that? And the problem is that in some cases we have to kind of keep track of
a lot of information. For instance, in the second example, here, we've got our opening left paren.
Doesn't get matched with the right paren for quite a while. There's a lot of intervening stuff, and we
have to sort of keep track that we've got a left paren whose right paren we need to match, even as all
this other stuff happens. And it turns out a stack is a good way to keep track of it, so here's what we'll
do. We'll create a stack and then we'll go through every character in the string.
If we have an opening paren or an opening square bracket, we'll go ahead and push it on the stack. So
the stack is going to represent the parens that are still open, the parens and brackets which have yet
to be matched and the order in which they need to be matched, so the outermost ones will be at the
bottom of the stack and the last one we saw (the innermost one) would be at the top of the stack.
Then if it's not one of these opening ones. Then if our stack is empty then that's a problem, because
basically we've got a closing paren or bracket and there's no matching element. So if the stack is
empty, no, we're not balanced. Otherwise we'll pop the top element off and then we'll check and see.
Does it match, an element from the stack, does it match the character we've got? So, if the top was a
left paren, did we just read a right paren. If so, great. They match. And now those two match, the next
one we need to match is still the newly top one on the stack or similarly if we have a square bracket
on the stack and a square bracket we read, those match as well. If they don't match, then we've got a
problem, right? If we've got a right paren on the stack and a, sorry a left paren on the stack and right
bracket that we just read, those don't match for example. So then we return false.
Once we've run through all of the strings, were matched, right? No. Not necessarily. Imagine we have
the string, left paren, left paren, left paren.. We go through, we push left paren, we push left paren,
we push left paren and then we'd be done. It won't ever return false, but, it's no good because we
didn't actually match what's on the stack. So, once we go through all of the characters in the string,
then we're going to have to check and make sure, is our stack empty? Did we successfully match
everything? This is only one example of the use of stacks. Stacks are used in lots of other places.
They're used for compilers. They're used in a lot of algorithms that you'll be seeing throughout this
course.
So how do you actually implement a stack? Well, let's see. You can implement a stack with an array
fairly easily, so allocate an array of some maximum stack size.
So, in this case, we decided it's five, just for the sake of example, and we're going to keep a variable,
which is the number of elements that are actually in the stack.
When we push, so in this case we going to push a, we're going to go ahead and put it at the end of the
array, that we've got so far. So, for whatever elements we have, we'll append it to those. So in this
case, we will put it at the beginning of the array because we haven't used any elements. And we will
kept track of the number of elements, as well. We push b, put in the next spot and now our number
of elements is two, fairly straight forward, right, we're just appending to the array and these are
clearly O(1) operations.
What's the top element? Well that's really simple. If the number of elements is two, that means we
need the second element from the array. Which in this case, is b. Again, a constant time operation.
We push c, our number of elements is three, and now let's say we pop. Well, what do we do? We
need to go get the third element, which is c, and erase it, and then adjust numElements so it's now 2.
Now we can push an element, push another element,
push a fifth element, and now if we try to push again, that's an error, right? Because we don't have
any more space. So, that wouldn't be allowed.
Is it empty? No, how do we know? Because the number of elements is greater than zero.
Again, an O(1) operation.
And now we can start popping, which will be returning the appropriate element of the array based on
the numElements value, and keep popping until we get down to no elements.
If we're at no elements, and we ask it's empty? Yes. That's true.
We can also implement a stack with a linked list. So, one disadvantage--one limitation--of the array is
that we have a maximum size, based on the array we initially allocated.
But all of the operations are O(1), which is good.
The other potential problem is that we have potentially wasted space. So if we allocated a very large
array, to allow a possibly large stack, we didn't actually use much of it, all the rest of it is wasted. If we
have a linked list, what we do then is every element in the list of course will represent a particular
element in the stack. And so we'll push a, and then if we push b, we're going to go ahead and push b
at the front. So basically, pushes will turn into PushFront. If we want to get the top element, just get
the head element, so top will really just be TopFront, we can keep pushing. Pushing at the front or
popping, All of those are O(1) operations. We can keep pushing, and the nice thing about this is
there's no a priori limit as to the number of elements you can add. As long as you have available
memory, you can keep adding. There's an overhead though, like in the array, we have each element
size, is just big enough to store our key. Here we've got the overhead of storing a pointer as well. On
the other hand there's no wasted space in terms of allocated space that isn't actually being used. So
we keep pushing, is it empty? No, because the head is not nil.
And then we can go ahead and pop, so, if we had a linked list it's very simple to implement the stack
operations in terms of the linked list operations. Push becomes PushFront.
Top becomes TopFront and Pop which is supposed to both return and pop the top element then
become a combination of a TopFront followed by a PopFront. Empty is just Empty. We keep popping
and then eventually we pop the last element and now if we ask whether it's empty, the answer will be
true.
Okay, so that's our stack implementation. Stacks can be implemented with either arrays or linked lists,
I talked a little bit about the pros and cons of each of those, the linked list has fixed amount of
overhead, that is for every element you are pushing, you have an additional pointer. For arrays you
have, potentially, space that you've over-allocated, basically, to allow for a stack to grow to maximum
size. For arrays, stacks do have a maximum size. For linked lists, they don't. Each stack operation is
constant time for either one of these implementations. Sometimes we know stacks as LIFO queues.
LIFO meaning Last In First Out. The last one that was inserted Is the first line that comes out. This
reminds me sometimes of also what's known as GIGO, Garbage In Garbage Out. That if you input
garbage into a system, you get garbage out. But of course this is different.
So that is stacks. In the next video we're going to go ahead and look at queues. Thanks.
Now let's talk about queues.
So, a queue has some similarities with a stack.
But in a fundamental way is different.
So it's an abstract data type and these are the operations that it has.
You can Enqueue a key, it adds the key to the collection.
And then when you Dequeue, that gives you back a key and removes it from the queue.
It removes and returns the least recently added key,
rather than in the case of a stack, the most recently added key.
So that's the fundamental difference.
If you think about queues as like queuing up in line or
waiting in line, this is a first come first serve situation.
So the longer you've been waiting in line, so
the longest person waiting in line is the next person to be served.
Makes sense.
So you can imagine if you had a grocery store that had
a stack that it used for serving people, people would be pretty annoyed, right?
Because the person who'd just arrived,
you've been waiting in line ten minutes, a person just arrives,
they get served before you do, that would not make you happy.
So, queues are very useful for instance, for things like servers.
Where you've got a bunch of operations coming in and
you want to service the one that's been waiting the longest.
The other operation is you can find out whether the queue is empty or not.
So these are often called FIFO, first in, first out,
and this distinguishes them from stacks which are LIFO.
Last in, first out.
First in first out, or first come first serve, same thing.
How can you implement a queue?
Well, one way is with a linked list, where you have a head and a tail pointer.
So let's say we start out with an empty linked list.
We can go ahead and Enqueue, and what we're going to do basically in an Enqueue,
is we are going to push to the back of the linked list, so
that's how we'll implement Enqueue.
So here, we Enqueue (a), it's now at the back of the linked list.
If we Enqueue (b), it's going to be then added, again,
at the end of the linked list.
Is it empty? No.
How do we know it's not empty?
Well the simplest thing is we would just call to the underlying list implementation
and say hey, list are you empty?
It would say no.
And so empty for the queue is no.
We know it's really happening just by checking whether the head is nil or not.
If we Enqueue(c) then, again it goes to the tail of the list.
And if I now Dequeue, which one is going to be removed?
Again this is not a stack, in a stack c would be removed.
In our case, (a) is going to be removed because it's been there longest.
That's just an implementation of popping from the front.
So that would return (a).
We can now do some more Enqueueing, Enqueue(d), Enqueue(e),
Enqueue(f), and now if we start Dequeueing, we Dequeue from the front.
So Dequeuing (b), Dequeuing (c), Dequeuing (d), Dequeuing (e), and
finally Dequeuing (f).
If we ask whether the queue is empty now, the answer is yes.
Again, because the head is nil.
So Enqueue uses list's PushBack call and Dequeue uses both the list's
TopFront to get the front element as well as PopFront to remove that front element.
And Empty just uses the list's Empty method.
What about with an array?
We could think of doing something similar.
That is, we could add at the end and then pop from the front.
But you can imagine, so,
we said the front of the array is the beginning of the queue.
Then enqeueing is easy, but dequeuing would be an expensive O(n) operation.
And we want enqeueing to be O(1).
We can do that, in a fashion I'll show you right now which is basically
keeping track of sort of the array as a circular array.
So we're going to go ahead and Enqeue (a), and we have a write index.
And the write index tells us where the next Enqueue operation should happen.
And the read tells us where the next Dequeue operation should happen.
So we Enqueue a, we Enqueue b, and now update our write index.
If we ask whether we're empty?
No, we're not empty because read is not equal to write.
That is we have something to Dequeue that has been Enqueued.
So Empty would be false.
We Enqueue (c), we Dequeue, again we're going to Dequeue (a), so
we Dequeue from the read index.
So we basically read what's at the read index and then increment the read index.
If we now Dequeue again, we read what's at the read index which is (b) and
we increment the read index.
Now we will do some more Enqueueing.
Notice at this point that when we Enqueue(d), the write index is 4,
that's the next place we're going to Enqueue(e), which will have us write
to the index 4 and then the write index wraps back around to the initial element.
And, here it's important to note we're using zero based indexing with this array
because of the fact that the first element is zero.
We Enqueue again, Enqueue (f), and
now if we try Enqueue (g), it's not going to allow us to do that.
So that will be an error.
The reason it would be an error, is if we did Enqueue(g), the read and
the write index would be both be 2.
And it would be hard to distinguish read and write index 2 because the queue is
full, or read and write index both 2 because the queue is empty.
So therefore, we have a buffer of at least one element that can't be written to,
to make sure read and write are separate and
distinct if the queue's not empty.
Now we'll Dequeue, so we'll Dequeue (c), basically reading from
the read index and updating it.
Dequeue (d), read from the read index and update it.
Dequeue (e) and here again, the read index wraps around back to 0.
And now finally, we do our final Dequeue and now the read and
write index are both 1.
Which means if we ask whether Dequeue is empty, the answer is yes, it is empty.
So what we see here is that the cost for doing a Dequeue and
an Enqueue, as well of course as Empty, are all O(1) operations this way.
So we can use either a linked list,
although we have to have a tail pointer, so that PushBack is cheap, or an array.
Each of the operations is O(1).
One distinction between the array and the linked list implementation, is
that in the array implementation, we have a maximum size that the queue can grow to.
So it's bounded.
Maybe you want that in which case it's fine, but if you don't know a priori
how long the queue you need is going to be an array is a bad choice.
And any amount that is unused is wasted space.
In a queue that's implemented with a linked list,
it can get arbitrarily large as long as there's available memory.
The downside is, every element you have to pay for another pointer.
Slides
Download the slides for stacks and queues here:
05_2_stacks_and_queues.pdfPDF File
References
See these visualizations: array-based stack, list-based stack, array-based queue, list-based

queue.
Tree
In this lecture, we're going to talk about trees.
Let's look at some example trees.
So here we have a sentence, "I ate the cake".
Now, we're going to look at a syntax tree for that,
which shows the structure of the sentence.
So it's similar to sentence diagramming that you may have done in grade school.
So we have at the top of the tree, the S for
sentence and then children: a noun phrase and a verb phrase.
The child of the noun phrase is the word I from the sentence.
And the child of the verb phrase is a verb and noun phrase, where the verb is ate,
and the noun phrase is a determiner and a noun, the and cake.
So along the bottom of the tree, we have the words from the sentence,
"I ate the cake", and the rest of the tree reflects the structure of that sentence.
We can look here at a syntax tree for an expression 2sin(3z-7),
we can break that up into the structure.
So at the top level, we have a multiplication,
that's really the last thing that's done, multiplying the 2 and the sine.
Within the sine, what we're applying the sine to is 3z-7,
so we have the minus that's happening last with a 7 and then this 3z, 3 times z.
So this shows again the structure of the expression and
the order in which you might evaluate it.
So from the bottom, you would first do 3 times z, and then you would subtract 7
from that, you'd apply the sine to that, and then you multiply that by 2.
Trees are also used to reflect hierarchy.
So this reflects hierarchy of geography where we have at the left hand side
the top level of the hierarchy, the world.
And then below that,
entities in the world, United States, all sorts of other things, United Kingdom.
And then below that, various subcomponents of the geography.
So we've got, for the case of the United States, states, and
then within those states, cities.
Another example of a hierarchy is the animal kingdom.
This is part of it where we've got animals, and then below that, different
types of animals, so invertebrates, reptiles, mammals, and so on.
And then within each of these, we have various subcategorizations.
So this shows this entire hierarchy.
We also use trees in computer science for code.
So in order to represent code, we will do that with an abstract syntax tree.
So our code here is a while loop.
While x is less than 0, x is x+2, f of x.
So we reflect that at the top, we have while, which is our while loop.
And the children of the while loop are the condition that needs to be met for
the while loop to continue and then the statement to execute.
So the condition is x less than 0, so comparison operation, the variable x and
the constant 0.
And then the statement to execute, well, it's actually multiple statements so
we have a block.
And in those blocks, we have two different statements, an assignment statement and
a procedure call.
The assignment statement, the left child is the variable we're assigning to,
which is x, and the right child is an expression, in this case, x+2.
The procedure call, the left child is the name of the procedure, and
subsequent children are the arguments to that procedure.
In our case, we just have one argument x.
Binary search tree is a very common type of a tree used in computer science.
The binary search tree is defined by the fact that it's binary, so
that means it has at most two children at each node.
And we have the property that at the root node,
the value of that root node is greater than or
equal to all of the nodes in the left child, and
it's less than the nodes in the right child.
So here less than or greater than, we're talking about alphabetically.
So Les is greater than Alex, Cathy, and Frank, but
is less than Nancy, Sam, Violet, Tony, and Wendy.
And then that same thing is true for every node in the tree has the same thing.
For instance, Violet is greater than or equal to Tony and
strictly less than Wendy.
The binary search tree allows you to search quickly.
For instance, if we wanted to search in this tree for Tony, we could start at Les.
Notice that we are greater than Les, so therefore, we're going to go right.
We're greater than Sam so we'll go right.
We're less than Violet so we'll go left and then we find Tony.
And we do that in just four comparisons.
It's a lot like a binary search in a sorted array.
So with all these examples of trees, what's the actual definition of a tree?
Well a tree is, this is a recursive definition.
A tree is either empty or it's a node that has a key and
it has a list of child trees.
So if we go back to our example here, Les is a node that
has the key Les and two child trees, the Cathy child tree and the Sam child tree.
The Cathy child tree is a node with a key Cathy and
two child trees, the Alex child tree and the Frank child tree.
Let's look at the Frank child tree.
It's a node with a key Frank and two, well, does it have any child trees?
No, it has no child trees.
So let's look at some other examples.
An empty tree, well, we don't really have a good representation for that,
it's just empty.
A tree with one node is the Fred tree, and it has no children.
A tree with two nodes is a Fred with a single child Sally,
that in itself has no children.
In computer science commonly, trees grow down, so parents are above their children.
So that's why we have Fred above Sally.
So let's look at some other terminology for trees.
So here, we have a tree, Fred is the root of the tree.
So it's the top node in the tree.
And here, the children of Fred are Kate, Sally, and Jim.
We are actually showing that with arrows, commonly, when you show trees,
you don't actually show the arrows.
We just assume that if a node is above another node,
that it's a parent of that node.
A child has a line down directly from a parent, so
Kate is a parent of Sam, and Sam is a child of Kate.
An ancestor is a parent or parent's parents and so on.
So Sam's ancestors are Kate and Fred.
Hugh's ancestors are also Kate and Fred.
Sally's ancestors are just Fred.
The descendant is an inverse of the ancestor, so it's the child or
child of child and so on.
So the descendants of Fred are all of the other nodes since it's the root, Sam,
Hugh, Kate, Sally and Jim.
The descendants of Kate would just be Sam and Hugh.
Sibling, two parents, sorry,
two nodes sharing the same parent, so Kate, Sally and Jim are all siblings.
Sam and Hugh are also siblings.
A leaf is a node that has no children.
So that's Sam, Hugh, Sally, and Jim.
An interior node are all nodes that aren't leaves.
So this is Kate and Fred.
Another way to describe it is all nodes that do have children.
A level: 1 plus the number of edges between the root and
a node, let's think about that.
Fred, how many edges are there between the root and the Fred node?
Well, since the Fred node is the root, there are no edges.
So its level would be 1.
Kate has one edge between Fred and Kate,
so its level would be 2, along with its siblings, Sally and Jim.
And Sam and Hugh are level 3.
The height: the maximum depth of the subtree node in the farthest leaf,
so here we want to look, for instance, if we want to look at the height of Fred,
we want to look at what is its farthest down descendant.
And so its farthest down descendant would either be Sam or Hugh.
Its height would be 3.
So the leaf heights are 1.
Kate has height 2.
Fred has height 3.
We also have the idea of a forest.
Extending this tree metaphor, so it's a collection of trees.
So we have here two trees with a root Kate and a root Sally, and those form a forest.
So a node has a key, children,
which is a list of children nodes, and then it may or may not have a parent.
The most common representation probably of trees, is really without the parent.
But it's possible to also have parent pointers, and that can be useful as a way
to traverse from anywhere in a tree to anywhere else by going up and then down,
following parent nodes and then child nodes.
On rare occasions,
you could have a tree that's represented just with parent pointers.
Okay, but that's unusual because a lot of times, kind of the way you get access
to a tree is via its root and you want to go down from there.
There are other less commonly used representations of trees as well,
we're not going to get into here.
Binary trees are very commonly used.
So a binary tree has, at most, two children.
Rather than having in this general list of children, for a binary tree,
we normally have an explicit left and right child, either of which can be nil.
As with the normal tree, the general form of a tree, you may or
may not have a parent pointer.
Let's look at a couple of procedures operating on trees.
Since trees are recursively defined, it's very common to write
routines that operate on trees that are themselves recursive.
So for instance,
if we want to calculate the height of a tree, that is the height of a root node,
we can go ahead and recursively do that, going through the tree.
So we can say, for instance, if we have a nil tree, then its height is a 0.
Otherwise, we're 1 plus the maximum of the left child tree and the right child tree.
So if we look at a leaf for example, that height would be 1 because the height
of the left child is nil, is 0, and the height of the nil right child is also 0.
So the max of that is 0, 1 plus 0.
We could also look at calculating the size of a tree that is the number of nodes.
Again, if we have a nil tree, we have zero nodes.
Otherwise, we have the number of nodes in the left child plus 1 for
ourselves plus the number of nodes in the right child.
So 1 plus the size of the left tree plus the size of the right tree.
In the next video, we're going to look at different ways to traverse a tree.
Tree Traversal
In this video, we're going to continue talking about trees. And in particular, look at walking a tree, or
visiting the elements of a tree, or traversing the elements of a tree. So often we want to go through
the nodes of a tree in a particular order. We talked earlier, when we were looking at the syntax tree
of an expression, how we could evaluate the expression by working our way up from the leaves. So
that would be one way of walking through a tree in a particular order so we could evaluate. Another
example might be printing the nodes of a tree. If we had a binary search tree, we might want to get
the elements of a tree in sorted order.
There are two main ways to traverse a tree. One, is depth-first. So there, we completely traverse one
sub-tree before we go on to a sibling sub-tree. Alternatively, in breadth-first search we traverse all the
nodes at one level before we go to the next level. So in that case, we would traverse all of our siblings
before we visited any of the children of any of the siblings. We'll see some code examples of these. In
depth-first search, so we're going to look here at an in-order traversal. And that's really defined best
for a binary tree. This is InOrderTraversal is what we might use to print all the nodes of a binary
search tree in alphabetical order.
So, we're going to have a recursive implementation, where if we have a nil tree, we do nothing,
otherwise, we traverse the left sub-tree, and then do whatever we're going to do with the key, visit it,
in this case, we're going to print it. But often there's just some operation you want to carry out, and
then traverse the right sub-tree. So let's look at an example of this. We've got our binary search tree.
And we're going to look at how these nodes get printed out if we do an in-order traversal. So to begin
with, we go to the Les node. And from there, since it's not nil, we're going to do an in-order traversal
of its left child, which is Cathy. Similarly now we're going to do an in-order traversal of its left child,
which is Alex.
We do an in-order traversal of its left child which is nil, so it does nothing. So we come back to Alex,
and then print out Alex, and then traverse its right sub-tree which is nil and does nothing. We come
back to Alex. And then we're finished with Alex and we go back to Cathy. So, we have successfully
completed Cathy's left sub-tree. So we did an in-order traversal of that, so now we're going to print
Cathy, and then do an in-order traversal of its right sub-tree, which is Frank.
So we go to Frank, similarly now we're going to print out Frank.
We've finished with Frank and go back to Cathy, and now we've completed Cathy totally, so we go
back to Les. We completed Les' left sub-tree, so we're now going to print Les and then traverse Les'
right sub-tree. So that is Sam, traverse its left sub-tree which is Nancy. Print it out, go back to Sam,
we've completed Sam's left sub-tree, so we print Sam, and then go ahead and do Sam's right sub-tree
which is Violet, which will end up printing Tony, Violet, and then Wendy. We're completed with
Wendy. We go back to Violet. We completed her right sub-tree, so we go back to Sam, completed his
right sub-tree, go back to Les, completed his right sub-tree, and we're done. So we see we get the
elements out in sorted order. And again, we do the left child. And then the node and then the right
child. And by our definition of a binary search tree, that then gives them to us in order because we
know all the elements in the left child are in fact less than or equal to the node itself.
The next depth-first traversal is a pre-order traversal. Now the in-order traversal really is only defined
for a binary tree because we talk about doing the left child and then the node and then the right child.
And so it's not clear if you had let's say three children, where it is you'd actually put the node itself. So
you might do the first child and then print the node, and then second and third child. Or first child and
then second child and print the node, and then third child. It's kind of undefined then, so not well-
defined.
However, these next two, the pre-order and post-order traversal are well defined. Not just for binary
trees, but for general, arbitrary number of children trees.
So here the pre-order traversal says, we're going to go ahead first if it's nil we return. We print the key
first, that is, we visit the node itself and then its children. So we're going to, in this case, go ahead and
go to
the Les tree and then print out its key and then go to its children. So we're going to first go to its left
child which is Cathy, and for Cathy, we then print Cathy, and then go to its left child which is Alex,
print Alex, we go back to Cathy.
And we finished its left child, so then we go do its right child, which is Frank. We finished Frank. We
finished Cathy. We go back up to Les. We've already printed Les. We've already visited or traversed
Les' left child. Now we can traverse Les' right child, so it'll be Sam, which we'll print out. And then
we'll go to Nancy, which we'll print out, we'll go back up to Sam and then to Violet, and we will print
Violet, and then print Violet's children, which will be Tony and Wendy and then return back.
A post-order traversal is like a pre-order traversal expect instead of printing the node itself first, which
is a pre, we print it last, which is the post. So all we've really done is move where this print statement
is.
And here then, what's the last of these notes that's going to be printed? Well it's actually going to be
Les, because we're not going to be able to print Les until we've finished
completely dealing with Les' left sub-tree and right sub-tree. So we'll visit Les, and then visit Cathy,
and then Alex, and then we'll actually print out Alex. Once we're done with Alex, we'll go back up to
Cathy and down to Frank, and then print out Frank, and then once we're done with both Alex and
Frank we can then print Cathy.
We go back up to Les, and we now need to go deal with Les' right child which is Sam. In order to deal
with Sam we go to Nancy, print Nancy, go back up to Sam and down to Violet, and deal with the
Violet tree, which will print out Tony, and then Wendy, and then Violet. And on our way back up,
then, when we get up to Sam, we have finished its children, so we can print out Sam. When we get up
to Les, we've finished its children, so we can print out Les. One thing to note about the recursive
traversal is we do have sort of under the covers, a stack that's being used. Because in a recursive call,
every time we make a call back to a procedure, we are invoking another frame on the stack. So we are
saving implicitly our information of where we are on the stack.
Breadth-first, we're going to actually use a queue instead of a stack. So in the breadth-first, we are
going to call it level traversal here, we're going to go ahead and instantiate a queue, and on the queue
first put the root of the tree. So we put that in the queue and then while the queue is not empty,
we're going to dequeue, so pull a node off, deal with that by printing it and then if it's got a left child,
enqueue the left child, if it's got a right child, enqueue the right child. And so this will have the effect
of going through and processing the elements in level order. We see the example here, and we're
going to show the queue. So here let's say we're just before the while loop, the queue contains Les.
And we're going to now dequeue Les from the queue, output it by printing it, and then enqueue Les'
children which are Cathy and Sam.
Now, we visit those in order, so first we're going to dequeue Cathy, print it out and then enqueue its
children. Remember when we're enqueuing we go at the end of the line, so Alex and Frank go after
Sam. So now we're going to dequeue Sam, print it, and then enqueue its children Nancy and Violet. So
we can see what we've done then is, we first printed Les, that's level one and then we printed the
elements of level two, which are Cathy and Sam, and now we're going to go on to the elements at
level three. So notice, all the elements in level three, Alex, Frank, Nancy, and Violet are in the queue
already.
And they're all going to be processed before any of the level four nodes
are processed. So even though they'll be pushed in the queue, since the level three nodes got there
first that they're all going to be processed before we process the level four ones. So here, we dequeue
Alex, print it out, and we're done. Dequeue Frank, print it out, we're done with Frank. Dequeue
Nancy, print it out, we're done with Nancy. And Violet, we print it out, but then also enqueue Tony
and Wendy, and then dequeue those and print them out. So this is a breadth-first search, with an
explicit queue, you can do depth-first searches rather than recursively, iteratively, but you will need
an additional data structure which is a stack to keep track of the work still to be done.
So in summary, trees are used for lots of different things in computer science.
We've seen that trees have a key and normally have children, although there are alternative
representations of trees.
The tree walks that are normally done are traversals, are DFS: depth-first search, and BFS: breadth-
first search. There are different types of depth-first search traversals, pre-order, in-order, and post-
order.
When you work with a tree, it's common to use recursive algorithms, although note that we didn't for
the breadth-first search where we needed to go through the elements of the tree in kind of a non-
recursive order. And finally, in computer science, trees grow down.
Slides
Download the slides for trees here:
05_3_trees.pdfPDF File
References
PRACTICE QUIZ • 30 MIN
Basic Data Structures

Submit your assignment
Programming assignment 1
Available Programming Languages
To solve programming assignments, you can use any of the following programming languages:
 C
 C++
 C#
 Haskell
 Java
 JavaScript
 Python2
 Python3
 Ruby
 Scala
However, we will only be providing starter solution files for C++, Java, and Python3. Your
submission's programming language is detected automatically based on its file extension.
We have reference solutions in C++, Java and Python3, which solve the problem correctly under
the given restrictions, and in most cases spend at most 1/3 of the time limit and at most 1/2 of the
memory limit. You can also use other languages, and we've estimated the time limit multipliers for
them; however, we do not guarantee that there exists a correct solution in those languages for
every problem running under the given time and memory constraints.
Your solution will be compiled as follows. We recommend that when testing your solution locally,
you use the same compiler flags for compiling. This will increase the chances that your program
behaves in the same way on your machine and on the testing machine (note that a buggy
program may behave differently when compiled by different compilers, or even by the same
compiler with different flags).
C (gcc 5.2.1). File extensions: .c. Flags:
gcc -pipe -O2 -std=c11 <filename> -lm

C++ (g++ 5.2.1). File extensions: .cc, .cpp. Flags
g++ -pipe -O2 -std=c++11 <filename> -lm

If your C/C++ compiler does not recognize the "-std=c++11" flag, try replacing it with the "-std=c+
+0x" flag or compiling without this flag at all (all starter solutions can be compiled without it). On
Linux and MacOS, you probably have the required compiler. On Windows, you may use your
favorite compiler or install an environment such as cygwin.
C# (mono 3.2.8). File extensions: .cs. No flags:
mcs
Haskell (GHC 7.8.4). File extensions: .hs. Flags:
ghc -O
Java (Open JDK 8). File extensions: .java. Flags:
javac -encoding UTF-8
java -Xmx1024m
JavaScript (Node v6.3.0). File extensions: .js. No flags:
1
nodejs
Python 2 (CPython 2.7). File extensions: .py2 or .py (a file ending in .py needs to have a first line
which is a comment containing 'python2'). No flags:
python2
Python 3 (CPython 3.4). File extensions: .py3 or .py (a file ending in .py needs to have a first line
which is a comment containing 'python3'). No flags:
python3
Ruby (Ruby 2.1.5). File extensions: .rb. No flags:
1
ruby
Scala (Scala 2.11.6). File extensions: .scala. No flags:
FAQ on Programming Assignments
I submit the program, but nothing happens
You need to create submission and upload the file with your solution in one of the programming
languages C, C++, C#, Haskell, Java, JavaScript, Python2, Python3, Ruby, and Scala. Make sure
that after uploading the file with your solution you press on the blue "Submit" button in the bottom.
After that, the grading starts, and the submission being graded is enclosed in an orange rectangle.
After the testing is finished, the rectangle disappears, and the results of the testing of all problems
is shown to you.
I submit the solution for only one problem, but all the problems in the
assignment are graded
Each time you submit any solution, the last uploaded solution for each problem is tested. Don't
worry: this doesn't affect your score even if the submissions for the other problems are wrong. As
soon as you pass the sufficient number of problems in the assignment (see in the pdf with
instructions), you pass the assignment. After that, you can improve your result if you successfully
pass more problems from the assignment. We recommend working on one problem at a time,
checking whether your solution for any given problem passes in the system as soon as you are
confident in it. However, it is better to test it first, please refer to this reading from the "Algorithmic
Toolbox" course.
What are the possible grading outcomes, and how to read them?
Your solution can either pass or not. To pass, it must work without crashing and return the correct
answers on all the test cases we prepared for you, and do so under the time limit and memory
limit constraints specified in the problem statement. If your solution passes, you get the
corresponding feedback "Good job!" and get a point for the problem. If your solution fails, it can be
because it crashes, returns wrong answer, works for too long or uses too much memory for some
test case. The feedback will contain the number of the test case on which your solution fails and
the total number of test cases in the system. The tests for the problem are numbered from 1 to the
total number of test cases for the problem, and the program is always tested on all the tests in the
order from the test number 1 to the test with the biggest number.
Here are the possible outcomes:
1. Good job! - Hurrah! Your solution passed, and you get a point!
2. Wrong answer. - Your solution has output incorrect answer for some test case. If it is a
sample test case from the problem statement, or if you are solving Programming Assignment
1, you will also see the input data, the output of your program and the correct answer.
Otherwise, you won't know the input, the output and the correct answer. Check that you
consider all the cases correctly, avoid integer overflow, output the required whitespace,
output the floating point numbers with the required precision, don't output anything in addition
to what you are asked to output in the output specification of the problem statement. See
this reading on testing from the "Algorithmic Toolbox" course.
3. Time limit exceeded. - Your solution worked longer than the allowed time limit for some
test case. If it is a sample test case from the problem statement, or if you are solving
Programming Assignment 1, you will also see the input data and the correct answer.
Otherwise, you won't know the input and the correct answer. Check again that your algorithm
has good enough running time estimate. Test your program locally on the test of maximum
size allowed by the problem statement and see how long it works. Check that your program
doesn't wait for some input from the user which makes it to wait forever. See this reading on
testing from the "Algorithmic Toolbox" course.
4. Memory limit exceeded. - Your solution used more than the allowed memory limit for
some test case. If it is a sample test case from the problem statement, or if you are solving
Programming Assignment 1, you will also see the input data and the correct answer.
Otherwise, you won't know the input and the correct answer. Estimate the amount of memory
that your program is going to use in the worst case and check that it is less than the memory
limit. Check that you don't create too large arrays or data structures. Check that you don't
create large arrays or lists or vectors consisting of empty arrays or empty strings, since those
in some cases still eat up memory. Test your program locally on the test of maximum size
allowed by the problem statement and look at its memory consumption in the system.
5. Cannot check answer. Perhaps output format is wrong. - This happens when you
output something completely different than expected. For example, you are required to output
word "Yes" or "No", but you output number 1 or 0, or vice versa. Or your program has empty
output. Or your program outputs not only the correct answer, but also some additional
information (this is not allowed, so please follow exactly the output format specified in the
problem statement). Maybe your program doesn't output anything, because it crashes.
6. Unknown signal 6 (or 7, or 8, or 11, or some other). - This happens when your program
crashes. It can be because of division by zero, accessing memory outside of the array
bounds, using uninitialized variables, too deep recursion that triggers stack overflow, sorting
with contradictory comparator, removing elements from an empty data structure, trying to
allocate too much memory, and many other reasons. Look at your code and think about all
those possibilities. Make sure that you use the same compilers and the same compiler
options as we do. Try different testing techniques from this reading from the "Algorithmic
Toolbox" course.
7. Grading failed. - Something very wrong happened with the system. Contact Coursera for
help or write in the forums to let us know.
How to understand why my program fails and to fix it?
If your program works incorrectly, it gets a feedback from the grader. For the Programming
Assignment 1, when your solution fails, you will see the input data, the correct answer and the
output of your program in case it didn't crash, finished under the time limit and memory limit
constraints. If the program crashed, worked too long or used too much memory, the system stops
it, so you won't see the output of your program or will see just part of the whole output. We show
you all this information so that you get used to the algorithmic problems in general and get some
experience debugging your programs while knowing exactly on which tests they fail.
However, in the following Programming Assignments throughout the Specialization you will only
get so much information for the test cases from the problem statement. For the next tests you will
only get the result: passed, time limit exceeded, memory limit exceeded, wrong answer, wrong
output format or some form of crash. We hide the test cases, because it is crucial for you to learn
to test and fix your program even without knowing exactly the test on which it fails. In the real life,
often there will be no or only partial information about the failure of your program or service. You
will need to find the failing test case yourself. Stress testing is one powerful technique that allows
you to do that. You should apply it after using the other testing techniques covered in
this reading from the "Algorithmic Toolbox" course.
Why do you hide the test on which my program fails?
Often beginner programmers think by default that their programs work. Experienced programmers
know, however, that their programs almost never work initially. Everyone who wants to become a
better programmer needs to go through this realization.
When you are sure that your program works by default, you just throw a few random test cases
against it, and if the answers look reasonable, you consider your work done. However, mostly this
is not enough. To make one's programs work, one must test them really well. Sometimes, the
programs still don't work although you tried really hard to test them, and you need to be both
skilled and creative to fix your bugs. Solutions to algorithmic problems are one of the hardest to
implement correctly. That's why in this Specialization you will gain this important experience which
will be invaluable in the future when you write programs which you really need to get right.
It is crucial for you to learn to test and fix your programs yourself. In the real life, often there will be
no or only partial information about the failure of your program or service. Still, you will have to
reproduce the failure to fix it (or just guess what it is, but that's rare, and you will still need to
reproduce the failure to make sure you have really fixed it). When you solve algorithmic problems,
it is very frequent to make subtle mistakes. That's why you should apply the testing techniques
described in this reading from the "Algorithmic Toolbox" course to find the failing test case and fix
your program.
My solution does not pass the tests? May I post it in the forum and ask for a
help?
No, please do not post any solutions in the forum or anywhere on the web, even if a solution does
not pass the tests (as in this case you are still revealing parts of a correct solution). Recall the
third item of the Coursera Honor Code: "I will not make solutions to homework, quizzes, exams,
projects, and other assignments available to anyone else (except to the extent an assignment
explicitly permits sharing solutions). This includes both solutions written by me, as well as any
solutions provided by the course staff or others''.
Are you going to support my favorite language in programming assignments?
Currently, we are going to support C++, Java, and Python only, but we may add other
programming languages later if there appears a huge need. To express your interest in a
particular programming language, please post its name in this thread (in the forum of the
"Algorithmic Toolbox" course) or upvote the corresponding option if it is already there.
My implementation always fails in the grader, though I already tested and stress
tested it a lot. Wouldn’t it be better if you give me a solution to this problem or
at least the test cases that you use? I will then be able to fix my code and will
learn how to avoid making mistakes. Otherwise, I don’t feel that I learn anything
from solving this problem. I’m just stuck.
First of all, it is just not true that you do not learn by trying to fix your implementation.
The process of trying to invent new test cases that might fail your program and proving them
wrong is often enlightening. This thinking about the invariants which you expect your loops, ifs,
etc. to keep and proving them wrong (or right) makes you understand what happens inside your
program and in the general algorithm you're studying much more.
Also, it is important to be able to find a bug in your implementation without knowing a test case
and without having a reference solution. Assume that you designed an application and an
annoyed user reports that it crashed. Most probably, the user will not tell you the exact sequence
of operations that led to a crash. Moreover, there will be no reference application. Hence, once
again, it is important to be able to locate a bug in your implementation yourself, without a magic
oracle giving you either a test case that your program fails or a reference solution. We encourage
you to use programming assignments in this class as a way of practicing this important skill.
If you’ve already tested a lot (considered all corner cases that you can imagine, constructed a set
of manual test cases, applied stress testing), but your program still fails and you are stuck, try to
ask for help on the forum. We encourage you to do this by first explaining what kind of corner
cases you have already considered (it may happen that when writing such a post you will realize
that you missed some corner cases!) and only then asking other learners to give you more ideas
for tests cases.
Programming Assignment: Programming Assignment 1:

Basic Data Structures
You have not submitted. You must earn 2/5 points to pass.
Week 2
Data Structures
Week 2
Discuss and ask questions about Week 2.
45 threads · Last post a day ago

Go to forum
Dynamic Arrays and Amortized Analysis
In this module, we discuss Dynamic Arrays: a way of using arrays when it is unknown ahead-of-
time how many elements will be needed. Here, we also discuss amortized analysis: a method of
determining the amortized cost of an operation over a sequence of operations. Amortized analysis
is very often used to analyse performance of algorithms when the straightforward analysis
produces unsatisfactory results, but amortized analysis helps to show that the algorithm is actually
efficient. It is used both for Dynamic Arrays analysis and will also be used in the end of this course
to analyze Splay trees.
Less
Key Concepts
 Describe how dynamic arrays work
 Calculate amortized running time of operations
 List the methods for amortized analysis
More
Video: LectureDynamic Arrays
8 min
Resume
. Click to resume
Video: LectureAmortized Analysis: Aggregate Method
5 min
Video: LectureAmortized Analysis: Banker's Method
6 min
Video: LectureAmortized Analysis: Physicist's Method
7 min
Video: LectureAmortized Analysis: Summary
2 min

Quiz: Dynamic Arrays and Amortized Analysis
4 questions
Due Jul 12, 11:59 PM PDT
10 min
Dynamic Arrays
So in this lecture, we're going to talk about dynamic arrays and amortized analysis.
In this video we're going to talk about dynamic arrays.
So the problem with static arrays is, well, they're static.
Once you declare them, they don't change size, and you have to determine that size at compile time.
So one solution is what are called dynamically-allocated arrays. There you can actually allocate the
array, determining the size of that array at runtime. So that gets allocated from dynamic memory. So
that's an advantage. The problem is, what if you don't know the maximum size at the time you're
allocating the array?
A simple example, you're reading a bunch of numbers. You need to put them in an array. But you
don't know how many numbers there'll be. You just know there'll be some mark at the end that says
we're done with the numbers.
Play video starting at 1 minute 1 second and follow transcript1:01
So, how big do you make it? Do you make it 1,000 big? But then what if there are 2,000 elements?
Make it 10,000 big? But what if there are 20,000 elements? So, a solution to this. There's a saying that
says all problems in computer science can be solved by another level of indirection. And that's the
idea here. We use a level of indirection. Rather than directly storing a reference to the either static or
dynamically allocated array, we're going to store a pointer to our dynamically allocated array. And
that allows us then to update that pointer. So if we start adding more and more elements, when we
add too many, we can go ahead and allocate a new array, copy over the old elements, get rid of the
old array, and then update our pointer to that new array. So these are called dynamic arrays or
sometimes they're called resizable arrays. And this is distinct from dynamically allocated arrays.
Where we allocate an array, but once it's allocated it doesn't change size.
So a dynamic array is an abstract data type, and basically you want it to look kind of like an array. So it
has the following operations, at a minimum. It has a Get operation, that takes an index and returns
you the element at that index, and a Set operation, that sets an element at a particular index to a
particular value.
Both of those operations have to be constant time. Because that kind of what it means to be an array,
is that we have random access with constant time to the elements. We can PushBack so that adds a
new element to the array at the end of the array.
We can remove an element at a particular index. And that'll shuffle down all the succeeding ones. And
finally, we can find out how many elements are in the array. How do we implement this? Well, we're
going to store arr, which is our dynamically-allocated array. We're going to store capacity, which is the
size of that dynamically-allocated array, how large it is. And then size is the number of elements that
we're currently using in the array. Let's look at an example. So let's say our dynamically allocated
array has a capacity of 2. But we're not using any elements in it yet, so it's of size 0. And arr then
points to that dynamically allocated array.
If we do a PushBack of a, that's going to go ahead and put a into the array and update the size.
We now push b, it's going to put b into the array and update the size.
Notice now the size is equal to the capacity which means this dynamically allocated array is full. So if
we get asked to do another PushBack, we've got to go allocate a new dynamically-allocated array.
We're going to make that larger, in this case it's of size 4. And then we copy over each of the elements
from the old array to the new array.
Once we've copied them over, we can go ahead and update our array pointer to point to this new
dynamically allocated array, and then dispose of the old array.
At this point now we finally have our new dynamically allocated array, that has room to push another
element, so we push in c.
We push in d, if there is room we put it in, update the size. And now if we try and push another
element, again we have a problem, we're too big. We can allocate a new array. In this case, we're
going to make it of size 8. We'll talk about how you determine that size somewhat later.
And then copy over a, b, c, and d, update the array pointer, de-allocate the old array, and now we
have room we can push in e. So that's how dynamic arrays work. Let's look at some of the
implementations of the particular API methods.
Get is fairly simple. So we just check and see, we're going to assume for the sake of argument,
that we are doing 0-based indexing here. So if we want to Get(i), we first check and make sure, is i in a
range? That is, is it non-negative, and is it within the range from 0 to size i minus 1? Because if it's less
than 0 or it's greater or equal to size, it's out of range, that will be an error.
If we're in range then we just return index i from the dynamically allocated array.
Set is very similar. Check to make sure out index is in bounds, and then if it is, update index i of the
array to val. PushBack is a little more complicated. So, let's actually skip the if statement for now and
just say, let's say that there is empty space in our dynamic array. In that case, we just set array at size
to val and then increment size.
If, however, we're full, we're not going to do that yet, if size is equal to capacity, then we go ahead
and allocate a new array. We're going to make it twice the capacity, and then we go through a for
loop, copying over every one of the elements from the existing array to the new array.
We free up the old array and then set array to the new one.
At that point then, we've got space and we go ahead and set the size element and then increment
size.
Remove's fairly simple. Check that our index is in bounds and then go ahead through a loop, basically
copying over successive elements and then decrementing the size.
Size is simple, will just return size.
There are common implementations for these dynamic arrays and C++'s vector class is an example of
a dynamic array. And there, notice it uses C++ operator overloading, so you can use the standard
array syntax of left brackets, to either read from or write to an element. Java has an ArrayList. Python
has the list. And there is no static arrays in Python. All of them are dynamic. What's the runtime? We
saw Get and Set are O(1), as they should be. PushBack is O(n). Although we're going to see that's only
the worst case. And most of the time actually, when you call PushBack, it's not having to do the
expensive operation, that is, the size is not equal to capacity. For now, though, we're just going to say
that it's O(n). We'll look at a more detailed analysis when we get into aggregate analysis in our next
video.
Removing is O(n), because we've gotta move all those elements. Size is O(1).
So in summary, unlike static arrays, dynamic arrays are dynamic. That is, they can be resized.
Appending a new element to a dynamic array is often constant time, but it can take O(n). We're going
to look at a more nuanced analysis in the next video. And some space is wasted. In our case, if we're
resizing by a factor of two, at most half the space is wasted. If we were making our new array three
times as big, then we can waste two-thirds of our space. If we're only making it 1.5 as big, then we
would waste less space. It's worth noting dynamic array can also be resized smaller, that's possible
too. It's worth thinking about what if we resized our array to a smaller dynamic array as soon as we
got under one-half utilization? And it turns out we can come up with a sequence of operations that
gets to be quite expensive.
In the next video, we're going to talk about amortized analysis. And in particular, we're going to look
at one method called the aggregate method.
So we'll discuss now what Amortized Analysis is and look at a particular method for doing such
analysis.
Sometimes, we're looking at an individual worst case and that may be too severe. In particular we
may want to know the total worst case for a sequence of operations and it may be some of those
operations are cheap, while only certain of them are expensive. So if we look at the worst case
operation for any one and multiply that by the total, it may be overstating the total cost.
As an example, for a dynamic array, we only resize every so often. Most of the time, we're doing a
constant time operation, just adding an element. It's only when we fully reach the capacity, that we
have to resize. So the question is, what's the total cost if you have to insert a bunch of items?
So here's the definition of amortized cost. You have a sequence of n operations, the amortized cost is
the cost of those n operations divided by n.
This is similar in spirit to let's say you buy a car for, I don't know, $6,000. And you figure it's going to
last you five years.
Now, you have two possibilities. One, you pay the $6,000 and then five years later you have to pony
up another $6,000. Another option would be to put aside money every month. So five years is 60
months. So if you put away $100 a month, once the five years is over, then when it's time to buy a
new car for $6000, you'll have $6000 in your bank account. And so there that amortized cost (monthly
cost) is $100 a month, whereas the worst case monthly cost is actually 6,000, it's 0 for 59 months and
then it's 6,000 after one month, so you can see that, that amortized cost gives you a more balanced
understanding. If you really want to know what's the most I spend in every month, the answer yes is
$6,000. But if you want to know sort of an average what am I spending, $100 is a more reasonable
number. So that's why we do this amortized analysis, to get a more nuanced picture of what it looks
like for a succession of operations.
So let's look at the aggregate method of doing amortized analysis. And the aggregate method really
says, let's look at the definition of what an amortized cost is, and use that to directly calculate.
So we're going to look at an example of dynamic array and we're going to do n calls to PushBack. So
we're going to start with an empty array and n times call PushBack.
And then we'll find out what the amortized cost is of a single call to PushBack. We know the worst
case time is O(n). Let's define c sub i as the cost of the i'th insertion. So we're interested in c1 to cn.
So ci is clearly 1. because we have got to actual, and what we're going to count for a second here is
writing into the array. So the cost is 1 because we have to write in this i'th element that we're adding.
Regardless of whether or not we need to resize.
If we need to resize, the first question is when do we need to resize? We need to resize if our capacity
is used up. That is if the size is equal to capacity. Well when does that happen? That happens if the
previous insertion filled it up. That is made it a full power of 2, because in our case we're always
doubling the size. So that says on the i'th insertion we're going to have to resize if the i'th- 1 filled it
up. That is the i- 1 is a power of 2.
And if we don't have to resize, there's no additional cost, it's just zero.
So the total amortized cost is really the sum of the n actual costs divided by n. So that's a summation
from i = 1 to n of c sub i. And again c sub i is the cost of that i'th insertion. While that's equal to n,
because every c sub i has a cost of 1, so we sum that n times, that's n plus then the summation from
what's this, this looks a little complicated so j = 1 to the floor of log base 2 of n- 1 of 2 to the j. That
just really says the power of twos. All the way up to n- 1. So to give an example, if n is 100, the power
of 2s are going to be 1, 2, 4, 8, 16, 32, and 64. And it's the summation of all of those. Well that
summation is just order n.
Right. We basically take powers of 2 up to but not including n. And that is going to be no more than
2n. So we've got n plus something no more than 2n, that's clearly O(n) divided by n, and that's just
O(1). So what we've determined then is that we have a amortized cost for each insertion of order 1.
Our worst case cost is still order n, so if we want to know how long it's going to take in the worst case
for any particular insertion is O(n), but the amortized cost is O(1).
In the next video, we're going to look at an alternative way to do this amortized analysis.
In this video, we're going to talk about a second way to do Amortized Analysis, what we call the
Banker's Method.
The idea here is that we're going to charge extra for each cheap operation. So it's sort of like we're
taking the example where we looked at saving money for a car. We're going to actually take that $100
and put it in the bank. And then we save those charges somewhere, in the case of the bank we put it
in the bank. In our case we're going to conceptually save it in our data structure. We're not actually
changing our code, this is strictly an analysis. But we're conceptually thinking about putting our saved
extra cost as sort of tokens in our data structure that later on we'll be able to use to pay for the
expensive operations. To make more sense as we see an example.
So it's kind of like an amortizing loan or this case I talked about where we're saving $100 a month
towards a $6000 car, because we know our current car is going to run out.
Let's look at this same example where we have a dynamic array and n calls to PushBack starting with
an empty array. The idea is we're going to charge 3 for every insertion. So every PushBack, we're
going to charge 3. One is the raw cost for actually
moving in this new item into the array, and the other two are going to be saved.
So if we need to do a resize in order to pay for moving the elements, we're going to use tokens we've
already saved in order to pay for the moving. And then, we're going to place 1 token, once we've
actually added our item. 1 token on the item we added and then 1 token on an item prior to this in
the array. It'll be easier when we look at a particular example.
Let's look at an example we have an empty array. And we're going to start with size 0, capacity 0. We
PushBack(a), what happens? Well we have to allocate our array of size one, point to it, and then we
put a into the array. And now we're going to put a little token on a and this token is what we use to
pay later on to moving a. In this particular example for the very first element there's no other element
to put a token on. So we're just going to waste that other, that third token. We push in b. There's no
space for b so we've got to allocate a larger array and then move a. How are we going to pay for that
moving a? Well with the token the token that's already on it. So we prepaid this moving a. When we
actually initially put a into the array, we put a token on it that would pay for moving it into a new
array. So that's how we pay for moving a and then we update the array, delete the old one, and now
we actually put b in. So we put b in at the cost of one, we still have two more tokens to pay. So we're
going to put one on b and we're going to put one capacity over two that is one element earlier, so
we're going to put one on a. So we've spent three now. One for real and two as deferred payment
that we're going to use later in the form of these tokens.
Remember these tokens are not actually stored in the data structure. There's nothing actually in the
array. This is just something we're using for mental accounting in order to do our analysis.
When we push in c, we're going to allocate a new array. We copy over a and we pay for that with our
pre-paid token. We copy over b, paying for that with our pre-paid token. And now we push in c.
That's one, the second payment we have to make is, we put a token on c and we then we put token
on a. Four divided by two, that is the capacity divided by two, or two elements prior.
We push in d, we don't have to do any resizing, finally. Okay, so we just put in d and that's the cost of
one. Second, put a token on d. Third, put a token capacity over two or two elements prior to that. So
notice what we've got now is a full array and everything has tokens on it which means when we need
to resize, we have prepaid for all of that movement. So we push in e, allocate a new array. And now
we use those prepaid tokens to pay for moving a, b, c, and d. Get rid of the old array, and now push in
e. And again, put a token on e, and a token on a.
So, what we've got here then is O(1) amortized cost for each PushBack. And in particular, we have a
cost of three, right?
So we have clearly seen.
So lets look back at how we did this.
For this dynamic array we decided we had to charge three, and other data structures with other
operations we not did have to charge a different amount. We have to figure out what will be
sufficient, in our case three was sufficient, and we decided that we would go ahead and
store these tokens on the elements that needed to be moved. So it's a very physical way to keep track
of the saved work that we have done, or the prepaid work that we have done. So we charge 3, 1 is the
raw cost of insertion. If we need to resize, we've arranged things such that whenever the array gets
full, we've actually, in order for the array to have been full, we had to have done enough PushBacks
such that every element got a token on it. All the new ones that we added since the previous resize,
plus every time we added one of those new ones, we prepaid for a prior element as well.
So, we pay our one insertion, we pay one for the element we're adding now and we pay one for sort
of a buddy element earlier.
In the next video we're going to look at a third way of doing Amortized Analysis, which is the
Physicist's Method.
Now, let's talk about the final way to do amortized analysis, which is the physicist's method. The idea
of the physicist's method is to define a potential function, which is a function that takes a state of a
data structure and maps it to an integer which is its potential.
This is similar in spirit to what you may have learned in high school physics, the idea of potential
energy. For instance, if you have a ball and you take it up to the top of a hill, you've increased its
potential energy. If you then let the ball roll down the hill, its potential energy decreases and gets
converted into kinetic energy which increases.
We do the same sort of thing for our data structure, storing in it the potential to do future work.
Couple of rules about the potential function. First, phi of h sub 0. So, phi is the potential function. h
sub 0 is time 0 of the data structure h, so that means the initial
state of the data structure, and that has to have a potential of 0.
Play video starting at 1 minute 1 second and follow transcript1:01
Second rule is that potential is never negative. So, at any point in time, phi of h sub t is greater than or
equal to 0. So, once we've defined the potential function, we can then say what amortized cost is. The
amortized cost of an operation t is c sub t, the true cost, plus the change in potential, between, before
doing the operation and after doing the operation. So, before doing the operation, we have phi(h sub
t-1) after we have phi(h sub t), so it's c sub t plus phi(h sub t)- phi(h sub t-1).
What we need to do is choose a function phi, such that,
if the actual cost is small, then we want the potential to increase. So that we're saving up some
potential for doing later work. And if c sub t is large, then we want the potential to decrease. In a way
to sort of pay for that work. So, the cost of in operations is the sum of the true costs which is a
summation from i goes from one to n of c sub i. And, what we want to do is relate the sum of the true
costs to the sum of the amortized costs. So, the sum of the amortized costs is the summation from i
equals 1 to n of the definition of the amortized cost. Which is (c sub i + phi(hsub i) - phi(h sub i-1)).
Or, we could just rewrite that. So, removing the summation is c sub 1 + phi of (h sub 1)- phi of (h sub
0), + c sub 2 + phi of (h sub 2)- phi of (h sub 1) and so on. What's important to note is that we have a
phi of h sub 1 in the first line and then a minus phi of h sub 1 in the second line, so those two cancel
out. Similarly, we have a phi of h sub 2 in the second line, and we have a phi of h sub 3 when we look
at the amortized cost at time three. And, that goes on and on until at time n-1, we would have a phi of
h sub n-1 positive and a negative phi of h sub n-1 negative. So, if all these cancellations and all we're
left with is the very first term phi of h sub 0, negative phi of h sub 0, and the very last term in the last
line which is phi of h sub n. So, this really just equals phi of h sub n minus phi of h sub 0 because all
the other phis cancel, plus the summation from i equals 1 to n of c sub i, that is the true costs. Since
phi of h sub n is non negative and phi of h sub 0 is 0, this value is greater than or equal to just the
summation of the true costs.
What that means then is we've come up with a lower bound on the sum of the amortized costs which
is the sum of the true costs. So therefore, if we want to look at a cost of a entire sequence of
operations,
we know it's at least the sum of the true costs.
So, let's look at applying this physicist's method to the dynamic array. So, we're going to look at n calls
to PushBack.
Phi of h, so, at any given time the data structure's going to be two times the size minus the capacity.
So, as the size increases, the potential's going to be increasing for a given fixed capacity.
Phi of x sub here, so we want to make sure that our phi function satisfies our requirements. So, first
phi of 0 is 2 x 0- 0, assuming we have an initial array of size 0, and that's just 0. Also, phi of h sub i is 2
x size - capacity. We know that size is at least capacity over 2, so therefore, 2 x size - capacity is
greater than 0.
Now, let's look at our amortized cost. So, we're going to assume we don't have to do a resize and let's
look at the amortized cost. So, we add a particular element i and the amortized cost is the cost of
insertion plus phi(h sub i) - phi(h sub i-1). So, the cost of insertion is just going to be 1 because we're
adding an element and we don't have to do any moving of elements. Phi of h sub i is 2 x size of i - the
capacity of i, and phi of h sub i- 1 is 2 x size i- 1 - capacity i- 1.
Well, what do we know? Since we're not resizing and the capacities don't change. So, the capacities
cancel themselves out. And so, we are left with 2 times the difference in sizes. What's the difference
in size? Difference in size is just 1, because we added one element, so this is 1 + 2 x 1 or 3.
It's no accident that this 3 is the same value that we saw when we used the banker's method.
And then, let's look at the cost when we have to do a resize. So, we're going to define here k is size
sub i-1, which is the same thing as capacity sub i-1. Why is it the same? because we're about to do a
resize. So, that means that after the previous operation, we must have made the dynamic array full.
And then, phi(h sub i-1) is just 2 times the old size minus the old capacity, and that's just 2 x k - k, or k.
Phi(h sub i) is 2 times the size of i - capacity of i, and that's 2(k + 1), because the size sub i is one more
than the size of i-1, minus 2k. Why 2k? Because we double the capacity each time. So, that's just
equal to 2. So, the amortized cost of adding the element is c sub i + phi(h sub i) - phi(h sub i - 1), which
is just size of i, because that's the number of elements we have to, we have to move size of i-1
elements and then add the one new element, so, that's size of i. So, we have (sizei)+2-k, which is just
(k+1)+2-k, which is 3.
So, what we have seen now is that the amortized cost using the physicist's method of adding
elements is 3.
Let's go back to the dynamic array.
So are there alternatives to doubling the array size?
Right, we doubled each time.
What happens if we didn't double?
Well we could use some different growth factor.
So for instance, we could use 2.5.
So grow the array by more than two, or grow the array by less than two.
As long as we used some constant multiplicative factor, we'd be fine.
The question is can we use a constant amount?
Can we add by a particular amount, like, let's say, 10 each time?
And the answer is really, no.
And the reason is, as the array got bigger and bigger, and we have to resize
every ten times, we just don't have enough time
to accumulate work in order to actually do the movement.
Let's look at another way.
Let's look at an aggregate method.
Let's say c sub i is the cost of the i'th insertion.
We're going to define that as one, for putting in the i'th element,
plus either i-1 if the i-1'th insertion makes the
dynamic array full.
So that is if i-1 is a multiple of 10 and it's 0 otherwise.
By the definition of aggregate method which is
just the sum of the total costs divided by n and that's n plus again that's
the one summed n times is just n plus the summation from one to (n-1)/10 of 10j.
That is just the multiples of 10.
All the way up to but not including n.
So 10, 20, 30, 40 and so on.
All that divided by n.
Well, we can pull the 10 out of that summation so
it's just 10 x the summation j = 1 to (n- 1)/10 of j.
So that's just numbers 1, 2, 3, 4, and so on, all the way up to (n- 1)/10.
That is O(n squared).
That summation.
So we've got n+10 times O(N^2)/n=O(n^2)/n=O(n).
So this shows that if we use a constant amount to grow the dynamic
array each time that we end up with an amortized cost for push back of O(n)
rather than O(1).
So it's extremely important to use a constant factor.
So in summary we can calculate the amortized cost
in the context of a sequence of operations.
Rather than looking at a single operation in its worst case we look at a totality
of a sequence of operations.
We have three ways to do the analysis.
The aggregate method,
where we just do the brute-force sum based on the definition of the amortized cost.
We can use the banker's method where we actually use tokens and
we're saving them conceptually in the data structure.
Or the physicist's method where we define a potential function,
and look at the change in that potential.
Nothing changes in the code.
We're only doing runtime analysis, so
the code doesn't actually store any tokens at all.
That's an important thing to remember.
That is dynamic arrays and amortized analysis.
QUIZ • 30 MIN

Slides
Download the slides for dynamic arrays and amortized analysis here:
05_4_dynamic_arrays_and_amortized_analysis.pdfPDF File
References
See the chapter 17 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford
Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Additional Video
This external video may be useful to give another perspective on amortized analysis in general,
and the banker's method in particular.
Week 3
Data Structures
Week 3
114 threads · Last post 6 hours ago
Go to forum
Priority Queues and Disjoint Sets

We start this module by considering priority queues which are used to efficiently schedule jobs,
either in the context of a computer operating system or in real life, to sort huge files, which is the
most important building block for any Big Data processing algorithm, and to efficiently compute
shortest paths in graphs, which is a topic we will cover in our next course. For this reason, priority
queues have built-in implementations in many programming languages, including C++, Java, and
Python. We will see that these implementations are based on a beautiful idea of storing a complete
binary tree in an array that allows to implement all priority queue methods in just few lines of code.
We will then switch to disjoint sets data structure that is used, for example, in dynamic graph
connectivity and image processing. We will see again how simple and natural ideas lead to an
implementation that is both easy to code and very efficient. By completing this module, you will be
able to implement both these data structures efficiently from scratch.
Less
Key Concepts
 Describe how heaps and priority queues work

 Describe how disjoint set union data structure works
 Analyze the running time of operations with heaps
 List the heuristics that speedup disjoint set union
 Apply priority queues to schedule jobs on processors
 Apply disjoint set union to merge tables in a database
Less
Priority Queues: Introduction
Video: LectureIntroduction
6 min
Resume
. Click to resume
Video: LectureNaive Implementations of Priority Queues
5 min

Reading: Slides
10 min
Priority Queues: Heaps
Video: LectureBinary Trees
1 min
Reading: Tree Height Remark
10 min
Video: LectureBasic Operations
12 min
Video: LectureComplete Binary Trees
9 min
Video: LecturePseudocode
8 min

10 min
Priority Queues: Heap Sort
Video: LectureHeap Sort
10 min
Video: LectureBuilding a Heap
10 min
Video: LectureFinal Remarks
4 min
Quiz: Priority Queues: Quiz
6 questions


10 min
Disjoint Sets: Naive Implementations
Video: LectureOverview
7 min
Video: LectureNaive Implementations
10 min
10 min
Disjoint Sets: Efficient Implementation
Video: LectureTrees for Disjoint Sets
7 min

Video: LectureUnion by Rank
9 min
Video: LecturePath Compression
6 min
Video: LectureAnalysis (Optional)
18 min
Quiz: Quiz: Disjoint Sets
4 questions
10 min
Practice Quiz: Priority Queues and Disjoint Sets
3 questions
Programming Assignment: Programming Assignment 2: Priority Queues and Disjoint Sets
2h
Survey
Survey
10 min
Introduction
Hello everybody. Welcome back.
Today, I'm going to be talking about priority queues.
This popular data structure has built-in implementations in many programming languages. For
example in C++, Java, and Python. And in this lesson, we will learn what is going on inside these
implementations. We will see beautiful combinatorial ideas that allow to store the contents of a
priority queue in a complete binary tree, which is in turn stored just as an array. This give an
implementation which is both time and space efficient. Also, it can be implemented just in a few lines
of code. A priority queue data structure is a generalization of the standard queue data structure.
Recall that the queue data structure supports the following two main operations. So we have a queue
and when a new element arrives, we put it to the end of this queue by calling the method
PushBack(e). And when we need to process the next element we extract it from the beginning of the
queue by calling the method PopFront(). In the priority queue data structure, there is no such thing as
the beginning or the end of a queue. Instead we have just a bag of elements, but each element is
assigned a priority. When a new element arrives, we just put it inside this bag by calling the method
Insert. However, when we need to process the next element from this bag, we call the method
ExtractMax which is supposed to find an element inside this bag whose priority is currently maximum.
A typical use case for priority queues is the following. Assume that we have a machine and we would
like to use this machine for processing jobs. It takes time to process a job and when we are processing
the current job, a new job may arrive.
So we would like to be able to quickly perform the following operations. First of all, when a new job
arrives we would like to insert it to the pool of our other weekly jobs quickly, right? And when we are
done with the current job, we would like to be able to quickly find the next job. That is, the job with
the maximum priority.
Okay, and now we are ready to state the definition of priority queue formally. Formally a priority
queue is an abstract data type which supports the two main operations, Insert and ExtractMax.
Consider a toy example. We have a priority queue which is initially empty. We then insert element 5
in it, we then insert 7, then insert 1, and then insert 4.
So we put these elements in random places inside this box on the left, just to emphasize, once again,
that there is no such thing as the beginning or the end of a priority queue. So it is not important how
the elements are stored inside the priority queue. What is important for us now is that if we call
ExractMax() for this priority queue, then an element with currently highest priority should be
extracted. In our toy example it is 7. So if we call ExtractMax for this priority queue, then 7 is taken
out of the priority queue. Then, well let's insert 3 into our priority queue and now let's call
ExtractMax(). The currently highest priority is 5, so we extract 5. Then we ExtractMax() once again,
and now it is 4, okay? Some additional operations that we might expect from a particular
implementation of a priority queue data structure are the following. So first of all, we might want to
remove an element. I mean, not to extract an element with a maximum priority, but to remove a
particular element given by an iterator, for example. Also, we might want to find the maximum
priority without extracting an element with a maximum priority. So GetMax is an operation which is
responsible for this. And also, we might want to change the priority of a given element. I mean, to
increase or to decrease its priority. So ChangePriority(it,p) is the operation responsible for this. Let us
conclude this introductory video by mentioning a few examples of famous algorithms that use priority
queues essentially.
Dijsktra's algorithm uses priority queues to find efficiently the shortest path from point a to point b on
a map or in a graph.
Prim's algorithm uses priority queues to find an optimum spanning tree in a graph, this might be
useful for example in the following case. Assume that you have a set of computers and you would like
to connect them in a network by putting wires between some pairs of them. And you would like to
minimize the total price or the total length of all the wires.
Huffman's algorithm computes an optimum prefix-free encoding of a string or a file. It is used, for
example, in MP3 audio format encoding algorithms.
Finally, heap sort algorithm uses priority queues to efficiently sort the n given objects. So it is
comparison based algorithm. It's running time is n log n, as in particularly in the worst case. And
another advantage of this algorithm is that it is in place, it uses no extra memory for sorting the input
data. So we will go through all these algorithms in this specialization and the heap sort algorithm will
be even covered in this lesson, in the forthcoming videos.
Naive Implementations of Priority Queues
As usual before going into the details of efficient implementation let's check what is wrong with naive
implementations? For example, what if we store the contents of a priority queue just in an unsorted
array or in an unsorted list? In this example on the slide, we use a doubly linked list. Well, in this case
inserting a new element is very easy. We just append the new element to the end of our array or list.
For example, as follows, if our new element is seven we can just put it to the next available cell in our
array where we can just append it to the end of the list. So we put 7 to the end. We say that the
previous element of 7 is 2 and that there is no next element, right? So it is easy and it takes constant
time, okay? Now, what about extracting the maximum element in this case? Well, unfortunately we
need to scan the whole array to find the maximum element.
And we need to scan the whole list to find the maximum element which gives us a linear running
time. That is we O(n), right? In our previous naive implementation using an unsorted array or list, the
running time of the extract max operation is linear. Well, a reasonable approach to try to improve
this, is to keep the contents of our array, for example array, sorted. Well, what are the advantages of
this approach? Well, of course, in this case, extract max is very easy. So, the maximum element, is just
the last element of our array. Right? Which means that the running time of ExtractMax in this case is
just constant. However, the disadvantage is that now the insertion operation takes linear time, and
this is why. Well, to find the right position for the new element we can use the binary search. This is
actually good, well it can be done in logarithmic time. For example, if we need to insert 7 in our
priority queue, then in logarithmic time we will find out that it should be inserted between 3 and 9 in
this for example. However unfortunately after finding this right position, we need to shift everything
to the right of this position by one.
Right just to create a vacant position for 7. For this we need to first shift 16 to this cell. Then we move
10 then to this cell, then we move 9 to this cell, and finally we put 7 in to this cell, and we get it sorted
already. So in the worst case, we need to shift a linear number of cells, a linear number of
elements, which gives us a linear running time for the insertion operation. As we've just seen,
inserting an element into a sorted array is expensive because to insert an element into the middle we
need to shift all elements to the right of this position by one. So, this makes the running time of the
insertion procedure linear. However if we use a doubly linked list, then inserting into the middle of
this list is actually constant time operation. So let's try to use a sorted list. Well, the first advantage is
that the extract max operation still takes constant time. Well this is just because, well, the maximum
element in our list is just the last element, right? So for this reason, we have a constant time for
extract max. Also, another advantage is that inserting in the middle of this list actually takes a
constant amount of work, not linear, and this is why. Again let's try to insert 7 into our list. Well, this
can be done as follows.
We know that inserting 7 should be done between 3 and 9. So we do just the following. We will
remove this. We remove this pointer and this pointer. We'll say that now the next element after 3 is 7
and the previous element before 7 is 3. And also the next element after 3, after 7 is 9, and previous
element before 9 is 7. So inserting an element just involves changing four pointers. Right? So it is a
constant time operation. However everything is not so easy, unfortunately. And this is because just
finding the right position for inserting this new element takes a linear amount of work. And this in
particular because we cannot use binary search for lists. Given the first element of this list and the last
element of this list, we cannot find the position of the middle element of this list because this is not
an arry. We cannot just compute the middle index in this array. So for this reason, just finding the
right position for the new element I mean to keep the list sorted takes already a linear amount of
work. And for this reason, inserting into a sorted list still takes a linear amount of work.
Well to conclude if you implement a priority queue using a list or an array sorted or not then one of
the operations insert and extract max takes a linear amount of work. In the next video we will show a
data structure called binary heap which allows to implement a priority queue so that both of these
operations can be performed in logarithmic amount of work.
Slides
Download the slides for this lesson:
06_1_priority_queues_1_intro.pdfPDF File
Priority Queues Heaps

Binary Trees
0:00
Hello.
In this lesson we will consider binary heaps in full detail.
A binary heap is one of the most common ways of implementing a priority queue.
So just by definition a max binary heap is a binary tree where each node has zero,
one, or two children where the following property is satisfied for each node.
The value of the node is at least the value of all its children.
Or, put it otherwise, if you take any edge in this tree, then the value
of its top end is at least the value of its bottom end.
So this is an example of a binary max heap, and it can be easily
checked that this property is satisfied on all the edges of this tree.
On the other hand, this is an example of a binary tree,
which is not a binary max heap.
And this is why.
Here, we have five edges where the property is violated.
For example, we see that on this top edge while it's the value of its top end is
10 while the value of its bottom end is 25.
And there are four other edges where the property is violated.
Tree Height Remark
In the previous module, we defined the height of a tree as the number of nodes on a longest path
from the root to a leaf. In this module, we use a slightly different definition of the height: we define
it to be equal to the number of edges on the longest path from the root to a leaf. In particular, the
height of a tree that consists of one node is equal to 0 and the height of tree shown below is equal
to 3.
Both definitions of height are used in practice frequently, so it is always a matter of context. If
there is no definition in the text, you should look at the examples discussed to understand which of
the two definitions is used.
Basic Operations
Let's see how basic operations work with binary max heaps.
What is particularly easy for binary max heaps is finding the maximum value without extracting it. I
mean, it is easy to implement GetMax operation. Well, recall that the main property of that binary
max heap tree is the following. For each edge its top value is greater or equals than its bottom value.
this means that if we go from bottom to top now in our trees, the values can only increase. This in
particular means that the maximum value is stored at the root of our tree. So just to implement
GetMax, we just return the value at the root of our tree. And this takes us just a constant time of
course. Now let see how inserting a new element into to the max binary heap works. So first of all a
new element should be attached somewhere to our tree. We cannot attach it to the root in this case
for example, because the root already has two children. Therefore, we just attach it to some leaf. Let's
select for example, the leaf seven and attach a new node to it. The new node in this case has value 32.
Well, it is still a binary tree. Right? Because seven, before attaching seven, had zero children, now it
has just one child. So it is still a binary tree. However, the heap property might potentially be violated.
And it is violated actually in this case, right? Which is shown by this red edge. So for this red edge the
value of it parent which is seven, is less than the value of its child which is 32. So we need to fix it
somehow. So to fix it we just allow that the new element to sift up. So this new element has value 32,
which is relatively large with respect to all other elements in this tree, so we need to move it
somewhere closer to the root. So the process of moving it closer to the roof is called sifting up.
So the first thing to do is we need to fix this problematic edge. To fix it, we perform the following
simple operation. We just swap the corresponding two elements. In this case, we'll swap seven and
32. After they swap, there is no problem on this edge. However, it might be the case that the new
element 32 is still smaller. Is still greater than its parent and this is the case, in our toy example. So the
parent of 32 is now 29, which is smaller than 32, so we still need to fix this red problem. And we just
repeat this process, we again swap the new element with its parent, right? So we swap it and now we
see that the property is satisfied for all edges in this binary tree.
So what we've just done is that we let the new element to sift up.
And what is important to note here is that we maintained the following invariant, that the heap
property at any point of time of sifting the new element up, the heap property is violated on at most
one edge of our binary tree. So and if we see that there is a problematic edge, we just swap its two
elements, right? And each time during this process the problematic node gets closer to the root. This
in particular implies that the number of swaps required is at most the height of this tree. Which in
turn means that the running time of insertion procedure, as well as the running time of the sifting up
procedure, in this case is big O of the tree height.
Now let's see how the extract max procedure works for binary max heaps. First of all, recall that we
already know that the maximum value is stored at the root of the tree. However, we cannot just take
and detach the root node because it will leave two sub trees, right? So we need to somehow preserve
the structure of the tree. What is easy to detach from a binary tree is any leaf. So let's do the
following, let's select any leaf of our tree and let's replace the root with this leaf. So in this case this
produces the following tree.
This potentially might violate the heap property. And in this case, this does violate the property. So
the new root 12, is less than both its children. So the property is violated on two edges. So 12 is a
relatively small number in this case. So we need to move it down to the leaves. Great, so for this we
will implement a new procedure, which is called SiftDown, okay? So, similarly to SiftUp, we are going
to replace,
to replace the new element with one of its children. In this case we have a choice actually, we can
replace it either with its left child or with its right child. By thinking a little bit we realize that it will
make more sense to replace it with the left child in this case. Because the left child is larger than the
right child, because after this, after we replace 12 with 29, the right problematic edge will be fixed
automatically, right? So this is how we are going to perform the SiftDown procedure. Once again, we
select the largest of two child and we replace. the problematic node with this larger child. As you can
see, the right problematic edge is fixed automatically. The left edge is also fixed, just because we
swapped two elements. However, the new problematic node might introduce new problems, right
closer to the bottom of the tree. Now we see that there is still a problematic edge, so in this case, we
have just one edge so 12 is smaller than 14, but it is greater than seven, so we are safe in the right
tree. In this case we swap 14 with 12 and after that we just get a tree where the property is satisfied
on all edges. So once again we maintain the following invariant. At each point of time we have just
one problematic node, and we always solve the problematic node. With the larger one of its children,
so that to fix both problematic edges. Right? And the problematic node always gets closer to the leaf,
which means that the total running time of the extract max as well as the sift down procedures is
proportional to the tree height.
Now, when we have implemented both procedures, sifting up and sifting down, it's not so difficult to
implement also the ChangePriority procedure. So assume that we have an element for which we
would like to change its priority. This means that we are going either to decrease its priority or
increase its priority. Well, to fix the potential problems that might be introduced by changing its
priority, we are going to call either sifting up or sifting down.
Well, let me illustrate this again on the toy example. Assume that we are going to change the priority
of this leaf 12. So we've just changed it. We just increased the priority of this element to 35. In this
case, we potentially introduced some problems and we need to fix some.
Well we see that 35 is a relatively large number which means that we need to sift it up. So we need to
move it closer to the root. So to do this we just call SiftUp procedure. Which repeatedly swaps the
problematic node with its parent, so in this case this will produce the following sequence of swaps.
First will swap 35 with 18 this gives us the following picture, we see there is still a problem 35 is still
larger than its parent so we swap it again. Now we see that 35 is smaller than its parent. And actually,
the heap property is satisfied for all edges. Once again, what is important in this case is that at each
point of time, the heap property is violated on at most one edge of our tree. So since our problematic
node always gets closer to the root at each step, I mean, after each swap. We conclude that the
running time of change priority procedure is also at most Big O of the tree height. There is an elegant
way of removing an element from the binary max heap. Namely it can be done just by calling two
procedures that we already have. So I assume that we have a particular element that we're going to
remove.
So the first step to do is we just change its priority to plus infinity, that is, to a number which is
definitely larger than all the elements in our binary MaxHeap. When we call it, the change priority
procedure will sift this element to the top of our tree, namely to the root of our tree. Then to remove
this element it is enough to call the extract max procedure. So in this particular example it will work as
follows. So assume that we're going to remove the element 18, which is highlighted here on this slide.
So we first change it's priority to infinity. Then the ChangePriority procedure calls the SiftUp
procedure. This procedure realizes that there is, that the property is violated on this edge. And swaps
these two elements. Then it swaps the next two elements and each at this point well this,
this node that we're going to remove is at the root. Well, to remove this node, we just call the
ExtractMax procedure. So recall that the first step of ExtractMax is to replace the root node with any
leaf. So let's select, for example, 11. So we replace, we replace the root with 11. Then we need to call
sift down, just to let this new root go closer to the leaves.
Well, in this case, 11 will be replaced first by 42, then there is still a problem on the edge from 11 to,
to 18. So we swap 11 with 18 and finally we swap 11 with 12. Well, once again since everything boils
down just to two procedures. First is change priority. And the second one is extracting the max. And
they all, they both work in time proportional to the tree height. So we conclude that the running time
of the remove procedure is also, at most, Big O of the tree height. So to summarize, we were able to
implement all max binary heap operations in time proportional to the tree height, and the GetMax
procedure even works in constant time in our current implementation. So we definitely would like to
keep our trees shallow. And this will be the subject of our next video.
Complete Binary Trees

Our goal in this video is to design a way of keeping the height of our binary max heap shallow. Well,
what is a natural approach to create a tree out of n given nodes, whose height is as small as possible.
Well, it is natural to require that all the levels are fully packed, right. This leads us to a notion of a
complete binary tree. By definition a binary tree is called complete if all its levels are filled completely.
Except possibly the last one where we require additionally that all the nodes at this last level are in
left most positions. Let me illustrate this with a few small examples.
So this is a complete binary tree. This is also a complete binary tree, and this is also a complete binary
tree. So this is a binary complete tree too.
And this is our first example of a binary tree which is not complete. Well it is not complete because on
the last level the two nodes shown here are not in the left most positions. This is also not a complete
binary tree. This binary tree is also not complete
because well this child is missing here, right. And this is also an example of a binary tree which is not
complete.
The first advantage of complete binary trees is straightforward, and it is exactly what we need
actually. Namely, the height of any complete binary tree with n nodes is O(log n). Intuitively, this is
clear. A complete binary tree with n nodes has the minimum possible height over all binary trees with
n nodes. Well, just because all the levels of this tree, except possibly the last one, are fully packed.
Still let me give you a formal proof.
Well, for this consider our complete binary tree and let me show a small example. So I assume that
this is our complete binary tree. So in this case, n = 10 and the number of levels, l = 4.
Well, let's first do the following thing, let's complete the last level.
And let's denote the result in number of nodes by n prime. In this case in particular the number of
nodes
in the new tree is equal to 15. Well the first thing to note is that n prime is at most 2n. Well this is just
because in such a tree where all levels including the last one are fully packed, the number of nodes on
each level is equal to the number of nodes on all the previous levels minus one. Okay, so for example
here the number of nodes on the last level is 8, and the number of nodes on all previous levels is 7. So
we added at most seven vertices.
Now, when we have such a tree where all the levels are packed completely, it is easy to relate the
number of levels with the number of vertices. Namely with the number of nodes. Namely n prime = to
2 to the l- 1. This allows us to conclude that l = binary logarithm of n prime + 1. Now, recall that n
prime is at most 2n, which allows us to write that l is at most binary logarithm of 2n + 1 which is of
course, O(log n).
The second advantage of complete binary trees is not so straightforward, but fortunately it is still easy
to describe. To explain it, let's consider again a toy example, I mean a complete binary tree shown
here on this slide. Let's enumerate all its nodes going from top to down, and on each level from left to
right.
So this way the root receives number 1, to its children receive numbers 2 and 3 and so on. So it turns
out that such a numbering allows for each vertex, number i for example, to be specific, to compute
the number of its parent and the numbers of each children using the following simple formulas. Once
again, if we have a node number i, then its parent has number i divided by 2 and rounded down while
its two children have numbers 2i and 2i + 1. To give a specific example, I assume that i = 4, which
means that we are speaking about about this node. Then to find out the number of its parent, we
need to divide i by 2, this gives us 2.
And indeed, vertex number 2 is a parent of vertex number 4. While to find out the numbers of two
children of this node, we need to multiply i by 2, this gives us this node and multiply i by 2 + 1 and this
gives us this node. And these two nodes are indeed children of vertex number 4, right? And this is
very convenient. This allows us to store the whole complete binary tree just in an array. So we do not
need to store any links for each vertex to its parent and to its two children. So these links can be
computed just on the fly. Again to give a concrete example, assume that we are talking about vertex
number 3. So in this case i = 3. To find out the number of its parent, we just divide i by 2 and round
down. So this gives us vertex number 1, and indeed vertex number 1 is a parent of the vertex number
3.
And to find out the numbers of its two children, we just multiply i by 2 and also multiply i by 2 and
add 1. This gives us, in this case, vertices number 6 and vertex number 7. So, and we know its indices
in theory.
Okay, we have just discussed two advantages of complete binary trees, and it would be too optimistic
to expect that these advantages come to us at no cost. So we need to pay something, and our cost is
that we need to keep the tree complete. Right, we need to ensure that at each point of time, our
binary tree is complete. Well, to ensure this, let's just ask ourselves what operations change the shape
of our tree. Essentially, these are only two operations, namely insert and extract max. So, two
operations, sift up and sift down, they actually do not change the shape of the tree. They just swap
some two elements inside the tree. Another operation which actually change the shape is remove an
element, however, it does so by calling the ExtractMax procedure.
So on the next slide we will explain how to modify our Insert and ExtractMax operations so that they
preserve completeness of our tree.
Return for PPT Slides

To keep a tree complete when we insert something into a complete binary tree, we just insert the
new element, as a leaf, to the leftmost vacant position on the last level.
Well, an example is given here. So, we insert element 30 just to the right of the last element on the
last level. Okay, then we need to let this new element sift up so we perform a number of swaps. So 30
is swapped with 14, then still there is a problem, so 30 is greater than 29, so we swap it again. Okay,
now the property of the heap is satisfied for all the edges.
Well, when we need to extract the maximum value, recall that we first replace the root by some leaf.
Well in this case, to keep the tree complete, let's just select the last leaf at the last level. In this case it
is 14. So we'll replace 42 with 14, and then, again, perform a number of swaps required to satisfy the
property of the heap. Okay, so in this case, 14 is swapped with 30, and then 14 is swapped with 29.
This gives us a correct heap whose underlying tree is a complete binary tree.
Well so far, so good, we now know how to maintain the tree complete and how to store it in an array.
In the next video we will show the full pseudocode of the binary max heap data structure.
Pseudocode
In this video, we provide the full pseudocode of the binary max heap data structure.
Here we will maintain the following three variables. H is an array where our heap will stay. MaxSize is
the size of this array, and at the same time, it is the maximum number of nodes in our heap. And size
is the actual size of our heap. So size is always at most maxSize.
So let me give you an example. In this case, we're given a heap of size 9. And it is stored in the first
nine cells of our array H whose size is 13. In particular, you may notice that there are some values
here, and it is actually some garbage. We just don't care about any values that stay to the right of the
position numbered 9. So our heap occupies the first nine positions in the array. Also let me emphasis
once again that we store just the array H and also variables size and maxSize. So this tree is given to
us implicitly. Namely, for any node, we can compute the number of its parent and the number of its
two children. And we can compute it and access the corresponding value in this array. For example, if
we have no number three, then we can compute the index of its left child, which is 2 multiplied by 3.
So the value of its right child is 18. These implementations showing how to find given a node i, the
index of the parent of i and two children of i. So they just implement our formulas in a straightforward
way.
To sift element i up, we do the following. While this element is not the root, namely,
i is greater than 1, and while the value of this node is greater than the value of its parent, we do the
following. We swap this element with his parent. So this is done on this line. And then we proceed
with this new element. I mean, we assign i to be equal to Parent of i and go back to this while loop,
and we do this until the property is satisfied.

To sift an element number i down, we first need to select the direction of sifting. Namely, if element
number i is smaller than one or two of its children, we first need to select the largest one of its two
children, right? So this is done here. So initially, we assign to the variable maxIndex the value of i.
Then we compute the index of the left child of the node number i. Then in the next if loop we first
check whether i indeed has a left child. This is done in the following check. We check whether l is at
most size. Namely, whether l is an index which is in our heap, okay? Then if H of l is greater than H of
maxIndex, we assign maxIndex to be equal to l, okay? Then we do the same with the right child. We
first compute its index, then we check whether this index is indeed in our heap. Then we check
whether the value in this node is greater than the value of our current maximum index. And if it is,
then we update the value of maxIndex. And finally, if the node i is not the largest one among itself
and the two of its children, namely, if i is not equal to maxIndex, then we do the following. We swap
element number i with element number maxIndex.
It is done here. And then we continue sifting down the just-swapped element. Okay, so, this is done
recursively. However, it is not so difficult to avoid using recursion here, just by introducing a new
while loop.
To insert a new element with priority p in our binary max heap, we do the following. We first check
whether we still have room for a new element, namely whether size is equal to maxSize. If it is equal,
then we just return an error. Otherwise, we do the following. We increase size by 1, then we assign H
of size to be equal to p. At this point we add a new leaf in our implicit tree to the last level, to the
leftmost position on the last level. And finally, we call SiftUp to sift this element up if needed.
To extract the maximum value from our binary max heap, we first store the value of the root of our
tree in the variable result. So result is assigned to be equal H of 1. Then we replace the root by the last
leaf, by the rightmost leaf on the last level, so this is done by assigning H of 1 to be equal to H of size.
Okay? Then we decrease the value of size by 1, just to show that the last leaf is not in our tree
anymore. And finally, we call SiftDown for the root, because it was replaced by
the last leaf, which is potentially quite small and needs to be sifted down.
And the last instruction in our pseudocode is, we return the result. That means the value which was
initially in the root of our tree.
Removing an element. So as we've discussed already, this actually boils down to calling two
procedures that we already have. Once again, to remove element number i, we do the following.
First, we change its priority to be equal to infinity, so we assign H of i plus infinity. Then we SiftUp this
node, so this will move this node to the root of our tree. And then we just call ExtractMax()
procedure, which will remove the root from this tree and make necessary changes in the tree to
restore its shape.
Finally, to change the priority of a given node i to the given value p, we do the following. We first
assign H of i to p, okay? Then we check whether the new priority increased is greater than the old
priority or is smaller. If it is greater, then potentially we need to sift up the new node. So we just call
sift up. If the new priority is smaller, then we call SiftDown for this node.

Time to summarize. In this sequence of videos, we considered the binary max heap data structure,
which is a popular way of implementing the priority queue data type. The considered implementation
is quite fast, all operations work in logarithmic time and GetMax procedure works even in constant
time. It is also space efficient. In this data structure, we store a tree, but this tree is stored implicitly.
Namely, for each node, we do not store a connection or a link to its parent and its two children.
Instead, we compute the index of the corresponding nodes on the fly. Well, in this case, we store just
n given cells in an array, nothing more. Okay? Another advantage of this data structure, of this
implementation, is that it is really easy to code. As you've seen, the pseudocode of each operation is
just a few lines of code. Well, in the next video, we will show how to use binary heap to sort data
efficiently.
Slides
06_1_priority_queues_2_heaps.pdf PDF File
References
Priority queues: Heap Sort

Heap Sort
>> In this video we will use binary heaps to design the heap sort algorithm, which is a fast and space
efficient sorting algorithm. In fact, using priority queues, or using binary heaps, it is not so difficult to
come up with an algorithm that sorts the given array in time of analog n. Indeed, given a rate a of size
m, we do the following. First, just create an empty priority queue. Then, insert all the elements from
our array into our priority queue. Then, extract the maximum one by one from the given array.
Namely, we first extract the maximum. This is the maximum of our array, so put it to the last position.
Then, extract the next maximum and put it to the left of the last position, and so on. This clearly gives
us a sorted array, right?
So, we know that if we use binary heaps as an implementation for priority queue, then all operations
work in logarithmic time. So, this gives us an algorithm with a running time big o of n log n. And recall
that this is asymptotically optimal for algorithms that are comparison-based. And this algorithm is
clearly comparison-based, all right? Also, know that this is a natural generalization of selection sort
algorithm. Recall that in selection sort algorithms, we proceed as follows. Given an array, we first scan
the whole array to find the maximum value. So, then we get this maximum value and swap it with the
last element. Then, we forget about this last element and we can see that only n-1 first elements.
Again, by scanning this array, we find the maximum value, and we swap it with the last element in this
region, and so on. So, here in the heap sort algorithm, instead of finding the maximum value at each
iteration, namely, we use a smart data structure, namely a binary heap, right? So, the only
disadvantage of our current algorithm is that it uses additional space. So, it uses additional space to
store the priority queue.
Okay? So, in this lesson we will show how to avoid this disadvantage. Namely, given an array A, we
will first permute its elements somehow, so that the result in array is actually a binary heap. So, it
satisfies binary heap property. And then, we will sort this array, again, just by calling extract marks
and minus one times.

Building a heap out of a given array turns out to be surprisingly simple. Namely, given array A of size n
where there is a following. We first assign the value of n to is variable size just to indicate that we
have a heap of size n. Then, we do the following. For each i, going down from n over two, rounded
down to one, we just sift down the i's element. Let me explain it on a small picture, why, how we
doing this. So, consider the following.
The following heap, that we actually, let me remind you, we do not store it explicitly. We have just an
array, in this case, of size 15. So, this is the first node, the second one, the third one, four, five, six,
seven.
So, if we just can see the corresponding array of size 15, and then imagine this complete binary tree.
Then, the heap property might potentially be related on many edges. However, in this tree we have
15 nodes, so we have 15 sub trees. And for these sub trees, I mean the root that the leaves of these
of this tree, the heap property is satisfied for an obvious reason. There are no edges in these subtrees.
So, the first node where the property might be related is node number seven.
So, potentially, there might be a problem in this subtree. To fix this problem, we just call SiftDown for
node number seven. Okay? After this call, this small sub tree must be here, right? Then, we do the
same for node number six. After this call, there are no problems in the sub tree rooted at node
number six. Then, we do the same for four i equal to 5 and 4i equal to 4. Then, we proceed to node
number three. Note that, at this point, everything is fine in this subtree,
and in this subtree, right? We already fixed everything in these two subtrees. So, to make, to satisfy
the heap property in the sub tree rooted at the node three, it is enough to call SiftDown from node
number three. Okay? Then, we proceed to node number two. And, again, I have to call and SiftDown
to node number two. We fix the heap property in this sub tree, and so on. So, in this example, actually
the last thing that needs to be done is to call SiftDown for node number one. When we are done with
this, we are sure that what we have in array A is actually a heap. So, it corresponds to a complete
binary tree where the heap property is satisfied on every edge.
Let me repeat slowly what just happened. So, we, to turn a given array into a heap, we start repairing
the heap properties in larger and larger subtrees. So, we start from the leaves, and go to the root.
Initially, so, our induction base is that the heap property is satisfied at all the leaves. I mean, in all
subtrees rooted at the leaves for an obvious reason. Any subtree rooted at the leaf has just one node,
so the property cannot be violated, right? Then, we gradually go up and we fix the heap property by
shifting down the current vertex. And, finally, when we reach a root, the property is satisfied in the
whole sub-tree, right? So, this is just a link for an online visualization. You can download the slides
and play with this visualization if something is unclear to you in this process.
Let me now mention that the running time of this procedure is n log n. Again, for an obvious reason.
If, so, we use a binary max heap to implement this SiftDown down procedure. So, we call SiftDown
roughly n over two times and each call is just log n, right?
So, we have n log n running time. We already have everything to present, to present the in-place heap
sort algorithm. Given an array A of size m, we first build heap out of it. Namely, we permute its
elements so that the resulting array corresponds to a complete binary tree, which satisfies the heap
property on every edge. So, we do this just by calling BuildHeap procedure. In particular, this
BuildHeap procedure assigns the value n to the variable size. Then, we repeat the following process n
minus 1 times. So, we first, we call that just after calling to build heap, the first element of our array is
a maximum element. Right? So, we would like to put it to the last position of our array. So, we just
swap A1 with A of n. And currently, A of n is equal to n of size, okay? So, we swap it. And then, we
forget about the last element. So, we decrease size by one. So, we say that now our heap occupies
the first n-1 element. And since we swapped the last element with the first element, we potentially
need to sift down the first element. So, we just call SiftDown for the element number one. And we
proceed in a similar fashion. I mean, now the heap occupies n-1, the first n-1 position. So, the largest
element among the first n-1 element is the first element. So, we swap it with element n-1. We forget
about the element n-1 by reducing the size by 1. And then, see if bounds a first element. Okay? So, we
repeat this procedure n-1 times, each time finding the currently largest element. So, once again, this
is an improvement of a selection sort algorithm. And this is an in-place algorithm.
So, once again, let me state some properties of the resulting algorithm which is called Heap Sort. It is
in place. it doesn't need any additional memory. Everything is happening inside the given array A. So
this is in advantage of this algorithm. Another advantage is that its learning times is n log n. It is as
simple, as it is optimal. So, this makes it a good alternative to the quick sort algorithm. So, in practice
presort is usually faster, it is still faster. However, the heap sort algorithm has worst case running time
n log n. While the quick sort algorithm has average case running time n log n. For this reason, a
popular approach and practice is the following. It is called IntraSort algorithm. You first run quick sort
algorithm. If it turns out the be slow, I mean, if the recursion dips, exceeds c log n for some constant,
c, then you stop the current call to quick sort algorithm and switch to heap sort algorithm, which is
guaranteed to have running time n log n. So, in this case, in this implementation, your algorithm
usually, in most cases it works like quick sort algorithm. And even in these unfortunate cases where it
works in larger, where quick sort has running time larger than n log n, you stop it in the right point of
time and switch to HeapSort. So, this gives an algorithm which in many cases behaves like quick sort
algorithm, and it has worst case running time. That must be [INAUDIBLE] n log n.
Building a Heap
In this video, we are going to refine our analysis of the BuildHeap procedure. Recall that we estimated
the running time of the BuildHeap procedure as n log n, because it consists actually of roughly n over
2 calls to SiftDown procedure, whose running time is log n. So we get n and over 2 multiplied by log n,
which is of course O(n log n). Note, however, the following thing. If we call SiftDown for a node which
is already quite close to the leaves, then the running time of sifting it down is much less than log n.
Right? Because it is already close to the root. So the number of swaps until it goes to the leaves
cannot be larger than the height of the corresponding subtree, okay? Note also the following thing.
We actually, in our tree, we actually have many nodes that are close to the root. So we have just one
node, which is exactly the root, whose height is log n. We have two nodes whose height is log n minus
1, we have four nodes whose height is log n minus 2, and so on. And we have roughly n over 4 nodes
whose height is just 1. Okay? So it raises the question whether our estimate of the running time of
BuildHeap procedure was too pessimistic.
We will see on the next slide. Let's just estimate the running time of the BuildHeap procedure a little
bit more accurately. Okay, so this is our heap, shown schematically.
So this is the last level, which is probably not completely filled.
But all the leaves on the last level are in leftmost position. So, on the very top level, we have just one
node, and sifting down this node costs logarithmic time. At the same time, on the last level, we have
at most n over 2 nodes, and sifting them down makes at most one swap. Actually, we do not need
even one swap, just zero swaps, but let's be just generous enough and let's allow one swap. On the
next level, we have at most n over 4 nodes, and sifting down for them costs at most two swaps, and
so on. So if we just compute the sum of everything, so we have n over 2 nodes, for which the cost of
the SiftDown procedure is 1. We have n over 4 nodes on the next level, for which sifting them down
makes at most two swaps. On the next level, we have n over 8.
Now it's, for each, sifting them down costs at most three swaps, and so on. So now let's do the
following. Let's just upper bound this sum by the following sum. First of all, let's take the multiplier n
out of this sum. So what is left here is the following, 1 over 2 + 2 over 4 + 3 over 8 + 4 over 16 + 5 over
32, and so on, right? So this can be upper-bounded by the following sum. So this is just the sum from i
equal to 1 to infinity of the following fraction, i divided by 2 to the i. Once again, in our case, in the
running time of BuildHeap, this sum is finite.
So the maximum value of i is log n. We do not have any nodes on height larger than log n. But we just
upper-bound it by an infinite sum, where we can see they're just all possible values of i. And even for
this sum, we will show that the value of this sum is equal to 2. Which gives us that the running time of
the BuildHeap procedure is actually at most 2n.

To estimate the required sum, we start with a simpler and more well-known sum. So this is given by
the following picture, and the sum is given here. So 1 over 2 + 1 over 4 + 1 over 8 + 1 over 16 and so
on. It is equal to 1. And this can be proved geometrically as follows. Consider a segment of length 1.
Now, above the segment, let's draw a segment of size 1 over 2, of length 1 over 2. Okay? This is half
of our initial segment. What remains is also a segment of size one-half. So when we add a segment of
size 1 over 4 here, we actually get 2 times closer to this vertical line, right? When we add the next
one, 1 over 8, we get, again, 2 times closer than we were before adding this segment to this vertical
line. When we add 1 over 16, again, our current distance to the vertical line is 1 over 16, and so on. So
if we go to infinity, we get infinitely closer to this vertical line, which means that this sum is equal to 1.
Now, what about the sum that we need to estimate? Well, to estimate it, let's first do the following
strange thing. Let's consider all the segments shown above, and let's adjust them by their right end.
So consider the segment 1 shown here. Now consider the segment of length 1 over 2, the segment of
length 1 over 4, the segment of length 1 over 8, and so on.
So we continue this process to infinity. And we know that the sum of the lengths of all these segments
is equal to 2, of course.
Now, why we're doing this?
Well, for the following reason. First, consider the following vertical lines.
What we need to estimate is the following sum, 1 over 2 + 2 over 4 + 3 over 8 + 4 over 16 and so on.
Let's see, so this is a segment of size 1 over 2, okay. This is two segments of size 1 over 4, okay. So this
is three segments of length 1 over 8, and so on. So if we put another vertical line, we will get four
segments of size 1 over 16, and so on. So if we do this to infinity, we will cover all our segments that
are shown here, which proves that this sum is equal to 2. Which in turn proves that the running time
of our BuildHeap procedure is actually just linear, it is bounded above by 2n.
Our new estimate for the running time of the BuildHeap procedure does not actually improve the
running time of the HeapSort algorithm. Because the HeapSort algorithm first builds a heap, and now
we know that it can be done in linear time, but then we need to extract max n minus 1 times.
So we still have n log n time, and actually we cannot do better than n log n asymptotically. We already
know this, because it is a comparison-based algorithm. However, this helps to solve a different
problem faster than naively. So assume that we're given array and we're given a parameter k, which is
an integer between 1 and n. And what we would like to do is not to sort the given array, but to find
the k largest elements in this array. Or put it otherwise, we need to output the last k elements from
the sorted version of our array.
So using the new estimate for the BuildHeap procedure, we can actually solve this problem in linear
time, when k is not too large. Namely, when k is at most big O(n) divided by log n. For example, if you
have an array of size n, and you would like just to find square root of n largest elements, then you can
solve this just in linear time. So you do not need to sort the whole array in time n log n to solve this
problem. So linear time is enough to solve it.
And this is how it can be done. Given an array A of size n and parameter k, you just build a heap out of
a given array, and then you extract the maximum value k times. Right? So easy. The running time of
this procedure is the following. First, you need to build a heap, so you spend a linear amount of work
for this, then you need to extract max k times. For this, you spend k multiplied by log n. So if k is
indeed smaller than n divided by log n, so let me write it. So if k is smaller than n divided by log n,
then this is at most n. So the whole running time is at most linear.

So to conclude, what we've presented here is a heap sort algorithm which actually sorts a given array
in place without using any additional memory and in optimal time n log n. We also discussed that to
build a heap out of a given array, it's actually enough to spend a linear amount of work.
We'll conclude the lesson by a few remarks.

First of all, for implementing a binary heap in an array, we can as well use zero based arrays. In this
case, the formulas for computing the parent and the index of the parent of two children of a given
node, I changed as follows. So, the Parent of i, is given by the number (i-1)/2 and rounded down. The
LeftChild is given by the number 2i + 1 and the RightChild is given by 2i + 2. The next remark is that
you can implement the binary min-heap in exactly the same way. And binary min-heap is a heap
where each edge is value of their parent, is at most the value of the child. A case like this is useful for
the case when a iteration just in your priority queue, you need to extract an element not with
maximum priority, but with minimum priority.

The final remark is that binary heaps can be easily generalized through d-ary heap. In such a heap,
each node has, at most, d children. And we require to call a d-ary heap complete, we again require
that all the levels are completely filled, except for possibly the last one where all the nodes are in
leftmost position, okay? So the height of this tree is in this case, log base d of n, not the binary log of
n.
This in particular means that the running time of the SiftUp procedure is, at most, O of log base d of n,
right? Just because the height of the tree is, at most, log base d of n. And the element just goes down,
just goes up. If needed, we swap an element with its pairing. However, the running time of SiftDown
procedure increases through d when supplied by a log base u of n.
This is because when we go down, we always need to find a direction where to go in reach of this to
go. And this is because, when we need to replace, to swap a node with one of its children we first
need to select which on of these children is the largest one. Right, so for this reason the running time
of SiftDown procedure in this case is O of d multiplied by log base d of n.
Okay, time to conclude in this segment of lessons we started by introducing the abstract data type
called priority cues. This abstract data types supports the following two main operations insert an
element and extract an element with the highest priority.
This priority queues find a lot of applications. We will see many other reasons that you used efficiently
this data type. Then we explain that if implemented naively using an array, or at least sort it or not,
one of these two operations will take a lenient amount of work, in the worst case. Then we presented
binary heaps. So this is a way of implement priority queues that gives the worst case running time, O
of log n for all operations. And also, finally, we explained that this also can be made space efficient.
Namely, a binary heap is a tree, however to store this tree we do not need you to store connections
to a parent and to children. It is enough to store everything in an array. Again, this makes binary
heaps both time and space efficient.
QUIZ • 30 MIN
Priority Queues: Quiz

DUEJul 19, 11:59 PM PDT
ATTEMPTS3 every 8 hours
Start
Slides
06_1_priority_queues_2_heaps.pdf PDF File

References
See this min-heap visualization.
Disjoint Sets: Naïve Implemnetations
Overview
Hello and welcome to the next lesson of the data structures class. It is devoted to disjoint sets.
As a first motivating example, consider the following maze shown on the slide. It is basically just a grid
of cells with walls between some pairs of adjacent cells. A natural question for such a maze is given
two points, given two cells in this maze whether there is a path between them. For example for these
two points for these two cells shown on these slides there is a path and it is not too difficult to
construct it. Let's do this together. So this is, we can go as follows.
And there is actually another path here, we can also go this way.
Great. On the other hand, there is no path between these two points shown on the slide and to show
this we might want to construct just a set of all points that are reachable from B. Let's again do this,
so let's just mark all the points that are reachable from B.
So it is not difficult to see that we marked
just every single point which is reachable from B. And we now see that A does not belong to this set.
Which actually justifies that A is not reachable from B in this case.

The maze problem can be easily solved with the help of the disjoint set data structure which supports
the following three operations. The first operation is called MakeSet. It takes one argument x and it
creates just a set of size one containing this element x.
The second operation is called Find. It takes an argument x and returns an ID of the set that contains
this element x. Well, we expect this ID to satisfy the following two natural properties. If x and y, if two
elements, x and y lie in the same set, then we expect the operation Find to return the same ID was for
x and y. Well just because x and y lie in the same set, and Find returns some identifier of this set,
right? If, on the other hand, x and y lie in different sets, then Find(x) should not be equal to Find(y),
right?
The last operation is called Union and it takes two arguments, x and y. And then it considers two sets
containing x and y, and it merges them. In particular, if we just called Union(x,y) and right after this
we called Find(x) and Find(y). Then these two call to Find operation should return exactly the same
identifier. Just because x and y after the call to union, lie in the same merged set.
Recall that our maze example shows a particular point B is not reachable from a particular point A, we
did the following. We first constructed the region of all cells, reachable from this point B in our maze.
We then just checked that a, that point a does not belong to this region. So this was a justification of
the fact that a is not reachable from b in our maze. And in fact, any maze can be partitioned into
disjoint regions, where in each region, between any two cells there is a path, right? And using the
disjoint sets data structure it is easy to partition any maze into such disjoint regions. We can do this
by preprocess the maze as follows. We first call MakeSet for all cells c in our maze. This creates a
separate region for each cell. So initially we have as many regions as there are cells in our maze. Then
we do the following. We go through all possible cells in our maze. So when a cell C is fixed, we also go
through all possible neighbors of this cell in our maze. We say that n is a neighbor of c if n is an
adjacent cell of c and there is no wall between them.
So at this point c belongs to some region and n belongs to some region and we just discovered the
fact that there is a path between c and n. Which means that, actually, any cell from this region is
reachable from any cell from this region, right? To reflect this fact, we just call Union(c,n). This creates
a separate set for these two regions. This merges these two regions, right? So after the call to this
preprocess and procedure each region in this maze, receives a unique ID, right? So then to check
whether a particular cell is reachable from other cell we just need to check whether the Find
operation returns the same for them or not.
To give another example of using the disjoint sets data structure, assume that we're building a
network.
In each area we have four machines we call MakeSet(1) for the first machine. MakeSet(2) for the
second machine, and so on, so, to reflect the fact that initially, each machine lies in a separate set. In
particular, if we now check whether Find(1) is equal to Find(2), then it is false just because 1 and 2 lie
in different sets. Now, let's add a wire between the third machine and the fourth machine. To notify
our data structures that now 3 and 4 belong to the same set, we call Union(3,4), okay? Now let's
introduce the fifth machine so we do this by calling MakeSet(5) then let's add another wire between
the second machine and the third machine. To notify our data structure about this event we call
Union(3,2). If we now call Find(1) and Find(2), they should return, these two calls should return
different values because 2 and 1, machines 2 and 1 still belong to different sets. Okay, now we have
the wire between the machines 1 and 4. And now yes, to notify our data structure about this event
we call Union(1,4), and now if we check whether Find(1) is equal to Find(2) it should return true. Just
because now 1 and 2 lie in the same set that contains machine 1, 2, 3, and 4. Later in this
specialization, we will learn the Kruskal algorithm, which builds a network of a given set of machines
in an optimal way, and uses the disjoint set data structure essentially.
Now when we've seen a few examples of using the disjointed set data structure and when we
formally defined it, let's start to think about possible ways of implementing it. As usual we will start
with a few naive implementations that will turn out to be slow on one hand, but on the other hand
they will help us to come up with an efficient implementation.
First of all, let us simplify our life by assuming that our n objects are just integers from 1 to n.
This in particular will allow us to use the subjects as indices in a race.
Okay, so our sets are just sets of integers. And we need to come up with a notion of an ID, of a unique
ID for each set. So, let's use the smallest element in each set as its ID. In particular, since our objects
are integers, for each element we can store the ID of the set, this element belongs to in the array
called smallest. For example, if we have the following three sets,
then in the first set the ID, namely the smallest element is two. In the second set the smallest element
and the only element is five. In the third set the smallest element is one. Then this information is
stored in the array called smallest of size nine.
Separations MakeSet and Find can be implemented in just one line of code in this case. Namely to
create a singleton set consisting of just element i, we set smallest of i to be equal to i, right. To find
the ID of the set containing the element i, we just return the value of smallest[i]. The running time of
both of these operation is constant.
Everything is not so easy with the union operation unfortunately. So to merge two sets containing
elements i and j, we do the following. First we find out the ID of the set containing the element i. We
do this by calling Find(i) and restores the result in the variable i_id. Then we do the same for the
element j. We call Find(j) and we store the resulting id in the variable j_id. Then we check whether
i_id is equal to j_id. And if they are equal, this means that i and j already lie in the same set. Which
means that nothing needs to be done. Which means, in turn, that we can just return. If i_id and j_id
are different, we do the following. Well, we need to merge two sets. The smallest element in this set
is i_id. The smallest element in this set is j_id. Which means that the smallest element in the merged
set should be just the minimum among i_id and j_id. Restores as minimum in the variable m. Then we
need to scan all the objects, all our n objects and update the id of each. And update the value of the
smallest array for reach objects for which their id is i_id or j_id. So this is done in the loop here where
k ranges from 1 to n. So we check where the smallest of k is i_id or j_id and if it is equal then we
update it to be equal to m. The running time of this operation is linear of course, just because
essentially what we have here is a single loop that goes over all n objects.
The bottleneck of our current implementation is the union operation, whose running time is linear as
opposed to the finite make-set operations. Whose running time is just constant. So we definitely need
another data structure for storing sets, which allows for more efficient merging. And one such data
structure is a linked list.
So let's try to use the following idea. Let's represent each set just as a linked list and let's use the tail
of a list as the ID of the corresponding set. Let me illustrate this with two examples. In this case we
have two sets, each one of them is represented, is organized as a linked list, and we treat the tail of a
list as the ID of the corresponding set. For example in this case, 7 is the ID of the first set and 8 is the
ID of the second set. Now to find the ID of the set that contains the element three for example, we
just follow the pointers until we reach the tail of the corresponding list. So in this case ID is well
defined. What is also nice, is that in this case. We can merge two sets very efficiently. Actually since
they are organized as lists, we just need to append to the other list and this requires changing just one
pointer. What is very nice in this case is just the id of the merge itself is updated automatically. So
after the merging, the hat of the resulting list is 8, so the ID is updated for all the elements of two sets
automatically.

As we've just discussed, there are at least two advantages of the current implementation where we
store each set as a linked list. First of all the running time of the union operation is, in this case, just
constant. This is because to merge, to linked lists, we'd just append one of them to the other one. And
for this we need just to change one pointer.
Another advantage here is that we have a well-defined ID in this case. Namely if two elements lie in
the same set then find will return the same tale element from the corresponding list, right? And also if
two elements lie in different sets then the tales of the corresponding two lists are different. And this
is exactly what we want. There are however also two disadvantages. The first disadvantage is that
now the running time of the find operation is linear in the worst case. This is because to find the tail
of the corresponding list, I mean, given an element, we would like to find the corresponding tail of a
list. For this, we need to follow the pointers til we reach the tail of this list. For this, we might need
potentially to traverse a linear number of elements. Because the list might contain a linear number of
elements. So in the worst case, the running time of Find declaration is linear, which is not so good.
The second disadvantage is that actually implementing Union operation is not so, in constant time, is
not so easy as it was shown on our previous two examples. Namely, we assumed implicitly in this
example, then when given two elements x and y, we can find the beginning of the list containing x,
and the end of the list containing y in constant time.
So to be able to do this, we might need to store some additional information. And this in turn will
require us to update this information when we merge two elements. So once again, this means that
to implement union procedure in constant time we need to store some additional information, but
not just pointers from a particular element to the next element.
Return to PPT Slides

In search of an inspiration for improving our current implementation, let's review our previous two
examples. So we've discussed that merging these two sets shown on this slide is follows as good
because it requires just constant time and it updates the ID of the resulting set automatically. On the
other hand it is bad because it creates a long list. This in particular requires Find(9) to traverse the
whole list, and this makes the find operation a linear time operation. Right, so let's try to think about
a different way of merging these two lists. For example, what about the following one?
In this case, first of all we, the resulting in structures is not least, it's just strange right? However, it is
still constant time, right? And also 7 can still be considered as a ID of the resulting set. Because 7 is
reachable from any other element, right? However, so what about this structure? It is not a list, but it
is a tree. Right, so it is a tree whose root is seven, and that has two branches, right. In the next video
we will develop this idea to get a very efficient implementation of the disjoint sets data structure.
Namely, we will represent each set as a tree. And we will show that in this case the running time, the
amortized running time of each operation is nearly constant.
Slides
06_2_disjoint_sets_1_naive.pdfPDF File
References
See the chapters 21.1 and 21.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L.
Rivest, Clifford Stein. Introduction to Algorithms (3rd Edition). MIT Press and McGraw-Hill. 2009.
Disjoint sets: Efficient implementations

Trees for Disjoint Sets
Hi. In the previous video we considered a few naive implementations of the disjoint sets data
structure. In one of them, we represented each set as a linked list. Let me give you a small example.
So these four elements are organised into a linked list. And we treat the tail of this element of this list,
so this last element has the idea of the correspondence set. And this is well defined idea because it is
unique for any list, and it can be easily reached from any other element in the correspondence set. So
if we need to find the ID of the set that contains this element,
we just follow the next pointers shown here until we reach the tail of this list. Another advantage is
that merging two sets is very easy in this case. So assume that this is our first set and the second set
looks as follows.
Then, to merge these two sets, we just append one of the least to the other one. Like this.
The first advantage of this merging is that it is clearly constant time. We just change one pointer.
Another advantage is that it updates the ID of the result in that list automatically. Now, with three
these elements as ID of the result in list. It still can be reached for many as an element of this list, just
by following these pointers. The main disadvantage of this approach is that over time, lists get longer
and longer. Which in turn, implies that the find declaration gets slower and slower.

Well, we then discussed another possibility to merge two lists, namely we can do the following.
Again, consider the same two lists.
And now I assume that instead of just appending one list to the other one, we do the following
strange thing. We'll just change this pointer as follows.
Well, as you see what we get is not actually a list, however it is a tree whose root is this element and
it has two branches, so we do not get a long trees, but instead we get a tree. And in this tree we can
still treat this last element, this root element is the idea of the correspondent set. Because it is unique
for this for this tree, and also it can be reached from many as an element. So in this lesson, we'll going
to further develop this idea. By doing this we will eventually get a very efficient implementation of the
disjointed data structure.
The general setting is the following. Each set is going to be represented as a rooted tree. We will tree
the root of each tree as the idea of the corresponding cell. For each element, we will need to know its
parent and this will be stored in the array parent of size m. Namely parent of i will be equal to j if
element j is a parent of i or in case i is a root. Then, parent of i will be equal to i. So this is a toy
example. Here, we have three trees and there are three roots, five, six, and four. And these three
trees are stored in their right parent as follows, for example, to indicate that four is the root, we store
four in the fourth cell of this array. To indicate that 9 is the parent of 7, we'll put 9 into the 7th cell.
Recall that MakeSet of i creates a single [INAUDIBLE] set consistent of just a single element i. To do
this, we just assign parent of i to be equal to i. This creates the three, whose only element is right and
this is the root of these three. So for this reason, we assign parent to five be equal to i. The running
time of this operation is, of course, constant.
To find the root of the three that contains a given element, i, we just follow the parent links from the
node i until we reach the root. This can be done as follows. While i is not equal to parent[i], namely
while i is not the root of the corresponding tree,
we replace i by its parent. So each time, we go close it, there's a route. And eventually, we will reach
a route. And then this point, we return the results in element. The running time of this operation is, of
course, at most, the height of the correspondent tree.
Now, we need to design a way of nurturing to a trees and there is a very natural idea for doing this.
We have two trees, let's just take one of them and camp under the root of the other one. Let me
illustrate it with a small example.
Assume that this is our first tree. It contains just three nodes and this is the root of this tree so it
points to itself, and this is our second tree.
This is the root, so it points at itself again. To merge these two trees we just change one pointer.
Namely, we say that now.
This node is not the root anymore but its parent is this node. So we hang the left tree on this root of
the right tree.
Once again this node is not the root anymore, while this node is the root of the resulting tree. And at
this point, there is a natural question. We can hang the left tree on the root of the right tree. But also,
vice versa, we can hang the right tree under the root of the left tree. So which one to choose? And
after thinking a little bit, we realize that it makes sense to hang a tree whose height is smaller under
the root of the true whose height is larger. And the reason for this is that we would like to keep our
trees shallow. And in turn the reason for this is that the height of the trees in our forest influences the
running time of the find operation. Namely, the worst case running time of the find operation is
actually at most the maximal height of a tree in our forest.
To give a specific example, let's consider the following through trees shown on the slide. In this case,
we have a tree of height one and tree of height two. Assume that we call Union of 3 and 8, in this case
we need to merge these 2 trees and these will discuss there are 2 possibilities for doing this. Either we
hang the left tree under the root of the right tree or vice versa, we hang the right tree under the root
of the left tree. The results of these two cases are shown here on the slide, and you see that in the
last case the height of the tree increased. And this is not something that we want, because as we've
discussed the height of this tree influences the worst case running time of the find operation. So this
illustrates that to keep our trees shallow, when merged into a trees we would like to hang a tree
who's height is smaller. And there's a root of the tree whose height is larger.
Union by Rank
Okay, when merging two trees we're going to hang the shorter one under the root of a taller one. This
means that when merging two trees we need a way to quickly find the height of both trees. Instead of
just computing them we're going to keep the height of each possible subtree in our forest in a
separate array called rank. Name the rank of i is equal to the height of the subtree rooted at i. The
reason we call it rank will become clear a little bit later.
Let me also mention that this way of merging two trees, based on the height is called the union by
rank heuristic.
To keep the rank, we need a small addition to our MakeSet implementation, namely when creating a
single set, we also set rank of i to be equal to zero. This reflects the fact that it is currently just a root
containing one node, that just a tree containing one node, that is a tree of height zero.
We do not need to change Find. So the Find operation doesn't need to change rank, and it also
doesn't use rank in any way. To merge two trees containing the given two elements i and j, we do the
following. We first find the roots of the point in two trees by calling the Find operation two times.
We store this root in variables i_id and j_id. We then check whether i_id is equal to j_id. If they are
equal, this means that elements i and j already lie in the same set. So we just return in this case. So
this is done in the following if loop. We then check whether the height of the tree containing element
i is larger than the height of the tree containing element j. If it is larger, then we hang the tree with
the root j_id, and this root of element i_id. This is done as follows. Parent of j_id is set to i_id.
Otherwise, we do the opposite thing. We just assigned parent of i_id to be equal to j_id.
So the last thing is that we need to check whether the height of the corresponding two threes are just
equal. Let me illustrate this again with a small example. Assume that we are merging the following
two trees.
In this case the height of these two elements and this element is zero and this height of this element
is 1. So in this case roots are equal. The ranks of the corresponding roots are equal. To merge these
two trees we do the following.
We just hang the left tree under the root of the right tree. If you can see, in this case, the height of
the resulting tree actually increases and this is the only case when the union operation increases the
height of this tree. So in this case, initially, the longest path contained just one edge. In this case we
go the path that can contain two edges. So we need to update this rank and this is done in the last
check. So if initially the ranks of our two trees that are going to be merged are equal we hang one of
them under the root of the other one and increase the rank of the resulting tree by one.
Return PPT Slides
Let's consider a small example, in this case we have six elements. Let's call MakeSet for each of these
elements. These fields have a data structure as follows. So currently, each element is its own parent,
right? So its current set is just a single one set. Also, the height of each sub-tree in our data structure
is currently equal to 0. Now let's call Union(2,4). In this case, the rank of the subtree rooted at 2 is
equal to 0. The height of the subtree rooted at 4 is equal to 0. So it doesn't mean which one to hang
under the root of the other one, so let's hang 2 under 4. This changes the data structure as follows.
Now it's a parent of 2 is 4 and the rank of the subtree rooted at 4 is equal to 1. Okay, now let's call
Union(5,2). In this case the height of the tree that contains the element number 2 is equal to 1, right?
While the height of the tree that contains element number 5 is equal to 0. So, in this case we're going
to hang 5 under 4. We do this as follows. So, this change the data structure, only this changes only
this cell. So now 4 is the parent of 5, and it doesn't change any rank in our sub tree, in our forest.
Okay, now lets call Union(3,1). This is done as follows, now 1 is rank 1, and now the parent of 3 is
equal to 1, okay? Now, let's call Union(2,3), and again, in this case, 2 lies in a set in the tree whose
root is 4. And currently, the rank of 4 is equal to 1. Also, 3 lies in a set whose root is 1. And currently,
rank of 1 is equal to 1. Which means that after merging these two trees we will get a tree of height 2.
So we do this as follows, now 1 is the root of the resulting tree and its rank is equal to 2.
Finally we call Union(2,6) and this will just attach 6 to 1, as follows.
In our current implementation, we maintain the following important invariant. At any point of time
and for any node i, rank of i is equal to the height of the subtree rooted at this node, i, right?
We will use this invariant to prove the following lemma. The height of any tree in our forest is at most
binary logarithm of n.
This will immediately imply that the running time of all operations with our data structure is at most
logarithmic, right? To prove this lemma we will prove another lemma shown here on this slide.
We're going to prove that if we have a tree in our forest whose height is k then this tree contains at
least two to the k nodes.
This will imply the first lemma as follows. I assume that some tree has height with more, strictly
greater than binany logarithm of n. Using the second lemma it will be possible to show then that this
tree contains more than n nodes, right? Which would lead to a contradiction with the fact that we
only have n objects in our data structure.
Here we are going to prove the second lemma by induction on k. Recall that we proved that any tree
of height k in our forest contains at least 2 to the k nodes. We're going to prove this by induction on k.
When k is equal to zero this means that we have a tree just of height 0, which means that it contains
just one node. So, in this case, the statement clearly holds. Now, to prove the induction step, let's
recall that the only way to get a height, to get a tree of height k, is to merge two trees, whose height
is equal to k- 1. I mean to merge both trees such that the height of the first tree is equal to k- 1 and
the height of the second tree is equal to k-1. By the induction hypothesis the both of these two trees
contain at least 2 to the k-1 node. Which means that our resulting tree contains at least 2 to the k- 1 +
2 to the k- 1 nodes, which is exactly equal to 2 to the k, right? Which means that the lemma is proved.
To conclude, the running time of both Union and Find operations in our current implementation is at
most logarithmic. Why is so? Well, just because we keep our trees shallow, so that the height of any
tree in our forest is at most logarithmic. This immediately implies the time of any Find operation is
also big O of logarithm of n. Recall also that the Union operation consists of two calls to the Find
operation and also a few constant time operations, which means also that the running time of Union
is also big O of log n. In the next video, we will see another beautiful heuristic which will just decrease
the running time of both these operations to nearly a constant.
Path Compression
Return to PPT Slides

To build our Y intuition involves a second characteristic for the disjoint at the destruction. Let's again
consider the example shown here on the slide. Assume that we call find of six. This will traverse the
following path from six to the root of this tree.
So let's know that in this case we find the root of three that contains element six, but we also find the
root of the tree that contains element 12 and contains element three. In general, by reversing this
path we find the root for all the elements on this path. So, why lose this information? Let's just store
it somehow. And one way to do this, for example, is to re-attach all these notes directly to the root,
we can do this as follows. Now as you can see the parent of element 12 for example is five, and also
the parent of element six is also five. We've just attached them directly to the root. And this can not
only save us space in the future. Save us time, I'm sorry, in the future. For future calls of find
operation. So this heuristic is called path compression. Implementing this heuristic I mean path
compression heuristic, don't sound to be surprisingly simple. It is actually only three lines of code.
Here we do the following. We first check whether i is equal to parent of i. If it is equal, if it is the root,
then we will just return the result. If i is not the root, if i is not the root, we do the following. We call
find recursively for the parent of the node i. This is done here. So we call find for the parent of the
node i. It will return the root of the correspondent three. And then we set parent of i to be equal to
the returned root. That is, we attach the node number i directly to the root. And we do this
recursively for all the elements on this pass. Finally we return the new parent of i, which is now just
the root of the corresponding tree.
Before stating an upper bound on the running time of operations of our current implementations, we
need to introduce the so-called iterated logarithm function, which is also denoted by log star of n. So
by definition, iterated logarithm of n is the number of times the binary logarithm should be applied to
n before we get a number which is at most one. Let me illustrate this with an example so n equal to
one is a binary logarithm of n is equal to zero. We do not need to apply binary logarithm to get a
number which is at most one, because n is already almost one in this case. For n equal to two, we
need to apply binary logarithm once to get the number which is at most one. Mainly, if we apply
binary algorithm to two we just get the number one. Okay for n equals to three and four, the binary
algorithm is equal to two and so on. And for example, for the numbers shown here, two to the sixth,
5536 If we apply binary logarithm once, then just by definition of binary logarithm we get this
number, just 65536, which is two to 16, okay? So if we apply the binary logarithm once again we get
16, 16 is two to the four. If we apply the binary logarithm once again we get four. If we apply, again,
we get two, and if we apply it finally once again, we get one. And at this point, we stop.
So we applied the binary logarithm five times to get a number which is at most one, which ensures
that for this number, two to the 65536 the log star is equal to five, okay? So, and this shows that for
any practical value of n, the binary log, the log star function is, at most, five. Because this number is
extremely huge. We will never see any value of m which is greater than this number in practical value.
We will never get an input, a sequence that consists of so many elements. So theoretically speaking,
the lock star function is not bound. It is not constant. So there are extremely huge numbers for which
lock star is equal to ten or twenty or 100 and so on. However, for all practical values of n, log star of n
is at most five.
We're now ready to state an upper bound. Assume that we used both union by rank heuristic and
past compression heuristic, and assume that we make m operations with our data structure, of which
m are calls to the MakeSet does a MakeSet operation. Namely, we have n object, and we make m
operations with them. Then the total running time of all these calls is O(mlog*n). Put it otherwise, the
amortized running time of a single operation is O(log*n). And recalls it for all practical values of n, log
star of n is at most five. Which means that we have a constant average time for a single operation for
all practical values of n. So once again, log star theoretically speaking, log star function is not bound,
however it is at most five for all practical values of n which makes our data structure very efficient in
practice. We will prove this upper bound in the next video.
Analysis (Optional)
Our goal in this video is to show that if we use both path compression and union by rank heuristics
then the average running time of a single operation is upper bounded by big O of log star of N.
Before going into the details of the proof, let's realize a few important facts. First of all, know that
since we are now using path compression, it is no longer the case that the rank of
node i is equal to height of the tree rooted at vertex i. However, it is still true that the rank is an upper
bound. On the corresponding height. Let me illustrate this with a toy example. So, I assume that we
have a tree like this. So, this is root say vertex 5. Here we have vertex 2 and we have node 3, node say
6. Assume that currently rank of this node is 2 say 0, this is 1 and this also 0. We now recall find of 6.
This will re-attach 6 to the current root of this tree. So what we get is the following, 5, 3, 1. And also 6
now stays here, all right? So we see that the height of these three is equal to 1. However the rank of
the root is still equal to 2. Recall that find doesn't use and doesn't change any rank values. Also for
this node 3, it's height. The height of the tree rooted at element 3 is equal to 0, however the rank is
equal to 1.
Well, intuitively it is clear path compression can only decrease the height. So for this reason, rank is
no longer equal to the height of the corresponding tree however the height is at most the length.
Okay. Another important thing is the following. It is still true that for any root node i of rank k. The
corresponding sub 3 contains at least 2 to the k elements. And this can be seen by realizing that the
past compression does not affect any root nodes. I mean, if we have a root node whose height is k,
then no matter how many times and for which nodes in these sub tree we call find with path
compression, all this nodes are still in this subtree rooted at this nodes exactly because this node is a
root. So any node from this subtree cannot be attached to some other vertex and some other
subtree. This node is still, so once again if we have a node who is still a root and whose rank is k, then
the corresponding subtree contains at least 2 to the k elements, 2 to the k nodes. On this slide we
discuss a few more important properties. The first one of them says that we have, in our forest, at
most n divided by 2 to the k nodes of rank k. Why is that? Well, recall that if we create a new node of
rank k
then it was created by merging two nodes of rank k-1. Okay. So we know that currently this node is a
root.
At the same time, we know that the correspondence subtree contains at least 2 to the k nodes.
then it was created by merging two nodes of rank k-1. Okay. So we know that currently this node is a
root.
At the same time, we know that the correspondence subtree contains at least 2 to the k nodes. If we
have another another node of rank k. Then it also contains at least 2 to the k nodes. Which means
that if we have too many such nodes. I mean, too many nodes of rank k.
By saying too many, I mean that its number is greater than n divided by 2 to the k. Then overall, we
have more than n element. Which is a contradiction, right?
The second property is that, when we go up, the rank strictly increases. Well, this was clear if we do
not use past compression. I mean, if rank is equal to the height of the corresponding subtree, then
this is completely clear. Let me recall that if we have for example a tree
of height two, then the height of this tree is two. The height of this subtree is one. The height of this
subtree is zero. Let's say this is element 5. This is 4. This is 8.
Now we have passed compression, so we need to check what happens when we compress some
paths. If we call Find(5), for example, then we'll be reattached to the current root. But it will still be 2.
That the rank of the parent.
So let me fix this. This is node 8. The rank of the parent is strictly greater than the rank of a child.
The last property is, says that when a node becomes an internal node it will be an internal node
forever, right. It will not have a chance to become a root and this is just because
the find operation doesn't change any roots in our forest. While union operation takes two roots and
makes one of them a child of the other one. So it takes two roots and leaves only one root.
Okay. So once again when a vertex becomes an internal vertex, a non root vertex, it will be a non root
vertex forever.
We now start to estimate the running time of M operations. First of all note that the union operation
boils down to T(all calls to FInd) operation. And also to some constant operations. Namely, when we
have two roots that were found by two calls to. To find that operation, we need to hank one of them
below other one which is a constant operation. We just need to change one parent. And also possibly
we need to change the rank value. Okay. So for this reason when estimating the total running time we
will just assume that we have m calls to find operation. Paths node that each Find operation traverses
some pass from a note to find the root of the corresponding tree. So we traverse some number of
edges. So the total run in time of all the defind operations, of all the calls to defind operation is just
the total number of edges traversed. So this is what is written here, we just need to count the number
of edges from parent a node i through it's paring j, at traverse you know these codes. For technical
reasons we will split this number into three terms. In the first term we will account all the edges that
lead from a vertex to the node to the root of the corresponding tree. So the first term includes all the
edges that lead from a node to another node, which is the root in this case. The second term include
all the remaining edges where we go from i to j. Such as there log* of rank of a is strictly smaller than
log* of rank of j, okay? And their remaining term accounts for everything else, namely for all the
edges where we go from i to j such that j is not the root. And that log*(rank[i]) = log*(rank[j])). We're
going to show separately for each of these terms that it is upper bounded by big 0 of m multiplied by
log star of m.
Let me show this on a small example as usual. Assumes that we have such a path that we're going to
traverse. A path from the node at the bottom to the root of the corresponding tree. So the numbers
shown here indicate the ranks of the corresponding nodes. Then these two nodes, these two edges
will be accounted in the first term. Just because these two nodes lead from a node to the root this
thread just lead from a node to the root of the corresponding tree. Well, this edge for example will be
accounted in the last term because the rank of this node is 17 and the rank of this node is 21. And log
star of this numbers are equal. At the same time, here we have 14 the rank 14, and here we have rank
17. And the log star of these two numbers are different. For this reason this hatch will be accounted in
the second term, okay? So on the next sequence of slides, we are going to estimate separately each of
these three terms. And for each of them, we are going to show that it is at most big O of m multiplied
by log* of m.
The first term is easy to estimate. Recall that in this term we account for all the edges traversed by
find operation. Where we go from node i to its parent j such that j is the root. Clearly for each call to
find the operation there are at most two side edges, right? Which means that we have an upper
bound big O of m. In the second term, we need to estimate that a long number of edges traversed it
during all m calls to the find operation. Such that we go from node i to its parent j such that j is not
the root and also log star of rank of i is strictly less than log star of rank of j. We're going to prove here
that it is upper bounded by big O of m multiplied by log star of n.
And this is just because, when we go up from some node to
the root of the corresponding tree, the rank always increases. However, the rank of the root is at
most log n, which means that during one call to find procedure the lock star of rank can only increase
log star of n times. Okay. This is just because we've had an upper bound for the rank of the root. It is
upper bounded by log m which means that there are only log star of m different possibilities for log
star of rank of folds and nodes on this. Which means, that these can only increase, at most, log star of
m times. And we have at most, m calls to find the operations. Which gives us, an upper bound m, to
get, m multiplied by log star of m.
Now it remains to estimate the last term. Where we account for all the address traversed during m
calls to the find operations. Where we go from node i to node j through its parent j such that j is not
the root, first of all. And then the rank, the log star of rank of i is equal to log star of n of j. What we're
going to show is that the total number of side edges is upper bounded by big O of m multiplied by log
star of m. Note that this is even better than what we need. What we need is a recap upper bound
which is m log star of n. Recall that we know that m is greater than m just because m is the total
number of operations, while n is the number of calls to make said operations.
To estimate the required term, consider a particular node i and assume for completeness that it's
rank lies in an interval from k plus one to two to the k. Recall that this was the form of interval
through which the lock star function is fixed. Okay?
Now let's compute the total number of nodes whose rank lies in section interval. So we know that the
total number of nodes whose rank is equal to k plus one is at most n divided by two, to the k plus one.
So total number of nodes was ranked equal to k plus two is at most n divided by two, to k plus two.
And so on, so the total number of nodes whose rank lies in this interval is at most n divided by two to
the k.
Okay. The next stop equation is that each time when we call Find of i, it is adopted by a new parent
and since it is new. So, at this point we know that if we have a node i and its parent j is not the root.
Yes, this is essential. Which means that when we go up we find another root when we cofind a Find of
i. And at this point we will reattach node i to this new root. And this new root has strictly larger rank.
And this in turn means that after most 2 to the k calls to find of i. Find(i) will be adopted by a new
parent whose rank, for sure, does not lie in this interval. Just because the rank of this interval is at
most 2 to the k. So if we increase the rank of the parent of i, at least 2 to the k times, it will be greater
than 2 to the k for sure.
QUIZ • 30 MIN
Quiz: Disjoint Sets


Start
Receive grade
TO PASS50% or higher
Slides
06_2_disjoint_sets_2_efficient.pdfPDF File
References
See section 5.1.4 of Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani. Algorithms
(1st Edition). McGraw-Hill Higher Education. 2008.
Also see this tutorial on Disjoint Sets data structures.
Also see this visualization of Disjoint Sets with and without Path Compression and Union by Rank
heuristics.

Week 4
Data Structures
Week 4
89 threads · Last post 2 hours ago
Go to forum
Hash Tables
In this module you will learn about very powerful and widely used technique called hashing. Its
applications include implementation of programming languages, file systems, pattern search,
distributed key-value storage and many more. You will learn how to implement data structures to
store and modify sets of objects and mappings from one type of objects to another one. You will see
that naive implementations either consume huge amount of memory or are slow, and then you will
learn to implement hash tables that use linear memory and work in O(1) on average! In the end, you
will learn how hash functions are used in modern disrtibuted systems and how they are used to
optimize storage of services like Dropbox, Google Drive and Yandex Disk!
Less
Key Concepts
 List applications of hashing

 Apply direct addressing to retrieve names by phone numbers
 Develop a hash table based on chaining scheme
 Apply hashing to find patterns in text
 Describe how Dropbox, Google Drive and Yandex Disk save space
 Describe the principles on which distributed hash tables are built
Less
Introduction, Direct Addressing and Chaining
Video: LectureApplications of Hashing
2 min
Resume
. Click to resume
Video: LectureAnalysing Service Access Logs
7 min
Video: LectureDirect Addressing
7 min
Video: LectureList-based Mapping
8 min
Video: LectureHash Functions
3 min
Video: LectureChaining Scheme
6 min
Video: LectureChaining Implementation and Analysis
5 min
Video: LectureHash Tables
6 min

10 min
Hash Functions
Video: LecturePhone Book Problem
4 min
Video: LecturePhone Book Problem - Continued
6 min
Video: LectureUniversal Family
9 min
Video: LectureHashing Integers
9 min
Video: LectureProof: Upper Bound for Chain Length (Optional)
8 min

Video: LectureProof: Universal Family for Integers (Optional)
11 min
Video: LectureHashing Strings
9 min
Video: LectureHashing Strings - Cardinality Fix
7 min
10 min
Quiz: Hash Tables and Hash Functions
4 questions

Searching Patterns
Video: LectureSearch Pattern in Text
7 min
Video: LectureRabin-Karp's Algorithm
9 min
Video: LectureOptimization: Precomputation
9 min
Video: LectureOptimization: Implementation and Analysis
5 min
10 min
Distributed Hash Tables (Optional)
Video: LectureInstant Uploads and Storage Optimization in Dropbox
10 min
Video: LectureDistributed Hash Tables
12 min
10 min
Practice Quiz: Hashing
3 questions
Programming Assignment: Programming Assignment 3: Hash Tables
2h

Introduction, Direct addressing and chaining
Applications of Hashing
Hi.
In this module, we'll study hashing, and hash tables.
Hashing is a powerful technique with a wide range of applications.
In this video, we will learn about some examples of those applications,
just to have a taste of it.
The first example that comes to mind is, of course, programming languages.
In most of the programming languages, there are built-in data types or
data structures in the standard library that are based on hash tables.
For example, dict or dictionary in Python, or HashMap in Java.
Another case is keywords of the language itself.
When you need to highlight them in the text editor or when the compiler needs to
separate keywords from other identifiers in the problem to compile it.
It needs to store all the keywords in the set.
And that set is usually intuitive using the hashtag.
Another example is file system.
When you interact with a file system as a user, you see the file name,
maybe the path to the file.
But to actually store the correspondence between the file name and path, and
the physical location of that file on the disk.
System uses a map, and that map is usually implemented as a hash table.
Another example is password verification.
When you use some web service and you log into that and you type your password,
actually if it is a good service, it won't send your password in clear text through
the network to the server to check if that's the correct password or not,
because that message could be intercepted and then someone will know your password.
Instead, a hash value of your password is computed.
On your client side and then sent to the server and
the server compares that hash value with the hash value of the stored password.
And if those coincide, you get authenticated.
Special cryptographic hash functions are used for that.
It means that it is very hard to try and
find another string which has the same hash value as your password.
So you are secure.
Nobody can actually construct a different string which has the same hash value as
your password and then log in as you in the system, even if he intercepted
the message with the hash value of your password going to the server.
Another example, storage optimization for online cloud storages,
such as Dropbox, Google Drive or Yandex.Disk.
Those use a huge amount of space to store all the user files and
that can actually be optimized using hashing.
We will discuss this example further in the lectures of this module.
Hi, in this video, we will introduce a problem about a web service, and IP addresses of it's clients. We
will use this problem, to illustrate different approaches throughout the whole lesson. Suppose you
have a web service with many, many clients, who access your service through the Internet from
different computers. In the Internet, there is a system which assigns a unique address to each
computer in the network. Just like every house in the city has its own address. Those addresses of
computers are called IP addresses or just IPs. Every IP address looks like this, four integers, separated
by dots. Every of the four integers is from 0 to 255. So that it can be stored in eight bits of memory.
And the whole IP address, can be stored in 32 bits of memory as the standard integer type in C++ or
Java. So there are 2 to the power of 32 different IP addresses, which is roughly 4 billion.
Recently, the Internet became so big that 4 billion is no longer enough for all of the commuters in the
network. That's why people designed the new address system called IPv6. And the number of
addresses there is 2 to the power of 128, which is a number with 39 digits. And it will be sufficient for
a long time.
In this problem, we will start talking about old system called IPv4, which is still in use. And which
contains only 2 to the power of 32 different IP addresses.
When somebody accesses your web service, you know from which IP address did he or she access it.
And you store this information in a special file called access log. You want to analyze all the activity,
for example, to defend yourself from attacks. An adversary can try to kill your service by sending lots
and lots of requests from his computer to your service, so that it doesn't survive the lot and fails. This
is called Denial of Service attack. And you want to be able to quickly notice the pattern. That there is a
unusual high number of requests from the same IP address during some period of time for example,
the last hour. And to do that, you want to analyze your Access Log.
You can think of your access log as of a simple text file with many, many lines. And in each line, you
have date and time of the access, and the IP address from which the client accessed your servers. And
you want to be able to quickly answer the queries like, did anybody access my service from this
particular IP address during the last hour? And how many times did he access my service? And how
many different IPs were used to access the service during the last hour?
To answer those questions, we'll need to do some Log Processing. But of course, we don't want to
process whole one hour of logs each time we want to answer such a simple question because one
hour of logs can easily contain dozens of thousands or hundred of thousands or even millions of lines
depending on the load of your web service. Want to do that much faster.
So to do that we'll keep count. For each IP address, we'll keep a counter that says how many times
exactly that IP address appears in the last one hour of the access log, or how many times during the
last hour clients accessed your service from that particular IP address.
And we'll store it in some data structure C, which is basically some data structure to store the
mapping from IP addresses to counters. We don't know yet how to implement that data structure C.
We will discuss that further. We will update the counter corresponding to IP addresses every second.
For example, if now is 1 hour 45 minutes and 13 seconds from the start of the date and we'll ignore
the date field in the access log for the sake of simplicity. Then we need to increment the counters
corresponding to the IP addresses in the last two lines of the log, because those are new lines. We
also need to remember to decrement the counters corresponding to the IP addresses in the old lines
of the log. For that we'll look at the lines exactly 1 hour ago in the log. Because the lines which are
older than that, for them we've already decremented the counters in the previous seconds. And the
lines which are more recent than that, we still don't need to decrement the counters because the IPs
in those lines are still in the 1 hour window ending in the current second. So we'll decrement the
counters corresponding to the lines which are 1 hour ago from the current moment.
Now let's look at the to pseudo code. In the main loop we have the following variables. log represents
the access log. We will think of it as an array of log lines. Each log line has two fields. Time and IP
address. C is some mapping from IPs to counters. We still don't know how to implement that but we
suppose that we have some data structure for that.
i is an index in the log which points to the first unprocessed log line. So when a new second starts,
we'll need to start incrementing counters corresponding to lines starting from i and further in log.
j is the first or the oldest line in the current 1 hour window. So that when the next second starts we'll
need to decrement counters for some of the lines starting from line number j. We initialize i and j
with 0 and C with an empty mapping, because there is nothing to store in the start. And then each
second, we call procedure UpdateAccessList, and we pass there the access log to read data from. We
also pass i and j, which we will use inside and also update. And we pass data structure C, which is our
goal to updated.
So now let's look at the pseudo code for update access list. it consists of two parts. The first part deals
with the new lines and the second part deals with the old lines.
New lines start from line number i which is the first unprocessed line. Look at this line and we
increase the counter corresponding to the IP in this line using our data structure C. And then we go on
to the next line. We'll proceed with this while the time written in the log line i is still less than or equal
to the time when UpdateAccessList was launched and then we stop processing new lines. And we
want to all blinds. How do we determine that the line is old enough, to decrement the counter?
We compute the time now, which we assume is computed in seconds. So then we need to subtract,
exactly one hour from that and that is 3600 seconds. And if the time written in line j is less than or
equal to that, we need to decrement the corresponding counter. So we'll start with line number j,
which is the first line in our 1 hour window. We check that it is old enough to decrement the calendar.
We decrement the calendar if that's the case and then we move on to the next line. In the and when
we stop in this while loop, j will point again to the first or oldest line in the current 1 hour window.
So we've implemented the updating procedure correctly. Now how to answer the question whether
this particular IP was or was not used to access our service during the last hour. That is really easy. If
the counter corresponding to that IP is more than 0, then this IP was used during the last hour.
Otherwise the counter will be 0.
So,we've implemented all the procedures necessary to answer the questions, but for one small detail.
We don't know how to implement data structure C.
And we will discuss that in the next lectures.
Direct Addressing
Hi. In this video we will talk about direct addressing, which is the first step on the way to hashing.
Remember this computer code from the last video. We implemented procedure UpdateAccessList
using a data structure C, which stores a counter for any IP address.
Now the question is, how to implement the data structure C itself.
The idea here, is that there are 2 to the power of 32 different IP addresses. According to IP(v4)
format.
And we can actually convert each IP to a 32-bit integer. And it will be a one to one correspondence
between old possible IPs. And all numbers between zero and two to the power 32 minus one.
Thus, we can create an array A, of size exactly two to the power of 32, with indexes zero to two to the
power of 32 minus one. And then for each IP, there will be exactly one position in this array,
correspondent to this IP. And then, we will be able to use the corresponding answer in array A.
Instead of the counter for this IP.
Now, how do we actually convert IP addresses to integers?
If you look at this picture, you will see that any IP address actually consists of 4 integer numbers.
Which are all, at most, 255. And each of them corresponds to 8 bits, or 1 byte in the total 4-byte or
32-bit integer number. Basically, if you just coordinate all the 8 bytes corresponding to first number
with 8 bytes. Corresponding to the second number and to the third number and to the fourth
number. You will get 32 bytes. And if you then convert this string of 32 bytes into the decimal form.
You will get an integer number in the form which we are used to. For example, if you take a very
simple IP address, 0.0.0.1. It will convert to integer 1, because all the higher bits are zeroes and in the
lowest byte. The only bit set is the lowest bit and that corresponds to number 1. If we convert the
number in the picture, to the decimal form, we will get 2886794753. Now, what do you think will be
the integer number corresponding to this IP? And the correct answer is 1168893508.
Now, here is the formula and the code to convert an IP address to an integer number. Why is that?
Well, the lowest eight bits are in the fourth number of the IP address. So we use them without
changing.
The next Eight bits, are in the third number of IP. But to use them, we need to move them to the left
by eight positions in the binary form. And to do that, we need to multiply the corresponding integer
number by two to the power of eight.
The next eight bits are in the second number of the IP.
And to use them we need to move them to the left by 16 positions in the binary form. To do that, we
multiply the corresponding integer number by two to the power of 16, and so on. This gives us a one
to one correspondence between IP address and integer number.
Now, we can rewrite the code for UpdateAccessList using array A, instead of mysterious data
structure C.
And the only thing that changes is the incrementing and decrementing the counters. So when we
need to increment a counter corresponding to the IP in the ith line. We first convert this IP to integer
number from 0 to 2 to the power of 32 minus 1. And then we increase the entry of the integer RA A,
add this index. Note, that each IP is converted to its own integer number. So, there will be no
collisions between different IP numbers. When we try to increment a counter for one IP number and
by chance increment the current correspondent to another IP address. All IP addresses are uniquely
mapped into integers from zero to two to the power of 32 minus one. We do the same thing when we
need to decrement the counter. So basically, in the position in array A corresponding to any IP
address, we will store the counter. Which measures how many times this particular IP was accessed
during the last hour.
Now, how to answer the question, whether this IP was or was not used during the last hour, to access
your services. This is very easy. We first convert the IP to the corresponding position in the area A,
and then we look at the counter this position.
If the IP was used, then the counter will be more than zero. Otherwise it will be exactly zero.
So, now lets look at the asyptotics of this implementation. UpdateAccessList is as fast as we can do. It
is constant time per log line. Because for each log line, we only look at some position in the array and
increment it. And also increment some counter, or decrement some counter.
AccessedLastHour is also constant time. Because the only thing we do is, we look at some position in
their rate. Which is a constant time impression and compare it with a zero, but there is a drawback.
Even if during the last hour, for example, in the night, there are only five, or 10, or 100 IPs. From
which your clients use the service. You will still need 2 to the power of 32 memory cells, to store that
information.
And in general, if you have for example, new IP protocol. IPv6, it already contains 2 to the power of
128 different IP addresses. And if you create an array of that size, it won't fit in memory in your
computer.
In the general case, we need O(N) memory, where N is the size of our universe. Universe is the set of
all objects, that we might possibly want to store in our data structure. It doesn't mean that every one
of them will be stored in our data structure. But if we at least at some point might want to store it, we
have to count it. So for example, if some of the IP addresses never access your service. You will still
have to have a cell in your array for this particular IP, in the direct addressing method. So, this method
only works when the universe is somewhat small. And we need to invent something else to work with
the universes which are bigger than that or even infinite. Such as, for example, the universe of all
possible words, all possible strings, or all possible files on your computer. And we will talk in the next
videos about that.
List-based Mapping
Play Video
Play
Volume
0:00/8:10
Settings
Full Screen
Notes
All notes
Click the “Save Note” button when you want to capture a screen. You can also highlight and save
lines from the transcript below. Add your own notes to anything you’ve captured.
Save Note
Discuss
Download
Help Us Translate
Interactive Transcript - Enable basic transcript mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text press CTRL
+ S. To expand your selection you may use CTRL + arrow key. You may contract your selection
using shift + CTRL + arrow key. For screen readers that are incompatible with using arrow keys for
shortcuts, you can replace them with the H J K L keys. Some screen readers may require using
CTRL in conjunction with the alt key
Hi, in this video we will study another approach to the IP addresses problem.
In the last video we understood that the direct addressing scheme sometimes
requires too much memory.
And why is that?
Because it tries to store something for
each possible IP address while we're only interested in the active IP addresses.
Those from which at least some user has accessed our service during the last hour.
So the first idea for improvement of the memory consumption
is let's just store the active IP's and nothing else.
Another idea is that if our error based approach from the last video has failed,
then lets try to use list instead of an error.
So let's store all the IP addresses which are active in a list.
Sorted by the time of access.
So that the first element in the list corresponds to the oldest access time
during the last hour, and the last element in the list corresponds to the latest,
newest access from some IP address to our service.
Let's jump from here right into the pseudo code, because it's pretty simple.
We're going to have our procedure update access list which takes in
the log file log.
It also takes in i which is the index of the first
log line which hasn't been processed yet.
And also it has input L which is the list and
instead of some abstract data structure see from the first videos and
instead of the area a from the direct addressing scheme.
We put parameter L which is a list into this procedure and
this is the list with active IP addresses.
So our code have to pass first deals with new lines and second deals with old lines.
We just go searching from the first unprocessed line.
And if we need to added to our list because it was processed
during the last hour, we just append it to the end of the list.
And now again, the last element of the list corresponds to the latest,
newest access from some IP address.
And note that in our list we will start not just the IP address but,
both IP address and the time of the axis.
And then we will go to the next element in the log file and go and go while we
still have some log lines which we need to add to the end of our list.
And then the second part we just look at the oldest event during the last hour,
which is corresponding to the first element of the list.
And if that is actually before the start of the last hour,
then we need to remove it from the list.
And so we just do L.Pop.
And we do that while the head of the list is still too old.
And when we stop, it means that all the elements in the list are actually
with time during the last hour.
Why is that?
Because the list is always kept in the order by increasing time of access.
When we add new log lines to the list.
We add only those which have time even more than last element of the list
currently, and we remove something from the list.
We remove the oldest entries.
So, all the entries are always sorted, and as soon as we removed everything from
the start which is too old, all the entries in the list are not too old.
They are made during the last hour.
So this is pretty simple and now we need to answer questions like, whether my IP
address was used during the last hour to access the service and how many times.
To answer the first one we just need to
find out whether there is an element in our list with the given IP address.
And that is done by find by ID, which is different from the standards
find procedure of the least by the fact that we search not by the whole object,
which is a log line, which contains both IP address and time.
But we search just by the first field, by the IP address.
So our list contains tuples of IP addresses and
times of access, and we only look by IP address.
But the implementation will be the same.
We'll just go from the head of the list to the end of the list, and
compare the IP field of the log lines with the IP address given as the input.
And if it coincides we will return this element,
otherwise we'll return that there is nothing with this IP address in the list.
And the reason we return some special [INAUDIBLE].
So then, in the AccessedLastHour, just compare the results with null.
If it's not null then this IP address is in the list, otherwise it's not.
And to count the number of times
our service was accessed from a particular IP address, we just need to
count the number of log lines in the list which have the same IP address.
And that can be done by procedure CountIP of the list which
again differs from the standard count procedure in the list by the fact that it
counts by the first field, not by the whole object which is a log line.
But it just goes from to the end of the list.
Compares the IP field with the given IP and if they coincide,
it increases the counter by 1.
And returns the counter in the end.
So this is all the implementation.
Now let's analyze it.
Let N be the number of currently active IPs,
then the memory consumption is bigger of N.
Because we only store the active IP addresses and the corresponding times
of X's, but the times of X's on the add constant memory per active IPs.
So it's all null linear in the number of active IPs which is much better than
the direct addressing scheme because it require an amount of memory
proportionally to the number of all possible IP addresses.
And here will only require amount memory proportional to the number of
currently active IP addresses.
What about running time?
We know the standards list procedures such as Append, Top and
Pop all working constant time and
that's why the UpdateAccessList works in constant time per log line.
Of course, any particular call to UpdateAccessList could take more than
constant number of operations if we need to add more new lines to the end of
the list or remove many many old lines from the start of the list.
But for each log line we will only append it at most once and
we will only removed from the beginning at most once.
So it's constant time per log line plus constant time per each call of
UpdateAccessList just to check whether we need to append something and
whether we need to remove something from the beginning.
But this amount of operations can be controlled by how
often do we actually call Update Access List.
What about answering the questions?
We know that Find By IP and
Count IP have to go through the whole list in the worst case and actually count.
IP has to go through the whole list all the time to find out how many
log lines have the same IP as the given one and so AccessLastHour and
AccessCountLastHour are both linear in the number of active IPs.
And that is actually now good because even without introducing
any additional data structures, we could just take the log file,
take the last line in it before the current time, and go back from it.
And just look through each log line and
compare its IP address with the IP address in the question.
And count how many times it occurs during the last hour and
just stop as soon as we go through the border of the last hour.
And that will take the same time without any additional data structure.
So this solution is not more clever than the trio approach.
So, we failed somewhat with direct addressing scheme and
we failed with this list based approach.
It is overall a failure?
Well no, in the next videos we'll combine the ideas from direct
addressing scheme with the list based approach.
And we'll come up with solution which is both good in terms of memory consumption
and is much faster than the trivial approach in terms of the running time.
List-based Mapping
Hi, in this video we will study another approach to the IP addresses problem. In the last video we
understood that the direct addressing scheme sometimes requires too much memory. And why is
that? Because it tries to store something for each possible IP address while we're only interested in
the active IP addresses. Those from which at least some user has accessed our service during the last
hour. So the first idea for improvement of the memory consumption is let's just store the active IP's
and nothing else. Another idea is that if our error based approach from the last video has failed, then
lets try to use list instead of an error. So let's store all the IP addresses which are active in a list.
Sorted by the time of access. So that the first element in the list corresponds to the oldest access time
during the last hour, and the last element in the list corresponds to the latest, newest access from
some IP address to our service. Let's jump from here right into the pseudo code, because it's pretty
simple. We're going to have our procedure update access list which takes in the log file log. It also
takes in i which is the index of the first log line which hasn't been processed yet. And also it has input
L which is the list and instead of some abstract data structure see from the first videos and instead of
the area a from the direct addressing scheme. We put parameter L which is a list into this procedure
and this is the list with active IP addresses. So our code have to pass first deals with new lines and
second deals with old lines. We just go searching from the first unprocessed line. And if we need to
added to our list because it was processed during the last hour, we just append it to the end of the
list. And now again, the last element of the list corresponds to the latest, newest access from some IP
address. And note that in our list we will start not just the IP address but, both IP address and the
time of the axis.
And then we will go to the next element in the log file and go and go while we still have some log lines
which we need to add to the end of our list. And then the second part we just look at the oldest event
during the last hour, which is corresponding to the first element of the list. And if that is actually
before the start of the last hour, then we need to remove it from the list. And so we just do L.Pop.
And we do that while the head of the list is still too old.
And when we stop, it means that all the elements in the list are actually with time during the last
hour. Why is that? Because the list is always kept in the order by increasing time of access. When we
add new log lines to the list. We add only those which have time even more than last element of the
list currently, and we remove something from the list. We remove the oldest entries. So, all the
entries are always sorted, and as soon as we removed everything from the start which is too old, all
the entries in the list are not too old. They are made during the last hour.
So this is pretty simple and now we need to answer questions like, whether my IP address was used
during the last hour to access the service and how many times. To answer the first one we just need
to
find out whether there is an element in our list with the given IP address. And that is done by find by
ID, which is different from the standards find procedure of the least by the fact that we search not by
the whole object, which is a log line, which contains both IP address and time. But we search just by
the first field, by the IP address. So our list contains tuples of IP addresses and times of access, and we
only look by IP address. But the implementation will be the same. We'll just go from the head of the
list to the end of the list, and compare the IP field of the log lines with the IP address given as the
input. And if it coincides we will return this element, otherwise we'll return that there is nothing with
this IP address in the list. And the reason we return some special [INAUDIBLE]. So then, in the
AccessedLastHour, just compare the results with null. If it's not null then this IP address is in the list,
otherwise it's not.
And to count the number of times
our service was accessed from a particular IP address, we just need to count the number of log lines in
the list which have the same IP address. And that can be done by procedure CountIP of the list which
again differs from the standard count procedure in the list by the fact that it counts by the first field,
not by the whole object which is a log line. But it just goes from to the end of the list. Compares the IP
field with the given IP and if they coincide, it increases the counter by 1. And returns the counter in
the end. So this is all the implementation.
Now let's analyze it. Let N be the number of currently active IPs, then the memory consumption is
bigger of N. Because we only store the active IP addresses and the corresponding times of X's, but the
times of X's on the add constant memory per active IPs. So it's all null linear in the number of active
IPs which is much better than the direct addressing scheme because it require an amount of memory
proportionally to the number of all possible IP addresses. And here will only require amount memory
proportional to the number of currently active IP addresses. What about running time? We know the
standards list procedures such as Append, Top and Pop all working constant time and that's why the
UpdateAccessList works in constant time per log line. Of course, any particular call to
UpdateAccessList could take more than constant number of operations if we need to add more new
lines to the end of the list or remove many many old lines from the start of the list. But for each log
line we will only append it at most once and we will only removed from the beginning at most once.
So it's constant time per log line plus constant time per each call of UpdateAccessList just to check
whether we need to append something and whether we need to remove something from the
beginning. But this amount of operations can be controlled by how often do we actually call Update
Access List.
What about answering the questions? We know that Find By IP and Count IP have to go through the
whole list in the worst case and actually count. IP has to go through the whole list all the time to find
out how many log lines have the same IP as the given one and so AccessLastHour and
AccessCountLastHour are both linear in the number of active IPs. And that is actually now good
because even without introducing any additional data structures, we could just take the log file, take
the last line in it before the current time, and go back from it. And just look through each log line and
compare its IP address with the IP address in the question. And count how many times it occurs
during the last hour and just stop as soon as we go through the border of the last hour. And that will
take the same time without any additional data structure. So this solution is not more clever than the
trio approach.
So, we failed somewhat with direct addressing scheme and we failed with this list based approach. It
is overall a failure? Well no, in the next videos we'll combine the ideas from direct addressing scheme
with the list based approach. And we'll come up with solution which is both good in terms of memory
consumption and is much faster than the trivial approach in terms of the running time.
Hash Functions
Hi, in this video, you will learn what a hash function is, how could we apply it to solve our problem
with IP addresses, and why it is not straightforward to make it to work. Remember the direct
addressing approach worked particularly fast, but it used a lot of memory that's because it encoded IP
with numbers and those numbers were sometimes huge. So we had to create an array of size 2 to the
power of 32 just to store all those numbers. What if we could encode our IP addresses with smaller
numbers, for example, numbers from 0 to 999? We'll still need the code for different IP addresses,
which are active currently to be different because we want a separate counter for each IP in our
solution.
Let's define a hash function. So if you have universal object S for example. A set of all IP addresses or
a set of all files stored on your computer or a set of all words or cures in the programming language,
so that is our universe. And we will call it a set S. And now we want to encode each object from that
universe with a small number. A number from 0 to m- 1 where m is a positive integer number. While
any function, which encodes some object from S as a number from 0 to m- 1, is called a hash function.
And m is called the cardinality of hash function h.
So what are the desirable properties of the hash function in our problem? First, h should be fast to
compute because we need to encode some object for each query. Second, we want different values
for different objects because we want a separate counter for each IP address in our problem from
them.
And also, we want to use direct addressing scheme because it was very fast, but we want to use a
direct addressing scheme with a small amount of memory. And it's only logical to use in this case
direct addressing scheme with O(m) memories. Just create an area a of size m, and then encode each
ID with some value from 0 to m- 1, and store the corresponding counter In the cell of this array.
The problem is that we want small cardinality m and it won't work if m is smaller than the number of
different objects in the universe. Because if we have for example 25 object in the universe and m is
only 10, then at least two objects will have the same code from 0 to 9 because there are only 10
different codes and there are 25 different objects. So that won't work for all possible universes and
for small m.

In this situation, when the values of the hash function are the same, but the objects which are being
encoded are different, is called a collision. So collisions cause us problems. Because of collisions, we
cannot just directly apply the scheme called direct addressing with O(m) memory. And in the next
lecture, we will see how to overcome this problem.
Chaining Scheme
In this video, we will study chaining, which is one of the most frequently used techniques for using
hashing to store mappings from one type of object to another type of object.
So, let us define a map.
We often want to store mapping from some objects to some other object. For example, I'm mapping
from IP addresses to integer numbers. Or from filenames to the physical location of those files on the
disk. From student ID to the name of the student. Or from contact name in your phone book to the
contact phone number.
The general definition of a map from set of objects S to the set of values V is a data structure which
has three methods. HasKey, which tells us whether there is an entry in the map corresponding to
object O from set S. Method Get, which returns to us the value corresponding to the object O, if there
is one. If there is no such value, it returns a special value telling us that there is no entry
corresponding to this object O in the map.
And the last method is set, the most important method, which sets the value corresponding to object
O to V.
Here, objects O are all from the set S and values V are from the set big V.
We want to implement a map, using hash function, and some combination of ideas from direct
addressing, and least based solution from one of the previous videos. So what we'll do is called
chaining.
We will create an array of size m, where m is the cardinality of the hash function, and in this case, let
m be eight. This won't be an array of integers, though. This will be an array of lists. So in each cell of
this array, we will store a list. And this will be a list of pairs. And each pair will consist of an object, O.
And a value V, corresponding to this object. Let's look at an example.
For example, our objects are IP addresses, and the values are the corresponding counters. As in our
initial problem about web service, and IP addresses of its class. Now we're processing the log, and we
see an IP address, starting with 173. And it so happens that the value of hash function on this IP
address is four. Then, we look at the cell four, the list there is now empty. But we append, in the pair
of our IP address. And the corresponding counter one, to this list. The value is one because this is the
first time that we encounter this AB.
Now we'll look at the next IP in the log. It starts with 69, and the hash value for this IP is one. So we'll
look at the cell number one, and we append the pair of this IP address and the corresponding counter
one to the list. Again the counter is one because this is the first time we see this IP address.
Now it looks at the next IP address in the log and we see that it again starts with 173 and actually it
coincides with the first IP that we've already seen. And the hash value is again four, because hash
function is deterministic, it always returns the same number for same object. So we'll look at the cell
number four, we'll look through the whole list and we find out that there is already a pair containing
this IP address as the key. So instead of appending this IP address again to the list, we will increase
the value of the counter by one because this is the second time we've seen our IP address. Of course
in the interface of a general map, there is no method for incrementing a counter, there is a method to
set so we will need to first use method get to get the value corresponding to this IP address, we will
get one. We will then increase it by one ourselves, get two. And then we will call set for this IP
address and value two. And it will just rewrite the value from one to two in this list element.
Then, we'll look at the next line in our log, and we see that this is IP starting from 91. And it so
happens that the hash value for this IP address, again, is four, although this is a different IP address.
And that has to happen at some point, because there are many, many different IP addresses, and only
eight entries in our array.
So what do we do? If we look at the cell number four, there is a non-empty list there. We go through
the whole list, but we see that our new IP address starting from 91 is not in the list. So we add our
new IP address to the end of this list
along with the corresponding counter of one. And these two IP addresses in the list for cell number
four already make a chain together. And if we go further and further through the log, and we add
some IP addresses to this map, some of the chains will become longer. If where some point we'll need
to remove some IP address from the list we can do that and the chain can become shorter. But
anyway, you see the general structure that a chain maybe empty, maybe non-empty, starts in any cell
of the array. The array size is m, which is equal to the cardinality of the hash function. And for each
such cell we store a list with all the IP addresses which occurred before and which have hash value
the same as the number of the cell.
In this video, we will study chaining, which is one of the most frequently used techniques for using
hashing to store mappings from one type of object to another type of object.
So, let us define a map.
We often want to store mapping from some objects to some other object. For example, I'm mapping
from IP addresses to integer numbers. Or from filenames to the physical location of those files on the
disk. From student ID to the name of the student. Or from contact name in your phone book to the
contact phone number.
The general definition of a map from set of objects S to the set of values V is a data structure which
has three methods. HasKey, which tells us whether there is an entry in the map corresponding to
object O from set S. Method Get, which returns to us the value corresponding to the object O, if there
is one. If there is no such value, it returns a special value telling us that there is no entry
corresponding to this object O in the map.
And the last method is set, the most important method, which sets the value corresponding to object
O to V.
Here, objects O are all from the set S and values V are from the set big V.
We want to implement a map, using hash function, and some combination of ideas from direct
addressing, and least based solution from one of the previous videos. So what we'll do is called
chaining. We will create an array of size m, where m is the cardinality of the hash function, and in this
case, let m be eight. This won't be an array of integers, though. This will be an array of lists. So in each
cell of this array, we will store a list. And this will be a list of pairs. And each pair will consist of an
object, O. And a value V, corresponding to this object. Let's look at an example.
For example, our objects are IP addresses, and the values are the corresponding counters. As in our
initial problem about web service, and IP addresses of its class. Now we're processing the log, and we
see an IP address, starting with 173. And it so happens that the value of hash function on this IP
address is four. Then, we look at the cell four, the list there is now empty. But we append, in the pair
of our IP address. And the corresponding counter one, to this list. The value is one because this is the
first time that we encounter this AB.
Now we'll look at the next IP in the log. It starts with 69, and the hash value for this IP is one. So we'll
look at the cell number one, and we append the pair of this IP address and the corresponding counter
one to the list. Again the counter is one because this is the first time we see this IP address.
Now it looks at the next IP address in the log and we see that it again starts with 173 and actually it
coincides with the first IP that we've already seen. And the hash value is again four, because hash
function is deterministic, it always returns the same number for same object. So we'll look at the cell
number four, we'll look through the whole list and we find out that there is already a pair containing
this IP address as the key. So instead of appending this IP address again to the list, we will increase
the value of the counter by one because this is the second time we've seen our IP address. Of course
in the interface of a general map, there is no method for incrementing a counter, there is a method to
set so we will need to first use method get to get the value corresponding to this IP address, we will
get one. We will then increase it by one ourselves, get two. And then we will call set for this IP
address and value two. And it will just rewrite the value from one to two in this list element.
Then, we'll look at the next line in our log, and we see that this is IP starting from 91. And it so
happens that the hash value for this IP address, again, is four, although this is a different IP address.
And that has to happen at some point, because there are many, many different IP addresses, and only
eight entries in our array.
So what do we do? If we look at the cell number four, there is a non-empty list there. We go through
the whole list, but we see that our new IP address starting from 91 is not in the list. So we add our
new IP address to the end of this list
along with the corresponding counter of one. And these two IP addresses in the list for cell number
four already make a chain together. And if we go further and further through the log, and we add
some IP addresses to this map, some of the chains will become longer. If where some point we'll need
to remove some IP address from the list we can do that and the chain can become shorter. But
anyway, you see the general structure that a chain maybe empty, maybe non-empty, starts in any cell
of the array. The array size is m, which is equal to the cardinality of the hash function. And for each
such cell we store a list with all the IP addresses which occurred before and which have hash value
the same as the number of the cell.
Hash Tables
Hi, in this video, we will finally start talking about hash tables. We will define what a hash table is and
what we can do with it. In the last video, we've introduced the notion of map, and now we'll
introduce a very similar and natural notion of a set. By definition, a set is a data structure which has at
least three methods, to add an object to the set, to remove an object from the set, and to find out
whether a given object is already in the set or not.
One of the examples we already know very well, set of all IPs through which clients access to your
service during the last hour. This is an example with which we've worked for the last few videos.
Another example would be to store the set of all students currently on campus. And another one is to
store all the key words of a given programming language so that we can quickly highlight them in the
text editor, which you used to code. There are two ways to implement a set. One of them is when you
already have an implementation of a map, you can base your implementation of set on the map.
Basically, you can set a map from all the objects S that you need to store in the set to the set of
values, V, which only contains two values, true and false.
If the object is in the set, then the corresponding value to this object will be true. If the object is not in
the set, it is either not in the map or the corresponding value to it in the map is false. But that is not a
very efficient way because we will have to store twice as much objects and values as we need. And
also, when we remove objects from the set, it will be hard to remove them from the map. We will
probably have to store them with value false, so there's a better way. We can again use chaining. But
instead of storing pairs of objects and corresponding values in the chains, we'll just store objects
themselves.
Let's see how can we implement that into the code. Again, we'll have a hash function from all the
objects S to the set of integer numbers from 0 to m-1. We denote it by O and O' objects from the set
S, and we initialize array A with an array of size m which consists of lists or chains. And each chain
consists of object O. Initially all the chains are empty.
When we need to find an object inside a set, we first compute the hash value of our object, we look at
the corresponding cell in the array A. We take the list of objects from there, and then we go through
the whole list and try to find object O there. If we find it, return true. Otherwise, return false because
our object O can be only in the list corresponding to the cell in the array A, number h(O).
To implement add, we again compute value of hash function on object O, we take the list
corresponding to this cell. And we go through this list, if we find our object O on this list, then we
don't need to do anything because our object O is already in the set. Otherwise, we append our
object to the list corresponding to cell number h(O).
To remove object from the set, we first try to find it in the set. If it's not in the set, initially we don't
need to do anything. Otherwise, we again compute the hash value of our object, take the
corresponding list, and erase our object from that list.
So, now we are ready to say what is a hash table? A hash table is any implementation of a set or a
map which is using hashing, hash functions. It can even not use chaining. There are different ways to
use hash functions to store a set or a map in memory. But chaining is one of the most frequently used
methods to implement a hash table.
We have a few examples of hash tables already implemented and built in our standard library types
and programming languages, for example. Set is implemented as unordered_set in C++, as HashSet in
Java, as set in Python. And map is implemented as unordered_map in C++, as HashMap in Java, and as
dict, or dictionary in Python.
Why those types are called unordered in C++? You will learn in one of the next modules about data
structures. For now, you just know that hash tables were already implemented in the main languages
we used for the specialization.
In conclusion, we've learned what is chaining. We've learned what is a hash table. And now we know
that chaining is a technique that can be used to implement a hash table. We know that the memory
consumptions for the chaining technique is big O(n + m) where n is the number of objects currently
stored in the hash table. And m is the cardinality of the hash function.
We also know that the operations with such a hash table implemented using chaining work in time
c+1, where c is the length of the longest chain.
Now the question is, how to make both m and c small? Why do we need that? Because we want both
small memory consumption and fast operations. For example, if m is very big, then we can use direct
addressing, or something like that. But for some universes, some sets of objects, we will use too much
memory, or we will have just too much overhead
on top of our O of n memory which is needed to store n objects, anyway. If n is small, but c is big, well
that's one different match from the list based approach where we used only O of n memory to store
the list, to store only the active IPs. But then we have to spend O of n time to actually look through all
the list every time we want to make a query. So we want both m being relatively small and c. How can
we do that? Well, we can do that based on a clever selection of a hash function, and we will discuss
this topic in the next lessons.
Chaining Implementation and Analysis
How to implement this in code?

Well, let's just assume that we have a hash function h from the set of all possible objects S to the set
of numbers from 0 to m-1. And let us denote by O and O prime objects from set S and by v and v
prime values from set big V.
And let us have an array A, which consists of m lists, where m is the cardinality of the hash function.
And those lists we'll call also chains, and those chains consist of pairs of objects O and values v.
Now let us implement the first method, HasKey, which would return whether there is an entry in our
table or in our map for the object O. First, we compute the value of hash function on the object O.
We'll look at the corresponding cell in the array A, and we'll take the list out from there. Then we go
through this list and when we go through it, we'll look at pairs O prime, v prime that are elements of
this list. If for some pair O prime is the same as the object O for which we are looking, we return true
because it means that there is an entry in our map corresponding to the object O.
If we don't find any corresponding pair in the list, return false because that means there is no such
object and there is no key corresponding to this object in our map. Because it only could be in the list
corresponding to the cell with number h of O and we didn't find it there.
Next let's implement method Get, which should return the value corresponding to object O if there is
one. Otherwise, return some special value telling us that there is no entry corresponding to object O.
Again, we start with computing value of hash function on object O and looking at the cell number h of
O in the array A and take the list, which is stored in that cell. Then we again go through all the pairs in
that list L, pairs O prime, v prime, and if for some of the pairs, O prime is the same as the object O for
which we are looking, then we'll return the corresponding value v prime as the value corresponding
to that object O. If we go through the whole list and we don't find corresponding player, we'll return
special value n/a, which means that there is no value corresponding to object O in our map. Why is
that? Because if there was some value, it has to be in the list corresponding to the cell number h of O
because that's the way we store our chains, and if we didn't find it there, then there is no entry
corresponding to the object O.
Now the last, most interesting method, Set, which accepts two arguments, object O and the value v,
which we need to set corresponding to this object. We need to either rewrite this value if there was
already an entry corresponding to the object O with different value. Or we need to create a new value
in the map corresponding to the object O if it didn't happen to be in the map before. We again start
with computing the hash function on the object O and looking at the corresponding cell in the array A
and we'll take the list, we just start there. Now we go through all the pairs p in that list L, and each
pair p contains two fields, first field is p.O, which is the object of that pair, and p.v, which is the value
of that pair. If for some pair, we see that the object of that pair is the same as object O for which we
need to set the value v, then we just assign the value to the p.v, the new value. We will write the old
value with the new value for that object O and then we return, we exit from the function. Because
we've already done everything we need.
If we go through the whole list and we don't find any pair corresponding to our object O, it means
that there was no entry in our map corresponding to the object O previously. And it means that we
need to add a new pair to our list, and we just append a new pair containing object O and value v to
the list L, corresponding to the cell number h of O.
Now let's look at the SM project of the chaining scheme. The first lemma says that if c is the length of
the longest chain in A, then the running time of all three methods is theta of c+1.
First, if we look at the list corresponding to some object O, the list in the cell number h of O, then the
length of this list can be c, this can be the longest list itself. And if object O is not in this list and would
call some of the methods for this object, we will need to scan the full list so we'll need to scan all c
items in this list.
Also, if c is 0 so that our map is empty, our array A is comprised of m after this. We still need constant
time to check that. So that's why c+1 and not just c.
Another lemma is talking about memory consumption. So let n be the number of different keys that
we're storing in the map and m is the cardinality of the hash function. Then the memory consumption
is theta of n+m. That is very easy to prove. First, we need to store n pairs of objects and
corresponding values in the map. That's where we get theta of n. And we get additional theta of m to
store the array of m lists. Although those lists can be empty, we'll still need to use some memory to
store the pointers to the heads of those lists, and that's why memory consumption is theta of n+m.
Slides
07_hash_tables_1_intro.pdfPDF File
References
See the chapter 1.5.1 in [DPV] Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani.
Algorithms (1st Edition). McGraw-Hill Higher Education. 2008.
Hash Functions
Phone Book Problem

Hi, in the previous lesson you've learned what is a hash function, what is a hash table, and how to use
those to implement data structures for storing sets of objects and mappings from one type of object
to another one. However, the speed of this data structure depends a lot on the choice of hash
function and in this lesson you will learn how to choose a good hash function. You will learn how to
implement an efficient context book. And you will also learn how is hashing of strength objects in Java
implemented.
We will start with the phone book problem. When you use your phone you want to be able to quickly
look up a phone number of a person by name to be able to call him. And to determine who is calling
you and to see not their phone number but their name if it's in your contact book. So, you need a
data structure that is able to efficiently add and delete contacts from your phone book. To look up
phone number by name and to do the reverse, look up the name given the phone number.
To do that we will need two mappings, one from phone numbers to names, and another one from
names to phone numbers. We will implement both of those maps as hash tables and we will start
from the mapping from phone numbers to names.
One approach that we know from the previous lesson is direct addressing. First, we'll need to convert
phone numbers to integers and that is very easy to do. We'll implement simple function called int, as
in integer that just deletes all characters of the phone number other then digits. And then you are left
with an integer number like in this example.
Then we'll create an array called Name, which will contain 10 to the power L cells, where L is the
maximum allowed length of the phone number. That way it will be able to store a cell for each integer
number from 0 to 999,999 and that 9 is going L times,
where L is the maximum length of a phone number. So it will be basically enough to store each phone
number of allowed length.
And in this array, we'll store the names corresponding to the phone number. So to store a name
corresponding to some phone number P, we will first convert P to an integer, and the store the name
in the cell with this number.
And if there is no contact with some particular phone number P, we'll just store a default value N/A in
the corresponding cell.
This is how it will look like. On the right is our array Name, and on the left, we have two contacts.
Natalie with number 123-45-67 which is converted to 1,234,567 and is stored in the cell with this
number in the array. It is somewhere in the middle of the array, there are a lot of cells before that. A
few cells next to it are probably filled with default value N/A. Because of course we have much less
phone numbers in your phone book than 10 to the power of 7, which is 10 million.
And then there is another contact of Steve which is stored at position 2232323. And of course, there
are more N/As in this array. So as we know operations in the direct addressing scheme work in
constant time. However the memory consumption is exponential in this case. Is big O of 10 to the
power of L, where L is the maximum allowed phone number length. And that is problematic, because
with international phone numbers, which can contain 12 digits or more for European countries, for
example, we will need one terabyte, just to store one phone book, of one person. No smart phone is
able to store a phone book of size one terabyte. And in the next video, we will suggest a scheme that
avoids this problem with memory consumption.
Phone Book Problem - Continued
Another scheme that we know from the previous lesson is chaining.

To use that we first select the hash function with some cardinality m,
then we create an array name again of size m.
But instead of storing the names themselves in the array,
we store chains or lists.
And in these lists, we'll store both names and the phone numbers.
And to determine where to put name and phone number, we first convert the phone
number to integer, then we'll apply hash function to it and get a hash value.
And put both name and
phone number in the chain corresponding to the cell with such index.
Here is how it looks like. For example, we have a contact of Steve and his phone number is 223-23-23.
We first convert it to the number 2 million and a few thousand, and then we compute the hash value
of this integer number and it turns out to be 1. Then we put both Steve's name and his phone number
in the chain corresponding to the cell number 1 in our array Name. Then we do the same for Natalie
and her phone number. And it turns out that the hash value of the integer corresponding to her
phone number is 6, so we put her contact in the 6th cell. And then we do the same for Sasha, and the
hash value of his phone number turns out again to be 1. So Sasha gets in the same cell as Steve. There
are the following parameters of the chaining scheme. First, n is the total number of phone numbers
stored in our phone book. m is the cardinality of the selected hash function, which is the same as the
size of our area name, which works as a hash table. c is the length of the longest chain in our hash
table. We use big O(n + m) memory to store the phone book.
And also alpha, which is n/m, the number of phone numbers stored divided by the size of the hash
table, which measures how filled up is our hash table, is called the load factor. And we will need it
later.
So we know that the operations with a hash table run in time big O(c + 1). And so we want both small
m to use fewer memory and small c so that everything works faster. And here's a good example. We
see a hash table of size 8 with a few chains and we see that the length of the chains are relatively the
same, with the longest chain being of length just 2. So everything will work fast. And here's a bad
example. When we again have hash table of size 8, but all the keys fell in the same cell 1. And they
make up a very long chain of size m, in this case, it is 8. So this is what we want to avoid.
Let's try a few hash functions that come to our mind. First, let's select cardinality of 1000. And choose
the first three digits as the hash value for the phone number. For example for this phone number it
will be 800, because the first three digits are 800.
However there is a problem with this hash function because the area code, which is the first three
digits will be the same for many, many people in your phone book. Probably because they live In the
same city with you, and so they will have the same area code. And the hash values for their phone
numbers will be the same, and they will make up a very long chain.
Another idea is to take the last digits, again the cardinality is 1,000, and we take the last three digits
as the hash value. So for this number, it will be 567. But still, there can be a problem if there are many
phone numbers in your phone book, which, for example, end in three zeros, or in some other
combinations of three digits. So, another approach is to just select a random value as the hash
function, a random number between 0 and 999. And then the distribution of hash values will be very
good, probably the longest chain will be short. However, we cannot use such hash function actually
because when we'll call the hash function again to look up the phone number we stored in the phone
book we won't find it because we are looking in the wrong place. Because the value of the hash
function changed because it's not deterministic. So we learned that the hash function must be
deterministic, that is return the same value if given the same phone number as the input each time.
So good hash functions are deterministic. Fast to compute because we do that every time we need to
store something or modify something or find something in our hash table. And they should distribute
the keys well in different cells and have few collisions.
Unfortunately, there is no universal hash function. Most specifically, the Lemma says that the number
of all possible keys, the sizes of the universe with keys is large enough. Much larger than the
cardinality of the hash function that we want to use to save memory. Then for any specific
deterministic hash function there is a bad input, which results in many, many collisions. Why is that?
Well, let's look at the universe U and select some cardinality for example, 3.
Then, our universe will be divided into three groups. All the keys that have hash value 0, all the keys
that have hash value 1, and all the keys that have hash value of 2. Now, let's select the biggest of
those groups. In this case, it's the group with hash value of 1. This group will definitely be of size at
least one-third of the whole universe and it can be even bigger. In this case for example, around 42%.
And then, if we take all these keys or a significant part of these keys as an input, they will have the
same hash value. And so, all of them will make collisions between themselves and they will form a
very long chain in the hash table and everything will work very slowly. Of course, if we change the
hash function for this particular input, it will distribute the keys more uniformly among hash values.
But for this particular hash function, this will be a bad input. And for any specific hash function with
any cardinality, we'll be able to select a bad input this way. And in the next video, you will learn how
to solve this problem.
Universal Family
Hi, in the previous video you learned that for any deterministic hash function, there is a bad input on
which it will have a lot of collisions. And in this video, you will learn to solve that problem. And the
idea starts from, remember when you started QuickSort algorithm? At first, you learned that it can
work as slow as m squared time. But then you learned that adding a random pivot to the partition
procedure helps, because now you know that QuickSort works on average in n log n time. And in
practice, it works usually faster than the other sorting algorithms. So we want to use the same
randomization idea here for hash functions. But we already know that we cannot just use a random
hash function because it must be deterministic. So instead, we will first create a whole set of hash
functions called a family of hash functions. And we'll choose a random function from this family to use
in our algorithm. Not all families of hash functions are good, however, and so we will need a concept
of universal family of hash functions. So let U be the universe, the set of all possible keys that we want
to hash. And then a set of hash functions denoted by calligraphic letter H, set of functions from U to
numbers between 0 and m- 1. So hash functions with the same cardinality. Such set is called a
universal family if for
any two keys in the universe the probability of collision is small. So, what does that mean? Our hash
function is a deterministic function, so for any two keys it either has a collision for those two keys or
not. So, what does it mean that the probability of collision for two different keys is small?
It means that if we look at our family calligraphic H, then at most 1/m part of all hash functions in this
family, at most 1/m of them have a collision for these two different keys. And if we select a random
hash function from the family with probability at least one minus one over m, which is very close to
one, there will be no collision for this hash function and these two keys. And of course it is essential
that the keys are different. Because if keys are equal then any deterministic hash function will have
the same value on these two keys. So, this collision property with small probability is only for two
different keys in the universe, but for any two different keys in the universe this property should be
satisfied. It might seem that it is impossible but later you will learn how to build a universal family of
hash functions and practice.
So how are randomization idea works in practice. One approach would be to just make one hash
function which returns a random value between 0 and m-1, each value with the same probability.
Then the probability of collision for any two keys is exactly 1/m. But that is not a universal family.
Actually we cannot use this family at all because the hash function is not deterministic and we can
only use deterministic hash functions.
So instead, we need to have some set of hash functions such that all the hash functions in the set are
deterministic. And then, we will select a random function h from this set of hash functions, and we
will use the same fixed function h throughout the whole algorithm. So that we can correctly find all
the objects that we store in the hash table, for example.
So, there is a Lemma about running time of operations with hash table if we use universal family. If
hash function h is chosen at random from a universal family then on average the length of the longest
chain in our hash table will be bounded by O(1 + alpha), where alpha is the load factor. Load factor is
the ratio of number of keys that we store in our hash table to the size of the hash table allocated.
Which is the same as the chronology of the hash functions in the universal family that we use. So, it
makes sense. If the load factor is small it means that we only store a few keys in a large hash table,
and so longest chain will be short.
But as our table gets filled up, the chains grow. This Lemma says, however, that if we chose a random
function from a universal family they won't grow to much. On average, the longest chain will still be of
length just (1 + alpha). And probably that is just a small number because alpha
is usually below one, you don't want to store more keys in the hash table than the size of the hash
table allocated. So alpha will be below 1 most of the time and then (1+ alpha) is just two, so this is a
constant actually. So, the corollary is that if h is chosen at random from the universal family, then
operations with hash table will run on average in a constant time.
Now the question is, how to choose the size of your hash table? Of course, it control the amount of
memory used with m which is your chronology of the hash functions and which is equal to the size of
the hash table. But you also control the speed of the operations. So ideally, in practice, you want your
load factor alpha to be between 0.5 and 1. You want it to be below 1 because otherwise you store too
much keys in the same hash table and then everything could becomes slow. But also you don't want
alpha to be too small because that way you will waste a lot of memory. If alpha is at least one-half,
then you basically use linear memory to store your n keys and your memory overhead is small. And
operations still run in time, O(1 + alpha) which is a constant time, on average if alpha is between 0.5
and 1.
The question is what to do if you don't know in advance how many keys you want to store in your
hash table. Of course, there is a solution to start with a very big hash table, so that definitely all the
keys will fit. But this way you will waste a lot of memory. So, what we can do is copy the idea you
learned in the lesson about dynamic arrays. You start with a small hash table and then you grow it
organically as you put in more and more keys. Basically, you resize the hash table and make it twice
bigger as soon as alpha becomes too large. And then, you need to do what is called a rehash. You
need to copy all the keys from the current hash table to the new bigger hash table. And of course, you
will need a new hash function with twice the chronology to do that. So here is the code which tries to
keep loadfFactor below 0.9. And 0.9 is just a number I selected, you could put 1 here or 0.8, that
doesn't really matter. So first we compute the current loadFactor, which is the ratio of the number of
keys stored in the table to the size of the hash table. And if that loadFactor just became bigger than
0.9, we create a new hash table of twice the size of our current hash table. We also choose a new
random hash function from the universal family with twice the cardinality coresponding to the new
hash table size. And then we take each object from our current hash table, and we insert it in the new
hash table using the new hash function. So we basically copy all the keys to the new hash table. And
then we substitute our current hash table with the bigger one and the current hash function with the
hash function corresponding to the new hash table. That way, the loadFactor decreases roughly
twice. Because we added, probably just added one new element, the loadFactor became just a little
more than 0.9. And then we increase the size of the hash table twice while the number of keys stayed
the same, so the loadFactor became roughly 0.45, which is below 0.9, which is what we wanted.
So to achieve that, you need to call this procedure rehash after each operation which inserts
something in your hash table. And it could work slowly when this happens because the rehash
procedure needs to copy all the keys from your current hash table to the new big hash table, and that
works in linear time. But similarly to dynamic arrays, the amortized running time will still be constant
on average because their hash will happen only rarely. So you reach a certain level of load factor and
you increase the size of our table twice. And then it will take twice longer to again reach too high
value of load factor. And then you'll again increase your hash table twice. So the more keys you put in,
the longer it takes until the next rehash. So their hashes will be really rare, and that's why it won't
influence your running time with operations, significantly.
Hashing Integers
Hi, in the previous video,

you've learned the concept of universal family of hash functions and you learned
how to use it to make operations with your hash table really fast.
However, now we need to actually build a universal family and you will
start with a universal family for the most important object which is integer number.
Because any object on your computer is represented as a series of bits or
bytes, and so you can think of it as a sequence of integer numbers.
And so first, we need to learn to hash integers efficiently.
So we will build a universal family for hashing integers.
But we will look at our example with phone numbers because
we need to store contacts in our phone.
So first, we will consider only phone numbers up to length seven and for
example we will consider phone number 148-2567.
And again, we'll convert all of those phone numbers,
we want to start from integers from zero to the number consisting of seven nines.
And for example, our selected phone number
will convert to 1,482,567.
And then we will hash those integers to which we convert our phone numbers.
So to hash them, we will need to also choose a big prime number,
bigger than 10 to the power of 7, for
example, 10,000,019 is a suitable prime number.
And we will also need to choose the hash table size which is the same as
the chronology of the hash function that we need.
So now that we selected p and m, we are ready to define universal family for
integers between 0 and 10 to the power of 7 minus 1.
So the Lemma says that the following family of hash functions is a universal family.
What is this family? It is indexed by p, p is the prime number, 10,000,019, in this case that we choose.
And it also has parameters a and b, so those parameters are different for different hash functions in
these family. Basically, if you fix a and b, you fix a hash function from this hash functions family,
calligraphic H with index p. And x is the key, it is the integer number that we want to hash, and it is
required that x is less than p. It is from 0 to p minus 1, or less than p minus 1, but definitely, it is less
than p. So, to create a value of this integer x with some hash function, we first make a linear
transform of this x. We multiply it by a, corresponding to this hash function, and add b, corresponding
to this hash function. Then we take the result, modulo our big prime number p.
And after that, we again take the result modulo the size of our hash table or the chronology of the
hash functions that we need. So all these hash functions indexed by a and b will have the same
chronology m.
And the size of this hash family, what do you think it is?
Well, it is equal to b multiply by p minus 1, why is that? Because there are p minus 1 variance for a,
and independently from that, there are p variance for b. So the total number of pairs, a and b, is p
multiplied by p minus 1, that is the size of our universal family. And the Lemma states that it really will
be a universal family for integers between 0 and p minus 1. We will prove this Lemma in a separate,
optional video. And here, we'll look at an example of how this universal family works.
So, for example, we selected hash function corresponding to a = 34 and b = 2, so this hash function h
is h index by p, 34, and 2.
And we will compute the value of this hash function on number 1,482,567 because this integer
number corresponds to the phone number who we're interested in which is 148-2567. Well,
remember that p that we chose is a prime number 10,000,019. So first, we multiply our number x by
34 and add 2, and after that, we take the result modulo b, modulo 10,000,019, and the result is
407,185. Then we take this result and take it again modulo 1,000, and the result is 185. And so the
value for our selected hash function on number x is 185. And for any other number x, you would do
the same, you would multiply x by 34, add 2, take the result modulo b, then take the result modulo
1,000. And so any value of our hash function is a number between 0 and 999 as we want.
And if we do different a and b, instead of 34 and 2, we'll just multiply x by different a, add different b.
Take a modulo b, take the result modulo m, and get the value for our hash function.
Hashing Integers
Hi, in the previous video, you've learned the concept of universal family of hash functions and you
learned how to use it to make operations with your hash table really fast.
However, now we need to actually build a universal family and you will start with a universal family
for the most important object which is integer number. Because any object on your computer is
represented as a series of bits or bytes, and so you can think of it as a sequence of integer numbers.
And so first, we need to learn to hash integers efficiently. So we will build a universal family for
hashing integers. But we will look at our example with phone numbers because we need to store
contacts in our phone. So first, we will consider only phone numbers up to length seven and for
example we will consider phone number 148-2567. And again, we'll convert all of those phone
numbers, we want to start from integers from zero to the number consisting of seven nines. And for
example, our selected phone number will convert to 1,482,567. And then we will hash those integers
to which we convert our phone numbers. So to hash them, we will need to also choose a big prime
number, bigger than 10 to the power of 7, for example, 10,000,019 is a suitable prime number. And
we will also need to choose the hash table size which is the same as the chronology of the hash
function that we need. So now that we selected p and m, we are ready to define universal family for
integers between 0 and 10 to the power of 7 minus 1.
So the Lemma says that the following family of hash functions is a universal family.
What is this family? It is indexed by p, p is the prime number, 10,000,019, in this case that we choose.
And it also has parameters a and b, so those parameters are different for different hash functions in
these family. Basically, if you fix a and b, you fix a hash function from this hash functions family,
calligraphic H with index p. And x is the key, it is the integer number that we want to hash, and it is
required that x is less than p. It is from 0 to p minus 1, or less than p minus 1, but definitely, it is less
than p. So, to create a value of this integer x with some hash function, we first make a linear
transform of this x. We multiply it by a, corresponding to this hash function, and add b, corresponding
to this hash function. Then we take the result, modulo our big prime number p.
And after that, we again take the result modulo the size of our hash table or the chronology of the
hash functions that we need. So all these hash functions indexed by a and b will have the same
chronology m.
And the size of this hash family, what do you think it is?
Well, it is equal to b multiply by p minus 1, why is that? Because there are p minus 1 variance for a,
and independently from that, there are p variance for b. So the total number of pairs, a and b, is p
multiplied by p minus 1, that is the size of our universal family. And the Lemma states that it really will
be a universal family for integers between 0 and p minus 1. We will prove this Lemma in a separate,
optional video. And here, we'll look at an example of how this universal family works.
So, for example, we selected hash function corresponding to a = 34 and b = 2, so this hash function h
is h index by p, 34, and 2.
And we will compute the value of this hash function on number 1,482,567 because this integer
number corresponds to the phone number who we're interested in which is 148-2567. Well,
remember that p that we chose is a prime number 10,000,019. So first, we multiply our number x by
34 and add 2, and after that, we take the result modulo b, modulo 10,000,019, and the result is
407,185. Then we take this result and take it again modulo 1,000, and the result is 185. And so the
value for our selected hash function on number x is 185. And for any other number x, you would do
the same, you would multiply x by 34, add 2, take the result modulo b, then take the result modulo
1,000. And so any value of our hash function is a number between 0 and 999 as we want.
And if we do different a and b, instead of 34 and 2, we'll just multiply x by different a, add different b.
Take a modulo b, take the result modulo m, and get the value for our hash function.
So in the general case, when the phone numbers can be longer than seven, we first define the
maximum allowed length, L, of the phone number. And again, convert all the phone numbers to
integers which will derive from 0 to 10 to the power of L- 1, and then we'll hash those integers. To
hash those integers, we'll choose a sufficiently large number p, p must be more than 10 to the power
of L for the family to be universal. Because otherwise, if we take some p less than 10 to the power of
L, there will exist two different integer numbers between 0 and 10 to the power of L- 1, which differ
by exactly p. And then, when we compute the value of some hash function on both those numbers
and we take linear transformation of those keys, modulo b, the value of those transformations will be
the same. And then when we take, again, module m, the value again will be the same. And that
means that for any hash function from our family, the value of its function on these two keys will be
the same. So there will be a collision for any hash function from the family, but that contradicts the
definition of universal family. Because for a universal family and for two fixed different keys, no more
than 1 over m part of all hash functions can have collision for these two keys. And in our case, all hash
functions have a collision for these two keys, so this is definitely not a universal family. So we must
take p more than 10 to the power of L, and in fact, that is sufficient. Then, we choose hash table of
size m, and then we use our universal family, calligraphic H with index p. We choose a random hash
function from this universal family, and to choose a random hash function from this family, we need
to actually choose two numbers, a and b. And a should be a random number between 1 and p-1, and
b should be an independent random number from 0 to p-1. If we selected those two numbers, we
define our hash function completed.
So now we know how to solve the problem of phone book in the direction from phone numbers to
names. So we first define the longest allowed length of the phone number. We convert all the phone
numbers to integers from 0 to 10 to the power of L -1.
We choose a big prime number, bigger than 10 to the power of L. We choose the size of the hash
table that we want based on the techniques you learned in the previous video and then you add the
context to your phone book as a hash table of size m. Hashing them by a hash function randomly
selected from the universal family, calligraphic H with index p. And that is the solution in the direction
from phone numbers to names. This solution will take bit of m memory, and you can control for m,
and it will work on average in constant time if you select m wisely using the techniques from the
previous video.
And now we also need to solve our phone book problem in the different direction, from names to
phone numbers. And that we will do in the next video.
Hashing Strings
Hi, in the previous videos, you've learned how to quickly look up name in your phonebook given the
phone number. And we want to learn to solve the reverse problem given a name, look up a phone
number of the corresponding person.
To do that, we need to implement the Map from names to phone numbers.
And we can again use hash tables and we can again use chaining as in the previous sections. But we
need to design a hash function that is defined names. And more generally, we want to learn to hash
arbitrary strings of characters. And by the way in this video, you will also learn how hashing of strings
and implemented in the Java programming language. But first, let's introduce a new notation. Denote
by lSl enclosed in vertical lines the length of string S. For example, the length of string l"a"l is 1, length
of string l"ab"l is 2, and length of string l"abcde"l is 5. So now how do hash strings? Well when we're
given a string, we're actually given a sequence of characters from S[0] to S length of S- 1. We number
the characters of the strings from 0 in this lecture. And S[i] Is an individual character that is in the i-th
position in the string.
I say that we should use all the characters when we compute our hash function of a string. Indeed, if
we don't use the first character, there will be many collisions. For example, if the first symbol of the
string is not used, then the hash value of strings ("aa"), ("ba") and so on, up to ("za") will be the same.
Because however we compute the value of the hash function, it doesn't use the value of the first
character. And if everything else in the strings stays the same, and we only change the first character
that doesn't influence the value of the hash function then the value of the hash function must be the
same. And so there will be a lot of collisions and we want to avoid collisions. So we need to use value
of each of the characters.
Now, we could do a lot of things with that. For example, sum the values of all the characters or
multiply them, but we'll do something different.
It is a polynomial sum where we multiply the integer quote corresponding to the ith character of S,
which is noted by S of i, the same as the character itself. We multiply it by x to the power of i. We sum
all these things up, and we take the value modular p. So this is a family of hash functions, and the
chronology of all those hash functions is p. So any such hash function returns value from 0 to p- 1.
And how many hash functions are there in this family? Well of course, there are exactly p- 1 different
hash functions, because to choose to define a hash function from this family you would just need to
choose the value of x. And x changes from 1 to p- 1, and it's an integer number of course.
So how can we implement a hash function from this family?
S, the procedure PolyHash which takes it's input string S, prime number p and parameter x,
implements the hash function from our peril. It starts with the signing values of 0 to the result to the
hash value will return to end. And then it will go from right to left in our string and compute new
value based on the value of the corresponding character. And there is a formula in the code that does
exactly that. And I will show you by example that what we get in the end by applying this formula is
exactly what we want. So basically, we start with a hash value of 0, and then we start with i equal to 2
if the length of our string S is 3. We start with length of S- 1 which is 2. We have current value of hash
= 0. So we multiply the 0 by x and get 0, then we add the value of S[i] which is S[2], and take it mod p.
And so after first iteration of the for loop, we get S[2] mod p.
What happens is the next iteration, that i is decreased and i is now 1. And we multiply the current
value S[2] by x. And we add s[1], and take everything modular p. And what we get is the same as of
S[1] + S[2] multiply by x modular p. And then the last iteration, i is decreased to 0. We multiply the
current value by x. What we get is S[1] multiply by x + S[2] multiply by x squared. And then we also
add S[0] to the sum and take everything modular p. And the result is S[0] + S[1] multiply by x + S[2]
multiply by x2, exactly as we wanted. A polynomial hash function, with prime P and prime parameter
x.
And by the way, the implementation of the built in hash code methods in the class stream in Java, is
very similar to our procedure PolyHash. The only difference is that, it always uses x = 31. And for some
technical reasons, it avoids the modular p operator It just computes the polynomial sum without any
modular division. So now you know how a function that is used probably trillions of times a day by
thousands and many thousands of different programs, how this function is implemented.
So now about the efficiency of our polynomial family.
First, Lemma says that for any two different strings s1 and s2 of length at most L + 1. If you choose a
random hash function from the polynomial family by selecting a random value of x, parameter x from
1 to p- 1. You can select a random hash function from the family. So if you select a random hash from
the polynomial family, then the probability of collision on these two different strings is at most L
divided by p.
So that doesn't seem like a good estimate because L can be big, but actually it is your power to choose
p. If you choose very, very big prime number p then L over p will be very small. And know that it won't
influence the running time of the PolyHash procedure, because the running time of this procedure is
big length of S. It only depends on the length of the string. It doesn't depend on the length of number
p more or less. So if you select a really big number p, then the probability of collision will be very small
and the hash function will still be computed very fast. The idea of proof of this Lemma is that the
equation polynomial equation of power L, modular prime number p has at most L different solutions
x. Basically, when we consider two strings S1 and S2. The fact that the hash value or some hash
function from the polynomial family is the same for these two strings means that x corresponding to
our hash function is a solution of this kind of equation. And the fact that strings are different makes
sure that at least one of the coefficients of this equation is different from 0, and that is essential. If the
strings were the same of course, the value of any hash function on them will be the same. But if
they're different then the probability is at most L over p. Because there are only L or less different x
for which the hash function can give the same value on these two strings.
Hashing Strings - Cardinality Fix

>> Now we know of polynomial hash family or hashing strings. But there's a problem with that family.
All the hash functions in that family have a cardinality of P, where P is a very big prime number. And
what we want is the cardinality of hash functions to be the same as the size of our hash table. So,
once a small cardinality. So, we won't be able to use this binomial hashing family directly in our hash
tables. We want to somehow fix the cardinality of the functions in the polynomial family. And a good
way to do that is the following. We design a new complex transformation from strings to numbers,
from zero, to m minus one. So, we select the cardinality m, and we want to design a function from
strings to numbers, between zero, and m minus one. And, to that, we first apply our random hash
function from the polynomial family to the string. And we get some integer number module P, and
then we can apply a random hash function from the universal family for integers less than P, and get a
number between 0 and m -1, if we select it from universal family from cardinality m. So, we now have
a complex transformation which is two stage. First, take a stream and apply a random function from
the polynomial family and then apply a random function from the universal family for integers to the
result. And you get a number from zero to m-1 from the string. Note that it is very important that we
first select both random function from the polynomial family and the random function from the
universal family of our integers. And we fix them, and we use the same pair of functions for the whole
algorithm. And then, the whole function from string to integer number from between zero and minus
one is a deterministic hash function.
And it can be shown that the family of functions define this way is a very good family. It is not a
universal family, but it is a very good family with [INAUDIBLE]. More specifically, if you take any two
different strings S1 and S2 of length at most L + 1, and you choose a cardinality m, and you apply the
process described to build a hash family from strings of length at most L + 1 to integers numbers
between zero and m minus 1.
Then the probability of collision for random function from that family is at most 1 over m + L over p.
So, that is not an universal family because for a universal family there shouldn't be any summon L
over p the probability of collision should be at most 1 over M. But we can be very, very close to
universal family because we can control P. We can make P very big. And then L over p will be very
small. And so, the probability of collision will be at most will 1 over m plus some very small number.
And so, it will be either even less than 1 over m or very close to it. So 1 mL hash, and then universal
hash for integers is a good construction of a family of hash functions.
A Corollary from the previous Lemma is that, if we specifically select the prime number p to be bigger
than m multiplied by L, then the probability of collision will be, at most of 1 over m, so it won't be less
than 1 over m itself, but it will be at most 1 over m multiplied by some constant. Why is that? Well,
because if we rewrite 1 over m plus L over p by 1 over m + L over mL. Then the second expression will
be bigger because P is bigger than mL.
And then it is equal is 2 over m which is big O(1 over m). So that way, we proved that combination of
polynomial hashing with universal hashing for integers, is a really good family of hash functions.
Now what if we take this new family of hash functions and apply it to build a hash table?
Well, I say that for big enough prime number p, we'll again have running time on average c=O(1 + a).
The length of the longest chain will be O(1 + a). Where alpha is the lowest factor of our hash table.
And so, by wisely controlling this size of the hash table on the lowest factor as we learned in the
previous videos. We can control the running time and the memory consumption.
Of course, computing the hash function itself on sum string s is not a constant time operation,
because the string can be very long. And we need to look through the whole string to compute our
hash function. But in the case when the lengths of the strings in question are bounded, like for
example with names definitely there are no names longer than a few hundred characters I think, so all
they are bounded by some constant L. And so computing hash function on the names, tags, of course
go off the length of the stream time, but it is also we go off constant time because L is a constant
itself, and so we can implement a map from names to phone numbers using chaining, using the newly
created family of hash functions, which is complex. It first applies polynomial hashing to the stream,
to the name, and then applies universal family or integers to the result. So we can choose a random
hash function from this two staged family. And store our names, and phone numbers in the hash
table, using this hash function.
In conclusion, you learned how to hash integers, and strings, really good, so that probability of
collision is small. You learned that a phone book can be implemented as two maps, as two hash
tables, one from phone numbers to names, and another one back, from names to phone numbers.
And if you manage to do that in such a way you don't waste too much memory where all factors of
your hash table is between one five and one, and search and modification, on average, work in
constant time, which is great. And then the next lesson. We'll learn to apply hash functions to
different problems such as searching for patterns in text.
Slides
07_hash_tables_2_hashfunctions.pdfPDF File
07_hash_tables_2_proof_universal_family.pdfPDF File
References
See the chapter 1.5 in [DPV] Sanjoy Dasgupta, Christos Papadimitriou, and Umesh Vazirani.
Algorithms (1st Edition). McGraw-Hill Higher Education. 2008.
QUIZ • 30 MIN
Hash Tables and Hash Functions

Start
Searching Patterns
Search Pattern in Text
Hi. In this lesson, you will learn about applications of hashing to problems regarding strings and texts.
We will consider the problem of finding patterns in text. The problem is, given a long text T, for
example a book or a website or a Facebook profile, and some pattern P which can be a word, a
phrase, a sentence. Find all occurrences of pattern in the text. Some examples of that can be that you
want to find all occurrences were name on the website or you want to find all the Twitter messages
about your company to analyze the reviews of your new product. Or, you could potentially want to
detect all the files in your computer which are infected by specific computer virus and in that case you
won't find letters in text, you will find code patterns in the binary code of the program.
Anyway the algorithm will be the same.
First we introduce some new notations, substring notation, we denote by S from I to J the substring of
string S, starting in position I and ending in position J. Both I and J are included in the substring. For
example, if S is the string ABCDE, then S from zero to four is the same string ABCDE because we index
our characters from zero and A is the character number zero and E is the character number four.
S from one to three is bcd because b is the character with index one and d is the character with index
three. And S from two to two is also allowed. It's a sub-string of length one, string c. And I shouldn't
be more than J of course because otherwise there is no sub-string from I to J.
So, the formal version of our problem to find pattern in text is that you're given strings T and P as
input and you need to find all such positions I in the text T
that pattern P occurs in text T starting from position I. That is the same that to say that a substring of t
from I to I plus length of T minus one, the substring of T starting from I with length equal to the length
of the pattern is equal to the pattern. So we want to find all such positions i and, of course, i can be
from zero to length of text minus length of pattern. It cannot be bigger because otherwise the pattern
just won't fit in the text,
it will be ending to the right from the end of the text.
So we've start with a naive algorithm to solve this problem. Physically we go through all possible
positions, i from zero to difference of the length of the text and pattern. And then for each such
position I would just check character by character, whether the corresponding sub string of T starting
in position number I is equal to the pattern or not. If it is equal to the pattern we advance position I to
the result.
First we need to implement a function to compare two strings and we start with checking whether
their lengths are the same or not of cvourse if the lengths of strings is different then the strings are
definitely difference. If that's not the case, then the length of the strings are equal. And then we go
through all the positions in both strings with I going from zero to length of the first string minus one.
And if the corresponding symbols on the ith position differ, then the strings are different. Otherwise
they are the same.
Now we will use this function to find our occurrence of pattern in the text.
The procedure find pattern naive implements our naive algorithm. So let's start with an empty list in
the variable result and then we'd would go through all the possible positions where pattern could
start with X for I from zero to lines of text minus length of the pattern and we check whether the
substring starting in I with length equal to length of the pattern is equal to the pattern itself. If it is,
then we append position I to the result because this is a position where pattern occurs in text and
then, we just return the list that we collected by going through all possible positions of pattern in the
text. I'd say that the running time of this naive algorithm is big O of length of the text multiply by
length of the pattern.
Why is that? Well, each call to the function AreEqual, runs in time big O, of length of the pattern,
because both strings we pass there,
are of lengths, the same as the length of the pattern. And, the running time of AreEqual, is linear.
And then we have exactly
length of T minus length of P plus one calls of this function, which total to big O of length of T
multiplied by length of P, because we always consider that length of the text is bigger than the length
of the pattern, and so this is the upper bound for our running time.
Actually, this is not just the upper bound, it's also lower bound.
For example, consider, text T, which consists of many, many letters, a, and pattern P, which consists of
many, many letters a, and then letter b in the end, and also
we choose such text that it is much longer than the pattern which is basically
almost always true in the practical problems. For each position i and t which we try to observe the
goal to our equal to make has to make all of the maximum possible number of comparisons which is
equal to the length of the pattern B. Why is that? Because one would call our equal for substring of T
starting in position I and for the pattern B. We see that they differ only in the last characters so our
equal has to check all of the previous characters until it comes to the last character of P and
determines that actual pattern is different from the corresponding substring of D. Last in this case the
naive algorithm will do at least proportional to length of T multiplied by length of T operations. That's
our estimate is not just big O, it is big letter which means that it is not only in upper bound but also a
lower bound on the writing time on the naive algorithm. In the next video we will introduce an
algorithm based on hashing which has better running time
Rabin-Karp's Algorithm
Hi, in this video, we'll introduce Rabin-Karp's Algorithm for finding all occurrences of a pattern in the
text. At first it will have the same running time as the Naive Algorithm from the previous video. But
then we'll be able to improve it significantly for the practical purposes. So we need to compare our
pattern to all substrings S of text T, with length the same as the length of the pattern. And in the
Naive algorithm, we just did that by checking character by character whether pattern is equal to the
corresponding substring. And the idea is we could use hashing to quickly compare P with substrings of
T. So, how to do that? Well, let's introduce some hash function h and of course if it is a deterministic
hash function. And we see that the value of hash function on the pattern P is different from the value
of this hash function on some string S. Then definitely P is not equal to S, because h is deterministic.
However if the value of hash function on P is equal to the value of hash function on S, P can be equal
to S or it can be different from S if there is a collision. So to exactly check whether P is equal to S or
not we will need to call our function AreEqual(P,S). And so this doesn't yet save us any time. But we
hope that we could call this function AreEqual less frequently because there will be only few
collisions. So we'll use polynomial hashing family. Polygraphic P with index p small with some big
prime number p. And if P pattern is not equal to S substring of text, then the probability that the value
of the hash function on the pattern is the same as the value of hash function on the sub string is at
most length of the pattern divided by our big prime number p.
And we'll choose, a prime number P big enough, so that this probability will be very small. So here is
the code, of our algorithm RabinKarp. It takes its input, text T, and pattern P.
And it starts by initializing the hash function from polynomial family. We first choose a very big prime
number p. We'll talk later about how to choose it, how big it should be. And we also choose a random
number x between 1 and p- 1. Choose the specific hash function from the polynomial family.
Initialize all our list of positions where pattern occurs in text with an empty list.
We also precompute the hash value of our pattern, and we call the PolyHash function to do that.
And then we again need to go through all possible starting positions of pattern and text. So we go
from i from zero to difference of the length of text and pattern.
And for each i, we take the substring starting in this position i and of length equal to the lengths of the
pattern, which is t from i to i plus length of the pattern minus 1. And you compute the hash value of
this substring. And then we'll look at the hash of the pattern and the hash of the substring. If they are
different, then it means that definitely, P is not equal to this substring. And so, P doesn't occur in
position i and so we don't need to do anything in this iteration so we just continue to the next
iteration of the loop without calling AreEqual. However, if has values pHash and tHash aren't equal,
then we need to check if it's true that P is really equal to the substring of T starting in position i or it is
just a collision of our hash function. And to do that we make a call to AreEqual and pass there the
substring and the pattern. If AreEqual returns true, it means that pattern is really equal to the
correspondence substring of texts, and then we advance position i to resolve. Because pattern P
occurs in position i in the text T. Otherwise we just continue to the next situation of our for loop. So
this more or less the same as naive algorithm, but we have an additional checking of hash value, and
so we're not always calling AreEqual. We are calling AreEqual either if P is equal to the corresponding
sub string of T or if there is a collision. Let's estimate the running time of this algorithm.
So first we need to talk about false alarms. We'll call false alarm the event when P is compared with a
substring of T from i to i plus length of P minus 1. Compared inside the AreEqual procedure, but
pattern P is actually not equal to this substring. So there's a false alarm in the sense that P doesn't
occur in the text T starting from position i, but we still called the AreEqual function. And we need to
go character by character through P and the substring to test that they're actually not equal. So the
probability of false alarm as we know from the previous lesson, is at most length of the pattern over
prime number P, which we choose. So on average, the total number of false alarms will be the
number of iterations of our for loop, multiplied by this probability.
And so this total number of false alarms can be made very small if we choose prime number P, bigger
than the product of length of the text, and length of the pattern.
Much bigger. So now let's estimate the running time of everything in our code except for calls to the
AreEqual function. So the hash value of the pattern is computed in time big O of length of the pattern.
Hash of the substring corresponding to the pattern is computed in the same big O of length of the
pattern time. And this is done length of text minus length of the pattern plus 1 times because that is
the number of iterations of the for loop.
So the total time to compute all those hash values is big O of length of text multiplied by the length of
the pattern.
Now what about the running time of all calls to AreEqual? Each call to AreEqual is computed in big O
of length of the pattern because we pass there are two strings of length equal to length of the
pattern.
However, AreEqual is called only when the hash value of the pattern as the same as the hash value of
the corresponding substring of T. And that means that either P occurs in position i in text T or there
was a false alarm.
And by selecting the prime number to be very big, much bigger than the product of the length of text,
and the length of pattern, we can make the number of false alarms negligible, at least on average. So,
if q is the number of times that pattern P is actually found, in different positions in the text T, then the
total time spent in AreEqual, on average, is big O of q. Which is number of times P is really found, plus
the fraction T minus P plus 1 multiplied by P and divided by prime, p. Which is the average number of
times that a false alarm happens. So q plus number of false alarms is the number of times that we
need to actually call function AreEqual. And then the time spent inside the function AreEqual is
proportional to the length of the pattern.
So, this is the same as the O of q multiplied by the length of the pattern, because the second
summoned can be made pretty small, less than 1 if we choose big enough prime number p. And we'll
only get the first summoned multiplied by the length of the pattern.
And now the total running time of the Rabin-Karp's algorithm in this variant is big O length of text
multiplied by length of pattern plus q multiplied by the length of pattern. But, of course we know that
the number of times that pattern occurs in text is not bigger than the number of characters, in text.
Because there are only so many different positions where the pattern could start, in text. So, this sum
is dominated by the sum of big O of length of text, multiplied by length of the pattern.
So, this is basically the same running time as our estimate for the naive algorithm. So we haven't
improved anything yet, but this time can be improved for this algorithm with a clever trick. And you
will learn it in the next video.
Optimization: Precomputation
Hi, in this video you will learn to significantly improve the running time of the Rabin-Karp's Algorithm.
And to do so we'll need to look closer into the polynomial hashing and its properties. Recall that to
compute a polynomial hash on the string s but first choose a big prime number for the polynomial
family, then we choose a random integer x from 1 to p minus 1 to select a random hash function from
the family. And then the value of this hash function is the polynomial of x with coefficients which are
characters of the string S.
And to compute this hash functional substring of text T starting in position i and having the same
length as the pattern for which we are looking in the text. We need to also compute a similar
polynomial sum. It goes from character number i to character number i plus length of the pattern
minus 1. And we need to multiply each character by the corresponding power of x. For example T of i
will be multiplied by x to the power of zero because this is the first character of the substring and the
last character will be multiplied by x to the power length of the pattern minus 1, and here is a formula
on the slide. And the idea for the improving of the running time is that the polynomial hash value for
two consecutive substrings of text with length equal to the length of the pattern are very similar and
one of them can be computed given another one in constant time. We introduce a new notation, we
denote by H[i]. The hash value for the substring of the text starting in position i and having the same
length as the pattern.
Now let's look at the example, our text is a, b, c, b, d.
And we need to convert the characters to their integer codes. And let's assume for simplicity that the
code for a is zero, for b is one, for c is two, and for d is three. Then our text is actually 0, 1, 2, 1, 3.
Also, we will assume in this example, that the length of the pattern is three. We don't need to know
the pattern itself, we just fix its length.
So we will need to computer hash values for the substrings of the text of length three. There are three
of them, abc, bcd, and cbd. We start with the last one, cbd. To compute its hash value, we first need
to write down the powers of x under the corresponding characters of the text.
Then we need to multiply each power of the x by the corresponding integer code of the character and
we get 2x and 3x squared. And then we need to sum them and we also need to take the value module
of b, but on this slide we'll just ignore module of p, it will be assumed in each expression. Now let's
look at the hash value for the previous substring of lines three which is bcb. We again need to write
down the powers of x under the corresponding integer codes of the character. And again need to
multiply the powers of x by the corresponding integer codes and get one to x and x squared, we need
to sum them. Now note the similarity between the hash value for the last substring of line three and
the previous substring of line three. To get the last two terms for bcb, we can multiply the first two
terms for cdb by x.
And we will use this similarity to compute the hash for bcb given the hash for cdb. So again H[2] is the
same as hash value of cbd because it starts in the character with index two and it's equal to 2 + x + 3x
squared.
Now let's compute the age of 1 based on that this is the hash value of bcb and we know it's equal to 1
+ 2x + x squared module of p.
Now let's rewrite this using this property of multiplication by x the terms for the cbd.
So it's equal to 1 + x multiplied by the first two terms for cbd which are 2+x. Now we don't want to
use just the first two terms for cbd, we. We want to use the whole cbd so we write this as following
1+ x multiplied by the whole expression for cbd but now we need to subtract something to make the
equality true. And that something is the last term, x multiplied by 3x squared, which is the same as 3x
cubed, so we subtract 3x cubed.
Now we regroup the summons, and we right as this is equal to x multiplied by the hash value for cbd
which is big H[2], we add 1 to it and we subtract through 3x cubed.
In the general case, there is a very similar formula. So, here is the expression for big H[i + 1], and
notice that the powers of x are, in each case j- i- 1 because the substring starts in position i plus one.
So, we subtract i + 1 from each j in the sum, and the expression for big H[i] is very similar. But, for
each power of x, we subtract just i from j. Because the substring starts in position i. Now let's rewrite
this expression so that it is more similar to the gauge of i + 1. And to do that, we start summation not
from i, but from i + 1 and also end it one position later. So, the first sum is now very similar to the
expression for H[i+1], which has the powers of x are always bigger by one. And also we need to add
T[i] which is not accounted for in the sum, and we need to subtract its last term, because it's not In
the expression for big H[i]. And that is T[i] plus length of the pattern, multiplied by x to the power of
length of the pattern. Now we notice that the first sum is the same as x multiplied by the value of
hash function for the next substream, big H[i+1]. And the second and third terms are the same. So
now we get this recurrent formula. To compute the gauge of i, if we know already the gauge of i + 1,
we need to multiply it by x and then add T[i] and subtract another term. Notice that T[i] and T of i plus
length of the pattern we just know. And x to the length of the pattern is a multiplier that we can pre
compute and use for each i.
Now let's use this in the pseudo code. Here's the function to pre compute all the hash values of our
polynomial hash function on the substrings of the text t with the length equal to the length of the
pattern, and with prime number, P and selected integer x. We initialize our answer, big H, as an array
of length, length of text minus length of pattern plus one. Which is the number of substrings of the
text with length equal to the length of the pattern. Also initialize S by the last substring of the text
with a length equal to the length of the pattern. And you compute the hash value for this last
substring directly by calling our implementation of polynomial hash with the substring prime number
P and integer x.
Then we also need to precompute the value of x to the power of length of the pattern and store it in
the variable y. To do that we need initialize it with 1 and then multiply it length of P times by x and
take this module of p. And then the main for loop, the second for loop goes from right to left and
computes the hash values for all the substrings of the text, but for the last one for which we already
know the answer. So to compute H[i] given H[i + 1], we multiply it by x. Then we add T[i] and we
subtract y, which is x to the power of length of P, by T[i + length of the pattern]. And we take the
expression module of p.
And then we just return the array with the precomputed values.
So to analyze its training time, we know that initialization of array H of s and with the
accommodations with the hash value of the last substring, I'll take time proportional to the length of
the pattern. Also pre-computation of the x to the power of length of P takes time proportional to the
length of the pattern. And the second for loop takes time proportional to length of the text minus
length of the pattern. And all and all it's big O of length of the text plus length of the pattern.

Now again, polynomial hash is computed in time proportional to the length of the pattern. First for
loop, computing the power of x, also and the second for loop, which goes through all the substrings of
the text with length equal to the length of the pattern, x length of text minus length of pattern time.
And the total precomputation time is proportional to the sum of length of the text and the pattern.
And in the next video we'll use these precomputed values to actually improve the running time of the
Rabin-Karp's Algorithm.
Optimization: Implementation and Analysis
Hi, in this video we'll use the precomputed hashes from the previous video
to improve the running time of the RabinKarp cell algorithm.
And here is the pseudo code.
Actually it is very similar to the pseudo code of the initial
RabinKarp algorithm and only a few lines changed.
So again, choose a very big prime number p and we choose a random number
x from 1 to p- 1 to choose a random Hash function from the polynomial family.
We initialize the result with an positions.
And we compute the hash of the pattern in the variable pHash
directly using our implementation of polynomial hash.
And then we call the PrecomputeHashes function from
the previous video to precompute big H, an array with hash values
of all sub strings of the text with length equal to the pattern p.
We need them to check whether it makes sense to compare pattern to
a sub string if their hashes are the same.
Or maybe if their hashes are different then,
there is no point comparing them character by characte,r because it means that
pattern is definitely different from the substream.
So, then our main for loop goes for
all i, starting positions for the pattern, from 0 to length of text
minus length of pattern as in the previous version of the RabinKarp's algorithm.
And the main thing that changed is that.
We compare the hash of the pattern, not with a vary
of hash functions computed on the fly, but with the pre-computed value of the hash
function for the substream starting in position i, H[i].
If they are different, it means that the pattern is definitely different from
the substream starting in position i and we don't need to compare them character
by character, so we just continue to the next iteration of the for loop.
Otherwise, if the hash value of the pattern is the same
as the hash value of the substring,
we need to actually compare them on the quality, character by character.
And to do that, we call function AreEqual for the substring and the pattern.
If they are actually equal, we append position i to the result
to the list of all the occurrences of pattern indexed.
Otherwise, we proceed to the next iteration.
And in the end, we return result,
the list of all positions in which pattern occurs in the text.
Let's analyze the running time of this version of RabinKarp's algorithm. First we compute the hash
value of the pattern in time proportional to its length. Then we call the PrecomputeHashes function,
which we estimated in the previous video around in time proportional to the sum of length of text
and the pattern.
And then the only other thing that we do is we compare the hashes, and for some of the substrings
we call function AreEqual. And we already know from the previous videos that the total time spent in
AreEqual is on average, proportional to q multiplied by length of the pattern, where q is the number
of occurrences of pattern and text. Why is that? Because we only compared pattern to a substring if
they're equal or if there's a collision. You can compare such a big prime P, that the collisions have very
low probability, and on average, they won't influence the running time. So, on average, total time
spent in AreEqual is proportional to q multiplied by length of the pattern. And then the total average
running time, is proportional to length of the text, plus q plus 1, multiplied by length of the pattern.
And this is actually much better, than the time for the algorithm, because usually q is very small, q is
the number of times you actually found pattern in text. If you are, for example, searching for your
name on a website or for infected code pattern in the binary code of the program, there will be no or
only a few places where you actually find it. And that their number is q and it is usually much, much
less than the total number of positions in the test which is length of the test. So the second sum of [q
plus 1 multiplied by 1 looks like by length of the pattern is much smaller than length of the text
multiply it by length of the pattern. And if pattern is sufficiently long, then the first summoned is also
much smaller than length of the text multiplied by length of the pattern. So we improved our running
time for most practical purposes very significantly. Of course it's only an average, but in practice, this
will work really well.

And to conclude. In this module, we cited hash tables and hash functions, and we learned that hash
tables are useful for storing sets of objects and mappings from one type of object to another one. And
we managed to do it in such a way that you can search and modify keys and values of the hash tables
in constant time on average. And to do so, you must use good hash families, and you must select
random hash functions from good hash families.
And you also learned that hashes are not only useful for storing something, but they're also useful
while working with strings and texts, for finding patterns in long texts. And actually, there are a lot
more applications of hashing in distributed systems, for example, and in data science. And I'll tell you
about some applications and distributive systems in the next few optional videos.

Slides
07_hash_tables_3_search_substring.pdfPDF File
References
Distributed Hash Tables (Optional)

Instant Uploads and Storage Optimization in Dropbox
Hi, in this optional lesson we will learn a bit about more than distributed systems. And we will start
with some interesting inner working online storage services which you probably use such as DropBox,
Google Drive and the Yandex Disk. Have you wondered how a very big file of tens or hundreds of
megabytes can be uploaded almost instantly to your DropBox account? Or maybe your interested,
how Dropbox, Google Drive and Yandex Disk save petabytes of storage space using the ideas from this
module on hash tables and hash functions. Or maybe you're interested in distrusted systems and
distributed storage in general. Then, this lecture is for you.
So services like Dropbox and Google Drive used extra bytes of storage to store data of millions and
millions of users worldwide.
And there's a very simple idea on how to actually save some of that space and save some of the cost
so it sometimes happens that users upload the same files. The first user liked the video with the cats
and uploaded it to his Dropbox account just to save it and to show his friends. And then another user
also loved this video file. He may have called it different way but still uploaded it to his Dropbox
account, the exactly same video. And then another user also uploaded this video, because this was a
viral video and many, many people liked it and some of them decided to upload it to their user
accounts in Dropbox. And then what we can do on the level of the whole Dropbox service is instead of
storing all three copies of the same video, just save one copy and have links from the user's files to
this actual, physical, stored file. And then we've just saved 66% of the storage space because we
basically reduced three times. And if you have some large videos which are also very popular that you
can save this way significant portion of this storage space which all the users collectively use in
DropBox to store their files.
So the question is how to actually implement that. So, when you do a new file log, you need to
determine if there is already the same file in the system or now, and if there is, you just ignore the
plot, and sent a link to the register's file in the user's account, instead of a real file. So, there are a few
ways to do that, and we'll start with a really simple one. Naive comparison. You take the new file that
the user wants to upload. You actually upload it to a temporary storage, then you go through all the
storage files, then you compare the new file with each of the storage files, bye, bye, bye. And if there
is exactly the same file, you store a link to this file instead of the new file that user's wants to upload.
So, there are a few drawbacks of this approach. First, you have to first, upload the file anyways. So
you won't see this miraculous instant upload time of large files with hundreds of megabytes. And
second is to compare a file of size S with N other files, it takes time proportional to product of N and
S. And that can be huge because the number of files in Dropbox or Google Drive is probably on the
order of hundreds of billions or even trillions. And the files uploaded are often also very large like
gigabytes. And also, if we use the strategy, then, as N grows, as service for online storage grows, the
total running time of all uploads will grow as N squared because each new upload is big O(N) and it's
longer and longer and longer as the number of files increases. So, this approach won't work long-term
anyway.
So what can we do? First idea is, instead of comparing the files themselves, try to compare hashes. As
in the Rabin Karp's algorithm, compare hashes of the files first. If the hashes are different, then the
files are definitely different. And if there is a file with the same hash, then upload this new file that
the user wants to upload to his account, and compare the new file with the old file with the same
hash directly byte by byte. Still, there are problems with this approach. First, there can be collisions so
we cannot just say that if two files have the same hash value then they're equal and we don't need to
store the new file. Sometimes, different files can have the same hash value and we'll still have to
compare the two files, and also we still have to upload the file to compare directly even if the same
file's already stored. And we still have to compare with all N already stored files. So what can we do?
Another idea is we can use several hash functions. If we have two equal files, even if we compute five
different hash functions there values on these two files will be the same.
So the idea is choose several different hash functions independently for example, take functions from
polynomial family with different multiplier x or with different prime numbers p.
And then compute all the hashes for each file and if there is a file which is already stored and has all
the same hash values, then the new file is probably the same as the file already stored. In this case,
we might want to not even upload the new file at all and save the time and make the upload seem
immediate. And to do that we need to just compute hashes locally before upload and only send
through the network, which can be slow, the variants of the hash functions, which are much, much
less in terms of space than the initial huge file. So we can do the hash values locally. We send those
three or five hash values over the network to the service. They're compared to the hash values of the
files already stored. And if there is a file with same set of hash values, we don't upload our new file.
And this is how the instant upload works sometimes. When you try to upload file which is already
stored but by someone else. Well of course, there is a problem with collisions. Because collisions can
happen even if you have several different hash functions. Still, there can be two different files which
have the same set of hash values even for several hash functions. And there are even algorithms
which on purpose find two different files which have the same value of a give hash function. If you
know for which hash function you are trying to find a collision. However, for hash functions used in
practice, collisions are extremely rare and hard to find. And if you use more than one hash function, if
you use three or even five then you probably won't see a collision in a life time. So this is actually
done in practice. You compute several different hash functions which no body knows and then it is so
hard to find two files for which all the hash functions have the same values. That a new file is
considered to be equal to the old stored file if all the hash values coincide with the hash values of the
file already stored.
So we still have an unsolved problem that we need to do and comparisons with all the already stored
files. So how can we solve this problem? Well, we can first precompute hashes because when a file is
submitted for upload, hash values for this file are computed anyway. So we can store the addresses of
the files
which already stored in the service in a hash table and along with the addresses of the files will store
those hash values for each file. So, we recompute them and store them and when we need to search
for a new file We actually only need to search in the hash table, and we need only the values of the
hash functions on this file to search for it. We don't need to provide the file itself. So we search for
the hash values in the hash table. And if we find some files stored in this hash table with the same set
of hash values, then we know that there is already such files stored in the system.
So the final solution is the following. We choose from three to five different good hash functions, for
which it is hard to find solutions. So that we don't see collisions in practice. We store the addresses of
the files and the hashes of those files in a hash table, and before we upload a new file, we compute
the hashes locally, we send them over the network to the service. We check whether there is a file in
a hash table with the same hash values. And if all the hashes for some stored file coincide with the
hashes of the new file then the search is successful. And in this case we don't even upload the file, we
just store a link in the user account to the existing, already stored file.
So this how we can do instant upload to Dropbox or Google Drive for [INAUDIBLE] this, and this is
actually how they save a lot of space, probably petabytes of space in their services. However, there
are more problems to this, because it turns out that billions of files are uploaded daily, for example,
into Dropbox. And that means that probably around trillions are already stored there. And that is just
too big for a simple hash table on one computer. And also, millions of users upload simultaneously
their files, and so this is also too many requests for a single hash table. And so you need some more
sophisticated solution to cope with this those two things. And see our next lecture to understand how
that problem is solved.
Distributed Hash Tables

Hi. In this video we will learn how to store a whole lot of objects, how to store big data, using
distributed hash tables.
So big data is when you need to store trillions or more objects. For example, trillions of file addresses
in Dropbox, or user profiles, or emails and user accounts, for example, in Gmail or services like that.
And you need fast search and fast access to that data. So hash tables, in general, is a good solution for
that problem because they give a constant time search access on average. But for number of keys in
the order of ten to the power of 12, the amount of memory that a single hash table will store
becomes too big to store it in one computer and so we need to do something else, we need to use
more computers probably. And the solution to this is distributed hash tables. So the first idea for
distributed hash table is the following, just get more computers. Get 1,000 computers. If you are
Google or Dropbox you can do that. And then you will store your data on many computers. And you
will do the following. You create a hash table on each of those computers. And then you will separate
the data between those computers. So each computer will store its own part of the data. And you
need to determine quickly, and automatically, and deterministically which computer should store
some object O.
And there is a simple way, just compute some hash function of this object, modular 1000, so we get
basically a value from 0 to 999 for each object and that will be the number of the computer which
should store this object. And then you send a request to that computer and search or modify what
you need to do with that object in the local hash table of that computer. And that seems to already
solve our problem because if a new request comes you quickly compute the hash function on the
object and you know where to send your request. And then that computer just looks up in its local
hash table. Each of the local hash tables can be 1,000 times less than the total amount of data stored,
and so it is scalable. If you need more data, you just get more computers and everything works. Still
there are problems with this approach. and the main problem is that computers sometimes break.
And especially if you have a lot of computers, then they break pretty often. For example if a computer
breaks once in two years on average, then if you have 1,000 computers, on average, more than one
computer breaks every day. Because there are less than 1,000 days in two years, and you have 1,000
computers. So what do you do in that case? You don't want to lose your user's data. So you need to
store several copies of the data. So basically you can do it in a way that every computer stores each
part of data. Each part of data should be stored on several computers. And what happens then when
some computer breaks? Well, luckily the data which is stored on this computer is also stored
somewhere else. But if that's the only copy left after this computer broke, you also need to also copy
that data to some other computer, so that it is again stored in several places. And you need to
relocate the data from the broken computer and also sometimes your service grows and you want to
buy more computers. You want to reply faster to your clients and new computers are added to the
cluster. And then this formula take hash value of the object modular 1000. And this is the number of
computer on which your object is stored. It no longer works. Because the numbers of the computers
always change. New computers come in. Broken computers come out. And so you need something
else.
And one way to solve this is called consistent hashing. So first, we choose some hash function with
sum cardinality m. And we choose a circle, a regular circle, and you put numbers from zero to m
minus one on the circle in a clockwise order. And then each object, O, is mapped to some point on the
circle corresponding to the number hash value of this object.
Which is from 0 to m- 1, so it always maps to some of the numbers on the circle. And also, each
computer ID is mapped to the same circle. We hash the ID of the computer and we get the number of
the points to which this computer is mapped. So let's look at the picture. Here's our circle. And, for
example m is 12. Then we put 12 points around the circle. And we put numbers from 0 to 11 around
the circle. And then, objects, such as for example, name Steve, can be mapped to some of those 12
points. And if hash value of Steve is 9, then Steve is mapped to the point with number 9. And also
computers can be mapped to points, and for example, this computer with ID 253. If the hash value of
253 is 5, then this computer is mapped to the point 5. So what do we do then?
PPT Slides

We make a rule that each object is stored on the so-called closest computer, closest in terms of the
distance along the circle. And in this case, each computer stores all objects falling on some arc, which
consists of all objects which are closer to this computer than to any other computer. Let's again look
at the picture. This is the circle and there are six computers and these computers mapped to some
points on this circle. And then the arcs of the same color as the computers near them, are the sets of
points, which are closer to the corresponding computer than to any other computer. And so each
computer is responsible for some arc of this circle. For all the keys that are mapped to this arc.
And so what happens when computers come in because new computers are bought or when
computers are broken. When a computer goes off when it is broken, it's neighbors take its data. So it
has two neighbors, and it's arc is divided into parts, and one part goes to the right neighbor and the,
another part goes to the left neighbor. And when a new computer is added it takes data from its
neighbors. So it comes between some two already existing computers, and it takes a part of the arc of
one of them, and a part of the arc of another one, and he gets its arc. So let's look at an example. For
example, the yellow computer breaks and it goes away. And then the green and the blue computer
will take its arc and divide it between themselves. So that's what happens. Another problem which
still needs to be solved is that when some computer breaks, we need to copy or relocate the data.
And how will a node, a computer, know where to send the data that is stored?
Well, we need another rule for that.
We cannot store the addresses of all the other computers on each of the computers because that is
inconvenient. We will have to constantly update that information.
But, each node, each computer will be so called acquainted with a few neighbors. So it will store the
metric addresses of some of its neighbors. The rule is that for any key,
each node will either store this key itself, or it will be acquainted. It will know some other computer
which is closer to this key in terms of the distance on the circle. And, that way, if a request comes to
some node, any node in the network, about some key, it either can find this key inside it's own
storage, or, it will redirect the request to another node which is closer to this key. And that that node
will either store the key, or direct the code to the next node, which is even closer to that key. And in
finite number of iterations the request will come to the node that will actually stores the key. So
that's the idea. And in practice, what we can do is we can put the computers, the nodes on the circle.
And then each node will know its immediate neighbors, its neighbors of neighbors. And then its
neighbors in distance of 4 and distance of 8, and distance of 16. And for all powers of 2 it will know
neighbors to the right and to the left at distance of this part 2. Of course less than n over half.
And it's easier to see on the picture again. So suppose we have many, many nodes. And then the
upper node will have links to its right and left neighbor. To its right and left neighbor on distance of
two, and to its right and left neighbor, the distance of four, and so on. So each node will contain
algorithmic number of links to other nodes, which is much better than storing all the other nodes.
And, if we need to come to some key from some node that doesn't contain it we'll first jump in the
direction where the distance to the key decreases. And we will jump as much as we can. If the
computer at distance eight is closer than our computer to the key, we will jump at least by eight. If
computer with distance 16 is closer, we'll jump at least 16. If computer with distance 32 is farther,
then we'll jump just by 16. In this way, we will always jump by at least a half of the distance which
divides us from computer that stores the key itself. And so in algorithmic number of steps, we will
actually come from the current computer, to the computer that actually stores our key.
And this network of nodes which know some neighbors and they know some of their neighbors is
called, Overlay Network. So in conclusion, Distributed Hash Tables is a way to store Big Data on many
many computers and access it fast, as if it was on one computer.
Consistent Hashing is one way to determine which computer actually owns the data, which computer
stores this particular object. And to do that, consistent hashing uses mapping of keys and computer
IDs on a circle. And each computer stores a range of keys on an arc, which is closest to this computer
in terms of distance along the circle. And also overlay network is used to route the data to and from
the right computer. So when a computer is broken,
first, its data needs to be copied to some other computer. And its neighbors take its data. So
computer disappears, and its arc disappears, but this is actually divided between two neighbor
computers. And each of those arcs increases a bit, and they cover the whole data and then we
proceed. If a new computer appears, it takes some data from its right neighbor, some data from its
left neighbor, and assembles an arc for itself.
And I hope that after this lecture, you understand how important are data structures we study in this
course, to the modern technological industry, distributed systems, and big data.
Slides
Download the slides for this lessson:
07_hash_tables_4_distributed_hash_tables.pdf PDF File
References
https://en.wikipedia.org/wiki/Distributed_hash_table
http://stackoverflow.com/questions/144360/simple-basic-explanation-of-a-distributed-hash-table-
dht
https://www.cs.cmu.edu/~dga/15-744/S07/lectures/16-dht.pdf
https://en.wikipedia.org/wiki/Consistent_hashing
programming assigment 3
Hashing
Start
Hash Tables
Week 5
Data Structures
Week 5
29 threads · Last post a day ago

Go to forum
Binary Search Trees

In this module we study binary search trees, which are a data structure for doing searches on
dynamically changing ordered sets. You will learn about many of the difficulties in accomplishing
this task and the ways in which we can overcome them. In order to do this you will need to learn
the basic structure of binary search trees, how to insert and delete without destroying this
structure, and how to ensure that the tree remains balanced.
Less
Key Concepts
 Describe how balanced binary search trees work
 Analyze the running time of operations with binary search trees
 List the capabilities of binary search trees
 Compare balanced binary search trees with arrays and lists
Less
Binary Search Trees
Video: LectureIntroduction
7 min
Resume
. Click to resume
Video: LectureSearch Trees
5 min

Video: LectureBasic Operations
10 min
Video: LectureBalance
5 min
10 min
AVL Trees
Video: LectureAVL Trees
5 min
Video: LectureAVL Tree Implementation
9 min
Video: LectureSplit and Merge

9 min
10 min
Practice Quiz: Binary Search Trees
4 questions
Binary Search Tree

Introduction
Hello everybody, welcome back. Today, we're going to be starting with a new data structures topic. In
particular, we are going to be talking about Binary Search Trees. And today, we're going to be giving
some of introductions to the topic and really going to try and do two things. One is to sort of motivate
the types of the problems that we want to be able to solve this new data structure. And secondly,
we'll talk a little bit about why the data structures we already know about are not up to this task and
why we really do need something new. So to begin with let's talk about a few problems that you
might want to solve. So one is you want to search a dictionary. You've got a dictionary and you want
to find all the words that start with some given string of letters. Or similarly you've got a bunch of
emails and you'd like to find all the emails that were sent or received during a given period.
Or maybe you've got a bunch of friends or class or something and you'd like to find the other person
in this class whose height is closest to yours.
Now all of these are examples of what we might call a local search problem. What you want for them
is you have a data structure that stores a bunch of elements. Each of them has some key that comes
from a linearly ordered set. Something like a word sorted by alphabetical order, or a date, or a height,
or something like that. And we want this data structure to support some operations. Things like range
search, which should return all of the elements whose keys are between two numbers x and y. Or
nearest neighbors, where given another key z, you want to find the things closest to z on either side in
this data structure.
So, for example, if we have such data structure storing the following numbers, if we wanted to do a
range search for 5 to 12, it should return 6, 7 and 10, the three numbers that are stored that are
between 5 and 12. If we want the nearest neighbors of 3 we should return 1 and 4 since those are the
closest things to we have to 3 on either side.
Now if we just wanted to do that, it turns out you can do it, but in practice, you really want these data
structures to be dynamic. You want it to be possible to modify them. So, two more operations that we
would like to be able to implement are insert and delete. Insert(x) adds a new element with key x, and
Delete(x) removes an element with key x. Fine. So, for example, we have this array. If we want to
insert 3 into it, we do whatever we need to, 3 is now stored in this data structure in addition to
everything else. And then we can delete 10, remove that and we've got slightly different elements
that we're storing.
Which number(s) are returned by the sequence of queries below?
Insert(3)
Insert(8)
Insert(5)
Insert(10)
Delete(8)
Insert(12)
NearestNeighbors(7)
8
10
12
3 + 12
Ppt slides

So just to make sure we're on the same page, if you start with such a data structure and it's empty
and you insert 3, and then insert 8, and then insert 5, then insert 10, then delete 8, then insert 12 and
ask for the nearest neighbors of 7 what are you going to return?
Well, if you figure out what the data structures stores at the end of the day, you've inserted 3, 5, 8, 10
and 12, 8 got deleted, so, you've only have the other four left over. And you want the things closest to
7 of the remaining guys, which would be 5 and 10. So, that should be the answer.
Okay, so this is the data structure that we're trying to implement. What can we say about being able
to do it? We've seen a bunch of data structures, maybe one of them will work.
For example, we could try implementing this by a hash table. Hash tables are good at storing and
looking up elements very, very quickly.
Unfortunately, they're hard to search.
You can't really search for all the elements in the hash table in the given range more or less at all. In
some sense all the hash table lets you do is check whether or not a given element is stored there. You
can't find about elements in a range.
Similarly nearest neighbor is not really a thing you can do with hash tables, but they are good at
inserts insertion into a hash table is all of one as is deletion. But the searching aspect doesn't work
here, so maybe we need something else.
Ppt slides important
Well, the next thing we can try is an array. And in an array you can do the searches, but they're a little
bit slow. If you want to do a range search on an array, the best you can do is scan through the entire
array, figure out which elements are in the range you want and return those.
Similarly you have a nearest neighbors search in O(n) time by scanning through the entire array,
keeping track of the closest things on either sides of the query, and then returning the best ones at
the end.
On the other hand, arrays, at least if they're expandable arrays, are still fine with insert and delete. To
insert a new element you just add it to the next square over at the end. To delete you can't just
remove it from the array because then you'd leave a gap, but if you take the last element and move it
over to fill the gap. Once again, this delete operations is O(1) time. Perhaps more interestingly than
just any array though is a sorted array. Here, we're storing all of our elements in an array, but we're
going to store them in a sorted order.
And the critical thing about this is it allows us to do binary search.
If we want to do a range search, we can do a binary search to find the left end of the range in our
array and that takes logarithmic time. And then scan through until we hit the right end of the range
we want and return everything in the middle. So, the range search here is basically log n time at least
assuming the number of things we actually want to return is small.
Similarly nearest neighbors is log arithmetic time. We do a binary search to find the thing that we're
looking for and reach to return the elements on either side.
Unfortunately, updates to a sorted array are hard. You can't just insert a new element at the end of
the array, because the array needs to remain sorted at the end and this will generally destroy the
sorted order. If you want to insert 3, it really needs to go between 1 and 4. But, you can't really do
that, you can't just sort of add a new cell in the middle of an array. The only way to actually do this, is
you can put 3 in that plot and then everything 4 and onwards needs to shift over one cell to make
room.
And so, insertion here is O(n) time which is a lot longer than we want.
Similarly deletions are going to be hard. If you delete an element, you can't just leave a gap, you need
to fill it somehow. You can't just bring an element over from one of the ends to fill the gap, because
that would destroy your sorted structure. So the only way to fill the gap is to sort of take everything
and shift it back over 1 in order to fill things up, and that again takes O(n) time.
A final thing to look at are linked lists. Now here you can do a RangeSearch in O(n) time, you just scan
through the list and find everything in the range. Similarly nearest neighbors are going to be O(n). Of
course linked lists, insertion, and deletions are very fast, O(1) at least if you've got a doubly linked list.
These things are very good. Unfortunately our searches are slow. And even if you make this a sorted
linked list, if you sort of guarantee that everything comes in sorted order, you still can't do better than
linear time for your searches. Because you can't binary search a linked list, even if it's sorted, because
there's no way to sort of jump to the middle of the list and sort of do comparisons.
And so the moral here really is that the data structure we've seen up til this point don't work for this
sort of local search data structure. And so, we're going to need something new. And that's what we're
going to start talking about in the next lecture.
Search Trees
Ppt slides
Hello everybody, welcome back. Today, we're going to start talking about binary search trees. In
particular, we're going to talk about what the binary search tree data structure is, how it's
constructed, and the basic properties that need to be maintained.
So last time we came up with this idea of a local search problem, we wanted a data structure to be
able to solve it. And we know that none of the data structures we had seen up till this point were
sufficient to solve the problems that we wanted. But one maybe came closer than the others.
Sorted arrays were okay, in that you could actually do searches efficiently on them. But unfortunately,
you couldn't do updates in any reasonable way. But the fact that these things allowed for efficient
binary searches sort of maybe gives us a good starting point for what we're looking for.
So, what we should look at is, we should really see this operation of binary search. What does it
entail, and what exactly makes it work? And so we all know how a binary search works, right? So
you've got your list of numbers, you pick the one in the middle. You ask, is the thing I am looking for
bigger than this or less than this? If it's smaller, I sort of look at the middle of first half of the array,
and say, is it bigger or less than that? If it's larger, I look to the second half of the array, and ask, is it
bigger or less than that? And I sort of keep on asking these question and each time it sort of narrows
down my search space until I get an answer.
But as you'll note, sort of associated to this sort of binary search procedure is a search tree. If you sort
of consider which questions you ask. First, I ask about, is it bigger or less than seven? If it's smaller, I
ask about four. If it's bigger, I ask about 13. If I got four and said it was bigger than four, I'd then ask
about six. And I have this sort of whole tree of possibilities. Every time I ask a question it sort of splits
into two different cases.
And maybe the key idea here is that if you want to do a binary search, instead of doing it on the array,
you could just have this search tree. You start at the top of the tree, at seven. And then you head
down to 4 or 13, depending on where you go, and then you keep going down until you find your
answer.
And so in some sense, the search tree is as good as the array. But while a sorted array, as we saw, was
hard to insert into, the tree is actually a lot easier to work with in that way. And it turns out this
search tree going to be the thing that allows us to implement these operations in much better way.
Okay, so what do we need to be the case for the subtree? Well, I mean, like all trees, it should have a
root node, each node should have two children. It should have a left side, which is sort of where
you're going to go when you find out that things are smaller than that. And then you have a right side,
which is where you go when things are bigger than that. So, to be a little bit more formal, the tree is
constructed out of a bunch of nodes. Each node is sort of a data type that stores a bunch of things.
Importantly, it stores a key, it stores a value that you're comparing things to. It also should have a
pointer to the parent node and a pointer to the left child and a pointer to the right child.
And to be a search tree, it needs to satisfy one very critical property. If you look at the key of a node
X, then, well, the stuff on the left should be where you're going if you do a comparison and find the
thing you're looking for is smaller than X. And that means that, all the keys stored on all the nodes in
the left subtree of x, all the descendants of its left child, need to have a smaller key than X does.
And similarly, if you found that something was bigger than X and go to the right, it had better actually
be on the right. And so, the things whose keys are larger than X need to be on the right subtree of X.
So just review this. I mean, we have this following three trees, A, B, and C. Which one of these trees
satisfy the Search Tree Property?
Well, it turns the only correct one is B, B it works out. A has this issue that up at the top you've got
this node 4 and on the left side, it has everything bigger than 4 and on the right side, it has everything
smaller than 4. And it's supposed to be the other way around, but if you switch 4's left and right sides,
everything would work out there. Now case C is a little bit more subtle. There's really only one
problem here. And that's that you have this root node which is a 5. And there's another 4, but 4 is
part of 5's right subtree. And remember, everything on the right subtree of any node has to be larger
than it. And this 4 is smaller. And so other than that one mistake, things are okay there as well. Okay,
so this is the structure. Next time we're going to talk about how to do basic operations on binary
search trees and sort of give a little bit of pseudocode for how to do these things and then we'll sort
of have a basic start for this project
Basic Operations
0
Hello everybody, welcome back. We're continuing to talk about binary search trees. And today, we're
going to talk about how to implement the basic operations of a binary search tree. So we're going to
talk about this and talk about a few of the difficulties that show up when you're trying it. Okay, so let's
start with searching. And this is sort of the key thing that you want to be able to do on the binary
search tree. And the primary operation that we're going to look at for how to do this is what we're
going to call Find.
Now Find is a function. What it takes is a key k and the root R of a tree. And what it's going to return is
the node in the subtree with R as the root whose key is equal to k.
Okay, that's the goal and the idea is pretty easy. I mean the search tree is set up to do binary searches
on. So what we're going to do is we're going to sort of start at the top of the tree. We're going to
compare 6, the thing we're searching for, to 7. 6 is less than 7 and that means since everything less
than 7 is in its left subtree, we should look in the left subtree.
So we go left. We now compare 6 to 4, the root of this left subtree. 6 is bigger than 4. Everything
bigger than 4 in the place that we're looking is going to be in 4s right subtree, so we had down in that
way. We now compare 6 to 6. They're equal, and so we are done with our search.
And so this algorithm is actually very easy to implement recursively. If the key of R.Key = k, then we're
done. We just return the node R at the root and that's all there is to it.
Otherwise, if R.key > K, we need something less than R, so the thing we're looking for should be in the
left subtree. So we recursively run Find on k and R's left child.
On the other hand, if R.key < k, we have to look in the right subtree, so we find k in the subtree of Rs
child.
Okay, this works fine as long as the thing we're looking is in the tree, but what happens if we're
looking for a key that isn't there? So we're trying to find 5 in this tree. We checked it's less than 7. It's
more than 4. It's less than 6. 6 doesn't have a left child. We have a null pointer, what do we do here?
Well, in some sense we could just return some errors saying the thing you were looking for wasn't
there, but we did actually find something useful. We didn't find 5 in the tree, because its not there,
but we sort of figured out where 5 should have been in the tree if it were there.
And so, if you stop your search right before you hit the null pointer, you can actually something
useful. You find the place where k would fit in the tree
. So it makes a little bit of sense to modify this Find procedure so that if say R.key > k, then instead of
just checking the left points here, we first check to see if R actually has a left child. If Rs Left child isn't
null, we can recursively try to find k in the left subtree. But otherwise, if it is null, we'll just stop early
and return R and do something similar for the other case. And this sort of means that if we're
searching for something that's not in the tree, we at least give something close to it. Okay, so that's
one thing we can do. Another thing that we might want to do is sort of talk with adjacent elements. If
we've got some element in the tree, we might want to find the next element.
And so in particular another function we might want which we will call next. It takes a node N and
outputs the node in the same tree with the next largest key.
And maybe one way to think about this is instead of searching for every key and has we should search
the tree for something just a tiny bit bigger than that. And, now if N has a right child this is kind of
easy. The first bunch of steps lead you to the node N, and then you want to go right, because it is
bigger than N. But after you do that you just keep going left, because it's not, it's smaller than all of
these other things. They're a little bit bigger than N. And you just keep going until you hit a node
where you can't go left any further. It's left pointer is null, and that's going to be the successor.
Now, this doesn't work if N has no right child, because you can't go right from N. You also go looking
on the left side of N, doesn't work. Everything is going to be smaller there. So instead what you have
to do is you have to go up the tree. You check its parent, and if its parent is smaller than n as well, you
have to check the grandparent. You just keep going up until you find the first ancestor that's bigger
than n. And once you have that, that will actually be the successor.
So, the algorithm for next involves a little bit of case analysis. If N does not have a right child, we're
going to run this protocol we call RightAncestor, which goes up until you take the first step right?
Otherwise, we are going to return what we're going to call the LeftDescendant of Ns right child which
means you sort of go left, until you can't go left anymore.
Now both of these are easy to implement, recursively for LeftDescendant if you don't have a left child,
you're done, you return N. Otherwise you take one step left and repeat.
For RightAncestor you check to see if your parent has a larger key than you if so you return your
parent otherwise you go up a level and repeat, and just keep going until you find it. And so putting
these together that computes next.
Now it turns out that this range search operation that we talked about before, this you're given two
numbers x and y and the root of the tree and you'd like to return a list of all the nodes whose keys are
between x and y you can implement this pretty easily using what we already have.
So, the idea is, well, you want to find the RangeSearch, say, everything between 5 and 12. First thing
you do is you do a search for the first element in that range, in this case, it will be 6. Then you find the
next element, which is 7, and the next element, which is 10 and the next element is 13, it's too big so
you stop.
So the implementation is pretty easy. We create a list L that's going to store everything that we find,
we let N be what we get when we try to find the left N point x within our tree.
And then while the key of this note N that we're working at is less than y as long as the key is bigger
than x, we're going to this node to our list and then we're going to replace N by Next(N). We're just
going to iterate through these nodes until they're too big, and then we return L. Okay, so that's how
you do range search.
And nearest neighbors, you can figure it out, it's a similar idea.
Now, the interesting things are how do we do inserts and deletes? So, for insertion we want
to be given the key k and the root R of the tree and we'd like to add a node with key equal to k to our
tree.
And the basic idea is that unlike with our certain array with the tree we can just have a new
element and just have it hanging off one of our leaves. And this works perfectly well.
There is a big of a technical problem here, though. We can't just have it hang off anywhere. I
mean, three that we're inserting is smaller than seven, so it needs to be on the left side of seven. And
furthermore, there are a whole bunch of other things that needs to satisfy to keep the search
property working out.
But, fortunately for us, this find operation. If we tried to find A node that wasn't in our tree
actually did tell us where that node should belong was.
So to insert we just find our key within R, and that gives us P, and the new node that we want
should be a child of P, on the appropriate right or left side, depending on the comparison between
things.
And that's that.
A little bit more difficult is Delete.
So, here we just want to node N, we should remove it from our tree.
Now there's a problem we can't just delete the node because then its parent doesn't have a
child. It's children don't have parents, it breaks things apart. So we need to find some way to fill the
gap.
And there is a natural way to fill this gap. The point is you want to fill the gap with something
nearby in the sorted order, so you try and find the next element, X, and maybe you just take X and fill
the gap that you created by deleting this.
Unfortunately, there could be a problem. Now, X turns out because it's the next element that
often not going to have a left child, because the left child would be sort of even closer.
But it might have a right child and if it does have this right child, then by moving X side of the
way it's right child is now going to be orphaned. It's not going to have a proper parents. So in addition
to moving X to fill that gap you have to move Y up to fill the gap that you made by moving X out of the
way. But once you do that it's actually perfectly good. You've done a reasonable rearrangement tree
and removed nodes you want.
So the implementation takes a little bit of work.
First you check to see if N has a right child. If it's right child is null then it turns out we're not in this
other case. But you can just remove N and you need to promote Ns left child, if it has one. So Ns left
child should now become the child of Ns parent instead of the other way around. Otherwise we're
going to let X be next of N and note that X does not have a left child.
And then we're going to replace N by X and promote Xs right child to sort of fill the gap that we made
by moving X out of the way.
But this all works. Just to review it, so if we have the following tree and we're deleting the highlighting
node. Which of the following three trees do we end up with? Well the answer here is C. The point is
that we deleted 1 so we want to replace it with the next element which is 2. So we took 2 and put it
to the place where 1 was. Now 2s child, 4, needs to be promoted. So 4 now becomes the new child.
Six and everything works nicely and in this tree.
Okay so if I tell you implement some basic operations for binary search trees next time we'll going to
talk about the run time of these operations, which are going to leads us to some interesting ideas
about the balance of these trees.
Balance
Hello, everybody, welcome back.
We're continuing to talk about binary search trees and today,
we're going to talk about balance.
In particular, we're actually going to look at sort of the basic runtime
of these operations that we talked about in the last lecture and from that,
we're going to notice that they'll sometimes be a little bit slow.
And to combat this, we want to make sure that our trees are balanced and
well, doing that's a bit tough and
we're going to talk a little bit about how we're going to do this with rotations.
Okay, so first off we've got this great operation, it can do local searches,
but how long do these operations take?
And maybe a key example is the Find operation.
So we'd like to find 5 in the following tree.
We compare it to 7, and it's less than 7, and it's bigger than 2, and
it's bigger than 4, and it's less than 6, and it's equal to 5, and we found it.
And you'll note that the amount of work that we did we sort of had to traverse
all the way down the tree from the root to the node that we're searching for and
we had to do a constant amount of work at every level.
So the number of operations that we had to perform was O of the Depth of the tree.
Now, just to sort of make sure we get this. So if we have the following tree, we could be searching for
different nodes A, B, C or D, which ones are faster to search with, avoiding which others? Well, A is
the fastest. It's up at the root. Then D has depth only three, B has depth four, and C has depth five,
and so A is faster than D is faster than B is faster than C probably.
So the runtime is the depth of the node that you're looking for. But it's unfortunate, the depth can
actually be pretty bad. In this example, we have only ten nodes in the tree, but this 4 at the bottom
has depth 6. In fact, if you think about it, things could be even worse. A tree on n nodes could have
depth n if the nodes are just sort of strung out in some long chain.
And so this is maybe sort of a problem. That maybe our searches only work in O of n time were not
any better than any of these other data structures that didn't really work. On the other hand, even
though depth can be very bad, it can also be much smaller. This example has the same ten nodes in it,
but the depth maximum depth is only four. And so by rearranging your tree, maybe you can make the
depth a lot smaller.
And in particular, what you realize is well, in binary search the questions that we asked, in order for it
to be efficient, we wanted to guess the thing in the middle. Because then no matter which answer we
got, we cut our search space in two, and so what this means for a binary search tree is at any node
you're asking that one question. You want things on the left and the things on the right. Those two
subtrees should have approximately the same size.
And this is what we mean by balance. And if you're balanced, suppose that you're perfectly balanced,
everything is exactly the same size, then this is really good for us. Because it means that each subtree
has half the size of sort of the subtree of its parent.
And that means after you go down, logarithmically, many levels the subtrees have size one and you're
just done.
And so, if your tree is well balanced, operations should run in O(log(n)) time, which is really what we
want.
But there's a problem with this, that if you make insertions they can destroy your balance properties.
We start with this tree, it's perfectly well balanced and just has one node I guess but we insert two
and then we insert three and then we insert five and then we insert four and you'll note that suddenly
we've got a very, very unbalanced tree. But all we did were updates. So, somehow we need a way to
get around this. We need a way to do updates without unbalancing the tree.
And the basic idea for how we're going to do this, is we're going to want to have some mechanism by
which we can rearrange the trees in order to maintain balance.
And there's one problem with this, which is that however we rearrange the tree, we have to maintain
the sorting property. We have to make sure that it's still sorting correctly or none of our other
operations will work. And well, there's a key way to do this, and this is what's known as rotation.
The idea is you got two nodes, X and Y. We say X is Y's parent. And there's a way to switch them. So,
that instead Y is X's parent. And the like sort of sub-trees A, B, and C that hang off of X and Y. You
need to rearrange them a little bit to keep everything still sorted. But there is this sort of very local
rearrangement. You can go back and forth. And it keeps the sorting structure working. And it
rearranges the tree in some hopefully useful way.
So just to be clear about how this works, it takes a little bit of bookkeeping. But that's about it. You let
P be the parent to X. Y be its left child. B be its right child.
And then what we're going to do is we just need to reassign some pointers. P is the new parent of Y
and Y is its child, Y is the new parent of X and X is its right child, X is the new parent of B and B is X's
new left child. And once you've sort of rearranged all those pointers then everything actually works.
This is a nice constant time operation and it does some useful rearrangements.
So what we really need to do though is we need to have a way to sort of use these operations to
actually keep our tree balanced and we're going to start talking about how to do that next time when
we discuss AVL trees.
Slides
08_binary_search_trees_1_intro.pdfPDF File
08_binary_search_trees_2_binary_search_trees.pdf PDF File
08_binary_search_trees_3_basic_ops.pdf PDF File
08_binary_search_trees_4_balance.pdfPDF File
References
AVL Trees
Hello everybody, welcome back.
We're still talking about binary search trees but
today we're going to talk about AVL trees.
And AVL trees are just sort of a specific way of maintaining balance in your
binary search tree.
And we're just going to talk about sort of the basic idea.
Next lecture we're going to talk about how to actually implement them.
But okay, what's the idea?
We learned last lecture that in order for
our search operations to be fast, we need to maintain balance of the tree.
But before we can do that we first need a way to measure the balance of the tree so
that we can know if we're unbalanced and know how to fix it.
And a natural way to do this is by what's called the height of a node.
So if you have a node in the tree,
its height is the maximum length of a path from that node to a leaf of your tree.
Fair enough.
So for
example if we have the following tree, what's the height of the highlighted node?
And a natural way to do this is by what's called the height of a node.
So if you have a node in the tree,
its height is the maximum length of a path from that node to a leaf of your tree.
Fair enough.
So for
example if we have the following tree, what's the height of the highlighted node?
Well, this node has height six, the following path of length six leads down
from this node and it turns out to be nothing longer.
so, we can define height recursively in a very easy way. If you have leaf its height is one because
you're just there and you can't go any further.
Otherwise, well the longest path downwards you can have, it's either through the longest path on
your left side or the longest path on your right side. So you want to take the maximum of the height
of your left child and the height of your right child, and then you need to add one to that because n
sort of gets added to this path.
Okay, that's fine. Now, in order to actually make use of this height we actually are going to want to
add a new field to our nodes. So, the nodes that made up our tree previously stored a key and are
pointed to the parents on the left child and the right child, and now they also need to store another
piece of data, the height of node.
And note that we are actually going to have to do some work, and we'll talk a little bit about how to
do this later, to insure that this height field is actually kept up to date. We can't just store it as a
number and leave it there forever. If we rearrange the tree, we might need to change its heights.
S In any case, back to balance. Height is a very rough measure of the size of a sub-tree.
For things to be balanced, we want the size of the two sub-trees of the left and right children of any
given node to be roughly the same.
And so there's an obvious way to do this. We'd like to force the heights of these children to be
roughly the same. So the AVL property is the following.
For all nodes N in our tray, we would like it to be the case that the difference between the height of
the left child and the height of the right child is at most one.
And we claim that if you can maintain this property for all nodes in your tree, this actually ensures
that your tree is reasonably well balanced.
3:19
Okay and so, really what we'd like to know is that if you have the AVL property on all nodes, then the
total height of the tree should be logarithmic. It should be O(log(n)). So basically what we want to say
is that you have an AVL tree and it doesn't have too many nodes. Then the height is not too big.
But it turns out that the easier way to get at this is to turn this on its head. We want to show instead
that if you have an AVL tree and the height isn't too big, then you can't have too many nodes.
And this we can do.
So we're going to prove the following theorem. Suppose that you have an AVL tree, a tree satisfying
the AVL property, and N is a node of this tree with height h.
THen the claim is that the sub-tree of N has to have size at least the Fibonacci Number F's og h. And
so just to review, we talked about Fibonacci Numbers way back in the introductory unit for this, but
for the previous course in this sequence. But this is just a sequence of numbers. The zeroth one is
zero, the first is one, and from there after each Fibonacci number is the sum of the previous two.
Now, these are a nice predictable sequence, they grow pretty fast, the end Fibonacci number's at
least two to the n over two for all n at least 6.
Okay so let's look at the proof, we're going to do this by induction on the height of our node.
If the node we're looking at has height one, it's a leaf. And it's sub-tree has one node which is the first
Fibonacci number, great.
Next we need an inductive hypothesis if you've got some node of height h. By sort of definition of the
height, at least one of your two children need to have height h minus one. Then, by the AVL property,
your other child needs to have height at least h minus two. So by the inductive hypothesis, the total
number of nodes in this tree is at least the sum of the h -1 Fibonacci number plus the h- 2 Fibonacci
number, which equals the h Fibonacci number. And so that completes the proof.
So what does this mean? It means that if a node in our tree has height h, the sub-tree of that node
has height at least two to the h over two. But if our tree only has n nodes, two to the h over two can't
be more than n. So the height can't be more than two log base two of n. Which is o of log n.
And so the conclusion is if we can maintain the AVL property, you can perform all of your find and
operations in such in O(log(n)) time. And so next lecture we're going to talk about how to maintain
this property but this is the key idea. If we can maintain this property, we have a balanced tree
and things should be fast. So I'll see you next time as we discuss how to ensure that this happens
AVL Tree Implementation

Hello, everybody, welcome back. Today, we're going to continue talking about AVL Trees, and in
particular, we're going to talk about the actual implementation and what goes into that. So, as you
recall, the AVL Tree was this sort of property that we wanted our binary search tree to have, where
we needed to ensure that for any given node, its two children have nearly the same height. So the
following is an ideal tree everything's labelled by their height, it all works out. Now, there's a problem
that if we update this tree it can destroy this property.
So if we try to add a new node where the blue node is, then what happens is, a bunch of nodes in the
tree, their heights change because now they have a longer path which leads to this new node. And
now there are a couple locations at which the AVL property fails to hold. So, in other words, we need
a way to correct this issue.
And there is one thing that actually helps a little bit here, which is that when we do an insertion the
only heights of nodes that change are along the insertion path. The only time when a height can get
bigger is because the new path from it to a leaf ends up at the leaf you staggered. So we only need to
worry about nodes on this path, but we do actually need to worry. Okay, just sort of review what it is,
we have this AVL tree, we want to insert a new node either A B C or D.
Which one of these will require us to do some rebalancing?
It turns out that D is the only place where we have a problem, but if you insert D it changes a bunch of
these heights and that destroys AVL program. The other inserts it turns out would be fine.
Okay, so let's actually talk about how to do this. So we need a new insertion algorithm that involves
some rebalancing of the tree in order to maintain our AVL property.
And the basic idea of the algorithm is pretty simple. First you just insert your node as you would
before. You then find the node that you just inserted and then you want to run some rebalance
operation. And this operation should start down at N and should probably work its way all the way up
to the root, sort of following parent pointers as you go. Just to sort of make sure that everything that
could have been made unbalanced has been fixed, and we're all good.
So the question is how do we actually do this rebalancing?
And, well, the idea is the following.
At any given node, if the height of your left child and the height of your right child differ by at most 1,
you're fine, you're already satisfied the AVL property.
On the other hand it could be the case that your children's heights differ by more than one. In that
case you actually do need to do some rearranging. If your left child is two taller than your right, you
need to fix things and probably what you need to do is move your left child higher up in the tree
relative to your right to compensate for the fact that it's sort of bigger. Fortunately for us, you can
actually show that these inserts, the height difference is never going to be more than 2. And that
simplifies things a little bit. Okay, so the basic idea is the following. In order to rebalance N, first we
need to store P, the parent of N just because we're going to, after we fix N, we're going to want to fix
things at P, and so on recursively.
Now, if the height of N's left child is bigger than the height of its right child by more than one, we
need to rebalance right-wards. If the height of the right child is bigger than the height of the left child
by more than one we need to rebalance left-wards.
Then after that, no matter what happens, we maybe need to readjust the height of N, because the
height field that was stored might be inaccurate if we inserted things below it.
And then if the parent that we fixed wasn't the null point, if we weren't already at the root, we need
to go back up and we need to rebalance the parent recursively.
So quickly, this AdjustHeight function, this sort of just fixes the number that we're storing in the
height field. All it does is we sort of set the height to be one plus the maximum of the height of the
left child and the height of the right child. Just given by this recursive formula we had for the height.
Okay! But the key thing we still haven't really touched. We need to figure out how to do the
rebalancing. So you have a node, its left child is heavier than its right child. Its left child has exactly
two more height to it.
And the basic idea is the left child is bigger, it needs to be higher up, so we should just rotate
everything right.
And it turns out that in a lot of cases this is actually enough to solve the problem. There is one case
where it doesn't work. So B is the node we're trying to rebalance. A is its left child which is too heavy,
and we're going to assume that A is too heavy because its right child has some large height, n+1. The
problem is that if we just rotate B to the right, then this thing of height, n+1, switches sides of the tree
when we perform this rotation. Switches from being A's child to being B's child. And when we do this
we've switched our tree from being unbalanced at B to being unbalanced at A now, in the other
direction.
And so, just performing this one rotation doesn't help here.
In this case the problem is that A's right child, which we'll call X, was too heavy. So the first thing we
need to do is make X higher up. So what you can do is, instead of just doing this rotation at B, first we
rotate A to the left one, then we rotate B to the right one. And then you can do some case analysis
and you figure out after you do this you've actually fixed all the problems that you have.
And it's good.
The operation for rebalancing right is you let M be the left child of N and then you check to see if
we've to be in this other case. If M's right child has height more than M's left child, then you rotate M
to the left, and then no matter what you did, you rotate N to the right. And then no matter what you
did, all the nodes that you rearranged in this procedure, you need to adjust their heights to make sure
that everything works out. Once you do this, this rebalances things at that node properly, it sets all
the heights to what they should be, and it's good.
Okay, so that's how insert works. Next, we need to talk about delete. And the thing is deletions can
also change the balance of the tree. Remember generally what we do is the deletions we removed
the node. But, we replaced it by its successor and then promoted its successor's child.
And the thing to note is that when you do this, sort of the space in the tree where the successor was,
the height of that denoting that location decreased by one. Because instead of having successor and
then its child and then some such, you just have the child and such.
And this of course can cause your tree to become unbalanced even if it were balanced beforehand.
So, we of course need a way to fix this, but there's a simple solution. You delete the node N as before.
You then let M be this left child of the node that replaced N this thing that might have unbalanced the
tree. And then you run the same rebalance operation that we did for our insertions starting on M and
then filtering its way up the tree. And once you've done that, everything works.
And so what we've done is we've shown that you can maintain this AVL property and you can do it
pretty efficiently, all of our rebalancing work was only sort of O of 1 work per level of the tree. And so
if you can do all of this, we can actually perform all of our basic binary search tree operations in O of
log n time per operation, using AVL trees. And this is great. We really do have a good data structure
now for these local search problems.
So that's all for today, coming next lecture we are going to talk about a couple of other useful
operations that you can perform on binary surgeries.
Split and Merge

Hello everybody, welcome back. Last time, we talked about AVL trees and showed that they can
perform your basic binary search tree operations. But now we're going to introduce a couple of new
operations and talk about how to implement them. So, another useful feature of binary search trees
is in addition to being able to search them, there's also a bunch of sort of interesting ways that you
could recombine them.
And so we're going to discuss here two of these operations. One of them is merge, which takes two
binary search trees and combines them into a single one. And the other one is split, which takes one
binary search tree and breaks it into two.
So, let's start with merge. And the idea is, in general, if you have two sorted lists and want to
combined them into sort of a single sorted list. This is actually pretty slow, it's going to take O(n) time,
because you sort of need to figure out how the lists interweave with each other. This is the thing you
do in merge sort.
However, there is a case where you can merge them a lot faster. And that's the case where they're
separated from each other, where they're sort of, one's on one side, one's on the other side. And so,
this is the case that merge is going to work with. So you're giving two trees, R1 and R2, or the roots of
these trees. And they're going to need to have the property that all the keys in R1's tree are smaller
than all of the keys in R2's tree.
And if we're given this, we should then return the root of a new tree that had all of the elements of
both of the trees.
So just to review this condition that we have, of the three trees below, which of them can be properly
merged with the one above?
The answer is that only A can because all of the elements of A are less than all of the elements of the
guy above. B has this problem that it's got a 9, and 9 is between 8 and 10 so it can't really be
separated from them. And C has the problem that it has both 2, which is smaller than everything
above, and 12 which is bigger than everything above. So, A is the only guy that actually works here.
Okay, so how do you do this merge operation?
Well, there's actually a case where it's super easy to merge trees and that's if, instead of just being
given the two trees, we also have an extra node that we can use as the root.
Because then you just have the node up top, you have the guys in R1, the small guys on the left of it,
the guys in R2, the big guys on the right of it. That's your tree.
So, the implementation, we'll call this function MergeWithRoot, is you let R1 be the left child of T, R2
be the right child of T. Let T be the parents of these two. If you need to do things involving restoring
heights, you adjust those as appropriate, and then you return T.
This takes O(1) time, very simple.
The problem is, well, what if we're not given this extra node? Then there's actually a pretty simple
solution. You find the largest element, of say, the left subtree. You remove that, and then turn it into
the root of the new guy. And that works. So now if we want to merge R1 and R2, we're going to find
the largest element of R1, call that T. You're going to delete T from its subtree, and then you're going
to merge with root. And that works. The run time's a little bit worse. You have to spend O of the
height time to find this node T, but other than that it's pretty efficient.
So, just to review, we've got this tree, we find the biggest element to the left tree 8. We then delete
it, and then we merge with root.
That's all there is to it.
Now if we didn't care about balance, that would be all we'd have to say about this operation.
Unfortunately, this merge operation doesn't preserve balance properties. And the problem is that,
well, the two trees, you didn't really touch them very much, so they stay balanced. But when you stick
them both together under the same root, well, if one tree is much, much bigger than the other,
suddenly the root is very, very unbalanced, and this is a problem.

So we need a way to fix this. And there's actually a not very difficult way to do that. The idea is that
we can't merge the, say, left subtree with the right subtree because the left one is way too big.
So what we're going to do is not merge the whole guy with the whole guy here. We're going to have
to find some node of approximately the same height as this guy on the right so that we can merge
that.
And so what we're going to do is, we're going to sort of climb down the sort of right edge of the
bigger tree until we find a sub tree of the right height that we can merge with our guy on the other
side. Okay, so how do we implement this? AVLTreeMergeWithRoot.
What we're going to do is, if the heights of the two trees differ by at most 1, we can just merge with
root as before. We then figure out what the height of T needs to be and return it. That simple.
Otherwise though, what happens if say R1's height is bigger than R2's height?
Well, what we want to do is we want to step down to instead of merging R1 with R2, we merge R1's
right child with R2. So, we use merge with root to merge the right child with R2 at this T, and we get
some new tree with root R prime. R prime we set back to be the right child of R1, and similarly set R1
to be the parent. And then we need to rebalance this at R1 because things might be off by a little bit,
but it turns out not more than about one. We knew how to deal with that with our old rebalance
operation. And than we return the root of the tree.
If, on the other hand, R1's height is smaller than R2's height, we sort of do the same operation but on
the opposite side.
Okay, so, let's analyze this. Every step we take down the side of the tree decreases the height
difference by either 1 or 2. So we're just going to keep decreasing the height difference until we're off
by at most 1. Then we merge, and we soon have to go back up the chain. But the amount of time this
takes isn't the depth of the tree, it's sort of the number of steps we need to take down the side. And
that's approximately the difference in the two heights. And this will actualy be important in a bit.
Okay, so that was the merge operation. Next, we're going to talk about sort of the opposite
operation, split. Merge takes two trees and turns them into one. Split breaks one tree into two.
What you do is you pick an element, say 6, and then you've got the tree consisting of all the elements
less than 6, and then another one from all the elements bigger than 6. So split should take the root of
a tree and the key x, and the output should be two different trees, one with all the elements less than
x in your tree, and one with all of the elements bigger than x. Okay now, the idea is actually not so
hard. What we're going to do is we're going to search for x, and then the search path, well it's got a
few nodes on the path. And then it has a bunch of trees hanging off to the left of the path. These
things are all going to be smaller than x. And it's going to have a bunch of trees hanging off to the
right that are all going to be bigger than x. And so all we have to do is take these trees that are smaller
than x, merge them all together, take these things bigger than x, merge them all together, and then
we have two trees. So let's see how this works. So we're going to do this recursively. We're going to
split at R, at this point x. If our root is null, if we just have the null tree, we just return a pair of null
vectors, whatever. Next if x is less than the key at the root, what that means is that everything on the
right is bigger than x. That's sort of all on the right. But the left side of the root, we need to split that
in two. So we're going to recursively run split on R.Left, and that gives us two trees, R1 and R2. Now
R1 is actually everything in the whole tree that's smaller than x. That half is done, but R2 needs to be
combined with the whole right subtree of our original tree.
So we run MergeWithRoot on R2 and R.Right with R as the root. Fortunately we have this extra nodes,
uses the root. And this gives us an R3. And we can return R1 and R3 as our split.
If X is bigger than the key, we can do the same thing on the opposite side. Hopefully you can figure
that out.

Okay, so the first thing to note is that if we, instead of just doing a MergeWithRoot, we used
AVLMergeWithRoot. This insures that the two trees we produce are both balanced, which is good.
Also if you look at the run time of this algorithm, well, the run time of AVLMergeWithRoot is O, the
difference in the two heights.
So we have to look at the difference between the biggest type and the next biggest and the next
biggest and the one after that and so on. This sum actually telescopes, and the total run time is O of
the maximum height of one of these trees we're trying to merge, which is just O(log(n)). And so we've
got these two operation, merge combines trees, split breaks them in two, and both operations can be
implemented in O(log(n)) time using AVL trees.
So that's these two operations. Next time we're going to talk about a couple of applications. One of
them's going to sort of talk about a way that we can make use of these split and merge in an
interesting way.
Slides
08_binary_search_trees_5_avl.pdf
08_binary_search_trees_6_avl2.pdf
08_binary_search_trees_7_split_merge.pdf PDF File
References
See the chapters 5.11.1, 5.11.2 here.
https://en.wikipedia.org/wiki/AVL_tree
See this visualization. Play with this AVL tree by adding and deleting elements to see how it
manages to keep being balanced.

Binary Search Trees
Week 6
Data Structures
Week 6
Welcome to the course discussion forums! Ask questions, debate ideas, and find classmates who
share your goals. Browse popular threads below or other forums in the sidebar.
43 threads · Last post 3 days ago

Go to forum
Binary Search Trees 2

In this module we continue studying binary search trees. We study a few non-trivial applications.
We then study the new kind of balanced search trees - Splay Trees. They adapt to the queries
dynamically and are optimal in many ways....
Key Concepts
 Describe how to implement advanced operations using balanced binary search trees
 Describe how splay trees work
 Analyze the running time of operations with splay trees
 Apply amortized analysis to splay trees
 Apply binary search trees in programming challenges
 Develop a balanced binary search tree
Less
Applications
Video: LectureApplications
10 min
Resume
. Click to resume
10 min
Splay Trees

Video: LectureSplay Trees: Introduction
6 min
Video: LectureSplay Trees: Implementation
7 min
Video: Lecture(Optional) Splay Trees: Analysis
10 min
10 min
Practice Quiz: Splay Trees
3 questions

Programming Assignment: Programming Assignment 4: Binary Search Trees
3h
Due Aug 9, 11:59 PM PDT
Applications
Hello everybody, welcome back. Today we're going to talk more about binary searches, in particular
we're going to give a couple of applications to computing order statistics. And then a sort of
additional use of binary search trees, to store assorted lists.
Okay, so, there's some questions that you might want to ask. You've got a bunch of elements that are
stored in this binary search tree data structure. They're assorted by some ordering. Things we might
want to do. We might want to find the 7th largest element. Or maybe we want the median element,
or the 25th percentile element. Now, these are all instances of an order statistic problem. We would
like to be able to be given the root of the T tree and the number k, we should be able to return the
kth smallest element stored in the T tree.
So, the basic idea is that this is sort of a search problem. We sort of should treat it like one. But to do
that we need to know which subtree to search in.
So I mean, is the case smallest element in the left subtree? Well, the left subtree does store a bunch
of the smallest elements. But the real question is, does it store k of them? If it stores at least, k, k's
smallest element should be in there, otherwise it won't be. So the thing that we need to know, is how
many elements are in the left subtree?
PPT Slides

And so, we really need a tree where we can easily tell how many elements are in each subtree.
Well, there's an easy fix for that. You just add that as a new field. So N.Size should return the number
of elements in the subtree of N.
It's a new field, it stores that, it should satisfy the property, that the size of N is the size of the left
subtree, and the size of the right subtree, plus one, all added together, where if you have null
pointers, these have size zero.
Okay? Now, we have to be a little bit careful, just like with the height, we couldn't just define this field
and hope that it always has the right value. You actually need to do some work to make sure this stays
correct.
For example, when you perform a rotation, the sizes of various subtrees will change. Fortunately, only
the sort of two nodes that you moved around, will actually have their subtree sizes changed, but you
do need to deal with those. And so, we're going to have an operation called RecomputeSize, which
just recomputes the size as the sum of the size of its left child, and the size of its right child, and one.
And then to do a rotation, we should do this as we did before. But then we need to recompute the
sizes of the two nodes that we rotated. And you need to make sure to do this in the right order,
actually, because the size of the parent depends on the size of the child. So you need to recompute
the child first.
Okay, but once you got all the sizes of nodes stored, it's actually not hard to compute order statistics.
So to compute the kth smallest element in your tree, R, what you do is we let s be the size of the left
sub tree, R.Left.Size.
Now, if k = s+1, they're exactly s things smaller than the root. The root is the s plus firsts smallest
element, so we return R. Otherwise, if k < s + 1, the kth smallest element is in the left subtree, so
recursively return the kth smallest element of our .Left subtree. On the other hand, if k > s+1, we
need to look in the right subtree. Now unfortunately, it's no longer the kth smallest element of the
right subtree, because there were already s+1 smaller elements.
So what we actually need to compute is the k-s-1st smallest element in the right subtree.
But with this minor adjustment, this algorithm works. The runtime is all of the height of the node that
you search for, it's basically just a binary search. But we need to use these sizes of trees in order to
just figure out which side to look at, instead of comparing the actual keys, but this works perfectly
well.
Now here's a puzzle that I'm going to give you. We're not going to do homework on this, but it's
something you should think about. What we've shown you how to do is, given a rank, k, we can find
the kth smallest node. How would you do the opposite? Given a node, figure out how many nodes in
the tree are smaller than it. You can do it using this kind of thing, but it takes a little bit of thinking.
Okay that was one problem. We're going to talk about one more application of binary search trees.
And, this problem's a little bit weird, but it's going to introduce some very important ideas. So we
have an array of squares, they're each colored black or white. And we want to be able to perform this
flip operation, which what it does is it sort of points at some square, x, and every square after that
index, it flips them from black to white, and white to black.
So, in the array pictured, we're flipping the last four guys, and they go from being white, white, black,
white, to being black, black, white, black. Okay. So, formally, we sort of want a data structure that
maintains this. You can create an array of size n, you can ask for the color of the mth square, or you
can fun this flip operation, which flips the color of all the squares after index x.
Okay, those are the things we want to support. Now, we could do this by just having an array and
storing all the colors. The problem is that the flip would be pretty slow. Because if you wanted to flip
all of the things with index bigger than x,
then there's no good way to do this. You just have to sort of go through every square of index bigger
than x, and flip them, and that could take linear time.
So it turns out that there's a nice way to use binary search trees to solve this.
And this requires sort of a slightly different way of thinking about binary search trees.
Up until now we thought of a search tree as sort of having a bunch of elements there stored, and they
allow you to do searches on them. And so somehow you're given the keys, and the search tree allows
you to find them.
But there's another way to do it. And maybe a good way to illustrate it, is by looking at the logo for
our course. So this is a binary search tree. Every node has a letter in it. But you'll note that this isn't
sorted in terms of these letters. For example, O is the left child of I, but O comes after I in alphabetical
order. These things aren't sorted alphabetically.
On the other hand, you'll note that the binary search tree structure actually does tell you what order
these letters are supposed to be going in.
I mean, the smallest thing here should be A, because it's the left child, of the left child, of the left
child, of the left child.
Then L is the next smallest, then G, then O, R, I, then T, H, M, S. And so, it tells us what order these
layers are supposed to be, and they spell algorithms.
And that's the basic idea, that you can use a tree to store some sort of sorted list of things, in a
convenient way.
So, for example we have the following tree. There are no actual keys stored on them, but of this A, B,
C, D, and E, one of these is the 5th smallest element in the tree. Which one is it?
Well, it's D. I mean you sort of count the smallest sum left, then B, then this other thing, and D is the
5th smallest.
Okay, so what's the point? How are we going to use this to do our flip arrays? What we're going to do
is, instead of storing the sequence of black and white cells as an array, we're going to store it as a list.
And in this list we're going to have a bunch of nodes, they're going to be colored black or white, fine.
Actually, there's a bit of a clever thing. We're actually going to want two trees, one with the normal
colors, and one with the opposite colors. And the reason for this is that when we want to do flips, we
want to be able to replace things with their opposite colors. So, it helps to have everything opposited
already stored somewhere.
But now comes the really clever bit. If we wanted to do this flip operation, say we wanted to take the
last three elements of our tree and flip all of their colors, well this second tree, this dual tree, the last
three elements of that tree, have the opposite colors. So all that we need to do is swap the last three
elements of the tree on the left, the last three elements of the tree on the right, and we have
effectively swapped those colors.
And what's even better is that using these merge and split operations from last time, we can actually
do this.
So let's see how this is implemented.
Firstly, to create this thing, we just build two trees, where T1, all of the things are colored white, and
T2, they're all colored black. Great. To find the color of the mth node, you just find the mth node
within T1 and return its color.
Great. The flip operation is the interesting bit. If we wanted to Flip(x), what you do is, you split T1 at x
into two halves, and you split T2 at x, and then you merge them back together. But you merge the left
half of T1, with the right half of T2, and you merge the left half of T2, with the right half of T1. And
that effectively did move around the sort of last N bits, and it works. And so, as the moral, trees can
actually be used for more than just performing searches on things. We can use them to store these
sorted lists, and merge and split then become very interesting operations, in that they can allow us to
recombine these lists in useful ways. Okay, so that's all for these applications. Next time we're going
to sort of talk, we're going to give an optional lecture, and it's going to talk about an alternative way
to implement many of these useful binary search tree operations. So, please come to that, it'll be
interesting, but it's not really required. It's another way to do some of this stuff. But, I hope I'll see you
there.
Slides
08_binary_search_trees_8_applications.pdf PDF File
References
See the chapters 14.1, 14.2 in [CLRS] Thomas H. Cormen, Charles E. Leiserson, Ronald L.
Splay Tree
Splay Trees: Introduction
Ppt slides
Hello everybody. Welcome back. Today we are going to talk about something a little different.
Up until this point, we've talked about AVL trees, we've talked about how to keep them bound, and
how to use them, to implement all of our binary search tree operations, and all of log on time pre-
operations. But it turns out that there are a wide number of different binary search tree structures
that give you different ways to ensure that your trees are balanced, there are trepes, there are red
block trees, and today we're going to talk about splay trees as sort of another example of the types of
things that you can do.
And to motivate this suppose that, well, if you're searching for random elements, one after the other,
you can actually show that no matter what splay your use or data, searcher you use it will always take
at least log n time per operation. That's actually the best you can do.
However, if some items are searched for more frequently than others, you might be able to do better.
If you take the frequently queried items and put them close to the root, those items will be faster to
search for. And some of the other items might be a little bit slower, but you should still be okay.
To compare for example, we've got the unbalanced tree and the balanced one with the same entires.
But you'll note that 1, 5 and 7 are much higher up in the unbalanced tree. Now if we search for
random things, if we search for 11, 11 is much higher up in the balanced tree than the unbalanced
one, so the unbalanced one is slower there. But when we search for 1, it takes a lot less time in the
unbalanced case, we search for one again, we search for seven, it's again a lot cheaper in the
unbalanced case. And if we do some sequence of searches well, it might turn out that's is actually
cheaper to use the unbalanced tree than the balanced one if these elements that tend to be higher up
in the unbalanced tree are searched for more frequently than other elements.
Ppt slides

So the idea here is that we'd like to have a search tree where we can put the commons searched
common nodes near the root of the tree so that they are cheaper to look up.
However, very often it will be the case that we won't know ahead of time which those commonly
searched for nodes will be.
And so if we, or in this situation we'll need an adaptive way to bring them close to the root. And one
natural way to do it is every time you find a node in your tree you do something to rearrange the tree
and bring that node up to the root.
And that way at least heuristically, if there are elements that you search for frequently, then since you
keep bringing them up to the root, they'll usually stay somewhat close to the root. And they'll be very
cheap to access.
So if we want to phrase this in a nice simple way, one thing that you could do is if you have your tree
and you've got this node that you searched for, you could just bring it to the root by just rotating to
the top. You rotate it up and again and again and again, and again until it ends up being the root.
Unfortunately, this simple idea is actually not very good. As you'll note, we started with an
unbalanced tree, but after we did this operation, the tree is still unbalanced.

And in fact, if you keep doing this you'll get a bad sequence of inputs. You can note that there's this
loop here, these five rearrangements of the tree. Where if you keep doing the appropriate search and
then you rotate the searched for a note all the way to the top, they just go in this loop. And when this
happens, if you count the total time it takes to perform the sequence of operations, it takes O of n
squared time to perform n operations. And so the average time per operation is linear rather than
logarithmic which is far to slow. So this rotate things up to the top doesn't actually work very well, we
need something better.
Ppt slides

And for this we're going to make just a slight modification. The rotate to top algorithm basically says
you look at the node and its parent and you rearrange them and then you, again, look at the node and
its parent and rearrange them, and keep going until you get to the root.
The modification, instead of just looking at the node and its parent, you're going to look at the node
and its parent and its grandparent.
And there are a few cases here. Firstly, there's the case where the node and its parent and
grandparent are on the same side of it. This is called the zig-zig case.
Then what we're going to do is we're going to elevate the node up so that it's now the parent of what
was its old parent, and that's on top of what was the old grandparent.
On the other hand, it could be that the parent and grandparent are on opposite sides of the nodes,
what's known as the zig-zag case. Then you rearrange the tree as follows, so that the node is now the
new parent parent of its old parent and grandparent.
And finally there's one more case where the node's parent is actually the root node, so it doesn't have
a grandparent.
And then you actually just rotate the node up And so if we combine these operations, together we get
what's called the splay operation. If you want to splay a node N, and this is a way of bringing the node
N to the root of the tree, you determine which case you are in, the Zig-Zig, the Zig-Zag, or the Zig case.
You apply the appropriate local operation to rearrange the tree.
And then if the parent of the node is not null IE, if the node is not the root of the tree yet, you splay n
again. And you just keep splaying until it gets to become the root. Okay so to make sure that we're on
the same page with this.
If the take the tree up top and we splay the highlighted node number 4, which of these three trees, A,
B, or C, do we end up with afterwards?
Well, the answer, here, is A. So, the point is, we start in this configuration, we note that we're,
originally, in the zig, zig case, two, three, and then, four. So, we elevate four, such that three, and
then, two come down from it, as children. And then we're in the zig zag case. One and five are on
opposite sides of four so we elevate four, one and five are it's new children and three and two now
hang off of one and that is exactly what we were supposed to end up with and so that is the answer
to this question.
Okay, so that's what the splay operation is. Next time we're going to talk about how to use the splay
operation to rebalance your circuitry, and how to use it to perform all the basic binary circuitry
operations efficiently. So I'll see you then.
Splay Trees: Implementation

Hello everybody. Welcome back. Today we're going to talk more about splay trees. In particular, we
can tell how to implement your basic search tree operations using a splay tree. So remember, last
time, we had this idea to design a binary search tree, where every time you queried a node, you
brought it to the root. And we know that simple ways of doing this didn't quite work out so well, so
we introduced the splay operation, which is a little bit better.
Now, there's this problem with the splay operation that the way that the splay trees are built, you
don't actually guarantee the tree is always balanced. Sometimes you'll end up with very unbalanced
trees. And when that happens, your splay operation will actually be very slow because you have to
sort of bring your node up to the root one or two steps at a time, and it will actually take a while.
However, you'll note that if I have this long stretched out tree, and I splay this leaf all the way to the
root, we have rearranged the tree, it's now a little bit more balanced than it was before. And so, when
you use the splay operation rather than to sort of rotate to top operation, it's actually the case that
you can't have a long sequence of expensive splay operations. Because every time you have an
expensive splay operation, it will rebalance the tree and make it more balanced. And so, if you keep
picking really unbalanced nodes, pretty quickly, the tree will balance itself out, and then you'll have
nice, short login time operations.
But this does mean that we're no longer dealing with worst case time. Well, we need to talk about
amortized analysis, average case time.
Ppt slides
And the big theorem that we're not going to prove today is that the amortized cost of first doing O(D)
work, and then splaying a node of depth D is actually O(log(n)), where n is the total number of nodes
in the tree.
And we'll prove this later, but using it, we can analyze our splay tree operations. And the basic idea is
that, if you have to do a lot of work and then splay a very deep node, we're going to be able to pay for
that work by the fact that the splay operation will rebalance our tree in some useful way. And that
will pay for it and so amortized cost will only be O(log(n)). Okay, using this, how do we implement our
operations? So a splay tree find is actually very simple. First we find the node in the way we normally
would. We then splay the node that we found and then return it.
Pretty simple. So how does the analysis work? Now the node, remember, might not be at small depth.
It could be at depth D, or D could be as big as N.
We then do O(D) work to find the node because that's how long a find operation takes. We then run a
splay, so we did O(D) work, and then we splayed a node of depth D. And so the amortized cost is
O(log(n)) for this operation, which is what we want.
Now, the idea here is that you're paying for the work again of finding this N by splaying it back to the
root to rebalance the tree. And so, if the node was really deep, you did do a lot of work. But you also
did some useful rebalancing work, which means you're not going to have to keep doing a lot of work.
Now, there's a very important point to note here, that it could be that we were doing this search, you
fail to find a node with exactly that key that you were looking for.
But when this happens, you still have to splay the closest node that you found in this operation.
Because otherwise, what's happening is your operation did O(D) work, but since you're not doing a
splay, there's nothing to amortize against. You actually just spent O(D) work. What you need to do is if
you're doing this big, deep search, you have to pay for it by rebalancing the tree. And you have to,
therefore, splay whatever node you found, even if it does not have the right key.
Okay, so that's fine. Let's talk about Insert. Insert, it turns out, is also really easy. First, you insert a
node in the same way that you would before. And that's O of depth much work.
And then you run the splay tree find. You find the node again, and you splay it to the top. It all works.
Now to get deletes to work, there's actually a pretty clever way to do it. If you splay your node and
successor both to the top of the tree, you end up with this sort of third diagram in this picture. And
you'll note that if we want to get rid of the red node, all we have to do is sort of promote the blue
node, its successor, into its place. Because of the way this works out, the blue node will never have a
left child, and things will just work. So the code for delete is you splay the successor of N to the root,
you then splay N to the root, and then we just need to remove N and promote its successor. So we let
L and R be the left and right children of N, and basically what we have to do is we need to make R to
become L's new parent and L R's new child, and then set R to be the root of the tree. And once we've
rearranged a few pointers, everything works. Now, there is one special case here, which is what if N is
the largest key in the entire tree, there is no successor, you need to do something slightly different. I'll
leave that to you to figure out.
Finally, let's talk about the split operation. Now, the split is actually also very nice with splay trees.
The point is there's one case where split entry is really easy. It's if the key at which you're supposed to
split is right at the root. Because then all you need to do is you need to split things into two subtrees
by just breaking them off the root.
And so, but with the splay tree, it's really easy to make any node that we want be the root.
So what we're going to do is we're going to do a search to find the place at which we are supposed to
do our split. Take whatever node we found, splay it to the root, and then we're just going to break our
tree into two pieces right at the root.
So to see pseudocode for this, we're going to let N be what we find when we search for the x that
we're trying to split at. We then Splay N to the root. Now if N's key is bigger than x, we have to cut off
the left subtree. If the key is less than x, we cut off the right subtree. And if the key is actually equal to
x, well, the x that we're trying to split is actually in the tree. So, I mean, you might do one, or the
other, depending on if you actually want to keep the node in the tree, or maybe you want to throw it
away, and we just want to return the left subtree and the right subtree. Now just to be clear, if we
want to say, cut off the left subtree, all we have to do is we let L be the left child, and we just have to
sort of break the pointer between our node and in its left child, so that they're now separate trees.
And we just return L and N as the two roots. So that's how we do a split.
To do a merge, we basically have to do the opposite of this. And the idea is that it's very easy to
merge two trees together when you sort of have this element that's in between them right up there
at the root. And once again, there's an easy way to do this with splay trees. You just find the largest
element of the left subtree, you splay it to the top, and then just attach the right subtree as a child of
that node. And then you're done.
So, in summary, splay trees. Using this, we can perform actually all the operations that we wanted
very simply in an O(log(n)) amortized time per operation.
And so this provides a very clean way to do this. We left out some things in the analysis though. So if
you'd like to see really what the math behind how we can show all of these things work, please come
back for the next lecture.
(Optional) Splay Trees: Analysis
PPT Slides
Hello everybody. Welcome back. Today we're still talking about splay trees and we're actually going to
go into a little bit of the math behind analyzing their run times.
So remember last time we analyzed splay trees and in order to do so we needed the following
important result, that the amortized cost of doing O(D) work and then splaying a node of depth D is
actually O(log(n)) where n is the total number of nodes in the tree.
And today we're going to prove that.
So to do this of course we need to amortize, we need to pay for this extra work by doing something to
make the tree simpler. And the way we talk about this being simple is we're going to pick a potential
function, and so that if we do a lot of work it's going to pay for itself by decreasing this potential.
And it takes some cleverness to find the right one and it turns out more or less the right potential
function is the following. We define the rank of a node N to be the log of the size of it's subtree,
where the size of it's subtree is just the number of nodes that are descendants of N in that tree.
Then the potential function for the tree is P is just the sum over all nodes N in the tree of the rank of
N. Now to get a feel for what this means if your tree is balanced, or even approximately balanced,
potential function should be approximately linear in the total number of the nodes. But if on the
other hand, it's incredibly unbalanced, say just one big chain of nodes, then the potential could be as
biggest N(log(n)). And so, a very large potential function means that your tree is very unbalanced. And
so, if you are decreasing the potential, it means that you're rebalancing the tree.
So what we need to do is we need to see what happens when you perform a splay operation, what
does it do to the potential function.
Now, to do that, the splay operation is composed of a bunch of these little operations, zig, zig zigs and
zig zags and we want to know for each operation what does that do to the potential.
So for example when you perform a zig operation how does the potential function change?
Well you'll note that other than N and P, these two nodes that were directly affected, none of the
nodes have their subtrees change at all. And therefore the ranks of all the other nodes stay exactly
the same.
So the change in potential function is just the new rank of N plus the new rank of P, minus the old
rank of N and the old rank of P.
Now, the new rank of N and the old rank of P are actually the same, because each of those had
subtrees that were just the entire tree.
And so this is just the new rank of P minus the old rank of N and it's easy to see that's at most, the
newer rank of N minus the old rank of N.
That's not so bad.
Now let's look at the zig-zig analysis which is a little bit trickier.
So here the change in the potential is the new rank of N plus the new rank of P plus the new rank of Q
minus the old rank of N, and the old rank of P and the old rank of Q. So the new ranks minus all the
old ranks. Now the claim here is that this is at most 3 times the new rank of N minus the old rank of N
minus 2. And to prove this we need a few observations.
The first thing is that the rank of N is equal to the old rank of Q and that this term is actually bigger
than any other term in our expression. And that's simply because, I mean, both of these nodes what
are their subtrees? Well, it's N, P and Q. And then the red, green, blue and black subtrees. They're the
same subtrees, the same size. They've got the same rate. But the next thing to note is that the old size
of N's subtree and the new size of Q's subtree, when you add them together, it's going to be the red
subtree, the green subtree, the blue subtree, and the black subtree plus two more nodes.
And that's actually one less than the size of either of these two big terms.
And what that says, when you take logarithms, you can actually get that the size, the rank of N, the
old rank of N plus the new rank of Q is at most twice the new rank of N minus 2.
Because they're sort of half the size each.
And therefore, if you combine these inequalities together, you can actually get the one that we
wanted on the previous slide. Now, the zig-zag analysis is pretty similar to this. Here, you can show
the change in potential is at most twice the new rank of N minus the old rank of N minus 2. Okay,
great. So now we perform an entire splay operation. So we splay once, and then again, and then
again, and then again, all the way up until we finally have the final version of N that's the root.
And we want to know what the total change in the potential function is over all of these little teeny
steps.
Well what is it? Well it's at most the sum of the changes in potentials from each steps. The last one
you have three times the final rank of N minus the rank of N one step before that, minus two.
You add to that the rank of N one step before the N minus the rank of N two steps before the N minus
2. Then you add three times the rank of N two steps before the end minus the rank three steps before
the N minus 2 and so on and so forth. And this sum actually telescopes. The rank of N one step before
the end there are two versions of it and they cancel. The rank two steps before the end, there are two
terms that cancel and so on and so forth. And the only terms that are left is well you've got three
times the rank of N at the very end of your splay operation, minus the rank of N at the very beginning
of your splay operation, and then for each of these steps, for each two levels the N went up the tree,
you have this copy of -2. So that's minus the depth of the node app.
And so the change in potential is just O of log(n) minus the depth, which is minus the work that you
did.
And note it's O of log(n) because the rank of n is at most log of the total number of nodes in the tree.
And so if you add the change in potential to the amount of work that you did, you get out O of log(n).
And so the amortized cost of your O of D work plus your splay operation is just O of log(n).
Now, that shows there our splay trees runs an O of log(n) amortized time per operation.
And if that was all you could say there is nothing really to be too excited about. I mean it gets about
the same run time,
maybe it's a little bit easier to implement. It's a little bit more annoying because it's only amortized
rather than worst case. Some operations will be much more expensive than log(n) even if on average
the operations are pretty cheap.
But another great thing is that splay tree has also have many other wonderful properties.
For example, if you assign weights to the nodes in any way such that the sum of all nodes of their
weight is equal to one, the amortized cost of accessing a node is actually just O(log(1/wt (N))). And
that means that if you spend a lot of time accessing high weight nodes it might actually be much
quicker that log(N) time per operation.
And so, and note that this run time bound holds no matter what weights you assign. You don't need
to change the algorithm based on the weights. This bound happens automatically. And so if there are
certain nodes that get accessed much more frequently than others, you could just sort of artificially
assign them very high weights and then that actually means that your display tree automatically runs
faster than log(n) per operation. Another bound is the dynamic finger bound. The amortized cost of
accessing a node is O(log(D + 1)) where here, D is the distance in terms of the ordering between
nodes between the last access and the current access. So if say you want to list all the nodes in order
or search for all the nodes in order, it's actually pretty fast to do a display tree because D is 1. It's a
constant per operation rather than O of log(n).
Another bound is the working set bound. The amortized cost of accessing a node N is O(log(t+1))
where t is the amount of time that has elapsed since that node N was last accessed.
And what that means, for example, is that if you tend to access nodes that you've accessed recently a
lot. So you access one node pretty frequently and then they move to accessing a different node pretty
frequently, then this actually does a lot better again than O of log(n) per operation.
Finally we've got what's known as the dynamic optimality conjecture.
And this says if you give me any list of splay tree operations, inserts, finds, deletes whatever.
And then you build the best possible dynamic search tree for that particular sequence of operations.
You can have it completely optimized to perform those operations as best possible.
The conjecture says that if you run a splay tree on those operations it does worse by at most a
constant factor.
And that's pretty amazing. It would say that if there is any binary search three that does particularly
well on a sequence of operations than at least conjecturally a splay tree does. So the conclusion here
is that splay trees, they're pretty fast, they require only O of log(n) amortized time per operation
which, remember, it can be a problem if you're worried that the occasional operation might take a
long time.
But in addition to this, splay trees can actually be much better if your input queries have extra
structure, if you call some nodes more frequently than others, or you tend to call nearby nodes to the
ones that you most recently accessed, and things like that. But that's actually it for today. That's a
splay tree, that's why they're considered to be useful.
And that's this course, I really hope that you've enjoyed it, I hope you'll come back for our next course
and best of luck.
Slides
08_binary_search_trees_9_splay.pdf PDF File
References
See the chapter 5.11.6 here.
Also see this visualization. Play with it by adding and erasing keys from it, and see how it can be
unbalanced, in contrast with AVL tree, but pulls the keys it works with to the top.
Also see this answer about comparison of AVL trees and Splay trees.
Also see the original paper on Splay trees.
Splay Trees
Binary Search Trees

Coursera Data Structure

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Coursera Data Structure

Загружено:

Авторское право:

Доступные форматы

Ffs

very important notes :

Reading: Slides and External References

Stacks and Queues

Reading: Slides and External References

Reading: Slides and External References

Reading: Available Programming Languages

Reading: FAQ on Programming Assignments

Purchase a subscription to unlock this item.

Programming Assignment: Programming Assignment 1: Basic Data Structures

05_1_arrays_and_lists.pdf PDF File

Stacks and Queues

Slides and External References

See these visualizations: array-based stack, list-based stack, array-based queue, list-based

Slides and External References

PRACTICE QUIZ • 30 MIN

Basic Data Structures

Available Programming Languages

C (gcc 5.2.1). File extensions: .c. Flags:

gcc -pipe -O2 -std=c11 <filename> -lm

g++ -pipe -O2 -std=c++11 <filename> -lm

C# (mono 3.2.8). File extensions: .cs. No flags:

javac -encoding UTF-8

FAQ on Programming Assignments

I submit the program, but nothing happens

Here are the possible outcomes:

How to understand why my program fails and to fix it?

Why do you hide the test on which my program fails?

Are you going to support my favorite language in programming assignments?

Programming Assignment: Programming Assignment 1:

45 threads · Last post a day ago

Dynamic Arrays and Amortized Analysis

Dynamic Arrays and Amortized Analysis

Video: LectureAmortized Analysis: Aggregate Method

Video: LectureAmortized Analysis: Banker's Method

Video: LectureAmortized Analysis: Physicist's Method

Video: LectureAmortized Analysis: Summary

Purchase a subscription to unlock this item.

Quiz: Dynamic Arrays and Amortized Analysis

Due Jul 12, 11:59 PM PDT

Reading: Slides and External References

Dynamic Arrays and Amortized Analysis

114 threads · Last post 6 hours ago

Priority Queues and Disjoint Sets

 Describe how heaps and priority queues work

Priority Queues: Introduction

Video: LectureNaive Implementations of Priority Queues

Priority Queues: Heaps

Reading: Tree Height Remark

Video: LectureComplete Binary Trees

Reading: Slides and External References

Priority Queues: Heap Sort

Purchase a subscription to unlock this item.

Quiz: Priority Queues: Quiz

Due Jul 19, 11:59 PM PDT

Reading: Slides and External References

Disjoint Sets: Naive Implementations

Reading: Slides and External References

Disjoint Sets: Efficient Implementation

Video: LectureTrees for Disjoint Sets

Purchase a subscription to unlock this item.

Quiz: Quiz: Disjoint Sets

Due Jul 19, 11:59 PM PDT

Reading: Slides and External References