Вы находитесь на странице: 1из 12

ICP 2027 - Data Structures an Algorithms Assignment

2 Report
Callum Scott Innes <eeu46f>

Abstract
In this report I will go over a few key topics that were considered before,
during and after the creation of this project. Namely Id like to go over my
initial thoughts, research into finding a node in a BST, research into finding
an element in a sorted array, a comparison of the two methods and finally a
conclusion.
Initial Thoughts and Rational
Searching through data in any data structure is a frequent task that needs
to be carried out, so it make sense for us to want to find the fastest, most
efficient way of doing so. You could tackle it in a number of different ways,
but generally you have to consider these three constraints:
Type of data to be searched - is it comparable to other elements?
Order of data - is the data structured, is it sorted, etc?
Limitations to the data structure - is there something unique about the
data structure which changes the way in which it must be stored?

Because of these limitations there isnt a set method of searching through


any data structure generically, and we must find the best option thats suited
to a specific situation.
Searching an Array
An array is a very primitive and simple data structure, but there are still
a number of ways. Notably:

Preprint submitted to Ik Soo Lim

May 6, 2016

Linear Search
Linear search algorithm consists of traversing every index in the array and
comparing it against the neede value. The time complexity of this algorithm
with unsorted data is O(n), because in the worst case we potentially have to
compare every element in the entire array in order to find the element were
searching for.
In the best case, the element is at the first index of the array - meaning
there only has to be a single comparison to find the element. On average,
the linear search algorithm performs (n + 1)/2 comparisons, since on average
wed have to traverse half the array to find the required element. But this
still yields a time complexity of O(n) since n is the fastest growing term.
A linear search would be used in the case that our array is not sorted
and/or we dont have much information to compare it to.
Binary Search
A binary search (or logarithmic search) is much like a linear search in that
elements are searched consecutively, but since in this type of search that data
must be sorted, the data is split into half and each side is compared to see in
which half the required element resides. This halving process continues until
the value is found.
Binary search has a time complexity of O(log(n)) since there are approximately log2 (n) + 1 comparisons on average. At every subdivision the
problem size is split in half, and this is synonymous to log2 (n), since log2 (n)
is the inverse of n2 , which would indicate the problem size doubles on each
half.
Comparison
Clearly, binary search is superior to a standard linear search if were able
to have access to sorted data, but heres a table showing the average amount
of comparisons made at different array lengths:

Sorted array length: n


1
10
100
1000
10000
100000
1000000
..
.

Linear Search: b(n + 1)/2c


1
5
50
500
5000
50000
500000
..
.

Binary Search: blog2 (n) + 1c


1
4
7
10
14
17
20
..
.

Note the growth rates of both of these algorithms - linear searchs comparisons grow directly proportional to n, whereas binary searchs comparisons
grow at much smaller increments. This proves that binary search is much
more efficient.
Finding an Element in a Sorted Array with Binary Search
We only really have the theoretical time complexity of the binary search,
so to get something thats proven Ive written a program thatll test it for
me. I decided that a suitable test would be to run a binary search algorithm
on a sorted array a number of times, attempting to find a randoml generated
element from 0 to n-1, and compare the actual average amount of comparisons
against the theoretical expected amount.
The program works by peforming 1000 trials at array size N, and finding
the average number of comparisons for each N. Heres the code I used to
perform this test:
private
private
private
private

static
static
static
static

final int ARRAY_SIZE_START = 100;


final int ARRAY_SIZE_END = 10000;
final int TRIAL_COUNT = 1000;
Random rand = new Random();

private float[] comparisons;


public SortedArrayTest() {
comparisons = new float[(ARRAY_SIZE_END ARRAY_SIZE_START) + 1];

int[] sortedArray;
for (int n = ARRAY_SIZE_START; n <= ARRAY_SIZE_END;
n++) {
sortedArray = new int[n];
int i = 0;
while (i < sortedArray.length) {
sortedArray[i++] = i;
}
int sum = 0;
for (int j = 0; j < TRIAL_COUNT; j++) {
int randomNumber = rand.nextInt(n);
sum += findBinarySearchComparisons(sortedArray,
randomNumber);;
}
sortedArray = null;
comparisons[n - ARRAY_SIZE_START] = (sum / (float)
TRIAL_COUNT);
}
}

The helper method findBinarySearchComparisons finds the number of


comparisons made for each trial, it is defined below:
private static int findBinarySearchComparisons(int[]
array, int value) {
int comparisons = 0;
int left = 0;
int right = array.length - 1;
while (true) {
int midpoint = left + ((right - left) / 2);

comparisons++;
if (left >= right) {
return comparisons;
}
if (array[midpoint] == value) {
return comparisons;
} else if (value < array[midpoint]) {
right = midpoint - 1;
} else if (value > array[midpoint]) {
left = midpoint + 1;
}
}
}

The number of comparisons is saved in the array comparisons. After all


of the trials are completed, I wrote an extension to output all the comparisons
for each N to a CSV file. I then loaded this information into MATLAB and
plotted it next to log2 (n) to see how they compare. Heres what I found:

As you can see, the recorded number of comparisons is actually very


close to our expected number of comparisons log2 (n). It doesnt produce
a perfectly identical curve because each point in the graph is taken as an
average with a completey random number, so the amount of comparisons
can fluctuate between 0(O(1) to log2 (n)(O(log(n))) depending on the random
number itself, which subsequently affects the average.
If we introduced more trials into the test, wed see the average fitting
closer and closer to log2 (n) due to the law of large numbers:
limt

C(n)
t

= log2 (n)

Here C(n) represents the amount of comparisons for a sorted array of size
n, and t represents the number of trials on the sorted array. This basically
shows that as the amount of trials increases toward infinity, the amount of
comparisons will trend toward log2 n(n).
Although again, the problem with binary search is that it must be used
with a sorted array. This means that although searching is extremely efficient,
insertion and deletion will remain quite inefficient. Heres a table showing
its time complexities :
Algorithm (sorted array)
Insertion
Deletion
Searching

Time complexity
O(n)
O(n)
O(log(n))

This is because of the way data is modified in an array. If we insert an


element into the array at an arbitrary index, wed first need to shift all of the
elements to the right of the index to the right by one. With deletion, wed
have to shift all of the elements to the left by one place.
Finding a Node in a Binary Search Tree
Implementing the Binary Search Tree (BST) data structure is one way of
tackling the issue of the bad time complexity of a binary sorted search.
A binary search tree is a binary tree with an enforcement put upon the
data in the tree. It is a type of binary tree whose left child is less than that of
the parent, and the right child is more than that of the parent. This provides
6

a type of ordering of data from left to right.


For every subtree in the tree, the element to the right of it could be
considered as the midpoint into the right portion of a sorted array, and the
element to the left of it could be considered as the midpoint into the left
portion of the sorted array. This is a recursive property considering that
each subtree with another subtree also links to another (max) 2 subtrees
which link to another 2 subtrees, etc.
Because of this, searching a binary search tree has an average time complexity of O(log(n)) with an average of log2 (n) comparisons, which is the
same as the theoretical complexity/comparison as the binary search algorithm. But in a BST, we dont have to shift values when inserting and
deleting from the tree, so it has a lower time complexity O(log(n)) + O(1)
O(log(n)) for inserting and deleting values.
The problem with the binary search tree is that the constraints we place
on each subtree are too broad. These constraints mean that a BST can
become lob-sided if they are inserted sequentially, and it may actually take
longer than an unsorted linear search. At worst case, the time complexity for
searching of a BST would be O(n), since we may have to go through every
single node to find the element were looking for.
We can attempt to fix this issue with BST by shuffling all of the nodes in
the tree before searching, which would lessen the probability of lobsided-ness.
To test whether this fix is adequate Ive created a program similar to my
sorted array tester to see how it matches up. The program is pretty much
exactly the same, except Ive adapted the algorithm to use a binary search
tree instead. In this case, the number of comparisons is the length of the
path to the node. Like the previous test, Ive done this 1000 times per value
of N, from 100 to 10000 inclusive.
Heres a snippet of code showing how I calculated the distance to each
random node:
public int findLengthOfPathToNode(T value) {
Node currentRoot = rootNode;
int comparisonCount = 0;
while (true) {
comparisonCount++;

if (currentRoot == null) {
return comparisonCount;
}
int comparedValue =
value.compareTo(currentRoot.data);
if (comparedValue == 0) {
return comparisonCount;
} else if (comparedValue < 0) {
currentRoot = currentRoot.leftNode;
} else if (comparedValue > 0) {
currentRoot = currentRoot.rightNode;
}
}
}

Heres a snippet of code showing how I ran the test:


private static final int TREE_START_SIZE = 100;
private static final int TREE_END_SIZE = 10000;
private static final int TRIAL_COUNT = 1000;
private static Random rand = new Random();
private float[] comparisons;
public BinarySearchTest() {
comparisons = new float[(TREE_END_SIZE TREE_START_SIZE) + 1];
BST<Integer> tree = new BST<Integer>();
for (int n = TREE_START_SIZE; n <= TREE_END_SIZE; n++) {
List<Integer> keys = new ArrayList<Integer>();
int i = 0;
while (i < n) {

keys.add(i++);
}
Collections.shuffle(keys);
tree.addAll(keys);
int sum = 0;
for (int j = 0; j < TRIAL_COUNT; j++) {
int randomNumber = rand.nextInt(n);
int numberOfCompares =
tree.findLengthOfPathToNode(randomNumber);
sum += numberOfCompares;
}
tree.removeAll();
comparisons[n - TREE_START_SIZE] = (sum / (float)
TRIAL_COUNT);
}
}

As previously, I output the comparisons to a CSV file and plotted it into


MATLAB, heres my result:

The grey points on the graph represent the average path length to a node
at a given tree size N, and the blue line represents the theoretical average
number of comparisons log2 (n).
Note how the blue line and the grey points initially start very close together, and the grey points slowly break away and scatter from the blue line
as the tree size increases. This shows that the binary search trees structure
gradually becomes more lob-sided if we increase the amount of elements in
the tree with shuffled data.
The theoretical and practical comparisons arent too different though, and
its still very far off an O(n) result, so it may show that shuffling data before
searching is a good way to minimize the comparison count when searching
through a binary search tree.
Comparing the two Methods
I wrote a simple script in MATLAB to overlay the two graphs so we can
compare them, here it is:

10

In this graph, the green crosses represent the average amount of comparisons on a sorted array of a given size N, and the grey crosses represent the
average amount of comparisons on a binary search tree of the same size. The
orange line is log2 (n).
Since the structure of a sorted array is constant, the green points accurately follow log2 (n), there isnt much fluctuation in time complexity.
In a sorted array the data is already ordered correctly in some manner
- the order in which we insert data into the sorted array doesnt matter or
affect the time complexity of searching it. However in BST, the order in
which we insert data does affect the time complexity to search it, since we
cannot guarantee that the position which we have inserted the element into
is the best place to insert.
A BST is never restructured throughout its lifetime by default. So whatever element is first inserted into the BST never changes, whereas in a sorted
array the root node is computed at search-time, so the most efficient subdivision is always chosen.
In BST, the best root node is not necessarily taken to ensure the best
possible time complexity because we cant guarentee that the tree is lob-sided
or not because the order of insertion affects the layout of the tree.

11

Conclusion
As you can see in the comparison, if you compare efficiency, the sorted
array is more efficient because the growth rate of the number of comparisons
is on average less than log2 (n) as the amount of elements in the array (N)
increases. Whereas the binary search tree data shows that as the tree size
N increases, the average number of comparisons actually increases and move
away from log2 (n).
The only area in which the BST seems to top the binary search algorithm
is with insertion and deletion, where its time complexity is better than that
of the binary search.
Also the binary search algorithm only seems to work when data is sorted.
This is a major limitation since if you were given an unsorted array youd
have to apply some sort of sorting algorithm which would increase the time
complexity and perhaps make BST more efficient.
Therefore a BST would be more suitable in the case that wed need to
continuously insert, delete and search for data.
Improvements and Criticisms
If I had more time to work on this report, Id definitely revise the second
half as I have had to rush it due to time constraints. I feel like Ive left out
alot of detail. I could have given examples as to how the different algorithms
worked, I could have explained my code step by step and I could have given
further mathematical proof.
I would be interested in attempting to calculate the growth in deviation
from the best/worst case scenarios but this is far beyond the scope of this
assignment.

12