Вы находитесь на странице: 1из 10

Radix Sorting

Radix Sorting Radix sorting.


n Specialized sorting solution for strings.
n Same ideas for bits, digits, etc.

LSD radix sort


Applications.
MSD radix sort n Sorting strings.
3-way radix quicksort n Full text indexing.
Suffix sorting n Plagiarism detection.
n Burrows-Wheeler transform. stay tuned
n Computational molecular biology.

Reference: Chapter 13, Algorithms in Java, 3rd Edition, Robert Sedgewick.

Princeton University COS 226 Algorithms and Data Structures Spring 2004 Kevin Wayne http://www.Princeton.EDU/~cos226 2

An Application: Redundancy Detector An Application: Redundancy Detector

Longest repeated substring. Longest repeated substring.


n Given a string of N characters, find the longest repeated substring. n Given a string of N characters, find the longest repeated substring.
n Ex: a a c a a g t t t a c a a g c n Ex: a a c a a g t t t a c a a g c
n Application: computational molecular biology. n Application: computational molecular biology.

Dumb brute force. Brute force.


n Try all indices i and j, and all match lengths k and check. n Try all indices i and j for start of possible match, and check.
n O(W N3 ) time, where W is length of longest match. n O(W N2 ) time, where W is length of longest match.

k k

a a c a a g t t t a c a a g c a a c a a g t t t a c a a g c

i j i j
3 4
A Sorting Solution Suffix Sorting: Java Implementation

Suffix sort. Java implementation.


n Form N suffixes of original string. n We use Java String library functions to simplify code.
n Sort to bring longest repeated substrings together. n Could use byte array to store ASCII string, and array of pointers
into the byte array to save memory.

a a c a a g t t t a c a a g c a a c a a g t t t a c a a g c
a c a a g t t t a c a a g c a a g c public class SuffixSorter {
public static void main(String[] args) {
c a a g t t t a c a a g c a a g t t t a c a a g c
a a g t t t a c a a g c a c a a g c In stdin = new In(); read input
a g t t t a c a a g c a c a a g t t t a c a a g c String s = stdin.readAll();
g t t t a c a a g c a g c int N = s.length();
t t t a c a a g c a g t t t a c a a g c create suffixes
String[] suffixes = new String[N];
t t a c a a g c c (linear time)
for (int i = 0; i < N; i++)
t a c a a g c c a a g c suffixes[i] = s.substring(i, N);
a c a a g c c a a g t t t a c a a g c
Arrays.sort(suffixes); sort and find
c a a g c g c
longest match
a a g c g t t t a c a a g c findLongestMatch(suffixes); (bottleneck)
a g c t a c a a g c }
g c t t a c a a g c }
c t t t a c a a g c
5 6

Diversion: String Implementation in Java String Sorting Performance

Java implementation of String.


n Immutability: use as Key in symbol table, fast substring. String Sort Suffix (sec)
n Memory for virgin string: 28 + 2N bytes (!) Worst Case Moby Dick
Brute W N2 36,000
public final class String implements Comparable {
private char[] value; // characters Quicksort W N log N 9.5
private int offset; // index of first char into array
private int count; // length of string
private int hash; // cache of hashCode

private String(int offset, int count, char[] value) {


this.offset = offset;
this.count = count;
this.value = value;
}
public String substring(int from, int to) {
return new String(offset + from, to - from, value);
} N = number of strings. estimate
. . . 1.2 million for Moby Dick. probabilistic guarantee.
} 191 thousand for Aesop's Fables.

7 8
String Sorting Key Indexed Counting

Notation. Key indexed counting.


n String = variable length sequence of characters. n Count frequencies of each letter. (0th character)
n W = max # characters per string.
n N = # input strings.
n R = radix (256 for extended ASCII, 65,536 for UNICODE).
a count
0 d a b a 0
Java syntax. 1 a d d b 2
int[] count = new int[256+1];
n Array of strings: String[] a; for (int i = L; i <= R; i++) { 2 c a b c 3

n The ith string: a[i] char c = a[i].charAt(d); 3 f a d d 1


count[c+1]++; 4 f e e e 2
n The dth character of the ith string: a[i].charAt(d) } 5 b a d f 1
n Strings to be sorted: a[lo], ..., a[hi]
frequencies 6 d a d g 3
7 b e e
8 f e d
9 b e d
10 e b b
d = 0; 11 a c e
9 10

Key Indexed Counting Key Indexed Counting

Key indexed counting. Key indexed counting.


n Count frequencies of each letter. (0th character) n Count frequencies of each letter. (0th character)
n Compute cumulative frequencies. n Compute cumulative frequencies.
n Use cumulative frequencies to rearrange strings.

a count a count temp


0 d a b a 0 a 0 0 d a b a 0 a 0 0 a d d
1 a d d b 2 b 2 1 a d d b 2 b 2 1 a c e
2 c a b c 3 c 5 for (int i = L; i <= R; i++) { 2 c a b c 3 c 5 2 b a d
for (int i = 1; i < 256; i++)
3 f a d d 1 d 6 char c = a[i].charAt(d); 3 f a d d 1 d 6 3 b e e
count[i] += count[i-1];
4 f e e e 2 e 8 temp[count[c]++] = a[i]; 4 f e e e 2 e 8 4 b e d
cumulative counts 5 b a d f 1 f 9 } 5 b a d f 1 f 9 5 c a b
6 d a d g 3 g 11 rearrange 6 d a d g 3 g 11 6 d a b
7 b e e 7 b e e 7 d a d
8 f e d 8 f e d 8 e b b
9 b e d 9 b e d 9 f a d
10 e b b 10 e b b 10 f e e
11 a c e d = 0; 11 a c e 11 f e d
11 12
Key Indexed Counting Key Indexed Counting

Key indexed counting. Key indexed counting.


n Count frequencies of each letter. (0th character) n Count frequencies of each letter. (0th character)
n Compute cumulative frequencies. n Compute cumulative frequencies.
n Use cumulative frequencies to rearrange strings. n Use cumulative frequencies to rearrange strings.

a count temp a count temp


0 d a b a 0 a 0 0 a d d 0 d a b a 0 a 2
1 0 a d d
1 a d d b 2 b 2 1 a c e 1 a d d b 2 b 5 1 a c e
for (int i = L; i <= R; i++) { 2 c a b c 3 c 5 2 b a d for (int i = L; i <= R; i++) { 2 c a b c 3 c 6 2 b a d
char c = a[i].charAt(d); 3 f a d d 1 d 7
6 3 b e e char c = a[i].charAt(d); 3 f a d d 1 d 8 3 b e e
temp[count[c]++] = a[i]; 4 f e e e 2 e 8 4 b e d temp[count[c]++] = a[i]; 4 f e e e 2 e 9 4 b e d
} 5 b a d f 1 f 9 5 c a b } 5 b a d f 1 f 12 5 c a b
rearrange 6 d a d g 3 g 11 6 d a b rearrange 6 d a d g 3 g 11 6 d a b
7 b e e 7 d a d 7 b e e 7 d a d
8 f e d 8 e b b 8 f e d 8 e b b
9 b e d 9 f a d 9 b e d 9 f a d
10 e b b 10 f e e 10 e b b 10 f e e
d = 0; 11 a c e 11 f e d d = 0; 11 a c e 11 f e d
13 24

Key Indexed Counting LSD Radix Sort

Key indexed counting. Least significant digit radix sort.


n Count frequencies of each letter. (0th character) n Ancient method used for card-sorting.
n Compute cumulative frequencies. n Consider digits from right to left:
n Use cumulative frequencies to rearrange strings. use key-indexed counting to STABLE sort by character

a count temp
0 d a b 0 d a b 0 d a b 0 a c e
0 d a b a 0 a 2 0 a d d
1 a d d 1 c a b 1 c a b 1 a d d
1 a d d b 2 b 5 1 a c e
2 c a b 2 e b b 2 f a d 2 b a d
for (int i = L; i <= R; i++) 2 c a b c 3 c 6 2 b a d
3 f a d 3 a d d 3 b a d 3 b e d
a[i] = temp[i - L]; 3 f a d d 1 d 8 3 b e e
4 f e e 4 f a d 4 d a d 4 b e e
4 f e e e 2 e 9 4 b e d
5 b a d 5 b a d 5 e b b 5 c a b
copy back 5 b a d f 1 f 12 5 c a b
6 d a d 6 d a d 6 a c e 6 d a b
6 d a d g 3 g 11 6 d a b
7 b e e 7 f e d 7 a d d 7 d a d
7 b e e 7 d a d
8 f e d 8 b e d 8 f e d 8 e b b
8 f e d 8 e b b
9 b e d 9 f e e 9 b e d 9 f a d
9 b e d 9 f a d
10 e b b 10 b e e 10 f e e 10 f e d
10 e b b 10 f e e
11 a c e 11 a c e 11 b e e 11 f e e
11 a c e 11 f e d
25 27
LSD Radix Sort LSD Radix Sort: Correctness

Least significant digit radix sort. Proof 1. (left-to-right).


n Ancient method used for card-sorting. n If two strings differ on first character, key-
n Consider digits from right to left: indexed sort puts them in proper relative order.
use key-indexed counting to STABLE sort by character n If two strings agree on first character, stability
keeps them in proper relative order.

Proof 2. (right-to-left)
n If the characters not yet examined differ, it
public static void lsd(String[] a, int lo, int hi) {
for (int d = W-1; d >= 0; d--) { doesn't matter what we do now.
// do key-indexed counting sort on digit d n If the characters not yet examined agree, later
...
pass won't affect order.
}
}

Fixed length strings (length = W)

28 29

LSD Radix Sort Correctness MSD Radix Sort

Running time. Q(W(N + R)). Most significant digit radix sort.


why doesn't it violate N log N lower bound? n Partition file into 256 pieces according
to first character.
Advantage. Fastest sorting method for random fixed length strings. n Recursively sort all strings that start
with the same character, etc.
Disadvantages.
n Accesses memory "randomly." How to sort on dth character?
n Inner loop has a lot of instructions. n Use key-indexed counting.
n Wastes time on low-order characters.
n Doesn't work for variable-length strings.
n Not much semblance of order until very last pass.

Goal: find fast algorithm for variable length strings.

30 31
MSD Radix Sort Implementation String Sorting Performance

String Sort Suffix (sec)


public static void msd(String[] a, int lo, int hi) {
msd(a, lo, hi, 0); Worst Case Moby Dick
}
Brute W N2 36,000

private static void msd(String[] a, int lo, int hi, int d) { Quicksort W N log N 9.5
if (hi <= lo) return;
LSD * W(N + R) -
// do key-indexed counting sort on digit d MSD W(N + R) 395
int[] count = new int[256+1];
MSD with cutoff W(N + R) 6.8
...

// recursively sort 255 subfiles assumes '\0' terminated


for (int i = 0; i < 255; i++)
msd(a, L + count[i], L + count[i+1] - 1, d+1);
}

R = radix. estimate
W = max length of string. * assumes fixed length strings.
N = number of strings. probabilistic guarantee.

32 33

MSD Radix Sort Analysis Recursive Structure of MSD Radix Sort

Disadvantages. Trie structure to describe recursive calls in MSD radix sort.


n Too slow for small files.
ASCII: 100x slower than insertion sort for N = 2
UNICODE: 30,000x slower for N = 2
n Huge number of recursive calls on small files.

Solution: cutoff to insertion sort for small N.


n Competitive with quicksort for string keys.

Problem: algorithm touches lots of empty nodes ala R-way tries.


n Tree can be as much as 256 times bigger than it appears!

34 35
Correspondence With Sorting Algorithms 3-Way Radix Quicksort

Correspondence between trees and sorting algorithms. Idea 1. Use dth character to "sort" into 3 pieces instead of 256, and
n BSTs correspond to quicksort recursive partitioning structure. sort each piece recursively.
n R-way tries corresponds to MSD radix sort. Idea 2. Keep all duplicates together in partitioning step.

n What corresponds to ternary search tries?

by h the

e e

l shells shore

sea sells

36
Partition Algorithm 37

Recursive Structure of MSD Radix Sort vs. 3-Way Quicksort 3-Way Partitioning

3-way radix quicksort collapses empty links in MSD tree. 3-way partitioning.
n Natural way to deal with equal keys.
n Partition elements into 3 parts:
elements between i and j equal to partition element v
MSD Recursion Tree
no larger elements to left of i
no smaller elements to right of j

less than v equal to v greater than v


lo i j hi

Dutch national flag problem.


n Not easy to implement efficiently. (Try it!)
3-Way Radix Quicksort Recursion Tree
n Not done in practical sorts before mid-1990s.
n Incorporated into Java system sort, C qsort.

38 39
3-Way Partitioning 3-Way Radix Quicksort

Elegant solution to Dutch national flag problem. private static void quicksortX(String a[], int lo, int hi, int d) {
if (hi - lo <= 0) return;
n Partition elements into 4 parts: int i = lo-1, j = hi, p = lo-1, q = hi;
no larger elements to left of m char v = a[hi].charAt(d);

no smaller elements to right of m while (i < j) { repeat until pointers cross

equal elements to left of p


while (a[++i].charAt(d) < v) ;
find i on left and j on right to swap
while (v < a[--j].charAt(d))
equal elements to right of q if (j == lo) break;
if (i > j) break;
exch(a, i, j);
equal to v less than v greater than v equal to v if (a[i].charAt(d) == v) { p++; exch(a, p, i); } swap equal chars
if (a[j].charAt(d) == v) { q--; exch(a, j, q); } to left or right
lo p m q hi }
if (p == q) { special case for
n Afterwards, swap equal keys into center. if (v != '\0') quicksortX(a, lo, hi, d+1); all equal chars
return;
}
All the right properties. if (a[i].charAt(d) < v) i++;
swap equal ones
n Not much code. for (int k = lo; k <= p; k++, j--) exch(a, k, j);
for (int k = hi; k >= q; k--, i++) exch(a, k, i); back to middle

n In-place. quicksortX(a, lo, j, d);


if ((i == hi) && (a[i].charAt(d) == v)) i++; sort 3 pieces
n Linear if keys are all equal. if (v != '\0') quicksortX(a, j+1, i-1, d+1); recursively
quicksortX(a, i, hi, d);
n Small overhead if no equal keys. }
40 41

Significance of 3-Way Partitioning Quicksort vs. 3-Way Radix Quicksort

Equal keys omnipresent in applications when purpose of sort is to bring Quicksort.


records with equal keys together. n 2N ln N string comparisons on average.
n Finding collinear points. n Long keys are costly to compare if they differ only at the end, and
n Sort population by age. this is common case!
n Remove duplicates from mailing list. n Absolutism, absolut, absolutely, absolute.
n Sort job applicants by college attended.
3-way radix quicksort.
Typical application. n Avoids re-comparing initial parts of the string.
n Huge file. n Uses just "enough" characters to resolve order.
n Small number of key values. n 2 N ln N character comparisons on average for random strings.
n Randomized 3-way quicksort is LINEAR time. (Try it!) n Sub-linear sort for large W since input is of size NW.

Theorem. Quicksort with 3-way partitioning is OPTIMAL.


Proof. Ties cost to entropy. Beyond scope of 226.

42 43
String Sorting Performance Suffix Sorting: Worst Case Input

Length of longest match small.


String Sort Suffix Sort n 3-way radix quicksort rules!
abcdefghi
Worst Case Moby Dick abcdefghiabcdefghi
Length of longest match very long. bcdefghi
Brute W N2 36,000
bcdefghiabcdefghi
n 3-way radix quicksort is quadratic.
Quicksort W N log N 9.5 cdefghi
n Two copies of Moby Dick. cdefghiabcdefgh
LSD * W(N + R) - defghi
MSD W(N + R) 395 efghiabcdefghi
Can we do better? efghi
MSD with cutoff W(N + R) 6.8
n Q(N log N) ? fghiabcdefghi
3-Way Radix Quicksort W N log N 2.8 fghi
n Q(N) ? ghiabcdefghi
fhi
hiabcdefghi
Observation. Must find longest repeated
hi
substring WHILE suffix sorting to beat N2. iabcdefghi
R = radix. estimate i
W = max length of string. * fixed length strings only
N = number of strings. probabilistic guarantee
Input: "abcdeghiabcdefghi"

44 45

Suffix Sorting in N log N Time: Key Idea Suffix Sorting in N log N Time

0 babaaaabcbabaaaaa0 17 0babaaaabcbabaaaaa Manber's MSD algorithm.


1 abaaaabcbabaaaaa0b 16 a0babaaaabcbabaaaa n Phase 0: sort on first character using key-indexed sorting.
2 baaaabcbabaaaaa0ba 15 aa0babaaaabcbabaaa
3 aaaabcbabaaaaa0bab 14 aaa0babaaaabcbabaa n Phase n: given list of suffixes sorted on first n characters, create
4 aaabcbabaaaaa0baba 3 aaaabcbabaaaaa0bab list of suffixes sorted on first 2n characters
5 aabcbabaaaaa0babaa 12 aaaaa0babaaaabcbab n Finishes after lg N phases.
6 abcbabaaaaa0babaaa 13 aaaa0babaaaabcbaba
7 bcbabaaaaa0babaaaa 4 aaabcbabaaaaa0baba
8 cbabaaaaa0babaaaab 5 aabcbabaaaaa0babaa Manber's LSD algorithm.
9 babaaaaa0babaaaabc 1 abaaaabcbabaaaaa0b
10 abaaaaa0babaaaabcb 10 abaaaaa0babaaaabcb n Same idea but go from right to left.
11 baaaaa0babaaaabcba 6 abcbabaaaaa0babaaa n O(N log N) guaranteed running time.
12 aaaaa0babaaaabcbab 2 baaaabcbabaaaaa0ba
n O(N) extra space.
13 aaaa0babaaaabcbaba 11 baaaaa0babaaaabcba
14 aaa0babaaaabcbabaa 0 babaaaabcbabaaaaa0
15 aa0babaaaabcbabaaa 9 babaaaaa0babaaaabc
16 a0babaaaabcbabaaaa 7 bcbabaaaaa0babaaaa
17 0babaaaabcbabaaaaa 8 cbabaaaaa0babaaaab

Input: "babaaaabcbabaaaaa"

46 47
String Sorting Performance

String Sort Suffix Sort (seconds)

Worst Case Moby Dick AesopAesop


Brute W N2 36,000 3,990

Quicksort W N log N 9.5 167


LSD * W(N + R) - -
MSD W(N + R) 395 memory
MSD with cutoff W(N + R) 6.8 162
3-Way Radix Quicksort W N log N 2.8 400
Manber N log N 17 8.5

R = radix. estimate
W = max length of string. * fixed length strings only
N = number of strings. probabilistic guarantee
suffix sorting only

48

Вам также может понравиться