Вы находитесь на странице: 1из 28


Sorting is one of the most common operations performed by a computer. Because sorted data are easier to manipulate than randomly-ordered data, many algorithms require sorted data. Sorting is of additional importance to parallel computing because of its close relation to the task of routing data among processes, which is an essential part of many parallel algorithms. Many parallel sorting algorithms have been investigated for a variety of parallel computer architectures.

Sorting algorithms are categorized as internal or external. In internal sorting, the number of elements to be sorted is small enough to fit into the process's main memory. In contrast, external sorting algorithms use auxiliary storage (such as tapes and hard disks) for sorting because the number of elements to be sorted is too large to fit into memory. This chapter concentrates on internal sorting algorithms only. Sorting algorithms can be categorized as comparison-based and noncomparison-based. A comparison-based algorithm sorts an unordered sequence of elements by repeatedly comparing pairs of elements and, if they are out of order, exchanging them. This fundamental operation of comparison-based sorting is called compare-exchange. The lower bound on the sequential complexity of any sorting algorithms that is comparison-based is Q(n log n), where n is the number of elements to be sorted. Noncomparison-based algorithms sort by using certain known properties of the elements (such as their binary representation or their distribution). The lowerbound complexity of these algorithms is Q(n). We concentrate on comparison-based sorting algorithms.

Issues in Sorting on Parallel Computers

Parallelizing a sequential sorting algorithm involves distributing the elements to be sorted onto the available processes. This process raises a number of issues that we must address in order to make the presentation of parallel sorting algorithms clearer.

Where the Input and Output Sequences are Stored

In sequential sorting algorithms, the input and the sorted sequences are stored in the process's memory. However, in parallel sorting there are two places where these sequences can reside. They may be stored on only one of the processes, or they may be distributed among the processes. we assume that the input and sorted sequences are distributed among the processes.

Consider the precise distribution of the sorted output sequence among the processes. A general method of distribution is to enumerate the processes and use this enumeration to specify a global ordering for the sorted sequence. In other words, the sequence will be sorted with respect to this process enumeration. For instance, if Pi comes before Pj in the enumeration, all the elements stored in Pi will be smaller than those stored in Pj .

How Comparisons are Performed

A sequential sorting algorithm can easily perform a compare-exchange on two elements because they are stored locally in the process's memory. In parallel sorting algorithms, this step is not so easy. If the elements reside on the same process, the comparison can be done easily. But if the elements reside on different processes, the situation becomes more complicated.

One Element Per Process

Consider the case in which each process holds only one element of the sequence to be sorted. At some point in the execution of the algorithm, a pair of processes (Pi, Pj) may need to compare their elements, ai and aj. After the comparison, Pi will hold the smaller and Pj the larger of {ai, aj}. We can perform comparison by having both processes send their elements to each other. Each process compares the received element with its own and retains the appropriate element. In our example, Pi will keep the smaller and Pj will keep the larger of {ai, aj}. As in the sequential case, we refer to this operation as compare-exchange.

Figure: A parallel compare-exchange operation. Processes Pi and Pj send their elements to each other. Process Pi keeps min{ai, aj}, and Pj keeps max{ai , aj}.

More than One Element Per Process

A general-purpose parallel sorting algorithm must be able to sort a large sequence with a relatively small number of processes. Let p be the number of processes P0, P1, ..., Pp-1, and let n be the number of elements to be sorted. Each process is assigned a block of n/p elements, and all the processes cooperate to sort the sequence. Let A0, A1, ... A p-1 be the blocks assigned to processes P0, P1, ... Pp-1, respectively. is

less than or equal to every element in Aj. When the sorting algorithm finishes, each process As in the one-element-per-process case, two processes Pi and Pj may have to redistribute their blocks of n/p elements so that one of them will get the smaller n/p elements and the other will get the larger n/p elements. Let Ai and Aj be the blocks stored in processes Pi and Pj. If the block of n/p elements at each process is already sorted, the redistribution can be done efficiently as follows. Each process sends its block to the other process. Now, each process merges the two sorted blocks and retains only the appropriate half of the merged block. We refer to this operation of comparing and splitting two sorted blocks as compare-split.

Figure: A compare-split operation. Each process sends its block of size n/p to the other process. Each process merges the received block with its own block and retains only the appropriate half of the merged block. In this example, process Pi retains the smaller elements and process Pj retains the larger elements.

Sorting Networks
In the quest for fast sorting methods, a number of networks have been designed that sort n elements in time significantly smaller than (n log n). These sorting networks are based on a comparison network model, in which many comparison operations are performed simultaneously.

The key component of these networks is a comparator. A comparator is a device with two inputs x and y and two outputs x' and y'. For an increasing comparator, x' = min{x, y} and y' = max{x, y}; for a decreasing comparator x' = max{x, y} and y' = min{x, y}. As the two elements enter the input wires of the comparator, they are compared and, if necessary, exchanged before they go to the output wires. We denote an increasing comparator by and a decreasing comparator by . A sorting network is usually made up of a series of columns, and each column contains a number of comparators connected in parallel. Each column of comparators performs a permutation, and the output obtained from the final column is sorted in increasing or decreasing order.

Figure:A schematic representation of comparators: (a) an increasing comparator, and (b) a decreasing comparator.

Figure 9.4. A typical sorting network. Every sorting network is made up of a series of columns, and each column contains a number of comparators connected in parallel.

Bitonic Sort
A bitonic sorting network sorts n elements in (log2 n) time. The key operation of the bitonic sorting network is the rearrangement of a bitonic sequence into a sorted sequence. A bitonic sequence is a sequence of elements <a0, a1, ..., an-1> with the property that either (1) there

Thus, we have reduced the initial problem of rearranging a bitonic sequence of size n to that of rearranging two smaller bitonic sequences and concatenating the results. We refer to the operation of splitting a bitonic sequence of size n into the two bitonic sequences defined by as a bitonic split. Obtaining s1 and s2 we assumed that the original sequence had increasing and decreasing sequences of the same length, the bitonic split operation also holds for any bitonic sequence. We can recursively obtain shorter bitonic sequences using for each of the bitonic subsequences until we obtain subsequences of size one. At that point, the output is sorted in monotonically increasing order. Since after each bitonic split operation the size of the problem is halved, the number of splits required to rearrange the bitonic sequence into a sorted sequence is log n. The procedure of sorting a bitonic sequence using bitonic splits is called bitonic merge. Example: Merging a 16-element bitonic sequence through a

series of log 16 bitonic splits.

We now have a method for merging a bitonic sequence into a sorted sequence. This method is easy to implement on a network of comparators. This network of comparators, known as a bitonic merging network. The network contains log n columns. Each column contains n/2 comparators and performs one step of the bitonic merge. This network takes as input the bitonic sequence and outputs the sequence in sorted order. We denote a bitonic merging network with n inputs by BM[n]. If we replace the comparators in by comparators, the input

will be sorted in monotonically decreasing order; such a network is denoted by BM[n]. Example: A bitonic merging network for n = 16. The input wires

are numbered 0, 1 ..., n - 1, and the binary representation of these numbers is shown. Each column of comparators is drawn separately; the entire figure represents a BM[16] bitonic merging network. The network takes a bitonic sequence and outputs it in sorted order.

Example: Figure 9.8. The comparator network that transforms an

input sequence of 16 unordered numbers into a bitonic sequence. The columns of comparators in each bitonic merging network are drawn in a single box, separated by a dashed line.


We describe a parallel sorting algorithm for an SIMD computer where the processors are connected to form a linear array. -The algorithm uses n processors PI, P 2, .. , P. to sort the sequences S = {s,, s2, . . ., sn}. -At any time during the execution of the algorithm, processor Pi holds one element of the input sequence; we denote this element by xi for all 1 < i < n. Initially xi = si. It is required that, upon termination, xi be the ith element of the sorted sequence. -The algorithm consists of two steps that are performed repeatedly. In the first step, all odd-numbered processors Pi obtain xi+1, from Pi+1. -If xi> xi+1, then Pi and Pi+1, exchange the elements they held at the beginning of this step. -In the second step, all even-numbered processors perform the same operations as did the odd-numbered ones in the first step. After [n/2] repetitions of these two steps in this order, no further exchanges of elements can take place. Hence the algorithm terminates with xi < xi +1 for

all 1 < i < n 1.The algorithm is given in what follows as procedure ODDEVEN TRANSPOSITION.

Example: Let S = {6, 5, 9, 2, 4, 3, 5, 1, 7, 5, 8}. The contents of the linear array for this input during the execution of procedure ODD-EVEN TRANSPOSITION. a sorted sequence is produced after four iterations of steps 1 and 2, two more (redundant) iterations are performed, that is, a total of [11/2] as required by the procedure's statement.

Analysis. Each of steps 1 and 2 consists of one comparison and two routing operations and hence requires constant time. These two steps are executed [n/2] times. The running time of procedure ODD-EVEN TRANSPOSITION is therefore t(n) = 0(n). Since p(n) = n, the procedure's cost is given by c(n) = p(n) x t(n) = 0(n2 ),which is not optimal. From this analysis, procedure ODD-EVEN TRANSPOSITION does not appear to be too attractive. Indeed, (i) with respect to procedure QUICKSORT, it achieves a speedup of O(log n) only, (ii) it uses a number of processors equal to the size of the input, which is unreasonable, and (iii) it is not cost optimal.

The only redeeming feature of procedure ODD-EVEN TRANSPOSITION seems to be its extreme simplicity. We are therefore tempted to salvage its basic idea in order to obtain a new algorithm with optimal cost. There are two obvious ways for doing this: either (1) reduce the running time or (2) reduce the number of processors used. The first approach is hopeless: The running time of procedure ODD-EVEN TRANSPOSITION is the smallest possible achievable on a linear array with n processors. To see this, assume that the largest element in S is initially in PI and must therefore move n - 1 steps across the linear array before settling in its final position in Pn. This requires 0(n) time. Cost Optimal Sorting On Linear Array Now consider the second approach. If N processors, where N < n, are available, then they can simulate the algorithm in n x t(n)/N time. The cost remains n x t(n), which as we know is not optimal. A more subtle simulation, however, allows us to achieve cost optimality. Assume that each of the N processors in the linear array holds a subsequence of S of length n/N. (It may be necessary to add some dummy elements to S if n is not a multiple of N.) In the new algorithm, the comparison-exchange operations of procedure ODD-EVEN TRANSPOSITION are now replaced with merge-split operations on subsequences. Basic Idea Let Si denote the subsequence held by processor Pi. Initially, the Si are random subsequences of S. In step 1, each Pi sorts Si using procedure QUICKSORT. In step 2.1 each odd-numbered processor Pi merges the two subsequences Si and Si+1 into a sorted sequence It retains the first half of Si' and assigns to its neighbor Pi+1 the second half. Step 2.2 is identical to 2.1 except that it is performed by all even-numbered processors. Steps 2.1 and 2.2 are repeated alternately. After [N/2] iterations no further exchange of elements can take place between two processors.

Example: Let S = {8, 2, 5, 10, 1, 7, 3, 12, 6, 11, 4, 9} and N = 4. The contents of the various processors during the execution of procedure MERGE SPLIT.

Analysis. Step 1 requires O((n/N)log(n/N)) steps. Transferring Si+I to Pi, merging by SEQUENTIAL MERGE, and returning Si+1, to Pi+1, all require O(n/N) time. The total running time of procedure MERGE SPLIT is therefore t(n) = O((n/N)log(n/N)) + [N/2] x O(n/N) = O((n log n)/N) + 0(n), and its cost is c(n) = 0(n log n) + O(nN),which is optimal when N < log n.


We attempt to deal with two of the objections raised with regards to procedure CRCW SORT: its excessive use of processors and its tolerance of write conflicts. Our purpose is to design an algorithm that is free of write conflicts and uses a reasonable number of processors. In addition, we shall require the algorithm to also satisfy our usual desired properties for shared-memory SIMD algorithms.

Thus the algorithm should have (i) a sublinear and adaptive number of processors, (ii) a running time that is small and adaptive, and (iii) a cost that is optimal. Basic Idea The idea is quite simple. Assume that a CREW SM SIMD computer with N processors P1 , P2 , ., Pn is to be used to sort the sequence

We begin by distributing the elements of S evenly among the N processors. Each processor sorts its allocated subsequence sequentially using procedure QUICKSORT. The N sorted subsequences are now merged pairwise, simultaneously, using procedure CREW MERGE for each pair. The resulting subsequences are again merged pairwise and the process continues until one sorted sequence of length n is obtained. The algorithm is given in what follows as procedure CREW SORT. We denote the initial subsequence of S allocated to processor Pi by Si. Subsequently, is used to denote a subsequence obtained by merging two subsequences and the set of processors that performed the merge.


Analysis. The dominating operation in step 1 is the call to QUICKSORT, which requires O((n/N)log(n/N)) time. During each iteration of step 2.3, [v/2] pairs of subsequences with n/[v/2] elements per pair are to be merged simultaneously using N/[v/2J processors per pair. Procedure CREW MERGE thus requires O([(n/[v/2])/(N/[v/2)] + log(n/[v/2)), that is, O((n/N) + log n) time. Since step 2.3 is iterated [log N] times, the total running time of procedure CREW SORT is t(n) = O((n/N)log(n/N)) + O((n/N)log N + log n log N) = O((n/N)log n + log2 n).

Since p(n) = N, the procedure's cost is given by c(n) = O(n log n + N log2 n), which is optimal for N < n/log n.


Two of the criticisms expressed with regards to procedure CRCW SORT were addressed by procedure CREW SORT, which adapts to the number of existing processors and disallows multiple-write operations into the same memory location. Still, procedure CREW SORT tolerates multiple-read operations. Our purpose in this section is to deal with this third difficulty. Basic Idea

The idea is to adapt the sequential procedure QUICKSORT to run on a parallel computer. We begin by noting that, since N < n, we can write N = n1 x , where 0 < x < 1.



Let k = 2[1/x]. The algorithm is given as procedure EREW SORT:


Let S = {5,9,12,16,18,2,10,13,17,4,7,18,18,11,3,17,20,19,14,8,5,17,1,11,15, 10, 6} (i.e., n = 27) and let five processors P1, P2, P3, P4, P5 be available on an EREW SM-SIMD computer (i.e., N = 5).

. . . . are in sorted order. . are




to sort S3 and S4.

shown below:

Analysis. The call to QUICKSORT takes constant time. From the analysis of procedure PARALLEL SELECT in steps 1-4 require cnX time units for some constant c. The running time of procedure EREW SORT is therefore t(n) = cnX + 2t(n/k) = O(nx log n). Since p(n) = n1 x , the procedure's cost is given by c(n) = p(n) x t(n) = O(n log n), which is optimal. Note, however, that since n1 x < n/log n, cost optimality is restricted to the range N < n/log n. SORTING ON CRCW MODEL Whenever an algorithm is to be designed for the CRCW model of computation, one must specify how write conflicts, that is, multiple attempts to write into the same memory location, can be resolved. For the purposes of the sorting algorithm to be described, we shall assume that write conflicts are created whenever several processors attempt to write potentially

different integers into the same address. The conflict is resolved by storing the sum of these integers in that address. Basic Idea Assume that n2 processors are available on such a CRCW computer to sort the sequence The sorting algorithm to be used is based on the idea of sorting by enumeration: The position of each element si of S in the sorted sequence is determined by computing ci, the number of elements smaller than it. If two elements si and sj are equal, then s, is taken to be the larger of the two if i > j; otherwise sj is the larger. Once all the ci have been computed, si is placed in position 1 + ci of the sorted sequence. We assume that the processors are arranged into n rows of n elements each and are numbered. The shared memory contains two arrays: The input sequence is stored in array S, while the counts ci are stored in array C. The sorted sequence is returned in array S. The ith row of processors is "in charge" of element si: Processors P(i, 1), P(i, 2), .. ., P(i, n) compute ci and store si in position 1 + ci of S. The algorithm is given as procedure CRCW SORT:

Example: Let S = {5, 2, 4, 5}. The two elements of S that each of the 16 processors compares and the contents of arrays S and C after each step of procedure CRCW SORT.

(ii) the write conflict resolution process is itself very powerful-all numbers to be stored in a memory location are added and stored in constant time;and 3. uses a very large number of processors; that is, the number of processors grows quadratically with the size of the input. For these reasons, particularly the last one, the algorithm is most likely to be of no great practical value. Nevertheless, procedure CRCW SORT is interesting in its own right: It demonstrates how sorting can be accomplished in constant time on a model.


A CREW SM SIMD computer consists of N processors P1, P2 , ... PN. It is required to design a parallel algorithm for this computer that takes the two sequences A and B as input and produces the sequence C as output, as defined earlier. Without loss of generality, we assume that r < s. It is desired that the parallel algorithm satisfy the properties: (i) the number of processors used by the algorithm be sublinear and adaptive, (ii) the running time of the algorithm be adaptive and significantly smaller than the best sequential algorithm, and (iii) the cost be optimal. We now describe an algorithm that satisfies these properties. It uses N processors where N < r and in the worst case when r = s = n runs in 0((n/N) + log n) time. The algorithm is therefore cost optimal for N < n/log n. In addition to the basic arithmetic and logic functions usually available, each of the N processors is assumed capable of performing the following two sequential procedures: 1. Procedure SEQUENTIAL MERGE 2. Procedure BINARY SEARCH Basic Idea The procedure takes as input a sequence S = {s1, s2 , ... , sn.} of numbers sorted in nondecreasing order and a number x. If x belongs to S, the procedure returns the index k of an element Sk in S such that x = Sk. Otherwise, the procedure returns a zero. Binary search is based on the divide-and-conquer principle. At each stage, a comparison is performed between x and an element of S. Either the two are equal and the procedure terminates or half of the elements of the sequence under consideration are discarded. The process continues until the number of elements left is 0 or 1, and after at most one additional comparison the procedure terminates.

Since the number of elements under consideration is reduced by one-half at each step, the procedure requires O(log n) time in the worst case. We are now ready to describe our first parallel merging algorithm for a sharedmemory computer. The algorithm is presented as procedure CREW MERGE.


In steps 3.1 and 3.2, Q(1) = (1, 1), Q(2) = (5, 3), Q(3) = (6, 7), and Q(4) = (10, 9) are determined.

In step 3.3 processor P1 begins at elements aI = 2 and b, = 1 and merges all

elements of A and B smaller than 7, thus creating the subsequence { 1, 2, 3, 4, 5, 6} of C. Similarly, processor P2 begins at a5 = 1 1 and b3 = 7 and merges all elements smaller than 12, thus creating {7, 8, 9, 10, 11}. Processor P3 begins at a6 = 12 and b, = 14 and creates {12, 13, 14, 15, 16, 17}.
Finally P4 begins at a10 = 20 and bg = 18 and creates {18, 19, 20, 21, 22, 23, 24}.

The resulting sequence C is therefore {1, 2, 3, 4, 5, 6, 7, 8, 9, 10 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24}.
Analysis. A step-by-step analysis of CREW MERGE follows: Step 1: With all processors operating in parallel, each processor computes two subscripts. Therefore this step requires constant time. Step 2: This step consists of two applications of procedure BINARY SEARCH to a sequence of length N -1, each followed by an assignment statement. This takes O(log N) time. Step 3: Step 3.1 consists of a constant-time assignment, and step 3.2 requires at most O(log s) time. To analyze step 3.3, we first observe that V contains 2N 2 elements that divide C into 2N -1 subsequences with maximum size equal to ([r/N] + [s/N]). This maximum size occurs if, for example, one element a of A' equals an element bj of B'; then the [r/N] elements smaller than or equal to a, (and larger than or equal to a ,-) are also smaller than or equal to bj, and similarly, the [s/N] elements smaller than or equal to bJ (and larger than or equal to bj-,) are also smaller than or equal to a'. In step 3 each processor creates two such subsequences of C whose total size is therefore no larger than 2([r/N] + rs/Ni), except PN, which creates only one subsequence of C. It follows that procedure SEQUENTIAL MERGE takes at most O((r + s)/N) time. In the worst case, r = s = n, and since n > N, the algorithm's running time is dominated by the time required by step 3. Thus t(2n) = O((n/N) + log n). Since p(2n) = N, c(2n) = p(2n) x t(2n) = O(n + N log n), and the algorithm is cost optimal when N < n/log n.

Searching is one of the most fundamental operations in the field of computing. It is used in any application where we need to find out whether an element belongs to a list or, more generally, retrieve from a file information associated with that element. In its most basic form the searching problem is stated as follows: Given a sequence S = {s1, s2 ,. Sn} of integers and an integer x, it is required to determine whether X = Sk for some sk in S. In sequential computing, the problem is solved by scanning the sequence S and comparing x with its successive elements until either an integer equal to x is found or the sequence is exhausted without success. This is given in what follows as procedure SEQUENTIAL SEARCH. As soon as an Sk in S is found such that x = Sk, the procedure returns k; otherwise 0 is returned.

In the worst case, the procedure takes 0(n) time. This is clearly optimal, since every element of S must be examined (when x is not in S) before declaring failure. Alternatively, if S is sorted in nondecreasing order, then procedure BINARY SEARCH can return the index of an element of S equal to x (or 0 if no such element exists) in 0(log n) time.

EREW Searching
N-processor EREW SM SIMD computer is available to search S for a given element x, where 1 < N < n. the value of x must be made known to all processors. This can be done using procedure BROADCAST in O(log N) time. The sequence S is then subdivided into N subsequences of length n/N each, and processor Pi is assigned {S(i-1)(n/N)+1, S(i-2)(n/N)+2, ... , Si(n/N)}. All processors now perform procedure BINARY SEARCH on their assigned subsequences. This requires O(log(n/N)) in the worst case. Since the elements of S are all distinct, at most one processor finds an Sk equal to x and returns k. The total time required by this EREW searching algorithm is therefore O(log N) + O(log(n/N)), which is O(log n).

Fig: Format of record in file to be searched. CRCEW Searching N-processor EREW SM SIMD computer is available to search S for a given element x, where 1 < N < n.

The value of x must be made known to all processors. This can be done in constant time. Basic Idea There are N processors and hence an (N + 1)-ary search can be used. At each stage, the sequence is split into N + 1 subsequences of equal length and the N processors simultaneously probe the elements at the boundary between successive subsequences. Let g be the smallest integer such that that is

. g stages are sufficient to search a sequence of length n for an element equal to an input x. To search a