Вы находитесь на странице: 1из 15

Data and File Structures

Unit 13

Unit 13

External Sorting Techniques

Structure: 13.1 Introduction Objectives 13.2 External Sorting Run lists Tape sorting 13.3 Sorting on Disks 13.4 Generating Extended Initial Runs 13.5 Summary 13.6 Terminal Questions 13.7 Answers

13.1 Introduction
In the previous unit, you learnt about the meaning of external storage devices and why we have to use external devices. We also discussed file structures as these are one of the most important storage structures which are mandatory for storing large amount of data. You also learnt about three different files structures called sequential, indexed sequential and direct files with their processing methods. In this unit, we are going to discuss how we can apply our sorting technique on the data which is available externally. Generally we can do sorting with the data which is available in the main memory which we discussed in unit 11. But this unit is focusing on how the data can be sorted which is available externally. External sorting is a generic term for a class of sorting algorithm that can handle massive amounts of data. External sorting is required when the data being sorted does not fit into the main memory of a computing device usually RAM and a slower kind of memory usually a hard drive needs to be used. Under external sorting we are going to discuss two concepts called run list and tape sort, and two algorithms for sorting the data available in the disk called tape drive merge sort and ployphase sort. We are going to conclude this unit with the discussion of sorting on disks and how to generate an extended initial runs.

Sikkim Manipal University

Page No.: 214

Data and File Structures

Unit 13

Objectives: After studying this unit, you should be able to: state the meaning of external sorting explain concepts of run list and tape sort describe type drive merge sort and polyphase sorting explain sorting on disks discuss generating extended initial runs

13.2 External Sorting


External sorting is must where we handle or store in lexicographic order a very large number of items, which you can not fit in main memory. The general strategy in external sorting is to begin by sorting small batches of records from a file in internal memory. These small batches are commonly called run lists. The size of these run lists depends on the amount of internal memory set aside for the sorting. 13.2.1 Run lists The run lists are the small batches created in external sorting otherwise called as initial runs or initial strings. After creation these run lists will be stored in a target file and later retrieved and merged together again to form a larger run list. This merging process continues and terminates with the production of single run list that will the desired sorted list. Let us assume we have a buffer to hold m records an unordered file containing n records, where m is smaller than n. Now the process begins with picking the m records from the list apply internal sort on that and store back the sorted list in the target file with the run list of size m. Selection sort is recommended if the size of the m is lesser than 15 otherwise, heap sort technique can be used for the larger size of m. This generation of run list process continues until all the n records are picked. These run list what we discussed will be used further in this unit for our sorting algorithm. 13.2.2 Tape sorting External sorting will be proficient, when we use magnetic tapes, because these tapes are cheapest choice for external storage. Even the sequential nature of tapes will not be the hurdle for the sorting process because every time we do the sorting process with significant amount of data. The major overhead of the magnetic tape is the amount of tape rewinding, but this problem can be handled to some extent if the tape unit can read both forward and backward.

Sikkim Manipal University

Page No.: 215

Data and File Structures

Unit 13

Tape drive merge sort Merge sort is naturally sequential that, it is practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not change with the number of data elements. For the same reason it is also useful for sorting data on disk that is too large to fit entirely into primary memory. On tape drives that can run both backwards and forwards, merge passes can be run in both directions, with which we can avoid rewind time. Sorting procedure using four tape drives: 1) Divide the data to be sorted in half and put half on each of two tapes 2) Merge individual pairs of records from the two tapes; write two-record chunks alternately to each of the two output tapes 3) Merge the two-record chunks from the two output tapes into four-record chunks; write these alternately to the original two input tapes 4) Merge the four-record chunks into eight-record chunks; write these alternately to the original two output tapes 5) Repeat until you have one chunk containing all the data, sorted that is, for log n passes, where n is the number of records. I Constructing the runs 1. If tape 1 is not finished I. Read M items (if available) from tape1 II. Sort them in memory III. Write them to tape 3 (these M items is called one run) 2. If tape 1 is not finished I. Read M items (if available) from tape 1 II. Sort them in memory III. Write them to tape 4 3. Repeat steps 1 and 2 until tape 1 is finished. II Merging runs 1. Merge runs in tapes 3 and 4 into tape 1 and 2. I. By taking one run from tape 3 and one run from tape 4. II. Continue in this way At the end of this we have runs of size 2*M in tape 1 and 2 2. Merge runs in tape 1 and 2 into tapes 3 and 4. At the end of this we have runs of size 4*M in tape 3 and 4. 3. Repeat steps 1 and 3 until we have a single run of size N (input size)
Sikkim Manipal University Page No.: 216

Data and File Structures

Unit 13

We will discuss now four tape merge sort with example. T1 is an unordered list with 13 elements we can assume here the size of m=3

T3 and T4 is constructed by picking 3 element from the T1 every time sorted and stored.

T3 and T4 elements are picked alternatively, sorted and stored using T1 and T2 in pass1. And in pass2 you can observe again that T1 and T2 elements are picked sorted and stored in T3 and T4.
Sikkim Manipal University Page No.: 217

Data and File Structures

Unit 13

End of the pass3 the process terminates, since the elements are sorted by generating a single tape. Polyphase sorting Ployphase is one of the popular sorting methods for external sorting. The basic idea is to distribute ordered initial runs of predetermined size on the available tapes and then to repeatedly merge these runs in multiple phases in which each phase has a predetermined number of merges before the target tape is selected. Distribution of the runs on the working tapes may affect the performance of the sort. Simple balanced merging is recommended here, it requires the runs to be distributed evenly on T-1 of the tapes. Merged run is written on T tape and remerging will take place with T-1 working tapes until the final run is formed on T. It is identified that Fibonacci distribution of initial runs provides better performance. In the Ployphase merge with a Fibonacci distribution, the merging procedure is continued until the tape with the least number of run list is empty. When this occurs the remaining work spaces logically rotated that is the recently emptied tape becomes the new target tape and the old target tape becomes the one of the working tapes to involve in the merging process. The Fibonacci distribution is perfect for 17 records on 4 tapes. The pth order Fibonacci series is used to determine the number of runs on each tape where p=T-1 this series is defined as follows. Formula: FpS = FpS-1 + FpS-2 + . + FpS-p Conditions: FpS = 0 Fpp-1 = 1 for 0 S p-2 (1)

Sikkim Manipal University

Page No.: 218

Data and File Structures

Unit 13

Derivation: p=T-1, where T is the Number of tapes (including the tape containing the elements) For this example, T=4, therefore; p=3. Assume, S=5. We are using k tapes for sorting. Here, k=3. Since we are using k tapes instead of p tapes the equation 1 becomes: FpS = FpS-1 + FpS-2 + . + FpS-k (2) Using the above equation 2 we shall generate a perfect tape distribution table for the above values shown in the table 1. T=4, p=3, k=3, S=5 for tape 1 Substituting the values in equation 2 we get F35 = F34 + F33 + F32 F31 = 0, because F31 = 0 for 0 2 3-2=1 3 p 3 F 2= 1, because F p-1 = F 2 = 1 F33 = F32 + F31 + 1 = 1 + 0 = 1 F34 = F33 + F32+ F31 = 1 + 1 + 0 = 2 F35 = F34 + F33 + F32 = 2 + 1 + 1 = 4 |||ly F36 = F35 + F34 + F33 + F32 = 4 + 2 + 1 = 7
Table 1: Tape distribution table LEVELS 1 2 3 4 5 6 7 8 9 Sikkim Manipal University TAPE 1 0 1 1 2 4 7 13 24 44 2 0 1 2 3 6 11 20 37 68 3 1 1 2 4 7 13 24 44 81 TOTAL NUMBER OF RUNS 1 3 5 9 17 31 57 105 193 Page No.: 219

Data and File Structures

Unit 13

Initially tape 4 is used as the input buffer; you can find 17 unordered records are stored in it. Based on the above formula the initial runs for 17 records are distributed as shown in step 2. 1 2 3 4 15 27 3 14 6 35 20 26 21 40 12 8 32 19 1 18 36 Step 1 1 3 14 2 20 26 21 40 1 15 27 3 14 2 6 35 20 26 21 40 3 12 8 32 19 1 18 36 4

Step 2 3 32 19 1 18 36 4 6 12 15 8 27 35

1 27 3 14

2 35 20 26 21 40

3 8 32 19 1 18 36

4 6 12 15

Step 3
Sikkim Manipal University

Step 4
Page No.: 220

Data and File Structures

Unit 13

On the first merge pass initial merge runs on each of the source tapes 1, 2, and 3 are merged ie [15, 6, and 12] and the run [6, 12, 15] is placed on the initially empty object tape 4. 1 14 2 26 21 40 3 19 1 18 36 4 6 12 15 8 27 35 3 20 32 1 2 21 40 3 1 18 36 4 6 12 15 8 27 35 3 20 32 14 19 26 Step 6

Step 5

The second, third, fourth merges place the runs [8, 27, 35], [3, 20, 32], [14, 19,26] on tape 4. Now the tape 1 is left with no records to merge, now this tape can be rewind and utilized as output tape for the merging process. The next phase begins by merging runs from tapes 2, 4 and 4 which results in the placement of the new run [1, 6, 12, 15, 21] on tape 1. On the next merge the run [8, 18, 27, 35, 40] is placed on the tape1 leaving tape2 empty. Tape 1 and 2 are rewind and a third pass is made that places run [ 1, 3, 6, 12, 15, 20, 21, 32, 36] on tape 2 in step 9. Now the tape 3 is empty now tape 2 and tape 3 are rewound the final merge processes places the run in tape 1 thus you can see the sorted list in tape 3.

Sikkim Manipal University

Page No.: 221

Data and File Structures

Unit 13

1 1 6 12 15 21

2 40

3 18 36

4 8 27 35 3 20 32 14 19 26

1 1 6 12 15 21 8 18 27 35 40

3 36

4 3 20 32 14 19 26

Step 7 1 8 18 27 35 40 2 1 3 6 12 15 20 21 32 36 Step 9 3 4 14 19 26 1

Step 8 2 3 1 3 6 8 12 14 15 18 19 20 21 26 27 32 35 36 40 4

Step 10

Sikkim Manipal University

Page No.: 222

Data and File Structures

Unit 13

Self Assessment Questions 1. External sorting is done where large amount of data stored in ___________________. 2. ______________ are the small batches created during external sorting. 3. What is the major overhead we need to handle while manipulating data in magnetic tape? ____________________________ 4. Tape drive requires only very little memory can be used as input and out devices. (True/False) 5. In Ployphase sorting __________________________ is used to determine the number of runs on each tape.

13.3 Sorting on Disks


We need to give a special care while accessing a particular record on a magnetic disk for sorting purpose. In direct record access we can ignore the problems of setting up initial runs according to certain merge patterns and having to rewind working tapes. We can use k-way merge strategy that will allow us to ignore these problems and thus sorting time will be reduced. Assume that we have r initial runs then the number of passes required to log r sort a file using k-way merge is O[ k ]. We will see one example with the buffer size of b=3 as discussed in the figure 13.1.

Sikkim Manipal University

Page No.: 223

Data and File Structures

Unit 13

Runs of length=3 15 27 3 14 6 35 20 26 21 40 12 8 32 19 1 18 36 3 27 15 6 14 35 20 21 26 8 12 40
1 19 32

K=2 way Merges 3 6 14 15 length=3 27 35 8 12 20 21 26 40 1 18 19 32 36 3 6 8 12 14 15 20 21 26 27 35 40 1 3 6 8 12 14 15 18 19 20 21 26 27 32 35 36 40

18 36

Input file

Pass1

pass2

pass3

pass4

Figure 13.1: Sorting on disk with buffer size as 3

Here first pass is required to create the initial runs of length 3. From this point the progressively longer runs will be created for merging process until it is ordered. Here you can see that the two-way merge tree is atleast depth [logkr], in the given example [log26] = 3. Note here the merge tree is not truly balanced, not all records are read in each pass. Particularly in third pass only 4 blocks are picked out of 6. The storage requirement to perform merging are two input buffers, one out put buffer for k-way merge k+1 buffers are required. The general strategy for the merging process involves the following two steps: 1) Generate as many initial runs as possible let the number of runs be r.

Sikkim Manipal University

Page No.: 224

Data and File Structures

Unit 13

2) Merge the r runs, k at a time, using a merge algorithm that is a generalization of the simple two-way merging which we discussed in the sub-section 11.2.3.

13.4 Generating Extended Initial Runs


We had a detail discussion about the runs and also we assumed that the initial runs are of a fixed length say m. These runs generally reads data from the unordered list and applies internal sorting technique, again this selection of sorting technique is based on the size of the m what you are going to store in each run. As we already discussed if the size of the m is small we can go for selection sort or merge sort if the size of the m is huge or large in size heap sort is advisable. After sorting these will be stored back in the runs and it will start the merging process. It is possible to reduce the longer runs by applying replacement selection technique as they are read into memory. For example we consider the generation of extended runs for an unordered list and the size of merge buffer is 3. 1) The run order commences by picking 3 elements from the unordered list and stored it in the merge buffer. 2) A selection process invokes now and it will isolate the least element from the memory buffer and the same will be pushed in to the output file. 3) The next record from the input file will be moved in to the buffer in order to replace the previously selected record. 4) Now again the selection process starts, now the selection of least key is having the restriction that, least key should be greater than the previous key pushed in to the output file. 5) If the least key is selected which is smaller than the previous selected key, which leads to termination of current run. So always select the least key which is higher than the previous selected key to avoid the termination of current run. 6) Now the merge buffer becomes filled with records of keys that are less than the previously selected key and then the current run is terminated and the new run must be initiated.
Sikkim Manipal University Page No.: 225

Data and File Structures

Unit 13

Now we will discuss one example with the 17 records and how the runs are generated depicted in the figure 13.2.
15 27 3 14 6 35 20 26 21 40 12 8 32 19 1 18 36 Input file Run 15 I 27 3 15 27 14 15 27 6 3 14 15 27 35 6 20 21 26 40 8 12 18 32 36 1

6 27 35

6 20 35

6 20 26

6 II 20 26

20 21 26

21 26 40

12 26 40

8 12 40 8 12 32

III

8 12 32 1

12 19 32

1 19 32

1 18 32

1 32 36

1 36

Output file Figure 13.2: Extended initial runs

In the above example you can observe 17 records in the input file and the buffer size is considered as 3. Initially it picks 3 elements from the input file (15, 27, 3) among these elements 3 is identified as least and moved to output file. And the next element is refilled in the buffer that is 14 among (15, 27, 14) 14 is identified as lease at moved to output file. You can also observe in the next buffer (15, 27, 6) even though the least value is 6 we can not move to the output file because the previous key value moved is 14 which is greater than the current least one. Then we forced to identified next highest least element from the buffer then 15 is identified and pushed towards the output file. It can be shown empirically that the average length of an extended run is 2m assuming that the unordered records are in random order and a merge buffer of size m is used.
Sikkim Manipal University Page No.: 226

Data and File Structures

Unit 13

Self Assessment Questions 6. How many number of passes required to sort a file using k-way, if we have initial runs r. ____________________________ 7. Which sorting technique is recommended for the runs holding the number of records less than 15? (Pick the right option) a) Heap b) Selection c) Merge d) Insertion 8. What will be the average length of the extended runs if the unordered records are in random order? _____________________

13.5 Summary
External sorting is mandatory where the input size is huge that it does not fit into internal memory. In this unit we learnt how the polyphase and merge sort helpful to do this process. We discussed two way merging in this unit but generally internal memory can hold many blocks not just two or three. To reduce the number of phases we can extend our merge algorithm to kway merging. Here we can do k sorted into a single output sequence. External memory representation implementation is easy as long as we have enough internal memory for k input buffer blocks, one output buffer block and a small amount of additional storage.

13.6 Terminal Questions


1. What is external sorting? 2. Discuss the concepts of run list and tape sorting. 3. Explain tape drive merge sort with example. 4. Explain polyphase sorting using tape drive. 5. What do you mean by sorting on disks? 6. Discuss how to generate a extended initial runs.

Sikkim Manipal University

Page No.: 227

Data and File Structures

Unit 13

13.7 Answers
Self Assessment Questions 1. Main memory 2. Run lists 3. Rewinding 4. True 5. Fibonacci series 6. O[logkr]. 7. b) Selection 8. 2m. Terminal Questions 1. External sorting is must where we handle or store in lexicographic order a very large number of items, which you can not fit in main memory. The general strategy in external sorting is to begin by sorting small batches of records from a file in internal memory. (Refer section 13.2) 2. The run lists are the small batches created in external sorting otherwise called as initial runs or initial strings. External sorting will be proficient, when we use magnetic tapes, because these tapes are cheapest choice for external storage. (Refer sub-sections 13.2.1 and 13.2.2 for details) 3. Merge sort is naturally sequential that, it is practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not change with the number of data elements. (Refer sub-section 13.2.2 for detail) 4. Ployphase is one of the popular sorting methods for external sorting. The basic idea is to distribute ordered initial runs of predetermined size on the available tapes. (Refer sub-section 13.2.2 for detail) 5. We need to give a special care while accessing a particular record on a magnetic disk for sorting purpose. (Refer section 13.3 for detail) 6. These runs generally reads data from the unordered list and applies internal sorting technique, again this selection of sorting technique is based on the size of the m what you are going to store in each run. (Refer section 13.4 for detail)

Sikkim Manipal University

Page No.: 228

Вам также может понравиться