Академический Документы
Профессиональный Документы
Культура Документы
Unit 13
Unit 13
Structure: 13.1 Introduction Objectives 13.2 External Sorting Run lists Tape sorting 13.3 Sorting on Disks 13.4 Generating Extended Initial Runs 13.5 Summary 13.6 Terminal Questions 13.7 Answers
13.1 Introduction
In the previous unit, you learnt about the meaning of external storage devices and why we have to use external devices. We also discussed file structures as these are one of the most important storage structures which are mandatory for storing large amount of data. You also learnt about three different files structures called sequential, indexed sequential and direct files with their processing methods. In this unit, we are going to discuss how we can apply our sorting technique on the data which is available externally. Generally we can do sorting with the data which is available in the main memory which we discussed in unit 11. But this unit is focusing on how the data can be sorted which is available externally. External sorting is a generic term for a class of sorting algorithm that can handle massive amounts of data. External sorting is required when the data being sorted does not fit into the main memory of a computing device usually RAM and a slower kind of memory usually a hard drive needs to be used. Under external sorting we are going to discuss two concepts called run list and tape sort, and two algorithms for sorting the data available in the disk called tape drive merge sort and ployphase sort. We are going to conclude this unit with the discussion of sorting on disks and how to generate an extended initial runs.
Unit 13
Objectives: After studying this unit, you should be able to: state the meaning of external sorting explain concepts of run list and tape sort describe type drive merge sort and polyphase sorting explain sorting on disks discuss generating extended initial runs
Unit 13
Tape drive merge sort Merge sort is naturally sequential that, it is practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not change with the number of data elements. For the same reason it is also useful for sorting data on disk that is too large to fit entirely into primary memory. On tape drives that can run both backwards and forwards, merge passes can be run in both directions, with which we can avoid rewind time. Sorting procedure using four tape drives: 1) Divide the data to be sorted in half and put half on each of two tapes 2) Merge individual pairs of records from the two tapes; write two-record chunks alternately to each of the two output tapes 3) Merge the two-record chunks from the two output tapes into four-record chunks; write these alternately to the original two input tapes 4) Merge the four-record chunks into eight-record chunks; write these alternately to the original two output tapes 5) Repeat until you have one chunk containing all the data, sorted that is, for log n passes, where n is the number of records. I Constructing the runs 1. If tape 1 is not finished I. Read M items (if available) from tape1 II. Sort them in memory III. Write them to tape 3 (these M items is called one run) 2. If tape 1 is not finished I. Read M items (if available) from tape 1 II. Sort them in memory III. Write them to tape 4 3. Repeat steps 1 and 2 until tape 1 is finished. II Merging runs 1. Merge runs in tapes 3 and 4 into tape 1 and 2. I. By taking one run from tape 3 and one run from tape 4. II. Continue in this way At the end of this we have runs of size 2*M in tape 1 and 2 2. Merge runs in tape 1 and 2 into tapes 3 and 4. At the end of this we have runs of size 4*M in tape 3 and 4. 3. Repeat steps 1 and 3 until we have a single run of size N (input size)
Sikkim Manipal University Page No.: 216
Unit 13
We will discuss now four tape merge sort with example. T1 is an unordered list with 13 elements we can assume here the size of m=3
T3 and T4 is constructed by picking 3 element from the T1 every time sorted and stored.
T3 and T4 elements are picked alternatively, sorted and stored using T1 and T2 in pass1. And in pass2 you can observe again that T1 and T2 elements are picked sorted and stored in T3 and T4.
Sikkim Manipal University Page No.: 217
Unit 13
End of the pass3 the process terminates, since the elements are sorted by generating a single tape. Polyphase sorting Ployphase is one of the popular sorting methods for external sorting. The basic idea is to distribute ordered initial runs of predetermined size on the available tapes and then to repeatedly merge these runs in multiple phases in which each phase has a predetermined number of merges before the target tape is selected. Distribution of the runs on the working tapes may affect the performance of the sort. Simple balanced merging is recommended here, it requires the runs to be distributed evenly on T-1 of the tapes. Merged run is written on T tape and remerging will take place with T-1 working tapes until the final run is formed on T. It is identified that Fibonacci distribution of initial runs provides better performance. In the Ployphase merge with a Fibonacci distribution, the merging procedure is continued until the tape with the least number of run list is empty. When this occurs the remaining work spaces logically rotated that is the recently emptied tape becomes the new target tape and the old target tape becomes the one of the working tapes to involve in the merging process. The Fibonacci distribution is perfect for 17 records on 4 tapes. The pth order Fibonacci series is used to determine the number of runs on each tape where p=T-1 this series is defined as follows. Formula: FpS = FpS-1 + FpS-2 + . + FpS-p Conditions: FpS = 0 Fpp-1 = 1 for 0 S p-2 (1)
Unit 13
Derivation: p=T-1, where T is the Number of tapes (including the tape containing the elements) For this example, T=4, therefore; p=3. Assume, S=5. We are using k tapes for sorting. Here, k=3. Since we are using k tapes instead of p tapes the equation 1 becomes: FpS = FpS-1 + FpS-2 + . + FpS-k (2) Using the above equation 2 we shall generate a perfect tape distribution table for the above values shown in the table 1. T=4, p=3, k=3, S=5 for tape 1 Substituting the values in equation 2 we get F35 = F34 + F33 + F32 F31 = 0, because F31 = 0 for 0 2 3-2=1 3 p 3 F 2= 1, because F p-1 = F 2 = 1 F33 = F32 + F31 + 1 = 1 + 0 = 1 F34 = F33 + F32+ F31 = 1 + 1 + 0 = 2 F35 = F34 + F33 + F32 = 2 + 1 + 1 = 4 |||ly F36 = F35 + F34 + F33 + F32 = 4 + 2 + 1 = 7
Table 1: Tape distribution table LEVELS 1 2 3 4 5 6 7 8 9 Sikkim Manipal University TAPE 1 0 1 1 2 4 7 13 24 44 2 0 1 2 3 6 11 20 37 68 3 1 1 2 4 7 13 24 44 81 TOTAL NUMBER OF RUNS 1 3 5 9 17 31 57 105 193 Page No.: 219
Unit 13
Initially tape 4 is used as the input buffer; you can find 17 unordered records are stored in it. Based on the above formula the initial runs for 17 records are distributed as shown in step 2. 1 2 3 4 15 27 3 14 6 35 20 26 21 40 12 8 32 19 1 18 36 Step 1 1 3 14 2 20 26 21 40 1 15 27 3 14 2 6 35 20 26 21 40 3 12 8 32 19 1 18 36 4
Step 2 3 32 19 1 18 36 4 6 12 15 8 27 35
1 27 3 14
2 35 20 26 21 40
3 8 32 19 1 18 36
4 6 12 15
Step 3
Sikkim Manipal University
Step 4
Page No.: 220
Unit 13
On the first merge pass initial merge runs on each of the source tapes 1, 2, and 3 are merged ie [15, 6, and 12] and the run [6, 12, 15] is placed on the initially empty object tape 4. 1 14 2 26 21 40 3 19 1 18 36 4 6 12 15 8 27 35 3 20 32 1 2 21 40 3 1 18 36 4 6 12 15 8 27 35 3 20 32 14 19 26 Step 6
Step 5
The second, third, fourth merges place the runs [8, 27, 35], [3, 20, 32], [14, 19,26] on tape 4. Now the tape 1 is left with no records to merge, now this tape can be rewind and utilized as output tape for the merging process. The next phase begins by merging runs from tapes 2, 4 and 4 which results in the placement of the new run [1, 6, 12, 15, 21] on tape 1. On the next merge the run [8, 18, 27, 35, 40] is placed on the tape1 leaving tape2 empty. Tape 1 and 2 are rewind and a third pass is made that places run [ 1, 3, 6, 12, 15, 20, 21, 32, 36] on tape 2 in step 9. Now the tape 3 is empty now tape 2 and tape 3 are rewound the final merge processes places the run in tape 1 thus you can see the sorted list in tape 3.
Unit 13
1 1 6 12 15 21
2 40
3 18 36
4 8 27 35 3 20 32 14 19 26
1 1 6 12 15 21 8 18 27 35 40
3 36
4 3 20 32 14 19 26
Step 7 1 8 18 27 35 40 2 1 3 6 12 15 20 21 32 36 Step 9 3 4 14 19 26 1
Step 8 2 3 1 3 6 8 12 14 15 18 19 20 21 26 27 32 35 36 40 4
Step 10
Unit 13
Self Assessment Questions 1. External sorting is done where large amount of data stored in ___________________. 2. ______________ are the small batches created during external sorting. 3. What is the major overhead we need to handle while manipulating data in magnetic tape? ____________________________ 4. Tape drive requires only very little memory can be used as input and out devices. (True/False) 5. In Ployphase sorting __________________________ is used to determine the number of runs on each tape.
Unit 13
Runs of length=3 15 27 3 14 6 35 20 26 21 40 12 8 32 19 1 18 36 3 27 15 6 14 35 20 21 26 8 12 40
1 19 32
18 36
Input file
Pass1
pass2
pass3
pass4
Here first pass is required to create the initial runs of length 3. From this point the progressively longer runs will be created for merging process until it is ordered. Here you can see that the two-way merge tree is atleast depth [logkr], in the given example [log26] = 3. Note here the merge tree is not truly balanced, not all records are read in each pass. Particularly in third pass only 4 blocks are picked out of 6. The storage requirement to perform merging are two input buffers, one out put buffer for k-way merge k+1 buffers are required. The general strategy for the merging process involves the following two steps: 1) Generate as many initial runs as possible let the number of runs be r.
Unit 13
2) Merge the r runs, k at a time, using a merge algorithm that is a generalization of the simple two-way merging which we discussed in the sub-section 11.2.3.
Unit 13
Now we will discuss one example with the 17 records and how the runs are generated depicted in the figure 13.2.
15 27 3 14 6 35 20 26 21 40 12 8 32 19 1 18 36 Input file Run 15 I 27 3 15 27 14 15 27 6 3 14 15 27 35 6 20 21 26 40 8 12 18 32 36 1
6 27 35
6 20 35
6 20 26
6 II 20 26
20 21 26
21 26 40
12 26 40
8 12 40 8 12 32
III
8 12 32 1
12 19 32
1 19 32
1 18 32
1 32 36
1 36
In the above example you can observe 17 records in the input file and the buffer size is considered as 3. Initially it picks 3 elements from the input file (15, 27, 3) among these elements 3 is identified as least and moved to output file. And the next element is refilled in the buffer that is 14 among (15, 27, 14) 14 is identified as lease at moved to output file. You can also observe in the next buffer (15, 27, 6) even though the least value is 6 we can not move to the output file because the previous key value moved is 14 which is greater than the current least one. Then we forced to identified next highest least element from the buffer then 15 is identified and pushed towards the output file. It can be shown empirically that the average length of an extended run is 2m assuming that the unordered records are in random order and a merge buffer of size m is used.
Sikkim Manipal University Page No.: 226
Unit 13
Self Assessment Questions 6. How many number of passes required to sort a file using k-way, if we have initial runs r. ____________________________ 7. Which sorting technique is recommended for the runs holding the number of records less than 15? (Pick the right option) a) Heap b) Selection c) Merge d) Insertion 8. What will be the average length of the extended runs if the unordered records are in random order? _____________________
13.5 Summary
External sorting is mandatory where the input size is huge that it does not fit into internal memory. In this unit we learnt how the polyphase and merge sort helpful to do this process. We discussed two way merging in this unit but generally internal memory can hold many blocks not just two or three. To reduce the number of phases we can extend our merge algorithm to kway merging. Here we can do k sorted into a single output sequence. External memory representation implementation is easy as long as we have enough internal memory for k input buffer blocks, one output buffer block and a small amount of additional storage.
Unit 13
13.7 Answers
Self Assessment Questions 1. Main memory 2. Run lists 3. Rewinding 4. True 5. Fibonacci series 6. O[logkr]. 7. b) Selection 8. 2m. Terminal Questions 1. External sorting is must where we handle or store in lexicographic order a very large number of items, which you can not fit in main memory. The general strategy in external sorting is to begin by sorting small batches of records from a file in internal memory. (Refer section 13.2) 2. The run lists are the small batches created in external sorting otherwise called as initial runs or initial strings. External sorting will be proficient, when we use magnetic tapes, because these tapes are cheapest choice for external storage. (Refer sub-sections 13.2.1 and 13.2.2 for details) 3. Merge sort is naturally sequential that, it is practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not change with the number of data elements. (Refer sub-section 13.2.2 for detail) 4. Ployphase is one of the popular sorting methods for external sorting. The basic idea is to distribute ordered initial runs of predetermined size on the available tapes. (Refer sub-section 13.2.2 for detail) 5. We need to give a special care while accessing a particular record on a magnetic disk for sorting purpose. (Refer section 13.3 for detail) 6. These runs generally reads data from the unordered list and applies internal sorting technique, again this selection of sorting technique is based on the size of the m what you are going to store in each run. (Refer section 13.4 for detail)