Вы находитесь на странице: 1из 9

Polyphase merge

The polyphase merge changes the game. There might be N files, but the polyphase merge will read from N 1 files and only write one output file at a time. The writing to that output file continues until an input file is exhausted, and then that input file becomes the new output file. The number of runs in each file is related to Fibonacci numbers and Fibonacci numbers of higher order[2][3]. A polyphase merge sort is an algorithm which decreases the number of "runs" at every iteration of the main loop by merging "runs" in pairs. Typically, a merge sort splits items into groups then recursively sorts each group. Once the groups are sorted, they are merged into a final, sorted sequence. Polyphase merge sorts are ideal for sorting and merging large files. Two pairs of input and output files are opened as file streams. At the end of each iteration, input files are deleted, output files are closed and reopened as input files. The use of file streams makes it possible to sort and merge files which can not be loaded into the computer's main memory.

Perfect 3 file polyphase merge sort


It is easiest to look at the polyphase merge starting from its ending conditions and working backwards. At the start of each iteration, there will be two input files and one output file. At the end of the iteration, one input file will have been completely consumed and will become the output file for the next iteration. The current output file will become an input file for the next iteration. The remaining files (just one in the 3 file case) have only been partially consumed and their remaining runs will be input for the next iteration. File 1 just emptied and became the new output file. One run is left on each input tape, and merging those runs together will make the sorted file.
File 1 (out): (the sorted file) File 2 (in ): ... | <1 run> * (consumed) File 3 (in ): | <1 run> * (consumed) ... | * <1 run> * --> ... <1 run> | * <1 run> | *

possible runs that have already been read marks the read pointer of the file marks end of file

Stepping back to the previous iteration, we were reading from 1 and 2. One run is merged from 1 and 2 before file 1 goes empty. Notice that file 2 is not completely consumed -- it has one run left to match the final merge (above).
File 1 (in ): ... | <1 run> * ... <1 run> | *

File 2 (in ): run> * File 3 (out):

| <2 run> *

-->

<1 run> | <1 <1 run> *

Stepping back another iteration, 2 runs are merged from 1 and 3 before file 3 goes empty.
File run> File File 1 (in ): | <3 run> * 2 (out): 3 (in ): ... | <2 run> * ... <2 run> | <1 --> <2 run> * <2 run> | *

Stepping back another iteration, 3 runs are merged from 2 and 3 before file 2 goes empty.
File File File run> 1 (out): 2 (in ): ... | <3 run> * 3 (in ): | <5 run> * * --> <3 run> * ... <3 run> | * <3 run> | <2

Stepping back another iteration, 5 runs are merged from 1 and 2 before file 1 goes empty.
File File run> File 1 (in ): ... | <5 run> * 2 (in ): | <8 run> * * 3 (out): --> ... <5 run> | * <5 run> | <3 <5 run> *

Looking at the number of runs merged working backwards: 1, 1, 2, 3, 5, ... reveals a Fibonacci sequence. For everything to work out right, the initial file to be sorted must be distributed to the proper input files and each input file must have the correct number of runs on it. In the example, that would mean an input file with 13 runs would write 5 runs to file 1 and 8 runs to file 2. In practice, the input file won't happen to have a Fibonacci number of runs it (and the number of runs won't be known until after the file has been read). The fix is to pad the input files with dummy runs to make the required Fibonacci sequence. For comparison, the ordinary merge sort will combine 16 runs in 4 passes using 4 files. The polyphase merge will combine 13 runs in 5 passes using only 3 files. Alternatively, a polyphase merge will combine 17 runs in 4 passes using 4 files. (Sequence: 1, 1, 1, 3, 5, 9, 17, 31, 57, ...) An iteration (or pass) in ordinary merge sort involves reading and writing the entire file. An iteration in a polyphase sort does not read or write the entire file[4], so a typical polyphase iteration will take less time than a merge sort iteration.

Inverted List
In file organization, this is a file that is indexed on many of the attributes of the data itself. The inverted list method has a single index for each key type. The records are not necessarily stored in a sequence. They are placed in the are data storage area, but indexes are updated for the record keys and location. Here's an example, in a company file, an index could be maintained for all products, another one might be maintained for product types. Thus, it is faster to search the indexes than every record. These types of file are also known as "inverted indexes." Nevertheless, inverted list files use more media space and the storage devices get full quickly with this type of organization. The benefits are apparent immediately because searching is fast. However, updating is much slower. Content-based queries in text retrieval systems use inverted indexes as their preferred mechanism. Data items in these systems are usually stored compressed which would normally slow the retrieval process, but the compression algorithm will be chosen to support this technique. When querying a file there are certain circumstances when the query is designed to be modal which means that rules are set which require that different information be held in the index. Here's an example of this modality: when phrase querying is undertaken, the particular algorithm requires that offsets to word classifications are held in addition to document numbers.

External sorting

In situations where one must handle or store in lexicographic order a very large number of items-so large as to not fit in main memory, external sorting techniques are required. The general strategy in external sorting is to begin by sorting small batches of records from a file in internal memory. These small batches are commonly called rum lists. The size of these run lists depends on the amount of internal memory set aside for the sorting. The run lists are in a target file from which they are later retrieved and merged together again to form fewer but larger run lists. this process of merging run lists to form fewer and larger run lists continues and eventually terminates with the production of a single run list that is the desired sorted file. Two commonly used external sorting techniques are -

1. 2.

External merge sort Bucket sort

External merge sort

Balanced Merge Sorting

Having creating runs, the next step is to distribute and merge those runs on tapes until there is only one left. One algorithm to do so is named as balanced merge sorting. But before we introduce it, let us review an algorithm used to merge one ordered run onto a specified output file from all input files. Recall the internal sorting algorithm merge sort, in which two ordered sequences are combined into a single ordered sequence. It is not difficult to extend this idea to the notion of P-Way merging, where P input runs are combined into a single output run. The algorithm should be easy to understand and is listed below:

merge(int out) { int i, isml; typekey lastout;

/* LastRec[] stores records from every input file */ extern struct rec LastRec[]; extern char FilStat[];

lastout = Min_key; LastRec[0].key = Max_key; while (TRUE) { isml = 0; /* * Select the smallest record that is * no less than the last out */ for (i=1; i<=maxfiles; i++) if (FilStat[i]=='i' && !Eof(i) && LastRec[i].key >= lastout && LastRec[i].key < LastRec[isml].key) isml = i;

if (isml==0) { /* not found */ for (i=1; i<=maxfiles; i++) if (FilStat[i]=='i' && !Eof(i)) return(0);

return(DONE); /* all tapes exhausted */ } WriteFile(out, LastRec[isml]); lastout = LastRec[isml].key; LastRec[isml] = ReadFile(isml); } } /* end of merge( ) */

Balanced merge sorting is a simple scheme for sorting external files. The tapes are divided into two groups, and every pass merges the runs from one group and distributes the merge result onto the other group at the same time. However, during distribution, only one from the output group will be active. So there is still room for improvement in tape utilization.

balance_sort( ) { int i, runs; extern int maxfiles, unit; extern char FilStat[];

extern struct rec LastRec[];

/* Initialize input/output files */ OpenRead(1); /* i.e. FilStat[1] = 'i' */ for (i=2; i<=maxfiles; i++) if (i <= maxfiles/2) FilStat[i]='-'; /* idle */ else OpenWrite( i ); /* i.e. FilStat[i] ='o' */

distribute( );

do { /* re-assign files */ for (i=1; i<=maxfiles; i++) if (FilStat[i] == 'o') { OpenRead( i ); LastRec[i] = ReadFile(i); } else OpenWrite( i );

for (runs=1; merge(nextfile())!=DONE; runs++);

} while (runs>1);

return (unit);

} /* end of balance_sort( ) */

Selection for next file for balanced merge sorting is as follows:

nextfile( ) { extern int maxfiles, unit; extern char FilStat[];

do unit = unit%maxfiles + 1; while (FilStat[unit] != 'o'); /* 'o' = output */ return (unit); } /* end of nextfile( ) */

Balanced merge sorting is already an improvement of the simplest straight merge sorting algorithm. In straight merge sorting, one tape is fixed as the output file. So, you have to distribute runs from this fixed tape to other input tapes after every merge. Distribution does no good in reducing the number of runs, but it still makes a pass of the file data. By comparison, balanced merge sorting makes multi-way merge every time after the first distribution, it achieves this by uniting the merge and distribution passes (the activity level per tape is increased and balanced).

In general, the more tapes involved in a merge, the longer run will be created by the merge. So it is more advantageous to use multi-way merge than binary (or two-way) merge when tapes are available.

Вам также может понравиться