Вы находитесь на странице: 1из 14

Modular Design of High-Throughput, LowLatency Sorting Units

High-throughput and low-latency sorting is a key requirement in many applications that deal with large amounts of data. This paper presents efficient techniques for designing highthroughput, low-latency sorting units. Our sorting architectures utilize modular design techniques that hierarchically construct large sorting units from smaller building blocks. The sorting units are optimized for situations in which only the M largest numbers from N inputs are needed, because this situation commonly occurs in many applications for scientific computing, data mining, network processing, digital signal processing, and high-energy physics. We utilize our proposed techniques to design parameterized, pipelined, and modular sorting units. A detailed analysis of these sorting units indicates that as the number of inputs increases their resource requirements scale linearly, their latencies scale logarithmically, and their frequencies remain almost constant. When synthesized to a 65-nm TSMC technology, a pipelined 256-to-4 sorting unit with 19 stages can perform more than 2.7 billion sorts per second with a latency of about 7 ns per sort. We also propose iterative sorting techniques, in which a small sorting unit is used several times to find the largest values.

Sorting in parallel database systems


Sorting in database processing is frequently required through the use of Order By and Distinct clauses in SQL. Sorting is also widely known in the computer science community at large. Sorting in general covers internal and external sorting. Past published work has extensively focused on external sorting on uni-processors (serial external sorting), and internal sorting on multiprocessors (parallel internal sorting). External sorting on multiprocessors (parallel external sorting) has received surprisingly little attention; furthermore, the way current parallel database systems do sorting is far from optimal in many scenarios. The authors present a taxonomy for parallel sorting in parallel database systems, which covers five sorting methods: namely parallel merge-all sort, parallel binary-merge sort, parallel redistribution binary-merge sort, parallel redistribution merge-all sort, and parallel partitioned sort. The first two methods are previously proposed approaches to parallel external sorting which have been adopted as status quo of parallel database sorting, whereas the latter three methods which are based on redistribution and repartitioning are new, in that the have not been discussed in the literature of parallel external sorting.

A comparison of three representative hardware sorting units


Sorting is an important operation for many embedded computing systems. Since sorting large datasets may slowdown the overall execution, schemes to speedup sorting operations are needed.

Bearing in mind the hardware acceleration of sorting, we show in this paper an analysis and comparison among three hardware sorting units: sorting network, insertion sorting, and FIFObased merge sorting. We focus on embedded computing systems implemented with FPGAs, which give us the flexibility to accommodate customized hardware sorting units. We also present a hardware/software solution for sorting data sets with size larger than the size of the sorting unit. This hardware/software solution achieves 20 overall speedup over a pure software implementation of the well-known quicksort algorithm.

Zero-delay FPGA-based odd-even sorting network


Sorting is one of the most well-known problems in computer science and is frequently used for benchmarking computer systems. It can contribute significantly to the overall execution time of a process in a computer system. Dedicated sorting architectures can be used to accelerate applications and/or to reduce energy consumption. In this paper, we propose an efficient sorting network aiming at accelerating the sorting operation in FPGA-based embedded systems. The proposed sorting network is implemented based on an Optimized Odd-even sorting method (O2) using fully pipelined combinational logic architecture and ring shape processing. Consequently, O2 generates the sorted array of numbers in parallel when the input array of numbers is given, without any delay or lag. Unlike conventional sorting networks, O2 sorting network does not need memory to hold data and information about sorting, and neither need input clock to perform the sorting operations sequentially. We conclude that by using O2 in FPGA-based image processing, we can optimize the performance of filters such as median filter which demands high performance sorting operations for realtime applications.

Intelligent solid waste processing using optical sensor based sorting technology
Solid wastes are always collected as mixtures of different materials. They gets crushed, classified and sorted in solid waste treatment plants. Among these processes the sorting is the determining step for recycling and reuse. Traditional sorting technologies like magnetic sorting and eddy current sorting are only able to process some special kinds of ingredients of waste mixture roughly, such as the separation of ferrous and non-ferrous metals. Since there exist corresponding force fields between waste particles and separators. Some other properties of the solid particles such as the colours, shapes and texture features could also be considered as sorting criterions but there is no sufficient force field between these properties and separators. In this paper, an indirect sorting process by using optical sensor and mechanical separating system was developed and introduced. By using this system the particle sizes and positions, colours and shapes of each waste particle are able to be determined and used as sorting criterion. The mechanical sorting device consists of a compressed air nozzle which is controlled by computer, the target particles which were recognized by sensor were blown out of the main waste stream.

Feature recognition by using optical sensor yield good results. This research provides a new approach for multi-feature recognition of sensor based sorting technology.

Parallelization of bitonic sort and radix sort algorithms on many core GPUs
Data sorting is used in many fields and plays an important role in defining the overall speed and performance. There are - many sorting categories. In this study, two of these sorting algorithms that are bitonic sort and radix sort are dealt with. We have designed and developed Radix Sort and Bitonic Sort algorithms for many core Graphics Processing Units (GPUs). Although bitonic sort is a concurrent sorting algorithm, radix sort is a distribution sorting algorithm, i.e. both of these algorithms are not usual sorting algorithms. They can be parallelized on GPUs easily to get better performance than other sorting algorithms. We parallelized these sorting algorithms on many core GPUs using the Compute Unified Device Architecture (CUDA) platform, developed by NVIDIA Corporation and got some performance measurements.

Breaking the (nlog2 n) barrier for sorting with faults


We study the problem of constructing a sorting circuit, network, or PRAM algorithm that is tolerant to faults. For the most part, we focus on fault patterns that are random, e.g., where the result of each comparison is independently faulty with probability upper-bounded by some constant. All previous fault-tolerant sorting circuits, networks, and parallel algorithms require (log2 n) depth (time) and/or (nlog2 n) comparisons to sort n items. In this paper, we construct a passive-fault-tolerant sorting circuit with O(nlog nloglog n) comparators, a reversal-faulttolerant sorting network with O(n loglog(2) 3 n) comparators, and a deterministic O(log n)-step O(n)-processor EREW PRAM fault-tolerant sorting algorithm. The results are based on a new analysis of the AKS circuit, which uses a much weaker notion of expansion that can be preserved in the presence of faults. Previously, the AKS circuit was not believed to be fault-tolerant because the expansion properties that were believed to be crucial for the performance of the circuit are destroyed by random faults. Extensions of our results for worst-case faults are also presented We study the problem of constructing a sorting circuit, network, or PRAM algorithm that is tolerant to faults. For the most part, we focus on fault patterns that are random, e.g., where the result of each comparison is independently faulty with probability upper-bounded by some constant. All previous fault-tolerant sorting circuits, networks, and parallel algorithms require (log2 n) depth (time) and/or (nlog2 n) comparisons to sort n items. In this paper, we construct a passive-fault-tolerant sorting circuit with O(nlog nloglog n) comparators, a reversal-faulttolerant sorting network with O(n loglog(2) 3 n) comparators, and a deterministic O(log n)-step

O(n)-processor EREW PRAM fault-tolerant sorting algorithm. The results are based on a new analysis of the AKS circuit, which uses a much weaker notion of expansion that can be preserved in the presence of faults. Previously, the AKS circuit was not believed to be fault-tolerant because the expansion properties that were believed to be crucial for the performance of the circuit are destroyed by random faults. Extensions of our results for worst-case faults are also presented

The Parallel Enumeration Sorting Scheme for VLSI


We propose a new parallel sorting scheme, called the parallel enumeration sorting scheme, which is suitable for VLSI implementation. This scheme can be introduced to conventional computer systems without changing their architecture. In this scheme, sorting is divided into two stages, the ordering process and the rearranging one. The latter can be efficiently performed by central processing units or intelligent memory devices. For implementations of the ordering process by VLSI technology, we design a new hardware algorithm of parallel enumeration sorting circuits whose processing time is linearly proportional to the number of data for sorting. Data are serially transmitted between the sorting circuit and memory devices and the total communication between them is minimized. The basic structure used in the algorithm is called a bus connected cellular array structure with pipeline and parallel processing. The circuit consists of a linear array of one type of simple cell and two buses connecting all cells for efficient global communications in the circuit. The sorting circuit is simple, regular and small enough for realization by today's VLSI technology. We discuss several applications of the sorting circuit and evaluate its performance.

Sorting by parallel insertion on a onedimensional subbus array


We consider the problem of sorting on a one-dimensional subbus array of processors, an architecture that communicates using a segmentable bus. The subbus broadcast operation makes possible a new class of parallel sorting algorithms whose complexity we analyze with the parallel insertion model. We give per-input lower bounds for sorting in the parallel insertion model and demonstrate sorting strategies that are optimal by matching those lower bounds. For each of our sorting strategies, we discuss the issues involved in implementing them on subbus machines. Finally, we empirically evaluate the performance of our sorting strategies by applying them to shearsort, a common two-dimensional mesh sorting algorithm. Our results suggest that for

sorting the subbus broadcast capability gives at most a slight advantage over using only nearest neighbor communication

Carrier Sorting Plan and Rate of Letter in Postal Logistic


It is important to have the accuracy sorting plan for the save of postman work time and the fast delivery of the receipt mail. It's also very important to plan the proper carrier sorting plan at the situation where the number of pocket of letter sorting machine [LSM] is less than the number of postal code or the number of carrier. This paper describes the method for the carrier sorting plan and sorting rate improvement at the situation where the number of pocket of LSM is small. The sorting rate improvement effect more than 20% was showed by applying the proposed carrier sorting plan to 2 areas.

A Partition-Merge Based Cache-Conscious Parallel Sorting Algorithm for CMP with Shared Cache
To explore chip-level parallelism, the PSC (Parallel Shared Cache) model is provided in this paper to describe high performance shared cache of Chip Multi-Processors (CMP). Then for a specific application, parallel sorting, a cache-conscious parallel algorithm, PMCC (PartitionMerge based Cache-Conscious) is designed based on the PSC model. The PMCC algorithm consists of two steps: the partition-based in-cache sorting and merge-based k-way merge sorting. In the first stage, PMCC first divides the input dataset into multiple blocks so that each block can fit into the shared L2 cache, and then employs multiple cores to perform parallel cache sorting to generate sorted blocks. In the second stage, PMCC first selects an optimized parameter k which can not only improve the parallelism but also reduce the cache missing rate, then performs a kway merge sorting to merge all the sorted blocks. The I/O complexity of the in-cache sorting step and k-way merge step are analyzed in detail. The simulation results show that the PSC based PMCC algorithm can out-performance the latest PEM based cache-conscious algorithm and the scalability of PMCC is also discussed. The low I/O complexity, high parallelism and the high scalability of PMCC can take advantage of CMP to improve its performance significantly and deal with large scale problem efficiently.

Optimized GPU Sorting Algorithms on Special Input Distributions

We present a high performance graphics processing unit (GPU) sorting algorithm ISSD (Improved Sorting considering Special Distributions) implemented with the Compute Unified Device Architecture (CUDA). The ISSD focuses on two aspects to improve parallel sorting performance. One is how to decompose the sorting tasks into independent and balanced subtasks which can then be easily distributed to thousands of threads to realize the concept of parallel sorting as well as to efficiently explore the power of GPU. The other one is how to take advantage of special data distributions to further optimize the algorithms and improve their performance. The algorithm is redesigned based on our previous general data distribution version and optimized both on general implementation methods and special input distributions. Experimental results show that for the general data distribution inputs, the ISSD outperforms the existing parallel sorting algorithms by about 10% in performance due to its practical optimization in implementation; and for three special data distribution inputs, the ISSD outperforms the existing algorithms by more than 40% due to its special optimization based on the data distributions. Therefore, the algorithm is viable and efficient when dealing with specific data distributions.

Text compression using recency rank with context and relation to context sorting, block sorting and PPM*
A block sorting compression scheme was developed and its relation to a statistical scheme was studied, but a theoretical analysis of its performance has not been studied fully. Context sorting is a compression scheme based on context similarity and it is regarded as an on-line version of block sorting and it is asymptotically optimal. However, the compression speed is slower and the real performance is not better. We propose a compression scheme using recency rank code with context (RRC), which is based on context similarity. The proposed method encodes characters to recency ranks according to their contexts. It can be implemented using suffix tree and the recency rank code is realized by move-to-front transformation of edges in the suffix tree. It is faster than context sorting and is also asymptotically optimal. The performance is improved by changing models according to the length of the context and by combining some characters into a code. However, it is still inferior to block sorting in both performance and speed. We investigate the reason for the bad performance and we also prove the asymptotical optimality of a variation of block sorting and derive the relation among the RRC, context sorting, block sorting and PPM* clear

A reversible watermarking prediction based scheme using a new sorting technique


The prediction error expansion technique is one of the reversible watermarking techniques. The sorting technique exploits the correlation between neighboring pixels for optimizing embedding order hence sorting is a fundamental step to enhance the embedding capacity and visual quality.

In this paper a new sorting technique is designed to improve the hiding capacity and visual quality. Using of prediction expansion, histogram shifting and our new sorting technique produces superior results than several methods. We use a new measure for sorting the cells and we show that using only local variance values for sorting is ineffective in some cases. By using the new measure we can solve this problem and lead to more efficient sorting procedure. Experimental results show the efficiency of our proposed sorting procedure.

Bitonic Sorting on Dynamically Reconfigurable Architectures


Sorting is one of the most investigated tasks computers are used for. Up to now, not much research has been put into increasing the flexibility and performance of sorting applications by applying reconfigurable computer systems. There are parallel sorting algorithms (sorting circuits) which are highly suitable for VLSI hardware realization and which outperform sequential sorting methods applied on traditional software processors by far. But usually they require a large area that increases with the number of keys to be sorted. This drawback concerns ASIC and statically reconfigurable systems. In this paper, we present a way to adopt the wellknown Bitonic sorting method to dynamically reconfigurable systems such that this drawback is overcome. We present a detailed description of the design and actual implementation, and we present experimental results of our approach to show its benefits in performance and the tradeoffs of our approach.

Reliability analysis of a fault-tolerant sorting network


A self routing fault-tolerant sorting network that employs an enhanced scheme of the Batcher (1968) sorting network was proposed previously. It consists of two Batcher sorting planes with links provided at every stage to allow cell transfer to and from each sorting network, thereby offering multiple paths between each input-output pair and giving a high degree of faulttolerance and overcoming the single path limitation of the Batcher sorting network. In this paper, expressions for reliability, mean time to failure, and availability are derived and the numerical results show that the fault-tolerant sorting network has a superior performance compared to Batcher and parallel Batcher sorting networks

How to sort N items using a sorting network of fixed I/O size

Sorting networks of fixed I/O size p have been used, thus far, for sorting a set of p elements. Somewhat surprisingly, the important problem of using such a sorting network for sorting arbitrarily large datasets has not been addressed in the literature. Our main contribution is to propose a simple sorting architecture whose main feature is the pipelined use of a sorting network of fixed I/O size p to sort an arbitrarily large data set of N elements. A noteworthy feature of our design is that no extra data memory space is required, other than what is used for storing the input. As it turns out, our architecture is feasible for VLSI implementation and its time performance is virtually independent of the cost and depth of the underlying sorting network. Specifically, we show that by using our design N elements can be sorted in (N/p log N/p) time without memory access conflicts. Finally, we show how to use an AT2-optimal sorting network of fixed I/O size p to construct a similar architecture that sorts N elements in (N/p log N/p log p) time

The optimize tactic and the simulation of the Fast sorting system
As the rapid development of logistic technology in our country, the sorting efficiency of the distribution center becomes the bottle-neck in the whole process. At present, various sorting device emerge as the times require and the improvement of the sorting efficiency becomes the difficult problem of the sorting devices. This article begins from the optimize tactic of the sorting system, builds optimizing modal of the rapid sorting system, the result of simulation analyze and the practical run shows that this modal can improve the efficiency of the sorting system greatly. This modal based on the sorting device of the tobacco industry and can be applied to other industries.

Segment Clustering Radar Signal Sorting


Radar signal sorting is picking-up pulse serial of same radar emitter from dense complex pulse signal flow. The tolerance of radar signal sorting is analyzed in modern electronic warfare. The complex and dense pulses environment makes it become a vital factor to restrict the efficiency of sorting of the conventional multi-parameters signal sorting system. A segment clustering radar signal sorting method is presented based on support vector clustering (SVC) according to the idea of statistics learning theory. It prevents tolerance from affecting radar sorting. The accuracy of sorting and the sensitivity of algorithm on parameter variation is analyzed. The experimental results show that the sorting method presented is effective to overcome the tolerance of radar signal sorting.

Adaptive binary sorting schemes and associated interconnection networks

Many routing problems in parallel processing, such as concentration and permutation problems, can be cast as sorting problems. In this paper, we consider the problem of sorting on a new model, called an adaptive sorting network. We show that any sequence of n bits can be sorted on this model in O(lg2 n) bit-level delay using O(n) constant fanin gates. This improves the cost complexity of K.E. Batcher's binary sorters (1968) by a factor of O(lg2 n) while matching their sorting time. The only other network that can sort binary sequences in O(n) cost is the network version of columnsort algorithm, but this requires excessive pipelining. In addition, using binary sorters, we construct permutation networks with O(n lg n) bit-level cost and O(lg3 n) bit-level delay. These results provide the asymptotically least-cost practical concentrators and permutation networks to date. We note, of course, that the well-known AKS sorting network has O(lg n) sorting time and O(n lg n) cost, but the constants hidden in these complexities are so large that our complexities outperform those of the AKS sorting network until n becomes extremely large

Simulation and optimization about sorting system of distribution center


New areas of modern logistics research focus on the direction of three-dimensional simulation technology, mainly about the method of logistics simulation modeling and the technology of logistics simulation implementation. Based on in-depth study of the principle of sorting system, this paper used three-dimensional simulation software Flexsim to have realized the system design, simulation modeling, simulation run, data analysis and system optimization of two sorting modes of distribution center: picking style sorting system and seeding-type sorting system, then verified the validity, rationality and optimization of the simulation sorting system, showed the advantages and disadvantages of two sorting modes clearly, which have important reference value to the sorting system development and improvement of enterprise and the applications of Flexsim simulation software in logistics fields.

An optimal hardware-algorithm for sorting using a fixed-size parallel sorting device


We present a hardware-algorithm for sorting N elements using either a p-sorter or a sorting network of fixed I/O size p while strictly enforcing conflict-free memory accesses. To the best of our knowledge, this is the first realistic design that achieves optimal time performance, running in (NlogN/plogp) time for all ranges of N. Our result completely resolves the problem of designing an implementable, time-optimal algorithm for sorting N elements using a p-sorter. More importantly, however, our result shows that, in order to achieve optimal time performance, all that is needed is a sorting network of depth O(log2p) such as, for example, Batcher's classic bitonic sorting network

Fault tolerance analysis of odd-even transposition sorting networks


Sorting networks are important hardware and software models of parallel sorting operations. They have several applications such as ATM switching, distributed processing, and optical implementation of sorting. In this paper we investigate the fault-tolerance properties of a special class of sorting networks called the odd-even transposition sorting networks. These networks have a simple and reliable hardware structure, which is easy to implement with VLSI technology. A simulation program of these networks' operation has been developed in C++. The simulation results revealed two important properties of odd-even transposition sorting networks: Any single stuck-at-X fault occurring in an internal comparator is redundant. And any two stuckat-X faults occurring in a large number of internal comparators is redundant

Classical sorting embedded in genetic algorithms for improved permutation search


A sorting algorithm defines a path in the search space of n! permutations based on the information provided by a comparison predicate. Our generic mutation operator for hybridization is a hill-climber and follows the path traced by any sorting algorithm. Our proposal adds exploitation capability to the mutation operator. Mutation requests swaps to construct and test new permutations, while the sorting algorithm supplies suggestions for swapping pairs as comparison to perform. The need to compare pairs of items in sorting is fulfilled by evaluating a guiding function. This novel HGA-sorting hybrid, instantiated with Insertion Sort, dramatically improves previous results for a benchmark of experiments of the Error-Correcting Graph Isomorphism

Mechatronics applications to fish sorting Part 2: Control and sorting mechanism


This paper describes the research conducted at Monash University (Australia) to design a new mechatronics fish sorting system. The system consisted of a vision system for fish size identification and control and a sorting mechanism. This paper focuses on the control and sorting mechanism. The conceptual design and implementation of the mechatronics sorting system is discussed. Also the new design of the fish sorting flap is explained. Also, an illustration of the integrated mechatronics design approach is presented. The paper also describes the overall structure of the sorting system and the control methodology. Lessons learned are summarised in the conclusions.

A hardware design approach for mergesorting network


In this paper, a hardware design methodology for merge-sorting networks, which uses a fixed size Batcher's sorting network, a data memory module and a memory addressing controller, is proposed. In this method, only by adjusting the data flow of the memory addressing controller, the amount of sorting data can be extended easily. Particularly, the adjustment of data flow is quite regular. Therefore, the proposed method has the following merits: low complexity of parallel sorting networks, low hardware fabrication cost, high extensibility, high regularity and no extra data memory space needed. For verifying the proposed approach, a 128-item mergesorting network has been designed and simulated by Verilog VHDL

About sorting problems solution on neurocomputer


Different approaches to the solution of sorting problems on neural networks are given. Consideration is given to sorting on a neural network with adaptive coefficients, sorting on a neural network with fixed coefficients, sorting on a neural network in the dynamic regime, consecutive sorting on a one-layer neural network, and sorting on a neural network by the comparison method

The effect of local sort on parallel sorting algorithms


We show the importance of sequential sorting in the context of in-memory parallel sorting of large data sets of 64-bit keys. First, we analyze several sequential strategies, like Straight Insertion, Quick sort, Radix sort and Cache-Conscious Radix sort (CC-Radix sort). As a consequence of the analysis, we propose a new algorithm that we call the Sequential Counting Split Radix sort (SCS-Radix sort). This is a combination of some of the algorithms analyzed and other new ideas. There are three important contributions in SCS-Radix sort: first, the work saved by detecting data skew dynamically; second, the exploitation of the memory hierarchy done by the algorithm; and third, the execution time stability of SCS-Radix when sorting data sets with different characteristics. We evaluate the use of SCS-Radix sort in the context of a parallel sorting algorithm on an SGI Origin 2000. The parallel algorithm is 1.2 to 45 times faster using the SCS-Radix sort than using the Radix sort or Quick sort

Antisequential suffix sorting for BWT-based data compression

Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three fast, antisequential, and memory-efficient algorithms for suffix sorting. For a length-N input over a size-|X| alphabet, the worst-case complexities of these algorithms are (N2), O(|X|N log(N/|X|)), and O(N|X|log(N/|X|)), respectivel y. Furthermore, simulation results indicate performance that is competitive with other suffix sorting methods. In contrast, the suffix sorting methods that are fastest on standard test corpora have poor worst-case performance. Therefore, in comparison with other suffix sorting methods, suffix lists offer a useful trade off between practical performance and worst-case behavior. Another distinguishing feature of suffix lists is that these algorithms are simple; some of them can be implemented in VLSI. This could accelerate suffix sorting by at least an order of magnitude and enable high-speed BWT-based compression systems.

Optimal Sorting Algorithms for a Simplified 2D Array with Reconfigurable Pipelined Bus System
In recent years, many researchers have investigated optical interconnections as parallel computing. Optical interconnections are attractive due to their high bandwidth and concurrent access to the bus in a pipelined fashion. The Linear Array with Reconfigurable Pipelined Bus System (LARPBS) model is a powerful optical bus system that combines both the advantages of optical buses and reconfiguration. To increase the scalability of the LARPBS model, we propose a two-dimensional extension: a simplified two-dimensional Array with Reconfigurable Pipelined Bus System (2D ARPBS). While achieving better scalability, we show the effectiveness of this newly proposed model by designing two novel optimal sorting algorithms on this model. The first sorting algorithm is an extension of Leighton's seven-phase columnsort algorithm that eliminates the restriction of sorting only an r ?? s array, where r ?? s2 , and sorts an n ?? n array in O(log n) time. The second one is an optimal multiway mergesort algorithm that uses a novel processor efficient two-way mergesort algorithm and a novel multiway merge scheme to sort n2 items in O(log n) time. Using an optimal sorting algorithm Pipelined Mergesort designed for the LARPBS model as a building block, we extend our research on parallel sorting on the LARPBS to a more scalable 2D ARPBS model and achieve optimality in both sorting algorithms.

Analysis of Fast Parallel Sorting Algorithms for GPU Architectures'

Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the OpenCL specification. Our findings report minimum of 19x speed-up of bitonic sort against oddeven sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture.

Fast sorting algorithms on reconfigurable array of processors with optical buses


The Reconfigurable Array with Spanning Optical Buses (RASOB) has recently received a lot of attention from the research community. By taking advantage of the unique properties of optical transmission, the RASOB provides flexible reconfiguration and strong connectivities with low hardware and control complexities. In this paper, we use this architecture for the design of efficient sorting algorithms on the 1-D RASOB and the 2-D RASOB. Our parallel sorting algorithm on the 1-D RASOB, which sorts N data items using N processors in O(k) time where k is the size of the data items to be in bits, is based on a novel divide-and-conquer scheme. On the other hand, our parallel sorting algorithm on the 2-D RASOB is based on the sorting algorithm on the 1-D RASOB in conjunction with the well known Rotatesort algorithm. This algorithm sorts N data items on a 2-D RASOB of size N in O(k) time. These sorting algorithms outperform state-of-the-art sorting algorithms on reconfigurable arrays of processors with electronic buses

Electronic Data Sorting


This paper presents results of a study of the fundamentals of sorting. Emphasis is placed on understanding sorting and on minimizing the time required to sort with electronic equipment of reasonable cost. Sorting is viewed as a combination of information gathering and item moving activities. Shannon's communication theory measure of information is applied to assess the difficulty of various sorting problems. Bounds on the number of comparisons required to sort are developed, and optimal or near-optimal sorting schemes are described and investigated. Three abstract sorting models based on cyclic, linear, and randomaccess memories are defined. Optimal or near-optimal sorting methods are developed for the models and their parallel-register extensions. A brief review of the origin of the work and some of its hypotheses is also presented.

Design and analysis of a systolic sorting architecture


We present a new parallel sorting algorithm that uses a fixed-size sorter iteratively to sort inputs of arbitrary size. A parallel sorting architecture based on this algorithm is proposed. This architecture consists of three components, linear arrays that support constant-time operations, a multilevel sorting network, and a termination detection tree, cell operating concurrently in systolic processing fashion. The structure of this sorting architecture is simple and regular, highly suitable for VLSI realization. Theoretical analysis and experimental data indicate that the performance of this architecture is likely to be excellent in practice

Sequence sorting in secondary storage


We investigate the I/O complexity of the problem of sorting sequences (or strings of characters) in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting K strings of total length N is (K log2 K+N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting sequences is ((K/B)logM B/(K/B)+(N/B)), but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where the strings are not allowed to be broken into their individual characters, and we show that the I/O complexity of string sorting in this model is ((N1/B)logMB/(N1/B)+K2 +(N/B)), where N1 is the total length of all strings shorter than B and K2 is the number of strings longer than B. We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms outside the comparison model

Modular design of a large sorting network


Batcher sorting networks have been extensively used in the design of ATM switches based on Batcher-banyan interconnection network. Batcher sorting networks require large number of stages of sorting elements especially for large network sizes. This results in high delay, difficulty in partition into IC, and difficulty in maintaining synchronization across rite entire structure. In this paper, we present a simple design of a sorting network that can be used as a building block to build larger sorting networks of arbitrary size. The proposed design is very modular and can be efficiently implemented using current VLSI technology

Вам также может понравиться