Академический Документы
Профессиональный Документы
Культура Документы
High-throughput and low-latency sorting is a key requirement in many applications that deal with large amounts of data. This paper presents efficient techniques for designing highthroughput, low-latency sorting units. Our sorting architectures utilize modular design techniques that hierarchically construct large sorting units from smaller building blocks. The sorting units are optimized for situations in which only the M largest numbers from N inputs are needed, because this situation commonly occurs in many applications for scientific computing, data mining, network processing, digital signal processing, and high-energy physics. We utilize our proposed techniques to design parameterized, pipelined, and modular sorting units. A detailed analysis of these sorting units indicates that as the number of inputs increases their resource requirements scale linearly, their latencies scale logarithmically, and their frequencies remain almost constant. When synthesized to a 65-nm TSMC technology, a pipelined 256-to-4 sorting unit with 19 stages can perform more than 2.7 billion sorts per second with a latency of about 7 ns per sort. We also propose iterative sorting techniques, in which a small sorting unit is used several times to find the largest values.
Bearing in mind the hardware acceleration of sorting, we show in this paper an analysis and comparison among three hardware sorting units: sorting network, insertion sorting, and FIFObased merge sorting. We focus on embedded computing systems implemented with FPGAs, which give us the flexibility to accommodate customized hardware sorting units. We also present a hardware/software solution for sorting data sets with size larger than the size of the sorting unit. This hardware/software solution achieves 20 overall speedup over a pure software implementation of the well-known quicksort algorithm.
Intelligent solid waste processing using optical sensor based sorting technology
Solid wastes are always collected as mixtures of different materials. They gets crushed, classified and sorted in solid waste treatment plants. Among these processes the sorting is the determining step for recycling and reuse. Traditional sorting technologies like magnetic sorting and eddy current sorting are only able to process some special kinds of ingredients of waste mixture roughly, such as the separation of ferrous and non-ferrous metals. Since there exist corresponding force fields between waste particles and separators. Some other properties of the solid particles such as the colours, shapes and texture features could also be considered as sorting criterions but there is no sufficient force field between these properties and separators. In this paper, an indirect sorting process by using optical sensor and mechanical separating system was developed and introduced. By using this system the particle sizes and positions, colours and shapes of each waste particle are able to be determined and used as sorting criterion. The mechanical sorting device consists of a compressed air nozzle which is controlled by computer, the target particles which were recognized by sensor were blown out of the main waste stream.
Feature recognition by using optical sensor yield good results. This research provides a new approach for multi-feature recognition of sensor based sorting technology.
Parallelization of bitonic sort and radix sort algorithms on many core GPUs
Data sorting is used in many fields and plays an important role in defining the overall speed and performance. There are - many sorting categories. In this study, two of these sorting algorithms that are bitonic sort and radix sort are dealt with. We have designed and developed Radix Sort and Bitonic Sort algorithms for many core Graphics Processing Units (GPUs). Although bitonic sort is a concurrent sorting algorithm, radix sort is a distribution sorting algorithm, i.e. both of these algorithms are not usual sorting algorithms. They can be parallelized on GPUs easily to get better performance than other sorting algorithms. We parallelized these sorting algorithms on many core GPUs using the Compute Unified Device Architecture (CUDA) platform, developed by NVIDIA Corporation and got some performance measurements.
O(n)-processor EREW PRAM fault-tolerant sorting algorithm. The results are based on a new analysis of the AKS circuit, which uses a much weaker notion of expansion that can be preserved in the presence of faults. Previously, the AKS circuit was not believed to be fault-tolerant because the expansion properties that were believed to be crucial for the performance of the circuit are destroyed by random faults. Extensions of our results for worst-case faults are also presented
sorting the subbus broadcast capability gives at most a slight advantage over using only nearest neighbor communication
A Partition-Merge Based Cache-Conscious Parallel Sorting Algorithm for CMP with Shared Cache
To explore chip-level parallelism, the PSC (Parallel Shared Cache) model is provided in this paper to describe high performance shared cache of Chip Multi-Processors (CMP). Then for a specific application, parallel sorting, a cache-conscious parallel algorithm, PMCC (PartitionMerge based Cache-Conscious) is designed based on the PSC model. The PMCC algorithm consists of two steps: the partition-based in-cache sorting and merge-based k-way merge sorting. In the first stage, PMCC first divides the input dataset into multiple blocks so that each block can fit into the shared L2 cache, and then employs multiple cores to perform parallel cache sorting to generate sorted blocks. In the second stage, PMCC first selects an optimized parameter k which can not only improve the parallelism but also reduce the cache missing rate, then performs a kway merge sorting to merge all the sorted blocks. The I/O complexity of the in-cache sorting step and k-way merge step are analyzed in detail. The simulation results show that the PSC based PMCC algorithm can out-performance the latest PEM based cache-conscious algorithm and the scalability of PMCC is also discussed. The low I/O complexity, high parallelism and the high scalability of PMCC can take advantage of CMP to improve its performance significantly and deal with large scale problem efficiently.
We present a high performance graphics processing unit (GPU) sorting algorithm ISSD (Improved Sorting considering Special Distributions) implemented with the Compute Unified Device Architecture (CUDA). The ISSD focuses on two aspects to improve parallel sorting performance. One is how to decompose the sorting tasks into independent and balanced subtasks which can then be easily distributed to thousands of threads to realize the concept of parallel sorting as well as to efficiently explore the power of GPU. The other one is how to take advantage of special data distributions to further optimize the algorithms and improve their performance. The algorithm is redesigned based on our previous general data distribution version and optimized both on general implementation methods and special input distributions. Experimental results show that for the general data distribution inputs, the ISSD outperforms the existing parallel sorting algorithms by about 10% in performance due to its practical optimization in implementation; and for three special data distribution inputs, the ISSD outperforms the existing algorithms by more than 40% due to its special optimization based on the data distributions. Therefore, the algorithm is viable and efficient when dealing with specific data distributions.
Text compression using recency rank with context and relation to context sorting, block sorting and PPM*
A block sorting compression scheme was developed and its relation to a statistical scheme was studied, but a theoretical analysis of its performance has not been studied fully. Context sorting is a compression scheme based on context similarity and it is regarded as an on-line version of block sorting and it is asymptotically optimal. However, the compression speed is slower and the real performance is not better. We propose a compression scheme using recency rank code with context (RRC), which is based on context similarity. The proposed method encodes characters to recency ranks according to their contexts. It can be implemented using suffix tree and the recency rank code is realized by move-to-front transformation of edges in the suffix tree. It is faster than context sorting and is also asymptotically optimal. The performance is improved by changing models according to the length of the context and by combining some characters into a code. However, it is still inferior to block sorting in both performance and speed. We investigate the reason for the bad performance and we also prove the asymptotical optimality of a variation of block sorting and derive the relation among the RRC, context sorting, block sorting and PPM* clear
In this paper a new sorting technique is designed to improve the hiding capacity and visual quality. Using of prediction expansion, histogram shifting and our new sorting technique produces superior results than several methods. We use a new measure for sorting the cells and we show that using only local variance values for sorting is ineffective in some cases. By using the new measure we can solve this problem and lead to more efficient sorting procedure. Experimental results show the efficiency of our proposed sorting procedure.
Sorting networks of fixed I/O size p have been used, thus far, for sorting a set of p elements. Somewhat surprisingly, the important problem of using such a sorting network for sorting arbitrarily large datasets has not been addressed in the literature. Our main contribution is to propose a simple sorting architecture whose main feature is the pipelined use of a sorting network of fixed I/O size p to sort an arbitrarily large data set of N elements. A noteworthy feature of our design is that no extra data memory space is required, other than what is used for storing the input. As it turns out, our architecture is feasible for VLSI implementation and its time performance is virtually independent of the cost and depth of the underlying sorting network. Specifically, we show that by using our design N elements can be sorted in (N/p log N/p) time without memory access conflicts. Finally, we show how to use an AT2-optimal sorting network of fixed I/O size p to construct a similar architecture that sorts N elements in (N/p log N/p log p) time
The optimize tactic and the simulation of the Fast sorting system
As the rapid development of logistic technology in our country, the sorting efficiency of the distribution center becomes the bottle-neck in the whole process. At present, various sorting device emerge as the times require and the improvement of the sorting efficiency becomes the difficult problem of the sorting devices. This article begins from the optimize tactic of the sorting system, builds optimizing modal of the rapid sorting system, the result of simulation analyze and the practical run shows that this modal can improve the efficiency of the sorting system greatly. This modal based on the sorting device of the tobacco industry and can be applied to other industries.
Many routing problems in parallel processing, such as concentration and permutation problems, can be cast as sorting problems. In this paper, we consider the problem of sorting on a new model, called an adaptive sorting network. We show that any sequence of n bits can be sorted on this model in O(lg2 n) bit-level delay using O(n) constant fanin gates. This improves the cost complexity of K.E. Batcher's binary sorters (1968) by a factor of O(lg2 n) while matching their sorting time. The only other network that can sort binary sequences in O(n) cost is the network version of columnsort algorithm, but this requires excessive pipelining. In addition, using binary sorters, we construct permutation networks with O(n lg n) bit-level cost and O(lg3 n) bit-level delay. These results provide the asymptotically least-cost practical concentrators and permutation networks to date. We note, of course, that the well-known AKS sorting network has O(lg n) sorting time and O(n lg n) cost, but the constants hidden in these complexities are so large that our complexities outperform those of the AKS sorting network until n becomes extremely large
Suffix sorting requires ordering all suffixes of all symbols in an input sequence and has applications in running queries on large texts and in universal lossless data compression based on the Burrows Wheeler transform (BWT). We propose a new suffix lists data structure that leads to three fast, antisequential, and memory-efficient algorithms for suffix sorting. For a length-N input over a size-|X| alphabet, the worst-case complexities of these algorithms are (N2), O(|X|N log(N/|X|)), and O(N|X|log(N/|X|)), respectivel y. Furthermore, simulation results indicate performance that is competitive with other suffix sorting methods. In contrast, the suffix sorting methods that are fastest on standard test corpora have poor worst-case performance. Therefore, in comparison with other suffix sorting methods, suffix lists offer a useful trade off between practical performance and worst-case behavior. Another distinguishing feature of suffix lists is that these algorithms are simple; some of them can be implemented in VLSI. This could accelerate suffix sorting by at least an order of magnitude and enable high-speed BWT-based compression systems.
Optimal Sorting Algorithms for a Simplified 2D Array with Reconfigurable Pipelined Bus System
In recent years, many researchers have investigated optical interconnections as parallel computing. Optical interconnections are attractive due to their high bandwidth and concurrent access to the bus in a pipelined fashion. The Linear Array with Reconfigurable Pipelined Bus System (LARPBS) model is a powerful optical bus system that combines both the advantages of optical buses and reconfiguration. To increase the scalability of the LARPBS model, we propose a two-dimensional extension: a simplified two-dimensional Array with Reconfigurable Pipelined Bus System (2D ARPBS). While achieving better scalability, we show the effectiveness of this newly proposed model by designing two novel optimal sorting algorithms on this model. The first sorting algorithm is an extension of Leighton's seven-phase columnsort algorithm that eliminates the restriction of sorting only an r ?? s array, where r ?? s2 , and sorts an n ?? n array in O(log n) time. The second one is an optimal multiway mergesort algorithm that uses a novel processor efficient two-way mergesort algorithm and a novel multiway merge scheme to sort n2 items in O(log n) time. Using an optimal sorting algorithm Pipelined Mergesort designed for the LARPBS model as a building block, we extend our research on parallel sorting on the LARPBS to a more scalable 2D ARPBS model and achieve optimality in both sorting algorithms.
Sorting algorithms have been studied extensively since past three decades. Their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this paper we extended our previous work regarding parallel sorting algorithms on GPU, and are presenting an analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms on different GPU and CPU architectures. Their performance for various queue sizes is measured with respect to sorting time and rate and also the speed up of bitonic sort over odd-even sorting algorithms is shown on different GPUs and CPU. The algorithms have been written to exploit task parallelism model as available on multi-core GPUs using the OpenCL specification. Our findings report minimum of 19x speed-up of bitonic sort against oddeven sorting technique for small queue sizes on CPU and maximum of 2300x speed-up for very large queue sizes on Nvidia Quadro 6000 GPU architecture.