Sample location

Samplesort is a sorting algorithm published in 1970 by WD Frazer and AC McKellar in the scientific publication Samplesort: A Sampling Approach to Minimal Storage Tree Sorting . The algorithm works in a similar way to Quicksort and, like this, follows the divide-and-conquer method . The algorithm divides the input sequence into several partial sequences, which are then sorted recursively.

Pseudocode

The following pseudocode shows how the algorithm works. It contains the data to be sorted, corresponds to the oversampling factor and the number of splitters to be used. The number of splitters specifies how many partial sequences the input data is divided into (for corresponds to the Quicksort algorithm ), while a large oversampling factor means that the sizes of the buckets differ less statistically. ${\ displaystyle A}$ ${\ displaystyle k}$ ${\ displaystyle p}$ ${\ displaystyle p = 2}$

 ${\text{sampleSort}}(A[1..n],k,p)$ 
   ${\text{if }}n/k<{\text{ threshold then smallSort}}(A)$  // if average bucket size is below a threshold switch to e.g. quicksort
   ${\text{select }}S=[S_{1},\dots ,S_{p(k-1)}]{\text{ randomly from }}A$  // select samples
   ${\text{sort }}S$  // sort sample
   $[s_{0},s_{1},\dots ,s_{p-1},s_{p}]\leftarrow [-\infty ,S_{k},S_{2k},\dots ,S_{(p-1)k},\infty ]$  // select splitters
  // classify elements
   ${\text{for each }}a\in A$ 
     ${\text{find }}j{\text{ such that }}s_{j-1}<a\leq s_{j}$ 
     ${\text{place }}a{\text{ in bucket }}b_{j}$ 
   ${\text{return concatenate}}({\text{sampleSort}}(b_{1}),\dots ,{\text{sampleSort}}(b_{k}))$

Choosing the right bucket in the loop can be done in time for any binary search element . ${\ displaystyle {\ mathcal {O}} (\ log p)}$

The specified pseudocode differs from the original form described by Frazer and McKellar. Samplesort is called recursively in the pseudocode to sort the smaller subsets that are generated. Frazer and McKellar used Quicksort for this; H. the Samplesort function is only called once.

The number of comparisons that this algorithm performs approaches the information-theoretical optimum asymptotically for large input sequences and required more than 15% fewer comparisons than Quicksort in the experiments carried out by Frazer and McKellar. ${\ displaystyle \ log _ {2} (n!)}$

Oversampling

The oversampling factor indicates how large the sample for selecting the splinters is larger than the number of splinters. The aim is to approximate the distribution of the input data as closely as possible. After the classification, each bucket ideally contains exactly the elements in order to distribute the subsequent work of the algorithm evenly among the buckets. ${\ displaystyle n / p}$

After selecting the sample size , the sample is sorted. The elements in are used as splitters that are used as bucket boundaries at positions and and as left and right boundaries of the leftmost and rightmost bucket. ${\ displaystyle S}$ ${\ displaystyle p \ cdot (k-1)}$ ${\ displaystyle S}$ ${\ displaystyle k, 2k, 3k, \ dots, (p-1) k}$ ${\ displaystyle - \ infty}$ ${\ displaystyle \ infty}$

Size of the buckets

With the size of the sample or the oversampling factor, the expected bucket size and in particular the probability that a bucket will exceed a certain size can be estimated. For an oversampling factor , we will show that there is a probability of less than one bucket containing more than than elements. ${\ displaystyle k \ in \ Theta ({\ frac {\ log n} {\ epsilon ^ {2}}})}$ ${\ displaystyle 1 / n}$ ${\ displaystyle (1+ \ epsilon) n / p}$

Let the input already be in sorted form. In order for a processor to receive more than elements, there must be a sequence for entering the length from which a maximum of elements can be selected for the sample. This is represented as a random variable as follows: ${\ displaystyle [e_ {1}, \ dots, e_ {n}]}$ ${\ displaystyle (1+ \ epsilon) n / p}$ ${\ displaystyle (1+ \ epsilon) n / p}$ ${\ displaystyle k}$

{\ displaystyle X_ {i}: = {\ begin {cases} 1, & {\ text {falls}} s_ {i} \ in [e_ {j}, \ dots, e_ {j} + (1+ \ epsilon ) n / p] \\ 0, & {\ text {otherwise}} \ end {cases}}}

{\ displaystyle X: = \ sum _ {i = 0} ^ {k \ cdot p-1} X_ {i}}

It follows:

{\ displaystyle \ mathbb {E} [X_ {i}] = \ mathbb {P} [X_ {i} = 1] = {\ frac {1+ \ epsilon} {p}}}

{\ displaystyle \ mathbb {P} [X <k] \ approx \ mathbb {P} [X <(1- \ epsilon ^ {2}) k] = \ mathbb {P} [X <(1- \ epsilon) \ mathbb {E} [X]]}

Here referred to the probability that less than samples in the bucket are. Using the Chernoff inequality , one can see that ${\ displaystyle \ mathbb {P} [X <k]}$ ${\ displaystyle k}$ ${\ displaystyle j}$

{\ displaystyle n \ mathbb {P} [X <k] \ leq n \ exp ({\ frac {\ epsilon ^ {2} k} {2}}) \ leq n {\ frac {1} {n ^ { 2}}}}

For

{\ displaystyle k \ geq {\ frac {4} {\ epsilon ^ {2}}} \ ln n}

Sample location on systems with distributed memory

Working principle of parallel sample sort for processors and an oversampling factor .

{\ displaystyle p = 3}

{\ displaystyle k = 3}

Samplesort is often used in parallel systems with distributed memory because, unlike Quicksort, for example, the variable number of buckets means that it can be easily adapted to any number of processors. The input data are distributed across processors, that is, each processor has elements stored. Correspondingly, the same number of splitters are selected to create a bucket for each processor. ${\ displaystyle p}$ ${\ displaystyle n / p}$

In order to select the splitters, it is sufficient if each processor selects elements from its input elements. These are sorted with a distributed sorting algorithm and distributed to all processors. There are costs for sorting the elements on processors and for distributing the selected splitters to processors. ${\ displaystyle k}$ ${\ displaystyle T _ {\ text {sort}} (kp, p)}$ ${\ displaystyle kp}$ ${\ displaystyle p}$ ${\ displaystyle T _ {\ text {allgather}} (p, p)}$ ${\ displaystyle p}$ ${\ displaystyle p}$

With the splinters received, each processor partitions its own input data into local buckets. The time for this is with binary search . The processors then distribute the buckets to the respective processors with time , with the size of the largest bucket. Processor accordingly contains the local buckets of all processors and sorts them locally. The time required for the local sorting of all processors is . ${\ displaystyle {\ mathcal {O}} (n / p \ log p)}$ ${\ displaystyle T _ {\ text {all-to-all}} (N, p)}$ ${\ displaystyle N}$ ${\ displaystyle i}$ ${\ displaystyle b_ {i}}$ ${\ displaystyle T _ {\ text {localsort}} (N)}$

If we choose, as in the previous section described , one obtains a total duration of ${\ displaystyle k \ in \ Theta ({\ frac {\ log n} {\ epsilon ^ {2}}})}$

{\ displaystyle T (n, p) = T _ {\ text {sort}} ({\ mathcal {O}} ({\ frac {p \ log n} {\ epsilon ^ {2}}}), p) + T _ {\ text {allgather}} (p, p) + {\ mathcal {O}} (n / p \ log p) + T _ {\ text {all-to-all}} ((1+ \ epsilon) n / p, p) + T _ {\ text {localsort}} ((1+ \ epsilon) n / p)}

Efficient implementation of sequential sample sorting

Working principle of Super Scalar Sample Sort. (Animation) In each step, numbers that are being compared are colored blue and numbers that are otherwise read or written are colored red.

As described above, the algorithm classifies the elements based on the selected splitters and partitions them accordingly. An efficient implementation is described in the paper Super Scalar Sample Sort. There two arrays of the size are used: an array for the input data and a temporary array. Accordingly, Samplesort does not work in-place in this case . In each recursion step, the data is copied from one array to the other and partitioned in the meantime. If the data is in the temporary array after the last recursion step, it is copied back into the original array. The essential steps are the classification of the elements based on the splitter and the subsequent partitioning. Both are explained in more detail below. ${\ displaystyle n}$

Classification of the elements

In a comparison-based sorting algorithm, the performance-critical part is the classification of the elements, i. H. the assignment of elements to buckets. This assignment requires steps per element. Super Scalar Sample Sort uses a balanced search tree that is implicitly stored in an array . This means that the root is at position 0 and the left successor of an element is in , the right successor is in . Given a search tree and element , the algorithm calculates the index of the corresponding bucket as follows ( let the expression be 1 if and 0 otherwise, be a power of two): ${\ displaystyle {\ mathcal {O}} (\ log p)}$ ${\ displaystyle t}$ ${\ displaystyle t_ {i}}$ ${\ displaystyle t_ {2i}}$ ${\ displaystyle t_ {2i + 1}}$ ${\ displaystyle t}$ ${\ displaystyle a}$ ${\ displaystyle j}$ ${\ displaystyle a> t_ {j}}$ ${\ displaystyle a> t_ {j}}$ ${\ displaystyle p}$

 $j\leftarrow 1$ 
 ${\text{repeat}}\log _{2}p{\text{ times}}$ 
   $j\leftarrow 2j+(a>t_{j})$ 
 $j\leftarrow j-p+1$

Because the number of buckets in the translation time is known, the loop can from the compiler rolled be. If the processor architecture supports it, conditional jumps can be completely dispensed with in the loop and no incorrect jump predictions occur. ${\ displaystyle p}$

Partitioning

To partition efficiently, the algorithm needs to know the sizes of the buckets before copying the items. In a naive implementation, all elements could be classified to determine the bucket sizes. Then the elements could be copied in the right place. The bucket would have to be determined twice for each element (once for counting the elements of the buckets and once for copying). Super Scalar Sample Sort avoids this double classification by temporarily storing the result of the classification in an additional array (the “oracle”). This assigns the bucket index to each element . First, Super Scalar Sample Sort determines the correct values of by classifying each element. Then the sizes of the buckets are determined and each element is copied to the right place using . The memory requirement of the array is bits, which means little memory compared to the input data. ${\ displaystyle o}$ ${\ displaystyle a_ {i}}$ ${\ displaystyle o_ {i}}$ ${\ displaystyle o}$ ${\ displaystyle o}$ ${\ displaystyle o}$ ${\ displaystyle n \ log p}$

Parallel in-place sample location for systems with shared memory

The sample sort implementation shown above does not work in-place because a second array of the size is required. Efficient implementations of e.g. B. Quicksort work in-place and therefore require less memory. However, sample sorting can also be implemented in-place for parallel systems with shared memory. Such an implementation is described below. It should be noted that the additional storage space of the algorithm still depends on, in contrast to the Super Scalar Sample Sort explained above, it only requires additional storage space. A modified variant of the algorithm, which only requires storage space regardless of , is also described in. ${\ displaystyle n}$ ${\ displaystyle n}$ ${\ displaystyle {\ mathcal {o}} (n)}$ ${\ displaystyle n}$

The in-place algorithm can be divided into four phases:

Sampling: Selection of the splitter.

Local classification: Grouping of the input into blocks so that all elements of a block belong to one bucket (although not every bucket has to be connected in memory).

Block permutation : swapping the blocks so that they are globally in the correct order.

Clean up: Copy the rest of the items to the edges of the buckets.

Since this algorithm means that each element has to be read and written twice (once during the classification and once during the block permutation), there is a higher writing and reading effort. Nevertheless, the algorithm is up to three times faster than current in-place competitors and up to 1.5 times faster than other current sequential competitors, including the Super Scalar Sample Sort described above. As the sampling and the selection of the splitters are already described above ( pseudocode ) is described, only the detailed explanations of the three other phases follow at this point.

Local classification

The input array is first divided into (number of processors / threads) areas of the same size. Each processor contains such an area and divides it further into blocks, each containing elements. In addition, each processor allocates (number of buckets) buffers that can also store elements. Each processor then processes its allocated area and copies the elements into the appropriate allocated buffers. If a buffer is full, its contents are copied into the area processed by the processor. The elements there that are overwritten by this have already been processed, otherwise no buffer could be full. To select the correct buckets / buffers, see Classification of Elements . After this phase, each area contains several blocks, each with elements of a bucket and then empty blocks. The remaining elements are still in the buffer blocks. In addition, each processor stores the size of each bucket. ${\ displaystyle t}$ ${\ displaystyle b}$ ${\ displaystyle p}$ ${\ displaystyle b}$ ${\ displaystyle b}$

Block permutation

In this phase the blocks in the input array are put into the correct order. Using a prefix sum of the size of the buckets, their limits are determined. Since only whole blocks are moved in this phase, the limits are rounded up to whole blocks. The missing elements are still in the buffer blocks. Since the size of the input array is not necessarily divisible by the block size, some elements of the last (incomplete) block must be moved to an overflow buffer.

Before starting block permutation, you may need to move some empty blocks to the end of your bucket. Subsequently, for each bucket , a write pointer created pointing to the beginning of the buckets, and a read pointer that is not empty at the last block in the bucket shows. ${\ displaystyle b_ {i}}$ ${\ displaystyle w_ {i}}$ ${\ displaystyle r_ {i}}$ ${\ displaystyle b_ {i}}$

To avoid mutual blockages between the processors, each processor initially processes its own (primary) bucket . In addition, each processor receives two exchange buffers, each of which can store exactly one block. If both exchange buffers are empty, the block is placed in one of the exchange buffers instead of. By classifying the first element in the buffer, the target bucket of this block is determined and copied in place . To prevent a block there from being overwritten, it must first be copied into a free exchange buffer. This block in the exchange buffer is then handled in the same way. The pointers and are adjusted accordingly. ${\ displaystyle b _ {\ text {prim}}}$ ${\ displaystyle r _ {\ text {prim}}}$ ${\ displaystyle b _ {\ text {dest}}}$ ${\ displaystyle w _ {\ text {dest}}}$ ${\ displaystyle r _ {\ text {prim}}}$ ${\ displaystyle w _ {\ text {dest}}}$

If all blocks of the primary bucket are in the right place, the blocks of the next bucket are processed further until all blocks have reached their target bucket. This phase then ends.

to clean up

Since only whole blocks were moved in the block permutation phase, some elements may still be in place outside of their bucket. Since there is still enough space in the original array for all elements, the remaining elements can be copied from the buffers to the free spaces. Finally, the content of the overflow buffer is copied.

literature

WD Frazer, AC McKellar: Samplesort: A Sampling Approach to Minimal Storage Tree Sorting . In: Journal of the ACM . tape 17 , no. 3 , July 1970, ISSN 0004-5411 , pp. 496-507 .
Peter Sanders et al .: Lecture Parallel Algorithms , Karlsruhe 2009, pp. 75–80.

Extensions for parallel calculations:

ohio-state.edu ( Memento from June 8, 2011 in the Internet Archive ) (PDF)
citeseer.ist.psu.edu
citeseerx.ist.psu.edu
Use on the GPU (PDF)

Individual evidence

↑ ^a ^b Peter Sanders, Sebastian Winkel: Super Scalar Sample Sort . In: Algorithms - ESA 2004 (= Lecture Notes in Computer Science ). Springer, Berlin, Heidelberg, 2004, ISBN 978-3-540-23025-0 , pp. 784-796 , doi : 10.1007 / 978-3-540-30140-0_69 .
^ WD Frazer, AC McKellar: Samplesort: A Sampling Approach to Minimal Storage Tree Sorting . In: Journal of the ACM (JACM) . tape 17 , no. 3 , July 1, 1970, ISSN 0004-5411 , p. 496-507 , doi : 10.1145 / 321592.321600 ( acm.org [accessed March 22, 2018]).
↑ Guy E. Blelloch, Charles E. Leiserson, Bruce M. Maggs, C. Greg Plaxton, Stephen J. Smith: A comparison of sorting algorithms for the connection machine CM-2 . ACM, 1991, ISBN 0-89791-438-4 , pp. 3–16 , doi : 10.1145 / 113379.113380 ( acm.org [accessed March 23, 2018]).
↑ Alexandros V. Gerbessiotis, Leslie G. Valiant: Direct Bulk-Synchronous Parallel Algorithms . In: JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING . tape 22 , 1992, pp. 22–251 ( psu.edu [accessed April 1, 2018]).
↑ Jonathan MD Hill, Bill McColl, Dan C. Stefanescu, Mark W. Goudreau, Kevin Lang: BSPlib: The BSP Programming Library . 1998 ( psu.edu [accessed April 1, 2018]).
^ William L. Hightower, Jan F. Prins, John H. Reif: Implementations of randomized sorting on large parallel machines . ACM, 1992, ISBN 0-89791-483-X , pp. 158–167 , doi : 10.1145 / 140901.140918 ( acm.org [accessed April 1, 2018]).
↑ ^a ^b Michael Axtmann, Sascha Witt, Daniel Ferizovic, Peter Sanders: In-Place Parallel Super Scalar Samplesort (IPSSSSo) . In: 25th Annual European Symposium on Algorithms (ESA 2017) (= Leibniz International Proceedings in Informatics (LIPIcs) ). tape 87 . Schloss Dagstuhl – Leibniz Center for Computer Science, Dagstuhl, Germany 2017, ISBN 978-3-95977-049-1 , p. 9: 1–9: 14 , doi : 10.4230 / LIPIcs.ESA.2017.9 ( dagstuhl.de [accessed on March 30, 2018]).

[:0-1] Peter Sanders, Sebastian Winkel: Super Scalar Sample Sort . In: Algorithms - ESA 2004 (= Lecture Notes in Computer Science ). Springer, Berlin, Heidelberg, 2004, ISBN 978-3-540-23025-0 , pp. 784-796 , doi : 10.1007 / 978-3-540-30140-0_69 .

[2] WD Frazer, AC McKellar: Samplesort: A Sampling Approach to Minimal Storage Tree Sorting . In: Journal of the ACM (JACM) . tape 17 , no. 3 , July 1, 1970, ISSN 0004-5411 , p. 496-507 , doi : 10.1145 / 321592.321600 ( acm.org [accessed March 22, 2018]).

[3] Guy E. Blelloch, Charles E. Leiserson, Bruce M. Maggs, C. Greg Plaxton, Stephen J. Smith: A comparison of sorting algorithms for the connection machine CM-2 . ACM, 1991, ISBN 0-89791-438-4 , pp. 3–16 , doi : 10.1145 / 113379.113380 ( acm.org [accessed March 23, 2018]).

[4] Alexandros V. Gerbessiotis, Leslie G. Valiant: Direct Bulk-Synchronous Parallel Algorithms . In: JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING . tape 22 , 1992, pp. 22–251 ( psu.edu [accessed April 1, 2018]).

[5] Jonathan MD Hill, Bill McColl, Dan C. Stefanescu, Mark W. Goudreau, Kevin Lang: BSPlib: The BSP Programming Library . 1998 ( psu.edu [accessed April 1, 2018]).

[6] William L. Hightower, Jan F. Prins, John H. Reif: Implementations of randomized sorting on large parallel machines . ACM, 1992, ISBN 0-89791-483-X , pp. 158–167 , doi : 10.1145 / 140901.140918 ( acm.org [accessed April 1, 2018]).

[:1-7] Michael Axtmann, Sascha Witt, Daniel Ferizovic, Peter Sanders: In-Place Parallel Super Scalar Samplesort (IPSSSSo) . In: 25th Annual European Symposium on Algorithms (ESA 2017) (= Leibniz International Proceedings in Informatics (LIPIcs) ). tape 87 . Schloss Dagstuhl – Leibniz Center for Computer Science, Dagstuhl, Germany 2017, ISBN 978-3-95977-049-1 , p. 9: 1–9: 14 , doi : 10.4230 / LIPIcs.ESA.2017.9 ( dagstuhl.de [accessed on March 30, 2018]).