Parallel quicksort

Parallel Quicksort describes parallelizations of the sequential Quicksort algorithm , which is based on the divide-and-conquer principle .

Like Quicksort, Parallel Quicksort divides a set of elements into two subsets using a pivot element, so that the elements in one subset are smaller than the pivot and the elements in the other subset are greater than or equal to the pivot. Then the new quantities are again subdivided into subsets.

This procedure can be implemented in parallel in various ways. On the one hand, several quantities from one process each can be divided into new subsets. This has the disadvantage that the first recursion level is not parallelized and the speedup is limited to QuickSort . On the other hand, several processes can work together and split up a single quantity together. This approach has proven to be more efficient in practice. Parallel quicksort can be implemented in the PRAM model with a runtime of . ${\ displaystyle O (\ log n)}$ ${\ displaystyle O ((n / p + \ log p + \ log \ log n) \ log n)}$

Naive approach

Working method

Partitioning of an array with elements and pivot element . After processing the elements with the index , the two subsets and are formed and the pivot element is placed at that point . Then each of them is split into two new subsets and further partitioned.

{\ displaystyle n}

{\ displaystyle k_ {i}}

{\ displaystyle (1, n)}

{\ displaystyle (1, k_ {1} -1)}

{\ displaystyle (k_ {1} + 1, n)}

{\ displaystyle k_ {1}}

Each time Quicksort is called recursively, the set of elements is repeatedly split into two. The first partitioning can only be done by a single processor. All subsets of the array that are further generated and are on the same level of recursion can be processed in parallel by two further processors without memory conflicts. The assignment of the subsets to the various processors has an additional overhead. If the two sets are small enough, or if fewer processors are available, it is sometimes more efficient to use a sequential sort algorithm instead of making a new recursive call to quicksort.

It is possible that some processors will finish partitioning their subset faster than others. A queue can be used to store the pointers to subsets that have not yet been partitioned. When removing from the queue, the larger subsets are preferred because they create new smaller subsets which in turn come into the queue. The goal is to have as many subsets as possible while keeping the processors permanently occupied.

Allocation of processors when executing parallel quicksort

Allocation of processors

The order in which processors are allocated to subsets is dynamically adjusted. Processor 1 partitions the set of elements with index . The first resulting subset,, processes processor 1 again and the second ,, processor 2. The set of elements is o. B. d. A. the smaller of the two. It is therefore more plausible that Processor 2 will partition this faster. Processor 2 therefore requests a new processor for its subsets earlier than processor 1. As a result, processor 2 processes one subset received while processor 3 processes the other. When processor 1 requests a new processor, it gets processor 4 because it is currently free. ${\ displaystyle (1, n)}$ ${\ displaystyle (1, k_ {1} -1)}$ ${\ displaystyle (k_ {1} + 1, n)}$ ${\ displaystyle (k_ {1} + 1, n)}$

The following pseudocode outlines an implementation of Parallel Quicksort. denotes the maximum size of the array which can be more efficiently used with a sequential sorting algorithm, e.g. B. Insertion location is to be sorted. ${\ displaystyle M}$

PROCEDURE QUICKSORT (L,U);
IF U-L > M THEN;
BEGIN
    PARTITION (L,U)
    FORK L1, L2
L1: QUICKSORT (L,K-1)
    GOTO L3
L2: QUICKSORT (K+1,U)
    GOTO L3
L3: JOIN L1, L2
END
ELSE IF U-L > 1 THEN INSERTION_SORT (L,U)

running time

This approach achieves a lower acceleration compared to the sequential quicksort.

In order to partition a -element set with a pivot element from the same set, one needs comparisons. It is assumed that a comparison takes a unit of time. Furthermore, let the original size of the array , the number of processors with and the selected pivot element always be the true median of the subset to be partitioned. ${\ displaystyle n}$ ${\ displaystyle n-1}$ ${\ displaystyle n = 2 ^ {k} -1}$ ${\ displaystyle p = 2 ^ {m}}$ ${\ displaystyle m <k}$

Under these assumptions, the number of comparisons of the sequential algorithm can be determined by the following recurrence relation:

${\ displaystyle T (n) = {\ begin {cases} n-1 + 2T {\ frac {n-1} {2}}, & {\ text {if}} n = 7,15,31, .. . \\ 2, & {\ text {falls}} n = 3 \ end {cases}}}$

with solution:

${\ displaystyle T (n) = (n + 1) \ log {n + 1} -2n}$

The running time of Parallel Quicksort can be determined as the sum of the running times of two sections.

In the first section, there are at least as many processors as there are subsets to be partitioned. For the first partitioning , only one of all processors that are available is used. The other processors remain inactive. This iteration takes units of time to make comparisons. In the second partition, only two processors are active and waiting. If so, time units are needed for comparisons. Similarly, the third partitioning requires at least time units for comparisons. On the first iteration, there are enough processors for each partition. The running time for this period can be expressed by the following formula: ${\ displaystyle p}$ ${\ displaystyle p-1}$ ${\ displaystyle n-1}$ ${\ displaystyle n-1}$ ${\ displaystyle p-2}$ ${\ displaystyle p \ geq 2}$ ${\ displaystyle {\ frac {(n-1)} {2}} - 1}$ ${\ displaystyle n-3}$ ${\ displaystyle p \ geq 4}$ ${\ displaystyle {\ frac {[(n-1) / 2-1]} {2}} - 1}$ ${\ displaystyle n-7}$ ${\ displaystyle \ log {p}}$

${\ displaystyle T_ {1} (n) = 2 (n + 1) (1 - {\ frac {1} {p}}) - \ log {p}}$

The number of comparisons is:

${\ displaystyle C_ {1} = (n + 1) \ log {p} -2 (p-1)}$

In the second section there are more subsets than processors available. Assuming that each processor is making the same number of comparisons, the running time is the number of comparisons divided by . Hence it follows: ${\ displaystyle p}$

${\ displaystyle C_ {2} (n) = T (n) -C_ {1} (n)}$

${\ displaystyle T_ {2} (n) = {\ frac {C_ {2} (n)} {p}}}$

The expected acceleration is:

${\ displaystyle {\ text {Acceleration}} = {\ frac {T (n)} {T_ {1} (n) + T_ {2} (n)}}}$ .

The best acceleration at z. B. and is approx , which is low. The problem is that only a few processors are active at the beginning and it takes a lot of time for all processors to get work. ${\ displaystyle n = 1023}$ ${\ displaystyle p = 8}$ ${\ displaystyle 3,375}$

Parallelization of Quicksort to EREW PRAM

In 1991, Zhang and Nageswara presented a possibility for the parallelization of Quicksort on an EREW PRAM . The algorithm sorts elements with processors in expected runtime and in deterministic runtime with memory requirements. In the event that is, then the cost is optimal in terms of the product of the time and number of processors. ${\ displaystyle n}$ ${\ displaystyle p}$ ${\ displaystyle O ((n / p) + \ log p) \ log n)}$ ${\ displaystyle O ((n / p + \ log p + \ log \ log n) \ log n)}$ ${\ displaystyle O (n)}$ ${\ displaystyle p = O (n / \ log n)}$

Working method

The algorithm is carried out in three steps. First, the original set of elements is divided into blocks so that each block has elements except the last, which can have fewer than elements. A block is assigned to each processor, ie processor processes blocks . A pivot element is selected and sent out to each processor with a parallel broadcast . ${\ displaystyle S}$ ${\ displaystyle n}$ ${\ displaystyle p}$ ${\ displaystyle \ lceil n / p \ rceil}$ ${\ displaystyle \ lceil n / p \ rceil}$ ${\ displaystyle i}$ ${\ displaystyle i}$ ${\ displaystyle a}$

In the second step, each processor marks all elements from its block with either 0 or 1, depending on whether the element is smaller or larger than the pivot element . With a parallel prefix algorithm , the rank of each element within group 0 or group 1 can be calculated. The idea is to save all elements from all blocks marked with "0" before all "1" elements of all blocks in a further array . The index of the element in is calculated based on its rank and the number of its block. So you have divided all the elements from the original set into two parts - the set with all "0" elements and with all "1" elements. ${\ displaystyle a}$ ${\ displaystyle B}$ ${\ displaystyle B}$ ${\ displaystyle S}$ ${\ displaystyle S_ {1}}$ ${\ displaystyle S_ {2}}$

The third step is to check the length of and, and if they contain more than one element, step one recursively invokes both. ${\ displaystyle S_ {1}}$ ${\ displaystyle S_ {2}}$

Choice of pivot element and running time calculation

Only one additional array with a length and a constant need for additional auxiliary variables are required. With the recursion call , the original array can be reused and with the next call the other way around. Therefore, the additional memory requirement remains in . ${\ displaystyle B}$ ${\ displaystyle n}$ ${\ displaystyle A}$ ${\ displaystyle O (n)}$

The running time of the algorithm depends on the choice of pivot element. One possibility is to use a random element as a pivot.

Assuming that a processor can generate a random integer from the range in constant time, partitioning with pivot element - the element with index , corresponds to a random binary search . The analysis says that for the height ,, of a random binary search tree with elements, applies . Therefore the depth of the recursion is included . The pivot element must also be sent. This is done in . Each processor then has to mark the elements in its block, the prefix sum has to be calculated and the elements have to be moved. This takes total . Then you calculate the total runtime for this case, this is expected. ${\ displaystyle k}$ ${\ displaystyle \ {1,2, .., n \}}$ ${\ displaystyle k}$ ${\ displaystyle H_ {n}}$ ${\ displaystyle n}$ ${\ displaystyle \ lim _ {n \ to \ infty} (H_ {n} \ ln n) = 4.31107 ...}$ ${\ displaystyle O (\ log n)}$ ${\ displaystyle O (\ log p)}$ ${\ displaystyle O (n / p + \ log p)}$ ${\ displaystyle O ((n / p) + \ log p) \ log n)}$

In the event that the pivot element was not chosen at random, the running time of the algorithm depends on the recursion depth and thus also on the ratio of the sizes and . The algorithm's recursion depth can be limited to if and are approximately equal. This can be guaranteed by the selection of the pivot element with the Parallel Median Finding Algorithm from Akl on an EREW PRAM in time. The marking, the prefix sum and the moving of the elements are carried out in the same way as in the other case. With the Brent method , the algorithm can consequently be implemented deterministically. ${\ displaystyle S_ {1}}$ ${\ displaystyle S_ {2}}$ ${\ displaystyle O (\ log n)}$ ${\ displaystyle S_ {1}}$ ${\ displaystyle S_ {2}}$ ${\ displaystyle O (\ log \ log n)}$ ${\ displaystyle O ((n / p + \ log p + \ log \ log n) \ log n)}$

Implementation of Quicksort in MCSTL

A parallel implementation for Quicksort can be found in the MCSTL library. The two functions and are available. The main difference lies in the balanced load distribution achieved through work stealing . With both functions, the pivot selection is carried out by one processor and the partitioning is carried out in parallel by several processors. The desired number of threads that must be used is given as a parameter to the algorithm. ${\ displaystyle parallel \ _sort \ _qsb}$ ${\ displaystyle parallel \ _sort \ _qs}$ ${\ displaystyle parallel \ _sort \ _qs}$

Working method

First comes the choice of the pivot element. This is the median of several elements from the array. The partitioning phase begins with dividing the elements into smaller groups or chunks. Then each thread gets two chunks for processing. If there are fewer chunks so that this service can be fulfilled, fewer threads than desired are used. Each thread sorts its elements so that there are only elements smaller than the pivot element in the left chunk and only larger elements in the right one. Each thread then reserves space for its two chunks in parallel. The left chunks of each thread come from the beginning of the array and the right chunks come from the end. Partitioning is now complete.

Balanced load distribution

In the first step, all processors work together on partitioning the array. The two partitioned subsets obtained are each assigned half of all processors. In the course of the algorithm, there will come a point where each processor has to process a subset with several elements. It is possible that these subsets have different lengths and are processed for different lengths of time. As a result, some processors have already partitioned their subsets while others are not done with it. Processors are inactive. The solution in consists of processor-local queues. When a processor gets two subsets after partitioning, it continues partitioning the larger one while the smaller one is added to its queue. If one processor's queue becomes empty, it can take work from the other processors' queue. ${\ displaystyle MCSTL}$

Individual evidence

↑ ^a ^b D.J. Evans, RC Dunbar: The parallel quicksort algorithm part i – run time analysis . In: International Journal of Computer Mathematics . tape 12 , no. 1 , January 1982, ISSN 0020-7160 , pp. 19-55 , doi : 10.1080 / 00207168208803323 .
↑ ^a ^b Weixiong Zhang, Nageswara SV Rao: Optimal parallel quicksort on EREW PRAM . In: BIT . tape 31 , no. 1 , March 1991, ISSN 0006-3835 , pp. 69-74 , doi : 10.1007 / bf01952784 .
↑ Quinn, Michael J. (Michael Jay): Instructor's manual to accompany Designing efficient algorithms for parallel computers . McGraw-Hill Book Co, 1987, ISBN 0-07-051072-5 .
↑ Luc Devroye: A note on the height of binary search trees . In: Journal of the ACM . tape 33 , no. 3 , May 1, 1986, ISSN 0004-5411 , pp. 489-498 , doi : 10.1145 / 5925.5930 .
^ Vishkin, U .: Synchronous parallel computation - a survey. Courant Institute of Mathematical Sciences, New York University, April 1983, OCLC 1085604497 .
^ Richard Cole, Uzi Vishkin: Approximate and exact parallel scheduling with applications to list, tree and graph problems . In: 27th Annual Symposium on Foundations of Computer Science (sfcs 1986) . IEEE, 1986, ISBN 0-8186-0740-8 , doi : 10.1109 / sfcs.1986.10 .
^ Akl, SG "Parallel Selection in O (log log n) Time Using O (n / log log n) Processors." Technical Report 88-221, Department of Computing and Information Science, Queen's University, Kingston, Ontario (1988).
↑ plasters, Felix .: MCSTL: the multi-core standard template library . ACM, 2007, OCLC 854741854 .

[:1-1] D.J. Evans, RC Dunbar: The parallel quicksort algorithm part i – run time analysis . In: International Journal of Computer Mathematics . tape 12 , no. 1 , January 1982, ISSN 0020-7160 , pp. 19-55 , doi : 10.1080 / 00207168208803323 .

[:2-2] Weixiong Zhang, Nageswara SV Rao: Optimal parallel quicksort on EREW PRAM . In: BIT . tape 31 , no. 1 , March 1991, ISSN 0006-3835 , pp. 69-74 , doi : 10.1007 / bf01952784 .

[:0-3] Quinn, Michael J. (Michael Jay): Instructor's manual to accompany Designing efficient algorithms for parallel computers . McGraw-Hill Book Co, 1987, ISBN 0-07-051072-5 .

[4] Luc Devroye: A note on the height of binary search trees . In: Journal of the ACM . tape 33 , no. 3 , May 1, 1986, ISSN 0004-5411 , pp. 489-498 , doi : 10.1145 / 5925.5930 .

[5] Vishkin, U .: Synchronous parallel computation - a survey. Courant Institute of Mathematical Sciences, New York University, April 1983, OCLC 1085604497 .

[6] Richard Cole, Uzi Vishkin: Approximate and exact parallel scheduling with applications to list, tree and graph problems . In: 27th Annual Symposium on Foundations of Computer Science (sfcs 1986) . IEEE, 1986, ISBN 0-8186-0740-8 , doi : 10.1109 / sfcs.1986.10 .

[7] Akl, SG "Parallel Selection in O (log log n) Time Using O (n / log log n) Processors." Technical Report 88-221, Department of Computing and Information Science, Queen's University, Kingston, Ontario (1988).

[8] sters, Felix .: MCSTL: the multi-core standard template library . ACM, 2007, OCLC 854741854 .