Merge sort

Example of how mergesort sorts a list. The list elements are represented by dots. The horizontal axis shows where an element is located in the list, the vertical axis shows how big an element is.

Mergesort (of English merge , merge 'and sort , sort') is a stable sorting algorithm , which according to the principle divide and conquer (divide and conquer) works. It was first introduced in 1945 by John von Neumann .

functionality

Mergesort regards the data to be sorted as a list and breaks it down into smaller lists that are sorted individually. The sorted lists are small then the zipper on major lists together (Engl. (To) merge ), until a sorted complete list is reached. The method does not usually work in-place with arrays , but (tricky) implementations are known for this, in which the sub-arrays are usually recursively merged. Linked lists are particularly suitable for implementing mergesort, with in-place sorting almost automatically.

Illustration of how it works

functionality

The picture illustrates the three main steps of a divide-and-conquer process as implemented in the context of mergesort. The split step is obviously trivial (the data is simply split in half). The main work is done during merging - hence the name of the algorithm. With Quicksort , on the other hand, the parts step is complex and the merge step simpler (namely, concatenation ).

When considering the procedure shown in the graphic, however, one should be aware that this is only one of several recursion levels. For example, the sorting function that is supposed to sort the two parts 1 and 2 could come to the conclusion that these parts are still too big for sorting. Both parts would then be split up again and passed recursively to the sorting function, so that a further recursion level is opened, which processes the same steps. In the extreme case (which is the normal case with mergesort) the split is continued until the two parts only consist of individual data elements.

implementation

The following pseudocode illustrates how the algorithm works , where list contains the elements to be sorted.

funktion mergesort(liste);
  falls (Größe von liste <= 1) dann antworte liste
  sonst
     halbiere die liste in linkeListe, rechteListe
     linkeListe = mergesort(linkeListe)
     rechteListe = mergesort(rechteListe)
     antworte merge(linkeListe, rechteListe)

funktion merge(linkeListe, rechteListe);
  neueListe
  solange (linkeListe und rechteListe nicht leer)
  |    falls (erstes Element der linkeListe <= erstes Element der rechteListe)
  |    dann füge erstes Element linkeListe in die neueListe hinten ein und entferne es aus linkeListe
  |    sonst füge erstes Element rechteListe in die neueListe hinten ein und entferne es aus rechteListe
  solange_ende
  solange (linkeListe nicht leer)
  |    füge erstes Element linkeListe in die neueListe hinten ein und entferne es aus linkeListe
  solange_ende
  solange (rechteListe nicht leer)
  |    füge erstes Element rechteListe in die neueListe hinten ein und entferne es aus rechteListe
  solange_ende
  antworte neueListe

example

In the last fusing step, the zipper process is indicated when fusing (“Mixing” in the figure). Blue arrows indicate the splitting step, green arrows the merging steps.

Example code for the recursive sorting algorithm, analogous to the "Implementation" section above, follows. It divides the input recursively in descending order into 2 smaller lists until these are trivially sorted, and merges them on the recursive way back, whereby they are sorted.

function merge_sort(list x)

    if length(x) ≤ 1 then
        return x      // Kurzes x ist trivialerweise sortiert.

    var l := empty list
    var r := empty list
    var i, nx := length(x)−1
    // Teile x in die zwei Hälften l und r ...
    for i := 0 to floor(nx/2) do
        append x[i] to l
    for i := floor(nx/2)+1 to nx do
        append x[i] to r
    // ... und sortiere beide (einzeln).
    l := merge_sort(l)
    r := merge_sort(r)
    // Verschmelze die sortierten Hälften.
    return merge(l, r)

Sample code for merging two sorted lists.

function merge(list l, list r)
    var y := empty list    // Ergebnisliste

    var nl := length(l)−1
    var nr := length(r)−1
    var i, il := 0
    for i := 0 to nl+nr+1 do
        if il > nl then
            append r[i−il] to y
            continue
        if il < i−nr then
            append l[il] to y
            il := il+1
            continue
        // Jetzt ist 0 ≤ il ≤ nl und 0 ≤ i−il ≤ nr.
        if l[il] ≤ r[i−il] then
            append l[il] to y
            il := il+1
        else
            append r[i−il] to y

    return y

C ++ 11

An implementation in the programming language C ++ using pointer arithmetic could look like this:

#ifndef ALGORITHM_H
#define ALGORITHM_H

#include <functional>

namespace ExampleNamespace
{
  class Algorithm
  {
  public:
    Algorithm() = delete;
    template<typename T>
    static auto ptrDiff(T const * const begin, T const * const end)
    {
      return end - begin;
    }
    template<typename T>
    static auto midPtr(T* const begin, long long const & ptr_diff)
    {
      return begin + ptr_diff / 2u;
    }
    template<typename T>
    static auto midPtr(T* const begin, T* const end)
    {
      return midPtr(begin, ptrDiff(begin, end));
    }
    static auto continueSplit(long long const & ptr_diff)
    {
      return ptr_diff > 1u;
    }
    template<typename T>
    static auto continueSplit(T const * const begin, T const * const end)
    {
      return continueSplit(ptrDiff(begin, end));
    }
    template<typename T>
    static auto mergeSort(T* const begin, T* const end)
    {
      mergeSort(begin, end, std::less<T>());
    }
    template<typename T, typename Compare>
    static auto mergeSort(T* const begin, T* const end, Compare&& comp)
    {
      auto ptr_diff = ptrDiff(begin, end);
      if (ptr_diff) {
        auto* temp = new T[ptr_diff];
        mergeSort(begin, end, temp, comp);
        delete[] temp;
      }
    }
    template<typename T>
    static auto copyRange(T const * begin, T const * const end, T* dst)
    {
      copyRangeOverwrite(begin, end, dst);
    }
    template<typename T>
    static auto copyRangeOverwrite(T const * begin, T const * const end, T*& dst)
    {
      while (begin != end) {
        *dst++ = *begin++;
      }
    }
  private:
    template<typename T, typename Compare>
    static void mergeSort(T* const begin, T* const end, T* temp, Compare&& comp)
    {
      auto ptr_diff = ptrDiff(begin, end);
      if (continueSplit(ptr_diff)) {
        auto * const mid = midPtr(begin, ptr_diff);
        mergeSort(begin, mid, temp, comp);
        mergeSort(mid, end, temp, comp);
        merge(begin, mid, end, temp, comp);
      }
    }
    template<typename T, typename Compare>
    static auto merge(T* begin, T const * const mid, T const * const end, T* temp, Compare&& comp)
    {
      copyRange(begin, end, temp);
      auto* right_temp = temp + ptrDiff(begin, mid);
      auto const * const left_temp_end = right_temp;
      auto const * const right_temp_end = temp + ptrDiff(begin, end);
      mergeResults(temp, right_temp, right_temp_end, begin, comp);
      copyRangeOverwrite(temp, left_temp_end, begin);
      copyRangeOverwrite(right_temp, right_temp_end, begin);
    }
    template<typename T, typename Compare>
    static auto mergeResults(T*& begin, T*& mid, T const * const end, T*& dst, Compare&& comp)
    {
      auto const * const mid_ptr = mid;
      while (begin != mid_ptr && mid != end) {
        *dst++ = comp(*begin, *mid) ? *begin++ : *mid++;
      }
    }
  };
}

#endif // !ALGORITHM_H

complexity

Mergesort is a stable sorting process, provided the merge step is implemented accordingly. In the worst, best and average cases, its complexity is always expressed in Landau notation . ${\ displaystyle {\ mathcal {O}} (n \ cdot \ log (n))}$

With this, Mergesort is fundamentally superior to Quicksort in terms of complexity, since Quicksort (without special precautions) has a worst-case behavior of . However, it requires additional storage space (of the order of magnitude ), so it is not an in-place process. ${\ displaystyle \ Theta (n ^ {2})}$ ${\ displaystyle {\ mathcal {O}} (n)}$

The recursion formula applies to the runtime of mergesort for elements to be sorted ${\ displaystyle T (n)}$ ${\ displaystyle n}$

{\ displaystyle T (n) = \ underbrace {T (\ lfloor {\ tfrac {n} {2}} \ rfloor)} _ {\ text {Effort to sort one part}} + \ underbrace {T (\ lceil {\ tfrac {n} {2}} \ rceil)} _ {\ text {Effort to sort the other part}} + \ underbrace {{\ mathcal {O}} (n)} _ {\ text {Effort to merge the two parts}}}

with the beginning of the recursion . ${\ displaystyle T (1) = 1}$

According to the master theorem , the recursion formula can be approximated by or with the respective solution (2nd case of the master theorem, see there) . ${\ displaystyle 2 \, T (\ lfloor {\ tfrac {n} {2}} \ rfloor) + n}$ ${\ displaystyle 2 \, T (\ lceil {\ tfrac {n} {2}} \ rceil) + n}$ ${\ displaystyle T (n) = {\ mathcal {O}} (n \ cdot \ log (n))}$

Average and maximum number of comparisons

If the lengths of the sequences to be merged and the pre-sorted sequences then apply to the number of comparisons required for the sorting merging ${\ displaystyle l_ {0}, l_ {1}}$ ${\ displaystyle F_ {0}, F_ {1},}$ ${\ displaystyle M}$

{\ displaystyle \ min (l_ {0}, l_ {1}) \ leq M \ leq l_ {0} + l_ {1} -1}

,

First, because one sequence can completely precede the other, d. that is, it is or or it is second (or vice versa), so that the elements must be compared up to the last element in each sequence. It is assumed in each case that the comparison of the two sequences begins with the elements with a low index. With ${\ displaystyle F_ {0} [l_ {0}] \ prec F_ {1} [1]}$ ${\ displaystyle F_ {1} [l_ {1}] \ prec F_ {0} [1],}$ ${\ displaystyle F_ {0} [l_ {0} -1] \ prec F_ {1} [l_ {1}] \ prec F_ {0} [l_ {0}]}$

{\ displaystyle V_ {m} (l_ {0} + l_ {1}): = l_ {0} + l_ {1} -1}

(Subscript for maximum ) denotes the maximum number of comparisons for merging . This is used to calculate the maximum number of comparisons for an entire merge sort run of elements ${\ displaystyle m}$ ${\ displaystyle S_ {m} (n)}$ ${\ displaystyle n}$

{\ displaystyle S_ {m} (n) = nl-2 ^ {l} +1,}

With

{\ displaystyle l: = \ lceil \ log _ {2} (n) \ rceil.}

${\ displaystyle S_ {m} (n)}$ is the sequence A001855 in OEIS .

For an even distribution, the average number (subscript for average ) of the comparisons can also be precisely calculated, namely for ${\ displaystyle V _ {\ varnothing} (l_ {0}, l_ {1})}$ ${\ displaystyle {} _ {\ varnothing}}$ ${\ displaystyle l_ {1} = l_ {0}}$

{\ displaystyle {\ begin {aligned} V _ {\ varnothing} (l_ {0}, l_ {0}) \; \; \; \; \; & = \ left (\ sum _ {k = l_ {0} } ^ {2l_ {0} -1} {\ binom {k-1} {l_ {0} -1}} k \ right) \, {\ bigg /} \ left (\ sum _ {k = l_ {0 }} ^ {2l_ {0} -1} {\ binom {k-1} {l_ {0} -1}} \ right) \\ & = 2 \, l_ {0} ^ {2} / (l_ { 0} +1) \ end {aligned}}}

and for ${\ displaystyle l_ {1} = l_ {0} -1}$

{\ displaystyle {\ begin {aligned} V _ {\ varnothing} (l_ {0}, l_ {0} -1) & = \ left (\ sum _ {k = l_ {0}} ^ {2l_ {0} - 1} {\ binom {k-1} {l_ {0} -1}} k \; + \ sum _ {k = l_ {0} -1} ^ {2l_ {0} -2} {\ binom {k -1} {l_ {0} -2}} k \ right) \\ & \; {\ bigg /} \ left (\ sum _ {k = l_ {0}} ^ {2l_ {0} -1} { \ binom {k-1} {l_ {0} -1}} \; \; \; + \ sum _ {k = l_ {0} -1} ^ {2l_ {0} -2} {\ binom {k -1} {l_ {0} -2}} \; \; \ right) \\ & = 2 \, l_ {0} ^ {2} / (l_ {0} +1) -1. \ End {aligned }}}

Where is the number of comparisons for the arrangement ${\ displaystyle k}$

{\ displaystyle \ underbrace {u \, v \ dotso w \, 0} _ {k} \, \ underbrace {1 \, 1 \ dotso 1} _ {l_ {1} + l_ {0} -k}}

,

wherein for the last (the sorting highest) element in the sequence is, (not an element of the discontinuous) s to belong and (which may also be absent) either to or include. The binomial coefficient given in the sum formulas counts the different possibilities for ${\ displaystyle 0}$ ${\ displaystyle F_ {0}}$ ${\ displaystyle F_ {0}}$ ${\ displaystyle 1}$ ${\ displaystyle F_ {1}}$ ${\ displaystyle u, v, w}$ ${\ displaystyle F_ {0}}$ ${\ displaystyle F_ {1}}$ ${\ displaystyle u \, v \ dotso w.}$

This is used to calculate the average number of comparisons for an entire merge sort run of elements ${\ displaystyle S _ {\ varnothing} (n)}$ ${\ displaystyle n}$

${\ displaystyle n}$	${\ displaystyle S _ {\ varnothing} (n)}$	${\ displaystyle S_ {m} (n)}$	${\ displaystyle (S_ {m} (n) -}$ ${\ displaystyle S _ {\ varnothing} (n)) / n}$
2	1.0000	1	0.00000
3	2.6667	3	0.11111
4th	4.6667	5	0.08333
6th	9.8333	11	0.19444
8th	15.733	17th	0.15833
12	29,952	33	0.25397
16	45.689	49	0.20694
24	82.059	89	0.28922
32	121.50	129	0.23452
48	210.20	225	0.30839
64	305.05	321	0.24920
96	514.44	545	0.31838
128	736.13	769	0.25677
192	1218.9	1281	0.32348
256	1726.3	1793	0.26061
384	2819.8	2945	0.32606
512	3962.6	4097	0.26255
768	6405.6	6657	0.32736
1024	8947.2	9217	0.26352
1536	14345.0	14849	0.32801
2048	19940.0	20481	0.26401
3072	31760.0	32769	0.32833
4096	43974.0	45057	0.26426
6144	69662.0	71681	0.32849
8192	96139.0	98305	0.26438
12288	1.5161E5	155649	0.32857
16384	2.0866E5	212993	0.26444
24576	3.278E5	335873	0.32862
32768	4,5009E5	458753	0.26447
49152	7.0474E5	720897	0.32864
65536	9.6571E5	983041	0.26448
98304	1.5078E6	1540097	0.32865
131072	2.0625E6	2097153	0.26449
196608	3.2122E6	3276801	0.32865
262144	4.3871E6	4456449	0.26450
393216	6.8176E6	6946817	0.32865
524288	9.2985E6	9437185	0.26450
786432	1.4422E7	14680065	0.32865
1048576	1.9646E7	19922945	0.26450
1572864	3.0416E7	30932993	0.32866
2097152	4.1388E7	41943041	0.26450
3145728	6.3978E7	65011713	0.32866
4194304	8.6971E7	88080385	0.26450
6291456	1.3425E8	136314881	0.32866
8388608	1.8233E8	184549377	0.26450
12582912	2.8108E8	285212673	0.32866
16777216	3.8144E8	385875969	0.26450

and finds ${\ displaystyle S_ {m} (n) -0 {,} 3286560975 \, n \ leq S _ {\ varnothing} (n) \ leq S_ {m} (n).}$

Correctness and timing

The recursion termination obviously ensures the termination of the mergesort, so that only the correctness has to be shown. We do this by proving the following claim:

Assertion : In recursion depth , the sorted sublists from recursion depth are correctly sorted. ${\ displaystyle i}$ ${\ displaystyle i {+} 1}$

Proof : Let o. B. d. A. the -th recursion is the deepest. Then the sublists are obviously sorted because they are one-element. Thus, part of the claim is already secured. Now these sorted sublists are passed one recursion level up, i.e. into the -th recursion. There they are correctly sorted by Mergesort according to the construction of the merge procedure. Thus our claim is fulfilled and the total correctness of Mergesort is proven. ${\ displaystyle (i {+} 1)}$ ${\ displaystyle i}$

Natural merge sort

Natural Mergesort (natural Mergesort) is an extension of Mergesort, the already-sorted sub-sequences, so-called runs , exploits within the to be sorted starting list. The basis for the merge process is not the recursively or iteratively obtained groups of two, but the runs to be determined in a first round :

Startliste    : 3--4--2--1--7--5--8--9--0--6
Runs bestimmen: 3--4  2  1--7  5--8--9  0--6
Merge         : 2--3--4  1--5--7--8--9  0--6
Merge         : 1--2--3--4--5--7--8--9  0--6
Merge         : 0--1--2--3--4--5--6--7--8--9

This variant has the advantage that sorted sequences are “recognized” and the complexity is in the best case . Average and worst case behavior, however, does not change. ${\ displaystyle {\ mathcal {O}} (n)}$

Mergesort is also well suited for larger amounts of data that can no longer be kept in main memory - when merging, two lists need only be read from the external buffer (e.g. hard disk) and one written there. One variant makes better use of the available main memory (and minimizes read / write access to the hard disk) by combining more than just two sub-lists at the same time, thus reducing the depth of recursion.

Parallel merge sort

Mergesort can be easily parallelized due to the divide-and-conquer approach. Various parallel variants have been developed in the past. Some are closely related to the sequential variant presented here, while others have a fundamentally different structure and use K-way mixing .

Mergesort with parallel recursion calls

The sequential merge sort can be described in two phases, the dividing phase and the subsequent mixing phase. The first consists of many recursive calls that perform the same splitting process over and over until the partial sequences are trivially sorted (with one or no element). An intuitive approach is to parallelize these recursive calls. The following pseudocode describes the classic mergesort algorithm with parallel recursion using the key words fork and join .

// Sort elements lo through hi (exclusive) of array A.
algorithm mergesort(A, lo, hi) is
    if lo+1 < hi then  // Two or more elements.
        mid := ⌊(lo + hi) / 2⌋
        fork mergesort(A, lo, mid)
        mergesort(A, mid, hi)
        join
        merge(A, lo, mid, hi)

This algorithm is the trivial modification of the sequential algorithm and is not yet optimal. Accordingly, his speedup is not impressive either. It has a span of , which is only a factor of improvement compared to the sequential version (see also Introduction to Algorithms ). This is mainly due to the sequential shuffling method, which is the bottleneck of the parallel execution. ${\ displaystyle \ Theta (n)}$ ${\ displaystyle \ Theta (\ log n)}$

Merge sort with parallel mixing method

Better parallelism can be achieved by using a parallel blending method. Cormen et al. present a binary variant which mixes two sorted partial sequences into one sorted output sequence. A more detailed description can be found here .

In the longer of the two sequences (if not the same length) the element of the middle index is selected. Its position in the other sequence is determined in such a way that the sequence would remain sorted if this element were inserted at the specified position. In this way you know how many elements are smaller than the pivot element and the final position of the pivot can be calculated in the output sequence. For the partial sequences of the smaller and larger elements generated in this way, the mixing method is again carried out in parallel until the base case of recursion is reached.

The following pseudocode illustrates the merge sort with modified parallel mixing method (from Cormen et al.).

/**
 * A: Input array
 * B: Output array
 * lo: lower bound
 * hi: upper bound
 * off: offset
 */
algorithm parallelMergesort(A, lo, hi, B, off) is
    len := hi - lo + 1
    if len == 1 then
        B[off] := A[lo]
    else let T[1..len] be a new array
        mid := ⌊(lo + hi) / 2⌋
        mid' := mid - lo + 1
        fork parallelMergesort(A, lo, mid, T, 1)
        parallelMergesort(A, mid + 1, hi, T, mid' + 1)
        join
        parallelMerge(T, 1, mid', mid' + 1, len, B, off)

In order to obtain a recurrence relation for the worst case, the recursive calls of parallelMergesort only have to be listed once due to the parallel execution. You get

${\ textstyle T _ {\ infty} ^ {\ text {sort}} (n) = T _ {\ infty} ^ {\ text {sort}} \ left ({\ frac {n} {2}} \ right) + T _ {\ infty} ^ {\ text {merge}} (n) = T _ {\ infty} ^ {\ text {sort}} \ left ({\ frac {n} {2}} \ right) + \ Theta \ left (\ log (n) ^ {2} \ right)}$ .

For more detailed information on the complexity of the parallel merging method, see Merge algorithm .

The solution to this recurrence is

${\ textstyle T _ {\ infty} ^ {\ text {sort}} = \ Theta \ left (\ log (n) ^ {3} \ right)}$ .

This algorithm achieves a parallelizability of , which is much better than the parallelism of the previous algorithm. Such a sorting algorithm, if provided with a fast stable sequential sorting algorithm and a sequential shuffling method as the base case for shuffling two small sequences, can work well in practice. ${\ displaystyle \ Theta {\ biggr (} {n \ over (\ log n) ^ {2}} {\ biggr)}}$

Parallel multi-way merge sort

It seems unnatural to limit mergesort algorithms to binary mixing methods, since more than two processors are often available. A better approach would be to implement a K-way mixing . In contrast to binary mixing , this generalization mixes sorted sequences into one sorted sequence. This mixed variant is well suited for describing a sorting algorithm on a PRAM . ${\ displaystyle k}$

Basic idea

The parallel multipath mergesort algorithm on four processors up to .

{\ displaystyle t_ {0}}

{\ displaystyle t_ {3}}

A sequence of elements is given. The aim is to sort this sequence with available processors. The elements are distributed equally across all processors and are first locally pre-sorted using a sequential sorting algorithm . Accordingly, the data now consists of sorted sequences of length . For simplicity, by a multiple , so for the following applies: . Each of these sequences is in turn partitioned into partial sequences by determining the separating elements with global rank . The corresponding indices are determined in each sequence with a binary searcher so that the sequences can be divided up on the basis of the indices. Formally defined, therefore, applies . ${\ displaystyle n}$ ${\ displaystyle p}$ ${\ displaystyle S_ {1}, ..., S_ {p}}$ ${\ textstyle \ lceil {\ frac {n} {p}} \ rceil}$ ${\ displaystyle n}$ ${\ displaystyle p}$ ${\ displaystyle i = 1, ..., p}$ ${\ textstyle \ left \ vert S_ {i} \ right \ vert = {\ frac {n} {p}}}$ ${\ displaystyle p}$ ${\ displaystyle S_ {i, 1}, ..., S_ {i, p}}$ ${\ displaystyle j = 1, ..., p}$ ${\ displaystyle v_ {j}}$ ${\ textstyle k = j {\ frac {n} {p}}}$ ${\ displaystyle S_ {i}}$ ${\ displaystyle S_ {i, j}: = \ {x \ in S_ {i} | rank (v_ {j-1}) <rank (x) \ leq rank (v_ {j}) \}}$

The elements are now allocated by the processor . These are all the items from global rank to rank spread across the . In this way, each processor receives a series of sorted sequences. The fact that the rank of the separating elements was chosen globally results in two important properties: First, the separating elements are chosen so that each processor is still entrusted with elements after the new data has been allocated . The algorithm therefore has a perfect load distribution . In addition, all elements of the processor are less than or equal to the elements of the processor . If each processor now performs p-way mixing locally, the elements are sorted globally because of this property. Thus, the results only need to be put together in the order of the processors. ${\ displaystyle S_ {1, i}, ..., S_ {p, i}}$ ${\ displaystyle i}$ ${\ textstyle (i-1) {\ frac {n} {p}}}$ ${\ textstyle i {\ frac {n} {p}}}$ ${\ displaystyle S_ {i}}$ ${\ displaystyle k}$ ${\ displaystyle v_ {i}}$ ${\ textstyle n / p}$ ${\ displaystyle i}$ ${\ displaystyle i + 1}$

Separation element determination

In the simplest form, sorted sequences are evenly distributed among processors and given a rank . We are now looking for a separating element with global rank in the union of the consequences. This means that each sequence can be divided into two parts at an index : The lower part only consists of elements that are smaller , while the upper part contains all elements that are greater than or equal to . ${\ displaystyle p}$ ${\ displaystyle S_ {1}, ..., S_ {p}}$ ${\ displaystyle p}$ ${\ displaystyle k}$ ${\ displaystyle x}$ ${\ displaystyle k}$ ${\ displaystyle S_ {i}}$ ${\ displaystyle l_ {i}}$ ${\ displaystyle x}$ ${\ displaystyle x}$

The sequential algorithm presented here returns the indices of the separations, i.e. the indices in the sequences , so that has and is a globally smaller rank than . ${\ displaystyle l_ {i}}$ ${\ displaystyle S_ {i}}$ ${\ displaystyle S_ {i} [l_ {i}]}$ ${\ displaystyle k}$ ${\ displaystyle \ mathrm {rank} \ left (S_ {i} [l_ {i} +1] \ right) \ geq k}$

algorithm msSelect(S : Array of sorted Sequences [S_1,..,S_p], k : int) is
    for i = 1 to p do
 (l_i, r_i) = (0, |S_i|-1)

    while there exists i: l_i < r_i do
 //pick Pivot Element in S_j[l_j],..,S_j[r_j], chose random j uniformly
 v := pickPivot(S, l, r)
 for i = 1 to p do
     m_i = binarySearch(v, S_i[l_i, r_i]) //sequentially
 if m_1 + ... + m_p >= k then //m_1+ ... + m_p is the global rank of v
     r := m  //vector assignment
 else
     l := m

    return l

The PRAM model was chosen for the complexity analysis. The p-times execution of the binarySearch method has a runtime in , if the data is distributed evenly across all processors. The expected recursion depth is as in the Quickselect algorithm . Thus is the total expected run time . ${\ displaystyle {\ mathcal {O}} \ left (p \ log \ left (n / p \ right) \ right)}$ ${\ displaystyle p}$ ${\ displaystyle {\ mathcal {O}} \ left (\ log \ left (\ textstyle \ sum _ {i} | S_ {i} | \ right) \ right) = {\ mathcal {O}} (\ log ( n))}$ ${\ displaystyle {\ mathcal {O}} \ left (p \ log (n / p) \ log (n) \ right)}$

Applied to the parallel multipath merge sort , the msSelect method must be executed in parallel in order to find all separating elements of the rank at the same time. This can then be used to cut each sequence into pieces. The total runtime is the same . ${\ textstyle i {\ frac {n} {p}}}$ ${\ displaystyle p}$ ${\ displaystyle {\ mathcal {O}} \ left (p \, \ log (n / p) \ log (n) \ right)}$

Pseudocode

Here is the complete pseudocode for the parallel multipath merge sort. Barrier synchronization is assumed before and after the separation element determination, so that each processor can correctly calculate its separation elements and the partitioning of its sequence.

/**
 * d: Unsorted Array of Elements
 * n: Number of Elements
 * p: Number of Processors
 * return Sorted Array
 */
algorithm parallelMultiwayMergesort(d : Array, n : int, p : int) is
    o := new Array[0, n]                         // the output array
    for i = 1 to p do in parallel                // each processor in parallel
        S_i := d[(i-1) * n/p, i * n/p]           // Sequence of length n/p
 sort(S_i)                                // sort locally
        synch
 v_i := msSelect([S_1,...,S_p], i * n/p)          // element with global rank i * n/p
        synch
 (S_i,1 ,..., S_i,p) := sequence_partitioning(si, v_1, ..., v_p) // split s_i into subsequences

 o[(i-1) * n/p, i * n/p] := kWayMerge(s_1,i, ..., s_p,i)  // merge and assign to output array

    return o

analysis

First, each processor sorts the assigned items locally using a comparison-based sorting algorithm of complexity . Then the separating elements can be determined in time . Finally, each group of pieces must be mixed together by each processor at the same time. This has a running time of by using a sequential k-way mixing algorithm . This results in a total running time of ${\ displaystyle n / p}$ ${\ displaystyle {\ mathcal {O}} \ left (n / p \; \ log (n / p) \ right)}$ ${\ displaystyle {\ mathcal {O}} \ left (p \, \ log (n / p) \ log (n) \ right)}$ ${\ displaystyle p}$ ${\ displaystyle {\ mathcal {O}} (\ log (p) \; n / p)}$

${\ displaystyle {\ mathcal {O}} \ left ({\ frac {n} {p}} \ log \ left ({\ frac {n} {p}} \ right) + p \ log \ left ({\ frac {n} {p}} \ right) \ log (n) + {\ frac {n} {p}} \ log (p) \ right)}$ .

Practical adaptation and application

The multi-way mergesort algorithm is very scalable due to its high parallelism, which enables the use of many processors. This makes the algorithm a useful candidate for sorting large amounts of data, such as those processed in computer clusters . Since the memory in such systems is usually not a limiting resource, the disadvantage of the memory complexity of mergesort is negligible. However, other factors become important in such systems that are not taken into account when modeling on a PRAM . The following aspects, among others, have to be considered here: The memory hierarchy if the data does not fit into the cache of the processors, or the communication effort when exchanging data between the processors, which could become a bottleneck if the data is no longer accessible via the shared memory can be accessed.

Sanders et al. presented a bulk synchronous parallel algorithm for a multi-level multi-way merge sort in their paper , which divides processors into groups of size . All processors sort locally first. In contrast to a single-stage multi-way merge sort, these sequences are then divided into parts and assigned to the corresponding processor groups. These steps are repeated recursively within these groups. This reduces communication and in particular avoids problems with many small messages. The hierarchical structure of the underlying real network (e.g. racks , clusters , ...) can be used to define the processor groups. ${\ displaystyle p}$ ${\ displaystyle r}$ ${\ displaystyle p '}$ ${\ displaystyle r}$

Other variants

Mergesort was one of the first sorting algorithms to achieve optimal speedup , with Richard Cole using a clever subsampling algorithm to ensure O (1) merging. Other sophisticated parallel sorting algorithms can achieve the same or better time limits with a lower constant. David Powers described for example in 1991 a parallelized quicksort (and a related radix sort ), the implicit partitioning in time on a CRCW Parallel Random Access Machine (PRAM) with can work processors. Powers also shows that a pipelined version of Batcher's Bitonic Mergesort is faster in practice than its sorting algorithm on a PRAM in time on a butterfly sorting network, and offers a detailed discussion of the hidden overheads of the comparison, the radix and the Parallel sorting. ${\ displaystyle O (\ log n)}$ ${\ displaystyle n}$ ${\ displaystyle O ((\ log n) ^ {2})}$ ${\ displaystyle O (\ log n)}$

Others

Since Mergesort processes the start list and all intermediate lists sequentially, it is particularly suitable for sorting linked lists . For arrays , a temporary array of the same length as the array to be sorted is normally used as intermediate storage (that is, mergesort usually does not work in-place , see above ). Quicksort, on the other hand, does not require a temporary array.

The SGI implementation of the Standard Template Library (STL) uses the merge sort as an algorithm for stable sorting.

literature

Robert Sedgewick: Algorithms. Pearson Studies, February 2002, ISBN 3-8273-7032-9

Web links

Individual evidence

^ Donald E. Knuth: The Art of Computer Programming. Volume 3: Sorting and Searching. 2nd Edition. Addison-Wesley, 1998, pp. 158 .
↑ h2database / h2database. In: GitHub. Retrieved September 1, 2016 .
↑ ^a ^b Cormen, Thomas H .; Leiserson, Charles E .; Rivest, Ronald L .; Stein, Clifford: Introduction to algorithms . Third ed. MIT Press, Cambridge, Mass. 2009, ISBN 978-0-262-27083-0 .
↑ Victor J. Duvanenko "Parallel Merge Sort" Dr. Dobb's Journal & blog [1] and GitHub repo C ++ implementation [2]
↑ Peter Sanders, Johannes Singler. 2008. Lecture Parallel algorithms Last visited 05.02.2020. http://algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf
↑ ^a ^b Practical Massively Parallel Sorting | Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures ( EN ) doi : 10.1145 / 2755573.2755595 .
↑ Peter Sanders. 2019. Lecture Parallel algorithms Last visited 05.02.2020. http://algo2.iti.kit.edu/sanders/courses/paralg19/vorlesung.pdf
↑ Richard Cole: Parallel Merge Sort . In: SIAM Journal on Computing . tape 17 , no. 4 , August 1988, ISSN 0097-5397 , pp. 770–785 , doi : 10.1137 / 0217049 ( siam.org [accessed March 6, 2020]).
^ Powers, David MW Parallelized Quicksort and Radixsort with Optimal Speedup , Proceedings of International Conference on Parallel Computing Technologies . Novosibirsk . 1991.
^ David MW Powers, Parallel Unification: Practical Complexity , Australasian Computer Architecture Workshop, Flinders University, January 1995
↑ http://www.sgi.com/tech/stl/stable_sort.html stable_sort

[1] Donald E. Knuth: The Art of Computer Programming. Volume 3: Sorting and Searching. 2nd Edition. Addison-Wesley, 1998, pp. 158 .

[2] h2database / h2database. In: GitHub. Retrieved September 1, 2016 .

[:2-3] Cormen, Thomas H .; Leiserson, Charles E .; Rivest, Ronald L .; Stein, Clifford: Introduction to algorithms . Third ed. MIT Press, Cambridge, Mass. 2009, ISBN 978-0-262-27083-0 .

[4] Victor J. Duvanenko "Parallel Merge Sort" Dr. Dobb's Journal & blog [1] and GitHub repo C ++ implementation [2]

[5] Peter Sanders, Johannes Singler. 2008. Lecture Parallel algorithms Last visited 05.02.2020. http://algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf

[:02-6] Practical Massively Parallel Sorting | Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures ( EN ) doi : 10.1145 / 2755573.2755595 .

[7] Peter Sanders. 2019. Lecture Parallel algorithms Last visited 05.02.2020. http://algo2.iti.kit.edu/sanders/courses/paralg19/vorlesung.pdf

[8] Richard Cole: Parallel Merge Sort . In: SIAM Journal on Computing . tape 17 , no. 4 , August 1988, ISSN 0097-5397 , pp. 770–785 , doi : 10.1137 / 0217049 ( siam.org [accessed March 6, 2020]).

[9] Powers, David MW Parallelized Quicksort and Radixsort with Optimal Speedup , Proceedings of International Conference on Parallel Computing Technologies . Novosibirsk . 1991.

[10] David MW Powers, Parallel Unification: Practical Complexity , Australasian Computer Architecture Workshop, Flinders University, January 1995

[11] ttp://www.sgi.com/tech/stl/stable_sort.html stable_sort