Reduction operator

In computer science , a designated reduction operator (English: Reduction Clause ) which is often in the an operator, parallel programming is used to reduce elements of an array on a single result. Reduction operators are associative and often (but not always) commutative . The reduction of quantities is an important part of programming models like MapReduce , in which a reduction operator is applied to all elements before they are reduced. Other parallel algorithms use reduction operators as their primary operations to solve more complex problems. Many of these operators can also be used to distribute data to all processors.

theory

A reduction operator can help divide a problem into many sub-problems by using the solutions of the sub-problems to get the final result. They make it possible to carry out certain serial operations in parallel, thereby reducing the number of calculation steps required. A reduction operator stores the results of the sub-problems in a private copy of the variable. These private copies are ultimately merged into one common copy.

An operator is a reduction operator if

he can reduce an array to a single value and
the final result can be obtained from the partial results.

These two requirements are met for commutative and associative operators, which are applied to all elements of the array.

Examples of this are addition and multiplication as well as certain logical operators (and, or etc.).

A reduction operator can be applied in constant time to a set of vectors each with elements. The result of the operation is the combination of the elements and must be stored on a designated processor after execution. If the result is to be available on all processors, this is often called allreduce. An optimal sequential linear time algorithm for reduction can be applied gradually from front to back, replacing two vectors with the result of the operation on these vectors, reducing the amount of vectors by one each time. For this purpose, steps required. Sequential algorithms are not faster than linear time algorithms, but parallel algorithms can shorten the running time. ${\ displaystyle \ oplus}$ ${\ displaystyle V = \ {v_ {0} = {\ begin {pmatrix} e_ {0} ^ {0} \\\ vdots \\ e_ {0} ^ {m-1} \ end {pmatrix}}, v_ {1} = {\ begin {pmatrix} e_ {1} ^ {0} \\\ vdots \\ e_ {1} ^ {m-1} \ end {pmatrix}}, \ dots, v_ {p-1} = {\ begin {pmatrix} e_ {p-1} ^ {0} \\\ vdots \\ e_ {p-1} ^ {m-1} \ end {pmatrix}} \}}$ ${\ displaystyle p}$ ${\ displaystyle m}$ ${\ displaystyle r}$ ${\ displaystyle r = {\ begin {pmatrix} e_ {0} ^ {0} \ oplus e_ {1} ^ {0} \ oplus \ dots \ oplus e_ {p-1} ^ {0} \\\ vdots \ \ e_ {0} ^ {m-1} \ oplus e_ {1} ^ {m-1} \ oplus \ dots \ oplus e_ {p-1} ^ {m-1} \ end {pmatrix}} = {\ begin {pmatrix} \ bigoplus _ {i = 0} ^ {p-1} e_ {i} ^ {0} \\\ vdots \\\ bigoplus _ {i = 0} ^ {p-1} e_ {i} ^ {m-1} \ end {pmatrix}}}$ ${\ displaystyle r}$ ${\ displaystyle (p-1) \ cdot m}$

example

An array is given . The sum of the entire array can be calculated serially by reducing the array sequentially to a single sum using the '+' operator. If you start from the beginning, the following calculation results: ${\ displaystyle [2,3,5,1,7,6,8,4]}$

${\ displaystyle (((((((2 + 3) +5) +1) +7) +6) +8) + 4 = 36}$

Since '+' is both associative and commutative, '+' is a reduction operator. This reduction can therefore also take place in parallel on several cores, with each core only calculating the sum of a subset of the array and the reduction operator combining these partial results. Using a binary tree can cores 4 respectively , , and are calculated. You can then calculate two cores and and end up calculating a single core . With 4 cores, the sum can be calculated in instead of steps, as is the case with the serial algorithm. The algorithm calculates what corresponds to the same result due to the associativity of the addition. Commutativity would be important if there were a main core that distributes the subtasks to other cores, since the partial results could come back in different orders. The commutativity property here would guarantee that the result is still the same. ${\ displaystyle (2 + 3)}$ ${\ displaystyle (5 + 1)}$ ${\ displaystyle (7 + 6)}$ ${\ displaystyle (8 + 4)}$ ${\ displaystyle (5 + 6)}$ ${\ displaystyle (13 + 12)}$ ${\ displaystyle (11 + 25) = 36}$ ${\ displaystyle \ log _ {2} 8 = 3}$ ${\ displaystyle 7}$ ${\ displaystyle ((2 + 3) + (5 + 1)) + ((7 + 6) + (8 + 4))}$

Counterexample

Matrix multiplication is not a reduction operator because this operation is not commutative. If the cores returned their partial results in any order, the end result would most likely be wrong. However, matrix multiplication is associative, so the end result is correct if you make sure that the partial results are in the correct order. This is the case when using binary trees.

Algorithms

Binomial Tree Algorithms

Concerning the parallel algorithms there are mainly two models, the Parallel Random Access Machine as an extension of the working memory with shared memory between the cores and Bulk Synchronous Parallel Computers , in which the cores communicate and are synchronized. Both models have different effects on time complexity , which is why both are presented here.

PRAM algorithm

This algorithm uses a widely used method, where a power of two is. A reverse is often used to distribute the elements. ${\ displaystyle p}$

A visualization of the algorithm with and addition as a reduction operator

{\ displaystyle p = 8, m = 1}

for to do

{\ displaystyle k \ gets 0}

{\ displaystyle \ lceil \ log _ {2} p \ rceil -1}

for to do in parallel

{\ displaystyle i \ gets 0}

{\ displaystyle p-1}

if is active then

{\ displaystyle p_ {i}}

if bit of is set then

{\ displaystyle k}

{\ displaystyle i}

set to inactive

{\ displaystyle p_ {i}}

else if

{\ displaystyle i + 2 ^ {k} <p}

{\ displaystyle x_ {i} \ gets x_ {i} \ oplus ^ {\ star} x_ {i + 2 ^ {k}}}

The binary operator for vectors is defined element-wise, so that . The algorithm is also based on the assumption that applies to all at the beginning and that the cores are used. In each iteration, half of the cores become inactive; they no longer contribute to the calculation. The animation shows a visualization of the algorithm with addition as an operator. Vertical lines represent the cores in which the calculation of the elements on the line are calculated. The eight elements of input are shown below. Each step in the animation corresponds to a parallel step in the execution of the algorithm. An active kernel applies the operator to an element locally available to it as well , whereby the smallest index is with , so that it becomes inactive in the current step . and are not necessarily part of the input, as these locations are overwritten and reused for previously calculated expressions. In order to coordinate the cores with each other without causing further effort through communication between them, the algorithm makes use of the indexing of the cores through to . Each core makes it dependent on its -th least significant bit whether it becomes inactive or applies the operator to its own element and the element with the index in which the -th last significant bit is not set. The underlying scheme for this is a bionomial tree, hence the name of the algorithm. ${\ displaystyle {\ begin {pmatrix} e_ {i} ^ {0} \\\ vdots \\ e_ {i} ^ {m-1} \ end {pmatrix}} \ oplus ^ {\ star} {\ begin { pmatrix} e_ {j} ^ {0} \\\ vdots \\ e_ {j} ^ {m-1} \ end {pmatrix}} = {\ begin {pmatrix} e_ {i} ^ {0} \ oplus e_ {j} ^ {0} \\\ vdots \\ e_ {i} ^ {m-1} \ oplus e_ {j} ^ {m-1} \ end {pmatrix}}}$ ${\ displaystyle x_ {i} = v_ {i}}$ ${\ displaystyle i}$ ${\ displaystyle p_ {0}, p_ {1}, \ dots p_ {n-1}}$ ${\ displaystyle p_ {i}}$ ${\ displaystyle x_ {i}}$ ${\ displaystyle x_ {j}}$ ${\ displaystyle j}$ ${\ displaystyle j> i}$ ${\ displaystyle p_ {j}}$ ${\ displaystyle x_ {i}}$ ${\ displaystyle x_ {j}}$ ${\ displaystyle 0}$ ${\ displaystyle p-1}$ ${\ displaystyle k}$ ${\ displaystyle k}$

At the end of the algorithm, the result is just available. For an all reduction operation, the result must be available to all cores, which is made possible by a subsequent broadcast. The number of kernels should be a power of two, otherwise the number can be filled up to the next power of two. There are algorithms that are specially tailored to this case. ${\ displaystyle p_ {0}}$ ${\ displaystyle p}$

Runtime analysis

The outermost loop is executed one time. The time for each parallel run is in as each core either combines two vectors or becomes inactive. Therefore applies to the parallel time . To read-write conflicts to avoid, may Exclusive Read, Exclusive Write be used. For the speedup applies , therefore applies to the efficiency . The efficiency suffers from the fact that in each step half of all cores become inactive, i.e. H. in step are cores active. ${\ displaystyle \ lceil \ log _ {2} p \ rceil}$ ${\ displaystyle {\ mathcal {O}} (m)}$ ${\ displaystyle T (p, m) = {\ mathcal {O}} (\ log (p) \ cdot m)}$ ${\ displaystyle S (p, m) \ in {\ mathcal {O}} ({\ frac {T_ {seq}} {T (p, m)}}) = {\ mathcal {O}} ({\ frac {p} {\ log (p)}})}$ ${\ displaystyle E (p, m) \ in {\ mathcal {O}} ({\ frac {S (p, m)} {p}}) = {\ mathcal {O}} ({\ frac {1} {\ log (p)}})}$ ${\ displaystyle i}$ ${\ displaystyle {\ frac {p} {2 ^ {i}}}}$

Distributed storage algorithms

In contrast to the PRAM algorithms, the cores do not share a common memory here. Therefore, the data must be exchanged explicitly between the cores, as the following algorithm shows.

for to do

{\ displaystyle k \ gets 0}

{\ displaystyle \ lceil \ log _ {2} p \ rceil -1}

for to do in parallel

{\ displaystyle i \ gets 0}

{\ displaystyle p-1}

if is active then

{\ displaystyle p_ {i}}

if bit of is set then

{\ displaystyle k}

{\ displaystyle i}

send to

{\ displaystyle x_ {i}}

{\ displaystyle p_ {i-2 ^ {k}}}

set to inactive

{\ displaystyle p_ {k}}

else if

{\ displaystyle i + 2 ^ {k} <p}

receive

{\ displaystyle x_ {i + 2 ^ {k}}}

{\ displaystyle x_ {i} \ gets x_ {i} \ oplus ^ {\ star} x_ {i + 2 ^ {k}}}

The only difference to the PRAM version above is the use of explicit primitives for communication. However, the principle remains the same.

Runtime analysis

Communication between the cores causes some overhead . A simple analysis of the algorithm uses the BSP model and takes into account the time required to initiate a data exchange and the time required to send a byte of data. The resulting runtime is then , where elements of a vector have the size . ${\ displaystyle T_ {start}}$ ${\ displaystyle T_ {byte}}$ ${\ displaystyle \ Theta ((T_ {start} + n \ cdot T_ {byte}) \ cdot log (p))}$ ${\ displaystyle m}$ ${\ displaystyle n}$

Pipeline algorithm

Visualization of the pipeline algorithm with p = 5, m = 4 and addition as the reduction operator.

For the distributed memory models it can make sense to exchange the data in the form of a pipeline. This is especially true when it is small compared to . Typically, linear pipelines break the data up into smaller pieces and process them in stages. In contrast to the Bionomial tree algorithms, the pipeline algorithm makes use of the fact that vectors are not inseparable: the operator can also be applied to individual elements. ${\ displaystyle T_ {start}}$ ${\ displaystyle T_ {byte}}$

for to do

{\ displaystyle k \ gets 0}

{\ displaystyle p + m-3}

for to do in parallel

{\ displaystyle i \ gets 0}

{\ displaystyle p-1}

if

{\ displaystyle i \ leq k <i + m \ land i \ neq p-1}

send to

{\ displaystyle x_ {i} ^ {ki}}

{\ displaystyle p_ {i + 1}}

if

{\ displaystyle i-1 \ leq k <i-1 + m \ land i \ neq 0}

receive from

{\ displaystyle x_ {i-1} ^ {k + i-1}}

{\ displaystyle p_ {i-1}}

{\ displaystyle x_ {i} ^ {k + i-1} \ gets x_ {i} ^ {k + i-1} \ oplus x_ {i-1} ^ {k + i-1}}

It is important that sending and receiving are carried out at the same time for the algorithm to work correctly. The result is in the end . The animation shows the execution of the algorithm on vectors of size 4 with 5 cores. Two steps in animation correspond to one step in parallel execution. ${\ displaystyle p_ {p-1}}$

Runtime analysis

The number of steps in the parallel execution is , it takes steps until the last core receives its first element and further steps until all elements have arrived. In the BSP model, the runtime is , where the size of a vector is in bytes. ${\ displaystyle p + m-2}$ ${\ displaystyle p-1}$ ${\ displaystyle m-1}$ ${\ displaystyle T (n, p, m) = (T_ {start} + {\ frac {n} {m}} T_ {byte}) (p + m-2)}$ ${\ displaystyle n}$

Even if it is a fixed value, it is possible to logically group elements of vectors and thereby reduce them. For example, a problem instance with vectors of length four can be solved by dividing the vectors into their first and last two elements, which are then always sent and calculated together. In this case, twice as much data is sent in each step, but the number of steps has been reduced by about half. is halved, while the size in bytes remains the same. So the runtime for this approach depends on what can be optimized if and are known. It is optimal for , which is believed to result in a smaller one that divides the original one. ${\ displaystyle m}$ ${\ displaystyle m}$ ${\ displaystyle m}$ ${\ displaystyle n}$ ${\ displaystyle T (p)}$ ${\ displaystyle m}$ ${\ displaystyle T_ {start}}$ ${\ displaystyle T_ {byte}}$ ${\ displaystyle m = {\ sqrt {\ frac {n \ cdot (p-2) T_ {byte}} {T_ {start}}}}}$ ${\ displaystyle m}$

Applications

Reduction is one of the most important collective operations in the message passing interface , where the performance of the algorithm used is important and is constantly evaluated for different use cases. Operators can be used as parameters for MPI_Reduceand MPI_Allreduce, the difference being whether the end result is in all or just one core. Efficient reduction algorithms are important for MapReduce in order to process large data sets, even in large clusters.

Some parallel sorting algorithms use reductions to process large data sets.

literature

Rohit Chandra: Parallel Programming in OpenMP . Morgan Kaufmann, 2001, ISBN 1558606718 , pp. 59-77.
Yan Solihin: Fundamentals of Parallel Multicore Architecture . CRC Press, 2016, ISBN 978-1-4822-1118-4 , p. 75.

Web links

Reduction Clause , Reference to Reduction Clause

Individual evidence

↑ ^a ^b ^c Solihin
↑ Chandra p. 59
^ Murray Cole: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming . In: Parallel computing . 30, 2004, p. 393.
↑ Amotz Bar-Noy, Shlomo Kipnis: Broadcasting multiple messages in simultaneous send / receive systems . In: Discrete Applied Mathematics . 55, No. 2, 1994, pp. 95-105. doi : 10.1016 / 0166-218x (94) 90001-9 .
^ Eunice E. Santos: Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines . In: Journal of Parallel and Distributed Computing . 62, No. 4, 2002, pp. 517-543. doi : 10.1006 / jpdc.2000.1698 .
↑ P. Slater, E. Cockayne, S. Hedetniemi: Information Dissemination in Trees . In: SIAM Journal on Computing . 10, No. 4, November 1, 1981, ISSN 0097-5397 , pp. 692-701. doi : 10.1137 / 0210052 .
^ Rolf Rabenseifner, Jesper Larsson Träff: More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems . In: Lecture Notes in Computer Science . tape 3241 . Springer, Berlin, Heidelberg 2004, ISBN 978-3-540-23163-9 , Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 36-46 , doi : 10.1007 / 978-3-540-30218-6_13 (English).
^ A. Bar-Noy, S. Kipnis: Designing broadcasting algorithms in the postal model for message-passing systems . In: Mathematical Systems Theory . 27, No. 5, September 1, 1994, ISSN 0025-5661 , pp. 431-452. doi : 10.1007 / BF01184933 .
↑ Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E. Fagg, Edgar Gabriel, Jack J. Dongarra: Performance analysis of MPI collective operations . In: Cluster Computing . 10, No. 2, June 1, 2007, ISSN 1386-7857 , pp. 127-143. doi : 10.1007 / s10586-007-0012-0 .
^ Ralf Lämmel: Google's MapReduce programming model - Revisited . In: Science of Computer Programming . 70, No. 1, 2008, pp. 1-30. doi : 10.1016 / j.scico.2007.07.001 .
↑ Hermes Senger, Veronica Gil-Costa, Luciana Arantes, Cesar AC Marcondes, Mauricio Marín, Liria M. Sato, Fabrício AB da Silva: BSP cost and scalability analysis for MapReduce operations . In: Concurrency and Computation: Practice and Experience . 28, No. 8, June 10, 2016, ISSN 1532-0634 , pp. 2503-2527. doi : 10.1002 / cpe.3628 .
↑ Michael Axtmann, Timo Bingmann, Peter Sanders, Christian Schulz: Practical Massively Parallel Sorting . , Pp. 13-23. doi : 10.1145 / 2755573.2755595

[:1-1] Solihin

[:0-2] Chandra p. 59

[3] Murray Cole: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming . In: Parallel computing . 30, 2004, p. 393.

[4] Amotz Bar-Noy, Shlomo Kipnis: Broadcasting multiple messages in simultaneous send / receive systems . In: Discrete Applied Mathematics . 55, No. 2, 1994, pp. 95-105. doi : 10.1016 / 0166-218x (94) 90001-9 .

[5] Eunice E. Santos: Optimal and Efficient Algorithms for Summing and Prefix Summing on Parallel Machines . In: Journal of Parallel and Distributed Computing . 62, No. 4, 2002, pp. 517-543. doi : 10.1006 / jpdc.2000.1698 .

[6] P. Slater, E. Cockayne, S. Hedetniemi: Information Dissemination in Trees . In: SIAM Journal on Computing . 10, No. 4, November 1, 1981, ISSN 0097-5397 , pp. 692-701. doi : 10.1137 / 0210052 .

[7] Rolf Rabenseifner, Jesper Larsson Träff: More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems . In: Lecture Notes in Computer Science . tape 3241 . Springer, Berlin, Heidelberg 2004, ISBN 978-3-540-23163-9 , Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 36-46 , doi : 10.1007 / 978-3-540-30218-6_13 (English).

[8] A. Bar-Noy, S. Kipnis: Designing broadcasting algorithms in the postal model for message-passing systems . In: Mathematical Systems Theory . 27, No. 5, September 1, 1994, ISSN 0025-5661 , pp. 431-452. doi : 10.1007 / BF01184933 .

[9] Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E. Fagg, Edgar Gabriel, Jack J. Dongarra: Performance analysis of MPI collective operations . In: Cluster Computing . 10, No. 2, June 1, 2007, ISSN 1386-7857 , pp. 127-143. doi : 10.1007 / s10586-007-0012-0 .

[10] Ralf Lämmel: Google's MapReduce programming model - Revisited . In: Science of Computer Programming . 70, No. 1, 2008, pp. 1-30. doi : 10.1016 / j.scico.2007.07.001 .

[11] Hermes Senger, Veronica Gil-Costa, Luciana Arantes, Cesar AC Marcondes, Mauricio Marín, Liria M. Sato, Fabrício AB da Silva: BSP cost and scalability analysis for MapReduce operations . In: Concurrency and Computation: Practice and Experience . 28, No. 8, June 10, 2016, ISSN 1532-0634 , pp. 2503-2527. doi : 10.1002 / cpe.3628 .

[12] Michael Axtmann, Timo Bingmann, Peter Sanders, Christian Schulz: Practical Massively Parallel Sorting . , Pp. 13-23. doi : 10.1145 / 2755573.2755595