Hyper cube (communication pattern)

The -dimensional hypercube is a network topology for computer bundles with computers. It allows an efficient implementation of the basic communication patterns broadcast , gossip, all-reduce and partial sum formation . For this, the participating computers are numbered from to in binary. Each computer then has a direct connection to the computers whose numbers differ in exactly one bit from . This structure is used efficiently by the following algorithms. ${\ displaystyle d}$ ${\ displaystyle 2 ^ {d}}$ ${\ displaystyle 0}$ ${\ displaystyle 2 ^ {d} -1}$ ${\ displaystyle i}$ ${\ displaystyle d}$ ${\ displaystyle i}$

sketch

Some of the communication primitives can be implemented according to the same sketch presented in this section. At the beginning of each communication primitive, each participating computer has a message that must reach every other computer in the course of communication. This process is outlined by the following pseudocode, with initialization , operation and output being placeholders that depend on the communication primitive.

Eingabe: Nachricht  $m$ .
Ausgabe: Abhängig von den Platzhaltern Initialisierung, Operation und Ausgabe.
Initialisierung
 $s:=m$ 
for  $0\leq k<d$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Sende  $s$  an  $y$ 
    Empfange  $m$  von  $y$ 
    Operation $(s,m)$ 
endfor
Ausgabe

In the pseudocode, each computer iterates over its neighbors once, because the expression negates the -th bit in the binary representation of , i.e. it describes the number of the -th neighbor of the computer . Each computer now exchanges a message with this neighbor and then processes its current message with the received message , whereby the accumulation operation to be used differs according to communication primitives. ${\ displaystyle i {\ text {XOR}} 2 ^ {k}}$ ${\ displaystyle k}$ ${\ displaystyle i}$ ${\ displaystyle k}$ ${\ displaystyle i}$ ${\ displaystyle s}$ ${\ displaystyle s}$ ${\ displaystyle m}$

The algorithm sketch applied to a -dimensional hypercube. In the first step (before any communication), each computer has a message (blue) that it exchanges with its neighbor in the respective step along the dimension marked in red. After an exchange, the messages are concatenated in this example. After the last step, every computer has every message.

{\ displaystyle 3}

Communication primitives

Prefix sum

With the prefix sum , each processor has a message at the beginning . The goal is for each processor to end up getting an associative operation . The algorithm can be embedded in the algorithm sketch as follows: ${\ displaystyle i}$ ${\ displaystyle m_ {i}}$ ${\ displaystyle i}$ ${\ displaystyle \ bigoplus _ {0 \ leq j \ leq i} m_ {j}}$ ${\ displaystyle \ oplus}$

Eingabe: Nachricht  $m_{i}$  auf Prozessor  $i$ .
Ausgabe: Präfixsumme  $\bigoplus _{0\leq j\leq i}m_{j}$  auf Prozessor  $i$ .
 $x:=m_{i}$ 
 $\sigma :=m_{i}$ 
for  $0\leq k\leq d-1$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Sende  $\sigma$  an  $y$ 
    Empfange  $m$  von  $y$ 
     $\sigma :=\sigma \oplus m$ 
    if Bit  $k$  in  $i$  gesetzt then  $x:=x\oplus m$ 
endfor

A hypercube of dimension can be broken down into two hypercubes of dimension . For this purpose, the sub-cube of all nodes whose numbers start with 0 in binary representation are referred to as 0-sub-cube. The remaining nodes form the 1-part cube analogously. After the prefix sum has been calculated in both sub-cube, the total sum of the elements in the 0-sub-cube has to be added to all elements of the 1-sub-cube. This is because, by definition, the computers in the 0-sub-cube have a lower rank than the computers in the 1-sub-cube. In the implementation, each node therefore saves not only its prefix sum (variable ) but also the sum of all elements in the sub-cube (variable ). In this way, in each step, all nodes in the 1-sub-cube can obtain the total amount from the 0-sub-cube. ${\ displaystyle d}$ ${\ displaystyle d-1}$ ${\ displaystyle x}$ ${\ displaystyle \ sigma}$

When the term of a factor of results for and a factor of for : . ${\ displaystyle \ log p}$ ${\ displaystyle T _ {\ text {start}}}$ ${\ displaystyle n \ log p}$ ${\ displaystyle T _ {\ text {byte}}}$ ${\ displaystyle T (n, p) = (T _ {\ text {start}} + nT _ {\ text {byte}}) \ log p}$

Example of a prefix sum calculation. Each node starts with its own node number as a message; H. . The top row of a node shows the bottom row . The operation is addition.

{\ displaystyle m_ {i} = i}

{\ displaystyle x}

{\ displaystyle \ sigma}

Gossip / All-Reduce

In the Gossip operation, every computer starts with a message . The aim is that after the execution every computer knows all computers, i.e. has the message , whereby the concatenation denotes. This operation can be implemented with the algorithm sketch as follows: ${\ displaystyle m_ {i}}$ ${\ displaystyle x: = m_ {0} \ cdot m_ {1} \ dots m_ {p}}$ ${\ displaystyle \ cdot}$

Eingabe: Nachricht  $x:=m_{i}$  auf Prozessor  $i$ .
Ausgabe: Alle Nachrichten  $m_{1}\cdot m_{2}\dots m_{p}$ .
 $x:=m_{i}$ 
for  $0\leq k<d$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Sende  $x$  an  $y$ 
    Empfange  $x'$  von  $y$ 
     $x:=x\cdot x'$ 
endfor

The process follows the sketch. Note that the length of the messages transmitted doubles in each step. This results in the following runtime: . ${\ displaystyle T (n, p) \ approx \ sum _ {j = 0} ^ {d-1} (T _ {\ text {start}} + n \ cdot 2 ^ {j} T _ {\ text {byte} }) = \ log (p) T _ {\ text {start}} + (p-1) nT _ {\ text {byte}}}$

With All-Reduce , in contrast to Gossip, the messages are not concatenated, but an operator is applied to the two messages. It is therefore a Reduce operation, the result of which is available to every processor. The Gossip algorithm can be adapted in the hypercube. This reduces the number of communication steps compared to reduce and broadcast.

All-to-all

With all-to-all communication, each processor has its own message for all other processors.

Eingabe: Nachrichten  $m_{ij}$  auf Prozessor  $i$  an Prozessor  $j$ .
for  $d>k\geq 0$  do
   Erhalte von Prozessor  $i{\text{ XOR }}2^{k}$ :
       alle Nachrichte für meinen  $k$ -dimensionalen Teilwürfel
   Sende an Prozessor  $i{\text{ XOR }}2^{k}$ :
       alle Nachrichte für seinen  $k$ -dimensionalen Teilwürfel
endfor

In each iteration step, a message comes one dimension closer to its destination if it has not yet reached it. Accordingly, only a maximum of many steps are required. Messages are sent in every step . For the first step, exactly half of the messages are not in their own sub-cube. In all of the following steps, the sub-cube is only half as large as before, however, in the previous step, just as many messages were received from another processor that are also intended for this sub-cube. ${\ displaystyle d = \ log {p}}$ ${\ displaystyle p / 2}$

Overall, this means a running time of . ${\ displaystyle T (n, p) \ approx \ log {p} (T _ {\ text {start}} + {\ frac {p} {2}} nT _ {\ text {byte}})}$

ESBT broadcast

The ESBT broadcast (Edge-Disjoint Spanning Binomial Tree) algorithm is a time-optimized broadcast for computer bundles with hypercube network topology. For this purpose, starting from the source (hereinafter the computer) , the network is divided into edge-disjoint binomial trees so that each neighbor of the source is the root of a binomial tree with computers. The source now divides its message into partial messages , which are then distributed cyclically to the roots of the binomial trees. Each binomial tree then carries out a broadcast. ${\ displaystyle 0}$ ${\ displaystyle d}$ ${\ displaystyle 2 ^ {d} -1}$ ${\ displaystyle k}$

If the source distributes a partial message in each step, it has distributed all partial messages according to the steps. The broadcast in a binomial tree takes steps. All in all, steps are therefore required until the broadcast for the last message is completed and the runtime for a message of the length results from . The optimal minimizes the running time too . ${\ displaystyle k}$ ${\ displaystyle d}$ ${\ displaystyle k + d}$ ${\ displaystyle n}$ ${\ displaystyle T (n, p, k) = \ left ({\ frac {n} {k}} T _ {\ text {byte}} + T _ {\ text {start}} \ right) (k + d) }$ ${\ displaystyle k ^ {*} = {\ sqrt {\ frac {nd \ cdot T _ {\ text {byte}}} {T _ {\ text {start}}}}}}$ ${\ displaystyle T ^ {*} (n, p) = n \ cdot T _ {\ text {byte}} + \ log (p) \ cdot T _ {\ text {start}} + {\ sqrt {n \ log ( p) \ cdot T _ {\ text {start}} \ cdot T _ {\ text {byte}}}}}$

Structure of the binomial trees

One- dimensional hypercube with three edge-disjoint ESBTs.

{\ displaystyle 3}

The binomial trees can be constructed systematically according to the following rule. To do this, a binomial tree with nodes is first defined. Edge-disjoint copies of the binomial tree are then embedded in the hypercube by translation and rotation . ${\ displaystyle d}$ ${\ displaystyle 2 ^ {d}}$ ${\ displaystyle d}$

A single binomial tree has nodes as its roots. The children of a node result from negating the leading zeros in the binary representation of the node number. The resulting graph is obviously a binomial tree. The set of edges of the -th binomial tree in the hypercube is now obtained as follows: an XOR operation is applied to each node and the binary representation of the node number is then shifted cyclically to the right. The resulting copies of the outgoing binomial tree are edge-disjoint and therefore meet the requirements of the ESBT broadcast algorithm. ${\ displaystyle 0}$ ${\ displaystyle k}$ ${\ displaystyle 2 ^ {k}}$ ${\ displaystyle k}$ ${\ displaystyle d}$

Individual evidence

^ A. Grama: Introduction to Parallel Computing. 2nd Edition. Addison-Wesley, 2003, ISBN 0-201-64865-2 .
^ I. Foster: Designing and Building Parallel Programs. Concepts and Tools for Parallel Software Engineering. Addison-Wesley, 1995, ISBN 0-201-57594-9 .
↑ SL Johnsson, C.-T. Ho: Optimum broadcasting and personalized communication in hypercubes . In: IEEE Transactions on Computers . tape 38 , no. 9 , 1989, ISSN 0018-9340 , pp. 1249-1268 , doi : 10.1109 / 12.29465 .

[1] A. Grama: Introduction to Parallel Computing. 2nd Edition. Addison-Wesley, 2003, ISBN 0-201-64865-2 .

[2] I. Foster: Designing and Building Parallel Programs. Concepts and Tools for Parallel Software Engineering. Addison-Wesley, 1995, ISBN 0-201-57594-9 .

[JohnssonHo1989-3] SL Johnsson, C.-T. Ho: Optimum broadcasting and personalized communication in hypercubes . In: IEEE Transactions on Computers . tape 38 , no. 9 , 1989, ISSN 0018-9340 , pp. 1249-1268 , doi : 10.1109 / 12.29465 .