Parallel matrix multiplication

The multiplication of matrices is part of many algorithms for solving more complex problems. This is used to calculate the path length in graphs or to determine reachable nodes . In this case, the multiplication is often carried out several times, so that an effort is made to reduce the running time for the multiplication. In addition, data sets are getting bigger and bigger, for example dependency graphs between users of social platforms, whereby the size of the matrices grows and more efficient algorithms are required. In addition to the possibility of developing algorithms with better complexity , the multiplication can be carried out in parallel, i.e. on several processors at the same time. An example of this is Cannon's matrix multiplication, developed by Lynn Elliot Cannon .

Algorithms

Fox algorithm

The Fox algorithm, named after Geoffrey C. Fox , is an algorithm for matrix multiplication carried out in parallel on p processors . In developing the algorithm, a topology was used in which the processors are arranged in a hypercube . The distributed memory model is used as the memory model. Each processor has its own private address space and communication between processors is message-based . The aim during development was to create an efficient and easy-to-use algorithm for use in scientific calculations. Furthermore, an effort was made to develop an algorithm that would also scale for future machines with many processors. Systolic arrays are essential for the algorithm . This describes a pipe network through which data streams are clocked.

description

Play media file

The video shows the schematic sequence of the Fox algorithm.

Two matrices A and B are multiplied. Here A and B are full matrices. The elements of the matrices are distributed to the processors with the help of the i and j coordinates in the matrix, which also have an i and j coordinate within the hypercube . Thus, at the beginning of the calculation, each processor has an element of matrix A and an element of matrix B. For matrices in which applies , each processor receives an entry in matrix A and an entry in matrix B. For matrices in which applies , each processor receives a sub-matrix the size of A and B. In the case of the sub-matrices, each processor performs its own matrix multiplication of the sub-matrices. ${\ displaystyle {n \ times n}}$ ${\ displaystyle {n = p}}$ ${\ displaystyle {n> p}}$ ${\ displaystyle ({\ frac {n} {\ sqrt {p}}}) \ times ({\ frac {n} {\ sqrt {p}}})}$

During the algorithm, the elements of matrix A are distributed by means of broadcast in the processor rows, i.e. all processors with the same j coordinate. The elements of matrix B are passed on to the processor above. For processors at the top, the elements are passed to the processor at the bottom.

In the initial start-up configuration, each processor holds the element of the matrix A and B, which has the same values for the coordinates i and j as the processor itself in the hypercube .

The algorithm consists of the following steps:

Broadcast of the elements on the diagonal of the matrix A to the processors with the same j coordinate. Then all processors in the first row hold the element and all processors in the second row hold the element . This pattern continues for all processors. ${\ displaystyle A_ {00}}$ ${\ displaystyle A_ {11}}$
Multiplication of the received element of matrix A by the element of matrix B currently in the processor.
All processors pass their current element of the matrix B to the processor in the same column, one digit above them. The processors at the edge pass their element to the processors at the bottom.
Broadcast of the elements on the 'diagonal + 1' of the matrix A to the processors with the same j -coordinate. Then all processors in the first row hold the element and all processors in the second row hold the element . This pattern continues for all processors. ${\ displaystyle A_ {01}}$ ${\ displaystyle A_ {12}}$
Multiplication of the received element of matrix A with the element of matrix B currently present in the processor and subsequent addition of the result with the result from the previous multiplication.

This pattern continues until the elements of B are back in their original position and all of A's secondary diagonals have been used.

Pseudocode

In pseudocode, the algorithm can be implemented as follows:

FOX(A, B):
1 //Prozessoreinheit(i,j)
2  $a:=A_{ij};$ 
3  $b:=B_{ij};$ 
4  $c_{ij}:=0;$ 
5 for  $(l:=0;l<{\sqrt {p}};l++)$  {
6    PE $(i,(i+l)mod{\sqrt {p}})$  broadcasts a to PE(i,j) for  $j\in {0..{\sqrt {p}}}\land j\neq (i+l)mod{\sqrt {p}};$ 
7     $c_{ij}:=c_{ij}+a*b;$ 
8    concurrently {
9        send b to PE $((i+1)mod{\sqrt {p}},j);$ 
10   } with {
11       receive b' from PE $((i-1)mod{\sqrt {p}},j);$ 
12   }
13    $b:=b';$ 
14 }

analysis

Should two matrices of size be multiplied by means of processors. So each processor receives two sub-matrices of size . ${\ displaystyle n \ times n}$ ${\ displaystyle {\ sqrt {q}} \ times {\ sqrt {q}}}$ ${\ displaystyle ({\ frac {n} {\ sqrt {p}}}) \ times ({\ frac {n} {\ sqrt {p}}})}$

The broadcast-multiplication-rotation pattern repeats itself until the result is determined. ${\ displaystyle {\ sqrt {p}}}$

Overall, the algorithm has a calculation time of ${\ displaystyle T = {\ sqrt {q}} * (2m ^ {3} t_ {flop} + 2m ^ {2} t_ {comm} + ({\ sqrt {p}} 1) t_ {start})}$

This is made up of the individual sub-steps:

1. Broadcast der Submatrizen der Matrix A:  $m^{2}t_{comm}+({\sqrt {p}}-1)t_{start}$ 
2. Rotation der Submatrizen der Matrix B um die j-Achse:  $m^{2}t_{comm}$ 
3. Multiplikation der Submatrizen und Addition zum vorherigen Teilergebnis:  $2m^{3}t_{flop}$

This is the time to transfer an equal-point number between processors, the start time to fill the pipeline and the time for a multiplication or addition of equal-point numbers. ${\ displaystyle t_ {comm}}$ ${\ displaystyle t_ {start}}$ ${\ displaystyle t_ {flop}}$

The analysis applies to all machines that can be assigned to the MIMD classification and that have the memory model of distributed memory.

A disadvantage of the algorithm is the scalability for the multiplication of two matrices. At most processors can be used, which means that the runtime is in , since these are operations in the serial algorithm. ${\ displaystyle n \ times n}$ ${\ displaystyle n ^ {2}}$ ${\ displaystyle \ Omega (n)}$ ${\ displaystyle \ Theta (n ^ {3})}$

DNS algorithm

The graphic shows matrix A, which has been colored red for better recognition in the video about the execution of the DNS algorithm.

The DNS algorithm, named after Elizier Dekel , David Nassimi and Sartaj Sahni , was published by them in 1981. The algorithm was developed for the parallel multiplication of matrices on general purpose processors . The algorithm is aimed at parallel computers that can be assigned to the SIMD classification . These computers have in common that they consist of p processing elements ( processing elements consist) PE. Each of these processing elements can perform standard arithmetic and logical operations. A total of up to processors can be used. These are arranged in a three-dimensional hypercube . In the case of scalar multiplications, it follows that every processor must perform a scalar multiplication. ${\ displaystyle n ^ {3}}$ ${\ displaystyle n ^ {3}}$

The algorithm was developed primarily to solve problems in graph theory , for example finding the shortest path between all pairs of shortest paths or determining the radius, diameter and center of a graph.

description

Two matrices, A and B, are multiplied. These are full matrices. Specifically, this algorithm and thus processors are used that are arranged in a cube. The elements of the matrices are assigned to the processors with suitable coordinates within the three-dimensional hypercube using the i and j coordinates . The multiplication should then be carried out on matrices. The entries of the matrices are in turn matrices with the size . Here should be a factor of . ${\ displaystyle {n \ times n}}$ ${\ displaystyle m: = {\ sqrt [{3}] {p}}}$ ${\ displaystyle m ^ {3}}$ ${\ displaystyle m \ times m}$ ${\ displaystyle {\ frac {m} {n}} \ times {\ frac {m} {n}}}$ ${\ displaystyle m}$ ${\ displaystyle n}$

The graphic shows matrix B, which has been colored yellow for better recognition in the video about the execution of the DNS algorithm.

We use the notation to address processors . The initial state can therefore be expressed by . The elements of matrix A are on the processors on the front side of the hypercube and the elements of matrix B are on the left side surface of the hypercube. The processors on the left front edge thus already have one element of matrix A and one element of matrix B. This is the initial start configuration. ${\ displaystyle PE (i, j, k)}$ ${\ displaystyle A (i, k, 0), B (0, k, j)}$

The algorithm consists of the following steps:

Broadcast of matrices A and B via the processors. Here the elements of the matrix A (located on the front of the cube) are sent backwards along the j dimension. The elements of matrix B (located on the left outer surface of the cube) are sent to the right along the k dimension. After the broadcast is complete, each processor holds 1 element of matrix A and one element of matrix B. ${\ displaystyle n ^ {3}}$

Now each processor performs the multiplication of the elements present.

In the last phase, the sum of the individual results is formed. This is done by sending the partial results from top to bottom and adding them along the way. After completing this operation, the result matrix is on the bottom surface of the hypercube .

Pseudocode

An implementation in pseudocode could look like this.

DNS(A, B):
1 store  $a_{ik}$  in PE $(i,k,1);$ 
2 store  $b_{kj}$  in PE $(1,k,j);$ 
3 PE $(i,k,1)$  broadcasts  $a_{ik}$  to PEs $(i,k,j)$  for  $j\in {1..n};$ 
4 PE $(1,k,j)$  broadcasts  $b_{kj}$  to PEs $(i,k,j)$  for  $i\in {1..n};$ 
5 compute  $c_{ikj}:=a_{ik}*b_{kj}$  on PE $(i,k,j);$ 
6 PEs $(i,k,j)$  foreach  $(k\in {1..n})$  compute  $c_{ij}:=\sum _{k=1}^{n}c_{ikj}$  to PE $(i,1,j);$

Play media file

The video shows the schematic sequence of the DNS algorithm on processors, arranged in a hypercube.

analysis

Assume that two matrices of the size are to be multiplied by means of and thus processors. The number of processors applies . ${\ displaystyle {n \ times n}}$ ${\ displaystyle m: = {\ sqrt [{3}] {p}}}$ ${\ displaystyle m ^ {3}}$ ${\ displaystyle p = q ^ {3}}$

Overall, the algorithm has a calculation time of:

   $T_{p}={\frac {n^{3}}{p}}+t_{s}\log p+t_{w}{\frac {n^{2}}{p^{(}{\frac {2}{3}})}}\log p$ 
   $t_{s}$  steht für die Startup Time. Diese wird benötigt um eine Nachricht im Sendeprozessor zu handhaben. Hierbei ist die Zeit mitinbegriffen, die benötigt wird um die Nachricht zu erstellen, den Routing-Algorithmus durchzuführen und eine Verbindung zwischen beiden Prozessoren zu etablieren.
   $t_{w}$  steht für die Transferzeit pro Wort. In diesem Fall beispielsweise die Dauer für die Übertragung eines Fließkommazahl.

The calculation time consists of:

  1 Die Broadcastoperation ausgeführt für beide Matrizen:  $t_{s}\log q+t_{w}({\frac {n}{q}})^{2}\log q$ 
  2 Die Reduktion für die Ergebnismatrix C:  $t_{s}\log q+t_{w}({\frac {n}{q}})^{2}\log q$ 
  3 Multiplikation der Teilmatrizen:  $({\frac {n}{q}})^{3}$

This results in:

    $T_{p}=({\frac {n}{q}})^{3}+3t_{s}\log q+3t_{w}({\frac {n}{q}})^{2}\log q$

By inserting the above condition you get the above calculation time: ${\ displaystyle p = q ^ {3}}$

    $T_{p}={\frac {n^{3}}{p}}+t_{s}\log p+t_{w}{\frac {n^{2}}{p^{(}{\frac {2}{3}})}}\log p$

The algorithm is optimal for or cost. ${\ displaystyle n ^ {3} = \ Theta (p \ log p)}$ ${\ displaystyle p = \ Omega ({\ frac {n ^ {3}} {\ log n}})}$

Web links

References and comments

^ GC Fox, SW Otto, AJG Hey: Matrix algorithms on a hypercube I. Matrix multiplication. In: Parallel Computing. No. 17, 1987, p. 1.
^ GC Fox, SW Otto, AJG Hey: Matrix algorithms on a hypercube I. Matrix multiplication. In: Parallel Computing. No. 1, 1987, p. 18.
^ Vipin Kumar: Introduction to parallel computing: design and analysis of algorithms. Benjamin / Cummings Pub. Co. , Minnesota 1994, ISBN 0-8053-3170-0 , pp. 173 .
^ GC Fox, SW Otto, AJG Hey: Matrix algorithms on a hypercube I. Matrix multiplication. In: Parallel Computing. No. 1, 1987, p. 19.
^ GC Fox, SW Otto, AJG Hey: Matrix algorithms on a hypercube I. Matrix multiplication. In: Parallel Computing. No. 1, 1987, pp. 19-21.
^ Vipin Kumar: Introduction to parallel computing: design and analysis of algorithms. Benjamin / Cummings Pub. Co. , Minnesota 1994, ISBN 0-8053-3170-0 , pp. 174 .
↑ Eliezer Dekel, David Nassimi, Sartaj Sahni: Parallel Matrix and Graph Algorithms In: SIAM Journal on Computing No. 4, 1981, p. 657.
↑ Eliezer Dekel, David Nassimi, Sartaj Sahni: Parallel Matrix and Graph Algorithms In: SIAM Journal on Computing No. 4, 1981, p. 660.
↑ ^a ^b ^c ^d Vipin Kumar: Introduction to parallel computing: design and analysis of algorithms. Benjamin / Cummings Pub. Co. , Minnesota 1994, ISBN 0-8053-3170-0 , pp. 177 .
^ Vipin Kumar: Introduction to parallel computing: design and analysis of algorithms. Benjamin / Cummings Pub. Co. , Minnesota 1994, ISBN 0-8053-3170-0 , pp. 178 .

[fox_paper-1] GC Fox, SW Otto, AJG Hey: Matrix algorithms on a hypercube I. Matrix multiplication. In: Parallel Computing. No. 17, 1987, p. 1.

[fox_paper2-2] GC Fox, SW Otto, AJG Hey: Matrix algorithms on a hypercube I. Matrix multiplication. In: Parallel Computing. No. 1, 1987, p. 18.

[3] Vipin Kumar: Introduction to parallel computing: design and analysis of algorithms. Benjamin / Cummings Pub. Co. , Minnesota 1994, ISBN 0-8053-3170-0 , pp. 173 .

[fox_paper3-4] GC Fox, SW Otto, AJG Hey: Matrix algorithms on a hypercube I. Matrix multiplication. In: Parallel Computing. No. 1, 1987, p. 19.

[fox_paper4-5] GC Fox, SW Otto, AJG Hey: Matrix algorithms on a hypercube I. Matrix multiplication. In: Parallel Computing. No. 1, 1987, pp. 19-21.

[6] Vipin Kumar: Introduction to parallel computing: design and analysis of algorithms. Benjamin / Cummings Pub. Co. , Minnesota 1994, ISBN 0-8053-3170-0 , pp. 174 .

[dns_paper-7] Eliezer Dekel, David Nassimi, Sartaj Sahni: Parallel Matrix and Graph Algorithms In: SIAM Journal on Computing No. 4, 1981, p. 657.

[dns_paper1-8] Eliezer Dekel, David Nassimi, Sartaj Sahni: Parallel Matrix and Graph Algorithms In: SIAM Journal on Computing No. 4, 1981, p. 660.

[laufzeit-9] Vipin Kumar: Introduction to parallel computing: design and analysis of algorithms. Benjamin / Cummings Pub. Co. , Minnesota 1994, ISBN 0-8053-3170-0 , pp. 177 .

[laufzeit1-10] Vipin Kumar: Introduction to parallel computing: design and analysis of algorithms. Benjamin / Cummings Pub. Co. , Minnesota 1994, ISBN 0-8053-3170-0 , pp. 178 .