Parallel breadth-first search

The parallel breadth-first search (English parallel breadth-first search (BFS)) is in the computer science a variant of the breadth-first search algorithm for graph in which this algorithm distributed concurrently on multiple processors ( parallelized ) is executed.

In the breadth-first search, all nodes that can be reached by this exit node are visited, whereby it is ensured that nodes with a smaller distance to the exit node are visited before other nodes with a greater distance.

General

The parallel breadth first search is the basis for many other algorithms and is used in particular for the analysis of large databases. With the help of breadth-first search z. B. determine the maximum flow , connected components or the centrality measure in graphs. The algorithm is also used in the Graph500 benchmark to compare the performance of supercomputers when processing large amounts of data. A similar variant of this benchmark is the Green500, which compares performance in relation to power consumption.

Memory models

The underlying storage model and the communication between the individual processes play an important role when considering parallel algorithms. A basic distinction is made here between two different storage models:

With shared memory , each processor can access the same shared main memory
In distributed memory ( Distributed Memory ), each processor has its own, separate from the remaining processors, local memory. In this model, the processors must exchange messages in order to share data.

Level-based algorithm

Animated example of the level-based algorithm

node not yet visited

visited node

edge currently visited

The breadth first search is usually implemented with a FIFO queue in sequential operation . (See breadth first search ) A possible implementation of the algorithm can be described as follows:

Mark all nodes as unvisited
Choose a starting node and add it to the current waiting list and mark it as visited ${\ displaystyle s}$ ${\ displaystyle Q_ {c}}$ ${\ displaystyle s}$
Repeat as long as is not empty ${\ displaystyle Q_ {c}}$ ${\ displaystyle Q_ {c}}$
1. Removing the first node from ${\ displaystyle k}$ ${\ displaystyle Q_ {c}}$
2. Add all connected nodes that have not yet been visited to the new waiting list and mark them as visited ${\ displaystyle k}$ ${\ displaystyle Q_ {n}}$
If is not empty, swap and and go to step 3 ${\ displaystyle Q_ {n}}$ ${\ displaystyle Q_ {c}}$ ${\ displaystyle Q_ {n}}$

The following loop invariant applies to the algorithm: When performing step 3, all nodes are at the distance from the starting node in (current level). In step 4 all nodes with the distance are in the list (next level). Since this algorithm searches for further nodes based on the node set of the current level, it is also referred to as a top-down algorithm. If the number of nodes in the current level is very large, a buttom-up variant is also available. With this method, all nodes not yet visited are searched for. These two variants can also be used together as a hybrid algorithm. ${\ displaystyle i}$ ${\ displaystyle i-1}$ ${\ displaystyle Q_ {c}}$ ${\ displaystyle i}$ ${\ displaystyle Q_ {n}}$

In order to execute the illustrated algorithm in parallel, the loop can be divided into several processes. To do this, it must be ensured that the status of the waiting list and the data structure in which the status of the node (already visited or not) is stored is synchronized between the processors. Without further adaptation, this algorithm is only possible in the shared memory model. ${\ displaystyle Q_ {c}}$

BFS with partitioning

One way of dividing the algorithm into several processes with distributed memory is to assign each process its own part of the nodes of the graph. This process is known as partitioning.

1D partitioning

In the simplest case, nodes with all outgoing edges are assigned to each of the processes , the number of nodes and the number of processes being denoted. In this case, each processor keeps the status of the nodes assigned to it in its own memory. The partitioning means that communication between the processors must take place after each step, since the outgoing edges do not necessarily point to nodes that are assigned to the same processor. Since potentially every processor has to send each other a message, all-to-all communication must be used here. ${\ displaystyle n / p}$ ${\ displaystyle n}$ ${\ displaystyle p}$

2D partitioning

In many cases the graph has very few edges compared to the number of nodes. If you represent the graph with the help of an adjacency matrix , you get a very sparse matrix . If you now display the nodes that were visited in the current level as a vector, the next step of the breadth-first search can be shown by multiplying the vector by the adjacency matrix. The multiplication of a sparse vector by a sparse matrix (Sparse Matrix Vector Multiplication (SpMV)) can be implemented efficiently.

The adjacency matrix can be divided into several sub-matrices. Each sub-matrix corresponds to a part of the edges in the graph. The breadth-first search can be parallelized in that each processor processes its own part of the adjacency matrix.

How it works using an example

graph ${\ displaystyle G}$	the associated adjacency matrix ${\ displaystyle A}$
	${\ displaystyle A = {\ begin {array} {r \| c} & {\ begin {matrix} 1 & 2 & 3 & 4 \ end {matrix}} \\\ hline {\ begin {matrix} 1 \\ 2 \\ 3 \\ 4 \ end {matrix}} & {\ begin {pmatrix} 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 1 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\\ end {pmatrix}} \ end {array}}}$

We choose the number of columns into which we want to divide our matrix with and the number of rows with . Since our adjacency matrix is a matrix, our sub-matrices have 2 rows and 2 columns each. The sub-matrices obtained during partitioning, which are each assigned to a processor, look like this: ${\ displaystyle C = 2}$ ${\ displaystyle R = 2}$ ${\ displaystyle 4 \ times 4}$

${\ displaystyle A = {\ begin {pmatrix} A_ {1,1} & A_ {1,2} \\ A_ {2,1} & A_ {2,2} \\\ end {pmatrix}} {\ text {with }} A_ {1,1} = {\ begin {pmatrix} 0 & 1 \\ 1 & 0 \\\ end {pmatrix}}, \ A_ {1,2} = {\ begin {pmatrix} 0 & 0 \\ 1 & 1 \\\ end {pmatrix}}, \ A_ {2,1} = {\ begin {pmatrix} 0 & 1 \\ 0 & 1 \\\ end {pmatrix}}, \ A_ {2,2} = {\ begin {pmatrix} 0 & 0 \\ 0 & 0 \\\ end {pmatrix}}}$

In the following, a single calculation step of the breadth-first search for the processor that owns the matrix is considered: ${\ displaystyle A_ {1,1}}$

The processor calculates the part of the breadth-first search given by the edges contained in. In , this is the edge between nodes 1 and 2. For the calculation of the breadth- first search you first need the vector with two elements as input , which indicates whether nodes 1 and 2 were visited in the last step of the breadth-first search. ${\ displaystyle A_ {1,1}}$ ${\ displaystyle A_ {1,1}}$ ${\ displaystyle G}$ ${\ displaystyle A_ {1,1}}$ ${\ displaystyle v}$
In order to get an up-to-date state of Vector , all processors in the same column must communicate, since each of these processors could have visited one of the two nodes in the last step. In this example, the vector only needs to be exchanged. ${\ displaystyle v}$ ${\ displaystyle A_ {1,1}}$ ${\ displaystyle v}$ ${\ displaystyle A_ {2,1}}$
Subsequently, the vector is to be multiplied to obtain the nodes which are reached in the next step. Let (i.e., node 1 was visited in the previous step), where the one in the second component indicates that node 2 will be visited in the next step. ${\ displaystyle v}$ ${\ displaystyle A_ {1,1}}$ ${\ displaystyle v = {\ begin {pmatrix} 1 \\ 0 \ end {pmatrix}}}$ ${\ displaystyle A_ {1,1} \ times v = {\ begin {pmatrix} 0 \\ 1 \ end {pmatrix}}}$
The result of the matrix-vector multiplication must now be exchanged with all processors in the same row. This communication step compares the visited status of the nodes and resolves possible conflicts in the event that several processors visit the same node. Then the process starts all over again.

The advantage over 1D partitioning is that each processor only needs to communicate with the processors in the same row and column, but not with all other processors. Compared to 1D partitioning, however, two instead of a single communication step are required for each level.

Web links

Example implementation in C ++ with Atomics

Individual evidence

↑ ^a ^b Yasui Y., Fujisawa K. (2017) Fast, Scalable, and Energy-Efficient Parallel Breadth-First Search. In: Anderssen B. et al. (eds) The Role and Importance of Mathematics in Innovation. Mathematics for Industry, vol 25. Springer, Singapore
↑ Ueno, K., Suzumura, T., Maruyama, N. et al. Data Sci. Closely. (2017) 2: 22. doi: 10.1007 / s41019-016-0024-y
↑ ^a ^b Aydin Buluç and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New York, NY, USA, Article 65, 12 pages. doi: 10.1145 / 2063384.2063471

[YasuiFujisawa-1] Yasui Y., Fujisawa K. (2017) Fast, Scalable, and Energy-Efficient Parallel Breadth-First Search. In: Anderssen B. et al. (eds) The Role and Importance of Mathematics in Innovation. Mathematics for Industry, vol 25. Springer, Singapore

[2] Ueno, K., Suzumura, T., Maruyama, N. et al. Data Sci. Closely. (2017) 2: 22. doi: 10.1007 / s41019-016-0024-y

[AydinKamesh-3] Aydin Buluç and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New York, NY, USA, Article 65, 12 pages. doi: 10.1145 / 2063384.2063471