Polytope model

The polytope model (or, more generally, polyhedron model ) is a mathematical model that can be used by compilers to optimize sets of loops. The loops in the source program are described by polytopes , to which a correctness-preserving transformation is then applied. In the last step, the resulting polytopes are translated back into (target) code.

Establishing the transformation matrix

A transformation consists of two parts, the schedule and the allocation. The schedule specifies when a computation should take place, while the allocation specifies where the computation takes place (i.e. on which processor it is executed).

Calculation of a valid schedule

A schedule is valid when it receives all data dependencies .

If an iteration with the loop indices , the results of calculation required needs for the Schedule shall apply: . This means that all required values must have been calculated at an earlier point in time. ${\ displaystyle (i_ {1}, \ ldots, i_ {n})}$ ${\ displaystyle (i '_ {1}, \ ldots, i' _ {n})}$ ${\ displaystyle t}$ ${\ displaystyle t (i_ {1}, \ ldots, i_ {n})> t (i '_ {1}, \ ldots, i' _ {n})}$

Calculation of a valid allocation

In contrast to the schedule, there are no restrictions on the allocation.

In principle, there is always the option of having the calculation performed by just one processor. However, you lose all parallelism. It is therefore advisable to distribute the calculation over as many processors as possible in order to maximize parallelism. However, one must take into account that this means that more data must be sent between the processors. These additional costs for communication can easily exceed the profit from the parallel calculation.

Example (automatic parallelization)

From the source program to the polytope

Let us consider the following program, which consists of a perfectly nested set of loops. The body of the loop contains a statement S .

   for i:= 0 to n do
       for j:= 0 to i+2 do
S:         A(i, j):= A(i-1, j) + A(i, j-1)
       end
   end

To represent the loop as a polytope, it is sufficient to write the upper and lower bounds as inequalities:

${\ displaystyle 0 \ leq i \ leq n \ wedge 0 \ leq j \ leq i + 2}$

or as a linear system of equations in matrix representation (one row corresponds to an inequality, columns: i, j, n, 1)

${\ displaystyle {\ color {Blue} {\ begin {pmatrix} 1 & 0 & 0 & 0 \\ - 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 1 & -1 & 0 & 2 \ end {pmatrix}}} \ geq {\ begin {pmatrix} 0 \\ 0 \\ 0 \\ 0 \ end {pmatrix}}}$

Dependency analysis

The array cell is A(i,j)overwritten in each loop pass. For the calculation of A(i,j)you need the values of A(i-1,j)and A(i,j-1). This creates two data dependencies: Each iteration (i,j)depends on both the iteration (i-1,j)and (i,j-1)on. Both dependencies must be taken into account in the next step when calculating the schedule.

All dependencies can be calculated algorithmically using a dependency analysis method.

Establishing the transformation matrix

A correct schedule that contains both dependencies is e.g. B. . ${\ displaystyle t (i, j) = i + j}$

Interpretation:

The first step is to A(0,0)calculate
In the second step, A(1,0)and are A(0,1)calculated
In the third step A(2,0), A(1,1)andA(0,2)
etc.

In order to enable maximum parallelism in every calculation step, we choose the allocation ${\ displaystyle p (i, j) = i}$

This results in the following transformation matrix: ${\ displaystyle T = {\ begin {pmatrix} 1 & 1 \\ 1 & 0 \ end {pmatrix}}}$

(Explanation: first line = schedule (i + j), second line = allocation (i), first column: i, second column: j)

Transformed polytope

${\ displaystyle T ^ {- 1} = \ color {Red} {\ begin {pmatrix} 0 & 1 \\ 1 & -1 \ end {pmatrix}}}$

${\ displaystyle {\ color {Blue} {\ begin {pmatrix} 1 & 0 & 0 & 0 \\ - 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 1 & -1 & 0 & 2 \ end {pmatrix}}} {\ begin {pmatrix} {\ color {Red} 0} & {\ color {Red} 1} & 0 & 0 \\ {\ color {Red} 1} & {\ color {Red} -1} & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \ end {pmatrix}} = {\ begin {pmatrix} 0 & 1 & 0 & 0 \\ 0 & -1 & 1 & 0 \\ 1 & -1 & 0 & 0 \\ - 1 & 2 & 0 & 2 \ end {pmatrix}}}$

The loop indices are also transformed by: and ${\ displaystyle T ^ {- 1}}$ ${\ displaystyle i = p}$ ${\ displaystyle j = tp}$

Generation of the target program

The last step is to generate code that precisely enumerates the points from the target polyhedron and adheres to the correct order (more precisely the lexicographical order). This is calculated algorithmically by so-called scanning algorithms (e.g. Fourier-Motzkin elimination or the Quillerè method ).

The following (synchronous) target program is obtained:

for t:= 0 to 2n+2 do
    parfor p:= max(0, ceil(t/2)-1) to min(t, n) do
          A(p, t-p):= A(p-1, t-p) + A(p, t-p-1)
    end
end

The outer loop specifies global time steps, while the second loop represents the parallel calculation. Since this program lacks the code for communication between the processors, it cannot run directly on systems with distributed memory. However, it can be implemented directly with shared memory, e.g. B. as an OpenMP program.

application

The most important possible uses include: