Loop unrolling

Loop unrolling (sometimes also loop unwinding ), the "stretching of cyclic calculation plans " or "stretching a loop", is an optimization method that can accelerate the runtime of a computer program at the expense of the size of its program file. This creates a loop

either replaced by an equivalent loop that contains multiple copies of the loop body and has a smaller number of iterations,
or completely resolved by stringing together the loop body as often as the original number of runs.

This means that the loop condition is checked less often or not at all. It is also often possible to subsequently carry out further optimizations of the (unrolled) loop body. The number of copies of the original loop body will roll-off ( English unroll factor called).

Modern compilers try to unroll loops automatically if you want to optimize for speed. However, if you know exactly on which architecture a program will later be executed, manual optimization can be more effective.

If the repetitions come about through self-calls ( recursions ), procedure calls and returns can be saved by duplicating the procedure body. In such cases one speaks of recursion unrolling .

idea

The original idea of loop unrolling was to optimize the relationship between the loop body and loop control instructions. Since today's processors have complex mechanisms for optimizing the program flow (e.g. jump prediction and pipelining ), the unrolled loop body can be further optimized.

approach

In the case of a loop, the so-called loop condition must be checked for each run to determine whether another run should take place. This test can make up a large proportion of the running time, especially for loops with a very small body.

for( i=0 ; i<8 ; i=i+1 )
    dest[i] = src[i];

Such a loop consists only of a few instructions: copy, increase the counter, check the condition and, if necessary, a jump. This means that the processor spends about half of the time only with control instructions.

An obvious optimization is to replace the loop with n copies of its loop body ( complete unrolling ). Loop control instructions and counters are then completely omitted:

dest[0] = src[0];
dest[1] = src[1];
dest[2] = src[2];
dest[3] = src[3];
dest[4] = src[4];
dest[5] = src[5];
dest[6] = src[6];
dest[7] = src[7];

Partial unrolling

If a complete unrolling is not possible or not useful, the original loop body can alternatively be executed several times within one iteration ( partial unrolling ). This reduces the number of control instructions to be executed.

The original loop body is replaced by another loop, the passes of which correspond to the roll-off factor (here: 2):

for( i=0 ; i<8 ; i=i+2 ) {
    for( j=0; j<2; j=j+1 )
      dest[i+j] = src[i+j];
}

Then the inner loop is completely unrolled:

for( i=0 ; i<8 ; i=i+2) {
    dest[i]   = src[i];
    dest[i+1] = src[i+1];
}

A partial unroll can be used, though

the number of loops during the translation is not yet known (see also Duff's Device ),
the loop body is too long or the number of iterations is too high, d. H. the generated machine code would be too large.

Waste

There is always the possibility that the number of original iterations is not an integral multiple of the iterations of the unrolled loop. There then remains a remainder or a partial pass of the unrolled loop, which must be treated separately.

Example: The number of iterations does not result until runtime, the original loop has been unrolled 10 times (i.e. the original loop body is 10 times in the new loop body). At runtime, the result is that 1005 iterations are to be calculated, i.e. 100 runs that calculate 10 times - there remains (in this program run) a remainder of 5 iterations.

The following measures can now be taken:

Unroll the rest completely, provided the number of loop passes is known at the time of compilation,
process the rest individually in a separate loop (as in the original loop),
only partially execute the unrolled loop body (see Duff's Device ) or
pad the data to be processed to an integer multiple of the inner loop.

advantages

Fewer commands to be executed: The elimination or the less frequent execution of the control instructions reduces the execution time.
Improved register allocation : The loop counter has to be loaded into a register less often or not at all . This means that more registers are available for other calculations.
Latency masking ( English "Latency hiding" ): If there is a delay by accessing a slower memory to the next can be started (unrolled) run or even processed in parallel ( pipelining ).
Removing shared-code ( English "common subexpression elimination" ): Multiple calculations of the same value may be removed.
Vectorization : The instructions can be vectorized, i. H. several calculations can be combined within one command (e.g. from SSE ).
Optimized memory access: Memory accesses can be rearranged so that they better take into account the prefetching of the processor or write-through of the cache .

disadvantage

Manual unrolling usually worsens the readability of the source text.
The size of the program grows as the number of instructions increases. It must be taken into account that the probability increases that program parts will be displaced from the instruction cache and the slow reloading will compensate for the speed advantage.
Possibly. It is no longer possible to save all of the variables required by the loop in registers , especially in the case of low-register architectures such as B. x86 . Then you have to switch to the main memory (or the processor cache), which is usually slower.

Manual unwinding

When exactly manual processing is better than automatic compiler optimization, it cannot be described in general due to the ongoing development. If in doubt, runtime measurements must be made.

In general, the automatic optimization is made more difficult

the more complex the loop is (complex calculations, a lot of data, inner loops, dependent number of runs),
when conditional statements are executed (control flow dependencies) or
when (partially) recursive calculations are performed (data flow dependencies).

example

A simple example in which today's compilers (status: 2001) do not succeed in automatically unrolling the loop is:

double sum = 0.0;

for (int i=0; i<n; ++i)
    for (int j=0; j<n; ++j)
        sum += x[i][j] * (i*i + j*j);

The reason is that the factor (i ² + j ² ) depends on both the outer and the inner loop index. This bow can easily be unrolled by hand.

Web links

Intel 64 and IA-32 Architectures Software Developer Manuals , Intel 64 and IA-32 Architectures Software Developer's Manual Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B, and 3C (PDF; 15.4 MB)
AMD Developer Central: Developer Guides & Manuals , Software Optimization Guide for AMD Family 15h Processors (PDF; 2.0 MB)
Chapter 7, pp. 8-10. (PDF; 2.2 MB), Michael Abrash's Graphics Programming Black Book with a loop unrolling example in x86 assembler (English)
Optimizing subroutines in assembly language (PDF; 873 kB) Agner Fogs Optimierungshandbuch u. a. with the loop unrolling technique (English, 2012)

Individual evidence

↑ Rutishauser, Heinz : Automatic calculation plan production with program-controlled calculating machines. In: Messages from the Institute for Applied Mathematics at ETH Zurich. Birkhäuser, 1961, accessed January 6, 2019 .
^ Bauer, FL ; Wössner, H .: Algorithmic Language and Program Development . 2nd Edition. Springer, Berlin / Heidelberg / New York 1984, ISBN 3-540-12962-6 , pp. 287 .
↑ Jeffrey D. Ullman, Alfred V. Aho: Principles of compiler design. Addison-Wesley, 1977, ISBN 0-201-10073-8 .
^ Steven S. Muchnick: Advanced Compiler Design & Implementation. Morgan Kaufmann Publishers, 1997, ISBN 1-55860-320-4 .
↑ Intel 64 and IA-32 Architectures Optimization Reference Manual
↑ Intel C ++ Compiler XE 13.0 User and Reference Guides: O
↑ GCC 4.7.2 Manual: Optimization Options
^ ^A ^b Mike Wall: Using Block Prefetch for Optimized Memory Performance. (PDF; 136 kB) mit.edu , March 19, 2002, accessed on September 22, 2012 (English).
↑ ^a ^b Tom Duff on Duff's Device at www.lysator.liu.se ( English )
^ Agner Fog: Optimizing subroutines in assembly language. (PDF; 873 kB) Copenhagen University College of Engineering, February 29, 2012, p. 100 , accessed on September 22, 2012 (English): "12.11 Loop unrolling"
↑ Stefan Goedecker, Adolfy Hoisie: Performance Optimization of Numerically Intensive codes . SIAM, 2001, ISBN 978-0-89871-484-5 , pp. 59 .

[1] Rutishauser, Heinz : Automatic calculation plan production with program-controlled calculating machines. In: Messages from the Institute for Applied Mathematics at ETH Zurich. Birkhäuser, 1961, accessed January 6, 2019 .

[2] Bauer, FL ; Wössner, H .: Algorithmic Language and Program Development . 2nd Edition. Springer, Berlin / Heidelberg / New York 1984, ISBN 3-540-12962-6 , pp. 287 .

[ullman77-3] Jeffrey D. Ullman, Alfred V. Aho: Principles of compiler design. Addison-Wesley, 1977, ISBN 0-201-10073-8 .

[muchnick-4] Steven S. Muchnick: Advanced Compiler Design & Implementation. Morgan Kaufmann Publishers, 1997, ISBN 1-55860-320-4 .

[intel_opt_man-5] Intel 64 and IA-32 Architectures Optimization Reference Manual

[6] Intel C ++ Compiler XE 13.0 User and Reference Guides: O

[7] GCC 4.7.2 Manual: Optimization Options

[amd2002-8] A ^b Mike Wall: Using Block Prefetch for Optimized Memory Performance. (PDF; 136 kB) mit.edu , March 19, 2002, accessed on September 22, 2012 (English).

[duff83-9] Tom Duff on Duff's Device at www.lysator.liu.se ( English )

[fog2012-10] Agner Fog: Optimizing subroutines in assembly language. (PDF; 873 kB) Copenhagen University College of Engineering, February 29, 2012, p. 100 , accessed on September 22, 2012 (English): "12.11 Loop unrolling"

[11] Stefan Goedecker, Adolfy Hoisie: Performance Optimization of Numerically Intensive codes . SIAM, 2001, ISBN 978-0-89871-484-5 , pp. 59 .