Out-of-order execution

from Wikipedia, the free encyclopedia

Out-of-order execution ( English for about execution in a different order [as in the program code] ) the possibility of machine instructions referred to in the execution units of a (usually superscalar ) processor in a different order to perform when she in program code are without, however, the result to change. As a result, more commands can be executed in parallel, since the processing units of the processor are better utilized. The opposite of out-of-order execution is in-order execution , in which the commands are processed strictly according to the program sequence, such as in the Von-Neumann cycle . Because the result of the out-of-order execution must be the same as when executing in the program order, out-of-order execution only increases the speed for command sequences in which a subsequent command does not depend on the result of the previous command (but only commands that have been executed “far enough away” beforehand).

motivation

A processor has several functional units, u. a. arithmetic and logic unit (ALU), floating point unit (FPU), load and store unit and special (floating point) vector units. With superscalar processors, they make it possible to use instruction parallelism and thus to increase the execution speed: While a previous instruction is still occupying a functional unit, the free functional units are already available for the following instruction. However, due to data dependencies between the instructions or if they require the same functional unit, parallel execution is not always possible. Often commands could be executed in parallel, but they are not in direct succession in the program code; a processor without out-of-order execution adheres strictly to the execution order specified in the program and cannot then execute it in parallel.

Rearranging the commands in the program by hand or by the compiler can lead to better results on an in-order processor, but is generally not optimal because influences during runtime cannot or can hardly be taken into account. In particular, the execution time of memory accesses cannot be predicted. This depends on whether the cache can deliver the requested data or the translation buffer ( TLB ) can deliver the requested page translation . This is usually difficult or impossible to predict at compile time .

A dynamic method for rearranging the commands, such as out-of-order execution execution, can react accordingly at execution time and thus execute more commands in parallel and thus speed up processing.

Working principle

Basic principle of disorderly command execution

In earlier processors, the individual commands were processed one after the other, with each command being executed according to a fixed sequence of partial steps ( in-order execution ), cf. u. a. Von Neumann cycle . These sub-steps are:

  1. Load instruction ( instruction fetch )
  2. If the operands are available (for example in registers), the instruction is passed to the appropriate functional unit for execution. If one or more operands are not available in the current instruction cycle (mostly because they still have to be loaded from memory), the processor waits until they are loaded.
  3. The command is executed by the appropriate functional unit.
  4. The functional unit writes the result in a register.

The new concept ( out-of-order execution ) processes the commands in the following order:

  1. Load instruction ( instruction fetch ).
  2. Enter the command in a queue ( instruction buffer ).
  3. The instruction waits in the instruction buffer until its operands are loaded. The command is then allowed to leave the queue, often before older commands entered earlier.
  4. The command is transferred to the appropriate execution unit and executed there.
  5. The result is entered in a queue of results. ( register retirement file / buffer )
  6. Only after the results of all previous, older commands have been written into the register, the result of the current command is also written into the register.

implementation

Scoreboarding or the Tomasulo algorithm is usually implemented . With scoreboarding, occupied resources are marked on a central scoreboard and released again after they have been used. The Tomasulo algorithm implements dynamic scheduling . Several commands are executed at the same time as long as the operands are independent. Read-after-write hazards are prevented by delaying the command and write-after-read hazards by temporarily storing a new value. Register renaming is also used to reduce the data dependencies .

Almost all modern x86 processors from the Intel Pentium Pro or AMD K5 can execute commands out of order . Known exceptions are the IDT WinChip , VIA-C3 and -C7 series, which were developed by Centaur Technology , and the Intel Atom series up to and including Cedar Trail.

See also

literature

  • Oberschelp, Vossen: Computer structure and computer structures . 9th edition. Oldenbourg, 2003, ISBN 3-486-27206-3 .