Jump prediction

The branch prediction ( English prediction branch ) is in the (micro) computer architecture used and treated the problem of microprocessors , all stages of its pipeline utilize whenever possible and appropriate.

Overview

Jump prediction (also branch prediction) means:

The prediction of whether a conditional jump will be performed
To determine the destination address of a jump

There are two types of jumps :

conditional jump: J condition address
unconditional jump: JMP address JMP Computed address CALL address CALL Calculated address RET

In modern processors, machine instructions are executed in several processing steps within a processing chain ( pipeline ). In order to maximize the performance of the processor, after an instruction has been loaded into the pipeline and e.g. B. to continue with the analysis of the command in the next step, the loading of the next command is started at the same time. So there are (mostly) a whole series of commands for sequential processing in the pipeline. If it is now determined at the end of the pipeline that a conditional jump is being carried out, then all of the commands pending and partially processed in the pipeline are invalid. The processor now clears the pipeline and then reloads it from the new program code address. The more stages the pipeline has, the more intermediate results that have already been calculated must be discarded and the more cycles the pipeline is only partially used. This reduces the processing speed of programs and reduces energy efficiency.

The goal: to recognize a jump command as early as possible and to recognize its target address, so that the target address data can follow the jump command into the pipeline.

functionality

The jump prediction can be divided into two types.

Static jump prediction

The static branch prediction does not change its prediction while the program is running. It only achieves a prediction accuracy of 55 to 80%. This technique is based on known facts, e.g. B. that loops often perform jumps, while this occurs less often in selection processes. Some compilers also support the mechanism with special flags in the command code (prediction is built in during compilation).

Dynamic jump prediction

The dynamic jump prediction takes place during runtime through an electronic interconnection within the CPU. It uses various techniques to generate a prediction. Their prediction accuracy is up to 98%. The simplest method speculates on the basis of the jump direction: Jumps in the program code back are usually loops that are often run through several times, so that the pipeline is prophylactically filled with the previous code.

Detected unconditional jumps are simply sorted out in advance from the command queue and these are then further filled with the code from the jump target before they enter the pipeline ("Branch folding").

Per-address history

If a jump is recognized, it is logged and used for further jump predictions (in loops, jumps will usually occur more often - so the jump only needs to be recognized once). This technology is implemented e.g. B. from the Branch History Table (BHT) .

Global history

With the global prehistory, the path that a program has taken is logged over a limited number of steps. If you now recognize that two jumps are similar, they could take the same path - so the logic may already know part of what the program will do in the future. The path is usually stored in a shift register . Either a counter or a (slow) automaton is used for the predictions . This technology is implemented e.g. B. from Branch Target Buffer (BTB).

Static jump prediction techniques

Stall / Freeze

This technique just stops the whole pipeline briefly. If a jump command is detected in the ID stage (instruction decoding), the pipeline is stopped ( stalled / frozen ) until the EX stage (execution) knows whether the jump is being executed.

Jump does not take place: continue normally
Jump is executed: Set the program counter to the jump target address and fill the pipeline with the instructions at the jump target.

Predict taken

Just assume that every conditional jump will be executed; That is, if it is determined in the ID stage that a jump instruction is present, the CPU starts to determine the target address and to load the data there into the pipeline as follow-up instructions. If, however, it is determined in the EX stage that the jump does not take place, the previous work was for nothing (used for loops).

Predict not taken

Assumes that any conditional jump is not executed and continues normally. This means (if the jump is really not carried out) a good gain in performance . If it is found in the EX stage that the jump is carried out contrary to expectations, the follow-up instruction must be stopped, the PC must be set to the jump target address and the pipeline must then be filled (used in selection procedures).

Delayed branches

Delayed branches do not represent a jump prediction. Jump instructions are coded 1 to 3 instructions pulled forward in the instruction stream, the following 1 to 3 instructions are always executed regardless of the jump instruction.

They are therefore not transparent with regard to the interpretation of the machine language and are therefore an integral part of it.

 do
   *r3++ = *r1++ + *r2++;
 while (--r4);

.repeat
    dec  r4
    brz  .repeat
    ld   r0,r1+    ; diese drei Befehle
    add  r0,r2+    ; werden immer ausgeführt
    st   r3+,r0    ; unabhängig vom Sprungbefehl

The efficiency of this optimization strategy depends on how well it is possible to find statements that are independent of the jump result. In extreme cases, this does not work and the slots have to be filled with NOPs.

Dynamic jump prediction techniques

Branch History Table (BHT)

The BHT (also branch prediction buffer) tries, as its name suggests, to also log the last jumps. To do this, it uses part of the jump instruction address as a hash value . In general, the lower-order address part is used for this. Of course, these address parts cannot always be unique, so that collisions can occur (several different jumps occupy the same space in the table).

The table is updated after each jump.

n-bit carrier automat

Is a finite automaton that supplies forecast information.

1-bit automat: If a saved jump is taken, its bit is set from 0 to 1. One problem, however, is that it does not take alternating jumps into account (in the case of jumps that only take place every 2nd loop pass, the bit would be inverted again and again). The solution for this is an n-bit automaton.

n-bit carrier machine (n = 2)

n-bit carrier automat: This only sets the correctness bit to 0 after the n failures. In general, n = 2 is used. (Tests have shown that from n> 2 the increase in performance is only minimal.) During the first loop pass, the state is 00 and the condition is true. The status then changes to 01. If the condition is true again the next time the loop is run, the state becomes 10 and therefore predicts a true jump condition for all further jumps. If the condition is false on the second run, the status goes back to 00. If the state is 11, the jump condition must have been false twice before the prediction is "false" again. In other words: after jumping twice in the same direction, this direction is now also predicted for the further jumps. It can be calculated that with this method the probability of the correct prediction is 83%.

gshare

With gshare, the address part and the global history are linked with XOR and stored in a table. The information in the table is then used for branch prediction. gshare thus combines per-address history with global history . Since XOR is used as a hash process here, collisions can occur again.

The method takes place e.g. B. in the AMD Athlon and Pentium III application.

Overview

Procedure	accuracy	Low hardware expenditure	Timing
static at runtime	−−	++	++
static at compile time	-	++	++
Per-address history	+	+	+
Global history	+	+	+
gshare	++	+	+

Jump target prediction techniques

A jump target prediction is better than a mere jump prediction. As soon as you recognize in the ID stage that it is a jump, you can check whether this jump has already taken place and, if necessary, fetch its jump destination from a buffer. This means that you can set the program counter to this jump destination immediately and load the instructions there into the pipeline.

Branch Target Buffer (BTB)

The BTB (also branch target address cache or BTAC) is used to predict the next address before the command has been decoded, i.e. H. before it is determined whether it is a jump instruction at all. In this way, the otherwise unavoidable pipeline gap is avoided and the branching costs are thus reduced. The prediction is made on the basis of jumps stored in a table (previously actually executed).

This table contains:

Forecast information
Destination addresses
Tags

The BTB always returns an address. If an unknown jump is queried, it simply provides the following address. However, if a known jump is queried, it supplies the target address.

The BTB cannot always work correctly. Since z. If, for example, RETURN instructions have variable target addresses ( moving targets ), the BTB can save an incorrect target address for a correct jump. Since modern high-level programming languages are used for object-oriented programming, there are frequent method calls and thus many moving targets. In order to remedy this fatal weakness, BTBs are expanded by a call-return stack.

Call return stack

This stack saves all return addresses according to the LIFO principle. Furthermore, special call and return commands are assumed in the command set (a distinction is therefore made from a normal jump).

Special handling of both jumps in the Branch Target Buffer (BTB).

Call : When the call is made, the return address is stored on the call return stack.
Return : RET commands are specially marked in the BTB. When a command is fetched from an address marked in this way, the top address of the call-return stack is used instead of the target address from the BTB.

Web links

Elaboration on the most fundamental topics of computer architecture (overview) (Chapter 7 - Branch Prediction)
Simulator for branch prediction in processors with instruction phase pipelining on the website of one of the authors.
Error-free explanation of possible methods of branch prediction. Performance measurements by simulating SPEC95 benchmarks (English)

Individual evidence

↑ U. Brinkschulte, T. Ungerer: Mikrocontroller und Mikroprozessoren 2nd edition, 2007 , Springer, p. 328, table 7.6 online

[1] U. Brinkschulte, T. Ungerer: Mikrocontroller und Mikroprozessoren 2nd edition, 2007 , Springer, p. 328, table 7.6 online