Fused multiply-add

The fused multiply add operation (FMA operation) is a variant of the multiply accumulate operation (MAC) for floating point numbers and is used on some microprocessors with floating point units for optimized calculations. In contrast to the usual operation, also known as unfused-multiply-add , the fused-multiply-add operation performs the calculation with full resolution and only rounds the result at the end of the calculation.

The technology was developed by IBM Research at the end of the 1980s , but initially found little use. With increasing integration density , a simple implementation of the FMA technology in GPUs , DSPs and CPUs became possible. The FMA operation is specified in the IEEE 754-2008 standard.

application

Operations of the form often occur in numerical algorithms

{\ displaystyle \ a \ leftarrow a + (b \ cdot c)}

on. This is the case, among other things, with the evaluation of scalar products , with matrix operations and with numerical integration .

In the conventional unfused multiply add operation with N digits, the product b · c is first calculated, rounded to N digits, then the addition of a is carried out and the final result rounded again to N digits. With the fused-multiply-add operation, rounding after multiplication is omitted, the expression a + b · c is calculated with full precision and only rounded once at the end to N final places. This is associated with a slightly higher hardware requirement for the fused multiply add operation. In some situations the rounding errors are reduced somewhat. With the classic scalar product calculation this is only very rarely the case, since mostly | a | >> | b · c | is. There one gains much more accuracy with other techniques (e.g. by using 4 or 8 accumulators and a final horizontal sum).

At least three different instructions are required for evaluation without the FMA:

Loading 'b' and 'c' into registers (condition: 'b' and 'c' are not already in registers and the CPU does not support memory operands)
Multiplication of 'b' and 'c'
This result is temporarily stored in a register
Loading 'a' into the accumulator (condition: 'a' is not already in a register and the CPU does not support memory operands)
Addition of 'a' to the previously cached product '(b · c)'.

If special opcodes are defined for operations of the form , the evaluation is carried out by an optimized processing unit, the multiplier-accumulator (MAC), which executes this instruction in one step. There are only two instructions left from the above scheme, namely the loading of the operands and the subsequent FMA instruction. ${\ displaystyle \ a \ leftarrow a + (b \ cdot c)}$

advantages

increased floating point performance by using the MAC
improved utilization of registers, compact machine code

disadvantage

the FMA technology must be supported by compilers ; the machine code generated in this way now requires opcodes that differ from the usual 2-address or 3-address schemes. Optimizing the use of FMA sometimes requires a lot of finger dexterity and explicit intervention from programmers .

Implementations

AMD Radeon HD 5000 (and subsequent architectures)
ARM VFPv4
IBM RISC System / 6000 (1990)
HP PA-8000 (1996) and later
SCE - Toshiba Emotion Engine (1999)
Intel Itanium (2001)
Intel Core, see FMA x86
nVidia GT200 (and subsequent architectures)
STI Cell (2006)
Fujitsu SPARC64 VI (2007) and later
AMD Bulldozer (2011), see also FMA4
AMD Zen / Ryzen (2017)

Individual evidence

↑ RK Montoye, E. Hokenek, SL Runyon: Design of the IBM RISC System / 6000 floating-point execution unit . In: IBM Journal of Research and Development . 34, No. 1, January 1990, ISSN 0018-8646 , pp. 59-70. doi : 10.1147 / around 341.0059 .
↑ http://repositories.lib.utexas.edu/bitstream/handle/2152/3082/quinnelle60861.pdf
↑ http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491c/CJAEGAJB.html
↑ http://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf
↑ 1.1.1. VFPv4 architecture hardware support . Retrieved May 16, 2012.
↑ http://archive.rootvg.net/column_risc.htm
↑ http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

[1] RK Montoye, E. Hokenek, SL Runyon: Design of the IBM RISC System / 6000 floating-point execution unit . In: IBM Journal of Research and Development . 34, No. 1, January 1990, ISSN 0018-8646 , pp. 59-70. doi : 10.1147 / around 341.0059 .

[2] ttp://repositories.lib.utexas.edu/bitstream/handle/2152/3082/quinnelle60861.pdf

[3] ttp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491c/CJAEGAJB.html

[4] ttp://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf

[5] 1.1.1. VFPv4 architecture hardware support . Retrieved May 16, 2012.

[6] ttp://archive.rootvg.net/column_risc.htm

[7] ttp://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf