Fused multiply-add

from Wikipedia, the free encyclopedia

The fused multiply add operation (FMA operation) is a variant of the multiply accumulate operation (MAC) for floating point numbers and is used on some microprocessors with floating point units for optimized calculations. In contrast to the usual operation, also known as unfused-multiply-add , the fused-multiply-add operation performs the calculation with full resolution and only rounds the result at the end of the calculation.

The technology was developed by IBM Research at the end of the 1980s , but initially found little use. With increasing integration density , a simple implementation of the FMA technology in GPUs , DSPs and CPUs became possible. The FMA operation is specified in the IEEE 754-2008 standard.

application

Operations of the form often occur in numerical algorithms

on. This is the case, among other things, with the evaluation of scalar products , with matrix operations and with numerical integration .

In the conventional unfused multiply add operation with N digits, the product b · c is first calculated, rounded to N digits, then the addition of a is carried out and the final result rounded again to N digits. With the fused-multiply-add operation, rounding after multiplication is omitted, the expression a + b · c is calculated with full precision and only rounded once at the end to N final places. This is associated with a slightly higher hardware requirement for the fused multiply add operation. In some situations the rounding errors are reduced somewhat. With the classic scalar product calculation this is only very rarely the case, since mostly | a | >> | b · c | is. There one gains much more accuracy with other techniques (e.g. by using 4 or 8 accumulators and a final horizontal sum).

At least three different instructions are required for evaluation without the FMA:

  • Loading 'b' and 'c' into registers (condition: 'b' and 'c' are not already in registers and the CPU does not support memory operands)
  • Multiplication of 'b' and 'c'
  • This result is temporarily stored in a register
  • Loading 'a' into the accumulator (condition: 'a' is not already in a register and the CPU does not support memory operands)
  • Addition of 'a' to the previously cached product '(b · c)'.

If special opcodes are defined for operations of the form , the evaluation is carried out by an optimized processing unit, the multiplier-accumulator (MAC), which executes this instruction in one step. There are only two instructions left from the above scheme, namely the loading of the operands and the subsequent FMA instruction.

advantages

  • increased floating point performance by using the MAC
  • improved utilization of registers, compact machine code

disadvantage

  • the FMA technology must be supported by compilers ; the machine code generated in this way now requires opcodes that differ from the usual 2-address or 3-address schemes. Optimizing the use of FMA sometimes requires a lot of finger dexterity and explicit intervention from programmers .

Implementations

Individual evidence

  1. RK Montoye, E. Hokenek, SL Runyon: Design of the IBM RISC System / 6000 floating-point execution unit . In: IBM Journal of Research and Development . 34, No. 1, January 1990, ISSN  0018-8646 , pp. 59-70. doi : 10.1147 / around 341.0059 .
  2. http://repositories.lib.utexas.edu/bitstream/handle/2152/3082/quinnelle60861.pdf
  3. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491c/CJAEGAJB.html
  4. http://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf
  5. 1.1.1. VFPv4 architecture hardware support . Retrieved May 16, 2012.
  6. http://archive.rootvg.net/column_risc.htm
  7. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf