Multiply accumulate

Multiply-Accumulate (short: MAC ) or Multiply-Add (short: MAD ) is an arithmetic operation in which two factors are multiplied and the product is added to a consecutive summand ( accumulator ):

{\ displaystyle \ a \ leftarrow a + (b \ cdot c)}

This operation is used extensively in processing digital signals . In modern FPGAs and in the development of application-specific circuits ( ASICs ), this operation is provided as part of DSP blocks (hardware units); as a machine instruction it has been found in many signal processors since the 1980s and in conventional CPUs since the early 2000s . Fused Multiply-Accumulate is a Multiply-Accumulate instruction with higher computational accuracy.

By expanding the hardware multiplier, processors can carry out this just as quickly as a classic multiplication - usual execution times are e.g. B. 2 clocks (40 ns) in the TMS320C40 from Texas Instruments with 50 MHz clock frequency and 5 clocks (2 ns) in the current Intel Haswell with z. B. 2.5 GHz clock frequency.

Contrary to the usual representations, multiply accumulate commands can also be used for calculations outside of the main areas of application, such as digital image processing , video decoding, digital filters and control technology .

The arguments and the result of this operation can vary depending on the processor type and the selected data type

Integer numbers ( Motorola DSP56K ),
Fixed point numbers (no type known),
single exact floating point numbers (TI TMS320C30 / 40, Altivec , Intel Haswell) or
double-precision floating point numbers (Intel Haswell).

accuracy

With the MAC operation, an improvement in the accuracy of the final result can be achieved in that the necessary rounding is only carried out at the end of the MAC operation and the intermediate results are carried out internally with full resolution without rounding. This operation is also called English Fused Multiply Accumulate , abbreviated FMA or FMAC referred. In contrast to the MAC operation, the FMAC operation requires wider data paths and the associated increased hardware expenditure.

speed

The speed increase can be up to 100%. In many DSPs, the multiply accumulate command takes just as long as a single addition or a single multiplication (example: Texas Instruments TMS320C40). The speed increase with the Intel Haswell is less. A multiply accumulate command takes 5 clocks, a single multiplication 5 clocks and a single addition 3 clocks, which adds up to 8 clocks and with optimal use brings a profit of 60%.

On the other hand, the multiply accumulate command is often the most critical command (critical path) and limits the clock frequency upwards. Another problem is that in practice one has to do very often with operations that change the shape

{\ displaystyle \ z \ leftarrow a + (b \ cdot c) \ cdot s}

with .

{\ displaystyle s = \ pm 0 {,} 5, \, \, \ pm 1, \, \, \ pm 2}

would need.

Often no addition, but rather a subtraction of the product is required (effort: an exclusive-or gate for the sign of or ). ${\ displaystyle b}$ ${\ displaystyle c}$
Scaling with the factors 0.5 or 2 is just as often necessary (effort: increment or decrement for the exponent of or ). ${\ displaystyle b}$ ${\ displaystyle c}$
A 4-operand form is required because it cannot be destroyed. ${\ displaystyle a}$

In the first case, the MAC instruction can often not be used, although it is an exclusive-or gate away from the required solution. In the second case, the MAC command has a clear benefit, unfortunately a trivial operation remains. The third case was addressed by AMD with FMA4 . Furthermore, it can usually be hidden by renaming the registers of today's CPUs.

Examples:

Approximation of the reciprocal value of a with the Newton-Raphson method : ${\ displaystyle x '= x \ cdot ({\ underline {2-a \ cdot x}})}$
Approximation of the reciprocal root of a using the Newton-Raphson method : ${\ displaystyle x '= x \ cdot ({\ underline {1 {,} 5-0 {,} 5 \ cdot a \ cdot (x ^ {2})}})}$
Complex values multiplication: ${\ displaystyle r = {\ underline {(r_ {1} r_ {2}) - i_ {1} i_ {2}}}, \, \, i = r_ {1} i_ {2} + r_ {2} i_ {1}}$
Even the iteration of the Julia set: ${\ displaystyle r_ {n + 1} = r_ {n} ^ {2} {\ underline {-i_ {n} ^ {2} + r_ {0}}}, \, \, i_ {n + 1} = {\ underline {2r_ {n} i_ {n} + i_ {0}}}}$
Function approximation by means of series expansion up to the quadratic term

Individual evidence

↑ Uwe Meyer-Baese: Digital Signal Processing with Field Programmable Gate Arrays , Springer Verlag , 2014. P. 124ff doi: 10.1007 / 978-3-642-45309-0
↑ Eric Quinnell: Floating-Point Fused Multiply - Add Architectures. (PDF; 4.4 MB) 2007, accessed on July 25, 2013 .

[MEYERBAESE-1] Uwe Meyer-Baese: Digital Signal Processing with Field Programmable Gate Arrays , Springer Verlag , 2014. P. 124ff doi: 10.1007 / 978-3-642-45309-0

[quinnell-2] Eric Quinnell: Floating-Point Fused Multiply - Add Architectures. (PDF; 4.4 MB) 2007, accessed on July 25, 2013 .