FMA x86

from Wikipedia, the free encyclopedia

FMA x86 is an instruction set extension for microprocessors from Intel and AMD to support the fused multiply add technology (FMA). For the first time, AMD implemented this in the "Bulldozer" CPU . Intel only used this in the Haswell processors.

There are two incompatible versions, FMA4 and FMA3:

FMA4 is the full-fledged version that allows form operation ; ${\ displaystyle d = a + b \ cdot c}$
FMA3 , on the other hand, requires that the destination register is one of the operand registers and this is then overwritten.

New features

FMA extends the possibilities for vector operations and can be seen as an extension of the AVX instructions .

New instructions

CPUs with FMA4

Intel
- At the moment it is still unclear whether Intel will join FMA4 or stay with FMA3.
AMD
- AMD Bulldozer- Based Processors (AMD FX), Q4 / 2011
- Piledriver-based processors (AMD FX, Trinity and Richland APUs), Q2 / 2012
- Steamroller-based processors (4th generation A-series processors , Kaveri APUs) Q1 / 2014

Mnemonic	Operands	surgery
VFMADDPDx	xmm, xmm, xmm / m128, xmm / m128	a = b ∙ c + d
VFMADDPDy	ymm, ymm, ymm / m256, ymm / m256
VFMADDPSx	xmm, xmm, xmm / m128, xmm / m128
VFMADDPSy	ymm, ymm, ymm / m256, ymm / m256
VFMADDSD	xmm, xmm, xmm / m64, xmm / m64
VFMADDSS	xmm, xmm, xmm / m32, xmm / m32

CPUs with FMA3

Intel
- Haswell -based processors (4th generation Core i processors)
AMD
- Piledriver -based processors (2nd generation FX CPUs, Trinity and Richland APUs), Q2 / 2012
- Steamroller -based processors (4th generation A-series processors , Kaveri APUs) Q1 / 2014

Mnemonic	Operands	surgery
VFMADD132PDy	ymm, ymm, ymm / m256	a = a ∙ c + b
VFMADD132PSy	ymm, ymm, ymm / m256
VFMADD132PDx	xmm, xmm, xmm / m128
VFMADD132PSx	xmm, xmm, xmm / m128
VFMADD132SD	xmm, xmm, xmm / m64
VFMADD132SS	xmm, xmm, xmm / m32
VFMADD213PDy	ymm, ymm, ymm / m256	a = b ∙ a + c
VFMADD213PSy	ymm, ymm, ymm / m256
VFMADD213PDx	xmm, xmm, xmm / m128
VFMADD213PSx	xmm, xmm, xmm / m128
VFMADD213SD	xmm, xmm, xmm / m64
VFMADD213SS	xmm, xmm, xmm / m32
VFMADD231PDy	ymm, ymm, ymm / m256	a = b ∙ c + a
VFMADD231PSy	ymm, ymm, ymm / m256
VFMADD231PDx	xmm, xmm, xmm / m128
VFMADD231PSx	xmm, xmm, xmm / m128
VFMADD231SD	xmm, xmm, xmm / m64
VFMADD231SS	xmm, xmm, xmm / m32

application

Useful for floating point intensive calculations, especially in multimedia, scientific or financial calculations. Integer operations are to follow later.
Increases parallelism and throughput of floating point SIMD calculations
Reduces the register load through non-destructive four-operand form (in the case of FMA4)