FMA x86
FMA x86 is an instruction set extension for microprocessors from Intel and AMD to support the fused multiply add technology (FMA). For the first time, AMD implemented this in the "Bulldozer" CPU . Intel only used this in the Haswell processors.
There are two incompatible versions, FMA4 and FMA3:
- FMA4 is the full-fledged version that allows form operation ;
- FMA3 , on the other hand, requires that the destination register is one of the operand registers and this is then overwritten.
New features
FMA extends the possibilities for vector operations and can be seen as an extension of the AVX instructions .
New instructions
CPUs with FMA4
- Intel
- At the moment it is still unclear whether Intel will join FMA4 or stay with FMA3.
- AMD
- AMD Bulldozer- Based Processors (AMD FX), Q4 / 2011
- Piledriver-based processors (AMD FX, Trinity and Richland APUs), Q2 / 2012
- Steamroller-based processors (4th generation A-series processors , Kaveri APUs) Q1 / 2014
Mnemonic | Operands | surgery |
---|---|---|
VFMADDPDx | xmm, xmm, xmm / m128, xmm / m128 | a = b ∙ c + d |
VFMADDPDy | ymm, ymm, ymm / m256, ymm / m256 | |
VFMADDPSx | xmm, xmm, xmm / m128, xmm / m128 | |
VFMADDPSy | ymm, ymm, ymm / m256, ymm / m256 | |
VFMADDSD | xmm, xmm, xmm / m64, xmm / m64 | |
VFMADDSS | xmm, xmm, xmm / m32, xmm / m32 |
CPUs with FMA3
- Intel
- Haswell -based processors (4th generation Core i processors)
- AMD
- Piledriver -based processors (2nd generation FX CPUs, Trinity and Richland APUs), Q2 / 2012
- Steamroller -based processors (4th generation A-series processors , Kaveri APUs) Q1 / 2014
Mnemonic | Operands | surgery |
---|---|---|
VFMADD132PDy | ymm, ymm, ymm / m256 | a = a ∙ c + b |
VFMADD132PSy | ||
VFMADD132PDx | xmm, xmm, xmm / m128 | |
VFMADD132PSx | ||
VFMADD132SD | xmm, xmm, xmm / m64 | |
VFMADD132SS | xmm, xmm, xmm / m32 | |
VFMADD213PDy | ymm, ymm, ymm / m256 | a = b ∙ a + c |
VFMADD213PSy | ||
VFMADD213PDx | xmm, xmm, xmm / m128 | |
VFMADD213PSx | ||
VFMADD213SD | xmm, xmm, xmm / m64 | |
VFMADD213SS | xmm, xmm, xmm / m32 | |
VFMADD231PDy | ymm, ymm, ymm / m256 | a = b ∙ c + a |
VFMADD231PSy | ||
VFMADD231PDx | xmm, xmm, xmm / m128 | |
VFMADD231PSx | ||
VFMADD231SD | xmm, xmm, xmm / m64 | |
VFMADD231SS | xmm, xmm, xmm / m32 |
application
- Useful for floating point intensive calculations, especially in multimedia, scientific or financial calculations. Integer operations are to follow later.
- Increases parallelism and throughput of floating point SIMD calculations
- Reduces the register load through non-destructive four-operand form (in the case of FMA4)