FMA x86

from Wikipedia, the free encyclopedia

FMA x86 is an instruction set extension for microprocessors from Intel and AMD to support the fused multiply add technology (FMA). For the first time, AMD implemented this in the "Bulldozer" CPU . Intel only used this in the Haswell processors.

There are two incompatible versions, FMA4 and FMA3:

  • FMA4 is the full-fledged version that allows form operation ;
  • FMA3 , on the other hand, requires that the destination register is one of the operand registers and this is then overwritten.

New features

FMA extends the possibilities for vector operations and can be seen as an extension of the AVX instructions .

New instructions

CPUs with FMA4

  • Intel
    • At the moment it is still unclear whether Intel will join FMA4 or stay with FMA3.
  • AMD
    • AMD Bulldozer- Based Processors (AMD FX), Q4 / 2011
    • Piledriver-based processors (AMD FX, Trinity and Richland APUs), Q2 / 2012
    • Steamroller-based processors (4th generation A-series processors , Kaveri APUs) Q1 / 2014
Mnemonic Operands surgery
VFMADDPDx xmm, xmm, xmm / m128, xmm / m128 a = b ∙ c + d
VFMADDPDy ymm, ymm, ymm / m256, ymm / m256
VFMADDPSx xmm, xmm, xmm / m128, xmm / m128
VFMADDPSy ymm, ymm, ymm / m256, ymm / m256
VFMADDSD xmm, xmm, xmm / m64, xmm / m64
VFMADDSS xmm, xmm, xmm / m32, xmm / m32

CPUs with FMA3

Mnemonic Operands surgery
VFMADD132PDy ymm, ymm, ymm / m256 a = a ∙ c + b
VFMADD132PSy
VFMADD132PDx xmm, xmm, xmm / m128
VFMADD132PSx
VFMADD132SD xmm, xmm, xmm / m64
VFMADD132SS xmm, xmm, xmm / m32
VFMADD213PDy ymm, ymm, ymm / m256 a = b ∙ a + c
VFMADD213PSy
VFMADD213PDx xmm, xmm, xmm / m128
VFMADD213PSx
VFMADD213SD xmm, xmm, xmm / m64
VFMADD213SS xmm, xmm, xmm / m32
VFMADD231PDy ymm, ymm, ymm / m256 a = b ∙ c + a
VFMADD231PSy
VFMADD231PDx xmm, xmm, xmm / m128
VFMADD231PSx
VFMADD231SD xmm, xmm, xmm / m64
VFMADD231SS xmm, xmm, xmm / m32

application

  • Useful for floating point intensive calculations, especially in multimedia, scientific or financial calculations. Integer operations are to follow later.
  • Increases parallelism and throughput of floating point SIMD calculations
  • Reduces the register load through non-destructive four-operand form (in the case of FMA4)