Advanced vector extensions

from Wikipedia, the free encyclopedia

Advanced Vector Extensions ( AVX ) is an extension of x86 - instruction set for microprocessors from Intel and AMD , which was proposed by Intel in March of 2008. AVX is an extension of the earlier SIMD -Befehlssatzerweiterung sse4 , which was also initiated by Intel. The width of the registers and data words increases to 256 bits. The following table shows the further development of the SIMD instructions in the x86 architecture:


Extension name
Data
width
Number of registers addressing
schema
present in CPUs from
Intel AMD
MMX / 3DNow! 0064 08 X(MM0… 07) MMX from Pentium (P55C) K6 (MMX) / K6-2 "Chomper" (3DNow!)
SSE (1… 4. *) 0128 8/16 (XMM0… 15) REX SSE4: Core 2 , Nehalem K7 "Palomino", K8 , K8 "Venice"
AVX 0256 16 (YMM0… 15) VEX Sandy Bridge , Ivy Bridge Bulldozer , piledriver , steamroller , jaguar
AVX2 Haswell , Broadwell , Skylake-i , Kaby Lake-i Excavator , Zen , Zen 2
AVX-512 0512 32 (ZMM0… 31) EVEX Skylake-X , Xeon Phi x200 , Xeon Skylake-Scalable Processors

AVX2 extends the AVX instruction set by further 256-bit instructions and was supported for the first time by processors of the Haswell architecture (Intel) and Excavator architecture (AMD).

AVX-512 was released in 2013 and expanded the AVX instructions from 256 to 512 bits. It was first supported by processors from the Knights Landing architecture (Intel).

New features

YMM AVX register scheme as an extension of the XMM-SSE registers

The width of the SIMD registers has been increased from 128 bits (with SSE ) to 256 bits. The new necessary registers are called YMM0 to YMM15. The processors that support AVX execute the older SSE instructions on the lower 128 bits of the new registers; H. the lower 128 bits of the YMM registers are shared with the XMM registers.

AVX introduces a three-operand SIMD instruction format c  : = a + b , so the result no longer necessarily destroys a source register, which saves copy operations. SSE instructions use the two-operand form a  : = a + b . The three-operand format can only be used with SIMD operands (YMM) and not with general purpose registers such as B. EAX or RAX.

application

  • Useful for floating point intensive calculations, especially in the multimedia, scientific or financial sector. Integer operations are to follow later.
  • Increases parallelism and throughput of floating point SIMD calculations.
  • Reduces the register load through non-destructive three-operand form.

Support in compilers and assemblers

GCC from version 4.6, the Intel Compiler Suite from version 11.1 and Visual Studio 2010 support AVX. The GNU assembler supports AVX via inline assembler commands, as does Intel's counterpart. In addition, MASM in the version for Visual Studio 2010, Yasm from version 1.1.0, FASM and NASM also support AVX. In the x86 code generator of the LLVM compiler substructure, there is full AVX 1 support from version 3.0.

Operating system support

AVX needs explicit support from the operating system so that the new registers are correctly saved and restored when the context is changed. The following operating system versions support AVX:

DragonFly BSD
beginning 2013
FreeBSD
9.1 of November 13, 2013 with a patch submitted on January 21, 2012
Linux
from kernel 2.6.30 from June 9, 2009
macOS
from 6/10/8 (last Snow Leopard Update) from June 23, 2011
OpenBSD
5.8 of October 18, 2015
Solaris
10 Update 10 and Solaris 11
Windows
from Windows 7 SP1 and Windows Server 2008 R2 SP1 from February 22, 2011

CPUs with AVX

Intel
AMD

New instructions AVX

Register scheme of AVX-512 as an extension of the AVX (YMM0-YMM15) and SSE registers (XMM0-XMM15)
511 256 255 128 127 0
  ZMM0     YMM0     XMM0  
ZMM1 YMM1 XMM1
ZMM2 YMM2 XMM2
ZMM3 YMM3 XMM3
ZMM4 YMM4 XMM4
ZMM5 YMM5 XMM5
ZMM6 YMM6 XMM6
ZMM7 YMM7 XMM7
ZMM8 YMM8 XMM8
ZMM9 YMM9 XMM9
ZMM10 YMM10 XMM10
ZMM11 YMM11 XMM11
ZMM12 YMM12 XMM12
ZMM13 YMM13 XMM13
ZMM14 YMM14 XMM14
ZMM15 YMM15 XMM15
ZMM16 YMM16 XMM16
ZMM17 YMM17 XMM17
ZMM18 YMM18 XMM18
ZMM19 YMM19 XMM19
ZMM20 YMM20 XMM20
ZMM21 YMM21 XMM21
ZMM22 YMM22 XMM22
ZMM23 YMM23 XMM23
ZMM24 YMM24 XMM24
ZMM25 YMM25 XMM25
ZMM26 YMM26 XMM26
ZMM27 YMM27 XMM27
ZMM28 YMM28 XMM28
ZMM29 YMM29 XMM29
ZMM30 YMM30 XMM30
ZMM31 YMM31 XMM31
instruction description
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
Copies a 32-bit, 64-bit, or 128-bit memory operand to all elements of an XMM or YMM register.
VINSERTF128 Replaces either the upper or lower half of a 256-bit YMM register with the value from the 128-bit operand. The other half remains unchanged.
VEXTRACTF128 Extracts either the upper or lower half of a 256-bit YMM register and copies the value into the 128-bit operand.
VMASKMOVPS
VMASKMOVPD
Conditionally reads any number of vector elements from a SIMD memory operand into a destination register, with the remaining space being filled with zeros.
Alternatively, it conditionally writes any number of vector elements from a SIMD register into a SIMD memory operand, the remaining space
in the memory not being changed.
VPERMILPS
VPERMILPD
Swaps 32-bit or 64-bit vector elements.
VPERM2F128 Merges the four 128-bit vector elements from two 256-bit source operands into one 256-bit destination operand.
VTESTPS, VTESTPD Sets the flag bits CF and ZF according to a comparison of all sign bits.
VZEROALL Fills all YMM registers with zeros and marks them as unused. Used when switching between 128-bit and 256-bit mode.
VZEROUPPER Fills the top half of all YMM registers with zeros. Used when switching between 128-bit and 256-bit mode.

Extension AVX 2

The Advanced Vector Extensions 2 ( AVX2 ) represent an extension , in which some new instructions have been introduced and numerous existing instructions are now also 256 bits wide. AVX2 is being sold for the first time with the AMD Carrizo and Intel Haswell processors.

Extension AVX-512

Since energy efficiency is becoming more and more important in high-performance computing and the SIMD concept promises progress here, AVX computing accelerator cards, known as Intel Xeon Phi , are being completely revised again, the data and register width is doubled to 512 bits and the number of registers is increased 32 doubled. This extension is called Intel AVX-512 , it consists of several specified groups of new instructions that are implemented in stages. The second Xeon Phi generation (“Knights Corner”) receives the “Foundation”, the third generation (“Knights Landing”) in 2016 also receives “CD”, “PF”, “ER”, extensions.

In contrast to Xeon Phi / Knights Landing, the command groups "CD", "PF", "BW" are part of the Xeon Scalable Processors released in summer 2017 and the Skylake-X processors derived from them (from Core i7-7800X).

The command groups have already been documented by Intel in advance and can be queried via the CPUID instruction; certain register bits are set when the command group is available. AVX-512 is currently to be seen as a specification or "roadmap" which instructions Intel intends to bring into the AVX units in the future:

Instruction set Name set CPUID bit Processors
AVX512F (basic instruction set,
remaining instructions are optional)
Foundation EBX 16 Xeon Phi x200, Xeon SP
AVX512PF Prefetch EBX 26 Xeon Phi x200
AVX512DQ Vector Double Word and Quad Word EBX 17 Xeon SP
AVX512BW Vector Byte and Word EBX 30 Xeon SP
AVX512VL Vector length EBX 31 Xeon SP
AVX512CD Conflict Detection EBX 28 Xeon Phi x200, Xeon SP
AVX512ER Exponential and Reciprocal EBX 27 Xeon Phi x200
AVX512IFMA Integer Fused Multiply-Add with 512 bits EBX 21 Cannon Lake
AVX512_VBMI Vector bit manipulation ECX 01 Cannon Lake
AVX512_VBMI2 Vector bit manipulation 2 ECX 06 Cannon Lake
AVX512_4FMAPS Vector Fused Multiply Accumulation Packed Single precision EDX 03 Xeon Phi 72x5
AVX512_4VNNIW Vector Neural Network Instructions Word Variable Precision EDX 02 Xeon Phi 72x5
AVX512_VPOPCNTDQ Vector POPCOUNT Dword / Qword ECX 14 Xeon Phi 72x5
AVX512_VNNI Vector Neural Network Instructions ECX 11 Xeon Cascade Lake
AVX512_BITALG Bit algorithms, Support for VPOPCNT [B, W] and VPSHUF-BITQMB ECX 12 Ice Lake
AVX512_GFNI Galois Field New Instructions Ice Lake
AVX512_VPCLMULQDQ Carry-Less Multiplication Quadword Ice Lake
AVX512_VAES Vector AES Ice Lake

Implementation of the individual command groups documented for Xeon SP in and for Xeon Phi Knights Landing (x200):

use

Using these special commands boils down to the following:

  • Isolation of the program parts to be optimized, only these have to be considered at all
  • to be optimized there:
    • Memory layout of the data structures used (alignment, cache efficiency, location of memory access)
    • Breakdown of the calculations into many independent threads that run in parallel and e.g. Can sometimes be processed on different architectures (e.g. can be swapped out to one / more GPU (s))
    • Use of these extended instruction sets by ...
      • Use of compilers that support these instruction sets
      • Use of libraries that use these instruction sets (e.g. Math Kernel Library or OpenBLAS )
      • Use of libraries that in turn use such libraries (e.g. graphics libraries)
      • Use of programming languages ​​that make use of these commands on their own (e.g. Python with the numpy package)
      • In the case of very critical applications, it may be necessary to use Compiler Intrinsics or to write assembler routines to further increase performance.

The problems are not new, however, and the use of the instruction set extensions is still the part of these optimizations that can best be automated.

Conclusion

With the help of AVX and its 256-bit wide register in x64 mode, programs can calculate four floating point operations with double precision and eight floating point operations with single precision with, for example, a simple addition. There are four values ​​of double precision or eight values ​​of single precision in each of the 16 AVX registers, which are then offset with one partner.

With AVX2 the register width does not change, only some of the operations previously (with AVX) carried out with 128-bit (e.g. FMA: Fused-Multiply Add / Floating Point Multiply-Accumulate, Integer operations ...) were changed to 256-bit execution brought. This changes the number of available 256-bit SIMD operations. With a simple addition on a 64-bit architecture (only) four floating point operations with double precision and eight floating point operations with single precision are calculated simultaneously.

With the AVX-512, due to the double register width of 512-bit, there are eight additions with double precision or 16 additions with single precision per clock (on a 64-bit architecture).

The use of AVX-512 in the desktop segment is currently (2018) limited to the X299 chipset of the Skylake architecture for the 2066 socket and, since 2016, to a number of the Xeon processor series.

Individual evidence

  1. Thomas Hübner: SSE's successor is called AVX and is 256 bits wide. ComputerBase, March 17, 2008, accessed March 29, 2018 .
  2. James Reinders: AVX-512 Instructions . Intel . July 23, 2013. Retrieved March 3, 2017.
  3. x86_64 - support for AVX instructions . Retrieved November 20, 2013.
  4. FreeBSD 9.1-RELEASE Announcement . Retrieved May 20, 2013.
  5. Add support for the extended FPU states on amd64, both for native 64bit and 32bit ABIs . svnweb.freebsd.org. January 21, 2012. Retrieved January 22, 2012.
  6. x86: add linux kernel support for YMM state . Retrieved July 13, 2009.
  7. Linux 2.6.30 - Linux Kernel Newbies . Retrieved July 13, 2009.
  8. Twitter . Retrieved June 23, 2010.
  9. Theo de Raadt: OpenBSD 5.8. Retrieved December 7, 2015 .
  10. Floating-Point Support for 64-Bit Drivers . Retrieved December 6, 2009.
  11. Intel Offers Peek at Nehalem and Larrabee . ExtremeTech. March 17, 2008. Retrieved August 20, 2011.
  12. Bulldozer Roadmap . Joe Doe, AMD Developer blogs. May 7, 2009. Retrieved September 8, 2011.
  13. AMD Piledriver vs. Steamroller vs. Excavator - performance comparison of the architectures. (No longer available online.) In: Planet 3DNow! August 14, 2015, archived from the original on February 21, 2017 ; Retrieved February 20, 2017 .
  14. ^ ISA Extensions Programming Reference. Retrieved October 17, 2017 .
  15. Xeon SP Technical Overview. Retrieved October 17, 2017 .
  16. How to detect KNL instruction support. Retrieved October 17, 2017 .
  17. Gepner, Pavel. "Using AVX2 instruction set to increase performance of high performance computing code" , Computing and Informatics 36.5 (2017): 1001-1018.