Advanced vector extensions
Advanced Vector Extensions ( AVX ) is an extension of x86 - instruction set for microprocessors from Intel and AMD , which was proposed by Intel in March of 2008. AVX is an extension of the earlier SIMD -Befehlssatzerweiterung sse4 , which was also initiated by Intel. The width of the registers and data words increases to 256 bits. The following table shows the further development of the SIMD instructions in the x86 architecture:
Extension name |
Data width |
Number of registers | addressing schema |
present in CPUs from | |
---|---|---|---|---|---|
Intel | AMD | ||||
MMX / 3DNow! | 64 | 8 (MM0… 7) | MMX from Pentium (P55C) | K6 (MMX) / K6-2 "Chomper" (3DNow!) | |
SSE (1… 4. *) | 128 | 8/16 (XMM0… 15) | REX | SSE4: Core 2 , Nehalem | K7 "Palomino", K8 , K8 "Venice" |
AVX | 256 | 16 (YMM0… 15) | VEX | Sandy Bridge , Ivy Bridge | Bulldozer , piledriver , steamroller , jaguar |
AVX2 | Haswell , Broadwell , Skylake-i , Kaby Lake-i | Excavator , Zen , Zen 2 | |||
AVX-512 | 512 | 32 (ZMM0… 31) | EVEX | Skylake-X , Xeon Phi x200 , Xeon Skylake-Scalable Processors |
AVX2 extends the AVX instruction set by further 256-bit instructions and was supported for the first time by processors of the Haswell architecture (Intel) and Excavator architecture (AMD).
AVX-512 was released in 2013 and expanded the AVX instructions from 256 to 512 bits. It was first supported by processors from the Knights Landing architecture (Intel).
New features
The width of the SIMD registers has been increased from 128 bits (with SSE ) to 256 bits. The new necessary registers are called YMM0 to YMM15. The processors that support AVX execute the older SSE instructions on the lower 128 bits of the new registers; H. the lower 128 bits of the YMM registers are shared with the XMM registers.
AVX introduces a three-operand SIMD instruction format c : = a + b , so the result no longer necessarily destroys a source register, which saves copy operations. SSE instructions use the two-operand form a : = a + b . The three-operand format can only be used with SIMD operands (YMM) and not with general purpose registers such as B. EAX or RAX.
application
- Useful for floating point intensive calculations, especially in the multimedia, scientific or financial sector. Integer operations are to follow later.
- Increases parallelism and throughput of floating point SIMD calculations.
- Reduces the register load through non-destructive three-operand form.
Support in compilers and assemblers
GCC from version 4.6, the Intel Compiler Suite from version 11.1 and Visual Studio 2010 support AVX. The GNU assembler supports AVX via inline assembler commands, as does Intel's counterpart. In addition, MASM in the version for Visual Studio 2010, Yasm from version 1.1.0, FASM and NASM also support AVX. In the x86 code generator of the LLVM compiler substructure, there is full AVX 1 support from version 3.0.
Operating system support
AVX needs explicit support from the operating system so that the new registers are correctly saved and restored when the context is changed. The following operating system versions support AVX:
- DragonFly BSD
- beginning 2013
- FreeBSD
- 9.1 of November 13, 2013 with a patch submitted on January 21, 2012
- Linux
- from kernel 2.6.30 from June 9, 2009
- macOS
- from 6/10/8 (last Snow Leopard Update) from June 23, 2011
- OpenBSD
- 5.8 of October 18, 2015
- Solaris
- 10 Update 10 and Solaris 11
- Windows
- from Windows 7 SP1 and Windows Server 2008 R2 SP1 from February 22, 2011
CPUs with AVX
- Sandy Bridge processors, Q1 2011
- Ivy Bridge processors, Q2 2012
- Haswell processors, Q2 2013
- Broadwell processors, Q1 2015
- Skylake processors, Q3 2015
- Kaby Lake processors, Q3 2016
- Coffee Lake processors Q4 2017
- Bulldozer processors, Q4 2011
- Piledriver processors, Q4 2012
- Jaguar processors Q2 2013
- Steamroller processors, Q1 2014
- Excavator processors, Q2 2015
- Zen processors, Q1 2017
- Zen 2 processors, Q3 2019
New instructions AVX
511 256 | 255 128 | 127 0 |
ZMM0 | YMM0 | XMM0 |
ZMM1 | YMM1 | XMM1 |
ZMM2 | YMM2 | XMM2 |
ZMM3 | YMM3 | XMM3 |
ZMM4 | YMM4 | XMM4 |
ZMM5 | YMM5 | XMM5 |
ZMM6 | YMM6 | XMM6 |
ZMM7 | YMM7 | XMM7 |
ZMM8 | YMM8 | XMM8 |
ZMM9 | YMM9 | XMM9 |
ZMM10 | YMM10 | XMM10 |
ZMM11 | YMM11 | XMM11 |
ZMM12 | YMM12 | XMM12 |
ZMM13 | YMM13 | XMM13 |
ZMM14 | YMM14 | XMM14 |
ZMM15 | YMM15 | XMM15 |
ZMM16 | YMM16 | XMM16 |
ZMM17 | YMM17 | XMM17 |
ZMM18 | YMM18 | XMM18 |
ZMM19 | YMM19 | XMM19 |
ZMM20 | YMM20 | XMM20 |
ZMM21 | YMM21 | XMM21 |
ZMM22 | YMM22 | XMM22 |
ZMM23 | YMM23 | XMM23 |
ZMM24 | YMM24 | XMM24 |
ZMM25 | YMM25 | XMM25 |
ZMM26 | YMM26 | XMM26 |
ZMM27 | YMM27 | XMM27 |
ZMM28 | YMM28 | XMM28 |
ZMM29 | YMM29 | XMM29 |
ZMM30 | YMM30 | XMM30 |
ZMM31 | YMM31 | XMM31 |
instruction | description |
---|---|
VBROADCASTSS VBROADCASTSD VBROADCASTF128 |
Copies a 32-bit, 64-bit, or 128-bit memory operand to all elements of an XMM or YMM register. |
VINSERTF128 | Replaces either the upper or lower half of a 256-bit YMM register with the value from the 128-bit operand. The other half remains unchanged. |
VEXTRACTF128 | Extracts either the upper or lower half of a 256-bit YMM register and copies the value into the 128-bit operand. |
VMASKMOVPS VMASKMOVPD |
Conditionally reads any number of vector elements from a SIMD memory operand into a destination register, with the remaining space being filled with zeros. Alternatively, it conditionally writes any number of vector elements from a SIMD register into a SIMD memory operand, the remaining space in the memory not being changed. |
VPERMILPS VPERMILPD |
Swaps 32-bit or 64-bit vector elements. |
VPERM2F128 | Merges the four 128-bit vector elements from two 256-bit source operands into one 256-bit destination operand. |
VTESTPS, VTESTPD | Sets the flag bits CF and ZF according to a comparison of all sign bits. |
VZEROALL | Fills all YMM registers with zeros and marks them as unused. Used when switching between 128-bit and 256-bit mode. |
VZEROUPPER | Fills the top half of all YMM registers with zeros. Used when switching between 128-bit and 256-bit mode. |
Extension AVX 2
The Advanced Vector Extensions 2 ( AVX2 ) represent an extension , in which some new instructions have been introduced and numerous existing instructions are now also 256 bits wide. AVX2 is being sold for the first time with the AMD Carrizo and Intel Haswell processors.
Extension AVX-512
Since energy efficiency is becoming more and more important in high-performance computing and the SIMD concept promises progress here, AVX computing accelerator cards, known as Intel Xeon Phi , are being completely revised again, the data and register width is doubled to 512 bits and the number of registers is increased 32 doubled. This extension is called Intel AVX-512 , it consists of several specified groups of new instructions that are implemented in stages. The second Xeon Phi generation (“Knights Corner”) receives the “Foundation”, the third generation (“Knights Landing”) in 2016 also receives “CD”, “PF”, “ER”, extensions.
In contrast to Xeon Phi / Knights Landing, the command groups "CD", "PF", "BW" are part of the Xeon Scalable Processors released in summer 2017 and the Skylake-X processors derived from them (from Core i7-7800X).
The command groups have already been documented by Intel in advance and can be queried via the CPUID instruction; certain register bits are set when the command group is available. AVX-512 is currently to be seen as a specification or "roadmap" which instructions Intel intends to bring into the AVX units in the future:
Instruction set | Name set | CPUID bit | Processors |
---|---|---|---|
AVX512F (basic instruction set, remaining instructions are optional) |
Foundation | EBX 16 | Xeon Phi x200, Xeon SP |
AVX512PF | Prefetch | EBX 26 | Xeon Phi x200 |
AVX512DQ | Vector Double Word and Quad Word | EBX 17 | Xeon SP |
AVX512BW | Vector Byte and Word | EBX 30 | Xeon SP |
AVX512VL | Vector length | EBX 31 | Xeon SP |
AVX512CD | Conflict Detection | EBX 28 | Xeon Phi x200, Xeon SP |
AVX512ER | Exponential and Reciprocal | EBX 27 | Xeon Phi x200 |
AVX512IFMA | Integer Fused Multiply-Add with 512 bits | EBX 21 | Cannon Lake |
AVX512_VBMI | Vector bit manipulation | ECX 01 | Cannon Lake |
AVX512_VBMI2 | Vector bit manipulation 2 | ECX 06 | Cannon Lake |
AVX512_4FMAPS | Vector Fused Multiply Accumulation Packed Single precision | EDX 03 | Xeon Phi 72x5 |
AVX512_4VNNIW | Vector Neural Network Instructions Word Variable Precision | EDX 02 | Xeon Phi 72x5 |
AVX512_VPOPCNTDQ | Vector POPCOUNT Dword / Qword | ECX 14 | Xeon Phi 72x5 |
AVX512_VNNI | Vector Neural Network Instructions | ECX 11 | Xeon Cascade Lake |
AVX512_BITALG | Bit algorithms, Support for VPOPCNT [B, W] and VPSHUF-BITQMB | ECX 12 | Ice Lake |
AVX512_GFNI | Galois Field New Instructions | Ice Lake | |
AVX512_VPCLMULQDQ | Carry-Less Multiplication Quadword | Ice Lake | |
AVX512_VAES | Vector AES | Ice Lake |
Implementation of the individual command groups documented for Xeon SP in and for Xeon Phi Knights Landing (x200):
use
Using these special commands boils down to the following:
- Isolation of the program parts to be optimized, only these have to be considered at all
- to be optimized there:
- Memory layout of the data structures used (alignment, cache efficiency, location of memory access)
- Breakdown of the calculations into many independent threads that run in parallel and e.g. Can sometimes be processed on different architectures (e.g. can be swapped out to one / more GPU (s))
- Use of these extended instruction sets by ...
- Use of compilers that support these instruction sets
- Use of libraries that use these instruction sets (e.g. Math Kernel Library or OpenBLAS )
- Use of libraries that in turn use such libraries (e.g. graphics libraries)
- Use of programming languages that make use of these commands on their own (e.g. Python with the numpy package)
- In the case of very critical applications, it may be necessary to use Compiler Intrinsics or to write assembler routines to further increase performance.
The problems are not new, however, and the use of the instruction set extensions is still the part of these optimizations that can best be automated.
Conclusion
With the help of AVX and its 256-bit wide register in x64 mode, programs can calculate four floating point operations with double precision and eight floating point operations with single precision with, for example, a simple addition. There are four values of double precision or eight values of single precision in each of the 16 AVX registers, which are then offset with one partner.
With AVX2 the register width does not change, only some of the operations previously (with AVX) carried out with 128-bit (e.g. FMA: Fused-Multiply Add / Floating Point Multiply-Accumulate, Integer operations ...) were changed to 256-bit execution brought. This changes the number of available 256-bit SIMD operations. With a simple addition on a 64-bit architecture (only) four floating point operations with double precision and eight floating point operations with single precision are calculated simultaneously.
With the AVX-512, due to the double register width of 512-bit, there are eight additions with double precision or 16 additions with single precision per clock (on a 64-bit architecture).
The use of AVX-512 in the desktop segment is currently (2018) limited to the X299 chipset of the Skylake architecture for the 2066 socket and, since 2016, to a number of the Xeon processor series.
Individual evidence
- ↑ Thomas Hübner: SSE's successor is called AVX and is 256 bits wide. ComputerBase, March 17, 2008, accessed March 29, 2018 .
- ↑ James Reinders: AVX-512 Instructions . Intel . July 23, 2013. Retrieved March 3, 2017.
- ↑ x86_64 - support for AVX instructions . Retrieved November 20, 2013.
- ↑ FreeBSD 9.1-RELEASE Announcement . Retrieved May 20, 2013.
- ↑ Add support for the extended FPU states on amd64, both for native 64bit and 32bit ABIs . svnweb.freebsd.org. January 21, 2012. Retrieved January 22, 2012.
- ↑ x86: add linux kernel support for YMM state . Retrieved July 13, 2009.
- ↑ Linux 2.6.30 - Linux Kernel Newbies . Retrieved July 13, 2009.
- ↑ Twitter . Retrieved June 23, 2010.
- ↑ Theo de Raadt: OpenBSD 5.8. Retrieved December 7, 2015 .
- ↑ Floating-Point Support for 64-Bit Drivers . Retrieved December 6, 2009.
- ↑ Intel Offers Peek at Nehalem and Larrabee . ExtremeTech. March 17, 2008. Retrieved August 20, 2011.
- ↑ Bulldozer Roadmap . Joe Doe, AMD Developer blogs. May 7, 2009. Retrieved September 8, 2011.
- ↑ AMD Piledriver vs. Steamroller vs. Excavator - performance comparison of the architectures. (No longer available online.) In: Planet 3DNow! August 14, 2015, archived from the original on February 21, 2017 ; Retrieved February 20, 2017 .
- ^ ISA Extensions Programming Reference. Retrieved October 17, 2017 .
- ↑ Xeon SP Technical Overview. Retrieved October 17, 2017 .
- ↑ How to detect KNL instruction support. Retrieved October 17, 2017 .
- ↑ Gepner, Pavel. "Using AVX2 instruction set to increase performance of high performance computing code" , Computing and Informatics 36.5 (2017): 1001-1018.