Advanced vector extensions

Advanced Vector Extensions ( AVX ) is an extension of x86 - instruction set for microprocessors from Intel and AMD , which was proposed by Intel in March of 2008. AVX is an extension of the earlier SIMD -Befehlssatzerweiterung sse4 , which was also initiated by Intel. The width of the registers and data words increases to 256 bits. The following table shows the further development of the SIMD instructions in the x86 architecture:

Extension name	Data width	Number of registers	addressing schema	present in CPUs from
Extension name	Data width	Number of registers	addressing schema	Intel	AMD
MMX / 3DNow!	0064	08 X(MM0… 07)		MMX from Pentium (P55C)	K6 (MMX) / K6-2 "Chomper" (3DNow!)
SSE (1… 4. *)	0128	8/16 (XMM0… 15)	REX	SSE4: Core 2 , Nehalem	K7 "Palomino", K8 , K8 "Venice"
AVX	0256	16 (YMM0… 15)	VEX	Sandy Bridge , Ivy Bridge	Bulldozer , piledriver , steamroller , jaguar
AVX2	0256	16 (YMM0… 15)	VEX	Haswell , Broadwell , Skylake-i , Kaby Lake-i	Excavator , Zen , Zen 2
AVX-512	0512	32 (ZMM0… 31)	EVEX	Skylake-X , Xeon Phi x200 , Xeon Skylake-Scalable Processors

AVX2 extends the AVX instruction set by further 256-bit instructions and was supported for the first time by processors of the Haswell architecture (Intel) and Excavator architecture (AMD).

AVX-512 was released in 2013 and expanded the AVX instructions from 256 to 512 bits. It was first supported by processors from the Knights Landing architecture (Intel).

New features

YMM AVX register scheme as an extension of the XMM-SSE registers

The width of the SIMD registers has been increased from 128 bits (with SSE ) to 256 bits. The new necessary registers are called YMM0 to YMM15. The processors that support AVX execute the older SSE instructions on the lower 128 bits of the new registers; H. the lower 128 bits of the YMM registers are shared with the XMM registers.

AVX introduces a three-operand SIMD instruction format c : = a + b , so the result no longer necessarily destroys a source register, which saves copy operations. SSE instructions use the two-operand form a : = a + b . The three-operand format can only be used with SIMD operands (YMM) and not with general purpose registers such as B. EAX or RAX.

application

Useful for floating point intensive calculations, especially in the multimedia, scientific or financial sector. Integer operations are to follow later.
Increases parallelism and throughput of floating point SIMD calculations.
Reduces the register load through non-destructive three-operand form.

Support in compilers and assemblers

GCC from version 4.6, the Intel Compiler Suite from version 11.1 and Visual Studio 2010 support AVX. The GNU assembler supports AVX via inline assembler commands, as does Intel's counterpart. In addition, MASM in the version for Visual Studio 2010, Yasm from version 1.1.0, FASM and NASM also support AVX. In the x86 code generator of the LLVM compiler substructure, there is full AVX 1 support from version 3.0.

Operating system support

AVX needs explicit support from the operating system so that the new registers are correctly saved and restored when the context is changed. The following operating system versions support AVX:

DragonFly BSD: beginning 2013
FreeBSD: 9.1 of November 13, 2013 with a patch submitted on January 21, 2012
Linux: from kernel 2.6.30 from June 9, 2009
macOS: from 6/10/8 (last Snow Leopard Update) from June 23, 2011
OpenBSD: 5.8 of October 18, 2015
Solaris: 10 Update 10 and Solaris 11
Windows: from Windows 7 SP1 and Windows Server 2008 R2 SP1 from February 22, 2011

CPUs with AVX

Intel

Sandy Bridge processors, Q1 2011
Ivy Bridge processors, Q2 2012
Haswell processors, Q2 2013
Broadwell processors, Q1 2015
Skylake processors, Q3 2015
Kaby Lake processors, Q3 2016
Coffee Lake processors Q4 2017

AMD

Bulldozer processors, Q4 2011
Piledriver processors, Q4 2012
Jaguar processors Q2 2013
Steamroller processors, Q1 2014
Excavator processors, Q2 2015
Zen processors, Q1 2017
Zen 2 processors, Q3 2019

New instructions AVX

Register scheme of AVX-512 as an extension of the AVX (YMM0-YMM15) and SSE registers (XMM0-XMM15)
511 256	255 128	127 0

ZMM0	YMM0	XMM0
ZMM1	YMM1	XMM1
ZMM2	YMM2	XMM2
ZMM3	YMM3	XMM3
ZMM4	YMM4	XMM4
ZMM5	YMM5	XMM5
ZMM6	YMM6	XMM6
ZMM7	YMM7	XMM7
ZMM8	YMM8	XMM8
ZMM9	YMM9	XMM9
ZMM10	YMM10	XMM10
ZMM11	YMM11	XMM11
ZMM12	YMM12	XMM12
ZMM13	YMM13	XMM13
ZMM14	YMM14	XMM14
ZMM15	YMM15	XMM15
ZMM16	YMM16	XMM16
ZMM17	YMM17	XMM17
ZMM18	YMM18	XMM18
ZMM19	YMM19	XMM19
ZMM20	YMM20	XMM20
ZMM21	YMM21	XMM21
ZMM22	YMM22	XMM22
ZMM23	YMM23	XMM23
ZMM24	YMM24	XMM24
ZMM25	YMM25	XMM25
ZMM26	YMM26	XMM26
ZMM27	YMM27	XMM27
ZMM28	YMM28	XMM28
ZMM29	YMM29	XMM29
ZMM30	YMM30	XMM30
ZMM31	YMM31	XMM31

instruction	description
VBROADCASTSS VBROADCASTSD VBROADCASTF128	Copies a 32-bit, 64-bit, or 128-bit memory operand to all elements of an XMM or YMM register.
VINSERTF128	Replaces either the upper or lower half of a 256-bit YMM register with the value from the 128-bit operand. The other half remains unchanged.
VEXTRACTF128	Extracts either the upper or lower half of a 256-bit YMM register and copies the value into the 128-bit operand.
VMASKMOVPS VMASKMOVPD	Conditionally reads any number of vector elements from a SIMD memory operand into a destination register, with the remaining space being filled with zeros. Alternatively, it conditionally writes any number of vector elements from a SIMD register into a SIMD memory operand, the remaining space in the memory not being changed.
VPERMILPS VPERMILPD	Swaps 32-bit or 64-bit vector elements.
VPERM2F128	Merges the four 128-bit vector elements from two 256-bit source operands into one 256-bit destination operand.
VTESTPS, VTESTPD	Sets the flag bits CF and ZF according to a comparison of all sign bits.
VZEROALL	Fills all YMM registers with zeros and marks them as unused. Used when switching between 128-bit and 256-bit mode.
VZEROUPPER	Fills the top half of all YMM registers with zeros. Used when switching between 128-bit and 256-bit mode.

Extension AVX 2

The Advanced Vector Extensions 2 ( AVX2 ) represent an extension , in which some new instructions have been introduced and numerous existing instructions are now also 256 bits wide. AVX2 is being sold for the first time with the AMD Carrizo and Intel Haswell processors.

Extension AVX-512

Since energy efficiency is becoming more and more important in high-performance computing and the SIMD concept promises progress here, AVX computing accelerator cards, known as Intel Xeon Phi , are being completely revised again, the data and register width is doubled to 512 bits and the number of registers is increased 32 doubled. This extension is called Intel AVX-512 , it consists of several specified groups of new instructions that are implemented in stages. The second Xeon Phi generation (“Knights Corner”) receives the “Foundation”, the third generation (“Knights Landing”) in 2016 also receives “CD”, “PF”, “ER”, extensions.

In contrast to Xeon Phi / Knights Landing, the command groups "CD", "PF", "BW" are part of the Xeon Scalable Processors released in summer 2017 and the Skylake-X processors derived from them (from Core i7-7800X).

The command groups have already been documented by Intel in advance and can be queried via the CPUID instruction; certain register bits are set when the command group is available. AVX-512 is currently to be seen as a specification or "roadmap" which instructions Intel intends to bring into the AVX units in the future:

Instruction set	Name set	CPUID bit	Processors
AVX512F (basic instruction set, remaining instructions are optional)	Foundation	EBX 16	Xeon Phi x200, Xeon SP
AVX512PF	Prefetch	EBX 26	Xeon Phi x200
AVX512DQ	Vector Double Word and Quad Word	EBX 17	Xeon SP
AVX512BW	Vector Byte and Word	EBX 30	Xeon SP
AVX512VL	Vector length	EBX 31	Xeon SP
AVX512CD	Conflict Detection	EBX 28	Xeon Phi x200, Xeon SP
AVX512ER	Exponential and Reciprocal	EBX 27	Xeon Phi x200
AVX512IFMA	Integer Fused Multiply-Add with 512 bits	EBX 21	Cannon Lake
AVX512_VBMI	Vector bit manipulation	ECX 01	Cannon Lake
AVX512_VBMI2	Vector bit manipulation 2	ECX 06	Cannon Lake
AVX512_4FMAPS	Vector Fused Multiply Accumulation Packed Single precision	EDX 03	Xeon Phi 72x5
AVX512_4VNNIW	Vector Neural Network Instructions Word Variable Precision	EDX 02	Xeon Phi 72x5
AVX512_VPOPCNTDQ	Vector POPCOUNT Dword / Qword	ECX 14	Xeon Phi 72x5
AVX512_VNNI	Vector Neural Network Instructions	ECX 11	Xeon Cascade Lake
AVX512_BITALG	Bit algorithms, Support for VPOPCNT [B, W] and VPSHUF-BITQMB	ECX 12	Ice Lake
AVX512_GFNI	Galois Field New Instructions		Ice Lake
AVX512_VPCLMULQDQ	Carry-Less Multiplication Quadword		Ice Lake
AVX512_VAES	Vector AES		Ice Lake

Implementation of the individual command groups documented for Xeon SP in and for Xeon Phi Knights Landing (x200):

use

Using these special commands boils down to the following:

Isolation of the program parts to be optimized, only these have to be considered at all
to be optimized there:
- Memory layout of the data structures used (alignment, cache efficiency, location of memory access)
- Breakdown of the calculations into many independent threads that run in parallel and e.g. Can sometimes be processed on different architectures (e.g. can be swapped out to one / more GPU (s))
- Use of these extended instruction sets by ...
  - Use of compilers that support these instruction sets
  - Use of libraries that use these instruction sets (e.g. Math Kernel Library or OpenBLAS )
  - Use of libraries that in turn use such libraries (e.g. graphics libraries)
  - Use of programming languages that make use of these commands on their own (e.g. Python with the numpy package)
  - In the case of very critical applications, it may be necessary to use Compiler Intrinsics or to write assembler routines to further increase performance.

The problems are not new, however, and the use of the instruction set extensions is still the part of these optimizations that can best be automated.

Conclusion

With the help of AVX and its 256-bit wide register in x64 mode, programs can calculate four floating point operations with double precision and eight floating point operations with single precision with, for example, a simple addition. There are four values of double precision or eight values of single precision in each of the 16 AVX registers, which are then offset with one partner.

With AVX2 the register width does not change, only some of the operations previously (with AVX) carried out with 128-bit (e.g. FMA: Fused-Multiply Add / Floating Point Multiply-Accumulate, Integer operations ...) were changed to 256-bit execution brought. This changes the number of available 256-bit SIMD operations. With a simple addition on a 64-bit architecture (only) four floating point operations with double precision and eight floating point operations with single precision are calculated simultaneously.

With the AVX-512, due to the double register width of 512-bit, there are eight additions with double precision or 16 additions with single precision per clock (on a 64-bit architecture).

The use of AVX-512 in the desktop segment is currently (2018) limited to the X299 chipset of the Skylake architecture for the 2066 socket and, since 2016, to a number of the Xeon processor series.

Individual evidence

↑ Thomas Hübner: SSE's successor is called AVX and is 256 bits wide. ComputerBase, March 17, 2008, accessed March 29, 2018 .
↑ James Reinders: AVX-512 Instructions . Intel . July 23, 2013. Retrieved March 3, 2017.
↑ x86_64 - support for AVX instructions . Retrieved November 20, 2013.
↑ FreeBSD 9.1-RELEASE Announcement . Retrieved May 20, 2013.
↑ Add support for the extended FPU states on amd64, both for native 64bit and 32bit ABIs . svnweb.freebsd.org. January 21, 2012. Retrieved January 22, 2012.
↑ x86: add linux kernel support for YMM state . Retrieved July 13, 2009.
↑ Linux 2.6.30 - Linux Kernel Newbies . Retrieved July 13, 2009.
↑ Twitter . Retrieved June 23, 2010.
↑ Theo de Raadt: OpenBSD 5.8. Retrieved December 7, 2015 .
↑ Floating-Point Support for 64-Bit Drivers . Retrieved December 6, 2009.
↑ Intel Offers Peek at Nehalem and Larrabee . ExtremeTech. March 17, 2008. Retrieved August 20, 2011.
↑ Bulldozer Roadmap . Joe Doe, AMD Developer blogs. May 7, 2009. Retrieved September 8, 2011.
↑ AMD Piledriver vs. Steamroller vs. Excavator - performance comparison of the architectures. (No longer available online.) In: Planet 3DNow! August 14, 2015, archived from the original on February 21, 2017 ; Retrieved February 20, 2017 .
^ ISA Extensions Programming Reference. Retrieved October 17, 2017 .
↑ Xeon SP Technical Overview. Retrieved October 17, 2017 .
↑ How to detect KNL instruction support. Retrieved October 17, 2017 .
↑ Gepner, Pavel. "Using AVX2 instruction set to increase performance of high performance computing code" , Computing and Informatics 36.5 (2017): 1001-1018.

[1] Thomas Hübner: SSE's successor is called AVX and is 256 bits wide. ComputerBase, March 17, 2008, accessed March 29, 2018 .

[reinders512-2] James Reinders: AVX-512 Instructions . Intel . July 23, 2013. Retrieved March 3, 2017.

[3] x86_64 - support for AVX instructions . Retrieved November 20, 2013.

[4] FreeBSD 9.1-RELEASE Announcement . Retrieved May 20, 2013.

[5] Add support for the extended FPU states on amd64, both for native 64bit and 32bit ABIs . svnweb.freebsd.org. January 21, 2012. Retrieved January 22, 2012.

[6] x86: add linux kernel support for YMM state . Retrieved July 13, 2009.

[7] Linux 2.6.30 - Linux Kernel Newbies . Retrieved July 13, 2009.

[8] Twitter . Retrieved June 23, 2010.

[9] Theo de Raadt: OpenBSD 5.8. Retrieved December 7, 2015 .

[10] Floating-Point Support for 64-Bit Drivers . Retrieved December 6, 2009.

[11] Intel Offers Peek at Nehalem and Larrabee . ExtremeTech. March 17, 2008. Retrieved August 20, 2011.

[12] Bulldozer Roadmap . Joe Doe, AMD Developer blogs. May 7, 2009. Retrieved September 8, 2011.

[13] AMD Piledriver vs. Steamroller vs. Excavator - performance comparison of the architectures. (No longer available online.) In: Planet 3DNow! August 14, 2015, archived from the original on February 21, 2017 ; Retrieved February 20, 2017 .

[14] ISA Extensions Programming Reference. Retrieved October 17, 2017 .

[15] Xeon SP Technical Overview. Retrieved October 17, 2017 .

[16] How to detect KNL instruction support. Retrieved October 17, 2017 .

[PG17-17] Gepner, Pavel. "Using AVX2 instruction set to increase performance of high performance computing code" , Computing and Informatics 36.5 (2017): 1001-1018.