X86 SIMD instruction listings

Template:Short description Template:Lowercase title Template:X86 instruction listings

The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.

Summary of SIMD extensions

The main SIMD instruction set extensions that have been introduced for x86 are:

SIMD instruction set extension	Year	Description	Added in
Template:Glossary Template:Term Template:Glossary end	1997	A set of 57 integer SIMD instruction acting on 64-bit vectors, mostly providing 8/16/32-bit lane-width operations. Repurposed the old x87 FPU register-file as a bank of eight 64-bit vector registers, referred to as MM0..MM7 when used for MMX instructions.	Template:Nowrap AMD K6, Intel Pentium II, Template:Nowrap Rise mP6, IDT WinChip C6, Transmeta Crusoe
Template:Glossary Template:Term Template:Defn Template:Glossary end	1999	"Katmai New Instructions" - introduced a set of 70 new instructions. Most but not all of these instructions provide scalar and vector operations on 32-bit floating-point values in 128-bit SIMD vector registers. (Some of the SSE instructions were instead new MMX instructions and non-SIMD instructions such as `SFENCE` - the subset of SSE that excludes the 128-bit SIMD register instructions is known as "MMX+", and is supported on some AMD processors that didn't implement full SSE, notably early Athlons and Geode LX.) SSE introduced a new set of eight vector registers XMM0..XMM7, each 128 bits, and a status/control register MXCSR. This set of eight vector registers would later be extended to 16 registers with the introduction of x86-64.	Intel Pentium III, AMD Athlon XP, VIA C3 "Nehemiah", Transmeta Efficeon
Template:Glossary Template:Term Template:Defn Template:Glossary end	2000	Extended SSE with 144 new instructions - mainly additional instructions to work on scalars and vectors of 64-bit floating-point values, as well as 128-bit-vector forms of most of the MMX integer instructions.	Intel Pentium 4, Intel Pentium M, AMD Athlon 64, Transmeta Efficeon, VIA C7
Template:Glossary Template:Term Template:Defn Template:Glossary end	2004	"Prescott New Instructions": added a set of 13 new instructions,Template:Efn mostly horizontal add/subtract operations.	Intel Pentium 4 "Prescott", Transmeta Efficeon 8800, Athlon 64 "Venice", VIA C7, Intel Core "Yonah"
Template:Glossary Template:Term Template:Defn Template:Glossary end	2006	Added a set of 32 new instructions to extend MMX and SSE, including a byte-shuffle instruction.	Intel Core 2 "Merom", VIA Nano 2000, Intel Atom "Bonnell", AMD "Bobcat", AMD FX "Bulldozer"
Template:Glossary Template:Term Template:Glossary end	2007	AMD-only extension that added a set of 4 instructions, including bitfield insert/extract and scalar non-temporal store instructions.	AMD K10
Template:Glossary Template:Term Template:Glossary end	2007	Added a set of 47 instructions, including variants of integer min/max, widening integer conversions, vector lane insert/extract, and dot-product instructions.	Intel Core 2 "Penryn", VIA Nano 3000, AMD FX "Bulldozer", AMD "Jaguar", Intel Atom "Silvermont"
Template:Glossary Template:Term Template:Glossary end	2008	Added a set of 7 instructions, mostly pertaining to string processing.	Intel Core i7 "Nehalem", AMD FX "Bulldozer", AMD "Jaguar", Intel Atom "Silvermont", VIA Nano QuadCore C4000
Template:Glossary Template:Term Template:Defn Template:Glossary end	2011	Extended the XMM0..XMM15 vector registers to 256-bit registers, referred to as YMM0..YMM15 when used as full 256-bit registers. Added three-operand variants of most of the SSE1-4 vector instructions, as well as 256-bit vector variants of most of the SSE1-4 vector instructions acting on 32/64-bit floating-point values. These new instruction variants are all encoded with the new VEX prefix.	Intel Core i7 "Sandy Bridge", AMD FX "Bulldozer", AMD "Jaguar", VIA Nano QuadCore C4000
Template:Glossary Template:Term Template:Defn Template:Glossary end	2013	Added three-operand floating-point fused-multiply add operations, scalar and vector variants.	Intel Core i7 "Haswell", AMD FX "Piledriver", Zhaoxin Yongfeng
Template:Glossary Template:Term Template:Defn Template:Glossary end	2013	Added 256-bit vector variants of most of the MMX/SSE1-4 vector integer instructions. Also adds vector gather instructions.	Intel Core i7 "Haswell", AMD FX "Excavator", VIA Nano QuadCore C4000
Template:Glossary Template:Term Template:Glossary end	2016	Extended the YMM0..YMM15 vector registers to a set of 32 registers, each 512-bits wide - referred to as ZMM0..ZMM31 when used as 512-bit registers. Also added eight opmask registers K0..K7. Added 512-bit versions of most of the MMX/SSE/AVX vector instructions, as well as a substantial number of additional instructions. These are mostly encoded with the new EVEX prefix (except for opmask management instructions, which continue to use the VEX prefix.) Added the ability to perform per-vector-lane masking of the operation of most of its vector instructions, by using the opmask registers. Also added embedded rounding controls for floating-point instructions and a scalar-to-vector broadcast function for most instructions that can accept memory operands.	Template:Glossary Template:Term Template:Defn Template:Glossary end (See AVX-512#New instructions by sets for additional subsets.)
Template:Glossary Template:Term Template:Defn Template:Glossary end	2023	Added a set of eight new tile registers, referred to as TMM0..TMM7. Each of these tile registers has a size of 8192 bits (16 rows of 64 bytes each). Also added instructions to perform matrix multiplication on these registers with various data formats.	Template:Nowrap
Template:Glossary Template:Term Template:Glossary end	2024	Reformulation of AVX-512 that includes most of the optional AVX-512 subsets as baseline functionality, but also allows for implementations to reduce their maximum supported vector-register width to 256 bits.	Intel Xeon 6 "Granite Rapids"
Template:Glossary Template:Term Template:Glossary end	(2025)	Adds support for rounding modifiers for 256-bit floating-point numbers, as well as a handful of added instructions.	(Intel Diamond Rapids)

Template:Notelist Template:Vpad

MMX instructions and extended variants thereof

These instructions are, unless otherwise noted, available in the following forms:

MMX: 64-bit vectors, operating on mm0..mm7 registers (aliased on top of the old x87 register file)
SSE2: 128-bit vectors, operating on xmm0..xmm15 registers (xmm0..xmm7 in 32-bit mode)
AVX: 128-bit vectors, operating on xmm0..xmm15 registers, with a new three-operand encoding enabled by the new VEX prefix. (AVX introduced 256-bit vector registers, but the full width of these vectors was in general not made available for integer SIMD instructions until AVX2.)
AVX2: 256-bit vectors, operating on ymm0..ymm15 registers (extended versions of the xmm0..xmm15 registers)
AVX-512: 512-bit vectors, operating on zmm0..zmm31 registers (zmm0..zmm15 are extended versions of the ymm0..ymm15 registers, while zmm16..zmm31 are new to AVX-512). AVX-512 also introduces opmasks, allowing the operation of most instructions to be masked on a per-lane basis by an opmask register (the lane width varies from one instruction to another). AVX-512 also adds broadcast functionality for many of its instructions - this is used with memory source arguments to replicate a single value to all lanes of a vector calculation. The tables below provide indications of whether opmasks and broadcasts are supported for each instruction, and if so, what lane-widths they are using.

For many of the instruction mnemonics, (V) is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.

Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof

Description		Instruction mnemonics	Basic opcode	MMX (no prefix)	SSE2 (66h prefix)	AVX (VEX.66 prefix)	AVX-512 (EVEX.66 prefix)
Description		Instruction mnemonics	Basic opcode	MMX (no prefix)	SSE2 (66h prefix)	AVX (VEX.66 prefix)	supported	subset	lane	bcst
Empty MMX technology state. (MMX) Mark all the FP/MMX registers as Empty, so that they can be freely used by later x87 code.		`EMMS` (MMX)	`0F 77`	rowspan=3 Template:Yes	rowspan=3 Template:No	rowspan=3 Template:Yes Template:Efn	rowspan=3 Template:No	rowspan=3 Template:N/a	rowspan=3 Template:N/a	rowspan=3 Template:N/a
Zero out upper bits of vector registers `YMM0` to `YMM15` (AVX)		`VZEROUPPER` (AVX)
Zero out all bits of vector registers `YMM0` to `YMM15` (AVX)		`VZEROALL` (AVX)

Move scalar value from GPR (general-purpose register) or memory to vector register, with zero-fill	32-bit	`(V)MOVD mm, r/m32`	`0F 6E /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No
	64-bit (x86-64)	`(V)MOVQ mm, r/m64`, `MOVD mm, r/m64`Template:Efn	`0F 6E /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No
Move scalar value from vector register to GPR or memory	32-bit	`(V)MOVD r/m32, mm`	`0F 7E /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No
Move scalar value from vector register to GPR or memory	64-bit (x86-64)	`(V)MOVQ r/m64, mm`, `MOVD r/m64, mm`Template:Efn	`0F 7E /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No
Vector move between vector register and either memory or another vector register. For move to/from memory, the memory address is required to be aligned for `(V)MOVDQA` variants but not for `MOVQ`. 128-bit VEX-encoded form of `VMOVDQA` with memory argument will, if the memory is cacheable, perform its memory access atomically.Template:Efn		Template:Nowrap(MMX) Template:Nowrap	`0F 7F /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	Template:No
		Template:Nowrap(MMX) Template:Nowrap	`0F 7F /r`	Template:Yes	F	64	Template:No
		`MOVQ mm, mm/m64`(MMX) `(V)MOVDQA xmm,xmm/m128`	`0F 6F /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	Template:No
		`MOVQ mm, mm/m64`(MMX) `(V)MOVDQA xmm,xmm/m128`	`0F 6F /r`	Template:Yes	F	64	Template:No

Pack 32-bit signed integers to 16-bit, with saturation		Template:Nowrap Template:Zwsp Template:Efn	`0F 6B /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	32
Pack 16-bit signed integers to 8-bit, with saturation		`(V)PACKSSWB mm, mm/m64`Template:Zwsp Template:Efn	`0F 63 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Pack 16-bit unsigned integers to 8-bit, with saturation		`(V)PACKUSWB mm, mm/m64`Template:Zwsp Template:Efn	`0F 67 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Unpack and interleave packed integers from the high halves of two input vectors	8-bit	Template:Nowrap Template:Zwsp Template:Efn	`0F 68 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	16-bit	Template:Nowrap Template:Zwsp Template:Efn	`0F 69 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	Template:Nowrap	Template:Nowrap Template:Zwsp Template:Efn	`0F 6A /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32
Unpack and interleave packed integers from the low halves of two input vectors	8-bit	Template:Nowrap Template:Zwsp Template:Efn Template:Efn	`0F 60 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	16-bit	Template:Nowrap Template:Zwsp Template:Efn Template:Efn	`0F 61 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	Template:Nowrap Template:Zwsp Template:Efn Template:Efn	`0F 62 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32

Add packed integers	8-bit	`(V)PADDB mm, mm/m64`	`0F FC /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	16-bit	`(V)PADDW mm, mm/m64`	`0F FD /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PADDD mm, mm/m64`	`0F FE /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32
Add packed signed integers with saturation	8-bit	`(V)PADDSB mm, mm/m64`	`0F EC /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Add packed signed integers with saturation	16-bit	`(V)PADDSW mm, mm/m64`	`0F ED /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Add packed unsigned integers with saturation	8-bit	`(V)PADDUSB mm, mm/m64`	`0F DC /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Add packed unsigned integers with saturation	16-bit	`(V)PADDUSW mm, mm/m64`	`0F DD /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No

Subtract packed integers	8-bit	`(V)PSUBB mm, mm/m64`	`0F F8 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	16-bit	`(V)PSUBW mm, mm/m64`	`0F F9 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PSUBD mm, mm/m64`	`0F FA /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32
Subtract packed signed integers with saturation	8-bit	`(V)PSUBSB mm, mm/m64`	`0F E8 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Subtract packed signed integers with saturation	16-bit	`(V)PSUBSW mm, mm/m64`	`0F E9 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Subtract packed unsigned integers with saturation	8-bit	`(V)PSUBUSB mm, mm/m64`	`0F D8 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Subtract packed unsigned integers with saturation	16-bit	`(V)PSUBUSW mm, mm/m64`	`0F D9 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No

Compare packed integers for equality	8-bit	`(V)PCMPEQB mm, mm/m64`	`0F 74 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	16-bit	`(V)PCMPEQW mm, mm/m64`	`0F 75 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PCMPEQD mm, mm/m64`	`0F 76 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32
Compare packed integers for signed greater-than	8-bit	`(V)PCMPGTB mm, mm/m64`	`0F 64 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	16-bit	`(V)PCMPGTW mm, mm/m64`	`0F 65 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PCMPGTD mm, mm/m64`	`0F 66 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32

Multiply packed 16-bit signed integers, add results pairwise into 32-bit integers		`(V)PMADDWD mm, mm/m64`	`0F F5 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	32	Template:No
Multiply packed 16-bit signed integers, store high 16 bits of results		`(V)PMULHW mm, mm/m64`	`0F E5 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Multiply packed 16-bit integers, store low 16 bits of results		`(V)PMULLW mm, mm/m64`	`0F D5 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No

Vector bitwise AND		`(V)PAND mm, mm/m64`	`0F DB /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
Vector bitwise AND		`(V)PAND mm, mm/m64`	`0F DB /r`	Template:Yes	F	64	64
Vector bitwise AND-NOT		`(V)PANDN mm, mm/m64`	`0F DF /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
Vector bitwise AND-NOT		`(V)PANDN mm, mm/m64`	`0F DF /r`	Template:Yes	F	64	64
Vector bitwise OR		`(V)POR mm, mm/m64`	`0F EB /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
Vector bitwise OR		`(V)POR mm, mm/m64`	`0F EB /r`	Template:Yes	F	64	64
Vector bitwise XOR		`(V)PXOR mm, mm/m64`	`0F EE /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
Vector bitwise XOR		`(V)PXOR mm, mm/m64`	`0F EE /r`	Template:Yes	F	64	64

left-shift of packed integers, with common shift-amount	16-bit	`(V)PSLLW mm, imm8`	`0F 71 /6 ib`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	16-bit	`(V)PSLLW mm, mm/m64`Template:Efn	`0F F1 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PSLLD mm, imm8`	Template:Nowrap	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32
	32-bit	`(V)PSLLD mm, mm/m64`Template:Efn	`0F F2 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No
	64-bit	`(V)PSLLQ mm, imm8`	Template:Nowrap	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	64	64
	64-bit	`(V)PSLLQ mm, mm/m64`Template:Efn	`0F F3 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No
Right-shift of packed signed integers, with common shift-amount	16-bit	`(V)PSRAW mm, imm8`	`0F 71 /4 ib`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	16-bit	`(V)PSRAW mm, mm/m64`Template:Efn	`0F E1 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PSRAD mm, imm8`	`0F 72 /4 ib`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32
	32-bit	`(V)PSRAD mm, mm/m64`Template:Efn	`0F E2 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No
Right-shift of packed unsigned integers, with common shift-amount	16-bit	`(V)PSRLW mm, imm8`	`0F 71 /2 ib`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	16-bit	`(V)PSRLW mm, mm/m64`Template:Efn	`0F D1 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PSRLD mm, imm8`	`0F 72 /2 ib`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	32
	32-bit	`(V)PSRLD mm, mm/m64`Template:Efn	`0F D2 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No
	Template:Nowrap	`(V)PSRLQ mm, imm8`	`0F 73 /2 ib`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	64	64
	Template:Nowrap	`(V)PSRLQ mm, mm/m64`Template:Efn	`0F D3 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No

Template:Notelist Template:Vpad

MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof

Description		Instruction mnemonics	Basic opcode	MMX (no prefix)	SSE2 (66h prefix)	AVX (VEX.66 prefix)	AVX-512 (EVEX.66 prefix)
Description		Instruction mnemonics	Basic opcode	MMX (no prefix)	SSE2 (66h prefix)	AVX (VEX.66 prefix)	supported	subset	lane	bcst
Added with SSE and MMX+
Perform shuffle of four 16-bit integers in 64-bit vector (MMX)Template:Efn		Template:Nowrap	`0F 70 /r ib`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	F	32	32
Perform shuffle of four 32-bit integers in 128-bit vector (SSE2)		Template:Nowrap	`0F 70 /r ib`					F	32	32
Insert integer into 16-bit vector register lane		`(V)PINSRW mm,r32/m16,imm8`	`0F C4 /r ib`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	Template:No	Template:No
Extract integer from 16-bit vector register lane, with zero-extension		`(V)PEXTRW r32,mm,imm8`Template:Efn	`0F C5 /r ib`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	Template:No	Template:No
Create a bitmask made from the top bit of each byte in the source vector, and store to integer register		`(V)PMOVMSKB r32,mm`	`0F D7 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Minimum-value of packed unsigned 8-bit integers		`(V)PMINUB mm,mm/m64`	`0F DA /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Maximum-value of packed unsigned 8-bit integers		`(V)PMAXUB mm,mm/m64`	`0F DE /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Minimum-value of packed signed 16-bit integers		`(V)PMINSW mm,mm/m64`	`0F EA /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Maximum-value of packed signed 16-bit integers		`(V)PMAXSW mm,mm/m64`	`0F EE /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Rounded average of packed unsigned integers. The per-lane operation is: `dst ← (src1 + src2 + 1)>>1`	8-bit	`(V)PAVGB mm,mm/m64`	`0F E0 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	16-bit	`(V)PAVGW mm,mm/m64`	`0F E3 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Multiply packed 16-bit unsigned integers, store high 16 bits of results		`(V)PMULHUW mm,mm/mm64`	`0F E4 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Store vector register to memory using Non-Temporal Hint. Memory operand required to be aligned for all `(V)MOVNTDQ` variants, but not for `MOVNTQ`.		`MOVNTQ m64,mm`(MMX) `(V)MOVNTDQ m128,xmm`	`0F E7 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No
Compute sum of absolute differences for eight 8-bit unsigned integers, storing the result as a 64-bit integer. For vector widths wider than 64 bits (SSE/AVX/AVX-512), this calculation is done separately for each 64-bit lane of the vectors, producing a vector of 64-bit integers.		`(V)PSADBW mm,mm/m64`	`0F F6 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	Template:No	Template:No
Unaligned store vector register to memory using byte write-mask, with Non-Temporal Hint. First argument provides data to store, second argument provides byte write-mask (top bit of each byte).Template:Efn Address to store to is given by DS:DI/EDI/RDI (DS: segment overridable with segment-prefix).		`MASKMOVQ mm,mm`(MMX) Template:Nowrap	`0F F7 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Added with SSE2
Multiply packed 32-bit unsigned integers, store full 64-bit result. The input integers are taken from the low 32 bits of each 64-bit vector lane.		`(V)PMULUDQ mm,mm/m64`	`0F F4 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	64	64
Add packed 64-bit integers		`(V)PADDQ mm, mm/m64`	`0F D4 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	64	64
Subtract packed 64-bit integers		`(V)PSUBQ mm,mm/m64`	`0F FB /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	F	64	64
Added with SSSE3
Vector Byte Shuffle		`(V)PSHUFB mm,mm/m64`Template:Efn	`0F38 00 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
Pairwise horizontal add of packed integers	16-bit	`(V)PHADDW mm,mm/mm64`Template:Efn	`0F38 01 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Pairwise horizontal add of packed integers	32-bit	`(V)PHADDD mm,mm/mm64`Template:Efn	`0F38 02 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Pairwise horizontal add of packed 16-bit signed integers, with saturation		`(V)PHADDSW mm,mm/mm64`Template:Efn	`0F38 03 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Multiply packed 8-bit signed and unsigned integers, add results pairwise into 16-bit signed integers with saturation. First operand is treated as unsigned, second operand as signed.		`(V)PMADDUBSW mm,mm/m64`	`0F38 04 /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Pairwise horizontal subtract of packed integers. The higher-order integer of each pair is subtracted from the lower-order integer.	16-bit	`(V)PHSUBW mm,mm/m64`Template:Efn	`0F38 05 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
	32-bit	`(V)PHSUBD mm,mm/m64`Template:Efn	`0F38 06 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Pairwise horizontal subtract of packed 16-bit signed integers, with saturation		`(V)PHSUBSW mm,mm/m64`Template:Efn	`0F38 07 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Modify packed integers in first source argument based on the sign of packed signed integers in second source argument. The per-lane operation performed is: if( src2 < 0 ) dst ← -src1 else if( src2 == 0 ) dst ← 0 else dst ← src1	8-bit	`(V)PSIGNB mm,mm/m64`	`0F38 08 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
	16-bit	`(V)PSIGNW mm,mm/m64`	`0F38 09 /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
	Template:Nowrap	`(V)PSIGND mm,mm/m64`	`0F38 0A /r`	Template:Yes	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Multiply packed 16-bit signed integers, then perform rounding and scaling to produce a 16-bit signed integer result. The calculation performed per 16-bit lane is: Template:Code		`(V)PMULHRSW mm,mm/m64`	`0F38 0B /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
Absolute value of packed signed integers	8-bit	`(V)PABSB mm,mm/m64`	`0F38 1C /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	16-bit	`(V)PABSW mm,mm/m64`	`0F38 1D /r`	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	32-bit	`(V)PABSD mm,mm/m64`	`0F38 1E /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
	64-bit	Template:Nowrap	`0F38 1E /r`	Template:Yes	F	64	64
Packed Align Right. Concatenate two input vectors into a double-size vector, then right-shift by the number of bytes specified by the imm8 argument. The shift-amount is not masked - if the shift-amount is greater than the input vector size, zeroes will be shifted in.		`(V)PALIGNR mm,mm/mm64,imm8`Template:Efn	Template:Nowrap	Template:Yes	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No

Template:Notelist Template:Vpad

SSE instructions and extended variants thereof

Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

For the instructions in the below table, the following considerations apply unless otherwise noted:

Packed instructions are available at all vector lengths (128-bit for SSE2, 128/256-bit for AVX, 128/256/512-bit for AVX-512)
FP32 variants of instructions are introduced as part of SSE. FP64 variants of instructions are introduced as part of SSE2.
The AVX-512 variants of the FP32 and FP64 instructions are introduced as part of the AVX512F subset.
For AVX-512 variants of the instructions, opmasks and broadcasts are available with a width of 32 bits for FP32 operations and 64 bits for FP64 operations. (Broadcasts are available for vector operations only.)

From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - PXOR, XORPS, and XORPD - these are intended for use on integer, FP32, and FP64 data, respectively.)

		SSE instruction	Template:Resize	Template:Resize	SSE instruction	Template:Resize Template:ResizeTemplate:Efn	Template:Resize	SSE2 instruction	Template:Resize	Template:Resize	SSE2 instruction	Template:Resize Template:ResizeTemplate:Efn	Template:Resize
Instruction Description	Basic opcode	Single Precision (FP32)						Double Precision (FP64)						rowspan=3 Template:Vert header
		Packed (no prefix)			Scalar (F3h prefix)			Packed (66h prefix)			Scalar (F2h prefix)
Unaligned load from memory or vector register	`0F 10 /r`	`MOVUPS x,x/m128`	Template:Yes	Template:Yes	`MOVSS x,x/m32`	Template:Yes	Template:Yes	`MOVUPD x,x/m128`	Template:Yes	Template:Yes	Template:Nowrap Template:Efn	Template:Yes	Template:Yes	Template:No
Unaligned store to memory or vector register	`0F 11 /r`	`MOVUPS x/m128,x`	Template:Yes	Template:Yes	`MOVSS x/m32,x`	Template:Yes	Template:Yes	`MOVUPD x/m128,x`	Template:Yes	Template:Yes	Template:Nowrap Template:Efn	Template:Yes	Template:Yes	Template:No
Load 64 bits from memory or upper half of XMM register into the lower half of XMM register while keeping the upper half unchanged	`0F 12 /r`	`MOVHLPS x,x`	Template:Yes	Template:Yes	rowspan=2 Template:N/a	rowspan=2 Template:N/a	rowspan=2 Template:N/a	`MOVLPD x,m64`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:N/a	rowspan=2 Template:N/a	rowspan=2 Template:N/a	rowspan=2 Template:No
	`0F 12 /r`	`MOVLPS x,m64`	Template:Yes	Template:Yes				`MOVLPD x,m64`
Store 64 bits to memory from lower half of XMM register	`0F 13 /r`	`MOVLPS m64,x`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`MOVLPD m64,x`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No
Unpack and interleave low-order floating-point values	`0F 14 /r`	Template:Nowrap	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:Nowrap	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No
Unpack and interleave high-order floating-point values	`0F 15 /r`	Template:Nowrap	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:Nowrap	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No
Load 64 bits from memory or lower half of XMM register into the upper half of XMM register while keeping the lower half unchanged	`0F 16 /r`	`MOVLHPS x,x`	Template:Yes	Template:Yes	rowspan=2 Template:N/a	rowspan=2 Template:N/a	rowspan=2 Template:N/a	`MOVHPD x,m64`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	rowspan=2 Template:No	rowspan=2 Template:No	rowspan=2 Template:No	rowspan=2 Template:No
	`0F 16 /r`	`MOVHPS x,m64`	Template:Yes	Template:Yes				`MOVHPD x,m64`
Store 64 bits to memory from upper half of XMM register	`0F 17 /r`	`MOVHPS m64,x`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`MOVHPD m64,x`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No

Aligned load from memory or vector register	Template:Nowrap	`MOVAPS x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`MOVAPD x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No
Aligned store to memory or vector register	`0F 29 /r`	`MOVAPS x/m128,x`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`MOVAPD x/m128,x`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No
Integer to floating-point conversion using general-registers, MMX-registers or memory as source	`0F 2A /r`	Template:Nowrap Template:Efn	Template:No	Template:No	Template:Resize	Template:Yes	Template:Yes	Template:Nowrap Template:Efn	Template:No	Template:No	Template:Resize	Template:Yes	Template:Yes	RC
Non-temporal store to memory from vector register. The packed variants require aligned memory addresses even in VEX/EVEX-encoded forms.	`0F 2B /r`	`MOVNTPS m128,x`	Template:Yes	Template:Yes	Template:Maybe	Template:No	Template:No	`MOVNTPD m128,x`	Template:Yes	Template:Yes	Template:Maybe	Template:No	Template:No	Template:No
Floating-point to integer conversion with truncation, using general-purpose registers or MMX-registers as destination	`0F 2C /r`	Template:NowrapTemplate:Efn	Template:No	Template:No	Template:Resize	Template:Yes	Template:Yes	Template:NowrapTemplate:Efn	Template:No	Template:No	Template:Resize	Template:Yes	Template:Yes	SAE
Floating-point to integer conversion, using general-purpose registers or MMX-registers as destination	`0F 2D /r`	Template:Nowrap Template:Efn	Template:No	Template:No	Template:Resize	Template:Yes	Template:Yes	Template:Nowrap Template:Efn	Template:No	Template:No	Template:Resize	Template:Yes	Template:Yes	RC
Unordered compare floating-point values and set EFLAGS. Compares the bottom lanes of xmm vector registers.	`0F 2E /r`	`UCOMISS x,x/m32`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`UCOMISD x,x/m64`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	SAE
Compare floating-point values and set EFLAGS. Compares the bottom lanes of xmm vector registers.	`0F 2F /r`	`COMISS x,x/m32`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`COMISD x,x/m64`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	SAE

Extract packed floating-point sign mask	`0F 50 /r`	Template:Nowrap	Template:Yes	Template:No	Template:No	Template:No	Template:No	Template:Nowrap	Template:Yes	Template:No	Template:No	Template:No	Template:No	Template:N/a
Floating-point Square Root	`0F 51 /r`	`SQRTPS x,x/m128`	Template:Yes	Template:Yes	`SQRTSS x,x/m32`	Template:Yes	Template:Yes	`SQRTPD x,x/m128`	Template:Yes	Template:Yes	`SQRTSD x,x/m64`	Template:Yes	Template:Yes	RC
Reciprocal Square Root ApproximationTemplate:Efn	`0F 52 /r`	Template:Nowrap	Template:Yes	Template:No	Template:Nowrap	Template:Yes	Template:No	Template:No	Template:No	Template:No	Template:No	Template:No	Template:No	Template:N/a
Reciprocal ApproximationTemplate:Efn	`0F 53 /r`	`RCPPS x,x/m128`	Template:Yes	Template:No	`RCPSS x,x/m32`	Template:Yes	Template:No	Template:No	Template:No	Template:No	Template:No	Template:No	Template:No	Template:N/a
Vector bitwise AND	`0F 54 /r`	`ANDPS x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`ANDPD x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No
Vector bitwise AND-NOT	`0F 55 /r`	`ANDNPS x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`ANDNPD x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No
Vector bitwise OR	`0F 56 /r`	`ORPS x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`ORPD x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No
Vector bitwise XORTemplate:Efn	`0F 57 /r`	`XORPS x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	`XORPD x,x/m128`	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No

Floating-point Add	`0F 58 /r`	`ADDPS x,x/m128`	Template:Yes	Template:Yes	`ADDSS x,x/m32`	Template:Yes	Template:Yes	`ADDPD x,x/m128`	Template:Yes	Template:Yes	`ADDSD x,x/m64`	Template:Yes	Template:Yes	RC
Floating-point Multiply	`0F 59 /r`	`MULPS x,x/m128`	Template:Yes	Template:Yes	`MULSS x,x/m32`	Template:Yes	Template:Yes	`MULPD x,x/m128`	Template:Yes	Template:Yes	`MULSD x,x/m64`	Template:Yes	Template:Yes	RC
Convert between floating-point formats (FP32→FP64, FP64→FP32)	`0F 5A /r`	Template:Nowrap (SSE2)	Template:Yes	Template:Yes	Template:Nowrap (SSE2)	Template:Yes	Template:Yes	Template:Nowrap	Template:Yes	Template:Yes	Template:Nowrap	Template:Yes	Template:Yes	SAE, RCTemplate:Efn
Floating-point Subtract	`0F 5C /r`	`SUBPS x,x/m128`	Template:Yes	Template:Yes	`SUBSS x,x/m32`	Template:Yes	Template:Yes	`SUBPD x,x/m128`	Template:Yes	Template:Yes	`SUBSD x,x/m64`	Template:Yes	Template:Yes	RC
Floating-point Minimum ValueTemplate:Efn	`0F 5D /r`	`MINPS x,x/m128`	Template:Yes	Template:Yes	`MINSS x,x/m32`	Template:Yes	Template:Yes	`MINPD x,x/m128`	Template:Yes	Template:Yes	`MINSD x,x/m64`	Template:Yes	Template:Yes	SAE
Floating-point Divide	`0F 5E /r`	`DIVPS x,x/m128`	Template:Yes	Template:Yes	`DIVSS x,x/m32`	Template:Yes	Template:Yes	`DIVPD x,x/m128`	Template:Yes	Template:Yes	`DIVSD x,x/m64`	Template:Yes	Template:Yes	RC
Floating-point Maximum ValueTemplate:Efn	`0F 5F /r`	`MAXPS x,x/m128`	Template:Yes	Template:Yes	`MAXSS x,x/m32`	Template:Yes	Template:Yes	`MAXPD x,x/m128`	Template:Yes	Template:Yes	`MAXSD x,x/m64`	Template:Yes	Template:Yes	SAE

Floating-point compare. Result is written as all-0s/all-1s values (all-1s for comparison true) to vector registers for SSE/AVX, but opmask register for AVX-512. Comparison function is specified by imm8 argument.Template:Efn	Template:Nowrap	Template:Nowrap	Template:Yes	Template:Yes	Template:Nowrap	Template:Yes	Template:Yes	Template:Nowrap	Template:Yes	Template:Yes	Template:Nowrap Template:Efn	Template:Yes	Template:Yes	SAE
Packed Interleaved Shuffle. Performs a shuffle on each of its two input arguments, then keeps the bottom half of the shuffle result from its first argument and the top half of the shuffle result from its second argument.	Template:Nowrap	Template:NowrapTemplate:Efn	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:NowrapTemplate:Efn	Template:Yes	Template:Yes	Template:No	Template:No	Template:No	Template:No

Template:Notelist Template:Vpad

Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

These instructions do not have any MMX forms, and do not support any encodings without a prefix. Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:

The VEX-encoded forms are available under AVX/AVX2. Under AVX, they are available only with a vector length of 128 bits (VEX.L=0 enocding) - under AVX2, they are (with some exceptions noted with "L=0") also made available with a vector length of 256 bits.
The EVEX-encoded forms are available under AVX-512 - the specific AVX-512 subset needed for each instruction is listed along with the instruction.

Description		Instruction mnemonics	Basic opcode	SSE (66h prefix)	AVX (VEX.66 prefix)	AVX-512 (EVEX.66 prefix)
Description		Instruction mnemonics	Basic opcode	SSE (66h prefix)	AVX (VEX.66 prefix)	supported	subset	lane	bcst
Added with SSE2
Unpack and interleave low-order 64-bit integers		`(V)PUNPCKLQDQ xmm,xmm/m128`Template:Efn	`0F 6C /r`	Template:Yes	Template:Yes	Template:Yes	F	64	64
Unpack and interleave high-order 64-bit integers		`(V)PUNPCKHQDQ xmm,xmm/m128`Template:Efn	`0F 6D /r`	Template:Yes	Template:Yes	Template:Yes	F	64	64
Right-shift 128-bit unsigned integer by specified number of bytes		`(V)PSRLDQ xmm,imm8`Template:Efn	Template:Nowrap	Template:Yes	Template:Yes	Template:Yes	BW	Template:No	Template:No
Left-shift 128-bit integer by specified number of bytes		`(V)PSLLDQ xmm,imm8`Template:Efn	`0F 73 /7 ib`	Template:Yes	Template:Yes	Template:Yes	BW	Template:No	Template:No
Move 64-bit scalar value from xmm register to xmm register or memory		`(V)MOVQ xmm/m64,xmm`	`0F D6 /r`	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No
Added with SSE4.1
Variable blend packed bytes. For each byte lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding byte lane of `XMM0`.		`PBLENDVB xmm,xmm/m128` Template:Nowrap	`0F38 10 /r`	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a
Sign-extend packed integers into wider packed integers	8-bit → 16-bit	`(V)PMOVSXBW xmm,xmm/m64`	`0F38 20 /r`	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	8-bit → 32-bit	`(V)PMOVSXBD xmm,xmm/m32`	`0F38 21 /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No
	8-bit → 64-bit	`(V)PMOVSXBQ xmm,xmm/m16`	`0F38 22 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No
	16-bit → 32-bit	`(V)PMOVSXWD xmm,xmm/m64`	`0F38 23 /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No
	16-bit → 64-bit	`(V)PMOVSXWQ xmm,xmm/m32`	`0F38 24 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No
	32-bit → 64-bit	`(V)PMOVSXDQ xmm,xmm/m64`	`0F38 25 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No

Multiply packed 32-bit signed integers, store full 64-bit result. The input integers are taken from the low 32 bits of each 64-bit vector lane.		`(V)PMULDQ xmm,xmm/m128`	`0F38 28 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	64
Compare packed 64-bit integers for equality		`(V)PCMPEQQ xmm,xmm/m128`	`0F38 29 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	64
Aligned non-temporal vector load from memory.Template:Efn		`(V)MOVNTDQA xmm,m128`	`0F38 2A /r`	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No
Pack 32-bit unsigned integers to 16-bit, with saturation		Template:Nowrap	`0F38 2B /r`	Template:Yes	Template:Yes	Template:Yes	BW	16	32

Zero-extend packed integers into wider packed integers	8-bit → 16-bit	`(V)PMOVZXBW xmm,xmm/m64`	`0F38 30 /r`	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	8-bit → 32-bit	`(V)PMOVZXBD xmm,xmm/m32`	`0F38 31 /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No
	8-bit → 64-bit	`(V)PMOVZXBQ xmm,xmm/m16`	`0F38 32 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No
	16-bit → 32-bit	`(V)PMOVZXWD xmm,xmm/m64`	`0F38 33 /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No
	16-bit → 64-bit	`(V)PMOVZXWQ xmm,xmm/m32`	`0F38 34 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No
	Template:Nowrap	`(V)PMOVZXDQ xmm,xmm/m64`	`0F38 35 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No

Packed minimum-value of signed integers	8-bit	`(V)PMINSB xmm,xmm/m128`	`0F38 38 /r`	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	32-bit	`(V)PMINSD xmm,xmm/m128`	`0F38 39 /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
	64-bit	`VPMINSQ xmm,xmm/m128`(AVX-512)	`0F38 39 /r`	Template:Yes	F	64	64
Packed minimum-value of unsigned integers	16-bit	`(V)PMINUW xmm,xmm/m128`	`0F38 3A /r`	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PMINUD xmm,xmm/m128`	`0F38 3B /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
	64-bit	`VPMINUQ xmm,xmm/m128`(AVX-512)	`0F38 3B /r`	Template:Yes	F	64	64
Packed maximum-value of signed integers	8-bit	`(V)PMAXSB xmm,xmm/m128`	`0F38 3C /r`	Template:Yes	Template:Yes	Template:Yes	BW	8	Template:No
	32-bit	`(V)PMAXSD xmm,xmm/m128`	`0F38 3D /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
	64-bit	`VPMAXSQ xmm,xmm/m128`(AVX-512)	`0F38 3D /r`	Template:Yes	F	64	64
Packed maximum-value of unsigned integers	16-bit	`(V)PMAXUW xmm,xmm/m128`	`0F38 3E /r`	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No
	32-bit	`(V)PMAXUD xmm,xmm/m128`	`0F38 3F /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
	64-bit	`VPMAXUQ xmm,xmm/m128`(AVX-512)	`0F38 3F /r`	Template:Yes	F	64	64

Multiply packed 32/64-bit integers, store low half of results		`(V)PMULLD mm,mm/m64` Template:Nowrap(AVX-512)	`0F38 40 /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	32	32
		`(V)PMULLD mm,mm/m64` Template:Nowrap(AVX-512)	`0F38 40 /r`	Template:Yes	DQ	64	64
Packed Horizontal Word Minimum Find the smallest 16-bit integer in a packed vector of 16-bit unsigned integers, then return the integer and its index in the bottom two 16-bit lanes of the result vector.		Template:Nowrap	`0F38 41 /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Blend Packed Words. For each 16-bit lane of the result, pick a 16-bit value from either the first or the second source argument depending on the corresponding bit of the imm8.		Template:Nowrap	Template:Nowrap	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a

Extract integer from indexed lane of vector register, and store to GPR or memory. Zero-extended if stored to GPR.	8-bit	Template:Nowrap	`0F3A 14 /r ib`	Template:Yes	Template:Yes	Template:Yes	BW	Template:No	Template:No
	16-bit	Template:Nowrap	`0F3A 15 /r ib`	Template:Yes	Template:Yes	Template:Yes	BW	Template:No	Template:No
	32-bit	`(V)PEXTRD r/m32,xmm,imm8`	`0F3A 16 /r ib`	Template:Yes	Template:Yes Template:Efn	Template:Yes	DQ	Template:No	Template:No
	64-bit (x86-64)	Template:Nowrap	`0F3A 16 /r ib`	Template:Yes	Template:Yes	Template:Yes	DQ	Template:No	Template:No
Insert integer from general-purpose register into indexed lane of vector register	8-bit	`(V)PINSRB xmm,r32/m8,imm8`Template:Efn	`0F3A 20 /r ib`	Template:Yes	Template:Yes	Template:Yes	BW	Template:No	Template:No
	32-bit	`(V)PINSRD xmm,r32/m32,imm8`	`0F3A 22 /r ib`	Template:Yes	Template:Yes Template:Efn	Template:Yes	DQ	Template:No	Template:No
	64-bit (x86-64)	Template:Nowrap	`0F3A 22 /r ib`	Template:Yes	Template:Yes	Template:Yes	DQ	Template:No	Template:No

Compute Multiple Packed Sums of Absolute Difference. The 128-bit form of this instruction computes 8 sums of absolute differences from sequentially selected groups of four bytes in the first source argument and a selected group of four contiguous bytes in the second source operand, and writes the sums to sequential 16-bit lanes of destination register. If the two source arguments `src1` and `src2` are considered to be two 16-entry arrays of uint8 values and `temp` is considered to be an 8-entry array of uint16 values, then the operation of the instruction is: for i = 0 to 7 do temp[i] := 0 for j = 0 to 3 do a := src1[ i+(imm8[2]4)+j ] b := src2[ (imm8[1:0]4)+j ] temp[i] := temp[i] + abs(a-b) done done dst := temp For wider forms of this instruction under AVX2 and AVX10.2, the operation is split into 128-bit lanes where each lane internally performs the same operation as the 128-bit variant of the instruction - except that odd-numbered lanes use bits 5:3 rather than bits 2:0 of the imm8.		Template:Nowrap	Template:Nowrap	Template:Yes	Template:Yes	Template:Yes	10.2Template:Efn	16	Template:No
Added with SSE 4.2
Compare packed 64-bit signed integers for greater-than		`(V)PCMPGTQ xmm, xmm/m128`	`0F38 37 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	64
Packed Compare Explicit Length Strings, Return Mask		Template:Nowrap	`0F3A 60 /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Packed Compare Explicit Length Strings, Return Index		`(V)PCMPESTRI xmm,xmm/m128,imm8`	`0F3A 61 /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Packed Compare Implicit Length Strings, Return Mask		`(V)PCMPISTRM xmm,xmm/m128,imm8`	`0F3A 62 /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a
Packed Compare Implicit Length Strings, Return Index		`(V)PCMPISTRI xmm,xmm/m128,imm8`	`0F3A 63 /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a

Template:Notelist Template:Vpad

Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.

Description		Instruction mnemonics	Basic opcode	SSE	AVX (VEX prefix)	AVX-512 (EVEX prefix)
Description		Instruction mnemonics	Basic opcode	SSE	AVX (VEX prefix)	supported	subset	lane	bcst	rc/sae
Added with SSE
Load MXCSR (Media eXtension Control and Status Register) from memory		`(V)LDMXCSR m32`	`NP 0F AE /2`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Store MXCSR to memory		`(V)STMXCSR m32`	`NP 0F AE /3`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Added with SSE2
Move a 64-bit data item from MMX register to bottom half of XMM register. Top half is zeroed out.		`MOVQ2DQ xmm,mm`	`F3 0F D6 /r`	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Move a 64-bit data item from bottom half of XMM register to MMX register.		`MOVDQ2Q mm,xmm`	`F2 0F D6 /r`	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Load a 64-bit integer from memory or XMM register to bottom 64 bits of XMM register, with zero-fill		`(V)MOVQ xmm,xmm/m64`	`F3 0F 7E /r`	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No	Template:No
Vector load from unaligned memory or vector register		`(V)MOVDQU xmm,xmm/m128`	`F3 0F 6F /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	64	Template:No	Template:No
			`F3 0F 6F /r`	Template:Yes	F	32	Template:No	Template:No
			`F2 0F 6F /r`	rowspan=2 Template:No	rowspan=2 Template:No	Template:Yes	BW	16	Template:No	Template:No
			`F2 0F 6F /r`	Template:Yes	BW	8	Template:No	Template:No
Vector store to unaligned memory or vector register		`(V)MOVDQU xmm/m128,xmm`	`F3 0F 7F /r`	rowspan=2 Template:Yes	rowspan=2 Template:Yes	Template:Yes	F	64	Template:No	Template:No
			`F3 0F 7F /r`	Template:Yes	F	32	Template:No	Template:No
			`F2 0F 7F /r`	rowspan=2 Template:No	rowspan=2 Template:No	Template:Yes	BW	16	Template:No	Template:No
			`F2 0F 7F /r`	Template:Yes	BW	8	Template:No	Template:No
Shuffle the four top 16-bit lanes of source vector, then place result in top half of destination vector		Template:Nowrap Template:Efn	`F3 0F 70 /r ib`	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No	Template:No
Shuffle the four bottom 16-bit lanes of source vector, then place result in bottom half of destination vector		Template:Nowrap Template:Efn	`F2 0F 70 /r ib`	Template:Yes	Template:Yes	Template:Yes	BW	16	Template:No	Template:No
Convert packed signed 32-bit integers to FP32		`(V)CVTDQ2PS xmm,xmm/m128`	`NP 0F 5B /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:Yes	RC
Convert packed FP32 values to packed signed 32-bit integers		`(V)CVTPS2DQ xmm,xmm/m128`	`66 0F 5B /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:Yes	RC
Convert packed FP32 values to packed signed 32-bit integers, with round-to-zero		Template:Nowrap	`F3 0F 5B /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:Yes	SAE
Convert packed FP64 values to packed signed 32-bit integers, with round-to-zero		Template:Nowrap	`66 0F E6 /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:Yes	SAE
Convert packed signed 32-bit integers to FP64		`(V)CVTDQ2PD xmm,xmm/m64`	`F3 0F E6 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	Template:Yes	RCTemplate:Efn
Convert packed FP64 values to packed signed 32-bit integers		`(V)CVTPD2DQ xmm,xmm/m128`	`F2 0F E6 /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:Yes	RC
Added with SSE3
Duplicate floating-point values from even-numbered lanes to next odd-numbered lanes up	32-bit	`(V)MOVSLDUP xmm,xmm/m128`	`F3 0F 12 /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No	Template:No
	64-bit	`(V)MOVDDUP xmm/xmm/m128`	`F2 0F 12 /r`	Template:Yes	Template:Yes	Template:Yes	F	64	Template:No	Template:No
Duplicate FP32 values from odd-numbered lanes to next even-numbered lanes down		`(V)MOVSHDUP xmm,xmm/m128`	`F3 0F 16 /r`	Template:Yes	Template:Yes	Template:Yes	F	32	Template:No	Template:No
Packed pairwise horizontal addition of floating-point values	32-bit	`(V)HADDPS xmm,xmm/m128`Template:Efn	`F2 0F 7C /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
	64-bit	`(V)HADDPD xmm,xmm/m128`Template:Efn	`66 0F 7C /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Packed pairwise horizontal subtraction of floating-point values	32-bit	`(V)HSUBPS xmm,xmm/m128`Template:Efn	`F2 0F 7D /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
	64-bit	`(V)HSUBPD xmm,xmm/m128`Template:Efn	`66 0F 7D /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Packed floating-point add/subtract in alternating lanes. Even-numbered lanes (counting from 0) do subtract, odd-numbered lanes do add.	Template:Nowrap	`(V)ADDSUBPS xmm,xmm/m128`	`F2 0F D0 /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
	Template:Nowrap	`(V)ADDSUBPD xmm,xmm/m128`	`66 0F D0 /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Vector load from unaligned memory with looser semantics than `(V)MOVDQU`. Unlike `(V)MOVDQU`, it may fetch data more than once or, for a misaligned access, fetch additional data up until the next 16/32-byte alignment boundaries below/above the actually-requested data.		`(V)LDDQU xmm,m128`	`F2 0F F0 /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Added with SSE4.1
Vector logical test. Sets ZF=1 if bitwise-AND between first operand and second operand results in all-0s, ZF=0 otherwise. Sets CF=1 if bitwise-AND between second operand and bitwise-NOT of first operand results in all-0s, CF=0 otherwise		`(V)PTEST xmm,xmm/m128`	`66 0F38 17 /r`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Variable blend packed floating-point values. For each lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding lane of `XMM0`.	32-bit	`BLENDVPS xmm,xmm/m128` Template:Nowrap	`66 0F38 14 /r`	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
	64-bit	`BLENDVPD xmm,xmm/m128` Template:Nowrap	`66 0F38 15 /r`	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Rounding of packed floating-point values to integer. Rounding mode specified by imm8 argument.	32-bit	Template:Nowrap	`66 0F3A 08 /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
	64-bit	Template:Nowrap	`66 0F3A 09 /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Rounding of scalar floating-point value to integer.	32-bit	Template:Nowrap	`66 0F3A 0A /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Rounding of scalar floating-point value to integer.	64-bit	Template:Nowrap	`66 0F3A 0B /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Blend packed floating-point values. For each lane of the result, pick the value from either the first or the second argument depending on the corresponding imm8 bit.	32-bit	`(V)BLENDPS xmm,xmm/m128,imm8`	`66 0F3A 0C /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
	64-bit	Template:Nowrap	Template:Nowrap	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Extract 32-bit lane of XMM register to general-purpose register or memory location. Bits[1:0] of imm8 is used to select lane.		`(V)EXTRACTPS r/m32,xmm,imm8`	`66 0F3A 17 /r ib`	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No	Template:No
Obtain 32-bit value from source XMM register or memory, and insert into the specified lane of destination XMM register. If the source argument is an XMM register, then bits[7:6] of the imm8 is used to select which 32-bit lane to select source from, otherwise the specified 32-bit memory value is used. This 32-bit value is then inserted into the destination register lane specified by bits[5:4] of the imm8. After insertion, each 32-bit lane of the destination register may optionally be zeroed out - bits[3:0] of the imm8 provides a bitmap of which lanes to zero out.		Template:Nowrap	`66 0F3A 21 /r ib`	Template:Yes	Template:Yes	Template:Yes	F	Template:No	Template:No	Template:No
4-component dot-product of 32-bit floating-point values. Bits [7:4] of the imm8 specify which lanes should participate in the dot-product, bits[3:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)		Template:Nowrap Template:Efn	`66 0F3A 40 /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
2-component dot-product of 64-bit floating-point values. Bits [5:4] of the imm8 specify which lanes should participate in the dot-product, bits[1:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)		Template:Nowrap Template:Efn	`66 0F3A 41 /r ib`	Template:Yes	Template:Yes	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
Added with SSE4a (AMD only)
64-bit bitfield insert, using the low 64 bits of XMM registers. First argument is an XMM register to insert bitfield into, second argument is a source register containing the bitfield to insert (starting from bit 0). For the 4-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bit-offset to insert bitfield at. For the 2-argument version, the length and offset are instead taken from bits [69:64] and [77:72] of the second argument, respectively.		`INSERTQ xmm,xmm,imm8,imm8`	Template:Nowrap	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
		`INSERTQ xmm,xmm`	`F2 0F 79 /r`	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
64-bit bitfield extract, from the lower 64 bits of an XMM register. The first argument serves as both source that bitfield is extracted from and destination that bitfield is written to. For the 3-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bitfield bit-offset. For the 2-argument version, the second argument is an XMM register that contains bitfield length at bits[5:0] and bit-offset at bits[13:8].		`EXTRQ xmm,imm8,imm8`	`66 0F 78 /0 ib ib`	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a
		`EXTRQ xmm,xmm`	`66 0F 79 /r`	Template:Yes	Template:No	Template:No	Template:N/a	Template:N/a	Template:N/a	Template:N/a

Template:Notelist Template:Vpad

AVX

AVX were first supported by Intel with Sandy Bridge and by AMD with Bulldozer.

Vector operations on 256 bit registers.

Instruction	Description
VBROADCASTSS	Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register.
VBROADCASTSD
VBROADCASTF128
VINSERTF128	Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTF128	Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VMASKMOVPS	Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged. On the AMD Jaguar processor architecture, this instruction with a memory source operand takes more than 300 clock cycles when the mask is zero, in which case the instruction should do nothing. This appears to be a design flaw.^[1]
VMASKMOVPD
VPERMILPS	Permute In-Lane. Shuffle the 32-bit or 64-bit vector elements of one input operand. These are in-lane 256-bit instructions, meaning that they operate on all 256 bits with two separate 128-bit shuffles, so they can not shuffle across the 128-bit lanes.^[2]
VPERMILPD
VPERM2F128	Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VZEROALL	Set all YMM registers to zero and tag them as unused. Used when switching between 128-bit use and 256-bit use.
VZEROUPPER	Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.

F16C

Half-precision floating-point conversion.

Instruction	Meaning
Template:Mono	Convert four half-precision floating point values in memory or the bottom half of an XMM register to four single-precision floating-point values in an XMM register
Template:Mono	Convert eight half-precision floating point values in memory or an XMM register (the bottom half of a YMM register) to eight single-precision floating-point values in a YMM register
Template:Mono	Convert four single-precision floating point values in an XMM register to half-precision floating-point values in memory or the bottom half an XMM register
Template:Nowrap	Convert eight single-precision floating point values in a YMM register to half-precision floating-point values in memory or an XMM register

AVX2

Introduced in Intel's Haswell microarchitecture and AMD's Excavator.

Expansion of most vector integer SSE and AVX instructions to 256 bits

Instruction	Description
VBROADCASTSS	Copy a 32-bit or 64-bit register operand to all elements of a XMM or YMM vector register. These are register versions of the same instructions in AVX1. There is no 128-bit version however, but the same effect can be simply achieved using VINSERTF128.
VBROADCASTSD
VPBROADCASTB	Copy an 8, 16, 32 or 64-bit integer register or memory operand to all elements of a XMM or YMM vector register.
VPBROADCASTW
VPBROADCASTD
VPBROADCASTQ
VBROADCASTI128	Copy a 128-bit memory operand to all elements of a YMM vector register.
VINSERTI128	Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged.
VEXTRACTI128	Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand.
VGATHERDPD	Gathers single or double precision floating point values using either 32 or 64-bit indices and scale.
VGATHERQPD
VGATHERDPS
VGATHERQPS
VPGATHERDD	Gathers 32 or 64-bit integer values using either 32 or 64-bit indices and scale.
VPGATHERDQ
VPGATHERQD
VPGATHERQQ
VPMASKMOVD	Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged.
VPMASKMOVQ
VPERMPS	Shuffle the eight 32-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMD
VPERMPD	Shuffle the four 64-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector.
VPERMQ
VPERM2I128	Shuffle (two of) the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector.
VPBLENDD	Doubleword immediate version of the PBLEND instructions from SSE4.
VPSLLVD	Shift left logical. Allows variable shifts where each element is shifted according to the packed input.
VPSLLVQ
VPSRLVD	Shift right logical. Allows variable shifts where each element is shifted according to the packed input.
VPSRLVQ
VPSRAVD	Shift right arithmetically. Allows variable shifts where each element is shifted according to the packed input.

FMA3 and FMA4 instructions

Template:Main Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.

FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering (9='132', A='213', B='231') and the bottom nibble y (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x and y outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:

vfmadd132sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm1*xmm3)+xmm2
vfmadd213sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm1)+xmm3
vfmadd231sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm3)+xmm1

For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024,^[3] similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r with the opcode byte again working similar to the FP32/FP64 variants. (For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib (no EVEX encodings are defined). The opcode byte xx uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.

For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:

vfmaddsd xmm1,xmm2,[mem],xmm3 will perform xmm1 ← (xmm2*[mem])+xmm3 and require a W=0 encoding.
vfmaddsd xmm1,xmm2,xmm3,[mem] will perform xmm1 ← (xmm2*xmm3)+[mem] and require a W=1 encoding.
vfmaddsd xmm1,xmm2,xmm3,xmm4 will perform xmm1 ← (xmm2*xmm3)+xmm4 and can be encoded with either W=0 or W=1.

Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:

Basic operation	Opcode byte	FP32 instructions	FP64 instructions	FP16 instructions (AVX512-FP16)	BF16 instructions (AVX10.2)
Packed alternating multiply-add/subtract (AB)-C* in even-numbered lanesTemplate:Efn (AB)+C* in odd-numbered lanes	`96`	`VFMADDSUB132PS`	`VFMADDSUB132PD`	`VFMADDSUB132PH`	Template:N/a
	`A6`	`VFMADDSUB213PS`	`VFMADDSUB213PD`	`VFMADDSUB213PH`	Template:N/a
	`B6`	`VFMADDSUB231PS`	`VFMADDSUB231PD`	`VFMADDSUB231PH`	Template:N/a
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Packed alternating multiply-subtract/add (AB)+C* in even-numbered lanes (AB)-C* in odd-numbered lanes	`97`	`VFMSUBADD132PS`	`VFMSUBADD132PD`	`VFMSUBADD132PH`	Template:N/a
	`A7`	`VFMSUBADD213PS`	`VFMSUBADD213PD`	`VFMSUBADD213PH`	Template:N/a
	`B7`	`VFMSUBADD231PS`	`VFMSUBADD231PD`	`VFMSUBADD231PH`	Template:N/a
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Packed multiply-add (AB)+C*	`98`	`VFMADD132PS`	`VFMADD132PD`	`VFMADD132PH`	`VFMADD132BF16`
	`A8`	`VFMADD213PS`	`VFMADD213PD`	`VFMADD213PH`	`VFMADD213BF16`
	`B8`	`VFMADD231PS`	`VFMADD231PD`	`VFMADD231PH`	`VFMADD231BF16`
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Scalar multiply-add (AB)+C*	`99`	`VFMADD132SS`	`VFMADD132SD`	`VFMADD132SH`	Template:N/a
	`A9`	`VFMADD213SS`	`VFMADD213SD`	`VFMADD213SH`	Template:N/a
	`B9`	`VFMADD231SS`	`VFMADD231SD`	`VFMADD231SH`	Template:N/a
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Packed multiply-subtract (AB)-C*	`9A`	`VFMSUB132PS`	`VFMSUB132PD`	`VFMSUB132PH`	`VFMSUB132BF16`
	`AA`	`VFMSUB213PS`	`VFMSUB213PD`	`VFMSUB213PH`	`VFMSUB213BF16`
	`BA`	`VFMSUB231PS`	`VFMSUB231PD`	`VFMSUB231PH`	`VFMSUB231BF16`
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Scalar multiply-subtract (AB)-C*	`9B`	`VFMSUB132SS`	`VFMSUB132SD`	`VFMSUB132SH`	Template:N/a
	`AB`	`VFMSUB213SS`	`VFMSUB213SD`	`VFMSUB213SH`	Template:N/a
	`BB`	`VFMSUB231SS`	`VFMSUB231SD`	`VFMSUB231SH`	Template:N/a
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Packed negative-multiply-add (-AB)+C*	`9C`	`VFNMADD132PS`	`VFNMADD132PD`	`VFNMADD132PH`	`VFNMADD132BF16`
	`AC`	`VFNMADD213PS`	`VFNMADD213PD`	`VFNMADD213PH`	`VFNMADD213BF16`
	`BC`	`VFNMADD231PS`	`VFNMADD231PD`	`VFNMADD231PH`	`VFNMADD231BF16`
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Scalar negative-multiply-add (-AB)+C*	`9D`	`VFMADD132SS`	`VFMADD132SD`	`VFMADD132SH`	Template:N/a
	`AD`	`VFMADD213SS`	`VFMADD213SD`	`VFMADD213SH`	Template:N/a
	`BD`	`VFMADD231SS`	`VFMADD231SD`	`VFMADD231SH`	Template:N/a
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Packed negative-multiply-subtract (-AB)-C*	`9E`	`VFNMSUB132PS`	`VFNMSUB132PD`	`VFNMSUB132PH`	`VFNMSUB132BF16`
	`AE`	`VFNMSUB213PS`	`VFNMSUB213PD`	`VFNMSUB213PH`	`VFNMSUB213BF16`
	`BE`	`VFNMSUB231PS`	`VFNMSUB231PD`	`VFNMSUB231PH`	`VFNMSUB231BF16`
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a
Scalar negative-multiply-subtract (-AB)-C*	`9F`	`VFNMSUB132SS`	`VFNMSUB132SD`	`VFNMSUB132SH`	Template:N/a
	`AF`	`VFNMSUB213SS`	`VFNMSUB213SD`	`VFNMSUB213SH`	Template:N/a
	`BF`	`VFNMSUB231SS`	`VFNMSUB231SD`	`VFNMSUB231SH`	Template:N/a
	Template:Maybe	Template:Maybe	Template:Maybe	Template:N/a	Template:N/a

Template:Notelist

AVX-512

Template:Main AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory.^[4] Most of the added instructions may also be used with the 256- and 128-bit registers.

AMX

Template:Main Intel AMX adds eight new tile-registers, tmm0-tmm7, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.

AMX subset	Instruction mnemonics	Opcode	Instruction description	Added in

Template:Glossary Template:Term Template:Defn Template:Glossary end	`LDTILECFG m512`	`VEX.128.NP.0F38.W0 49 /0`	Load AMX tile configuration data structure from memory as a 64-byte data structure.	Template:Nowrap
	`STTILECFG m512`	`VEX.128.66.0F38 W0 49 /0`	Store AMX tile configuration data structure to memory.
	`TILERELEASE`	`VEX.128.NP.0F38.W0 49 C0`	Initialize `TILECFG` and tile data registers (`tmm0` to `tmm7`) to the INIT state (all-zeroes).
	`TILEZERO tmm`	Template:Nowrap	Zero out contents of one tile register.
	`TILELOADD tmm, sibmem`	Template:Nowrap	Load a data tile from memory into AMX tile register.
	Template:Nowrap	`VEX.128.66.0F38.W0 4B /r`Template:Efn	Load a data tile from memory into AMX tile register, with a hint that data should not be kept in the nearest cache levels.
	`TILESTORED mem, sibtmm`	`VEX.128.F3.0F38.W0 4B /r`Template:Efn	Store a data tile to memory from AMX tile register.

Template:Glossary Template:Term Template:Defn Template:Glossary end	`TDPBSSD tmm1,tmm2,tmm3`Template:Efn	`VEX.128.F2.0F38.W0 5E /r`	Matrix multiply signed bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1.
	`TDPBSUD tmm1,tmm2,tmm3`Template:Efn	`VEX.128.F3.0F38.W0 5E /r`	Matrix multiply signed bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1.
	`TDPBUSD tmm1,tmm2,tmm3`Template:Efn	`VEX.128.66.0F38.W0 5E /r`	Matrix multiply unsigned bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1.
	`TDPBUUD tmm1,tmm2,tmm3`Template:Efn	`VEX.128.NP.0F38.W0 5E /r`	Matrix multiply unsigned bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1.

Template:Glossary Template:Term Template:Defn Template:Glossary end	`TDPBF16PS tmm1,tmm2,tmm3`Template:Efn	`VEX.128.F3.0F38.W0 5C /r`	Matrix multiply BF16 values from tmm2 with BF16 values from tmm3, accumulating result in tmm1.

Template:Glossary Template:Term Template:Defn Template:Glossary end	`TDPFP16PS tmm1,tmm2,tmm3`Template:Efn	`VEX.128.F2.0F38.W0 5C /r`	Matrix multiply FP16 values from tmm2 with FP16 values from tmm3, accumulating result in tmm1.	(Granite Rapids)

Template:Glossary Template:Term Template:Defn Template:Glossary end	Template:Nowrap	`VEX.128.NP.0F38.W0 6C /r`	Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating real part of result in tmm1.	Template:Nowrap
	Template:Nowrap	`VEX.128.66.0F38.W0 6C /r`	Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating imaginary part of result in tmm1.	Template:Nowrap

Template:Notelist

References

Template:Reflist

External links

Intel Intrinsics Guide - searchable reference for Intel MMX/SSE/AVX/AVX512 SIMD intrinsics

↑ Template:Cite web
↑ Template:Cite web
↑ Intel, Advanced Vector Extensions 10.2 Architecture Specification, order no. 361050-001, rev 1.0, July 2024. Archived on 1 Aug 2024.
↑ Template:Cite web

[1] Template:Cite web

[2] Template:Cite web

[3] Intel, Advanced Vector Extensions 10.2 Architecture Specification, order no. 361050-001, rev 1.0, July 2024. Archived on 1 Aug 2024.

[4] Template:Cite web

[1]

[2]

[3]

[4]

X86 SIMD instruction listings

Contents

Summary of SIMD extensions

MMX instructions and extended variants thereof

Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof

MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof

SSE instructions and extended variants thereof

Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

AVX

F16C

AVX2

FMA3 and FMA4 instructions

AVX-512

AMX

See also

References

External links

Navigation menu

X86 SIMD instruction listings

Summary of SIMD extensions

MMX instructions and extended variants thereof

Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof

MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof

SSE instructions and extended variants thereof

Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

AVX

F16C

AVX2

FMA3 and FMA4 instructions

AVX-512

AMX

See also

References

External links

Navigation menu

Search