Instructions that have at some point been present as documented instructions in one or more x86 processors, but where the processor series containing the instructions are discontinued or superseded, with no known plans to reintroduce the instructions.
These instructions are only present in the x86 operation mode of early Intel Itanium processors with hardware support for x86. This support was added in "Merced" and removed in "Montecito", replaced with software emulation.
These instructions were introduced in 6th generation Intel Core "Skylake" CPUs. The last CPU generation to support them was the 9th generation Core "Coffee Lake" CPUs.
Intel MPX adds 4 new registers, BND0 to BND3, that each contains a pair of addresses. MPX also defines a bounds-table as a 2-level directory/table data structure in memory that contains sets of upper/lower bounds.
Store bounds into the bounds-table, using address translation using an sib-addressing expression mib.[d]
BND
F2
Instruction prefix used with certain branch instructions[e] to indicate that they should not clear the bounds registers.
^For all of the MPX instructions, 16-bit addressing is disallowed − this effectively makes the address-size override prefix 67h mandatory in 16-bit mode and prohibited in 32-bit mode. In 64-bit mode, the 67h prefix is ignored for the MPX instructions − address size is always 64-bit. These behaviors are unique to the MPX instructions.
^For BNDMK in 64-bit mode, RIP-relative addressing is not permitted and will cause #UD.
^ abThe BNDLDX and BNDSTX instructions requires memory addressing modes that use the SIB byte − non-SIB addressing modes cause #UD.
^ abThe BNDLDX and BNDSTX instructions produce a #BR exception if bounds directory entry is not valid (which prevents address translation).
^The branch instructions that can accept a BND prefix are the near forms of JMP (opcodes E9 and FF /4), CALL (opcodes E8 and FF /2), RET (opcodes C2 and C3), and the short/near forms of the Jcc instructions (opcodes 70..7F and 0F 80..8F). If the BNDPRESERVE config bit is not set, then executing any of these branch instructions without the BND prefix will clear all four bounds registers. (Other branch instructions − such as e.g. far jumps, short jumps (EB), LOOP, IRET etc − do not clear the bounds registers regardless of whether an F2h prefix is present or not.)
Hardware Lock Elision
The Hardware Lock Elision feature of Intel TSX is marked in the Intel SDM as removed from 2019 onwards.[2] This feature took the form of two instruction prefixes, XACQUIRE and XRELEASE, that could be attached to memory atomics/stores to elide the memory locking that they represent.
Instruction prefix
Opcode
Description
XACQUIRE
F2
Instruction prefix to indicate start of hardware lock elision, used with memory atomic instructions only (for other instructions, the F2 prefix may have other meanings). When used with such instructions, may start a transaction instead of performing the memory atomic operation.
XRELEASE
F3
Instruction prefix to indicate end of hardware lock elision, used with memory atomic/store instructions only (for other instructions, the F3 prefix may have other meanings). When used with such instructions during hardware lock elision, will end the associated transaction instead of performing the store/atomic.
VP2Intersect instructions
The VP2INTERSECT instructions (an AVX-512 subset) were introduced in Tiger Lake (11th generation mobile Core processors), but were never officially supported on any other Intel processors - they are now considered deprecated[3] and are listed in the Intel SDM as removed from 2023 onwards.[2]
As of July 2024, the VP2INTERSECT instructions have been re-introduced on AMD Zen 5 processors.[4]
Store, in an even/odd pair of mask registers, the indicators of the locations of value matches between 32-bit lanes in the two vector source arguments.
Store, in an even/odd pair of mask registers, the indicators of the locations of value matches between 64-bit lanes in the two vector source arguments.
Instructions specific to Xeon Phi processors
"Knights Corner" instructions
The first generation Xeon Phi processors, codenamed "Knights Corner" (KNC), supported a large number of instructions that are not seen in any later x86 processor. An instruction reference is available[5] − the instructions/opcodes unique to KNC are the ones with VEX and MVEX prefixes (except for the KMOV, KNOT and KORTEST instructions − these are kept with the same opcodes and function in AVX-512, but with an added "W" appended to their instruction names).
Most of these KNC-unique instructions are similar but not identical to instructions in AVX-512 − later Xeon Phi processors replaced these instructions with AVX-512.
Early versions of AVX-512 avoided the instruction encodings used by KNC's MVEX prefix, however with the introduction of Intel APX (Advanced Performance Extensions) in 2023, some of the old KNC MVEX instruction encodings have been reused for new APX encodings. For example, both KNC and APX accept the instruction encoding 62 F1 79 48 6F 04 C1 as valid, but assign different meanings to it:
KNC: VMOVDQA32 zmm0, k0, xmmword ptr [rcx+rax*8]{uint8} - vector load with data conversion
APX: VMOVDQA32 zmm0, [rcx+r16*8] - vector load with one of the new APX extended-GPRs used as scaled index
"Knights Landing" and "Knights Mill" instructions
Some of the AVX-512 instructions in the Xeon Phi "Knights Landing" and later models belong to the AVX-512 subsets "AVX512ER", "AVX512_4FMAPS", "AVX512PF" and "AVX512_4VNNIW", all of which are unique to the Xeon Phi series of processors. The ER and PF subsets were introduced in "Knights Landing" − the 4FMAPS and 4VNNIW instructions were later added in "Knights Mill".
The ER and 4FMAPS instructions are floating-point arithmetic instructions that all follow a given pattern where:
EVEX.W is used to specify floating-point format (0=FP32, 1=FP64)
The bottom opcode bit is used to select between packed and scalar operation (0: packed, 1:scalar)
For a given operation, all the scalar/packed variants belong to the same AVX-512 subset.
The instructions all support result masking by opmask registers. The AVX512ER instructions also all support broadcast of memory operands.
^ abcFor the AVX512ER instructions, a numerically exact reference is available as C code.[6]
The AVX512PF instructions are a set of 16 prefetch instructions. These instructions all use VSIB encoding, where a memory addressing mode using the SIB byte is required, and where the index part of the SIB byte is taken to index into the AVX512 vector register file rather than the GPR register file. The selected AVX512 vector register is then interpreted as a vector of indexes, causing the standard x86 base+index+displacement address calculation to be performed for each vector lane, causing one associated memory operation (prefetches in case of the AVX512PF instructions) to be performed for each active lane. The instruction encodings all follow a pattern where:
EVEX.W is used to specify format of the prefetchable data (0:FP32, 1:FP64)
The bottom bit of the opcode is used to indicate whether the AVX512 index register is considered a vector of sixteen signed 32-bit indexes (bit 0 not set) or eight signed 64-bit indexes (bit 0 set)
The instructions all support operation masking by opmask registers.
The only supported vector width is 512 bits.
Operation
Basic opcode
32-bit indexes (opcode C6)
64-bit indexes (opcode C7)
FP32 prefetch (W=0)
FP64 prefetch (W=1)
FP32 prefetch (W=0)
FP64 prefetch (W=1)
Prefetch into L1 cache (T0 hint)
EVEX.66.0F38 (C6/C7) /1 /vsib
VGATHERPF0DPS vm32z {k1}
VGATHERPF0DPD vm32y {k1}
VGATHERPF0QPS vm64z {k1}
VGATHERPF0QPD vm64y {k1}
Prefetch into L2 cache (T1 hint)
EVEX.66.0F38 (C6/C7) /2 /vsib
VGATHERPF1DPS vm32z {k1}
VGATHERPF1DPD vm32y {k1}
VGATHERPF1QPS vm64z {k1}
VGATHERPF1QPD vm64y {k1}
Prefetch into L1 cache (T0 hint) with intent to write
EVEX.66.0F38 (C6/C7) /5 /vsib
VSCATTERPF0DPS vm32z {k1}
VSCATTERPF0DPD vm32y {k1}
VSCATTERPF0QPS vm64z {k1}
VSCATTERPF0QPD vm64y {k1}
Prefetch into L2 cache (T1 hint) with intent to write
EVEX.66.0F38 (C6/C7) /6 /vsib
VSCATTERPF1DPS vm32z {k1}
VSCATTERPF1DPD vm32y {k1}
VSCATTERPF1QPS vm64z {k1}
VSCATTERPF1QPD vm64y {k1}
The AVX512_4VNNIW instructions read a 128-bit data item from memory, containing 4 two-component vectors (each component being signed 16-bit). Then, for each of 4 consecutive AVX-512 registers, they will, for each 32-bit lane, interpret the lane as a two-component vector (signed 16-bit) and perform a dot-product with the corresponding two-component vector that was read from memory (the first two-component vector from memory is used for the first AVX-512 source register, and so on). These results are then accumulated into a destination vector register.
Instruction
Opcode
Description
VP4DPWSSD zmm1{k1}{z}, zmm2+3, m128
EVEX.512.F2.0F38.W0 52 /r
Dot-product of signed words with dword accumulation, 4 iterations
VP4DPWSSDS zmm1{k1}{z}, zmm2+3, m128
EVEX.512.F2.0F38.W0 53 /r
Dot-product of signed words with dword accumulation and saturation, 4 iterations
Xeon Phi processors (from Knights Landing onwards) also featured the PREFETCHWT1 m8 instruction (opcode 0F 0D /2, prefetch into L2 cache with intent to write) − these were the only Intel CPUs to officially support this instruction, but it continues to be supported on some non-Intel processors (e.g. Zhaoxin YongFeng).
A handful of instructions to support System Management Mode were introduced in the Am386SXLV and Am386DXLV processors.[7][8] They were also present in the later Am486SXLV/DXLV and Elan SC300/310 processors.[9]
The SMM functionality of these processors was implemented using Intel ICEmicrocode without a valid license, resulting in a lawsuit that AMD lost in late 1994.[10] As a result of this loss, the ICE microcode was removed from all later AMD CPUs, and the SMM instructions removed with it.
Instruction
Opcode
Description
SMI
F1
Call SMM interrupt handler (only if DR7 bit 12 is set; not available on Am486SXLV/DXLV[11])
UMOV r/m8, r8
0F 10 /r
Move data between registers and main system memory
UMOV r/m, r16/32
0F 11 /r
UMOV r8, r/m8
0F 12 /r
UMOV r16/32, r/m
0F 13 /r
RES3
0F 07
Return from SMM interrupt handler (Am386SXLV/DXLV only) Takes a pointer in ES:EDI to a processor save state to resume from − this save state has format nearly identical to that of the undocumented Intel 386 LOADALL instruction.[12]
RES4
0F 07
Return from SMM interrupt handler (Am486SXLV/DXLV only). Similar to RES3, but with a different save state format.[13]
These SMM instructions were also present on the IBM 386SLC and its derivatives (albeit with the LOADALL-like SMM return opcode 0F 07 named ICERET),[12][14][11] as well as on the UMC U5S processor.[15]
The 3DNow! instruction set extension was introduced in the AMD K6-2, mainly adding support for floating-point SIMD instructions using the MMX registers (two FP32 components in a 64-bit vector register). The instructions were mainly promoted by AMD, but were supported on some non-AMD CPUs as well. The processors supporting 3DNow! were:
AMD K6-2, K6-III, and all processors based on the K7, K8 and K10 microarchitectures. (Later AMD microarchitectures such as Bulldozer, Bobcat and Zen do not support 3DNow!)
VIA Cyrix III (both "Joshua" and "Samuel" variants), and the "Samuel" and "Ezra" revisions of VIA C3. (Later VIA CPUs, from C3 "Nehemiah" onwards, dropped 3DNow! in favor of SSE.)
National Semiconductor Geode GX2; AMD Geode GX and LX.
The 3DNow! specification[16] does not directly specify the operation performed by the PFRCPIT1, PFRSQIT1 and PFRCPIT2 instructions − instead, it imposes requirements on the results of using these instructions together in specific ways:[a]
If the bottom 32 bits of mm0 initially contains a value X in FP32 format, then the instruction sequence:
PFRCP mm1,mm0
PFRCPIT1 mm0,mm1
PFRCPIT2 mm0,mm1
must fill both 32-bit lanes of mm0 with in FP32 format, computed with an error of at most 1 ulp.
Multiply signed packed 16-bit integers with rounding and store the high 16 bits: dst <- ((dst * src) + 0x8000) >> 16
PAVGUSB mm1,mm2/m64
0F 0F /r BF
Average of unsigned packed 8-bit integers: dst <- (src+dst+1) >> 1
FEMMS
0F 0E
Faster Enter/Exit of the MMX or x87 floating-point state[c]
^The 3DNow! precision requirements can be fulfilled in several different ways, for example:
On AMD K6-2, the PFRCPIT1, PFRSQIT1 and PFRCPIT2 instructions would perform various parts of a Newton-Raphson iteration to improve the precision of a low-precision initial result from PFRCP/PFRSQRT.[17]
On AMD Geode LX, the PFRCP and PFRSQRT instructions would instead compute their results with full 24-bit precision − this made it possible to turn the PFRCPIT1, PFRSQIT1 and PFRCPIT2 instructions into pure data movement instructions, performing the same operation as MOVQ.[18]
^The 3DNow! PMULHRW instruction has the same mnemonic as the Cyrix EMMI PMULHRW instruction, however its opcode and function differ (the EMMI instruction right-shifts its multiply-result by 15 bits, while the 3DNow! instruction right-shifts by 16 bits).
Some assemblers/disassemblers, such as NASM, resolve this ambiguity by using the mnemonic PMULHRWA for the 3DNow! instruction and PMULHRWC for the EMMI instruction.
^The FEMMS instruction differs from the standard MMX EMMS instruction in that FEMMS makes the FP/MMX register contents undefined after the instruction is executed.
3DNow! also introduced a couple of prefetch instructions: PREFETCH m8 (opcode 0F 0D /0) and PREFETCHW m8 (opcode 0F 0D /1). These instructions, unlike the rest of 3DNow!, are not discontinued but continue to be supported on modern AMD CPUs. The PREFETCHW instruction is also supported on Intel CPUs starting with 65 nm Pentium 4,[19] albeit executed as NOP until Broadwell.
^ abThe PF2IW and PI2FW instructions also existed as undocumented instructions on the original K6-2.
The undocumented variant of PF2IW in K6-2 would set the top 16 bits of each 32-bit result lane to all-0s, while the documented variant in later processors would sign-extend the 16-bit result to 32 bits.[20][21]
^The PSWAPD instruction uses same opcode as the older undocumented K6-2 PSWAPW instruction.[21]
SSE5 was a proposed SSE extension by AMD, using a new "DREX" instruction encoding to add support for new 3-operand and 4-operand instructions to SSE.[22] The bundle did not include the full set of Intel's SSE4 instructions, making it a competitor to SSE4 rather than a successor.
AMD chose not to implement SSE5 as originally proposed − it was instead reworked into FMA4 and XOP,[23] which provided similar functionality but with a quite different instruction encoding − using the VEX prefix for the FMA4 instructions and the new VEX-like XOP prefix for most of the remaining instructions.
Introduced with the Bulldozer processor core, removed again from Zen (microarchitecture) onward.
A revision of most of the SSE5 instruction set.
The XOP instructions mostly make use of the XOP prefix, which is a 3-byte prefix with the following layout:
Byte 0
Byte 1
Byte 2
Bits
7:0
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
Usage
8Fh
R̅
X̅
B̅
mmmmm
W
v̅v̅v̅v̅
L
pp
where:
Overlines indicate inverted bits.
The R/X/B bits are argument extension bits similar to the RXB bits of the REX prefix.
mmmmm is an opcode-map specifier. While capable of encoding values from 8 to 31 (values 0 to 7 map to ModR/M-encoded variants of the older POP instruction, making them unusable for XOP), only maps 8, 9 and 0Ah were ever used: map 8 for instructions that take an 8-bit immediate, map 9 for instructions that don't take an immediate, and map 0Ah for instructions that take a 32-bit immediate.
W is used in a couple of different ways:
For XOP vector instructions, W is used to swap the last two vector source arguments to the instruction. For instructions that allow W=1, encodings with W=0 allow the second-to-last vector argument to be a memory argument, while encodings with W=1 allow the last vector argument to be a memory argument. For instructions that don't allow their last two vector arguments to be swapped, W is required to be 0.
For XOP-encoded integer-register instructions (the TBM and LWP instruction set extensions, see below), W is used for operand size. (0=32-bit, 1=64-bit)
vvvv is an extra source register argument, normally the first non-r/m source argument for instructions with ≥3 register arguments.
L is a vector length specifier. L=1 indicates 256-bit operation, L=0 indicates scalar or 128-bit operation.
pp is an embedded prefix − nominally 0/1/2/3=none/66h/F2h/F3h, but only 0 was ever used with any of the instructions defined for the XOP prefix.
The XOP instructions encoded with the XOP prefix are as follows:
Instruction description
Instruction mnemonics
Opcode
W=1 swap allowed
L=1 (256b) allowed
Extract fractional portion of floating-point value.
Packed FP32
VFRCZPS ymm1,ymm2/m256
XOP.9 80 /r
No
Yes
Packed FP64
VFRCZPD ymm1,ymm2/m256
XOP.9 81 /r
No
Yes
Scalar FP32
VFRCZSS xmm1,xmm2/m32
XOP.9 82 /r
No
No
Scalar FP64
VFRCZSD xmm1,xmm2/m64
XOP.9 83 /r
No
No
Vector per-bit-lane conditional move.
VPCMOV dst,src1,src2,src3 performs the equivalent of dst <- (src1 AND src3) OR (src2 AND NOT(src3))
VPCMOV ymm1,ymm2,ymm3/m256,ymm4
XOP.8 A2 /r /is4
Yes
Yes
Vector integer compare.
For each vector-register lane, compare src1 to src2, then set destination to all-1s if the comparison passes, all-0s if it fails. The imm8 argument specifies comparison function to perform:
For each N-bit lane, split the lane into a series of M-bit lanes, add the M-bit lanes together, then store the result into the destination as an N-bit zero/sign-extended value.
2x8bit -> 16bit, signed
VPHADDBW xmm1,xmm2/m128
XOP.9 C1 /r
No
No
4x8bit -> 32bit, signed
VPHADDBD xmm1,xmm2/m128
XOP.9 C2 /r
8x8bit -> 64bit, signed
VPHADDBQ xmm1,xmm2/m128
XOP.9 C3 /r
2x16bit -> 32bit, signed
VPHADDWD xmm1,xmm2/m128
XOP.9 C6 /r
4x16bit -> 64bit, signed
VPHADDWQ xmm1,xmm2/m128
XOP.9 C7 /r
2x32bit -> 64bit, signed
VPHADDDQ xmm1,xmm2/m128
XOP.9 CB /r
2x8bit -> 16bit, unsigned
VPHADDUBW xmm1,xmm2/m128
XOP.9 D1 /r
4x8bit -> 32bit, unsigned
VPHADDUBD xmm1,xmm2/m128
XOP.9 D2 /r
8x8bit -> 64bit, unsigned
VPHADDUBQ xmm1,xmm2/m128
XOP.9 D3 /r
2x16bit -> 32bit, unsigned
VPHADDUWD xmm1,xmm2/m128
XOP.9 D6 /r
4x16bit -> 64bit, unsigned
VPHADDUWQ xmm1,xmm2/m128
XOP.9 D7 /r
2x32bit -> 64bit, unsigned
VPHADDUDQ xmm1,xmm2/m128
XOP.9 DB /r
Vector Integer Horizontal Subtract.
For each N-bit lane, split the lane into two signed sub-lanes of N/2 bits each, then subtract the upper lane from the lower lane, then store the result as a signed N-bit result.
2x8bit -> 16bit
VPHSUBBW xmm1,xmm2/m128
XOP.9 E1 /r
No
No
2x16bit -> 32bit
VPHSUBWD xmm1,xmm2/m128
XOP.9 E2 /r
2x32bit -> 64bit
VPHSUBDQ xmm1,xmm2/m128
XOP.9 E3 /r
Vector Signed Integer Multiply-Add.
For each N-bit lane, perform dest <- src1*src2 + src3
For src1 and src2, the factors to multiply may be taken as signed values from the low half of each lane, high half of each lane or the lane in full (picked in the same way for src1 and src2) − the addend and the result use the full lane.
16-bit, full-lane
VPMACSWW xmm1,xmm2,xmm3/m128,xmm4
XOP.8 95 /r /is4
No
No
32-bit, low-half
VPMACSWD xmm1,xmm2,xmm3/m128,xmm4
XOP.8 96 /r /is4
64-bit, low-half
VPMACSDQL xmm1,xmm2,xmm3/m128,xmm4
XOP.8 97 /r /is4
32-bit, full-lane
VPMACSDD xmm1,xmm2,xmm3/m128,xmm4
XOP.8 9E /r /is4
64-bit, high-half
VPMACSDQH xmm1,xmm2,xmm3/m128,xmm4
XOP.8 9F /r /is4
16-bit, full-lane, saturating
VPMACSSWW xmm1,xmm2,xmm3/m128,xmm4
XOP.8 85 /r /is4
32-bit, low-half, saturating
VPMACSSWD xmm1,xmm2,xmm3/m128,xmm4
XOP.8 86 /r /is4
64-bit, low-half, saturating
VPMACSSDQL xmm1,xmm2,xmm3/m128,xmm4
XOP.8 87 /r /is4
32-bit, full-lane, saturating
VPMACSSDD xmm1,xmm2,xmm3/m128,xmm4
XOP.8 8E /r /is4
64-bit, high-half, saturating
VPMACSSDQH xmm1,xmm2,xmm3/m128,xmm4
XOP.8 8F /r /is4
Packed multiply, add and accumulate signed word to signed doubleword.
For each 32-bit lane, treat src1 and src2 as 2-component vectors of signed 16-bit values, then compute their dot-product, then add src3 as a 32-bit value.
with saturation
VPMADCSSWD xmm1,xmm2,xmm3/m128,xmm4
XOP.8 A6 /r /is4
No
No
without saturation
VPMADCSWD xmm1,xmm2,xmm3/m128,xmm4
XOP.8 B6 /r /is4
Packed Permute Bytes.
For VPPERM dst,src1,src2,src3, src2:src1 are considered a 32-element vector of bytes. For each byte-lane, the byte in src3 is used to index into this 32-byte vector and transform the element:
bits 4:0 is used to pick one of the 32 bytes.
bits 7:6 specify a transform to perform on the byte (0=keep, 1=bitreverse, 2=set-to-zero, 3=replicate-MSB)
bit 5, if set, inverts the result after the transform.
VPPERM xmm1,xmm2,xmm3/m128,xmm4
XOP.8 A3 /r /is4
Yes
No
Packed left-rotate.
Rotation amount is given in the last source argument. It may be provided as an immediate or a vector register − in the latter case, the rotation amount is provided on a per-lane basis.
8-bit lanes
VPROTB xmm1,xmm2/m128,xmm3
XOP.9 90 /r
Yes
No
VPROTB xmm1,xmm2/m128,imm8
XOP.8 C0 /r ib
No
16-bit lanes
VPROTW xmm1,xmm2/m128,xmm3
XOP.9 91 /r
Yes
VPROTW xmm1,xmm2/m128,imm8
XOP.8 C1 /r ib
No
32-bit lanes
VPROTD xmm1,xmm2/m128,xmm3
XOP.9 92 /r
Yes
VPROTD xmm1,xmm2/m128,imm8
XOP.8 C2 /r ib
No
64-bit lanes
VPROTQ xmm1,xmm2/m128,xmm3
XOP.9 93 /r
Yes
VPROTQ xmm1,xmm2/m128,imm8
XOP.8 C3 /r ib
No
Packed shift, with signed shift-amounts.
Shift-amount is provided on a per-vector-lane basis, and is taken from the bottom 8 bits of each lane of the last source argument. The shift-amount is considered signed − a positive value will cause left-shift, while a negative value causes right-shift.
8-bit, signed
VPSHAB xmm1,xmm2/m128,xmm3
XOP.9 98 /r
Yes
No
16-bit, signed
VPSHAW xmm1,xmm2/m128,xmm3
XOP.9 99 /r
32-bit, signed
VPSHAD xmm1,xmm2/m128,xmm3
XOP.9 9A /r
64-bit, signed
VPSHAQ xmm1,xmm2/m128,xmm3
XOP.9 9B /r
8-bit, unsigned
VPSHLB xmm1,xmm2/m128,xmm3
XOP.9 94 /r
16-bit, unsigned
VPSHLW xmm1,xmm2/m128,xmm3
XOP.9 95 /r
32-bit, unsigned
VPSHLD xmm1,xmm2/m128,xmm3
XOP.9 96 /r
64-bit, unsigned
VPSHLQ xmm1,xmm2/m128,xmm3
XOP.9 97 /r
^ abcdefghFor each VPCOM* instruction, a series of alias mnemonics are available for the instruction, one for each of the eight comparison functions encodable in the imm8 argument. These alias mnemonics specify the comparison to perform after the "VPCOM" part of the mnemonic. For example:
VPCOMEQB xmm1,xmm2,xmm3 is an alias for VPCOMB xmm1,xmm2,xmm3,4
VPCOMFALSEUQ xmm1,xmm2,[ebx] is an alias for VPCOMUQ xmm1,xmm2,[ebx],6
XOP also included two vector instructions that used the VEX prefix instead of the XOP prefix:
The instructions VPERMIL2PD and VPERMIL2PS were originally defined by Intel in early drafts of the AVX specification[24] − they were removed in later drafts[25][26] and were never implemented in any Intel processor. They were, however, implemented by AMD, who designated them as being a part of the XOP instruction set extension. (Like the other parts of XOP, they've been removed in AMD Zen.)
AMD introduced TBM together with BMI1 in its Piledriver[27] line of processors; later AMD Jaguar and Zen-based processors do not support TBM.[28] No Intel processors (as of 2023) support TBM.
The TBM instructions are all encoded using the XOP prefix. They are all available in 32-bit and 64-bit forms, selected with the XOP.W bit (0=32bit, 1=64bit). (XOP.W is ignored outside 64-bit mode.) Like all instructions encoded with VEX/XOP prefixes, they are unavailable in Real Mode and Virtual-8086 mode.
^For BEXTR, a register form is available as part of BMI1.
Lightweight Profiling instructions
The AMD Lightweight Profiling (LWP) feature was introduced in AMD Bulldozer and removed in AMD Zen. On all supported CPUs, the latest available microcode updates have disabled LWP due to Spectre mitigations.[31]
These instructions are available in Ring 3, but not available in Real Mode and Virtual-8086 mode. All of them use the XOP prefix.
Instruction
Opcode
Description
LLWPCB r32/64
XOP.9 12 /0
Load LWPCB (Lightweight Profiling Control Block) address.[a]
Loading an address of 0 disables LWP. Loading a nonzero address will cause the CPU to perform validation of the specified LWPCB, then enable LWP if the validation passed. If LWP was already enabled, state for the previous LWPCB is flushed to memory.
SLWPCB r32/64
XOP.9 12 /1
Store LWPCB address[a] to register, and flush LWP state to memory.
If LWP is not enabled, the stored address is 0.
LWPINS r32/64, r/m32, imm32
XOP.A 12 /0 imm32
Insert user event record with EventID=255 in LWP ring buffer. The arguments are inserted into the event record as follows:
The first argument is stored in bytes 23:16 (zero-extended if 32-bit)
The second argument is stored in bytes 7:4
The low 16 bits of the imm32 are stored in bytes 3:2 (the high 16 bits are ignored)
The LWPINS instruction sets CF=1 if LWP is enabled and the ring buffer is full, CF=0 otherwise.
LWPVAL r32/64, r/m32, imm32
XOP.A 12 /1 imm32
Decrement the event counter associated with the programmed value sample event. If the resulting counter value ends up negative, insert an event record with EventID=1 in LWP ring buffer. (The instruction arguments are inserted in this record in the same way as for LWPINS.)
Executes as NOP if LWP is not enabled or if the event counter is not enabled. If no event record is inserted, then the second argument (which may be a memory argument) is not accessed.
^ abThe address used by LLWPCB and SLWPCB is an effective-address, specified relative to the DS: segment base address. LLWPCB converts this effective-address to a linear-address by adding the DS base address to it, and SLWPCB converts it back by subtracting the DS base address. Changing the DS base address while LWP is enabled will thereby cause SLWPCB to return a different address than what was specified to LLWPCB, and may also cause XSAVE to fail to save LWP state properly.
These instructions are specific to the NEC V20/V30 CPUs and their successors, and do not appear in any non-NEC CPUs. Many of their opcodes have been reassigned to other instructions in later non-NEC CPUs.
Instruction
Opcode
Description
Available on
TEST1 r/m8, CL TEST1 r/m16, CL
0F 10 /0 0F 11 /0
Test one bit.
First argument specifies an 8/16-bit register or memory location.
Performs a string addition of integers in packed BCD format (2 BCD digits per byte). DS:SI points to a source integer, ES:DI to a destination integer, and CL provides the number of digits to add. The operation is then:
destination <- destination + source
SUB4S
0F 22
Subtract Nibble Strings.
destination <- destination − source
CMP4S
0F 26
Compare Nibble Strings.
ROL4 r/m8
0F 28 /0
Rotate Left Nibble.
Concatenates its 8-bit argument with the bottom 4 bits of AL to form a 12-bit bitvector, then left-rotates this bitvector by 4 bits, then writes this bitvector back to its argument and the bottom 4 bits of AL.
ROR4 r/m8
0F 2A /0
Rotate Right Nibble. Similar to ROL4, except performs a right-rotate by 4 bits.
EXT r8,r8
0F 33 /r
Bitfield extract.
Perform a bitfield read from memory. DS:SI (DS0:IX in NEC nomenclature) points to memory location to read from, first argument specifies bit-offset to read from, and second argument specifies the number of bits to read minus 1. The result is placed in AX. After the bitfield read, SI and the first argument are updated to point just beyond the just-read bitfield.
EXT r8,imm8
0F 3B /0 ib
INS r8,r8
0F 31 /r
Bitfield Insert.
Perform a bitfield write to memory. ES:DI (DS1:IY in NEC nomenclature) points to memory location to write to, AX contains data to write, first argument specifies bit-offset to write to, and second argument specifies the number of bits to write minus 1. After the bitfield write, DI and the first argument are updated to point just beyond the just-written bitfield.
INS r8,imm8
0F 39 /0 ib
REPC
64
Repeat if carry. Instruction prefix for use with CMPS/SCAS.
REPNC
65
Repeat if not carry. Instruction prefix for use with CMPS/SCAS.
FPO2
66 /r 67 /r
"Floating Point Operation 2": extra escape opcodes for floating-point coprocessor, in addition to the standard D8-DF ones used for x87.
The FPO2 escape opcodes are used by the NEC 72291 floating-point coprocessor - this coprocessor also uses the standard D8-DF escape opcodes, but uses them to encode an instruction set that is unique to the 72291 and not compatible with x87. A listing of the opcodes/instructions supported by the 72291 is available.[34]
Jump to an address picked from the IVT (Interrupt Vector Table) using the imm8 argument, similar to the 8086 INT instruction, but start executing as Intel 8080 code rather than x86 code.
Jump to an address picked from the IVT using the imm8 argument. Enables a simple memory paging mechanism after reading the IVT but before executing the jump.
The paging mechanism uses an on-chip page table with 16Kbyte pages and no access rights checking.[35]
Perform software interrupt with context switch to register bank specified by low 3 bits of r16.
RETRBI
0F 91
Return from register bank context switch interrupt.
FINT
0F 92
Finish Interrupt.
TSKSW r16
0F 94 /7
Perform task switch to register bank indicated by low 3 bits of r16.
MOVSPB r16
0F 95 /7
Transfer SS and SP of current register bank to register bank indicated by low 3 bits of r16.
BTCLR imm8,imm8,cb
0F 9C ib ib rel8
Bit Test and Clear.
The first argument specifies a V25/V35 Special Function Register to test a bit in. The second argument specifies a bit position in that register. The third argument specifies a short branch offset. If the bit was set to 1, then it is cleared and a short branch is taken, else the branch is not taken.
STOP
0F 9E
CPU Halt.
Differs from the conventional 8086 HLT instruction in that the clock is stopped too, so that an NMI or CPU reset is needed to resume operation.
BRKS imm8
F1 ib
Break and Enable Software Guard.
Jump to an address picked from the IVT using the imm8 argument, and then continue execution with "Software Guard" enabled. The "Software Guard" is an 8-bit Substitution cipher that, during instruction fetch/decode, translates opcode bytes using a 256-entry lookup table stored in an on-chip Mask ROM.
Break and Enable Native Mode. Similar to BRKS, excepts disables "Software Guard" rather than enabling it.
MOV r/m,DS3
8C /6
Move to/from the DS2 and DS3 extended segment registers.
The DS2 and DS3 registers (which are specific to the NEC V55) act similar to regular x86 real mode segment registers except that they are left-shifted by 8 rather than 4, enabling access to 16MB of memory. Block transfer instructions, such as MOVBKW, can access the 16MB memory space by simultaneously prefixing with DS2 and DS3.[39]
These instructions are present in Cyrix CPUs as well as NatSemi/AMD Geode CPUs derived from Cyrix microarchitectures (Geode GX and LX, but not NX). They are also present in Cyrix manufacturing partner CPUs from IBM, ST and TI, as well as the VIA Cyrix III ("Joshua" core only, not "Samuel") and a few SoCs such as STPC ATLAS and ZFMicro ZFx86.[43] Many of these opcodes have been reassigned to other instructions in later non-Cyrix CPUs.
Instruction
Opcode
Description
Available on
SVDC m80,sreg
0F 78 /r
Save segment register and descriptor to memory as a 10-byte data structure.
The first 8 bytes are the descriptor, the last two bytes are the selector.[44]
^The Cyrix SMM instructions also include RSM (0F AA; Return from System Management mode), however, RSM is not a Cyrix-specific instruction, and it continues to exist in modern non-Cyrix x86 processors.
^RSDC with CS as a destination register is only supported on NatSemi Geode GX2 and AMD Geode GX/LX[47] - on other processors, it causes #UD.
^Some assemblers/disassemblers, such as NASM, use the instruction mnemonic SMINTOLD for the 0F 7E encoding.
^ abFor the RDSHR and WRSHR instructions, Cyrix's documentation[48] specifies that the instruction accepts a ModR/M byte but does not specify the encoding of the ModR/M byte's reg field. NASM v0.98.31 and later uses /0 for these instructions,[49] while sandpile.org's opcode tables[50] indicate that the reg field is ignored for these instructions.
These instructions were introduced in the Cyrix 6x86MX and MII processors, and were also present in the MediaGXm and Geode GX1[53] processors. (In later non-Cyrix processors, all of their opcodes have been used for SSE or SSE2 instructions.)
These instructions are integer SIMD instructions acting on 64-bit vectors in MMX registers or memory. Each instruction takes two explicit operands, where the first one is an MMX register operand and the second one is either a memory operand or a second MMX register. In addition, several of the instructions take an implied operand, which is an MMX register implied from the first operand as follows:
First explicit operand
mm0
mm1
mm2
mm3
mm4
mm5
mm6
mm7
Implied operand
mm1
mm0
mm3
mm2
mm5
mm4
mm7
mm6
In the instruction descriptions in the below table, arg1 and arg2 refer to the two explicit operands of the instruction, and imp to the implied operand.
Packed conditional load from memory to MMX register.
Condition is evaluated on a per-byte-lane basis, by comparing byte lanes in the implied source to zero (with signed compare) − if the comparison passes, then the corresponding destination lane is loaded from memory, otherwise it keeps its original value.
^Implementations differ on whether the PAVEB instruction treats the bytes as signed or unsigned.[54]
^ abcdefFor PDISTIB, PMACHRIW and the PMV* instructions, the second explicit operand is required to be a memory operand − register operands are not supported.
^The Cyrix EMMI PMULHRW instruction has the same mnemonic as the 3DNow! PMULHRW instruction, however its opcode and function differ (the EMMI instruction right-shifts its multiply-result by 15 bits, while the 3DNow! instruction right-shifts by 16 bits).
Some assemblers/disassemblers, such as NASM, resolve this ambiguity by using the mnemonic PMULHRWA for the 3DNow! instruction and PMULHRWC for the EMMI instruction.
The C&T F8680 PC/Chip is a system-on-a-chip featuring an 80186-compatible CPU core, with a few additional instructions to support the F8680-specific "SuperState R"[58] supervisor/system-management feature. Some of the added instructions for "SuperState R" are:[59]
Instruction
Opcode
Description
LFEAT AX
FE F8
Load datum into F8680 "CREG" configuration register (AH=register-index, AL=datum)[60]
STFEAT AL,imm8
FE F0 ib
Read F8680 status register into AL (imm8=register-index)
C&T also developed a 386-compatible processor known as the Super386. This processor supports, in addition to the basic Intel 386 instruction set, a number of instructions to support the Super386-specific "SuperState V" system-management feature. The added instructions for "SuperState V" are:[7]
The M6117 series of embedded microcontrollers feature an Intel 386SX compatible CPU core derived from V.M. Technology (VMT) VM386SX+ processor. VMT VM386SX+ adds a few processor specific additions to the Intel 386 instruction set. The ones documented for DM&P M6117D are:[63]
Instruction
Opcode
Description
BRKPM
F1
System management interrupt − enters "hyper state mode"
Several 80387-class floating-point coprocessors provided extra instructions in addition to the standard 80387 ones − none of these are supported in later processors:
Instruction to signal to the FPU that the main CPU is exiting protected mode, similar to how the FSETPM instruction is used to signal to the FPU that the CPU is entering protected mode.
Different sources provide different encodings for this instruction.
Multiply 4-component vector with 4x4 matrix. For proper operation, the matrix must be preloaded into Coprocessor Register banks 1 and 2 (unique to IIT FPUs), and the vector must be loaded into Coprocessor Register Bank 0. Example code is available.[67][69]
Round st(0) to integer, with round-to-nearest ties-away-from-zero rounding.[70]
FRICHOP
DD FC
Round st(0) to integer, with round-to-zero rounding.
FRINEAR
DF FC
Round st(0) to integer, with round-to-nearest-even rounding.[70]
^The FNSTSG AX instruction can be executed not just on the Intel 387SL FPU but on the Intel 387SX as well - executing the instruction immediately after an FNINIT will cause the instruction to return 0000h on 387SX, but a nonzero signature value on the 387SL.[66]
^ abMicroprocessor Report, System Management Mode Explained (vol 6, no. 8, june 17, 1992) − includes a listing of the AMD/Cyrix SMM opcodes and the C&T Super386 "SuperState V" opcodes. Archived on 29 Jun 2022.
^ abcJohn H. Wharton, The Complete X86, Volume 1, 1994. MicroDesign Resources, ISBN1-885330-02-2. Covers instruction set additions of Am486SXLV on page 210, Cyrix 486S on page 273 and IBM 386SLC on page 298.
^Hans Peter Messmer, "The Indispensable PC Hardware Book" (ISBN 0201403994), chapter 10.6.1, pages 280-281
^Frank van Gilluwe, "The Undocumented PC, second edition", 1997, ISBN0-201-47950-8, page 120
^Microprocessor Report, UMC Announces Enhanced 486SX-Compatible, (vol 8, no.7, May 30, 1994) — describes the UMC U5S as having "built-in SMM, which is hardware- and software-compatible with AMD’s implementation." Archived on 7 Sep 2024.
^AMD, AMD64 Architecture Programmer’s Manual Volume 5, pub.no.26569, rev 3.16, Nov 2021 − provides details on how PFRCPIT1, PFRSQIT1 and PFRCPIT2 perform their Newton-Raphson iterations on pages 118 to 125. Archived on 24 Sep 2023.
^Intel, Advanced Vector Extensions Programming Reference, order no. 319433-002, March 2008 - contains specifications of VPERMIL2PD and VPERMIL2PS on pages 411 and 420, as well as FMA4 instructions on pages 612 to 660. Archived from the original on 7 Aug 2011.
^NEC V55SC 16-bit Microprocessor Preliminary Data Sheet (O.D.No ID-8206A, March 1993), pages 70 and 127. Located on Apr 20, 2022 by searching for "nec v55sc" at datasheetarchive.com. Archived on Nov 22, 2022.
^ abVIA Technologies, VIA C3 Samuel 2 Processor Datasheet, version 1.10, January 2002 - publicly available datasheet that lists the 0F 3F and 8D 84 00 imm32 AIS opcodes (without mnemonics) on page 60. Archived from the original on 10 Apr 2004.
^Intel "Intel387 SL Mobile Math Coprocessor" (feb 1992, order no 290427-001), appendix A. Located on Jan 7, 2022 by searching for "intel387 sl" at datasheetarchive.com. Archived on Jan 7, 2022.