CN105204820B

CN105204820B - Instructions and logic for providing generic GF(256) SIMD cryptographic arithmetic functions

Info

Publication number: CN105204820B
Application number: CN201510274232.0A
Authority: CN
Inventors: S·格伦
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2014-06-26
Filing date: 2015-05-26
Publication date: 2019-02-22
Anticipated expiration: 2035-05-26
Also published as: DE102015006670A1; CN105204820A; DE102015006670B4

Abstract

This application discloses for providing general GF(256) instruction and logic of SIMD encrypted mathematical function.Instruction and logic provide general GF (2⁸) SIMD encrypted mathematical function.Embodiment includes processor, is used to be decoded the instruction of source data operation number invert for SIMD binary system finite field multiplier, specified and irreducible function to calculate inverse element to irreducible function modulus for each of source data operation number element.The result of the instruction is stored in SIMD destination register.Some embodiments are also decoded the instruction for SIMD affine transformation of specified source data operation number, transformation matrix operand and converting vector.Transformation matrix and converting vector are applied to each of source data operation number element.Some embodiments are also decoded the instruction for SIMD binary system finite field multiplier of specified first and second source data operation numbers, so as to by each corresponding element of the first and second source data operation numbers to multiplication, and to irreducible function modulus.

Description

Instructions and logic for providing generic GF(256) SIMD cryptographic arithmetic functions

技术领域technical field

本公开涉及处理逻辑、微处理器以及相关联的指令集架构领域，当由处理器或其他处理逻辑执行该指令集架构时，该指令集架构执行逻辑、数学或其他功能性操作。更具体地说，本公开涉及用于提供通用GF(256)SIMD加密算术功能的指令和逻辑。The present disclosure relates to the field of processing logic, microprocessors, and associated instruction set architectures that perform logical, mathematical or other functional operations when executed by a processor or other processing logic. More specifically, the present disclosure relates to instructions and logic for providing generic GF(256) SIMD cryptographic arithmetic functions.

背景技术Background technique

密码学是依赖算法和密钥来保护信息的工具。该算法是复杂的数学算法，而密钥是位串。有两种基本类型的加密系统：秘密密钥系统和公共密钥系统。秘密密钥系统(也被称为对称系统)具有两方或更多方共享的单个密钥(“秘密密钥”)。该单个密钥用于既加密也解密信息。Cryptography is a tool that relies on algorithms and keys to protect information. The algorithm is a complex mathematical algorithm, and the key is a string of bits. There are two basic types of encryption systems: secret key systems and public key systems. Secret key systems (also known as symmetric systems) have a single key ("secret key") shared by two or more parties. This single key is used to both encrypt and decrypt information.

例如，高级加密标准(AES)(也被称为Rijndael)是由两名比利时密码学家JoanDaemen和Vincent Rijmen开发，并被美国政府采纳为加密标准的块密码。2001年11月26日，AES被国家标准与技术研究所(NIST)宣布为U.S. FIPS PUB 197(FIPS 197)。For example, the Advanced Encryption Standard (AES) (also known as Rijndael) is a block cipher developed by two Belgian cryptographers, Joan Daemen and Vincent Rijmen, and adopted by the U.S. government as an encryption standard. On November 26, 2001, AES was declared a U.S. FIPS PUB 197 (FIPS 197) by the National Institute of Standards and Technology (NIST).

AES具有128位的固定块尺寸，并具有128位、192位或256位的密钥尺寸。使用Rijndael密钥调度的密钥扩展将128位、192位或256位尺寸的密钥变换为128位的10轮密钥、12轮密钥或14轮密钥。这些轮密钥被用于按轮将明文数据作为128位的块(被视为4×4的字节数组)来处理，并且将它们转换为密文块。通常，对于对轮的128位输入(16字节)，根据被称为S盒(S-box) 的查找表，每一个字节被另一字节替换。块密码的这部分被称为字节替换 (SubBytes)。接下来，字节的行(被视为4×4数组)循环左移或左循环特定偏移(即，第零行0字节，第一行1字节，第二行2字节，第三行3字节)。块密码的这部分被称为行移位(ShiftRows)。然后，字节的每一列被视为有限域GF(256)(也被称为伽罗瓦(Galois)域2⁸)中的多项式的四个系数，并且乘以可逆线性变换。块密码的这部分被称为列混合(MixColumns)。最后，128 位的块与轮密钥进行异或(XOR)运算以生成16字节的密文块，这被称为轮密钥相加(AddRoundKey)。AES has a fixed block size of 128 bits and a key size of 128 bits, 192 bits, or 256 bits. Key expansion using Rijndael key scheduling transforms keys of size 128, 192, or 256 bits into 128-bit 10-round, 12-round, or 14-round keys. These round keys are used in rounds to process plaintext data as 128-bit blocks (considered as 4x4 byte arrays) and convert them into ciphertext blocks. Typically, for a 128-bit input (16 bytes) to a wheel, each byte is replaced by another according to a lookup table called an S-box. This part of the block cipher is called SubBytes. Next, the row of bytes (treated as a 4x4 array) is rotated left or by a specific offset (i.e., row zero is 0 bytes, row 1 is 1 byte, row 2 is 2 bytes, row 0 three lines of 3 bytes). This part of the block cipher is called ShiftRows. Each column of bytes is then treated as the four coefficients of a polynomial in the finite field GF(256) (also known as the Galois field ²⁸ ) and multiplied by an invertible linear transform. This part of the block cipher is called MixColumns. Finally, the 128-bit block is XORed with the round key to generate a 16-byte ciphertext block, which is called AddRoundKey.

在具有32位或更大的字的系统上，通过将字节替换、行移位和列混合变换转换为利用存储器的4096字节的四个256条目的32位的表来实现AES密码是可能的。软件实现的一个缺点在于性能。软件运行得比专用硬件慢多个数量级，因此，期望具有硬件/固件实现的增加的性能。On systems with 32-bit or larger words, it is possible to implement the AES cipher by converting byte substitution, row shifting, and column mixing into a 32-bit table utilizing four 256-entry 32-bit tables of 4096 bytes of memory of. One disadvantage of software implementation is performance. Software runs orders of magnitude slower than dedicated hardware, so it is desirable to have increased performance implemented by hardware/firmware.

使用查找存储器、真值表、二叉决策图或256输入多路复用器的典型直接硬件实现在电路面积方面代价是高的。使用与GF(256)同构的有限域的替代方法在面积上可能是高效的，但是可能也比直接硬件实现慢。Typical direct hardware implementations using lookup memories, truth tables, binary decision diagrams, or 256-input multiplexers are costly in terms of circuit area. An alternative approach using finite fields isomorphic to GF(256) may be area efficient, but may also be slower than a direct hardware implementation.

现代的诸多处理器通常包括提供计算密集性但提供高度的数据并行性的操作的指令，可通过使用诸如单指令多数据(SIMD)向量寄存器之类的各种数据存储设备的高效实现来利用该数据并行性。然后，中央处理单元(CPU) 可提供并行硬件以支持处理向量。向量是保持多个连续数据元素的数据结构。尺寸为M(其中，M是2^k，例如，256、128、64、32、…4或2)的向量寄存器可包含N个尺寸为O的向量元素，其中，N＝M/O。例如，64字节向量寄存器可分割为：(a)64个向量元素，每个元素保存占用1字节的数据项；(b) 32个向量元素，每个元素保存占用2字节(或一个“字”)的数据项；(c)16 个向量元素，每个元素保存占用4字节(或一个“双字”)的数据项；或(d)8 个向量元素，每个元素保存占用8字节(或一个“四字”)的数据项。SIMD向量寄存器中的并行性本质可很好地适用于处理安全散列算法。Modern processors typically include instructions that provide operations that are computationally intensive but provide a high degree of data parallelism, which can be exploited through efficient implementations using various data storage devices such as single-instruction, multiple-data (SIMD) vector registers. Data parallelism. The central processing unit (CPU) can then provide parallel hardware to support the processing vectors. A vector is a data structure that holds multiple contiguous data elements. A vector register of size M (where M is ^2k , eg, 256, 128, 64, 32, . . . 4, or 2) may contain N vector elements of size 0, where N=M/O. For example, a 64-byte vector register can be divided into: (a) 64 vector elements, each holding a data item occupying 1 byte; (b) 32 vector elements, each holding 2 bytes (or a "word") data items; (c) 16 vector elements, each element holding data items occupying 4 bytes (or a "double word"); or (d) 8 vector elements, each element holding data items occupying 4 bytes 8 bytes (or a "quadword") data item. The nature of parallelism in SIMD vector registers lends itself well to processing secure hashing algorithms.

其他类似的加密算法也可能是感兴趣的。例如，Rijndael规范本身是利用各种块尺寸和密钥尺寸指定的，这些块尺寸和密钥尺寸两者可以是最小128位和最大256位的、32位的任何倍数。另一示例是SMS4，它是无线局域网WAPI 中国国家标准(有线认证和隐私基础建设)中使用的块密码。它也按轮(即， 32)将明文数据作为GF(256)中的128位块来处理，但是执行对不同多项式的归约求模(reductions modulo)。Other similar encryption algorithms may also be of interest. For example, the Rijndael specification itself is specified with various block sizes and key sizes, both of which can be any multiple of 32 bits, with a minimum of 128 bits and a maximum of 256 bits. Another example is SMS4, which is a block cipher used in the wireless local area network WAPI China National Standard (Wired Authentication and Privacy Infrastructure). It also processes plaintext data as 128-bit blocks in GF(256) in rounds (ie, 32), but performs reductions modulo over different polynomials.

迄今为止，还未完全探索出提供高效的空-时设计权衡的选项以及对于此类复杂性、性能限制问题和其他瓶颈的潜在解决方案。To date, options that provide efficient space-time design tradeoffs and potential solutions to such complexity, performance constraints, and other bottlenecks have not been fully explored.

附图说明Description of drawings

在所附附图的多个图中以示例方式而非限制方式示出本发明。The present invention is illustrated by way of example and not by way of limitation in the various figures of the accompanying drawings.

图1A是执行用于提供通用GF(256)SIMD加密算术功能的指令的系统的一个实施例的框图。Figure 1A is a block diagram of one embodiment of a system that executes instructions for providing generic GF(256) SIMD cryptographic arithmetic functions.

图1B是执行用于提供通用GF(256)SIMD加密算术功能的指令的系统的另一实施例的框图。Figure IB is a block diagram of another embodiment of a system that executes instructions for providing generic GF(256) SIMD cryptographic arithmetic functions.

图1C是执行用于提供通用GF(256)SIMD加密算术功能的指令的系统的另一实施例的框图。1C is a block diagram of another embodiment of a system that executes instructions for providing generic GF(256) SIMD cryptographic arithmetic functions.

图2是执行用于提供通用GF(256)SIMD加密算术功能的指令的处理器的一个实施例的框图。Figure 2 is a block diagram of one embodiment of a processor executing instructions for providing generic GF(256) SIMD cryptographic arithmetic functions.

图3A示出根据一个实施例的紧缩数据类型。Figure 3A illustrates a packed data type according to one embodiment.

图3B示出根据一个实施例的紧缩数据类型。Figure 3B illustrates a packed data type according to one embodiment.

图3C示出根据一个实施例的紧缩数据类型。Figure 3C illustrates a packed data type according to one embodiment.

图3D示出根据一个实施例的用于提供通用GF(256)SIMD加密算术功能的指令编码。Figure 3D illustrates an instruction encoding for providing generic GF(256) SIMD encrypted arithmetic functions, according to one embodiment.

图3E示出根据另一实施例的用于提供通用GF(256)SIMD加密算术功能的指令编码。Figure 3E illustrates an instruction encoding for providing generic GF(256) SIMD encrypted arithmetic functions according to another embodiment.

图3F示出根据另一实施例的用于提供通用GF(256)SIMD加密算术功能的指令编码。Figure 3F illustrates an instruction encoding for providing generic GF(256) SIMD encrypted arithmetic functions according to another embodiment.

图3G示出根据另一实施例的用于提供通用GF(256)SIMD加密算术功能的指令编码。3G illustrates an instruction encoding for providing generic GF(256) SIMD encrypted arithmetic functions according to another embodiment.

图3H示出根据另一实施例的用于提供通用GF(256)SIMD加密算术功能的指令编码。Figure 3H illustrates an instruction encoding for providing generic GF(256) SIMD encrypted arithmetic functions according to another embodiment.

图4A示出用于执行提供通用GF(256)SIMD加密算术功能的指令的处理器微架构的一个实施例的多个元件。Figure 4A illustrates elements of one embodiment of a processor microarchitecture for executing instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图4B示出用于执行提供通用GF(256)SIMD加密算术功能的指令的处理器微架构的另一实施例的多个元件。Figure 4B illustrates elements of another embodiment of a processor microarchitecture for executing instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图5是用于执行提供通用GF(256)SIMD加密算术功能的指令的处理器的一个实施例的框图。5 is a block diagram of one embodiment of a processor for executing instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图6是用于执行提供通用GF(256)SIMD加密算术功能的指令的计算机系统的一个实施例的框图。6 is a block diagram of one embodiment of a computer system for executing instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图7是用于执行提供通用GF(256)SIMD加密算术功能的指令的计算机系统的另一实施例的框图。7 is a block diagram of another embodiment of a computer system for executing instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图8是用于执行提供通用GF(256)SIMD加密算术功能的指令的计算机系统的另一实施例的框图。8 is a block diagram of another embodiment of a computer system for executing instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图9是用于执行提供通用GF(256)SIMD加密算术功能的指令的芯片上系统的一个实施例的框图。9 is a block diagram of one embodiment of a system-on-a-chip for executing instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图10是用于执行提供通用GF(256)SIMD加密算术功能的指令的处理器的实施例的框图。10 is a block diagram of an embodiment of a processor for executing instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图11是提供通用GF(256)SIMD加密算术功能的IP核开发系统的一个实施例的框图。Figure 11 is a block diagram of one embodiment of an IP core development system that provides generic GF(256) SIMD cryptographic arithmetic functionality.

图12示出提供通用GF(256)SIMD加密算术功能的架构仿真系统的一个实施例。Figure 12 illustrates one embodiment of an architectural simulation system that provides generic GF(256) SIMD cryptographic arithmetic functionality.

图13示出用于转换提供通用GF(256)SIMD加密算术功能的指令的系统的一个实施例。Figure 13 illustrates one embodiment of a system for converting instructions that provide generic GF(256) SIMD cryptographic arithmetic functions.

图14示出用于高效地实现高级加密标准(AES)的加密/解密标准的过程的一个实施例的流程图。14 shows a flowchart of one embodiment of a process for efficiently implementing the Advanced Encryption Standard (AES) encryption/decryption standard.

图15示出用于高效地实现AES S盒的乘法求逆(multiplicative inverse) 的过程的一个实施例的流程图。Figure 15 shows a flow diagram of one embodiment of a process for efficiently implementing a multiplicative inverse of an AES S-box.

图16A示出用于执行用于提供通用GF(256)SIMD加密算术功能的仿射映射指令的装置的一个实施例的图。.Figure 16A shows a diagram of one embodiment of an apparatus for executing affine map instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. .

图16B示出用于执行用于提供通用GF(256)SIMD加密算术功能的仿射求逆(affineinverse)指令的装置的一个实施例的图。Figure 16B shows a diagram of one embodiment of an apparatus for executing an affine inverse instruction for providing generic GF(256) SIMD cryptographic arithmetic functions.

图16C示出用于执行求逆仿射(inverse affine)指令的装置的替代实施例的图，该求逆仿射指令用于计算乘法逆元，然后对结果进行仿射变换以提供通用GF(256)SIMD加密算术功能。Figure 16C shows a diagram of an alternative embodiment of an apparatus for executing an inverse affine instruction for computing a multiplicative inverse and then affine transforming the result to provide a generic GF( 256) SIMD encryption arithmetic function.

图17A示出用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的装置的一个实施例的图。Figure 17A shows a diagram of one embodiment of an apparatus for executing a finite field multiply inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions.

图17B示出用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的装置的替代实施例的图。Figure 17B shows a diagram of an alternative embodiment of an apparatus for executing a finite field multiply inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions.

图17C示出用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的装置的另一替代实施例的图。Figure 17C shows a diagram of another alternative embodiment of an apparatus for executing a finite field multiply inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions.

图18A示出用于执行用于提供通用GF(256)SIMD加密算术功能的特定求模归约(modulus reduction)指令的装置的一个实施例的图。Figure 18A shows a diagram of one embodiment of an apparatus for executing specific modulus reduction instructions for providing generic GF(256) SIMD cryptographic arithmetic functions.

图18B示出用于执行用于提供通用GF(256)SIMD加密算术功能的特定求模归约指令的装置的替代实施例的图。Figure 18B shows a diagram of an alternative embodiment of an apparatus for executing specific modulo-reduce instructions for providing generic GF(256) SIMD cryptographic arithmetic functions.

图18C示出用于执行用于提供通用GF(2¹²⁸)SIMD加密算术功能的特定 AES迦罗瓦(Galois)计数器模式(GCM)求模归约指令的装置的另一替代实施例的图。Figure 18C shows a diagram of another alternative embodiment of an apparatus for executing a specific AES Galois Counter Mode (GCM) modulo-reduce instruction for providing generic GF(2 ¹²⁸ ) SIMD cryptographic arithmetic functions.

图18D示出用于执行用于提供通用GF(2^t)SIMD加密算术功能的求模归约指令的装置的一个实施例的图。Figure 18D shows a diagram of one embodiment of an apparatus for executing a modulo-reduce instruction for providing generic GF( ^2t ) SIMD cryptographic arithmetic functions.

图19A示出用于执行用于提供通用GF(256)SIMD加密算术功能的二进制有限域乘法指令的装置的一个实施例的图。Figure 19A shows a diagram of one embodiment of an apparatus for executing binary finite field multiply instructions for providing generic GF(256) SIMD cryptographic arithmetic functions.

图19B示出用于执行用于提供通用GF(256)SIMD加密算术功能的二进制有限域乘法指令的装置的替代实施例的图。Figure 19B shows a diagram of an alternative embodiment of an apparatus for executing binary finite field multiply instructions for providing generic GF(256) SIMD cryptographic arithmetic functions.

图20A示出用于执行用于提供通用GF(256)SIMD加密算术功能的仿射映射指令的过程的一个实施例的流程图。Figure 20A shows a flowchart of one embodiment of a process for executing an affine map instruction for providing generic GF(256) SIMD cryptographic arithmetic functions.

图20B示出用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的过程的一个实施例的流程图。Figure 20B shows a flowchart of one embodiment of a process for executing a finite field multiply inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions.

图20C示出用于执行用于提供通用GF(256)SIMD加密算术功能的仿射求逆指令的过程的一个实施例的流程图。Figure 20C shows a flowchart of one embodiment of a process for executing an affine inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions.

图20D示出用于执行用于提供通用GF(256)SIMD加密算术功能的二进制有限域乘法指令的过程的一个实施例的流程图。Figure 20D shows a flow diagram of one embodiment of a process for executing a binary finite field multiply instruction to provide generic GF(256) SIMD cryptographic arithmetic functions.

具体实施方式Detailed ways

以下描述公开了用于提供通用GF(2ⁿ)SIMD加密算术功能的指令和处理逻辑，具体而言，其中n可以等于2^m(例如，GF(2⁸)、GF(2¹⁶)、GF(2³²)、...GF(2¹²⁸) 等)。实施例包括处理器，其用于对指定源数据操作数、变换矩阵操作数和转换向量的用于SIMD仿射变换的指令进行解码。变换矩阵被应用于源数据操作数中的每一个元素，并且转换向量被应用于每一个经变换的元素。该指令的结果被存储在SIMD目的地寄存器中。一些实施例也对用于SIMD二进制有限域乘法求逆的指令进行解码以针对源数据操作数中的每一个元素计算二进制有限域中的逆元对不可约多项式求模。一些实施例也对用于SIMD仿射变换和乘法求逆(或乘法求逆和仿射变换)的指令进行解码，其中，在该乘法求逆操作之前或之后，变换矩阵被应用于源数据操作数中的每一个元素，并且转换向量被应用于每一个经变换的元素。一些实施例也对用于SIMD求模归约的指令进行解码以计算对从二进制有限域中的多个多项式(由指令(或微指令)为它们提供求模归约)中选出的特定求模多项式p_s进行的归约求模。一些实施例也对指定第一和第二源数据操作数的用于SIMD二进制有限域乘法的指令进行解码，以便将第一和第二源数据操作数的每一个对应的元素对相乘，并且对不可约多项式求模。The following description discloses instructions and processing logic for providing general GF(2 ⁿ ) SIMD cryptographic arithmetic functions, in particular, where n may be equal to 2 ^m (eg, GF(2 ⁸ ), GF(2 ¹⁶ ), GF( 2 ³² ), ... GF(2 ¹²⁸ ), etc.). Embodiments include a processor for decoding instructions for SIMD affine transforms specifying source data operands, transform matrix operands, and transform vectors. A transformation matrix is applied to each element in the source data operand, and a transformation vector is applied to each transformed element. The result of this instruction is stored in the SIMD destination register. Some embodiments also decode the instructions for SIMD binary finite field multiplication inversion to compute the inverse in the binary finite field modulo the irreducible polynomial for each element in the source data operand. Some embodiments also decode instructions for SIMD affine transforms and multiplicative inversions (or multiplicative inversions and affine transforms), where a transformation matrix is applied to the source data operation before or after the multiplication inversion operation each element in the number, and a transformation vector is applied to each transformed element. Some embodiments also decode the instructions for SIMD modulo-reduction to compute a particular computation selected from a number of polynomials in a binary finite field for which the modulo-reduction is provided by the instruction (or microinstruction). The reduction modulo by the modulo polynomial p _s . Some embodiments also decode instructions for SIMD binary finite field multiplication specifying first and second source data operands to multiply each corresponding pair of elements of the first and second source data operands, and Modulo an irreducible polynomial.

将会理解，如本文中所述的多个实施例中所述，通用GF(2ⁿ)SIMD加密算术指令可用于在多个应用中提供算术功能，例如，在用于确保用于金融交易、电子商务、电子邮件、软件分发、数据存储等的隐私、数据完整性、身份验证、消息内容认证和消息源认证的加密协议和网际通信中。It will be appreciated that, as described in various embodiments described herein, generic GF(2 ⁿ ) SIMD encrypted arithmetic instructions can be used to provide arithmetic functions in a variety of applications, for example, in secure applications for financial transactions, Privacy, data integrity, authentication, message content authentication and message origin authentication encryption protocols and Internet communications for e-commerce, email, software distribution, data storage, etc.

也将理解，提供针对至少下列各项的指令的执行：(1)SIMD仿射变换，其指定源数据操作数、变换矩阵操作数和转换向量，其中，变换矩阵被应用于源数据操作数中的每一个数据元素，并且转换向量被应用于每一个经变换的元素；(2)SIMD二进制有限域乘法求逆，其用于针对源数据操作数中的每一个元素计算二进制有限域中的逆元对不可约多项式求模；(3)SIMD仿射变换和乘法求逆(或乘法求逆和仿射变换)，其指定源数据操作数、变换矩阵操作数和转换向量，其中，在乘法求逆操作之前或之后，变换矩阵被应用于源数据操作数中的每一个元素，并且转换向量被应用于每一个经变换的元素；(4)求模归约，其用于计算对从二进制有限域中的多个多项式(由指令(或微指令) 为这些特定的多项式提供求模归约)中选出的特定求模多项式p_s进行归约求模；(5)SIMD二进制有限域乘法，其指定第一和第二源数据操作数，并且用于将第一和第二源数据操作数中的每一个对应的元素对相乘并对不可约多项式求模；其中，这些指令的结果被存储在SIMD目的地寄存器中；并且能以硬件和/或微代码序列的形式提供通用GF(256)和/或其他替代的二进制有限域 SIMD加密算术功能，以便不需要要求附加电路、面积或功率的过多或过度的功能单元就能够支持对若干重要的性能关键性应用的显著的性能改善。It will also be appreciated that execution of instructions for at least the following is provided: (1) SIMD affine transform specifying source data operands, transform matrix operands and transform vectors, wherein the transform matrix is applied in the source data operands each data element of , and a transformation vector is applied to each transformed element; (2) SIMD binary finite field multiplication inversion, which is used to compute the inverse in the binary finite field for each element in the source data operands Element modulo irreducible polynomials; (3) SIMD affine transformation and multiplication inversion (or multiplication inversion and affine transformation), which specify source data operands, transformation matrix operands, and transformation vectors, where in the multiplication Before or after the inverse operation, the transformation matrix is applied to each element in the source data operand, and the transformation vector is applied to each transformed element; (4) Modulo reduction, which is used to calculate the binary finite The specific modulo polynomial p _s selected from the multiple polynomials in the field (the modulo reduction is provided by the instruction (or microinstruction) for these specific polynomials) is reduced and modulo; (5) SIMD binary finite field multiplication, It specifies the first and second source data operands and is used to multiply and modulo the irreducible polynomial by pairs of elements corresponding to each of the first and second source data operands; wherein the results of these instructions are Stored in SIMD destination registers; and can provide generic GF(256) and/or other alternative binary finite field SIMD cryptographic arithmetic functions in hardware and/or microcode sequences so as not to require additional circuitry, area, or power An excess or excess of functional units can support significant performance improvements for several important performance-critical applications.

在以下描述中，陈述了诸如处理逻辑、处理器类型、微架构条件、事件、启用机制等众多特定细节，以提供对本发明实施例的更透彻理解。然而，本领域技术人员应当领会，没有这些具体细节也可实施本发明。此外，没有详细示出一些公知的结构、电路等，以避免不必要地使本发明的多个实施例模糊。In the following description, numerous specific details are set forth, such as processing logic, processor types, microarchitectural conditions, events, enabling mechanisms, etc., in order to provide a more thorough understanding of embodiments of the present invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. Additionally, some well-known structures, circuits, etc. have not been shown in detail in order to avoid unnecessarily obscuring the various embodiments of the present invention.

虽然参照处理器来描述下列多个实施例，但是其他实施例也适用于其他类型的集成电路和逻辑设备。本发明的实施例的类似技术和教导可应用于可受益于更高的流水线吞吐量和改善的性能的其他类型的电路或半导体器件。本发明的多个实施例的教导适用于执行数据操纵的任何处理器或机器。然而，本发明不限于执行512位、256位、128位、64位、32位、16位或8位数据运算的处理器或机器，并可适用于执行数据操纵或管理的任何处理器和机器。此外，下述描述提供了示例，并且附图出于示意性目的示出各种示例。然而，这些示例不应该被理解为具有限制性意义，因为它们仅仅旨在提供本发明的多个实施例的示例，而并非对本发明的多个实施例的所有可能的实现方式进行穷举。Although the following various embodiments are described with reference to a processor, other embodiments are also applicable to other types of integrated circuits and logic devices. Similar techniques and teachings of embodiments of the present invention may be applied to other types of circuits or semiconductor devices that may benefit from higher pipeline throughput and improved performance. The teachings of the various embodiments of the present invention apply to any processor or machine that performs data manipulation. However, the present invention is not limited to processors or machines that perform operations on 512-bit, 256-bit, 128-bit, 64-bit, 32-bit, 16-bit or 8-bit data, and is applicable to any processor and machine that performs data manipulation or management . Furthermore, the following description provides examples, and the accompanying drawings show various examples for illustrative purposes. However, these examples should not be construed in a limiting sense as they are intended only to provide illustrations of the various embodiments of the invention and not to be exhaustive of all possible implementations of the various embodiments of the invention.

虽然下述的示例描述了在执行单元和逻辑电路的情境中的指令处理和分配，但是也可通过存储在机器可读有形介质上的数据和/或指令来完成本发明的其他实施例，这些数据和/或指令在由机器执行时使得该机器执行与本发明的至少一个实施例相一致的功能。在一个实施例中，与本发明的多个实施例相关联的功能被具体化在机器可执行指令中。这些指令可用来使通过这些指令而被编程的通用处理器或专用处理器执行本发明的步骤。本发明的多个实施例也可以作为计算机程序产品或软件来提供，该计算机程序产品或软件可包括其上存储有指令的机器或计算机可读介质，这些指令可被用来对计算机(或其他电子设备)进行编程以执行根据本发明的多个实施例的一个或多个操作。或者，本发明的多个实施例的多个步骤可由包含用于执行这些步骤的固定功能逻辑的专用硬件组件来执行，或由经编程的计算机组件以及固定功能硬件组件的任何组合来执行。While the following examples describe instruction processing and distribution in the context of execution units and logic circuits, other embodiments of the invention may also be accomplished with data and/or instructions stored on machine-readable tangible media, which Data and/or instructions, when executed by a machine, cause the machine to perform functions consistent with at least one embodiment of the present invention. In one embodiment, the functions associated with various embodiments of the invention are embodied in machine-executable instructions. These instructions may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps of the present invention. Embodiments of the present invention may also be provided as a computer program product or software, which may include a machine or computer readable medium having stored thereon instructions that may be used to program a computer (or other electronic device) programmed to perform one or more operations in accordance with various embodiments of the present invention. Alternatively, various steps of various embodiments of the invention may be performed by dedicated hardware components containing fixed function logic for performing the steps, or by any combination of programmed computer components and fixed function hardware components.

可将用于对逻辑进行编程以执行本发明的多个实施例的指令存储在系统中的存储器中(诸如，DRAM、高速缓存、闪存、或其他存储设备)内。此外，指令可经由网络或通过其他计算机可读介质来分配。因此，计算机可读介质可包括用于以机器(例如，计算机)可读的形式存储或发送信息的任何机制，但不限于：软盘、光盘、紧致盘只读存储器(CD-ROM)、磁光盘、只读存储器 (ROM)、随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、电可擦除可编程只读存储器(EEPROM)、磁卡或光卡、闪存、或在经由互联网通过电、光、声、或其他形式的传播信号(例如，载波、红外信号、数字信号等)发送信息中所用的有形机器可读存储器。因此，计算机可读介质包括适用于存储或发送机器(例如，计算机)可读形式的电子指令或信息的任何类型的有形机器可读介质。Instructions for programming the logic to perform various embodiments of the present invention may be stored in memory (such as DRAM, cache, flash memory, or other storage devices) in the system. Furthermore, the instructions may be distributed via a network or by other computer-readable media. Thus, a computer-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (eg, a computer), but is not limited to: floppy disk, optical disk, compact disk read only memory (CD-ROM), magnetic Optical Disc, Read Only Memory (ROM), Random Access Memory (RAM), Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Magnetic or Optical Cards, Flash Memory, or Tangible machine-readable storage used in transmitting information via the Internet via electrical, optical, acoustic, or other forms of propagated signals (eg, carrier waves, infrared signals, digital signals, etc.). Accordingly, computer-readable media includes any type of tangible machine-readable media suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer).

设计会经历多个阶段，从创建到仿真到制造。表示设计的数据可用多种方式来表示该设计。首先，像仿真中有用的那样，可使用硬件描述语言或另一功能性描述语言来表示硬件。此外，可在设计过程的一些阶段产生具有逻辑和/ 或晶体管门的电路级模型。此外，大多数设计在某个阶段都达到表示硬件模型中各种设备的物理布置的数据的层级。在使用常规半导体制造技术的情况下，表示硬件模型的数据可以是指定在用于制造集成电路的掩模的不同掩模层上存在或不存在各种特征的数据。在任何设计表示中，数据可被存储在任何形式的机器可读介质中。存储器或磁或光存储设备(诸如，盘)可以是存储信息的机器可读介质，这些信息是经由光或电波来发送的，调制或以其他方式生成这些光或电波以传送这些信息。当发送指示或承载代码或设计的电载波达到实现该电信号的复制、缓冲或重新发送的程度时，即产生了新的副本。因此，通信提供商或网络提供商可在有形的机器可读介质上至少临时地存储具体化本发明的多个实施例的技术的制品(诸如，被编码成载波的信息)。Designs go through multiple stages, from creation to simulation to manufacturing. The data representing the design can represent the design in a number of ways. First, hardware can be represented using a hardware description language or another functional description language, as useful in simulation. Additionally, circuit-level models with logic and/or transistor gates may be generated at some stage of the design process. Furthermore, most designs at some stage reach a level of data representing the physical arrangement of the various devices in the hardware model. Using conventional semiconductor fabrication techniques, the data representing the hardware model may be data specifying the presence or absence of various features on different mask layers of the mask used to fabricate the integrated circuit. In any design representation, data may be stored in any form of machine-readable medium. A memory or magnetic or optical storage device, such as a disk, may be a machine-readable medium that stores information transmitted via light or electrical waves, modulated or otherwise generated to convey the information. A new copy is created when the electrical carrier indicating or carrying the code or design is transmitted to the extent that duplication, buffering or retransmission of the electrical signal is effected. Accordingly, a communications provider or network provider may at least temporarily store on a tangible machine-readable medium an article of manufacture embodying the techniques of various embodiments of the present invention (such as information encoded into a carrier wave).

在现代处理器中，将多个不同的执行单元用于处理和执行各种代码和指令。并不是所有指令都被同等地创建，因为一些指令更快地完成而其他指令可能需要多个时钟周期来完成。指令的吞吐量越快，则处理器的总体性能越好。因此，使许多指令尽可能快地执行将会是有利的。然而，存在具有更大的复杂度，并在执行时间和处理器资源方面要求更多的某些指令。例如，存在浮点指令、加载/存储操作、数据移动等。In modern processors, multiple different execution units are used to process and execute various codes and instructions. Not all instructions are created equally, as some instructions complete faster and others may take multiple clock cycles to complete. The faster the throughput of instructions, the better the overall performance of the processor. Therefore, it would be advantageous to have many instructions execute as fast as possible. However, there are certain instructions that have greater complexity and require more in terms of execution time and processor resources. For example, there are floating point instructions, load/store operations, data movement, etc.

因为更多的计算机系统被用于互联网、文本以及多媒体应用，所以已逐渐地引进了附加的处理器支持。在一个实施例中，指令集可与一个或多个计算机架构相关联，一个或多个计算机架构包括：数据类型、指令、寄存器架构、寻址模式、存储器架构、中断和异常处理以及外部输入和输出(I/O)。As more computer systems are used for Internet, text, and multimedia applications, additional processor support has gradually been introduced. In one embodiment, an instruction set may be associated with one or more computer architectures including: data types, instructions, register architecture, addressing modes, memory architecture, interrupt and exception handling, and external input and output (I/O).

在一个实施例中，指令集架构(ISA)可由一个或更多微架构来执行，微架构包括用于实现一个或多个指令集的处理器逻辑和电路。因此，具有不同微架构的多个处理器可共享公共指令集的至少一部分。例如，奔腾四 (Pentium 4)处理器、酷睿(Core^TM)处理器、以及来自加利福尼亚州桑尼威尔(Sunnyvale)的超微半导体有限公司(Advanced Micro Devices,Inc.) 的多个处理器执行几乎相同版本的x86指令集(在更新的版本中加入了一些扩展)，但具有不同的内部设计。类似地，由其他处理器开发公司(诸如，ARM 控股有限公司、MIPS或它们的授权方或兼容方)设计的多个处理器可共享至少一部分公共指令集，但可包括不同的处理器设计。例如，ISA的相同寄存器架构在不同的微架构中可使用新的或公知的技术以不同方法来实现，包括专用物理寄存器、使用寄存器重命名机制(例如，使用寄存器别名表(RAT)、重排序缓冲器(ROB)以及引退寄存器组)的一个或多个动态分配物理寄存器。在一个实施例中，寄存器可包括：可由软件编程者寻址或不可由软件编程者寻址的一个或多个寄存器、寄存器架构、寄存器组、或其他寄存器集合。In one embodiment, an instruction set architecture (ISA) may be implemented by one or more microarchitectures that include processor logic and circuitry for implementing one or more instruction sets. Thus, multiple processors with different microarchitectures may share at least a portion of a common instruction set. E.g, Pentium 4 processor, Core ^™ processors, as well as multiple processors from Advanced Micro Devices, Inc. of Sunnyvale, Calif., execute nearly identical versions of the x86 instruction set (in the updated version with some extensions), but with a different internal design. Similarly, multiple processors designed by other processor development companies, such as ARM Holdings, Inc., MIPS, or their licensors or compatible parties, may share at least a portion of the common instruction set, but may include different processor designs. For example, the same register architecture of an ISA may be implemented in different ways in different microarchitectures using new or well-known techniques, including dedicated physical registers, using register renaming mechanisms (eg, using a register alias table (RAT), reordering One or more dynamically allocated physical registers for the buffer (ROB) and retirement register bank). In one embodiment, a register may include one or more registers, register architecture, register bank, or other set of registers that may or may not be addressable by a software programmer.

在一个实施例中，指令可包括一个或多个指令格式。在一个实施例中，指令格式可指示多个字段(位的数目、位的位置等)以指定将要被执行的操作以及将要被执行的操作的操作数等。一些指令格式可进一步由指令模板(或子格式)细分定义。例如，给定指令格式的指令模板可被定义为具有指令格式字段的不同的子集，和/或被定义为具有以不同方式进行解释的给定字段。在一个实施例中，使用指令格式(并且，如果定义过，则以该指令格式的给定的指令模板中的一个)来表示指令，并且该指令指定或指示操作以及该操作将操作的操作数。In one embodiment, an instruction may include one or more instruction formats. In one embodiment, the instruction format may indicate a number of fields (number of bits, position of bits, etc.) to specify the operation to be performed, the operands of the operation to be performed, and the like. Some instruction formats may be further subdivided by instruction templates (or sub-formats). For example, an instruction template for a given instruction format may be defined as having different subsets of instruction format fields, and/or as having a given field that is interpreted differently. In one embodiment, an instruction is represented using an instruction format (and, if defined, one of the instruction templates given for the instruction format) and specifies or indicates the operation and the operands on which the operation will operate .

科学应用、金融应用、自动向量化通用应用、RMS(识别、挖掘和合成) 应用以及视觉和多媒体应用(例如，2D/3D图形、图像处理、视频压缩/解压缩、语音识别算法和音频处理)可能需要对大量数据项执行相同的操作。在一个实施例中，单指令多数据(SIMD)指的是使得处理器对多个数据元素执行一个操作的指令类型。可将SIMD技术用于可将寄存器中的多个位逻辑地划分为固定尺寸或可变尺寸的数据元素(每个数据元素表示单独的值)的处理器中。例如，在一个实施例中，可将64位寄存器中的多个位组织为包含四个单独的16位数据元素的源操作数，每个数据元素表示单独的16位的值。该数据类型可被称为紧缩数据类型或向量数据类型，并且该数据类型的操作数被称为紧缩数据操作数或向量操作数。在一个实施例中，紧缩数据项或向量可以是存储在单个寄存器中的紧缩数据元素的序列，并且紧缩数据操作数或向量操作数可以是 SIMD指令(或“紧缩数据指令”或“向量指令”)的源操作数或目的地操作数。在一个实施例中，SIMD指令指定了将要对两个源向量操作数执行以生成具有相同或不同尺寸的、具有相同或不同数量的数据元素的、具有相同或不同数据元素顺序的目的地向量操作数(也被称为结果向量操作数)的单个向量操作。Scientific applications, financial applications, general-purpose applications of auto-vectorization, RMS (recognition, mining, and synthesis) applications, and vision and multimedia applications (eg, 2D/3D graphics, image processing, video compression/decompression, speech recognition algorithms, and audio processing) It may be necessary to perform the same operation on a large number of data items. In one embodiment, single instruction multiple data (SIMD) refers to a type of instruction that causes a processor to perform an operation on multiple data elements. SIMD techniques can be used in processors that can logically divide multiple bits in a register into fixed-size or variable-size data elements, each data element representing a separate value. For example, in one embodiment, multiple bits in a 64-bit register may be organized into a source operand containing four separate 16-bit data elements, each data element representing a separate 16-bit value. This data type may be referred to as a packed data type or a vector data type, and the operands of this data type are referred to as packed data operands or vector operands. In one embodiment, the packed data item or vector may be a sequence of packed data elements stored in a single register, and the packed data operand or vector operand may be a SIMD instruction (or "packed data instruction" or "vector instruction" ) of the source or destination operand. In one embodiment, a SIMD instruction specifies that a destination vector operation is to be performed on two source vector operands to generate a destination vector operation of the same or different size, with the same or different number of data elements, and with the same or different order of data elements A single vector operation of numbers (also known as result vector operands).

诸如由酷睿(Core^TM)处理器(具有包括x86、MMX^TM、流SIMD 扩展(SSE)、SSE2、SSE3、SSE4.1、SSE4.2指令的指令集)、ARM处理器 (诸如，ARM处理器族，具有包括向量浮点(VFP)和/或NEON指令的指令集)和MIPS处理器(诸如，中国科学院计算机技术研究所(ICT) 开发的龙芯处理器族)所采用的SIMD技术之类的SIMD技术在应用性能上带来了极大的提高(Core^TM和MMX^TM是加利福尼亚州圣克拉拉市的英特尔公司的注册商标或商标)。such as by Core ^™ processors (with instruction sets including x86, MMX ^™ , Streaming SIMD Extensions (SSE), SSE2, SSE3, SSE4.1, SSE4.2 instructions), ARM processors (such as ARM A processor family with an instruction set including vector floating point (VFP) and/or NEON instructions) and SIMD technology employed by MIPS processors such as the Loongson processor family developed by the Institute of Computer Technology (ICT) of the Chinese Academy of Sciences. This class of SIMD technology has brought about dramatic improvements in application performance (Core ^™ and MMX ^™ are either registered trademarks or trademarks of Intel Corporation of Santa Clara, California).

在一个实施例中，目的地寄存器/数据和源寄存器/数据是表示对应数据或操作的源和目的地的通用术语。在一些实施例中，它们可由寄存器、存储器或具有与所描绘的那些名称或功能不同的名称或功能的其他存储区域来实现。例如，在一个实施例中，“DEST1”可以是临时存储寄存器或其他存储区域，而“SRC1”和“SRC2”可以是第一和第二源存储寄存器或其他存储区域，等等。在其他实施例中，SRC和DEST存储区域中的两个或更多可对应于相同存储区域中的不同数据存储元素(例如，SIMD寄存器)。在一个实施例中，通过例如将对第一和第二源数据执行的操作的结果写回至两个源寄存器中作为目的地寄存器的那个寄存器，源寄存器中的一个也可以作为目的地寄存器。In one embodiment, destination register/data and source register/data are generic terms denoting the source and destination of corresponding data or operations. In some embodiments, they may be implemented by registers, memory, or other storage areas with names or functions different from those depicted. For example, in one embodiment, "DEST1" may be a temporary storage register or other storage area, while "SRC1" and "SRC2" may be first and second source storage registers or other storage areas, and so on. In other embodiments, two or more of the SRC and DEST storage areas may correspond to different data storage elements (eg, SIMD registers) in the same storage area. In one embodiment, one of the source registers may also act as the destination register by, for example, writing back the results of operations performed on the first and second source data to which of the two source registers is the destination register.

图1A是根据本发明的一个实施例的示例性计算机系统的框图，该计算机系统被形成为具有包括用于执行指令的执行单元的处理器。根据本发明，诸如根据在此所描述的实施例，系统100包括诸如处理器102之类的组件，该处理器102用于使用包括逻辑的执行单元以执行算法来处理数据。系统100代表基于可从美国加利福尼亚州圣克拉拉市的英特尔公司获得的III、4、Xeon^TM、XScale^TM和/或StrongARM^TM微处理器的处理系统，不过也可使用其他系统(包括具有其他微处理器的PC、工程工作站、机顶盒等)。在一个实施例中，样本系统100可执行可从美国华盛顿州雷蒙德市的微软公司获得的WINDOWS^TM操作系统的一个版本，不过也可使用其他操作系统(例如，UNIX和Linux)、嵌入式软件、和/或图形用户界面。因此，本发明的各实施例不限于硬件电路和软件的任何具体组合。1A is a block diagram of an exemplary computer system formed with a processor including an execution unit for executing instructions, according to one embodiment of the present invention. In accordance with the present invention, such as according to the embodiments described herein, system 100 includes a component such as a processor 102 for processing data using an execution unit including logic to execute an algorithm. System 100 represents a system based on a system available from Intel Corporation of Santa Clara, CA, USA III. 4. ^XeonTM , Processing systems of XScale ^(TM) and/or StrongARM ^(TM) microprocessors, although other systems (including PCs with other microprocessors, engineering workstations, set-top boxes, etc.) may also be used. In one embodiment, the sample system 100 may execute a version of the WINDOWS ^(TM) operating system available from Microsoft Corporation of Redmond, Washington, USA, although other operating systems (eg, UNIX and Linux), embedded Software, and/or Graphical User Interface. Accordingly, embodiments of the present invention are not limited to any specific combination of hardware circuitry and software.

实施例不限于计算机系统。本发明的替代实施例可用于其他设备，诸如手持式设备和嵌入式应用。手持设备的一些示例包括蜂窝电话、网际协议设备、数码相机、个人数字助理(PDA)以及手持PC。嵌入式应用可包括微控制器、数字信号处理器(DSP)、芯片上系统、网络计算机(NetPC)、机顶盒、网络集线器、广域网(WAN)交换机、或可执行根据至少一个实施例的一条或多条指令的任何其他系统。Embodiments are not limited to computer systems. Alternative embodiments of the present invention may be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular telephones, Internet Protocol devices, digital cameras, personal digital assistants (PDAs), and handheld PCs. Embedded applications may include microcontrollers, digital signal processors (DSPs), systems on chips, network computers (NetPCs), set-top boxes, network hubs, wide area network (WAN) switches, or may execute one or more of the methods in accordance with at least one embodiment. any other system of instructions.

图1A是计算机系统100的框图，计算机系统100被形成为具有处理器102，处理器102包括一个或多个执行单元108用于执行算法，从而执行根据本发明的一个实施例的至少一条指令。可在单处理器桌面或服务器系统的情境中描述一个实施例，但是可将替代实施例包括在多处理器系统中。系统100是“中枢”系统架构的示例。计算机系统100包括用于处理数据信号的处理器102。处理器102可以是复杂指令集计算机(CISC)微处理器、精简指令集计算(RISC) 微处理器、超长指令字(VLIW)微处理器、实现多个指令集组合的处理器或任意其他处理器设备(例如，数字信号处理器)。处理器102耦合至处理器总线110，该处理器总线可在处理器102和系统100内的其他组件之间传输数据信号。系统100的多个要素执行本领域所熟知的常规功能。1A is a block diagram of a computer system 100 formed with a processor 102 including one or more execution units 108 for executing an algorithm to execute at least one instruction in accordance with one embodiment of the present invention. One embodiment may be described in the context of a uniprocessor desktop or server system, although alternative embodiments may be included in a multiprocessor system. System 100 is an example of a "hub" system architecture. Computer system 100 includes a processor 102 for processing data signals. The processor 102 may be a complex instruction set computer (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a processor implementing a combination of multiple instruction sets, or any other A processor device (eg, a digital signal processor). The processor 102 is coupled to a processor bus 110 that may transmit data signals between the processor 102 and other components within the system 100 . Various elements of system 100 perform conventional functions well known in the art.

在一个实施例中，处理器102包括第一级(L1)内部高速缓存存储器104。取决于架构，处理器102可具有单个内部高速缓存或多级内部高速缓存。或者，在另一个实施例中，高速缓存存储器可驻留在处理器102的外部。其他实施例也可包括内部高速缓存和外部高速缓存的组合，这取决于特定实现和需求。寄存器组106可将不同类型的数据存储有在各种寄存器(包括整数寄存器、浮点寄存器、状态寄存器、指令指针寄存器)中。In one embodiment, the processor 102 includes a first level (L1) internal cache memory 104 . Depending on the architecture, the processor 102 may have a single internal cache or multiple levels of internal caches. Alternatively, in another embodiment, the cache memory may reside external to the processor 102 . Other embodiments may also include combinations of internal and external caches, depending on the particular implementation and requirements. Register bank 106 may store different types of data in various registers, including integer registers, floating point registers, status registers, instruction pointer registers.

执行单元108(包括用于执行整数和浮点操作的逻辑)也驻留在处理器102 中。处理器102还包括存储用于特定宏指令的微代码(ucode)ROM。对于一个实施例，执行单元108包括用于处理紧缩指令集109的逻辑。通过将紧缩指令集109包括在通用处理器102的指令集内并包括用于执行这些指令的相关的电路，可使用通用处理器102中的紧缩数据来执行由许多多媒体应用使用的操作。因此，通过将处理器数据总线的完整宽度用于执行对紧缩数据进行操作，可加速并更高效地执行许多多媒体应用。这能消除横跨处理器数据总线传输更小数据单元以便一次对一个数据元素执行一个或多个操作的需求。Execution unit 108 (including logic for performing integer and floating point operations) also resides in processor 102 . The processor 102 also includes a microcode (ucode) ROM that stores specific macroinstructions. For one embodiment, execution unit 108 includes logic for processing packed instruction set 109 . By including packed instruction set 109 within the instruction set of general-purpose processor 102 and including the associated circuitry for executing those instructions, packed data in general-purpose processor 102 may be used to perform operations used by many multimedia applications. Thus, many multimedia applications can be accelerated and executed more efficiently by utilizing the full width of the processor's data bus for performing operations on packed data. This can eliminate the need to transfer smaller data units across the processor data bus in order to perform one or more operations on one data element at a time.

执行单元108的替代实施例也可被用于微控制器、嵌入式处理器、图形设备、DSP以及其他类型的逻辑电路。系统100包括存储器120。存储器120可以是动态随机存取存储器(DRAM)设备、静态随机存取存储器(SRAM)设备、闪存设备或其他存储器设备。存储器120可存储由可由处理器102执行的数据信号来表示的指令和/或数据。Alternative embodiments of execution unit 108 may also be used in microcontrollers, embedded processors, graphics devices, DSPs, and other types of logic circuits. System 100 includes memory 120 . Memory 120 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, or other memory device. Memory 120 may store instructions and/or data represented by data signals executable by processor 102 .

系统逻辑芯片116耦合至处理器总线110和存储器120。在所示出的实施例中的系统逻辑芯片116是存储器控制器中枢(MCH)。处理器102可经由处理器总线110与MCH 116通信。MCH 116提供至存储器120的高带宽存储器路径118，用于指令和数据存储，以及用于存储图形命令、数据和纹理。MCH 116用于引导处理器102、存储器120以及系统100内的其他组件之间的数据信号，并在处理器总线110、存储器120和系统I/O 122之间桥接该数据信号。在一些实施例中，系统逻辑芯片116可提供耦合至图形控制器112的图形端口。 MCH 116经由存储器接口118耦合至存储器120。图形卡112通过加速图形端口(AGP)互连114耦合至MCH116。System logic chip 116 is coupled to processor bus 110 and memory 120 . The system logic chip 116 in the illustrated embodiment is a memory controller hub (MCH). Processor 102 may communicate with MCH 116 via processor bus 110 . MCH 116 provides a high bandwidth memory path 118 to memory 120 for instruction and data storage, and for storing graphics commands, data and textures. MCH 116 is used to direct and bridge data signals between processor 102 , memory 120 , and other components within system 100 and between processor bus 110 , memory 120 , and system I/O 122 . In some embodiments, system logic chip 116 may provide a graphics port coupled to graphics controller 112 . MCH 116 is coupled to memory 120 via memory interface 118 . Graphics card 112 is coupled to MCH 116 through accelerated graphics port (AGP) interconnect 114 .

系统100使用专用中枢接口总线122以将MCH 116耦合至I/O控制器中枢(ICH)130。ICH 130经由本地I/O总线向一些I/O设备提供直接连接。本地I/O总线是高速I/O总线，用于将外围设备连接至存储器120、芯片组以及处理器102。一些示例是音频控制器、固件中枢(闪存BIOS)128、无线收发机126、数据存储设备124、包括用户输入和键盘接口的传统I/O控制器、串行扩展端口(诸如通用串行总线(USB))以及网络控制器134。数据存储设备124可以包括硬盘驱动器、软盘驱动器、CD-ROM设备、闪存设备、或其他大容量存储设备。System 100 uses dedicated hub interface bus 122 to couple MCH 116 to I/O controller hub (ICH) 130 . ICH 130 provides direct connections to some I/O devices via a local I/O bus. The local I/O bus is a high-speed I/O bus used to connect peripherals to the memory 120 , the chipset, and the processor 102 . Some examples are audio controllers, firmware backbone (flash BIOS) 128, wireless transceivers 126, data storage devices 124, legacy I/O controllers including user input and keyboard interfaces, serial expansion ports such as Universal Serial Bus ( USB)) and the network controller 134. Data storage device 124 may include a hard disk drive, a floppy disk drive, a CD-ROM device, a flash memory device, or other mass storage device.

对于系统的另一实施例，可将根据一个实施例的指令与芯片上系统一起使用。芯片上系统的一个实施例包括处理器和存储器。用于一个此类系统的存储器是闪存存储器。闪存存储器可位于与处理器和其他系统组件相同的管芯上。此外，诸如存储器控制器或图形控制器之类的其他逻辑块也可位于芯片上系统上。For another embodiment of the system, instructions according to one embodiment may be used with a system on a chip. One embodiment of a system on a chip includes a processor and memory. The memory used for one such system is flash memory. Flash memory can be located on the same die as the processor and other system components. In addition, other logic blocks such as a memory controller or graphics controller may also be located on the system-on-chip.

图1B示出数据处理系统140，该数据处理系统140实现本发明的一个实施例的原理。本领域的技术人员将容易理解，本文描述的多个实施例可用于替代的处理系统，而不背离本发明的多个实施例的范围。Figure IB illustrates a data processing system 140 that implements the principles of one embodiment of the present invention. Those skilled in the art will readily appreciate that the various embodiments described herein may be used with alternative processing systems without departing from the scope of the various embodiments of the present invention.

计算机系统140包括能够执行根据一个实施例的至少一条指令处理核 159。对于一个实施例，处理核159表示任何类型的架构(包括但不限于，CISC、 RISC或VLIW类型架构)的处理单元。处理核159也可适于以一种或多种处理技术来制造，并且通过足够详细地表示在机器可读介质上可适用于促进所述制造。Computer system 140 includes a processing core 159 capable of executing at least one instruction in accordance with one embodiment. For one embodiment, processing core 159 represents a processing unit of any type of architecture including, but not limited to, CISC, RISC, or VLIW type architectures. The processing core 159 may also be adapted to be fabricated in one or more processing technologies, and by representation on a machine-readable medium in sufficient detail, may be adapted to facilitate such fabrication.

处理核159包括执行单元142、寄存器组145的集合以及解码器144。处理核159也包括对于理解本发明的多个实施例不是必需的附加电路(没有示出)。执行单元142用于执行处理核159接收到的指令。除了执行典型的处理器指令外，执行单元142也可执行紧缩指令集143中的指令，以便执行对紧缩数据格式进行的操作。紧缩指令集143包括用于执行本发明的多个实施例的指令以及其他紧缩指令。执行单元142通过内部总线耦合至寄存器组145。寄存器组145表示处理核159上用于存储包括数据的信息的存储区域。如前文所述，可以理解，将该存储区域用于存储紧缩数据并不是关键的。执行单元142耦合至解码器144。解码器144用于将处理核159接收到的指令解码为控制信号和/ 或微代码进入点。响应于这些控制信号和/或微代码进入点，执行单元142执行合适的操作。在一个实施例中，将解码器用于解释指令的操作码，该操作码将指示应当对该指令内所指示的对应数据执行什么操作。Processing core 159 includes execution unit 142 , a set of register files 145 , and decoder 144 . Processing core 159 also includes additional circuitry (not shown) that is not necessary to understand various embodiments of the present invention. The execution unit 142 is used to execute the instructions received by the processing core 159 . In addition to executing typical processor instructions, execution unit 142 may also execute instructions in packed instruction set 143 to perform operations on packed data formats. Packed instruction set 143 includes instructions for performing various embodiments of the present invention, as well as other packed instructions. Execution unit 142 is coupled to register bank 145 through an internal bus. Register file 145 represents a memory area on processing core 159 for storing information including data. As previously mentioned, it will be appreciated that it is not critical that this storage area is used for storing packed data. Execution unit 142 is coupled to decoder 144 . Decoder 144 is used to decode instructions received by processing core 159 into control signals and/or microcode entry points. In response to these control signals and/or microcode entry points, execution unit 142 performs appropriate operations. In one embodiment, a decoder is used to interpret an instruction's opcode, which will indicate what operation should be performed on the corresponding data indicated within the instruction.

处理核159耦合至总线141，用于与各种其他系统设备进行通信，其他系统设备可包括但不限于：例如，同步动态随机存取存储器(SDRAM)控制器 146、静态随机存取存储器(SRAM)控制器147、猝发闪存接口148、个人计算机存储卡国际协会(PCMCIA)/紧致闪存(CF)卡控制器149、液晶显示器 (LCD)控制器150、直接存储器存取(DMA)控制器151、以及替代的总线主接口152。在一个实施例中，数据处理系统140也可包括I/O桥154，用于经由I/O总线153与各种I/O设备进行通信。此类I/O设备可包括但不限于：例如，通用异步接收机/发射机(UART)155、通用串行总线(USB)156、蓝牙无线UART 157、以及I/O扩展接口158。Processing core 159 is coupled to bus 141 for communicating with various other system devices, which may include, but are not limited to, for example, synchronous dynamic random access memory (SDRAM) controller 146, static random access memory (SRAM) ) controller 147, burst flash interface 148, personal computer memory card international association (PCMCIA)/compact flash (CF) card controller 149, liquid crystal display (LCD) controller 150, direct memory access (DMA) controller 151 , and an alternate bus master interface 152 . In one embodiment, data processing system 140 may also include I/O bridge 154 for communicating with various I/O devices via I/O bus 153 . Such I/O devices may include, but are not limited to, for example, universal asynchronous receiver/transmitter (UART) 155 , universal serial bus (USB) 156 , Bluetooth wireless UART 157 , and I/O expansion interface 158 .

数据处理系统140的一个实施例提供了移动通信、网络通信和/或无线通信，并提供了能够执行包括文本串比较操作的SIMD操作的处理核159。可利用各种音频、视频、成像和通信算法对处理核159进行编程，这些算法包括：离散变换(诸如Walsh-Hadamard变换、快速傅立叶变换(FFT)、离散余弦变换(DCT)以及它们相应的逆变换)；压缩/解压缩技术(例如，色彩空间变换、视频编码运动估计或视频解码运动补偿)；以及调制/解调(MODEM)功能(例如，脉冲编码调制(PCM))。One embodiment of data processing system 140 provides for mobile communications, network communications, and/or wireless communications, and provides processing core 159 capable of performing SIMD operations including text string comparison operations. The processing core 159 can be programmed with various audio, video, imaging and communication algorithms including discrete transforms such as Walsh-Hadamard transform, Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT) and their corresponding inverses transform); compression/decompression techniques (eg, color space transform, video encoding motion estimation, or video decoding motion compensation); and modulation/demodulation (MODEM) functions (eg, pulse code modulation (PCM)).

图1C示出能够执行用于提供通用GF(256)SIMD加密算术功能的指令的数据处理系统的另一些替代实施例。根据一个替代实施例，数据处理系统160 可包括主处理器166、SIMD协处理器161、高速缓存处理器167以及输入/输出系统168。输入/输出系统168可以可选地耦合至无线接口169。SIMD协处理器161能够执行包括根据一个实施例的指令的操作。处理核170可适用于以一种或多种处理技术来制造，并且通过足够详细地表示在机器可读介质上，可适用于促进包括处理核170的数据处理系统160的全部或部分的制造。Figure 1C illustrates further alternative embodiments of a data processing system capable of executing instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. According to an alternative embodiment, data processing system 160 may include main processor 166 , SIMD co-processor 161 , cache processor 167 , and input/output system 168 . Input/output system 168 may optionally be coupled to wireless interface 169 . SIMD coprocessor 161 is capable of performing operations including instructions according to one embodiment. Processing core 170 may be adapted to be manufactured in one or more processing technologies and, by being represented on a machine-readable medium in sufficient detail, may be adapted to facilitate manufacture of all or part of data processing system 160 including processing core 170 .

对于一个实施例，SIMD协处理器161包括执行单元162以及一组寄存器组164。主处理器166的一个实施例包括解码器165，该解码器165用于识别包括根据一个实施例的、用于由执行单元162执行的指令的指令集163中的多条指令。对于替代实施例，SIMD协处理器161也包括用于对指令集163中的多条指令进行解码的解码器165B的至少部分。处理核170也包括对于理解本发明的实施例不是必需的附加电路(没有示出)。For one embodiment, SIMD coprocessor 161 includes execution unit 162 and a set of register files 164 . One embodiment of main processor 166 includes decoder 165 for identifying a plurality of instructions in instruction set 163 including instructions for execution by execution unit 162 according to one embodiment. For an alternative embodiment, SIMD coprocessor 161 also includes at least a portion of decoder 165B for decoding a plurality of instructions in instruction set 163 . Processing core 170 also includes additional circuitry (not shown) that is not necessary to understand embodiments of the present invention.

在操作中，主处理器166执行控制通用类型的数据处理操作(包括与高速缓存存储器167和输入/输出系统168之间的交互)的数据处理指令流。SIMD 协处理器指令被嵌入到该数据处理指令流中。主处理器166的解码器165将这些SIMD协处理器指令识别为应当由附连的SIMD协处理器161来执行的类型。因此，主处理器166在协处理器总线171上发布这些SIMD协处理器指令(或表示SIMD协处理器指令的控制信号)，由任何附连的SIMD协处理器从该协处理器总线171接收到这些指令。在这种情况下，SIMD协处理器161将接受并执行任何接收到的针对该SIMD协处理器的SIMD协处理器指令。In operation, main processor 166 executes a stream of data processing instructions that control general types of data processing operations, including interaction with cache memory 167 and input/output system 168 . SIMD coprocessor instructions are embedded into this stream of data processing instructions. The decoder 165 of the main processor 166 recognizes these SIMD coprocessor instructions as the type that should be executed by the attached SIMD coprocessor 161 . Accordingly, the main processor 166 issues these SIMD coprocessor instructions (or control signals representing SIMD coprocessor instructions) on the coprocessor bus 171 from which they are received by any attached SIMD coprocessors to these instructions. In this case, SIMD coprocessor 161 will accept and execute any received SIMD coprocessor instructions for that SIMD coprocessor.

可经由无线接口169接收数据以通过SIMD协处理器指令进行处理。对于一个示例，能以数字信号的形式接收到语音通信，可由SIMD协处理器指令处理该数字信号以便重新生成表示该语音通信的数字音频样本。对于另一个示例，能以数字位流的形式接收到被压缩的音频和/或视频，可由SIMD协处理器指令处理该数字位流以便重新生成音频样本和/或运动视频帧。对于处理核170 的一个实施例，主处理器166和SIMD协处理器161被集成在单个处理核170 中，该单个处理核170包括执行单元162、一组寄存器组164以及用于识别包括根据一个实施例的多条指令的指令集163中的多条指令的解码器165。Data may be received via wireless interface 169 for processing by SIMD coprocessor instructions. For one example, a voice communication can be received in the form of a digital signal that can be instructed by a SIMD coprocessor to process the digital signal to regenerate digital audio samples representing the voice communication. For another example, compressed audio and/or video can be received in the form of a digital bitstream that can be processed by SIMD coprocessor instructions to regenerate audio samples and/or motion video frames. For one embodiment of the processing core 170, the main processor 166 and the SIMD coprocessor 161 are integrated into a single processing core 170 that includes an execution unit 162, a set of register banks 164, and a The decoder 165 of the multiple instructions in the instruction set 163 of the multiple instructions of an embodiment.

图2是处理器200的微架构的框图，该处理器200包括用于执行根据本发明的一个实施例的指令的逻辑。在一些实施例中，可将根据一个实施例的指令实现为对具有字节尺寸、字尺寸、双字尺寸、四字尺寸等并具有诸多数据类型 (例如，单精度和双精度整数和浮点数据类型)的数据元素进行操作。在一个实施例中，有序前端201是处理器200的部分，该部分取出要被执行的指令，并准备这些指令以便稍后在处理器流水线中使用。前端201可包括若干单元。在一个实施例中，指令预取器226从存储器取出指令，并将指令馈送至指令解码器228，指令解码器228随后解码或解释这些指令。例如，在一个实施例中，该解码器将接收到的指令解码为机器可执行的、被称为“微指令”或“微操作”(也称为微op或uop)的一个或多个操作。在其他实施例中，该解码器将指令解析为可由微架构用于执行根据一个实施例的多个操作的操作码以及对应的数据和控制字段。在一个实施例中，追踪高速缓存230接受经解码的微操作，并将它们组合为程序有序序列或uop队列234中的踪迹，以用于执行。当追踪高速缓存230遇到复杂指令时，微代码ROM 232提供完成操作所需的uop。2 is a block diagram of the microarchitecture of a processor 200 that includes logic for executing instructions in accordance with one embodiment of the present invention. In some embodiments, an instruction according to one embodiment may be implemented as a pair of data types having a byte size, word size, doubleword size, quadword size, etc. and with numerous data types (eg, single and double precision integers and floating point numbers operate on data elements of type). In one embodiment, in-order front end 201 is the portion of processor 200 that fetches instructions to be executed and prepares them for later use in the processor pipeline. Front end 201 may include several units. In one embodiment, instruction prefetcher 226 fetches instructions from memory and feeds the instructions to instruction decoder 228, which then decodes or interprets the instructions. For example, in one embodiment, the decoder decodes received instructions into one or more machine-executable operations known as "microinstructions" or "microoperations" (also known as microops or uops) . In other embodiments, the decoder parses the instruction into opcodes and corresponding data and control fields that can be used by the microarchitecture to perform operations according to an embodiment. In one embodiment, trace cache 230 accepts decoded micro-ops and combines them into a program-ordered sequence or traces in uop queue 234 for execution. When the trace cache 230 encounters complex instructions, the microcode ROM 232 provides the uops needed to complete the operation.

一些指令被转换为单个微op，而其他指令需要若干个微op以完成完整的操作。在一个实施例中，如果需要多于四个微op来完成指令，则解码器228 访问微代码ROM 232以执行该指令。对于一个实施例，可将指令解码为少量的微op，以便在指令解码器228处进行处理。在另一实施例中，如果需要多个微op来完成操作，则可将指令存储在微代码ROM 232中。追踪高速缓存230 参考进入点可编程逻辑阵列(PLA)来确定正确的微指令指针，以从微代码 ROM 232中读取微代码序列来完成根据一个实施例的一条或多条指令。在微代码ROM232完成对指令的微op进行的序列化操作之后，该机器的前端201恢复从追踪高速缓存230中提取微op。Some instructions are converted into a single micro-op, while other instructions require several micro-ops to complete the complete operation. In one embodiment, if more than four micro-ops are required to complete an instruction, decoder 228 accesses microcode ROM 232 to execute the instruction. For one embodiment, the instruction may be decoded into a small number of micro-ops for processing at the instruction decoder 228 . In another embodiment, instructions may be stored in microcode ROM 232 if multiple micro-ops are required to complete the operation. The trace cache 230 references the entry point programmable logic array (PLA) to determine the correct microinstruction pointer to read the microcode sequence from the microcode ROM 232 to complete one or more instructions according to one embodiment. The machine's front end 201 resumes fetching the micro-ops from the trace cache 230 after the microcode ROM 232 completes the serialization of the micro-ops of the instruction.

无序执行引擎203是准备指令以进行执行的地方。无序执行逻辑具有多个缓冲器，这些缓冲器用于使指令流平滑并且重排序该指令流，以优化指令流往下进入流水线并且被调度以供执行时的性能。分配器逻辑分配每个微op需要的机器缓冲器和资源以用于执行。寄存器重命名逻辑将多个逻辑寄存器重命名为寄存器组中的多个条目。在指令调度器(存储器调度器、快速调度器202、慢速/通用浮点调度器204和简单浮点调度器206)之前，分配器也为两个uop 队列(一个uop队列用于存储器操作，另一个uop队列用于非存储器操作)中的一个队列中的每个uop分配条目。uop调度器202、204、206基于对它们的从属输入寄存器操作数源的准备状态以及微操作完成其操作所需的执行资源的可用性来确定uop何时就绪用于执行。一个实施例中的快速调度器202可在主时钟周期的每半个上进行调度，而其他调度器可在每个主处理器时钟周期上仅调度一次。调度器对分配端口进行仲裁以调度uop以便执行。The out-of-order execution engine 203 is where instructions are prepared for execution. The out-of-order execution logic has buffers used to smooth and reorder the instruction stream to optimize performance as the instruction stream goes down the pipeline and is scheduled for execution. The allocator logic allocates the machine buffers and resources each micro-op needs for execution. The register renaming logic renames multiple logical registers to multiple entries in the register bank. Before the instruction schedulers (memory scheduler, fast scheduler 202, slow/generic floating point scheduler 204, and simple floating point scheduler 206), the allocator is also two uop queues (one uop queue for memory operations, Another uop queue for non-memory operations) allocates entries for each uop in one of the queues. The uop schedulers 202, 204, 206 determine when uops are ready for execution based on the readiness status of their dependent input register operand sources and the availability of execution resources required by the uops to complete their operations. The fast scheduler 202 in one embodiment may schedule on every half of the master clock cycle, while other schedulers may schedule only once per master processor clock cycle. The scheduler arbitrates the allocation of ports to schedule uops for execution.

在执行块211中，寄存器组208和210位于调度器202、204和206以及执行单元212、214、216、218、220、222和224之间。也存在单独的寄存器组208、210，分别用于整数和浮点操作。一个实施例中的每个寄存器组208、210也包括旁路网络，该旁路网络可绕开还未被写入到寄存器组中的、刚完成的结果或将这些结果转发到新的从属uop中。整数寄存器组208和浮点寄存器组210也能够彼此传递数据。对于一个实施例，将整数寄存器组208划分为两个单独的寄存器组，一个寄存器组用于低阶的32位数据，第二个寄存器组用于高阶的32位数据。一个实施例中的浮点寄存器组210具有128位宽的条目，因为浮点指令通常具有从64至128位宽度的操作数。In execution block 211 , register sets 208 and 210 are located between schedulers 202 , 204 and 206 and execution units 212 , 214 , 216 , 218 , 220 , 222 and 224 . There are also separate register banks 208, 210 for integer and floating point operations, respectively. Each register bank 208, 210 in one embodiment also includes a bypass network that can bypass just-completed results that have not been written to the register bank or forward those results to new slave uops middle. Integer register set 208 and floating point register set 210 are also capable of passing data to each other. For one embodiment, the integer register set 208 is divided into two separate register sets, one register set for low-order 32-bit data and a second register set for high-order 32-bit data. The floating point register set 210 in one embodiment has entries that are 128 bits wide, since floating point instructions typically have operands from 64 to 128 bits wide.

执行块211包括执行单元212、214、216、218、220、222和224，实际在这些执行单元中执行指令。该区块包括存储微指令执行所需要的整数和浮点数据操作数值的寄存器组208和210。一个实施例中的处理器200包括多个执行单元：地址生成单元(AGU)212、AGU214、快速ALU 216、快速ALU 218、慢速ALU 220、浮点ALU 222和浮点移动单元224。对于一个实施例，浮点执行块222和224执行浮点、MMX、SIMD、SSE以及其他操作。一个实施例中的浮点ALU 222包括用于执行除法、平方根和余数微op的64位/64位的浮点除法器。对于本发明的多个实施例，可利用浮点硬件来处理涉及浮点值的指令。在一个实施例中，ALU操作去往高速ALU执行单元216和218。一个实施例中的快速ALU 216和218可执行有效等待时间为半个时钟周期的快速操作。对于一个实施例，大多数复杂整数操作去往慢速ALU 220，因为慢速ALU 220 包括用于长等待时间类型操作的整数执行硬件，例如，乘法器、移位器、标记逻辑和分支处理设备。存储器加载/存储操作由AGU 212和214来执行。对于一个实施例，在对64位的数据操作数执行整数操作的情境中描述整数ALU 216、218和220。在替代实施例中，可实现ALU 216、218和220以支持包括 16、32、128、256等各种数据位。类似地，可实现浮点单元222和224以支持具有各种位宽的一系列操作数。对于一个实施例，浮点单元222和224可结合 SIMD和多媒体指令，对128位宽度紧缩数据操作数进行操作。Execution block 211 includes execution units 212, 214, 216, 218, 220, 222, and 224 in which instructions are actually executed. The block includes register banks 208 and 210 that store integer and floating point data operand values required for microinstruction execution. Processor 200 in one embodiment includes multiple execution units: address generation unit (AGU) 212 , AGU 214 , fast ALU 216 , fast ALU 218 , slow ALU 220 , floating point ALU 222 , and floating point move unit 224 . For one embodiment, floating point execution blocks 222 and 224 perform floating point, MMX, SIMD, SSE, and other operations. The floating-point ALU 222 in one embodiment includes a 64-bit/64-bit floating-point divider for performing division, square root, and remainder microops. For various embodiments of the present invention, floating-point hardware may be utilized to process instructions involving floating-point values. In one embodiment, ALU operations go to high-speed ALU execution units 216 and 218 . Fast ALUs 216 and 218 in one embodiment may perform fast operations with an effective latency of half a clock cycle. For one embodiment, most complex integer operations go to the slow ALU 220 because the slow ALU 220 includes integer execution hardware for long-latency type operations, such as multipliers, shifters, tag logic, and branch processing devices . Memory load/store operations are performed by AGUs 212 and 214 . For one embodiment, integer ALUs 216, 218, and 220 are described in the context of performing integer operations on 64-bit data operands. In alternate embodiments, ALUs 216, 218, and 220 may be implemented to support various data bits including 16, 32, 128, 256, and the like. Similarly, floating point units 222 and 224 may be implemented to support a range of operands with various bit widths. For one embodiment, floating point units 222 and 224 may operate on 128-bit wide packed data operands in conjunction with SIMD and multimedia instructions.

在一个实施例中，在父加载完成执行之前，uop调度器202、204和206 就分派从属操作。由于在处理器200中推测性地调度并执行uop，因此处理器 200也包括用于处理存储器未命中的逻辑。如果数据加载在数据高速缓存中未命中，则在流水线中会存在带着临时不正确的数据离开调度器的运行中的从属操作。重放机制跟踪并重新执行使用不正确数据的指令。仅仅从属操作需要被重放，而允许独立操作完成。处理器的一个实施例中的调度器和重放机制也被设计为用于捕捉提供通用GF(256)SIMD加密算术功能的指令。In one embodiment, uop schedulers 202, 204, and 206 dispatch slave operations before the parent load finishes executing. Since uops are speculatively scheduled and executed in processor 200, processor 200 also includes logic for handling memory misses. If a data load misses in the data cache, there will be running slave operations in the pipeline that leave the scheduler with temporarily incorrect data. The replay mechanism tracks and re-executes instructions that use incorrect data. Only dependent operations need to be replayed, while independent operations are allowed to complete. The scheduler and replay mechanism in one embodiment of the processor is also designed to capture instructions that provide general GF(256) SIMD cryptographic arithmetic functions.

术语“寄存器”指代被用作指令的一部分以标识操作数的板上处理器存储位置。换句话说，寄存器可以是那些从处理器外部(从编程者的角度来看)可用的那些。然而，实施例中的寄存器不限于表示特定类型的电路。相反，实施例中的寄存器能够存储并提供数据，并且执行本文中所述的功能。可由处理器中的电路使用任何数量的不同技术来实现本文中所述的寄存器，例如，专用物理寄存器、使用寄存器重命名的动态分配物理寄存器，以及专用和动态分配物理寄存器的组合等。在一个实施例中，整数寄存器存储32位的整数数据。一个实施例中的寄存器组也包含八个用于紧缩数据的多媒体SIMD寄存器。对于以下讨论，应当将寄存器理解为设计成用于保持紧缩数据的数据寄存器，诸如来自美国加利福尼亚州圣克拉拉市的英特尔公司的启用了MMX技术的微处理器中的64位宽的MMX^TM寄存器(在一些实例中也称为‘mm’寄存器)。”这些MMX寄存器(可用在整数和浮点格式中)可与伴随SIMD和SSE指令的紧缩数据元素一起操作。类似地，涉及SSE2、SSE3、SSE4或更新的技术(统称为“SSEx”)的128位宽XMM寄存器也可被用于保持这样紧缩数据操作数。在一个实施例中，在存储紧缩数据和整数数据时，寄存器不需要区分这两类数据类型。在一个实施例中，整数和浮点数据可被包括在相同的寄存器组中，或被包括在不同的寄存器组中。此外，在一个实施例中，浮点和整数数据可被存储在不同的寄存器中，或被存储在相同的寄存器中。The term "register" refers to an on-board processor storage location that is used as part of an instruction to identify operands. In other words, registers can be those available from outside the processor (from a programmer's perspective). However, the registers in the embodiments are not limited to representing a particular type of circuit. Rather, registers in an embodiment can store and provide data and perform the functions described herein. The registers described herein may be implemented by circuitry in the processor using any number of different techniques, eg, dedicated physical registers, dynamically allocated physical registers using register renaming, combinations of dedicated and dynamically allocated physical registers, and the like. In one embodiment, the integer registers store 32-bit integer data. The register bank in one embodiment also contains eight multimedia SIMD registers for packed data. For the following discussion, registers should be understood as data registers designed to hold packed data, such as the 64-bit wide MMX ^™ registers in MMX technology-enabled microprocessors from Intel Corporation, Santa Clara, CA, USA (Also called the 'mm' register in some instances). "These MMX registers (available in integer and floating-point formats) can operate with packed data elements that accompany SIMD and SSE instructions. Similarly, 128 involving SSE2, SSE3, SSE4, or newer techniques (collectively "SSEx") Bit-wide XMM registers can also be used to hold such packed data operands. In one embodiment, when storing packed data and integer data, the registers do not need to distinguish between these two types of data. In one embodiment, integer and floating Point data may be included in the same register bank, or be included in a different register bank. Also, in one embodiment, floating point and integer data may be stored in different registers, or be stored in the same in the register.

在下述附图的示例中，描述了多个数据操作数。图3A示出了根据本发明一个实施例的多媒体寄存器中的各种紧缩数据类型表示。图3A示出用于128 位宽操作数的紧缩字节310、紧缩字320以及紧缩双字(dword)330的数据类型。本示例的紧缩字节格式310是128位长，并且包含十六个紧缩字节数据元素。字节在此被定义为8位数据。每一个字节数据元素的信息被存储为：对于字节0存储在位7到位0，对于字节1存储在位15到位8，对于字节2存储在位23到位16，最后对于字节15存储在位120到位127。因此，在该寄存器中使用了所有可用的位。该存储配置提高了处理器的存储效率。同样，因为访问了十六个数据元素，所以现在可并行地对十六个数据元素执行一个操作。In the examples of the following figures, a number of data operands are depicted. Figure 3A illustrates various packed data type representations in a multimedia register according to one embodiment of the present invention. Figure 3A shows the data types of packed bytes 310, packed words 320, and packed double words (dwords) 330 for 128-bit wide operands. The packed byte format 310 of this example is 128 bits long and contains sixteen packed byte data elements. A byte is defined here as 8-bit data. The information for each byte data element is stored as: bits 7 to 0 for byte 0, bits 15 to 8 for byte 1, bits 23 to 16 for byte 2, and finally bits 23 to 16 for byte 15 Stored in bits 120 to 127. Therefore, all available bits are used in this register. This storage configuration increases the storage efficiency of the processor. Also, because sixteen data elements are accessed, an operation can now be performed on sixteen data elements in parallel.

通常，数据元素是与具有相同长度的其他数据元素一起被存储在单个寄存器或存储器位置中的、单独的数据片。在涉及SSEx技术的紧缩数据序列中，存储在XMM寄存器中的数据元素的数目是128位除以单独的数据元素的位长。类似地，在涉及MMX和SSE技术的紧缩数据序列中，存储在MMX寄存器中的数据元素的数目是64位除以单独的数据元素的位长。虽然图3A中所示的数据类型是128位长，但本发明的多个实施例也可操作64位宽、256位宽、512位宽或其他尺寸的操作数。本示例中的紧缩字格式320是128位长，并且包含八个紧缩字数据元素。每个紧缩字包含十六位的信息。图3A的紧缩双字格式330是128位长，并且包含四个紧缩双字数据元素。每个紧缩双字数据元素包含三十二位的信息。紧缩四字是128位长，并包含两个紧缩四字数据元素。Typically, data elements are separate pieces of data that are stored in a single register or memory location with other data elements of the same length. In packed data sequences involving SSEx techniques, the number of data elements stored in an XMM register is 128 bits divided by the bit length of the individual data elements. Similarly, in packed data sequences involving MMX and SSE techniques, the number of data elements stored in an MMX register is 64 bits divided by the bit length of the individual data elements. Although the data type shown in Figure 3A is 128 bits long, various embodiments of the present invention may also operate on operands of 64 bits wide, 256 bits wide, 512 bits wide, or other sizes. Packed word format 320 in this example is 128 bits long and contains eight packed word data elements. Each packed word contains sixteen bits of information. The packed doubleword format 330 of FIG. 3A is 128 bits long and contains four packed doubleword data elements. Each packed doubleword data element contains thirty-two bits of information. Packed quadwords are 128 bits long and contain two packed quadword data elements.

图3B示出替代的寄存器内数据存储格式。每个紧缩数据可包括多于一个的独立数据元素。示出了三种紧缩数据格式：紧缩半数据元素341、紧缩单数据元素342和紧缩双数据元素343。紧缩半数据元素341、紧缩单数据元素342 和紧缩双数据元素343的一个实施例包含固定点数据元素。对于替代实施例，紧缩半数据元素341、紧缩单数据元素342和紧缩双数据元素343中的一个或多个可包含浮点数据元素。紧缩半数据元素341的一个替代实施例是一百二十八位长的，并且包含八个16位数据元素。紧缩单数据元素342的一个替代实施例是一百二十八位长的，并且包含四个32位数据元素。紧缩双数据元素343 的一个实施例是一百二十八位长的，并且包含两个64位数据元素。将会理解，可进一步将此类紧缩数据格式扩展至其他寄存器长度，例如，96位、160位、 192位、224位、256位、512位或更长。Figure 3B shows an alternative in-register data storage format. Each packed data may include more than one independent data element. Three packed data formats are shown: packed half data elements 341 , packed single data elements 342 and packed double data elements 343 . One embodiment of packed half data elements 341, packed single data elements 342, and packed double data elements 343 includes fixed point data elements. For alternative embodiments, one or more of packed half data elements 341, packed single data elements 342, and packed double data elements 343 may contain floating point data elements. An alternative embodiment of packed half data element 341 is one hundred and twenty-eight bits long and contains eight 16-bit data elements. An alternative embodiment of packed single data element 342 is one hundred and twenty-eight bits long and contains four 32-bit data elements. One embodiment of packed double data element 343 is one hundred and twenty-eight bits long and contains two 64-bit data elements. It will be appreciated that such packed data formats can be further extended to other register lengths, eg, 96 bits, 160 bits, 192 bits, 224 bits, 256 bits, 512 bits, or longer.

图3C示出了根据本发明的一个实施例的多媒体寄存器中的各种有符号和无符号紧缩数据类型表示。无符号紧缩字节表示344示出将无符号紧缩字节存储在SIMD寄存器中。每一个字节数据元素的信息被存储为：对于字节0存储在位7到位0，对于字节1存储在位15到位8，对于字节2存储在位23到位 16，等等，最后对于字节15存储在位120到位127。因此，在该寄存器中使用了所有可用的位。该存储安排可提高处理器的存储效率。同样，因为访问了十六个数据元素，所以能以并行方式对十六个数据元素执行一个操作。有符号紧缩字节表示345示出了有符号紧缩字节的存储。注意，每个字节数据元素的第八位是符号指示符。无符号紧缩字表示346示出了如何将字7到字0存储在 SIMD寄存器中。有符号紧缩字表示347类似于无符号紧缩字寄存器内表示346。注意，每个字数据元素的第十六位是符号指示符。无符号紧缩双字表示 348示出了如何存储双字数据元素。有符号紧缩双字表示349类似于无符号紧缩双字寄存器内表示348。注意，必要的符号位是每个双字数据元素的第三十二位。Figure 3C illustrates various signed and unsigned packed data type representations in a multimedia register according to one embodiment of the present invention. Unsigned packed byte representation 344 shows storing unsigned packed bytes in SIMD registers. The information for each byte data element is stored as: bits 7 to 0 for byte 0, bits 15 to 8 for byte 1, bits 23 to 16 for byte 2, and so on, and finally for Byte 15 is stored in bits 120 to 127. Therefore, all available bits are used in this register. This storage arrangement can improve the storage efficiency of the processor. Likewise, because sixteen data elements are accessed, an operation can be performed on sixteen data elements in parallel. Signed packed byte representation 345 shows the storage of signed packed bytes. Note that the eighth bit of each byte data element is the sign indicator. Unsigned packed word representation 346 shows how word 7 to word 0 are stored in SIMD registers. Signed packed word representation 347 is similar to unsigned packed word in-register representation 346. Note that the sixteenth bit of each word data element is the sign indicator. Unsigned packed doubleword representation 348 shows how doubleword data elements are stored. Signed packed doubleword representation 349 is similar to unsigned packed doubleword in-register representation 348. Note that the necessary sign bit is the thirty-second bit of each doubleword data element.

图3D是与可从美国加利福尼亚州圣克拉拉市的英特尔公司的万维网 (www)intel.com/products/processor/manuals/上获得的“64和IA-32英特尔架构软件开发者手册组合卷2A和2B：指令集参考A-Z(64and IA-32 Intel ArchitectureSoftware Developer’s Manual Combined Volume 2A and 2B:Instruction SetReference A-Z)”中描述的操作码格式类型相对应的、具有32 或更多位的操作编码(操作码)格式360以及寄存器/存储器操作数寻址模式的一个实施例的描绘。在一个实施例中，可通过字段361和362中的一个或多个对指令进行编码。可以对于每条指令标识多至两个操作数位置，包括多至两个源操作数标识符364和365。对于一个实施例，目的地操作数标识符366与源操作数标识符364相同，而在其他实施例中它们不相同。对于替代实施例，目的地操作数标识符366与源操作数标识符365相同，而在其他实施例中它们不相同。在一个实施例中，由源操作数标识符364和365标识的源操作数中的一个被指令的结果覆写，而在其他实施例中，标识符364对应于源寄存器元件，而标识符365对应于目的地寄存器元件。对于一个实施例，操作数标识符364 和365可被用于标识32位或64位的源和目的地操作数。FIG. 3D is a comparison of " 64 and IA-32 Intel Architecture Software Developer's Manual Combined Volumes 2A and 2B: Instruction Set Reference AZ ( 64and IA-32 Intel ArchitectureSoftware Developer's Manual Combined Volume 2A and 2B:Instruction SetReference AZ)", an operation code (opcode) format 360 with 32 or more bits, and register/memory operations corresponding to the type of opcode format described in A depiction of one embodiment of a number addressing mode. In one embodiment, instructions may be encoded by one or more of fields 361 and 362. Up to two operand locations may be identified for each instruction, including multiple to the two source operand identifiers 364 and 365. For one embodiment, the destination operand identifier 366 is the same as the source operand identifier 364, while in other embodiments they are not the same. For alternative embodiments, the destination Operand identifier 366 is the same as source operand identifier 365, while in other embodiments they are not the same. In one embodiment, one of the source operands identified by source operand identifiers 364 and 365 is instructed to The result is overwritten, while in other embodiments, identifier 364 corresponds to the source register element and identifier 365 corresponds to the destination register element. For one embodiment, operand identifiers 364 and 365 may be used to identify 32-bit or 64-bit source and destination operands.

图3E是具有四十个或更多位的另一个替代操作编码(操作码)格式370 的描绘。操作码格式370对应于操作码格式360，并包括可选的前缀字节378。根据一个实施例的指令可通过字段378、371和372中的一个或多个来编码。通过源操作数标识符374和375以及通过前缀字节378，可对每条指令标识多至两个操作数位置。对于一个实施例，前缀字节378可被用于标识32位或64 位的源和目的地操作数。对于一个实施例，目的地操作数标识符376与源操作数标识符374相同，而在其他实施例中它们不相同。对于替代实施例，目的地操作数标识符376与源操作数标识符375相同，而在其他实施例中它们不相同。在一个实施例中，指令对由操作数标识符374和375所标识的操作数中的一个或多个进行操作，并且通过该指令的结果覆写由操作数标识符374和375所标识的一个或多个操作数，而在其他实施例中，将由标识符374和375标识的操作数写入另一寄存器中的另一数据元素中。操作码格式360和370允许由MOD 字段363和373以及由可选的比例-索引-基址(scale-index-base)和位移(displacement)字节部分地指定的寄存器到寄存器寻址、存储器到寄存器寻址、由存储器对寄存器寻址、由寄存器对寄存器寻址、由立即数对寄存器寻址、寄存器到存储器寻址。FIG. 3E is a depiction of another alternative operation encoding (opcode) format 370 having forty or more bits. Opcode format 370 corresponds to opcode format 360 and includes optional prefix bytes 378 . Instructions according to one embodiment may be encoded by one or more of fields 378 , 371 and 372 . By source operand identifiers 374 and 375 and by prefix byte 378, up to two operand positions may be identified for each instruction. For one embodiment, prefix bytes 378 may be used to identify 32-bit or 64-bit source and destination operands. For one embodiment, destination operand identifier 376 is the same as source operand identifier 374, while in other embodiments they are not the same. For alternate embodiments, destination operand identifier 376 is the same as source operand identifier 375, while in other embodiments they are not the same. In one embodiment, the instruction operates on one or more of the operands identified by operand identifiers 374 and 375 and overwrites the one identified by operand identifiers 374 and 375 with the result of the instruction or more operands, while in other embodiments the operand identified by identifiers 374 and 375 is written to another data element in another register. Opcode formats 360 and 370 allow register-to-register addressing, memory-to-register addressing, specified in part by MOD fields 363 and 373 and by optional scale-index-base and displacement bytes. Register addressing, memory-to-register addressing, register-to-register addressing, immediate-value-to-register addressing, register-to-memory addressing.

接下来转到图3F，在一些替代实施例中，64位(或128位、或256位、或512位或更多位)单指令多数据(SIMD)算术操作可经由协处理器数据处理 (CDP)指令来执行。操作编码(操作码)格式380描绘了具有CDP操作码字段382和389的一条此类指令。对于替代实施例，可由字段383、384、387 和388中的一个或多个对CDP指令操作的这种类型进行编码。可以对每个指令标识多至三个操作数位置，包括多至两个源操作数标识符385和390以及一个目的地操作数标识符386。协处理器的一个实施例可对8位、16位、32位和 64位的值进行操作。对于一个实施例，对整数数据元素执行指令。在一些实施例中，可使用条件字段381，有条件地执行指令。对于一些实施例，可通过字段383来对源数据尺寸进行编码。在一些实施例中，可对SIMD字段执行零(Z)、负(N)、进位(C)和溢出(V)检测。对于一些指令，可通过字段384对饱和类型进行编码。Turning next to FIG. 3F, in some alternative embodiments, 64-bit (or 128-bit, or 256-bit, or 512-bit or more) single-instruction-multiple-data (SIMD) arithmetic operations may be processed via coprocessor data ( CDP) instruction to execute. Opcode (opcode) format 380 depicts one such instruction with CDP opcode fields 382 and 389 . For alternate embodiments, this type of CDP instruction operation may be encoded by one or more of fields 383 , 384 , 387 , and 388 . Up to three operand locations may be identified for each instruction, including up to two source operand identifiers 385 and 390 and one destination operand identifier 386 . One embodiment of a coprocessor can operate on 8-bit, 16-bit, 32-bit and 64-bit values. For one embodiment, the instructions are executed on integer data elements. In some embodiments, the condition field 381 may be used to conditionally execute the instruction. For some embodiments, the source data size may be encoded by field 383 . In some embodiments, zero (Z), negative (N), carry (C), and overflow (V) detection may be performed on the SIMD field. For some instructions, the saturation type may be encoded by field 384.

接下来转到图3G，其描绘了与可从美国加利福尼亚州圣克拉拉市的英特尔公司的万维网(www)intel.com/products/processor/manuals/上获得的“高级向量扩展编程参考(Advanced Vector Extensions Programming Reference)”中描述的操作码格式类型相对应的、根据另一实施例的用于提供通用GF(256)SIMD加密算术功能的另一替代操作编码(操作码)格式397。Turning next to Figure 3G, which depicts a " Advanced Vector Extensions Programming Reference ( Another alternative opcode (opcode) format 397 for providing generic GF(256) SIMD cryptographic arithmetic functions according to another embodiment, corresponding to the opcode format types described in the Advanced Vector Extensions Programming Reference)".

原始x86指令集向1字节操作码提供各种格式的地址字节(syllable)以及被包括在在附加字节中的立即操作数，其中可从第一个“操作码”字节中获知附加字节的存在。此外，存在作为对该操作码的修饰符而保留的某些字节值(被称为前缀(prefix)，因为必须将它们置于指令之前)。当256个操作码字节的原始调配(包括这些特殊的前缀值)耗尽时，将单个字节专用于去往256个操作码的新集合的跳出方式(escape)。因为添加了向量指令(例如，SIMD)，因此，即便通过使用前缀进行扩展，也产生了对更多操作码的需求，并且“两字节”操作码映射也是不够的。为此，将新指令加入附加的映射中，该附加的映射将两字节加上可选的前缀用作标识符。The original x86 instruction set provided 1-byte opcodes with address bytes in various formats (syllable) and immediate operands included in additional bytes, where additional bytes were known from the first "opcode" byte. existence of bytes. In addition, there are certain byte values (called prefixes, since they must be placed before the instruction) that are reserved as modifiers to this opcode. When the original allocation of 256 opcode bytes (including these special prefix values) is exhausted, a single byte is dedicated to escape to the new set of 256 opcodes. Because of the addition of vector instructions (eg SIMD), the need for more opcodes is created even by extension using prefixes, and a "two-byte" opcode map is not sufficient. To do this, a new instruction is added to an additional map that uses two bytes plus an optional prefix as an identifier.

此外，为了便于在64位模式中实现额外的寄存器，可在前缀和操作码(以及确定该操作码所必需的任何跳出字节)之间使用附加的前缀(被称为“REX”)。在一个实施例中，该REX具有4个“有效负荷”位，以指示在64位模式中使用附加的寄存器。在其他实施例中，该REX可具有少于或多于4位。至少一个指令集的通用格式(通常对应于格式360和/或格式370)被一般地示出如下：Additionally, to facilitate implementing additional registers in 64-bit mode, an additional prefix (called "REX") may be used between the prefix and the opcode (and any escape bytes necessary to determine that opcode). In one embodiment, the REX has 4 "payload" bits to indicate the use of additional registers in 64-bit mode. In other embodiments, the REX may have fewer or more than 4 bits. The general format of at least one instruction set (generally corresponding to format 360 and/or format 370) is shown generally as follows:

[prefixes][rex]escape[escape2]opcode modrm(等)[prefixes][rex]escape[escape2]opcode modrm(etc)

操作码格式397对应于操作码格式370，并包括替代大部分的其他公共使用的传统指令前缀字节和跳出代码的、可选的VEX前缀字节391(在一个实施例中，以十六进制的C4开始)。例如，以下示出了使用两个字段来编码指令的实施例，可在第二跳出代码存在于原始指令中时，或者在需要使用REX 字段中的额外位(例如，XB和W字段)时使用该实施例。在以下所示的实施例中，传统跳出由新的跳出值来表示，传统前缀被完全压缩为“有效负荷(payload)”字节的部分，传统前缀被重新申明并可用于未来的扩展，第二跳出代码被压缩在“映射(map)”字段中且未来的映射或特征空间可用，并且加入新的特征(例如，增加的向量长度以及附加的源寄存器区分符)。The opcode format 397 corresponds to the opcode format 370 and includes an optional VEX prefix byte 391 (in one embodiment in hexadecimal, in place of most of the other commonly used legacy instruction prefix bytes and escape codes). C4 of the system starts). For example, the following shows an embodiment that uses two fields to encode instructions, which can be used when a second escape code is present in the original instruction, or when extra bits in the REX field (eg, XB and W fields) need to be used this example. In the embodiment shown below, legacy bounces are represented by new bounce values, legacy prefixes are fully compressed as part of a "payload" byte, legacy prefixes are restated and available for future extensions, p. The two-hop code is packed in the "map" field and a future map or feature space is available, and new features are added (eg, increased vector length and additional source register specifiers).

根据一个实施例的指令可通过字段391和392中的一个或多个来进行编码。通过字段391结合源操作码标识符374和375，并且结合可选的比例-索引-基址(SIB)标识符393、可选的位移标识符394以及可选立即数字节395，可以为每条指令标识多至四个操作数位置。对于一个实施例，VEX前缀字节 391可被用于标识32位或64位的源和目的地操作数和/或128位或256位SIMD 寄存器或存储器操作数。对于一个实施例，由操作码格式397提供的功能可与操作码格式370形成冗余，而在其他实施例中它们不同。操作码格式370和397 允许由MOD字段373以及由可选的(SIB)标识符393、可选的位移标识符 394以及可选的立即数字节395部分地指定的寄存器到寄存器寻址、存储器到寄存器寻址、由存储器对寄存器寻址、由寄存器对寄存器寻址、由立即数对寄存器寻址、寄存器到存储器寻址。Instructions according to one embodiment may be encoded by one or more of fields 391 and 392 . Combining source opcode identifiers 374 and 375 via field 391, and in conjunction with optional scale-index-base (SIB) identifier 393, optional displacement identifier 394, and optional immediate byte 395, can be used for each entry. Instructions identify up to four operand positions. For one embodiment, the VEX prefix byte 391 may be used to identify 32-bit or 64-bit source and destination operands and/or 128-bit or 256-bit SIMD register or memory operands. For one embodiment, the functionality provided by opcode format 397 may form redundancy with opcode format 370, while in other embodiments they differ. Opcode formats 370 and 397 allow register-to-register addressing, memory-to-register addressing, specified by MOD field 373 and in part by optional (SIB) identifier 393, optional displacement identifier Register addressing, memory-to-register addressing, register-to-register addressing, immediate-value-to-register addressing, register-to-memory addressing.

接下来转到图3H，其描绘了根据另一实施例的、用于提供推入通用GF (256)SIMD加密算术功能的另一替代的操作编码(操作码)格式398。操作码格式398对应于操作码格式370和397，并包括替代大部分的其他公共使用的传统指令前缀字节和跳出代码，并提供附加的功能的、可选的EVEX前缀字节396(在一个实施例中，以十六进制的62开始)。根据一个实施例的指令可通过字段396和392中的一个或多个来进行编码。通过字段396结合源操作码标识符374和375，并且结合可选比例-索引-基址(scale-index-base SIB)标识符393、可选的位移标识符394以及可选的立即数字节395，可以为每条指令多至四个操作数位置以及标识掩码。对于一个实施例，EVEX前缀字节396可被用于标识32位或64位的源和目的地操作数和/或128位、256位或512位 SIMD寄存器或存储器操作数。对于一个实施例，由操作码格式398提供的功能可与操作码格式370或397形成冗余，而在其他实施例中它们不同。操作码格式398允许由MOD字段373以及由可选的(SIB)标识符393、可选的位移标识符394以及可选的立即数字节395部分地指定的、利用掩码的寄存器到寄存器寻址、存储器到寄存器寻址、由存储器对寄存器寻址、由寄存器对寄存器寻址、由立即数对寄存器寻址、寄存器到存储器寻址。至少一个指令集的通用格式(其通常对应于格式360和/或格式370)被一般地示出如下：Turning next to FIG. 3H, which depicts another alternative opcode (opcode) format 398 for providing push-in generic GF (256) SIMD cryptographic arithmetic functionality, according to another embodiment. Opcode format 398 corresponds to opcode formats 370 and 397 and includes the optional EVEX prefix byte 396 (in a example, starts with hexadecimal 62). Instructions according to one embodiment may be encoded by one or more of fields 396 and 392 . Combining source opcode identifiers 374 and 375 via field 396, and combining optional scale-index-base SIB identifier 393, optional displacement identifier 394, and optional immediate byte 395 , can be up to four operand positions and identity masks for each instruction. For one embodiment, EVEX prefix bytes 396 may be used to identify 32-bit or 64-bit source and destination operands and/or 128-bit, 256-bit, or 512-bit SIMD register or memory operands. For one embodiment, the functionality provided by opcode format 398 may form redundancy with opcode formats 370 or 397, while in other embodiments they are different. The opcode format 398 allows register-to-register addressing using masks specified by the MOD field 373 and in part by an optional (SIB) identifier 393, an optional displacement identifier 394, and an optional immediate byte 395 , memory-to-register addressing, memory-to-register addressing, register-to-register addressing, immediate-number-to-register addressing, register-to-memory addressing. The general format of at least one instruction set, which generally corresponds to format 360 and/or format 370, is shown generally as follows:

evex1RXBmmmmm WvvvLpp evex4opcode modrm[sib][disp][imm]evex1RXBmmmmmm WvvvLpp evex4opcode modrm[sib][disp][imm]

对于一个实施例，根据EVEX格式398来编码的指令可具有额外的“有效荷载”位，其被用于提供通用GF(256)SIMD加密算术功能，并具有附加的新特征，例如，用户可配置掩码寄存器、或附加的操作数、或从128位、256位或512位向量寄存器或待选择的更多的寄存器中作出的选择等。For one embodiment, instructions encoded according to EVEX format 398 may have additional "payload" bits that are used to provide generic GF(256) SIMD cryptographic arithmetic functions, with additional new features, eg, user configurable Mask registers, or additional operands, or selections from 128-bit, 256-bit or 512-bit vector registers or more registers to be selected, etc.

例如，在VEX格式397可用于提供具有隐式掩码的通用GF(256)SIMD加密算术功能的情况下，EVEX格式398可用于提供具有显式的用户可配置掩码的通用GF(256)SIMD加密算术功能。此外，在VEX格式397可用于提供在128位或256位向量寄存器上的通用GF(256)SIMD加密算术功能的情况下，EVEX格式398可用于提供在128位、256位、512位或更大(或更小)的向量寄存器上的通用GF(256)SIMD加密算术功能。For example, where VEX format 397 can be used to provide a generic GF(256) SIMD cryptographic arithmetic function with an implicit mask, EVEX format 398 can be used to provide a generic GF(256) SIMD with an explicit user-configurable mask Encrypted arithmetic functions. In addition, EVEX format 398 can be used to provide general-purpose GF(256) SIMD cryptographic arithmetic functions on 128-bit or 256-bit vector registers, where VEX format 397 can be used to provide Generic GF(256) SIMD cryptographic arithmetic functions on (or smaller) vector registers.

通过下列示例示出用于提供通用GF(256)SIMD加密算术功能的示例指令：Example instructions for providing generic GF(256) SIMD cryptographic arithmetic functions are shown by the following examples:

也将理解，提供针对至少下列各项的指令的执行：(1)SIMD仿射变换，其指定源数据操作数、变换矩阵操作数和转换向量，其中，变换矩阵被应用于源数据操作数的每一个数据元素，并且转换向量被应用于每一个经转换的元素；(2)SIMD二进制有限域乘法求逆，其用于针对源数据操作数中的每一个元素计算二进制有限域中的逆对不可约多项式求模；(3)SIMD仿射变换和乘法逆(或乘法逆和仿射变换)，其指定源数据操作数、变换矩阵操作数和转换向量，其中，在乘法逆操作之前或之后，变换矩阵被应用于源数据操作数中的每一个元素，并且转换向量被应用于每一个经变换的元素；(4)求模归约，其用于计算对从二进制有限域中的多个多项式(由指令(或微指令)为这些特定的多项式提供求模归约)中选出的特定求模多项式p进行归约求模；(5) SIMD二进制有限域乘法，其指定第一和第二源数据操作数，并且用于将第一和第二源数据操作数中的每一个对应的元素对相乘并对不可约多项式求模；其中，这些指令的结果被存储在SIMD目的地寄存器中；并且能以硬件和/或微代码序列的形式提供通用GF(256)和/或其他替代的二进制有限域SIMD加密算术功能，以便不需要要求附加电路、面积或功率的过多或过度的功能单元就能够支持对若干重要的性能关键性应用的显著的性能改善。It will also be understood that execution of instructions for at least the following is provided: (1) a SIMD affine transform specifying a source data operand, a transform matrix operand, and a transform vector, wherein the transform matrix is applied to the source data operand's each data element, and a transformation vector is applied to each transformed element; (2) SIMD binary finite field multiplication inversion, which is used to compute the inverse pair in the binary finite field for each element in the source data operands Irreducible polynomial modulo; (3) SIMD affine transform and multiplicative inverse (or multiplicative inverse and affine transform) specifying source data operands, transform matrix operands, and transform vectors, where before or after the multiply inverse operation , a transformation matrix is applied to each element in the source data operands, and a transformation vector is applied to each transformed element; (4) Modulo reduction, which is used to compute a The specific modulo polynomial p selected from the polynomials (which are provided by the instruction (or microinstruction) to provide modulo reduction for these specific polynomials) is reduced modulo; (5) SIMD binary finite field multiplication, which specifies the first and second Two source data operands, and are used to multiply and modulo the irreducible polynomial by pairs of elements corresponding to each of the first and second source data operands; wherein the results of these instructions are stored in SIMD destination registers and can provide general-purpose GF(256) and/or other alternative binary finite field SIMD cryptographic arithmetic functions in hardware and/or microcode sequences so as not to require excessive or excessive amounts of additional circuitry, area, or power The functional unit can then support significant performance improvements for several important performance-critical applications.

图4A是示出根据本发明的至少一个实施例的有序流水线以及寄存器重命名级、无序发布/执行流水线的框图。图4B是示出根据本发明的至少一个实施例的、要被包括在处理器中的有序架构核以及寄存器重命名逻辑、无序发布/ 执行逻辑的框图。图4A中的实线框示出了有序流水线，而虚线框示出了寄存器重命名的、无序发布/执行流水线。类似地，图4B中的实线框示出了有序架构逻辑，而虚线框示出了寄存器重命名逻辑以及无序发布/执行逻辑。4A is a block diagram illustrating an in-order pipeline and a register renaming stage, an out-of-order issue/execution pipeline in accordance with at least one embodiment of the present invention. 4B is a block diagram illustrating an in-order architecture core and register renaming logic, out-of-order issue/execution logic to be included in a processor in accordance with at least one embodiment of the present invention. The solid-line boxes in Figure 4A show an in-order pipeline, while the dashed-line boxes show a register-renaming, out-of-order issue/execution pipeline. Similarly, the solid-line boxes in Figure 4B show in-order architectural logic, while the dashed-line boxes show register renaming logic and out-of-order issue/execution logic.

在图4A中，处理器流水线400包括取出级402、长度解码级404、解码级406、分配级408、重命名级410、调度(也被称为分派或发布)级412、寄存器读取/存储器读取级414、执行级416、写回/存储器写入级418、异常处理级422和提交级424。In Figure 4A, processor pipeline 400 includes fetch stage 402, length decode stage 404, decode stage 406, allocate stage 408, rename stage 410, schedule (also known as dispatch or issue) stage 412, register read/memory Read stage 414 , execute stage 416 , write back/memory write stage 418 , exception handling stage 422 and commit stage 424 .

在图4B中，箭头指示两个或更多个单元之间的耦合，并且箭头的方向指示那些单元之间的数据流的方向。图4B示出处理器核490，其包括耦合至执行引擎单元450的前端单元430，该前端单元和执行引擎单元两者均耦合至存储器单元470。In Figure 4B, arrows indicate coupling between two or more units, and the direction of the arrows indicates the direction of data flow between those units. FIG. 4B shows processor core 490 including front end unit 430 coupled to execution engine unit 450 , both of which are coupled to memory unit 470 .

核490可以是精简指令集计算(RISC)核、复杂指令集计算(CISC)核、超长指令字(VLIW)核或混合或替代的核类型。作为又一选项，核490可以是专用核，例如，网络或通信核、压缩引擎、图形核等。Cores 490 may be reduced instruction set computing (RISC) cores, complex instruction set computing (CISC) cores, very long instruction word (VLIW) cores, or mixed or alternative core types. As yet another option, the cores 490 may be dedicated cores, eg, network or communication cores, compression engines, graphics cores, and the like.

前端单元430包括耦合至指令高速缓存单元434的分支预测单元432，该指令高速缓存单元耦合至指令转换后备缓冲器(TLB)436，该指令转换后备缓冲器(TLB)耦合至指令取出单元438，该指令取出单元耦合至解码单元440。解码单元或解码器可解码指令，并生成从原始指令中解码出的、或以其他方式反映原始指令的、或从原始指令中推导出的一个或多个微操作、微代码进入点、微指令、其他指令、或其他控制信号作为输出。可使用各种不同的机制来实现解码器。合适机构的示例包括但不限于，查找表、硬件实现、可编程逻辑阵列(PLA)、微代码只读存储器(ROM)等。指令高速缓存单元434还耦合到存储器单元470中的第二级(L2)高速缓存单元476。解码单元440耦合至执行引擎单元450中的重命名/分配器单元452。Front end unit 430 includes branch prediction unit 432 coupled to instruction cache unit 434, which is coupled to instruction translation lookaside buffer (TLB) 436, which is coupled to instruction fetch unit 438, The instruction fetch unit is coupled to decode unit 440 . A decode unit or decoder may decode the instruction and generate one or more micro-operations, microcode entry points, micro-instructions that are decoded from, or otherwise reflect, or derived from the original instruction , other commands, or other control signals as output. The decoder can be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), and the like. Instruction cache unit 434 is also coupled to a second level (L2) cache unit 476 in memory unit 470 . Decode unit 440 is coupled to rename/distributor unit 452 in execution engine unit 450 .

执行引擎单元450包括耦合到引退单元454和一个或多个调度器单元456 的集合的重命名/分配器单元452。调度器单元456表示任意数量的不同调度器，包括预留站、中央指令窗等。调度器单元456耦合到物理寄存器组单元458。每个物理寄存器组单元458表示一个或多个物理寄存器组，其中不同的物理寄存器组存储一个或多个不同的数据类型(诸如，标量整数、标量浮点、紧缩整数、紧缩浮点、向量整数、向量浮点，等等)、状态(诸如，作为要被执行的下一条指令的地址的指令指针)等等。物理寄存器组单元458被引退单元454 覆盖，以示出可实现寄存器重命名和无序执行的各种方式(例如，使用重排序缓冲器和引退寄存器组；使用未来文件(future file)、历史缓冲器和引退寄存器组；使用寄存器映射和寄存器池等)。通常，架构寄存器从处理器外部或从编程者的视角来看是可见的。这些寄存器不限于任何已知的特定电路类型。各种不同类型的寄存器是合适的，只要它们能够存储并提供本文所述的数据。合适寄存器的示例包括但不限于，专用物理寄存器、使用寄存器重命名的动态分配的物理寄存器、以及专用物理寄存器和动态分配物理寄存器的组合，等等。引退单元454和物理寄存器组单元458耦合至执行群集460。执行群集460包括一个或多个执行单元462的集合以及一个或多个存储器访问单元464的集合。执行单元462可执行各种操作(例如，移位、加法、减法、乘法)并可对各种数据类型(例如，标量浮点、紧缩整数、紧缩浮点、向量整数、向量浮点) 执行。尽管一些实施例可以包括专用于特定功能或功能集合的多个执行单元，但其他实施例可包括全部执行所有功能的仅一个执行单元或多个执行单元。调度器单元456、物理寄存器组单元458和执行群集460被示出为可能是复数个，因为某些实施例为某些数据/操作类型创建了多个单独流水线(例如，均具有各自调度器单元、物理寄存器组单元和/或执行群集的标量整数流水线、标量浮点 /紧缩整数/紧缩浮点/向量整数/向量浮点流水线和/或存储器访问流水线；以及在单独的存储器访问流水线的情况下，某些实施例被实现为仅仅该流水线的执行群集具有存储器访问单元464)。还应当理解，在使用分开的流水线的情况下，这些流水线中的一个或多个可以是无序发布/执行，并且其余流水线可以是有序发布/执行。Execution engine unit 450 includes a rename/distributor unit 452 coupled to a set of retirement unit 454 and one or more scheduler units 456 . Scheduler unit 456 represents any number of different schedulers, including reservation stations, central command windows, and the like. Scheduler unit 456 is coupled to physical register bank unit 458 . Each physical register bank unit 458 represents one or more physical register banks, where different physical register banks store one or more different data types (such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer , vector floating point, etc.), state (such as an instruction pointer which is the address of the next instruction to be executed), and so on. Physical register bank unit 458 is overridden by retirement unit 454 to illustrate the various ways in which register renaming and out-of-order execution can be achieved (eg, using reorder buffers and retirement register banks; using future files, history buffers registers and retirement register banks; using register maps and register pools, etc.). Typically, architectural registers are visible from outside the processor or from the programmer's perspective. These registers are not limited to any known particular circuit type. Various types of registers are suitable so long as they can store and provide the data described herein. Examples of suitable registers include, but are not limited to, dedicated physical registers, dynamically allocated physical registers using register renaming, and combinations of dedicated physical registers and dynamically allocated physical registers, among others. Retirement unit 454 and physical register set unit 458 are coupled to execution cluster 460 . Execution cluster 460 includes a set of one or more execution units 462 and a set of one or more memory access units 464 . Execution unit 462 may perform various operations (eg, shift, add, subtract, multiply) and may perform on various data types (eg, scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include multiple execution units dedicated to a particular function or set of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. Scheduler unit 456, physical register file unit 458, and execution cluster 460 are shown as possibly plural, as some embodiments create multiple separate pipelines (eg, each with their own scheduler unit) for certain data/operation types. , scalar integer pipelines, physical register file units and/or execution clusters, scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipelines, and/or memory access pipelines; and in the case of separate memory access pipelines , some embodiments are implemented such that only the execution clusters of the pipeline have memory access units 464). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the remaining pipelines may be in-order issue/execution.

存储器访问单元464的集合耦合至存储器单元470，该存储器单元包括数据TLB单元472，该数据TLB单元耦合至高速缓存单元474，该高速缓存单元耦合至第二级(L2)高速缓存单元476。在一个示例性实施例中，存储器访问单元464可包括加载单元、存储地址单元和存储数据单元，其中的每一个均耦合至存储器单元470中的数据TLB单元472。L2高速缓存单元476耦合至一个或多个其他层级的高速缓存，并最终耦合至主存储器。The set of memory access units 464 are coupled to memory units 470 including a data TLB unit 472 that is coupled to a cache unit 474 that is coupled to a second level (L2) cache unit 476 . In one exemplary embodiment, memory access unit 464 may include a load unit, a store address unit, and a store data unit, each of which is coupled to data TLB unit 472 in memory unit 470 . L2 cache unit 476 is coupled to one or more other levels of cache, and ultimately to main memory.

作为示例，示例性的寄存器重命名无序发布/执行核架构可按下列方式实现流水线400：1)指令取出器438执行取出和长度解码级402和404；2)解码单元440执行解码级406；3)重命名/分配器单元452执行分配级408和重命名级410；4)调度器单元456执行调度级412；5)物理寄存器组单元458和存储器单元470执行寄存器读取/存储器读取级414；执行群集460实现执行级416； 6)存储器单元470和物理寄存器组单元458执行写回/存储器写入级418；7)各种单元可被牵涉在异常处理级422中；以及8)引退单元454和物理寄存器组单元458执行提交级424。As an example, an exemplary register renaming out-of-order issue/execution core architecture may implement pipeline 400 in the following manner: 1) instruction fetcher 438 performs fetch and length decode stages 402 and 404; 2) decode unit 440 performs decode stage 406; 3) rename/allocator unit 452 performs allocation stage 408 and rename stage 410; 4) scheduler unit 456 performs dispatch stage 412; 5) physical register bank unit 458 and memory unit 470 perform register read/memory read stage 414; execution cluster 460 implements execution stage 416; 6) memory unit 470 and physical register set unit 458 perform write back/memory write stage 418; 7) various units may be involved in exception handling stage 422; and 8) retirement Unit 454 and physical register bank unit 458 perform commit stage 424 .

核490可支持一个或多个指令集(例如，x86指令集(具有增加了更新版本的一些扩展)、加利福尼亚州桑尼威尔的MIPS技术公司的MIPS指令集、加利福尼亚州桑尼威尔的ARM控股公司的ARM指令集(具有可选的附加扩展，诸如NEON))。The core 490 may support one or more instruction sets (eg, the x86 instruction set (with some extensions that add newer versions), the MIPS instruction set from MIPS Technologies, Inc., Sunnyvale, Calif., the ARM, Sunnyvale, Calif. The holding company's ARM instruction set (with optional additional extensions such as NEON).

应当理解，核可支持多线程操作(执行两个或更多个并行的操作或线程的集合)，并且可以按各种方式来完成该多线程化，各种方式包括时分多线程化、同步多线程化(其中单个物理核为物理核正在同步多线程化的各线程中的每一个线程提供逻辑核)、或其组合(例如，时分取出和解码以及之后的同步多线程操作(例如，用超线程化技术))。It should be understood that a core may support multithreaded operations (performing a collection of two or more operations or threads in parallel), and that this multithreading may be accomplished in various ways, including time division multithreading, simultaneous multithreading, and multithreading. Threading (where a single physical core provides a logical core for each of the threads that the physical core is simultaneously multithreading), or a combination thereof (eg, time division fetch and decode followed by simultaneous multithreading (eg, with Hyper-Threading Technology)).

尽管在无序执行的情境中描述了寄存器重命名，但应当理解，可以在有序架构中使用寄存器重命名。虽然处理器的所示出的实施例也包括单独的指令和数据高速缓存单元434/474以及共享的L2高速缓存单元476，但替代的实施例可具有用于指令和数据两者的单个内部高速缓存，诸如例如第一级(L1)内部高速缓存、或多个层级的内部高速缓存。在一些实施例中，该系统可包括内部高速缓存和在核和/或处理器外部的外部高速缓存的组合。或者，所有高速缓存都可以在核和/或处理器的外部。Although register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. Although the illustrated embodiment of the processor also includes separate instruction and data cache units 434/474 and a shared L2 cache unit 476, alternative embodiments may have a single internal cache for both instruction and data A cache, such as, for example, a first level (L1) internal cache, or multiple levels of internal caches. In some embodiments, the system may include a combination of internal cache and external cache external to the core and/or processor. Alternatively, all caches can be external to the core and/or processor.

图5是根据本发明的多个实施例的具有集成的存储器控制器和图形器件的单核处理器和多核处理器500的框图。图5的实线框示出了处理器500，其具有单个核502A、系统代理510、一组一个或多个总线控制器单元516，而可选附加的虚线框示出了替代的处理器500，其具有多个核502A-N、位于系统代理单元510中的一组一个或多个集成存储器控制器单元514的以及集成图形逻辑508。5 is a block diagram of a single-core processor and a multi-core processor 500 with an integrated memory controller and graphics device in accordance with various embodiments of the present invention. The solid-line box of FIG. 5 shows a processor 500 having a single core 502A, a system agent 510 , a set of one or more bus controller units 516 , while an optional additional dashed-line box shows an alternative processor 500 , which has a plurality of cores 502A-N, a set of one or more integrated memory controller units 514 located in a system agent unit 510, and integrated graphics logic 508.

存储器层次结构包括核内的一个或多个高速缓存层级、一组一个或多个共享高速缓存单元506、以及耦合至该组集成存储器控制器单元514的外部存储器(未示出)。该组共享高速缓存单元506可包括一个或多个中级高速缓存，诸如，第二级(L2)、第三级(L3)、第四级(L4)或其他级别的高速缓存、末级高速缓存(LLC)和/或以上的组合。虽然在一个实施例中，基于环形的互连单元512将集成图形逻辑508、该组共享高速缓存单元506和系统代理单元 510进行互连，但替代的实施例也可使用任意数量的公知技术来互连这些单元。The memory hierarchy includes one or more cache levels within the core, a set of one or more shared cache units 506 , and external memory (not shown) coupled to the set of integrated memory controller units 514 . The set of shared cache units 506 may include one or more mid-level caches, such as second-level (L2), third-level (L3), fourth-level (L4) or other level caches, last-level caches (LLC) and/or combinations of the above. Although in one embodiment a ring-based interconnect unit 512 interconnects the integrated graphics logic 508, the set of shared cache units 506, and the system proxy unit 510, alternative embodiments may use any number of well-known techniques to interconnect these units.

在一些实施例中，一个或多个核502A-N能够进行多线程操作。系统代理 510包括协调和操作核502A-N的那些组件。系统代理单元510可包括例如功率控制单元(PCU)和显示单元。PCU可以是或可包括对核502A-N以及集成图形逻辑508的功率状态进行调节所需的逻辑和组件。显示单元用于驱动一个或多个外部连接的显示器。In some embodiments, one or more of the cores 502A-N are capable of multi-threaded operations. System agent 510 includes those components that coordinate and operate cores 502A-N. The system agent unit 510 may include, for example, a power control unit (PCU) and a display unit. The PCU may be or may include logic and components required to regulate the power states of cores 502A-N and integrated graphics logic 508 . The display unit is used to drive one or more externally connected displays.

在架构和/或指令集方面，核502A-N可以是同构的或异构的。例如，核 502A-N中的一些可以是有序的，而另一些是无序的。作为另一个示例，核 502A-N中的两个或更多能够执行相同的指令集，而其他核能够执行该指令集中的仅子集或执行不同的指令集。The cores 502A-N may be homogeneous or heterogeneous in terms of architecture and/or instruction set. For example, some of the cores 502A-N may be ordered while others are unordered. As another example, two or more of cores 502A-N can execute the same instruction set, while other cores can execute only a subset of the instruction set or execute different instruction sets.

处理器可以是通用处理器，诸如酷睿(Core^TM)i3、i5、i7、2Duo和Quad、至强(Xeon^TM)、安腾(Itanium^TM)、XScale^TM或StrongARM^TM处理器，这些均可以从加利福尼亚圣克拉拉市的英特尔公司获得。或者，处理器可以来自另一个公司，诸如来自ARM控股公司、MIPS等。处理器可以是专用处理器，例如，网络或通信处理器、压缩引擎、图形处理器、协处理器、嵌入式处理器等。该处理器可以被实现在一个或多个芯片上。处理器500可以是一个或多个基板的一部分，和/或使用多种工艺技术(诸如，BiCMOS、CMOS、或NMOS) 中的任意技术被实现在一个或多个基板上。The processor may be a general purpose processor such as a Core ^™ i3, i5, i7, 2Duo and Quad, Xeon ^™ , Itanium ^™ , XScale ^™ or StrongARM ^™ processors, all available from Obtained from Intel Corporation of Santa Clara, California. Alternatively, the processor may be from another company, such as from ARM Holdings, MIPS, or the like. The processor may be a special purpose processor, eg, a network or communications processor, a compression engine, a graphics processor, a coprocessor, an embedded processor, and the like. The processor may be implemented on one or more chips. Processor 500 may be part of one or more substrates and/or implemented on one or more substrates using any of a variety of process technologies, such as BiCMOS, CMOS, or NMOS.

图6-8是适于包括处理器500的示例性系统，而图9是可包括502中的一个或多个的示例性芯片上系统(SoC)。本领域已知的对膝上型计算机、台式机、手持PC、个人数字助理、工程工作站、服务器、网络设备、网络集线器、交换机、嵌入式处理器、数字信号处理器(DSP)、图形设备、视频游戏设备、机顶盒、微控制器、蜂窝电话、便携式媒体播放器、手持设备以及各种其他电子设备的其他系统设计和配置也是合适的。一般地，能够包含本文中所公开的处理器和/或其他执行逻辑的多种系统或电子设备一般都是合适的。6-8 are exemplary systems suitable for including processor 500, while FIG. 9 is an exemplary system-on-a-chip (SoC) that may include one or more of 502. FIG. Known in the art for laptops, desktops, handheld PCs, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors (DSPs), graphics devices, Other system designs and configurations for video game devices, set-top boxes, microcontrollers, cellular telephones, portable media players, handheld devices, and various other electronic devices are also suitable. In general, a variety of systems or electronic devices capable of incorporating the processors and/or other execution logic disclosed herein are generally suitable.

现在参考图6，所示出的是根据本发明一个实施例的系统600的框图。系统600可包括耦合至图形存储器控制器中枢(GMCH)620的一个或多个处理器610和615。附加的处理器615的可选性质在图6中通过虚线来表示。Referring now to FIG. 6, shown is a block diagram of a system 600 according to one embodiment of the present invention. System 600 may include one or more processors 610 and 615 coupled to graphics memory controller hub (GMCH) 620 . The optional nature of the additional processor 615 is represented in FIG. 6 by dashed lines.

每个处理器610、615可以是处理器500的某个版本。然而，应当注意，集成图形逻辑和集成存储器控制单元不太可能出现在处理器610和615中。图 6示出GMCH 620可耦合至存储器640，该存储器640可以是例如动态随机存取存储器(DRAM)。对于至少一个实施例，DRAM可以与非易失性高速缓存相关联。Each processor 610 , 615 may be some version of processor 500 . However, it should be noted that integrated graphics logic and integrated memory control units are unlikely to be present in processors 610 and 615 . 6 shows that GMCH 620 may be coupled to memory 640, which may be, for example, dynamic random access memory (DRAM). For at least one embodiment, the DRAM may be associated with a non-volatile cache.

GMCH 620可以是芯片组或芯片组的部分。GMCH 620可以与处理器610 和615进行通信，并控制处理器610、615与存储器640之间的交互。GMCH 620 还可担当处理器610、615和系统600中的其他元件之间的加速总线接口。对于至少一个实施例，GMCH 620经由诸如前端总线(FSB)695之类的多点总线与处理器610和615进行通信。GMCH 620 may be a chipset or part of a chipset. The GMCH 620 may communicate with the processors 610 and 615 and control the interaction between the processors 610 , 615 and the memory 640 . GMCH 620 may also serve as an acceleration bus interface between processors 610 , 615 and other elements in system 600 . For at least one embodiment, GMCH 620 communicates with processors 610 and 615 via a multidrop bus, such as a front side bus (FSB) 695 .

此外，GMCH 620耦合至显示器645(诸如平板显示器)。GMCH 620可包括集成图形加速器。GMCH 620还耦合至输入/输出(I/O)控制器中枢(ICH) 650，该输入/输出(I/O)控制器中枢(ICH)650可用于将各种外围设备耦合至系统600。在图6的实施例中作为示例示出了外部图形设备660以及另一外围设备670，该外部图形设备660可以是耦合至ICH 650的分立图形设备。Additionally, the GMCH 620 is coupled to a display 645 (such as a flat panel display). GMCH 620 may include an integrated graphics accelerator. The GMCH 620 is also coupled to an input/output (I/O) controller hub (ICH) 650 , which may be used to couple various peripherals to the system 600 . An external graphics device 660 , which may be a separate graphics device coupled to the ICH 650 , is shown by way of example in the embodiment of FIG. 6 , along with another peripheral device 670 .

替代地，系统600中还可存在附加的或不同的处理器。例如，附加处理器 615可包括与处理器610相同的附加处理器、与处理器610异类或不对称的附加处理器、加速器(例如，图形加速器或数字信号处理(DSP)单元)、现场可编程门阵列或任何其他处理器。在物理资源610和615之间会存在包括架构、微架构、热和功耗特征等的一系列品质度量方面的各种差异。这些差异会有效显示为处理器610和615之间的不对称性和异类性。对于至少一个实施例，各种处理器610和615可驻留在同一管芯封装中。Alternatively, additional or different processors may also be present in system 600 . For example, additional processors 615 may include the same additional processors as processor 610, additional processors heterogeneous or asymmetric to processor 610, accelerators (eg, graphics accelerators or digital signal processing (DSP) units), field programmable gate array or any other processor. Various differences may exist between physical resources 610 and 615 in a range of quality metrics including architecture, micro-architecture, thermal and power consumption characteristics. These differences would effectively appear as asymmetries and heterogeneities between processors 610 and 615 . For at least one embodiment, the various processors 610 and 615 may reside in the same die package.

现在参照图7，所示出的是根据本发明实施例的第二系统700的框图。如图7所示，多处理器系统700是点对点互连系统，并且包括经由点对点互连750 而被耦合的第一处理器770和第二处理器780。处理器770和780中的每一个可以是处理器500的某个版本(如处理器610、615中的一个或多个)。Referring now to FIG. 7, shown is a block diagram of a second system 700 in accordance with an embodiment of the present invention. As shown in FIG. 7 , the multiprocessor system 700 is a point-to-point interconnect system and includes a first processor 770 and a second processor 780 that are coupled via a point-to-point interconnect 750 . Each of processors 770 and 780 may be some version of processor 500 (eg, one or more of processors 610, 615).

虽然仅以两个处理器770和780示出，但是应当理解，本发明的范围不限于此。在其他实施例中，在给定处理器中可存在一个或多个附加处理器。Although shown with only two processors 770 and 780, it should be understood that the scope of the present invention is not limited in this regard. In other embodiments, one or more additional processors may be present in a given processor.

处理器770和780被示出为分别包括集成存储器控制器单元772和782。处理器770还包括作为其总线控制器单元的部分的点对点(P-P)接口776和 778；类似地，第二处理器780包括P-P接口786和788。处理器770和780 可经由使用点对点(P-P)接口电路778和788的P-P接口750来交换信息。如图 7所示，IMC 772和782将处理器耦合到各自的存储器，即存储器732和存储器734，这些存储器可以是本地附连到各自处理器的主存储器的部分。Processors 770 and 780 are shown to include integrated memory controller units 772 and 782, respectively. Processor 770 also includes point-to-point (P-P) interfaces 776 and 778 as part of its bus controller unit; similarly, second processor 780 includes P-P interfaces 786 and 788. Processors 770 and 780 may exchange information via P-P interface 750 using point-to-point (P-P) interface circuits 778 and 788 . As shown in Figure 7, IMCs 772 and 782 couple the processors to respective memories, namely memory 732 and memory 734, which may be portions of main memory locally attached to the respective processors.

处理器770、780可各自经由使用点对点接口电路776、794、786和798 的各个P-P接口752和754与芯片组790交换信息。芯片组790还可经由高性能图形接口739与高性能图形电路738交换信息。Processors 770 , 780 may each exchange information with chipset 790 via respective P-P interfaces 752 and 754 using point-to-point interface circuits 776 , 794 , 786 and 798 . Chipset 790 may also exchange information with high-performance graphics circuitry 738 via high-performance graphics interface 739 .

共享高速缓存(未示出)可被包括在任一处理器中，或在两个处理器的外部但经由P-P互连与这些处理器连接，使得如果处理器被置于低功率模式，则任一个或这两个处理器的本地高速缓存信息可被存储在该共享的高速缓存中。A shared cache (not shown) can be included in either processor, or external to both processors but connected to the processors via a P-P interconnect, so that if the processor is placed in a low power mode, either Or the local cache information for both processors can be stored in the shared cache.

芯片组790可经由接口796耦合至第一总线716。在一个实施例中，第一总线716可以是外围组件互连(PCI)总线或诸如PCI高速总线或另一第三代 I/O互连总线之类的总线，但是本发明的范围不限于此。Chipset 790 may be coupled to first bus 716 via interface 796 . In one embodiment, the first bus 716 may be a Peripheral Component Interconnect (PCI) bus or a bus such as a PCI Express bus or another third-generation I/O interconnect bus, although the scope of the invention is not limited in this regard .

如图7所示，各种I/O设备714可连同总线桥718一起耦合到第一总线716，总线桥718将第一总线716耦合到第二总线720。在一个实施例中，第二总线 720可以是低引脚计数(LPC)总线。各种设备可以被耦合至第二总线720，在一个实施例中，这些设备包括例如键盘和/或鼠标722、通信设备727以及诸如可包括指令/代码和数据730的盘驱动器或其他大容量存储设备之类的存储单元 728。此外，音频I/O 724可耦合至第二总线720。注意，其他架构是可能的。例如，代替图7的点对点架构，系统可实现多点总线或者其他此类架构。As shown in FIG. 7 , the various I/O devices 714 may be coupled to the first bus 716 along with a bus bridge 718 that couples the first bus 716 to the second bus 720 . In one embodiment, the second bus 720 may be a low pin count (LPC) bus. Various devices may be coupled to the second bus 720, in one embodiment, these devices include, for example, a keyboard and/or mouse 722, communication devices 727, and other mass storage such as disk drives or other mass storage that may include instructions/code and data 730 A storage unit 728 such as a device. Additionally, audio I/O 724 may be coupled to second bus 720 . Note that other architectures are possible. For example, instead of the point-to-point architecture of Figure 7, the system may implement a multi-drop bus or other such architecture.

现在参照图8，所示出的是根据本发明实施例的第三系统800的框图。图 7和8中的类似元件使用类似的附图标记，且在图8中省略了图7的某些方面以避免使图8的其他方面模糊。Referring now to FIG. 8, shown is a block diagram of a third system 800 in accordance with an embodiment of the present invention. Similar elements in Figures 7 and 8 have been given similar reference numerals, and certain aspects of Figure 7 have been omitted from Figure 8 to avoid obscuring other aspects of Figure 8 .

图8示出处理器870和880可分别包括集成存储器和I/O控制逻辑(“CL”) 872和882。对于至少一个实施例，CL 872和882可包括诸如以上结合图5和 7所描述的集成存储器控制器单元。此外，CL 872、882还可包括I/O控制逻辑。图8示出不仅存储器832和834耦合至CL 872和882，而且I/O设备814 也耦合至控制逻辑872和882。传统I/O设备815耦合至芯片组890。8 shows that processors 870 and 880 may include integrated memory and I/O control logic ("CL") 872 and 882, respectively. For at least one embodiment, CLs 872 and 882 may include integrated memory controller units such as those described above in connection with FIGS. 5 and 7 . Additionally, the CLs 872, 882 may also include I/O control logic. FIG. 8 shows that not only memory 832 and 834 are coupled to CLs 872 and 882 , but I/O device 814 is also coupled to control logic 872 and 882 . Conventional I/O devices 815 are coupled to chipset 890 .

现在参照图9，所示出的是根据本发明的实施例的SoC 900的框图。图5 中的类似组件具有相同的标号。另外，虚线框是更先进的SoC上的可选特征。在图9中，互连单元902被耦合至：应用处理器910，其包括一组一个或多个核502A-N以及共享高速缓存单元506；系统代理单元510；总线控制器单元 516；集成存储器控制器单元514；一组一个或多个媒体处理器920，其可包括集成图形逻辑508、用于提供静止和/或视频照相功能的图像处理器924、用于提供硬件音频加速的音频处理器926、用于提供视频编码/解码加速的视频处理器928、静态随机存取存储器(SRAM)单元930；直接存储器存取(DMA)单元 932；以及显示单元940，其用于耦合至一个或多个外部显示器。Referring now to FIG. 9, shown is a block diagram of an SoC 900 in accordance with an embodiment of the present invention. Similar components in Figure 5 have the same reference numerals. Also, the dotted box is an optional feature on more advanced SoCs. In FIG. 9, interconnect unit 902 is coupled to: application processor 910, which includes a set of one or more cores 502A-N and shared cache unit 506; system proxy unit 510; bus controller unit 516; integrated memory Controller unit 514; set of one or more media processors 920, which may include integrated graphics logic 508, image processor 924 for providing still and/or video camera functions, audio processor for providing hardware audio acceleration 926. A video processor 928 for providing video encoding/decoding acceleration, a static random access memory (SRAM) unit 930; a direct memory access (DMA) unit 932; and a display unit 940 for coupling to one or more an external monitor.

图10示出处理器，包括中央处理单元(CPU)和图形处理单元(GPU)，该处理器可执行根据一个实施例的至少一条指令。在一个实施例中，执行根据至少一个实施例的操作的指令可由CPU来执行。在另一实施例中，指令可以由 GPU来执行。在又一实施例中，指令可以由GPU和CPU所执行的操作的组合来执行。例如，在一个实施例中，根据一个实施例的指令可被接收，并被解码，以便在GPU上执行。然而，经解码的指令中的一个或多个操作可由CPU来执行，并且结果被返回到GPU，以便进行指令的最终引退。相反，在一些实施例中，CPU可作为主处理器，而GPU作为协处理器。10 illustrates a processor, including a central processing unit (CPU) and a graphics processing unit (GPU), that can execute at least one instruction according to one embodiment. In one embodiment, instructions to perform operations in accordance with at least one embodiment may be executed by a CPU. In another embodiment, the instructions may be executed by a GPU. In yet another embodiment, the instructions may be executed by a combination of operations performed by the GPU and the CPU. For example, in one embodiment, instructions according to one embodiment may be received and decoded for execution on a GPU. However, one or more operations in the decoded instructions may be performed by the CPU and the results returned to the GPU for eventual retirement of the instructions. Rather, in some embodiments, the CPU may act as the main processor and the GPU as the co-processor.

在一些实施例中，受益于高度并行化的吞吐量处理器的指令可由GPU来执行，而受益于处理器(这些处理器受益于深度流水线架构)性能的指令可由 CPU来执行。例如，图形、科学应用、金融应用以及其他并行工作负荷可受益于GPU的性能并相应地被执行，而更多的序列化应用(例如，操作系统内核或应用代码)更适于CPU。In some embodiments, instructions that benefit from a highly parallelized throughput processor may be executed by a GPU, while instructions that benefit from the performance of processors that benefit from a deeply pipelined architecture may be executed by a CPU. For example, graphics, scientific applications, financial applications, and other parallel workloads can benefit from the performance of GPUs and be executed accordingly, while more serialized applications (eg, operating system kernels or application code) are more suitable for CPUs.

在图10中，处理器1000包括，CPU 1005、GPU 1010、图像处理器1015、视频处理器1020、USB控制器1025、UART控制器1030、SPI/SDIO控制器 1035、显示设备1040、高清晰度多媒体接口(HDMI)控制器1045、MIPI控制器1050、闪存存储器控制器1055、双数据率(DDR)控制器1060、安全引擎1065、I²S/I²C(集成跨芯片声音/跨集成电路)接口1070。其他逻辑和电路 (包括更多的CPU或GPU以及其他外围设备接口控制器)可被包括在图10 的处理器中。In FIG. 10, the processor 1000 includes a CPU 1005, a GPU 1010, an image processor 1015, a video processor 1020, a USB controller 1025, a UART controller 1030, a SPI/SDIO controller 1035, a display device 1040, a high-definition Multimedia Interface (HDMI) Controller 1045, MIPI Controller 1050, Flash Memory Controller 1055, Double Data Rate (DDR) Controller 1060, Security Engine 1065, I2S/I2C ⁽ Integrated Cross ^- Chip Sound/ Cross-Integrated Circuit ) interface 1070. Other logic and circuitry, including more CPUs or GPUs and other peripheral device interface controllers, may be included in the processor of FIG. 10 .

至少一个实施例的一个或多个方面可由存储在表示处理器内的各种逻辑的机器可读介质上的表示性数据来实现，当机器读取该表示性数据时，该表示性数据使得该机器用于制造执行本文所述的技术的逻辑。可将此类表示(即所谓“IP核”)存储在有形的机器可读介质(“磁带”)上，并将其提供给各种顾客或生产设施，以便加载到实际制作该逻辑或处理器的制造机器中。例如，IP核(诸如由ARM控股公司所开发的Cortex^TM处理器族以及由中国科学院计算机技术研究所(ICT)所开发的龙芯IP核)可被授权或销售给各种客户或受许可方，诸如德州仪器、高通、苹果、或三星，并被实现在由这些客户或受许可方生产的处理器中。One or more aspects of at least one embodiment may be implemented by representational data stored on a machine-readable medium representing various logic within a processor, which, when read by a machine, causes the representational data to Machines are used to manufacture logic that performs the techniques described herein. Such representations (so-called "IP cores") may be stored on tangible machine-readable media ("tape") and provided to various customers or production facilities for loading into the actual production logic or processor in the manufacturing machine. For example, IP cores such as the Cortex ^™ processor family developed by ARM Holdings and the Loongson IP core developed by the Institute of Computer Technology (ICT) of the Chinese Academy of Sciences may be licensed or sold to various customers or licensees, such as Texas Instruments, Qualcomm, Apple, or Samsung, and implemented in processors produced by these customers or licensees.

图11示出根据一个实施例的IP核开发的框图。存储设备1130包括仿真软件1120和/或硬件或软件模型1110。在一个实施例中，表示IP核设计的数据可经由存储器1140(例如，硬盘)、有线连接(例如，互联网)1150或无线连接1160而被提供给存储设备1130。由仿真工具和模型所生成的IP核信息可随后被发送到生产设施，可由第三方在该生产设施中制造该IP核信息以执行根据至少一个实施例的至少一条指令。Figure 11 shows a block diagram of IP core development according to one embodiment. Storage device 1130 includes simulation software 1120 and/or hardware or software models 1110 . In one embodiment, data representing the IP core design may be provided to storage device 1130 via memory 1140 (eg, hard disk), wired connection (eg, the Internet) 1150 , or wireless connection 1160 . The IP core information generated by the simulation tools and models can then be sent to a production facility where the IP core information can be manufactured by a third party to execute at least one instruction in accordance with at least one embodiment.

在一些实施例中，一条或多条指令可以对应于第一类型或架构(例如， x86)，并且可在不同类型或架构的处理器(例如，ARM)上被转换或仿真。根据一个实施例，因此可在任何处理器或处理器类型(包括ARM、x86、MIPS、 GPU或其他处理器类型或架构)上执行指令。In some embodiments, the one or more instructions may correspond to a first type or architecture (eg, x86), and may be translated or emulated on a different type or architecture of processor (eg, ARM). According to one embodiment, instructions may thus be executed on any processor or processor type, including ARM, x86, MIPS, GPU, or other processor types or architectures.

图12示出了根据一个实施例的第一类型的指令如何被不同类型的处理器仿真。在图12中，程序1205包含可执行与根据一个实施例的指令相同或基本相同的功能的一些指令，这些指令。然而，程序1205的指令可以是与处理器 1215不同或不兼容的类型和/或格式，这意味着不能够由处理器1215原生地执行程序1205中的类型的指令。然而，借助于仿真逻辑1210，将程序1205的指令转换成能够由处理器1215原生执行的指令。在一个实施例中，仿真逻辑被具体化在硬件中。在另一实施例中，将仿真逻辑具体化在有形的机器可读介质中，该机器可读介质包含用于将程序1205中的该类指令转换为能由处理器 1215原生地执行的类型的软件。在其他实施例中，仿真逻辑是固定功能或可编程硬件和存储在有形的机器可读介质上的程序的组合。在一个实施例中，处理器包含仿真逻辑，而在其他实施例中，仿真逻辑在处理器之外，并且由第三方提供。在一个实施例中，处理器能够通过执行被包括在处理器中或者与该处理器相关联的微代码或固件，加载被具体化在包含软件的有形的机器可读介质中的仿真逻辑。Figure 12 shows how instructions of the first type are emulated by different types of processors, according to one embodiment. In Figure 12, program 1205 contains some instructions that may perform the same or substantially the same function as the instructions according to one embodiment. However, the instructions of program 1205 may be of a different or incompatible type and/or format than processor 1215, which means that instructions of the type in program 1205 cannot be natively executed by processor 1215. However, by means of emulation logic 1210, the instructions of program 1205 are converted into instructions that can be natively executed by processor 1215. In one embodiment, the emulation logic is embodied in hardware. In another embodiment, the emulation logic is embodied in a tangible machine-readable medium containing instructions for converting such instructions in program 1205 into a type that can be executed natively by processor 1215 software. In other embodiments, the emulation logic is a combination of fixed-function or programmable hardware and a program stored on a tangible machine-readable medium. In one embodiment, the processor contains emulation logic, while in other embodiments the emulation logic is external to the processor and provided by a third party. In one embodiment, a processor is capable of loading emulation logic embodied in a tangible machine-readable medium containing software by executing microcode or firmware included in or associated with the processor.

图13是根据本发明的多个实施例的对照使用软件指令转换器将源指令集中的二进制指令转换为目标指令集中的二进制指令的框图。在所示的实施例中，指令转换器是软件指令转换器，但作为替代，可在软件、固件、硬件或其各种组合中实现该指令转换器。图13示出可使用x86编译器1304来编译利用高级语言1302的程序，以生成可由具有至少一个x86指令集核的处理器1316 原生地执行的x86二进制代码1306。具有至少一个x86指令集核1316的处理器表示能够通过兼容地执行或以其他方式处理下列各项来执行与具有至少一个x86指令集核的英特尔处理器基本相同的功能的任何处理器：(1)英特尔 x86指令集核的指令集的本质部分或(2)旨在在具有至少一个x86指令集核的英特尔处理器上运行以实现与具有至少一个x86指令集核的英特尔处理器基本相同的结果的应用或其他软件的目标代码版本。x86编译器1304表示可用于生成x86二进制代码1306(例如，目标代码)的编译器，该x86二进制代码1306 能够通过附加的链接处理或无需附加的链接处理而在具有至少一个x86指令集核的处理器1316上被执行。类似地，图13示出可使用替代的指令集编译器1308来编译利用高级语言1302的程序，以生成可由不具有至少一个x86指令集核的处理器1314(例如，具有执行加利福尼亚州桑尼威尔的MIPS技术公司的 MIPS指令集的核的处理器和/或执行加利福尼亚州桑尼威尔的ARM控股公司的ARM指令集的核的处理器)原生地执行的替代指令集二进制代码1310。该指令转换器1312被用于将x86二进制代码1306转换为可由不具有x86指令集核的处理器1314原生地执行的代码。该经变换的代码不太可能与替代指令集二进制代码1310相同，因为难以制造能完成这样的指令转换器；然而，经变换的代码将完成一般操作，并且由替代指令集中的指令构成。因此，指令转换器1312通过仿真、模拟或任何其他过程来表示允许不具有x86指令集处理器或核的处理器或其他电子设备执行x86二进制代码1306的软件、固件、硬件或它们的组合。13 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set in accordance with various embodiments of the present invention. In the illustrated embodiment, the instruction translator is a software instruction translator, but may alternatively be implemented in software, firmware, hardware, or various combinations thereof. 13 shows that x86 compiler 1304 can be used to compile programs utilizing high-level language 1302 to generate x86 binary code 1306 that can be natively executed by processor 1316 having at least one x86 instruction set core. A processor with at least one x86 instruction set core 1316 means any processor capable of performing substantially the same functions as an Intel processor with at least one x86 instruction set core by compatibly executing or otherwise processing: (1 ) an essential part of the instruction set of an Intel x86 instruction set core or (2) is designed to run on an Intel processor with at least one x86 instruction set core to achieve substantially the same results as an Intel processor with at least one x86 instruction set core the object code version of the application or other software. The x86 compiler 1304 represents a compiler that can be used to generate x86 binary code 1306 (eg, object code) capable of processing at least one x86 instruction set core with or without additional linking processing is executed on the device 1316. Similarly, FIG. 13 shows that an alternative instruction set compiler 1308 may be used to compile a program using the high-level language 1302 to generate a program that can be executed by a processor 1314 that does not have at least one x86 instruction set core (eg, with a Sunnyvale California Alternate instruction set binary code 1310 natively executed by processors of MIPS instruction set cores of MIPS Technologies, Inc., and/or processors executing ARM instruction set cores of ARM Holdings, Inc. of Sunnyvale, Calif. The instruction converter 1312 is used to convert x86 binary code 1306 into code that can be natively executed by a processor 1314 that does not have an x86 instruction set core. The transformed code is unlikely to be the same as the alternative instruction set binary code 1310, as it is difficult to manufacture an instruction converter that can accomplish such; however, the transformed code will perform the normal operation and be composed of the instructions in the alternative instruction set. Thus, instruction translator 1312 represents, by emulation, emulation, or any other process, software, firmware, hardware, or a combination thereof that allows a processor or other electronic device that does not have an x86 instruction set processor or core to execute x86 binary code 1306.

图14示出用于高效地实现高级加密标准(AES)的加密/解密标准的过程 1401的一个实施例的流程图。通过处理块来执行本文中公开的过程1401和其他过程，这些处理块可包括专用硬件或可由通用机器或由专用机器或由通用机器和专用机器两者的组合执行的软件或固件操作码。在一个实施例中，对于AES逆向列混合变换，可使用复合域GF((2⁴)²)以及不可约多项式x⁴+x²+x+1 和x²+2x+0xE。Figure 14 shows a flowchart of one embodiment of a process 1401 for efficiently implementing the Advanced Encryption Standard (AES) encryption/decryption standard. Process 1401 and other processes disclosed herein are performed by processing blocks, which may include special purpose hardware or software or firmware opcodes executable by a general purpose machine or by a special purpose machine or by a combination of both. In one embodiment, the composite field GF((2 ⁴ ) ² ) and the irreducible polynomials x ⁴ +x ² +x+1 and x ² +2x+0xE can be used for the AES inverse column hybrid transform.

在处理框1411中，包括16个字节值的128位的输入块与轮密钥进行逻辑异或(XOR)运算。在处理框1412中，确定该过程是否是加密，在加密的情况下，处理从点1418处继续，或者如果该过程是解密，则这种情况下，处理在处理框1413中恢复。In processing block 1411, an input block of 128 bits comprising 16 byte values is XORed with the round key. In process block 1412, it is determined whether the process is encryption, in which case processing continues from point 1418, or if the process is decryption, in which case processing resumes in process block 1413.

在处理框1413中，域转换电路被用于分别将16个字节值中的每一个从 GF(256)中对应的多项式表示转换到复合域GF((2⁴)²)中的另一对应的多项式表示。对于处理框1413的一个实施例，通过将每一个字节值乘以8位×8位的转换矩阵，GF(256)中的多项式表示[a₇,a₆,a₅,a₄,a₃,a₂,a₁,a₀]可被转换为复合域 GF((2⁴)²)中对应的多项式表示[b₇,b₆,b₅,b₄,b₃,b₂,b₁,b₀]，这可通过以下一系列 XOR来实现：In processing block 1413, a domain conversion circuit is used to separately convert each of the 16 byte values from the corresponding polynomial representation in GF(256) to another corresponding in the composite field GF((2 ⁴ ) ² ) polynomial representation of . For one embodiment of processing block 1413, the polynomial in GF(256) represents [a ₇ ,a ₆ ,a ₅ ,a ₄ ,a ₃ by multiplying each byte value by an 8-bit by 8-bit transformation matrix ,a ₂ ,a ₁ ,a ₀ ] can be transformed into the corresponding polynomial representations in the composite field GF((2 ⁴ ) ² ) [b ₇ ,b ₆ ,b ₅ ,b ₄ ,b ₃ ,b ₂ ,b ₁ ,b ₀ ], which can be achieved by the following series of XORs:

b₁＝a₇,b ₁ =a ₇ ,

此时，这16个字节可被视为具有四行和四列的、4×4的字节块。在处理框1414中，确定当前轮是否是最后一轮/特殊轮，在最后一轮/特殊轮的情况下，没有逆向列混合被执行，否则，在处理框1415中，逆向列混合电路被用于计算GF((2⁴)²)中16个字节值的逆向列混合变换以得到GF((2⁴)²)中对应的、经变换的多项式表示。对于一个实施例，可按如下方式执行16字节输入值的、在 GF((2⁴)²)中的逆向列混合变换：At this time, the 16 bytes can be regarded as a 4x4 byte block with four rows and four columns. In processing block 1414, it is determined whether the current round is the last round/special round, in which case no inverse column mixing is performed, otherwise, in processing block 1415, the inverse column mixing circuit is used In computing the inverse column hybrid transform of the 16-byte values in GF((2 ⁴ ) ² ) to obtain the corresponding transformed polynomial representation in GF((2 ⁴ ) ² ). For one embodiment, the inverse column blending transform in GF((2 ⁴ ) ² ) of the 16-byte input value can be performed as follows:

将会理解，通过在第一阶段中为每一个结果计算执行乘以表达式中的矩阵常数所需的唯一项，并且随后对这些唯一项求和以生成每一个结果，可在 GF((2⁴)²)中对[a₃,a₂,a₁,a₀,b₃,b₂,b₁,b₀]执行此类矩阵乘法。例如，计算上述矩阵乘法所需的、来自半字节[a₃,a₂,a₁,a₀]唯一项为：It will be appreciated that by computing the unique terms required to perform the multiplication by the matrix constants in the expression for each result in the first stage, and then summing these unique terms to generate each result, it is possible to obtain a value in GF((2 ⁴ ) ² ) perform such matrix multiplications for [a ₃ , a ₂ , a ₁ , a ₀ , b ₃ , b ₂ , b ₁ , b ₀ ]. For example, the unique items from the nibble [a ₃ ,a ₂ ,a ₁ ,a ₀ ] needed to compute the above matrix multiplication are:

计算上述矩阵乘法所需的、来自半字节[b₃,b₂,b₁,b₀]唯一项为：The only items from the nibble [b ₃ ,b ₂ ,b ₁ ,b ₀ ] needed to compute the above matrix multiplication are:

在处理框1414中所确定的任何一种情况下，在处理框1416中，对应于逆向行混合变换，对16个字节值执行硬连线的行置换。在处理框1417中，第二域转换电路用于转换GF((2⁴)²)中的每一个对应的经变换的多项式表示，并且也用于应用仿射变换，以便分别生成除GF((2⁴)²)之外的有限域中的第三对应的多项式表示。在过程1401的一个实施例中，除GF((2⁴)²)之外的那个新有限域是复合域GF((2²)⁴)。在下文中参考图2更详细地描述该实施例。在过程1401的替代实施例中，该新有限域是原始域GF(256)。在下文中参考图3a和3b更详细地描述这些实施例。In either case determined in processing block 1414, in processing block 1416, a hard-wired row permutation is performed on the 16-byte values corresponding to the inverse row blending transform. In processing block 1417, a second domain conversion circuit is used to convert each of the corresponding transformed polynomial representations in GF((2 ⁴ ) ² ) and also to apply affine transformations to generate divisions GF((( The third corresponding polynomial representation in a finite field other than 2 ⁴ ) ² ). In one embodiment of process 1401, the new finite field other than GF((2 ⁴ ) ² ) is the composite field GF((2 ² ) ⁴ ). This embodiment is described in more detail below with reference to FIG. 2 . In an alternate embodiment of process 1401, the new finite field is the original field GF (256). These embodiments are described in more detail below with reference to Figures 3a and 3b.

继续从点1418开始，乘法求逆电路在处理块1420中被使用，以分别针对 16个字节值的第三对应的多项式表示中的每一个计算在除GF((2⁴)²)之外的新有限域中对应的乘法逆元多项式表示。在处理框1421中，确定该过程是否是解密，在解密的情况下，轮处理被完成，并且在处理框1426中输出结果，或者如果该过程是加密，则在这种情况下，处理在处理框1422中恢复。Continuing from point 1418, a multiplication-inversion circuit is used in processing block 1420 to compute for each of the third corresponding polynomial representations of the 16-byte values, respectively, in addition to GF((2 ⁴ ) ² ) The corresponding multiplicative inverse polynomial representation in the new finite field of . In process block 1421, it is determined whether the process is decryption, in which case the round of processing is completed and the result is output in process block 1426, or if the process is encryption, in which case the process is in process Block 1422 resumes.

在处理框1422中，电路被用于将仿射变换应用于16个字节值的每一个对应的乘法逆元多项式表示，从而分别生成在不同于GF((2⁴)²)的那个新有限域中的、经变换的对应多项式表示。如果那个新有限域不是原始域GF(256)，则在框1422中，另一域变换可结合电路以将每一个对应的经变换的多项式表示往回转换到原始域GF(256)。因此，可以假定过程1401其余部分的多项式表示是在原始域GF(256)中的。In processing block 1422, circuitry is used to apply an affine transformation to each of the corresponding multiplicative inverse polynomial representations of the 16 byte values to generate a new finite value at a different one than GF((2 ⁴ ) ² ), respectively The transformed corresponding polynomial representation in the domain. If that new finite field is not the original field GF (256), then in block 1422, another field transform may incorporate circuitry to convert each corresponding transformed polynomial representation back to the original field GF (256). Therefore, the polynomial representation of the remainder of the process 1401 can be assumed to be in the original field GF(256).

在处理框1423中，对应于正向行混合变换，对16个字节值执行硬连线的行置换。在处理框1424中，确定当前轮是否是最后一轮/特殊轮，在最后一轮/ 特殊轮的情况下，没有列混合被执行，否则，在处理框1425中，正向列混合电路被用于在GF(256)中计算16个字节值的正向列混合变换以得到GF(256)中对应的经变换的多项式表示。将会理解，由于系数在在GF(256)中的正向列混合变换中是相对小的，因此，在处理框1425中，没有替代的域表示被使用。最终，过程1401的轮处理被完成，并且在处理框1426中，16字节结果被输出。In processing block 1423, a hardwired row permutation is performed on the 16-byte values, corresponding to a forward row mix transform. In processing block 1424, it is determined whether the current round is the last round/special round, in which case no column mixing is performed, otherwise, in processing block 1425, the forward column mixing circuit is used The forward column blend transform of the 16-byte values is computed in GF(256) to obtain the corresponding transformed polynomial representation in GF(256). It will be appreciated that since the coefficients in the forward column mixing transform in GF(256) are relatively small, in processing block 1425, no alternative domain representation is used. Finally, the round of process 1401 is completed, and in processing block 1426, the 16-byte result is output.

图15示出用于高效地实现AES S盒的乘法求逆的过程1501的一个实施例的流程图。在以下所示的一个实施例中，可结合不可约多项式x⁴+x³+x²+2 将复合域GF((2²)⁴)用于S盒变换。Figure 15 shows a flowchart of one embodiment of a process 1501 for efficiently implementing the multiplicative inversion of an AES S-box. In one embodiment shown below, the composite field GF((2 ² ) ⁴ ) can be used for the S-box transform in conjunction with the irreducible polynomial x ⁴ +x ³ +x ² +2 .

继续从过程1401的点1418开始，在处理框1518处，确定该过程是否是加密，在加密的情况下，在处理框1519中，处理继续。否则，如果该过程是解密，则已经在处理框1417中执行了域转换，并且16个字节值的第三对应的多项式表示在复合域GF((2²)⁴)中。对于处理框1417的一个实施例，通过将每一个字节值乘以8位×8位的转换矩阵以及一些与常数进行的XOR(即，按位反转)，可应用求逆仿射变换，并且GF((2⁴)²)中的多项式表示[a₇,a₆,a₅,a₄,a₃,a₂, a₁,a₀]可被转换为复合域GF((2²)⁴)中对应的多项式表示[b₇,b₆,b₅,b₄,b₃,b₂,b₁, b₀]，这可通过以下一系列XOR来实现：Continuing from point 1418 of process 1401, at process block 1518, it is determined whether the process is encryption, in which case processing continues at process block 1519. Otherwise, if the process is decryption, the domain conversion has been performed in processing block 1417, and the third corresponding polynomial representation of the 16-byte value is in the composite field GF((2 ² ) ⁴ ). For one embodiment of processing block 1417, an inverse affine transform may be applied by multiplying each byte value by an 8-bit by 8-bit transformation matrix and some XOR (ie, bitwise inversion) with a constant, And the polynomial representation in GF((2 ⁴ ) ² ) [a ₇ ,a ₆ ,a ₅ ,a ₄ ,a ₃ ,a ₂ ,a ₁ ,a ₀ ] can be transformed into the composite field GF((2 ² ) ⁴ ) in the corresponding polynomial representation [b ₇ ,b ₆ ,b ₅ ,b ₄ ,b ₃ ,b ₂ ,b ₁ ,b ₀ ], which can be achieved by the following series of XORs:

在处理框1519中，需要域转换以用于加密过程，因此域转换电路被用于分别将16个字节值中的每一个从GF(256)中对应的多项式表示转换到复合域GF((2²)⁴)中对应的多项式表示。对于处理框1519的一个实施例，通过将每一个字节值乘以8位×8位的转换矩阵，GF(256)中的多项式表示[a₇,a₆,a₅,a₄,a₃, a₂,a₁,a₀]可被转换为复合域GF((2²)⁴)中对应的多项式表示[b₇,b₆,b₅,b₄,b₃,b₂,b₁, b₀]，这可通过以下一系列XOR来实现：In processing block 1519, a domain conversion is required for the encryption process, so a domain conversion circuit is used to convert each of the 16 byte values separately from the corresponding polynomial representation in GF(256) to the composite field GF((( 2 ² ) ⁴ ) corresponding polynomial representation. For one embodiment of processing block 1519, the polynomial in GF(256) represents [a ₇ ,a ₆ ,a ₅ ,a ₄ ,a ₃ by multiplying each byte value by an 8-bit by 8-bit transformation matrix , a ₂ ,a ₁ ,a ₀ ] can be transformed into the corresponding polynomial representations in the composite field GF((2 ² ) ⁴ ) [b ₇ ,b ₆ ,b ₅ ,b ₄ ,b ₃ ,b ₂ ,b ₁ , b ₀ ], which can be achieved by the following series of XORs:

在处理框1520中，逆转电路被用于分别针对16个字节值在GF((2²)⁴)中的每一个多项式表示来计算GF((2²)⁴)中的乘法逆元多项式表示。对于一个实施例，对应于复合域GF((2²)⁴)中的多项式表示的输入[a,b,c,d]和乘法逆元[A,B, C,D]的关系如下：In processing block 1520, an inversion circuit is used to compute the multiplicative inverse variate polynomial representation in GF((2 ² ) ⁴ ) for each polynomial representation of the 16 byte values in GF((2 ² ) ⁴ ), respectively . For one embodiment, the relationship between the input [a,b,c,d] and the multiplicative inverse [A,B,C,D] corresponding to the polynomial representation in the composite field GF((2 ² ) ⁴ ) is as follows:

其中，和‘·’分别表示GF(2²)加法和乘法。in, and '·' represent GF(2 ² ) addition and multiplication, respectively.

解为：A＝Δ^-1·Δ_a,B＝Δ^-1·Δ_b,C＝Δ^-1·Δ_c,D＝Δ^-1·Δ_d，其中，行列式Δ给出为：The solution is: A=Δ ^-1 ·Δ _a , B=Δ ^-1 ·Δ _b , C=Δ ^-1 ·Δ _c , D=Δ ^-1 ·Δ _d , where the determinant Δ is given as:

并且通过分别用{0,0,0,1}替代Δ的第一、第二、第三和第四列，从Δ中得出行列式Δ_a、Δ_b、Δ_c和Δ_d。再一次地将会理解，可通过扩展行列式计算，在硬件中计算诸如a²、b²、a³、3·b²等之类的唯一项以及所需要项的唯一和，随后对特定的项组合求和以生成必要的结果，以便在GF(2²)中实施此类计算。And the determinants Δ _a , Δ _b , Δ _c and Δ _d are derived from Δ by substituting {0,0,0,1} for the first, second, third and fourth columns of Δ, respectively. Once again it will be understood that unique terms such as a2, b2, a3, ³ ^· ^b2 , etc. and the unique sum ^of the required terms can be computed in hardware by extending the determinant computation, and then for a particular The terms are summed in combination to generate the necessary results to implement such computations in GF(2 ² ).

在处理框1521中，确定该过程是否是解密，在解密的情况下，在处理框 1522中，处理继续。在处理框1522中，另一域转换电路被用于分别将16个字节值中的每一个从复合域GF((2²)⁴)中对应的多项式表示转换为GF(256)中对应的多项式表示。对于处理框1522的一个实施例，通过将每一个字节值乘以8 位×8位的转换矩阵，复合域GF((2²)⁴)中的多项式表示[a₇,a₆,a₅,a₄,a₃,a₂,a₁,a₀] 可被转换为GF(256)中对应的多项式表示[b₇,b₆,b₅,b₄,b₃,b₂,b₁,b₀]，这可通过以下一系列XOR来实现：In process block 1521, it is determined whether the process is decryption, in which case processing continues in process block 1522. In processing block 1522, another domain conversion circuit is used to separately convert each of the 16 byte values from the corresponding polynomial representation in the composite field GF((2 ² ) ⁴ ) to the corresponding polynomial representation in GF(256) Polynomial representation. For one embodiment of processing block 1522, the polynomial in the composite field GF((2 ² ) ⁴ ) represents [a ₇ , a ₆ , a ₅ by multiplying each byte value by an 8-bit by 8-bit transformation matrix ,a ₄ ,a ₃ ,a ₂ ,a ₁ ,a ₀ ] can be transformed into the corresponding polynomial representations in GF(256) [b ₇ ,b ₆ ,b ₅ ,b ₄ ,b ₃ ,b ₂ ,b ₁ ,b ₀ ], which can be achieved by the following series of XORs:

否则，如果该过程是加密，则处理继续进行过程1401中的处理框1421。如参考过程1401中的处理框1422所解释的那样，在处理框1422中用于将仿射变换应用于16字节的电路可与本实施例的域转换电路结合，以便将这16个字节值从GF((2²)⁴)中的多项式表示转换到GF(256)中对应的多项式表示。对于处理框1422的一个实施例，通过将每一个字节值乘以8位×8位的转换矩阵并且与一些常数XOR(即，按位反转)，可应用仿射变换，并且复合域GF((2²)⁴) 中的多项式表示[a₇,a₆,a₅,a₄,a₃,a₂,a₁,a₀]可被转换为GF(256)中对应的多项式表示[b₇,b₆,b₅,b₄,b₃,b₂,b₁,b₀]，这可通过以下一系列XOR来实现：Otherwise, if the process is encryption, processing continues to process block 1421 in process 1401. As explained with reference to processing block 1422 in process 1401, the circuitry used in processing block 1422 to apply an affine transform to 16 bytes may be combined with the domain conversion circuit of this embodiment to convert the 16 bytes Values are converted from the polynomial representation in GF((2 ² ) ⁴ ) to the corresponding polynomial representation in GF(256). For one embodiment of processing block 1422, by multiplying each byte value by an 8-bit by 8-bit transformation matrix and XORing (ie, bit-wise inversion) with some constant, an affine transformation can be applied, and the composite field GF The polynomial representations in ((2 ² ) ⁴ ) [a ₇ ,a ₆ ,a ₅ ,a ₄ ,a ₃ ,a ₂ ,a ₁ ,a ₀ ] can be converted to the corresponding polynomial representations in GF(256)[ b ₇ ,b ₆ ,b ₅ ,b ₄ ,b ₃ ,b ₂ ,b ₁ ,b ₀ ], which can be achieved by the following series of XORs:

图16A示出用于执行仿射映射指令的装置1601的一个实施例的图，该仿射映射指令用于仿射变换以提供通用GF(256)SIMD加密算术功能。在一些实施例中，装置1601可被复制16次，每一个装置1601包括用于高效地实现对包括16个字节值(每一个字节具有GF(256)中的多项式表示)的128位块的仿射变换的硬件处理块。在仿射映射指令(或微指令)的其他实施例中，也可指定元素尺寸，并且/或者可选择装置1601的复制数量以实现对128位块或256位块或512位块等的仿射变换。装置1601的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的仿射映射指令的流水线400 的部分(例如，执行级416)或核490的部分(例如，执行单元462)。装置 1601的多个实施例可以与用于对用于GF(256)中的仿射变换的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440)耦合。在一些实施例中，可由微指令(或微操作、微op或uop)实现仿射映射指令——例如，有限域矩阵-向量乘法微指令以及之后的有限域向量加法(XOR)微指令。Figure 16A shows a diagram of one embodiment of an apparatus 1601 for executing affine map instructions for affine transformations to provide general GF(256) SIMD cryptographic arithmetic functions. In some embodiments, devices 1601 may be replicated 16 times, each device 1601 comprising a 128-bit block for efficiently implementing a 128-bit block consisting of 16 byte values (each byte having a polynomial representation in GF(256)). The hardware processing block of the affine transformation. In other embodiments of affine map instructions (or microinstructions), element sizes may also be specified, and/or the number of copies of device 1601 may be selected to enable affine to 128-bit blocks or 256-bit blocks or 512-bit blocks, etc. transform. Various embodiments of apparatus 1601 may be part of pipeline 400 (eg, execution stage 416) or part of core 490 (eg, execute affine map instructions for providing generic GF(256) SIMD cryptographic arithmetic functions) unit 462). Various embodiments of apparatus 1601 may be coupled with a decoding stage (eg, decode 406) or a decoder (eg, decode unit 440) for decoding instructions for affine transforms in GF(256). In some embodiments, affine map instructions may be implemented by microinstructions (or microops, microops, or uops)—eg, a finite field matrix-vector multiply microinstruction followed by a finite field vector addition (XOR) microinstruction.

例如，装置1601的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供通用 GF(256)SIMD仿射变换功能的仿射映射指令的一些实施例指定源数据操作数元素集1612、变换矩阵1610操作数和转换向量1614操作数。响应于被解码的仿射映射指令，一个或多个执行单元(例如，执行单元462)通过经由处理块1602中的GF(256)字节乘法器阵列的八个按位“与”(AND)1627-1620，将变换矩阵1610操作数应用于源数据操作数集中(例如，在16字节元素的128 位块中)的每一个元素1612，并且经由处理块1603中的GF(256)位加法器阵列的八个9输入XOR 1637-1630来应用转换向量1614，以对源数据操作数集中的每一个经变换的元素执行SIMD仿射变换。仿射映射指令的源数据操作数集中的每一个元素1612的、经仿射变换的结果元素1618被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458中)。For example, various embodiments of apparatus 1601 may be coupled with a SIMD vector register (eg, physical register bank unit 458 ) that includes m adjustable values for storing m variable numbers of variable sized data elements A variable number of variable size data fields. Some embodiments of affine map instructions for providing generic GF(256) SIMD affine transform functions specify source data operand element set 1612, transform matrix 1610 operands, and transform vector 1614 operands. In response to the decoded affine map instruction, one or more execution units (e.g., execution unit 462 ) pass the eight bitwise ANDs (AND) through the GF(256) byte multiplier array in processing block 1602 1627-1620, apply transform matrix 1610 operands to each element 1612 in source data operand set (eg, in 128-bit blocks of 16-byte elements), and via GF(256)-bit addition in processing block 1603 The eight 9-input XORs 1637-1630 of the transformer array apply a transform vector 1614 to perform a SIMD affine transform on each transformed element in the source data operand set. The affine transformed result element 1618 of each element 1612 in the source data operand set of the affine map instruction is stored in a SIMD destination register (eg, in physical register bank unit 458).

图16B示出用于执行仿射求逆指令的装置1605的一个实施例的图，该仿射求逆指令用于进行仿射变换，并随后计算结果的乘法逆元以提供通用GF (256)SIMD加密算术功能。装置1605的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的仿射求逆指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元462)。装置1605的多个实施例可以与用于对用于GF(256)中的仿射变换和求逆的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440)耦合。在一些实施例中，可由微指令(或微操作、微op或uop)实现仿射求逆指令——例如，仿射映射1601微指令以及之后的有限域乘法求逆微指令1604。在替代实施例中，可由不同的微指令来实现仿射求逆指令——例如，有限域矩阵-向量乘法微指令以及之后的字节广播微指令、有限域向量加法(XOR)微指令和有限域乘法求逆微指令。Figure 16B shows a diagram of one embodiment of an apparatus 1605 for executing an affine inversion instruction for performing an affine transformation and then computing the multiplicative inverse of the result to provide a generic GF (256) SIMD encrypted arithmetic functions. Various embodiments of apparatus 1605 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, core 490 ) for executing affine inversion instructions for providing generic GF(256) SIMD cryptographic arithmetic functions execution unit 462). Embodiments of apparatus 1605 may be coupled with a decoding stage (eg, decode 406) or a decoder (eg, decode unit 440) for decoding instructions for affine transformation and inversion in GF(256) . In some embodiments, affine inversion instructions may be implemented by microinstructions (or microops, microops, or uops)—eg, affine map 1601 microinstructions followed by finite field multiplication inversion microinstructions 1604. In alternative embodiments, affine inversion instructions may be implemented by different uops—eg, finite field matrix-vector multiply uops followed by byte broadcast uops, finite field vector addition (XOR) uops, and finite field vector addition (XOR) uops Field multiplication inversion microinstruction.

装置1605的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供通用GF(256) SIMD仿射变换功能的仿射求逆指令以及之后计算结果的乘法逆元的一些实施例指定源数据操作数元素集1612、变换矩阵1610操作数、转换向量1614操作数以及可选的首一的不可约多项式。响应于被解码的仿射求逆指令，一个或多个执行单元(例如，执行单元462)通过经由处理块1602中的GF(256)字节乘法器阵列的八个按位“与”(AND)1627-1620，将变换矩阵1610操作数应用于源数据操作数集中(例如，在16字节元素的128位块中)的每一个元素 1612，并且经由处理块1603中的GF(256)位加法器阵列的八个9输入XOR 1637-1630来应用转换向量1614操作数，以对源数据操作数集中的每一个经变换的元素执行SIMD仿射变换。将会理解，该计算中的该点可对应于过程1403 中的点1418。可经由乘法求逆单元1640，根据针对源数据操作数集中的每一个元素1612的、经仿射变换的结果元素1618来计算有限域乘法逆元元素1648 对不可约多项式求模。针对仿射求逆指令的每一个经仿射变换的结果元素1618 的乘法逆元结果元素1648被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458中)。Various embodiments of apparatus 1605 may be coupled with a SIMD vector register (eg, physical register bank unit 458) that includes m variable numbers of values for storing m variable numbers of variable size data elements variable size data fields. Some embodiments of affine inversion instructions for providing generic GF(256) SIMD affine transform functions and subsequent computation of multiplicative inverses of results specify source data operand element set 1612, transform matrix 1610 operands, transform vector 1614 operations number and an optional first irreducible polynomial. In response to the decoded affine inversion instruction, one or more execution units (eg, execution unit 462 ) perform an AND through eight bitwise AND's through the GF(256) byte multiplier array in processing block 1602. ) 1627-1620, applying the transform matrix 1610 operand to each element 1612 in the source data operand set (eg, in a 128-bit block of 16-byte elements), and via the GF(256) bit in processing block 1603 The eight 9-input XORs 1637-1630 of the adder array apply a transform vector 1614 operand to perform a SIMD affine transform on each transformed element in the source data operand set. It will be appreciated that this point in the calculation may correspond to point 1418 in process 1403 . The finite field multiplicative inverse element 1648 modulo the irreducible polynomial may be computed via the multiplicative inversion unit 1640 from the affine transformed result element 1618 for each element 1612 in the source data operand set. A multiplicative inverse element result element 1648 for each affine transformed result element 1618 of an affine inversion instruction is stored in a SIMD destination register (eg, in physical register bank unit 458).

将会理解，仿射求逆指令的一些实施例对执行诸如过程1403之类的过程可能是有用的。其他实施例对执行诸如过程1402之类的过程可能是有用的。It will be appreciated that some embodiments of affine inversion instructions may be useful for performing processes such as process 1403 . Other embodiments may be useful for performing processes such as process 1402.

图16C示出用于执行求逆仿射指令的装置1606的替代实施例的图，该求逆仿射指令用于计算乘法逆元，并且随后对结果进行仿射变换以提供通用GF (256)SIMD加密算术功能。装置1606的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的求逆仿射指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元462)。装置1606的多个实施例可以与用于对用于GF(256)中的求逆和仿射变换的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440)耦合。在一些实施例中，可由微指令(或微操作、微op或uop)实现求逆仿射指令——例如，有限域乘法求逆微指令1604以及之后的仿射映射1601微指令。在替代实施例中，可由不同的微指令来实现求逆仿射指令——例如，有限域乘法求逆微指令以及之后的有限域矩阵-向量乘法微指令和有限域向量标量转换(例如，广播和 XOR)微指令。Figure 16C shows a diagram of an alternative embodiment of an apparatus 1606 for executing an inverse affine instruction for computing a multiplicative inverse and then affine transforming the result to provide a generic GF (256) SIMD encrypted arithmetic functions. Various embodiments of apparatus 1606 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, core 490 ) for executing inverse affine instructions for providing generic GF(256) SIMD cryptographic arithmetic functions execution unit 462). Embodiments of apparatus 1606 may be coupled with a decoding stage (eg, decode 406) or a decoder (eg, decode unit 440) for decoding instructions for inversion and affine transformation in GF(256) . In some embodiments, an inverse affine instruction may be implemented by a microinstruction (or microop, microop, or uop)—eg, a finite field multiplication inversion microinstruction 1604 followed by an affine map 1601 microinstruction. In alternative embodiments, inverse affine instructions may be implemented by different uops—eg, a finite-field multiplication inversion uops followed by a finite-field matrix-vector multiplication uops and finite-field vector-to-scalar conversions (eg, broadcast and XOR) microinstructions.

装置1606的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供乘法逆元的通用 GF(256)SIMD计算的求逆仿射指令和之后的仿射变换功能的一些实施例指定源数据操作数元素集1612、变换矩阵1610操作数、转换向量1614操作数以及可选的首一的不可约多项式。在处理块1604中，响应于被解码的求逆仿射指令，一个或多个执行单元(例如，执行单元462)经由乘法求逆单元1640，针对源数据操作数集中的每一个元素1612计算SIMD二进制有限域乘法逆元元素1616对不可约多项式求模。随后，所述一个或多个执行单元通过经由处理块1602中的GF(256)字节乘法器阵列的八个按位“与“(AND)1627-1620，将变换矩阵1610操作数应用于源数据操作数集中(例如，在16字节元素的128 位块中的)的元素1612的每一个乘法逆元元素1616，并且经由处理块1603 中的GF(256)位加法器阵列的八个9输入XOR 1637-1630来应用转换向量1614 操作数，以便对源数据操作数集中的每一个经变换的逆元元素执行SIMD仿射变换。求逆仿射指令的源数据操作数集中的元素1612的每一个乘法逆元元素 1616的、经仿射变换的结果元素1638被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458中)。Various embodiments of apparatus 1606 may be coupled with a SIMD vector register (eg, physical register bank unit 458) that includes m variable numbers of values for storing m variable numbers of variable size data elements variable size data fields. Some embodiments of inverse affine instructions and subsequent affine transform functions for general GF(256) SIMD computations that provide multiplicative inverses specify source data operand element set 1612, transform matrix 1610 operands, transform vector 1614 operations number and an optional first irreducible polynomial. In processing block 1604, in response to the decoded inverse affine instruction, one or more execution units (eg, execution unit 462), via multiply inversion unit 1640, compute a SIMD for each element 1612 in the source data operand set Binary finite field multiplicative inverse element 1616 modulo the irreducible polynomial. The one or more execution units then apply the transform matrix 1610 operands to the source through eight bitwise ANDs 1627-1620 via the GF(256) byte multiplier array in processing block 1602 Each multiplicative inverse element 1616 of elements 1612 in the data operand set (eg, in a 128-bit block of 16-byte elements), and via eight nines of the GF(256)-bit adder array in processing block 1603 XORs 1637-1630 are input to apply transform vector 1614 operands to perform a SIMD affine transform on each transformed inverse element in the source data operand set. The affine transformed result element 1638 of each multiplicative inverse element 1616 of the elements 1612 in the source data operand set of the inverse affine instruction is stored in a SIMD destination register (eg, in physical register bank unit 458 ) ).

图17A示出用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的装置1701的一个实施例的图。在一些实施例中，装置1701 可被复制16次，每一个装置1701包括用于高效地实现对包括16个字节值(每一个字节具有GF(256)中的多项式表示)的128位块的、AES S盒的乘法求逆的硬件处理块。在有限域乘法求逆指令(或微指令)的其他实施例中，元素尺寸也可被指定，并且/或者装置1701的复制数量可被选择以实现对128位块或256位块或512位块等的有限域乘法求逆。装置1701的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元 462)。装置1701的多个实施例可以与用于对用于GF(256)中的乘法求逆的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440) 耦合。在装置1701中，我们考虑每一个字节x是从过程1401中的点1418中被输出的，因此装置1701通过访问包含x的源数据操作数集开始。处理块 1711-1717包括多项式幂字节片生成电路，其用于分别针对16个字节值中的每一个计算具有对应于它们各自的字节值x的多项式表示的幂x²、x⁴、x⁸、x¹⁶、 x³²、x⁶⁴和x¹²⁸的、GF(256)中的多项式表示的字节值。处理块1718-1720以及1728-1730包括乘法器字节片电路，该电路用于在GF(256)中分别将对应于针对16个字节值中的每一个的多项式表示的幂的字节值乘在一起，以便生成各自具有分别对应于它们各自的字节值x的乘法逆元x^-1＝x²⁵⁴的、GF(256) 中的多项式表示的16个字节值。然后，这16个乘法逆元字节值被存储(例如，在寄存器组单元458中)或被输出到过程1401中的处理框1421中，在那里，仿射变换电路(例如，1601)可选地用于处理框1422中以取决于过程1401在执行加密还是解密来应用仿射变换。Figure 17A shows a diagram of one embodiment of an apparatus 1701 for executing a finite field multiplication inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions. In some embodiments, devices 1701 may be replicated 16 times, each device 1701 comprising a 128-bit block for efficiently implementing a 128-bit block consisting of 16 byte values (each byte having a polynomial representation in GF(256)). The hardware processing block of the multiplication and inversion of the AES S-box. In other embodiments of finite field multiply inversion instructions (or microinstructions), the element size may also be specified, and/or the number of copies of the device 1701 may be selected to enable 128-bit blocks or 256-bit blocks or 512-bit blocks Equal to the finite field multiplication inverse. Various embodiments of apparatus 1701 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, execution stage 416 ) for executing finite field multiplication inversion instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. , execution unit 462). Various embodiments of apparatus 1701 may be coupled with a decode stage (eg, decode 406) or a decoder (eg, decode unit 440) for decoding the instructions for the multiplication inversion in GF(256). In device 1701, we consider that each byte x is output from point 1418 in process 1401, so device 1701 begins by accessing the source data operand set containing x. Processing blocks 1711-1717 include polynomial exponentiation byte slice generation circuits for computing powers ^x2 , ^x4 , Byte values represented by polynomials in GF( ²⁵⁶ ) for ^x8 , ^x16 , ^x32 , ^x64 , and x128. Processing blocks 1718-1720 and 1728-1730 include multiplier byte slice circuits for dividing byte values in GF(256) corresponding to the powers of the polynomial representations for each of the 16 byte values, respectively Multiply together to generate 16 byte values represented by the polynomial in GF(256), each having multiplicative inverses x ^-1 =x ²⁵⁴ corresponding to their respective byte values x, respectively. These 16 multiplicative inverse metabyte values are then stored (eg, in register bank unit 458) or output to processing block 1421 in process 1401, where an affine transform circuit (eg, 1601) optionally is used in processing block 1422 to apply an affine transformation depending on whether process 1401 is performing encryption or decryption.

图17B示出用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的装置1702的替代实施例的图。在一些实施例中，装置1702 可被复制16次，每一个装置1702包括用于高效地实现对包括16个字节值(每一个字节具有GF(256)中的多项式表示)的128位块的、AES S盒的乘法求逆的硬件处理块。在有限域乘法求逆指令(或微指令)的其他实施例中，元素尺寸也可被指定，并且/或者装置1702的复制数量可被选择以实现对128位块或256位块或512位块等的有限域乘法求逆。装置1702的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元 462)。装置1702的多个实施例可以与用于对用于GF(256)中的乘法求逆的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440) 耦合。在装置1702中，我们再次考虑每一个字节x是从过程1401中的点1418 中被输出的，因此装置1702通过访问包含x的源数据操作数集开始。将会理解，过程1401中的点1418可表示仿射变换电路(例如，1601)的输出或处理框1417中的仿射映射指令。处理块1721-1727包括多项式幂字节片生成电路，其用于分别针对16个字节值中的每一个计算具有对应于它们各自的字节值x 的多项式表示的幂x⁶、x²⁴、x⁹⁶和x¹²⁸的、GF(256)中的多项式表示的字节值。处理块1728-1730包括乘法器字节片电路，该电路用于在GF(256)中分别将对应于针对16个字节值中的每一个的多项式表示的幂的字节值乘在一起，以便生成各自具有分别对应于它们各自的字节值x的乘法逆元x^-1＝x²⁵⁴的、GF (256)中的多项式表示的16个字节值。这16个乘法逆元字节值被存储(例如，在寄存器组单元458中)或被输出到过程1401中的处理框1421中，在那里，仿射变换电路(例如，1601)可选地用于处理框1422中以取决于过程1401 在执行加密还是解码来应用仿射变换。Figure 17B shows a diagram of an alternative embodiment of an apparatus 1702 for executing a finite field multiply inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions. In some embodiments, devices 1702 may be replicated 16 times, each device 1702 comprising a 128-bit block for efficiently implementing a 128-bit block consisting of 16 byte values (each byte having a polynomial representation in GF(256)). The hardware processing block of the multiplication and inversion of the AES S-box. In other embodiments of finite field multiply inversion instructions (or microinstructions), the element size may also be specified, and/or the number of copies of the device 1702 may be selected to enable 128-bit blocks or 256-bit blocks or 512-bit blocks Equal to the finite field multiplication inverse. Various embodiments of apparatus 1702 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, execution stage 416 ) for executing finite field multiplication inversion instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. , execution unit 462). Various embodiments of apparatus 1702 may be coupled with a decode stage (eg, decode 406 ) or a decoder (eg, decode unit 440 ) for decoding the instructions for the multiplication inversion in GF(256). In device 1702, we again consider that each byte x is output from point 1418 in process 1401, so device 1702 begins by accessing the source data operand set containing x. It will be appreciated that point 1418 in process 1401 may represent the output of an affine transformation circuit (eg, 1601 ) or an affine mapping instruction in processing block 1417 . Processing blocks 1721-1727 include polynomial exponentiation byte slice generation circuits for computing powers ^x6 , x24, ^x24 , Byte value represented by a polynomial in GF( ²⁵⁶ ) for ^x96 and x128. Processing blocks 1728-1730 include multiplier byte slice circuits for multiplying together byte values in GF(256) corresponding to powers of polynomial representations for each of the 16 byte values, respectively, in order to generate 16 byte values represented by a polynomial in GF (256), each having multiplicative inverses x ⁻¹ =x ²⁵⁴ corresponding to their respective byte values x, respectively. The 16 multiplicative inverse metabyte values are stored (eg, in register bank unit 458) or output to processing block 1421 in process 1401, where the affine transform circuit (eg, 1601) optionally uses The affine transformation is applied in process block 1422 to depend on whether process 1401 is performing encryption or decoding.

图17C示出用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的装置1703的另一替代实施例的图。在一些实施例中，装置 1703可被复制16次，每一个装置1703包括用于高效地实现对包括16个字节值(每一个字节具有GF(256)中的多项式表示)的128位块的有限域乘法求逆的硬件处理块。在有限域乘法求逆指令(或微指令)的其他实施例中，元素尺寸也可被指定，并且/或者装置1703的复制数量可被选择以实现对128位块或256位块或512位块等的有限域乘法求逆。装置1703的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元 462)。装置1703的多个实施例可以与用于对用于GF(256)中的乘法求逆的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440) 耦合。Figure 17C shows a diagram of another alternative embodiment of an apparatus 1703 for executing a finite field multiply inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions. In some embodiments, devices 1703 may be replicated 16 times, each device 1703 comprising a 128-bit block for efficiently implementing a 128-bit block comprising 16 byte values (each byte having a polynomial representation in GF(256)). A hardware processing block for the finite field multiplication and inversion of . In other embodiments of finite field multiply inversion instructions (or microinstructions), the element size may also be specified, and/or the number of copies of the device 1703 may be selected to achieve a 128-bit block or a 256-bit block or a 512-bit block Equal to the finite field multiplication inverse. Various embodiments of apparatus 1703 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, ) for executing finite-field multiply inversion instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. , execution unit 462). Various embodiments of apparatus 1703 may be coupled with a decode stage (eg, decode 406 ) or a decoder (eg, decode unit 440 ) for decoding the instructions for the multiplication inversion in GF(256).

装置1703的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供通用GF(256) SIMD乘法求逆功能的有限域乘法求逆指令的一些实施例指定源数据元素集 1710和首一的不可约多项式1740。响应于被解码的有限域乘法求逆指令，一个或多个执行单元(例如，执行单元462)针对源数据操作数集中的每一个元素1710计算SIMD二进制有限域乘法逆元对不可约多项式求模。装置1703的一些实施例执行复合域GF((2⁴)²)中的有限域乘法求逆操作。在处理块1734中，源数据操作数集中的每一个元素1710被映射至复合域GF((2⁴)²)，处理块1734 输出4位的域元素z_H 1735和z_L 1736。对于一个实施例，逆元域元素z_L ^-11746 按如下方式计算：(1)在复合域中，域元素z_H 1735和z_L 1736被相加(按位 XOR 1737)；(2)在处理块1739中，按位XOR 1737的输出被相乘并且对不可约多项式p求模。在一个实施例中，多项式p＝z⁴+z³+1，但是在替代实施例中，可使用其他4次不可约多项式。继续进行逆元域元素z_L ^-11746的计算： (3)在处理块1738中，对域元素z_H 1735求平方，乘以十六进制值8，并对p求模，在复合域中将其结果与处理块1739的输出相加(按位XOR 1741)； (4)在处理块1742中，计算按位XOR 1741的输出的逆元；以及(5)在处理块1744中，与域元素z_L 1736相乘并对p求模以生成逆元域元素z_L ^-11746。对于一个实施例，逆元域元素z_H ^-11745按如下方式计算：步骤(1)到(4)如上所述；以及(5)在处理块1743中，将处理块1742的输出与域元素z_H 1735 相乘并对p求模以生成逆元域元素z_H ^-11745。然后，在处理块1747中，从复合域GF((2⁴)²)中逆映射每一对4位的域元素z_H ^-11745和z_L ^-11746以生成GF (256)中的乘法逆元结果元素1750。有限域乘法求逆指令的源数据操作数集中的每一个元素1710的乘法逆元结果元素1750最终被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458中)。Various embodiments of apparatus 1703 may be coupled with a SIMD vector register (eg, physical register bank unit 458) that includes m variable numbers of values for storing m variable numbers of variable size data elements variable size data fields. Some embodiments of finite field multiplication inversion instructions for providing a generic GF(256) SIMD multiplication inversion function specify a source data element set 1710 and a first irreducible polynomial 1740. In response to the decoded finite field multiplication inversion instruction, one or more execution units (eg, execution units 462 ) compute the SIMD binary finite field multiplication inverse modulo the irreducible polynomial for each element 1710 in the source data operand set . Some embodiments of means 1703 perform a finite field multiplication inversion operation in the composite field GF((2 ⁴ ) ² ). In processing block 1734, each element 1710 in the source data operand set is mapped to a composite field GF((2 ⁴ ) ² ), which outputs 4-bit field elements z _H 1735 and z _L 1736 . For one embodiment, the inverse field element z _L ^-1 1746 is computed as follows: (1) in the composite field, the field elements z _H 1735 and z _L 1736 are added (bitwise XOR 1737); (2) in the composite field In processing block 1739, the outputs of the bitwise XOR 1737 are multiplied and modulo the irreducible polynomial p. In one embodiment, the polynomial p=z ⁴ +z ³ +1, but in alternative embodiments, other 4th degree irreducible polynomials may be used. The computation of the inverse field element z _L ^-1 1746 continues: (3) In processing block 1738, the field element z _H 1735 is squared, multiplied by the hexadecimal value 8, modulo p, in the composite field add its result to the output of processing block 1739 (bitwise XOR 1741); (4) in processing block 1742, compute the inverse of the output of the bitwise XOR 1741; and (5) in processing block 1744, and Field elements z _L 1736 are multiplied and modulo p to generate inverse element field elements z _L ^-1 1746 . For one embodiment, the inverse meta-field element z _H ^-1 1745 is computed as follows: steps (1) through (4) are as described above; and (5) in processing block 1743, the output of processing block 1742 is compared to the field element z _H 1735 is multiplied and modulo p to generate the inverse metafield element z _H ^-1 1745 . Then, in processing block 1747, each pair of 4-bit field elements z _H ^-1 1745 and z _L ^-1 1746 are inverse-mapped from the composite field GF((2 ⁴ ) ² ) to generate the multiplication in GF(256) Inverse meta result element 1750. The multiplication inverse element result element 1750 for each element 1710 in the source data operand set of the finite field multiplication inversion instruction is ultimately stored in a SIMD destination register (eg, in physical register bank unit 458).

图18A示出用于执行用于提供通用GF(256)SIMD加密算术功能的特定求模归约指令的装置1801的一个实施例的图。在当前所示的示例中，特定的求模多项式1811B是GF(256)中的p＝x⁸+x⁴+x³+x+1。在一些实施例中，装置1801可被复制16次，每一个装置1801包括用于高效地实现对包括16个两字节值的两个128位块(或一个256位块)的特定求模归约以生成包括16 个字节值的128位块的硬件处理块，所得到的16个字节值中的每一个都具有GF(256)中的多项式表示。装置1801的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的特定求模归约指令的流水线400的部分 (例如，执行级416)或核490的部分(例如，执行单元462)。装置1801的多个实施例可以与用于对用于GF(256)中的特定求模归约的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440)耦合。Figure 18A shows a diagram of one embodiment of an apparatus 1801 for executing specific modulo-reduce instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. In the presently shown example, the particular modulo polynomial 1811B is p=x8+ ^x4 + ^x3 +x+ ¹ in GF(256). In some embodiments, device 1801 may be replicated 16 times, each device 1801 including a specific modulo normalization for efficiently implementing two 128-bit blocks (or one 256-bit block) comprising 16 two-byte values A hardware processing block that generates a 128-bit block consisting of 16 byte values approximately, each of the resulting 16 byte values having a polynomial representation in GF(256). Various embodiments of apparatus 1801 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, execution stage 416 ) for executing specific modulo-reduce instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. , execution unit 462). Various embodiments of apparatus 1801 may be coupled with a decoding stage (eg, decode 406) or a decoder (eg, decode unit 440) for decoding instructions for a particular modulo reduction in GF(256).

装置1801的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供通用GF(256) SIMD求模归约功能的特定求模归约指令的一些实施例指定源数据操作数元素集1810和首一的不可约多项式1811B。响应于被解码的求模归约指令，一个或多个执行单元(例如，执行单元462)针对源数据操作数集中的每一个元素 1810计算SIMD二进制有限域对不可约多项式归约求模。具有两字节值的源数据操作数集的元素1810作为q_H 1828和q_L 1820被输入到处理块1821中。在处理块1821中，装置1801的一些实施例执行12位操作，其等于：Various embodiments of apparatus 1801 may be coupled with a SIMD vector register (eg, physical register bank unit 458) that includes m variable numbers of values for storing m variable numbers of variable size data elements variable size data fields. Some embodiments of specific modulo-reduce instructions for providing a generic GF(256) SIMD modulo-reduce function specify a source data operand element set 1810 and a first irreducible polynomial 1811B. In response to the decoded modulo-reduce instruction, one or more execution units (eg, execution units 462) compute a SIMD binary finite field modulo irreducible polynomial reduction modulo for each element 1810 in the source data operand set. Elements 1810 of the source data operand set with two-byte values are input into processing block 1821 as _qH 1828 and _qL 1820. In processing block 1821, some embodiments of apparatus 1801 perform a 12-bit operation that equals:

具有被部分地归约的12位值的、处理块1825的得到的元素T作为T_H 1838和T_L 1830被输入到处理块1831中。在处理块1831中，装置1801的一些实施例在处理块1835中执行8位操作，其等于：The resulting element T of processing block 1825 with the partially reduced 12-bit value is input into processing block 1831 as _TH 1838 and _TL 1830. In processing block 1831, some embodiments of apparatus 1801 perform an 8-bit operation in processing block 1835, which is equal to:

将会理解，在XOR操作中，零(0)输入可被消除，进而进一步减小装置 1801的逻辑复杂度。特定求模归约指令的源数据操作数集中的每一个元素1810 的特定求模归约结果元素1850被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458中)。It will be appreciated that in an XOR operation, the zero (0) input can be eliminated, thereby further reducing the logic complexity of the device 1801. A particular modulo-reduce result element 1850 for each element 1810 in the source data operand set of a particular modulo-reduce instruction is stored in a SIMD destination register (eg, in physical register bank unit 458).

图18B示出用于执行用于提供通用GF(256)SIMD加密算术功能的特定求模归约指令的装置1802的替代实施例的图。在当前所示的示例中，特定的求模多项式1811B也是GF(256)中的p＝x⁸+x⁴+x³+x+1。将会理解，类似技术也适用于实现针对其他求模多项式的特定求模归约指令(或微指令)，例如，无线局域网WAPI中国国家标准(有线认证和隐私基础建设)中使用的块密码SMS4中所使用的、GF(256)中的f₅＝x⁸+x⁷+x⁶+x⁵+x⁴+x²+1。在一些实施例中，装置1802可被复制16次，每一个装置1802包括用于高效地实现对包括16个两字节值的两个128位块(或一个256位块)的特定求模归约以生成包括16个字节值的128位块的硬件处理块，所得到的16个字节值中的每一个都具有GF(256)中的多项式表示。装置1802的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的特定求模归约指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元 462)。装置1802的多个实施例可以与用于对用于GF(256)中的特定求模归约的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440) 耦合。18B shows a diagram of an alternative embodiment of an apparatus 1802 for executing specific modulo-reduce instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. In the presently shown example, the particular modulo polynomial 1811B is also p=x8+ ^x4 + ^x3 +x+ ¹ in GF(256). It will be appreciated that similar techniques are also applicable to implement specific modulo-reduce instructions (or micro-instructions) for other modulo-reduction polynomials, such as the block cipher SMS4 used in the Wireless LAN WAPI Chinese National Standard (Wired Authentication and Privacy Infrastructure) As used in , f ₅ in GF(256) = x ⁸ +x ⁷ +x ⁶ +x ⁵ +x ⁴ +x ² +1. In some embodiments, device 1802 may be replicated 16 times, each device 1802 including a specific modulo normalization for efficiently implementing two 128-bit blocks (or one 256-bit block) comprising 16 two-byte values A hardware processing block that generates a 128-bit block consisting of 16 byte values approximately, each of the resulting 16 byte values having a polynomial representation in GF(256). Various embodiments of apparatus 1802 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, ) for executing specific modulo-reduce instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. , execution unit 462). Various embodiments of apparatus 1802 may be coupled with a decode stage (eg, decode 406) or a decoder (eg, decode unit 440) for decoding instructions for a particular modulo reduction in GF(256).

装置1802的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供通用GF(256) SIMD求模归约功能的特定求模归约指令的一些实施例指定源数据操作数元素集1810和首一的不可约多项式1811B。响应于被解码的求模归约指令，一个或多个执行单元(例如，执行单元462)针对源数据操作数集中的每一个元素 1810计算SIMD二进制有限域对不可约多项式归约求模。具有两字节值的源数据操作数集的元素1810作为q[15:8]1828和q[7:0]1820被输入到处理块1861 中。在处理块1861中，装置1802的一些实施例在XOR逻辑门1867-1860中执行逻辑操作，其等于：Various embodiments of apparatus 1802 may be coupled with a SIMD vector register (eg, physical register bank unit 458) that includes m variable numbers of values for storing m variable numbers of variable size data elements variable size data fields. Some embodiments of specific modulo-reduce instructions for providing a generic GF(256) SIMD modulo-reduce function specify a source data operand element set 1810 and a first irreducible polynomial 1811B. In response to the decoded modulo-reduce instruction, one or more execution units (e.g., execution units 462) compute a SIMD binary finite field modulo irreducible polynomial reduction modulo for each element 1810 in the source data operand set. Elements 1810 of the source data operand set with two-byte values are input into processing block 1861 as q[15:8] 1828 and q[7:0] 1820. In processing block 1861, some embodiments of apparatus 1802 perform logical operations in XOR logic gates 1867-1860, which are equal to:

特定求模归约指令的源数据操作数集中的每一个元素1810的特定求模归约结果元素(q mod p)1850被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458中)。A particular modulo-reduce result element (q mod p) 1850 for each element 1810 in the source data operand set of a particular modulo-reduce instruction is stored in a SIMD destination register (eg, in physical register bank unit 458) .

图18C示出用于执行用于提供GF(2¹²⁸)SIMD加密算术功能的特定AES 迦罗瓦计数器模式(GCM)求模归约指令的装置1803的另一替代实施例的图。在当前所示的示例中，特定的求模多项式1887是GF(256)中的p＝x¹²⁸+x⁷+ x²+x+1。装置1803的多个实施例可以是用于执行用于提供GF(2¹²⁸)SIMD 加密算术功能的特定求模归约指令的流水线400的部分(例如，执行级416) 或核490的部分(例如，执行单元462)。装置1803的多个实施例可以与用于对用于GF(2¹²⁸)中的特定求模归约的指令进行解码的解码级(例如，解码406) 或解码器(例如，解码单元440)耦合。Figure 18C shows a diagram of another alternative embodiment of an apparatus 1803 for executing specific AES Galois Counter Mode (GCM) modulo-reduce instructions for providing GF( ²¹²⁸ ) SIMD encrypted arithmetic functions. In the presently shown example, the particular modulo polynomial 1887 is p=x128+ ^x7 + ^x2 +x+1 in GF( ²⁵⁶ ). Various embodiments of means 1803 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, execution stage 416 ) for executing specific modulo-reduce instructions for providing GF(2 ¹²⁸ ) SIMD cryptographic arithmetic functions. , execution unit 462). Various embodiments of apparatus 1803 may be coupled with a decode stage (eg, decode 406) or decoder (eg, decode unit 440) for decoding instructions for a particular modulo reduction in GF(2 ¹²⁸ ) .

装置1803的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供GF(2¹²⁸)中AES GCM求模归约功能的特定指令的一些实施例指定源数据操作数元素集1813和首一的不可约多项式1887。响应于被解码的有限域求模归约指令，一个或多个执行单元(例如，执行单元462)针对源数据操作数集中的每一个元素1813 计算SIMD有限域对不可约多项式的归约求模。Embodiments of apparatus 1803 may be coupled with a SIMD vector register (eg, physical register bank unit 458) that includes m variable numbers of values for storing m variable numbers of variable size data elements variable size data fields. Some embodiments of specific instructions for providing the AES GCM modulo-reduce function in GF(2 ¹²⁸ ) specify a source data operand element set 1813 and a first irreducible polynomial 1887 . In response to the decoded finite field modulo-reduce instruction, one or more execution units (eg, execution unit 462 ) compute a SIMD finite field reduction modulo an irreducible polynomial for each element 1813 in the source data operand set .

具有32字节值的源数据操作数集的元素1813被输入到处理块1871中。在处理块1871中，装置1803的一些实施例执行相对于非反映位 (non-bit-reflected)的归约多项式的非反映位的操作，其等于如下所示的对反映位(bit-reflected)的积的反映位的求模归约：Element 1813 of the source data operand set with a 32-byte value is input into processing block 1871. In processing block 1871, some embodiments of means 1803 perform operations on the non-reflected bits of the reduction polynomial with respect to the non-bit-reflected, which is equal to the bit-reflected as shown below Modulo reduction of the reflected bits of the product of :

(i) [X₃,X₂,X₁,X₀]＝q[255:0]<<1；(i) [X ₃ , X ₂ , X ₁ , X ₀ ]=q[255:0]<<1;

(ii) A＝X₀<<63；B＝X₀<<62；C＝X₀<<57；(ii) A=X ₀ <<63; B=X ₀ <<62; C=X ₀ <<57;

(iv) [E₁,E₀]＝[D,X₀]>>1；[F₁,F₀]＝[D,X₀]>>2；[G₁,G₀]＝[D,X₀]>>7；(iv) [E ₁ ,E ₀ ]=[D,X ₀ ]>>1; [F ₁ ,F ₀ ]=[D,X ₀ ]>>2; [G ₁ ,G ₀ ]=[D, X ₀ ]>>7;

相应地，等式(i)由移位器1870根据元素1813来实现以生成[X₃,X₂,X₁,X₀] 1872。等式(ii)由移位器1873-1875实现。等式(iii)由处理块1876实现。等式(iv)由移位器1877-1879实现。等式(v)由处理块1885实现，并且等式(vi)由处理块1880实现。特定求模归约指令的源数据操作数集中的每一个元素1813的特定求模归约结果元素(q mod p)1853被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458中)。Accordingly, equation (i) is implemented by shifter 1870 from element 1813 to generate [X ₃ , X ₂ , X ₁ , X ₀ ] 1872 . Equation (ii) is implemented by shifters 1873-1875. Equation (iii) is implemented by processing block 1876 . Equation (iv) is implemented by shifters 1877-1879. Equation (v) is implemented by processing block 1885 and equation (vi) is implemented by processing block 1880 . A particular modulo-reduce result element (q mod p) 1853 for each element 1813 in the source data operand set of a particular modulo-reduce instruction is stored in a SIMD destination register (eg, in physical register bank unit 458) .

图18D示出用于执行用于提供通用二进制有限域GF(2^t)SIMD加密算术功能的求模归约指令的装置1804的一个实施例的图。在当前所示的示例中，可从特定的求模多项式(由指令(或微指令)为这些特定的求模多项式提供求模归约，例如，p₀、p₁、...p_n)中选择特定的求模多项式p_s。在t＝8的一些实施例中，装置1804可被复制16次，每一个装置1804包括用于高效地实现对包括16个两字节值的两个128位块(或一个256位块)的特定求模归约以生成包括16个字节值的128位块的硬件处理块，所得到的16个字节值中的每一个都具有GF(256)中或替代地某个复合域(例如，GF((2⁴)²)或GF((2²)⁴)等) 中的多项式表示。在求模归约指令(或微指令)的其他实施例中，尺寸t也可被指定，并且/或者装置1804的复制数量可被选择以生成128位块或256位块或512位块等。装置1804的多个实施例可以是用于执行用于提供通用二进制有限域GF(2^t)SIMD加密算术功能的求模归约指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元462)。装置1804的多个实施例可以与用于对用于在二进制有限域GF(2^t)或替代地在某个复合域(例如，GF((2^u)^v)，其中t＝u+v)中进行求模归约的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440)耦合。Figure 18D shows a diagram of one embodiment of an apparatus 1804 for executing a modulo-reduce instruction for providing a general binary finite field GF( ^2t ) SIMD cryptographic arithmetic function. In the presently shown example, the modulo reductions may be provided from specific modulo polynomials for which the instructions (or microinstructions) are provided, eg, p ₀ , p ₁ , . . . _pn ) Choose a specific modulo polynomial p _s in . In some embodiments where t=8, the device 1804 may be replicated 16 times, each device 1804 including a method for efficiently implementing two 128-bit blocks (or one 256-bit block) of 16 two-byte values A specific modulo reduction to generate a hardware processing block consisting of 128-bit blocks of 16-byte values, each of the resulting 16-byte values having GF(256) or alternatively some composite field (e.g. , a polynomial representation in GF((2 ⁴ ) ² ) or GF((2 ² ) ⁴ ), etc.). In other embodiments of a modulo-reduce instruction (or microinstruction), the size t may also be specified, and/or the number of copies of the device 1804 may be selected to generate 128-bit blocks or 256-bit blocks or 512-bit blocks, etc. Various embodiments of apparatus 1804 may be part of pipeline 400 (eg, execution stage 416 ) or core 490 for executing modulo-reduce instructions for providing general binary finite field GF(2 ^t ) SIMD cryptographic arithmetic functions part (eg, execution unit 462). Various embodiments of the apparatus 1804 may be used for pairing in a binary finite field GF(2 ^t ) or alternatively in some composite field (eg, GF((2 ^u ) ^v ), where t=u+v) A decode stage (eg, decode 406 ) or decoder (eg, decode unit 440 ) that decodes the instruction performing the modulo reduction in is coupled.

图19A示出用于执行用于提供通用GF(256)SIMD加密算术功能的二进制有限域乘法指令的装置1901的一个实施例的图。在一些实施例中，装置1901 可被复制16次，每一个装置1901包括用于高效地实现对每个都包括16个字节值(每一个字节具有GF(256)中的多项式表示)的两个128位块的二进制有限域乘法的硬件处理块。在二进制有限域乘法指令(或微指令)的其他实施例中，元素尺寸也可被执行，并且/或者装置1901的复制数量可被选择以实现对128位块或256位块或512位块等的二进制有限域乘法。装置1901的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的二进制有限域乘法指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元462)。装置1901的多个实施例可以与用于对用于GF(256)中的有限域乘法的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440)耦合。Figure 19A shows a diagram of one embodiment of an apparatus 1901 for executing binary finite field multiply instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. In some embodiments, the devices 1901 may be replicated 16 times, each device 1901 including an efficient implementation for each including 16 byte values (each byte having a polynomial representation in GF(256)) Hardware processing block for binary finite field multiplication of two 128-bit blocks. In other embodiments of binary finite field multiply instructions (or microinstructions), the element size may also be implemented, and/or the number of copies of the device 1901 may be selected to achieve a 128-bit block or a 256-bit block or a 512-bit block, etc. binary finite field multiplication. Embodiments of apparatus 1901 may be part of pipeline 400 (eg, execution stage 416 ) or part of core 490 (eg, core 490 ) for executing binary finite field multiply instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. execution unit 462). Various embodiments of apparatus 1901 may be coupled with a decoding stage (eg, decode 406) or a decoder (eg, decode unit 440) for decoding instructions for finite field multiplication in GF(256).

装置1901的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供二进制有限域乘法功能的通用GF(256)SIMD计算的二进制有限域乘法指令的一些实施例指定两个源数据操作数元素集1910和1920以及首一的不可约多项式。在处理块 1902中，响应于被解码的二进制有限域乘法指令，一个或多个执行单元(例如，执行单元462)计算SIMD无进位的8×8乘法以生成15位的积元素1915，并且对于源数据操作数集中的每一对元素1910和1912，归约的积1918经由求模归约单元1917对(例如，经由选择器1916)所选择的不可约多项式求模。源数据操作数集中的元素对1910和1912的每一个二进制有限域乘法的归约积 1918的结果被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458 中)。Various embodiments of apparatus 1901 may be coupled with a SIMD vector register (eg, physical register bank unit 458) that includes m variable numbers of values for storing m variable numbers of variable size data elements variable size data fields. Some embodiments of binary finite field multiply instructions for general GF(256) SIMD computations that provide binary finite field multiply functionality specify two source data operand element sets 1910 and 1920 and a first irreducible polynomial. In processing block 1902, in response to the decoded binary finite field multiply instruction, one or more execution units (eg, execution unit 462) compute a SIMD carry-free 8x8 multiplication to generate a 15-bit product element 1915, and for For each pair of elements 1910 and 1912 in the source data operand set, the reduced product 1918 modulo the selected irreducible polynomial via a modulo-reduction unit 1917 (eg, via selector 1916). The result of the reduction product 1918 of each binary finite field multiplication of pairs 1910 and 1912 of elements in the source data operand set is stored in a SIMD destination register (e.g., in physical register bank location 458).

图19B示出用于执行用于提供通用GF(256)SIMD加密算术功能的二进制有限域乘法指令的装置1903的替代实施例的图。在一些实施例中，装置1903 可被复制2次，每一个装置1903包括用于高效地实现对每个都包括16个字节值(每一个字节具有GF(256)中的多项式表示)的两个128位块的二进制有限域乘法的硬件处理块。在二进制有限域乘法指令(或微指令)的其他实施例中，元素尺寸也可被执行，并且/或者装置1903的复制数量可被选择以实现对 128位块或256位块或512位块等的二进制有限域乘法。装置1903的多个实施例可以是用于执行用于提供通用GF(256)SIMD加密算术功能的二进制有限域乘法指令的流水线400的部分(例如，执行级416)或核490的部分(例如，执行单元462)。装置1903的多个实施例可以与用于对用于GF(256)中的有限域乘法的指令进行解码的解码级(例如，解码406)或解码器(例如，解码单元440)耦合。Figure 19B shows a diagram of an alternative embodiment of an apparatus 1903 for executing binary finite field multiply instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. In some embodiments, devices 1903 may be replicated 2 times, each device 1903 including a 16-byte value (each byte having a polynomial representation in GF(256)) for efficiently implementing Hardware processing block for binary finite field multiplication of two 128-bit blocks. In other embodiments of binary finite field multiply instructions (or microinstructions), the element size may also be implemented, and/or the number of copies of the device 1903 may be selected to enable 128-bit blocks or 256-bit blocks or 512-bit blocks, etc. binary finite field multiplication. Various embodiments of the apparatus 1903 may be part of the pipeline 400 (eg, execution stage 416 ) or part of the core 490 (eg, execution unit 462). Various embodiments of apparatus 1903 may be coupled with a decoding stage (eg, decode 406) or a decoder (eg, decode unit 440) for decoding instructions for finite field multiplication in GF(256).

装置1903的多个实施例可以与SIMD向量寄存器(例如，物理寄存器组单元458)耦合，该SIMD向量寄存器包括用于存储m个可变数量的可变尺寸数据元素的值的m个可变数量的可变尺寸数据字段。用于提供二进制有限域乘法功能的通用GF(256)SIMD计算的二进制有限域乘法指令的一些实施例指定两个源数据操作数集(例如，1920和1922)以及首一的不可约多项式p。在阵列1925的处理块1902中，响应于被解码的二进制有限域乘法指令，一个或多个执行单元(例如，执行单元462)计算SIMD无进位的8×8乘法以生成积元素1915，并且对于源数据操作数集1920和1922中的每一对元素，归约的积 1918经由求模归约单元1917对(例如，经由选择器1916)所选择的不可约多项式求模。源数据操作数集1920和1922的SIMD二进制有限域乘法的归约积集合1928的结果被存储在SIMD目的地寄存器中(例如，在物理寄存器组单元458中)。Various embodiments of apparatus 1903 may be coupled with a SIMD vector register (eg, physical register bank unit 458) that includes m variable numbers of values for storing m variable numbers of variable size data elements variable size data fields. Some embodiments of binary finite field multiply instructions for general GF(256) SIMD computations that provide binary finite field multiply functionality specify two source data operand sets (eg, 1920 and 1922) and a first irreducible polynomial p. In processing block 1902 of array 1925, in response to the decoded binary finite field multiply instruction, one or more execution units (eg, execution unit 462) compute a SIMD carry-free 8x8 multiplication to generate product element 1915, and for For each pair of elements in the source data operand sets 1920 and 1922, the reduced product 1918 modulo the selected irreducible polynomial via a modulo-reduction unit 1917 (eg, via a selector 1916). The result of the reduced product set 1928 of the SIMD binary finite field multiplication of the source data operand sets 1920 and 1922 is stored in a SIMD destination register (eg, in the physical register bank unit 458).

图20A示出用于执行用于提供通用GF(256)SIMD加密算术功能的仿射映射指令的过程2001的一个实施例的流程图。通过处理块来执行过程2001和本文中公开的其他过程，这些处理块可包括专用硬件或可由通用机器或专用机器或通用机器和专用机器两者的组合执行的软件或固件操作码。Figure 20A shows a flowchart of one embodiment of a process 2001 for executing an affine map instruction for providing generic GF(256) SIMD cryptographic arithmetic functions. Process 2001 and other processes disclosed herein are performed by processing blocks, which may include special purpose hardware or software or firmware opcodes executable by a general purpose machine or a special purpose machine or a combination of both.

在处理框2011中，对用于有限域中的SIMD仿射变换的处理器仿射映射指令进行解码。在处理框2016中，对仿射映射指令的解码可选地生成多条微指令，例如用于有限域矩阵-向量乘法1602的第一微指令以及用于有限域向量加法(或XOR)1603的第二微指令。在处理框2021中，源数据操作数元素集被访问。在处理框2031中，变换矩阵操作数被访问。在处理框2041中，转换向量操作数被访问。在处理框2051中，变换矩阵操作数被应用于源数据操作数集中的每一个元素。在处理框2061中，转换向量操作数被应用于源数据操作数集中的每一个经变换的元素。在处理框2081中，作出对源数据操作数集中的每一个元素的处理是否已完成的确定。如果没有，则SIMD仿射变换重新进行开始于处理框2051中的迭代。否则，在处理框2091中，SIMD仿射变换的结果被存储在SIMD目的地寄存器中。In processing block 2011, processor affine map instructions for SIMD affine transforms in finite fields are decoded. In processing block 2016, decoding of the affine map instruction optionally generates a plurality of microinstructions, such as the first microinstruction for finite field matrix-vector multiplication 1602 and the first microinstruction for finite field vector addition (or XOR) 1603 The second microinstruction. In process block 2021, the source data operand element set is accessed. In processing block 2031, the transform matrix operand is accessed. In process block 2041, the transform vector operand is accessed. In processing block 2051, a transformation matrix operand is applied to each element in the source data operand set. In processing block 2061, a transform vector operand is applied to each transformed element in the source data operand set. In processing block 2081, a determination is made whether processing of each element in the source data operand set has completed. If not, the SIMD affine transformation repeats the iterations started in process block 2051. Otherwise, in processing block 2091, the result of the SIMD affine transformation is stored in the SIMD destination register.

图20B示出用于执行用于提供通用GF(256)SIMD加密算术功能的有限域乘法求逆指令的过程2002的一个实施例的流程图。在处理框2012中，对用于有限域中的SIMD乘法求逆的处理器乘法求逆指令进行解码。在处理框2016 中，对乘法求逆指令的解码可选地生成多条微指令，例如，用于乘法求逆的第一微指令以及用于诸如1801-1804中的一个之类的求模归约的第二微指令。在处理框2022中，源数据操作数元素集被访问。在处理框2032中，不可约多项式可选地被明确地标识。在一个实施例中，可例如在指令的立即数操作数中将该不可约多项式指定为十六进制的控制值1B以指示迦罗瓦域GF(256)中的多项式x⁸+x⁴+x³+x+1。在另一实施例中，可例如在指令的立即数操作数中将该不可约多项式指定为十六进制的控制值FA以指示迦罗瓦域GF(256)中的多项式x⁸+x⁷+x⁶+x⁵+x⁴+x²+1或替代地指示另一多项式。在另一替代实施例中，可在指令助记符中指定和/或明确标识该不可约多项式。在处理框 2042中，计算针对源数据操作数集中的每一个元素的二进制有限域乘法逆元，并且在处理框2052中，可选地使源数据操作数集中的每一元素的逆元对不可约多项式归约求模。在处理框2082中，作出对源数据操作数集中的每一个元素的处理是否已完成的确定。如果没有，则SIMD有限域乘法求逆重新进行开始于处理框2042中的迭代。否则，在处理框2092中，SIMD仿射变换的结果被存储在SIMD目的地寄存器中。Figure 20B shows a flow diagram of one embodiment of a process 2002 for executing a finite field multiply inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions. In processing block 2012, a processor multiplication inversion instruction for SIMD multiplication inversion in a finite field is decoded. In processing block 2016, decoding of the multiply-inverse instruction optionally generates a plurality of microinstructions, eg, a first microinstruction for multiply-inverse and a modulo-reduction such as one of 1801-1804. The second microinstruction of approx. In process block 2022, the source data operand element set is accessed. In processing block 2032, the irreducible polynomials are optionally explicitly identified. In one embodiment, the irreducible polynomial may be specified as a control value 1B in hexadecimal, for example, in the immediate operand of the instruction to indicate the polynomial ^x8 + ^x4 + in the Galois field GF(256) ^x3 +x+1. In another embodiment, the irreducible polynomial may be specified as a control value FA in hexadecimal to indicate the polynomial x ⁸ +x ⁷ in the Galois field GF(256), for example, in the immediate operand of the instruction +x ⁶ +x ⁵ +x ⁴ +x ² +1 or alternatively indicates another polynomial. In another alternative embodiment, the irreducible polynomial may be specified and/or explicitly identified in the instruction mnemonic. In processing block 2042, a binary finite field multiplicative inverse is computed for each element in the source data operand set, and in processing block 2052, the inverse pair for each element in the source data operand set is optionally rendered inverse About polynomial reduction modulo. In processing block 2082, a determination is made whether processing of each element in the source data operand set has completed. If not, the SIMD finite field multiplication inversion restarts the iteration started in process block 2042. Otherwise, in process block 2092, the result of the SIMD affine transformation is stored in the SIMD destination register.

图20C示出用于执行用于提供通用GF(256)SIMD加密算术功能的仿射求逆指令的过程2003的一个实施例的流程图。在处理框2013中，对用于有限域中的SIMD仿射变换和求逆的处理器仿射求逆指令进行解码。在处理框2016 中，对仿射求逆指令的解码可选地生成多条微指令，例如，用于有限域仿射映射1601的第一微指令以及用于有限域乘法求逆1604的第二微指令；或替代地，用于有限域矩阵-向量乘法1601的第一微指令，以及之后的用于字节广播的第二微指令，用于有限域向量加法(XOR)1602的第三微指令和用于有限域乘法求逆1604的第四微指令。在处理框2023中，源数据操作数元素集被访问。在处理框2033中，变换矩阵操作数被访问。在处理框2043中，转换向量操作数被访问。在处理框2053中，变换矩阵操作数被应用于源数据操作数集中的每一个元素。在处理框2063中，转换向量操作数被应用于源数据操作数集中的每一个经变换的元素。在处理框2073中，针对源数据操作数集的每一个经变换元素，计算二进制有限域乘法逆元。在处理框2083中，作出对源数据操作数集中的每一个元素的处理是否已完成的确定。如果没有，则SIMD仿射变换和求逆重新进行开始于处理框2053中的迭代。否则，在处理框2093中，SIMD 仿射变换和乘法求逆的结果被存储在SIMD目的地寄存器中。Figure 20C shows a flowchart of one embodiment of a process 2003 for executing an affine inversion instruction for providing generic GF(256) SIMD cryptographic arithmetic functions. In processing block 2013, processor affine inversion instructions for SIMD affine transformation and inversion in finite fields are decoded. In processing block 2016, decoding of the affine inversion instruction optionally generates a plurality of microinstructions, eg, a first microinstruction for finite field affine map 1601 and a second microinstruction for finite field multiplication inversion 1604. microinstructions; or alternatively, a first microinstruction for finite field matrix-vector multiplication 1601, followed by a second microinstruction for byte broadcasting, a third microinstruction for finite field vector addition (XOR) 1602 instruction and fourth microinstruction for finite field multiplication inversion 1604. In process block 2023, the source data operand element set is accessed. In processing block 2033, the transform matrix operand is accessed. In process block 2043, the conversion vector operand is accessed. In processing block 2053, a transformation matrix operand is applied to each element in the source data operand set. In processing block 2063, a transform vector operand is applied to each transformed element in the source data operand set. In processing block 2073, for each transformed element of the source data operand set, a binary finite field multiplicative inverse is computed. In processing block 2083, a determination is made whether processing of each element in the source data operand set has completed. If not, the SIMD affine transformation and inversion repeat the iterations started in process block 2053. Otherwise, in processing block 2093, the result of the SIMD affine transformation and multiplication inversion is stored in the SIMD destination register.

图20D示出用于执行用于提供通用GF(256)SIMD加密算术功能的二进制有限域乘法指令的过程2004的一个实施例的流程图。在处理框2014中，对用于有限域中的SIMD乘法的处理器乘法指令进行解码。在处理框2016中，对仿射求逆指令的解码可选地生成多条微指令，例如，用于有限域无进位乘法 1913的第一微指令以及用于诸如1801-1804中的一个之类的有限域求模归约 1917的第二微指令。在处理框2024中，第一源数据操作数元素集被访问。在处理框2034中，第二源数据操作数元素集被访问。在处理框2044中，不可约多项式可选地被明确地标识。在一个实施例中，可例如在指令的立即数操作数中将该不可约多项式指定为十六进制的控制值1B以指示迦罗瓦域GF(256) 中的多项式x⁸+x⁴+x³+x+1。在另一实施例中，可例如在指令的立即数操作数中将该不可约多项式指定为十六进制的控制值FA以指示迦罗瓦域GF(256) 中的多项式x⁸+x⁷+x⁶+x⁵+x⁴+x²+1。在另一替代实施例中，可在指令助记符中指定和/或明确标识该不可约多项式。在处理框2054中，计算针对第一和第二源数据操作数集的对应元素中的每一个的对应元素对的积，并且在处理框 2064中，可选地使第一和第二源数据操作数集中对应元素中的每一个的积对不可约多项式归约求模。在处理框2084中，作出对第一和第二源数据操作数集中对应元素中的每一个元素的处理是否已完成的确定。如果没有，则SIMD有限域乘法重新进行开始于处理框2054中的迭代。否则，在处理框2094中，SIMD 有限域乘法的结果被存储在SIMD目的地寄存器中。Figure 20D shows a flow diagram of one embodiment of a process 2004 for executing binary finite field multiply instructions for providing generic GF(256) SIMD cryptographic arithmetic functions. In processing block 2014, a processor multiply instruction for SIMD multiplication in a finite field is decoded. In processing block 2016, decoding of the affine inversion instruction optionally generates a plurality of microinstructions, eg, the first microinstruction for finite field carry-less multiplication 1913 and a microinstruction for one such as one of 1801-1804 The finite field modulo reduction of the second microinstruction of 1917. In process block 2024, a first set of source data operand elements is accessed. In process block 2034, a second set of source data operand elements is accessed. In processing block 2044, the irreducible polynomials are optionally explicitly identified. In one embodiment, the irreducible polynomial may be specified, for example, in the immediate operand of the instruction as a control value 1B in hexadecimal to indicate the polynomial x ⁸ +x ⁴ + in the Galois field GF(256) ^x3 +x+1. In another embodiment, the irreducible polynomial may be specified as a control value FA in hexadecimal to indicate the polynomial x ⁸ +x ⁷ in the Galois field GF(256), for example, in the immediate operand of the instruction +x ⁶ +x ⁵ +x ⁴ +x ² +1. In another alternative embodiment, the irreducible polynomial may be specified and/or explicitly identified in the instruction mnemonic. In process block 2054, a product of pairs of corresponding elements for each of the corresponding elements of the first and second source data operand sets is computed, and in process block 2064, the first and second source data are optionally made The product of each of the corresponding elements in the operand set modulo the irreducible polynomial reduction. In processing block 2084, a determination is made whether processing of each of the corresponding elements in the first and second source data operand sets has completed. If not, the SIMD finite field multiplication repeats the iteration started in process block 2054. Otherwise, in process block 2094, the result of the SIMD finite field multiplication is stored in the SIMD destination register.

将会理解，虽然上文中可将用于执行用于提供通用SIMD加密算术功能的指令示出为是迭代的，但是只要有可能，各种处理框的一个或多个实例就可以以及优选地同时和/或并行地被执行以增加执行性能和吞吐量。It will be appreciated that, while the instructions for executing the instructions for providing general SIMD cryptographic arithmetic functions may be shown above as being iterative, whenever possible, one or more instances of the various processing blocks may and preferably be concurrently and/or are executed in parallel to increase execution performance and throughput.

将会理解，通用GF(256)SIMD加密算术指令可用于在多个应用中提供通用GF(256)SIMD加密算术功能，例如，用于确保用于金融交易、电子商务、电子邮件、软件分发、数据存储等的数据完整性、身份验证、消息内容认证和消息源认证的加密协议和网际通信。It will be appreciated that the general purpose GF(256) SIMD encrypted arithmetic instructions can be used to provide general purpose GF(256) SIMD encrypted arithmetic functions in a variety of applications, for example, for secure use in financial transactions, e-commerce, email, software distribution, Encryption protocols and Internet communications for data integrity, authentication, message content authentication, and message origin authentication for data storage, etc.

因此，将会理解，提供针对至少下列各项的指令的执行：(1)SIMD仿射变换，其指定源数据操作数、变换矩阵操作数和转换向量，其中，变换矩阵被应用于源数据操作数的每一个数据元素，并且转换向量被应用于每一个经转换的元素；(2)SIMD二进制有限域乘法求逆，其用于针对源数据操作数中的每一个元素计算二进制有限域中的逆对不可约多项式求模；(3)SIMD仿射变换和乘法逆(或乘法逆和仿射变换)，其指定源数据操作数、变换矩阵操作数和转换向量，其中，在乘法逆操作之前或之后，变换矩阵被应用于源数据操作数中的每一个元素，并且转换向量被应用于每一个经变换的元素；(4)求模归约，其用于计算对从二进制有限域中的多个多项式(由指令(或微指令)为这些特定的多项式提供求模归约)中选出的特定求模多项式ps进行归约求模；(5)SIMD二进制有限域乘法，其指定第一和第二源数据操作数，并且用于将第一和第二源数据操作数中的每一个对应的元素对相乘并对不可约多项式求模；其中，这些指令的结果被存储在SIMD目的地寄存器中；并且能以硬件和/或微代码序列的形式提供通用GF(256)和/或其他替代的二进制有限域 SIMD加密算术功能，以便不需要要求附加电路、面积或功率的过多或过度的功能单元就能够支持对若干重要的性能关键性应用的显著的性能改善。Accordingly, it will be appreciated that execution of instructions for at least the following: (1) SIMD affine transforms specifying source data operands, transform matrix operands and transform vectors, wherein the transform matrix is applied to the source data operations each data element of the number, and a transformation vector is applied to each transformed element; (2) SIMD binary finite field multiplication inversion, which is used to compute for each element in the source data operand the Inverse modulo an irreducible polynomial; (3) SIMD affine transform and multiplicative inverse (or multiplicative inverse and affine transform) specifying the source data operand, transform matrix operand, and transform vector, where, before the multiply inverse operation or after, a transformation matrix is applied to each element in the source data operands, and a transformation vector is applied to each transformed element; (4) Modulo reduction, which is used to compute the pair from the binary finite field A specific modulo polynomial ps selected from a plurality of polynomials for which modulo reduction is provided by the instruction (or microinstruction) for these specific polynomials is reduced modulo; (5) SIMD binary finite field multiplication, which specifies the first and the second source data operand, and are used to multiply and modulo the irreducible polynomial by pairs of elements corresponding to each of the first and second source data operands; wherein the results of these instructions are stored in the SIMD object and can provide general-purpose GF(256) and/or other alternative binary finite field SIMD cryptographic arithmetic functions in hardware and/or microcode sequences so as not to require excessive or excessive amounts of additional circuitry, area, or power Excessive functional units can support significant performance improvements for several important performance-critical applications.

本文公开的机制的各实施例可以被实现在硬件、软件、固件或此类实现方式的组合中。可将本发明的多个实施例实现为在可编程系统上执行的计算机程序或程序代码，该可编程系统包括至少一个处理器、存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入设备以及至少一个输出设备。Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementations. Embodiments of the present invention can be implemented as a computer program or program code executing on a programmable system including at least one processor, a memory system (including volatile and nonvolatile memory and/or storage element), at least one input device, and at least one output device.

可将程序代码应用于输入指令以执行本文描述的功能并产生输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的，处理系统包括具有诸如例如数字信号处理器(DSP)、微控制器、专用集成电路 (ASIC)或微处理器之类的处理器的任何系统。Program code may be applied to input instructions to perform the functions described herein and to generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC), or microprocessor.

程序代码可以用高级程序化语言或面向对象的编程语言来实现，以便与处理系统通信。在需要时，也可用汇编语言或机器语言来实现程序代码。事实上，本文中描述的机制不限于任何特定编程语言的范围。在任何情况下，该语言可以是编译语言或解释语言。The program code may be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited to the scope of any particular programming language. In any case, the language can be a compiled language or an interpreted language.

至少一个实施例的一个或多个方面可由存储在表示处理器中的各种逻辑的机器可读介质上的表示性指令来实现，当由机器读取这些表示性指令时，这些指令使该机器制作用于执行本文所述的技术的逻辑。可将被称为“IP核”的此类表示存储在有形的机器可读介质上，并将其提供给各种客户或生产设施，以便加载到实际制造该逻辑或处理器的制造机器中。One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium representing various logic in a processor that, when read by a machine, cause the machine to Produce logic for performing the techniques described herein. Such representations, referred to as "IP cores," can be stored on tangible machine-readable media and provided to various customers or production facilities for loading into the manufacturing machines that actually manufacture the logic or processors.

此类的机器可读存储介质可以包括但不限于通过机器或设备制造或形成的制品的非瞬态的有形安排，其包括存储介质，诸如：硬盘；任何其他类型的盘，包括软盘、光盘、紧致盘只读存储器(CD-ROM)、紧致盘可重写(CD-RW) 以及磁光盘；半导体器件，例如只读存储器(ROM)、诸如动态随机存取存储器 (DRAM)和静态随机存取存储器(SRAM)之类的随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、闪存、电可擦除可编程只读存储器(EEPROM)；相变存储器(PCM)；磁卡或光卡；或适于存储电子指令的任何其他类型的介质。Such machine-readable storage media may include, but are not limited to, non-transitory tangible arrangements of articles of manufacture or formation by machines or equipment, including storage media, such as: hard disks; any other type of disks, including floppy disks, optical disks, Compact disk read only memory (CD-ROM), compact disk rewritable (CD-RW), and magneto-optical disks; semiconductor devices such as read only memory (ROM), such as dynamic random access memory (DRAM) and static random access memory Random Access Memory (RAM) such as Access Memory (SRAM), Erasable Programmable Read Only Memory (EPROM), Flash Memory, Electrically Erasable Programmable Read Only Memory (EEPROM); Phase Change Memory (PCM) ; a magnetic or optical card; or any other type of medium suitable for storing electronic instructions.

相应地，本发明的多个实施例也包括非瞬态的有形机器可读介质，该介质包含指令或包含定义本文中描述的结构、电路、装置、处理器和/或系统特征的设计数据(例如，硬件描述语言(HDL))。也将此类实施例称为程序产品。Accordingly, various embodiments of the present invention also include non-transitory tangible machine-readable media containing instructions or containing design data ( For example, Hardware Description Language (HDL)). Such embodiments are also referred to as program products.

在一些情况下，指令转换器可用来将指令从源指令集转换至目标指令集。例如，指令转换器可变换(例如，使用静态二进制变换、包括动态编译的动态二进制变换)、变形、仿真指令或以其他方式将指令转换成将由核来处理的一条或多条其他指令。可在软件、硬件、固件或其组合中实现该指令转换器。指令转换器可在处理器上、在处理器外、或者部分在处理器上且部分在处理器外。In some cases, an instruction converter may be used to convert instructions from a source instruction set to a target instruction set. For example, an instruction translator may transform (eg, using static binary transform, dynamic binary transform including dynamic compilation), warp, emulate, or otherwise convert an instruction into one or more other instructions to be processed by the core. The instruction translator may be implemented in software, hardware, firmware, or a combination thereof. The instruction translator may be on-processor, off-processor, or partially on-processor and partially off-processor.

因此，公开了用于执行根据至少一个实施例的一条或多条指令的技术。虽然已经描述并在附图中示出了某些示例性实施例，但是应当理解，此类实施例仅仅是对本宽泛发明的说明而非限制，并且本发明不限于所示出和所描述的特定结构和配置，因为本领域技术人员在研究了本公开文本之后可以料知到各种其他修改。在诸如本申请这样的、发展迅速且进一步的进展难以预见的技术领域中，所公开的多个实施例在通过启用技术进步所促成的配置和细节上是容易修改的，同时不背离本公开的原理和所附权利要求书的范围。Accordingly, techniques for executing one or more instructions in accordance with at least one embodiment are disclosed. While certain exemplary embodiments have been described and illustrated in the accompanying drawings, it is to be understood that such embodiments are merely illustrative and not limiting of the broad invention and that the invention is not limited to the particulars shown and described. structures and configurations, as various other modifications will be apparent to those skilled in the art after a study of this disclosure. In a field of technology such as this application, in which development is rapid and further progress is unforeseeable, the disclosed embodiments are easily modifiable in configuration and detail enabled by technological progress, without departing from the scope of the present disclosure. principles and scope of the appended claims.

Claims

1. A processor comprising:

a decoding stage for decoding a first instruction for inversion of a single instruction multiple data SIMD binary finite field multiplication, the first instruction specifying a source data operand set and an irreducible polynomial; and

One or more execution units, in response to the decoded first instruction:

For each element in the source data operand set, computing the binary finite field multiplicative inverse modulo the irreducible polynomial as a SIMD operation; and

The result of the first instruction is stored in a SIMD destination register.

2. The processor of claim 1, wherein the first instruction designates the SIMD destination register as a destination operand.

3. The processor of claim 1, wherein the first instruction specifies a SIMD register to hold a plurality of 16-byte elements as the source data operand set.

4. The processor of claim 1, wherein the first instruction specifies a SIMD register to hold a plurality of 32-byte elements as the source data operand set.

5. The processor of claim 1, wherein the first instruction specifies a SIMD register to hold a plurality of 64-byte elements as the source data operand set.

6. The processor of any of claims 1-5, characterized by multiplying each element in the source data operand set to 254 in a Galois field GF( ²⁸ ) power and modulo the irreducible polynomial to perform the computation of the SIMD binary finite field multiplication inversion.

7. The processor of any one of claims 1-5, wherein, in the mnemonic of the first instruction, the irreducible polynomial is designated as 1B to indicate Galois Field GF x ⁸ +x ⁴ +x ³ +x+1 in (2 ⁸ ).

8. The processor of any one of claims 1-5, wherein in an immediate operand of the first instruction, the irreducible polynomial is specified as a hexadecimal control value 1B to indicate x ⁸ +x ⁴ +x ³ +x+1 in the Galois field GF(2 ⁸ ).

9. The processor of any one of claims 1-5, wherein in an immediate operand of the first instruction, the irreducible polynomial is specified as a hexadecimal control value F5 to indicate x ⁸ +x ⁷ +x ⁶ +x ⁵ +x ⁴ +x ² +1 in the Galois field GF(2 ⁸ ).

10. A device for performing a single instruction multiple data SIMD binary finite field multiplication inversion operation, the device comprising:

means for accessing source data operand element sets and irreducible polynomials;

means for computing the binary finite field multiplicative inverse modulo the irreducible polynomial for each element in the source data operand element set as a SIMD operation; and

Means for storing in a SIMD destination register the result of modulo the irreducible polynomial by the SIMD binary finite field multiplicative inverse.

11. The apparatus of claim 10, wherein in an immediate operand of a first instruction for SIMD binary finite field multiplication inversion, the irreducible polynomial is specified as a control in hexadecimal Value IB to indicate x8+ ^x4 + ^x3 +x+ ¹ in the Galois field GF( ²⁸ ).

12. The apparatus of claim 10, wherein, in the mnemonic of the first instruction of SIMD binary finite field multiplication inversion, the irreducible polynomial is designated 1B to indicate Galois field GF( 2 ⁸ ) in x ⁸ +x ⁴ +x ³ +x+1.

13. The apparatus of claim 10, wherein in the immediate operand of the first instruction of SIMD binary finite field multiplication inversion, the irreducible polynomial is specified as the control value 87 in hexadecimal to indicate x ¹²⁸ +x ⁷ +x ² +x+1 in the Galois field GF(2 ¹²⁸ ).

14. The apparatus of claim 10, wherein in the immediate operand of the first instruction of SIMD binary finite field multiplication inversion, the irreducible polynomial is specified as a control value F5 in hexadecimal to indicate x ⁸ +x ⁷ +x ⁶ +x ⁵ +x ⁴ +x ² +1 in the Galois field GF(2 ⁸ ).

15. A processing method, comprising:

decoding a first instruction for inversion of single instruction multiple data SIMD binary finite field multiplication, the first instruction specifies a source data operand set and an irreducible polynomial;

In response to the decoded first instruction, for each element in the source data operand set, computing a binary finite field multiplicative inverse modulo the irreducible polynomial as a SIMD operation; and

The result of the first instruction is stored in a SIMD destination register.

16. The processing method of claim 15, wherein, in the immediate operand of the first instruction, the irreducible polynomial is designated as a control value 1B in hexadecimal to indicate Galois x ⁸ +x ⁴ +x ³ +x+1 in the field GF(2 ⁸ ).

17. The processing method of claim 15, wherein, in the mnemonic of the first instruction, the irreducible polynomial is designated as 1B to indicate that in the Galois field GF( ²⁸ ) x ⁸ +x ⁴ +x ³ +x+1.

18. The processing method of claim 15, wherein, in the immediate operand of the first instruction, the irreducible polynomial is designated as a control value of 87 in hexadecimal to indicate Galois x ¹²⁸ +x ⁷ +x ² +x+1 in the field GF(2 ¹²⁸ ).

19. The processing method of claim 15, wherein in the immediate operand of the first instruction, the irreducible polynomial is designated as a control value F5 in hexadecimal to indicate Galois x ⁸ +x ⁷ +x ⁶ +x ⁵ +x ⁴ +x ² +1 in the field GF(2 ⁸ ).

20. A processing system comprising:

a memory for storing first instructions for the SIMD secure hash algorithm wheel segment; and

processors, including:

an instruction fetch stage for fetching the first instruction;

One or more execution units, in response to the decoded first instruction:

The result of the first instruction is stored in a SIMD destination register.

21. The processing system of claim 20, wherein in the mnemonic of the first instruction, the irreducible polynomial is designated as 1B to indicate that in the Galois field GF( ²⁸ ) x ⁸ +x ⁴ +x ³ +x+1.

22. The processing system of claim 20, wherein in an immediate operand of the first instruction, the irreducible polynomial is specified as a control value IB in hexadecimal to indicate Galois x ⁸ +x ⁴ +x ³ +x+1 in the field GF(2 ⁸ ).

23. The processing system of any one of claims 20-22, wherein the first instruction is further for a SIMD affine transform of each binary finite field multiplicative inverse, and the one or more The execution units are further configured to, in response to the decoded first instruction:

By applying a transform matrix operand to the multiplicative inverse of each element in the source data operand set and applying a transform vector operand to each transformed multiplicative inverse of the elements in the source data operand set element to perform a SIMD affine transformation in order to generate the result of the first instruction.

24. The processing system of claim 20, wherein in an immediate operand of the first instruction, the irreducible polynomial is specified as a control value of 87 in hexadecimal to indicate Galois x ¹²⁸ +x ⁷ +x ² +x+1 in the field GF(2 ¹²⁸ ).

25. The processing system of claim 20, wherein in an immediate operand of the first instruction, the irreducible polynomial is specified as a control value F5 in hexadecimal to indicate Galois x ⁸ +x ⁷ +x ⁶ +x ⁵ +x ⁴ +x ² +1 in the field GF(2 ⁸ ).

26. A machine-readable medium comprising a plurality of instructions stored on the machine-readable medium that, when executed, cause a computing device to perform the method of any of claims 15-19 method.

27. A processing device comprising:

an apparatus for decoding a first instruction for inversion of a single instruction multiple data SIMD binary finite field multiplication, the first instruction specifying a source data operand set and an irreducible polynomial;

means for computing a binary finite field multiplicative inverse modulo the irreducible polynomial for each element in the source data operand set as a SIMD operation in response to the decoded first instruction; and

means for storing the result of the first instruction in a SIMD destination register.

28. The processing device of claim 27, wherein in an immediate operand of the first instruction, the irreducible polynomial is specified as a control value of 1B in hexadecimal to indicate Galois x ⁸ +x ⁴ +x ³ +x+1 in the field GF(2 ⁸ ).

29. The processing device of claim 27, wherein, in the mnemonic of the first instruction, the irreducible polynomial is designated as 1B to indicate that in the Galois field GF( ²⁸ ) x ⁸ +x ⁴ +x ³ +x+1.

30. The processing device of claim 27, wherein in an immediate operand of the first instruction, the irreducible polynomial is specified as a control value of 87 in hexadecimal to indicate Galois x ¹²⁸ +x ⁷ +x ² +x+1 in the field GF(2 ¹²⁸ ).

31. The processing device of claim 27, wherein in an immediate operand of the first instruction, the irreducible polynomial is specified as a control value F5 in hexadecimal to indicate Galois x ⁸ +x ⁷ +x ⁶ +x ⁵ +x ⁴ +x ² +1 in the field GF(2 ⁸ ).