[go: up one dir, main page]

CN107861757B - Arithmetic device and related product - Google Patents

Arithmetic device and related product Download PDF

Info

Publication number
CN107861757B
CN107861757B CN201711244055.7A CN201711244055A CN107861757B CN 107861757 B CN107861757 B CN 107861757B CN 201711244055 A CN201711244055 A CN 201711244055A CN 107861757 B CN107861757 B CN 107861757B
Authority
CN
China
Prior art keywords
instruction
vector
calculation
operation instruction
extended
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711244055.7A
Other languages
Chinese (zh)
Other versions
CN107861757A (en
Inventor
陈天石
王秉睿
张潇
刘少礼
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201711244055.7A priority Critical patent/CN107861757B/en
Publication of CN107861757A publication Critical patent/CN107861757A/en
Application granted granted Critical
Publication of CN107861757B publication Critical patent/CN107861757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30181Instruction operation extension or modification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

The present invention provides an arithmetic device for executing an operation according to an extended instruction, the arithmetic device including: a memory, an arithmetic unit and a control unit; the extended instruction includes: an opcode and an operation field, a memory to store a vector; the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory; and the operation unit is used for executing the vector operation instruction and the second operation instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction. The technical scheme provided by the invention has the advantages of low power consumption and low calculation overhead.

Description

运算装置以及相关产品Computing devices and related products

技术领域technical field

本发明涉及通信技术领域,具体涉及一种运算装置以及相关产品。The present invention relates to the field of communication technologies, in particular to a computing device and related products.

背景技术Background technique

现代的通用和专用处理器中,越来越多地引入计算指令(例如向量指令)进行运算。向量指令是使处理器进行向量或者矩阵运算的指令,例如向量的加减、向量的内积、矩阵乘法、矩阵卷积等。向量指令中至少有一个输入为向量或者矩阵或运算结果是向量或矩阵。向量指令可以通过调用处理器内部的向量处理部件来进行并行计算,提高运算速度。现有的向量指令中,其操作数或结果中的向量或矩阵一般是固定规模的,例如ARM处理器中的向量扩展结构Neon中的向量指令一次可以处理长为4的32位浮点向量或者长为8的16位定点向量。In modern general-purpose and special-purpose processors, computational instructions (eg, vector instructions) are increasingly introduced to perform operations. Vector instructions are instructions that cause the processor to perform vector or matrix operations, such as addition and subtraction of vectors, inner product of vectors, matrix multiplication, matrix convolution, and the like. At least one input to a vector instruction is a vector or matrix or the result of the operation is a vector or matrix. The vector instruction can perform parallel computation by calling the vector processing unit inside the processor to improve the operation speed. In the existing vector instructions, the vectors or matrices in the operands or results are generally of fixed size. For example, the vector instructions in the vector extension structure Neon in the ARM processor can process a 32-bit floating-point vector with a length of 4 at a time. 16-bit fixed-point vector of length 8.

所以现有的向量运算指令无法实现可变规模的向量或矩阵的运算,并且现在的向量运算指令只能实现一种运算,例如一条向量指令只能实现乘法、加法中的一种运算,一条向量指令无法实现二种以上的运算,所以现有的向量运算的运算开销大,能耗高。Therefore, the existing vector operation instructions cannot implement variable-scale vector or matrix operations, and the current vector operation instructions can only implement one operation. For example, a vector instruction can only implement one operation in multiplication and addition, and a vector Instructions cannot implement more than two kinds of operations, so the existing vector operations have high computational overhead and high energy consumption.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供了一种运算装置及相关产品,可实现单条运算指令实现多种运算的目的,减少运算开销,降低模块的功耗优点。The embodiments of the present invention provide an operation device and related products, which can realize the purpose of realizing multiple operations with a single operation instruction, reduce operation overhead, and reduce the power consumption of modules.

第一方面,本发明实施例提供一种扩展指令的实现方法,所述方法包括如下步骤:In a first aspect, an embodiment of the present invention provides a method for implementing an extended instruction, and the method includes the following steps:

一种运算装置,用于根据扩展指令执行运算,所述运算装置包括:存储器、运算单元和控制单元;An arithmetic device for performing an operation according to an extended instruction, the arithmetic device comprising: a memory, an arithmetic unit and a control unit;

所述扩展指令包括:操作码和操作域,所述操作码包括:识别向量计算指令的标识;所述操作域包括:向量计算指令的输入向量地址、向量计算指令的输出向量地址、第二计算指令的标识、第二计算指令的输入数据、数据类型以及数据长度N;The extended instruction includes: an operation code and an operation field, the operation code includes: an identifier for identifying the vector calculation instruction; the operation field includes: the input vector address of the vector calculation instruction, the output vector address of the vector calculation instruction, the second calculation instruction The identifier of the instruction, the input data of the second calculation instruction, the data type, and the data length N;

存储器,用于存储向量;memory, for storing vectors;

控制单元,用于获取扩展指令,解析所述扩展指令得到向量运算指令和第二运算指令,根据所述向量运算指令以及第二运算指令确定向量运算指令与第二运算指令的计算顺序,从存储器读取所述输入向量地址对应的输入向量;The control unit is used to obtain the extended instruction, parse the extended instruction to obtain the vector operation instruction and the second operation instruction, determine the calculation order of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and obtain the vector operation instruction and the second operation instruction from the memory. reading the input vector corresponding to the input vector address;

运算单元,用于对所述输入向量按所述计算顺序执行所述向量运算指令以及第二运算指令得到所述扩展指令的结果。an operation unit, configured to execute the vector operation instruction and the second operation instruction on the input vector in the calculation order to obtain the result of the extended instruction.

可选的,该运算装置还包括:Optionally, the computing device further includes:

寄存器单元,用于存储待执行的扩展指令。A register unit for storing extended instructions to be executed.

可选的,所述控制单元包括:Optionally, the control unit includes:

取指模块,用于从所述寄存器单元中获取扩展指令;an instruction fetch module, used for acquiring extended instructions from the register unit;

译码模块,用于对获取的扩展指令进行译码得到向量运算指令、第二运算指令以及计算顺序;a decoding module for decoding the acquired extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;

指令队列,用于对译码后的向量运算指令和第二运算指令按所述计算顺序存储。The instruction queue is used for storing the decoded vector operation instructions and the second operation instructions in the calculation order.

可选的,该运算装置还包括:Optionally, the computing device further includes:

依赖关系处理单元,用于在所述控制单元获取扩展指令前,判断该扩展指令与前一扩展指令是否访问相同的向量,若是,则等待前一扩展指令执行完毕后,将当前扩展指令的向量运算指令以及第二运算指令提供给所述运算单元;否则,将该向量运算指令的向量运算指令以及第二运算指令提供给所述运算单元。The dependency relationship processing unit is used to determine whether the extended instruction and the previous extended instruction access the same vector before the control unit acquires the extended instruction, and if so, wait for the execution of the previous extended instruction to complete, and then convert the vector of the current extended instruction The operation instruction and the second operation instruction are provided to the operation unit; otherwise, the vector operation instruction of the vector operation instruction and the second operation instruction are provided to the operation unit.

可选的,如当前扩展指令与前一扩展指令访问相同的向量时,所述依赖关系处理单元将该当前扩展指令存储在一存储队列中,待前一扩展指令执行完毕后,将存储队列中的该当前扩展指令提供给所述控制单元。Optionally, when the current extended instruction accesses the same vector as the previous extended instruction, the dependency processing unit stores the current extended instruction in a storage queue, and after the execution of the previous extended instruction is completed, stores the current extended instruction in the storage queue. of this current extended instruction is provided to the control unit.

可选的,所述存储器为高速暂存存储器。Optionally, the memory is a cache memory.

可选的,所述运算单元包含包括向量加法电路、向量乘法电路、大小比较电路、非线性运算电路、向量标量乘法电路和激活电路。Optionally, the operation unit includes a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a nonlinear operation circuit, a vector scalar multiplication circuit and an activation circuit.

可选的,所述运算单元为多流水级结构,其中,所述向量乘法电路和所述向量标量乘法电路处于第一流水级,大小比较部件和向量加法电路处于第二流水级,非线性运算部件和激活电路处于第三流水级,其中第一流水级的输出数据为第二流水级的输入数据,第二流水级的输出数据为第三流水级的输入数据。Optionally, the operation unit is a multi-pipeline structure, wherein the vector multiplication circuit and the vector scalar multiplication circuit are in the first pipeline stage, the size comparison component and the vector addition circuit are in the second pipeline stage, and the nonlinear operation is performed. The components and the activation circuit are in a third pipeline stage, wherein the output data of the first pipeline stage is the input data of the second pipeline stage, and the output data of the second pipeline stage is the input data of the third pipeline stage.

可选的,所述控制单元,具体用于识别所述向量运算指令的输出数据与所述第二计算指令的输入数据是否相同,如相同,确定所述计算顺序为正序计算;识别向量运算指令的输入数据与第二计算指令的输出数据是否相同,如相同,确定计算顺序为倒序计算;识别向量运算指令的输入数据与第二计算指令的输出数据是否关联,如不关联,确定计算顺序为无序计算。Optionally, the control unit is specifically configured to identify whether the output data of the vector operation instruction is the same as the input data of the second calculation instruction, if they are the same, determine that the calculation order is positive-order calculation; identify the vector operation Whether the input data of the instruction and the output data of the second calculation instruction are the same, if they are the same, determine that the calculation sequence is reverse order calculation; identify whether the input data of the vector operation instruction is related to the output data of the second calculation instruction, if not, determine the calculation sequence for unordered computation.

第二方面,提供一种芯片,所述芯片集成第一方面提供的运算装置。A second aspect provides a chip that integrates the computing device provided in the first aspect.

第三方面,提供一种电子设备,所述电子设备包括第二方面提供的芯片。In a third aspect, an electronic device is provided, and the electronic device includes the chip provided in the second aspect.

可以看出,通过本发明实施例提供的扩展指令,强化了指令的功能,用一条指令代替了原来的多条指令。这样减少了复杂向量、矩阵操作所需的指令数量,简化了向量指令的使用;与多条指令相比,不需要存储中间结果,既节约了存储空间,又避免了额外的读写开销。It can be seen that, through the extended instruction provided by the embodiment of the present invention, the function of the instruction is strengthened, and the original multiple instructions are replaced by one instruction. This reduces the number of instructions required for complex vector and matrix operations, and simplifies the use of vector instructions; compared with multiple instructions, there is no need to store intermediate results, which not only saves storage space, but also avoids additional read and write overhead.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the accompanying drawings in the following description are some embodiments of the present invention.

图1A是本发明提供的一种运算装置的结构示意图。FIG. 1A is a schematic structural diagram of a computing device provided by the present invention.

图1是本发明扩展指令的实现方法的流程图。FIG. 1 is a flow chart of a method for implementing an extended instruction according to the present invention.

图2是本发明提供运算单元的一种结构示意图。FIG. 2 is a schematic structural diagram of an arithmetic unit provided by the present invention.

图3A是本发明提供的控制单元的结构示意图。FIG. 3A is a schematic structural diagram of a control unit provided by the present invention.

图3为本申请实施例提供的一种神经网络处理器板卡的结构示意图;FIG. 3 is a schematic structural diagram of a neural network processor board according to an embodiment of the present application;

图4为本申请实施例流提供的一种神经网络芯片封装结构的结构示意图;4 is a schematic structural diagram of a neural network chip packaging structure provided by an embodiment of the present application;

图5为本申请实施例流提供的一种神经网络芯片的结构示意图;FIG. 5 is a schematic structural diagram of a neural network chip provided by an embodiment of the present application;

图6为本申请实施例流提供的一种神经网络芯片封装结构的示意图;6 is a schematic diagram of a neural network chip packaging structure provided by an embodiment of the present application;

图6A为本申请实施例流提供的另一种神经网络芯片封装结构的示意图。FIG. 6A is a schematic diagram of another neural network chip packaging structure provided by an embodiment of the present application.

附图中的虚线部件表现可选。Parts in dashed lines in the drawings represent optional features.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。文中的“/”可以表示“或”。The terms "first", "second", "third" and "fourth" in the description and claims of the present invention and the accompanying drawings are used to distinguish different objects, rather than to describe a specific order. . Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices. "/" in the text can mean "or".

在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本发明的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

下面以CPU为例来说明向量点积的方法,对于向量点积,计算向量与向量的点积,功能描述:给定长为n的向量x,y和标量r,进行如下的向量-向量操作,其计算公式如下:The following uses the CPU as an example to illustrate the method of vector dot product. For vector dot product, the dot product of vector and vector is calculated. Function description: Given a vector x, y and a scalar r of length n, perform the following vector-vector operations , and its calculation formula is as follows:

Figure BDA0001490454140000041
Figure BDA0001490454140000041

对于向量点积,其向量点积的指令具体可以为“DOT TYPE,N,X,Y,R”;其中,DOT表示向量点积指令,type表示可以操作的数据类型(例如实数或复数),N表示向量长度,X表示向量X的首地址,y表示向量Y的首地址,R为标量。如上述向量点积所述,其向量点积指令仅仅只能实现一种类型的操作,即实现向量点积的操作,无法实现多种操作,例如无法实现向量点积与离散数据的读取两个操作。For the vector dot product, the instruction of the vector dot product can be "DOT TYPE,N,X,Y,R"; DOT represents the vector dot product instruction, and type represents the operable data type (such as real number or complex number), N represents the length of the vector, X represents the first address of the vector X, y represents the first address of the vector Y, and R is a scalar. As mentioned in the above vector dot product, its vector dot product instruction can only implement one type of operation, that is, the operation of vector dot product, and cannot implement multiple operations, such as the inability to implement both vector dot product and discrete data reading. an operation.

如图1A所示,该运算装置包括:存储器111、寄存器112(可选的)、运算单元114、控制单元115和依赖关系处理单元116(可选的);As shown in FIG. 1A, the operation device includes: a memory 111, a register 112 (optional), an operation unit 114, a control unit 115, and a dependency relationship processing unit 116 (optional);

其中,运算单元114如图2所示:包括:转换电路(可选的)、向量加法电路、向量乘法电路、大小比较电路、非线性运算电路、向量标量乘法电路和激活电路。Wherein, the operation unit 114 is shown in FIG. 2 : including: a conversion circuit (optional), a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a nonlinear operation circuit, a vector scalar multiplication circuit and an activation circuit.

该运算单元为多流水级结构,具体的如图2所示,第一流水级包括但不限于:向量乘法电路和向量标量乘法电路等等。The operation unit has a multi-pipeline structure, specifically as shown in FIG. 2 , the first pipeline stage includes but is not limited to: a vector multiplication circuit, a vector scalar multiplication circuit, and the like.

第二流水级包括但不限于:大小比较计算器(例如比较器)、向量加法电路等等。The second pipeline stage includes, but is not limited to, size comparison calculators (eg, comparators), vector addition circuits, and the like.

第三流水级包括但不限于:非线性运算部件(具体可以为:激活电路或超越函数计算电路等等)等等。The third pipeline stage includes, but is not limited to, nonlinear operation components (specifically, activation circuits or transcendental function calculation circuits, etc.) and the like.

如果该运算单元包括转换电路,那么该转换电路可以处于第一流水级,也可以处于第三流水级。If the arithmetic unit includes a conversion circuit, the conversion circuit may be in the first pipeline stage, or may be in the third pipeline stage.

其中,第一流水级的输出数据为第二流水级的输入数据,第二流水级的输出数据为第三流水级的输入数据。第一流水级的输入可以为输入数据(例如输入向量),该第三流水级的输出可以为计算结果。The output data of the first pipeline stage is the input data of the second pipeline stage, and the output data of the second pipeline stage is the input data of the third pipeline stage. The input of the first pipeline stage may be input data (eg, an input vector), and the output of the third pipeline stage may be the calculation result.

本发明还提供一种扩展指令,该操作码和操作域,所述操作码包括:识别第一运算指令的标识(例如ROT);所述操作域包括:第一计算指令的输入数据地址、第一计算指令的输出数据地址、第二计算指令的标识、第二计算指令的输入数据、数据类型以及数据长度N。The present invention also provides an extended instruction, the operation code and an operation field, the operation code includes: an identifier (for example, ROT) for identifying the first operation instruction; the operation field includes: the input data address of the first operation instruction, the first operation field The output data address of a calculation instruction, the identifier of the second calculation instruction, the input data of the second calculation instruction, the data type, and the data length N.

可选的,上述扩展指令具体还可以包括:第三计算指令以及第三计算指令的输入数据。Optionally, the above-mentioned extended instruction may further include: a third calculation instruction and input data of the third calculation instruction.

需要说明的是,上述计算指令可以为向量运算指令或矩阵指令,本发明具体实施方式并不限制上述计算指令的具体表现形式。It should be noted that, the above calculation instruction may be a vector operation instruction or a matrix instruction, and the specific embodiment of the present invention does not limit the specific expression form of the above calculation instruction.

上述运算装置可以用于执行扩展指令,具体可以包括:The above-mentioned computing device can be used to execute extended instructions, which can specifically include:

存储器,用于存储向量;memory, for storing vectors;

控制单元,用于获取扩展指令,解析所述扩展指令得到向量运算指令和第二运算指令,根据所述向量运算指令以及第二运算指令确定向量运算指令与第二运算指令的计算顺序,从存储器读取所述输入向量地址对应的输入向量;The control unit is used to obtain the extended instruction, parse the extended instruction to obtain the vector operation instruction and the second operation instruction, determine the calculation order of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and obtain the vector operation instruction and the second operation instruction from the memory. reading the input vector corresponding to the input vector address;

运算单元,用于对所述输入向量按所述计算顺序执行所述向量运算指令以及第二运算指令得到所述扩展指令的结果。an operation unit, configured to execute the vector operation instruction and the second operation instruction on the input vector in the calculation order to obtain the result of the extended instruction.

寄存器单元,用于存储待执行的扩展指令。A register unit for storing extended instructions to be executed.

可选的,上述控制单元115如图3A所示,可以包括:Optionally, the above-mentioned control unit 115, as shown in FIG. 3A, may include:

取指模块,用于从所述寄存器单元中获取扩展指令;an instruction fetch module, used for acquiring extended instructions from the register unit;

译码模块,用于对获取的扩展指令进行译码得到向量运算指令、第二运算指令以及计算顺序;a decoding module for decoding the acquired extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;

指令队列,用于对译码后的向量运算指令和第二运算指令按所述计算顺序存储。The instruction queue is used for storing the decoded vector operation instructions and the second operation instructions in the calculation order.

依赖关系处理单元116,用于在所述控制单元获取扩展指令前,判断该扩展指令与前一扩展指令是否访问相同的向量,若是,则等待前一扩展指令执行完毕后,将当前扩展指令的向量运算指令以及第二运算指令提供给所述运算单元;否则,将该向量运算指令的向量运算指令以及第二运算指令提供给所述运算单元。The dependency relationship processing unit 116 is used to determine whether the extended instruction and the previous extended instruction access the same vector before the control unit acquires the extended instruction, and if so, wait for the execution of the previous extended instruction to complete, and then convert the current extended instruction to the same vector. The vector operation instruction and the second operation instruction are provided to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.

依赖关系处理单元116,还用于在当前扩展指令与前一扩展指令访问相同的向量时,将该当前扩展指令存储在一存储队列中,待前一扩展指令执行完毕后,将存储队列中的该当前扩展指令提供给所述控制单元。The dependency relationship processing unit 116 is further configured to store the current extended instruction in a storage queue when the current extended instruction accesses the same vector as the previous extended instruction, and store the current extended instruction in a storage queue after the execution of the previous extended instruction is completed. The current extended instruction is provided to the control unit.

可选的,上述存储器为高速暂存存储器。Optionally, the above-mentioned memory is a high-speed temporary storage memory.

参阅图1,图1提供了一种扩展指令的实现方法,该方法中的扩展指令可以包括:操作码和操作域,该第一运算指令可以为向量运算指令,例如AXPY。所述操作码包括:识别第一运算指令的标识(例如AXPY);所述操作域包括:第一计算指令的输入数据地址、第一计算指令的输出数据地址、第二计算指令的标识、第二计算指令的输入数据、数据类型以及数据长度N(其为用户自行设定值,本发明并不限制N的具体形式);该方法由运算装置或计算芯片执行,该运算装置如图1A所示。该方法如图1所示,包括如下步骤:Referring to FIG. 1, FIG. 1 provides a method for implementing an extended instruction. The extended instruction in the method may include: an operation code and an operation domain, and the first operation instruction may be a vector operation instruction, such as AXPY. The operation code includes: an identifier (for example, AXPY) that identifies the first operation instruction; the operation field includes: the input data address of the first operation instruction, the output data address of the first operation instruction, the identifier of the second operation instruction, the first operation instruction. 2. The input data, data type and data length N of the calculation instruction (which is a value set by the user, and the present invention does not limit the specific form of N); the method is executed by a computing device or a computing chip, and the computing device is shown in FIG. 1A . Show. The method, as shown in Figure 1, includes the following steps:

步骤S101、运算装置获取扩展指令,解析该扩展指令得到第一计算指令以及第二计算指令;Step S101, the computing device obtains an extended instruction, and parses the extended instruction to obtain a first calculation instruction and a second calculation instruction;

步骤S102、运算装置依据第一计算指令以及第二计算指令确定计算顺序,按所述计算顺序执行第一计算指令以及第二计算指令得到该扩展指令的结果。Step S102 , the computing device determines a calculation sequence according to the first calculation instruction and the second calculation instruction, and executes the first calculation instruction and the second calculation instruction in the calculation sequence to obtain the result of the extended instruction.

本发明提供的技术方案提供了扩展指令的实现方法,使得运算装置能够对该扩展指令执行两个计算指令的计算,使得单个扩展指令能够实现两种类型的计算,减少了计算的开销,降低了功耗。The technical solution provided by the present invention provides an implementation method of an extended instruction, so that the computing device can perform the calculation of two calculation instructions on the extended instruction, so that a single extended instruction can realize two types of calculations, which reduces the calculation overhead and reduces the power consumption.

可选的,上述计算顺序具体可以包括:无序计算、正序计算或倒序计算中的任意一种,无序计算时,即第一计算指令与第二计算指令没有相应的顺序的要求,正序计算时即先执行第一计算指令,后执行第二计算指令,倒序计算时即先执行第二计算指令,后执行第一计算指令。Optionally, the above calculation sequence may specifically include: any one of out-of-order calculation, forward-order calculation, or reverse-order calculation. During out-of-order calculation, that is, the first calculation instruction and the second calculation instruction do not have a corresponding order requirement, the normal calculation In sequential calculation, the first calculation instruction is executed first, and then the second calculation instruction is executed. In reverse order calculation, the second calculation instruction is executed first, and then the first calculation instruction is executed.

上述运算装置依据第一计算指令以及第二计算指令确定计算顺序的具体实现方式可以为,运算装置识别第一计算指令的输出数据与第二计算指令的输入数据是否相同,如相同,确定计算顺序为正序计算,反之运算装置识别第一计算指令的输入数据与第二计算指令的输出数据是否相同,如相同,确定计算顺序为倒序计算,运算装置识别第一计算指令的输入数据与第二计算指令的输出数据是否关联,如不关联,确定计算顺序为无序计算。The specific implementation of the above-mentioned computing device determining the calculation sequence according to the first calculation instruction and the second calculation command may be: It is a forward-order calculation, otherwise, the computing device identifies whether the input data of the first computing instruction and the output data of the second computing instruction are the same. Whether the output data of the calculation instruction is related, if not, the calculation sequence is determined as unordered calculation.

具体的以一个实际的例子来说明,如F=A*B+C,第一计算指令为矩阵乘法指令,第二计算指令为矩阵加法指令,由于第二计算指令的矩阵加法指令需要应用到第一计算指令的结果即输出数据,所以确定该计算顺序为正序计算。又如,F=OP(A)*OP(B),其中,第一运算指令为矩阵乘法指令,第二运算指令为变换,例如转置或共轭,则由于第一运算指令使用了第二运算指令的输出,所以其运算顺序为倒序计算。如没有相应关联,即第一计算指令的输出数据与第二计算指令的输入数据不相同且第一计算指令的输入数据与第二计算指令的输入数据也不相同,确定不关联。Specifically, an actual example is used to illustrate, such as F=A*B+C, the first calculation instruction is a matrix multiplication instruction, and the second calculation instruction is a matrix addition instruction. Since the matrix addition instruction of the second calculation instruction needs to be applied to the first calculation instruction The result of a calculation instruction is output data, so it is determined that the calculation sequence is positive sequence calculation. For another example, F=OP(A)*OP(B), where the first operation instruction is a matrix multiplication instruction, and the second operation instruction is a transformation, such as transposition or conjugation, since the first operation instruction uses the second operation instruction The output of the operation instruction, so its operation order is calculated in reverse order. If there is no corresponding association, that is, the output data of the first calculation instruction is not the same as the input data of the second calculation instruction, and the input data of the first calculation instruction and the input data of the second calculation instruction are also different, it is determined not to be associated.

本发明提供的向量指令的扩展,强化了指令的功能,用一条指令代替了原来的多条指令。这样减少了复杂向量、矩阵操作所需的指令数量,简化了向量指令的使用;与多条指令相比,不需要存储中间结果,既节约了存储空间,又避免了额外的读写开销。The extension of the vector instruction provided by the present invention strengthens the function of the instruction, and replaces the original multiple instructions with one instruction. This reduces the number of instructions required for complex vector and matrix operations, and simplifies the use of vector instructions; compared with multiple instructions, there is no need to store intermediate results, which not only saves storage space, but also avoids additional read and write overhead.

如第一计算指令为向量指令,对于向量指令中的输入向量或矩阵,指令增加对其进行缩放的功能即在操作域增加表示缩放系数的操作数,在读入该向量时首先按照缩放系数对其进行缩放(即第二计算指令为缩放指令)。如果向量指令中有多个输入向量或矩阵相乘的操作,则这些输入向量或矩阵对应的缩放系数可以合并成一个。If the first calculation instruction is a vector instruction, for the input vector or matrix in the vector instruction, the instruction adds the function of scaling it, that is, adding an operand representing the scaling factor in the operation field, and when reading the vector, first adjust the scaling factor according to the scaling factor. It scales (ie the second calculation instruction is a scaling instruction). If there are multiple input vector or matrix multiplication operations in the vector instruction, the scaling coefficients corresponding to these input vectors or matrices can be combined into one.

如第一计算指令为向量指令,对于向量指令中的输入矩阵,指令增加对其进行转置的功能(即第二计算指令为转置指令)。在指令中增加表示是否对其进行转置的操作数,代表在运算前是否对该矩阵进行转置。If the first calculation instruction is a vector instruction, for the input matrix in the vector instruction, the instruction adds a function of transposing it (that is, the second calculation instruction is a transposition instruction). Add an operand indicating whether to transpose it in the instruction, indicating whether to transpose the matrix before the operation.

如第一计算指令为向量指令,对于向量指令中的输出向量或矩阵,指令增加与原始的输出向量或矩阵相加的功能(即第二计算指令为加法指令)。在指令中增加表示对原始的输出向量或矩阵进行缩放的系数(即添加第三计算指令,第三计算指令可以为缩放指令),指令表示在进行完向量或矩阵操作后,把结果与缩放后的原始输出相加,作为新的输出。If the first calculation instruction is a vector instruction, for the output vector or matrix in the vector instruction, the instruction adds the function of adding the original output vector or matrix (ie, the second calculation instruction is an addition instruction). Adding a coefficient indicating scaling of the original output vector or matrix to the instruction (that is, adding a third calculation instruction, which can be a scaling instruction), the instruction indicates that after the vector or matrix operation is performed, the result is scaled with the The original outputs are added as the new output.

如第一计算指令为向量指令,对于向量指令中的输入向量,指令增加按照固定步长读取的功能。在指令中增加表示输入向量读取步长的操作数(即第二计算指令为按固定步长读取向量),表示向量中相邻两个元素的地址之差。If the first calculation instruction is a vector instruction, for the input vector in the vector instruction, the instruction adds the function of reading according to a fixed step size. An operand representing the read step size of the input vector is added to the instruction (that is, the second calculation instruction is to read the vector at a fixed step size), which represents the difference between the addresses of two adjacent elements in the vector.

如第一计算指令为向量指令,对于向量指令中的结果向量,指令增加按照固定步长写入结果的功能(即第二计算指令按固定步长写入向量)。在指令中增加表示结果向量读取步长的操作数,表示向量中相邻两个元素的地址之差。如果一个向量既是输入又作为结果,则该向量作为输入和作为结果时使用同一个步长。If the first calculation instruction is a vector instruction, for the result vector in the vector instruction, the instruction adds the function of writing the result according to a fixed step size (that is, the second calculation instruction writes a vector according to a fixed step size). An operand representing the read step size of the result vector is added to the instruction, representing the difference between the addresses of two adjacent elements in the vector. If a vector is both an input and a result, the same stride is used for both the input and the result.

如第一计算指令为向量指令,对于向量指令中的输入矩阵,指令增加按照固定步长读取行或列向量的功能(即第二计算指令为按固定步长读取多个向量)。在指令中增加表示矩阵读取步长的操作数,表示矩阵行或列向量之间的首地址之差。If the first calculation instruction is a vector instruction, for the input matrix in the vector instruction, the instruction adds the function of reading row or column vectors according to a fixed step size (that is, the second calculation instruction is to read multiple vectors according to a fixed step size). An operand representing the matrix read step size is added to the instruction, representing the difference between the first addresses of the matrix row or column vectors.

如第一计算指令为向量指令,对于向量指令中的结果矩阵,指令增加按照固定步长读取行或列向量的功能(即第二计算指令为按固定步长写入多个向量)。在指令中增加表示矩阵读取步长的操作数,表示矩阵行或列向量之间的首地址之差。如果一个矩阵既是输入又是结果矩阵,则作为输入和作为结果使用同一个步长。If the first calculation instruction is a vector instruction, for the result matrix in the vector instruction, the instruction adds the function of reading row or column vectors according to a fixed step size (that is, the second calculation instruction is to write multiple vectors according to a fixed step size). An operand representing the matrix read step size is added to the instruction, representing the difference between the first addresses of the matrix row or column vectors. If a matrix is both an input and a result matrix, the same stride is used as the input and as the result.

下面以一些实际的扩展指令来说明上述扩展指令的实际结构。The actual structures of the above-mentioned extended instructions are described below with some actual extended instructions.

向量乘加vector multiplication

计算向量与标量的积并把结果加到另外一个向量Calculate the product of a vector and a scalar and add the result to another vector

功能描述:Function description:

给定向量x,y和标量a,进行如下的向量-向量操作Given a vector x, y and a scalar a, perform the following vector-vector operations

Y:=a*x+yY:=a*x+y

指令格式如表1-1所示:The command format is shown in Table 1-1:

表1-1:Table 1-1:

Figure BDA0001490454140000081
Figure BDA0001490454140000081

Figure BDA0001490454140000091
Figure BDA0001490454140000091

如表1-1所示的指令格式中向量的长度可变,可以减少指令数量,简化指令的使用。The length of the vector in the instruction format shown in Table 1-1 is variable, which can reduce the number of instructions and simplify the use of instructions.

支持按一定间隔存储的向量格式,避免了对向量格式进行变换的执行开销和存储中间结果的空间占用。Supports the vector format stored at certain intervals, avoiding the execution overhead of transforming the vector format and the space occupied by the storage of intermediate results.

向量点积Vector dot product

计算向量与向量的点积Calculates the dot product of a vector and a vector

功能描述:给定长为n的向量x,y和标量r,进行如下的向量-向量操作Functional description: Given a vector x, y and a scalar r of length n, perform the following vector-vector operations

Figure BDA0001490454140000092
Figure BDA0001490454140000092

指令格式如表1-2所示:The command format is shown in Table 1-2:

表1-2:Table 1-2:

Figure BDA0001490454140000093
Figure BDA0001490454140000093

如表1-2所示的指令格式中向量的长度可变,可以减少指令数量,简化指令的使用。The length of the vector in the instruction format shown in Table 1-2 is variable, which can reduce the number of instructions and simplify the use of instructions.

支持按一定间隔存储的向量格式,避免了对向量格式进行变换的执行开销和存储中间结果的空间占用。Supports the vector format stored at certain intervals, avoiding the execution overhead of transforming the vector format and the space occupied by the storage of intermediate results.

向量范数vector norm

计算向量的欧几里得范数Calculate the Euclidean norm of a vector

功能描述:Function description:

该指令进行如下的向量规约操作:This instruction performs the following vector reduction operations:

Figure BDA0001490454140000101
Figure BDA0001490454140000101

指令格式如表1-3所示:The command format is shown in Table 1-3:

表1-3Table 1-3

Figure BDA0001490454140000102
Figure BDA0001490454140000102

如表1-3所示的指令格式中向量的长度可变,可以减少指令数量,简化指令的使用。支持按一定间隔存储的向量格式,避免了对向量格式进行变换的执行开销和存储中间结果的空间占用。The length of the vector in the instruction format shown in Table 1-3 is variable, which can reduce the number of instructions and simplify the use of instructions. Supports the vector format stored at certain intervals, avoiding the execution overhead of transforming the vector format and the space occupied by the storage of intermediate results.

向量加和vector sum

计算向量的所有元素的相加的和Calculate the sum of all elements of a vector

功能描述:Function description:

该指令进行如下的向量规约操作:This instruction performs the following vector reduction operations:

Figure BDA0001490454140000103
Figure BDA0001490454140000103

指令格式如表1-4所示:The command format is shown in Table 1-4:

表1-4:Table 1-4:

Figure BDA0001490454140000104
Figure BDA0001490454140000104

Figure BDA0001490454140000111
Figure BDA0001490454140000111

如表1-4所示的指令格式中向量的长度可变,可以减少指令数量,简化指令的使用。The length of the vector in the instruction format shown in Table 1-4 is variable, which can reduce the number of instructions and simplify the use of instructions.

支持按一定间隔存储的向量格式,避免了对向量格式进行变换的执行开销和存储中间结果的空间占用。Supports the vector format stored at certain intervals, avoiding the execution overhead of transforming the vector format and the space occupied by the storage of intermediate results.

向量最大值vector maximum

计算向量的所有元素中最大元素的位置Calculate the position of the largest element among all elements of a vector

功能描述:Function description:

对于长度为n的向量x,该指令将向量中最大元素的位置写入标量i中For a vector x of length n, this instruction writes the position of the largest element in the vector into a scalar i

指令格式如表1-5所示:The command format is shown in Table 1-5:

表1-5:Table 1-5:

Figure BDA0001490454140000112
Figure BDA0001490454140000112

如表1-5所示的指令格式中向量的长度可变,可以减少指令数量,简化指令的使用。支持按一定间隔存储的向量格式,避免了对向量格式进行变换的执行开销和存储中间结果的空间占用。The length of the vector in the instruction format shown in Table 1-5 is variable, which can reduce the number of instructions and simplify the use of instructions. Supports the vector format stored at certain intervals, avoiding the execution overhead of transforming the vector format and the space occupied by the storage of intermediate results.

向量最小值vector minimum

计算向量的所有元素中最小元素的位置Calculate the position of the smallest element among all elements of a vector

功能描述:Function description:

对于长度为n的向量x,该指令将向量中最小元素的位置写入标量i中For a vector x of length n, this instruction writes the position of the smallest element in the vector into a scalar i

指令格式如表1-6所示:The command format is shown in Table 1-6:

表1-6Table 1-6

Figure BDA0001490454140000121
Figure BDA0001490454140000121

如表1-6所示的指令格式中向量的长度可变,可以减少指令数量,简化指令的使用。支持按一定间隔存储的向量格式,避免了对向量格式进行变换的执行开销和存储中间结果的空间占用。The length of the vector in the instruction format shown in Table 1-6 is variable, which can reduce the number of instructions and simplify the use of instructions. Supports the vector format stored at certain intervals, avoiding the execution overhead of transforming the vector format and the space occupied by the storage of intermediate results.

向量外积Outer product of vectors

计算两个向量的张量积(外积)Calculate the tensor product (outer product) of two vectors

功能描述:Function description:

该指令进行如下的矩阵向量操作This instruction performs the following matrix-vector operations

A:=α*x*yT+AA:=α*x*y T +A

指令格式如表1-7所示:The command format is shown in Table 1-7:

表1-7Table 1-7

Figure BDA0001490454140000122
Figure BDA0001490454140000122

Figure BDA0001490454140000131
Figure BDA0001490454140000131

如表1-7所示的指令格式中标量alpha对结果矩阵进行缩放,增加了指令的灵活性,避免了利用缩放指令进行缩放的额外开销。向量和矩阵的规模可变,可以减少指令数量,简化指令的使用。可以处理不同存储格式(行主序和列主序)的矩阵,避免了对矩阵进行变换的开销。支持按一定间隔存储的向量格式,避免了对向量格式进行变换的执行开销和存储中间结果的空间占用。The scalar alpha in the instruction format shown in Table 1-7 scales the result matrix, which increases the flexibility of the instruction and avoids the extra overhead of scaling with the scaling instruction. The variable size of vectors and matrices can reduce the number of instructions and simplify the use of instructions. Can handle matrices of different storage formats (row-major and column-major order), avoiding the overhead of transforming the matrix. Supports the vector format stored at certain intervals, avoiding the execution overhead of transforming the vector format and the space occupation of storing intermediate results.

对于如图1所示的运算装置,其实现扩展指令运算时计算出该扩展指令的具体结构,即通过一个扩展指令执行实现多条计算指令执行的组合,需要说明的是,对于运算装置执行该扩展指令时并未将该扩展指令拆分成多条计算指令。For the computing device shown in FIG. 1 , the specific structure of the extended instruction is calculated when the extended instruction is implemented, that is, the combination of multiple computing instructions is implemented through the execution of one extended instruction. The extended instruction is not split into multiple computation instructions when extending the instruction.

请参照图3,图3为本申请实施例提供的一种神经网络处理器板卡的结构示意图。如图3所示,上述神经网络处理器板卡10包括神经网络芯片封装结构11、第一电气及非电气连接装置12和第一基板(substrate)13。Please refer to FIG. 3 , which is a schematic structural diagram of a neural network processor board according to an embodiment of the present application. As shown in FIG. 3 , the above-mentioned neural network processor board 10 includes a neural network chip package structure 11 , a first electrical and non-electrical connection device 12 and a first substrate 13 .

本申请对于神经网络芯片封装结构11的具体结构不作限定,可选的,如图4所示,上述神经网络芯片封装结构11包括:神经网络芯片111、第二电气及非电气连接装置112、第二基板113。This application does not limit the specific structure of the neural network chip packaging structure 11. Optionally, as shown in FIG. 4, the neural network chip packaging structure 11 includes: a neural network chip 111, a second electrical and non-electrical connection device 112, a first Two substrates 113 .

本申请所涉及的神经网络芯片111的具体形式不作限定,上述的神经网络芯片111包含但不限于将神经网络处理器集成的神经网络晶片,上述晶片可以由硅材料、锗材料、量子材料或分子材料等制成。根据实际情况(例如:较严苛的环境)和不同的应用需求可将上述神经网络晶片进行封装,以使神经网络晶片的大部分被包裹住,而将神经网络晶片上的引脚通过金线等导体连到封装结构的外边,用于和更外层进行电路连接。The specific form of the neural network chip 111 involved in this application is not limited. The above-mentioned neural network chip 111 includes but is not limited to a neural network chip that integrates a neural network processor. The above-mentioned chip can be made of silicon materials, germanium materials, quantum materials or molecular materials. materials etc. According to the actual situation (for example: harsh environment) and different application requirements, the above neural network chip can be packaged, so that most of the neural network chip is wrapped, and the pins on the neural network chip are passed through gold wires. The other conductors are connected to the outside of the package structure for circuit connection with the outer layers.

本申请对于神经网络芯片111的具体结构不作限定,可选的,请参照图1A,图1A为本申请实施例提供的一种神经网络芯片内的计算装置的结构示意图。如图1A所示,上述计算装置包括:存储器111、寄存器112(可选的)、运算单元114、控制单元115和依赖关系处理单元116(可选的)。上述各个单元的具体功能或结构可以参见图1A所示实施例所示。The present application does not limit the specific structure of the neural network chip 111 . Optionally, please refer to FIG. 1A , which is a schematic structural diagram of a computing device in a neural network chip provided by an embodiment of the present application. As shown in FIG. 1A , the above computing device includes: a memory 111 , a register 112 (optional), an arithmetic unit 114 , a control unit 115 and a dependency relationship processing unit 116 (optional). For the specific functions or structures of the above units, reference may be made to the embodiment shown in FIG. 1A .

本申请对于第一基板13和第二基板113的类型不做限定,可以是印制电路板(printed circuit board,PCB)或(printed wiring board,PWB),还可能为其它电路板。对PCB的制作材料也不做限定。The application does not limit the types of the first substrate 13 and the second substrate 113, which may be printed circuit boards (PCBs) or printed wiring boards (PWBs), and may also be other circuit boards. The material for making the PCB is also not limited.

本申请所涉及的第二基板113用于承载上述神经网络芯片111,通过第二电气及非电气连接装置112将上述的神经网络芯片111和第二基板113进行连接得到的神经网络芯片封装结构11,用于保护神经网络芯片111,便于将神经网络芯片封装结构11与第一基板13进行进一步封装。The second substrate 113 involved in the present application is used to carry the above-mentioned neural network chip 111 , and the neural network chip package structure 11 is obtained by connecting the above-mentioned neural network chip 111 and the second substrate 113 through the second electrical and non-electrical connection device 112 . , which is used to protect the neural network chip 111 and facilitate further packaging of the neural network chip packaging structure 11 and the first substrate 13 .

对于上述具体的第二电气及非电气连接装置112的封装方式和封装方式对应的结构不作限定,可根据实际情况和不同的应用需求选择合适的封装方式并进行简单地改进,例如:倒装芯片球栅阵列封装(Flip Chip Ball Grid Array Package,FCBGAP),薄型四方扁平式封装(Low-profile Quad Flat Package,LQFP)、带散热器的四方扁平封装(QuadFlat Package with Heat sink,HQFP)、无引脚四方扁平封装(Quad Flat Non-leadPackage,QFN)或小间距四方扁平式封装(Fine-pitch Ball Grid Package,FBGA)等封装方式。The packaging method and the structure corresponding to the packaging method of the above-mentioned specific second electrical and non-electrical connection device 112 are not limited, and an appropriate packaging method can be selected according to the actual situation and different application requirements and can be simply improved, for example: flip chip Ball Grid Array Package (Flip Chip Ball Grid Array Package, FCBGAP), Low-profile Quad Flat Package (LQFP), Quad Flat Package with Heat Sink (HQFP), No Lead Quad Flat Package (Quad Flat Non-lead Package, QFN) or Small Pitch Quad Flat Package (Fine-pitch Ball Grid Package, FBGA) and other packaging methods.

倒装芯片(Flip Chip),适用于对封装后的面积要求高或对导线的电感、信号的传输时间敏感的情况下。除此之外可以用引线键合(Wire Bonding)的封装方式,减少成本,提高封装结构的灵活性。Flip Chip is suitable for situations where the area after the package is high or is sensitive to the inductance of the wire and the transmission time of the signal. In addition, a wire bonding (Wire Bonding) packaging method can be used to reduce the cost and improve the flexibility of the packaging structure.

球栅阵列(Ball Grid Array),能够提供更多引脚,且引脚的平均导线长度短,具备高速传递信号的作用,其中,封装可以用引脚网格阵列封装(Pin Grid Array,PGA)、零插拔力(Zero Insertion Force,ZIF)、单边接触连接(Single Edge Contact Connection,SECC)、触点阵列(Land Grid Array,LGA)等来代替。Ball Grid Array (Ball Grid Array) can provide more pins, and the average lead length of the pins is short, which has the function of high-speed signal transmission. Among them, the package can be packaged with Pin Grid Array (PGA) , zero insertion force (Zero Insertion Force, ZIF), single edge contact connection (Single Edge Contact Connection, SECC), contact array (Land Grid Array, LGA), etc. instead.

可选的,采用倒装芯片球栅阵列(Flip Chip Ball Grid Array)的封装方式对神经网络芯片111和第二基板113进行封装,具体的神经网络芯片封装结构的示意图可参照图6。如图6所示,上述神经网络芯片封装结构包括:神经网络芯片21、焊盘22、焊球23、第二基板24、第二基板24上的连接点25、引脚26。Optionally, the neural network chip 111 and the second substrate 113 are packaged in a flip chip ball grid array (Flip Chip Ball Grid Array) packaging method. Refer to FIG. 6 for a schematic diagram of a specific neural network chip packaging structure. As shown in FIG. 6 , the above-mentioned neural network chip package structure includes: a neural network chip 21 , pads 22 , solder balls 23 , a second substrate 24 , connection points 25 and pins 26 on the second substrate 24 .

其中,焊盘22与神经网络芯片21相连,通过在焊盘22和第二基板24上的连接点25之间焊接形成焊球23,将神经网络芯片21和第二基板24连接,即实现了神经网络芯片21的封装。Wherein, the pad 22 is connected to the neural network chip 21, and solder balls 23 are formed by welding between the pad 22 and the connection point 25 on the second substrate 24, and the neural network chip 21 and the second substrate 24 are connected. Packaging of the neural network chip 21 .

引脚26用于与封装结构的外部电路(例如,神经网络处理器板卡10上的第一基板13)相连,可实现外部数据和内部数据的传输,便于神经网络芯片21或神经网络芯片21对应的神经网络处理器对数据进行处理。对于引脚的类型和数量本申请也不作限定,根据不同的封装技术可选用不同的引脚形式,并遵从一定规则进行排列。The pin 26 is used to connect with the external circuit of the package structure (for example, the first substrate 13 on the neural network processor board 10 ), which can realize the transmission of external data and internal data, and is convenient for the neural network chip 21 or the neural network chip 21 The corresponding neural network processor processes the data. The type and quantity of pins are not limited in this application, and different pin forms can be selected according to different packaging technologies, and they are arranged in accordance with certain rules.

可选的,上述神经网络芯片封装结构还包括绝缘填充物,置于焊盘22、焊球23和连接点25之间的空隙中,用于防止焊球与焊球之间产生干扰。Optionally, the above-mentioned neural network chip package structure further includes insulating fillers, which are placed in the gaps between the pads 22 , the solder balls 23 and the connection points 25 , to prevent interference between the solder balls and the solder balls.

其中,绝缘填充物的材料可以是氮化硅、氧化硅或氧氮化硅;干扰包含电磁干扰、电感干扰等。Wherein, the material of the insulating filler may be silicon nitride, silicon oxide or silicon oxynitride; the interference includes electromagnetic interference, inductive interference, and the like.

可选的,上述神经网络芯片封装结构还包括散热装置,用于散发神经网络芯片21运行时的热量。其中,散热装置可以是一块导热性良好的金属片、散热片或散热器,例如,风扇。Optionally, the above-mentioned neural network chip packaging structure further includes a heat dissipation device for dissipating heat during operation of the neural network chip 21 . Wherein, the heat dissipation device may be a metal sheet, a heat sink or a heat sink with good thermal conductivity, such as a fan.

举例来说,如图6A所示,神经网络芯片封装结构11包括:神经网络芯片21、焊盘22、焊球23、第二基板24、第二基板24上的连接点25、引脚26、绝缘填充物27、散热膏28和金属外壳散热片29。其中,散热膏28和金属外壳散热片29用于散发神经网络芯片21运行时的热量。For example, as shown in FIG. 6A , the neural network chip package structure 11 includes: a neural network chip 21 , pads 22 , solder balls 23 , a second substrate 24 , connection points 25 on the second substrate 24 , pins 26 , Insulation filler 27, thermal paste 28 and metal shell heat sink 29. Among them, the heat dissipation paste 28 and the metal shell heat dissipation fins 29 are used to dissipate the heat of the neural network chip 21 during operation.

可选的,上述神经网络芯片封装结构11还包括补强结构,与焊盘22连接,且内埋于焊球23中,以增强焊球23与焊盘22之间的连接强度。Optionally, the above-mentioned neural network chip package structure 11 further includes a reinforcing structure, which is connected to the pads 22 and embedded in the solder balls 23 to enhance the connection strength between the solder balls 23 and the pads 22 .

其中,补强结构可以是金属线结构或柱状结构,在此不做限定。Wherein, the reinforcing structure may be a metal wire structure or a columnar structure, which is not limited herein.

本申请对于第一电气及非电气装置12的具体形式也不作限定,可参照第二电气及非电气装置112的描述,即通过焊接的方式将神经网络芯片封装结构11进行封装,也可以采用连接线连接或插拔方式连接第二基板113和第一基板13的方式,便于后续更换第一基板13或神经网络芯片封装结构11。This application also does not limit the specific form of the first electrical and non-electrical device 12, and can refer to the description of the second electrical and non-electrical device 112, that is, the neural network chip package structure 11 is packaged by welding, or a connection can be used. The way of connecting the second substrate 113 and the first substrate 13 by wire connection or plugging is convenient for subsequent replacement of the first substrate 13 or the neural network chip package structure 11 .

可选的,第一基板13包括用于扩展存储容量的内存单元的接口等,例如:同步动态随机存储器(Synchronous Dynamic Random Access Memory,SDRAM)、双倍速率同步动态随机存储器(Double Date Rate SDRAM,DDR)等,通过扩展内存提高了神经网络处理器的处理能力。Optionally, the first substrate 13 includes an interface and the like for a memory unit used to expand the storage capacity, for example: a synchronous dynamic random access memory (Synchronous Dynamic Random Access Memory, SDRAM), a double-rate synchronous dynamic random access memory (Double Date Rate SDRAM, DDR), etc., the processing power of the neural network processor is improved by expanding the memory.

第一基板13上还可包括快速外部设备互连总线(Peripheral ComponentInterconnect-Express,PCI-E或PCIe)接口、小封装可热插拔(Small Form-factorPluggable,SFP)接口、以太网接口、控制器局域网总线(Controller Area Network,CAN)接口等等,用于封装结构和外部电路之间的数据传输,可提高运算速度和操作的便利性。The first substrate 13 may further include a Peripheral Component Interconnect-Express (PCI-E or PCIe) interface, a Small Form-factor Pluggable (SFP) interface, an Ethernet interface, and a controller. A local area network bus (Controller Area Network, CAN) interface, etc., is used for data transmission between the package structure and the external circuit, which can improve the operation speed and the convenience of operation.

将神经网络处理器封装为神经网络芯片111,将神经网络芯片111封装为神经网络芯片封装结构11,将神经网络芯片封装结构11封装为神经网络处理器板卡10,通过板卡上的接口(插槽或插芯)与外部电路(例如:计算机主板)进行数据交互,即直接通过使用神经网络处理器板卡10实现神经网络处理器的功能,并保护神经网络芯片111。且神经网络处理器板卡10上还可添加其他模块,提高了神经网络处理器的应用范围和运算效率。The neural network processor is packaged as a neural network chip 111, the neural network chip 111 is packaged as a neural network chip package structure 11, and the neural network chip package structure 11 is packaged as a neural network processor board 10, through the interface ( socket or ferrule) for data interaction with an external circuit (eg, computer motherboard), that is, directly using the neural network processor board 10 to realize the function of the neural network processor and protect the neural network chip 111 . In addition, other modules can be added to the neural network processor board 10, which improves the application scope and operation efficiency of the neural network processor.

在一个实施例里,本公开公开了一个电子装置,其包括了上述神经网络处理器板卡10或神经网络芯片封装结构11。In one embodiment, the present disclosure discloses an electronic device including the above-mentioned neural network processor board 10 or neural network chip package structure 11 .

电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, video cameras, projectors, watches, headphones, mobile storage , wearable devices, vehicles, home appliances, and/or medical devices.

所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.

需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本发明所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described action sequence. As in accordance with the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present invention.

在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The embodiments of the present invention have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present invention. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present invention; at the same time, for Persons of ordinary skill in the art, according to the idea of the present invention, will have changes in the specific embodiments and application scope. To sum up, the contents of this specification should not be construed as limiting the present invention.

Claims (11)

1. An arithmetic device for performing an operation in accordance with an extended instruction, the arithmetic device comprising: a memory, an arithmetic unit and a control unit;
the extended instruction includes: an opcode and an operation domain, the opcode comprising: identifying an identification of a vector computation instruction; the operation domain includes: an input vector address of the vector calculation instruction, an output vector address of the vector calculation instruction, an identifier of the second calculation instruction, input data of the second calculation instruction, a data type and a data length N;
a memory for storing vectors;
the control unit is used for acquiring an extended instruction, analyzing the extended instruction to obtain a vector operation instruction and a second operation instruction, determining the calculation sequence of the vector operation instruction and the second operation instruction according to the vector operation instruction and the second operation instruction, and reading an input vector corresponding to the input vector address from a memory;
the control unit is specifically configured to identify whether output data of the vector operation instruction is the same as input data of the second calculation instruction, and if so, determine that the calculation order is a positive order calculation; identifying whether the input data of the vector operation instruction is the same as the output data of the second calculation instruction, and if so, determining that the calculation sequence is reverse calculation; identifying whether input data of the vector operation instruction is associated with output data of the second calculation instruction or not, and if not, determining that the calculation sequence is unordered calculation;
the arithmetic unit is used for executing the vector arithmetic instruction and a second arithmetic instruction to the input vector according to the calculation sequence to obtain the result of the extended instruction, wherein the calculation sequence is positive order calculation, the vector arithmetic instruction is executed firstly, and then the second arithmetic instruction is executed; the calculation sequence is reverse order calculation, the second operation instruction is executed firstly, and then the vector operation instruction is executed; the calculation order is out-of-order calculation, and the vector operation instruction and the second operation instruction do not have the corresponding order requirement.
2. The arithmetic device of claim 1, further comprising:
and the register unit is used for storing the extended instruction to be executed.
3. The arithmetic device according to claim 2, wherein the control unit includes:
the instruction fetching module is used for acquiring an extended instruction from the register unit;
the decoding module is used for decoding the obtained extended instruction to obtain a vector operation instruction, a second operation instruction and a calculation sequence;
and the instruction queue is used for storing the decoded vector operation instruction and the second operation instruction according to the calculation sequence.
4. The arithmetic device of claim 3, further comprising:
the dependency relationship processing unit is used for judging whether the expansion instruction and a previous expansion instruction access the same vector or not before the control unit acquires the expansion instruction, and if so, after the previous expansion instruction is completely executed, providing the vector operation instruction and a second operation instruction of the current expansion instruction to the operation unit; otherwise, the vector operation instruction and the second operation instruction of the vector operation instruction are provided to the operation unit.
5. The computing device of claim 4, wherein the dependency processing unit is configured to store a current extended instruction in a store queue when the current extended instruction and a previous extended instruction access the same vector, and to provide the current extended instruction in the store queue to the control unit after the previous extended instruction is executed.
6. The computing device of any of claims 1-5, wherein the memory is a scratch pad memory.
7. The arithmetic device of claim 1, wherein the arithmetic unit comprises a vector addition circuit, a vector multiplication circuit, a size comparison circuit, a non-linear arithmetic circuit, and a vector scalar multiplication circuit.
8. The arithmetic device according to claim 7, wherein the arithmetic unit has a multi-pipeline structure, wherein the vector multiplication circuit and the vector scalar multiplication circuit are in a first pipeline stage, the magnitude comparison unit and the vector addition circuit are in a second pipeline stage, and the non-linear operation unit is in a third pipeline stage, wherein output data of the first pipeline stage is input data of the second pipeline stage, and output data of the second pipeline stage is input data of the third pipeline stage.
9. The arithmetic device according to claim 8, wherein the arithmetic unit further comprises: the conversion circuit is positioned at the first flowing water level and the third flowing water level, or the conversion circuit is positioned at the first flowing water level, or the conversion circuit is positioned at the third flowing water level.
10. A chip incorporating an arithmetic device as claimed in any one of claims 1 to 9.
11. An electronic device, characterized in that the electronic device comprises a chip according to claim 10.
CN201711244055.7A 2017-11-30 2017-11-30 Arithmetic device and related product Active CN107861757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711244055.7A CN107861757B (en) 2017-11-30 2017-11-30 Arithmetic device and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711244055.7A CN107861757B (en) 2017-11-30 2017-11-30 Arithmetic device and related product

Publications (2)

Publication Number Publication Date
CN107861757A CN107861757A (en) 2018-03-30
CN107861757B true CN107861757B (en) 2020-08-25

Family

ID=61704370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711244055.7A Active CN107861757B (en) 2017-11-30 2017-11-30 Arithmetic device and related product

Country Status (1)

Country Link
CN (1) CN107861757B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388446A (en) * 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 Computing module and method
CN110413561B (en) * 2018-04-28 2021-03-30 中科寒武纪科技股份有限公司 Data acceleration processing system
US11995556B2 (en) 2018-05-18 2024-05-28 Cambricon Technologies Corporation Limited Video retrieval method, and method and apparatus for generating video retrieval mapping relationship
CN110147872B (en) * 2018-05-18 2020-07-17 中科寒武纪科技股份有限公司 Code storage device and method, processor and training method
CN109032670B (en) * 2018-08-08 2021-10-19 上海寒武纪信息科技有限公司 Neural network processing device and method for executing vector copy instructions
CN110929855B (en) * 2018-09-20 2023-12-12 合肥君正科技有限公司 Data interaction method and device
CN110096310B (en) * 2018-11-14 2021-09-03 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111290788B (en) * 2018-12-07 2022-05-31 上海寒武纪信息科技有限公司 Computing method, apparatus, computer equipment and storage medium
CN111061507A (en) * 2018-10-16 2020-04-24 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN111275197B (en) * 2018-12-05 2023-11-10 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium
CN111353124A (en) * 2018-12-20 2020-06-30 上海寒武纪信息科技有限公司 Computing method, apparatus, computer equipment and storage medium
CN111290789B (en) * 2018-12-06 2022-05-27 上海寒武纪信息科技有限公司 Operation method, operation device, computer equipment and storage medium
CN109711539B (en) * 2018-12-17 2020-05-29 中科寒武纪科技股份有限公司 Operation method, device and related product
US11841822B2 (en) 2019-04-27 2023-12-12 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
CN111860797B (en) * 2019-04-27 2023-05-02 中科寒武纪科技股份有限公司 Arithmetic device
WO2020220935A1 (en) 2019-04-27 2020-11-05 中科寒武纪科技股份有限公司 Operation apparatus
CN114677549B (en) * 2020-12-24 2025-09-02 安徽寒武纪信息科技有限公司 Method, electronic device and storage medium for reducing multidimensional vectors
CN119271570A (en) * 2022-09-13 2025-01-07 华为技术有限公司 A data computing method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5337395A (en) * 1991-04-08 1994-08-09 International Business Machines Corporation SPIN: a sequential pipeline neurocomputer
US6304963B1 (en) * 1998-05-14 2001-10-16 Arm Limited Handling exceptions occuring during processing of vector instructions
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Microprocessor vector processing method
CN105359052A (en) * 2012-12-28 2016-02-24 英特尔公司 Method and apparatus for integral image calculation instruction
CN106990940A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A vector computing device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008070250A2 (en) * 2006-09-26 2008-06-12 Sandbridge Technologies Inc. Software implementation of matrix inversion in a wireless communication system
US9250916B2 (en) * 2013-03-12 2016-02-02 International Business Machines Corporation Chaining between exposed vector pipelines

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5337395A (en) * 1991-04-08 1994-08-09 International Business Machines Corporation SPIN: a sequential pipeline neurocomputer
US6304963B1 (en) * 1998-05-14 2001-10-16 Arm Limited Handling exceptions occuring during processing of vector instructions
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Microprocessor vector processing method
CN105359052A (en) * 2012-12-28 2016-02-24 英特尔公司 Method and apparatus for integral image calculation instruction
CN106990940A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A vector computing device

Also Published As

Publication number Publication date
CN107861757A (en) 2018-03-30

Similar Documents

Publication Publication Date Title
CN107861757B (en) Arithmetic device and related product
CN109725936B (en) Method for implementing extended computing instruction and related product
CN109902814B (en) Neural network operation module and method
US11748601B2 (en) Integrated circuit chip device
TWI793225B (en) Method for neural network training and related product
TWI791725B (en) Neural network operation method, integrated circuit chip device and related products
CN109961134B (en) Integrated circuit chip device and related product
CN110175673B (en) Processing method and acceleration device
TWI793224B (en) Integrated circuit chip apparatus and related product
TWI767098B (en) Method for neural network forward computation and related product
TWI767097B (en) Integrated circuit chip apparatus and related product
CN109977446B (en) Integrated circuit chip device and related product
CN111382864B (en) Neural network training method and device
CN111832695A (en) Quantitative precision adjustment method of operation data and related products
TWI795482B (en) Integrated circuit chip apparatus and related product
CN110490314B (en) Neural network sparseness method and related products
TWI768160B (en) Integrated circuit chip apparatus and related product
CN111832710A (en) Quantization frequency adjustment method of operation data and related products
CN111832712A (en) Quantitative Methods of Operational Data and Related Products
CN111832696A (en) Neural network computing methods and related products
CN111832711A (en) Quantitative Methods of Operational Data and Related Products
WO2020211783A1 (en) Adjusting method for quantization frequency of operational data and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant