[go: up one dir, main page]

CN115617396A - Register allocation method and device applied to novel artificial intelligence processor - Google Patents

Register allocation method and device applied to novel artificial intelligence processor Download PDF

Info

Publication number
CN115617396A
CN115617396A CN202211225754.8A CN202211225754A CN115617396A CN 115617396 A CN115617396 A CN 115617396A CN 202211225754 A CN202211225754 A CN 202211225754A CN 115617396 A CN115617396 A CN 115617396A
Authority
CN
China
Prior art keywords
register
block
virtual
instruction
registers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211225754.8A
Other languages
Chinese (zh)
Other versions
CN115617396B (en
Inventor
官孝峰
周浩
许志鹏
张忠军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Suiyuan Technology Co ltd
Original Assignee
Shanghai Enflame Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Enflame Technology Co ltd filed Critical Shanghai Enflame Technology Co ltd
Priority to CN202211225754.8A priority Critical patent/CN115617396B/en
Publication of CN115617396A publication Critical patent/CN115617396A/en
Application granted granted Critical
Publication of CN115617396B publication Critical patent/CN115617396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a register allocation method and a register allocation device applied to a novel artificial intelligence processor, wherein the register allocation method comprises the following steps: before the compiler instruction is scheduled, according to the grouping limit of each virtual register in the target chip in the instruction, constructing a block internal block group diagram, and analyzing the register group relation according to the block internal block group diagram so as to split the grouping of the registers; after the instruction scheduling is finished, according to the block limitation of each virtual register in the instruction, a register block conflict graph is constructed, and the register is subjected to block assignment according to the register block conflict graph; when the registers are allocated, the registers are allocated and assigned according to the grouping limit of the registers, the block assignment and the life conflict limit of the registers. The technical scheme of the embodiment of the invention can ensure the problem of program functionality under the design of a two-stage hardware register structure of a partitioning-grouping artificial intelligence chip, and meets the load balancing requirement of register resources while ensuring the accuracy of the register distribution result.

Description

Register allocation method and device applied to novel artificial intelligence processor
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a register allocation method and a register allocation device applied to a novel artificial intelligence processor.
Background
In the design of a chip integrated circuit, a large Register File (Register File) may be implemented as a Static Random-Access Memory (SRAM), and such a large Register File is usually implemented as a multi-partitioned (multi-Banked) structure due to the limitation of various factors such as complexity, power consumption, clock cycle, etc. in the processor design.
In a multi-block architecture, each register block only contains a part of the registers, and since each register block only contains one Read Port (Read Port), one Write Port (Write Port), or a single Read/Write Port (Read/Write Port), a block collision may occur when an instruction reads and writes operands from multiple register blocks in the same clock cycle. Fig. 1 is a schematic diagram of an instruction under a two-partition design of a register in the prior art, and as shown in fig. 1, when the register adopts the two-partition design, a partition conflict may occur when a program instruction is executed, for instruction 0, its operands v0 and v1 are respectively from partition 0 and partition 1, and for instruction 1, its operands v0 and v2 are both from partition 0, thereby causing a partition conflict.
In the existing method, when the block conflict of the register is processed, a hardware or software control mode can be adopted. The simplest hardware processing method is to increase the design of hardware control logic, that is, to ensure the storage and reading of the Register block conflict period by using pipeline stall (pipeline stall), and in some GPU designs, the probability of Register block conflict or the overhead caused by Register block conflict is reduced by using Register Buffering (Buffering) or Cache (RFC) and the like, thereby ensuring the pipeline performance. However, both of the above methods increase the complexity of the hardware design process, and further affect the performance of the chip in terms of area, clock, and power consumption.
In contrast, some register designs do not generally consider handling register conflicts in hardware, but rather avoid register conflicts that may exist in instructions entirely by way of software. The software mode is that in the process of compiling a program source code, register partitioning Conflict is abstractly described into a Register Conflict Graph (RCG), the Register partitioning is designated by carrying out Graph coloring on the Register Conflict Graph, and the effect of Register allocation under the limit of Register partitioning is completed by combining the existing Register allocation method.
However, in such register design, hardware cannot guarantee register partitioning conflict, and the correctness of instruction execution is guaranteed by the register allocation process of compiler software. So the final executable file generated by compilation must eventually have operands for each instruction originating from the same register block. This can lead to two significant problems:
the first problem is that: fig. 2 is a schematic diagram of an instruction under another two-partition design of a Register in the prior art, where for the instruction in fig. 2, since a Register conflict graph cannot be colored two, when the total number of registers is 2, in order to avoid Register partition conflict, a compiler can only generate a Register Spill (Register Spill) instruction to guarantee functions, and the Register Spill instruction Spill causes a significant performance loss in many cases;
the second problem is that: in order to fully realize the register allocation process of the compiler, no register block conflict exists in a program, and generally unacceptable compiling time overhead exists in the prior art. It has been proved by research that finding a bipartite graph with the least overhead of register overflow in a register conflict graph is generally an NP-complete problem, and thus when the problem is complex in size, there may be a possibility that the compiling overhead may not be acceptable.
In the design of the existing novel artificial intelligence chip, the design of a register is further simplified usually in consideration of hardware factors such as reduction of power consumption and control area. In particular, by omitting cross-wires (Crossbar) in the register design, the design complexity of the register module can be further reduced. Fig. 3 is a schematic structural diagram of register blocks in the existing novel artificial intelligence chip, and as shown in fig. 3, each register block is connected with an arithmetic logic unit through a cross-connect wire. Fig. 4 is a schematic diagram of a two-stage register block-group structure in a conventional novel artificial intelligence chip design, and as shown in fig. 4, each register block includes a plurality of groups, and each group is directly connected to an arithmetic logic unit. Thus, for one arithmetic logic unit, it can only operate on register data within the same group of different blocks.
Aiming at the design of the chip, the complexity of hardware design can be reduced by reducing cross connection lines, the area of the hardware is favorably reduced, and the running frequency of the hardware is improved. But relatively, this design also brings more register use restrictions, i.e. the register read and write in the same clock cycle needs to satisfy the use restrictions of the register grouping (registers in the same grouping must be used) in addition to the register block restrictions (registers in different blocks must be used). When the register block has a problem, hardware only guarantees the functions of the basic registers by a pipeline pause method, and register grouping limitation needs to be guaranteed by compiler software. Moreover, in the existing disclosed chip design, there is no similar hardware design to implement, and the register allocation method based on the register block structure design has no guarantee of the conflict of the hardware to the register block, and does not consider the two-stage structure of block-grouping adopted in the hardware, so the related method is not suitable for the register allocation of the novel artificial intelligence chip.
Disclosure of Invention
The embodiment of the invention provides a register allocation method and a register allocation device applied to a novel artificial intelligence processor, which can ensure the problem of program functionality under the structural design of a novel artificial intelligence chip 'block-group' two-stage hardware register, and can realize the register allocation in a plurality of register blocks in a balanced manner while ensuring that the compiling result meets the functional correctness, thereby realizing the simplification of the hardware register design on the basis of software and hardware collaborative design.
In a first aspect, an embodiment of the present invention provides a register allocation method under a "block-and-packet" two-stage hardware register structure, where the method includes:
before the compiler instruction is scheduled, according to the grouping limit of each virtual register in the target chip in the instruction, constructing a block internal block group diagram, and analyzing the virtual register block group relation according to the block internal block group diagram so as to split the grouping of the virtual registers;
after the instruction scheduling is finished, according to the block limitation of each virtual register in the instruction, a register block conflict graph is constructed, and according to the register block conflict graph, the virtual register is subjected to block assignment;
when the register is distributed, each virtual register is distributed and specified according to the grouping limit, the block specification and the life conflict limit of the register.
Optionally, constructing the intra-block node group diagram according to the grouping restriction of each virtual register in the target chip in the instruction, including:
and obtaining the operand of the virtual register read in each instruction, and connecting all the virtual registers corresponding to each instruction to obtain the block internal block group diagram corresponding to each virtual register.
Optionally, analyzing the virtual register group relationship according to the intra-block node group diagram to split the groups of virtual registers, including:
obtaining a plurality of knot groups and target registers in the block knot group diagram;
before all instructions taking a target register as a source operand, inserting a copy instruction corresponding to the target register, and modifying the instructions according to the copy instruction;
and updating the block internal node group diagram according to the modified instruction to obtain a new block internal node group diagram.
Optionally, the block specifying of the virtual register according to the register block conflict graph includes:
and distributing and assigning the blocks of the virtual register by adopting a block coloring algorithm based on a greedy algorithm.
Optionally, allocating and specifying each virtual register according to the grouping limit, the block specification of the virtual register and the lifetime conflict limit of the register, includes:
and allocating and assigning each virtual register according to the knot group limit, the block assignment and the life conflict limit of the register generated by the new block knot group diagram.
In a second aspect, an embodiment of the present invention further provides a register allocation apparatus, where the apparatus includes:
the node group diagram building module is used for building a block node group diagram according to the grouping limit of each virtual register in a target chip in an instruction before the instruction scheduling of the compiler, and analyzing the node group relation of the virtual registers according to the block node group diagram so as to split the grouping of the virtual registers;
the conflict graph constructing module is used for constructing a register block conflict graph according to the block limitation of each virtual register in the instruction after the instruction scheduling is finished, and performing block assignment on the virtual registers according to the register block conflict graph;
and the allocation module is used for allocating and assigning each virtual register according to the grouping limit, the block assignment and the life conflict limit of the register of the virtual register when the register is allocated.
In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
storage means for storing one or more programs;
the register allocation method provided by any embodiment of the invention is implemented when the one or more programs are executed by the one or more processors such that the one or more processors execute the programs.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the register allocation method provided in any embodiment of the present invention.
In a fifth aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when the computer program is executed by a processor, the computer program implements the register allocation method provided in any embodiment of the present invention.
According to the technical scheme of the embodiment of the invention, before the instruction scheduling of a compiler, a block internal block group diagram is constructed according to the grouping limit of each virtual register in a target chip in the instruction, and the relation of the virtual register block group is analyzed according to the block internal block group diagram so as to split the grouping of the virtual registers; after the instruction scheduling is finished, according to the block limitation of each virtual register in the instruction, a register block conflict graph is constructed, and according to the register block conflict graph, the virtual register is subjected to block assignment; when the registers are allocated, according to the grouping limit, the block designation and the life conflict limit of the registers of the virtual registers, the technical means of allocating and designating the virtual registers can guarantee the program functionality problem under the two-stage hardware register structure design of the artificial intelligent chip block-grouping, and meet the requirement of load balance of register resources while ensuring the accuracy of the register allocation result, thereby finally making the reduction of the complexity of the hardware register design possible.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a diagram illustrating an instruction under a register two-partition design in the prior art;
FIG. 2 is a diagram illustrating an instruction of another register bipartite block design in the prior art;
FIG. 3 is a block diagram of a register in a novel artificial intelligence chip in the prior art;
FIG. 4 is a diagram of a two-stage block-group structure of a register in a novel artificial intelligence chip design in the prior art;
FIG. 5a is a flowchart of a register allocation method according to an embodiment of the present invention;
FIG. 5b is a diagram of an intra-block set according to an embodiment of the present invention;
fig. 5c is a block collision diagram according to an embodiment of the present invention;
FIG. 6a is a flowchart of a register allocation method according to a second embodiment of the present invention;
FIG. 6b is a diagram of a second embodiment of an intra-block set of knots;
FIG. 6c is a diagram of a new set of intra-block junctions according to the second embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a register allocation apparatus according to a third embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device implementing the register allocation method according to the embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 5a is a flowchart of a register allocation method according to an embodiment of the present invention, where the present embodiment is applicable to a situation of allocating registers in a novel artificial intelligence chip, and the method may be executed by a register allocation apparatus. The register allocation device can be implemented by software and/or hardware, and can be generally integrated in an electronic device with a data processing function, and specifically includes the following steps:
step 110, before the compiler instruction is scheduled, building a block internal block group diagram according to the grouping limit of each virtual register in the target chip in the instruction, and analyzing the virtual register group relation according to the block internal block group diagram so as to split the grouping of the virtual registers.
In this embodiment, the target chip may be a novel artificial intelligence chip. When the register is applied to the target chip, a developer may generate an instruction (i.e., intermediate instruction replication) using the virtual register as an operand in the compiler according to a functional requirement corresponding to the target chip. Before allocating each virtual register (i.e., mapping the virtual register to a physical register), each instruction using the virtual register may be acquired, then a group relationship (i.e., a grouping restriction) between the virtual registers in each instruction is acquired, and a plurality of virtual registers having the group relationship are connected in sequence according to each instruction, so as to obtain an intra-block group diagram. Specifically, when the operand of an instruction contains more than or equal to an input virtual register and an output register number, the block group diagram needs to embody the group restriction relationship. Typically, an instruction uses a virtual register x, a virtual register y as an input operand, and a virtual register z as an output operand, and there is a group relationship between the virtual registers x, y, z, which requires interconnection in the block group diagram.
In an implementation manner of this embodiment, constructing the intra-block grouping map according to the grouping restriction of each virtual register in the target chip in the instruction includes: and obtaining the operand of the virtual register read in each instruction, and connecting the relevant virtual registers corresponding to each instruction to obtain the block internal block group diagram corresponding to each virtual register.
Specifically, the virtual registers in all operands of each instruction may be obtained, the names of the virtual registers are used as nodes, and the grouping relationship is (undirected) edges, so as to construct an intra-block grouping diagram corresponding to the code.
In a specific embodiment, assume that the instructions in the target chip are:
b=*
c=*
d=*
g=*
a=vadd b,c
e=vsub b,a
f=vmul g,d
according to the above instruction, the virtual register a has a group relationship with the virtual registers b and c, the virtual register e has a group relationship with the virtual registers b and a, and the virtual register f has a group relationship with the virtual registers g and d, so that the block group diagram obtained by the above relationship can be as shown in fig. 5 b. In fig. 5b, if there is a connection between two virtual registers, the two virtual registers may be considered to have a group relationship.
From fig. 5b it can be determined that in the intra block set graph formed by the code, two subgraphs are formed, namely node { a, b, c, e } and its associated edges form a first subgraph, and node { g, d, f } and its associated edges form a second subgraph. The nodes in the subgraph are limited by the grouping relationship and must be allocated and specified from physical registers in the same group. Taking the 2-4 register "block-wise" hardware setting example in fig. 4, the virtual register nodes { a, b, c, e } may be assigned and ultimately register-specified from any 4 different registers in physical register set 0, i.e., { v0, v4, v8, … }, or physical register set 1, i.e., { v1, v5, v9, … } or physical register set 2, i.e., { v2, v6, v10, … }, or physical register set 3, i.e., { v3, v7, v11, … }. Similarly, the virtual registers { g, d, f } may be allocated and register-specified from one of the four physical register sets. In this example, a register specification that complies with the register grouping restriction, i.e., the mapping of virtual registers to physical registers, is: { a, b, c, e } - > { v0, v4, v8, v12} - > { g, d, f } - > { v1, v5, v7}. In addition, since the grouping specification of the virtual register sets of the first sub-graph and the second sub-graph does not require inter-group mutual exclusion, the { a, b, c, e } - > { v0, v4, v8, v12}, { g, d, f } - > { v16, v20, v4} is also a register specification that can satisfy the register grouping restriction.
And 120, after the instruction scheduling is finished, constructing a register block conflict graph according to the block limit of each virtual register in the instruction, and performing block assignment on the virtual register according to the register block conflict graph.
In this embodiment, after the building of the intra-block node group map is completed, the instruction may be scheduled, and then the block conflict map corresponding to each virtual register may be built according to the block restriction of each virtual register in the instruction. Specifically, assuming that the input operand of the same instruction includes both the virtual register x and the virtual register y, it may be determined that a block conflict relationship exists between the virtual register x and the virtual register y, that is, the virtual register x and the virtual register y must be allocated to registers from different register blocks.
In an implementation manner of this embodiment, constructing a register block conflict graph according to block limitations of each virtual register in an instruction includes: and acquiring a plurality of operands read in each instruction, and constructing a corresponding block conflict graph by taking the name of the virtual register operand as a node and taking the block conflict relationship as a (undirected) edge.
In a specific embodiment, assume that the instructions in the target chip are:
b=*
c=*
d=*
g=*
a=vadd b,c
e=vsub b,a
f=vmul g,d
as can be seen from the above-mentioned commands, the virtual register b and the virtual register c have a block conflict relationship, the virtual register b and the virtual register a have a block conflict relationship, and the virtual register g and the virtual register d have a block conflict relationship, and therefore, a block conflict graph obtained by the above-mentioned block conflict relationship may be as shown in fig. 5 c. In fig. 5c, if there is a connection between two virtual registers, it can be considered that there is a block conflict relationship between the two virtual registers.
According to the block conflict map, block assignment of the virtual register can be performed. In an implementation manner of this embodiment, performing block specification on a virtual register according to the register block conflict graph includes: and distributing and assigning the blocks of the virtual register by adopting a block coloring algorithm based on a greedy algorithm. The block coloring algorithm, which starts coloring from a high-degree node, greedily looks for a color that is different from all neighboring nodes. If such a color exists, the color of the node is designated, and if the number of colors of the adjacent nodes is the same as the number of register blocks, the color is arbitrarily designated to the node. When a node is colored up, it will continue to color all its uncolored neighbors. When one sub-graph is colored completely, the same process is repeated for the next sub-graph.
In this embodiment, the 2-4 register "group-block" hardware in fig. 4 is taken as an example, and the hardware has two blocks, so that 2-coloring can be performed on fig. 5 c. A chunk assignment that conforms to the above algorithm is { a, c, d, f } - > chunk 0, { b, g, e } - > chunk 1. In this process, the register pressures of block 0 and block 1 may be calculated in a simulated manner to ensure that the block allocation of the registers is uniform.
And step 130, when the registers are distributed, distributing and specifying the virtual registers according to the grouping limit, the block specification and the life conflict limit of the registers of the virtual registers.
In a specific embodiment, assume that the instructions in the target chip are:
b=*
c=*
d=*
g=*
a=vadd b,c
e=vsub b,a
f=vmul g,d
the intra-block grouping diagram and the block conflict diagram are shown in fig. 5b and 5c, and taking the 2-4 register "grouping-blocking" hardware in fig. 4 as an example, a register instruction meeting both the register grouping restriction and the register blocking restriction is: a- > v0 (block 0, group 0), b- > v4 (block 1, group 0), c- > v8 (block 0, group 0), d- > v1 (block 0, group 1), e- > v12 (block 1, group 0), f- > v9 (block 0, group 1), g- > v5 (block 1, group 1).
In the register allocation, the existing register life analysis technology can be considered to be utilized, and the using amount of the registers is reduced through register reuse. Such as the following assignments: a- > v0 (block 0, group 0), b- > v4 (block 1, group 0), c- > v0 (block 0, group 0), d- > v1 (block 0, group 1), e- > v4 (block 1, group 0), f- > v1 (block 0, group 1), g- > v5 (block 1, group 1), the register usage is reduced from 7 to 4 by the reuse of virtual registers a and c, virtual registers b and e, and virtual registers f and d lifetime.
Compared with the prior art, the method solves the problem of program function guarantee under the hardware design of grouping-partitioning registers in the novel artificial intelligence chip step by step through a reasonable execution flow, and is different from other technical schemes of avoiding generation of codes with register partitioning conflict through a compiler. The reliability of register allocation results under the 'block-group' design of register hardware can be ensured by constructing an intra-block group diagram and a block conflict diagram and allocating registers by combining intra-block group limitation and block assignment; by constructing the block conflict graph after the instruction scheduling is finished, the pressure of the register in each block can be simulated and calculated, so that the condition that the pressure of the register in a single register block is too high is avoided, and the reasonable register block assignment is further carried out; finally, register allocation is completed by designing three separate processes instead of considering all complex conditions in the same process, so that the accuracy of allocation results can be ensured under the condition that the complexity of software design is controllable, the execution efficiency of allocating each part of the register is improved, and the effect of reducing the complexity of hardware is achieved by combining software and hardware.
According to the technical scheme of the embodiment of the invention, before the instruction scheduling of a compiler, a block internal block group diagram is constructed according to the grouping limit of each virtual register in a target chip in the instruction, and the relation of the virtual register block group is analyzed according to the block internal block group diagram so as to split the grouping of the virtual registers; after the instruction scheduling is finished, according to the block limitation of each virtual register in the instruction, a register block conflict graph is constructed, and according to the register block conflict graph, the virtual register is subjected to block assignment; when the register is distributed, the technical means of distributing and specifying each virtual register can ensure the accuracy of the register distribution result under the hardware block-group design of the register, relieve the pressure of the register and reduce the complexity of the hardware design according to the group limit, the block specification of the virtual register and the life conflict limit of the register.
Example two
This embodiment is a further refinement of the above embodiment, and the same or corresponding terms as those of the above embodiment are explained, and this embodiment is not described again. Fig. 6a is a flowchart of a register allocation method provided in the second embodiment, in this embodiment, the technical solution of this embodiment may be combined with one or more methods in the solutions of the foregoing embodiments, as shown in fig. 6a, the method provided in this embodiment may further include:
step 201, before the command scheduling of the compiler, building a block internal block group diagram according to the grouping limit of each virtual register in the target chip in the command.
Step 202, obtain a plurality of knots and target registers in the block knot map.
In the present embodiment, a 2-4 "block-group" register design is adopted in the novel artificial intelligence chip, and although the number of physical registers is up to 1024, the novel artificial intelligence chip is specific to a single block-group, which contains 128 physical registers. For operation type codes with larger scale, subgraphs with excessive nodes are easier to form in the block internal node group graph. When the number of virtual register nodes in the subgraph is greater than 128 and the virtual register lifetimes conflict with each other, then a single block-packet register over-stressed condition results. At the same time, there may be a condition where the physical registers in other block-packets are not fully used, i.e., starved, which may also be referred to as a block-packet load imbalance.
In order to solve the above problem, this embodiment provides a way of performing value segmentation on a block internal node group diagram, so as to ensure block load balancing and avoid excessive register pressure.
In one specific embodiment, assume that the instructions in the target chip are:
b=*
c=*
d=*
a=vadd b,c
e=vsub b,d
f=vmul c,d
the block internal grouping diagram constructed by step 201 is shown in fig. 6b, and it can be seen that all virtual registers { a, b, c, d, e, f } are in the same sub-graph, which must be mapped into the same grouping. This places a large register pressure on a single packet.
In this step, a register common between the respective node groups may be acquired as a target register in the block node group map. Taking the block internal node group diagram in fig. 6b as an example, it is assumed that the node group 1 includes a register a, a register b, and a register c, the node group 2 includes a register e, a register d, and a register b, the node group 3 includes a register f, a register d, and a register c, and common target registers in the three node groups are the register b, the register d, and the register c, respectively.
Step 203, before all the instructions taking the target register as the source operand, inserting the copy instruction corresponding to the target register, and modifying the instructions according to the copy instruction.
In this step, assuming that the target register is register b, the copy instruction corresponding to the target register may be b' = b. Taking the target register in step 202 as an example, the modified instruction may be:
b=*
c=*
d=*
b’=b
a=vadd b’,c
d’=d
e=vsub b,d’
c’=c
f=vmul c’,d
and step 204, updating the block internal node group diagram according to the modified instruction to obtain a new block internal node group diagram.
In a specific embodiment, the new intra block grouping map obtained according to the modified instruction can be as shown in fig. 6 c.
Step 205, after the instruction scheduling is completed, according to the blocking limitation of each virtual register in the instruction, a register blocking conflict graph is constructed, and according to the register blocking conflict graph, the virtual register is subjected to blocking assignment.
And step 206, when the registers are distributed, distributing and specifying the virtual registers according to the grouping limit, the block specification and the life conflict limit of the registers.
In this embodiment, by performing value segmentation on the intra-block grouping diagram, on one hand, it is possible to ensure that the load of the registers in the register block-grouping is balanced, and alleviate the pressure of the single block-grouping register, and on the other hand, the updated intra-block grouping diagram can be better integrated into the existing register allocation process, and the accuracy of the register allocation result is ensured.
According to the technical scheme, before the compiler is used for instruction scheduling, a block internal node group diagram is constructed according to the grouping limit of each virtual register in a target chip in an instruction, a plurality of node groups and target registers are obtained in the block internal node group diagram, a copy instruction corresponding to the target register is inserted before all instructions taking the target registers as source operands, the instructions are modified according to the copy instruction, the block internal node group diagram is updated according to the modified instructions to obtain a new block internal node group diagram, after the instruction scheduling is completed, a register block conflict diagram is constructed according to the block limit of each virtual register in the instruction, the virtual registers are subjected to block assignment according to the register block conflict diagram, and when the registers are assigned, the virtual registers are assigned according to the grouping limit of the virtual registers, the block assignment and the life conflict limit of the registers, the technical means for assigning and assigning each virtual register are adopted, so that the accuracy of the register assignment result can be guaranteed, the pressure of a single block-group register is relieved, and the complexity of hardware design is reduced.
EXAMPLE III
Fig. 7 is a schematic structural diagram of a register allocation apparatus according to a third embodiment of the present invention, as shown in fig. 7, the apparatus includes: a group graph building module 310, a conflict graph building module 320, and an assignment module 330.
Before the compiler instruction is scheduled, the node group diagram building module 310 is configured to build an intra-block node group diagram according to the grouping restriction of each virtual register in the target chip in the instruction, and analyze the virtual register node group relationship according to the intra-block node group diagram to split the grouping of the virtual registers;
the conflict graph constructing module 320 is configured to, after the instruction scheduling is completed, construct a register block conflict graph according to the block restriction of each virtual register in the instruction, and perform block specification on the virtual register according to the register block conflict graph;
the allocating module 330 is configured to allocate and specify each virtual register according to the grouping limit, the block specification, and the lifetime conflict limit of the register when the register is allocated.
According to the technical scheme provided by the embodiment of the invention, before the instruction scheduling of a compiler, a block internal block group diagram is constructed according to the grouping limit of each virtual register in a target chip in the instruction, and the virtual register block group relation is analyzed according to the block internal block group diagram so as to split the grouping of the virtual registers; after the instruction scheduling is finished, according to the block limitation of each virtual register in the instruction, a register block conflict graph is constructed, and according to the register block conflict graph, the virtual register is subjected to block assignment; when the register is distributed, the technical means of distributing and specifying each virtual register can ensure the accuracy of the register distribution result under the hardware block-group design of the register, relieve the pressure of the register and reduce the complexity of the hardware design according to the group limit, the block specification of the virtual register and the life conflict limit of the register.
On the basis of the above embodiment, the knot group diagram building module 310 includes:
the register connecting unit is used for acquiring the operand of the virtual register read in each instruction, and connecting all the virtual registers corresponding to each instruction to obtain the block internal node group diagram corresponding to each virtual register;
a target register acquisition unit configured to acquire a plurality of node groups and a target register in the block node group map;
the copy instruction insertion unit is used for inserting a copy instruction corresponding to the target register before all instructions taking the target register as a source operand and modifying the instructions according to the copy instruction;
and the knot group diagram updating unit is used for updating the block knot group diagram according to the modified instruction to obtain a new block knot group diagram.
The conflict graph building module 320 includes:
a conflict register obtaining unit, configured to obtain a plurality of conflict registers with a connection relationship in the block conflict graph;
and the register block assignment unit is used for assigning and assigning the blocks of the virtual register by adopting a block coloring algorithm based on a greedy algorithm.
The distribution module 330 includes:
and the allocation unit is used for allocating and assigning each virtual register according to the node group limit, the block assignment and the life conflict limit of the register generated by the new block node group diagram.
The device can execute the methods provided by all the embodiments of the invention, and has corresponding functional modules and beneficial effects for executing the methods. For technical details which are not described in detail in the embodiments of the present invention, reference may be made to the methods provided in all the aforementioned embodiments of the present invention.
Example four
FIG. 8 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM12, and the RAM13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the register allocation method.
In some embodiments, the register allocation method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. When the computer program is loaded into RAM13 and executed by processor 11, one or more steps of the register allocation method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the register allocation method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A register allocation method, the method comprising:
before the compiler instruction is scheduled, according to the grouping limit of each virtual register in the target chip in the instruction, constructing a block internal block group diagram, and analyzing the virtual register block group relation according to the block internal block group diagram so as to split the grouping of the virtual registers;
after the instruction scheduling is finished, according to the block limitation of each virtual register in the instruction, a register block conflict graph is constructed, and according to the register block conflict graph, the virtual register is subjected to block assignment;
when the register is distributed, each virtual register is distributed and specified according to the grouping limit, the block specification and the life conflict limit of the register.
2. The method of claim 1, wherein constructing the intra block group map according to the grouping restriction of each virtual register in the target chip in the instruction comprises:
and obtaining the operand of the virtual register read in each instruction, and connecting all the virtual registers corresponding to each instruction to obtain the block internal block group diagram corresponding to each virtual register.
3. The method of claim 1, wherein analyzing virtual register bank relationships from the intra block bank map to split a grouping of virtual registers comprises:
obtaining a plurality of knot groups and target registers in the block knot group diagram;
before all instructions taking a target register as a source operand, inserting a copy instruction corresponding to the target register, and modifying the instructions according to the copy instruction;
and updating the block internal node group diagram according to the modified instruction to obtain a new block internal node group diagram.
4. The method of claim 3, wherein performing block assignment of virtual registers according to the register block conflict graph comprises:
and distributing and assigning the blocks of the virtual register by adopting a block coloring algorithm based on a greedy algorithm.
5. The method of claim 4, wherein assigning and specifying virtual registers based on their grouping limits, block designations, and register lifetime conflict limits comprises:
and allocating and assigning each virtual register according to the knot group limit, the block assignment and the life conflict limit of the register generated by the new block knot group diagram.
6. A register allocation apparatus, characterized in that the apparatus comprises:
the node group diagram building module is used for building a block node group diagram according to the grouping limit of each virtual register in a target chip in an instruction before the instruction scheduling of the compiler, and analyzing the node group relation of the virtual registers according to the block node group diagram so as to split the grouping of the virtual registers;
the conflict graph constructing module is used for constructing a register block conflict graph according to the block limitation of each virtual register in the instruction after the instruction scheduling is finished, and performing block assignment on the virtual registers according to the register block conflict graph;
and the allocation module is used for allocating and assigning each virtual register according to the grouping limit, the block assignment and the life conflict limit of the register of the virtual register when the register is allocated.
7. An electronic device, the electronic device comprising:
one or more processors;
storage means for storing one or more programs;
the register allocation method of any one of claims 1-5 when executed by the one or more processors such that the one or more processors execute the program.
8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the register allocation method according to any one of claims 1 to 5.
9. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, carries out the register allocation method according to any one of claims 1-5.
CN202211225754.8A 2022-10-09 2022-10-09 Register allocation method and device applied to novel artificial intelligence processor Active CN115617396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211225754.8A CN115617396B (en) 2022-10-09 2022-10-09 Register allocation method and device applied to novel artificial intelligence processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211225754.8A CN115617396B (en) 2022-10-09 2022-10-09 Register allocation method and device applied to novel artificial intelligence processor

Publications (2)

Publication Number Publication Date
CN115617396A true CN115617396A (en) 2023-01-17
CN115617396B CN115617396B (en) 2023-08-29

Family

ID=84859787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211225754.8A Active CN115617396B (en) 2022-10-09 2022-10-09 Register allocation method and device applied to novel artificial intelligence processor

Country Status (1)

Country Link
CN (1) CN115617396B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1881175A (en) * 2005-06-16 2006-12-20 中国科学院计算技术研究所 Method for solving multi-register conflict
US20110219364A1 (en) * 2010-03-03 2011-09-08 Vladimir Makarov Mechanism for Performing Register Allocation of Program Variables Based on Priority Spills and Assignments
US8555035B1 (en) * 2010-07-07 2013-10-08 Nvidia Corporation Conflict-free register allocation using a multi-bank register file with input operand alignment
US8832671B1 (en) * 2010-07-07 2014-09-09 Nvidia Corporation Conflict-free register allocation
CN104461471A (en) * 2014-12-19 2015-03-25 中国人民解放军国防科学技术大学 Unified instruction scheduling and register allocating method on clustering VLIW processor
US20150113251A1 (en) * 2013-10-18 2015-04-23 Marvell World Trade Ltd. Systems and Methods for Register Allocation
CN105912304A (en) * 2016-03-31 2016-08-31 中国人民解放军国防科学技术大学 Vector VLIW architecture diagram coloring register grouping allocation method
US20170083320A1 (en) * 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Predicated read instructions
CN106681694A (en) * 2016-12-30 2017-05-17 中国科学院计算技术研究所 Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction
CN109478136A (en) * 2016-06-23 2019-03-15 超威半导体公司 System and method for using virtual vector register files
CN110968544A (en) * 2019-11-22 2020-04-07 华中科技大学 SoC storage system based on embedded spin transfer torque magnetic random access memory
CN111638911A (en) * 2019-03-01 2020-09-08 阿里巴巴集团控股有限公司 Processor, instruction execution equipment and method
CN113835758A (en) * 2021-11-25 2021-12-24 之江实验室 Winograd convolution implementation method based on vector instruction accelerated computation
CN114298293A (en) * 2021-12-29 2022-04-08 杭州万高科技股份有限公司 Recurrent neural network acceleration methods, systems, and media based on Cortex-M processor
CN114816532A (en) * 2022-04-20 2022-07-29 湖南卡姆派乐信息科技有限公司 Register allocation method for improving RISC-V binary code density

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1881175A (en) * 2005-06-16 2006-12-20 中国科学院计算技术研究所 Method for solving multi-register conflict
US20110219364A1 (en) * 2010-03-03 2011-09-08 Vladimir Makarov Mechanism for Performing Register Allocation of Program Variables Based on Priority Spills and Assignments
US8555035B1 (en) * 2010-07-07 2013-10-08 Nvidia Corporation Conflict-free register allocation using a multi-bank register file with input operand alignment
US8832671B1 (en) * 2010-07-07 2014-09-09 Nvidia Corporation Conflict-free register allocation
US20150113251A1 (en) * 2013-10-18 2015-04-23 Marvell World Trade Ltd. Systems and Methods for Register Allocation
CN104461471A (en) * 2014-12-19 2015-03-25 中国人民解放军国防科学技术大学 Unified instruction scheduling and register allocating method on clustering VLIW processor
US20170083320A1 (en) * 2015-09-19 2017-03-23 Microsoft Technology Licensing, Llc Predicated read instructions
CN105912304A (en) * 2016-03-31 2016-08-31 中国人民解放军国防科学技术大学 Vector VLIW architecture diagram coloring register grouping allocation method
CN109478136A (en) * 2016-06-23 2019-03-15 超威半导体公司 System and method for using virtual vector register files
CN106681694A (en) * 2016-12-30 2017-05-17 中国科学院计算技术研究所 Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction
CN111638911A (en) * 2019-03-01 2020-09-08 阿里巴巴集团控股有限公司 Processor, instruction execution equipment and method
CN110968544A (en) * 2019-11-22 2020-04-07 华中科技大学 SoC storage system based on embedded spin transfer torque magnetic random access memory
CN113835758A (en) * 2021-11-25 2021-12-24 之江实验室 Winograd convolution implementation method based on vector instruction accelerated computation
CN114298293A (en) * 2021-12-29 2022-04-08 杭州万高科技股份有限公司 Recurrent neural network acceleration methods, systems, and media based on Cortex-M processor
CN114816532A (en) * 2022-04-20 2022-07-29 湖南卡姆派乐信息科技有限公司 Register allocation method for improving RISC-V binary code density

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张军超;连瑞琦;张兆庆;: "多寄存器组网络处理器上的寄存器分配技术", 计算机学报, no. 01, pages 1 - 7 *
戴涛;单征;卢帅兵;石强;潭捷;: "基于优先级动态二进制翻译寄存器分配算法", 浙江大学学报(工学版), no. 07, pages 1348 *

Also Published As

Publication number Publication date
CN115617396B (en) 2023-08-29

Similar Documents

Publication Publication Date Title
JP7545211B2 (en) A method for allocating resources among layers of a multi-path neural network.
US10031775B2 (en) Backfill scheduling for embarrassingly parallel jobs
US10656970B2 (en) Scheduling graph computing on heterogeneous processing resources based on energy efficiency
US11630983B2 (en) Graph conversion method
CN116501503B (en) Architecture mapping method and device for load task, computer equipment and medium
WO2024021192A1 (en) Graph optimization method and apparatus for neural network calculation
US7788635B2 (en) Technique for processing a computer program
US12081636B2 (en) Distribution of machine learning workflows on webscale infrastructures
US12026606B2 (en) Fractal calculating device and method, integrated circuit and board card
CN117808048A (en) Operator execution method, device, equipment and storage medium
CN119536818B (en) Instruction scheduling decision method, device, equipment and medium
CN115617396B (en) Register allocation method and device applied to novel artificial intelligence processor
US20240104016A1 (en) Intermediate Representation Method and Apparatus for Compiling Computation Graphs
CN117539548A (en) Instruction execution method, device, equipment and storage medium based on wait mechanism
CN116545958A (en) Basic block arrangement method applied to PISA architecture chip
CN116993063A (en) A task solving method and its device
CN112019368B (en) VNF migration method, VNF migration device and VNF migration storage medium
Mirsadeghi Improving communication performance through topology and congestion awareness in HPC systems
Uscumlic et al. Design space exploration with deterministic latency guarantees for crossbar mpsoc architectures
CN118963994B (en) Resource management method, electronic device and storage medium
CN119861974B (en) Task processing method, device, equipment and medium based on data stream core architecture
US20240185110A1 (en) Distribution of quantum state vector elements across network devices in quantum computing simulation
US20240281294A1 (en) Apparatus, method, non-transitory computer-readable medium and system
CN116680296A (en) Large-scale graph data processing system based on single machine
CN120276857A (en) Task execution method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room a-522, 188 Yesheng Road, Lingang New District, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201306

Patentee after: Shanghai Suiyuan Technology Co.,Ltd.

Country or region after: China

Address before: Room a-522, 188 Yesheng Road, Lingang New District, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201306

Patentee before: SHANGHAI ENFLAME TECHNOLOGY Co.,Ltd.

Country or region before: China