Disclosure of Invention
The embodiment of the application provides an instruction conversion method and device, which can balance the loads of a scalar execution unit and a vector execution unit of computer equipment and improve the utilization rate of the execution unit, thereby improving the performance of the computer equipment.
In order to achieve the above purpose, the embodiment of the application adopts the following technical scheme:
In a first aspect, embodiments of the present application provide an instruction converting method, which may be executed by a computer device, or may be executed by a component of the computer device, for example, a processor, a chip, or a chip system of the computer device. The method is applied to computer equipment, the computer equipment comprises a first type execution unit and a second type execution unit, the first type execution unit is used for executing a first type instruction, the second type execution unit is used for executing a second type instruction, the method comprises the steps of acquiring an instruction set according to a source program, the execution set comprises at least two instructions, the type of any one instruction in the at least two instructions is the first type or the second type, and part or all of the first type instructions in the instruction set are converted into the second type instructions according to a policy rule, and the policy rule enables loads of the first type execution unit and the second type execution unit to be balanced. The first type of instruction is a SIMD instruction, the second type of instruction is a SISD instruction, or the first type of instruction is a SISD instruction, and the second type of instruction is a SIMD instruction.
In the embodiment of the application, when the instruction conversion is executed, part or all of the first type of instructions can be converted into the second type of instructions according to the policy rules, and the policy rules can balance the loads of the first type of execution units and the second type of execution units, so that the load balancing method and the load balancing device can balance the loads of the first type of execution units and the second type of execution units, improve the utilization rate of the execution units, and further improve the performance of computer equipment.
In some possible designs, the policy rule indicates that an instruction of a first type in a first instruction relationship sub-tree, which is a sub-tree of instruction relationship trees of at least two instructions comprised by the instruction set, is converted to an instruction of a second type. Based on the possible design, instruction conversion can be performed according to the instruction relation tree, so that the complexity of an instruction conversion strategy is reduced.
In some possible designs, the instruction set includes a first instruction set, an initial set of the first instruction set including at least two first instructions, the first instructions being of a first type for implementing a first operation, the policy rules indicating that a portion of the first instructions in the first instruction set are to be converted into M second instructions, the second instructions being of a second type for implementing the first operation, M being a positive integer. Based on the possible design, instructions for realizing each operation in the instruction set can be respectively subjected to instruction conversion according to the operation realized by the instructions, and the accuracy of instruction conversion is improved, so that the load balancing of the first type of execution units and the second type of execution units is more effectively realized.
In some possible designs, the policy rules indicating converting a portion of the first instructions in the first set of instructions to M second instructions include the policy rules indicating converting a portion of the first instructions in the first set of instructions to M second instructions if the first instructions and the second instructions are executed in parallel. Based on this possible design, converting part of the first instructions in the first set into M second instructions may cause both the first type of execution unit and the second type of execution unit to be used, in case the first instructions and the second instructions are executed in parallel.
In some possible designs, the ratio of M to K is determined by a first value and a second value, K being the number of the first instructions of the other part of the first instruction set, the first value being determined by a first duration and a first number, the second value being determined by a second duration and a second number, the first duration being the duration used by a first execution unit to execute a first instruction, the first number being the number of first execution units comprised by the computer device, the second duration being the duration used by a second execution unit to execute a second instruction, the second number being the number of second execution units comprised by the computer device. Based on the possible design, the number of the second instructions and the first instructions after conversion can be determined by combining the hardware characteristics of the computer equipment, so that instruction conversion can be performed more finely according to the hardware differences of different computer equipment, and further the accuracy of load balancing is improved.
In some possible designs, the ratio of M to K is determined by a first value to a second value, including that the ratio of M to K is determined by the first value, the second value, and an instruction relationship tree of at least two instructions included in the instruction set. Based on the possible design, the number of the converted second instructions and the number of the converted first instructions can be determined by combining the hardware characteristics of the computer equipment and the characteristics of the instruction set, so that the accuracy of load balancing is further improved.
In some possible designs, the first value, the first duration, and the first number satisfy the following formula:
wherein, At a first value, T 1 is a first duration,Is a first number.
In some possible designs, the first time period is determined by a first number of times and a first time difference, the first number of times is the number of times of a first operation implemented by one first instruction, the first time difference is a time period used for implementing n times of the first operation by using the first instruction, and the time period used for implementing m times of the first operation by using the first instruction is a positive integer, and n is greater than m.
In some possible designs, the first time period, the first number of times, and the first time difference satisfy the following formulas:
Wherein T 1 is a first duration, For the first number of times,For the first time difference of the first time difference,The duration used to implement n first operations using the first instruction,The duration used to implement m first operations using the first instruction.
In some possible designs, the second value, the second duration, and the second number satisfy the following formulas:
wherein, At a second value, T 2 is a second duration,A second number.
In some possible designs, the second time period is determined by a second number of times and a second time difference, the second number of times is the number of times the first operation is performed by one second instruction, the second time difference is a time period used for performing the first operation n times by using the second instruction, and the time period used for performing the first operation m times by using the second instruction, n and m are positive integers, and n is greater than m.
In some possible designs, the second time period, the second number of times, and the second time difference satisfy the following formulas:
Wherein T 2 is a second time period, For the second number of times,For the second time difference to be a second time difference,The duration used to implement n first operations using the second instruction,The duration used to implement m first operations using the second instruction.
In some possible designs, the instruction set includes a second instruction set, an initial set of the second instruction set includes at least one third instruction, the third instruction is an instruction of the first type for implementing the second operation, the policy rule indicates that all third instructions in the second instruction set are converted into fourth instructions in a case that a third value is less than or equal to a fourth value, the fourth instruction is an instruction of the second type for implementing the second operation, the third value is determined by a third duration and a third number, the fourth value is determined by a fourth duration and a fourth number, the third duration is a duration used by a third execution unit to execute the third instruction, the third number is a number of third execution units included in the computer device, the fourth duration is a duration used by a fourth execution unit to execute the fourth instruction, and the fourth number is a number of fourth execution units included in the computer device.
Based on the possible design, instructions for realizing each operation in the instruction set can be respectively subjected to instruction conversion according to the operation realized by the instructions, and the accuracy of instruction conversion is improved, so that the load balancing of the first type of execution units and the second type of execution units is more effectively realized.
In some possible designs, the policy rules indicate that all third instructions in the third set of instructions are converted to fourth instructions if the third value is less than or equal to the fourth value, including converting all third instructions in the third set of instructions to fourth instructions if the policy rules indicate that the third instructions and the fourth instructions are not executed in parallel and the third value is less than or equal to the fourth value.
In some possible designs, the instruction translation method further includes storing the first type of instruction in the instruction set to a first instruction cache where the instruction set includes the first type of instruction. Based on this possible design, a sorted store of different types of instructions may be implemented such that when instructions are fetched, the required type of instructions may be fetched based on the sorted store.
In some possible designs, the instruction conversion method further comprises the steps of acquiring state information of a first micro instruction cache queue, wherein the first micro instruction cache queue is used for caching micro operations of the first type of instructions, and determining the quantity of the first type of instructions to be acquired from the first instruction cache according to the state information of the first micro instruction cache queue. Based on the possible design, the state information of the micro instruction cache queue can be obtained in real time, the number of the instructions of the corresponding types taken out from various instruction caches is dynamically adjusted according to the state information, and the parallel operation efficiency of the first type of execution units and the second type of execution units is improved, so that the performance of the computer equipment is further improved.
In some possible designs, the status information of the first micro instruction cache queue is used for indicating the quantity of the free space in the first micro instruction cache queue, and determining the quantity of the first type of instructions to be obtained from the first instruction cache according to the status information of the first micro instruction cache queue comprises determining the quantity of the free space in the first micro instruction cache queue as the quantity of the first type of instructions to be obtained from the first instruction cache.
In some possible designs, the instruction translation method further includes storing the second type of instruction in the instruction set to a second instruction cache where the instruction set includes the second type of instruction. Based on this possible design, a sorted store of different types of instructions may be implemented such that when instructions are fetched, the required type of instructions may be fetched based on the sorted store.
In some possible designs, the instruction conversion method further comprises the steps of acquiring state information of a second micro instruction cache queue, wherein the second micro instruction cache queue is used for caching micro operations of the second type of instructions, and determining the number of the second type of instructions to be acquired from the second instruction cache according to the state information of the second micro instruction cache queue. Based on the possible design, the state information of the micro instruction cache queue can be obtained in real time, the number of the instructions of the corresponding types taken out from various instruction caches is dynamically adjusted according to the state information, and the parallel operation efficiency of the first type of execution units and the second type of execution units is improved, so that the performance of the computer equipment is further improved.
In some possible designs, the status information of the second micro instruction cache queue is used for indicating the amount of free space in the second micro instruction cache queue, and determining the amount of the second type of instructions to be fetched from the second instruction cache according to the status information of the second micro instruction cache queue comprises determining the amount of free space in the second micro instruction cache queue as the amount of the second type of instructions to be fetched from the second instruction cache.
In a second aspect, a communications device is provided for implementing any of the methods of the above aspects. In some implementations, the communication apparatus may be a computer device in the first aspect, or an apparatus including the computer device. The communication device comprises corresponding modules, units or means (means) for realizing the method, and the modules, units or means can be realized by hardware, software or realized by executing corresponding software by hardware. The hardware or software includes one or more modules or units corresponding to the functions described above. Such as an acquisition module and a conversion module.
The system comprises an acquisition module, a conversion module and a policy rule, wherein the acquisition module is used for acquiring an instruction set according to a source program, the instruction set comprises at least two instructions, the type of any one instruction in the at least two instructions is a first type or a second type, and the conversion module is used for converting part or all of the first type of instructions in the instruction set into the second type of instructions according to the policy rule, and the policy rule enables load balancing of a first type of execution unit and a second type of execution unit of the computer equipment. The first type of instruction is a SIMD instruction, the second type of instruction is a SISD instruction, or the first type of instruction is a SISD instruction, and the second type of instruction is a SIMD instruction.
In this embodiment, the specific implementation manner of the communication apparatus may refer to the behavior function of the computer device in the instruction conversion method provided by the first aspect or any one of possible designs of the first aspect.
In a third aspect, a communication apparatus is provided, which may be the computer device in the first aspect, or an apparatus comprising the computer device. In a possible implementation manner, the communication device may include at least one processor, where the processor may be configured to support the communication device to implement the functions involved in the first aspect or any of the possible designs of the first aspect. For example, the processor is configured to obtain an instruction set according to a source program, where the instruction set includes at least two instructions, and a type of any one of the at least two instructions is a first type or a second type, and the processor is further configured to convert some or all of the first type of instructions in the instruction set into the second type of instructions according to a policy rule, where the policy rule makes loads of the first type of execution unit and the second type of execution unit balanced. The first type of instruction is a SIMD instruction, and the second type of instruction is a SISD instruction; or the first type of instruction is a SISD instruction and the second type of instruction is a SIMD instruction, in another possible implementation the processor is configured to couple with the memory and to execute the method according to any of the above aspects according to the instructions after reading the instructions in the memory.
In this embodiment, the specific implementation manner of the communication apparatus may refer to the behavior function of the computer device in the instruction conversion method provided by the first aspect or any one of possible designs of the first aspect.
In a fourth aspect, there is provided a communications device comprising a memory and at least one processor, the memory storing computer instructions which, when executed by the processor, cause the communications device to perform the method of any of the preceding aspects. The communication means may be the computer device of the first aspect, or an apparatus comprising the computer device.
In a fifth aspect there is provided a communications apparatus comprising an interface circuit, which may be a code/data read/write interface circuit, for receiving computer-executable instructions (stored in memory, possibly read directly from memory, or possibly via other means) and transmitting them to the processor, and at least one processor for executing the computer-executable instructions to perform the method of any of the above aspects. The communication means may be the computer device of the first aspect, or an apparatus comprising the computer device.
In a sixth aspect, there is provided a computer readable storage medium having instructions stored therein which, when run on a communications device, cause the communications device to perform the method of any of the above aspects. The communication means may be the computer device of the first aspect, or an apparatus comprising the computer device.
In a seventh aspect, there is provided a computer program product comprising instructions which, when run on a communications device, cause the communications device to perform the method of any of the above aspects. The communication means may be the computer device of the first aspect, or an apparatus comprising the computer device.
In an eighth aspect, there is provided a communications device (e.g. which may be a chip or a system of chips) comprising at least one processor for carrying out the functions referred to in any of the above aspects. In some possible designs, the communication device further comprises a memory for holding necessary program instructions and/or data. When the communication device is a chip system, the communication device may be formed of a chip, or may include a chip and other discrete devices.
In a ninth aspect, embodiments of the present application provide a chip comprising at least one processor and a communication interface coupled to the at least one processor, the at least one processor being configured to execute a computer program or instructions to implement the method described in the first aspect or in various possible implementations of the first aspect. The communication interface is used for communicating with other modules outside the chip.
The chip provided by the embodiment of the application can also comprise a memory, wherein the memory is used for storing computer programs or instructions.
It will be appreciated that any of the above-mentioned communication devices, chips, computer-readable media, computer program products, etc. are used to perform the corresponding methods provided above, and therefore, the advantages achieved by the above-mentioned communication devices, chips, computer-readable media, computer-program products, etc. refer to the advantages of the corresponding methods, and are not repeated herein.
Detailed Description
In order to facilitate understanding of the technical solutions of the embodiments of the present application, a brief description of the related art of the present application is given below.
First, single-emission CPU:
as shown in fig. 1, when an instruction is executed, the single-issue CPU can only fetch one instruction from the instruction Cache (Cache) in each clock cycle, and can only decode one instruction, can only execute one instruction, and can only write back one operation result.
For a single-emission CPU, the expected value of the parallelism of the instruction level cannot reach 1 due to data correlation, conditional branching, resource conflict and the like, namely, the execution of one instruction per clock cycle cannot be realized on average.
Second, multiple transmit CPU:
In order to develop parallelism among pipelines and improve CPU processing performance, the current mainstream CPU microarchitecture mainly adopts a multiple pipeline technology, namely, the CPU is usually a multiple emission CPU.
For example, as shown in fig. 2, the three-issue CPU may fetch three instructions from the instruction cache at the same time in each clock cycle, decode the three instructions, execute the three instructions at the same time, and write back the three operation results.
Third, out-of-order multiple processor:
in order to alleviate the problem of CPU stalls due to data correlation, conditional branching, resource conflicts, and the like, further improving inter-pipeline parallelism, out-of-order execution paradigms are widely used in high performance microprocessors.
In particular, in the field of computer engineering, out-of-order execution is a paradigm that uses instruction cycles in high performance CPUs to avoid certain types of latency consumption. In this example, in order to avoid a CPU stall caused by unavailability of an operation target of the next instruction, the processor processes an instruction that can be immediately executed instead of the component such as a microinstruction buffer, thereby implementing execution in an instruction sequence not specified by the program.
Based on this, an out-of-order multi-issue processor may be understood as a multi-issue processor that employs an out-of-order execution paradigm. Illustratively, an out-of-order multiple-issue processor may pick an executable instruction to run first by the following steps, the CPU instruction pipeline of which is shown in FIG. 3.
Step 1, acquiring an instruction from an instruction cache;
Step 2, decoding the instruction to obtain a plurality of micro-operations (micro-ops), and sending the micro-operations to an instruction sequence (also called an execution buffer or a reservation station);
Step 3, the instruction sequence will send the instruction micro-operation with complete operands and executable to the corresponding execution units (PU).
In this step, even if the micro-operation of a certain instruction is located in front of the instruction sequence, execution cannot be started if its operands are not ready, so the instruction execution order in the instruction sequence is inconsistent with the program, i.e., out of order.
Step 4, each execution unit obtains and executes executable micro-operations from the instruction sequence to obtain an execution result, and the execution result is cached in the sequence (Queue);
And 5, writing the execution result in the sequence into a (write back) register (register).
In this step, for the execution result of a certain instruction, the execution result of the instruction is written into the register only after the execution result of the instruction preceding the instruction in the program is written into the register.
Fourth, SISD, SIMD:
As shown in fig. 4, for SISD, one instruction can only operate on one data stream at a time, and typically, SISD uses 64-bit general purpose registers. An instruction employing SISD techniques may be referred to as a SISD instruction, and an execution unit for executing the SISD instruction may be referred to as a scalar execution unit.
As shown in fig. 5, for SIMD, one instruction can operate on multiple data streams simultaneously, implementing the functions of multiple SISD instructions, typically SIMD uses SIMD-specific registers with a maximum length of 128 bits or 512 bits. An instruction employing SIMD technology may be referred to as a SIMD instruction and an execution unit for executing the SIMD instruction may be referred to as a vector execution unit.
At present, the function of program acceleration by using SIMD instructions is mainly realized by an automatic vectorization mode of a vectorization option of a compiler. As shown in fig. 6, the compiler is a computer program that can convert a source program (i.e., a program to be compiled) written in a high-level programming language into another programming language, i.e., a target language. Wherein vectorization refers to the conversion of optimizable SISD instructions into corresponding SIMD instructions.
The compiler is logically divided into a front-end and a back-end, the front-end consisting mainly of parts related to the source program but not related to the target machine (i.e. the device for running the program), typically comprising lexical, grammatical, semantic analysis, intermediate code generation, etc. The back end mainly comprises parts related to the target machine, including code optimization, target code generation and the like related to the target machine.
Notably, a Tree SSA optimization framework for supporting the compiler to implement automatic vectorization of source programs is deployed at the backend, SSA refers to static single assignment (STATIC SINGLE ASSIGNMENT).
Taking GNU compiler sets (the GNU compiler collection, GCC) as an example, based on SSA optimization framework, the following compiler optimization options can be used to implement automatic vectorization of the corresponding functions to the source program:
automatic vectorization of loop code in the source program can be achieved using the-ftree-loop-vectorize option;
Using the-ftree-SLP-vectorize option, automatic vectorization of common serial code in the source program can be achieved using super word LEVEL PARALLELISM (SLP) techniques;
Using the-ftree-vectorize or-O3 options, the-ftree-loop-vectorize and-ftree-slp-vectorize options described above can be enabled simultaneously, enabling automatic vectorization of loop code and common serial code in the source program.
Based on the automatic vectorization mode of the vectorization option of the compiler, all SISD instructions which can be optimized after being analyzed by the compiler in the source program are completely replaced by SIMD instructions. However, when the optimizable SISD instruction is replaced by the SIMD instruction, the vector execution unit may be operated under high load, and the scalar execution unit is in a light load state, so that loads of the scalar execution unit and the vector execution unit are uneven, and the average utilization rate of the execution units is low.
In order to solve the above technical problems, an embodiment of the present application provides an instruction conversion method, which is applied to a computer device, where the computer device includes a first type execution unit and a second type execution unit, the first type execution unit is used for executing a first type instruction, and the second type execution unit is used for executing a second type instruction. The method comprises the steps that computer equipment obtains an instruction set, the instruction set comprises at least two instructions, the type of any one instruction in the at least two instructions is a first type or a second type, and the computer equipment converts part or all of the first type of instructions in the instruction set into the second type of instructions according to a strategy rule, and the strategy rule enables load balancing of a first type of execution units and a second type of execution units. This method will be described in detail in the following examples, and will not be described here again.
It should be noted that, in the embodiment of the present application, the first type and the second type are different types, for example, the first type of instruction is a SIMD instruction, the second type of instruction is a SISD instruction, and accordingly, the first type of execution unit is a vector execution unit, and the second type of execution unit is a scalar execution unit. Or the first type of instruction is a SISD instruction, the second type of instruction is a SIMD instruction, and correspondingly, the first type of execution unit is a scalar execution unit and the second type of execution unit is a vector execution unit.
In the embodiment of the application, when the instruction conversion is executed, part or all of the first type of instructions can be converted into the second type of instructions according to the policy rules, and the policy rules can balance the loads of the first type of execution units and the second type of execution units, so that the load balancing method and the load balancing device can balance the loads of the first type of execution units and the second type of execution units, improve the utilization rate of the execution units, and further improve the performance of computer equipment.
The following describes embodiments of the present application in detail with reference to the drawings.
As shown in fig. 7, a schematic structural diagram of a communication apparatus 700 according to an embodiment of the present application is provided, where the communication apparatus 700 may be a computer device or a chip or a system on a chip in the computer device. As shown in fig. 7, the communication device 700 includes a processor 701, a transceiver 702, and a communication line 703.
Further, the communication device 700 may also include a memory 704. The processor 701, the memory 704, and the transceiver 702 may be connected by a communication line 703.
The processor 701 is a central processing unit (central processing unit, CPU), a general-purpose processor, a network processor (network processor, NP), a digital signal processor (DIGITAL SIGNAL processing, DSP), a microprocessor, a microcontroller, a programmable logic device (programmable logic device, PLD), or any combination thereof. The processor 701 may also be any other device having processing functions, such as, without limitation, a circuit, a device, or a software module. Alternatively, the processor 701 may be an out-of-order multiple processor.
A transceiver 702 for communicating with other devices or other communication networks. The other communication network may be an ethernet, a radio access network (radio access network, RAN), a wireless local area network (wireless local area networks, WLAN), etc. The transceiver 702 may be a module, circuitry, transceiver, or any device capable of communicating.
Communication line 703 is used to transfer information between the components included in communication device 700.
Memory 704 for storing instructions. Wherein the instructions may be computer programs.
The memory 704 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device capable of storing static information and/or instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device capable of storing information and/or instructions, an EEPROM, a CD-ROM (compact disc read-only memory) or other optical disk storage, an optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, etc.
It should be noted that the memory 704 may exist separately from the processor 701 or may be integrated with the processor 701. Memory 704 may be used to store instructions or program code or some data, etc. Memory 704 may be located within communication device 700 or external to communication device 700, without limitation. The processor 701 is configured to execute instructions stored in the memory 704 to implement a multi-path communication method according to the following embodiments of the present application.
In one example, processor 701 may include one or more CPUs, such as CPU0 and CPU1 in fig. 7.
As an alternative implementation, the communication device 700 includes multiple processors, e.g., the processor 707 may be included in addition to the processor 701 in fig. 7.
As an alternative implementation, the communication apparatus 700 further comprises an output device 705 and an input device 706. By way of example, input device 706 is a keyboard, mouse, microphone, or joystick, and output device 705 is a display screen, speaker (speaker), or the like.
It should be noted that the communication apparatus 700 may be a desktop computer, a portable computer, a web server, a mobile phone, a tablet computer, a wireless terminal, an embedded device, a chip system, or a device having a similar structure as in fig. 7. Further, the constituent structure shown in fig. 7 does not constitute a limitation of the communication apparatus, and the communication apparatus may include more or less components than those shown in fig. 7, or may combine some components, or may be arranged in different components, in addition to those shown in fig. 7.
Fig. 8 is a schematic structural diagram of a processor 701 according to an embodiment of the present application. The processor 701 includes a first instruction cache (instruction cache, IR) 7011a, a second IR7011b, a scheduler 7012, an instruction fetch unit 7013, a decode unit 7017, an instruction issue unit 7014, a first micro instruction cache queue 7015a, a second micro instruction cache queue 7015b, a first type of execution unit 7016a, and a second type of execution unit 7016b.
It will be appreciated that the processor 701 may include a plurality of execution units 7016a of a first type and a plurality of execution units 7016b of a second type, and that fig. 8 is merely illustrative and includes one execution unit 7016a of a first type and one execution unit 7016b of a second type.
Wherein the first IR7011a is used to store a first type of instruction and the second IR7011b is used to store a second type of instruction. Alternatively, storing the first type of instruction and the second type of instruction in order may be accomplished by an instruction type bit and/or an instruction location cache table.
The scheduler 7012 is configured to monitor a status (such as a dequeue rate and a number of micro operations in a queue) of the first micro instruction cache queue 7015a and the second micro instruction cache queue 7015b, and output a fetch policy to the fetch unit 7013 according to the status, where the fetch policy may be, for example, a number of instructions of the first type or a number of instructions of the second type to be fetched.
The first and second micro instruction cache queues 7015a, 7015b are used to feedback the state of the current queue to the scheduler 7012.
Alternatively, the first micro instruction cache queue 7015a may be a micro instruction cache queue corresponding to each of the existing first type execution units, or may be a micro instruction cache queue shared by all the first type execution units. Similarly, the second micro instruction cache queue 7015b may be a micro instruction cache queue corresponding to each of the existing second type execution units, or may be a micro instruction cache queue shared by all the second type execution units.
The instruction fetching unit 7013, the decoding unit 7017, the instruction transmitting unit 7014, the first type of executing unit 7016a, and the second type of executing unit 7016b can refer to the related descriptions in the prior art, and this step is repeated.
In the embodiment of the application, the chip system can be composed of chips, and can also comprise chips and other discrete devices.
Further, actions, terms, and the like, which are referred to between embodiments of the present application, are not limited thereto. The message names of interactions between the devices or parameter names in the messages in the embodiments of the present application are just an example, and other names may be used in specific implementations without limitation.
It will be appreciated that in embodiments of the present application, a computer device may perform some or all of the steps in embodiments of the present application, which are merely examples, and embodiments of the present application may also perform other operations or variations of the operations. Furthermore, the various steps may be performed in a different order presented in accordance with embodiments of the application, and it is possible that not all of the operations in the embodiments of the application may be performed.
In the embodiment of the present application, before the computer device performs instruction conversion, the computer device may first perform evaluation of the instruction implementation manner by using the method shown in fig. 9. As shown in fig. 9, the evaluation method may include the steps of:
S901, the computer device determines one or more operations to be evaluated.
Operations are understood to mean operations on data, such as addition or multiplication. Each of the one or more operations to be evaluated may be implemented on the computer device by either SISD instructions or SIMD instructions.
S902, the computer device determines whether there is an unevaluated operation.
If one or more operations in S901 are all evaluated, the evaluation is finished, and if there are non-evaluated operations in S901, the following step S903 is executed.
S903, the computer device determines whether instruction 1 and instruction 2 can be executed in parallel on the computer device.
Wherein, instruction 1 is a SISD instruction for implementing operation i, instruction 2 is a SIMD instruction for implementing operation i, and operation i is the operation of the current evaluation. It is understood that operation i belongs to one or more operations to be evaluated in S901.
Alternatively, whether instruction 1 and instruction 2 are executable in parallel on a computer device may be understood as whether the computer device supports parallel execution of instruction 1 and instruction 2 on the computer device.
If instruction 1 and instruction 2 can be executed in parallel on the computer device, step S904a described below is executed, or if instruction 1 and instruction 2 cannot be executed in parallel on the computer device, step S904b described below is executed.
S904a, determining the quantitative ratio of instruction 1 and instruction 2.
The number ratio of the instruction 1 to the instruction 2 is the number ratio (or called the combination ratio) of the instruction 1 to the instruction 2 which is used for realizing any operation i at the time of optimal performance. Alternatively, the quantitative ratio is a ratio of a value of 1 to a value of 2.
For a value of 1, the number of operations i that can be performed by one instruction 1Number of scalar execution units included in computer device for executing instruction 1Time difference 1 is determined, time difference 1 being the length of time used to implement operation i n times using instruction 1And the duration used to implement operation i m times using instruction 2And the time difference between the two is that n and m are positive integers, and n is larger than m.
Alternatively to this, the method may comprise,The following formula is satisfied:
wherein, Is a value of 1.It is understood that a scalar execution unit executing instruction 1 uses the length of time it is executing an instruction 1.
For a value of 2, the number of operations i that can be performed by one instruction 2Number of scalar execution units included in computer device for executing instruction 1Time difference 1 is determined, time difference 1 being the length of time used to implement operation i n times using instruction 1And the duration used to implement operation i m times using instruction 2Time difference between them.
Alternatively to this, the method may comprise,The following formula is satisfied:
wherein, Is a number of 2.It is understood that the duration used by a vector execution unit for executing instruction 2 to execute an instruction 2.
Alternatively, the aboveMay be measured after a performance monitoring unit (performance monitor unit, PMU) of the computer device binds the cores and empties the pipeline.
That is, the number ratio of instruction 1 and instruction 2 satisfies the following formula:
S904b, the computer device determines the best instruction implementation of operation i.
Alternatively, the computer device may determine the best implementation based on comparing the magnitudes of the values 1 and 2, e.g., the computer device determines the instruction implementation corresponding to the smaller of the values 1 and 2 as the best instruction implementation for operation i.
For example, if value 1 is less than value 2, the computer device may determine the SISD instruction as the best instruction implementation of operation i, i.e., select instruction 1 to implement operation i in the event that instructions 1 and 2 cannot be executed in parallel.
S905, the computer equipment records the evaluation result.
Alternatively, if the computer device executes S904a and then executes S905, the evaluation result may be the number ratio of the instruction 1 and the instruction 2 corresponding to the operation i, for example, the evaluation result is [ operation i:10:11], which indicates that the number ratio of the SISD instruction and the SIMD instruction implementing the operation i is 10:11. Or if the computer device executes S904b and then executes S905, the evaluation result may be the best instruction implementation corresponding to the operation i, for example, the evaluation result is [ operation i: SISD ], which indicates that the best instruction implementing the operation i is a SISD instruction.
It is understood that after S905, the computer device may continue to execute step S902, most in order to complete the evaluation of all the operations determined in S901.
Based on the method, the optimal quantity ratio of SISD instructions and SIMD instructions for realizing a certain operation can be evaluated by combining with the hardware characteristics of the computer equipment, or the optimal instruction realization mode for realizing a certain operation is evaluated, so that program optimization or instruction conversion can be performed subsequently based on the evaluation result obtained by the evaluation, the loads of a scalar execution unit and a vector execution unit are well balanced, and the performance of the computer equipment is improved.
The following describes in detail the instruction conversion method provided in the embodiment of the present application with reference to fig. 1 to 9.
Fig. 10 is a flow chart of an instruction conversion method according to an embodiment of the present application, where the instruction conversion method is applied to a computer device, and the computer device includes a first type of execution unit and a second type of execution unit, where the first type of execution unit is used for executing a first type of instruction, and the second type of execution unit is used for indicating a second type of instruction. The first type of instruction is a SIMD instruction, the second type of instruction is a SISD instruction, or the first type of instruction is a SISD instruction, and the second type of instruction is a SIMD instruction. As shown in fig. 10, the instruction conversion method may include the steps of:
s1001, the computer equipment acquires an instruction set according to a source program.
The instruction set comprises at least two instructions, wherein the type of any one instruction in the at least two instructions is a first type or a second type. I.e. either one instruction is a SISD instruction or a SIMD instruction.
It will be appreciated that any one of the instructions included in the instruction set may be converted from a first type of instruction to a second type of instruction, or may be converted from a second type of instruction to a first type of instruction.
Alternatively, the computer device may obtain the instruction set according to the source program, where the computer device compiles the source program based on an existing compiling manner to obtain the instruction set. The computer device may determine, as the instruction in the instruction set, an instruction capable of converting an instruction type, for example, an instruction capable of converting an instruction type into a SISD instruction capable of being optimized as a SIMD instruction and/or a SIMD instruction capable of being converted into a SISD instruction, after compiling the source program based on an existing compiling manner to obtain a first assembly file of the program.
Alternatively, the existing compilation mode may be an automatic vectorized compilation mode using vectorization options, i.e., converting the optimizable SISD instructions all into SIMD instructions when compiling the source program. In this scenario, the first type of instruction is a SIMD instruction, the second type of instruction is a SISD instruction, and correspondingly, the first type of execution unit is a vector execution unit, and the second type of execution unit is a scalar execution unit. Or the existing compilation may be a compilation that does not use vectorization options, i.e., instruction optimization is not performed when compiling the source program. In this scenario, the first type of instruction is a SISD instruction, the second type of instruction is a SIMD instruction, and accordingly, the first type of execution unit is a scalar execution unit, and the second type of execution unit is a vector execution unit.
S1002, the computer equipment converts part or all of the instructions of the first type in the instruction set into instructions of the second type according to the policy rules.
Wherein the policy rules balance loads of the first class of execution units and the second class of execution units.
It will be appreciated that the instruction conversion method provided in the embodiment of the present application may also be understood as a source program compiling method different from the prior art.
In the embodiment of the application, when the instruction conversion is executed, part or all of the first type of instructions can be converted into the second type of instructions according to the policy rules, and the policy rules can balance the loads of the first type of execution units and the second type of execution units, so that the load balancing method and the load balancing device can balance the loads of the first type of execution units and the second type of execution units, improve the utilization rate of the execution units, and further improve the performance of computer equipment.
The policy rules provided by the embodiments of the present application and the manner in which the computer device performs instruction conversion according to the policy rules are described below.
In a first possible implementation manner, the policy rule indicates that an instruction of a first type in a first instruction relationship subtree is converted into an instruction of a second type, where the first instruction relationship subtree is a subtree of an instruction relationship tree formed by at least two instructions included in the instruction set. It is to be appreciated that the first instruction relationship subtree can include one or more instruction relationship subtrees.
Optionally, in the implementation manner, the computer device converts part or all of the instructions of the first type in the instruction set into the instructions of the second type according to the policy rules, and the method can include the steps that the computer device establishes an instruction relationship tree of at least two instructions included in the instruction set according to data correlation and the like, and converts the instructions of the first type in a first instruction relationship subtree in the instruction relationship tree into the instructions of the second type according to the policy rules.
Alternatively, the first instruction relationship subtree may be an instruction relationship subtree with more instructions of the first type included in the instruction relationship tree.
For example, assuming that the first type of instruction is a SIMD instruction and the second type of instruction is a SISD instruction, the instruction set includes four instructions, where a+b=c, a/b=d, c×d=f, and h+i=j, respectively, and each of the four instructions is implemented by a SIMD instruction, an instruction relationship tree constructed by the computer device may be as shown in fig. 11. At this time, the first relational subtree selected by the computer device according to the policy rule may be a left subtree, i.e. a+b=c, a/b=d, c×d=f is converted into the SISD instruction.
In a second possible implementation, the policy rule indicates that part of the first instructions in the first instruction set is converted into M second instructions, where M is a positive integer. The first instruction set is an instruction subset included in the instruction set, the instruction subset included in the instruction set classifies operations implemented by instructions, that is, all instructions included in one instruction subset of the instruction set implement the same operations, and instructions included in different instruction subsets implement different operations.
The initial set of the first instruction set comprises at least two first instructions, wherein the first instructions are first type instructions for realizing first operations, and the second instructions are second type instructions for realizing the first operations. That is, the policy may be applied to a subset of instructions that includes at least two instructions of the first type.
Optionally, after the computer device obtains the instruction set, at least two instructions included in the instruction set may be divided into one or more instruction subsets according to an operation implemented by the instructions, and then instruction conversion is performed according to the policy rule. Illustratively, based on the example shown in FIG. 11, the computer device may divide the instruction set into three instruction subsets, which may be, for example, instruction subset 1: { a+b=c, h+i=j }, instruction subset 2: { a/b=d }, instruction subset 3: { c×d=f, respectively. And instruction subset 1 includes two instructions of the first type, then the policy rules may be used to perform instruction conversion on instructions in instruction subset 1.
Further optionally, the policy rules indicating that part of the first instructions in the first instruction set are converted into M second instructions may include the policy rules indicating that part of the first instructions in the first instruction set are converted into M second instructions in case the first instructions and the second instructions are executed in parallel.
Alternatively, parallel execution of the first instruction and the second instruction may be understood as a computer device supporting the first instruction and the second instruction in parallel, and at least two first instructions are included in an initial set of the first instruction set. This is because in the case where the initial set of the first instruction set includes one first instruction, if the first instruction is converted into a second instruction, there will be no first instruction in the whole instruction set, and thus there will be no case where the first instruction and the second instruction are executed in parallel.
Optionally, the ratio of M to K is determined by the first value and the second value, where K is the number of the other part of the first instructions in the first instruction set, i.e. K is the number of the first instructions in the first instruction set that are not converted into the second instructions.
The first value is determined by a first duration and a first number, wherein the first duration is a duration used by a first execution unit to execute a first instruction, and the first number is a number of first execution units included in the computer device. It will be appreciated that since the first instruction is a first type of instruction for performing the first operation, the first execution unit is a first type of execution unit for executing the first instruction.
Optionally, the first value, the first duration, and the first quantity satisfy the following formula:
wherein, At a first value, T 1 is a first duration,Is a first number.
Optionally, the first time length is determined by a first number of times and a first time difference, the first number of times is the number of times of a first operation implemented by one first instruction, the first time difference is a time length used for implementing n times of the first operation by using the first instruction, and a time length used for implementing m times of the first operation by using the first instruction, n and m are positive integers, and n is greater than m.
Optionally, the first duration, the first number of times, and the first time difference satisfy the following formula:
Wherein T 1 is a first duration, For the first number of times,For the first time difference of the first time difference,The duration used to implement n first operations using the first instruction,The duration used to implement m first operations using the first instruction.
And the second numerical value is determined by a second duration and a second number, wherein the second duration is used by a second execution unit to execute a second instruction, and the second number is the number of second execution units included in the computer device. It will be appreciated that since the second instruction is a second type of instruction for performing the first operation, the second execution unit is a second type of execution unit for executing the second instruction.
Optionally, the second value, the second duration, and the second number satisfy the following formula:
wherein, At a second value, T 2 is a second duration,A second number.
Optionally, the second duration is determined by a second number of times and a second time difference, where the second number of times is the number of times of the first operation implemented by one second instruction, the second time difference is a time difference between a duration used for implementing n times of the first operation by using the second instruction and a duration used for implementing m times of the first operation by using the second instruction, n and m are positive integers, and n is greater than m.
Optionally, the second duration, the second number of times, and the second time difference satisfy the following formula:
Wherein T 2 is a second time period, For the second number of times,For the second time difference to be a second time difference,The duration used to implement n first operations using the second instruction,The duration used to implement m first operations using the second instruction.
Finally, the ratio gamma of M to K is:
In connection with the above description, it may be understood that the first value and the second value may be evaluation results obtained by the computer device when performing the operation evaluation shown in fig. 9, and when performing instruction conversion on the first instruction set, the computer device may query the stored evaluation results corresponding to the first operation.
Further optionally, the ratio of M to K is determined by a first value to a second value, and may include that the ratio of M to K is determined by the first value, the second value, and an instruction relationship tree formed by at least two instructions included in the instruction set. That is, when determining the ratio of M to K, or determining the value of M to K, the computer device needs to consider not only the first value and the second value, but also an instruction relationship tree formed by at least two instructions included in the instruction set, that is, the computer device determines the ratio of M to K by combining the first value, the second value and the instruction relationship tree.
In a third possible implementation manner, the policy rule indicates that all third instructions in the second instruction set are converted into fourth instructions when the third value is greater than or equal to the fourth value, and that the third instructions in the second instruction set are not converted into fourth instructions when the third value is less than or equal to the fourth value. The second instruction set is an instruction subset included in the instruction set, the instruction subset included in the instruction set classifies operations implemented by instructions, that is, all instructions included in one instruction subset of the instruction set implement the same operations, and instructions included in different instruction subsets implement different operations.
Wherein the initial set of second instruction sets includes at least one third instruction, the third instruction being of a first type for effecting the second operation, the fourth instruction being of a second type for effecting the second operation. That is, the policy may be applied to a subset of instructions that includes at least one instruction of the first type. Illustratively, based on the example shown in fig. 11, the second instruction set may be instruction subset 2: { a/b=d } or instruction subset 3: { c x d=f }.
Optionally, for the third value, the third value is determined by a third duration and a third number, where the third duration is a duration used by a third execution unit to execute a third instruction, and the third number is a number of third execution units included in the computer device. It will be appreciated that since the third instruction is an instruction of the first type for performing the third operation, the third execution unit is the first type execution unit for executing the third instruction. Wherein, the third numerical value, the third duration and the third quantity satisfy the relation, and the related description of the first numerical value can be referred to.
Optionally, the third duration is determined by a third number of times and a third time difference, where the third number of times is the number of times of the second operation implemented by the third instruction, the third time difference is a time difference between a duration used for implementing n times of the second operation by using the third instruction and a duration used for implementing m times of the second operation by using the third instruction, n and m are positive integers, and n is greater than m. The third duration, the third number of times, and the relationship satisfied by the third time difference may be referred to the related description of the first duration.
Optionally, for the fourth value, the fourth value is determined by a fourth duration and a fourth number, the fourth duration being a duration used by a fourth execution unit to execute a fourth instruction, the fourth number being a number of fourth execution units included in the computer device. It will be appreciated that since the fourth instruction is a second type of instruction for performing the second operation, the fourth execution unit is a second type of execution unit for executing the fourth instruction. The relation that the fourth value, the fourth duration and the fourth quantity meet can be referred to the related description of the second value.
Optionally, the fourth time length is determined by a fourth number of times and a fourth time difference, the fourth number of times is the number of times of the second operation implemented by one fourth instruction, the fourth time difference is a time difference between a time length used for implementing n times of the second operation by using the fourth instruction and a time length used for implementing m times of the second operation by using the fourth instruction, n and m are positive integers, and n is greater than m. The fourth duration, the fourth number of times, and the fourth time difference satisfy the relationship, and reference may be made to the description related to the second duration.
Further optionally, the policy rules indicating that all third instructions in the second set of instructions are converted to fourth instructions if the third value is greater than or equal to the fourth value may include the policy rules indicating that all third instructions in the second set of instructions are converted to fourth instructions if the third instructions and the fourth instructions are not executed in parallel and the third value is greater than or equal to the fourth value.
Alternatively, the third instruction and the fourth instruction may not be executed in parallel, as understood by the computer device not supporting parallel execution of the third instruction and the fourth instruction, or by only including one third instruction in the initial set of the second instruction set. This is because in the case where the initial set of the second instruction set includes one third instruction, if the third instruction is converted into a fourth instruction, there will be no third instruction in the entire instruction set, and thus there will be no case where the third instruction and the fourth instruction are executed in parallel.
It will be appreciated that the magnitudes of the third and fourth values may reflect the performance of implementing the second operation with instructions of the first type and the performance of implementing the second operation with instructions of the second type, the smaller the value, the better the performance, and thus the policy rule may be considered to select the instruction type with better performance to implement the second operation.
In connection with the above description, it may be understood that the third value and the fourth value may be obtained when the computer device performs the operation evaluation shown in fig. 9, when recording the evaluation result, if the computer device records an instruction implementation manner corresponding to a smaller value of the third value and the fourth value, when performing instruction conversion on the second instruction set, the computer device may query the stored instruction implementation manner corresponding to the second operation, if the instruction implementation manner is of the first type, it is indicated that the third value is smaller than or equal to the fourth value, and the computer device does not convert the third instruction in the second instruction set into the fourth instruction according to the policy rule, or if the instruction implementation manner is of the second type, it is indicated that the third value is greater than or equal to the fourth value, and the computer device converts all the third instructions in the second instruction set into the fourth instruction according to the policy rule.
Next, a method for performing instruction conversion by the computer device according to the policy rules in the second implementation manner and/or the policy rules in the third implementation manner will be described, that is, a step S1002 shown in fig. 10 will be described in an expanded manner.
As shown in fig. 12, step S1002 shown in fig. 10 may include the steps of:
s1002a, the computer equipment constructs an instruction relation tree of at least two instructions included in an instruction set.
Illustratively, based on the above example, the instruction relationship tree constructed by the computer device may be as shown in fig. 11.
S1002b, the computer device divides the instruction set into one or more instruction subsets according to the operation type.
Wherein all instructions included in one subset of instructions of the instruction set implement the same operation, and instructions included in different subsets of instructions implement different operations.
Illustratively, based on the example shown in FIG. 11, the computer device may divide the instruction set into three instruction subsets, which may be, for example, instruction subset 1: { a+b=c, h+i=j }, instruction subset 2: { a/b=d }, instruction subset 3: { c×d=f, respectively.
S1002c, the computer equipment pre-judges the instruction conversion scheme of each instruction subset according to the policy rules.
Optionally, when the instruction subset includes at least two instructions of the first type and the instructions of the first type for implementing the operation corresponding to the instruction subset are executable in parallel, the computer device may pre-determine the instruction conversion scheme of the instruction subset according to the policy rules provided by the second possible implementation manner. I.e. the computer device may pre-judge the ratio of the number of instructions of the second type and the number of instructions of the first type comprised by the subset of instructions after conversion according to the policy rules.
By way of example, assuming that the first type of instruction and the second type of instruction implementing the add operation may be executed in parallel, the computer device may anticipate the instruction conversion scheme of instruction subset 1 described above according to policy rules provided by a second possible implementation. If the number ratio of the first type of instructions and the second type of instructions of the addition operation obtained by the evaluation method shown in fig. 9 is 1:1, the computer device may pre-determine the instruction conversion scheme of the instruction subset 1 to convert one first type of addition instruction of the instruction subset 1 into a second type of addition instruction.
Optionally, when the instruction subset including only one first type of instruction, or the first type of instruction and the second type of instruction for implementing the operation corresponding to the instruction subset are not executable in parallel, the computer device may pre-determine the instruction conversion scheme of the instruction subset according to the policy rules provided by the third possible implementation manner. I.e. the computer device may pre-decide whether to translate the first type of instruction in the subset of instructions according to the policy rules.
By way of example, assuming that the first type of instruction and the second type of instruction implementing the multiplication and division operations may not be executed in parallel, the computer device may pre-determine the instruction conversion schemes of instruction subset 2 and instruction subset 3 described above according to a policy rule of one of the third possible implementations. If the instruction implementation manner of the multiplication operation and the division operation obtained by the evaluation method shown in fig. 9 is the second type of instruction, the computer device may pre-determine the instruction conversion schemes of the instruction subset 2 and the instruction subset 3 to convert the first type of instruction in the instruction subset 2 and the instruction subset 3 into the second type of instruction.
S1002d, the computer device performs instruction conversion in combination with the instruction relationship tree and the instruction conversion schemes of the respective instruction subsets.
Alternatively, the computer device may combine the instruction relationship tree in S1002a, and target an instruction conversion scheme that satisfies as far as possible the respective instruction subsets predicted in S1002c, to convert part or all of the first type of instructions of the instruction set in S1002a into the second type of instructions.
For example, based on the above example, the instruction conversion scheme of the instruction subset 1 is to convert one of the first type of instructions into the second type of instructions, the conversion schemes of the instruction subset 2 and the instruction subset 3 are to convert all of the first type of instructions into the second type of instructions, and in combination with the instruction relationship tree shown in fig. 11, since the types of instructions that generally require the same instruction relationship subtree are the same, the computer device may convert a+b=c in the instruction subset 1 from the first type of instructions into the second type of instructions, and simultaneously convert a/b=d and c×d=f from the first type of instructions into the second type of instructions.
Optionally, after the computer device completes the instruction conversion, the second assembly file of the source program may be output according to a result of the instruction conversion. So far, the source program compiling process provided by the embodiment of the application can be considered to be ended.
In an implementation scenario of the present application, when the source program is executed, the computer device loads the second assembly file, and at this time, on the basis of the method shown in fig. 10 or fig. 12, as shown in fig. 13, in a case where the instruction set includes the first type of instruction after the instruction conversion is executed, the instruction conversion method may further include steps S1003a to S1005a, and in a case where the instruction set includes the second type of instruction, the instruction conversion method may further include steps S1003b to S1005b:
S1003a, the computer device stores the first type of instruction in the first instruction cache.
S1004a, the computer equipment acquires the state information of the first micro instruction cache queue.
The first micro instruction cache queue is used for caching micro operations of the first type of instructions.
Alternatively, from the processor perspective of the computer device, the step S1004a may send the state information of the first micro instruction cache queue to the scheduler for the first micro instruction cache queue. Accordingly, the scheduler receives status information from the first micro instruction cache queue.
S1005a, the computer device determines, according to the status information of the first micro instruction cache queue, the number of the first type of instructions to be obtained from the first instruction cache.
Optionally, the status information of the first micro instruction cache queue may be used to indicate an amount of free space in the first micro instruction cache queue. Correspondingly, the computer equipment determines the quantity of the first type of instructions to be acquired from the first instruction cache according to the state information of the first micro instruction cache queue, and can comprise determining the quantity of the free space in the first micro instruction cache queue as the quantity of the first type of instructions to be acquired from the first instruction cache.
Alternatively, from the processor perspective of the computer device, the step S1005a may be for the scheduler to determine, according to the status information of the first microinstruction cache queue, the number of instructions of the first type to be fetched from the first instruction cache. The scheduler then sends the number to the instruction fetch unit, which fetches a corresponding number of instructions of the first type from the first instruction cache.
S1003b, the computer device stores the second type of instruction in the second instruction cache.
S1004b, the computer equipment acquires the state information of the second micro instruction cache queue.
The second micro instruction cache queue is used for caching micro operations of the second type of instructions.
Alternatively, from the processor perspective of the computer device, the step S1004b may send the state information of the second micro instruction cache queue to the scheduler for the second micro instruction cache queue. Accordingly, the scheduler receives status information from the second micro instruction cache queue.
S1005b, the computer device determines, according to the status information of the second micro instruction cache queue, the number of instructions of the second type to be obtained from the second instruction cache.
Optionally, the status information of the second micro instruction cache queue may be used to indicate an amount of free space in the second micro instruction cache queue. Correspondingly, the computer equipment determines the number of the second type of instructions to be acquired from the second instruction cache according to the state information of the second micro instruction cache queue, and can comprise determining the number of the free space in the second micro instruction cache queue as the number of the second type of instructions to be acquired from the second instruction cache.
Alternatively, from the processor perspective of the computer device, this step S1005b may be for the scheduler to determine the number of instructions of the second type to be fetched from the second instruction cache based on the status information of the second micro instruction cache queue. The scheduler then sends the number to the instruction fetch unit, which fetches a corresponding number of instructions of the second type from the second instruction cache.
Illustratively, taking the first type of instruction as a SISD instruction and the second type of instruction as a SIMD instruction as an example, after adopting the method shown in fig. 13, the processor instruction pipeline of the computer device may be as shown in fig. 14.
Based on the scheme, the state information of the micro instruction cache queue can be acquired in real time, the number of the instructions of the corresponding types taken out from various instruction caches is dynamically adjusted according to the state information, and the parallel operation efficiency of the first type of execution units and the second type of execution units is improved, so that the performance of the computer equipment is further improved.
In summary, in the present application, the computer device may first execute the evaluation method shown in fig. 9 to evaluate each operation, and when executing the instruction conversion of the source program, pre-judge the instruction conversion scheme according to the evaluation result, and then complete the instruction conversion in combination with the instruction relationship tree. When the source program executes, the first type of instruction and the second type of instruction can be classified and stored, and the number of the corresponding types of instructions fetched from the first instruction cache and the second instruction cache can be dynamically adjusted.
It will be appreciated that the evaluation method shown in fig. 9 and the method of classifying storage and dynamically adjusting the number of fingers shown in fig. 13 may be performed independently of the method shown in fig. 10.
The above method will be described below with specific examples. Assuming that the source program of an application is mainly used for implementing multiple 32-bit unsigned integer addition operations, part of codes in the source program are as follows:
for(unsigned int i=0;i<len;i=i+4)
{
result[0]=bn[i]+bn[0];
result[1]=bn[i+1]+bn[0];
result[2]=bn[i+2]+bn[0];
result[3]=bn[i+3]+bn[0];
......
}
Before compiling the source program, the computer device can perform the method shown in fig. 9 to calculate the quantitative ratio of SISD instructions and SIMD instructions when implementing a 32-bit unsigned integer addition operation, here assuming that the quantitative ratio is 10:11.
Assuming that after the computer device compiles the source file based on the existing automatic vectorization compiling mode, the source code vectorizes the SIMD instruction of the 32-bit unsigned addition operation part in the first assembly file generated by the source code, namely, the instruction set acquired by the computer device comprises the following instructions:
add d15,d11,d10;
add d16,d12,d10;
add d17,d13,d10;
add d18,d14,d10;
Wherein d10-d18 are 128-bit SIMD special purpose registers, d10 stores bn [0], d11-d14 stores bn [ i ] -bn [ i+3] in each cycle, and d15-d18 stores the addition result to result [0] -result [3].
For the SIMD instruction, the computer equipment combines an instruction relation tree formed by the SIMD instruction and the estimated quantity ratio of the SISD instruction to the SIMD instruction when the 32-bit unsigned integer addition operation is realized is 10:11, and a conversion scheme is determined, so that the quantity of the converted SISD instruction and the SIMD instruction is as much as possible 10:11. It will be appreciated that since there is no dependency for the 4 SIMD instructions described above, in this example the computer device may determine the conversion scheme based solely on the quantitative ratio, e.g. the computer device determines to convert 2 of the 4 SIMD instructions described above to SISD instructions based on the quantitative ratio. Illustratively, the converted instruction may be as follows:
add x3,x1,x0;
add x4,x2,x0;
add d13,d11,d10;
add d14,d12,d10;
The first two instructions are SISD instructions obtained through conversion, and the second two instructions are untranslated SIMD instructions. x0-x4 are 64-bit general purpose registers, d10-d14 are 128-bit SIMD special purpose registers, bn [0] is stored in d10 and x0, x1 and x2 store bn [ i ] and bn [ i+1] in each cycle, d11 and d12 store bn [ i+2] and bn [ i+3] in each cycle, x3 and x4 store addition results result [0] and result [1] respectively, and d13 and d14 store addition results result [2] and result [3] respectively.
Subsequently, when the source program is executed, the computer device can execute the method shown in fig. 13 to acquire the state information of the micro instruction cache queue in real time, dynamically adjust the number of the corresponding type of instructions fetched from the first instruction cache and the second instruction cache, and improve the parallel running efficiency of the scalar execution unit and the vector execution unit.
In various embodiments of the application, where no special description or logic conflict exists, terms and/or descriptions between the various embodiments are consistent and may reference each other, and features of the various embodiments may be combined to form new embodiments based on their inherent logic.
It will be appreciated that in the various embodiments above, the methods and/or steps implemented by a computer device may also be implemented by components (e.g., chips or circuits) that may be used in a computer device.
The scheme provided by the embodiment of the application is introduced. Correspondingly, the embodiment of the application also provides a communication device which is used for realizing the various methods. The communication means may be the computer device in the above-described method embodiments, or an apparatus comprising the above-described computer device, or a component usable with the computer device. It will be appreciated that the communication device, in order to achieve the above-described functions, comprises corresponding hardware structures and/or software modules performing the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the computer equipment according to the embodiment of the method, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
For example, taking a communication device as an example of the computer device in the above method embodiment, fig. 15 shows a schematic structural diagram of a computer device 150. The computer device 150 includes an acquisition module 1501 and a conversion module 1502. Optionally, the computer device 150 further comprises a control module 1503.
The system comprises an acquisition module 1501, a conversion module 1502 and a policy rule, wherein the acquisition module 1501 is used for acquiring an instruction set according to a source program, the instruction set comprises at least two instructions, the type of any one instruction in the at least two instructions is a first type or a second type, and the conversion module 1502 is used for converting part or all of the first type of instructions in the instruction set into the second type of instructions according to the policy rule, and the policy rule enables the load of a first type of execution unit and a second type of execution unit to be balanced. The first type of instruction is a SIMD instruction, the second type of instruction is a SISD instruction, or the first type of instruction is a SISD instruction, and the second type of instruction is a SIMD instruction. Specific implementation may refer to the detailed descriptions of step S1001 to step S1002 in the embodiment shown in fig. 10, which is not repeated.
The specific implementation of the computer device 150 may refer to the behavior function of the computer device in the method shown in fig. 9, 10, 12, or 13. Specific implementation may refer to the detailed descriptions of step S901 to step S905 shown in fig. 9, or the detailed descriptions of step S1001 to step S1002 shown in fig. 10, or the detailed descriptions of step S1001 to step S1002d shown in fig. 12, or the detailed descriptions of step S1001 to step S1005a or step S1001 to step S1005b shown in fig. 13, which will not be repeated.
Optionally, in a specific embodiment, in a case where the instruction set includes a first type of instruction, the control module 1503 is configured to store the first type of instruction in the instruction set to the first instruction cache.
Optionally, the control module 1503 is further configured to obtain status information of a first micro instruction cache queue, where the first micro instruction cache queue is configured to cache micro operations of the first type of instructions, and the control module 1503 is further configured to determine, according to the status information of the first micro instruction cache queue, the number of the first type of instructions to be obtained from the first instruction cache.
Optionally, in a specific embodiment, the state information of the first micro instruction cache queue is used to indicate the number of free spaces in the first micro instruction cache queue, and the control module 1503 is further configured to determine, according to the state information of the first micro instruction cache queue, the number of the first type of instructions to be obtained from the first instruction cache, where the control module 1503 is further configured to determine the number of free spaces in the first micro instruction cache queue as the number of the first type of instructions to be obtained from the first instruction cache.
Optionally, in a specific embodiment, in a case where the instruction set includes a second type of instruction, the control module 1503 is configured to store the second type of instruction in the instruction set to the second instruction cache.
Optionally, the control module 1503 is further configured to obtain status information of a second micro instruction cache queue, where the second micro instruction cache queue is configured to cache micro operations of the second type of instructions, and the control module 1503 is further configured to determine, according to the status information of the second micro instruction cache queue, the number of the second type of instructions to be obtained from the second instruction cache.
Optionally, in a specific embodiment, the state information of the second micro instruction cache queue is used to indicate the number of free spaces in the second micro instruction cache queue, and the control module 1503 is further configured to determine, according to the state information of the second micro instruction cache queue, the number of instructions of the second type to be obtained from the second instruction cache, and may include the control module 1503 further configured to determine the number of free spaces in the second micro instruction cache queue as the number of instructions of the second type to be obtained from the second instruction cache.
All relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.
Optionally, the computer device 150 may further include a processing module and a storage module (not shown in fig. 15), where the storage module is configured to store data and/or instructions, and the processing module may read the data or instructions in the storage module to implement a method corresponding to the foregoing embodiments.
It will be appreciated that the above modules may be provided independently or may be integrated, which is not limited in this embodiment of the present application.
In one possible approach, the computer device 150 is presented in the form of individual functional modules that are divided in an integrated manner. A "module" herein may refer to a particular ASIC, an electronic circuit, a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and/or other device that can provide the described functionality. In a simple embodiment, one skilled in the art will appreciate that the computer device 150 may take the form of the communications apparatus 700 shown in FIG. 7.
For example, the processor 701 in the communication device 700 shown in fig. 7 may cause the communication device 700 to perform the method shown in fig. 9, 10, 12, or 13 in the above-described method embodiments by invoking computer-executable instructions stored in the memory 704.
Specifically, the functions/implementation procedures of the acquisition module 1501, the conversion module 1502 and the control module 1503 in fig. 15 may be implemented by the processor 701 in the communication apparatus 700 shown in fig. 7 calling the computer-executable instructions stored in the memory 704.
Since the computer device 150 provided in the present embodiment can execute the above-mentioned instruction conversion method, the technical effects obtained by the method can be referred to the above-mentioned method embodiments, and will not be described herein.
Optionally, an embodiment of the present application further provides a communication device (for example, the communication device may be a chip or a chip system), where the communication device includes a processor, and the method is used to implement the method in any of the foregoing method embodiments. In one possible design, the communication device further includes a memory. The memory for storing the necessary program instructions and data, and the processor may invoke the program code stored in the memory to instruct the communication device to perform the method of any of the method embodiments described above. Of course, the memory may not be in the communication device. In another possible design, the communication device further includes an interface circuit, which is a code/data read/write interface circuit, for receiving computer-executable instructions (the computer-executable instructions being stored in a memory, possibly read directly from the memory, or possibly through other devices) and transmitting to the processor. When the communication device is a chip system, the communication device may be formed by a chip, or may include a chip and other discrete devices, which is not particularly limited in the embodiment of the present application.
Optionally, an embodiment of the present application further provides a chip, and fig. 16 is a schematic structural diagram of the chip 160 provided by the embodiment of the present application. The chip 160 includes one or more (including two) processors 1610 and a communication interface 1630.
Optionally, the chip 160 further includes a memory 1640, which memory 1640 may include read only memory and random access memory, and provides operating instructions and data to the processor 1610. A portion of the memory 1640 may also include non-volatile random access memory (non-volatile random access memory, NVRAM).
In an embodiment of the present application, processor 1610 may perform the corresponding operations in the methods shown in fig. 9, 10, 12, or 13 by invoking operational instructions stored in memory 1640 (which may be stored in an operating system).
In the alternative, processor 1610, communication interface 1630, and memory 1640 are coupled together by bus system 1620, which bus system 1620 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. But for clarity of illustration, the various buses are labeled as bus system 1620 in fig. 16.
Processor 1610 controls the processing operations of the computer device. The methods disclosed in the embodiments of the present application described above may be applied to the processor 1610 or implemented by the processor 1610. Processor 1610 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or instructions in software in processor 1610. The processor 1610 may be a general purpose processor, a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1640, and the processor 1610 reads information in the memory 1640 and performs the steps of the method described above in combination with its hardware.
In one possible implementation, the communication interface 1630 is used to perform the steps of receiving and transmitting by a computer device. Processor 1610 is used to perform the processing steps of the computer device.
The embodiment of the application also provides a computer readable storage medium. All or part of the flow in the above method embodiments may be implemented by a computer program to instruct related hardware, where the program may be stored in the above computer readable storage medium, and when the program is executed, the program may include the flow in the above method embodiments. The computer readable storage medium may be an internal storage unit of the terminal (including the data transmitting end and/or the data receiving end) of any of the foregoing embodiments, for example, a hard disk or a memory of the terminal. The computer-readable storage medium may be an external storage device of the terminal, such as a plug-in hard disk, a smart card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, or a flash memory card (FLASH CARD) provided in the terminal. Further, the computer-readable storage medium may further include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The above-described computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the method described in the embodiments of the present application. The storage medium includes various media capable of storing program codes such as a U disk, a mobile hard disk, a ROM, a RAM, a magnetic disk or an optical disk.
The foregoing is merely illustrative of specific embodiments of the present application, and the scope of the present application is not limited thereto, but any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.