Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above, the existing thread group data interaction is mainly performed through the shared memory. The system calculates corresponding shared memory addresses according to indexes of different thread data, and then writes the data on the register into the shared memory through a storage instruction (store). Finally, the register is written back through a load instruction (load). Because of the uncertainty of scheduling between threads, it is necessary to wait for all threads to complete the write operation before performing the read operation, which requires adding an additional synchronization operation, while only serially processing if there is a write to the same location. This way, the instruction fetch latency increases, wasting the data bandwidth of the instruction memory. In addition, the index calculation of the shared memory is complex and requires additional calculations that translate to corresponding addresses. Therefore, the index and address calculation of the shared memory are additionally calculated, the consumption is large, the delay is serious, and meanwhile, the delay of the read-write operation in the shared memory is long, so that the program performance is influenced.
To at least partially address one or more of the above-mentioned problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for generating a data report in which arranging data stored in two registers, e.g., consecutive 1-64, may be achieved by utilizing two additional registers in a thread group and one staging register. Permutation may refer to reordering consecutive data, such as consecutive data 1-64 stored in two registers of 32 threads in one thread group (warp), into odd columns 1,3, 5, 63 and even columns 2, 4, 6, 64. Eventually, the two registers will store the corresponding odd columns and even columns.
Fig. 1 shows a schematic diagram of a system 100 for implementing a method for implementing data permutation according to an embodiment of the invention. As shown in fig. 1, system 100 includes a computing device 110 and a data storage device 130 and a network 140. The computing device 110, the data storage device 130 may interact with data via a network 140 (e.g., the internet).
The data storage device 130, which may store, for example, a plurality of different types of data stores, for example, consecutive numbers to be ordered, and the like. The data storage device 130 may also transmit the stored data store to the computing device 110. The data storage device may be a one-stop storage computing structure running on one or more computer nodes for implementing high-concurrency, high-throughput query services, which may include special-purpose processing units such as GPUs, FPGAs, and ASICs, as well as general-purpose processing units such as CPUs.
With respect to computing device 110, for example, to retrieve data stores from data storage device 130, to perform ranking on the received data. The rearranged parity columns may be stored in two registers within the same warp. Computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device 110. In some embodiments, computing device 110 and data storage device 130 may be integrated together or may be separate from each other. In some embodiments, computing device 110 includes, for example, an acquisition module 112, a shift module 114, an extraction module 116, a determination module 118, and a mapping module 120.
An acquisition module 112, the acquisition module 112 being configured to acquire the first assistance data and the second assistance data based on the number of threads of the thread group and the data arrangement requirement.
A shift module 114, the shift module 114 being configured to shift the data in the first register, the first auxiliary data, the data in the second register, the second auxiliary data to the staging register.
And an extraction module 116, wherein the extraction module 116 is configured to perform extraction on the data in the transfer register based on the address of the transfer register, so as to obtain first data, second data, third data and fourth data.
A determination module 118, the determination module 118 being configured to determine the mapping parameters based on the number of threads of the thread group and the data arrangement requirements.
A mapping module 120, the mapping module 120 being configured to map the third data to the first register and the fourth data to the second register based on the mapping parameters, thereby enabling data arrangement in the registers.
Fig. 2 illustrates a flow chart of a method 200 for implementing data permutation in registers according to an embodiment of the present disclosure. The method 200 may be performed by the computing device 110 shown in fig. 1, or at the electronic device 800 shown in fig. 8. It should be understood that method 200 may also include additional blocks not shown and/or that the blocks shown may be omitted, the scope of the disclosure being not limited in this respect.
In the context of the present application, a register may refer to a thread group (warp), e.g. a register group comprising a first register and a second register. The register sets may be located in the same thread or in different threads.
In one embodiment, for example, a thread group (warp) may include 8, 16, 32, 64 threads. For simplicity, this disclosure will be presented with 32 threads. Note that other thread groups of thread numbers are equally applicable to the solution described in this disclosure.
At step 202, the computing device 110 may obtain first assistance data and second assistance data based on the number of threads and data arrangement requirements of the thread group.
In one embodiment, the number of bits of the first auxiliary data and the second auxiliary data may be the same as the number of threads of the thread group. For example, if the thread number of the thread group is 32, the first auxiliary data and the second auxiliary data may be 32-bit data. For simplicity, the present context uses 32 threads as an illustration. As described above, the first and second registers corresponding to 32 threads store data arrays 1,2, 3., 32, and 33, 34, 35,., 64, respectively. In the context of the present application, data arrangement requirements may refer to the exchange, insertion, rearrangement, etc. of data corresponding to a thread. For simplicity, the present context will be described in terms of reordering data corresponding to two sets of 32 threads 1 bit apart. For example, data in the first thread group and the second thread group are swapped ordered such that after ordering a first register corresponding to the first thread group holds an odd column (1, 3, 5, 7..63) and a second register corresponding to the second thread group holds an even column (2, 4, 6, 8..64).
In one embodiment, the first auxiliary data and the second auxiliary data may be any data used to populate registers. For example, the first auxiliary data and the second auxiliary data may select 32 consecutive data sequences, such as 1,2, 3..32, or the same data sequence, such as 32 0. For the sake of introduction, the present context uses consecutive 320 s as first auxiliary data and second auxiliary data.
In step 204, the computing device 110 may shift the data in the first register, the first auxiliary data, the data in the second register, the second auxiliary data to the staging register.
In one embodiment, the computing device 110 may shift data 1,2, 3., 32, 32 first auxiliary data 0, 33, 34, 35,., 64, 32 second auxiliary data 0 in the first register R4, and the second auxiliary data 32 to the staging register X0. Fig. 3 shows a schematic diagram of a shifted staging register according to an embodiment of the disclosure. The shifted transfer register X0 will be extracted in a subsequent step, thereby realizing data arrangement. Since the transfer register X0 is a special register, the size of which is 4 times that of the first registers, 4 pieces of data of the first registers can be stored.
At step 206, the computing device 110 may perform extraction on the data in the staging register based on the address of the staging register X0, thereby obtaining the first data, the second data, the third data, and the fourth data.
In one embodiment, performing the extraction on the data in the staging register X0 may include building a sequence of numbers based on the number of threads (e.g., 32) of the thread group, wherein each element in the built sequence corresponds to the mapped data.
Specifically, the computing device 110 may construct a array of data based on 32-bit data of the first register and the second register. The array is constructed as a 4 x 8 array, thus dividing a 32-bit register into 4 groups of 8 bits. For example, the data shifted to the first row in the transfer register X0 may be divided into 1,2, 3..8 and 9, 10, 11..18 and 19, 20, 21..24 and 25, 26, 27..32. The data shifted into the second row in the staging register may be divided into eight bits 0 and eight bits 0 and 0. The data shifted into the third row in the staging register may be divided into 33, 34, 35..40 and 41, 42, 43..48 and 49, 50, 51..56 and 57, 58, 59..64. The data shifted into the second row in the staging register may be divided into eight bits 0 and eight bits 0 and 0.
Subsequently, in one embodiment, the computing device 110 may combine the data of the staging register into a set of a plurality of the data columns, e.g., a set of 4 x 8 columns. Fig. 4 shows a schematic diagram of data of a staging register made up of a set of multiple columns according to an embodiment of the disclosure. As shown in fig. 4, the data of the staging register X0 may be represented by a set of 44×8 columns. For example, for the data of the first row in the relay register X0, 4 8-bit data columns, that is, 4×8 data columns, of 1, 2, 3,..8, and 9, 10, 11,..16, and 17, 18, 19,..24, and 25, 26, 27,..32 can be constructed. For the data of the second row in the staging register X0, a data column of 4 8-bit 0 may be constructed, and so on. Based on the constructed data columns, the addresses of the staging registers may be combined into a set of, for example, 4 x 8 data columns, wherein the 4 x 8 data columns made up of the data of the first row may be located at the upper left of the set, the 4 x 8 data columns made up of the data of the second row may be located at the upper right of the set, the 4 x 8 data columns made up of the data of the third row may be located at the lower left of the set, and the 4 x 8 data columns made up of the data of the fourth row may be located at the lower right of the set.
The data in the upper left corner of the set consisting of a plurality of the data columns consists of 4×8 columns originally belonging to the first register (1, 2, 3..8 and 9, 10, 11..18 and 19, 20, 21..24 and 25, 26, 27..32), the data in the upper right corner consists of 4×8 columns originally belonging to the first auxiliary data (eight bits 0 and eight bits 0), the data in the lower left corner consists of 4×8 columns originally belonging to the second register (33, 34, 35..40 and 41, 42, 43..48 and 49, 50, 51..56 and 57, 58, 59..64), and the data in the lower right corner consists of 4×8 columns originally belonging to the second auxiliary data (eight bits 0 and eight bits 0).
The set may be represented in a coordinate system, wherein the coordinate values of the coordinate system comprise an abscissa and an ordinate to which the data corresponds. With the upper left point, the right is the y-axis and the downward is the x-axis. Starting from the origin, in the first 2×2 matrix, the coordinate value corresponding to the first data 1 is (0, 0), the coordinate value corresponding to the data 2 is (0, 1), the coordinate value corresponding to the data 9 is (1, 0), the coordinate value corresponding to the data 10 is (1, 1), and the rest of the data are the same.
Based on the determined coordinate values, the computing device 110 may perform extraction on the combined set of a plurality of the data columns. Specifically, the extraction is performed on a set composed of a plurality of the series of numbers in the order of even and even abscissa, even and odd abscissa, odd and odd abscissa, even and odd ordinate, odd and odd abscissa. The extracted data may be four types of data, namely, first data with even and even abscissas, second data with even and odd abscissas, third data with odd and even abscissas, and fourth data with odd and odd abscissas.
In one embodiment, the computing device 110 may also write the extracted first data, second data, third data, and fourth data to the first register R4, the second register R5, the third register R6, and the fourth register R7, respectively. The third register R6 and the fourth register R7 will be used for mapping data in a subsequent step.
Fig. 5 shows a schematic diagram of extracted first, second, third, and fourth data according to an embodiment of the present disclosure. As shown in fig. 5, the extracted first data, second data, third data, and fourth data are written into the first register R4, the second register R5, the third register R6, and the fourth register R7, respectively.
At step 208, the computing device 110 may determine mapping parameters based on the number of threads of the thread group and the data arrangement requirements.
In one embodiment, the mapping parameter is determined to be one eighth of the thread number of the thread group. In the case where the thread count of the thread group is 32, the mapping parameter k may be 4.
At step 210, computing device 110 may map the third data to the first register and the fourth data to the second register based on the mapping parameters, thereby implementing a data arrangement in the registers.
In one embodiment, computing device 110 may apply a mask to the third register R6 and the fourth register R7. For example, the mask may be 0xf0f0f0f0, i.e., 11110000111100001111000011110000. With such a mask applied, the data will be mapped once every 4 threads, i.e., threads at locations where the number of applied mask bits is 1 may be mapped and threads at locations where the number of applied mask bits is 0 may not be mapped.
In one embodiment, computing device 110 maps data in third register R6 of the applied mask to the first register R4 every the mapping parameter bits (i.e., 4 bits) and maps data in fourth register R7 of the applied mask to the second register R5 every the mapping parameter bits (i.e., 4 bits), where the mapping parameter bits are determined based on the mapping parameters. For example, the mapping parameter is 4, and the mapping parameter bit is 4 bits.
For example, based on the mask applied (0 xf0f0f0f 0), the computing device 110 may map the data (9, 11, 13, 15) corresponding to threads T0-T3 in the third register R6 to threads T4-T7 of the first register R4 and overwrite 0,0 in threads T4-T7. By analogy, computing device 110 may map the data (25, 27, 29, 31) corresponding to threads T8-T11 in third register R6 to threads T12-T15 of first register R4 and overwrite threads T12-T15 with 0, data (41, 43, 45, 47) corresponding to threads T16-T19 in the third register R6 are mapped to threads T20-T23 of the first register R4 and cover 0,0 of threads T20-T23, and data (57, 59, 61, 63) corresponding to threads T24-T27 in the third register R6 are mapped to threads T28-T31 of the first register R4 and cover 0,0 of threads T28-T31.
Similarly, the computing device 110 may map data (10, 12, 14, 16) corresponding to threads T0-T3 in the fourth register R7 to threads T4-T7 of the second register R5 and to overlay 0, 0 of threads T4-T7, map data (26, 28, 30, 32) corresponding to threads T8-T11 in the fourth register R7 to threads T12-T15 of the second register R5 and to overlay 0, 0 of threads T12-T15, data (42, 44, 46, 48) corresponding to threads T16-T19 in the fourth register R7 are mapped to threads T20-T23 and cover 0, 0 of threads T20-T23 of the second register R5, and data (58, 60, 62, 64) corresponding to threads T24-T27 in the fourth register R7 are mapped to threads T28-T31 bits of the second register R5 and cover 0, 0 of threads T28-T31.
Fig. 6 shows a schematic diagram of a mapping result according to an embodiment of the present disclosure. As shown in fig. 6, the first register R4 stores data 1, 3, 5, 7, 9..and the second register R5 stores data 2, 4, 6, 8, 10., thereby realizing rearrangement of data between the two registers, i.e., parity rearrangement.
By employing the means described above, computing device 110 is able to quickly utilize additional registers to implement the arrangement of data. Again this does not require calculation of the address of the shared memory and can make use of the data calculated in the previous method, thereby reducing the power consumption of the arrangement.
In another embodiment, the computing device 110 may also map the data of the front mapping parameter bits of the third register R6 and the fourth register R7 directly to the front mapping parameter bits of the fifth register and the sixth register. For example, the computing device 110 may also map the front mapping parameter bits of the third register R6 and the fourth register R7, i.e., the first 4 bits of data, directly to the front mapping parameter bits of the fifth register R8 and the sixth register R9, i.e., the first 4 bits. For example, the data 9, 11, 13, 15 of the third register R6 corresponding to the threads T0-T3 are mapped directly to the fifth register R8. The data 10, 12, 14, 16 of the fourth register R7 corresponding to the threads T0-T3 are mapped directly to the sixth register R9.
Subsequently, the computing device 110 may map the data of the third register R6 and the fourth register R7 to the next mapping parameter bits of the fifth register R8 and the sixth register R9, i.e., the bits added with 4 bits. Since the first 4 bits of the fifth register R8 and the sixth register R9 have been mapped with the first 4 bits of data of the third register R6 and the fourth register R7, the data of the last mapping parameter bits of the third register R6 and the fourth register R7 are not mapped, i.e. the data of the fifth register and the sixth register corresponding to the last 4 threads (i.e. the data corresponding to the threads T28-T31) are not mapped. Fig. 7 shows a schematic diagram of the third and fourth registers R6, R7 mapped with the fifth and sixth registers R8, R9 according to an embodiment of the present disclosure. As shown in FIG. 7, for example, computing device 110 maps data of a third register R6 corresponding to threads T0-T27 to a fifth register R8. At the same time, computing device 110 maps the data of the fourth register R7 corresponding to threads T0-T27 to the sixth register R9.
Subsequently, computing device 110 may apply a mask to fifth register R8 and sixth register R9. As described above, the mask may be 0xf0f0f0f0. With such a mask applied, the data will be mapped once every 4 threads, i.e., threads at locations where the number of applied mask bits is 1 may be mapped and threads at locations where the number of applied mask bits is 0 may not be mapped.
Computing device 110 maps data in fifth register R8 of the applied mask to the first register R4 every the mapping parameter bits (i.e., 4 bits) and maps data in sixth register R9 of the applied mask to the second register R5 every the mapping parameter bits (i.e., 4 bits).
Fig. 8 shows a schematic diagram of mapping data in fifth and sixth registers R8, R9 to first and second registers R4, R5 according to an embodiment of the disclosure. As shown in fig. 8, the data of the fifth and sixth registers R8 and R9 corresponding to the threads T4 to T7, T12 to T15, T20 to T23, and T28 to T31 are mapped to the respective first and second registers R4 and R5. Fig. 6 illustrates mapping results according to an embodiment of the present disclosure. As shown in fig. 6, the first register R4 stores data 1,3, 5, 7, 9..and the second register R5 stores data 2,4, 6, 8, 10., thereby realizing rearrangement of data between two registers, for example, parity rearrangement. This way, in case of using more than two registers, the rearrangement of data is also achieved.
By using the above technical means, the arrangement of data between two registers can be realized quickly and only two extra registers and one transit register are needed to be utilized. While eliminating the need to calculate the address of the shared memory, thereby reducing the power consumption of the arrangement.
By employing the means described above, computing device 110 is able to quickly utilize additional registers to implement the arrangement of data. Again this does not require calculation of the address of the shared memory and can make use of the data calculated in the previous method, thereby reducing the power consumption of the arrangement.
Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. For example, computing device 110 as shown in FIG. 1 may be implemented by electronic device 900. As shown, the electronic device 900 includes a Central Processing Unit (CPU) 901 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 902 or loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the random access memory 903, various programs and data necessary for the operation of the electronic device 900 may also be stored. The central processing unit 901, the read only memory 902, and the random access memory 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
A plurality of components in the electronic device 900 are connected to the input/output interface 905, including an input unit 906 such as a keyboard, a mouse, a microphone, and the like, an output unit 907 such as various types of displays, speakers, and the like, a storage unit 908 such as a magnetic disk, an optical disk, and the like, and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.
The various processes and treatments described above, such as method 200, may be performed by central processing unit 901. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, some or all of the computer program may be loaded and/or installed onto device 900 via read only memory 902 and/or communication unit 909. One or more of the acts of the method 200 described above may be performed when a computer program is loaded into the random access memory 903 and executed by the central processing unit 901.
The present disclosure relates to methods, apparatus, systems, electronic devices, computer readable storage media, and/or computer program products. The computer program product may include computer readable program instructions for performing various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge computing devices. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.