[go: up one dir, main page]

CN114546328B - Method, device and medium for implementing data arrangement - Google Patents

Method, device and medium for implementing data arrangement Download PDF

Info

Publication number
CN114546328B
CN114546328B CN202210194323.3A CN202210194323A CN114546328B CN 114546328 B CN114546328 B CN 114546328B CN 202210194323 A CN202210194323 A CN 202210194323A CN 114546328 B CN114546328 B CN 114546328B
Authority
CN
China
Prior art keywords
data
register
mapping
threads
thread group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210194323.3A
Other languages
Chinese (zh)
Other versions
CN114546328A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202210194323.3A priority Critical patent/CN114546328B/en
Publication of CN114546328A publication Critical patent/CN114546328A/en
Application granted granted Critical
Publication of CN114546328B publication Critical patent/CN114546328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/22Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
    • G06F7/24Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers sorting methods in general
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Executing Machine-Instructions (AREA)
  • Bus Control (AREA)

Abstract

本公开的实施例涉及一种用于实现数据排列的方法、设备和介质,所述方法包括:基于线程组的线程数和数据排列要求,获取第一辅助数据以及第二辅助数据;将第一寄存器中的数据、第一辅助数据、第二寄存器中的数据、第二辅助数据移位到中转寄存器;基于所述中转寄存器的地址,对所述中转寄存器中的数据执行抽取,从而获取第一数据、第二数据、第三数据以及第四数据;基于线程组的线程数和数据排列要求,确定映射参数;以及基于所述映射参数,将所述第三数据映射到所述第一寄存器并且将所述第四数据映射到所述第二寄存器,从而实现寄存器中的数据排列。由此,能够在不计算共享内存地址的情况下更为快速地实现多个寄存器间的数据排列。

The embodiments of the present disclosure relate to a method, device and medium for realizing data arrangement, the method comprising: based on the number of threads of a thread group and data arrangement requirements, obtaining first auxiliary data and second auxiliary data; shifting data in a first register, the first auxiliary data, the data in a second register and the second auxiliary data to a transfer register; based on the address of the transfer register, extracting the data in the transfer register, thereby obtaining first data, second data, third data and fourth data; based on the number of threads of the thread group and data arrangement requirements, determining mapping parameters; and based on the mapping parameters, mapping the third data to the first register and mapping the fourth data to the second register, thereby realizing data arrangement in registers. Thus, data arrangement between multiple registers can be realized more quickly without calculating a shared memory address.

Description

Method, apparatus and medium for implementing data permutation
Technical Field
Embodiments of the present disclosure relate generally to the field of processors and, more particularly, relate to a method, computing device, and computer-readable storage medium for implementing data arrangement.
Background
In General, in General-purpose computing graphics processing units (GPGPU), fast Fourier (FFT) operations are typically involved, with inputs that are real numbers and outputs that are complex numbers (R2C). The operation entails arranging (e.g., parity rearrangement) the original input real sequences so as to reassemble into a new complex sequence. This involves arranging inter-thread group (warp) data between different registers. Permutation operations need to rely on data interactions between thread groups.
The existing data interaction mode among thread groups mainly comprises the mode of sharing a memory. The system calculates corresponding shared memory addresses according to indexes of different thread data, and then writes the data on the register into the shared memory through a storage instruction (store). Finally, the register is written back through a load instruction (load). Because of the uncertainty of scheduling between threads, it is necessary to wait for all threads to complete the write operation before performing the read operation, which requires adding an additional synchronization operation, while only serially processing if there is a write to the same location. This way, the instruction fetch latency increases, wasting the data bandwidth of the instruction memory. In addition, the index calculation of the shared memory is complex and requires additional calculations that translate to corresponding addresses.
In summary, the conventional scheme for implementing data arrangement has the defects that complicated computing operation of index address conversion of the shared memory is added, and meanwhile, the read-write operation of the shared memory is longer in time delay, so that the program performance is affected.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a method, a computing device, and a computer-readable storage medium for implementing data arrangement that can implement data arrangement among a plurality of registers more quickly without computing a shared memory address.
According to a first aspect of the present disclosure, there is provided a method for implementing data arrangement in a register, including acquiring first auxiliary data and second auxiliary data based on a thread count and data arrangement requirements of a thread group, shifting the data in the first register, the first auxiliary data, the data in the second register, and the second auxiliary data to a staging register, performing extraction on the data in the staging register based on an address of the staging register, thereby acquiring first data, second data, third data, and fourth data, determining a mapping parameter based on the thread count and data arrangement requirements of the thread group, and mapping the third data to the first register and the fourth data to the second register based on the mapping parameter, thereby implementing data arrangement in the register.
According to a second aspect of the present disclosure there is provided a computing device comprising at least one processor and a memory communicatively connected to the at least one processor, the memory storing instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the present disclosure.
In a third aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect of the present disclosure.
In one embodiment, the method further comprises writing first, second, third and fourth data to the first, second, third and fourth registers, respectively.
In one embodiment, mapping the third data to the first register and the fourth data to the second register includes applying a mask to the third register and the fourth register, and mapping data in the third register of the applied mask to the first register every mapping parameter bit and mapping data in the fourth register of the applied mask to the second register every the mapping parameter bit, the mapping parameter bit being determined based on a mapping parameter.
In one embodiment, mapping the third data to the first register and the fourth data to the second register includes mapping data of current mapping parameter bits of the third and fourth registers to current mapping parameter bits of a fifth and sixth registers, respectively, mapping data of the third and fourth registers to next mapping parameter bits of the fifth and sixth registers, respectively, wherein data of last mapping parameter bits of the third and fourth registers are not mapped, masking the fifth and sixth registers, and mapping data in the fifth register of the applied mask to the first register and mapping data in the sixth register of the applied mask to the second register.
In one embodiment, determining the mapping parameters based on the thread count and data arrangement requirements of the thread group includes determining the mapping parameters as one eighth of the thread count of the thread group.
In one embodiment, performing decimation on data in the staging register includes constructing a sequence of numbers based on the number of threads of a thread group, wherein each element in the constructed sequence corresponds to data in the staging register, combining data of the staging register into a set of a plurality of the data sequences, and performing decimation on the combined set of a plurality of the data sequences.
In one embodiment, performing decimation on the combined set of the plurality of columns of data includes determining coordinate values of elements in the plurality of columns of data, wherein the coordinate values include an abscissa and an ordinate corresponding to the elements, and performing decimation on the combined set of the plurality of columns in an order of even and even abscissa, even and odd abscissa, odd and odd ordinate, and odd ordinate based on the determined coordinate values.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.
Fig. 1 shows a schematic diagram of a system 100 for implementing a method for implementing data permutation according to an embodiment of the invention.
Fig. 2 illustrates a flow chart of a method 200 for implementing data permutation in registers according to an embodiment of the present disclosure.
Fig. 3 shows a schematic diagram of a shifted staging register according to an embodiment of the disclosure.
Fig. 4 shows a schematic diagram of data of a staging register made up of a set of multiple columns according to an embodiment of the disclosure.
Fig. 5 shows a schematic diagram of extracted first, second, third, and fourth data according to an embodiment of the present disclosure.
Fig. 6 shows a schematic diagram of a mapping result according to an embodiment of the present disclosure.
Fig. 7 shows a schematic diagram of third and fourth registers and fifth and sixth register maps according to an embodiment of the present disclosure.
Fig. 8 shows a schematic diagram of mapping data in a fifth register, a sixth register to a first register, a second register, according to an embodiment of the present disclosure.
Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above, the existing thread group data interaction is mainly performed through the shared memory. The system calculates corresponding shared memory addresses according to indexes of different thread data, and then writes the data on the register into the shared memory through a storage instruction (store). Finally, the register is written back through a load instruction (load). Because of the uncertainty of scheduling between threads, it is necessary to wait for all threads to complete the write operation before performing the read operation, which requires adding an additional synchronization operation, while only serially processing if there is a write to the same location. This way, the instruction fetch latency increases, wasting the data bandwidth of the instruction memory. In addition, the index calculation of the shared memory is complex and requires additional calculations that translate to corresponding addresses. Therefore, the index and address calculation of the shared memory are additionally calculated, the consumption is large, the delay is serious, and meanwhile, the delay of the read-write operation in the shared memory is long, so that the program performance is influenced.
To at least partially address one or more of the above-mentioned problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for generating a data report in which arranging data stored in two registers, e.g., consecutive 1-64, may be achieved by utilizing two additional registers in a thread group and one staging register. Permutation may refer to reordering consecutive data, such as consecutive data 1-64 stored in two registers of 32 threads in one thread group (warp), into odd columns 1,3, 5, 63 and even columns 2, 4, 6, 64. Eventually, the two registers will store the corresponding odd columns and even columns.
Fig. 1 shows a schematic diagram of a system 100 for implementing a method for implementing data permutation according to an embodiment of the invention. As shown in fig. 1, system 100 includes a computing device 110 and a data storage device 130 and a network 140. The computing device 110, the data storage device 130 may interact with data via a network 140 (e.g., the internet).
The data storage device 130, which may store, for example, a plurality of different types of data stores, for example, consecutive numbers to be ordered, and the like. The data storage device 130 may also transmit the stored data store to the computing device 110. The data storage device may be a one-stop storage computing structure running on one or more computer nodes for implementing high-concurrency, high-throughput query services, which may include special-purpose processing units such as GPUs, FPGAs, and ASICs, as well as general-purpose processing units such as CPUs.
With respect to computing device 110, for example, to retrieve data stores from data storage device 130, to perform ranking on the received data. The rearranged parity columns may be stored in two registers within the same warp. Computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device 110. In some embodiments, computing device 110 and data storage device 130 may be integrated together or may be separate from each other. In some embodiments, computing device 110 includes, for example, an acquisition module 112, a shift module 114, an extraction module 116, a determination module 118, and a mapping module 120.
An acquisition module 112, the acquisition module 112 being configured to acquire the first assistance data and the second assistance data based on the number of threads of the thread group and the data arrangement requirement.
A shift module 114, the shift module 114 being configured to shift the data in the first register, the first auxiliary data, the data in the second register, the second auxiliary data to the staging register.
And an extraction module 116, wherein the extraction module 116 is configured to perform extraction on the data in the transfer register based on the address of the transfer register, so as to obtain first data, second data, third data and fourth data.
A determination module 118, the determination module 118 being configured to determine the mapping parameters based on the number of threads of the thread group and the data arrangement requirements.
A mapping module 120, the mapping module 120 being configured to map the third data to the first register and the fourth data to the second register based on the mapping parameters, thereby enabling data arrangement in the registers.
Fig. 2 illustrates a flow chart of a method 200 for implementing data permutation in registers according to an embodiment of the present disclosure. The method 200 may be performed by the computing device 110 shown in fig. 1, or at the electronic device 800 shown in fig. 8. It should be understood that method 200 may also include additional blocks not shown and/or that the blocks shown may be omitted, the scope of the disclosure being not limited in this respect.
In the context of the present application, a register may refer to a thread group (warp), e.g. a register group comprising a first register and a second register. The register sets may be located in the same thread or in different threads.
In one embodiment, for example, a thread group (warp) may include 8, 16, 32, 64 threads. For simplicity, this disclosure will be presented with 32 threads. Note that other thread groups of thread numbers are equally applicable to the solution described in this disclosure.
At step 202, the computing device 110 may obtain first assistance data and second assistance data based on the number of threads and data arrangement requirements of the thread group.
In one embodiment, the number of bits of the first auxiliary data and the second auxiliary data may be the same as the number of threads of the thread group. For example, if the thread number of the thread group is 32, the first auxiliary data and the second auxiliary data may be 32-bit data. For simplicity, the present context uses 32 threads as an illustration. As described above, the first and second registers corresponding to 32 threads store data arrays 1,2, 3., 32, and 33, 34, 35,., 64, respectively. In the context of the present application, data arrangement requirements may refer to the exchange, insertion, rearrangement, etc. of data corresponding to a thread. For simplicity, the present context will be described in terms of reordering data corresponding to two sets of 32 threads 1 bit apart. For example, data in the first thread group and the second thread group are swapped ordered such that after ordering a first register corresponding to the first thread group holds an odd column (1, 3, 5, 7..63) and a second register corresponding to the second thread group holds an even column (2, 4, 6, 8..64).
In one embodiment, the first auxiliary data and the second auxiliary data may be any data used to populate registers. For example, the first auxiliary data and the second auxiliary data may select 32 consecutive data sequences, such as 1,2, 3..32, or the same data sequence, such as 32 0. For the sake of introduction, the present context uses consecutive 320 s as first auxiliary data and second auxiliary data.
In step 204, the computing device 110 may shift the data in the first register, the first auxiliary data, the data in the second register, the second auxiliary data to the staging register.
In one embodiment, the computing device 110 may shift data 1,2, 3., 32, 32 first auxiliary data 0, 33, 34, 35,., 64, 32 second auxiliary data 0 in the first register R4, and the second auxiliary data 32 to the staging register X0. Fig. 3 shows a schematic diagram of a shifted staging register according to an embodiment of the disclosure. The shifted transfer register X0 will be extracted in a subsequent step, thereby realizing data arrangement. Since the transfer register X0 is a special register, the size of which is 4 times that of the first registers, 4 pieces of data of the first registers can be stored.
At step 206, the computing device 110 may perform extraction on the data in the staging register based on the address of the staging register X0, thereby obtaining the first data, the second data, the third data, and the fourth data.
In one embodiment, performing the extraction on the data in the staging register X0 may include building a sequence of numbers based on the number of threads (e.g., 32) of the thread group, wherein each element in the built sequence corresponds to the mapped data.
Specifically, the computing device 110 may construct a array of data based on 32-bit data of the first register and the second register. The array is constructed as a 4 x 8 array, thus dividing a 32-bit register into 4 groups of 8 bits. For example, the data shifted to the first row in the transfer register X0 may be divided into 1,2, 3..8 and 9, 10, 11..18 and 19, 20, 21..24 and 25, 26, 27..32. The data shifted into the second row in the staging register may be divided into eight bits 0 and eight bits 0 and 0. The data shifted into the third row in the staging register may be divided into 33, 34, 35..40 and 41, 42, 43..48 and 49, 50, 51..56 and 57, 58, 59..64. The data shifted into the second row in the staging register may be divided into eight bits 0 and eight bits 0 and 0.
Subsequently, in one embodiment, the computing device 110 may combine the data of the staging register into a set of a plurality of the data columns, e.g., a set of 4 x 8 columns. Fig. 4 shows a schematic diagram of data of a staging register made up of a set of multiple columns according to an embodiment of the disclosure. As shown in fig. 4, the data of the staging register X0 may be represented by a set of 44×8 columns. For example, for the data of the first row in the relay register X0, 4 8-bit data columns, that is, 4×8 data columns, of 1, 2, 3,..8, and 9, 10, 11,..16, and 17, 18, 19,..24, and 25, 26, 27,..32 can be constructed. For the data of the second row in the staging register X0, a data column of 4 8-bit 0 may be constructed, and so on. Based on the constructed data columns, the addresses of the staging registers may be combined into a set of, for example, 4 x 8 data columns, wherein the 4 x 8 data columns made up of the data of the first row may be located at the upper left of the set, the 4 x 8 data columns made up of the data of the second row may be located at the upper right of the set, the 4 x 8 data columns made up of the data of the third row may be located at the lower left of the set, and the 4 x 8 data columns made up of the data of the fourth row may be located at the lower right of the set.
The data in the upper left corner of the set consisting of a plurality of the data columns consists of 4×8 columns originally belonging to the first register (1, 2, 3..8 and 9, 10, 11..18 and 19, 20, 21..24 and 25, 26, 27..32), the data in the upper right corner consists of 4×8 columns originally belonging to the first auxiliary data (eight bits 0 and eight bits 0), the data in the lower left corner consists of 4×8 columns originally belonging to the second register (33, 34, 35..40 and 41, 42, 43..48 and 49, 50, 51..56 and 57, 58, 59..64), and the data in the lower right corner consists of 4×8 columns originally belonging to the second auxiliary data (eight bits 0 and eight bits 0).
The set may be represented in a coordinate system, wherein the coordinate values of the coordinate system comprise an abscissa and an ordinate to which the data corresponds. With the upper left point, the right is the y-axis and the downward is the x-axis. Starting from the origin, in the first 2×2 matrix, the coordinate value corresponding to the first data 1 is (0, 0), the coordinate value corresponding to the data 2 is (0, 1), the coordinate value corresponding to the data 9 is (1, 0), the coordinate value corresponding to the data 10 is (1, 1), and the rest of the data are the same.
Based on the determined coordinate values, the computing device 110 may perform extraction on the combined set of a plurality of the data columns. Specifically, the extraction is performed on a set composed of a plurality of the series of numbers in the order of even and even abscissa, even and odd abscissa, odd and odd abscissa, even and odd ordinate, odd and odd abscissa. The extracted data may be four types of data, namely, first data with even and even abscissas, second data with even and odd abscissas, third data with odd and even abscissas, and fourth data with odd and odd abscissas.
In one embodiment, the computing device 110 may also write the extracted first data, second data, third data, and fourth data to the first register R4, the second register R5, the third register R6, and the fourth register R7, respectively. The third register R6 and the fourth register R7 will be used for mapping data in a subsequent step.
Fig. 5 shows a schematic diagram of extracted first, second, third, and fourth data according to an embodiment of the present disclosure. As shown in fig. 5, the extracted first data, second data, third data, and fourth data are written into the first register R4, the second register R5, the third register R6, and the fourth register R7, respectively.
At step 208, the computing device 110 may determine mapping parameters based on the number of threads of the thread group and the data arrangement requirements.
In one embodiment, the mapping parameter is determined to be one eighth of the thread number of the thread group. In the case where the thread count of the thread group is 32, the mapping parameter k may be 4.
At step 210, computing device 110 may map the third data to the first register and the fourth data to the second register based on the mapping parameters, thereby implementing a data arrangement in the registers.
In one embodiment, computing device 110 may apply a mask to the third register R6 and the fourth register R7. For example, the mask may be 0xf0f0f0f0, i.e., 11110000111100001111000011110000. With such a mask applied, the data will be mapped once every 4 threads, i.e., threads at locations where the number of applied mask bits is 1 may be mapped and threads at locations where the number of applied mask bits is 0 may not be mapped.
In one embodiment, computing device 110 maps data in third register R6 of the applied mask to the first register R4 every the mapping parameter bits (i.e., 4 bits) and maps data in fourth register R7 of the applied mask to the second register R5 every the mapping parameter bits (i.e., 4 bits), where the mapping parameter bits are determined based on the mapping parameters. For example, the mapping parameter is 4, and the mapping parameter bit is 4 bits.
For example, based on the mask applied (0 xf0f0f0f 0), the computing device 110 may map the data (9, 11, 13, 15) corresponding to threads T0-T3 in the third register R6 to threads T4-T7 of the first register R4 and overwrite 0,0 in threads T4-T7. By analogy, computing device 110 may map the data (25, 27, 29, 31) corresponding to threads T8-T11 in third register R6 to threads T12-T15 of first register R4 and overwrite threads T12-T15 with 0, data (41, 43, 45, 47) corresponding to threads T16-T19 in the third register R6 are mapped to threads T20-T23 of the first register R4 and cover 0,0 of threads T20-T23, and data (57, 59, 61, 63) corresponding to threads T24-T27 in the third register R6 are mapped to threads T28-T31 of the first register R4 and cover 0,0 of threads T28-T31.
Similarly, the computing device 110 may map data (10, 12, 14, 16) corresponding to threads T0-T3 in the fourth register R7 to threads T4-T7 of the second register R5 and to overlay 0, 0 of threads T4-T7, map data (26, 28, 30, 32) corresponding to threads T8-T11 in the fourth register R7 to threads T12-T15 of the second register R5 and to overlay 0, 0 of threads T12-T15, data (42, 44, 46, 48) corresponding to threads T16-T19 in the fourth register R7 are mapped to threads T20-T23 and cover 0, 0 of threads T20-T23 of the second register R5, and data (58, 60, 62, 64) corresponding to threads T24-T27 in the fourth register R7 are mapped to threads T28-T31 bits of the second register R5 and cover 0, 0 of threads T28-T31.
Fig. 6 shows a schematic diagram of a mapping result according to an embodiment of the present disclosure. As shown in fig. 6, the first register R4 stores data 1, 3, 5, 7, 9..and the second register R5 stores data 2, 4, 6, 8, 10., thereby realizing rearrangement of data between the two registers, i.e., parity rearrangement.
By employing the means described above, computing device 110 is able to quickly utilize additional registers to implement the arrangement of data. Again this does not require calculation of the address of the shared memory and can make use of the data calculated in the previous method, thereby reducing the power consumption of the arrangement.
In another embodiment, the computing device 110 may also map the data of the front mapping parameter bits of the third register R6 and the fourth register R7 directly to the front mapping parameter bits of the fifth register and the sixth register. For example, the computing device 110 may also map the front mapping parameter bits of the third register R6 and the fourth register R7, i.e., the first 4 bits of data, directly to the front mapping parameter bits of the fifth register R8 and the sixth register R9, i.e., the first 4 bits. For example, the data 9, 11, 13, 15 of the third register R6 corresponding to the threads T0-T3 are mapped directly to the fifth register R8. The data 10, 12, 14, 16 of the fourth register R7 corresponding to the threads T0-T3 are mapped directly to the sixth register R9.
Subsequently, the computing device 110 may map the data of the third register R6 and the fourth register R7 to the next mapping parameter bits of the fifth register R8 and the sixth register R9, i.e., the bits added with 4 bits. Since the first 4 bits of the fifth register R8 and the sixth register R9 have been mapped with the first 4 bits of data of the third register R6 and the fourth register R7, the data of the last mapping parameter bits of the third register R6 and the fourth register R7 are not mapped, i.e. the data of the fifth register and the sixth register corresponding to the last 4 threads (i.e. the data corresponding to the threads T28-T31) are not mapped. Fig. 7 shows a schematic diagram of the third and fourth registers R6, R7 mapped with the fifth and sixth registers R8, R9 according to an embodiment of the present disclosure. As shown in FIG. 7, for example, computing device 110 maps data of a third register R6 corresponding to threads T0-T27 to a fifth register R8. At the same time, computing device 110 maps the data of the fourth register R7 corresponding to threads T0-T27 to the sixth register R9.
Subsequently, computing device 110 may apply a mask to fifth register R8 and sixth register R9. As described above, the mask may be 0xf0f0f0f0. With such a mask applied, the data will be mapped once every 4 threads, i.e., threads at locations where the number of applied mask bits is 1 may be mapped and threads at locations where the number of applied mask bits is 0 may not be mapped.
Computing device 110 maps data in fifth register R8 of the applied mask to the first register R4 every the mapping parameter bits (i.e., 4 bits) and maps data in sixth register R9 of the applied mask to the second register R5 every the mapping parameter bits (i.e., 4 bits).
Fig. 8 shows a schematic diagram of mapping data in fifth and sixth registers R8, R9 to first and second registers R4, R5 according to an embodiment of the disclosure. As shown in fig. 8, the data of the fifth and sixth registers R8 and R9 corresponding to the threads T4 to T7, T12 to T15, T20 to T23, and T28 to T31 are mapped to the respective first and second registers R4 and R5. Fig. 6 illustrates mapping results according to an embodiment of the present disclosure. As shown in fig. 6, the first register R4 stores data 1,3, 5, 7, 9..and the second register R5 stores data 2,4, 6, 8, 10., thereby realizing rearrangement of data between two registers, for example, parity rearrangement. This way, in case of using more than two registers, the rearrangement of data is also achieved.
By using the above technical means, the arrangement of data between two registers can be realized quickly and only two extra registers and one transit register are needed to be utilized. While eliminating the need to calculate the address of the shared memory, thereby reducing the power consumption of the arrangement.
By employing the means described above, computing device 110 is able to quickly utilize additional registers to implement the arrangement of data. Again this does not require calculation of the address of the shared memory and can make use of the data calculated in the previous method, thereby reducing the power consumption of the arrangement.
Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. For example, computing device 110 as shown in FIG. 1 may be implemented by electronic device 900. As shown, the electronic device 900 includes a Central Processing Unit (CPU) 901 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 902 or loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the random access memory 903, various programs and data necessary for the operation of the electronic device 900 may also be stored. The central processing unit 901, the read only memory 902, and the random access memory 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
A plurality of components in the electronic device 900 are connected to the input/output interface 905, including an input unit 906 such as a keyboard, a mouse, a microphone, and the like, an output unit 907 such as various types of displays, speakers, and the like, a storage unit 908 such as a magnetic disk, an optical disk, and the like, and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.
The various processes and treatments described above, such as method 200, may be performed by central processing unit 901. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, some or all of the computer program may be loaded and/or installed onto device 900 via read only memory 902 and/or communication unit 909. One or more of the acts of the method 200 described above may be performed when a computer program is loaded into the random access memory 903 and executed by the central processing unit 901.
The present disclosure relates to methods, apparatus, systems, electronic devices, computer readable storage media, and/or computer program products. The computer program product may include computer readable program instructions for performing various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge computing devices. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (7)

1.一种用于实现寄存器中的数据排列的方法,包括:1. A method for implementing data arrangement in a register, comprising: 基于线程组的线程数和数据排列要求,获取第一辅助数据以及第二辅助数据;Based on the number of threads and data arrangement requirements of the thread group, first auxiliary data and second auxiliary data are obtained; 将第一寄存器中的数据、第一辅助数据、第二寄存器中的数据、第二辅助数据移位到中转寄存器;Shifting the data in the first register, the first auxiliary data, the data in the second register, and the second auxiliary data to the transfer register; 基于所述中转寄存器的地址,对所述中转寄存器中的数据执行抽取,从而获取第一数据、第二数据、第三数据以及第四数据,其中包括基于线程组的线程数构建数据列,将中转寄存器的数据组合为多个所述数据列的集合,对所组合的由多个所述数据列组成的集合执行抽取;Based on the address of the transfer register, extracting the data in the transfer register to obtain the first data, the second data, the third data and the fourth data, which includes constructing a data column based on the number of threads of the thread group, combining the data of the transfer register into a set of multiple data columns, and extracting the combined set consisting of the multiple data columns; 基于线程组的线程数和数据排列要求,确定映射参数;以及Determining mapping parameters based on the number of threads and data arrangement requirements of the thread group; and 基于所述映射参数,将所述第三数据映射到所述第一寄存器并且将所述第四数据映射到所述第二寄存器,从而实现寄存器中的数据排列,其中包括对第三寄存器以及第四寄存器施加掩码,将所施加掩码的第三寄存器中的数据每隔映射参数位映射到所述第一寄存器并且将所施加掩码的第四寄存器中的数据每隔所述映射参数位映射到所述第二寄存器。Based on the mapping parameters, the third data is mapped to the first register and the fourth data is mapped to the second register, thereby realizing data arrangement in the registers, which includes applying a mask to the third register and the fourth register, mapping the data in the third register with the mask applied to the first register every mapping parameter bit and mapping the data in the fourth register with the mask applied to the second register every mapping parameter bit. 2.根据权利要求1所述的方法,还包括:2. The method according to claim 1, further comprising: 将第一数据、第二数据、第三数据以及第四数据分别写入到所述第一寄存器、所述第二寄存器、第三寄存器以及第四寄存器。The first data, the second data, the third data and the fourth data are written into the first register, the second register, the third register and the fourth register respectively. 3.根据权利要求2所述的方法,将所述第三数据映射到所述第一寄存器并且将所述第四数据映射到所述第二寄存器包括:3. The method of claim 2, mapping the third data to the first register and mapping the fourth data to the second register comprising: 将所述第三寄存器以及所述第四寄存器的当前映射参数位的数据分别映射到第五寄存器以及第六寄存器的当前映射参数位;Mapping the data of the current mapping parameter bits of the third register and the fourth register to the current mapping parameter bits of the fifth register and the sixth register respectively; 将所述第三寄存器以及所述第四寄存器的数据分别映射到第五寄存器以及第六寄存器的下一映射参数位,其中所述第三寄存器以及所述第四寄存器的最后映射参数位的数据不被映射;Mapping the data of the third register and the fourth register to the next mapping parameter bits of the fifth register and the sixth register respectively, wherein the data of the last mapping parameter bits of the third register and the fourth register are not mapped; 对所述第五寄存器以及所述第六寄存器施加掩码;以及applying a mask to the fifth register and the sixth register; and 将所施加掩码的第五寄存器中的数据映射到第一寄存器并且将所施加掩码的第六寄存器中的数据映射到第二寄存器。Data in the masked fifth register is mapped to the first register and data in the masked sixth register is mapped to the second register. 4.根据权利要求1所述的方法,其中基于线程组的线程数和数据排列要求确定映射参数包括:4. The method according to claim 1, wherein determining the mapping parameters based on the number of threads and data arrangement requirements of the thread group comprises: 将所述映射参数确定为线程组的线程数的八分之一。The mapping parameter is determined as one eighth of the number of threads of the thread group. 5.根据权利要求1所述的方法,其中对所组合的由多个所述数据列组成的集合执行抽取包括:5. The method according to claim 1, wherein performing extraction on the combined set consisting of the plurality of data columns comprises: 确定所述多个数据列中的元素的坐标值,其中所述坐标值包括所述元素对应的横坐标以及纵坐标;以及Determining coordinate values of elements in the plurality of data columns, wherein the coordinate values include abscissas and ordinates corresponding to the elements; and 基于所确定的坐标值,按照横坐标为偶数且纵坐标为偶数、横坐标为偶数且纵坐标为奇数、横坐标为奇数且纵坐标为偶数、横坐标为奇数且纵坐标为奇数的顺序对所组合的多个所述数据列的集合执行抽取。Based on the determined coordinate values, extraction is performed on the combined set of multiple data columns in the order of even abscissa and even ordinate, even abscissa and odd ordinate, odd abscissa and even ordinate, odd abscissa and odd ordinate. 6.一种计算设备,包括:6. A computing device comprising: 至少一个处理器;以及at least one processor; and 与所述至少一个处理器通信连接的存储器;a memory communicatively coupled to the at least one processor; 所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-5中任一项所述的方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1 to 5. 7.一种存储有计算机指令的非瞬时计算机可读存储介质,其中所述计算机指令用于使所述计算机执行权利要求1-5中任一项所述的方法。7. A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method of any one of claims 1 to 5.
CN202210194323.3A 2022-03-01 2022-03-01 Method, device and medium for implementing data arrangement Active CN114546328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210194323.3A CN114546328B (en) 2022-03-01 2022-03-01 Method, device and medium for implementing data arrangement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210194323.3A CN114546328B (en) 2022-03-01 2022-03-01 Method, device and medium for implementing data arrangement

Publications (2)

Publication Number Publication Date
CN114546328A CN114546328A (en) 2022-05-27
CN114546328B true CN114546328B (en) 2025-02-11

Family

ID=81661354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210194323.3A Active CN114546328B (en) 2022-03-01 2022-03-01 Method, device and medium for implementing data arrangement

Country Status (1)

Country Link
CN (1) CN114546328B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365631A (en) * 2012-04-05 2013-10-23 辉达公司 Dynamic bank mode addressing for memory access
CN107111485A (en) * 2014-11-14 2017-08-29 英特尔公司 Three-dimensional Morton coordinate transformation processor, method, system and instructions

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6625721B1 (en) * 1999-07-26 2003-09-23 Intel Corporation Registers for 2-D matrix processing
CN103827815B (en) * 2011-09-26 2017-11-28 英特尔公司 Instruction and the logic across vectorial loading and storage with mask function are utilized for providing
JP5852677B2 (en) * 2011-12-26 2016-02-03 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Register mapping method
US20140122842A1 (en) * 2012-10-31 2014-05-01 International Business Machines Corporation Efficient usage of a register file mapper mapping structure
US10871967B2 (en) * 2015-09-19 2020-12-22 Microsoft Technology Licensing, Llc Register read/write ordering
CN111459543B (en) * 2019-01-21 2022-09-13 上海登临科技有限公司 Method for managing register file unit

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365631A (en) * 2012-04-05 2013-10-23 辉达公司 Dynamic bank mode addressing for memory access
CN107111485A (en) * 2014-11-14 2017-08-29 英特尔公司 Three-dimensional Morton coordinate transformation processor, method, system and instructions

Also Published As

Publication number Publication date
CN114546328A (en) 2022-05-27

Similar Documents

Publication Publication Date Title
US11704548B2 (en) Multicast network and memory transfer optimizations for neural network hardware acceleration
CN117707791B (en) Method, apparatus and storage medium for performing attention calculations
CN117435855B (en) Method for performing convolution operation, electronic device, and storage medium
CN115408391B (en) A database table change method, device, equipment and storage medium
WO2019078890A1 (en) Parallel processing for signal generation neural networks
CN117785284B (en) Method, device and medium for performing convolution operation
CN110909527A (en) Operation method, device, electronic device, and storage medium of text processing model
US20150331712A1 (en) Concurrently processing parts of cells of a data structure with multiple processes
KR20210086936A (en) Data output method, data acquisition method, device, and electronic equipment
CN117132446A (en) GPU data access processing method, device and storage medium
US20210217187A1 (en) Method and apparatus for image processing and computer storage medium
US9058301B2 (en) Efficient transfer of matrices for matrix based operations
CN113779082B (en) Method and device for updating data
US11429317B2 (en) Method, apparatus and computer program product for storing data
CN107451070B (en) Data processing method and server
CN114546328B (en) Method, device and medium for implementing data arrangement
US10613861B2 (en) Programmable linear feedback shift register
CN114546329B (en) Method, apparatus and medium for implementing data parity rearrangement
CN110705701B (en) High-parallelism convolution operation method and circuit
CN116360858B (en) Data processing method, graphics processor, electronic device and storage medium
CN116804915B (en) Data interaction method, processor, device and medium based on memory
KR102471553B1 (en) Method, apparatus, device and computer-readable storage medium executed by computing devices
US10540183B2 (en) Accelerated execution of execute instruction target
CN119536810B (en) Data processing method, device, equipment and medium of single-port RAM based on BHT
CN119809910B (en) Video memory control method, device, equipment and storage medium for model training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai

Applicant after: Shanghai Bi Ren Technology Co.,Ltd.

Address before: 201114 room 1302, 13 / F, building 16, 2388 Chenhang Road, Minhang District, Shanghai

Applicant before: Shanghai Bilin Intelligent Technology Co.,Ltd.

Country or region before: China

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant