WO2018121118A1 - Calculating apparatus and method - Google Patents
Calculating apparatus and method Download PDFInfo
- Publication number
- WO2018121118A1 WO2018121118A1 PCT/CN2017/111333 CN2017111333W WO2018121118A1 WO 2018121118 A1 WO2018121118 A1 WO 2018121118A1 CN 2017111333 W CN2017111333 W CN 2017111333W WO 2018121118 A1 WO2018121118 A1 WO 2018121118A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- memory
- data
- hmc
- buffer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Definitions
- the present disclosure relates to the field of neural network computing, and more particularly to a neural network computing apparatus and method.
- DRAMs used in the architecture of neural network computing devices are mostly high performance graphics memory (Graphics Double Data Rate Version 4, GDDR4) or high performance graphics memory (Graphics Double Data Rate Version 5). , GDDR5).
- GDDR4 or GDDR5 can not fully meet the needs of the neural network computing device, and the technological development has entered a bottleneck period. Adding 1GB of bandwidth per second will result in more power consumption, which is not a sensible, efficient or cost-effective option for designers and consumers.
- GDDR4 or GDDR5 still has serious problems that are difficult to reduce the area. Therefore, GDDR4 or GDDR5 will gradually hinder the continuous growth of the performance of neural network computing devices.
- the present disclosure provides a neural network computing apparatus and method for solving bottlenecks in terms of bandwidth, power consumption, area, and the like faced by the neural network computing device proposed above.
- a neural network computing device including: a storage device; and a neural network processor electrically connected to the storage device, and the data exchange between the neural network processor and the storage device And perform neural network calculations.
- a chip including the neural network meter Calculation device.
- a chip package structure including the chip is provided.
- a board including the chip package structure.
- an electronic device including the board is provided.
- a neural network calculation method comprising: storing data by a storage device; and performing data exchange with the storage device by a neural network processor and performing neural network calculation.
- high-bandwidth memory as the memory of the neural network computing device, can exchange data of input data and operation parameters between the buffer and the memory more quickly, which greatly shortens the I/O time.
- the high-bandwidth memory with stacked storage structure and the neural network processor with HBM memory control module can greatly improve the storage bandwidth, and the bandwidth can be increased to more than twice that of the prior art, and the computing performance is greatly improved.
- the high-bandwidth memory is a stacked (stacked) structure that does not occupy a horizontal plane space, the area of the neural network computing device can be greatly reduced, and the area of the neural network computing device can be reduced to about 5% of the prior art.
- HMC's memory provides high data transmission bandwidth, its data transmission bandwidth can exceed 15 times the bandwidth of DDR3; reduce the overall power consumption, HMC memory technology, compared to the commonly used memory technology such as DDR3/DDR4, for each Bit storage can save more than 70% of power consumption.
- the HMC memory module includes a plurality of cascaded HMC memory units, which can flexibly select the number of HMC memory units according to the actual memory size required in the neural network operation process, thereby reducing the waste of functional components.
- the HMC memory unit adopts a stacked structure, which can reduce the memory footprint by more than 90% compared with the existing RDIMM technology, and at the same time greatly reduce the overall volume of the neural network computing device. Since the HMC can perform massive parallel processing, the latency of the memory components is small.
- the neural network computing device has a plurality of neural network processors, and the plurality of neural network processors are interconnected and communicate information with each other to avoid data consistency problems during operation; and the neural network computing device Support multi-core processor architecture, can make full use of the parallelism in the neural network operation process, accelerate the operation of the neural network.
- FIG. 1 is a block diagram showing the overall structure of a high bandwidth memory based neural network computing device in accordance with an embodiment of the present disclosure.
- FIG. 2 is a cross-sectional view of the neural network computing device of FIG. 1.
- FIG. 3 is an overall architectural diagram of a neural network processor with a HBM memory control module in accordance with an embodiment of the present disclosure.
- FIG. 4 is a cross-sectional view of a high bandwidth memory based neural network computing device in accordance with another embodiment of the present disclosure.
- FIG. 5 is a schematic diagram of an overall structure of a neural network computing device based on an HMC memory according to an embodiment of the present disclosure.
- FIG. 6 is a schematic diagram of an HMC memory unit of a HMC memory-based neural network computing device according to an embodiment of the present disclosure.
- FIG. 7 is a schematic diagram of capacity expansion of an HMC memory of a neural network computing device based on an HMC memory according to an actual need according to an embodiment of the present disclosure.
- FIG. 8 is a flow chart of a high bandwidth memory based neural network calculation method in accordance with an embodiment of the present disclosure.
- FIG. 9 is a flowchart of a HMC memory-based neural network calculation method according to an embodiment of the present disclosure.
- 1-HMC memory module 2-neural network processor; 3-external access unit; 4-other modules; 11-mixed memory cube; 12-logic base layer; 101, 201, 401-package substrate; 102, 202-intermediate layer; 103, 203, 403-logic die; 104-high bandwidth memory; 105, 205, 402- Neural network processor; 204-HBM memory control module; 206, 406-through hole; 207, 407-micro solder ball; 208, 405-dynamic random access memory DRAM; 301-storage interface; 302-package structure; Control unit; 304, 404-HBM memory control module; 305-buffer; 306-buffer control module; 307-neural processing unit.
- the present disclosure provides a neural network computing device, including:
- At least one storage device At least one storage device
- At least one neural network processor is electrically connected to the storage device, the neural network processor exchanges data with the storage device, and performs neural network calculation.
- the storage device is a high bandwidth memory (correspondingly, the neural network computing device is a high bandwidth memory based neural network computing device), and each high bandwidth memory includes a stacked (stacked) A plurality of accumulated memories; at least one neural network processor electrically coupled to the high bandwidth memory, the neural network processor and the high bandwidth memory exchange data, and perform neural network calculations.
- the High-Bandwidth Memory is a new type of low-power memory with excellent communication data path, low power consumption and small area.
- An embodiment of the present disclosure provides a neural network computing device based on a high-bandwidth memory.
- the neural network computing device includes: a package substrate 101 (Package Substrate), an interposer 102 (Interposer), and a logic die 103 ( Logic Die), a high bandwidth memory 104 (Stacked Memory) and a neural network processor 105. among them,
- the package substrate 101 is configured to carry the other components of the neural network computing device and is electrically connected to the host device, such as a computer, a mobile phone, and various embedded devices.
- the interposer 102 is formed on the package substrate 101 and carries the logic die 103 and the neural network processor 105.
- the logic die 103 is formed on the interposer 102, and the logic die 103 is used to connect the interposer 102 and the high bandwidth memory 104 to implement a layer encapsulation of the high bandwidth memory 104.
- the high bandwidth memory 104 is formed on a logic die 103 that includes a plurality of memories stacked (stacked) in a direction perpendicular to the package substrate 101.
- the neural network processor 105 is also formed on the interposer 102 for performing neural network calculations, which can complete the entire neural network calculation, and can also perform basic operations of neural network calculation such as convolution, and the neural network processor 105 passes through the interposer 102.
- the data is exchanged with the logic die 103 and with the high bandwidth memory 104.
- the neural network computing device is a 2.5D (2.5-dimensional) memory architecture.
- the high-bandwidth memory includes four dynamic random access memories (DRAMs) 208, and the DRAM 208 is stacked and stacked by a microbumps process. Micro solder balls 207 are formed between the adjacent DRAMs 208, and a high-bandwidth memory is formed.
- DRAMs dynamic random access memories
- the bottommost DRAM 208 is formed on the logic die 203 by a micro-bumping process
- the logic die 203 and the neural network processor 205 are formed on the interposer 202 by a micro-bumping process
- the interposer 202 is formed by a flip chip soldering process.
- Through holes 206 are formed in the DRAM 208 by using a through-silicon via (TSVs) process
- TSVs through-silicon via
- through holes are formed in the logic die 203 and the interposer 202 by using a through-silicon via process
- the wires are arranged by using the through holes and the micro solder balls.
- the DRAM 208 is electrically connected to the logic die 203, and the logic die 203 is electrically connected to the neural network processor 205 through the wires of the vias of the interposer 202 to realize interconnection between the high bandwidth memory and the neural network processor 205, and is processed in the neural network. Under the control of the HBM memory control module 204 of the 205, data is transferred between the neural network processor 205 and the high bandwidth memory.
- the existing GDDR memory has a channel width of 32 bits and a 16 channel memory bus width of 512 bits.
- the high bandwidth memory may include four DRAMs, each having two 128-bit channels, and the high bandwidth memory providing a bit width of 1024 bits, which is twice the bit width of the above GDDR memory.
- the logical network of the neural network computing device may be multiple, and correspondingly, the high-bandwidth memory may also be multiple.
- the DRAM of each high-bandwidth memory can also be greater than four, and the number of the above components can be set according to actual needs.
- a neural network computing device can include four logical dies and four high-bandwidth memories, each high-bandwidth memory including four DRAMs, each having two 128-bit channels.
- Each high-bandwidth memory can provide 1024 bits of bit width, and four high-bandwidth memories can provide 4096 bits of bit width, which is eight times the bit width of the above GDDR memory.
- Each high-bandwidth memory can also include eight DRAMs, each with two 128-bit channels, each high-bandwidth memory can provide 2048 bits of bit width, and four high-bandwidth memories can provide 8192 bits of bit width.
- GDDR memory is sixteen times wider than the bit width.
- the neural network processor includes: a storage interface 301 (Memory Interface), a control unit 303 (Control Processor), a HBM memory control module 304 (HBM Controller), a buffer 305 (BUFFER), a buffer control module 306 (Buffer Controller), and neural processing.
- the unit 307 NFU), wherein the control processor 303, the HBM memory control module 304, the buffer 305, the buffer control module 306, and the neural processing unit 307 are integrally packaged to form a package structure 302.
- the storage interface 301 as an interface between the neural network processor and the high-bandwidth memory, is electrically connected to the DRAM of the logic die and the high-bandwidth memory through wires, and is used for receiving data transmitted by the high-bandwidth memory and transmitting data to the high-bandwidth memory.
- the HBM memory control module 304 is configured to control data transmission between the high bandwidth memory and the buffer, including coordinating the data bandwidth of the high bandwidth memory and the buffer, and synchronizing the clock of the high bandwidth memory and the buffer.
- the HBM memory control module 304 synchronizes the clocks of the high bandwidth memory and the buffer, converts the bandwidth of the data of the high bandwidth memory received by the storage interface 301 into a bandwidth matched with the buffer, and transmits the bandwidth matched data to the buffer;
- the bandwidth of the buffer's data is converted to a bandwidth that matches the high bandwidth memory, and the bandwidth matched data is transferred to the high bandwidth memory via the storage interface 301.
- the buffer 305 is an internal storage unit of the neural network processor for receiving bandwidth-matched data transmitted by the HBM memory control module 304 and transmitting the stored data to the HBM memory control module 304.
- the buffer control module 306 is configured to control data interaction between the buffer 305 and the neural processing unit 307, transmit the data stored by the buffer 305 to the neural processing unit 307, the neural processing unit 307 performs neural network calculation, and the buffer control module 306 performs the neural network.
- the calculation result of the processing unit 307 is transmitted to the buffer 305.
- the control unit 303 decodes the instruction to the HBM memory control module 304, the buffer 305, The buffer control module 306 and the neural processing unit 307 send control commands, coordinate and schedule the above modules to work together to implement the computing functions of the neural network processor.
- the high-bandwidth memory using the stacked storage structure and the neural network processor with the HBM memory control module can greatly improve the storage bandwidth, and the bandwidth can be increased to twice that of the prior art. In the above, the computing performance is greatly improved.
- the high-bandwidth memory is used as the memory of the neural network computing device, and the data exchange between the input data and the operation parameters can be performed between the buffer and the memory more quickly, which makes IO time is greatly reduced.
- the high-bandwidth memory is a stacked (stacked) structure that does not occupy horizontal planar space, the area of the neural network computing device can be greatly reduced, and the area of the neural network computing device can be reduced to about 5% of the prior art;
- the power consumption of the neural network computing device; and the DRAMs are interconnected by a micro-bumping process and a TSV process wiring, and the intermediate layer is used for data exchange between the neural network processor and the high-bandwidth memory, thereby further improving between different DRAMs and Transmission bandwidth and transmission speed between the neural network processor and the high bandwidth memory.
- the neural network computing device is a 3D (3-dimensional) memory architecture, including a package substrate 401, a neural network processor 402, a logic die 403, and a high bandwidth memory stacked from bottom to top.
- the high-bandwidth memory includes four DRAMs 405.
- the DRAM 405 is stacked and stacked by a micro-bumping process.
- a micro solder ball 407 is formed between the adjacent DRAMs 405.
- the bottommost DRAM 405 of the high-bandwidth memory is formed by a micro-bumping process.
- the logic die 403 is formed on the neural network processor 402 by a micro-bumping process, and the neural network processor 402 is formed on the package substrate 401 by a micro-bumping process.
- a via hole 406 is formed in the DRAM 405 by a through-silicon via process, and a via hole is formed in the logic die 403 by using a through-silicon via process.
- the via hole and the micro solder ball are used to arrange the wire, the DRAM 405 and the logic die 403, and the neural network processor 402.
- the electrical connection enables vertical interconnection of the high bandwidth memory with the neural network processor 402. Under the control of the HBM memory control module 404 of the neural network processor 402, data is transferred between the neural network processor 402 and the high bandwidth memory.
- the neural network computing device can further save the neural network compared to the 2.5D storage architecture because the high bandwidth memory is directly stacked (stacked) on the neural network processor.
- the area of the computing device is particularly advantageous for miniaturization of the neural network computing device; and the distance between the high bandwidth memory and the neural network processor is shorter, meaning that the wiring between the two is shorter, and the signal transmission quality and transmission speed can be further improved. .
- the storage device is an HMC memory module (corresponding, the neural network computing device is a neural network computing device based on a Hybrid Memory Cube (HMC) memory); at least one nerve
- the network processor is connected to the HMC memory module, and is configured to acquire data and instructions required for the neural network operation from the HMC memory module, perform a neural network partial operation, and write the operation result back to the HMC memory module.
- HMC Hybrid Memory Cube
- the HMC-based neural network computing device and method can effectively meet the requirements of the neural network computing device for data transmission and storage in the operation process, and the computing device can provide unprecedented system performance and bandwidth, and at the same time, high memory utilization. Low power consumption, low cost and fast data transfer rate.
- FIG. 5 is a schematic diagram of an overall structure of a neural network computing device based on an HMC memory according to an embodiment of the present disclosure. As shown in FIG. 5, the present disclosure is based on a HMC memory-based neural network computing device.
- the device includes an HMC memory module 1, a neural network processor 2, and an external access unit 3.
- the HMC memory module 1 includes: a plurality of cascaded HMC memory units, each of which includes a mixed memory cube and a logical base layer.
- the hybrid memory cube is connected by a plurality of memory die layers through through silicon vias (TSVs), and the logic substrate layer includes a logic control unit that controls the mixed memory cube for data read and write operations and an external processor or HMC device or external The link to which the access unit is connected.
- TSVs through silicon vias
- the neural network processor is configured to perform a function of the neural network operation, acquire the required instructions and data of the neural network operation from the HMC memory module, perform a neural network partial operation, and calculate the intermediate value or the final result Writing back to the HMC memory module, while transmitting data preparation completion signals to other neural network processors through the data path between the neural network processors.
- the data includes first data and second data, the first data includes network parameters (such as weights, offsets, etc.) and a function table, and the second data includes network input data, and the data may further Includes the intermediate value after the operation.
- the external access unit connects the HMC memory module and an external designated address, and reads the neural network instructions and data required for the neural network operation from the external designated address to the HMC memory.
- the module, as well as the neural network operation results from the HMC memory module to the output operation of the external specified address space.
- the cascaded multiple HMC memory units are uniformly addressed, and the cascading manner and specific implementation of the multiple HMC memory units are transparent to the neural network processor, and the neural network processor is cascaded.
- the memory address of the HMC memory unit reads and writes the memory location of the HMC memory module.
- the neural network processor may use a multi-core processor in a specific implementation process to improve the efficiency of operation.
- the HMC memory module of the present disclosure can be used by a multi-core neural network processor.
- multiple neural network processors are interconnected and information is transmitted between each other.
- the other neural network processing when a part of the neural network processors in the plurality of neural network processors are in an operational state, and the other neural network processors are in an operation result waiting for one of the partial neural network processors, the other neural network processing The device is in a wait state, and after one of the partial neural network processors transmits the operation result to the HMC memory module and transmits the data delivery signal to the other neural network processor, the other neural network processor is woken up. Read the corresponding data from the HMC memory module and perform the corresponding operation.
- FIG. 6 is a schematic diagram of an HMC memory unit of a HMC memory-based neural network computing device according to an embodiment of the present disclosure. As shown in FIG. 6, the HMC includes a hybrid memory cube 11 and a logic substrate layer 12.
- the mixed memory cube 11 has a memory composed of a plurality of memory bodies, wherein each memory body includes a plurality of memory grain layers.
- the mixed memory cube has a memory composed of 16 memory banks, wherein each memory bank includes 16 memory grain layers.
- the top and bottom of the memory die layer are interconnected using a through silicon via (TSV) structure.
- TSV through silicon via
- the TSV interconnects a plurality of memory die layers, adding another dimension in addition to the rows and columns to form a three-dimensional structure of the die.
- the memory die layer is a conventional random dynamic memory (DRAM).
- DRAM random dynamic memory
- the logical base layer is configured to directly select the desired memory die layer by vertically rising.
- rows and columns are used to organize DRAM.
- a read or write request to a specific location in the memory bank is performed by the memory body controller.
- the logic base layer includes a logic control unit that controls the mixed memory cube for data read and write operations, and the logic control unit includes a plurality of memory body controllers, each of which is at a logical base layer There are corresponding memory body controllers for managing 3D access control of the memory.
- the 3D layering method allows memory accesses to be accessed not only in the direction of the rows and columns but also in parallel between multiple memory die layers.
- the logical base layer further includes a link to the external HMC memory unit for connecting the plurality of HMC memory units to increase the total capacity of the HMC storage device.
- the link includes an external I/O link and internal routing and switching logic connected to an HMC memory module, the external I/O link including a plurality of logical links attached to the switching logic
- the bootstrap internal routing controls the data transfer of each memory bank or forwards data to other HMC memory cells and neural network processors.
- the external I/O link comprises 4 or 8 logical links, each logical link comprising 16 or 8 serial I/O or SerDes bidirectional link groups.
- HMC memory cells with 4 logical links can transmit data at 10Gbps, 12.5Gbps and 15Gbps
- HMC memory cells with 8 logical links can transmit data at 10Gbps.
- the switching logic determines whether the HMC memory address passed through the external I/O link is on the HMC memory unit in the address strobing phase. If yes, convert the HMC memory address to the data format of the HMC internal hybrid memory cube address that can be recognized by the internal route; if not, the HMC memory address is forwarded to another HMC memory unit connected to the HMC memory unit.
- the data read and write phase if data read and write occurs on the HMC memory unit, the HMC memory unit is responsible for data conversion between the external I/O link and the internal route. If it does not occur on the HMC memory unit, the HMC memory The unit is responsible for forwarding the data on the two external I/O links connected to it to complete the data transmission.
- FIG. 7 is a schematic diagram of capacity expansion of an HMC memory of a neural network computing device based on an HMC memory according to an actual need according to an embodiment of the present disclosure.
- the logical base layer structure in the HMC memory unit supports attaching the device to a neural network processor or another HMC memory unit.
- the total capacity of the HMC storage device (memory module) can be increased without changing the structure of the HMC memory unit.
- Multiple HMC memory units can be connected to other modules 4 through topology connection.
- FIG. 7 by connecting two HMC memory units, the memory capacity of the neural network computing device can be doubled.
- this design can cause the neural network computing device to dynamically change the HMC memory according to the needs of the actual application.
- the number of units enables the HMC memory module to be fully utilized, and at the same time, the configurability of the neural network computing device is greatly improved.
- Another embodiment of the present disclosure further discloses a chip including the high bandwidth memory based neural network computing device of the above embodiment.
- Another embodiment of the present disclosure also discloses a chip package structure including the above chip.
- Another embodiment of the present disclosure also discloses a board that includes the above chip package structure.
- Another embodiment of the present disclosure also discloses an electronic device including the above card.
- Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headphones, mobile Storage, wearable device vehicles, household appliances, and/or medical devices.
- the vehicle includes an airplane, a ship, and/or a vehicle;
- the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
- the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
- the present disclosure also provides a neural network calculation method, including:
- the neural network calculation method performs neural network calculation based on the neural network computing device of the high bandwidth memory, and referring to FIG. 8, includes:
- Step S1 Write the operation parameters calculated by the neural network into the high bandwidth memory.
- the high bandwidth memory is connected to an external storage device such as an external disk through an external access unit.
- the external access unit writes the operation parameters of the external specified address to the high-bandwidth memory, and the operation parameters include the weight, the offset table, and the function table.
- Step S2 transmitting the input data calculated by the neural network from the high-bandwidth memory to the buffer of the neural network processor, which may specifically include:
- Sub-step S21 the HBM memory control module addresses according to the start address of the input data. If the address hits, the data of the bit width starting from the start address is sequentially transmitted to the HBM through the storage interface according to the high bandwidth provided by the high-bandwidth memory. The memory control module until all input data is transferred to the HBM memory control module.
- a high-bandwidth memory can provide 1024 bits of bits. Wide, the stored input data has a total of 4096 bits, then the high-bandwidth memory transmits 1024-bit input data to the HBM memory control module each time, and transfers the input data to the HBM memory control module after four transmissions.
- Sub-step S22 The HBM memory control module converts the bit width of the input data into a bit width matched with the buffer, and transmits the bit-width matched input data to the buffer.
- the input data can be the input neuron vector calculated by the neural network.
- Step S3 The operation parameters calculated by the neural network are transmitted from the high-bandwidth memory to the buffer of the neural network processor, and the step may be similar to the step S2, and the step may specifically include:
- Sub-step S31 the HBM memory control module addresses according to the start address of the operation parameter. If the address hits, the data of the bit width starting from the start address is sequentially transmitted to the HBM through the storage interface according to the high bandwidth provided by the high-bandwidth memory. The memory control module until all the operating parameters are transferred to the HBM memory control module.
- Sub-step S32 The HBM memory control module converts the bit width of the operation parameter into a bit width matched with the buffer, and transmits the operation parameter of the bit width matching to the buffer.
- Step S4 The buffer control module transmits the input data and the operation parameters stored in the buffer to the neural processing unit, and the neural processing unit processes the input data and the operation parameters to obtain the output data of the current neural network calculation, and the buffer control module outputs the data. Store to the cache.
- the buffer control module stores the intermediate data in the buffer, and the neural processing unit continues the operation, when the intermediate data is required to be input into the operation.
- the buffer control module transmits the intermediate data back to the neural processing unit, the neural processing unit continues the operation using the intermediate data to obtain the output data of the neural network calculation, and the output data may be the output neuron vector.
- Step S5 The output data in the buffer is transferred to the high bandwidth memory, and the output data is transmitted to the external storage device through the external access unit.
- the HBM memory control module converts the bit width of the output data into a bit width matched with the high bandwidth memory, and transmits the bit width matched output data to the high bandwidth memory, and the external access unit stores the output data of the high bandwidth memory. Transfer to an external storage device.
- step S2 the process returns to step S2 to obtain the operation result of the next neural network calculation.
- the neural network calculation method of the embodiment using the high-bandwidth memory of the stacked storage structure and the neural network processor with the HBM memory control module, can greatly improve the storage bandwidth, and the computing performance is greatly improved and improved. Signal transmission bandwidth and transmission speed.
- the neural network calculation method performs neural network calculation based on the neural network computing device of the HMC memory module, and referring to FIG. 9, includes:
- Step S1 the external access unit writes data and instructions of the external designated address into the HMC memory module
- Step S2 the neural network processor reads, from the HMC memory module, required data and instructions for performing a partial operation of the neural network;
- Step S3 the neural network processor performs a partial operation of the neural network, and writes the intermediate value or the final value obtained by the operation back to the HMC memory module;
- Step S4 the operation result is written from the HMC memory module to the external designated address via the external access unit;
- step S5 if there is an operation termination instruction in the operation instruction, the execution is terminated, otherwise, the process returns to step S1.
- the method includes the following steps: Step S11, the first data of the external designated address is written into the HMC memory module 1 through the external access unit 3.
- the first data includes a weight, an offset, a function table, and the like for performing a neural network operation; and in step S12, an externally designated address is written into the HMC memory module 1 through the external access unit 3.
- the instruction includes an operation instruction for performing a neural network operation at this time. If the operation is not performed after the current neural network operation, the instruction further includes an operation termination instruction; and in step S13, the second data of the external designated address is externally accessed.
- Unit 3 is written to the HMC memory module 1.
- the second data includes input data for performing neural network operations at this time.
- the neural network processor reads the required data for performing the neural network partial operation from the HMC memory module, including: S21 sets the data preparation state in the neural network processor to be operated to be prepared; S22 The data preparation state is that the prepared neural network processor reads the required data for performing the neural network partial operation from the HMC memory module; after the S23 reading is completed, the data preparation state is the prepared data preparation state of the neural network processor. Set to not prepared.
- step S3 it is further included that it is determined whether the neural network operation ends, and if it is finished, Going to step S4; otherwise, transmitting a data preparation completion signal to the other neural network processors, and setting the data preparation state in the other neural network processors to be prepared, returning to step S2.
- the HMC memory-based neural network operation method in which a part of the neural network processors in the plurality of neural network processors are in an operational state, and the other neural network processors are in an operation result of waiting for one of the partial neural network processors At the time, the other neural network processor is in a waiting state, after one of the partial neural network processors transmits the operation result to the HMC memory module and sends the data delivery signal to the other neural network processor.
- the other neural network processors are woken up, the corresponding data is read from the HMC memory module, and corresponding operations are performed.
- each functional unit in various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- Each functional unit/module may be hardware, such as the hardware may be a circuit, including digital circuits, analog circuits, and the like.
- Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like.
- the computing modules in the computing device can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Memory System (AREA)
Abstract
Description
本公开涉及神经网络计算领域,尤其是一种神经网络计算装置和方法。The present disclosure relates to the field of neural network computing, and more particularly to a neural network computing apparatus and method.
目前人工智能领域正在飞速发展,机器学习也在影响着人们生活的方方面面。作为机器学习领域一个重要组成部分,神经网络方面的研究也是工业界和学术界同时关注的热点。由于神经网络计算中庞大的数据量,如何处理神经网络算法执行成为需要我们解决的重要问题。因此专用的神经网络计算装置应运而生。At present, the field of artificial intelligence is developing rapidly, and machine learning is also affecting all aspects of people's lives. As an important part of the machine learning field, research on neural networks is also a hot topic in both industry and academia. Due to the huge amount of data in neural network computing, how to deal with neural network algorithm execution becomes an important issue that we need to solve. Therefore, a dedicated neural network computing device has emerged.
目前阶段神经网络计算装置的架构中使用的动态随机存取存储区DRAM的类型大多为高性能显卡用内存(Graphics Double Data Rate Version 4,GDDR4)或高性能显卡用内存(Graphics Double Data Rate Version 5,GDDR5)。然而在神经网络计算装置中,由于需要考虑带宽、性能、功耗和面积方面的问题,GDDR4或GDDR5已经不能完全满足神经网络计算装置的需要,技术发展也已进入了瓶颈期。每秒增加1GB的带宽将会带来更多的功耗,这不论对于设计人员还是消费者来说都不是一个明智、高效或合算的选择。同时,GDDR4或GDDR5还存在着难以缩小面积的严重问题。因此,GDDR4或GDDR5将会渐渐阻碍神经网络计算装置性能的持续增长。At present, the types of dynamic random access memory DRAMs used in the architecture of neural network computing devices are mostly high performance graphics memory (Graphics Double Data Rate Version 4, GDDR4) or high performance graphics memory (Graphics Double Data Rate Version 5). , GDDR5). However, in the neural network computing device, due to the consideration of bandwidth, performance, power consumption and area, GDDR4 or GDDR5 can not fully meet the needs of the neural network computing device, and the technological development has entered a bottleneck period. Adding 1GB of bandwidth per second will result in more power consumption, which is not a sensible, efficient or cost-effective option for designers and consumers. At the same time, GDDR4 or GDDR5 still has serious problems that are difficult to reduce the area. Therefore, GDDR4 or GDDR5 will gradually hinder the continuous growth of the performance of neural network computing devices.
发明内容Summary of the invention
(一)要解决的技术问题(1) Technical problems to be solved
鉴于上述技术问题,本公开提供了一种神经网络计算装置和方法,用于解决上面提出的神经网络计算装置所面临的带宽,能耗,面积等方面的瓶颈。In view of the above technical problems, the present disclosure provides a neural network computing apparatus and method for solving bottlenecks in terms of bandwidth, power consumption, area, and the like faced by the neural network computing device proposed above.
(二)技术方案(2) Technical plan
根据本公开的一个方面,提供了一种神经网络计算装置,包括:存储装置;以及神经网络处理器,与所述存储装置电性连接,所述神经网络处理器与存储装置之间进行数据交换,并执行神经网络计算。According to an aspect of the present disclosure, a neural network computing device is provided, including: a storage device; and a neural network processor electrically connected to the storage device, and the data exchange between the neural network processor and the storage device And perform neural network calculations.
根据本公开的另一个方面,提供了一种芯片,包括所述的神经网络计 算装置。According to another aspect of the present disclosure, a chip is provided, including the neural network meter Calculation device.
根据本公开的又一个方面,提供了一种芯片封装结构,包括所述的芯片。According to still another aspect of the present disclosure, a chip package structure including the chip is provided.
根据本公开的又一个方面,提供了一种板卡,包括所述的芯片封装结构。According to still another aspect of the present disclosure, a board is provided, including the chip package structure.
根据本公开的又一个方面,提供了一种电子装置,包括所述的板卡。According to still another aspect of the present disclosure, an electronic device including the board is provided.
根据本公开的又一个方面,提供了一种神经网络计算方法,包括:由存储装置存储数据;以及由神经网络处理器与所述存储装置进行数据交换,并执行神经网络计算。According to still another aspect of the present disclosure, a neural network calculation method is provided, comprising: storing data by a storage device; and performing data exchange with the storage device by a neural network processor and performing neural network calculation.
(三)有益效果(3) Beneficial effects
从上述技术方案可以看出,本公开神经网络计算装置和方法至少具有以下有益效果其中之一:It can be seen from the above technical solutions that the neural network computing device and method of the present disclosure have at least one of the following beneficial effects:
(1)在神经网络运算过程中,高宽带存储器作为神经网络计算装置的内存,能更快地在缓存器与内存之间进行输入数据和运算参数的数据交换,这使得I/O时间大大缩短。(1) In the process of neural network operation, high-bandwidth memory, as the memory of the neural network computing device, can exchange data of input data and operation parameters between the buffer and the memory more quickly, which greatly shortens the I/O time. .
(2)采用堆栈式存储结构的高宽带存储器和具有HBM内存控制模块的神经网络处理器,可以极大地提高存储带宽,带宽可提升至现有技术的两倍以上,运算性能得到很大改善。(2) The high-bandwidth memory with stacked storage structure and the neural network processor with HBM memory control module can greatly improve the storage bandwidth, and the bandwidth can be increased to more than twice that of the prior art, and the computing performance is greatly improved.
(3)由于高宽带存储器是堆栈式(堆叠式)结构,不占用横向的平面空间,可以大幅降低神经网络计算装置的面积,神经网络计算装置的面积可以缩小至现有技术的大约5%。(3) Since the high-bandwidth memory is a stacked (stacked) structure that does not occupy a horizontal plane space, the area of the neural network computing device can be greatly reduced, and the area of the neural network computing device can be reduced to about 5% of the prior art.
(4)降低了神经网络计算装置的功耗,通过微凸焊工艺和硅穿孔工艺布线互联,提高了数据传输带宽和传输速度。(4) The power consumption of the neural network computing device is reduced, and the data transmission bandwidth and the transmission speed are improved by the micro-bumping process and the TSV process wiring interconnection.
(5)HMC的内存提供的数据传输带宽高,其数据传输带宽可以超过15倍的DDR3的带宽;降低了整体的功耗,HMC内存技术,相对于常用的DDR3/DDR4等内存技术,对于每比特的存储可以节省超过70%的功耗。(5) HMC's memory provides high data transmission bandwidth, its data transmission bandwidth can exceed 15 times the bandwidth of DDR3; reduce the overall power consumption, HMC memory technology, compared to the commonly used memory technology such as DDR3/DDR4, for each Bit storage can save more than 70% of power consumption.
(6)HMC内存模块包括级联的多个HMC内存单元,可根据神经网络运算过程中所需的实际内存大小灵活选择HMC内存单元的数量,减少功能部件的浪费。(6) The HMC memory module includes a plurality of cascaded HMC memory units, which can flexibly select the number of HMC memory units according to the actual memory size required in the neural network operation process, thereby reducing the waste of functional components.
(7)级联的多个HMC内存单元为统一编址,该多个HMC内存单元 的级联方式和具体实现对所述神经网络处理器透明,方便实现神经网络处理器对内存体位置进行读写,减小了内存部件的延时。(7) Multiple HMC memory units cascaded for unified addressing, the multiple HMC memory units The cascading mode and the specific implementation are transparent to the neural network processor, and the neural network processor is convenient to read and write the memory device position, thereby reducing the delay of the memory component.
(8)HMC内存单元采用堆栈式结构,相比于现有的RDIMM技术,可以减少超过90%的内存占用面积,同时使得神经网络运算装置的整体体积大幅度减小。由于HMC可以进行大规模并行处理,使得内存部件的延时较小。(8) The HMC memory unit adopts a stacked structure, which can reduce the memory footprint by more than 90% compared with the existing RDIMM technology, and at the same time greatly reduce the overall volume of the neural network computing device. Since the HMC can perform massive parallel processing, the latency of the memory components is small.
(9)神经网络计算装置具有多个神经网络处理器,多个神经网络处理器之间通过互连,并相互之间传递信息,避免在运行过程中产生数据一致性问题;而且神经网络运算装置支持多核处理器架构,可以充分利用神经网络运算过程中的并行性,加速神经网络的运算。(9) The neural network computing device has a plurality of neural network processors, and the plurality of neural network processors are interconnected and communicate information with each other to avoid data consistency problems during operation; and the neural network computing device Support multi-core processor architecture, can make full use of the parallelism in the neural network operation process, accelerate the operation of the neural network.
图1是依据本公开实施例的基于高带宽存储器的神经网络计算装置的整体结构示意图。1 is a block diagram showing the overall structure of a high bandwidth memory based neural network computing device in accordance with an embodiment of the present disclosure.
图2为图1所示神经网络计算装置的剖面图。2 is a cross-sectional view of the neural network computing device of FIG. 1.
图3为本公开实施例的具有HBM内存控制模块的神经网络处理器的整体架构图。3 is an overall architectural diagram of a neural network processor with a HBM memory control module in accordance with an embodiment of the present disclosure.
图4为依据本公开另一实施例的基于高带宽存储器的神经网络计算装置的剖面图。4 is a cross-sectional view of a high bandwidth memory based neural network computing device in accordance with another embodiment of the present disclosure.
图5为根据本公开实施例提供的基于HMC内存的神经网络运算装置的整体结构示意图。FIG. 5 is a schematic diagram of an overall structure of a neural network computing device based on an HMC memory according to an embodiment of the present disclosure.
图6为根据本公开实施例提供的基于HMC内存的神经网络运算装置的HMC内存单元示意图。FIG. 6 is a schematic diagram of an HMC memory unit of a HMC memory-based neural network computing device according to an embodiment of the present disclosure.
图7为根据本公开实施例提供的基于HMC内存的神经网络运算装置的HMC内存根据实际需要进行容量扩充的示意图。FIG. 7 is a schematic diagram of capacity expansion of an HMC memory of a neural network computing device based on an HMC memory according to an actual need according to an embodiment of the present disclosure.
图8为依据本公开实施例的基于高带宽存储器的神经网络计算方法的流程图。8 is a flow chart of a high bandwidth memory based neural network calculation method in accordance with an embodiment of the present disclosure.
图9为根据本公开实施例提供的基于HMC内存的神经网络计算方法的流程图。FIG. 9 is a flowchart of a HMC memory-based neural network calculation method according to an embodiment of the present disclosure.
【本公开主要元件符号说明】[Explanation of main component symbols of the present disclosure]
1-HMC内存模块;2-神经网络处理器;3-外部访问单元;4-其他模块; 11-混合内存立方体;12-逻辑基底层;101、201、401-封装基板;102、202-中介层;103、203、403-逻辑裸片;104-高带宽存储器;105、205、402-神经网络处理器;204-HBM内存控制模块;206、406-通孔;207、407-微焊球;208、405-动态随机存取存储器DRAM;301-存储接口;302-封装结构;303-控制单元;304、404-HBM内存控制模块;305-缓存器;306-缓冲控制模块;307-神经处理单元。1-HMC memory module; 2-neural network processor; 3-external access unit; 4-other modules; 11-mixed memory cube; 12-logic base layer; 101, 201, 401-package substrate; 102, 202-intermediate layer; 103, 203, 403-logic die; 104-high bandwidth memory; 105, 205, 402- Neural network processor; 204-HBM memory control module; 206, 406-through hole; 207, 407-micro solder ball; 208, 405-dynamic random access memory DRAM; 301-storage interface; 302-package structure; Control unit; 304, 404-HBM memory control module; 305-buffer; 306-buffer control module; 307-neural processing unit.
为使本公开的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本公开进一步详细说明。The present disclosure will be further described in detail below with reference to the specific embodiments thereof and the accompanying drawings.
本公开提供了一种神经网络计算装置,包括:The present disclosure provides a neural network computing device, including:
至少一个存储装置;At least one storage device;
至少一个神经网络处理器,与所述存储装置电性连接,所述神经网络处理器与存储装置之间进行数据交换,并执行神经网络计算。At least one neural network processor is electrically connected to the storage device, the neural network processor exchanges data with the storage device, and performs neural network calculation.
在本公开的一具体实施例中,所述存储装置为高带宽存储器(相应的,神经网络计算装置为基于高带宽存储器的神经网络计算装置),每个高带宽存储器包括堆栈式(堆叠式)累加的多个存储器;至少一个神经网络处理器,与所述高带宽存储器电性连接,所述神经网络处理器与高带宽存储器之间进行数据交换,并执行神经网络计算。In a specific embodiment of the present disclosure, the storage device is a high bandwidth memory (correspondingly, the neural network computing device is a high bandwidth memory based neural network computing device), and each high bandwidth memory includes a stacked (stacked) A plurality of accumulated memories; at least one neural network processor electrically coupled to the high bandwidth memory, the neural network processor and the high bandwidth memory exchange data, and perform neural network calculations.
其中,所述高带宽存储器(High-Bandwidth Memory,HBM)作为一种新型低功耗存储器,具有超宽通信数据通路、低功耗和面积小的优秀特性。本公开一实施例提出了一种基于高带宽存储器的神经网络计算装置,参见图1,该神经网络计算装置包括:封装基板101(Package Substrate)、中介层102(Interposer)、逻辑裸片103(Logic Die)、高带宽存储器104(Stacked Memory)和神经网络处理器105。其中,The High-Bandwidth Memory (HBM) is a new type of low-power memory with excellent communication data path, low power consumption and small area. An embodiment of the present disclosure provides a neural network computing device based on a high-bandwidth memory. Referring to FIG. 1 , the neural network computing device includes: a package substrate 101 (Package Substrate), an interposer 102 (Interposer), and a logic die 103 ( Logic Die), a high bandwidth memory 104 (Stacked Memory) and a neural network processor 105. among them,
封装基板101,用于承载神经网络计算装置的上述其他部件,并与上位设备电性连接,例如计算机、手机以及各种嵌入式设备。The package substrate 101 is configured to carry the other components of the neural network computing device and is electrically connected to the host device, such as a computer, a mobile phone, and various embedded devices.
中介层102,形成于封装基板101上,承载逻辑裸片103和神经网络处理器105。The interposer 102 is formed on the package substrate 101 and carries the logic die 103 and the neural network processor 105.
逻辑裸片103形成于中介层102上,逻辑裸片103用于连接中介层102和高带宽存储器104,实现对高带宽存储器104的一层封装。 The logic die 103 is formed on the interposer 102, and the logic die 103 is used to connect the interposer 102 and the high bandwidth memory 104 to implement a layer encapsulation of the high bandwidth memory 104.
高带宽存储器104形成于逻辑裸片103上,该高带宽存储器104包括沿垂直于封装基板101的方向上堆栈式(堆叠式)累加的多个存储器。The high bandwidth memory 104 is formed on a logic die 103 that includes a plurality of memories stacked (stacked) in a direction perpendicular to the package substrate 101.
神经网络处理器105也形成于中介层102上,用于执行神经网络计算,其可以完成整个神经网络计算,也可以完成卷积等神经网络计算的基本操作,神经网络处理器105通过中介层102与逻辑裸片103电性连接,并与高带宽存储器104之间进行数据交换。The neural network processor 105 is also formed on the interposer 102 for performing neural network calculations, which can complete the entire neural network calculation, and can also perform basic operations of neural network calculation such as convolution, and the neural network processor 105 passes through the interposer 102. The data is exchanged with the logic die 103 and with the high bandwidth memory 104.
请参见图2,其显示了本实施例的神经网络计算装置沿垂直于封装基板方向的剖面图,神经网络计算装置为2.5D(2.5维)存储架构。其中,高带宽存储器包括四个动态随机存取存储器(DRAM)208,利用微凸焊(μbumps)工艺对DRAM 208进行堆栈式累加,相邻DRAM 208之间形成有微焊球207,高带宽存储器最底层的DRAM 208利用微凸焊工艺形成于逻辑裸片203上,逻辑裸片203和神经网络处理器205利用微凸焊工艺形成于中介层202上,中介层202利用倒装芯片焊接工艺形成于封装基板201上。利用硅穿孔(Through-Silicon Vias,TSVs)工艺在DRAM 208开设通孔206,并利用硅穿孔工艺在逻辑裸片203和中介层202开设通孔,利用上述通孔和微焊球布置导线,使DRAM 208与逻辑裸片203电性连接,逻辑裸片203通过中介层202通孔的导线与神经网络处理器205电性连接,实现高带宽存储器与神经网络处理器205的互联,在神经网络处理器205的HBM内存控制模块204的控制下,数据在神经网络处理器205和高带宽存储器之间传输。Referring to FIG. 2, a cross-sectional view of the neural network computing device of the present embodiment in a direction perpendicular to the package substrate is shown. The neural network computing device is a 2.5D (2.5-dimensional) memory architecture. The high-bandwidth memory includes four dynamic random access memories (DRAMs) 208, and the DRAM 208 is stacked and stacked by a microbumps process. Micro solder balls 207 are formed between the adjacent DRAMs 208, and a high-bandwidth memory is formed. The bottommost DRAM 208 is formed on the logic die 203 by a micro-bumping process, the logic die 203 and the neural network processor 205 are formed on the interposer 202 by a micro-bumping process, and the interposer 202 is formed by a flip chip soldering process. On the package substrate 201. Through holes 206 are formed in the DRAM 208 by using a through-silicon via (TSVs) process, and through holes are formed in the logic die 203 and the interposer 202 by using a through-silicon via process, and the wires are arranged by using the through holes and the micro solder balls. The DRAM 208 is electrically connected to the logic die 203, and the logic die 203 is electrically connected to the neural network processor 205 through the wires of the vias of the interposer 202 to realize interconnection between the high bandwidth memory and the neural network processor 205, and is processed in the neural network. Under the control of the HBM memory control module 204 of the 205, data is transferred between the neural network processor 205 and the high bandwidth memory.
现有的GDDR存储器的每个通道宽度为32bits,16个通道的存储器总线宽度为512bit。而在本实施例中,高带宽存储器可以包括四个DRAM,每个DRAM具有两个128-bit的通道,高带宽存储器可以提供1024bit的位宽,是上述GDDR存储器位宽的两倍。The existing GDDR memory has a channel width of 32 bits and a 16 channel memory bus width of 512 bits. In the present embodiment, the high bandwidth memory may include four DRAMs, each having two 128-bit channels, and the high bandwidth memory providing a bit width of 1024 bits, which is twice the bit width of the above GDDR memory.
以上仅是示例性的介绍了本公开,但本公开的内容并不以此为限,例如,神经网络计算装置的逻辑裸片可以是多个,相应地,高带宽存储器也可以是多个,每个高带宽存储器的DRAM也可以大于四个,上述部件的数量可以根据实际需求而设置。The above description is only illustrative of the disclosure, but the content of the disclosure is not limited thereto. For example, the logical network of the neural network computing device may be multiple, and correspondingly, the high-bandwidth memory may also be multiple. The DRAM of each high-bandwidth memory can also be greater than four, and the number of the above components can be set according to actual needs.
例如,神经网络计算装置可以包括四个逻辑裸片和四个高带宽存储器,每个高带宽存储器包括四个DRAM,每个DRAM具有两个128-bit的通道, 每个高带宽存储器可以提供1024bit的位宽,四个高带宽存储器可以提供4096bit的位宽,是上述GDDR存储器位宽的八倍。For example, a neural network computing device can include four logical dies and four high-bandwidth memories, each high-bandwidth memory including four DRAMs, each having two 128-bit channels. Each high-bandwidth memory can provide 1024 bits of bit width, and four high-bandwidth memories can provide 4096 bits of bit width, which is eight times the bit width of the above GDDR memory.
每个高带宽存储器也可以包括八个DRAM,每个DRAM具有两个128-bit的通道,每个高带宽存储器可以提供2048bit的位宽,四个高带宽存储器可以提供8192bit的位宽,是上述GDDR存储器位宽的十六倍。Each high-bandwidth memory can also include eight DRAMs, each with two 128-bit channels, each high-bandwidth memory can provide 2048 bits of bit width, and four high-bandwidth memories can provide 8192 bits of bit width. GDDR memory is sixteen times wider than the bit width.
请参见图3,其显示了本实施例的具有HBM内存控制模块的神经网络处理器的整体架构图。神经网络处理器包括:存储接口301(Memory Interface)、控制单元303(Control Processor)、HBM内存控制模块304(HBM Controller)、缓存器305(BUFFER)、缓冲控制模块306(Buffer Controller)、神经处理单元307(NFU),其中,控制处理器303、HBM内存控制模块304、缓存器305、缓冲控制模块306和神经处理单元307封装成一体,形成一封装结构302。Referring to FIG. 3, an overall architecture diagram of a neural network processor with a HBM memory control module of the present embodiment is shown. The neural network processor includes: a storage interface 301 (Memory Interface), a control unit 303 (Control Processor), a HBM memory control module 304 (HBM Controller), a buffer 305 (BUFFER), a buffer control module 306 (Buffer Controller), and neural processing. The unit 307 (NFU), wherein the control processor 303, the HBM memory control module 304, the buffer 305, the buffer control module 306, and the neural processing unit 307 are integrally packaged to form a package structure 302.
存储接口301,作为神经网络处理器与高带宽存储器的接口,通过导线与逻辑裸片、高带宽存储器的DRAM电性连接,用于接收高带宽存储器传输的数据,以及向高带宽存储器传输数据。The storage interface 301, as an interface between the neural network processor and the high-bandwidth memory, is electrically connected to the DRAM of the logic die and the high-bandwidth memory through wires, and is used for receiving data transmitted by the high-bandwidth memory and transmitting data to the high-bandwidth memory.
HBM内存控制模块304,用于控制高带宽存储器和缓存器之间的数据传输,包括协调高带宽存储器和缓存器的数据带宽,以及使高带宽存储器和缓存器的时钟同步。HBM内存控制模块304同步高带宽存储器和缓存器的时钟,将存储接口301接收的高带宽存储器的数据的带宽转换为与缓存器相匹配的带宽,并将带宽匹配的数据传输至缓存器;将缓存器的数据的带宽转换为与高带宽存储器相匹配的带宽,并将带宽匹配的数据经存储接口301传输至高带宽存储器。The HBM memory control module 304 is configured to control data transmission between the high bandwidth memory and the buffer, including coordinating the data bandwidth of the high bandwidth memory and the buffer, and synchronizing the clock of the high bandwidth memory and the buffer. The HBM memory control module 304 synchronizes the clocks of the high bandwidth memory and the buffer, converts the bandwidth of the data of the high bandwidth memory received by the storage interface 301 into a bandwidth matched with the buffer, and transmits the bandwidth matched data to the buffer; The bandwidth of the buffer's data is converted to a bandwidth that matches the high bandwidth memory, and the bandwidth matched data is transferred to the high bandwidth memory via the storage interface 301.
缓存器305是神经网络处理器的内部存储单元,用于接收HBM内存控制模块304传输的带宽匹配的数据,以及将其存储的数据传输给HBM内存控制模块304。The buffer 305 is an internal storage unit of the neural network processor for receiving bandwidth-matched data transmitted by the HBM memory control module 304 and transmitting the stored data to the HBM memory control module 304.
缓冲控制模块306用于控制缓存器305与神经处理单元307之间的数据交互,将缓存器305存储的数据传输给神经处理单元307,神经处理单元307进行神经网络计算,缓冲控制模块306将神经处理单元307的计算结果传输给缓存器305。The buffer control module 306 is configured to control data interaction between the buffer 305 and the neural processing unit 307, transmit the data stored by the buffer 305 to the neural processing unit 307, the neural processing unit 307 performs neural network calculation, and the buffer control module 306 performs the neural network. The calculation result of the processing unit 307 is transmitted to the buffer 305.
控制单元303,通过指令译码向HBM内存控制模块304、缓存器305、 缓冲控制模块306和神经处理单元307发送控制指令,协调和调度上述模块协同工作,实现神经网络处理器的计算功能。The control unit 303 decodes the instruction to the HBM memory control module 304, the buffer 305, The buffer control module 306 and the neural processing unit 307 send control commands, coordinate and schedule the above modules to work together to implement the computing functions of the neural network processor.
由此可见,本公开的神经网络计算装置,采用堆栈式存储结构的高宽带存储器和具有HBM内存控制模块的神经网络处理器,可以极大地提高存储带宽,带宽可提升至现有技术的两倍以上,运算性能得到很大改善,在神经网络运算过程中,高宽带存储器作为神经网络计算装置的内存,能更快的在缓存器与内存之间进行输入数据和运算参数的数据交换,这使得IO时间大大缩短。由于高宽带存储器是堆栈式(堆叠式)结构,不占用横向的平面空间,可以大幅降低神经网络计算装置的面积,神经网络计算装置的面积可以缩小至现有技术的大约5%;同时也降低了神经网络计算装置的功耗;并且DRAM之间通过微凸焊工艺和硅穿孔工艺布线互联,神经网络处理器和高带宽存储器之间利用中介层进行数据交换,进一步提高了不同DRAM之间以及神经网络处理器和高带宽存储器之间的传输带宽和传输速度。It can be seen that the neural network computing device of the present disclosure, the high-bandwidth memory using the stacked storage structure and the neural network processor with the HBM memory control module can greatly improve the storage bandwidth, and the bandwidth can be increased to twice that of the prior art. In the above, the computing performance is greatly improved. In the process of neural network operation, the high-bandwidth memory is used as the memory of the neural network computing device, and the data exchange between the input data and the operation parameters can be performed between the buffer and the memory more quickly, which makes IO time is greatly reduced. Since the high-bandwidth memory is a stacked (stacked) structure that does not occupy horizontal planar space, the area of the neural network computing device can be greatly reduced, and the area of the neural network computing device can be reduced to about 5% of the prior art; The power consumption of the neural network computing device; and the DRAMs are interconnected by a micro-bumping process and a TSV process wiring, and the intermediate layer is used for data exchange between the neural network processor and the high-bandwidth memory, thereby further improving between different DRAMs and Transmission bandwidth and transmission speed between the neural network processor and the high bandwidth memory.
本公开另一具体实施例提出的基于高带宽存储器的神经网络计算装置,参见图4,其与上述实施例相同的特征不在重复描述。该神经网络计算装置为3D(3维)存储架构,由底层至顶层包括层叠设置的封装基板401、神经网络处理器402、逻辑裸片403和高带宽存储器。高带宽存储器包括四个DRAM 405,利用微凸焊工艺对DRAM 405进行堆栈式累加,相邻DRAM 405之间形成有微焊球407,高带宽存储器最底层的DRAM 405利用微凸焊工艺形成于逻辑裸片403上,逻辑裸片403利用微凸焊工艺形成于神经网络处理器402上,神经网络处理器402利用微凸焊工艺形成于封装基板401上。利用硅穿孔工艺在DRAM 405开设通孔406,并利用硅穿孔工艺在逻辑裸片403开设通孔,利用上述通孔和微焊球布置导线,DRAM 405与逻辑裸片403、神经网络处理器402电性连接,实现高带宽存储器与神经网络处理器402的垂直互联,在神经网络处理器402的HBM内存控制模块404的控制下,数据在神经网络处理器402和高带宽存储器之间传输。A high-bandwidth memory-based neural network computing device proposed by another embodiment of the present disclosure, referring to FIG. 4, the same features as those of the above embodiments are not repeatedly described. The neural network computing device is a 3D (3-dimensional) memory architecture, including a package substrate 401, a neural network processor 402, a logic die 403, and a high bandwidth memory stacked from bottom to top. The high-bandwidth memory includes four DRAMs 405. The DRAM 405 is stacked and stacked by a micro-bumping process. A micro solder ball 407 is formed between the adjacent DRAMs 405. The bottommost DRAM 405 of the high-bandwidth memory is formed by a micro-bumping process. On the logic die 403, the logic die 403 is formed on the neural network processor 402 by a micro-bumping process, and the neural network processor 402 is formed on the package substrate 401 by a micro-bumping process. A via hole 406 is formed in the DRAM 405 by a through-silicon via process, and a via hole is formed in the logic die 403 by using a through-silicon via process. The via hole and the micro solder ball are used to arrange the wire, the DRAM 405 and the logic die 403, and the neural network processor 402. The electrical connection enables vertical interconnection of the high bandwidth memory with the neural network processor 402. Under the control of the HBM memory control module 404 of the neural network processor 402, data is transferred between the neural network processor 402 and the high bandwidth memory.
由此可见,该神经网络计算装置,由于高带宽存储器直接堆栈(堆叠)在神经网络处理器上,相对于2.5D的存储架构,可进一步节省神经网络 计算装置的面积,特别有利于神经网络计算装置的小型化;并且高带宽存储器与神经网络处理器的距离更短,意味着二者之间的布线更短,可以进一步提高信号传输质量和传输速度。It can be seen that the neural network computing device can further save the neural network compared to the 2.5D storage architecture because the high bandwidth memory is directly stacked (stacked) on the neural network processor. The area of the computing device is particularly advantageous for miniaturization of the neural network computing device; and the distance between the high bandwidth memory and the neural network processor is shorter, meaning that the wiring between the two is shorter, and the signal transmission quality and transmission speed can be further improved. .
在本公开的再一具体实施例中,所述存储装置为HMC内存模块(相应的,神经网络运算装置为基于混合存储立方体(Hybrid Memory Cube,HMC)内存的神经网络运算装置);至少一个神经网络处理器,与所述HMC内存模块连接,用于从所述HMC内存模块获取神经网络运算所需的数据、指令,进行神经网络部分运算,并将运算结果写回到所述HMC内存模块。In still another embodiment of the present disclosure, the storage device is an HMC memory module (corresponding, the neural network computing device is a neural network computing device based on a Hybrid Memory Cube (HMC) memory); at least one nerve The network processor is connected to the HMC memory module, and is configured to acquire data and instructions required for the neural network operation from the HMC memory module, perform a neural network partial operation, and write the operation result back to the HMC memory module.
本公开基于HMC的神经网络运算装置和方法,能够有效地满足神经网络运算装置在运算过程中对于数据传输和存储的要求,该运算装置能够提供前所未有的系统性能和带宽,同时,内存利用率高、功耗开销小、成本低并且数据传输速率快。The HMC-based neural network computing device and method can effectively meet the requirements of the neural network computing device for data transmission and storage in the operation process, and the computing device can provide unprecedented system performance and bandwidth, and at the same time, high memory utilization. Low power consumption, low cost and fast data transfer rate.
图5为根据本公开实施例提供的基于HMC内存的神经网络运算装置的整体结构示意图。如图5所示,本公开基于HMC内存的神经网络运算装置该装置包括HMC内存模块1、神经网络处理器2和外部访问单元3。FIG. 5 is a schematic diagram of an overall structure of a neural network computing device based on an HMC memory according to an embodiment of the present disclosure. As shown in FIG. 5, the present disclosure is based on a HMC memory-based neural network computing device. The device includes an HMC memory module 1, a neural network processor 2, and an external access unit 3.
所述HMC内存模块1包括:级联的多个HMC内存单元,每个HMC内存单元包括混合内存立方体及逻辑基底层。The HMC memory module 1 includes: a plurality of cascaded HMC memory units, each of which includes a mixed memory cube and a logical base layer.
所述混合内存立方体由多个内存晶粒层通过硅通孔(TSV)相连,而逻辑基底层包括控制混合内存立方体进行数据读写操作的逻辑控制单元和与外部的处理器或HMC装置或外部访问单元相连的链路。The hybrid memory cube is connected by a plurality of memory die layers through through silicon vias (TSVs), and the logic substrate layer includes a logic control unit that controls the mixed memory cube for data read and write operations and an external processor or HMC device or external The link to which the access unit is connected.
所述神经网络处理器,用于执行神经网络运算的功能部件,从所述HMC内存模块获取神经网络运算的所需的指令和数据,进行神经网络部分运算,将运算后的中间值或最终结果写回到所述HMC内存模块,同时,通过神经网络处理器之间的数据通路向其他神经网络处理器发送数据准备完成信号。其中所述数据包括第一数据及第二数据,所述第一数据包括网络参数(例如权值、偏置等)和函数表,所述第二数据包括网络输入数据,所述数据还可进一步包括运算后的中间值。The neural network processor is configured to perform a function of the neural network operation, acquire the required instructions and data of the neural network operation from the HMC memory module, perform a neural network partial operation, and calculate the intermediate value or the final result Writing back to the HMC memory module, while transmitting data preparation completion signals to other neural network processors through the data path between the neural network processors. The data includes first data and second data, the first data includes network parameters (such as weights, offsets, etc.) and a function table, and the second data includes network input data, and the data may further Includes the intermediate value after the operation.
所述外部访问单元,连接HMC内存模块和外部指定地址,完成从外部指定地址读取神经网络运算中所需的神经网络指令及数据到HMC内存 模块,以及神经网络运算结果从HMC内存模块到外部指定地址空间的输出操作。The external access unit connects the HMC memory module and an external designated address, and reads the neural network instructions and data required for the neural network operation from the external designated address to the HMC memory. The module, as well as the neural network operation results from the HMC memory module to the output operation of the external specified address space.
进一步的,所述级联的多个HMC内存单元统一编址,该多个HMC内存单元的级联方式和具体实现对所述神经网络处理器透明,所述神经网络处理器通过级联的多个HMC内存单元的内存地址对HMC内存模块的内存体位置进行读写。Further, the cascaded multiple HMC memory units are uniformly addressed, and the cascading manner and specific implementation of the multiple HMC memory units are transparent to the neural network processor, and the neural network processor is cascaded. The memory address of the HMC memory unit reads and writes the memory location of the HMC memory module.
另外,考虑到神经网络处理器在具体的实现过程中可能会用到多核处理器来提高运行的效率。本公开中HMC内存模块可以供多核神经网络处理器使用。针对神经网络处理器在实际运行过程中可能产生数据一致性问题,多个神经网络处理器之间通过互连,并相互之间传递信息完成。具体地,在多个神经网络处理器中的部分神经网络处理器处于运算状态,其他神经网络处理器处于等待所述部分神经网络处理器的其中之一的运算结果时,所述其他神经网络处理器先处于等待状态,在所述部分神经网络处理器的其中之一将运算结果传送到HMC内存模块中并向其他神经网络处理器发送数据送达信号之后,所述其他神经网络处理器被唤醒,从HMC内存模块中读取相应的数据,并进行相应的运算。In addition, it is considered that the neural network processor may use a multi-core processor in a specific implementation process to improve the efficiency of operation. The HMC memory module of the present disclosure can be used by a multi-core neural network processor. In view of the fact that the neural network processor may generate data consistency problems during actual operation, multiple neural network processors are interconnected and information is transmitted between each other. Specifically, when a part of the neural network processors in the plurality of neural network processors are in an operational state, and the other neural network processors are in an operation result waiting for one of the partial neural network processors, the other neural network processing The device is in a wait state, and after one of the partial neural network processors transmits the operation result to the HMC memory module and transmits the data delivery signal to the other neural network processor, the other neural network processor is woken up. Read the corresponding data from the HMC memory module and perform the corresponding operation.
图6为根据本公开实施例提供的基于HMC内存的神经网络运算装置的HMC内存单元示意图。如图6所示,HMC包括混合内存立方体11和逻辑基底层12。FIG. 6 is a schematic diagram of an HMC memory unit of a HMC memory-based neural network computing device according to an embodiment of the present disclosure. As shown in FIG. 6, the HMC includes a hybrid memory cube 11 and a logic substrate layer 12.
其中,混合内存立方体11中具有由多个内存体构成的内存,其中每个内存体包含多个内存晶粒层。优选的,混合内存立方体中具有由16个内存体构成的内存,其中每个内存体包含16个内存晶粒层。内存晶粒层的顶部和底部都使用硅通孔(TSV)结构进行互连。TSV使得多个内存晶粒层之间互连,除了行列之外增加另一维度,构成晶粒的三维结构。内存晶粒层为传统的随机动态存储器(DRAM)。对于混合内存立方体的存取,逻辑基底层经过配置,可以直接通过垂直上升的方式选择所需的内存晶粒层。在一个内存晶粒层中,使用行列的方式来组织DRAM。通过内存体控制器来执行对内存体中的具体位置进行读取或写入请求。The mixed memory cube 11 has a memory composed of a plurality of memory bodies, wherein each memory body includes a plurality of memory grain layers. Preferably, the mixed memory cube has a memory composed of 16 memory banks, wherein each memory bank includes 16 memory grain layers. The top and bottom of the memory die layer are interconnected using a through silicon via (TSV) structure. The TSV interconnects a plurality of memory die layers, adding another dimension in addition to the rows and columns to form a three-dimensional structure of the die. The memory die layer is a conventional random dynamic memory (DRAM). For the access of the hybrid memory cube, the logical base layer is configured to directly select the desired memory die layer by vertically rising. In a memory die layer, rows and columns are used to organize DRAM. A read or write request to a specific location in the memory bank is performed by the memory body controller.
逻辑基底层包括控制混合内存立方体进行数据读写操作的逻辑控制单元,该逻辑控制单元包括多个内存体控制器,每个内存体在逻辑基底层 中都有对应的用来管理对内存体进行3D存取控制的内存体控制器。3D分层的方式使得记忆体存取不仅可以在存储阵列上以行列的方向进行存取,也可以平行地在多个内存晶粒层之间进行存取。The logic base layer includes a logic control unit that controls the mixed memory cube for data read and write operations, and the logic control unit includes a plurality of memory body controllers, each of which is at a logical base layer There are corresponding memory body controllers for managing 3D access control of the memory. The 3D layering method allows memory accesses to be accessed not only in the direction of the rows and columns but also in parallel between multiple memory die layers.
所述逻辑基底层还包括与外部HMC内存单元相连的链路,用于将多个HMC内存单元连接以提升HMC存储装置的总容量。所述链路包括连接至HMC内存模块的外部I/O链路和内部路由以及交换逻辑,所述外部I/O链路包括多个逻辑链路,所述逻辑链路附接至交换逻辑以引导内部路由控制每一个内存体存储单元的数据传输或向其他HMC内存单元和神经网络处理器转发数据。优选的所述外部I/O链路包括4个或8个逻辑链路,每个逻辑链路为包含16个或者8个串行I/O或者SerDes双向链路群组。其中,包含4个逻辑链路的HMC内存单元能够以10Gbps、12.5Gbps以及15Gbps的速率进行数据传输,而包含8个逻辑链路的HMC内存单元能够在10Gbps的速率下进行数据传输。所述交换逻辑在地址选通阶段判断经过外部I/O链路传入的HMC内存地址是否在此HMC内存单元上。如果在,将HMC内存地址转换为内部路由能识别的HMC内部混合内存立方体地址的数据格式;如果不在,则HMC内存地址转发给与此HMC内存单元相连的另一个HMC内存单元。在数据读写阶段,若数据读写发生在此HMC内存单元上,则HMC内存单元负责外部I/O链路与内部路由之间数据的转换,若不是发生在此HMC内存单元上,HMC内存单元负责将与之相连的两个外部I/O链路上的数据进行转发,完成数据传输。The logical base layer further includes a link to the external HMC memory unit for connecting the plurality of HMC memory units to increase the total capacity of the HMC storage device. The link includes an external I/O link and internal routing and switching logic connected to an HMC memory module, the external I/O link including a plurality of logical links attached to the switching logic The bootstrap internal routing controls the data transfer of each memory bank or forwards data to other HMC memory cells and neural network processors. Preferably, the external I/O link comprises 4 or 8 logical links, each logical link comprising 16 or 8 serial I/O or SerDes bidirectional link groups. Among them, HMC memory cells with 4 logical links can transmit data at 10Gbps, 12.5Gbps and 15Gbps, and HMC memory cells with 8 logical links can transmit data at 10Gbps. The switching logic determines whether the HMC memory address passed through the external I/O link is on the HMC memory unit in the address strobing phase. If yes, convert the HMC memory address to the data format of the HMC internal hybrid memory cube address that can be recognized by the internal route; if not, the HMC memory address is forwarded to another HMC memory unit connected to the HMC memory unit. In the data read and write phase, if data read and write occurs on the HMC memory unit, the HMC memory unit is responsible for data conversion between the external I/O link and the internal route. If it does not occur on the HMC memory unit, the HMC memory The unit is responsible for forwarding the data on the two external I/O links connected to it to complete the data transmission.
图7为根据本公开实施例提供的基于HMC内存的神经网络运算装置的HMC内存根据实际需要进行容量扩充的示意图。HMC内存单元中的逻辑基底层结构支持将装置附接至神经网络处理器或另外的HMC内存单元。通过将多个HMC内存单元连接在一起,可以在不改变HMC内存单元结构的前提下,提升HMC存储装置(内存模块)的总容量。多个HMC内存单元可以通过拓扑连接后与其他模块4相连接。如图7所示,通过将两个HMC内存单元相连,可以使得神经网络运算装置的内存容量扩大一倍,同时,这种设计可以使得神经网络运算装置根据实际应用的需求,动态地改变HMC内存单元的数量,使得HMC内存模块得到充分的应用,同时也使得该神经网络运算装置的可配置性大大提高。 FIG. 7 is a schematic diagram of capacity expansion of an HMC memory of a neural network computing device based on an HMC memory according to an actual need according to an embodiment of the present disclosure. The logical base layer structure in the HMC memory unit supports attaching the device to a neural network processor or another HMC memory unit. By connecting multiple HMC memory units together, the total capacity of the HMC storage device (memory module) can be increased without changing the structure of the HMC memory unit. Multiple HMC memory units can be connected to other modules 4 through topology connection. As shown in FIG. 7, by connecting two HMC memory units, the memory capacity of the neural network computing device can be doubled. At the same time, this design can cause the neural network computing device to dynamically change the HMC memory according to the needs of the actual application. The number of units enables the HMC memory module to be fully utilized, and at the same time, the configurability of the neural network computing device is greatly improved.
本公开另一实施例,还公开了一种芯片,其包括了上述实施例的基于高带宽存储器的神经网络计算装置。Another embodiment of the present disclosure further discloses a chip including the high bandwidth memory based neural network computing device of the above embodiment.
本公开的另一实施例,还公开了一种芯片封装结构,其包括了上述芯片。Another embodiment of the present disclosure also discloses a chip package structure including the above chip.
本公开的另一实施例,还公开了一种板卡,其包括了上述芯片封装结构。Another embodiment of the present disclosure also discloses a board that includes the above chip package structure.
本公开的另一实施例,还公开了一种电子装置,其包括了上述板卡。Another embodiment of the present disclosure also discloses an electronic device including the above card.
电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备交通工具、家用电器、和/或医疗设备。Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers, cameras, cameras, projectors, watches, headphones, mobile Storage, wearable device vehicles, household appliances, and/or medical devices.
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicle includes an airplane, a ship, and/or a vehicle; the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood; the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
本公开还提供了一种神经网络计算方法,包括:The present disclosure also provides a neural network calculation method, including:
将至少一个存储装置与至少一个神经网络处理器电性连接;Electrically connecting at least one storage device to at least one neural network processor;
所述至少一个神经网络处理器与所述至少一存储装置之间进行数据交换,并执行神经网络计算。Data exchange between the at least one neural network processor and the at least one storage device and performing neural network calculations.
在本公开的一具体实施例中,所述神经网络计算方法,基于上述高带宽存储器的神经网络计算装置进行神经网络计算,参见图8,包括:In a specific embodiment of the present disclosure, the neural network calculation method performs neural network calculation based on the neural network computing device of the high bandwidth memory, and referring to FIG. 8, includes:
步骤S1:将神经网络计算的运算参数写入高带宽存储器。Step S1: Write the operation parameters calculated by the neural network into the high bandwidth memory.
高带宽存储器通过外部访问单元连接外部存储装置,例如外部磁盘。外部访问单元将外部指定地址的运算参数写入高带宽存储器,运算参数包括权值、偏置表和函数表等。The high bandwidth memory is connected to an external storage device such as an external disk through an external access unit. The external access unit writes the operation parameters of the external specified address to the high-bandwidth memory, and the operation parameters include the weight, the offset table, and the function table.
步骤S2:将本次神经网络计算的输入数据从高带宽存储器传输至神经网络处理器的缓存器,具体可以包括:Step S2: transmitting the input data calculated by the neural network from the high-bandwidth memory to the buffer of the neural network processor, which may specifically include:
子步骤S21:HBM内存控制模块按照输入数据的起始地址进行寻址,如果地址命中,则按照高带宽存储器提供的高带宽,将起始地址开始的位宽的数据依次经存储接口传输至HBM内存控制模块,直至将输入数据全部传输至HBM内存控制模块。例如,高带宽存储器可以提供1024bit的位 宽,其存储的输入数据共有4096bit,则高带宽存储器每次向HBM内存控制模块传输1024bit的输入数据,四次传输后将输入数据全部传输给HBM内存控制模块。Sub-step S21: the HBM memory control module addresses according to the start address of the input data. If the address hits, the data of the bit width starting from the start address is sequentially transmitted to the HBM through the storage interface according to the high bandwidth provided by the high-bandwidth memory. The memory control module until all input data is transferred to the HBM memory control module. For example, a high-bandwidth memory can provide 1024 bits of bits. Wide, the stored input data has a total of 4096 bits, then the high-bandwidth memory transmits 1024-bit input data to the HBM memory control module each time, and transfers the input data to the HBM memory control module after four transmissions.
子步骤S22:HBM内存控制模块将输入数据的位宽转换为与缓存器相匹配的位宽,并将位宽匹配的输入数据传输至缓存器。Sub-step S22: The HBM memory control module converts the bit width of the input data into a bit width matched with the buffer, and transmits the bit-width matched input data to the buffer.
输入数据可以是本次神经网络计算的输入神经元向量。The input data can be the input neuron vector calculated by the neural network.
步骤S3:将本次神经网络计算的运算参数由高带宽存储器传输至神经网络处理器的缓存器,与步骤S2类似,该步骤具体可以包括:Step S3: The operation parameters calculated by the neural network are transmitted from the high-bandwidth memory to the buffer of the neural network processor, and the step may be similar to the step S2, and the step may specifically include:
子步骤S31:HBM内存控制模块按照运算参数的起始地址进行寻址,如果地址命中,则按照高带宽存储器提供的高带宽,将起始地址开始的位宽的数据依次经存储接口传输至HBM内存控制模块,直至将运算参数全部传输至HBM内存控制模块。Sub-step S31: the HBM memory control module addresses according to the start address of the operation parameter. If the address hits, the data of the bit width starting from the start address is sequentially transmitted to the HBM through the storage interface according to the high bandwidth provided by the high-bandwidth memory. The memory control module until all the operating parameters are transferred to the HBM memory control module.
子步骤S32:HBM内存控制模块将运算参数的位宽转换为与缓存器相匹配的位宽,并将位宽匹配的运算参数传输至缓存器。Sub-step S32: The HBM memory control module converts the bit width of the operation parameter into a bit width matched with the buffer, and transmits the operation parameter of the bit width matching to the buffer.
步骤S4:缓冲控制模块将缓存器存储的输入数据和运算参数传输给神经处理单元,神经处理单元对输入数据和运算参数进行处理,得到本次神经网络计算的输出数据,缓冲控制模块将输出数据存储至缓存器。Step S4: The buffer control module transmits the input data and the operation parameters stored in the buffer to the neural processing unit, and the neural processing unit processes the input data and the operation parameters to obtain the output data of the current neural network calculation, and the buffer control module outputs the data. Store to the cache.
其中,在神经处理单元对输入数据和运算参数进行处理的过程中,如果神经网络计算存在中间数据,则缓冲控制模块将中间数据存入缓存器,神经处理单元继续运算,当需要中间数据参入运算时,缓冲控制模块将中间数据再传回给神经处理单元,神经处理单元利用中间数据继续运算,得到神经网络计算的输出数据,该输出数据可以是输出神经元向量。Wherein, in the process of processing the input data and the operation parameter by the neural processing unit, if the neural network calculates the existence of the intermediate data, the buffer control module stores the intermediate data in the buffer, and the neural processing unit continues the operation, when the intermediate data is required to be input into the operation. When the buffer control module transmits the intermediate data back to the neural processing unit, the neural processing unit continues the operation using the intermediate data to obtain the output data of the neural network calculation, and the output data may be the output neuron vector.
步骤S5:缓冲器中的输出数据传输至高带宽存储器,并通过外部访问单元将输出数据传输至外部存储装置。具体可以包括:HBM内存控制模块将输出数据的位宽转换为与高带宽存储器相匹配的位宽,并将位宽匹配的输出数据传输至高带宽存储器,外部访问单元将高带宽存储器存储的输出数据传输至外部存储装置。Step S5: The output data in the buffer is transferred to the high bandwidth memory, and the output data is transmitted to the external storage device through the external access unit. Specifically, the HBM memory control module converts the bit width of the output data into a bit width matched with the high bandwidth memory, and transmits the bit width matched output data to the high bandwidth memory, and the external access unit stores the output data of the high bandwidth memory. Transfer to an external storage device.
至此就可以得到本次神经网络计算的运算结果,如果继续进行下一次神经网络计算,则可以返回步骤S2执行,以获得下一次神经网络计算的运算结果。 At this point, the operation result of the current neural network calculation can be obtained. If the next neural network calculation is continued, the process returns to step S2 to obtain the operation result of the next neural network calculation.
由此可见,本实施例的神经网络计算方法,利用堆栈式存储结构的高宽带存储器和具有HBM内存控制模块的神经网络处理器,可以极大地提高存储带宽,运算性能得到很大改善,提高了信号传输带宽和传输速度。It can be seen that the neural network calculation method of the embodiment, using the high-bandwidth memory of the stacked storage structure and the neural network processor with the HBM memory control module, can greatly improve the storage bandwidth, and the computing performance is greatly improved and improved. Signal transmission bandwidth and transmission speed.
在本公开的另一具体实施例中,所述神经网络计算方法,基于上述HMC内存模块的神经网络计算装置进行神经网络计算,参见图9,包括:In another specific embodiment of the present disclosure, the neural network calculation method performs neural network calculation based on the neural network computing device of the HMC memory module, and referring to FIG. 9, includes:
步骤S1,所述外部访问单元将外部指定地址的数据、指令写入到HMC内存模块中;Step S1, the external access unit writes data and instructions of the external designated address into the HMC memory module;
步骤S2,神经网络处理器从HMC内存模块中读取进行神经网络部分运算的所需数据、指令;Step S2, the neural network processor reads, from the HMC memory module, required data and instructions for performing a partial operation of the neural network;
步骤S3,神经网络处理器进行神经网络部分运算,将运算得到的中间值或最终值写回到HMC内存模块中;Step S3, the neural network processor performs a partial operation of the neural network, and writes the intermediate value or the final value obtained by the operation back to the HMC memory module;
步骤S4,运算结果从HMC内存模块经外部访问单元写回到外部指定地址;Step S4, the operation result is written from the HMC memory module to the external designated address via the external access unit;
步骤S5,如果运算指令中有运算终止指令,则终止执行,否则,返回步骤S1。In step S5, if there is an operation termination instruction in the operation instruction, the execution is terminated, otherwise, the process returns to step S1.
进一步的,在所述步骤S1中,包括:步骤S11,外部指定地址的第一数据通过外部访问单元3写入到HMC内存模块1中。其中,所述第一数据包括进行神经网络运算的权值、偏置、函数表等;步骤S12,外部指定地址的指令通过外部访问单元3写入到HMC内存模块1中。其中,所述指令包括此次进行神经网络运算的运算指令,如果本次神经网络运算之后不再进行运算,所述指令还包括运算终止指令;步骤S13,外部指定地址的第二数据通过外部访问单元3写入到HMC内存模块1中。其中,所述第二数据包括此次进行神经网络运算的输入数据。Further, in the step S1, the method includes the following steps: Step S11, the first data of the external designated address is written into the HMC memory module 1 through the external access unit 3. The first data includes a weight, an offset, a function table, and the like for performing a neural network operation; and in step S12, an externally designated address is written into the HMC memory module 1 through the external access unit 3. The instruction includes an operation instruction for performing a neural network operation at this time. If the operation is not performed after the current neural network operation, the instruction further includes an operation termination instruction; and in step S13, the second data of the external designated address is externally accessed. Unit 3 is written to the HMC memory module 1. The second data includes input data for performing neural network operations at this time.
在所述步骤S2中,神经网络处理器从HMC内存模块中读取进行神经网络部分运算的所需数据,包括,S21将待运算的神经网络处理器中的数据准备状态设置为已准备;S22数据准备状态为已准备的神经网络处理器从HMC内存模块中读取进行神经网络部分运算的所需数据;S23读取完成后将该数据准备状态为已准备的神经网络处理器的数据准备状态设置为未准备。In the step S2, the neural network processor reads the required data for performing the neural network partial operation from the HMC memory module, including: S21 sets the data preparation state in the neural network processor to be operated to be prepared; S22 The data preparation state is that the prepared neural network processor reads the required data for performing the neural network partial operation from the HMC memory module; after the S23 reading is completed, the data preparation state is the prepared data preparation state of the neural network processor. Set to not prepared.
在所述步骤S3中还包括,判断神经网络运算是否结束,若结束,进 入步骤S4;否则,向其他神经网络处理器发送数据准备完成信号,并将其他神经网络处理器中的数据准备状态设置为已准备,返回步骤S2。In the step S3, it is further included that it is determined whether the neural network operation ends, and if it is finished, Going to step S4; otherwise, transmitting a data preparation completion signal to the other neural network processors, and setting the data preparation state in the other neural network processors to be prepared, returning to step S2.
所述基于HMC内存的神经网络运算方法,在多个神经网络处理器中的部分神经网络处理器处于运算状态,其他神经网络处理器处于等待所述部分神经网络处理器的其中之一的运算结果时,所述其他神经网络处理器先处于等待状态,在所述部分神经网络处理器的其中之一将运算结果传送到HMC内存模块中并向其他神经网络处理器发送数据送达信号之后,所述其他神经网络处理器被唤醒,从HMC内存模块中读取相应的数据,并进行相应的运算。The HMC memory-based neural network operation method, in which a part of the neural network processors in the plurality of neural network processors are in an operational state, and the other neural network processors are in an operation result of waiting for one of the partial neural network processors At the time, the other neural network processor is in a waiting state, after one of the partial neural network processors transmits the operation result to the HMC memory module and sends the data delivery signal to the other neural network processor. The other neural network processors are woken up, the corresponding data is read from the HMC memory module, and corresponding operations are performed.
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被具体化在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。The processes or methods depicted in the preceding figures may include hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software embodied on a non-transitory computer readable medium), or both The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
各功能单元/模块都可以是硬件,比如该硬件可以是电路,包括数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器等。所述计算装置中的计算模块可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。Each functional unit/module may be hardware, such as the hardware may be a circuit, including digital circuits, analog circuits, and the like. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The computing modules in the computing device can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
需要说明的是,在附图或说明书正文中,未绘示或描述的实现方式,均为所属技术领域中普通技术人员所知的形式,并未进行详细说明。此外,上述对各元件和方法的定义并不仅限于实施例中提到的各种具体结构、形状或方式,本领域普通技术人员可对其进行简单地更改或替换。It should be noted that the implementations that are not shown or described in the drawings or the text of the specification are all known to those of ordinary skill in the art and are not described in detail. In addition, the above definitions of the various elements and methods are not limited to the specific structures, shapes or manners mentioned in the embodiments, and those skilled in the art can simply modify or replace them.
还需要说明的是,本文可提供包含特定值的参数的示范,但这些参数无需确切等于相应的值,而是可在可接受的误差容限或设计约束内近似于相应值。此外,除非特别描述或必须依序发生的步骤,上述步骤的顺序并无限制于以上所列,且可根据所需设计而变化或重新安排。并且上述实施例可基于设计及可靠度的考虑,彼此混合搭配使用或与其他实施例混合搭 配使用,即不同实施例中的技术特征可以自由组合形成更多的实施例。It should also be noted that an example of parameters containing specific values may be provided herein, but these parameters need not be exactly equal to the corresponding values, but may approximate the corresponding values within acceptable error tolerances or design constraints. In addition, the order of the above steps is not limited to the above, and may be varied or rearranged depending on the desired design, unless specifically described or necessarily occurring in sequence. And the above embodiments can be mixed with each other or mixed with other embodiments based on design and reliability considerations. The use of the technical features in different embodiments can be freely combined to form more embodiments.
以上所述的具体实施例,对本公开的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本公开的具体实施例而已,并不用于限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。 The specific embodiments of the present invention have been described in detail with reference to the specific embodiments of the present disclosure. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present disclosure are intended to be included within the scope of the present disclosure.
Claims (37)
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611221798.8 | 2016-12-26 | ||
CN201611221798.8A CN108241484B (en) | 2016-12-26 | 2016-12-26 | Neural network computing device and method based on high bandwidth memory |
CN201611242813.7A CN108256643A (en) | 2016-12-29 | 2016-12-29 | A kind of neural network computing device and method based on HMC |
CN201611242813.7 | 2016-12-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018121118A1 true WO2018121118A1 (en) | 2018-07-05 |
Family
ID=62710389
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/111333 Ceased WO2018121118A1 (en) | 2016-12-26 | 2017-11-16 | Calculating apparatus and method |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2018121118A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110597756A (en) * | 2019-08-26 | 2019-12-20 | 光子算数(北京)科技有限责任公司 | Calculation circuit and data operation method |
CN111222632A (en) * | 2018-11-27 | 2020-06-02 | 中科寒武纪科技股份有限公司 | Computing device, computing method and related product |
CN111433758A (en) * | 2018-11-21 | 2020-07-17 | 吴国盛 | Programmable computing and control chip, design method and device thereof |
CN111461314A (en) * | 2020-03-31 | 2020-07-28 | 中科寒武纪科技股份有限公司 | Method and device for calculating neural network, board card and computer readable storage medium |
CN112036557A (en) * | 2019-06-04 | 2020-12-04 | 北京邮电大学 | Deep learning system based on multiple FPGA development boards |
CN113703690A (en) * | 2021-10-28 | 2021-11-26 | 北京微核芯科技有限公司 | Processor unit, method for accessing memory, computer mainboard and computer system |
CN114461546A (en) * | 2020-11-09 | 2022-05-10 | 哲库科技(上海)有限公司 | Memory control method and device, storage medium and electronic equipment |
CN114692852A (en) * | 2020-12-31 | 2022-07-01 | Oppo广东移动通信有限公司 | Operation method, device, terminal and storage medium of neural network |
CN114692851A (en) * | 2020-12-31 | 2022-07-01 | Oppo广东移动通信有限公司 | Calculation method and device of neural network model, terminal and storage medium |
CN115081599A (en) * | 2021-03-11 | 2022-09-20 | 安徽寒武纪信息科技有限公司 | Method for preprocessing Winograd convolution, computer readable storage medium and device |
CN115617739A (en) * | 2022-09-27 | 2023-01-17 | 南京信息工程大学 | Chiplet architecture-based chip and control method |
CN117915670A (en) * | 2024-03-14 | 2024-04-19 | 上海芯高峰微电子有限公司 | Integrated chip structure for memory and calculation |
US12443832B1 (en) * | 2021-04-30 | 2025-10-14 | Xilinx, Inc. | Neural network architecture with high bandwidth memory (HBM) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140071778A1 (en) * | 2012-09-11 | 2014-03-13 | International Business Machines Corporation | Memory device refresh |
CN104701309A (en) * | 2015-03-24 | 2015-06-10 | 上海新储集成电路有限公司 | Three-dimensional stacked nerve cell device and preparation method thereof |
CN105404925A (en) * | 2015-11-02 | 2016-03-16 | 上海新储集成电路有限公司 | Three-dimensional nerve network chip |
CN105789139A (en) * | 2016-03-31 | 2016-07-20 | 上海新储集成电路有限公司 | Method for preparing neural network chip |
CN106030553A (en) * | 2013-04-30 | 2016-10-12 | 惠普发展公司,有限责任合伙企业 | Memory network |
-
2017
- 2017-11-16 WO PCT/CN2017/111333 patent/WO2018121118A1/en not_active Ceased
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140071778A1 (en) * | 2012-09-11 | 2014-03-13 | International Business Machines Corporation | Memory device refresh |
CN106030553A (en) * | 2013-04-30 | 2016-10-12 | 惠普发展公司,有限责任合伙企业 | Memory network |
CN104701309A (en) * | 2015-03-24 | 2015-06-10 | 上海新储集成电路有限公司 | Three-dimensional stacked nerve cell device and preparation method thereof |
CN105404925A (en) * | 2015-11-02 | 2016-03-16 | 上海新储集成电路有限公司 | Three-dimensional nerve network chip |
CN105789139A (en) * | 2016-03-31 | 2016-07-20 | 上海新储集成电路有限公司 | Method for preparing neural network chip |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111433758A (en) * | 2018-11-21 | 2020-07-17 | 吴国盛 | Programmable computing and control chip, design method and device thereof |
CN111433758B (en) * | 2018-11-21 | 2024-04-02 | 吴国盛 | Programmable computing and control chip, design method and device |
CN111222632A (en) * | 2018-11-27 | 2020-06-02 | 中科寒武纪科技股份有限公司 | Computing device, computing method and related product |
CN112036557A (en) * | 2019-06-04 | 2020-12-04 | 北京邮电大学 | Deep learning system based on multiple FPGA development boards |
CN112036557B (en) * | 2019-06-04 | 2023-06-27 | 北京邮电大学 | A deep learning system based on multiple FPGA development boards |
CN110597756A (en) * | 2019-08-26 | 2019-12-20 | 光子算数(北京)科技有限责任公司 | Calculation circuit and data operation method |
CN110597756B (en) * | 2019-08-26 | 2023-07-25 | 光子算数(北京)科技有限责任公司 | Calculation circuit and data operation method |
CN111461314B (en) * | 2020-03-31 | 2022-12-20 | 中科寒武纪科技股份有限公司 | Method and device for performing artificial neural network calculation based on constant data packet and computer readable storage medium |
CN111461314A (en) * | 2020-03-31 | 2020-07-28 | 中科寒武纪科技股份有限公司 | Method and device for calculating neural network, board card and computer readable storage medium |
CN114461546A (en) * | 2020-11-09 | 2022-05-10 | 哲库科技(上海)有限公司 | Memory control method and device, storage medium and electronic equipment |
CN114692852A (en) * | 2020-12-31 | 2022-07-01 | Oppo广东移动通信有限公司 | Operation method, device, terminal and storage medium of neural network |
CN114692851B (en) * | 2020-12-31 | 2025-09-26 | Oppo广东移动通信有限公司 | Computation method, device, terminal and storage medium of neural network model |
CN114692851A (en) * | 2020-12-31 | 2022-07-01 | Oppo广东移动通信有限公司 | Calculation method and device of neural network model, terminal and storage medium |
CN114692852B (en) * | 2020-12-31 | 2025-07-29 | Oppo广东移动通信有限公司 | Operation method, device, terminal and storage medium of neural network |
CN115081599A (en) * | 2021-03-11 | 2022-09-20 | 安徽寒武纪信息科技有限公司 | Method for preprocessing Winograd convolution, computer readable storage medium and device |
US12443832B1 (en) * | 2021-04-30 | 2025-10-14 | Xilinx, Inc. | Neural network architecture with high bandwidth memory (HBM) |
CN113703690A (en) * | 2021-10-28 | 2021-11-26 | 北京微核芯科技有限公司 | Processor unit, method for accessing memory, computer mainboard and computer system |
CN113703690B (en) * | 2021-10-28 | 2022-02-22 | 北京微核芯科技有限公司 | Processor unit, method for accessing memory, computer mainboard and computer system |
CN115617739B (en) * | 2022-09-27 | 2024-02-23 | 南京信息工程大学 | Chip based on Chiplet architecture and control method |
CN115617739A (en) * | 2022-09-27 | 2023-01-17 | 南京信息工程大学 | Chiplet architecture-based chip and control method |
CN117915670A (en) * | 2024-03-14 | 2024-04-19 | 上海芯高峰微电子有限公司 | Integrated chip structure for memory and calculation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018121118A1 (en) | Calculating apparatus and method | |
CN108241484B (en) | Neural network computing device and method based on high bandwidth memory | |
US8018752B2 (en) | Configurable bandwidth memory devices and methods | |
JP7349812B2 (en) | memory system | |
TWI681525B (en) | Flexible memory system with a controller and a stack of memory | |
CN109783410A (en) | Execute the memory devices of concurrent operation processing and the memory module including it | |
CN108256643A (en) | A kind of neural network computing device and method based on HMC | |
KR20210098831A (en) | Configurable write command delay in nonvolatile memory | |
CN116737617A (en) | Access controller | |
US20230343380A1 (en) | Bank-Level Self-Refresh | |
CN214225915U (en) | Multimedia chip architecture and multimedia processing system applied to portable mobile terminal | |
WO2021018313A1 (en) | Data synchronization method and apparatus, and related product | |
US12300300B2 (en) | Bank-level self-refresh | |
WO2024049862A1 (en) | Systems, methods, and devices for advanced memory technology | |
CN106502923B (en) | Storage accesses ranks two-stage switched circuit in cluster in array processor | |
CN111382847A (en) | Data processing devices and related products | |
WO2023274032A1 (en) | Storage access circuit, integrated chip, electronic device and storage access method | |
US12100468B2 (en) | Standalone mode | |
TWI814179B (en) | A multi-core chip, an integrated circuit device, a board card, and a process method thereof | |
US12321288B2 (en) | Asymmetric read-write sequence for interconnected dies | |
CN118796757B (en) | A chip architecture integrating heterogeneous processors shared memory, a chip and a design method thereof | |
TWI840894B (en) | Memory circuit, data transmission circuit, and memory | |
US20240407127A1 (en) | Airflow distribution to cool memory module shadowed by the processor | |
EP4210099A1 (en) | Package routing for crosstalk reduction in high frequency communication | |
CN120631815A (en) | External storage solution based on 3D-DRAM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17885496 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17885496 Country of ref document: EP Kind code of ref document: A1 |