CN120547881A

CN120547881A - In-memory computing chip

Info

Publication number: CN120547881A
Application number: CN202510656041.4A
Authority: CN
Inventors: 江喜平; 郭一欣; 龙晓东; 郭富智
Original assignee: Xi'an Ziguang Guoxin Semiconductor Co ltd
Current assignee: Xi'an Ziguang Guoxin Semiconductor Co ltd
Priority date: 2025-05-21
Filing date: 2025-05-21
Publication date: 2025-08-26

Abstract

The invention discloses an in-memory computing chip. The in-memory computing chip comprises at least one layer of first core particles and at least one layer of second core particles. The first core includes at least one first cell including a plurality of storage nodes. The first unit and/or the second unit further comprises a plurality of computing nodes, the plurality of computing nodes and the plurality of storage nodes form an in-memory computing unit, and one computing control unit is vertically interconnected with the at least one in-memory computing unit. The invention provides a scheme for reducing the power consumption and time cost of data movement of an in-memory computing chip.

Description

In-memory computing chip

Technical Field

The invention relates to the technical field of microelectronics, in particular to an in-memory computing chip.

Background

The in-memory computing chip comprises a control unit and an in-memory computing unit, wherein the in-memory computing unit comprises a storage node and a computing node, and the storage node and the computing node are organized with extremely small granularity (usually bit computing level) and can play an important role in various fields (such as artificial intelligence fields).

However, the present in-memory computing chip has larger power consumption and longer time due to the layout constraint of the control unit and the in-memory computing unit.

Disclosure of Invention

The present invention has been made in view of the above-mentioned problems, and it is an object of the present invention to provide an in-memory computing chip that overcomes or at least partially solves the above-mentioned problems.

In a first aspect, there is provided an in-memory computing chip comprising:

At least one layer of first core particles, the first core particles comprising at least one first cell, the first cell comprising a plurality of storage nodes;

At least one layer of second core particles stacked with the at least one layer of first core particles, the second core particles comprising at least one second unit comprising a computational control unit;

The first unit and/or the second unit further comprises a plurality of computing nodes, the computing nodes and the storage nodes form an in-memory computing unit, and one computing control unit is vertically interconnected with at least one in-memory computing unit.

Optionally, the first core particle is at least two layers, and each layer of the first core particle is stacked and vertically interconnected;

The first unit comprises a plurality of storage nodes and a plurality of calculation nodes, and is the in-memory calculation unit.

Optionally, the plurality of storage nodes of the first unit are arranged in an array along a first direction and a second direction, the plurality of calculation nodes of the first unit are arranged in an array along the first direction and the second direction, and the first direction and the second direction intersect;

in the second direction, at least one storage node is arranged between two adjacent computing nodes, and/or at least one computing node is arranged between two adjacent storage nodes.

Alternatively, one of the computation control units is vertically interconnected with at least two of the in-memory computation units of different layers, or

One of the computation control units is vertically interconnected with at least one of the in-memory computation units of the same layer.

Optionally, the orthographic projection of the calculation control unit on the second core particle substantially overlaps with the orthographic projection of the in-memory calculation unit on the second core particle.

Optionally, the first core particle is at least one layer, and the first unit includes a plurality of storage nodes;

the second unit of the second core particle comprises the computation control unit and a plurality of the computation nodes.

Optionally, the orthographic projection of the second unit on the second core particle substantially overlaps with the orthographic projection of the first unit on the second core particle.

Optionally, the first core particle has at least two layers, and at least one layer of the second core particle is arranged between the first core particles of two adjacent layers, or

The second core particles are at least two layers, and at least one layer of the first core particles is arranged between the second core particles of two adjacent layers.

Optionally, each layer of the first core particle is stacked to form a first core particle group, each layer of the second core particle is stacked to form a second core particle group, the second core particle group being located on one side of the first core particle group.

Optionally, one of the computation control units and at least one of the in-memory computation units are vertically interconnected by a three-dimensional heterogeneous integrated structure, wherein the three-dimensional heterogeneous integrated structure comprises one or more of a through silicon via, a hybrid bonding structure and a microbump.

The technical scheme provided by the embodiment of the invention has at least the following technical effects or advantages:

The in-memory computing chip comprises at least one layer of first core particles and at least one layer of second core particles. The first core includes at least one first cell including a plurality of storage nodes. The first unit and/or the second unit further comprises a plurality of computing nodes, the plurality of computing nodes and the plurality of storage nodes form an in-memory computing unit, and one computing control unit is vertically interconnected with the at least one in-memory computing unit. In addition, the form of vertical interconnection between one calculation control unit and at least one in-memory calculation unit can ensure that when one in-memory calculation unit updates multiplexing calculation data, other in-memory calculation units and calculation control units are always in an effective working state, and through pingpang (ping-pong) operation, the throughput loss of updating the multiplexing calculation data of the in-memory calculation unit is reduced.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic diagram showing the structure of an in-memory computing chip in the related art;

FIG. 2 is a schematic diagram of a first architecture of an in-memory computing chip according to an embodiment of the present application;

FIG. 3 is a schematic diagram showing a second configuration of an in-memory computing chip according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a third configuration of an in-memory computing chip according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a fourth configuration of an in-memory computing chip according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a fifth configuration of an in-memory computing chip according to an embodiment of the application;

FIG. 7 illustrates an arrangement of storage nodes and compute nodes in accordance with an embodiment of the present application.

100-In-memory computing chips, 101-computing control units, 102-in-memory computing units, 103-input/output interfaces, 104-memory access interfaces, 105-computing data interfaces, 10-memory nodes, 20-computing nodes, 200-in-memory computing chips, 21-first core grains, 21A-first units, 211-memory nodes, 212-first memory access interfaces, 213-first three-dimensional integrated areas, 22-second core grains, 22A-second units, 221-computing control units, 221A-computing control function circuits, 221B-second memory access interfaces, 221C-second three-dimensional integrated areas, 221D-global networks, 203-computing nodes, X-first directions and Y-second directions.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

Various structural schematic diagrams according to embodiments of the present disclosure are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated for clarity of presentation and may have been omitted. The shapes of the various regions, layers and relative sizes, positional relationships between them shown in the drawings are merely exemplary, may in practice deviate due to manufacturing tolerances or technical limitations, and one skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions as actually required.

It should also be noted that the terms "first," "second," and the like in the description and claims of the present application and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the objects so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.

As used herein, "about," "approximately," or "substantially" includes the stated values as well as average values within an acceptable deviation of the particular values as determined by one of ordinary skill in the art in view of the measurement in question and the errors associated with the measurement of the particular quantity (i.e., limitations of the measurement system).

As used herein, "parallel", "perpendicular", "equal" includes the stated case as well as the case that approximates the stated case, the range of which is within an acceptable deviation range as determined by one of ordinary skill in the art taking into account the measurement in question and the errors associated with the measurement of the particular quantity (i.e., limitations of the measurement system). For example, "parallel" includes absolute parallel and approximately parallel, where the range of acceptable deviation of approximately parallel may be, for example, within 5 ° of deviation, and "perpendicular" includes absolute perpendicular and approximately perpendicular, where the range of acceptable deviation of approximately perpendicular may also be, for example, within 5 ° of deviation. "equal" includes absolute equal and approximately equal, where the difference between the two, which may be equal, for example, is less than or equal to 5% of either of them within an acceptable deviation of approximately equal.

In this specification, the terms "mounted," "connected," and "connected" are to be construed broadly, unless explicitly stated or limited otherwise. For example, they may be fixedly connected or detachably connected or integrally connected, they may be mechanically connected or electrically connected, they may be directly connected or indirectly connected through an intermediate member, or they may be in communication with the inside of two elements. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art.

Fig. 1 shows a schematic diagram of the structure of an in-memory computing chip in the related art.

As shown in fig. 1, the in-memory computing chip 100 includes a computing control unit 101 and an in-memory computing unit 102, the in-memory computing unit 102 includes a storage node 10 and a computing node 20, the storage node 10 and the computing node 20 are organized with a very small granularity (typically, a bit computing level), and the distance between the computing node 20 and the storage node 10 is getting closer and closer, so that energy efficiency is getting higher and higher, and the in-memory computing chip can play an important role in various fields (for example, an artificial intelligence field). The storage node 10 of the in-memory computing unit 102 contains reusable computing data which are used repeatedly in the computation, which reduces the energy and time expenditure for moving the data to a certain extent, wherein the computing control unit 101 is connected via the input/output interface 103 to the storage access interface 104 of the in-memory computing unit 102 and the computing control hospital is connected via the computing data interface 105 to the storage access interface 104 of the in-memory computing unit 102.

However, the present in-memory computing chip 100 is limited by the physical constraints of the planar layout of the computing control unit 101 and the in-memory computing unit 102, and has a major problem that, firstly, when the in-memory computing unit 102 inputs/outputs data, the data is transferred based on the conventional local cache (SCRATCH PAD) technology (the physical distance of the link is 100 μm-1000 μm), so that the power consumption of data movement is relatively large and the time is relatively long. And secondly, multiplexing calculation data of the in-memory calculation unit 102 is transmitted based on the traditional local buffer technology (the physical distance of a link is 100-1000 μm), so that the power consumption of data movement is larger and the time is longer. Third, the partial in-memory computing unit 102 (depending on the medium and structure) updates the multiplexed computing data much slower than the local cache, which becomes a time bottleneck for performing computing tasks.

In view of this, the embodiment of the application provides an in-memory computing chip, which shortens the data link between the computing control unit and the in-memory computing unit at least to a certain extent, reduces the power consumption and time cost of data movement from the 100 μm-1000 μm level of the traditional planar layout to the 10 μm level of the vertical stacking structure, and in addition, in the form of vertical interconnection between one computing control unit and at least one in-memory computing unit, when one in-memory computing unit updates the multiplexed computing data, other in-memory computing units and computing control units are always in an effective working state, and through pingpang (ping-pong) operation, the throughput loss of updating the multiplexed computing data by the in-memory computing unit is reduced.

The in-memory computing chip of the embodiments of the present application is described below with reference to the specific drawings.

Fig. 2 shows a first schematic configuration of an in-memory computing chip 200 according to an embodiment of the present application, fig. 3 shows a second schematic configuration of the in-memory computing chip 200 according to an embodiment of the present application, fig. 4 shows a third schematic configuration of the in-memory computing chip 200 according to an embodiment of the present application, fig. 5 shows a fourth schematic configuration of the in-memory computing chip 200 according to an embodiment of the present application, and fig. 6 shows a fifth schematic configuration of the in-memory computing chip 200 according to an embodiment of the present application.

According to a first aspect of an embodiment of the present application, there is provided an in-memory computing chip 200 comprising at least one layer of first core particles 21 and at least one layer of second core particles 22. The first core 21 comprises at least one first unit 21A, the first unit 21A comprises a plurality of storage nodes 211, the at least one layer of second core 22 is stacked with the at least one layer of first core 21, the second core 22 comprises at least one second unit 22A, the second unit 22A comprises a calculation control unit 221, wherein the first unit 21A and/or the second unit 22A further comprises a plurality of calculation nodes 203, the plurality of calculation nodes 203 and the plurality of storage nodes 211 form an in-memory calculation unit, and one calculation control unit 221 is vertically interconnected with at least one in-memory calculation unit.

It will be appreciated that the in-memory computing chip 200 may improve chip energy efficiency by performing calculations within the chip or near memory locations, performing logic or analog calculations (e.g., matrix multiply-add) directly using physical characteristics (e.g., resistance, current, etc.) of the memory nodes 211, or laying out the computing nodes 203 in close proximity to the memory nodes 211, and performing near zero data handling via ultra-short-range interconnects. The medium of the in-Memory computing chip 200 may be a standard in-Memory computing unit implemented in an analog, digital-analog hybrid or digital manner, such as SRAM (Static Random Access Memory ), DRAM (Dynamic Random Access Memory, dynamic random Access Memory), RRAM (RESISTIVE RANDOM-Access Memory), flash (Flash Memory), FERAM (Ferroelectric Random-Access Memory), MRAM (Magnetoresistive Random-Access Memory), PCM (Phase-Change Memory), or a part of the standard in-Memory computing unit using SRAM, DRAM, RRAM as a medium.

Illustratively, the at least one first core particle 21 includes one first core particle 21, two first core particles 21, three first core particles 21, four first core particles 21, five first core particles 21, six first core particles 21, etc., which may be specifically designed according to the requirement of the in-memory computing chip 200, and is not limited herein. Wherein, when the first core particle 21 is greater than or equal to two layers, the layers of the first core particle 21 are stacked.

That is, the first core 21 may be added to multiple layers, where each first unit 21A includes multiple computing nodes 203 and storage nodes 211, i.e., each first core 21 includes multiple in-memory computing units, such as a combination of one computing control unit 221 and three in-memory computing units, where the time for updating multiplexed computing data by the in-memory computing units is longer, so that it is possible to avoid that the operation of pingpang (ping pong) in the 1+2 (one computing control unit 221+two in-memory computing units) structure is interrupted to affect the throughput, such as a combination of one computing control unit 221 and four in-memory computing units, where two are operated in one state and two are operated in another state, thereby doubling the computing throughput.

Illustratively, the at least one layer of second core particles 22 includes one layer of second core particles 22, two layers of second core particles 22, three layers of second core particles 22, four layers of second core particles 22, five layers of second core particles 22, six layers of second core particles 22, etc., which may be specifically designed according to the requirements of the in-memory computing chip 200, and are not limited herein. Wherein, when the second core particle 22 is greater than or equal to two layers, the layers of the second core particle 22 are stacked.

In some embodiments, the at least one layer of second core particles 22 and the at least one layer of first core particles 21 may be a layer of first core particles 21 and a layer of second core particles 22, a layer of first core particles 21 and two or more layers of second core particles 22, two or more layers of second core particles 22 and a layer of first core particles 21, two or more layers of second core particles 22 and two or more layers of first core particles 21, and may be specifically designed according to the requirement of the in-memory computing chip 200, which is not limited herein.

It will be appreciated that by increasing the ratio of the number of layers of the first core particle 21 to the second core particle 22, for example, two layers of the first core particle 21 to one layer of the second core particle 22, the memory density of the in-memory computing chip 200 may be increased and the area overhead ratio of the second core particle 22 may be reduced.

In some embodiments, the first core particle 21 is at least two layers, and at least one layer of the second core particle 22 is disposed between the first core particles 21 of two adjacent layers, or the second core particle 22 is at least two layers, and at least one layer of the first core particle 21 is disposed between the second core particles 22 of two adjacent layers.

It will be appreciated that when the stacked relationship between each layer of first core particles 21 and each layer of second core particles 22 is different, the distance of the data link between the first core particles 21 and the second core particles 22 may be made different. For example, the first core particle 21 is at least two layers, and at least one layer of the second core particle 22 is arranged between the first core particles 21 of two adjacent layers, so that the distance between the calculation control unit 221 in the second core particle 22 and the data link between the upper layer of storage nodes 211 and the distance between the calculation control unit 221 and the lower layer of storage nodes 211 are smaller, and the control of the calculation control unit 221 on the data input or output of the storage nodes 211 of different layers can be balanced, and the power consumption and the time cost of data movement are reduced. For example, the second core particle 22 is at least two layers, and at least one layer of first core particle 21 is arranged between the second core particles 22 of two adjacent layers, so that the distance between the storage node 211 in the first core particle 21 and the data link of the upper computing control unit 221 is smaller than the distance between the storage node 211 and the data link of the lower computing control unit 221, and the data can be called by the computing control units 221 of different layers, thereby reducing the power consumption and the time cost of data movement.

It will be appreciated that the above-described stacking manner of the first core particle 21 and the second core particle 22 may be applied to a case where the first core particle 21 includes the storage node 211 and the second core particle 22 includes the calculation control unit 221 and the calculation node 203, in which case the calculation control unit 221 and the calculation unit are planarly arranged in the second core particle 22 and vertically interconnected with the storage node 211 in the first core particle 21, and the calculation node 203 may implement data movement of multiplexed calculation data stored in the storage node 211 through an ultra-short distance of the vertical interconnection, thereby reducing power consumption and time cost of the data movement.

In some embodiments, each layer of the first core particles 21 is stacked to form a first set of core particles 21, each layer of the second core particles 22 is stacked to form a second set of core particles 22, the second set of core particles 22 being located on one side of the first set of core particles 21.

For example, the first core particle 21 is formed by two layers, the two layers of the first core particle 21 are stacked to form a first core particle 21 group, the second core particle 22 is formed by one layer, and the one layer of the second core particle 22 is positioned on one side of the first core particle 21 group, for example, above or below the first core particle 21 group. It will be appreciated that this is more suitable for arranging the compute nodes 203 and the storage nodes 211 in the first core grain 21, in which case the distance between the compute nodes 203 of different layers and the storage nodes 211 of different layers is closer, for example the data link between the compute nodes 203 in the first core grain 21 of the first layer and the storage nodes 211 in the first core grain 21 of the second layer is ultra short (10 micron level), or the data link between the compute nodes 203 in the first core grain 21 of the second layer and the storage nodes 211 in the first core grain 21 of the first layer is also ultra short (10 micron level), so that the fast multiplexing of the multiplexed compute data stored in the storage nodes 211 of different layers by the compute nodes 203 can be realized, and the power consumption and the time cost of the data movement can be reduced.

In some embodiments, the in-memory computing unit may be manufactured by the same or different processes as the computing control unit 221, for example, the optimal process of the in-memory computing unit is biased to be realized by in-memory computing, and the general process advancement is lower than that of the computing control core particle, so that the constraint of the in-memory computing circuit process and the computing control circuit process is relieved, and different processes with different functional emphasis points can be selected respectively to prepare, thereby improving the PPA (power consumption, performance and area) of the system.

In some embodiments, the first core particle 21 is at least two layers, each layer of the first core particle 21 is stacked and vertically interconnected, the first unit 21A comprises a plurality of the storage nodes 211 and a plurality of the computing nodes 203, and the first unit 21A is the in-memory computing unit.

Illustratively, the in-memory computing unit is configured to implement in-memory computing, including a storage node 211, a computing node 203, a first storage access interface 212, and a first three-dimensional integration region 213. The storage node 211 is configured to store multiplexed computation data, the computation node 203 is configured to perform in-memory computation by using the multiplexed computation data stored in the storage node 211 and/or the computation data input by the computation control unit 221 through the first storage access interface 212, and the multiplexed computation data is retained on the storage node 211 during computation without moving, so that the computation energy efficiency can be significantly improved, and the computation result is output to the computation control unit 221 through the first storage access interface 212, and since the in-memory computation unit is vertically interconnected with the computation control unit 221, the data link is very short, so that the efficiency of data transmission can be improved.

It will be appreciated that, as shown in fig. 2, the computing node 203 and the storage node 211 are arranged in a planar manner in the first unit 21A to form an in-memory computing unit, which may be suitable for direct embedded in-memory computing, and for data intensive computing tasks (such as matrix multiply-add), so that the computing node 203 may perform near data computing, and reduce data moving overhead. Further, disposing the computation control unit 221 at the second core particle 22, the computation control unit 221 is in data communication with the computation node 203 and the storage node 211 through the vertical interconnection, enabling low-latency data communication.

Fig. 7 shows a schematic layout of the storage node 211 and the computing node 203 according to an embodiment of the present application.

In some embodiments, the plurality of storage nodes 211 of the first unit 21A are arranged in an array along a first direction X and a second direction Y, the plurality of computing nodes 203 of the first unit 21A are arranged in an array along the first direction X and the second direction Y, and the first direction X and the second direction Y intersect, and at least one storage node 211 is disposed between two adjacent computing nodes 203 and/or at least one computing node 203 is disposed between two adjacent storage nodes 211 in the second direction Y.

Illustratively, the first direction X and the second direction Y are perpendicular, and in the second direction Y, at least one storage node 211, e.g., one storage node 211, is disposed between two adjacent storage nodes 203, and/or at least one storage node 203, e.g., one storage node 203, is disposed between two adjacent storage nodes 211. Thus, the computing node 203 can directly perform the memory access data computation, and can improve the computing energy efficiency for the data-intensive computing task.

In some embodiments, one of the computational control units 221 is vertically interconnected with at least two of the in-memory computational units of different tiers, or one of the computational control units 221 is vertically interconnected with at least one of the in-memory computational units of the same tier.

For example, one computation control unit 221 is vertically interconnected with two in-memory computation units, which are located in the first core 21 of different layers, or one computation control unit 221 is vertically interconnected with one or two in-memory computation units of the same layer. In this case, firstly, the calculation control unit 221 realizes an ultra-short distance with each in-memory calculation unit based on the vertical interconnection, so that the calculation control unit 221 can rapidly update the multiplexed calculation data of the in-memory calculation units, rapidly input or output data streams, improve the control efficiency of the working modes of the in-memory calculation units, and the like based on the ultra-short distance, and secondly, the in-memory calculation units realize the ultra-short distance with each other through the vertical interconnection, so that the multiplexing efficiency of the multiplexed calculation data in the in-memory calculation units of different layers can be improved.

In some embodiments, the computation control unit 221 includes a computation control function circuit 221A, a second memory access interface 221B, a second three-dimensional integration area 221C, and a global network 221D for controlling the multiplexing computation data update of the in-memory computation unit, the input-output data flow, the operation mode of the in-memory computation unit, and the like.

In some embodiments, the orthographic projection of the calculation control unit 221 on the second core particle 22 substantially overlaps with the orthographic projection of the in-memory calculation unit on the second core particle 22.

It will be appreciated that the projection of the second core particle 22 by the calculation control unit 221 on the second core particle 22 may refer to the projection on the silicon substrate, and the projection of the in-memory calculation unit on the second core particle 22 may also be the silicon substrate, and of course, the aforementioned silicon substrate is only one example reference layer, and in other embodiments, other film layers may be used as reference layers, for example, the back of the second core particle 22, the crystal plane of the second core particle 22, the substrate layer, the back of the crystal, the crystal plane, etc. of the first core particle 21 are not limited herein.

It may be appreciated that when the orthographic projection of the calculation control unit 221 on the second core particle 22 and the orthographic projection of the in-memory calculation unit on the second core particle 22 are substantially overlapped, one calculation control unit 221 and one in-memory calculation unit may be disposed in a one-to-one correspondence in a vertical direction, so that one calculation control unit 221 and one in-memory calculation unit may be combined into a memory control module, and each memory control module is independent from each other, so that the calculation control unit 221 and the in-memory calculation unit in each memory control module may be vertically interconnected, and interlayer routing or intra-layer routing may be reduced.

In some embodiments, the first core 21 is at least one layer, the first unit 21A includes a plurality of the storage nodes 211, and the second unit 22A of the second core 22 includes the computation control unit 221 and a plurality of the computation nodes 203.

It will be appreciated that when the second unit 22A of the second core particle 22 includes the computation control unit 221 and a plurality of the computation nodes 203 as shown in fig. 4, the computation nodes 203 are integrated into the second core particle 22, that is, the computation control unit 221 and the computation nodes 203 are located in the same core particle, whereby the data interaction performance between the computation core particle and the computation control core particle can be improved, and furthermore, the computation core particle can be manufactured using the same process as the computation control core particle, and since the computation nodes 203 require a high-performance logic process, the performance of the computation core particle can be improved.

For example, the computing node 203 and the storage node 211 may have a one-to-one relationship, a one-to-many relationship, or a many-to-one relationship, and may be specifically designed according to the storage density requirement and the computing area overhead requirement, which is not limited herein.

In some embodiments, the orthographic projection of the second unit 22A onto the second core particle 22 substantially overlaps the orthographic projection of the first unit 21A onto the second core particle 22.

It will be appreciated that in this case, the second unit 22A includes the calculation control unit 221 and the plurality of calculation nodes 203, and one second unit 22A and one first unit 21A may be disposed in a one-to-one correspondence in the vertical direction, so that one first unit 21A and one second unit 22A may be combined into a calculation control module, and each calculation control module is independent from each other, so that the calculation control unit 221 and the in-memory calculation unit in each calculation control module can be vertically interconnected, and interlayer routing or intra-layer routing is reduced.

In some embodiments, one of the computation control units 221 and at least one of the in-memory computation units are vertically interconnected by a three-dimensional heterostructure including one or more of through-silicon vias, hybrid bond structures, and microbumps.

For example, the first three-dimensional integrated region 213 is provided in the first core particle 21, the second three-dimensional integrated region 221C is provided in the second core particle 22, and vertical interconnection between the first core particle 21 and the second core particle 22, and between the respective first core particles 21 and between the respective second core particles 22 is achieved by providing Through Silicon Vias (TSVs) and/or hybrid bonding structures or the like between the first three-dimensional integrated region 213 and the second three-dimensional integrated region 221C. Wherein the top metal or the back metal of the wafer of adjacent core particles is interconnected by hybrid bonding technology and the top metal and the back metal of the wafer are interconnected by TSV technology, or the top metal and the back metal of the wafer of the first core particle 21 is interconnected, or the top metal and the back metal of the wafer of the second core particle 22 is interconnected. Enabling cross-core vertical interconnection of power, clock and storage access paths between compute control unit 221 and compute node 203 and/or storage node 211, or between compute node 203 and storage node 211.

Based on the above disclosure, the embodiment of the present application vertically interconnects the computation control unit 221 and the storage node 211, or vertically interconnects the computation control unit 221 and the storage node 211 after integrating the planar layout of the computation control unit 221 and the computation node 203, thereby reducing the data link of the order of 100 μm-1000 μm of the conventional layout to the order of 10 μm, eliminating the data transfer cost between the computation control unit 221 and the computation node 203 and/or the storage node 211, and the computation node 203 and the storage node 211, and reducing the computation data transfer cost to 1pj/bit (bit by bit/bit), thereby reducing the transmission delay of the computation data and increasing the transmission frequency.

It will be appreciated that, in the in-memory computing chip 200 of the conventional planar layout, if a certain in-memory computing unit completes a computing task, the in-memory computing unit will send a notification to the computing control unit 221, the computing control unit 221 will receive data from the in-memory computing unit through a clock beat, and if the data is required by the subsequent operation of other in-memory computing units, the computing control unit 221 will send the data and a control instruction for controlling the operation of other in-memory computing units to the other in-memory computing units. The process of data forwarding by the computation control unit 221 requires multiple clock beats, and if the other in-memory computation unit must be based on the data to continue the subsequent computation, the other in-memory computation unit needs to be in a waiting state until the data is received, so the data moving time in the whole process is long, and the working efficiency of the in-memory computation unit is poor. In the embodiment of the present application, by stacking and vertically interconnecting the computation control unit 221 and the in-memory computation unit, for example, the vertical interconnection structure in fig. 2, the in-memory computation unit on the upper layer and the corresponding in-memory computation unit on the lower layer have an ultra-short distance, so that the computation control unit 221 can implement data transmission between the computation units on the upper layer and the lower layer through simple control, and the whole process can be completed in one clock beat, thereby reducing the data waiting time and shortening the data link to reduce the data transmission power consumption.

In addition, when one calculation control unit 221 is vertically interconnected with two or more in-memory calculation units, so that when one in-memory calculation unit updates multiplexing calculation data, the other in-memory calculation units and the calculation control unit 221 can be always in an effective working state, and throughput loss of updating multiplexing calculation data of the in-memory calculation unit is avoided through pingpang (ping-pong) operation.

It should be noted that pingpang operations refer to alternating data processing between two or more in-memory computing units, where one in-memory computing unit may perform data input or processing while another in-memory computing unit may perform data output or data preparation.

For example, when two layers of in-memory computing units are stacked, in the initialization, that is, the computing control unit 221 sets the multiplexing computing data and the working mode in the upper layer of in-memory computing unit and the lower layer of in-memory computing unit first and then, the upper layer of in-memory computing unit and the lower layer of in-memory computing unit start computing simultaneously, when the upper layer of in-memory computing unit has completed the local computing of the loaded multiplexing computing data, the multiplexing computing data needs to be reloaded, at this time, the computing control unit 221 reads the computing result of the upper layer of in-memory computing unit preferentially and updates the multiplexing computing data and the working mode of the lower layer of in-memory computing unit so as to make the upper layer of in-memory computing unit enter the effective in-memory computing mode. It can be seen that the operations of the calculation control unit 221 on the multiplexing calculation data updating of the upper-layer in-memory calculation unit and the lower-layer in-memory calculation unit are performed simultaneously, and the upper-layer in-memory calculation unit and the lower-layer in-memory calculation unit operate in pingpang mode, so as to provide continuous effective calculation throughput for the calculation control unit 221, without being limited by the updated multiplexing calculation data.

For another example, when one in-memory computing unit (e.g., unit A) is in a data update/multiplexing state (e.g., writing the results of the computation back to storage node 211, or loading the next batch of weights from storage node 211), computing control unit 221 may immediately switch the computing task to another in-memory computing unit (e.g., unit B) for execution without waiting for unit A to complete the update. Comprising stage 1, unit a computing task is completed, starting writing the result back to storage node 211, while unit B loads new data from storage node 211 and performs the computation, stage 2, unit a is ready to receive the next task when unit B computing is completed and updates data, and cycle alternating, computing control unit 221 and at least one in-memory computing unit are always active by this "ping-pong" switching.

It can be seen that ping-pong operation conceals the delay of data update by time overlapping of multiple compute units (computation is parallel to data update), three-dimensional stacked short-range interconnect physically eliminates data migration bottleneck, throughput of in-memory compute chip 200 is no longer limited by data update cycle of single compute unit, but is determined by parallel efficiency of multiple in-memory compute units, which is important in scenarios requiring high throughput, low delay (e.g. neural network reasoning, real-time streaming).

Therefore, through the in-memory computing chip 200 of the embodiment of the application, the power consumption and time cost of the multiplexing computation data transmission of the in-memory computation unit by the computation control unit 221 are reduced, the computation control unit 221 is prevented from being in a waiting state to form throughput bottlenecks when the multiplexing computation data of the in-memory computation unit is updated, the constraints of the in-memory computation circuit process and the computation control circuit process are relieved, different processes with different functional emphasis can be selected for preparation respectively, and PPA (power consumption, performance and area) of the chip are improved.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including the abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including the accompanying abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention. Any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in the invention. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In an embodiment in which several means are recited, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

Claims

1. An in-memory computing chip, comprising:

2. The in-memory computing chip of claim 1, wherein the first die is at least two layers, each layer of the first die being stacked and vertically interconnected;

3. The in-memory computing chip of claim 2, wherein the plurality of storage nodes of the first unit are arranged in an array along a first direction and a second direction, the plurality of computing nodes of the first unit are arranged in an array along the first direction and the second direction, the first direction and the second direction intersecting;

4. The in-memory computing chip of claim 2, wherein one of said computing control units is vertically interconnected with at least two of said in-memory computing units of different layers, or

5. The in-memory computing chip of claim 2, wherein an orthographic projection of the computing control unit on the second die substantially overlaps an orthographic projection of the in-memory computing unit on the second die.

6. The in-memory computing chip of claim 1, wherein the first die is at least one layer, the first unit comprising a plurality of the storage nodes;

7. The in-memory computing chip of claim 6, wherein an orthographic projection of the second unit on the second die substantially overlaps an orthographic projection of the first unit on the second die.

8. The in-memory computing chip of any one of claims 1 to 7, wherein the first core particle has at least two layers, and at least one layer of the second core particle is arranged between the first core particles of two adjacent layers, or

9. The in-memory computing chip of any one of claims 1-7, wherein each layer of the first core particles is stacked to form a first core particle set and each layer of the second core particles is stacked to form a second core particle set, the second core particle set being located on one side of the first core particle set.

10. The in-memory computing chip of any of claims 1-7, wherein one of the computing control units and at least one of the in-memory computing units are vertically interconnected by a three-dimensional heterostructure including one or more of a through-silicon via, a hybrid bond structure, and a microbump.