[go: up one dir, main page]

US11249939B2 - Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor - Google Patents

Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor Download PDF

Info

Publication number
US11249939B2
US11249939B2 US16/781,428 US202016781428A US11249939B2 US 11249939 B2 US11249939 B2 US 11249939B2 US 202016781428 A US202016781428 A US 202016781428A US 11249939 B2 US11249939 B2 US 11249939B2
Authority
US
United States
Prior art keywords
node
nodes
communication
network
array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US16/781,428
Other versions
US20200250131A1 (en
Inventor
Gerald George Pechanek
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US16/781,428 priority Critical patent/US11249939B2/en
Publication of US20200250131A1 publication Critical patent/US20200250131A1/en
Application granted granted Critical
Publication of US11249939B2 publication Critical patent/US11249939B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8023Two dimensional arrays, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17393Indirect interconnection networks non hierarchical topologies having multistage networks, e.g. broadcasting scattering, gathering, hot spot contention, combining/decombining
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution

Definitions

  • the present invention relates to unique and improved methods and apparatuses for processing architectures and organizations of processing elements in networks configured to reduce power. More specifically, this invention concerns processing architectures and interconnection networks, based on 1 to K+1 adjacency, that reduce power for communicating between nodes, including execution units and local files, as controlled by a result forwarding instruction set architecture.
  • Significant performance and power problems in current multi-processor architectures involve accessing data from memory, operating on the accessed data, and sharing of data between processors. These multi-processor architectures are generally based on use of large central multi-ported register files. Also, having adequate memory bandwidth to support high performance operations is related to the organization of the processors, memory modules, execution unit connections and the interconnection network used for load and store operations.
  • One of the problems associated with increasing performance in multiprocessor parallel processing systems is the efficient accessing of data or instructions from memory. Having adequate memory bandwidth for sharing of data between processors is another problem associated with parallel processing systems.
  • These problems are related to the organization of the processors and memory modules and the processor architecture used for data communication between a processor, including execution units, and a plurality of memories and between processors.
  • Various approaches to solving these problems have been attempted in the past, for example, array processors and shared memory processors.
  • Multiprocessor systems can be classified generally in terms of coupling strength for communication between processors. Those multiprocessor systems that communicate using a share memory facility between the processors and the shared memory over an interconnection network are generally considered tightly coupled. Loosely coupled multiprocessor systems generally use an input/output (I/O) communication process in each processor for communicating between the processors over an interconnection network, such as message passing process.
  • I/O input/output
  • a wide variety of interconnection networks have been utilized in multiprocessing systems. For example, rings, bus connected, crossbar, tree, shuffle, omega, and butterfly, mesh, hypercube, and ManArray networks, have been used in prior multiprocessor systems. From an application or use perspective, specific networks have been chosen primarily based upon performance characteristics and cost to implement tradeoffs.
  • Parallel processing and the distribution of data between functional execution elements may be described using Kronecker product expressions of signal transform functions, such as the fast Fourier transform (FFT) and other signal flow graph expressions.
  • FFT fast Fourier transform
  • the underlying processor architecture may present limitations that affect the efficiency of implementing functions described using Kronecker product expressions and other signal analysis techniques.
  • An embodiment of the present invention addresses a method of executing a sequence of instructions in an execution unit (E ⁇ U) node in an array of E ⁇ Units.
  • a first instruction and a destination instruction having a dependency on the first instruction are received, wherein the first instruction identifies the destination instruction in a sequence of instructions from a program and specifies that a result generated by execution of the first instruction by a first E ⁇ U node is to be forwarded to a destination E ⁇ U node that is to execute the destination instruction.
  • the first instruction is executed on the first E ⁇ U r,c node to generate the result for delivery through an E ⁇ U network to the destination E ⁇ U node associated with the identified destination instruction, wherein according to a Row by Column (R ⁇ C) matrix, an R ⁇ C array of E ⁇ U row(r),column(c) nodes are interconnected by the E ⁇ U network, the E ⁇ U network comprising (K+1) by (K+1) array of E ⁇ U r,c nodes, a first stage (K+1) ⁇ (K+1) array of R r,c nodes for a first direction of communication, a second stage (K+1) ⁇ (K+1) array of S r,c nodes for a second direction of communication, and in each stage having wiring configured according to a 1 to K+1 adjacency of connections between nodes which includes wrapping around data paths at the edges of the (K+1) ⁇ (K+1) arrays, K is an odd integer, K>1, R ⁇ (K+1), C ⁇ (K+1), r ⁇ 0, 1,
  • the first E ⁇ U r,c node generates the result for a selectable first data path that connects to an R r,c+1 node and for a selectable second data path that connects to an R r,c ⁇ 1 node for single step adjacency and for a selectable third data path that connects to an R r,c+2 node for two step adjacency, and for a selectable fourth data path that connects to an R r,c node in the same r,c position in the R ⁇ C matrix as the connecting E ⁇ U r,c node, and wherein connections exist between each R r,c node and S r,c nodes with the same column number in the second direction of communication, wherein an R r,c node, associated with
  • a 1 ⁇ C array of E ⁇ U 1,column(c) nodes are interconnected by an E ⁇ U network, the E ⁇ U network comprising 1 by (K+1) array of E ⁇ U 1,c nodes connected to a 1 ⁇ (K+1) array of R 1,c nodes for a first direction of communication, and having wiring configured according to a 1 to K+1 adjacency of connections between the E ⁇ U 1,c nodes and the R 1,c nodes which includes wrapping around data paths at the edges of the 1 ⁇ (K+1) arrays, K is an odd integer, K>1, C ⁇ (K+1), and c ⁇ 0, 1, . . .
  • a first E ⁇ U 1,c node is connected by a first data path to an R 1,c+1 node and by a second data path to an R 1,c ⁇ 1 node for single step adjacency and by a third data path to an R 1,c+2 node for two step adjacency, and by a fourth data path to an R 1,c node in the same 1,c position in the 1 ⁇ C matrix as the first E ⁇ U 1,c node, wherein the R 1,c ⁇ 1 node is connected by a first outputA path to its associated E ⁇ U 1,c ⁇ 1 node, the R 1,c node is connected by a second outputA path to its associated E ⁇ U 1,c node, the R 1,c+1 node is connected by a third outputA path to its associated E ⁇ U 1,c+1 node, and the R 1,c+2 node is connected by a fourth outputA path to its associated E ⁇ U 1,c+2
  • a further embodiment of the invention a system is provided.
  • the system has a load unit having a source of data values external to an array of execution unit (E ⁇ U) nodes that are interconnected by an E ⁇ U network.
  • a first multiplexing element in the load unit to connect externally received data values to an E ⁇ U located in the E ⁇ U network for processing by one or more program instructions.
  • the system has a store unit having a source of data values internal to the array of E ⁇ U nodes.
  • a second multiplexing element in the store unit to connect to the E ⁇ U network to receive data values from an E ⁇ U source and connect the internally received data values to a destination node located external to the E ⁇ U network for processing by the destination node, wherein the load unit is combined with the store unit as a single node of the array of E ⁇ U nodes.
  • FIG. 1 illustrates exemplary specifiable paths in an array beginning from a source node to nodes in a K+1 row by K+1 column array of nodes interconnected by a K+1 adjacency network, wherein K is an odd integer >1, in accordance with an embodiment of the present invention
  • FIG. 2 illustrates an execution array organized in an exemplary 4 row by 4 column arrangement of execution units and local files in a physical layout form with a one to K+1 level adjacency, where K is a positive odd integer, in accordance with an embodiment of the present invention
  • FIG. 3 illustrates exemplary specifiable paths in a 4 ⁇ 4 array of nodes beginning from a source node to nodes in a K+1 row by K+1 column array of nodes interconnected by a K+1 adjacency network with an increased number of internode data paths, wherein K is an odd integer >1, in accordance with an embodiment of the present invention
  • FIG. 5 illustrates three R ⁇ C XarMa processors that are based on the 4 ⁇ 4 Execution unit (E ⁇ U) array of FIGS. 1-4 in accordance with embodiments of the invention.
  • FIG. 1 illustrates exemplary specifiable paths in an array 100 from a single node (Nb 11 ) 102 to nodes in a K+1 row by K+1 column array of nodes Na 00 110 to Na 33 125 interconnected by a K+1 adjacency network, wherein K is an odd integer >1, in accordance with an embodiment of the present invention.
  • the Na 00 -Na 33 nodes 110 - 125 such as Na 11 115 , also shown in exemplary node illustration 157 , may be processor nodes (Pa), or memory nodes (Ma), or execution unit nodes (Xa), or local file nodes (LFa).
  • the single Nb 11 node 102 is one of a 4 ⁇ 4 array of nodes Nb 00 to Nb 33 , not shown for reason of clarity in the drawing.
  • the Nb 00 -Nb 33 nodes may also be processor nodes (Pb), or memory nodes (Mb), or execution unit nodes (Xb), or local file nodes (LFb).
  • S r,c 4 ⁇ 1 multiplexer nodes including S 00 4 ⁇ 1 140 to S 33 4 ⁇ 1 155 nodes.
  • An exemplary R r,c node 176 is configured with four 4 to 1 multiplexers (4 ⁇ 1) 177 .
  • An exemplary S r,c node 178 is configured with one 4 to 1 multiplexer (4 ⁇ 1) 179 .
  • the bus paths for Nb r,c ⁇ R r,c , R r,c ⁇ S r,c , and S r,c ⁇ (Pa/Ma/Xa/LFa) r,c , having the same r and the same c, are prioritized for short layouts, such as the case where Nb r,c is a processor and the S r,c node connects to a memory node Ma r,c .
  • the node Nb 11 102 is designed to be an execution unit, so is referenced here in this description as Xb 11 .
  • the execution unit Xb 11 102 generates a result upon executing an instruction which is programmatically directed to use one or more selectable data buses 135 - 138 , such as the data bus 135 .
  • the data buses 135 - 138 comprise data buses 135 and 137 having connections between the Xb 11 node 102 and the R 1,0 node 130 and the R 1,2 node 132 with the same row number in the first direction of communication of single step adjacency between next door adjacent neighbors.
  • the first direction of communication of single step adjacency for the Xb 11 node is communication in the east and west horizontal direction.
  • the single step adjacency for Xb 11 is to R nodes having an integer column number of the starting node, in this case column 1 for the Xb 11 node 102 , increased by a value of “1” for single step adjacency in the east direction to R 1,2 node 132 and decreased by the value “1” for single step adjacency in the west direction to R 1,0 node 130 .
  • the data bus 136 has a connection between Xb 11 node and R 1,1 node 131 having the same position in the R ⁇ C matrix.
  • the data bus 138 has a connection between Xb 11 node 102 and the R 1,3 node 133 representing one additional connection in the first direction of communication of two step adjacency.
  • the one additional connection in the first direction of communication of two step adjacency for the Xb 11 node 102 may be communication in either the east direction or communication in the west horizontal direction.
  • the east direction of communication of two step adjacency for the Xb 11 node 102 is to an R node having an integer column number of the starting node, in this case column 1 for the Xb 11 node 102 , increased by a value of “2” in the east direction to R 1,3 node 133 .
  • the west direction of communication of two step adjacency for the Xb 11 node 102 is to an R node having an integer column number of 1 for the starting node Xb 11 node 102 , is decreased by a value of “2” in the west direction to a ⁇ 1 value and is directed to R 1,3 node 133 due to wraparound. With wrap around, a decreased column number of ⁇ 2 wraps around to column 2.
  • the data travels across the data bus 135 and reaches node R 10 130 which is configured with four 4to1 multiplexers, such as shown R r,c 4 ⁇ 4 crossbar node 177 .
  • Each of the four 4to1 multiplexers receives control signals that cause each multiplexer to select none or one of that multiplexer's four input signals to pass to its associated output of the R 10 130 4 ⁇ 4 crossbar.
  • the first type of connection path is for data buses 160 and 168 having connections between the R 1,0 node 130 and the S 0,0 node 140 and the S 2,0 node 148 with the same column number in a vertical second direction of communication of single step adjacency between next door adjacent neighbors.
  • the second type of connection path is for data bus 164 which has a connection between R 1,0 node 130 and S 1,0 node 144 having the same position in the R ⁇ C matrix.
  • the third type of connection path is for data bus 172 which has a connection between the R 1,0 node 130 and the S 3,0 node 152 representing one additional connection in the second direction of communication of two step adjacency.
  • the first direction of communication and the second direction of communication can be reversed, with the first direction of communication being in a vertical North/South direction and the second direction of communication being is a horizontal East/West direction.
  • FIG. 2 illustrates an execution array 200 organized in an exemplary 4 row by 4 column arrangement of execution units and local files in a physical layout form with a one to K+1 level adjacency, where K is a positive odd integer, in accordance with an embodiment of the present invention.
  • functional units and local storage units are separately coupled across each row with the same row number in a first direction of communication of single step adjacency between next door adjacent neighbors by horizontal row networks 202 - 205 to R r,c nodes.
  • the R r,c nodes are separately coupled across each column with the same column number in a second direction of communication of single step adjacency between next door adjacent neighbors by vertical column networks 207 - 210 to S r,c nodes and from there to the functional units and local storage units.
  • a load 0 (L 0 )/store 0 (S 0 ) unit 220 there are a plurality of functional units comprising a load 0 (L 0 )/store 0 (S 0 ) unit 220 , a multiplication M 01 unit 221 , an ALU Complex (C) unit 222 , and a ALU Bit operation (B) unit 223 .
  • LF storage units comprising LF 00 225 , LF 01 226 , LF 02 227 , and LF 03 228 .
  • the other three rows 1 - 3 203 - 205 contain a similar organization of functional units and local file units labeled according to their position in the R ⁇ C array, such as multiplication Mq 11 unit 231 and LF 11 236 .
  • the local files in each row provide a distributed register file for storage of variables as required by a program. Each local file is placed local to its associated functional unit by nature of the timing path to read from and write to the local file as required by a particular implementation.
  • Each local file may also be considered a sub-file portion of a distributed register file supporting computations in row.
  • the 4 ⁇ 4 execution unit/LF network connecting the functional units and local LFs according to a 1toK+1 adjacency as defined herein contains paths such as shown in FIG. 1 .
  • the four buses 260 - 263 are provided to transport four 32-bit results generated in each multiplier (Mqxx) unit 221 , 231 , 241 , and 251 over to an associated add and subtract function in the ALU Complex (C) unit 222 , 232 , 242 , and 252 as part of a complex multiplication operation.
  • the four buses 260 - 263 are able to operate in parallel with the horizontal row networks 202 - 205 to R r,c nodes and the vertical column networks 207 - 210 to S r,c nodes and from there to the functional units and local storage units of E ⁇ U network operations.
  • Results of the complex multiplication generated in C 02 222 , C 12 232 , C 22 242 , and C 32 252 may be stored locally in associated LF 02 227 , LF 12 237 , LF 22 247 , and LF 32 257 , respectively.
  • FIG. 3 illustrates exemplary specifiable paths in a 4 ⁇ 4 array of nodes 300 beginning from a source node, Mq 11 306 or LF 11 323 , to nodes in a K+1 row by K+1 column array of nodes (L 0 /S 0 )/LF 00 301 / 318 to B 33 /LF 33 316 / 333 interconnected by a K+1 adjacency network with an increased number of internode data paths, wherein K is an odd integer >1, in accordance with an embodiment of the present invention.
  • the (L 0 /S 0 )/LF 00 301 / 318 -B 33 /LF 33 316 / 333 nodes may be different types of execution unit nodes, such as a load and store unit loadx(Lx)/storex(Sx) or a multiply complex unit (Mqxx) or an ALU Complex (Cxx) unit or an ALU Bit operation (Bxx) unit located with their corresponding local file nodes (LFxx).
  • a load and store unit loadx(Lx)/storex(Sx) or a multiply complex unit (Mqxx) or an ALU Complex (Cxx) unit or an ALU Bit operation (Bxx) unit located with their corresponding local file nodes (LFxx).
  • a row 1 of nodes L 1 /S 1 /LF 10 305 / 322 , Mq 11 /LF 11 306 / 323 , C 12 /LF 12 307 / 324 , and B 13 /LF 13 308 / 325 is an exemplary row of nodes in the 4 ⁇ 4 array of nodes 300 .
  • each cell comprises an execution node, a local file node, an R node, and an S node for the same row column position, such as exemplary cell 389 comprising Mq 11 /LF 11 306 / 323 , R 11 336 , S 11 345 .
  • the cells are configured with expanded capabilities in the R and S nodes.
  • the R nodes, such as R 11 336 and shown in Rrc 4 ⁇ 5 390 and in more detail in Rrc 4 ⁇ 5 391 comprises an additional 4to1 multiplexer 392 .
  • R r,c 4 ⁇ 5 crossbar nodes of which nodes R 10 335 , R 11 336 , R 12 337 , and R 13 338 are shown.
  • S r,c 5 ⁇ 2 multiplexer nodes including S 00 340 to S 33 355 nodes.
  • Each R r,c node, such as R r,c 390 is configured with five 4 to 1 multiplexers as shown in R r,c 391 .
  • S r,c node, such as S r,c 393 is configured with two 5 to 1 multiplexers as shown in S r,c 5 ⁇ 2 394 .
  • the data bus paths within cells include paths, such as from Mq 11 /LF 11 306 / 323 over bus 357 to R 11 336 , R 11 336 over a first data bus path 365 and a second data bus path 377 to S 11 345 , and S 11 345 over two data bus paths 381 to Mq 11 /LF 11 306 / 323 , are prioritized for short layouts.
  • the execution unit Mq 11 306 generates a result upon executing an instruction which is programmatically directed to use one or more data buses 356 - 359 , such as the data bus 356 .
  • the data buses 356 - 359 comprise data buses 356 and 358 having connections between the Mq 11 306 and the R 1,0 node 335 and the R 1,2 node 337 with the same row number in the first direction of communication of single step adjacency between next door adjacent neighbors.
  • the first direction of communication of single step adjacency for the Mq 11 306 node is communication in the east and west horizontal direction.
  • the single step adjacency for Mq 11 306 is to R nodes having an integer column number of the starting node, in this case column 1 for the Mq 11 306 , increased by a value of “1” for single step adjacency in the east direction to R 1,2 node 337 and decreased by the value “1” for single step adjacency in the west direction to R 1,0 node 335 .
  • the data bus 357 has a connection between Mq 11 306 and R 1,1 node 336 having the same position in the R ⁇ C matrix.
  • the data bus 359 has a connection between Mq 11 306 and the R 1,3 node 338 representing one additional connection in the first direction of communication of two step adjacency.
  • the one additional connection in the first direction of communication of two step adjacency for the Mq 11 306 node is communication in either the east direction or communication in the west horizontal direction.
  • the east direction of communication of two step adjacency for Mq 11 306 is to an R node having an integer column number of the starting node, in this case column 1 for the Mq 11 306 , increased by a value of “2” in the east direction to R 1,3 node 338 .
  • the data travels across the data bus 356 and reaches node R 10 335 which is configured with five 4to1 multiplexers, such as shown R r,c 4 ⁇ 5 crossbar node 391 .
  • Each of the five 4to1 multiplexers receives control signals that cause each multiplexer to select none or one of that multiplexer's four input signals to pass to its associated output of the R 10 335 4 ⁇ 5 crossbar.
  • the first type of connection path is for data buses 360 and 368 having connections between the R 1,0 node 335 and the S 0,0 node 340 and the S 2,0 node 348 with the same column number in a second vertical direction of communication of single step adjacency between next door adjacent neighbors.
  • the second type of connection path is for data buses 364 and 376 which have a connection between R 1,0 node 335 and S 1,0 node 344 having the same position in the R ⁇ C matrix.
  • the third type of connection path is for data bus 372 which has a connection between the R 1,0 node 335 and the S 3,0 node 352 representing one additional connection in the second direction of communication of two step adjacency.
  • the first direction of communication and the second direction of communication can be reversed, with the first direction of communication being in a vertical North/South direction and the second direction of communication being is a horizontal East/West direction.
  • FIG. 4 illustrates three of the four row pipeline control units that control pipeline stage operations in each row to execute chained execution packets (CEPs), for reasons of clarity of presentation.
  • a CEP is a chain of instructions that generally contain sequential dependencies between one or more instructions in the chain.
  • the CEPs may be fetched in packets or streamed 420 , 422 , 423 over to row packet registers, such as row 0 packet register 425 .
  • a chained execution packet which contains control parameters in a header 427 , shown as 12345, for a selected execution row, such as row 3
  • the control parameters are loaded into the control unit 408 for row 3 to coordinate operations on the processor.
  • Instructions are selected from each row packet register for execution and as specified by the control unit and loaded into the associated prolog instruction control (PIC) memory.
  • PIC prolog instruction control
  • the PIC memories 410 , 412 , 414 load up the instructions during execution of a prolog code sequence and are accessible from the PIC memories for execution in combinations of instructions of up to five instructions at a time, in this example, for independent parallel decode and execution.
  • the 4 ⁇ 4 execution unit (E ⁇ U) array 402 can operate all 16 execution units and 16 LFs under a single program counter control 416 in a master mode of operations or the rows of execution units can operate separately under control of four program counters, three of which 416 , 418 , and 419 are shown in FIG. 4 .
  • the program counters operate under a program mode control that if in Master Mode (InMM), the Row 0 program counter (R 0 PC) is the PC for Rows0-3, else, the R 0 PC is used for R 0 only and for rows 1 - 3 if InMM, the rows 1 - 3 PCs are not used (NU), else, the R 1 PC not shown for reasons of clarity of presentation, the R 2 PC 418 is used for R 2 only and the R 3 PC 419 is used for R 3 only.
  • FIG. 5 illustrates three R ⁇ C XarMa processors 500 that are based on the 4 ⁇ 4 Execution unit (E ⁇ U) array of FIGS. 1-4 in accordance with embodiments of the invention.
  • the XarMa processor can be scaled both smaller and larger as shown in FIGS. 5A-5C .
  • FIG. 5A illustrates a 1 ⁇ 4 XarMa processor 502 having a single row of five types of execution units, one load unit (L), a multiply (M), such as the Mq units shown in FIGS. 2-4 , an ALU complex (C) unit, an ALU bitop (B) unit, and a store unit (S).
  • L load unit
  • M multiply
  • C ALU complex
  • B ALU bitop
  • S store unit
  • a load unit (L 1 ) and a store unit (S 1 ) may be combined (LS) with a single two read port 2 write port (2R2W) LF allowing L 1 to load directly to an S 1 input register or the associated LF or by means of the execution unit network to one or more OIPRs in function execution units.
  • L 1 load unit
  • S 1 store unit
  • Such a store unit and load unit combination may facilitate directly communicating between processor and memory nodes to reach further network attached elements.
  • the load unit that provides a data value to a function unit or to a local file write port may be located with the store unit that receives the data value from the load unit, a function unit, or from a local file read port.
  • the load unit may access a source data value from a memory and load the fetched data to one or more function units or LFs.
  • the store unit may receive a data value from a function unit or from a LF for storage to memory.
  • FIG. 5B illustrates a 2 ⁇ 4 XarMa processor 503 having two rows of four types of execution units, LS, M, C, B per row.
  • 5C illustrates a 4 ⁇ 4 XarMa processor 504 having four rows of four types of execution units, LS, M, C, B, per row.
  • the data (D) memory banks and instruction memory are configured on a silicon plane separate from the processing logic and execution array plane. While this is a preferred approach, it does not preclude placing the data memory banks and instruction memory on the same silicon with the processing logic and execution array.
  • an instruction is formatted to specify that a result is to be forwarded to one or more destination instructions in a chain of execution instructions instead of a destination register in a central register file.
  • the forwarding of the result to the destination instruction is decoded by internal logic to be an operand input port register (OIPR) of an associated execution unit thereby eliminating the storage of the temporary result variable in a central register file.
  • OIPR operand input port register
  • a row specifier is used in identifying the appropriate execution unit associated with destination instructions. If there are variables in a program that need to be maintained longer than a specified lifetime they may be stored in one or more of the LFs having available storage.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Multi Processors (AREA)

Abstract

An Execution Array Memory Array (XarMa©) processor is described for signal processing and internet of things (IoT) applications, (pronounced sharma, that means happiness in Sanskrit). The XarMa© processor uses a 1 to K+1 adjacency network in an array of execution units. The 1 to K+1 adjacency refers to connections separately made in rows and in columns of execution unit and local file nodes, where the number of Rows≥K>1 and of Columns≥K>1 and K is an odd integer. Instead of a large central multi-ported register file, a distributed set of storage files local to each execution unit is used. The instruction set architecture uses instructions that specify forwarding of execution results to execution units associated with destination instructions. This execution array is scalable to support cost effective and low power high-performance application specific processing focused on target product requirements.

Description

RELATED APPLICATION DATA
The present application claims the benefit of U.S. Provisional Application No. 62/801,315 filed Feb. 5, 2019 entitled “Methods and Apparatus for Sharing Execution Units In an Execution Unit Network With Connections Based on 1 to K+1 Adjacency” which is incorporated by reference herein in its entirety.
FIELD OF INVENTION
The present invention relates to unique and improved methods and apparatuses for processing architectures and organizations of processing elements in networks configured to reduce power. More specifically, this invention concerns processing architectures and interconnection networks, based on 1 to K+1 adjacency, that reduce power for communicating between nodes, including execution units and local files, as controlled by a result forwarding instruction set architecture.
CROSS REFERENCE TO RELATED APPLICATIONS
The U.S. Pat. Nos. 7,581,079, 7,886,128, 8,156,311, 8,443,169, 9,460,048, 9,507,603, 10,078,517, and 10,503,515 have the same inventor, are related patents, and are hereby incorporated by reference in their entirety.
BACKGROUND OF INVENTION
A driving factor in development of internet of things (IoT) products, including phones, watches, medical related sensor devices, etc., is low cost, low power, high performance, and scalability. Significant performance and power problems in current multi-processor architectures involve accessing data from memory, operating on the accessed data, and sharing of data between processors. These multi-processor architectures are generally based on use of large central multi-ported register files. Also, having adequate memory bandwidth to support high performance operations is related to the organization of the processors, memory modules, execution unit connections and the interconnection network used for load and store operations.
One of the problems associated with increasing performance in multiprocessor parallel processing systems is the efficient accessing of data or instructions from memory. Having adequate memory bandwidth for sharing of data between processors is another problem associated with parallel processing systems. These problems are related to the organization of the processors and memory modules and the processor architecture used for data communication between a processor, including execution units, and a plurality of memories and between processors. Various approaches to solving these problems have been attempted in the past, for example, array processors and shared memory processors.
Multiprocessor systems can be classified generally in terms of coupling strength for communication between processors. Those multiprocessor systems that communicate using a share memory facility between the processors and the shared memory over an interconnection network are generally considered tightly coupled. Loosely coupled multiprocessor systems generally use an input/output (I/O) communication process in each processor for communicating between the processors over an interconnection network, such as message passing process. A wide variety of interconnection networks have been utilized in multiprocessing systems. For example, rings, bus connected, crossbar, tree, shuffle, omega, and butterfly, mesh, hypercube, and ManArray networks, have been used in prior multiprocessor systems. From an application or use perspective, specific networks have been chosen primarily based upon performance characteristics and cost to implement tradeoffs.
Parallel processing and the distribution of data between functional execution elements may be described using Kronecker product expressions of signal transform functions, such as the fast Fourier transform (FFT) and other signal flow graph expressions. However, the underlying processor architecture may present limitations that affect the efficiency of implementing functions described using Kronecker product expressions and other signal analysis techniques.
SUMMARY OF THE INVENTION
It is appreciated that improvements to processor architecture, network design, and organizations of processors and memory are desired. Such improvements are provided by multiple embodiments of the present invention.
An embodiment of the present invention addresses a method of executing a sequence of instructions in an execution unit (E×U) node in an array of E×Units. A first instruction and a destination instruction having a dependency on the first instruction are received, wherein the first instruction identifies the destination instruction in a sequence of instructions from a program and specifies that a result generated by execution of the first instruction by a first E×U node is to be forwarded to a destination E×U node that is to execute the destination instruction. The first instruction is executed on the first E×Ur,c node to generate the result for delivery through an E×U network to the destination E×U node associated with the identified destination instruction, wherein according to a Row by Column (R×C) matrix, an R×C array of E×Urow(r),column(c) nodes are interconnected by the E×U network, the E×U network comprising (K+1) by (K+1) array of E×Ur,c nodes, a first stage (K+1)×(K+1) array of Rr,c nodes for a first direction of communication, a second stage (K+1)×(K+1) array of Sr,c nodes for a second direction of communication, and in each stage having wiring configured according to a 1 to K+1 adjacency of connections between nodes which includes wrapping around data paths at the edges of the (K+1)×(K+1) arrays, K is an odd integer, K>1, R≥(K+1), C≥(K+1), r∈{0, 1, . . . , K}, and c∈{0, 1, . . . , K}, and wherein connections exist between each E×Ur,c node and Rr,c nodes with the same row number in the first direction of communication, the first E×Ur,c node generates the result for a selectable first data path that connects to an Rr,c+1 node and for a selectable second data path that connects to an Rr,c−1 node for single step adjacency and for a selectable third data path that connects to an Rr,c+2 node for two step adjacency, and for a selectable fourth data path that connects to an Rr,c node in the same r,c position in the R×C matrix as the connecting E×Ur,c node, and wherein connections exist between each Rr,c node and Sr,c nodes with the same column number in the second direction of communication, wherein an Rr,c node, associated with a selected path in the first direction of communication, produces the result for a selectable first data path that connects to an Sr+1,c node and for a second data path that connects to an Sr−1,c node for single step adjacency and for a third data path that connects to an Sr+2,c node for two step adjacency, and for a fourth data path that connects to an Sr,c node in the same r,c position in the R×C matrix as the connecting Rr,c node, wherein an Sr,c node, associated with the selected data path in the second direction of communication, produces the result on a destination data path that connects to the destination E×U node to be received at the destination E×U node. The destination instruction is executed in the destination E×U node based on the received result to produce a destination result for use by the program.
Another embodiment of the invention addresses a network organized according to a 1 by Column (1×C) matrix. A 1×C array of E×U1,column(c) nodes are interconnected by an E×U network, the E×U network comprising 1 by (K+1) array of E×U1,c nodes connected to a 1×(K+1) array of R1,c nodes for a first direction of communication, and having wiring configured according to a 1 to K+1 adjacency of connections between the E×U1,c nodes and the R1,c nodes which includes wrapping around data paths at the edges of the 1×(K+1) arrays, K is an odd integer, K>1, C≥(K+1), and c∈{0, 1, . . . , K} and wherein connections exist between each E×U1,c node and R1,c nodes in the first direction of communication, a first E×U1,c node is connected by a first data path to an R1,c+1 node and by a second data path to an R1,c−1 node for single step adjacency and by a third data path to an R1,c+2 node for two step adjacency, and by a fourth data path to an R1,c node in the same 1,c position in the 1×C matrix as the first E×U1,c node, wherein the R1,c−1 node is connected by a first outputA path to its associated E×U1,c−1 node, the R1,c node is connected by a second outputA path to its associated E×U1,c node, the R1,c+1 node is connected by a third outputA path to its associated E×U1,c+1 node, and the R1,c+2 node is connected by a fourth outputA path to its associated E×U1,c+2 node.
A further embodiment of the invention a system is provided. The system has a load unit having a source of data values external to an array of execution unit (E×U) nodes that are interconnected by an E×U network. A first multiplexing element in the load unit to connect externally received data values to an E×U located in the E×U network for processing by one or more program instructions. The system has a store unit having a source of data values internal to the array of E×U nodes. A second multiplexing element in the store unit to connect to the E×U network to receive data values from an E×U source and connect the internally received data values to a destination node located external to the E×U network for processing by the destination node, wherein the load unit is combined with the store unit as a single node of the array of E×U nodes.
These and other features, aspects, techniques and advantages of the invention will be apparent to those skilled in the art from the following detailed description, taken together with the accompanying drawings and claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates exemplary specifiable paths in an array beginning from a source node to nodes in a K+1row by K+1column array of nodes interconnected by a K+1 adjacency network, wherein K is an odd integer >1, in accordance with an embodiment of the present invention;
FIG. 2 illustrates an execution array organized in an exemplary 4 row by 4 column arrangement of execution units and local files in a physical layout form with a one to K+1 level adjacency, where K is a positive odd integer, in accordance with an embodiment of the present invention;
FIG. 3 illustrates exemplary specifiable paths in a 4×4 array of nodes beginning from a source node to nodes in a K+1row by K+1column array of nodes interconnected by a K+1 adjacency network with an increased number of internode data paths, wherein K is an odd integer >1, in accordance with an embodiment of the present invention;
FIG. 4 illustrates a control system for a R=4×C=4 XarMa processor comprising row 0-3 control units with corresponding prolog instruction code (PIC) memories in accordance with an embodiment of the present invention; and
FIG. 5 illustrates three R×C XarMa processors that are based on the 4×4 Execution unit (E×U) array of FIGS. 1-4 in accordance with embodiments of the invention.
DETAILED DESCRIPTION
While the present invention is disclosed in a presently preferred context, it will be recognized that the teachings of the present invention may be variously embodied consistent with the disclosure and claims. It will be recognized that the present teachings may be adapted to other present and future architectures to which they may be beneficial.
In order to amortize development costs for such devices across multiple products targeted for different applications, a scalable architecture with multiple design points using the same instruction set architecture is proposed. To address low power, high performance, and scalability, a new architecture is presented that reduces storage of temporary variables lowering power usage, provide efficient processor and shared memory transfers, and is scalable.
FIG. 1 illustrates exemplary specifiable paths in an array 100 from a single node (Nb11) 102 to nodes in a K+1row by K+1column array of nodes Na00 110 to Na33 125 interconnected by a K+1 adjacency network, wherein K is an odd integer >1, in accordance with an embodiment of the present invention. Array notation is used with nodes, such as Na00=Narow=0column=0. The Na00-Na33 nodes 110-125, such as Na11 115, also shown in exemplary node illustration 157, may be processor nodes (Pa), or memory nodes (Ma), or execution unit nodes (Xa), or local file nodes (LFa). The single Nb11 node 102 is one of a 4×4 array of nodes Nb00 to Nb33, not shown for reason of clarity in the drawing. The Nb00-Nb33 nodes may also be processor nodes (Pb), or memory nodes (Mb), or execution unit nodes (Xb), or local file nodes (LFb). There is also a 4×4 array of R r,c 4×4 crossbar nodes of which nodes R10 130, R11 131, R12 132, and R13 133 are shown. There is further shown a 4×4 array of S r,c 4×1 multiplexer nodes including S00 4×1 140 to S33 4×1 155 nodes. FIG. 2 shows all the nodes together. An exemplary Rr,c node 176 is configured with four 4 to 1 multiplexers (4×1) 177. An exemplary Sr,c node 178 is configured with one 4 to 1 multiplexer (4×1) 179. The horizontal row data buses 135-138 and vertical column data buses 160-175 are Bb-bits, for example Bb=16-bits or 32-bits or 64-bits, and the like. Generally, the bus paths for Nbr,c→Rr,c, Rr,c→Sr,c, and Sr,c→(Pa/Ma/Xa/LFa)r,c, having the same r and the same c, are prioritized for short layouts, such as the case where Nbr,c is a processor and the Sr,c node connects to a memory node Mar,c.
To illustrate an exemplary data path, the node Nb11 102 is designed to be an execution unit, so is referenced here in this description as Xb11. The execution unit Xb11 102 generates a result upon executing an instruction which is programmatically directed to use one or more selectable data buses 135-138, such as the data bus 135. The data buses 135-138 comprise data buses 135 and 137 having connections between the Xb11 node 102 and the R1,0 node 130 and the R1,2 node 132 with the same row number in the first direction of communication of single step adjacency between next door adjacent neighbors. The first direction of communication of single step adjacency for the Xb11 node is communication in the east and west horizontal direction. The single step adjacency for Xb11 is to R nodes having an integer column number of the starting node, in this case column 1 for the Xb11 node 102, increased by a value of “1” for single step adjacency in the east direction to R1,2 node 132 and decreased by the value “1” for single step adjacency in the west direction to R1,0 node 130. Wraparound is also in effect, in this case, after the increase of a starting column number 3 by “1” for a value of K+1=4, the starting column number 3 wraps around to column “0” and after the decrease of a starting column number 0 by “1” for a value of “−1”, the starting column number 0 wraps around to column “3”.
The data bus 136 has a connection between Xb11 node and R1,1 node 131 having the same position in the R×C matrix. The data bus 138 has a connection between Xb11 node 102 and the R1,3 node 133 representing one additional connection in the first direction of communication of two step adjacency. The one additional connection in the first direction of communication of two step adjacency for the Xb11 node 102 may be communication in either the east direction or communication in the west horizontal direction. The east direction of communication of two step adjacency for the Xb11 node 102 is to an R node having an integer column number of the starting node, in this case column 1 for the Xb11 node 102, increased by a value of “2” in the east direction to R1,3 node 133. With wrap around, an increased column number of 4 wraps around to column 0 and an increased column number of 5 wraps around to column 1. The west direction of communication of two step adjacency for the Xb11 node 102 is to an R node having an integer column number of 1 for the starting node Xb11 node 102, is decreased by a value of “2” in the west direction to a −1 value and is directed to R1,3 node 133 due to wraparound. With wrap around, a decreased column number of −2 wraps around to column 2.
The data travels across the data bus 135 and reaches node R10 130 which is configured with four 4to1 multiplexers, such as shown R r,c 4×4 crossbar node 177. Each of the four 4to1 multiplexers receives control signals that cause each multiplexer to select none or one of that multiplexer's four input signals to pass to its associated output of the R10 130 4×4 crossbar. There are three types of Rr,c node to Sr,c node connection paths. The first type of connection path is for data buses 160 and 168 having connections between the R1,0 node 130 and the S0,0 node 140 and the S2,0 node 148 with the same column number in a vertical second direction of communication of single step adjacency between next door adjacent neighbors. The second type of connection path is for data bus 164 which has a connection between R1,0 node 130 and S1,0 node 144 having the same position in the R×C matrix. The third type of connection path is for data bus 172 which has a connection between the R1,0 node 130 and the S3,0 node 152 representing one additional connection in the second direction of communication of two step adjacency. The first direction of communication and the second direction of communication can be reversed, with the first direction of communication being in a vertical North/South direction and the second direction of communication being is a horizontal East/West direction.
FIG. 2 illustrates an execution array 200 organized in an exemplary 4 row by 4 column arrangement of execution units and local files in a physical layout form with a one to K+1 level adjacency, where K is a positive odd integer, in accordance with an embodiment of the present invention. In FIG. 2, functional units and local storage units are separately coupled across each row with the same row number in a first direction of communication of single step adjacency between next door adjacent neighbors by horizontal row networks 202-205 to Rr,c nodes. The Rr,c nodes are separately coupled across each column with the same column number in a second direction of communication of single step adjacency between next door adjacent neighbors by vertical column networks 207-210 to Sr,c nodes and from there to the functional units and local storage units. For example, in row 0 202 there are a plurality of functional units comprising a load0 (L0)/store0 (S0) unit 220, a multiplication M01 unit 221, an ALU Complex (C) unit 222, and a ALU Bit operation (B) unit 223. Also in row 0 202 and associated with the plurality of functional units are local file (LF) storage units comprising LF00 225, LF01 226, LF02 227, and LF03 228. The other three rows 1-3 203-205 contain a similar organization of functional units and local file units labeled according to their position in the R×C array, such as multiplication Mq11 unit 231 and LF11 236. The local files in each row provide a distributed register file for storage of variables as required by a program. Each local file is placed local to its associated functional unit by nature of the timing path to read from and write to the local file as required by a particular implementation. Each local file may also be considered a sub-file portion of a distributed register file supporting computations in row. The 4×4 execution unit/LF network connecting the functional units and local LFs according to a 1toK+1 adjacency as defined herein contains paths such as shown in FIG. 1.
In FIG. 2, the four buses 260-263 are provided to transport four 32-bit results generated in each multiplier (Mqxx) unit 221, 231, 241, and 251 over to an associated add and subtract function in the ALU Complex (C) unit 222, 232, 242, and 252 as part of a complex multiplication operation. The four buses 260-263 are able to operate in parallel with the horizontal row networks 202-205 to Rr,c nodes and the vertical column networks 207-210 to Sr,c nodes and from there to the functional units and local storage units of E×U network operations. Results of the complex multiplication generated in C02 222, C12 232, C22 242, and C32 252 may be stored locally in associated LF02 227, LF12 237, LF22 247, and LF32 257, respectively.
FIG. 3 illustrates exemplary specifiable paths in a 4×4 array of nodes 300 beginning from a source node, Mq11 306 or LF11 323, to nodes in a K+1row by K+1column array of nodes (L0/S0)/LF00 301/318 to B33/LF33 316/333 interconnected by a K+1 adjacency network with an increased number of internode data paths, wherein K is an odd integer >1, in accordance with an embodiment of the present invention. Array notation is used with nodes, such as (L0/S0)/LF00=LFrow=0column=0 301/318. The (L0/S0)/LF00 301/318-B33/LF33 316/333 nodes may be different types of execution unit nodes, such as a load and store unit loadx(Lx)/storex(Sx) or a multiply complex unit (Mqxx) or an ALU Complex (Cxx) unit or an ALU Bit operation (Bxx) unit located with their corresponding local file nodes (LFxx). A row 1 of nodes L1/S1/LF10 305/322, Mq11/LF11 306/323, C12/LF12 307/324, and B13/LF13 308/325 is an exemplary row of nodes in the 4×4 array of nodes 300. Wiring according to the 1 to K+1 adjacency where K=3 is only shown for the Mq11 306 and LF11 323 nodes for reason of clarity in the drawing. In order to support two operand data paths, cells are defined, wherein each cell comprises an execution node, a local file node, an R node, and an S node for the same row column position, such as exemplary cell 389 comprising Mq11/LF11 306/323, R11 336, S11 345. The cells are configured with expanded capabilities in the R and S nodes. The R nodes, such as R11 336 and shown in Rrc 4×5 390 and in more detail in Rrc 4×5 391 comprises an additional 4to1 multiplexer 392. There is also a 4×4 array of R r,c 4×5 crossbar nodes of which nodes R10 335, R11 336, R12 337, and R13 338 are shown. There is further shown a 4×4 array of S r,c 5×2 multiplexer nodes including S00 340 to S33 355 nodes. Each Rr,c node, such as R r,c 390, is configured with five 4 to 1 multiplexers as shown in R r,c 391. Each Sr,c node, such as S r,c 393 is configured with two 5 to 1 multiplexers as shown in S r,c 5×2 394. The horizontal row data buses 356-359 and vertical column data buses 360-379 are Bb-bits, for example Bb=16-bits or 32-bits or 64-bits, and the like. Generally, the data bus paths within cells, such as exemplary cell 389, include paths, such as from Mq11/LF11 306/323 over bus 357 to R11 336, R11 336 over a first data bus path 365 and a second data bus path 377 to S11 345, and S11 345 over two data bus paths 381 to Mq11/LF11 306/323, are prioritized for short layouts.
To illustrate an exemplary data path, the execution unit Mq11 306 generates a result upon executing an instruction which is programmatically directed to use one or more data buses 356-359, such as the data bus 356. The data buses 356-359 comprise data buses 356 and 358 having connections between the Mq11 306 and the R1,0 node 335 and the R1,2 node 337 with the same row number in the first direction of communication of single step adjacency between next door adjacent neighbors. The first direction of communication of single step adjacency for the Mq11 306 node is communication in the east and west horizontal direction. The single step adjacency for Mq11 306 is to R nodes having an integer column number of the starting node, in this case column 1 for the Mq11 306, increased by a value of “1” for single step adjacency in the east direction to R1,2 node 337 and decreased by the value “1” for single step adjacency in the west direction to R1,0 node 335. Wraparound is also in effect, in this case, after the increase of a starting column number 3 by “1” for a value of K+1=4, the starting column number 3 wraps around to column “0” and after the decrease of a starting column number 0 by “1” for a value of “−1”, the starting column number 0 wraps around to column “3”.
The data bus 357 has a connection between Mq11 306 and R1,1 node 336 having the same position in the R×C matrix. The data bus 359 has a connection between Mq11 306 and the R1,3 node 338 representing one additional connection in the first direction of communication of two step adjacency. The one additional connection in the first direction of communication of two step adjacency for the Mq11 306 node is communication in either the east direction or communication in the west horizontal direction. The east direction of communication of two step adjacency for Mq11 306 is to an R node having an integer column number of the starting node, in this case column 1 for the Mq11 306, increased by a value of “2” in the east direction to R1,3 node 338. With wrap around, an increased column number of 4 wraps around to column 0 and an increased column number of 5 wraps around to column 1. The west direction of communication of two step adjacency for Mq11 306 is to an R node having an integer column number of 1 for the starting node Mq11 306, is decreased by a value of “2” in the west direction to a −1 value and is directed to R1,3 node 338 due to wraparound. With wrap around, a decreased column number of −2 wraps around to column 2.
The data travels across the data bus 356 and reaches node R10 335 which is configured with five 4to1 multiplexers, such as shown R r,c 4×5 crossbar node 391. Each of the five 4to1 multiplexers receives control signals that cause each multiplexer to select none or one of that multiplexer's four input signals to pass to its associated output of the R10 335 4×5 crossbar. There are three types of Rr,c node to Sr,c node connection paths. The first type of connection path is for data buses 360 and 368 having connections between the R1,0 node 335 and the S0,0 node 340 and the S2,0 node 348 with the same column number in a second vertical direction of communication of single step adjacency between next door adjacent neighbors. The second type of connection path is for data buses 364 and 376 which have a connection between R1,0 node 335 and S1,0 node 344 having the same position in the R×C matrix. The third type of connection path is for data bus 372 which has a connection between the R1,0 node 335 and the S3,0 node 352 representing one additional connection in the second direction of communication of two step adjacency. The first direction of communication and the second direction of communication can be reversed, with the first direction of communication being in a vertical North/South direction and the second direction of communication being is a horizontal East/West direction.
FIG. 4 illustrates control system 400 for an R=4×C=4 XarMa processor comprising row 0-3 control units 405, 407, 408 with corresponding prolog instruction code (PIC) memories 410, 412, 414 in accordance with an embodiment of the present invention. FIG. 4 illustrates three of the four row pipeline control units that control pipeline stage operations in each row to execute chained execution packets (CEPs), for reasons of clarity of presentation. A CEP is a chain of instructions that generally contain sequential dependencies between one or more instructions in the chain. The CEPs may be fetched in packets or streamed 420, 422, 423 over to row packet registers, such as row 0 packet register 425. Upon receiving a chained execution packet (CEP) which contains control parameters in a header 427, shown as 12345, for a selected execution row, such as row 3, the control parameters are loaded into the control unit 408 for row 3 to coordinate operations on the processor. Instructions are selected from each row packet register for execution and as specified by the control unit and loaded into the associated prolog instruction control (PIC) memory. As shown, the PIC memories 410, 412, 414 load up the instructions during execution of a prolog code sequence and are accessible from the PIC memories for execution in combinations of instructions of up to five instructions at a time, in this example, for independent parallel decode and execution. The 4×4 execution unit (E×U) array 402 can operate all 16 execution units and 16 LFs under a single program counter control 416 in a master mode of operations or the rows of execution units can operate separately under control of four program counters, three of which 416, 418, and 419 are shown in FIG. 4. The program counters (PCs) operate under a program mode control that if in Master Mode (InMM), the Row 0 program counter (R0PC) is the PC for Rows0-3, else, the R0PC is used for R0 only and for rows 1-3 if InMM, the rows 1-3 PCs are not used (NU), else, the R1PC not shown for reasons of clarity of presentation, the R2PC 418 is used for R2 only and the R3PC 419 is used for R3 only.
FIG. 5 illustrates three R×C XarMa processors 500 that are based on the 4×4 Execution unit (E×U) array of FIGS. 1-4 in accordance with embodiments of the invention. The XarMa processor can be scaled both smaller and larger as shown in FIGS. 5A-5C. FIG. 5A illustrates a 1×4 XarMa processor 502 having a single row of five types of execution units, one load unit (L), a multiply (M), such as the Mq units shown in FIGS. 2-4, an ALU complex (C) unit, an ALU bitop (B) unit, and a store unit (S). By developing an instruction set architecture that allows operands to be specified for delivery to a function unit's operand input instead of specifying a register in a register file, local files with a reduced capacity and reduced number of ports can be used instead of a large capacity multi-ported register file. A load unit (L1) and a store unit (S1) may be combined (LS) with a single two read port 2 write port (2R2W) LF allowing L1 to load directly to an S1 input register or the associated LF or by means of the execution unit network to one or more OIPRs in function execution units. Such a store unit and load unit combination may facilitate directly communicating between processor and memory nodes to reach further network attached elements. For example, the load unit that provides a data value to a function unit or to a local file write port may be located with the store unit that receives the data value from the load unit, a function unit, or from a local file read port. The load unit may access a source data value from a memory and load the fetched data to one or more function units or LFs. The store unit may receive a data value from a function unit or from a LF for storage to memory. FIG. 5B illustrates a 2×4 XarMa processor 503 having two rows of four types of execution units, LS, M, C, B per row. FIG. 5C illustrates a 4×4 XarMa processor 504 having four rows of four types of execution units, LS, M, C, B, per row. The data (D) memory banks and instruction memory are configured on a silicon plane separate from the processing logic and execution array plane. While this is a preferred approach, it does not preclude placing the data memory banks and instruction memory on the same silicon with the processing logic and execution array.
To minimize the storage of temporary variables, an instruction is formatted to specify that a result is to be forwarded to one or more destination instructions in a chain of execution instructions instead of a destination register in a central register file. The forwarding of the result to the destination instruction is decoded by internal logic to be an operand input port register (OIPR) of an associated execution unit thereby eliminating the storage of the temporary result variable in a central register file. For the 1×4 XarMa processor 502 of FIG. 5A no row specifier is required, but for the 2×4XarMa processor 503 of FIG. 5B and for the 4×4 XarMa processor 504 of FIG. 5C or for other configurations such as a 5×4 or 5×5 XarMa processor, a row specifier is used in identifying the appropriate execution unit associated with destination instructions. If there are variables in a program that need to be maintained longer than a specified lifetime they may be stored in one or more of the LFs having available storage.

Claims (20)

I claim:
1. A method of executing a sequence of instructions in an execution unit (E×U) node in an array of E×Units, the method comprising:
receiving a first instruction and a destination instruction having a dependency on the first instruction, wherein the first instruction identifies the destination instruction in a sequence of instructions from a program and specifies that a result generated by execution of the first instruction by a first E×U node is to be forwarded to a destination E×U node that is to execute the destination instruction;
executing the first instruction on the first E×Ur,c node to generate the result for delivery through an E×U network to the destination E×U node associated with the identified destination instruction, wherein according to a Row by Column (R×C) matrix, an R×C array of E×Urow(r),column(c) nodes are interconnected by the E×U network, the E×U network comprising (K+1) by (K+1) array of E×Ur,c nodes, a first stage (K+1)×(K+1) array of Rr,c nodes for a first direction of communication, a second stage (K+1)×(K+1) array of Sr,c nodes for a second direction of communication, and in each stage having wiring configured according to a 1 to K+1 adjacency of connections between nodes which includes wrapping around data paths at the edges of the (K+1)×(K+1) arrays, K is an odd integer, K>1, R≥(K+1), C≥(K+1), r∈{0,1, . . . , K}, and c∈{0,1, . . . , K}, and wherein connections exist between each E×Ur,c node and Rr,c nodes with the same row number in the first direction of communication, the first E×Ur,c node generates the result for a selectable first data path that connects to an Rr,c+1 node and for a selectable second data path that connects to an Rr,c−1 node for single step adjacency and for a selectable third data path that connects to an Rr,c+2 node for two step adjacency, and for a selectable fourth data path that connects to an Rr,c node in the same r,c position in the R×C matrix as the connecting E×Ur,c node, and wherein connections exist between each Rr,c node and Sr,c nodes with the same column number in the second direction of communication, wherein an Rr,c node, associated with a selected path in the first direction of communication, produces the result for a selectable first data path that connects to an Sr+1,c node and for a second data path that connects to an Sr−1,c node for single step adjacency and for a third data path that connects to an Sr+2,c node for two step adjacency, and for a fourth data path that connects to an Sr,c node in the same r,c position in the R×C matrix as the connecting Rr,c node, wherein an Sr,c node, associated with the selected data path in the second direction of communication, produces the result on a destination data path that connects to the destination E×U node to be received at the destination E×U node; and
executing the destination instruction in the destination E×U node based on the received result to produce a destination result for use by the program.
2. The method of claim 1, wherein the Rr,c nodes are 4×4 crossbars having four inputs and four outputs and the Sr,c nodes are 4×1 multiplexers having four inputs and one output.
3. The method of claim 1 further comprising:
wrapping around when Rr,c+1=Rr,K+1 in the first direction of communication to Rr,0 for single step adjacency;
wrapping around when Rr,c−1=Rr,−1 in the first direction of communication to Rr,K for single step adjacency; and
wrapping around when Rr,c+2=Rr,K+2 in the first direction of communication to Rr,1 for two step adjacency.
4. The method of claim 1 further comprising:
wrapping around when Sr+1,c=SK+1,c in the second direction of communication to S0,c for single step adjacency;
wrapping around when Sr−1,c=S−1,c in the second direction of communication to SK,c for single step adjacency; and
wrapping around when Sr+2,c=SK+2,c in the second direction of communication to R1,c for two step adjacency.
5. The method of claim 1 further comprising:
executing a second instruction on a second E×Ur,c node to generate a second result for a selectable fifth data path that connects to the Rr,c node, associated with the selected path in the first direction of communication;
producing the second result on the Rr,c node, associated with the selected path in the first direction of communication, for a selectable fifth data path that connects to the Sr,c node, in the same r,c position in the R×C matrix as the connecting Rr,c node; and
producing the second result, by the Sr,c node associated with the selected data path in the second direction of communication, on a second destination data path that connects to the destination E×U node to be received at the destination E×U node.
6. The method of claim 5, wherein the Rr,c nodes are 4×5 crossbars having four inputs and five outputs and the Sr,c nodes are 5×2 multiplexers having five inputs and two outputs.
7. The method of claim 1 further comprising:
setting a program counter mode control to master mode: and
controlling the instruction sequence from the program for operation of the K+1 rows of the R×C array of E×Urow(r),column(c) nodes by using the program counter for row 0 and making program counters for rows 1 to row K to be in a not used state.
8. The method of claim 1 further comprising:
setting a program counter mode control to not master mode; and
controlling the instruction sequence from the program for each row of the R×C array of E×Urow(r),column(c) nodes using K+1 program counters for separate control of rows 0 to row K to be in an active state.
9. A network organized according to a 1 by Column (1×C) matrix, the network comprising:
a 1×C array of E×U1,column(c) nodes interconnected by an E×U network, the E×U network comprising 1 by (K+1) array of E×U1,c nodes connected to a 1×(K+1) array of R1,c nodes for a first direction of communication, and having wiring configured according to a 1 to K+1 adjacency of connections between the E×U1,c nodes and the R1,c nodes which includes wrapping around data paths at the edges of the 1×(K+1) arrays, K is an odd integer, K>1, C≥(K+1), and c∈{0,1, . . . , K} and wherein connections exist between each E×U1,c node and R1,c nodes in the first direction of communication, a first E×U1,c node is connected by a first data path to an R1,c+1 node and by a second data path to an R1,c−1 node for single step adjacency and by a third data path to an R1,c+2 node for two step adjacency, and by a fourth data path to an R1,c node in the same 1,c position in the 1×C matrix as the first E×U1,c node, wherein the R1,c−1 node is connected by a first outputA path to its associated E×U1,c−1 node, the R1,c node is connected by a second outputA path to its associated E×U1,c node, the R1,c+1 node is connected by a third outputA path to its associated E×U1,c+1 node, and the R1,c+2 node is connected by a fourth outputA path to its associated E×U1,c+2 node.
10. The network of claim 9, wherein the Rr,c nodes comprise:
4×1 multiplexers, in the Rr,c nodes, having four inputs and one output.
11. The network of claim 9, wherein the node is connected by a first outputB path to its associated E×U1,c−1 node, the R1,c node is connected by a second outputB path to its associated E×U1,c node, the R1,c+1 node is connected by a third outputB path to its associated E×U1,c+1 node, and the R1,c+2 node is connected by a fourth outputB path to its associated E×U1,c+2 node.
12. The network of claim 11, wherein the Rr,c nodes comprise:
4×2 crossbars, in the Rr,c nodes, having four inputs and two outputs.
13. The network of claim 9 further comprising:
the first data path is wrapped around when R1,c+1=R1,K+1 in the first direction of communication to Rr,0 for single step adjacency;
the second data path is wrapped around when R1,c−1=R1,K−1 in the first direction of communication to Rr,K for single step adjacency; and
the third data path is wrapped around when R1,c+2=R1,K+2 in the first direction of communication to Rr,1 for two step adjacency.
14. The network of claim 9 further comprising:
connecting two 1×C arrays of E×U1,column(c) nodes by a second stage (K+1)×(K+1) array of Sr,c nodes for a second direction of communication, wherein each Rr,c node is connected by a selectable first data path to an Sr+1,c node and by a second data path to an Sr−1,c node for single step adjacency and by a third data path to an Sr+2,c node for two step adjacency, and by a fourth data path to an Sr,c node in the same r,c position in the R×C matrix as the connecting Rr,c node, wherein each Sr,c node is connected by a destination data path to a corresponding destination E×Ur,c node.
15. The network of claim 14, wherein the Rr,c nodes and Sr,c nodes comprise:
4×2 multiplexers in the Rr,c nodes having four inputs and one output; and
2×1 multiplexers in the Sr,c nodes having two inputs and one output.
16. A system apparatus comprising:
a load unit having a source of data values external to an array of execution unit (E×U) nodes that are interconnected by an E×U network;
a first multiplexing element in the load unit to connect externally received data values to an E×U located in the E×U network for processing by one or more program instructions;
a store unit having a source of data values internal to the array of E×U nodes;
a second multiplexing element in the store unit to connect to the E×U network to receive data values from an E×U source and connect the internally received data values to a destination node located external to the E×U network for processing by the destination node, wherein the load unit is combined with the store unit as a single node of the array of E×U nodes and wherein the E×U network comprises:
an Row×Column (R×C) array of E×Urow(r),column(c) nodes interconnected by the E×U network, the E×U network comprising (K+1) by (K+1) array of E×Ur,c nodes, a first stage of (K+1)×(K+1) array of Rr,c nodes for a first direction of communication, a second stage of (K+1)×(K+1) array of Sr,c nodes for a second direction of communication, and in each stage having wiring configured according to a 1 to K+1 adjacency of connections between nodes which includes wrapping around data paths at the edges of the (K+1)×(K+1) arrays, K is an odd integer, K>1, R≥(K+1), C≥(K+1), r∈{0,1, . . . , K}, and c∈{0,1, . . . , K}, and wherein connections exist between each E×Ur,c node and Rr,c nodes with the same row number in the first direction of communication, and wherein connections exist between each Rr,c node and Sr,c nodes with the same column number in the second direction of communication, and wherein each Sr,c node is connected to corresponding E×Ur,c nodes.
17. The system apparatus of claim 16, wherein the source of data values comprises:
a memory unit having a read port providing the source of data values.
18. The system apparatus of claim 16 wherein the destination node comprises:
a memory unit having a write port to receive the data values from the E×U source and store the received data values in the memory.
19. The system apparatus of claim 16, wherein the load unit connects to the E×U network as an E×Ur,c node connects in the first direction of communication to an Rr,c node to send a load supplied value to the E×U network and the store unit connects to the E×U network as an E×Ur,c node to receive an E×U network provided value from an Sr,c node output.
20. The system apparatus of claim 16 further comprises:
each Rr,c node is configured with a 4×5 crossbar, each Rr,c node having four inputs and five outputs; and
each Sr,c node is configured as a 5×2 multiplexer node having two outputs, wherein each Rr,c node is connected to a corresponding Sr,c node and each Sr,c node is connected to a corresponding E×Ur,c node, and further wherein two of the five outputs of each Rr,c node are connected to the corresponding Sr,c node having the same r,c position as the connected Rr,c node and the two outputs of each Sr,c node are connected to the corresponding E×Ur,c node having the same r,c position as the connected Sr,c node, whereby parallel operations are provided through the E×U network.
US16/781,428 2019-02-05 2020-02-04 Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor Active 2040-04-24 US11249939B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/781,428 US11249939B2 (en) 2019-02-05 2020-02-04 Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962801315P 2019-02-05 2019-02-05
US16/781,428 US11249939B2 (en) 2019-02-05 2020-02-04 Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor

Publications (2)

Publication Number Publication Date
US20200250131A1 US20200250131A1 (en) 2020-08-06
US11249939B2 true US11249939B2 (en) 2022-02-15

Family

ID=71836608

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/781,428 Active 2040-04-24 US11249939B2 (en) 2019-02-05 2020-02-04 Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor

Country Status (1)

Country Link
US (1) US11249939B2 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160085721A1 (en) * 2014-09-22 2016-03-24 International Business Machines Corporation Reconfigurable array processor for pattern matching
US20200174787A1 (en) * 2016-04-26 2020-06-04 Onnivation, LLC Secure Matrix Space With Partitions For Concurrent Use

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160085721A1 (en) * 2014-09-22 2016-03-24 International Business Machines Corporation Reconfigurable array processor for pattern matching
US20200174787A1 (en) * 2016-04-26 2020-06-04 Onnivation, LLC Secure Matrix Space With Partitions For Concurrent Use

Also Published As

Publication number Publication date
US20200250131A1 (en) 2020-08-06

Similar Documents

Publication Publication Date Title
US9460048B2 (en) Methods and apparatus for creating and executing a packet of instructions organized according to data dependencies between adjacent instructions and utilizing networks based on adjacencies to transport data in response to execution of the instructions
US7581079B2 (en) Processor composed of memory nodes that execute memory access instructions and cooperate with execution nodes to execute function instructions
US10503515B2 (en) Methods and apparatus for adjacency network delivery of operands to instruction specified destinations that reduces storage of temporary variables
US7196708B2 (en) Parallel vector processing
US6631439B2 (en) VLIW computer processing architecture with on-chip dynamic RAM
US9594724B2 (en) Vector register file
US7398347B1 (en) Methods and apparatus for dynamic instruction controlled reconfigurable register file
US7146486B1 (en) SIMD processor with scalar arithmetic logic units
US8078834B2 (en) Processor architectures for enhanced computational capability
JP6785738B2 (en) DRAM-based processing unit
EP3566134A1 (en) Multi-function unit for programmable hardware nodes for neural network processing
EP3869352A1 (en) Network-on-chip data processing method and device
US20020023201A1 (en) VLIW computer processing architecture having a scalable number of register files
US20070239970A1 (en) Apparatus For Cooperative Sharing Of Operand Access Port Of A Banked Register File
JPH0922404A (en) Array processor communication architecture with broadcast communication processor instruction
US10761851B2 (en) Memory apparatus and method for controlling the same
US8024549B2 (en) Two-dimensional processor array of processing elements
US20050226337A1 (en) 2D block processing architecture
US11249939B2 (en) Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor
CN116069684A (en) Multicast for hardware-software co-design of in-memory computing architecture
WO2014202825A1 (en) Microprocessor apparatus
US20020032849A1 (en) VLIW computer processing architecture having the program counter stored in a register file register
TWI845081B (en) Graphics processor
US20250021306A1 (en) Hardware supported multiply-accumulate (mac) operation in a reconfigurable parallel processor
US20250211420A1 (en) Techniques for compressed route tables for contention-free routing associated with number-theoretic- transform and inverse-number-theoretic-transform computations

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY