CN109491809A - A kind of communication means reducing high-speed bus delay - Google Patents
A kind of communication means reducing high-speed bus delay Download PDFInfo
- Publication number
- CN109491809A CN109491809A CN201811341410.7A CN201811341410A CN109491809A CN 109491809 A CN109491809 A CN 109491809A CN 201811341410 A CN201811341410 A CN 201811341410A CN 109491809 A CN109491809 A CN 109491809A
- Authority
- CN
- China
- Prior art keywords
- queue
- rdma
- data
- memory
- host
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/42—Bus transfer protocol, e.g. handshake; Synchronisation
 
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses a kind of communication means of reduction high-speed bus delay to be transferred to data to end memory from local memory using RDMA to system registry memory, and transmit queue, receiving queue is respectively created and completes queue;Then RDMA Send/Receive, RDMA Read and RDMA Write are carried out;When application needs to communicate, a channel connection is created, the head and the tail endpoint of every channel is two couples of QP, and each pair of QP is made of SQ and RQ, QP is mapped to the virtual address space of application, and the CQ that RDMA is provided is used to that the message on application program WQ is notified to have been dealt with;User creates some queue WQ in transmission request WR notice QP;The WR of user is converted into the format of WQE, waits the asynchronous schedule parsing of RDMA network interface card, and takes real message from the Buffer that WQE is directed toward and be sent to opposite end.Data communication of the present invention bypasses Linux network protocol stack, and by data from a direct DMA transfer of equipment end to another equipment end, it is obvious that communication promotes effect.
    Description
Technical field
      The invention belongs to communication Design technical fields, and in particular to a kind of communication means for reducing high-speed bus delay is fitted
Distributed high-performance computing platform field for high speed, highly reliable, fault tolerant, low latency.
    Background technique
      Seven layer network models defined in international interconnection standard tissue (OSI) are excessively complicated, but the mould of five layer networks
Type becomes de facto standards, and five layer network models are physical layer, Data Link Layer, network layer, transport layer and application layer respectively.
Physical layer transmission bit information handles mechanical, electrical characteristic related with physical medium.The physical equipment of connection is provided.Number
The error free transmission that communication is realized according to link layer, provides data framing, Error Control, the functions such as flow control.Network layer is responsible for
Transfer data to correct destination node.Transport layer provides the communication between application program, and there are commonly TCP/UDP agreements.
Application layer is then the program of user, such as HTTP, FTP, SMTP.
      Linux protocol stack originates from the protocol stack of BSD UNIX, by perfect, the Linux network association of open source community many years
View stack is huge and complete, and Linux protocol stack is famous with its versatility.The downward physical layer of the upward client layer of Linux protocol stack mentions
For general-purpose interface, protocol stack itself is organized very also very perfect.
      Linux network protocol stack can be divided into five layers, be system call layer, the socket layer unrelated with agreement, net respectively
Network protocol realization layer and specific device-independent driving interface layer, driving layer.System call interfaces level provides a user
Socket application program interface function library provides the service for using network to user application.Socket layers of shielding bottom
Different agreement, such as TCP, UDP, RAW Socket agreement, so that the interface between system call layer is unified, the number of message
User application is submitted to by interface layer according to part.Network protocol realize layer, this layer mainly realize TCP, UDP, ARP,
The agreements such as RARP, IP, IGMP, ICMP are the most essential parts of network protocol stack.With specific device-independent driving interface layer,
The purpose of this layer mainly plans a set of driving interface to be achieved, this interface is the bridge for driving layer and protocol realization layer.It
The function of a variety of different drivers is uniformly abstracted as several common movements, such as open, close, init etc., this
The layer specific driver of shielding bottom.Driver layer exactly operates the program of specific hardware.Linux network protocol stack
Hierarchical structure is clear, and every layer needs to realize relatively independent function.The design of two of them " unrelated " layer is the most ingenious, shields tool
The agreement and hardware layer of body, can be extended.
      If prolonging between communication can be dramatically increased using mature TCP/UDP communication between the real-time computing platform of high-performance
When, special CPU is used to sending and receiving data, increases overhead.Trace it to its cause is that there are following spies because of Linux network protocol stack
Point:
      1) interrupt processing.When mass data packet arrives in network, frequent hardware interrupt request can be generated, these hardware
The implementation procedure that the traps or system of lower priority before can interrupting are called is interrupted, if this interrupt frequently
Words, it will generate higher performance cost.
      2) memory copying.Under normal circumstances, a network packet is needed from network interface card to application program by following mistake
Journey: data pass to the buffer area that kernel is opened up by modes such as DMA from network interface card, then copy User space sky to from kernel spacing
Between, in Linux kernel protocol stack, this time-consuming operation has even accounted for the 57.1% of the entire process flow of data packet.
      3) context switches.The hardware interrupts and traps frequently reached may all seize the operation of system calling at any time,
This can generate a large amount of context handover overhead.In addition, in the server design frame based on multithreading, the scheduling of cross-thread
Also frequent context handover overhead can be generated, equally, the energy consumption for locking competition is also a problem very serious.
      4) locality fails.Nowadays the processor of mainstream is all multiple cores, it means that the processing of a data packet
It may may interrupt across multiple core cpus, such as a data packet in cpu0, kernel state processing exists in cpu1, User space processing
Cpu2 be easy to cause cpu cache to fail in this way across multiple cores, and locality is caused to fail.If it is NUMA architecture, Geng Huizao
At across NUMA access memory, performance is greatly affected.
      5) memory management.Traditional server page avoids cache for 4K in order to improve the access speed of memory
Miss can increase the entry of mapping table in cache, but this will affect the recall precision of CPU again.
    Summary of the invention
      In view of the above-mentioned deficiencies in the prior art, the technical problem to be solved by the present invention is that providing a kind of reduction high speed
The communication means of bus delay can significantly reduce communication delay, increase the reliability of system communication, reduce the benefit of cpu resource
With rate.
      The invention adopts the following technical scheme:
      A kind of communication means reducing high-speed bus delay, to system registry memory, using RDMA by data out of local
It deposits and is transferred to end memory, transmit queue Send Queue (SQ), receiving queue Receive Queue (RQ) and complete is respectively created
At queue Complete Queue (CQ);Then RDMA Send/Receive, RDMA Read and RDMA Write are carried out;RDMA
The point-to-point communication based on message queue is provided, each application directly obtains corresponding message;Messaging service is established double in communication
On the channel-IO connection created between Fang Benduan and distal end application, when application needs to communicate, a channel is created
Connection, the head and the tail endpoint of every channel are two pairs of Queue Pairs (QP), each pair of QP by Send Queue (SQ) and
Receive Queue (RQ) composition, QP are mapped to the virtual address space of application, the Complete Queue that RDMA is provided
(CQ) it is used to that the message on application program Work QueueWQ is notified to have been dealt with;User creates transmission request Work
Description application wishes that the message content for being transferred to opposite end, WR notify some queue (WQ) in QP in Request (WR), WR;?
In WQ, the WR of user is converted into the format of Work Queue Element (WQE), waits the asynchronous schedule solution of RDMA network interface card
Analysis, and take real message from the Buffer that WQE is directed toward and be sent to opposite end.
      Specifically, in data transmission procedure, application program cannot modify the memory where data in registers memory;Behaviour
Page out operation cannot be carried out to the memory where data by making system, and the mapping of physical address and virtual address is to immobilize
's.
      Further, memory registration each time all creates the end local and two, the end remote key, key direction need to operate
Region of memory, the keys of registration is a part of data transfer request, and local side accesses local memory, distal end using l_key
Key is used to distal end RNIC remote access Installed System Memory;Same memory can be registered repeatedly, and have different access authority,
Registration generates different key collection every time.
      Specifically, RDMA is message based transport protocol, data transmission is asynchronous operation, and RDMA operation is as follows:
      S201, Host submit work request (WR) to arrive work queue (WQ);
      Work queue includes transmit queue (SQ) and receiving queue (RQ), each element of work queue is called WQE,
It is exactly WR;
      S202, Host complete (WC) from acquisition work in queue (CQ) is completed;
      S203, the hardware (hardware) with RDMA engine are a queue element (QE) processors;
      RDMA hardware goes to take work request (WR) Lai Zhihang from work queue (WQ), to completion queue (CQ) after having executed
(WC) is completed in middle placement work.
      Further, creation forms Queue Pairs (QP) in pairs for transmit queue (SQ) and receiving queue (RQ), establishes QP
Before, telecommunication management is used to exchange the information about QP;After establishing QP, RDMA Write/Read behaviour is carried out by Verbs API
Make;Or operate similar to the Send/Receive of the serializing of Socket Reads/Writes.
      Specifically, RDMA Send/Receive is message based Data Transport Protocol, the assembling of all data packets all exists
It is completed on RDMA hardware, the transport layer in osi model, network layer, data link layer, physical layer is all complete on RDMA hardware
At.
      Further, RDMA Send/Receive is bilateral operation, and the process for sending data is as follows:
      1) host A and host B create and have initialized respective QP, complete queue CQ, wherein transmit queue (SQ) and connect
Receive queue (RQ) creation composition Queue Pairs (QP) in pairs;
      2) host A and host B register WQE into the WQ of oneself respectively, and for host A, WQ=SQ, WQE description are directed toward one
A data until being sent;For host B, one piece of Buffer for storing data is directed toward in WQ=RQ, WQE description;
      3) the RNIC asynchronous schedule of host A takes turns to the WQE of A, a SEND message is resolved to, directly to master from Buffer
Machine B issues data, and after data flow reaches the RNIC of host B, the WQE of host B is consumed, and data are directly stored in WQE and are referred to
To storage location;
      4) after the completion of host A and host B communicate, a completion message CQE expression can be generated in the CQ of A and is sent completely.With
This can also generate a completion message expression simultaneously, in the CQ of B and finish receiving.The processing completion of WQE can generate in each WQ
One CQE.
      Specifically, RDMA Read is that specified remote memory address and local terminal is needed to copy into memory address, in RDMA Read
Before operation, the permission that teleprogram provides corresponding access memory gives local terminal program, and it is laggard that local terminal program gets access authority
Row RDMA Read operation, operates RDMA Read/Write, and Remote Procedure does not need to perceive whether this operation terminates.
      Specifically, RDMA Write is Push operation, for the data-pushing in local system memory to remote system
Memory in.
      Further, Read/write is unilateral operation, and the process of data is as follows:
      1) file system A and storage medium B establishes connection, and QP has been created and initialized, wherein transmit queue (SQ)
Creation forms Queue Pairs (QP) in pairs with receiving queue (RQ);
      2) address the buffer VA, VA that data are archived in file system A are registered to the RNIC of A in advance, and obtain return
Local key;
      3) data address VA, key are encapsulated into dedicated message and are transmitted to storage medium B, while file system by file system A
System A registers the state into a WR, returned with the storage medium B transmitted for receiving data in its WQ.
      4) storage medium B is after the data VA and R_key for receiving file system A transmission, and RNIC is by data together with storage ground
Location VB to encapsulation RDMA READ;
      5) storage medium B returns to the status information that entire data are transmitted to file system A after the completion of storage.
      Compared with prior art, the present invention at least has the advantages that
      A kind of communication means for reducing high-speed bus delay of the present invention is avoided by bypass operating system network protocol stack
Remote memory is mapped to local memory, the local after local device operation mapping by the performance bottleneck of complicated network protocol stack
Memory and remote equipment communication, greatly reduce network delay, improve network bandwidth, reduce cpu load.
      Specific advantage is as follows:
      1) present invention significantly reduces the delay during high-speed bus communications, reduces the shake of communication, improves data communication
Reliability, time delay is reduced to 1us by tens of us of conventional communication mode;Since network protocol stack communication message is needed in kernel
It is copied between state and User space, the context of network-driven interrupt processing mechanism, thread switches, lock synchronization mechanism limits net
Network efficiency, network communication delay are larger.The present invention carries out memory registration, creates two key of local side l_ley and distal end r_key,
Key is directed toward the region of memory of access.Inside there is different access authority, can be carried out after the internal storage access permission for obtaining other side
DMA read-write operation, data do not pass through the complex operations of protocol stack, while using for reference the kfifo of kernel parallel without lock programming technique,
The synchronous elapsed time of lock is eliminated, communication delay reduces naturally.
      2) present invention reduces CPU and participates in communication process, reduces cpu system expense, improves system CPU utilization rate;It is pervious
Network-driven is frequently interrupted in communication process and is generated based entirely on interrupt processing mechanism, frequent context switching and is interrupted
Scene protection and recovery, data will be handled by each step processing of protocol stack by cpu, serious to consume cpu resource.This hair
The bright direct memory copying by local terminal and distal end, communication process do not need cpu participation, and cpu is only involved in control process, reduce
Cpu occupancy.
      3) communication process of the present invention realizes zero-copy, improves communication bandwidth;Pervious linux network communication, due to number
According to needing to be copied in kernel state and User space, primary copy is increased, therefore reduce communication bandwidth.The present invention passes through remote
Journey DMA directly transmits data to the reception of remote subscriber state, without data copy, increases bandwidth.
      In conclusion data communication of the present invention bypasses Linux network protocol stack, by data from a direct DMA of equipment end
It is transferred to another equipment end, it is obvious that communication promotes effect.
      Below by drawings and examples, technical scheme of the present invention will be described in further detail.
    Detailed description of the invention
      Fig. 1 is RDMA communication scheme.
    Specific embodiment
      RDMA is transferred to data by network the storage region of other side, does not need the participation of other side's computing platform cpu, and
It does not have any impact to operating system, does not need the computing function for using computing platform.Eliminate external memory duplication and
Text exchange operation, thus bus space and cpu cycle can be vacateed for improving application system performance.System is first to incoming at present
Information analyzed, be then then stored into correct virtual memory region.
      Referring to Fig. 1, a kind of communication means for reducing high-speed bus delay of the present invention, including the design of following four part:
      1, memory is registered
      RDMA transmission data are to be transferred to data to end memory from local memory, therefore using being needed before RDMA to system
Registers memory, the interior of application have following two feature, and the decision of RDMA controller will use the continuous physical address in address.
      S101, in data transmission procedure, application program cannot modify the memory where data;
      S102, operating system cannot carry out page out operation -- physical address and virtual address to the memory where data
Mapping be fixed and invariable;
      Memory registration each time all creates the end local and two, the end remote key (l_key, r_key), and key, which is directed toward, to be needed
The keys of the region of memory of operation, registration is a part of data transfer request.Local side accesses local memory using l_key,
Such as the reception operation of RDMA.Distal end key is used to distal end RNIC remote access Installed System Memory.Same memory can be registered
Repeatedly, or even there is different access authority, but registration generates different key collection every time.
      2, queue is created
      RDMA supports altogether three kinds of queues, transmit queue (SQ), receiving queue (RQ) and completion queue (CQ).Wherein, SQ
It is usually created in pairs with RQ, referred to as Queue Pairs (QP).In order to realize RDMA operation, QP mechanism is needed to establish and distal end
Connection and corresponding operation permission.This mechanism is similar to the IP protocol stack of standard, and QP is similar to socket.QP is used to just
The connection at beginningization both ends.Before establishing QP, telecommunication management is used to exchange the information about QP.After setting up QP, pass through Verbs
API can carry out RDMA Write/Read operation.It can also carry out the serializing for being similar to Socket Reads/Writes
Send/Receive operation.
      RDMA is message based transport protocol, and data transmission is all asynchronous operation.RDMA operation is as follows:
      S201, Host submit work request (WR) to arrive work queue (WQ);
      Work queue includes transmit queue (SQ) and receiving queue (RQ).Each element of work queue is called WQE,
It is exactly WR.
      S202, Host complete (WC) from acquisition work in queue (CQ) is completed;
      Each completed in queue is called CQE, that is, WC.
      S203, the hardware (hardware) with RDMA engine are exactly a queue element (QE) processor.
      RDMA hardware constantly goes to take work request (WR) Lai Zhihang from work queue (WQ), and execution is over just to completion
Work is placed in queue (CQ) completes (WC).
      3, RDMA data are transmitted
      S301、RDMA Send/Receive
      The send/recv of similar TCP/IP, the difference is that RDMA is message based Data Transport Protocol, all data packets
Assembling all completed on RDMA hardware, in osi model below 4 layers of (transport layer, network layer, data link layer, physics
Layer) all completed on RDMA hardware.
      S302、RDMA Read
      Specified remote memory address and local terminal is needed to copy into memory address, before RDMA Read operation, teleprogram is provided
The permission of corresponding access memory gives local terminal program, once local terminal program, which gets access authority, can be carried out RDMA Read
Operation, and do not need the result of notice Remote Procedure operation.RDMA Read/Write is operated, Remote Procedure does not need to feel
Know whether this operation terminates.
      RDMA read operation is substantially exactly Pull operation, and the data in remote system memory are withdrawn into local system
In depositing.
      S303、RDMA Write
      It is operated similar to RDMA Read, RDMA Write operation is substantially exactly Push operation, the number in local system memory
According in the memory for being pushed to remote system.
      In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.The present invention being described and shown in usually here in attached drawing is real
The component for applying example can be arranged and be designed by a variety of different configurations.Therefore, below to the present invention provided in the accompanying drawings
The detailed description of embodiment be not intended to limit the range of claimed invention, but be merely representative of of the invention selected
Embodiment.Based on the embodiments of the present invention, those of ordinary skill in the art are obtained without creative efforts
The every other embodiment obtained, shall fall within the protection scope of the present invention.
      RDMA provides the point-to-point communication based on message queue, and each application can directly obtain the message of oneself, nothing
Need the intervention of operating system and network protocol stack.
      Messaging service is established on the channel-IO connection created between communicating pair local terminal and distal end application.When answering
When with needing to communicate, a channel connection is created, the head and the tail endpoint of every channel is two pairs of Queue Pairs (QP),
Each pair of QP is made of Send Queue (SQ) and Receive Queue (RQ), and various types of message are managed in these queues.
QP can be mapped to the virtual address space of application, so that using directly RDMA network interface card is accessed by it.RDMA is also provided
Complete Queue (CQ), CQ are used to that the message on application program WQ is notified to have been dealt with.RDMA provides a set of software
Coffret facilitates user to create transmission request Work Request (WR), and description application is wished to be transferred to disappearing for opposite end in WR
Content is ceased, WR notifies some queue Work Queue (WQ) in QP.In WQ, the WR of user is converted into Work Queue
The format of Element (WQE), waits the asynchronous schedule parsing of RDMA network interface card, and takes really from the Buffer that WQE is directed toward
Message is sent to opposite end.
      Send/receive in RDMA is bilateral operation, it is necessary to which the application perception of distal end participates in that transmitting-receiving could be completed.
Read/write is unilateral operation, needs the source and destination address of local clear message, and distally application does not need aware communications, number
According to reading and writing completed between network interface card and the Buffer of application by RDMA.Send/receive is chiefly used in practical applications
Connection control class message, real data message are completed to transmit by write/read.
      For bilateral operation, the process that host A sends data to host B is as follows:
      1) firstly, A and B will be created and initialized respective QP, CQ.
      2) A and B registers WQE into the WQ of oneself respectively, and for A, WQ=SQ, WQE description are directed toward one until being sent
Data;For B, one piece of Buffer for storing data is directed toward in WQ=RQ, WQE description.
      3) the RNIC asynchronous schedule of A takes turns to the WQE of A, and being resolved to this is a SEND message, directly to B from Buffer
Issue data.After data flow reaches the RNIC of B, the WQE of B is consumed, and data is directly stored in the storage position of WQE direction
It sets.
      4) after the completion of AB communication, a completion message CQE expression can be generated in the CQ of A and is sent completely.At the same time, B
Also a completion message expression can be generated in CQ to finish receiving.The processing completion of WQE can generate a CQE in each WQ.
      Bilateral operation is similar with the bottom Buffer Pool of traditional network, and the participation process of receiving-transmitting sides is no difference, area
Not in zero-copy, Kernel Bypass, in practice for RDMA, this is a kind of message transmission mode of complexity, is chiefly used in transmitting
Short control message.
      For unilateral operation, for storing the storage under network environment (for A as file system, B is as storage medium),
The process of data is as follows:
      1) A, B establish connection first, and QP has been created and initialized.
      2) data are archived in the address the buffer VA of A, notice that VA should be registered in advance the RNIC of A, and take return
Local key, be equivalent to the permission of this block of RDMA operation buffer.
      3) data address VA, key are encapsulated into dedicated message and are transmitted to B by A, this is equivalent to A the behaviour of data buffer
Make power and gives B.A registers the state into a WR, returned with the B transmitted for receiving data in its WQ simultaneously.
      4) for B after receiving the data VA and R_key that bring of A, RNIC can be them together with storage address VB to encapsulation
RDMA READ, this process A, B both ends do not need any software and participate in, so that it may virtually by the VB of the data storage of A to B
Location.
      5) B can return to the status information that entire data are transmitted after the completion of storage to A.
      Unilateral operation transmission mode is RDMA different from the maximum of conventional network transfer, need to only provide and directly access remotely
Virtual address is suitable for bulk data transfer without participating for remote application.
      Hardware environment:
      Server: Intel (R) Xeon (R) CPU E5-2648L v4@1.80GHz 14CPU cores*2NUMA
nodes;
      Memory: 64GB, 2*8GB DIMMs*2NUMA nodes 2133Mhz;
      Network interface card: Mellanox ConnectX-4 40GbE network interface card;
      Software environment:
      Operating system: Centos 7.1;
      Kernel version: 3.10.0-229.el7.x86_64;
      Firmware version: 12_18_2000
      Experiment conclusion:
      The hardware offloading functions of network interface card are closed, data communication delays are reduced to 5us from 25us, and communication bandwidth is mentioned from 16Gbps
Height arrives 36Gbps.
      The above content is merely illustrative of the invention's technical idea, and this does not limit the scope of protection of the present invention, all to press
According to technical idea proposed by the present invention, any changes made on the basis of the technical scheme each falls within claims of the present invention
Protection scope within.
    Claims (10)
1. a kind of communication means for reducing high-speed bus delay, which is characterized in that system registry memory, using RDMA by data
It is transferred to from local memory to end memory, transmit queue Send Queue (SQ), receiving queue Receive Queue is respectively created
(RQ) and queue Complete Queue (CQ) is completed;Then RDMA Send/Receive, RDMA Read and RDMA are carried out
Write;RDMA provides the point-to-point communication based on message queue, and each application directly obtains corresponding message;Messaging service is built
It stands on the channel-IO connection created between communicating pair local terminal and distal end application, when application needs to communicate, creation
One channel connection, the head and the tail endpoint of every channel are two pairs of Queue Pairs (QP), and each pair of QP is by Send Queue
(SQ) and Receive Queue (RQ) composition, QP are mapped to the virtual address space of application, the Complete that RDMA is provided
Queue (CQ) is used to that the message on application program Work QueueWQ is notified to have been dealt with;User creates transmission request
Description application wishes that the message content for being transferred to opposite end, WR notify some queue in QP in Work Request (WR), WR
(WQ);In WQ, the WR of user is converted into the format of Work Queue Element (WQE), waits the asynchronous of RDMA network interface card
Scheduling parsing, and take real message from the Buffer that WQE is directed toward and be sent to opposite end.
    2. the communication means according to claim 1 for reducing high-speed bus delay, which is characterized in that in registers memory,
In data transmission procedure, application program cannot modify the memory where data;Operating system cannot be to depositing into where data
Row page out operation, the mapping of physical address and virtual address are fixed and invariable.
    3. the communication means according to claim 2 for reducing high-speed bus delay, which is characterized in that memory is registered each time
It all creates the end local and two, the end remote key, key is directed toward the region of memory for needing to operate, the keys of registration is data transmission
A part of request, local side access local memory using l_key, and distal end key is used in the RNIC remote access system of distal end
It deposits;Same memory can be registered repeatedly, and have different access authority, and registration generates different key collection every time.
    4. the communication means according to claim 1 for reducing high-speed bus delay, which is characterized in that RDMA is based on message
Transport protocol, data transmission be asynchronous operation, RDMA operation is as follows:
      S201, Host submit work request (WR) to arrive work queue (WQ);
      Work queue includes transmit queue (SQ) and receiving queue (RQ), each element of work queue is called WQE, that is,
WR;
      S202, Host complete (WC) from acquisition work in queue (CQ) is completed;
      S203, the hardware (hardware) with RDMA engine are a queue element (QE) processors;
      RDMA hardware goes to take work request (WR) Lai Zhihang from work queue (WQ), executed after to complete queue (CQ) in put
It sets work and completes (WC).
    5. it is according to claim 4 reduce high-speed bus delay communication means, which is characterized in that transmit queue (SQ) and
Creation forms Queue Pairs (QP) to receiving queue (RQ) in pairs, and before establishing QP, telecommunication management is used to exchange the letter about QP
Breath;After establishing QP, RDMA Write/Read operation is carried out by Verbs API;Or it carries out being similar to Socket Reads/
The Send/Receive of the serializing of Writes is operated.
    6. the communication means according to claim 1 for reducing high-speed bus delay, which is characterized in that RDMA Send/
Receive is message based Data Transport Protocol, what the assembling of all data packets was all completed on RDMA hardware, osi model
In transport layer, network layer, data link layer, physical layer all completes on RDMA hardware.
    7. the communication means according to claim 6 for reducing high-speed bus delay, which is characterized in that RDMA Send/
Receive is bilateral operation, and the process for sending data is as follows:
      1) host A and host B create and have initialized respective QP, complete queue CQ, wherein transmit queue (SQ) and reception team
Arrange (RQ) creation composition Queue Pairs (QP) in pairs;
      2) host A and host B register WQE into the WQ of oneself respectively, and for host A, WQ=SQ, WQE description are directed toward one etc.
To the data sent;For host B, one piece of Buffer for storing data is directed toward in WQ=RQ, WQE description;
      3) the RNIC asynchronous schedule of host A takes turns to the WQE of A, a SEND message is resolved to, directly to host B from Buffer
Data are issued, after data flow reaches the RNIC of host B, the WQE of host B is consumed, and data are directly stored in WQE direction
Storage location;
      4) after the completion of host A and host B communicate, a completion message CQE expression can be generated in the CQ of A and is sent completely;It is same with this
When, a completion message expression can be also generated in the CQ of B and is finished receiving, and the processing completion of WQE can generate one in each WQ
CQE。
    8. the communication means according to claim 1 for reducing high-speed bus delay, which is characterized in that RDMA Read is to need
Remote memory address and local terminal is specified to copy into memory address, before RDMA Read operation, teleprogram provides corresponding access
The permission of memory carries out RDMA Read operation after giving local terminal program, local terminal program to get access authority, for RDMA Read/
Write operation, Remote Procedure do not need to perceive whether this operation terminates.
    9. the communication means according to claim 1 for reducing high-speed bus delay, which is characterized in that RDMA Write is
Push operation, in the memory of the data-pushing in local system memory to remote system.
    10. the communication means for reducing high-speed bus delay according to claim 8 or claim 9, which is characterized in that Read/write
It is unilateral operation, the process of data is as follows:
      1) file system A and storage medium B establishes connection, and QP has been created and initialized, wherein transmit queue (SQ) and connects
Receive queue (RQ) creation composition Queue Pairs (QP) in pairs;
      2) address the buffer VA, VA that data are archived in file system A are registered to the RNIC of A in advance, and obtain return
local key;
      3) data address VA, key are encapsulated into dedicated message and are transmitted to storage medium B, while file system A by file system A
The state into a WR, returned with the storage medium B transmitted for receiving data is registered in its WQ;
      4) storage medium B is after the data VA and R_key for receiving file system A transmission, and RNIC is by data together with storage address VB
To encapsulation RDMA READ;
      5) storage medium B returns to the status information that entire data are transmitted to file system A after the completion of storage.
    Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201811341410.7A CN109491809A (en) | 2018-11-12 | 2018-11-12 | A kind of communication means reducing high-speed bus delay | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201811341410.7A CN109491809A (en) | 2018-11-12 | 2018-11-12 | A kind of communication means reducing high-speed bus delay | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| CN109491809A true CN109491809A (en) | 2019-03-19 | 
Family
ID=65695762
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN201811341410.7A Pending CN109491809A (en) | 2018-11-12 | 2018-11-12 | A kind of communication means reducing high-speed bus delay | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN109491809A (en) | 
Cited By (34)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN110149262A (en) * | 2019-04-02 | 2019-08-20 | 视联动力信息技术股份有限公司 | A kind for the treatment of method and apparatus and storage medium of signaling message | 
| CN110519242A (en) * | 2019-08-13 | 2019-11-29 | 新华三大数据技术有限公司 | Data transmission method and device | 
| CN111064680A (en) * | 2019-11-22 | 2020-04-24 | 华为技术有限公司 | A communication device and data processing method | 
| CN111752728A (en) * | 2020-06-30 | 2020-10-09 | 中国科学技术大学 | Message transmission method and device | 
| CN112003860A (en) * | 2020-08-21 | 2020-11-27 | 上海交通大学 | Memory management method, system and medium suitable for remote direct memory access | 
| WO2021013046A1 (en) * | 2019-07-19 | 2021-01-28 | 华为技术有限公司 | Communication method and network card | 
| CN112383443A (en) * | 2020-09-22 | 2021-02-19 | 北京航空航天大学 | Parallel application communication performance prediction method running in RDMA communication environment | 
| CN112732166A (en) * | 2019-10-28 | 2021-04-30 | 华为技术有限公司 | Method and device for accessing solid state disk | 
| CN113064846A (en) * | 2021-04-14 | 2021-07-02 | 中南大学 | Zero-copy data transmission method based on Rsockets protocol | 
| CN113518082A (en) * | 2021-06-24 | 2021-10-19 | 深之蓝(天津)水下智能科技有限公司 | Message processing method, electronic equipment and storage medium | 
| CN113626216A (en) * | 2021-07-23 | 2021-11-09 | 济南浪潮数据技术有限公司 | Method and system for optimizing network application performance based on remote direct data access | 
| CN113746897A (en) * | 2021-07-28 | 2021-12-03 | 浪潮电子信息产业股份有限公司 | A file transmission method, device, device and storage medium | 
| CN113886295A (en) * | 2020-07-02 | 2022-01-04 | 北京瀚海云星科技有限公司 | Efficient and low-delay transmission method for label data, and related device and system | 
| WO2022021988A1 (en) * | 2020-07-31 | 2022-02-03 | 华为技术有限公司 | Network interface card, storage apparatus, message receiving method and sending method | 
| CN114090481A (en) * | 2020-07-02 | 2022-02-25 | 北京瀚海云星科技有限公司 | Data sending method, data receiving method and related device | 
| CN114090483A (en) * | 2021-09-30 | 2022-02-25 | 上海浦东发展银行股份有限公司 | Protocol-based RDMA (remote direct memory Access) communication method and device and storage medium | 
| CN114201313A (en) * | 2021-12-07 | 2022-03-18 | 杭州时代银通软件股份有限公司 | Message transmission system and message transmission method | 
| CN114201317A (en) * | 2021-12-16 | 2022-03-18 | 北京有竹居网络技术有限公司 | Data transmission method, device, storage medium and electronic device | 
| CN114490462A (en) * | 2020-10-28 | 2022-05-13 | 华为技术有限公司 | Network interface card, controller, storage device and message sending method | 
| CN114584492A (en) * | 2022-02-15 | 2022-06-03 | 珠海星云智联科技有限公司 | Time delay measuring method, system and related equipment | 
| WO2022142562A1 (en) * | 2020-12-31 | 2022-07-07 | 中兴通讯股份有限公司 | Rdma-based communication method, node, system, and medium | 
| CN114827234A (en) * | 2022-04-29 | 2022-07-29 | 广东浪潮智慧计算技术有限公司 | Data transmission method, system, device and storage medium | 
| CN114979001A (en) * | 2022-05-20 | 2022-08-30 | 北京百度网讯科技有限公司 | Data transmission method, device and equipment based on remote direct data access | 
| WO2022179417A1 (en) * | 2021-02-24 | 2022-09-01 | 华为技术有限公司 | Network interface card, message transceiving method, and storage apparatus | 
| CN115639954A (en) * | 2022-09-09 | 2023-01-24 | 苏州浪潮智能科技有限公司 | Data transmission method, device, equipment and medium | 
| CN115861082A (en) * | 2023-03-03 | 2023-03-28 | 无锡沐创集成电路设计有限公司 | Low-delay picture splicing system and method | 
| CN118069387A (en) * | 2023-12-01 | 2024-05-24 | 中科驭数(北京)科技有限公司 | A method and device for managing RDMA data transmission queue based on hardware multithreading | 
| CN118093499A (en) * | 2024-02-06 | 2024-05-28 | 贝格迈思(深圳)技术有限公司 | Data transmission method, device, equipment and storage medium for remote memory access | 
| WO2024140375A1 (en) * | 2022-12-27 | 2024-07-04 | 华为技术有限公司 | Storage device, and data communication method and system | 
| CN119052344A (en) * | 2024-07-26 | 2024-11-29 | 浙江大学 | Network transmission method based on lock-free queue | 
| CN119201831A (en) * | 2024-11-27 | 2024-12-27 | 广州壁仞集成电路有限公司 | Processor, electronic device, and data communication method | 
| CN119621651A (en) * | 2025-02-07 | 2025-03-14 | 浙江大学 | RDMA network batch task processing method and device based on selective signal | 
| CN119675892A (en) * | 2024-10-25 | 2025-03-21 | 深圳强基计算技术有限公司 | Method for safely using cross-bus domain RNIC devices to perform RDMA operations | 
| CN120017602A (en) * | 2025-04-16 | 2025-05-16 | 中国人民解放军国防科技大学 | A virtual channel scheduling system for RDMA transmission | 
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US6948004B2 (en) * | 2001-03-28 | 2005-09-20 | Intel Corporation | Host-fabric adapter having work queue entry (WQE) ring hardware assist (HWA) mechanism | 
| US8244826B2 (en) * | 2007-10-23 | 2012-08-14 | International Business Machines Corporation | Providing a memory region or memory window access notification on a system area network | 
| CN103562882A (en) * | 2011-05-16 | 2014-02-05 | 甲骨文国际公司 | System and method for providing a messaging application program interface | 
| CN105408880A (en) * | 2013-07-31 | 2016-03-16 | 甲骨文国际公司 | Direct access to persistent memory of shared storage | 
| CN107888657A (en) * | 2017-10-11 | 2018-04-06 | 上海交通大学 | Low latency distributed memory system | 
| CN108710638A (en) * | 2018-04-13 | 2018-10-26 | 上海交通大学 | A kind of Distributed concurrency control method and system based on mixing RDMA operation | 
- 
        2018
        - 2018-11-12 CN CN201811341410.7A patent/CN109491809A/en active Pending
 
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US6948004B2 (en) * | 2001-03-28 | 2005-09-20 | Intel Corporation | Host-fabric adapter having work queue entry (WQE) ring hardware assist (HWA) mechanism | 
| US8244826B2 (en) * | 2007-10-23 | 2012-08-14 | International Business Machines Corporation | Providing a memory region or memory window access notification on a system area network | 
| CN103562882A (en) * | 2011-05-16 | 2014-02-05 | 甲骨文国际公司 | System and method for providing a messaging application program interface | 
| CN105408880A (en) * | 2013-07-31 | 2016-03-16 | 甲骨文国际公司 | Direct access to persistent memory of shared storage | 
| CN107888657A (en) * | 2017-10-11 | 2018-04-06 | 上海交通大学 | Low latency distributed memory system | 
| CN108710638A (en) * | 2018-04-13 | 2018-10-26 | 上海交通大学 | A kind of Distributed concurrency control method and system based on mixing RDMA operation | 
Non-Patent Citations (2)
| Title | 
|---|
| HARDY: "详解RDMA(远程直接内存访问)架构原理", 《架构师技术联盟》 * | 
| 王之: "面向数据中心的RDMA高速网络服务通用平台", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * | 
Cited By (44)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN110149262B (en) * | 2019-04-02 | 2021-03-12 | 视联动力信息技术股份有限公司 | Method and device for processing signaling message and storage medium | 
| CN110149262A (en) * | 2019-04-02 | 2019-08-20 | 视联动力信息技术股份有限公司 | A kind for the treatment of method and apparatus and storage medium of signaling message | 
| US11431624B2 (en) | 2019-07-19 | 2022-08-30 | Huawei Technologies Co., Ltd. | Communication method and network interface card | 
| WO2021013046A1 (en) * | 2019-07-19 | 2021-01-28 | 华为技术有限公司 | Communication method and network card | 
| CN110519242A (en) * | 2019-08-13 | 2019-11-29 | 新华三大数据技术有限公司 | Data transmission method and device | 
| CN112732166A (en) * | 2019-10-28 | 2021-04-30 | 华为技术有限公司 | Method and device for accessing solid state disk | 
| CN111064680B (en) * | 2019-11-22 | 2022-05-17 | 华为技术有限公司 | Communication device and data processing method | 
| CN111064680A (en) * | 2019-11-22 | 2020-04-24 | 华为技术有限公司 | A communication device and data processing method | 
| CN111752728B (en) * | 2020-06-30 | 2022-09-06 | 中国科学技术大学 | Message transmission method and device | 
| CN111752728A (en) * | 2020-06-30 | 2020-10-09 | 中国科学技术大学 | Message transmission method and device | 
| CN113886295A (en) * | 2020-07-02 | 2022-01-04 | 北京瀚海云星科技有限公司 | Efficient and low-delay transmission method for label data, and related device and system | 
| CN114090481A (en) * | 2020-07-02 | 2022-02-25 | 北京瀚海云星科技有限公司 | Data sending method, data receiving method and related device | 
| WO2022021988A1 (en) * | 2020-07-31 | 2022-02-03 | 华为技术有限公司 | Network interface card, storage apparatus, message receiving method and sending method | 
| CN112003860A (en) * | 2020-08-21 | 2020-11-27 | 上海交通大学 | Memory management method, system and medium suitable for remote direct memory access | 
| CN112383443B (en) * | 2020-09-22 | 2022-06-14 | 北京航空航天大学 | A Communication Performance Prediction Method for Parallel Applications Running in RDMA Communication Environment | 
| CN112383443A (en) * | 2020-09-22 | 2021-02-19 | 北京航空航天大学 | Parallel application communication performance prediction method running in RDMA communication environment | 
| CN114490462A (en) * | 2020-10-28 | 2022-05-13 | 华为技术有限公司 | Network interface card, controller, storage device and message sending method | 
| WO2022142562A1 (en) * | 2020-12-31 | 2022-07-07 | 中兴通讯股份有限公司 | Rdma-based communication method, node, system, and medium | 
| WO2022179417A1 (en) * | 2021-02-24 | 2022-09-01 | 华为技术有限公司 | Network interface card, message transceiving method, and storage apparatus | 
| CN113064846A (en) * | 2021-04-14 | 2021-07-02 | 中南大学 | Zero-copy data transmission method based on Rsockets protocol | 
| CN113518082B (en) * | 2021-06-24 | 2021-12-17 | 深之蓝(天津)水下智能科技有限公司 | Message processing method, electronic equipment and storage medium | 
| CN113518082A (en) * | 2021-06-24 | 2021-10-19 | 深之蓝(天津)水下智能科技有限公司 | Message processing method, electronic equipment and storage medium | 
| CN113626216A (en) * | 2021-07-23 | 2021-11-09 | 济南浪潮数据技术有限公司 | Method and system for optimizing network application performance based on remote direct data access | 
| CN113746897A (en) * | 2021-07-28 | 2021-12-03 | 浪潮电子信息产业股份有限公司 | A file transmission method, device, device and storage medium | 
| CN114090483A (en) * | 2021-09-30 | 2022-02-25 | 上海浦东发展银行股份有限公司 | Protocol-based RDMA (remote direct memory Access) communication method and device and storage medium | 
| CN114201313A (en) * | 2021-12-07 | 2022-03-18 | 杭州时代银通软件股份有限公司 | Message transmission system and message transmission method | 
| CN114201317A (en) * | 2021-12-16 | 2022-03-18 | 北京有竹居网络技术有限公司 | Data transmission method, device, storage medium and electronic device | 
| CN114201317B (en) * | 2021-12-16 | 2024-02-02 | 北京有竹居网络技术有限公司 | Data transmission method and device, storage medium and electronic equipment | 
| CN114584492A (en) * | 2022-02-15 | 2022-06-03 | 珠海星云智联科技有限公司 | Time delay measuring method, system and related equipment | 
| CN114827234A (en) * | 2022-04-29 | 2022-07-29 | 广东浪潮智慧计算技术有限公司 | Data transmission method, system, device and storage medium | 
| CN114979001B (en) * | 2022-05-20 | 2023-06-13 | 北京百度网讯科技有限公司 | Data transmission method, device and equipment based on remote direct data access | 
| CN114979001A (en) * | 2022-05-20 | 2022-08-30 | 北京百度网讯科技有限公司 | Data transmission method, device and equipment based on remote direct data access | 
| CN115639954A (en) * | 2022-09-09 | 2023-01-24 | 苏州浪潮智能科技有限公司 | Data transmission method, device, equipment and medium | 
| CN115639954B (en) * | 2022-09-09 | 2025-08-01 | 苏州浪潮智能科技有限公司 | Data transmission method, device, equipment and medium | 
| WO2024140375A1 (en) * | 2022-12-27 | 2024-07-04 | 华为技术有限公司 | Storage device, and data communication method and system | 
| CN115861082A (en) * | 2023-03-03 | 2023-03-28 | 无锡沐创集成电路设计有限公司 | Low-delay picture splicing system and method | 
| CN115861082B (en) * | 2023-03-03 | 2023-04-28 | 无锡沐创集成电路设计有限公司 | Low-delay picture splicing system and method | 
| CN118069387A (en) * | 2023-12-01 | 2024-05-24 | 中科驭数(北京)科技有限公司 | A method and device for managing RDMA data transmission queue based on hardware multithreading | 
| CN118093499A (en) * | 2024-02-06 | 2024-05-28 | 贝格迈思(深圳)技术有限公司 | Data transmission method, device, equipment and storage medium for remote memory access | 
| CN119052344A (en) * | 2024-07-26 | 2024-11-29 | 浙江大学 | Network transmission method based on lock-free queue | 
| CN119675892A (en) * | 2024-10-25 | 2025-03-21 | 深圳强基计算技术有限公司 | Method for safely using cross-bus domain RNIC devices to perform RDMA operations | 
| CN119201831A (en) * | 2024-11-27 | 2024-12-27 | 广州壁仞集成电路有限公司 | Processor, electronic device, and data communication method | 
| CN119621651A (en) * | 2025-02-07 | 2025-03-14 | 浙江大学 | RDMA network batch task processing method and device based on selective signal | 
| CN120017602A (en) * | 2025-04-16 | 2025-05-16 | 中国人民解放军国防科技大学 | A virtual channel scheduling system for RDMA transmission | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN109491809A (en) | A kind of communication means reducing high-speed bus delay | |
| US20220263913A1 (en) | Data center cluster architecture | |
| EP4298776A1 (en) | Service mesh offload to network devices | |
| US11372787B2 (en) | Unified address space for multiple links | |
| DE102022104207A1 (en) | Pooling of network processing resources | |
| US11750418B2 (en) | Cross network bridging | |
| DE102018006546A1 (en) | PLATFORMS INTERFACIAL LAYER AND PROTOCOL FOR ACCELERATORS | |
| US11902184B2 (en) | Methods and systems for providing a virtualized NVMe over fabric service | |
| García-Dorado et al. | High-performance network traffic processing systems using commodity hardware | |
| Alian et al. | Netdimm: Low-latency near-memory network interface architecture | |
| DE102019108798A1 (en) | HIGH-BAND CONNECTION LAYER FOR COHERENT MESSAGES | |
| US11895027B2 (en) | Methods and systems for service distribution using data path state replication and intermediate device mapping | |
| Abbasi et al. | A performance comparison of container networking alternatives | |
| CN116089331A (en) | A TTE network communication method based on RDMA | |
| CN117931481B (en) | A method for rapid data exchange between real-time and time-sharing systems | |
| CN108989317A (en) | A kind of RoCE network card data communication method and network interface card based on FPGA | |
| CN106484657A (en) | A kind of reconfigurable signal processor ASIC framework and its reconstructing method | |
| US10877911B1 (en) | Pattern generation using a direct memory access engine | |
| CN117240935A (en) | Data plane forwarding method, device, equipment and medium based on DPU | |
| CN102375789A (en) | Non-buffer zero-copy method of universal network card and zero-copy system | |
| Balaji et al. | Asynchronous zero-copy communication for synchronous sockets in the sockets direct protocol (SDP) over InfiniBand | |
| CN105718393A (en) | Multi-source access scheduling method and device for registers of network interface chip | |
| Schlansker et al. | High-performance ethernet-based communications for future multi-core processors | |
| Lant et al. | Enabling shared memory communication in networks of MPSoCs | |
| WO2017063447A1 (en) | Computing apparatus, node device, and server | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date: 20190319 | |
| RJ01 | Rejection of invention patent application after publication |