[go: up one dir, main page]

CN115509452B - A lightweight method for key I/O paths in distributed storage systems - Google Patents

A lightweight method for key I/O paths in distributed storage systems Download PDF

Info

Publication number
CN115509452B
CN115509452B CN202211194147.XA CN202211194147A CN115509452B CN 115509452 B CN115509452 B CN 115509452B CN 202211194147 A CN202211194147 A CN 202211194147A CN 115509452 B CN115509452 B CN 115509452B
Authority
CN
China
Prior art keywords
request
rdma
distributed storage
memory
protocol communication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211194147.XA
Other languages
Chinese (zh)
Other versions
CN115509452A (en
Inventor
冯丹
王芳
曹郁超
帅晓雨
陈思新
何营
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Mass Institute Of Information Technology
Huazhong University of Science and Technology
Original Assignee
Shandong Mass Institute Of Information Technology
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Mass Institute Of Information Technology, Huazhong University of Science and Technology filed Critical Shandong Mass Institute Of Information Technology
Priority to CN202211194147.XA priority Critical patent/CN115509452B/en
Publication of CN115509452A publication Critical patent/CN115509452A/en
Application granted granted Critical
Publication of CN115509452B publication Critical patent/CN115509452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Communication Control (AREA)

Abstract

本发明公开了一种基于分布式存储系统关键I/O路径轻量化方法,相比于现有的分布式存储系统,本发明中标准协议网络层不再直接与物理存储设备连接,而是通过虚拟块设备层连接到位于远端多个不同节点的物理存储设备。同时,基于该架构的改变,本发明扩展了标准协议网络层中RDMA注册内存属性,使其具有远程读写权限。由于复用了NVMe‑oF标准协议Target中的内存管理模块,消除了关键I/O路径上对RDMA内存的冗余管理,减少了内存资源的申请和释放次数,降低了请求的处理时延。解决了私有协议通信连接中通信队列对与NVMe‑oF标准协议中的RDMA内存处于不同保护域致使无法访存的问题,消除了关键路径中的额外数据拷贝,提升了系统的整体性能。

The present invention discloses a lightweight method based on the key I/O path of a distributed storage system. Compared with the existing distributed storage system, the standard protocol network layer in the present invention is no longer directly connected to the physical storage device, but is connected to the physical storage device located at multiple different nodes at the remote end through the virtual block device layer. At the same time, based on the change of the architecture, the present invention expands the RDMA registered memory attribute in the standard protocol network layer so that it has remote read and write permissions. Since the memory management module in the NVMe‑oF standard protocol Target is reused, the redundant management of the RDMA memory on the key I/O path is eliminated, the number of applications and releases of memory resources is reduced, and the processing delay of the request is reduced. The problem that the communication queue in the private protocol communication connection is in a different protection domain from the RDMA memory in the NVMe‑oF standard protocol, resulting in the inability to access the memory, is solved, the extra data copy in the key path is eliminated, and the overall performance of the system is improved.

Description

Key I/O path light-weight method based on distributed storage system
Technical Field
The invention belongs to the field of computer distributed storage, and in particular relates to a key I/O path light-weight method based on a distributed storage system.
Background
RDMA technology is a technology for solving the problems of insufficient bandwidth, high delay and the like in the network transmission process. Compared with the traditional TCP/IP network, the RDMA technology bypasses the kernel of the operating system, directly utilizes the system bus to complete the exchange of data in the memory and the network card, and at the moment, the RDMA operation only needs the direct participation of the CPU when a control command is sent, and is completed by the direct memory access controller when the data is transmitted, so that the CPU utilization rate is very low. RDMA technology has been widely used in network communication modules of systems in the field of distributed storage and high performance computing.
The NVMe-orf is proposed by expanding an NVMe protocol for ethernet, fiber channel and the like, adopting a message-based model to transmit requests and process responses between a host side and a target storage device through a network, and the aim oF the design is to replace PCIe to expand the communication distance between the NVMe host and the NVMe storage subsystem at the cost oF little performance loss, so as to realize a cross-network block storage service with high performance, high resource utilization, high scalability and fault isolation.
The NVMe-oF usually uses RDMA network as a transport layer protocol to fully play the advantage oF high-speed interconnection. In this type of distributed storage system, RDMA memory needs to be preregistered to reduce system latency, so the system needs to manage this portion of RDMA memory. In addition, when data is delivered from the user layer to the network layer, the data also needs to be copied into the RDMA memory to be sent.
Since the control command stream is unidirectional, even if the proprietary protocol communication connection oF the backend distributed storage system is also an RDMA network, no interaction with the network layer oF the NVMe-orf protocol can be performed. The virtual block device layer can only passively receive information from the network layer, and after packaging and attribute removal, the information from the network layer seen at the virtual block device layer has removed key RDMA attributes, and only ordinary memory information is exposed to the service interface of the distributed storage system. The above-mentioned working mode is that since the data in the memory is directly operated by the DMA controller at the beginning oF the NVMe-oh protocol design, and the controller only needs the local read-write rights lkey, only the rights information is hidden when the request is transferred to the virtual block device layer. In addition, remote access cannot be accomplished even if this portion of the rights information is available. This can cause the proprietary protocol communication link to reorganize and manage the necessary RDMA memory and data movement over the system critical I/O path, which redundant operations can lead to increased request processing delays and reduced system performance.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a key I/O path light-weight method based on a distributed storage system, which aims to reduce the processing time delay of a system request and improve the system performance.
In order to achieve the above object, in a first aspect, the present invention provides a method for lightening a key I/O path based on a distributed storage system, where the distributed storage system includes a standard protocol network layer, a virtual block device layer, and a distributed storage node that are sequentially connected, and the virtual block device layer interacts with the distributed storage node through a private protocol communication connection, and the method includes the following steps:
S1, before an original request is transferred from a standard protocol network layer to a virtual block device layer, expanding and storing verification information mac_number of the original request, remote access rights rkeys of an associated RDMA memory, a protection domain pd where a communication queue pair and a registration memory in the standard protocol network layer are located and context information device_ctx of the RDMA network device;
S2, after the original request is transferred to a virtual block device layer, the virtual block device layer analyzes RDMA memory segment information iovs, RDMA memory segment number iov _cnt, target data storage address offset, target data length len, magic_ number, rkeys, pd and device_ctx carried in the original request, and when judging that the original request is a normal read-write request through the magic_number, reconstructs a private protocol communication request based on the analyzed iovs, iov_ cnt, offset, len, magic _ number, rkeys, pd and device_ctx;
s3, each sub-request is sent to the distributed storage node through the private protocol communication connection;
s4, the distributed storage node receives and analyzes each sub-request, and initiates remote reading and writing through the private protocol communication connection.
Further, in the step S2, after the original request is transferred to the virtual block device layer, only the first address of the original request is encapsulated in a private protocol communication request;
when a plurality of parallel sub-requests are generated, the head address of the original request is resolved from the packaged private protocol communication request, and then the head address is combined with the offset to directly acquire the associated field information.
Further, before the step S3, the method further includes:
S3', judging whether the private protocol communication connection is initialized, if yes, executing S3, if not, initializing the private protocol communication connection by using pd and device_ctx, and then executing S3.
Further, in the initialization of the private protocol communication connection, two-stage initialization is adopted, and the second-stage delay initialization is adopted, wherein the first-stage initialization is completed when the system is started, and the second-stage initialization is completed by analyzing pd and device_ctx in the private protocol communication request.
Further, in S4, if the sub-request is a read request, data is read from the local SSD to the local memory, and then sent to the memory area indicated by the sub-request by using RDMA write operation and rkeys, and if the sub-request is a write request, then read the data to the local memory by using RDMA read operation and rkeys, and then write the data to the local SSD.
In a second aspect, the present invention provides a distributed storage system, including a standard protocol network layer, a virtual block device layer, and a distributed storage node connected in sequence, where the virtual block device layer interacts with the distributed storage node through a private protocol communication connection;
The standard protocol network layer is configured to extend and store, before an original request is transferred from the standard protocol network layer to the virtual block device layer, verification information mac_number of the original request, remote access rights rkeys associated with an RDMA memory, a protection domain pd in which a communication queue pair and a registration memory in the standard protocol network layer are located, and context information device_ctx of the RDMA network device;
The virtual block device layer is configured to parse RDMA memory segment information iovs, RDMA memory segment number iov _cnt, target data storage address offset, target data length len, mac_ number, rkeys, pd, and device_ctx carried in the original request after the original request is transferred to the virtual block device layer, and reconstruct a private protocol communication request based on the parsed iovs, iov_ cnt, offset, len, magic _ number, rkeys, pd, and device_ctx when the original request is judged to be a normal read-write request by the mac_number;
The virtual block device layer is further configured to send each sub-request to the distributed storage node through the private protocol communication connection;
the distributed storage node is used for receiving and analyzing each sub-request and initiating remote reading and writing through the private protocol communication connection.
In a third aspect, the present invention provides a computer readable storage medium, the computer readable storage medium comprising a stored computer program, wherein the computer program, when executed by a processor, controls a device in which the computer readable storage medium is located to perform the distributed storage system-based critical I/O path light-weight method according to the first aspect.
In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:
Compared with the existing distributed storage system, the standard protocol network layer is often directly connected with the physical storage device, and the standard protocol network layer is not directly connected with the physical storage device any more, but is connected with the physical storage device of a plurality of different nodes at a far end through the virtual block device layer. Meanwhile, based on the change of the architecture, the invention expands the attribute of the RDMA registered memory in the standard protocol network layer, so that the RDMA registered memory has remote read-write permission. Because the memory management module in the NVMe-oF standard protocol Target is multiplexed, redundant management oF the RDMA memory on a key I/O path is eliminated, the application and release times oF memory resources are reduced, and the processing time delay oF a request is reduced. The method solves the problem that the RDMA memory in the NVMe-oF standard protocol cannot be accessed due to the fact that a communication queue pair in the private protocol communication connection is in different protection domains, eliminates extra data copy in a key path, and improves the overall performance oF the system.
Drawings
FIG. 1 is a flow chart of a method for lightweight of key I/O paths based on a distributed storage system provided by an embodiment of the present invention;
FIG. 2 is a diagram of data structures for different stages of requests for critical I/O paths provided by an embodiment of the present invention;
FIG. 3 is a flow chart of a read operation based on a distributed storage system critical I/O path lightweight method provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a write-once operation based on a method for lightening a key I/O path of a distributed storage system according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
In the present invention, the terms "first," "second," and the like in the description and in the drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Referring to fig. 1, in conjunction with fig. 2 to 4, the present invention provides a method for lightening a key I/O path based on a distributed storage system, the method including operations S1 to S4.
Operation S1, before the original request is transferred from the standard protocol network layer to the virtual block device layer, the verification information mac_number of the original request, the remote access rights rkeys associated with the RDMA memory, the protection domain pd where the communication queue pair and the registration memory in the standard protocol network layer are located, and the context information device_ctx of the RDMA network device are extended and stored.
In this embodiment, the RDMA registered memory attribute in the standard protocol network layer is extended, and the remote read-write authority rkey is added on the basis of the local read-write authority lkey. In the original request from the NVMe-oh standard protocol network layer, four pieces oF memory space are additionally opened up, and verification information including the original request, a mac_number, remote access rights rkeys associated with the RDMA memory, a protection domain pd where a communication queue pair and a registration memory in the standard protocol network layer are located, and context information device_ctx oF the RDMA network device are stored. Wherein the mac_number is used to indicate the original request type, rkeys is reconstructed in each mapped sub-request, each RDMA memory segment has one associated grant information rkey, rkeys storing at most 16 rkey, and pd and device_ctx are used for lazy initialization of private protocol communication connection.
In addition, two global variables are maintained, including a state variable InitFlag that indicates whether the private protocol communication connection has completed two-phase initialization and a lock variable InitLock that acts as a synchronization lock when the private protocol communication connection is initialized. Predefined field access semantics are provided that allow local inverse control flows to access the above information. Specifically, initFlag is False indicating that the second phase of the private protocol communication connection msg_m has not been initialized, and InitFlag is True indicating that the private protocol communication connection has been initialized. InitLock is a lock state, which indicates that the private protocol communication connection is being initialized in the second stage, and InitLock is an unlock state, which indicates that the state of the private protocol communication connection is indicated by a InitFlag variable.
In operation S2, after the original request is transferred to the virtual block device layer, the virtual block device layer parses RDMA memory segment information iovs, RDMA memory segment number iov _cnt, target data storage address offset, target data length len, mac_ number, rkeys, pd, and device_ctx carried in the original request, and reconstructs a private protocol communication request based on the parsed iovs, iov_ cnt, offset, len, magic _ number, rkeys, pd, and device_ctx when the original request is judged to be a normal read-write request by the mac_number, and then generates a plurality of parallel sub-requests in combination with the offset in the original request.
When generating a plurality of parallel sub-requests, the first address of the original request is resolved from the encapsulated private protocol communication request, and then the first address is combined with the offset to directly acquire the associated field information.
And step S3, each sub-request is sent to the distributed storage node through the private protocol communication connection.
In this embodiment, before the step S3, the method further includes:
S3', judging whether the private protocol communication connection is initialized, if yes, executing S3, if not, initializing the private protocol communication connection by using pd and device_ctx, and then executing S3.
Further, in the initialization of the private protocol communication connection, two-stage initialization is adopted, and the second-stage delay initialization is adopted, wherein the first-stage initialization is completed when the system is started, and the second-stage initialization is completed by analyzing pd and device_ctx in the private protocol communication request.
Specifically, the S3' includes:
S31', checking the state variable, judging whether the private protocol communication connection has completed two-stage initialization by using pd and device_ctx, if not, executing S32', otherwise executing S33'.
S32', after the lock variable is tried to be obtained and successfully locked, the state variable is checked for the second time, and if the state variable passes the check again, the establishment of communication connection, the application and the configuration of RDMA resources and the like are completed by using pd and device_ctx in the private protocol communication request. The state variable is set to avoid initializing the replay operation. If the secondary check fails, S33' is performed.
S33', traversing RDMA memory segments in the private protocol communication request step by step, and mapping each RDMA memory segment to an actual storage node by combining a target data address and an actual data length. The mapped sub-request is reconstructed in combination with remote access rights rkeys for the associated RDMA memory, and the sub-request is sent to the actual storage node using a private protocol communication connection.
And S4, the distributed storage node receives and analyzes each sub-request and initiates remote reading and writing through the private protocol communication connection.
In this embodiment, if the sub-request is a read request, data is read from the local SSD into the local memory, the data is sent to the memory area indicated by the sub-request by using the RDMA write operation combination rkeys, and if the sub-request is a write request, the data is read to the local memory by using the RDMA read operation combination rkeys and then written into the local SSD.
The present invention is described in further detail below in connection with specific read and write operations.
As shown in fig. 2. The reconstructed private protocol communication request flag_request has the check information mac_number, the remote access authority rkeys of the related RDMA memory, the protection domain pd where the communication queue pair and the registration memory are located in the standard protocol network layer and the context information device_ctx of the RDMA network device, and also has the RDMA memory segment information iovs, the RDMA memory segment number iov _cnt, the target data storage address offset, the target data length len and the data length cpl_len of the current request after read-write. The private protocol communication request sub-request chunk_request has information such as a target allocation identifier id, an offset_ inchunk of target data in a target fragment, a length len_ inchunk in the target fragment, a remote buffer address remote_addr, a remote access right rkey corresponding to a buffer and the like.
The check information magic_number, protection domain pd, and RDMA network device context information each occupy 8 bytes. Remote access rights rkeys accommodate a maximum of 16 memory segment rights identifications, each occupying 4 bytes, for a total of 64 bytes. The maintenance state variables InitFlag and lock variables InitLock, initFlag have associated variables ChunkSize that represent the size of the shards when the virtual volume is mapped to the actual storage node. If the MAGIC NUMBER is equal to the global predefined value MAGIC NUMBER (0 xABCD 1234), this indicates that the request is a normal read-write request from the standard protocol network layer, otherwise the request is a detect request when the back-end virtual device is built.
In this embodiment ChunkSize is 1GB and the NVMe-oF protocol transport block is set to a maximum oF 4KB. Iovs [0] base is NULL, iovs [0] iov _len is 0, iovs [1] base is NULL, iovs [1] iov _len is 0, iovcnt is 0 in the original request. pd is NULL and device_ctx is NULL. offset is 0x3ffff000 and len is 0x2000. The mac_number is 0xABCD1234, rkeys [0] is 0, rkeys [1] is 0. As shown in fig. 3, the flow of one data read operation is as follows:
(1) Upon RDMA memory registration, the IBV_ACCESS_REMOTE_WRITE and IBV_ACCESS_REMOTE_READ flags are added, expanding it to remotely accessible RDMA memory. When the original request NVMe-oF standard protocol network layer is constructed, the target data length is 0x2000, so that two RDMA memory segments with the length oF 0x1000 are constructed, after buffer space is applied, iovs [0] base is 0x7FFD9E9C8000, iovs [0]. Iov _len is 0x1000, iovs [1] base is 0x7FFD9E9D4000, iovs [0]. Iov _len is 0x1000, and iovcnt is 2.pd is 0xBC63B0, device_ctx is 0xBDB7B0, rkeys [0] is 0xC3389, rkeys [1] is 0xC3389. And reconstructing the RDMA memory segment information iovs, the remote access authority information rkeys, the RDMA device context information device_ctx, the protection domain information pd where the RDMA memory segment is located, the target data address offset and the target data length len into a private protocol communication request flag_request, and sending the private protocol communication request flag_request to a virtual block device layer for processing.
(2) After receiving the private protocol communication request flag_request, the virtual block device layer obtains the information of the mac_number field based on the private protocol communication request flag_request, and if the mac_number is equal to the mac_number (0 xABCD 1234), the request is a normal read-write request, and the currently read data length cpl_len is set to 0.
(3) If the first check state variable InitFlag is False, then an attempt is made to lock the lock variable InitLock. If the lock variable successfully changes to lock state and the second time the state variable InitFlag is checked again as False, the private protocol communication connection is initialized by using the pd and device_ctx of the private protocol communication request in the second stage, i.e. a communication queue pair is constructed based on the protection domain pd 0xBC63B0 and the RDMA network device context device_ctx xBDB7B0, the initialization operation is completed with the post-state variable InitFlag being True, and if the second time the state variable InitFlag is checked to True, the initialization operation is skipped and the lock variable is released as unlock state. If the lock variable fails, the current thread hangs until the lock is successful, at which point the check of state variable InitFlag must be True, thus bypassing the initialization operation. If the first check state variable InitFlag is True, then the initialization operation is skipped directly.
(4) RDMA memory segment information is processed one by one after double checking, the buffer area pointed by iovs [0]. Base is 0x1000, the target address is 0x3FFFF000, and the boundary of a Chunk slice is crossed. Sub-request 1 is generated with id of 0, offset_inchunk of 0x3ffff000, len_inchunk of 0x1000, remote address remote_addr of 0x7FFD9E9C8000, and remote access authority rkey of 0xC3389. A sub-request 2 is generated with id of 1, offset_inchunk of 0, len_inchunk of 0x1000, remote address remote_addr of 0x7FFD9E9D4000, and remote access authority rkey of 0xC3389. The sub-requests are sent in parallel to the target storage node for processing.
(5) The storage node receiving the sub-request 1 parses the Chunk fragment id information in the request, converts the Chunk fragment id information into an SSD address, reads the data to a local RDMA memory, and the communication queue pair writes the data to the remote_addr0x7FFD9E9C8000 by using the remote access authority rkey xC3389, because the receiver of the communication queue pair and the remote_addr are registered in the same protection domain pd0xBC63B0, the writing operation is successful. The processing of sub-request 2 is as above.
(6) After the storage node writes the data into the remote_addr0x7ffd9e9c8000, triggering the callback of the private protocol communication request flag_request, and updating the cpl_len field in the callback to 0x1000. After the storage node writes the data into the remote_addr7ffd9e9d4000, triggering a second callback of the private protocol communication request flag_request, and updating the cpl_len field in the callback to 0x2000.
(7) When cpl_len in the private protocol communication request flag_request is updated to 0x2000, that is, equal to the target data length len, the request is completed, and at this time, hierarchical feedback is performed to the virtual block device layer, so that the read operation is completed.
In this embodiment, chunkSize is 1gb, and the maximum nvme-orf protocol transport block is set to 128KB. Iovs [0]. Base is 0x7FFDA0E91000, iovs [0]. Iov _len is 0x2000, iovcnt is 1 in the original request. pd is NULL and device_ctx is NULL. offset is 0x3ffff000 and len is 0x4000. The mac_number is 0xABCD1234 and the rkeys [0] is 0. As shown in fig. 4, the flow of one data write operation is as follows:
(1) Upon RDMA memory registration, the IBV_ACCESS_REMOTE_WRITE and IBV_ACCESS_REMOTE_READ flags are added, expanding it to remotely accessible RDMA memory. When the original request NVMe-oF standard protocol network layer is built, pd is 0xBD72B0, device_ctx is 0xBEC6B0, and rkeys [0] is 0x84E83. And reconstructing the RDMA memory segment information iovs, the remote access authority information rkeys, the RDMA device context information device_ctx, the protection domain information pd where the RDMA memory segment is located, the target data address offset and the target data length len into a private protocol communication request flag_request, and sending the private protocol communication request flag_request to a virtual block device layer for processing.
(2) After receiving the private protocol communication request flag_request, the virtual block device layer obtains the information of the mac_number field based on the private protocol communication request flag_request, and if the mac_number is equal to the mac_number (0 xABCD 1234), the request is a normal read-write request, and the currently read data length cpl_len is set to 0.
(3) If the first check state variable InitFlag is False, then an attempt is made to lock the lock variable InitLock. If the lock variable successfully changes to lock state, then the state variable InitFlag is checked again as False a second time, then the private protocol communication connection is initialized with the pd and device_ctx of the private protocol communication request in a second stage, i.e., a communication queue pair is constructed based on the protection domain pd 0xBD72B0 and the RDMA network device context device_ctx0 xBEC B0, and the initialization operation is completed with the post-state variable InitFlag being True. If the second check state variable InitFlag becomes True, the initialization operation is skipped and the lock variable is released to the unlock state. If the lock variable fails, the current thread hangs until the lock is successful, at which point the check of state variable InitFlag must be True, thus bypassing the initialization operation. If the first check state variable InitFlag is True, then the initialization operation is skipped directly.
(4) RDMA memory segment information is processed one by one after double checking, the buffer area pointed by iovs [0] base is 0x4000, the target address is 0x3FFFF000, and the boundary of a Chunk slice is crossed. Sub-request 1 is generated with id of 0, offset_inchunk of 0x3ffff000, len_inchunk of 0x1000, remote address remote_addr of 0x7FFDA E91000, and remote access authority rkey of 0x84E83. A sub-request 2 is generated with id of 1, offset_inchunk of 0, len_inchunk of 0x3000, remote address remote_addr of 0x7FFDA E92000, and remote access authority rkey of 0x84E83. The sub-requests are sent in parallel to the target storage node for processing.
(5) The storage node that receives sub-request 1, the communication queue pair reads the data from remote_addr0x FFDA e91000 to the local RDMA memory using remote access rights rkey0xC3389, because the receiver of the communication queue pair is registered in the same protection domain pd0xBD72B0 as the remote_addr, and thus the remote read operation is successful. The Chunk fragment id information in the request is analyzed, and the information is converted into an SSD address and written into the SSD. The processing of sub-request 2 is as above.
(6) After writing the data corresponding to 0x7FFDA E91000 into the SSD, the storage node triggers a callback of the private protocol communication request flag_request, and updates cpl_len therein to 0x1000. The storage node triggers a second callback of the private protocol communication request flag_request after writing data corresponding to 0x7FFDA E92000 into the SSD, and updates cpl_len therein to 0x4000.
(7) When cpl_len in the private protocol communication request flag_request is updated to 0x4000, that is, the data length len to be written is equal to that of the data, the request is completed, and hierarchical feedback is performed to the virtual block device layer at this time, so that the write operation is completed.
In summary, compared with the prior art, the invention has the advantages that the prior art generates overlap oF RDMA memory management functions on the bridging oF communication connection between NVMe-oF standard protocol and private protocol, and data movement is generated on a key I/O path due to the limitation oF a protection domain with fine granularity. According to the invention, through the incremental extension oF extending NVMe-oF, a metadata and data access tunnel is opened for private protocol communication connection under the condition that the original working semantics are not affected. The metadata access tunnel eliminates the redundancy mechanism of the RDMA memory with remote access authority by adding and maintaining the remote access authority rkey of the RDMA memory segment and directly mapping the discrete RDMA memory segment to the discrete actual storage node, reduces the application and release times of RDMA memory resources and reduces the processing delay of the system. The data access tunnel solves the problem that access cannot be achieved due to the fact that the communication queue pair in the private protocol and the RDMA memory in the standard protocol are in different protection domains by maintaining the communication queue pair and the protection domain where the registration memory is located and the context information of the RDMA network device, customizing the two-stage initialization of the private protocol and the two-stage double check inertia initialization strategy, eliminating data movement in a key I/O path, reducing processing delay of system requests and improving system performance.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (7)

1. The utility model provides a based on distributed storage system key I/O route lightweight method which characterized in that, distributed storage system includes standard protocol network layer, virtual piece equipment layer and distributed storage node that connect gradually, virtual piece equipment layer is through private protocol communication connection with distributed storage node is mutual, the method includes following steps:
S1, before an original request is transferred from a standard protocol network layer to a virtual block device layer, expanding and storing verification information mac_number of the original request, remote access rights rkeys of an associated RDMA memory, a protection domain pd where a communication queue pair and a registration memory in the standard protocol network layer are located and context information device_ctx of the RDMA network device;
S2, after the original request is transferred to a virtual block device layer, the virtual block device layer analyzes RDMA memory segment information iovs, RDMA memory segment number iov _cnt, target data storage address offset, target data length len, magic_ number, rkeys, pd and device_ctx carried in the original request, and when judging that the original request is a normal read-write request through the magic_number, reconstructs a private protocol communication request based on the analyzed iovs, iov_ cnt, offset, len, magic _ number, rkeys, pd and device_ctx;
s3, each sub-request is sent to the distributed storage node through the private protocol communication connection;
s4, the distributed storage node receives and analyzes each sub-request, and initiates remote reading and writing through the private protocol communication connection.
2. The distributed storage system-based critical I/O path light-weight method as claimed in claim 1, wherein in S2, after the original request is transferred to a virtual block device layer, only the first address of the original request is encapsulated in a private protocol communication request;
when a plurality of parallel sub-requests are generated, the head address of the original request is resolved from the packaged private protocol communication request, and then the head address is combined with the offset to directly acquire the associated field information.
3. The distributed storage system-based critical I/O path light-weight method as claimed in claim 1 or 2, further comprising, prior to said S3:
S3', judging whether the private protocol communication connection is initialized, if yes, executing S3, if not, initializing the private protocol communication connection by using pd and device_ctx, and then executing S3.
4. The method for reducing the weight of key I/O paths in a distributed storage system according to claim 3, wherein two-stage initialization is adopted during the initialization of the private protocol communication connection, and the second-stage initialization is delayed, the first-stage initialization is completed during the system start-up, and the second-stage initialization is completed by parsing pd and device_ctx in the private protocol communication request.
5. The method of claim 1, wherein in S4, if the sub-request is a read request, reading data from the local SSD to the local memory, sending the data to the memory area indicated by the sub-request by using an RDMA write operation combination rkeys, and if the sub-request is a write request, reading the data to the local memory by using an RDMA read operation combination rkeys, and then writing the data to the local SSD.
6. The distributed storage system is characterized by comprising a standard protocol network layer, a virtual block device layer and distributed storage nodes which are sequentially connected, wherein the virtual block device layer interacts with the distributed storage nodes through private protocol communication connection;
The standard protocol network layer is configured to extend and store, before an original request is transferred from the standard protocol network layer to the virtual block device layer, verification information mac_number of the original request, remote access rights rkeys associated with an RDMA memory, a protection domain pd in which a communication queue pair and a registration memory in the standard protocol network layer are located, and context information device_ctx of the RDMA network device;
the virtual block device layer is configured to parse RDMA memory segment information iovs, RDMA memory segment number iov _cnt, target data storage address offset, target data length len, mac_ number, rkeys, pd, and device_ctx carried in the original request after the original request is transferred to the virtual block device layer, and reconstruct a private protocol communication request based on the parsed iovs, iov_ cnt, offset, len, magic _ number, rkeys, pd, and device_ctx when the original request is judged to be a normal read-write request by the mac_number;
The virtual block device layer is further configured to send each sub-request to the distributed storage node through the private protocol communication connection;
the distributed storage node is used for receiving and analyzing each sub-request and initiating remote reading and writing through the private protocol communication connection.
7. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored computer program, wherein the computer program, when executed by a processor, controls a device in which the computer readable storage medium is located to perform the distributed storage system-based critical I/O path lightening method according to any of claims 1 to 5.
CN202211194147.XA 2022-09-28 2022-09-28 A lightweight method for key I/O paths in distributed storage systems Active CN115509452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211194147.XA CN115509452B (en) 2022-09-28 2022-09-28 A lightweight method for key I/O paths in distributed storage systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211194147.XA CN115509452B (en) 2022-09-28 2022-09-28 A lightweight method for key I/O paths in distributed storage systems

Publications (2)

Publication Number Publication Date
CN115509452A CN115509452A (en) 2022-12-23
CN115509452B true CN115509452B (en) 2025-05-30

Family

ID=84507219

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211194147.XA Active CN115509452B (en) 2022-09-28 2022-09-28 A lightweight method for key I/O paths in distributed storage systems

Country Status (1)

Country Link
CN (1) CN115509452B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7713068B2 (en) * 2006-12-06 2010-05-11 Fusion Multisystems, Inc. Apparatus, system, and method for a scalable, composite, reconfigurable backplane
CN110908600B (en) * 2019-10-18 2021-07-20 华为技术有限公司 Data access method, apparatus and first computing device
CN114691026A (en) * 2020-12-31 2022-07-01 华为技术有限公司 Data access method and related equipment
CN113703672B (en) * 2021-07-30 2023-07-14 郑州云海信息技术有限公司 A hyper-converged system and its IO request sending method, physical server
CN114327903B (en) * 2021-12-30 2023-11-03 苏州浪潮智能科技有限公司 NVMe-oF management system, resource configuration method and IO reading and writing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于NVMe over Fabrics协议的分布式块存储系统优化技术研究;曹郁超;中国优秀硕士论文电子期刊网;20231115;全文 *

Also Published As

Publication number Publication date
CN115509452A (en) 2022-12-23

Similar Documents

Publication Publication Date Title
CN114756388B (en) A method for on-demand shared memory between cluster system nodes based on RDMA
CN109313644B (en) System and method for database proxy
CN111400307B (en) A Persistent Hash Table Access System Supporting Remote Concurrent Access
JP4755390B2 (en) Method and apparatus for controlling the flow of data between data processing systems via a memory
US8671152B2 (en) Network processor system and network protocol processing method
US9703796B2 (en) Shared dictionary between devices
US6799200B1 (en) Mechanisms for efficient message passing with copy avoidance in a distributed system
JP4755391B2 (en) Method and apparatus for controlling the flow of data between data processing systems via a memory
US20140115182A1 (en) Fibre Channel Storage Area Network to Cloud Storage Gateway
WO2015078219A1 (en) Information caching method and apparatus, and communication device
CN106657365A (en) High concurrent data transmission method based on RDMA (Remote Direct Memory Access)
US7472231B1 (en) Storage area network data cache
US7409432B1 (en) Efficient process for handover between subnet managers
KR20040012716A (en) Method and Apparatus for transferring interrupts from a peripheral device to a host computer system
CN107992368A (en) Method for interchanging data and system between a kind of multi-process
CN114490439A (en) Data writing, reading and communication method based on lock-free ring shared memory
CN118012569A (en) EBPF-based Redis database cluster proxy system and eBPF-based Redis database cluster proxy method
CN116489250A (en) Method for transmitting path in zero copy in stack based on shared memory communication mode
CN115509452B (en) A lightweight method for key I/O paths in distributed storage systems
CN100486345C (en) Business system based on PC server
JPWO2018131550A1 (en) Connection management unit and connection management method
JP3237599B2 (en) Multiprocessor system and data transfer method in multiprocessor system
CN116662264B (en) Data management method, device, equipment and storage medium thereof
Noronha Designing High-Performance And Scalable Clustered Network Attached Storage With Infiniband
CN118368190A (en) Stream table updating method, device, equipment, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant