Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a key I/O path light-weight method based on a distributed storage system, which aims to reduce the processing time delay of a system request and improve the system performance.
In order to achieve the above object, in a first aspect, the present invention provides a method for lightening a key I/O path based on a distributed storage system, where the distributed storage system includes a standard protocol network layer, a virtual block device layer, and a distributed storage node that are sequentially connected, and the virtual block device layer interacts with the distributed storage node through a private protocol communication connection, and the method includes the following steps:
S1, before an original request is transferred from a standard protocol network layer to a virtual block device layer, expanding and storing verification information mac_number of the original request, remote access rights rkeys of an associated RDMA memory, a protection domain pd where a communication queue pair and a registration memory in the standard protocol network layer are located and context information device_ctx of the RDMA network device;
S2, after the original request is transferred to a virtual block device layer, the virtual block device layer analyzes RDMA memory segment information iovs, RDMA memory segment number iov _cnt, target data storage address offset, target data length len, magic_ number, rkeys, pd and device_ctx carried in the original request, and when judging that the original request is a normal read-write request through the magic_number, reconstructs a private protocol communication request based on the analyzed iovs, iov_ cnt, offset, len, magic _ number, rkeys, pd and device_ctx;
s3, each sub-request is sent to the distributed storage node through the private protocol communication connection;
s4, the distributed storage node receives and analyzes each sub-request, and initiates remote reading and writing through the private protocol communication connection.
Further, in the step S2, after the original request is transferred to the virtual block device layer, only the first address of the original request is encapsulated in a private protocol communication request;
when a plurality of parallel sub-requests are generated, the head address of the original request is resolved from the packaged private protocol communication request, and then the head address is combined with the offset to directly acquire the associated field information.
Further, before the step S3, the method further includes:
S3', judging whether the private protocol communication connection is initialized, if yes, executing S3, if not, initializing the private protocol communication connection by using pd and device_ctx, and then executing S3.
Further, in the initialization of the private protocol communication connection, two-stage initialization is adopted, and the second-stage delay initialization is adopted, wherein the first-stage initialization is completed when the system is started, and the second-stage initialization is completed by analyzing pd and device_ctx in the private protocol communication request.
Further, in S4, if the sub-request is a read request, data is read from the local SSD to the local memory, and then sent to the memory area indicated by the sub-request by using RDMA write operation and rkeys, and if the sub-request is a write request, then read the data to the local memory by using RDMA read operation and rkeys, and then write the data to the local SSD.
In a second aspect, the present invention provides a distributed storage system, including a standard protocol network layer, a virtual block device layer, and a distributed storage node connected in sequence, where the virtual block device layer interacts with the distributed storage node through a private protocol communication connection;
The standard protocol network layer is configured to extend and store, before an original request is transferred from the standard protocol network layer to the virtual block device layer, verification information mac_number of the original request, remote access rights rkeys associated with an RDMA memory, a protection domain pd in which a communication queue pair and a registration memory in the standard protocol network layer are located, and context information device_ctx of the RDMA network device;
The virtual block device layer is configured to parse RDMA memory segment information iovs, RDMA memory segment number iov _cnt, target data storage address offset, target data length len, mac_ number, rkeys, pd, and device_ctx carried in the original request after the original request is transferred to the virtual block device layer, and reconstruct a private protocol communication request based on the parsed iovs, iov_ cnt, offset, len, magic _ number, rkeys, pd, and device_ctx when the original request is judged to be a normal read-write request by the mac_number;
The virtual block device layer is further configured to send each sub-request to the distributed storage node through the private protocol communication connection;
the distributed storage node is used for receiving and analyzing each sub-request and initiating remote reading and writing through the private protocol communication connection.
In a third aspect, the present invention provides a computer readable storage medium, the computer readable storage medium comprising a stored computer program, wherein the computer program, when executed by a processor, controls a device in which the computer readable storage medium is located to perform the distributed storage system-based critical I/O path light-weight method according to the first aspect.
In general, through the above technical solutions conceived by the present invention, the following beneficial effects can be obtained:
Compared with the existing distributed storage system, the standard protocol network layer is often directly connected with the physical storage device, and the standard protocol network layer is not directly connected with the physical storage device any more, but is connected with the physical storage device of a plurality of different nodes at a far end through the virtual block device layer. Meanwhile, based on the change of the architecture, the invention expands the attribute of the RDMA registered memory in the standard protocol network layer, so that the RDMA registered memory has remote read-write permission. Because the memory management module in the NVMe-oF standard protocol Target is multiplexed, redundant management oF the RDMA memory on a key I/O path is eliminated, the application and release times oF memory resources are reduced, and the processing time delay oF a request is reduced. The method solves the problem that the RDMA memory in the NVMe-oF standard protocol cannot be accessed due to the fact that a communication queue pair in the private protocol communication connection is in different protection domains, eliminates extra data copy in a key path, and improves the overall performance oF the system.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
In the present invention, the terms "first," "second," and the like in the description and in the drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Referring to fig. 1, in conjunction with fig. 2 to 4, the present invention provides a method for lightening a key I/O path based on a distributed storage system, the method including operations S1 to S4.
Operation S1, before the original request is transferred from the standard protocol network layer to the virtual block device layer, the verification information mac_number of the original request, the remote access rights rkeys associated with the RDMA memory, the protection domain pd where the communication queue pair and the registration memory in the standard protocol network layer are located, and the context information device_ctx of the RDMA network device are extended and stored.
In this embodiment, the RDMA registered memory attribute in the standard protocol network layer is extended, and the remote read-write authority rkey is added on the basis of the local read-write authority lkey. In the original request from the NVMe-oh standard protocol network layer, four pieces oF memory space are additionally opened up, and verification information including the original request, a mac_number, remote access rights rkeys associated with the RDMA memory, a protection domain pd where a communication queue pair and a registration memory in the standard protocol network layer are located, and context information device_ctx oF the RDMA network device are stored. Wherein the mac_number is used to indicate the original request type, rkeys is reconstructed in each mapped sub-request, each RDMA memory segment has one associated grant information rkey, rkeys storing at most 16 rkey, and pd and device_ctx are used for lazy initialization of private protocol communication connection.
In addition, two global variables are maintained, including a state variable InitFlag that indicates whether the private protocol communication connection has completed two-phase initialization and a lock variable InitLock that acts as a synchronization lock when the private protocol communication connection is initialized. Predefined field access semantics are provided that allow local inverse control flows to access the above information. Specifically, initFlag is False indicating that the second phase of the private protocol communication connection msg_m has not been initialized, and InitFlag is True indicating that the private protocol communication connection has been initialized. InitLock is a lock state, which indicates that the private protocol communication connection is being initialized in the second stage, and InitLock is an unlock state, which indicates that the state of the private protocol communication connection is indicated by a InitFlag variable.
In operation S2, after the original request is transferred to the virtual block device layer, the virtual block device layer parses RDMA memory segment information iovs, RDMA memory segment number iov _cnt, target data storage address offset, target data length len, mac_ number, rkeys, pd, and device_ctx carried in the original request, and reconstructs a private protocol communication request based on the parsed iovs, iov_ cnt, offset, len, magic _ number, rkeys, pd, and device_ctx when the original request is judged to be a normal read-write request by the mac_number, and then generates a plurality of parallel sub-requests in combination with the offset in the original request.
When generating a plurality of parallel sub-requests, the first address of the original request is resolved from the encapsulated private protocol communication request, and then the first address is combined with the offset to directly acquire the associated field information.
And step S3, each sub-request is sent to the distributed storage node through the private protocol communication connection.
In this embodiment, before the step S3, the method further includes:
S3', judging whether the private protocol communication connection is initialized, if yes, executing S3, if not, initializing the private protocol communication connection by using pd and device_ctx, and then executing S3.
Further, in the initialization of the private protocol communication connection, two-stage initialization is adopted, and the second-stage delay initialization is adopted, wherein the first-stage initialization is completed when the system is started, and the second-stage initialization is completed by analyzing pd and device_ctx in the private protocol communication request.
Specifically, the S3' includes:
S31', checking the state variable, judging whether the private protocol communication connection has completed two-stage initialization by using pd and device_ctx, if not, executing S32', otherwise executing S33'.
S32', after the lock variable is tried to be obtained and successfully locked, the state variable is checked for the second time, and if the state variable passes the check again, the establishment of communication connection, the application and the configuration of RDMA resources and the like are completed by using pd and device_ctx in the private protocol communication request. The state variable is set to avoid initializing the replay operation. If the secondary check fails, S33' is performed.
S33', traversing RDMA memory segments in the private protocol communication request step by step, and mapping each RDMA memory segment to an actual storage node by combining a target data address and an actual data length. The mapped sub-request is reconstructed in combination with remote access rights rkeys for the associated RDMA memory, and the sub-request is sent to the actual storage node using a private protocol communication connection.
And S4, the distributed storage node receives and analyzes each sub-request and initiates remote reading and writing through the private protocol communication connection.
In this embodiment, if the sub-request is a read request, data is read from the local SSD into the local memory, the data is sent to the memory area indicated by the sub-request by using the RDMA write operation combination rkeys, and if the sub-request is a write request, the data is read to the local memory by using the RDMA read operation combination rkeys and then written into the local SSD.
The present invention is described in further detail below in connection with specific read and write operations.
As shown in fig. 2. The reconstructed private protocol communication request flag_request has the check information mac_number, the remote access authority rkeys of the related RDMA memory, the protection domain pd where the communication queue pair and the registration memory are located in the standard protocol network layer and the context information device_ctx of the RDMA network device, and also has the RDMA memory segment information iovs, the RDMA memory segment number iov _cnt, the target data storage address offset, the target data length len and the data length cpl_len of the current request after read-write. The private protocol communication request sub-request chunk_request has information such as a target allocation identifier id, an offset_ inchunk of target data in a target fragment, a length len_ inchunk in the target fragment, a remote buffer address remote_addr, a remote access right rkey corresponding to a buffer and the like.
The check information magic_number, protection domain pd, and RDMA network device context information each occupy 8 bytes. Remote access rights rkeys accommodate a maximum of 16 memory segment rights identifications, each occupying 4 bytes, for a total of 64 bytes. The maintenance state variables InitFlag and lock variables InitLock, initFlag have associated variables ChunkSize that represent the size of the shards when the virtual volume is mapped to the actual storage node. If the MAGIC NUMBER is equal to the global predefined value MAGIC NUMBER (0 xABCD 1234), this indicates that the request is a normal read-write request from the standard protocol network layer, otherwise the request is a detect request when the back-end virtual device is built.
In this embodiment ChunkSize is 1GB and the NVMe-oF protocol transport block is set to a maximum oF 4KB. Iovs [0] base is NULL, iovs [0] iov _len is 0, iovs [1] base is NULL, iovs [1] iov _len is 0, iovcnt is 0 in the original request. pd is NULL and device_ctx is NULL. offset is 0x3ffff000 and len is 0x2000. The mac_number is 0xABCD1234, rkeys [0] is 0, rkeys [1] is 0. As shown in fig. 3, the flow of one data read operation is as follows:
(1) Upon RDMA memory registration, the IBV_ACCESS_REMOTE_WRITE and IBV_ACCESS_REMOTE_READ flags are added, expanding it to remotely accessible RDMA memory. When the original request NVMe-oF standard protocol network layer is constructed, the target data length is 0x2000, so that two RDMA memory segments with the length oF 0x1000 are constructed, after buffer space is applied, iovs [0] base is 0x7FFD9E9C8000, iovs [0]. Iov _len is 0x1000, iovs [1] base is 0x7FFD9E9D4000, iovs [0]. Iov _len is 0x1000, and iovcnt is 2.pd is 0xBC63B0, device_ctx is 0xBDB7B0, rkeys [0] is 0xC3389, rkeys [1] is 0xC3389. And reconstructing the RDMA memory segment information iovs, the remote access authority information rkeys, the RDMA device context information device_ctx, the protection domain information pd where the RDMA memory segment is located, the target data address offset and the target data length len into a private protocol communication request flag_request, and sending the private protocol communication request flag_request to a virtual block device layer for processing.
(2) After receiving the private protocol communication request flag_request, the virtual block device layer obtains the information of the mac_number field based on the private protocol communication request flag_request, and if the mac_number is equal to the mac_number (0 xABCD 1234), the request is a normal read-write request, and the currently read data length cpl_len is set to 0.
(3) If the first check state variable InitFlag is False, then an attempt is made to lock the lock variable InitLock. If the lock variable successfully changes to lock state and the second time the state variable InitFlag is checked again as False, the private protocol communication connection is initialized by using the pd and device_ctx of the private protocol communication request in the second stage, i.e. a communication queue pair is constructed based on the protection domain pd 0xBC63B0 and the RDMA network device context device_ctx xBDB7B0, the initialization operation is completed with the post-state variable InitFlag being True, and if the second time the state variable InitFlag is checked to True, the initialization operation is skipped and the lock variable is released as unlock state. If the lock variable fails, the current thread hangs until the lock is successful, at which point the check of state variable InitFlag must be True, thus bypassing the initialization operation. If the first check state variable InitFlag is True, then the initialization operation is skipped directly.
(4) RDMA memory segment information is processed one by one after double checking, the buffer area pointed by iovs [0]. Base is 0x1000, the target address is 0x3FFFF000, and the boundary of a Chunk slice is crossed. Sub-request 1 is generated with id of 0, offset_inchunk of 0x3ffff000, len_inchunk of 0x1000, remote address remote_addr of 0x7FFD9E9C8000, and remote access authority rkey of 0xC3389. A sub-request 2 is generated with id of 1, offset_inchunk of 0, len_inchunk of 0x1000, remote address remote_addr of 0x7FFD9E9D4000, and remote access authority rkey of 0xC3389. The sub-requests are sent in parallel to the target storage node for processing.
(5) The storage node receiving the sub-request 1 parses the Chunk fragment id information in the request, converts the Chunk fragment id information into an SSD address, reads the data to a local RDMA memory, and the communication queue pair writes the data to the remote_addr0x7FFD9E9C8000 by using the remote access authority rkey xC3389, because the receiver of the communication queue pair and the remote_addr are registered in the same protection domain pd0xBC63B0, the writing operation is successful. The processing of sub-request 2 is as above.
(6) After the storage node writes the data into the remote_addr0x7ffd9e9c8000, triggering the callback of the private protocol communication request flag_request, and updating the cpl_len field in the callback to 0x1000. After the storage node writes the data into the remote_addr7ffd9e9d4000, triggering a second callback of the private protocol communication request flag_request, and updating the cpl_len field in the callback to 0x2000.
(7) When cpl_len in the private protocol communication request flag_request is updated to 0x2000, that is, equal to the target data length len, the request is completed, and at this time, hierarchical feedback is performed to the virtual block device layer, so that the read operation is completed.
In this embodiment, chunkSize is 1gb, and the maximum nvme-orf protocol transport block is set to 128KB. Iovs [0]. Base is 0x7FFDA0E91000, iovs [0]. Iov _len is 0x2000, iovcnt is 1 in the original request. pd is NULL and device_ctx is NULL. offset is 0x3ffff000 and len is 0x4000. The mac_number is 0xABCD1234 and the rkeys [0] is 0. As shown in fig. 4, the flow of one data write operation is as follows:
(1) Upon RDMA memory registration, the IBV_ACCESS_REMOTE_WRITE and IBV_ACCESS_REMOTE_READ flags are added, expanding it to remotely accessible RDMA memory. When the original request NVMe-oF standard protocol network layer is built, pd is 0xBD72B0, device_ctx is 0xBEC6B0, and rkeys [0] is 0x84E83. And reconstructing the RDMA memory segment information iovs, the remote access authority information rkeys, the RDMA device context information device_ctx, the protection domain information pd where the RDMA memory segment is located, the target data address offset and the target data length len into a private protocol communication request flag_request, and sending the private protocol communication request flag_request to a virtual block device layer for processing.
(2) After receiving the private protocol communication request flag_request, the virtual block device layer obtains the information of the mac_number field based on the private protocol communication request flag_request, and if the mac_number is equal to the mac_number (0 xABCD 1234), the request is a normal read-write request, and the currently read data length cpl_len is set to 0.
(3) If the first check state variable InitFlag is False, then an attempt is made to lock the lock variable InitLock. If the lock variable successfully changes to lock state, then the state variable InitFlag is checked again as False a second time, then the private protocol communication connection is initialized with the pd and device_ctx of the private protocol communication request in a second stage, i.e., a communication queue pair is constructed based on the protection domain pd 0xBD72B0 and the RDMA network device context device_ctx0 xBEC B0, and the initialization operation is completed with the post-state variable InitFlag being True. If the second check state variable InitFlag becomes True, the initialization operation is skipped and the lock variable is released to the unlock state. If the lock variable fails, the current thread hangs until the lock is successful, at which point the check of state variable InitFlag must be True, thus bypassing the initialization operation. If the first check state variable InitFlag is True, then the initialization operation is skipped directly.
(4) RDMA memory segment information is processed one by one after double checking, the buffer area pointed by iovs [0] base is 0x4000, the target address is 0x3FFFF000, and the boundary of a Chunk slice is crossed. Sub-request 1 is generated with id of 0, offset_inchunk of 0x3ffff000, len_inchunk of 0x1000, remote address remote_addr of 0x7FFDA E91000, and remote access authority rkey of 0x84E83. A sub-request 2 is generated with id of 1, offset_inchunk of 0, len_inchunk of 0x3000, remote address remote_addr of 0x7FFDA E92000, and remote access authority rkey of 0x84E83. The sub-requests are sent in parallel to the target storage node for processing.
(5) The storage node that receives sub-request 1, the communication queue pair reads the data from remote_addr0x FFDA e91000 to the local RDMA memory using remote access rights rkey0xC3389, because the receiver of the communication queue pair is registered in the same protection domain pd0xBD72B0 as the remote_addr, and thus the remote read operation is successful. The Chunk fragment id information in the request is analyzed, and the information is converted into an SSD address and written into the SSD. The processing of sub-request 2 is as above.
(6) After writing the data corresponding to 0x7FFDA E91000 into the SSD, the storage node triggers a callback of the private protocol communication request flag_request, and updates cpl_len therein to 0x1000. The storage node triggers a second callback of the private protocol communication request flag_request after writing data corresponding to 0x7FFDA E92000 into the SSD, and updates cpl_len therein to 0x4000.
(7) When cpl_len in the private protocol communication request flag_request is updated to 0x4000, that is, the data length len to be written is equal to that of the data, the request is completed, and hierarchical feedback is performed to the virtual block device layer at this time, so that the write operation is completed.
In summary, compared with the prior art, the invention has the advantages that the prior art generates overlap oF RDMA memory management functions on the bridging oF communication connection between NVMe-oF standard protocol and private protocol, and data movement is generated on a key I/O path due to the limitation oF a protection domain with fine granularity. According to the invention, through the incremental extension oF extending NVMe-oF, a metadata and data access tunnel is opened for private protocol communication connection under the condition that the original working semantics are not affected. The metadata access tunnel eliminates the redundancy mechanism of the RDMA memory with remote access authority by adding and maintaining the remote access authority rkey of the RDMA memory segment and directly mapping the discrete RDMA memory segment to the discrete actual storage node, reduces the application and release times of RDMA memory resources and reduces the processing delay of the system. The data access tunnel solves the problem that access cannot be achieved due to the fact that the communication queue pair in the private protocol and the RDMA memory in the standard protocol are in different protection domains by maintaining the communication queue pair and the protection domain where the registration memory is located and the context information of the RDMA network device, customizing the two-stage initialization of the private protocol and the two-stage double check inertia initialization strategy, eliminating data movement in a key I/O path, reducing processing delay of system requests and improving system performance.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.