CN109144406B

CN109144406B - Metadata storage method, system and storage medium in distributed storage system

Info

Publication number: CN109144406B
Application number: CN201710508014.8A
Authority: CN
Inventors: 饶蓉; 魏明昌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2020-08-07
Anticipated expiration: 2037-06-28
Also published as: CN111949210B; CN109144406A; CN111949210A; WO2019000949A1

Abstract

In the distributed storage system, under the scene that data reliability is realized by metadata stripes formed by an EC algorithm, other metadata blocks in the metadata stripes are backed up by a main data storage node, and because the metadata blocks on the data storage nodes are only required to be backed up on the main data storage node, compared with the prior art that all the metadata blocks are multiple copies, the storage space is reduced, and simultaneously, when a client accesses the metadata, all the metadata blocks are only required to be accessed from the main data storage node, so that the metadata access speed is improved.

Description

Metadata storage method, system and storage medium in distributed storage system

技术领域technical field

本发明涉及数据存储技术领域，尤其涉及一种分布式存储系统中元数据存储方法、系统及存储介质。The present invention relates to the technical field of data storage, and in particular, to a metadata storage method, system and storage medium in a distributed storage system.

背景技术Background technique

在分布式存储系统中，管理节点将用户数据存储到存储节点后，会产生记录数据的逻辑地址，物理地址等的元数据，元数据也要存储到存储节点。常见的元数据存储方式是将元数据分条中的块打散到各存储节点，读取该元数据时，需要从各存储节点读取元数据分条中的块，拼凑成元数据分条，但存储节点间数据转发量大，影响性能。另外一种方式元数据在存储节点以多副本形式存储，但会增加存储空间开销。In a distributed storage system, after the management node stores user data in the storage node, metadata such as the logical address and physical address of the data will be generated, and the metadata should also be stored in the storage node. A common metadata storage method is to scatter the blocks in the metadata stripe to each storage node. When reading the metadata, it is necessary to read the blocks in the metadata stripe from each storage node and piece them together into a metadata stripe. , but the amount of data forwarding between storage nodes is large, which affects performance. Another way is to store metadata in the form of multiple copies on the storage node, but it will increase the storage space overhead.

发明内容SUMMARY OF THE INVENTION

第一方面，本发明实施例提供了一种分布式存储系统中元数据存储方案，在所述分布式存储系统中包含管理节点和(M+N)个存储节点，所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图；所述元数据分条的分区视图包含主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r；其中，N为不小于2的自然数，M为不小于1的自然数，A为自然数1至N中的一个，i为自然数1至N中的除A外的每一个，r为自然数1至M中的每一个；在所述存储方案中：所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r；所述元数据分条包含元数据块D_A、D_i以及校验块C_r，将D_i发送到所述数据存储节点DS_i，将D_A发送到所述主数据存储节点DS_A，将Cr发送到所述校验存储节点CS_r；所述校验存储节点CS_r接收并存储C_r；所述数据存储节点DS_i接收并存储D_i，并根据所述元数据分条的分区视图将D_i发送到所述主数据存储节点DS_A；所述主数据存储节点DS_A接收并存储D_A和D_i。在本方案中，在实现元数据使用纠删码(ErasureCoding,EC)保护机制下，主数据存储节点DS_A备份元数据分条中其他元数据块D_i，因为只需要将数据存储节点DS_i上的元数据块D_i在主数据存储节点DS_A上备份，相比现有技术中所有元数据块多副本，不需要校验块副本，减少了存储空间，同时在客户端访问元数据时，可以从主数据存储节点DS_A访问所有元数据块，提高了元数据访问速度。本方案的分布式存储系统可以为分布式文件系统、分布式对象存储系统或分布式块设备存储。In a first aspect, an embodiment of the present invention provides a metadata storage solution in a distributed storage system. The distributed storage system includes a management node and (M+N) storage nodes, the management node and (M+N) storage nodes. +N) storage nodes all store the partition view of metadata stripe; the partition view of metadata stripe includes main data storage node DS _A , data storage node DS _i and check storage node CS _r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, A is one of the natural numbers 1 to N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M In the storage scheme: the management node determines the primary data storage node DS _A , the data storage node DS _i and the check storage node CS _r for the metadata strip according to the partition view of the metadata strip; The metadata strip includes metadata blocks D _A , D _i and check blocks C _r , D _i is sent to the data storage node DS _i , D _A is sent to the main data storage node DS _A , and Cr is sent to the check storage node CS _r ; the check storage node CS _r receives and stores Cr _; the data storage node DS _i receives and stores D _i , and splits the partition view according to the metadata Send D _i to the main data storage node DS _A ; the main data storage node DS _A receives and stores D _A and D _i . In this solution, under the protection mechanism of Erasure Coding (EC) for metadata, the primary data storage node DS _A backs up other metadata blocks D _i in the metadata stripe, because only the data storage node DS _i needs to be The metadata block D _i on the device is backed up on the primary data storage node DS _A. Compared with the multiple copies of all metadata blocks in the prior art, there is no need to verify the copy of the block, which reduces the storage space. At the same time, when the client accesses the metadata , all metadata blocks can be accessed from the main data storage node DS _A , which improves the metadata access speed. The distributed storage system of this solution may be a distributed file system, a distributed object storage system or a distributed block device storage.

可选的，所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r，具体包括：所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区；所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DS_A、所述数据存储节点DS_i和所述校验存储节点CS_r。Optionally, the management node determines the primary data storage node DS _A , the data storage node DS _i and the check storage node CS _r for the metadata strip according to the partition view of the metadata strip, specifically including: The management node determines the partition corresponding to the metadata strip according to the write request for generating the metadata in the metadata strip; the management node queries the metadata strip according to the partition corresponding to the metadata strip The partition view of the primary data storage node DS _A , the data storage node DS _i and the check storage node CS _r are determined.

可选的，所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。Optionally, the management node determines the partition corresponding to the metadata strip according to the address carried in the write request.

可选的，所述校验存储节点CS_r存储Cr具体包括：所述校验存储节点CS_r为所述Cr分配分片S_r，并且建立所述Cr的标识与所述分片S_r的映射关系；所述数据存储节点DS_i存储D_i具体包括：所述数据存储节点DS_i为所述D_i分配分片SD_i，并且建立所述D_i的标识与所述分片SD_i的映射关系；所述主数据存储节点DS_A存储D_A和D_i，具体包括：所述主数据存储节点DS_A为所述D_A分配分片SD_A，并且建立所述D_A的标识与所述分片SD_A的映射关系，为所述D_i分配分片SD_i，并且建立所述D_i的标识与所述分片SD_i的映射关系。Optionally, the storage of Cr by the check storage node CS _r specifically includes: the check storage node CS _r allocates a slice S _r to the Cr, and establishes a relationship between the identifier of the Cr and the slice S _r . Mapping relationship; the storage of _{Di by the data storage node DS i} _specifically includes: the data storage node DS _i allocates a fragment SD _i to the _Di , and establishes a relationship between the identifier of the _Di and the fragment SD _i A mapping relationship; the main data storage node DS _A stores D _A and D _i , specifically including: the main data storage node DS _A allocates a slice SD _A for the D _A , and establishes the identity of the D _A with the The mapping relationship of the segment SD _A , the segment SD _i is allocated to the D _i , and the mapping relationship between the identifier of the D _i and the segment SD _i is established.

进一步地，管理节点建立D_i的标识与数据存储节点DS_i和主数据存储节点DS_A的映射关系。在对元数据分条进行垃圾回收时，管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系，将元数据块在数据存储节点以及主数据存储节点中的数据均回收，提高了元数据回收的效率。Further, the management node establishes a mapping relationship between the identifier of D _i and the data storage node DS _i and the main data storage node DS _A. When garbage collection is performed on the metadata stripe, the management node can recycle the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata stripe and the storage node , which improves the efficiency of metadata recovery.

第二方面，相应地，本发明实施例还提供了一种分布式存储系统，在所述分布式存储系统中包含管理节点和(M+N)个存储节点，所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图；所述元数据分条的分区视图包含主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r；其中，N为不小于2的自然数，M为不小于1的自然数，A为自然数1至N中的一个，i为自然数1至N中的除A外的每一个，r为自然数1至M中的每一个；所述分布式存储系统用于实现第一方面各种实现方案。In the second aspect, correspondingly, an embodiment of the present invention further provides a distributed storage system, the distributed storage system includes a management node and (M+N) storage nodes, the management node and (M+N) storage nodes N) storage nodes all store partition views of metadata stripes; the partition views of metadata stripes include main data storage node DS _A , data storage node DS _i and check storage node CS _r ; wherein, N is A natural number not less than 2, M is a natural number not less than 1, A is one of the natural numbers 1 to N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M; The distributed storage system is used to implement various implementation solutions of the first aspect.

相应地，本发明还提供了非易失性计算机可读存储介质和计算机程序产品，当本发明实施例提供的存储设备的存储器加载非易失性计算机可读存储介质和计算机程序产品中包含的计算机程序指令，所述计算机程序指令可运行于分布式存储系统中，分布式存储系统包含管理节点和(M+N)个存储节点，所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图；所述元数据分条的分区视图包含主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r；其中，N为不小于2的自然数，M为不小于1的自然数，A为自然数1至N中的一个，i为自然数1至N中的除A外的每一个，r为自然数1至M中的每一个；当一个或多个计算机执行所述计算机程序指令分别作为所述分布式存储系统中的管理节点、主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r用于实现第一方面各种实现方案。Correspondingly, the present invention also provides a non-volatile computer-readable storage medium and a computer program product, when the memory of the storage device provided by the embodiment of the present invention loads the non-volatile computer-readable storage medium and the computer program product with the Computer program instructions, the computer program instructions can run in a distributed storage system, the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M+N) storage nodes both store There is a partition view of metadata stripe; the partition view of the metadata stripe includes the main data storage node DS _A , the data storage node DS _i and the check storage node CS _r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, A is one of the natural numbers 1 to N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M; when one or more computers execute The computer program instructions are respectively used as a management node, a main data storage node DS _A , a data storage node DS _i and a check storage node CS _r in the distributed storage system to implement various implementation schemes of the first aspect.

在第一方面公开的各种分布式存储系统中元数据存储方案也可以适用元数据对应的数据的存储。相应的，第二方面方向的分布式存储系统以及第三方面的非易失性计算机可读存储介质和计算机程序产品同样也适用于数据存储。The metadata storage solutions in the various distributed storage systems disclosed in the first aspect may also be applicable to the storage of data corresponding to the metadata. Correspondingly, the distributed storage system of the second aspect and the non-volatile computer-readable storage medium and computer program product of the third aspect are also suitable for data storage.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments.

图1为本发明实施例提供的一种分布式块设备存储架构示意图；1 is a schematic diagram of a distributed block device storage architecture according to an embodiment of the present invention;

图2为本发明实施例提供的一种分布式块设备中服务器结构示意图；2 is a schematic structural diagram of a server in a distributed block device according to an embodiment of the present invention;

图3为本发明实施例提供的一种数据分条与分区视图关系示意图；3 is a schematic diagram of a relationship between a data stripe and a partition view according to an embodiment of the present invention;

图4为本发明实施例提供的一种数据分条示意图；4 is a schematic diagram of a data striping provided by an embodiment of the present invention;

图5为本发明实施例提供的分区视图示意图；5 is a schematic diagram of a partition view according to an embodiment of the present invention;

图6为本发明实施例提供的一种元数据分条与分区视图关系示意图；6 is a schematic diagram of the relationship between a metadata stripe and a partition view according to an embodiment of the present invention;

图7为本发明实施例元数据存储流程图；7 is a flowchart of metadata storage according to an embodiment of the present invention;

图8为本发明实施例提供的一种元数据分条示意图；FIG. 8 is a schematic diagram of a metadata striping provided by an embodiment of the present invention;

图9为本发明实施例提供的元数据存储示意图。FIG. 9 is a schematic diagram of metadata storage provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚地描述。The technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention.

分布式存储系统主要有分布式文件系统存储、分布式对象存储和分布式块设备存储等几种形式，例如华为

的

系列产品。本发明实施例以分布式块设备存储为例进行说明。示例性的如图1所示，分布式块设备存储包括多台服务器1、服务器2、服务器3、服务器4、服务器5和服务器6，服务器间互相通信。在实际应用当中，分布式块设备存储中服务器的数量可以根据实际需求增加，本发明实施例对此不作限定。分布式块设备存储的服务器中包含如图2所示的结构。Distributed storage systems mainly include distributed file system storage, distributed object storage, and distributed block device storage. For example, Huawei

of

series of products. The embodiments of the present invention take distributed block device storage as an example for description. Exemplarily as shown in FIG. 1 , the distributed block device storage includes

multiple servers

1, 2, 3, 4, 5, and 6, and the servers communicate with each other. In practical applications, the number of servers in the distributed block device storage may be increased according to actual requirements, which is not limited in this embodiment of the present invention. The distributed block device storage server includes the structure shown in Figure 2.

如图2所示，分布式块设备存储中的每台服务器包含中央处理单元(CentralProcessing Unit,CPU)201、内存202、硬盘1、硬盘2和硬盘3，内存202中存储计算机指令，CPU201执行内存202中的程序指令执行相应的操作。硬盘可以为机械硬盘和固态硬盘中的至少一种。另外，为节省CPU201的计算资源，现场可编程门阵列(Field Programmable GateArray,FPGA)或其他硬件也可以用于CPU201上述相应的操作，或者，FPGA或其他硬件与CPU201共同完成上述相应的操作。为方便描述，本发明实施例统一描述为处理器用于实现上述相应的操作。As shown in FIG. 2, each server in the distributed block device storage includes a central processing unit (CPU) 201, a memory 202, a hard disk 1, a hard disk 2, and a hard disk 3. The memory 202 stores computer instructions, and the CPU 201 executes the memory. The program instructions in 202 perform corresponding operations. The hard disk may be at least one of a mechanical hard disk and a solid-state hard disk. In addition, in order to save the computing resources of the CPU 201, a Field Programmable Gate Array (FPGA) or other hardware can also be used for the above-mentioned corresponding operations of the CPU 201, or the FPGA or other hardware and the CPU 201 can jointly complete the above-mentioned corresponding operations. For convenience of description, the embodiments of the present invention are collectively described as a processor configured to implement the foregoing corresponding operations.

在图2所示的结构中，内存202中加载应用程序，CPU201执行内存202中的应用程序指令，则服务器作为客户端。其中，应用程序可以为虚拟机(Virtual Machine，VM)，也可以为某一个特定应用，如办公软件等。客户端向分布式块设备存储写入数据或从分布式块设备存储中读取数据。内存202中加载存储管理程序，CPU201执行内存202中的作为虚拟块存储管理程序的存储管理程序指令，则服务器作为管理节点，负责卷元数据的管理，用于向客户端提供块协议访问接口，为客户端提供分布式存储接入点服务，使客户端能够通过管理节点访问分布式块设备存储的存储资源。内存202中加载存储对象程序，CPU201执行内存202中的存储对象程序指令，则服务器作为存储节点，用于执行具体的输入输出(Input/Output,I/O)操作。在每个服务器上可以运行多个存储对象程序的进程，示例性的，一块硬盘默认对应运行一个存储对象程序进程，每个存储对象程序进程会管理一块硬盘，则服务器运行每一个存储对象程序的进程作为一个存储节点。具体实现，也可以一个服务器上运行一个存储对象程序的进程对应服务器上的所有硬盘。本发明实施例以一个存储对象程序进程会管理一块硬盘为例进行描述。分布式块设备存储初始化时，每个存储对象程序的进程会按照1MB为单位对硬盘进行分片管理，并在硬盘的元数据管理区域记录每个1MB分片的分配信息，硬盘的分片组成存储资源池。存储管理程序与其所能访问的资源池的所有存储对象程序的进程点对点通信，即管理节点与其所能访问的资源池的所有存储节点进行通信，从而管理节点能并发访问资源池的所有硬盘。In the structure shown in FIG. 2 , the application program is loaded in the memory 202 , the CPU 201 executes the application program instructions in the memory 202 , and the server acts as a client. The application program may be a virtual machine (Virtual Machine, VM), or may be a specific application, such as office software. Clients write data to or read data from distributed block device storage. The storage management program is loaded in the memory 202, and the CPU 201 executes the storage management program instruction as the virtual block storage management program in the memory 202, then the server, as a management node, is responsible for the management of the volume metadata, and is used to provide the client with a block protocol access interface, Provides distributed storage access point services for clients, enabling clients to access storage resources stored by distributed block devices through management nodes. The storage object program is loaded in the memory 202, the CPU 201 executes the storage object program instruction in the memory 202, and the server acts as a storage node for performing specific input/output (I/O) operations. Each server can run multiple storage object program processes. Exemplarily, a hard disk corresponds to running one storage object program process by default. Each storage object program process will manage a hard disk, and the server runs each storage object program. A process acts as a storage node. For specific implementation, it is also possible that a process of running a stored object program on one server corresponds to all hard disks on the server. The embodiment of the present invention is described by taking a storage object program process managing a hard disk as an example. When the distributed block device storage is initialized, the process of each storage object program will manage the hard disk by 1MB as a unit, and record the allocation information of each 1MB fragment in the metadata management area of the hard disk. Storage resource pool. The storage management program communicates point-to-point with the processes of all storage object programs in the resource pool that it can access, that is, the management node communicates with all storage nodes in the resource pool that it can access, so that the management node can concurrently access all hard disks in the resource pool.

在分布式块设备存储初始化时，将哈希空间(如0-2^32，)划分为N等份，每1等份是1个分区(Partition)，这N等份按照硬盘数量进行均分。例如，分布式块存储设备存储中N默认为3600，即分区分别为P1，P2，P3…P3600。如图3所示，假设当前分布式块设备存储有18块硬盘(存储节点)，则每块存储节点承载200个分区。上述分区与存储节点对应关系，即分区视图，在分布式块设备存储初始化时会分配好，后续会随着分布式块设备存储中硬盘数量的变化进行调整。分布式块设备存储的服务器会在内存202中保存该分区视图，管理节点使用该分区视图进行快速路由。每一个存储节点中也保存有分布式块设备存储系统的所有分区视图，即每一个分区与存储节点的对应关系。同时根据分布式块设备存储的可靠性要求，可以使用纠删码(Erasure Coding，EC)算法提高数据可靠性，如使用3+1模式，即3个数据块和1个校验块组成数据分条，如图4所示，则分区视图为“分区-主数据存储节点-数据存储节点1-数据存储节点2-校验存储节点，示例性的，分区视图如图5所示。该分区视图表示分区对应主数据节点以及用于存储数据分条的其他数据块的数据存储节点1和数据节点2，以及存储校验数据的校验存储节点，存储在数据存储节点1和数据存储节点2的数据块的备份数据存储节点为主数据存储节点。When the distributed block device storage is initialized, the hash space (such as 0-2^32,) is divided into N equal parts, each equal part is a partition (Partition), and the N equal parts are divided equally according to the number of hard disks . For example, in distributed block storage device storage, N is 3600 by default, that is, the partitions are P1, P2, P3...P3600. As shown in Figure 3, assuming that the current distributed block device stores 18 hard disks (storage nodes), each storage node carries 200 partitions. The above-mentioned correspondence between partitions and storage nodes, that is, the partition view, will be allocated when the distributed block device storage is initialized, and will be adjusted later as the number of hard disks in the distributed block device storage changes. The distributed block device storage server will save the partition view in the memory 202, and the management node uses the partition view for fast routing. Each storage node also stores all partition views of the distributed block device storage system, that is, the correspondence between each partition and the storage node. At the same time, according to the reliability requirements of distributed block device storage, the Erasure Coding (EC) algorithm can be used to improve data reliability. bar, as shown in Figure 4, the partition view is "partition-main data storage node-data storage node 1-data storage node 2-check storage node, exemplary, the partition view is shown in Figure 5. The partition view Indicates that the partition corresponds to the main data node and the data storage node 1 and data node 2 for storing other data blocks of data striping, and the check storage node for storing the check data, which is stored in the data storage node 1 and data storage node 2. The backup data storage node of the data block is the primary data storage node.

分布式块设备存储会对每个逻辑单元号(Logical Unit Number,LUN)在逻辑上按照1MB大小进行切片，例如1GB的LUN则会被切成1024*1MB分片。如图3所示，客户端通过管理节点向LUN发送写请求的时候，在小型计算机系统接口(Small Computer SystemInterface,SCSI)命令中会带LUN标识(Identifier,ID)、逻辑块地址(Logical BlockAddress,LBA)ID以及待写数据，客户端所在的管理节点接收写请求，根据LUN ID和LBA ID组成一个键key，该key会包含LBA ID对1MB的取整计算信息。通过分布式哈希表(Distributed Hash Table,DHT)Hash计算出一个整数(范围在0-2^32内)，并落在指定分区中；客户端所在的管理节点根据内存202中记录的分区视图确定主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点，管理节点将EC数据分条中的数据块1、数据块2、数据块3和校验块4分别发送到主数据存储节点1、数据存储节点2、数据存储节点3和校验存储节点4。主数据存储节点存储数据块1，数据存储节点1存储数据块2，数据存储节点2存储数据块3，校验存储节点存储校验块1。数据存储节点1和2根据分区视图分别确定主数据存储节点，数据存储节点1将数据块2备份到主数据存储节点，数据存储节点2将数据块3备份到主数据存储节点，主数据存储节点分别存储数据块2和数据块3。具体实现中，主数据存储节点为数据块1从其管理的硬盘中分配分片1，建立数据块1的标识与分片1的映射关系；数据存储节点1从其管理的硬盘中为数据块2分配分片2，建立数据块2的标识与分片2的映射关系；数据存储节点2从其管理的硬盘中为数据块3分配分片3，建立数据块3的标识与分片3的映射关系；校验存储节点从其管理的硬盘中为校验块1分配分片4，建立校验块1的标识与分片4的映射关系。主数据存储节点接收数据存储节点1发送的数据块2和数据存储节点2发送的数据块3，主数据存储节点从其管理的硬盘中分配分片5和分片6，主数据存储节点建立数据块2的标识与分片5的映射关系，以及数据块3的标识与分片6的映射关系。本发明实施例中，以数据块的标识与分片的映射关系为例，当存储对象程序的1个进程对应1个硬盘时，也即存储节点即为硬盘本身，则数据块的标识与分片的映射关系为数据块的标识与分片物理地址的映射关系；当存储对象程序的1个进程对应多个硬盘时，也即存储节点管理多个硬盘，则数据块的标识与分片的映射关系为包括数据块的标识与存储该数据块的硬盘的映射，以及存储该数据块的硬盘到分片的映射。分片物理地址的映射关系当。进一步地，数据块2分别存储到分片2和分片5，数据块3分别存储在分片3和6，管理节点建立并保存数据块2的标识与数据存储节点1和主数据存储节点的映射，建立并保存数据块3的标识与数据存储节点2和主数据存储节点的映射关系。进一步地，数据存储节点1保存保存数据块2的标识与数据存储节点1和主数据存储节点的映射，数据存储节点2保存数据块3的标识与数据存储节点2和主数据存储节点的映射关系。在对数据分条进行垃圾回收时，管理节点可以根据数据分条中数据块的标识与存储节点的映射关系，将数据块在数据存储节点以及主数据存储节点中的数据均回收，提高了数据回收的效率Distributed block device storage will logically slice each logical unit number (LUN) according to 1MB size. For example, a 1GB LUN will be sliced into 1024*1MB slices. As shown in Figure 3, when the client sends a write request to the LUN through the management node, the Small Computer System Interface (SCSI) command will carry the LUN identifier (Identifier, ID), the logical block address (Logical BlockAddress, LBA) ID and the data to be written, the management node where the client is located receives the write request, and forms a key according to the LUN ID and LBA ID. The key will contain the LBA ID to 1MB rounding calculation information. An integer (in the range of 0-2^32) is calculated by Distributed Hash Table (DHT) Hash and falls in the specified partition; the management node where the client is located is based on the partition view recorded in the memory 202 Determine the main data storage node, data storage node 1, data storage node 2, and check storage node, and the management node sends the data block 1, data block 2, data block 3, and check block 4 in the EC data stripe to the main data storage node. Data storage node 1 , data storage node 2 , data storage node 3 and check storage node 4 . The main data storage node stores data block 1, the data storage node 1 stores data block 2, the data storage node 2 stores data block 3, and the check storage node stores check block 1. Data storage nodes 1 and 2 respectively determine the main data storage node according to the partition view, data storage node 1 backs up data block 2 to the main data storage node, data storage node 2 backs up data block 3 to the main data storage node, and the main data storage node Data block 2 and data block 3 are stored respectively. In the specific implementation, the main data storage node allocates slice 1 for data block 1 from the hard disk it manages, and establishes a mapping relationship between the identifier of data block 1 and slice 1; the data storage node 1 is the data block from the hard disk managed by it. 2. Allocate fragment 2, establish a mapping relationship between the identifier of data block 2 and fragment 2; data storage node 2 allocates fragment 3 for data block 3 from the hard disk it manages, and establishes the identifier of data block 3 and fragment 3. Mapping relationship; the verification storage node allocates fragment 4 to verification block 1 from the hard disk it manages, and establishes a mapping relationship between the identification of verification block 1 and fragment 4. The main data storage node receives data block 2 sent by data storage node 1 and data block 3 sent by data storage node 2. The main data storage node allocates slices 5 and 6 from the hard disks it manages, and the main data storage node creates data The mapping relationship between the identifier of block 2 and slice 5, and the mapping relation between the identifier of data block 3 and slice 6. In the embodiment of the present invention, taking the mapping relationship between the identifier of the data block and the fragment as an example, when one process of the storage object program corresponds to one hard disk, that is, the storage node is the hard disk itself, the identifier of the data block and the partition The mapping relationship of the slice is the mapping relation between the identifier of the data block and the physical address of the slice; when one process of the storage object program corresponds to multiple hard disks, that is, the storage node manages multiple hard disks, the identifier of the data block is the same as that of the slice. The mapping relationship includes the mapping between the identifier of the data block and the hard disk storing the data block, and the mapping from the hard disk storing the data block to the slice. The mapping relationship of the physical address of the slice is when. Further, data block 2 is stored in fragment 2 and fragment 5 respectively, data block 3 is stored in fragment 3 and 6 respectively, and the management node establishes and saves the identification of data block 2 and the data storage node 1 and the main data storage node. Mapping, establishing and saving the mapping relationship between the identifier of the data block 3 and the data storage node 2 and the main data storage node. Further, the data storage node 1 saves the mapping between the identification of the data block 2 and the data storage node 1 and the main data storage node, and the data storage node 2 saves the mapping relationship between the identification of the data block 3 and the data storage node 2 and the main data storage node. . When garbage collection is performed on data stripes, the management node can recycle the data of the data blocks in both the data storage node and the main data storage node according to the mapping relationship between the identifiers of the data blocks in the data stripes and the storage nodes, which improves the data storage efficiency. recycling efficiency

本发明实施例中，客户端向分布式块设备存储发送写请求写入数据时，会产生元数据，用于记录数据的逻辑地址和物理地址等。本发明实施例中，数据对应的元数据存储与数据存储使用相同的EC算法。基于EC算法组成的元数据分条与上述基于EC算法的组成数据分条具有相同的分区视图，如图6所示。In the embodiment of the present invention, when the client sends a write request to the distributed block device storage to write data, metadata is generated for recording the logical address and physical address of the data, and the like. In this embodiment of the present invention, the metadata storage corresponding to the data uses the same EC algorithm as the data storage. The metadata stripe composed based on the EC algorithm has the same partition view as the above-mentioned composition data stripe based on the EC algorithm, as shown in FIG. 6 .

在分布式存储系统存储元数据，其中分布式存储系统包含管理节点和(M+N)个存储节点，管理节点和(M+N)个存储节点均存储有元数据分条的分区视图；元数据分条的分区视图包含主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r；其中，N为不小于2的自然数，M为不小于1的自然数，A为自然数1至N中的一个，i为自然数1至N中的除A外的每一个，r为自然数1至M中的每一个；在该分布式存储系统存储执行如图7所示的流程：Metadata is stored in a distributed storage system, wherein the distributed storage system includes a management node and (M+N) storage nodes, and both the management node and (M+N) storage nodes store partition views of metadata stripes; The partition view of data striping includes the main data storage node DS _A , the data storage node DS _i and the check storage node CS _r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, and A is a natural number 1 to One of N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M; the distributed storage system stores and executes the process shown in Figure 7:

步骤701：管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r；所述元数据分条包含元数据块D_A、D_i以及校验块Cr。Step 701: The management node determines the primary data storage node DS _A , the data storage node DS _i and the check storage node CS _r for the metadata strip according to the partition view of the metadata strip; the metadata strip includes Metadata blocks D _A , D _i and check block Cr.

具体的，所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DS_A、数据存储节点DS_i和校验存储节点CS_r，具体包括：所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区；所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DS_A、所述数据存储节点DS_i和所述校验存储节点CS_r。Specifically, the management node determines the primary data storage node DS _A , the data storage node DS _i and the check storage node CS _r for the metadata strip according to the partition view of the metadata strip, which specifically includes: the The management node determines the partition corresponding to the metadata strip according to the write request for generating the metadata in the metadata strip; the management node queries the metadata strip according to the partition corresponding to the metadata strip. The partition view determines the primary data storage node DS _A , the data storage node DS _i and the check storage node CS _r .

具体的，所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。具体可参见分布式块设备存储在存储客户端发送的写请求时的方案，在此不再赘述。Specifically, the management node determines the partition corresponding to the metadata strip according to the address carried in the write request. For details, please refer to the solution when the distributed block device stores the write request sent by the client, which will not be repeated here.

步骤702：所述管理节点将D_i发送到所述数据存储节点DS_i，将D_A发送到所述主数据存储节点DS_A，将C_r发送到所述校验存储节点CS_r。Step 702: The management node sends _{Di to the data storage node DS i} _, sends DA to the primary data storage node DS _A , and _sends _Cr to the check storage node CS _r .

步骤703：所述校验存储节点CS_r接收并存储C_r。Step 703: The verification storage node CS _r receives and stores C _r .

步骤704：所述数据存储节点DS_i接收并存储D_i，并根据所述元数据分条的分区视图将D_i发送到所述主数据存储节点DS_A。Step 704: The data storage node DS _i receives and stores D _i , and sends D _i to the primary data storage node DS _A according to the partition view of the metadata stripe.

步骤705：所述主数据存储节点DS_A接收并存储D_A和D_i。Step 705: The primary data storage node DS _A receives and stores D _A and D _i .

具体的，所述校验存储节点CS_r存储Cr具体包括：所述校验存储节点CS_r为所述Cr分配分片S_r，并且建立所述Cr的标识与所述分片S_r的映射关系；所述数据存储节点DS_i存储D_i具体包括：所述数据存储节点DS_i为所述D_i分配分片SD_i，并且建立所述D_i的标识与所述分片SD_i的映射关系；所述主数据存储节点DS_A存储D_A和D_i，具体包括：所述主数据存储节点DS_A为所述D_A分配分片SD_A，并且建立所述D_A的标识与所述分片SD_A的映射关系，为所述D_i分配分片SD_i，并且建立所述D_i的标识与所述分片SD_i的映射关系。进一步地，管理节点建立D_i的标识与数据存储节点DS_i和主数据存储节点DS_A的映射关系。进一步地，进一步地，数据存储节点1保存保存D_i的标识与数据存储节点DS_i和主数据存储节点DS_A的映射关系。在对元数据分条进行垃圾回收时，管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系，将元数据块在数据存储节点以及主数据存储节点中的数据均回收，提高了元数据回收的效率。Specifically, the storage of Cr by the check storage node CS _r specifically includes: the check storage node CS _r allocates a slice S _r to the Cr, and establishes a mapping between the identifier of the Cr and the slice S _r The storage of D _i by the data storage node DS _i specifically includes: the data storage node DS _i allocates a segment SD _i to the D _i , and establishes a mapping between the identifier of the D _i and the segment SD _i The main data storage node DS _A stores D _A and D _i , specifically including: the main data storage node DS _A allocates a slice SD _A for the D _A , and establishes the identity of the D _A with the For the mapping relationship of the slice SD _A , the slice SD _i is allocated to the D _i , and the mapping relation between the identifier of the D _i and the slice SD _i is established. Further, the management node establishes a mapping relationship between the identifier of D _i and the data storage node DS _i and the main data storage node DS _A. Further, further, the data storage node 1 saves the mapping relationship between the identifier of the storage D _i and the data storage node DS _i and the main data storage node DS _A. When garbage collection is performed on the metadata stripe, the management node can recycle the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata stripe and the storage node , which improves the efficiency of metadata recovery.

本发明实施例中，结合前面所述的分布式块设备存储及数据存储方式，如图8所示，使用EC算法的元数据分条中元数据块为D₁，D₂和D₃，校验块为C₁。客户端所在的管理节点根据内存202中记录的分区视图“分区-主数据存储节点-数据存储节点1-数据存储节点2-校验存储节点”确定主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点。该分区视图表示分区对应主数据节点以及用于存储元数据分条的其他数据块的数据存储节点1和数据节点2，以及存储校验数据的校验存储节点，存储在数据存储节点1和数据存储节点2的元数据块的备份数据存储节点为主数据存储节点。管理节点将基于EC算法的元数据分条中的D₁、D₂、D₃和C₁分别发送到主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点4。主数据存储节点接收并存储D₁，数据存储节点1接收并存储D₂，数据存储节点2接收并存储D₃，校验存储节点接收并存储C₁。数据存储节点1和2根据分区视图分别确定主数据存储节点，数据存储节点1将D₂备份到主数据存储节点，数据存储节点2将D₃备份到主数据存储节点，主数据存储节点接收并存储D₂和D₃。具体实现中，如图9所示，主数据存储节点为D₁从其管理的硬盘中分配分片7，建立D₁的标识与分片7的映射关系；数据存储节点1从其管理的硬盘中为D₂分配分片8，建立D₂的标识与分片8的映射关系；数据存储节点2从其管理的硬盘中为D₃分配分片9，建立D₃的标识与分片9的映射关系；校验存储节点从其管理的硬盘中为C₁分配分片10，建立C₁的标识与分片10的映射关系。主数据存储节点接收数据存储节点1发送的D₂和数据存储节点2发送的D₃，主数据存储节点从其管理的硬盘中分配分片11和分片12，主数据存储节点建立D₂的标识与分片11的映射关系，以及D₃的标识与分片12的映射关系。本发明实施例中，以元数据块的标识与分片的映射关系为例，当存储对象程序的1个进程对应1个硬盘时，也即存储节点即为硬盘本身，则元数据块的标识与分片的映射关系为元数据块的标识与分片物理地址的映射关系；当存储对象程序的1个进程对应多个硬盘时，也即存储节点管理多个硬盘，则元数据块的标识与分片的映射关系为包括元数据块的标识与存储该元数据块的硬盘的映射，以及存储该元数据块的硬盘到分片的映射。进一步地，D₂分别存储到分片8和分片11，D₃分别存储在分片9和12，管理节点建立并保存D₂的标识与数据存储节点1和主数据存储节点的映射，建立并保存D₃的标识与数据存储节点2和主数据存储节点的映射关系。进一步地，数据存储节点1保存保存D₂的标识与数据存储节点1和主数据存储节点的映射，数据存储节点2保存D₃的标识与数据存储节点2和主数据存储节点的映射关系。在对元数据分条进行垃圾回收时，管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系，将元数据块在数据存储节点以及主数据存储节点中的数据均回收，提高了元数据回收的效率。In this embodiment of the present invention, in combination with the aforementioned distributed block device storage and data storage methods, as shown in FIG. 8 , the metadata blocks in the metadata striping using the EC algorithm are D ₁ , D ₂ and D ₃ . The test block is C ₁ . The management node where the client is located determines the main data storage node, data storage node 1, data storage node according to the partition view "partition-main data storage node-data storage node 1-data storage node 2-check storage node" recorded in the memory 202 Node 2 and the checksum storage node. The partition view indicates that the partition corresponds to the main data node and data storage node 1 and data node 2 for storing other data blocks of metadata stripes, and the check storage node for storing check data, which is stored in data storage node 1 and data storage node 1. The backup data storage node of the metadata block of storage node 2 is the primary data storage node. The management node sends D ₁ , D ₂ , D ₃ and C ₁ in the metadata striping based on the EC algorithm to the main data storage node, data storage node 1, data storage node 2 and check storage node 4, respectively. The main data storage node receives and stores D ₁ , the data storage node 1 receives and stores D ₂ , the data storage node 2 receives and stores D ₃ , and the check storage node receives and stores C ₁ . Data storage nodes 1 and 2 respectively determine the main data storage node according to the partition view, data storage node 1 backs up D2 to the main data storage node, data storage node ₂ backs up _D3 to the main data storage node, and the main data storage node receives the _D2 and _D3 are stored. In the specific implementation, as shown in FIG. 9 , the main data storage node allocates slice 7 for D ₁ from the hard disk managed by it, and establishes a mapping relationship between the identifier of D ₁ and slice 7; In the middle, allocate slice 8 for D ₂ , and establish the mapping relationship between the identifier of D ₂ and slice 8; the data storage node 2 allocates slice 9 for D ₃ from the hard disk it manages, and establishes the identifier of D ₃ and slice 9. Mapping relationship; the verification storage node allocates shard 10 to C ₁ from the hard disk it manages, and establishes a mapping relationship between the identifier of C ₁ and the shard 10. The main data storage node receives D2 sent by data storage node ₁ and D3 sent by data storage node ₂ , the main data storage node allocates slices 11 and ₁₂ from the hard disks it manages, and the main data storage node establishes D2's The mapping relationship between the ID and the shard 11, and the mapping relationship between the ID of _D3 and the shard 12. In the embodiment of the present invention, taking the mapping relationship between the identifier of the metadata block and the fragment as an example, when one process of the storage object program corresponds to one hard disk, that is, the storage node is the hard disk itself, the identifier of the metadata block is The mapping relationship with the shard is the mapping relationship between the identifier of the metadata block and the physical address of the shard; when one process of the storage object program corresponds to multiple hard disks, that is, the storage node manages multiple hard disks, the identifier of the metadata block is The mapping relationship with the shard includes the mapping between the identifier of the metadata block and the hard disk storing the metadata block, and the mapping from the hard disk storing the metadata block to the shard. Further, D2 is stored in slices 8 and ₁₁ respectively, _D3 is stored in slices 9 and 12 respectively, the management node establishes and saves the identification of D2 and the mapping of the data storage node ₁ and the main data storage node, establishes And save the mapping relationship between the identifier of D ₃ and the data storage node 2 and the main data storage node. Further, the data storage node 1 saves the mapping between the identifier of D ₂ and the data storage node 1 and the main data storage node, and the data storage node 2 saves the mapping relationship between the identifier of D ₃ and the data storage node 2 and the main data storage node. When garbage collection is performed on the metadata stripe, the management node can recycle the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata stripe and the storage node , which improves the efficiency of metadata recovery.

因此，在使用EC算法组成的元数据分条实现数据可靠性的场景下，主数据存储节点备份元数据分条中其他元数据块，因为只需要将数据存储节点上的元数据块在主数据存储节点上备份，相比现有技术中所有元数据块多副本，减少了存储空间，同时在客户端访问元数据时，只需要从主数据存储节点访问所有元数据块，提高了元数据访问速度。Therefore, in the scenario where metadata striping composed of the EC algorithm is used to achieve data reliability, the primary data storage node backs up other metadata blocks in the metadata striping, because only the metadata blocks on the data storage node need to be stored in the primary data storage node. Compared with the multiple copies of all metadata blocks in the prior art, the backup on the storage node reduces the storage space. At the same time, when the client accesses the metadata, it only needs to access all the metadata blocks from the main data storage node, which improves the metadata access. speed.

本发明实施例，还提供了非易失性计算机可读存储介质和计算机程序产品，非易失性计算机可读存储介质和计算机程序产品中包含的计算机程序指令，CPU执行内存中加载的该计算机程序指令用于实现本发明各实施中管理节点和存储节点(主数据存储节点、数据存储节点和校验存储节点)对应的功能。Embodiments of the present invention also provide a non-volatile computer-readable storage medium and a computer program product, the computer program instructions contained in the non-volatile computer-readable storage medium and the computer program product, and the CPU executing the computer program loaded in the memory The program instructions are used to realize the functions corresponding to the management node and the storage node (main data storage node, data storage node and check storage node) in various implementations of the present invention.

本发明实施例中给出的示例性描述。本发明实施例中的“分片1”、“分片2”。。。“分片12”等并不是用于严格限定先后关系，只是用于区分不同的分片。本发明实施例中的分片可以为硬盘中的物理块等。本发明实施例中的硬盘，如前所述，可以为机械盘和固态硬盘中的至少一种。本发明实施例中存储对象程序的进程对应的硬盘还可以为存储阵列等，本发明实施例对此不作限定。Exemplary descriptions are given in the embodiments of the present invention. "Slice 1" and "Slice 2" in the embodiment of the present invention. . . "Shard 12" etc. are not used to strictly define the precedence relationship, but are only used to distinguish different shards. The slice in this embodiment of the present invention may be a physical block in a hard disk or the like. The hard disk in this embodiment of the present invention, as described above, may be at least one of a mechanical disk and a solid-state hard disk. In the embodiment of the present invention, the hard disk corresponding to the process storing the object program may also be a storage array, etc., which is not limited in the embodiment of the present invention.

在本发明所提供的几个实施例中，应该理解到，所公开的装置、方法，可以通过其它的方式实现。例如，以上所描述的装置实施例所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the division of the units in the apparatus embodiments described above is only a logical function division, and other division methods may be used in actual implementation, for example, multiple units or components may be combined or integrated into another system, or Some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

Claims

1. The method for storing the metadata in the distributed storage system is characterized in that the distributed storage system comprises a management node and (M + N) storage nodes, wherein the management node and the (M + N) storage nodes store partition views of metadata strips; the partitioned view of the metadata stripe contains a primary data storage node DS_AData storage node DS_iAnd checking the storage node CS_r(ii) a Wherein N is a natural number not less than 2, M is a natural number not less than 1, a is one of natural numbers 1 to N, i is each of natural numbers 1 to N except a, r is each of natural numbers 1 to M; the method comprises the following steps:

the management node determines a primary data storage node DS for the metadata stripe according to the partition view of the metadata stripe_AData storage node DS_iAnd checking the storage node CS_r(ii) a The metadata stripe contains a metadata block D_A、D_iAnd a check block Cr;

the management node is D_iTo said data storage node DS_iD is_ATo the primary data storage node DS_AMixing C with_rSent to the check storage node CS_r；

The check storage node CS_rReceive and store C_r；

The data storage node DS_iReceiving and storing D_iAnd partitioning D according to said metadata stripe_iTo the primary data storage node DS_A；

The primary data storage nodeDS_AReceiving and storing D_AAnd D_i。

2. The method of claim 1, wherein the management node determines a primary data storage node (DS) for the metadata stripe based on a partition view of the metadata stripe_AData storage node DS_iAnd checking the storage node CS_rThe method specifically comprises the following steps:

the management node determines a partition corresponding to the metadata stripe according to a write request for generating the metadata in the metadata stripe;

the management node inquires the partition view of the metadata strips according to the partitions corresponding to the metadata strips to determine the main data storage node DS_AThe data storage node DS_iAnd said check storage node CS_r。

3. The method according to claim 2, wherein the management node determines the partition corresponding to the metadata stripe according to an address carried by the write request.

4. The method of claim 1, wherein the check storage node CS_rThe storing of Cr specifically includes: the check storage node CS_rAllocating a slice S for the Cr_rAnd establishing the identifier of the Cr and the segment S_rThe mapping relationship of (2);

the data storage node DS_iStore D_iThe method specifically comprises the following steps: the data storage node DS_iIs said D_iDistribution shards SD_iAnd establishing said D_iAnd the slice SD_iThe mapping relationship of (2);

the main data storage node DS_AStore D_AAnd D_iThe method specifically comprises the following steps: the main data storage node DS_AIs said D_ADistribution shards SD_AAnd establishing said D_AAnd the slice SD_AIs the mapping relationship of D_iDistribution shards SD_iAnd establishing said D_iAnd the slice SD_iThe mapping relationship of (2).

5. A distributed storage system, comprising a management node and (M + N) storage nodes, wherein the management node and the (M + N) storage nodes each store a partition view of a metadata stripe; the partitioned view of the metadata stripe contains a primary data storage node DS_AData storage node DS_iAnd checking the storage node CS_r(ii) a Wherein N is a natural number not less than 2, M is a natural number not less than 1, a is one of natural numbers 1 to N, i is each of natural numbers 1 to N except a, r is each of natural numbers 1 to M;

the management node is configured to determine a primary data storage node DS for the metadata stripe according to the partitioned view of the metadata stripe_AData storage node DS_iAnd checking the storage node CS_r(ii) a The metadata stripe contains a metadata block D_A、D_iAnd check block Cr, D_iTo said data storage node DS_iD is_ATo the primary data storage node DS_AMixing C with_rSent to the check storage node CS_r；

The check storage node CS_rFor receiving and storing C_r；

The data storage node DS_iFor receiving and storing D_iAnd partitioning D according to said metadata stripe_iTo the primary data storage node DS_A；

The main data storage node DS_AFor receiving and storing D_AAnd D_i。

6. The system of claim 5, wherein the management node is specifically configured to determine the partition corresponding to the metadata stripe according to a write request for generating the metadata in the metadata stripe, and determine the partition corresponding to the metadata stripe according to the partition corresponding to the metadata stripeQuerying the partition view of the metadata stripe to determine the primary data storage node DS_AThe data storage node DS_iAnd said check storage node CS_r。

7. The system according to claim 6, wherein the management node is further configured to determine, according to an address carried by the write request, a partition corresponding to the metadata stripe.

8. The system of claim 5, wherein the check storage node CS_rIn particular for assigning a slice S to said Cr_rAnd establishing the identifier of the Cr and the segment S_rThe mapping relationship of (2);

the data storage node DS_iIn particular for the use of_iDistribution shards SD_iAnd establishing said D_iAnd the slice SD_iThe mapping relationship of (2);

the main data storage node DS_AIn particular for the use of_ADistribution shards SD_AAnd establishing said D_AAnd the slice SD_AIs the mapping relationship of D_iDistribution shards SD_iAnd establishing said D_iAnd the slice SD_iThe mapping relationship of (2).

9. A non-transitory readable storage medium containing computer program instructions executable in a distributed storage system, the distributed storage system comprising a management node and (M + N) storage nodes each storing a partitioned view of a metadata stripe; the partitioned view of the metadata stripe contains a primary data storage node DS_AData storage node DS_iAnd checking the storage node CS_r(ii) a Wherein N is a natural number not less than 2, M is a natural number not less than 1, A is one of natural numbers 1 to N, i is each of natural numbers 1 to N except A, r isEach of natural numbers 1 to M; when the computer program instructions are executed by one or more computers, the one or more computers are operable as the management node to determine a primary data storage node, DS, for the metadata stripe from a partitioned view of the metadata stripe_AData storage node DS_iAnd checking the storage node CS_r(ii) a The metadata stripe contains a metadata block D_A、D_iAnd check block Cr, D_iTo said data storage node DS_iD is_ATo the primary data storage node DS_AMixing C with_rSent to the check storage node CS_r(ii) a The one or more computers act as the check storage node CS_rFor receiving and storing C_r；

Said one or more computers being said data storage node DS_iFor receiving and storing D_iAnd partitioning D according to said metadata stripe_iTo the primary data storage node DS_A；

Said one or more computers being said primary data storage node DS_AFor receiving and storing D_AAnd D_i。

10. The storage medium of claim 9, further comprising computer program instructions to cause the one or more computers to, as the management node, determine a partition corresponding to the metadata stripe according to a write request to generate metadata in the metadata stripe, determine the primary data storage node (DS) according to a partition view of the metadata stripe queried by the partition corresponding to the metadata stripe_AThe data storage node DS_iAnd said check storage node CS_r。