[go: up one dir, main page]

CN109144406B - Metadata storage method, system and storage medium in distributed storage system - Google Patents

Metadata storage method, system and storage medium in distributed storage system Download PDF

Info

Publication number
CN109144406B
CN109144406B CN201710508014.8A CN201710508014A CN109144406B CN 109144406 B CN109144406 B CN 109144406B CN 201710508014 A CN201710508014 A CN 201710508014A CN 109144406 B CN109144406 B CN 109144406B
Authority
CN
China
Prior art keywords
storage node
data storage
metadata
node
stripe
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710508014.8A
Other languages
Chinese (zh)
Other versions
CN109144406A (en
Inventor
饶蓉
魏明昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201710508014.8A priority Critical patent/CN109144406B/en
Priority to CN202010648620.1A priority patent/CN111949210B/en
Priority to PCT/CN2018/075077 priority patent/WO2019000949A1/en
Publication of CN109144406A publication Critical patent/CN109144406A/en
Application granted granted Critical
Publication of CN109144406B publication Critical patent/CN109144406B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In the distributed storage system, under the scene that data reliability is realized by metadata stripes formed by an EC algorithm, other metadata blocks in the metadata stripes are backed up by a main data storage node, and because the metadata blocks on the data storage nodes are only required to be backed up on the main data storage node, compared with the prior art that all the metadata blocks are multiple copies, the storage space is reduced, and simultaneously, when a client accesses the metadata, all the metadata blocks are only required to be accessed from the main data storage node, so that the metadata access speed is improved.

Description

分布式存储系统中元数据存储方法、系统及存储介质Metadata storage method, system and storage medium in distributed storage system

技术领域technical field

本发明涉及数据存储技术领域,尤其涉及一种分布式存储系统中元数据存储方法、系统及存储介质。The present invention relates to the technical field of data storage, and in particular, to a metadata storage method, system and storage medium in a distributed storage system.

背景技术Background technique

在分布式存储系统中,管理节点将用户数据存储到存储节点后,会产生记录数据的逻辑地址,物理地址等的元数据,元数据也要存储到存储节点。常见的元数据存储方式是将元数据分条中的块打散到各存储节点,读取该元数据时,需要从各存储节点读取元数据分条中的块,拼凑成元数据分条,但存储节点间数据转发量大,影响性能。另外一种方式元数据在存储节点以多副本形式存储,但会增加存储空间开销。In a distributed storage system, after the management node stores user data in the storage node, metadata such as the logical address and physical address of the data will be generated, and the metadata should also be stored in the storage node. A common metadata storage method is to scatter the blocks in the metadata stripe to each storage node. When reading the metadata, it is necessary to read the blocks in the metadata stripe from each storage node and piece them together into a metadata stripe. , but the amount of data forwarding between storage nodes is large, which affects performance. Another way is to store metadata in the form of multiple copies on the storage node, but it will increase the storage space overhead.

发明内容SUMMARY OF THE INVENTION

第一方面,本发明实施例提供了一种分布式存储系统中元数据存储方案,在所述分布式存储系统中包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;在所述存储方案中:所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;所述元数据分条包含元数据块DA、Di以及校验块Cr,将Di发送到所述数据存储节点DSi,将DA发送到所述主数据存储节点DSA,将Cr发送到所述校验存储节点CSr;所述校验存储节点CSr接收并存储Cr;所述数据存储节点DSi接收并存储Di,并根据所述元数据分条的分区视图将Di发送到所述主数据存储节点DSA;所述主数据存储节点DSA接收并存储DA和Di。在本方案中,在实现元数据使用纠删码(ErasureCoding,EC)保护机制下,主数据存储节点DSA备份元数据分条中其他元数据块Di,因为只需要将数据存储节点DSi上的元数据块Di在主数据存储节点DSA上备份,相比现有技术中所有元数据块多副本,不需要校验块副本,减少了存储空间,同时在客户端访问元数据时,可以从主数据存储节点DSA访问所有元数据块,提高了元数据访问速度。本方案的分布式存储系统可以为分布式文件系统、分布式对象存储系统或分布式块设备存储。In a first aspect, an embodiment of the present invention provides a metadata storage solution in a distributed storage system. The distributed storage system includes a management node and (M+N) storage nodes, the management node and (M+N) storage nodes. +N) storage nodes all store the partition view of metadata stripe; the partition view of metadata stripe includes main data storage node DS A , data storage node DS i and check storage node CS r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, A is one of the natural numbers 1 to N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M In the storage scheme: the management node determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r for the metadata strip according to the partition view of the metadata strip; The metadata strip includes metadata blocks D A , D i and check blocks C r , D i is sent to the data storage node DS i , D A is sent to the main data storage node DS A , and Cr is sent to the check storage node CS r ; the check storage node CS r receives and stores Cr ; the data storage node DS i receives and stores D i , and splits the partition view according to the metadata Send D i to the main data storage node DS A ; the main data storage node DS A receives and stores D A and D i . In this solution, under the protection mechanism of Erasure Coding (EC) for metadata, the primary data storage node DS A backs up other metadata blocks D i in the metadata stripe, because only the data storage node DS i needs to be The metadata block D i on the device is backed up on the primary data storage node DS A. Compared with the multiple copies of all metadata blocks in the prior art, there is no need to verify the copy of the block, which reduces the storage space. At the same time, when the client accesses the metadata , all metadata blocks can be accessed from the main data storage node DS A , which improves the metadata access speed. The distributed storage system of this solution may be a distributed file system, a distributed object storage system or a distributed block device storage.

可选的,所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr,具体包括:所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区;所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DSA、所述数据存储节点DSi和所述校验存储节点CSrOptionally, the management node determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r for the metadata strip according to the partition view of the metadata strip, specifically including: The management node determines the partition corresponding to the metadata strip according to the write request for generating the metadata in the metadata strip; the management node queries the metadata strip according to the partition corresponding to the metadata strip The partition view of the primary data storage node DS A , the data storage node DS i and the check storage node CS r are determined.

可选的,所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。Optionally, the management node determines the partition corresponding to the metadata strip according to the address carried in the write request.

可选的,所述校验存储节点CSr存储Cr具体包括:所述校验存储节点CSr为所述Cr分配分片Sr,并且建立所述Cr的标识与所述分片Sr的映射关系;所述数据存储节点DSi存储Di具体包括:所述数据存储节点DSi为所述Di分配分片SDi,并且建立所述Di的标识与所述分片SDi的映射关系;所述主数据存储节点DSA存储DA和Di,具体包括:所述主数据存储节点DSA为所述DA分配分片SDA,并且建立所述DA的标识与所述分片SDA的映射关系,为所述Di分配分片SDi,并且建立所述Di的标识与所述分片SDi的映射关系。Optionally, the storage of Cr by the check storage node CS r specifically includes: the check storage node CS r allocates a slice S r to the Cr, and establishes a relationship between the identifier of the Cr and the slice S r . Mapping relationship; the storage of Di by the data storage node DS i specifically includes: the data storage node DS i allocates a fragment SD i to the Di , and establishes a relationship between the identifier of the Di and the fragment SD i A mapping relationship; the main data storage node DS A stores D A and D i , specifically including: the main data storage node DS A allocates a slice SD A for the D A , and establishes the identity of the D A with the The mapping relationship of the segment SD A , the segment SD i is allocated to the D i , and the mapping relationship between the identifier of the D i and the segment SD i is established.

进一步地,管理节点建立Di的标识与数据存储节点DSi和主数据存储节点DSA的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。Further, the management node establishes a mapping relationship between the identifier of D i and the data storage node DS i and the main data storage node DS A. When garbage collection is performed on the metadata stripe, the management node can recycle the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata stripe and the storage node , which improves the efficiency of metadata recovery.

第二方面,相应地,本发明实施例还提供了一种分布式存储系统,在所述分布式存储系统中包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;所述分布式存储系统用于实现第一方面各种实现方案。In the second aspect, correspondingly, an embodiment of the present invention further provides a distributed storage system, the distributed storage system includes a management node and (M+N) storage nodes, the management node and (M+N) storage nodes N) storage nodes all store partition views of metadata stripes; the partition views of metadata stripes include main data storage node DS A , data storage node DS i and check storage node CS r ; wherein, N is A natural number not less than 2, M is a natural number not less than 1, A is one of the natural numbers 1 to N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M; The distributed storage system is used to implement various implementation solutions of the first aspect.

相应地,本发明还提供了非易失性计算机可读存储介质和计算机程序产品,当本发明实施例提供的存储设备的存储器加载非易失性计算机可读存储介质和计算机程序产品中包含的计算机程序指令,所述计算机程序指令可运行于分布式存储系统中,分布式存储系统包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;当一个或多个计算机执行所述计算机程序指令分别作为所述分布式存储系统中的管理节点、主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr用于实现第一方面各种实现方案。Correspondingly, the present invention also provides a non-volatile computer-readable storage medium and a computer program product, when the memory of the storage device provided by the embodiment of the present invention loads the non-volatile computer-readable storage medium and the computer program product with the Computer program instructions, the computer program instructions can run in a distributed storage system, the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M+N) storage nodes both store There is a partition view of metadata stripe; the partition view of the metadata stripe includes the main data storage node DS A , the data storage node DS i and the check storage node CS r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, A is one of the natural numbers 1 to N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M; when one or more computers execute The computer program instructions are respectively used as a management node, a main data storage node DS A , a data storage node DS i and a check storage node CS r in the distributed storage system to implement various implementation schemes of the first aspect.

在第一方面公开的各种分布式存储系统中元数据存储方案也可以适用元数据对应的数据的存储。相应的,第二方面方向的分布式存储系统以及第三方面的非易失性计算机可读存储介质和计算机程序产品同样也适用于数据存储。The metadata storage solutions in the various distributed storage systems disclosed in the first aspect may also be applicable to the storage of data corresponding to the metadata. Correspondingly, the distributed storage system of the second aspect and the non-volatile computer-readable storage medium and computer program product of the third aspect are also suitable for data storage.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments.

图1为本发明实施例提供的一种分布式块设备存储架构示意图;1 is a schematic diagram of a distributed block device storage architecture according to an embodiment of the present invention;

图2为本发明实施例提供的一种分布式块设备中服务器结构示意图;2 is a schematic structural diagram of a server in a distributed block device according to an embodiment of the present invention;

图3为本发明实施例提供的一种数据分条与分区视图关系示意图;3 is a schematic diagram of a relationship between a data stripe and a partition view according to an embodiment of the present invention;

图4为本发明实施例提供的一种数据分条示意图;4 is a schematic diagram of a data striping provided by an embodiment of the present invention;

图5为本发明实施例提供的分区视图示意图;5 is a schematic diagram of a partition view according to an embodiment of the present invention;

图6为本发明实施例提供的一种元数据分条与分区视图关系示意图;6 is a schematic diagram of the relationship between a metadata stripe and a partition view according to an embodiment of the present invention;

图7为本发明实施例元数据存储流程图;7 is a flowchart of metadata storage according to an embodiment of the present invention;

图8为本发明实施例提供的一种元数据分条示意图;FIG. 8 is a schematic diagram of a metadata striping provided by an embodiment of the present invention;

图9为本发明实施例提供的元数据存储示意图。FIG. 9 is a schematic diagram of metadata storage provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚地描述。The technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention.

分布式存储系统主要有分布式文件系统存储、分布式对象存储和分布式块设备存储等几种形式,例如华为

Figure BDA0001335108050000031
Figure BDA0001335108050000032
系列产品。本发明实施例以分布式块设备存储为例进行说明。示例性的如图1所示,分布式块设备存储包括多台服务器1、服务器2、服务器3、服务器4、服务器5和服务器6,服务器间互相通信。在实际应用当中,分布式块设备存储中服务器的数量可以根据实际需求增加,本发明实施例对此不作限定。分布式块设备存储的服务器中包含如图2所示的结构。Distributed storage systems mainly include distributed file system storage, distributed object storage, and distributed block device storage. For example, Huawei
Figure BDA0001335108050000031
of
Figure BDA0001335108050000032
series of products. The embodiments of the present invention take distributed block device storage as an example for description. Exemplarily as shown in FIG. 1 , the distributed block device storage includes multiple servers 1, 2, 3, 4, 5, and 6, and the servers communicate with each other. In practical applications, the number of servers in the distributed block device storage may be increased according to actual requirements, which is not limited in this embodiment of the present invention. The distributed block device storage server includes the structure shown in Figure 2.

如图2所示,分布式块设备存储中的每台服务器包含中央处理单元(CentralProcessing Unit,CPU)201、内存202、硬盘1、硬盘2和硬盘3,内存202中存储计算机指令,CPU201执行内存202中的程序指令执行相应的操作。硬盘可以为机械硬盘和固态硬盘中的至少一种。另外,为节省CPU201的计算资源,现场可编程门阵列(Field Programmable GateArray,FPGA)或其他硬件也可以用于CPU201上述相应的操作,或者,FPGA或其他硬件与CPU201共同完成上述相应的操作。为方便描述,本发明实施例统一描述为处理器用于实现上述相应的操作。As shown in FIG. 2, each server in the distributed block device storage includes a central processing unit (CPU) 201, a memory 202, a hard disk 1, a hard disk 2, and a hard disk 3. The memory 202 stores computer instructions, and the CPU 201 executes the memory. The program instructions in 202 perform corresponding operations. The hard disk may be at least one of a mechanical hard disk and a solid-state hard disk. In addition, in order to save the computing resources of the CPU 201, a Field Programmable Gate Array (FPGA) or other hardware can also be used for the above-mentioned corresponding operations of the CPU 201, or the FPGA or other hardware and the CPU 201 can jointly complete the above-mentioned corresponding operations. For convenience of description, the embodiments of the present invention are collectively described as a processor configured to implement the foregoing corresponding operations.

在图2所示的结构中,内存202中加载应用程序,CPU201执行内存202中的应用程序指令,则服务器作为客户端。其中,应用程序可以为虚拟机(Virtual Machine,VM),也可以为某一个特定应用,如办公软件等。客户端向分布式块设备存储写入数据或从分布式块设备存储中读取数据。内存202中加载存储管理程序,CPU201执行内存202中的作为虚拟块存储管理程序的存储管理程序指令,则服务器作为管理节点,负责卷元数据的管理,用于向客户端提供块协议访问接口,为客户端提供分布式存储接入点服务,使客户端能够通过管理节点访问分布式块设备存储的存储资源。内存202中加载存储对象程序,CPU201执行内存202中的存储对象程序指令,则服务器作为存储节点,用于执行具体的输入输出(Input/Output,I/O)操作。在每个服务器上可以运行多个存储对象程序的进程,示例性的,一块硬盘默认对应运行一个存储对象程序进程,每个存储对象程序进程会管理一块硬盘,则服务器运行每一个存储对象程序的进程作为一个存储节点。具体实现,也可以一个服务器上运行一个存储对象程序的进程对应服务器上的所有硬盘。本发明实施例以一个存储对象程序进程会管理一块硬盘为例进行描述。分布式块设备存储初始化时,每个存储对象程序的进程会按照1MB为单位对硬盘进行分片管理,并在硬盘的元数据管理区域记录每个1MB分片的分配信息,硬盘的分片组成存储资源池。存储管理程序与其所能访问的资源池的所有存储对象程序的进程点对点通信,即管理节点与其所能访问的资源池的所有存储节点进行通信,从而管理节点能并发访问资源池的所有硬盘。In the structure shown in FIG. 2 , the application program is loaded in the memory 202 , the CPU 201 executes the application program instructions in the memory 202 , and the server acts as a client. The application program may be a virtual machine (Virtual Machine, VM), or may be a specific application, such as office software. Clients write data to or read data from distributed block device storage. The storage management program is loaded in the memory 202, and the CPU 201 executes the storage management program instruction as the virtual block storage management program in the memory 202, then the server, as a management node, is responsible for the management of the volume metadata, and is used to provide the client with a block protocol access interface, Provides distributed storage access point services for clients, enabling clients to access storage resources stored by distributed block devices through management nodes. The storage object program is loaded in the memory 202, the CPU 201 executes the storage object program instruction in the memory 202, and the server acts as a storage node for performing specific input/output (I/O) operations. Each server can run multiple storage object program processes. Exemplarily, a hard disk corresponds to running one storage object program process by default. Each storage object program process will manage a hard disk, and the server runs each storage object program. A process acts as a storage node. For specific implementation, it is also possible that a process of running a stored object program on one server corresponds to all hard disks on the server. The embodiment of the present invention is described by taking a storage object program process managing a hard disk as an example. When the distributed block device storage is initialized, the process of each storage object program will manage the hard disk by 1MB as a unit, and record the allocation information of each 1MB fragment in the metadata management area of the hard disk. Storage resource pool. The storage management program communicates point-to-point with the processes of all storage object programs in the resource pool that it can access, that is, the management node communicates with all storage nodes in the resource pool that it can access, so that the management node can concurrently access all hard disks in the resource pool.

在分布式块设备存储初始化时,将哈希空间(如0-2^32,)划分为N等份,每1等份是1个分区(Partition),这N等份按照硬盘数量进行均分。例如,分布式块存储设备存储中N默认为3600,即分区分别为P1,P2,P3…P3600。如图3所示,假设当前分布式块设备存储有18块硬盘(存储节点),则每块存储节点承载200个分区。上述分区与存储节点对应关系,即分区视图,在分布式块设备存储初始化时会分配好,后续会随着分布式块设备存储中硬盘数量的变化进行调整。分布式块设备存储的服务器会在内存202中保存该分区视图,管理节点使用该分区视图进行快速路由。每一个存储节点中也保存有分布式块设备存储系统的所有分区视图,即每一个分区与存储节点的对应关系。同时根据分布式块设备存储的可靠性要求,可以使用纠删码(Erasure Coding,EC)算法提高数据可靠性,如使用3+1模式,即3个数据块和1个校验块组成数据分条,如图4所示,则分区视图为“分区-主数据存储节点-数据存储节点1-数据存储节点2-校验存储节点,示例性的,分区视图如图5所示。该分区视图表示分区对应主数据节点以及用于存储数据分条的其他数据块的数据存储节点1和数据节点2,以及存储校验数据的校验存储节点,存储在数据存储节点1和数据存储节点2的数据块的备份数据存储节点为主数据存储节点。When the distributed block device storage is initialized, the hash space (such as 0-2^32,) is divided into N equal parts, each equal part is a partition (Partition), and the N equal parts are divided equally according to the number of hard disks . For example, in distributed block storage device storage, N is 3600 by default, that is, the partitions are P1, P2, P3...P3600. As shown in Figure 3, assuming that the current distributed block device stores 18 hard disks (storage nodes), each storage node carries 200 partitions. The above-mentioned correspondence between partitions and storage nodes, that is, the partition view, will be allocated when the distributed block device storage is initialized, and will be adjusted later as the number of hard disks in the distributed block device storage changes. The distributed block device storage server will save the partition view in the memory 202, and the management node uses the partition view for fast routing. Each storage node also stores all partition views of the distributed block device storage system, that is, the correspondence between each partition and the storage node. At the same time, according to the reliability requirements of distributed block device storage, the Erasure Coding (EC) algorithm can be used to improve data reliability. bar, as shown in Figure 4, the partition view is "partition-main data storage node-data storage node 1-data storage node 2-check storage node, exemplary, the partition view is shown in Figure 5. The partition view Indicates that the partition corresponds to the main data node and the data storage node 1 and data node 2 for storing other data blocks of data striping, and the check storage node for storing the check data, which is stored in the data storage node 1 and data storage node 2. The backup data storage node of the data block is the primary data storage node.

分布式块设备存储会对每个逻辑单元号(Logical Unit Number,LUN)在逻辑上按照1MB大小进行切片,例如1GB的LUN则会被切成1024*1MB分片。如图3所示,客户端通过管理节点向LUN发送写请求的时候,在小型计算机系统接口(Small Computer SystemInterface,SCSI)命令中会带LUN标识(Identifier,ID)、逻辑块地址(Logical BlockAddress,LBA)ID以及待写数据,客户端所在的管理节点接收写请求,根据LUN ID和LBA ID组成一个键key,该key会包含LBA ID对1MB的取整计算信息。通过分布式哈希表(Distributed Hash Table,DHT)Hash计算出一个整数(范围在0-2^32内),并落在指定分区中;客户端所在的管理节点根据内存202中记录的分区视图确定主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点,管理节点将EC数据分条中的数据块1、数据块2、数据块3和校验块4分别发送到主数据存储节点1、数据存储节点2、数据存储节点3和校验存储节点4。主数据存储节点存储数据块1,数据存储节点1存储数据块2,数据存储节点2存储数据块3,校验存储节点存储校验块1。数据存储节点1和2根据分区视图分别确定主数据存储节点,数据存储节点1将数据块2备份到主数据存储节点,数据存储节点2将数据块3备份到主数据存储节点,主数据存储节点分别存储数据块2和数据块3。具体实现中,主数据存储节点为数据块1从其管理的硬盘中分配分片1,建立数据块1的标识与分片1的映射关系;数据存储节点1从其管理的硬盘中为数据块2分配分片2,建立数据块2的标识与分片2的映射关系;数据存储节点2从其管理的硬盘中为数据块3分配分片3,建立数据块3的标识与分片3的映射关系;校验存储节点从其管理的硬盘中为校验块1分配分片4,建立校验块1的标识与分片4的映射关系。主数据存储节点接收数据存储节点1发送的数据块2和数据存储节点2发送的数据块3,主数据存储节点从其管理的硬盘中分配分片5和分片6,主数据存储节点建立数据块2的标识与分片5的映射关系,以及数据块3的标识与分片6的映射关系。本发明实施例中,以数据块的标识与分片的映射关系为例,当存储对象程序的1个进程对应1个硬盘时,也即存储节点即为硬盘本身,则数据块的标识与分片的映射关系为数据块的标识与分片物理地址的映射关系;当存储对象程序的1个进程对应多个硬盘时,也即存储节点管理多个硬盘,则数据块的标识与分片的映射关系为包括数据块的标识与存储该数据块的硬盘的映射,以及存储该数据块的硬盘到分片的映射。分片物理地址的映射关系当。进一步地,数据块2分别存储到分片2和分片5,数据块3分别存储在分片3和6,管理节点建立并保存数据块2的标识与数据存储节点1和主数据存储节点的映射,建立并保存数据块3的标识与数据存储节点2和主数据存储节点的映射关系。进一步地,数据存储节点1保存保存数据块2的标识与数据存储节点1和主数据存储节点的映射,数据存储节点2保存数据块3的标识与数据存储节点2和主数据存储节点的映射关系。在对数据分条进行垃圾回收时,管理节点可以根据数据分条中数据块的标识与存储节点的映射关系,将数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了数据回收的效率Distributed block device storage will logically slice each logical unit number (LUN) according to 1MB size. For example, a 1GB LUN will be sliced into 1024*1MB slices. As shown in Figure 3, when the client sends a write request to the LUN through the management node, the Small Computer System Interface (SCSI) command will carry the LUN identifier (Identifier, ID), the logical block address (Logical BlockAddress, LBA) ID and the data to be written, the management node where the client is located receives the write request, and forms a key according to the LUN ID and LBA ID. The key will contain the LBA ID to 1MB rounding calculation information. An integer (in the range of 0-2^32) is calculated by Distributed Hash Table (DHT) Hash and falls in the specified partition; the management node where the client is located is based on the partition view recorded in the memory 202 Determine the main data storage node, data storage node 1, data storage node 2, and check storage node, and the management node sends the data block 1, data block 2, data block 3, and check block 4 in the EC data stripe to the main data storage node. Data storage node 1 , data storage node 2 , data storage node 3 and check storage node 4 . The main data storage node stores data block 1, the data storage node 1 stores data block 2, the data storage node 2 stores data block 3, and the check storage node stores check block 1. Data storage nodes 1 and 2 respectively determine the main data storage node according to the partition view, data storage node 1 backs up data block 2 to the main data storage node, data storage node 2 backs up data block 3 to the main data storage node, and the main data storage node Data block 2 and data block 3 are stored respectively. In the specific implementation, the main data storage node allocates slice 1 for data block 1 from the hard disk it manages, and establishes a mapping relationship between the identifier of data block 1 and slice 1; the data storage node 1 is the data block from the hard disk managed by it. 2. Allocate fragment 2, establish a mapping relationship between the identifier of data block 2 and fragment 2; data storage node 2 allocates fragment 3 for data block 3 from the hard disk it manages, and establishes the identifier of data block 3 and fragment 3. Mapping relationship; the verification storage node allocates fragment 4 to verification block 1 from the hard disk it manages, and establishes a mapping relationship between the identification of verification block 1 and fragment 4. The main data storage node receives data block 2 sent by data storage node 1 and data block 3 sent by data storage node 2. The main data storage node allocates slices 5 and 6 from the hard disks it manages, and the main data storage node creates data The mapping relationship between the identifier of block 2 and slice 5, and the mapping relation between the identifier of data block 3 and slice 6. In the embodiment of the present invention, taking the mapping relationship between the identifier of the data block and the fragment as an example, when one process of the storage object program corresponds to one hard disk, that is, the storage node is the hard disk itself, the identifier of the data block and the partition The mapping relationship of the slice is the mapping relation between the identifier of the data block and the physical address of the slice; when one process of the storage object program corresponds to multiple hard disks, that is, the storage node manages multiple hard disks, the identifier of the data block is the same as that of the slice. The mapping relationship includes the mapping between the identifier of the data block and the hard disk storing the data block, and the mapping from the hard disk storing the data block to the slice. The mapping relationship of the physical address of the slice is when. Further, data block 2 is stored in fragment 2 and fragment 5 respectively, data block 3 is stored in fragment 3 and 6 respectively, and the management node establishes and saves the identification of data block 2 and the data storage node 1 and the main data storage node. Mapping, establishing and saving the mapping relationship between the identifier of the data block 3 and the data storage node 2 and the main data storage node. Further, the data storage node 1 saves the mapping between the identification of the data block 2 and the data storage node 1 and the main data storage node, and the data storage node 2 saves the mapping relationship between the identification of the data block 3 and the data storage node 2 and the main data storage node. . When garbage collection is performed on data stripes, the management node can recycle the data of the data blocks in both the data storage node and the main data storage node according to the mapping relationship between the identifiers of the data blocks in the data stripes and the storage nodes, which improves the data storage efficiency. recycling efficiency

本发明实施例中,客户端向分布式块设备存储发送写请求写入数据时,会产生元数据,用于记录数据的逻辑地址和物理地址等。本发明实施例中,数据对应的元数据存储与数据存储使用相同的EC算法。基于EC算法组成的元数据分条与上述基于EC算法的组成数据分条具有相同的分区视图,如图6所示。In the embodiment of the present invention, when the client sends a write request to the distributed block device storage to write data, metadata is generated for recording the logical address and physical address of the data, and the like. In this embodiment of the present invention, the metadata storage corresponding to the data uses the same EC algorithm as the data storage. The metadata stripe composed based on the EC algorithm has the same partition view as the above-mentioned composition data stripe based on the EC algorithm, as shown in FIG. 6 .

在分布式存储系统存储元数据,其中分布式存储系统包含管理节点和(M+N)个存储节点,管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;元数据分条的分区视图包含主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;在该分布式存储系统存储执行如图7所示的流程:Metadata is stored in a distributed storage system, wherein the distributed storage system includes a management node and (M+N) storage nodes, and both the management node and (M+N) storage nodes store partition views of metadata stripes; The partition view of data striping includes the main data storage node DS A , the data storage node DS i and the check storage node CS r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, and A is a natural number 1 to One of N, i is each of the natural numbers 1 to N except A, and r is each of the natural numbers 1 to M; the distributed storage system stores and executes the process shown in Figure 7:

步骤701:管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;所述元数据分条包含元数据块DA、Di以及校验块Cr。Step 701: The management node determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r for the metadata strip according to the partition view of the metadata strip; the metadata strip includes Metadata blocks D A , D i and check block Cr.

具体的,所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr,具体包括:所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区;所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DSA、所述数据存储节点DSi和所述校验存储节点CSrSpecifically, the management node determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r for the metadata strip according to the partition view of the metadata strip, which specifically includes: the The management node determines the partition corresponding to the metadata strip according to the write request for generating the metadata in the metadata strip; the management node queries the metadata strip according to the partition corresponding to the metadata strip. The partition view determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r .

具体的,所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。具体可参见分布式块设备存储在存储客户端发送的写请求时的方案,在此不再赘述。Specifically, the management node determines the partition corresponding to the metadata strip according to the address carried in the write request. For details, please refer to the solution when the distributed block device stores the write request sent by the client, which will not be repeated here.

步骤702:所述管理节点将Di发送到所述数据存储节点DSi,将DA发送到所述主数据存储节点DSA,将Cr发送到所述校验存储节点CSrStep 702: The management node sends Di to the data storage node DS i , sends DA to the primary data storage node DS A , and sends Cr to the check storage node CS r .

步骤703:所述校验存储节点CSr接收并存储CrStep 703: The verification storage node CS r receives and stores C r .

步骤704:所述数据存储节点DSi接收并存储Di,并根据所述元数据分条的分区视图将Di发送到所述主数据存储节点DSAStep 704: The data storage node DS i receives and stores D i , and sends D i to the primary data storage node DS A according to the partition view of the metadata stripe.

步骤705:所述主数据存储节点DSA接收并存储DA和DiStep 705: The primary data storage node DS A receives and stores D A and D i .

具体的,所述校验存储节点CSr存储Cr具体包括:所述校验存储节点CSr为所述Cr分配分片Sr,并且建立所述Cr的标识与所述分片Sr的映射关系;所述数据存储节点DSi存储Di具体包括:所述数据存储节点DSi为所述Di分配分片SDi,并且建立所述Di的标识与所述分片SDi的映射关系;所述主数据存储节点DSA存储DA和Di,具体包括:所述主数据存储节点DSA为所述DA分配分片SDA,并且建立所述DA的标识与所述分片SDA的映射关系,为所述Di分配分片SDi,并且建立所述Di的标识与所述分片SDi的映射关系。进一步地,管理节点建立Di的标识与数据存储节点DSi和主数据存储节点DSA的映射关系。进一步地,进一步地,数据存储节点1保存保存Di的标识与数据存储节点DSi和主数据存储节点DSA的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。Specifically, the storage of Cr by the check storage node CS r specifically includes: the check storage node CS r allocates a slice S r to the Cr, and establishes a mapping between the identifier of the Cr and the slice S r The storage of D i by the data storage node DS i specifically includes: the data storage node DS i allocates a segment SD i to the D i , and establishes a mapping between the identifier of the D i and the segment SD i The main data storage node DS A stores D A and D i , specifically including: the main data storage node DS A allocates a slice SD A for the D A , and establishes the identity of the D A with the For the mapping relationship of the slice SD A , the slice SD i is allocated to the D i , and the mapping relation between the identifier of the D i and the slice SD i is established. Further, the management node establishes a mapping relationship between the identifier of D i and the data storage node DS i and the main data storage node DS A. Further, further, the data storage node 1 saves the mapping relationship between the identifier of the storage D i and the data storage node DS i and the main data storage node DS A. When garbage collection is performed on the metadata stripe, the management node can recycle the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata stripe and the storage node , which improves the efficiency of metadata recovery.

本发明实施例中,结合前面所述的分布式块设备存储及数据存储方式,如图8所示,使用EC算法的元数据分条中元数据块为D1,D2和D3,校验块为C1。客户端所在的管理节点根据内存202中记录的分区视图“分区-主数据存储节点-数据存储节点1-数据存储节点2-校验存储节点”确定主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点。该分区视图表示分区对应主数据节点以及用于存储元数据分条的其他数据块的数据存储节点1和数据节点2,以及存储校验数据的校验存储节点,存储在数据存储节点1和数据存储节点2的元数据块的备份数据存储节点为主数据存储节点。管理节点将基于EC算法的元数据分条中的D1、D2、D3和C1分别发送到主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点4。主数据存储节点接收并存储D1,数据存储节点1接收并存储D2,数据存储节点2接收并存储D3,校验存储节点接收并存储C1。数据存储节点1和2根据分区视图分别确定主数据存储节点,数据存储节点1将D2备份到主数据存储节点,数据存储节点2将D3备份到主数据存储节点,主数据存储节点接收并存储D2和D3。具体实现中,如图9所示,主数据存储节点为D1从其管理的硬盘中分配分片7,建立D1的标识与分片7的映射关系;数据存储节点1从其管理的硬盘中为D2分配分片8,建立D2的标识与分片8的映射关系;数据存储节点2从其管理的硬盘中为D3分配分片9,建立D3的标识与分片9的映射关系;校验存储节点从其管理的硬盘中为C1分配分片10,建立C1的标识与分片10的映射关系。主数据存储节点接收数据存储节点1发送的D2和数据存储节点2发送的D3,主数据存储节点从其管理的硬盘中分配分片11和分片12,主数据存储节点建立D2的标识与分片11的映射关系,以及D3的标识与分片12的映射关系。本发明实施例中,以元数据块的标识与分片的映射关系为例,当存储对象程序的1个进程对应1个硬盘时,也即存储节点即为硬盘本身,则元数据块的标识与分片的映射关系为元数据块的标识与分片物理地址的映射关系;当存储对象程序的1个进程对应多个硬盘时,也即存储节点管理多个硬盘,则元数据块的标识与分片的映射关系为包括元数据块的标识与存储该元数据块的硬盘的映射,以及存储该元数据块的硬盘到分片的映射。进一步地,D2分别存储到分片8和分片11,D3分别存储在分片9和12,管理节点建立并保存D2的标识与数据存储节点1和主数据存储节点的映射,建立并保存D3的标识与数据存储节点2和主数据存储节点的映射关系。进一步地,数据存储节点1保存保存D2的标识与数据存储节点1和主数据存储节点的映射,数据存储节点2保存D3的标识与数据存储节点2和主数据存储节点的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。In this embodiment of the present invention, in combination with the aforementioned distributed block device storage and data storage methods, as shown in FIG. 8 , the metadata blocks in the metadata striping using the EC algorithm are D 1 , D 2 and D 3 . The test block is C 1 . The management node where the client is located determines the main data storage node, data storage node 1, data storage node according to the partition view "partition-main data storage node-data storage node 1-data storage node 2-check storage node" recorded in the memory 202 Node 2 and the checksum storage node. The partition view indicates that the partition corresponds to the main data node and data storage node 1 and data node 2 for storing other data blocks of metadata stripes, and the check storage node for storing check data, which is stored in data storage node 1 and data storage node 1. The backup data storage node of the metadata block of storage node 2 is the primary data storage node. The management node sends D 1 , D 2 , D 3 and C 1 in the metadata striping based on the EC algorithm to the main data storage node, data storage node 1, data storage node 2 and check storage node 4, respectively. The main data storage node receives and stores D 1 , the data storage node 1 receives and stores D 2 , the data storage node 2 receives and stores D 3 , and the check storage node receives and stores C 1 . Data storage nodes 1 and 2 respectively determine the main data storage node according to the partition view, data storage node 1 backs up D2 to the main data storage node, data storage node 2 backs up D3 to the main data storage node, and the main data storage node receives the D2 and D3 are stored. In the specific implementation, as shown in FIG. 9 , the main data storage node allocates slice 7 for D 1 from the hard disk managed by it, and establishes a mapping relationship between the identifier of D 1 and slice 7; In the middle, allocate slice 8 for D 2 , and establish the mapping relationship between the identifier of D 2 and slice 8; the data storage node 2 allocates slice 9 for D 3 from the hard disk it manages, and establishes the identifier of D 3 and slice 9. Mapping relationship; the verification storage node allocates shard 10 to C 1 from the hard disk it manages, and establishes a mapping relationship between the identifier of C 1 and the shard 10. The main data storage node receives D2 sent by data storage node 1 and D3 sent by data storage node 2 , the main data storage node allocates slices 11 and 12 from the hard disks it manages, and the main data storage node establishes D2's The mapping relationship between the ID and the shard 11, and the mapping relationship between the ID of D3 and the shard 12. In the embodiment of the present invention, taking the mapping relationship between the identifier of the metadata block and the fragment as an example, when one process of the storage object program corresponds to one hard disk, that is, the storage node is the hard disk itself, the identifier of the metadata block is The mapping relationship with the shard is the mapping relationship between the identifier of the metadata block and the physical address of the shard; when one process of the storage object program corresponds to multiple hard disks, that is, the storage node manages multiple hard disks, the identifier of the metadata block is The mapping relationship with the shard includes the mapping between the identifier of the metadata block and the hard disk storing the metadata block, and the mapping from the hard disk storing the metadata block to the shard. Further, D2 is stored in slices 8 and 11 respectively, D3 is stored in slices 9 and 12 respectively, the management node establishes and saves the identification of D2 and the mapping of the data storage node 1 and the main data storage node, establishes And save the mapping relationship between the identifier of D 3 and the data storage node 2 and the main data storage node. Further, the data storage node 1 saves the mapping between the identifier of D 2 and the data storage node 1 and the main data storage node, and the data storage node 2 saves the mapping relationship between the identifier of D 3 and the data storage node 2 and the main data storage node. When garbage collection is performed on the metadata stripe, the management node can recycle the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata stripe and the storage node , which improves the efficiency of metadata recovery.

因此,在使用EC算法组成的元数据分条实现数据可靠性的场景下,主数据存储节点备份元数据分条中其他元数据块,因为只需要将数据存储节点上的元数据块在主数据存储节点上备份,相比现有技术中所有元数据块多副本,减少了存储空间,同时在客户端访问元数据时,只需要从主数据存储节点访问所有元数据块,提高了元数据访问速度。Therefore, in the scenario where metadata striping composed of the EC algorithm is used to achieve data reliability, the primary data storage node backs up other metadata blocks in the metadata striping, because only the metadata blocks on the data storage node need to be stored in the primary data storage node. Compared with the multiple copies of all metadata blocks in the prior art, the backup on the storage node reduces the storage space. At the same time, when the client accesses the metadata, it only needs to access all the metadata blocks from the main data storage node, which improves the metadata access. speed.

本发明实施例,还提供了非易失性计算机可读存储介质和计算机程序产品,非易失性计算机可读存储介质和计算机程序产品中包含的计算机程序指令,CPU执行内存中加载的该计算机程序指令用于实现本发明各实施中管理节点和存储节点(主数据存储节点、数据存储节点和校验存储节点)对应的功能。Embodiments of the present invention also provide a non-volatile computer-readable storage medium and a computer program product, the computer program instructions contained in the non-volatile computer-readable storage medium and the computer program product, and the CPU executing the computer program loaded in the memory The program instructions are used to realize the functions corresponding to the management node and the storage node (main data storage node, data storage node and check storage node) in various implementations of the present invention.

本发明实施例中给出的示例性描述。本发明实施例中的“分片1”、“分片2”。。。“分片12”等并不是用于严格限定先后关系,只是用于区分不同的分片。本发明实施例中的分片可以为硬盘中的物理块等。本发明实施例中的硬盘,如前所述,可以为机械盘和固态硬盘中的至少一种。本发明实施例中存储对象程序的进程对应的硬盘还可以为存储阵列等,本发明实施例对此不作限定。Exemplary descriptions are given in the embodiments of the present invention. "Slice 1" and "Slice 2" in the embodiment of the present invention. . . "Shard 12" etc. are not used to strictly define the precedence relationship, but are only used to distinguish different shards. The slice in this embodiment of the present invention may be a physical block in a hard disk or the like. The hard disk in this embodiment of the present invention, as described above, may be at least one of a mechanical disk and a solid-state hard disk. In the embodiment of the present invention, the hard disk corresponding to the process storing the object program may also be a storage array, etc., which is not limited in the embodiment of the present invention.

在本发明所提供的几个实施例中,应该理解到,所公开的装置、方法,可以通过其它的方式实现。例如,以上所描述的装置实施例所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the division of the units in the apparatus embodiments described above is only a logical function division, and other division methods may be used in actual implementation, for example, multiple units or components may be combined or integrated into another system, or Some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

Claims (10)

1. The method for storing the metadata in the distributed storage system is characterized in that the distributed storage system comprises a management node and (M + N) storage nodes, wherein the management node and the (M + N) storage nodes store partition views of metadata strips; the partitioned view of the metadata stripe contains a primary data storage node DSAData storage node DSiAnd checking the storage node CSr(ii) a Wherein N is a natural number not less than 2, M is a natural number not less than 1, a is one of natural numbers 1 to N, i is each of natural numbers 1 to N except a, r is each of natural numbers 1 to M; the method comprises the following steps:
the management node determines a primary data storage node DS for the metadata stripe according to the partition view of the metadata stripeAData storage node DSiAnd checking the storage node CSr(ii) a The metadata stripe contains a metadata block DA、DiAnd a check block Cr;
the management node is DiTo said data storage node DSiD isATo the primary data storage node DSAMixing C withrSent to the check storage node CSr
The check storage node CSrReceive and store Cr
The data storage node DSiReceiving and storing DiAnd partitioning D according to said metadata stripeiTo the primary data storage node DSA
The primary data storage nodeDSAReceiving and storing DAAnd Di
2. The method of claim 1, wherein the management node determines a primary data storage node (DS) for the metadata stripe based on a partition view of the metadata stripeAData storage node DSiAnd checking the storage node CSrThe method specifically comprises the following steps:
the management node determines a partition corresponding to the metadata stripe according to a write request for generating the metadata in the metadata stripe;
the management node inquires the partition view of the metadata strips according to the partitions corresponding to the metadata strips to determine the main data storage node DSAThe data storage node DSiAnd said check storage node CSr
3. The method according to claim 2, wherein the management node determines the partition corresponding to the metadata stripe according to an address carried by the write request.
4. The method of claim 1, wherein the check storage node CSrThe storing of Cr specifically includes: the check storage node CSrAllocating a slice S for the CrrAnd establishing the identifier of the Cr and the segment SrThe mapping relationship of (2);
the data storage node DSiStore DiThe method specifically comprises the following steps: the data storage node DSiIs said DiDistribution shards SDiAnd establishing said DiAnd the slice SDiThe mapping relationship of (2);
the main data storage node DSAStore DAAnd DiThe method specifically comprises the following steps: the main data storage node DSAIs said DADistribution shards SDAAnd establishing said DAAnd the slice SDAIs the mapping relationship of DiDistribution shards SDiAnd establishing said DiAnd the slice SDiThe mapping relationship of (2).
5. A distributed storage system, comprising a management node and (M + N) storage nodes, wherein the management node and the (M + N) storage nodes each store a partition view of a metadata stripe; the partitioned view of the metadata stripe contains a primary data storage node DSAData storage node DSiAnd checking the storage node CSr(ii) a Wherein N is a natural number not less than 2, M is a natural number not less than 1, a is one of natural numbers 1 to N, i is each of natural numbers 1 to N except a, r is each of natural numbers 1 to M;
the management node is configured to determine a primary data storage node DS for the metadata stripe according to the partitioned view of the metadata stripeAData storage node DSiAnd checking the storage node CSr(ii) a The metadata stripe contains a metadata block DA、DiAnd check block Cr, DiTo said data storage node DSiD isATo the primary data storage node DSAMixing C withrSent to the check storage node CSr
The check storage node CSrFor receiving and storing Cr
The data storage node DSiFor receiving and storing DiAnd partitioning D according to said metadata stripeiTo the primary data storage node DSA
The main data storage node DSAFor receiving and storing DAAnd Di
6. The system of claim 5, wherein the management node is specifically configured to determine the partition corresponding to the metadata stripe according to a write request for generating the metadata in the metadata stripe, and determine the partition corresponding to the metadata stripe according to the partition corresponding to the metadata stripeQuerying the partition view of the metadata stripe to determine the primary data storage node DSAThe data storage node DSiAnd said check storage node CSr
7. The system according to claim 6, wherein the management node is further configured to determine, according to an address carried by the write request, a partition corresponding to the metadata stripe.
8. The system of claim 5, wherein the check storage node CSrIn particular for assigning a slice S to said CrrAnd establishing the identifier of the Cr and the segment SrThe mapping relationship of (2);
the data storage node DSiIn particular for the use ofiDistribution shards SDiAnd establishing said DiAnd the slice SDiThe mapping relationship of (2);
the main data storage node DSAIn particular for the use ofADistribution shards SDAAnd establishing said DAAnd the slice SDAIs the mapping relationship of DiDistribution shards SDiAnd establishing said DiAnd the slice SDiThe mapping relationship of (2).
9. A non-transitory readable storage medium containing computer program instructions executable in a distributed storage system, the distributed storage system comprising a management node and (M + N) storage nodes each storing a partitioned view of a metadata stripe; the partitioned view of the metadata stripe contains a primary data storage node DSAData storage node DSiAnd checking the storage node CSr(ii) a Wherein N is a natural number not less than 2, M is a natural number not less than 1, A is one of natural numbers 1 to N, i is each of natural numbers 1 to N except A, r isEach of natural numbers 1 to M; when the computer program instructions are executed by one or more computers, the one or more computers are operable as the management node to determine a primary data storage node, DS, for the metadata stripe from a partitioned view of the metadata stripeAData storage node DSiAnd checking the storage node CSr(ii) a The metadata stripe contains a metadata block DA、DiAnd check block Cr, DiTo said data storage node DSiD isATo the primary data storage node DSAMixing C withrSent to the check storage node CSr(ii) a The one or more computers act as the check storage node CSrFor receiving and storing Cr
Said one or more computers being said data storage node DSiFor receiving and storing DiAnd partitioning D according to said metadata stripeiTo the primary data storage node DSA
Said one or more computers being said primary data storage node DSAFor receiving and storing DAAnd Di
10. The storage medium of claim 9, further comprising computer program instructions to cause the one or more computers to, as the management node, determine a partition corresponding to the metadata stripe according to a write request to generate metadata in the metadata stripe, determine the primary data storage node (DS) according to a partition view of the metadata stripe queried by the partition corresponding to the metadata stripeAThe data storage node DSiAnd said check storage node CSr
CN201710508014.8A 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system Active CN109144406B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201710508014.8A CN109144406B (en) 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system
CN202010648620.1A CN111949210B (en) 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system
PCT/CN2018/075077 WO2019000949A1 (en) 2017-06-28 2018-02-02 Metadata storage method and system in distributed storage system, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710508014.8A CN109144406B (en) 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202010648620.1A Division CN111949210B (en) 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system

Publications (2)

Publication Number Publication Date
CN109144406A CN109144406A (en) 2019-01-04
CN109144406B true CN109144406B (en) 2020-08-07

Family

ID=64740945

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202010648620.1A Active CN111949210B (en) 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system
CN201710508014.8A Active CN109144406B (en) 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202010648620.1A Active CN111949210B (en) 2017-06-28 2017-06-28 Metadata storage method, system and storage medium in distributed storage system

Country Status (2)

Country Link
CN (2) CN111949210B (en)
WO (1) WO2019000949A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3935515B1 (en) * 2019-03-04 2024-02-14 Hitachi Vantara LLC Metadata routing in a distributed system
EP3971701A4 (en) * 2019-09-09 2022-06-15 Huawei Cloud Computing Technologies Co., Ltd. Data processing method in storage system, device, and storage system
CN111444274B (en) * 2020-03-26 2021-04-30 上海依图网络科技有限公司 Data synchronization method, data synchronization system, and apparatus, medium, and system thereof
CN111638995B (en) * 2020-05-08 2024-09-20 杭州海康威视系统技术有限公司 Metadata backup method, device and equipment and storage medium
WO2022094895A1 (en) * 2020-11-05 2022-05-12 Alibaba Group Holding Limited Virtual data copy supporting garbage collection in distributed file systems
CN112947864B (en) * 2021-03-29 2024-03-08 南方电网数字平台科技(广东)有限公司 Metadata storage method, apparatus, device and storage medium
CN115904794A (en) * 2021-08-18 2023-04-04 华为技术有限公司 A data processing method and device
CN115268801B (en) * 2022-09-30 2023-01-10 天津卓朗昆仑云软件技术有限公司 Backup system and method for block device
CN119718197B (en) * 2024-12-06 2025-08-12 北京邮电大学 A method and system for distributed data storage based on erasure codes
CN119336276B (en) * 2024-12-19 2025-03-21 北京大道云行科技有限公司 High-performance bare disk management method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399823A (en) * 2011-12-31 2013-11-20 华为数字技术(成都)有限公司 Method, equipment and system for storing service data
CN106233264A (en) * 2014-03-31 2016-12-14 亚马逊科技公司 File storage with variable stripe size
CN106471461A (en) * 2014-06-04 2017-03-01 纯存储公司 Automatically reconfigure storage device memorizer topology
CN106662983A (en) * 2015-12-31 2017-05-10 华为技术有限公司 Method, apparatus and system for data reconstruction in distributed storage system

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7051155B2 (en) * 2002-08-05 2006-05-23 Sun Microsystems, Inc. Method and system for striping data to accommodate integrity metadata
CN102411637B (en) * 2011-12-30 2013-07-24 创新科软件技术(深圳)有限公司 Metadata management method of distributed file system
US8914668B2 (en) * 2012-09-06 2014-12-16 International Business Machines Corporation Asynchronous raid stripe writes to enable response to media errors
CN102937964B (en) * 2012-09-28 2015-02-11 无锡江南计算技术研究所 Intelligent data service method based on distributed system
US9104332B2 (en) * 2013-04-16 2015-08-11 International Business Machines Corporation Managing metadata and data for a logical volume in a distributed and declustered system
US9529675B2 (en) * 2013-07-26 2016-12-27 Huawei Technologies Co., Ltd. Data recovery method, data recovery device and distributed storage system
CN103699494B (en) * 2013-12-06 2017-03-15 北京奇虎科技有限公司 A kind of date storage method, data storage device and distributed memory system
CN103729436A (en) * 2013-12-27 2014-04-16 中国科学院信息工程研究所 Distributed metadata management method and system
CN106294772B (en) * 2016-08-11 2019-03-19 电子科技大学 The buffer memory management method of distributed memory columnar database
CN106599308B (en) * 2016-12-29 2020-01-31 郭晓凤 distributed metadata management method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399823A (en) * 2011-12-31 2013-11-20 华为数字技术(成都)有限公司 Method, equipment and system for storing service data
CN106233264A (en) * 2014-03-31 2016-12-14 亚马逊科技公司 File storage with variable stripe size
CN106471461A (en) * 2014-06-04 2017-03-01 纯存储公司 Automatically reconfigure storage device memorizer topology
CN106662983A (en) * 2015-12-31 2017-05-10 华为技术有限公司 Method, apparatus and system for data reconstruction in distributed storage system

Also Published As

Publication number Publication date
CN111949210B (en) 2024-11-15
CN109144406A (en) 2019-01-04
CN111949210A (en) 2020-11-17
WO2019000949A1 (en) 2019-01-03

Similar Documents

Publication Publication Date Title
CN109144406B (en) Metadata storage method, system and storage medium in distributed storage system
US11379142B2 (en) Snapshot-enabled storage system implementing algorithm for efficient reclamation of snapshot storage space
US11082206B2 (en) Layout-independent cryptographic stamp of a distributed dataset
US11409705B2 (en) Log-structured storage device format
US9733848B2 (en) Method and system for pooling, partitioning, and sharing network storage resources
US9582198B2 (en) Compressed block map of densely-populated data structures
CN102255962B (en) Distributive storage method, device and system
KR20200017363A (en) MANAGED SWITCHING BETWEEN ONE OR MORE HOSTS AND SOLID STATE DRIVES (SSDs) BASED ON THE NVMe PROTOCOL TO PROVIDE HOST STORAGE SERVICES
US20200356282A1 (en) Distributed Storage System, Data Processing Method, and Storage Node
CN110325958B (en) Data storage method and device in distributed block storage system and computer readable storage medium
US11899533B2 (en) Stripe reassembling method in storage system and stripe server
US12032849B2 (en) Distributed storage system and computer program product
US11775194B2 (en) Data storage method and apparatus in distributed storage system, and computer program product
WO2021017782A1 (en) Method for accessing distributed storage system, client, and computer program product
CN114489465A (en) Method, network device and computer system for data processing using network card
US20250094068A1 (en) Parallelized recovery of logical block address (lba) tables
CN119536619A (en) Method for writing data and storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant