CN109144406B - Metadata storage method, system and storage medium in distributed storage system - Google Patents
Metadata storage method, system and storage medium in distributed storage system Download PDFInfo
- Publication number
- CN109144406B CN109144406B CN201710508014.8A CN201710508014A CN109144406B CN 109144406 B CN109144406 B CN 109144406B CN 201710508014 A CN201710508014 A CN 201710508014A CN 109144406 B CN109144406 B CN 109144406B
- Authority
- CN
- China
- Prior art keywords
- storage node
- data storage
- metadata
- node
- stripe
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0611—Improving I/O performance in relation to response time
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1095—Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1458—Management of the backup or restore process
- G06F11/1464—Management of the backup or restore process for networked environments
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及数据存储技术领域,尤其涉及一种分布式存储系统中元数据存储方法、系统及存储介质。The present invention relates to the technical field of data storage, and in particular, to a metadata storage method, system and storage medium in a distributed storage system.
背景技术Background technique
在分布式存储系统中,管理节点将用户数据存储到存储节点后,会产生记录数据的逻辑地址,物理地址等的元数据,元数据也要存储到存储节点。常见的元数据存储方式是将元数据分条中的块打散到各存储节点,读取该元数据时,需要从各存储节点读取元数据分条中的块,拼凑成元数据分条,但存储节点间数据转发量大,影响性能。另外一种方式元数据在存储节点以多副本形式存储,但会增加存储空间开销。In a distributed storage system, after the management node stores user data in the storage node, metadata such as the logical address and physical address of the data will be generated, and the metadata should also be stored in the storage node. A common metadata storage method is to scatter the blocks in the metadata stripe to each storage node. When reading the metadata, it is necessary to read the blocks in the metadata stripe from each storage node and piece them together into a metadata stripe. , but the amount of data forwarding between storage nodes is large, which affects performance. Another way is to store metadata in the form of multiple copies on the storage node, but it will increase the storage space overhead.
发明内容SUMMARY OF THE INVENTION
第一方面,本发明实施例提供了一种分布式存储系统中元数据存储方案,在所述分布式存储系统中包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;在所述存储方案中:所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;所述元数据分条包含元数据块DA、Di以及校验块Cr,将Di发送到所述数据存储节点DSi,将DA发送到所述主数据存储节点DSA,将Cr发送到所述校验存储节点CSr;所述校验存储节点CSr接收并存储Cr;所述数据存储节点DSi接收并存储Di,并根据所述元数据分条的分区视图将Di发送到所述主数据存储节点DSA;所述主数据存储节点DSA接收并存储DA和Di。在本方案中,在实现元数据使用纠删码(ErasureCoding,EC)保护机制下,主数据存储节点DSA备份元数据分条中其他元数据块Di,因为只需要将数据存储节点DSi上的元数据块Di在主数据存储节点DSA上备份,相比现有技术中所有元数据块多副本,不需要校验块副本,减少了存储空间,同时在客户端访问元数据时,可以从主数据存储节点DSA访问所有元数据块,提高了元数据访问速度。本方案的分布式存储系统可以为分布式文件系统、分布式对象存储系统或分布式块设备存储。In a first aspect, an embodiment of the present invention provides a metadata storage solution in a distributed storage system. The distributed storage system includes a management node and (M+N) storage nodes, the management node and (M+N) storage nodes. +N) storage nodes all store the partition view of metadata stripe; the partition view of metadata stripe includes main data storage node DS A , data storage node DS i and check storage node CS r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, A is one of the
可选的,所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr,具体包括:所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区;所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DSA、所述数据存储节点DSi和所述校验存储节点CSr。Optionally, the management node determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r for the metadata strip according to the partition view of the metadata strip, specifically including: The management node determines the partition corresponding to the metadata strip according to the write request for generating the metadata in the metadata strip; the management node queries the metadata strip according to the partition corresponding to the metadata strip The partition view of the primary data storage node DS A , the data storage node DS i and the check storage node CS r are determined.
可选的,所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。Optionally, the management node determines the partition corresponding to the metadata strip according to the address carried in the write request.
可选的,所述校验存储节点CSr存储Cr具体包括:所述校验存储节点CSr为所述Cr分配分片Sr,并且建立所述Cr的标识与所述分片Sr的映射关系;所述数据存储节点DSi存储Di具体包括:所述数据存储节点DSi为所述Di分配分片SDi,并且建立所述Di的标识与所述分片SDi的映射关系;所述主数据存储节点DSA存储DA和Di,具体包括:所述主数据存储节点DSA为所述DA分配分片SDA,并且建立所述DA的标识与所述分片SDA的映射关系,为所述Di分配分片SDi,并且建立所述Di的标识与所述分片SDi的映射关系。Optionally, the storage of Cr by the check storage node CS r specifically includes: the check storage node CS r allocates a slice S r to the Cr, and establishes a relationship between the identifier of the Cr and the slice S r . Mapping relationship; the storage of Di by the data storage node DS i specifically includes: the data storage node DS i allocates a fragment SD i to the Di , and establishes a relationship between the identifier of the Di and the fragment SD i A mapping relationship; the main data storage node DS A stores D A and D i , specifically including: the main data storage node DS A allocates a slice SD A for the D A , and establishes the identity of the D A with the The mapping relationship of the segment SD A , the segment SD i is allocated to the D i , and the mapping relationship between the identifier of the D i and the segment SD i is established.
进一步地,管理节点建立Di的标识与数据存储节点DSi和主数据存储节点DSA的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。Further, the management node establishes a mapping relationship between the identifier of D i and the data storage node DS i and the main data storage node DS A. When garbage collection is performed on the metadata stripe, the management node can recycle the data of the metadata block in the data storage node and the main data storage node according to the mapping relationship between the identifier of the metadata block in the metadata stripe and the storage node , which improves the efficiency of metadata recovery.
第二方面,相应地,本发明实施例还提供了一种分布式存储系统,在所述分布式存储系统中包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;所述分布式存储系统用于实现第一方面各种实现方案。In the second aspect, correspondingly, an embodiment of the present invention further provides a distributed storage system, the distributed storage system includes a management node and (M+N) storage nodes, the management node and (M+N) storage nodes N) storage nodes all store partition views of metadata stripes; the partition views of metadata stripes include main data storage node DS A , data storage node DS i and check storage node CS r ; wherein, N is A natural number not less than 2, M is a natural number not less than 1, A is one of the
相应地,本发明还提供了非易失性计算机可读存储介质和计算机程序产品,当本发明实施例提供的存储设备的存储器加载非易失性计算机可读存储介质和计算机程序产品中包含的计算机程序指令,所述计算机程序指令可运行于分布式存储系统中,分布式存储系统包含管理节点和(M+N)个存储节点,所述管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;所述元数据分条的分区视图包含主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;当一个或多个计算机执行所述计算机程序指令分别作为所述分布式存储系统中的管理节点、主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr用于实现第一方面各种实现方案。Correspondingly, the present invention also provides a non-volatile computer-readable storage medium and a computer program product, when the memory of the storage device provided by the embodiment of the present invention loads the non-volatile computer-readable storage medium and the computer program product with the Computer program instructions, the computer program instructions can run in a distributed storage system, the distributed storage system includes a management node and (M+N) storage nodes, and the management node and (M+N) storage nodes both store There is a partition view of metadata stripe; the partition view of the metadata stripe includes the main data storage node DS A , the data storage node DS i and the check storage node CS r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, A is one of the
在第一方面公开的各种分布式存储系统中元数据存储方案也可以适用元数据对应的数据的存储。相应的,第二方面方向的分布式存储系统以及第三方面的非易失性计算机可读存储介质和计算机程序产品同样也适用于数据存储。The metadata storage solutions in the various distributed storage systems disclosed in the first aspect may also be applicable to the storage of data corresponding to the metadata. Correspondingly, the distributed storage system of the second aspect and the non-volatile computer-readable storage medium and computer program product of the third aspect are also suitable for data storage.
附图说明Description of drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments.
图1为本发明实施例提供的一种分布式块设备存储架构示意图;1 is a schematic diagram of a distributed block device storage architecture according to an embodiment of the present invention;
图2为本发明实施例提供的一种分布式块设备中服务器结构示意图;2 is a schematic structural diagram of a server in a distributed block device according to an embodiment of the present invention;
图3为本发明实施例提供的一种数据分条与分区视图关系示意图;3 is a schematic diagram of a relationship between a data stripe and a partition view according to an embodiment of the present invention;
图4为本发明实施例提供的一种数据分条示意图;4 is a schematic diagram of a data striping provided by an embodiment of the present invention;
图5为本发明实施例提供的分区视图示意图;5 is a schematic diagram of a partition view according to an embodiment of the present invention;
图6为本发明实施例提供的一种元数据分条与分区视图关系示意图;6 is a schematic diagram of the relationship between a metadata stripe and a partition view according to an embodiment of the present invention;
图7为本发明实施例元数据存储流程图;7 is a flowchart of metadata storage according to an embodiment of the present invention;
图8为本发明实施例提供的一种元数据分条示意图;FIG. 8 is a schematic diagram of a metadata striping provided by an embodiment of the present invention;
图9为本发明实施例提供的元数据存储示意图。FIG. 9 is a schematic diagram of metadata storage provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚地描述。The technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention.
分布式存储系统主要有分布式文件系统存储、分布式对象存储和分布式块设备存储等几种形式,例如华为的系列产品。本发明实施例以分布式块设备存储为例进行说明。示例性的如图1所示,分布式块设备存储包括多台服务器1、服务器2、服务器3、服务器4、服务器5和服务器6,服务器间互相通信。在实际应用当中,分布式块设备存储中服务器的数量可以根据实际需求增加,本发明实施例对此不作限定。分布式块设备存储的服务器中包含如图2所示的结构。Distributed storage systems mainly include distributed file system storage, distributed object storage, and distributed block device storage. For example, Huawei of series of products. The embodiments of the present invention take distributed block device storage as an example for description. Exemplarily as shown in FIG. 1 , the distributed block device storage includes
如图2所示,分布式块设备存储中的每台服务器包含中央处理单元(CentralProcessing Unit,CPU)201、内存202、硬盘1、硬盘2和硬盘3,内存202中存储计算机指令,CPU201执行内存202中的程序指令执行相应的操作。硬盘可以为机械硬盘和固态硬盘中的至少一种。另外,为节省CPU201的计算资源,现场可编程门阵列(Field Programmable GateArray,FPGA)或其他硬件也可以用于CPU201上述相应的操作,或者,FPGA或其他硬件与CPU201共同完成上述相应的操作。为方便描述,本发明实施例统一描述为处理器用于实现上述相应的操作。As shown in FIG. 2, each server in the distributed block device storage includes a central processing unit (CPU) 201, a
在图2所示的结构中,内存202中加载应用程序,CPU201执行内存202中的应用程序指令,则服务器作为客户端。其中,应用程序可以为虚拟机(Virtual Machine,VM),也可以为某一个特定应用,如办公软件等。客户端向分布式块设备存储写入数据或从分布式块设备存储中读取数据。内存202中加载存储管理程序,CPU201执行内存202中的作为虚拟块存储管理程序的存储管理程序指令,则服务器作为管理节点,负责卷元数据的管理,用于向客户端提供块协议访问接口,为客户端提供分布式存储接入点服务,使客户端能够通过管理节点访问分布式块设备存储的存储资源。内存202中加载存储对象程序,CPU201执行内存202中的存储对象程序指令,则服务器作为存储节点,用于执行具体的输入输出(Input/Output,I/O)操作。在每个服务器上可以运行多个存储对象程序的进程,示例性的,一块硬盘默认对应运行一个存储对象程序进程,每个存储对象程序进程会管理一块硬盘,则服务器运行每一个存储对象程序的进程作为一个存储节点。具体实现,也可以一个服务器上运行一个存储对象程序的进程对应服务器上的所有硬盘。本发明实施例以一个存储对象程序进程会管理一块硬盘为例进行描述。分布式块设备存储初始化时,每个存储对象程序的进程会按照1MB为单位对硬盘进行分片管理,并在硬盘的元数据管理区域记录每个1MB分片的分配信息,硬盘的分片组成存储资源池。存储管理程序与其所能访问的资源池的所有存储对象程序的进程点对点通信,即管理节点与其所能访问的资源池的所有存储节点进行通信,从而管理节点能并发访问资源池的所有硬盘。In the structure shown in FIG. 2 , the application program is loaded in the
在分布式块设备存储初始化时,将哈希空间(如0-2^32,)划分为N等份,每1等份是1个分区(Partition),这N等份按照硬盘数量进行均分。例如,分布式块存储设备存储中N默认为3600,即分区分别为P1,P2,P3…P3600。如图3所示,假设当前分布式块设备存储有18块硬盘(存储节点),则每块存储节点承载200个分区。上述分区与存储节点对应关系,即分区视图,在分布式块设备存储初始化时会分配好,后续会随着分布式块设备存储中硬盘数量的变化进行调整。分布式块设备存储的服务器会在内存202中保存该分区视图,管理节点使用该分区视图进行快速路由。每一个存储节点中也保存有分布式块设备存储系统的所有分区视图,即每一个分区与存储节点的对应关系。同时根据分布式块设备存储的可靠性要求,可以使用纠删码(Erasure Coding,EC)算法提高数据可靠性,如使用3+1模式,即3个数据块和1个校验块组成数据分条,如图4所示,则分区视图为“分区-主数据存储节点-数据存储节点1-数据存储节点2-校验存储节点,示例性的,分区视图如图5所示。该分区视图表示分区对应主数据节点以及用于存储数据分条的其他数据块的数据存储节点1和数据节点2,以及存储校验数据的校验存储节点,存储在数据存储节点1和数据存储节点2的数据块的备份数据存储节点为主数据存储节点。When the distributed block device storage is initialized, the hash space (such as 0-2^32,) is divided into N equal parts, each equal part is a partition (Partition), and the N equal parts are divided equally according to the number of hard disks . For example, in distributed block storage device storage, N is 3600 by default, that is, the partitions are P1, P2, P3...P3600. As shown in Figure 3, assuming that the current distributed block device stores 18 hard disks (storage nodes), each storage node carries 200 partitions. The above-mentioned correspondence between partitions and storage nodes, that is, the partition view, will be allocated when the distributed block device storage is initialized, and will be adjusted later as the number of hard disks in the distributed block device storage changes. The distributed block device storage server will save the partition view in the
分布式块设备存储会对每个逻辑单元号(Logical Unit Number,LUN)在逻辑上按照1MB大小进行切片,例如1GB的LUN则会被切成1024*1MB分片。如图3所示,客户端通过管理节点向LUN发送写请求的时候,在小型计算机系统接口(Small Computer SystemInterface,SCSI)命令中会带LUN标识(Identifier,ID)、逻辑块地址(Logical BlockAddress,LBA)ID以及待写数据,客户端所在的管理节点接收写请求,根据LUN ID和LBA ID组成一个键key,该key会包含LBA ID对1MB的取整计算信息。通过分布式哈希表(Distributed Hash Table,DHT)Hash计算出一个整数(范围在0-2^32内),并落在指定分区中;客户端所在的管理节点根据内存202中记录的分区视图确定主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点,管理节点将EC数据分条中的数据块1、数据块2、数据块3和校验块4分别发送到主数据存储节点1、数据存储节点2、数据存储节点3和校验存储节点4。主数据存储节点存储数据块1,数据存储节点1存储数据块2,数据存储节点2存储数据块3,校验存储节点存储校验块1。数据存储节点1和2根据分区视图分别确定主数据存储节点,数据存储节点1将数据块2备份到主数据存储节点,数据存储节点2将数据块3备份到主数据存储节点,主数据存储节点分别存储数据块2和数据块3。具体实现中,主数据存储节点为数据块1从其管理的硬盘中分配分片1,建立数据块1的标识与分片1的映射关系;数据存储节点1从其管理的硬盘中为数据块2分配分片2,建立数据块2的标识与分片2的映射关系;数据存储节点2从其管理的硬盘中为数据块3分配分片3,建立数据块3的标识与分片3的映射关系;校验存储节点从其管理的硬盘中为校验块1分配分片4,建立校验块1的标识与分片4的映射关系。主数据存储节点接收数据存储节点1发送的数据块2和数据存储节点2发送的数据块3,主数据存储节点从其管理的硬盘中分配分片5和分片6,主数据存储节点建立数据块2的标识与分片5的映射关系,以及数据块3的标识与分片6的映射关系。本发明实施例中,以数据块的标识与分片的映射关系为例,当存储对象程序的1个进程对应1个硬盘时,也即存储节点即为硬盘本身,则数据块的标识与分片的映射关系为数据块的标识与分片物理地址的映射关系;当存储对象程序的1个进程对应多个硬盘时,也即存储节点管理多个硬盘,则数据块的标识与分片的映射关系为包括数据块的标识与存储该数据块的硬盘的映射,以及存储该数据块的硬盘到分片的映射。分片物理地址的映射关系当。进一步地,数据块2分别存储到分片2和分片5,数据块3分别存储在分片3和6,管理节点建立并保存数据块2的标识与数据存储节点1和主数据存储节点的映射,建立并保存数据块3的标识与数据存储节点2和主数据存储节点的映射关系。进一步地,数据存储节点1保存保存数据块2的标识与数据存储节点1和主数据存储节点的映射,数据存储节点2保存数据块3的标识与数据存储节点2和主数据存储节点的映射关系。在对数据分条进行垃圾回收时,管理节点可以根据数据分条中数据块的标识与存储节点的映射关系,将数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了数据回收的效率Distributed block device storage will logically slice each logical unit number (LUN) according to 1MB size. For example, a 1GB LUN will be sliced into 1024*1MB slices. As shown in Figure 3, when the client sends a write request to the LUN through the management node, the Small Computer System Interface (SCSI) command will carry the LUN identifier (Identifier, ID), the logical block address (Logical BlockAddress, LBA) ID and the data to be written, the management node where the client is located receives the write request, and forms a key according to the LUN ID and LBA ID. The key will contain the LBA ID to 1MB rounding calculation information. An integer (in the range of 0-2^32) is calculated by Distributed Hash Table (DHT) Hash and falls in the specified partition; the management node where the client is located is based on the partition view recorded in the
本发明实施例中,客户端向分布式块设备存储发送写请求写入数据时,会产生元数据,用于记录数据的逻辑地址和物理地址等。本发明实施例中,数据对应的元数据存储与数据存储使用相同的EC算法。基于EC算法组成的元数据分条与上述基于EC算法的组成数据分条具有相同的分区视图,如图6所示。In the embodiment of the present invention, when the client sends a write request to the distributed block device storage to write data, metadata is generated for recording the logical address and physical address of the data, and the like. In this embodiment of the present invention, the metadata storage corresponding to the data uses the same EC algorithm as the data storage. The metadata stripe composed based on the EC algorithm has the same partition view as the above-mentioned composition data stripe based on the EC algorithm, as shown in FIG. 6 .
在分布式存储系统存储元数据,其中分布式存储系统包含管理节点和(M+N)个存储节点,管理节点和(M+N)个存储节点均存储有元数据分条的分区视图;元数据分条的分区视图包含主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;其中,N为不小于2的自然数,M为不小于1的自然数,A为自然数1至N中的一个,i为自然数1至N中的除A外的每一个,r为自然数1至M中的每一个;在该分布式存储系统存储执行如图7所示的流程:Metadata is stored in a distributed storage system, wherein the distributed storage system includes a management node and (M+N) storage nodes, and both the management node and (M+N) storage nodes store partition views of metadata stripes; The partition view of data striping includes the main data storage node DS A , the data storage node DS i and the check storage node CS r ; wherein, N is a natural number not less than 2, M is a natural number not less than 1, and A is a
步骤701:管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr;所述元数据分条包含元数据块DA、Di以及校验块Cr。Step 701: The management node determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r for the metadata strip according to the partition view of the metadata strip; the metadata strip includes Metadata blocks D A , D i and check block Cr.
具体的,所述管理节点根据所述元数据分条的分区视图为所述元数据分条确定主数据存储节点DSA、数据存储节点DSi和校验存储节点CSr,具体包括:所述管理节点根据产生所述元数据分条中的元数据的写请求确定所述元数据分条对应的分区;所述管理节点根据所述元数据分条对应的分区查询所述元数据分条的分区视图确定所述主数据存储节点DSA、所述数据存储节点DSi和所述校验存储节点CSr。Specifically, the management node determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r for the metadata strip according to the partition view of the metadata strip, which specifically includes: the The management node determines the partition corresponding to the metadata strip according to the write request for generating the metadata in the metadata strip; the management node queries the metadata strip according to the partition corresponding to the metadata strip. The partition view determines the primary data storage node DS A , the data storage node DS i and the check storage node CS r .
具体的,所述管理节点根据所述写请求携带的地址确定所述元数据分条对应的分区。具体可参见分布式块设备存储在存储客户端发送的写请求时的方案,在此不再赘述。Specifically, the management node determines the partition corresponding to the metadata strip according to the address carried in the write request. For details, please refer to the solution when the distributed block device stores the write request sent by the client, which will not be repeated here.
步骤702:所述管理节点将Di发送到所述数据存储节点DSi,将DA发送到所述主数据存储节点DSA,将Cr发送到所述校验存储节点CSr。Step 702: The management node sends Di to the data storage node DS i , sends DA to the primary data storage node DS A , and sends Cr to the check storage node CS r .
步骤703:所述校验存储节点CSr接收并存储Cr。Step 703: The verification storage node CS r receives and stores C r .
步骤704:所述数据存储节点DSi接收并存储Di,并根据所述元数据分条的分区视图将Di发送到所述主数据存储节点DSA。Step 704: The data storage node DS i receives and stores D i , and sends D i to the primary data storage node DS A according to the partition view of the metadata stripe.
步骤705:所述主数据存储节点DSA接收并存储DA和Di。Step 705: The primary data storage node DS A receives and stores D A and D i .
具体的,所述校验存储节点CSr存储Cr具体包括:所述校验存储节点CSr为所述Cr分配分片Sr,并且建立所述Cr的标识与所述分片Sr的映射关系;所述数据存储节点DSi存储Di具体包括:所述数据存储节点DSi为所述Di分配分片SDi,并且建立所述Di的标识与所述分片SDi的映射关系;所述主数据存储节点DSA存储DA和Di,具体包括:所述主数据存储节点DSA为所述DA分配分片SDA,并且建立所述DA的标识与所述分片SDA的映射关系,为所述Di分配分片SDi,并且建立所述Di的标识与所述分片SDi的映射关系。进一步地,管理节点建立Di的标识与数据存储节点DSi和主数据存储节点DSA的映射关系。进一步地,进一步地,数据存储节点1保存保存Di的标识与数据存储节点DSi和主数据存储节点DSA的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。Specifically, the storage of Cr by the check storage node CS r specifically includes: the check storage node CS r allocates a slice S r to the Cr, and establishes a mapping between the identifier of the Cr and the slice S r The storage of D i by the data storage node DS i specifically includes: the data storage node DS i allocates a segment SD i to the D i , and establishes a mapping between the identifier of the D i and the segment SD i The main data storage node DS A stores D A and D i , specifically including: the main data storage node DS A allocates a slice SD A for the D A , and establishes the identity of the D A with the For the mapping relationship of the slice SD A , the slice SD i is allocated to the D i , and the mapping relation between the identifier of the D i and the slice SD i is established. Further, the management node establishes a mapping relationship between the identifier of D i and the data storage node DS i and the main data storage node DS A. Further, further, the
本发明实施例中,结合前面所述的分布式块设备存储及数据存储方式,如图8所示,使用EC算法的元数据分条中元数据块为D1,D2和D3,校验块为C1。客户端所在的管理节点根据内存202中记录的分区视图“分区-主数据存储节点-数据存储节点1-数据存储节点2-校验存储节点”确定主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点。该分区视图表示分区对应主数据节点以及用于存储元数据分条的其他数据块的数据存储节点1和数据节点2,以及存储校验数据的校验存储节点,存储在数据存储节点1和数据存储节点2的元数据块的备份数据存储节点为主数据存储节点。管理节点将基于EC算法的元数据分条中的D1、D2、D3和C1分别发送到主数据存储节点、数据存储节点1、数据存储节点2和校验存储节点4。主数据存储节点接收并存储D1,数据存储节点1接收并存储D2,数据存储节点2接收并存储D3,校验存储节点接收并存储C1。数据存储节点1和2根据分区视图分别确定主数据存储节点,数据存储节点1将D2备份到主数据存储节点,数据存储节点2将D3备份到主数据存储节点,主数据存储节点接收并存储D2和D3。具体实现中,如图9所示,主数据存储节点为D1从其管理的硬盘中分配分片7,建立D1的标识与分片7的映射关系;数据存储节点1从其管理的硬盘中为D2分配分片8,建立D2的标识与分片8的映射关系;数据存储节点2从其管理的硬盘中为D3分配分片9,建立D3的标识与分片9的映射关系;校验存储节点从其管理的硬盘中为C1分配分片10,建立C1的标识与分片10的映射关系。主数据存储节点接收数据存储节点1发送的D2和数据存储节点2发送的D3,主数据存储节点从其管理的硬盘中分配分片11和分片12,主数据存储节点建立D2的标识与分片11的映射关系,以及D3的标识与分片12的映射关系。本发明实施例中,以元数据块的标识与分片的映射关系为例,当存储对象程序的1个进程对应1个硬盘时,也即存储节点即为硬盘本身,则元数据块的标识与分片的映射关系为元数据块的标识与分片物理地址的映射关系;当存储对象程序的1个进程对应多个硬盘时,也即存储节点管理多个硬盘,则元数据块的标识与分片的映射关系为包括元数据块的标识与存储该元数据块的硬盘的映射,以及存储该元数据块的硬盘到分片的映射。进一步地,D2分别存储到分片8和分片11,D3分别存储在分片9和12,管理节点建立并保存D2的标识与数据存储节点1和主数据存储节点的映射,建立并保存D3的标识与数据存储节点2和主数据存储节点的映射关系。进一步地,数据存储节点1保存保存D2的标识与数据存储节点1和主数据存储节点的映射,数据存储节点2保存D3的标识与数据存储节点2和主数据存储节点的映射关系。在对元数据分条进行垃圾回收时,管理节点可以根据元数据分条中元数据块的标识与存储节点的映射关系,将元数据块在数据存储节点以及主数据存储节点中的数据均回收,提高了元数据回收的效率。In this embodiment of the present invention, in combination with the aforementioned distributed block device storage and data storage methods, as shown in FIG. 8 , the metadata blocks in the metadata striping using the EC algorithm are D 1 , D 2 and D 3 . The test block is C 1 . The management node where the client is located determines the main data storage node,
因此,在使用EC算法组成的元数据分条实现数据可靠性的场景下,主数据存储节点备份元数据分条中其他元数据块,因为只需要将数据存储节点上的元数据块在主数据存储节点上备份,相比现有技术中所有元数据块多副本,减少了存储空间,同时在客户端访问元数据时,只需要从主数据存储节点访问所有元数据块,提高了元数据访问速度。Therefore, in the scenario where metadata striping composed of the EC algorithm is used to achieve data reliability, the primary data storage node backs up other metadata blocks in the metadata striping, because only the metadata blocks on the data storage node need to be stored in the primary data storage node. Compared with the multiple copies of all metadata blocks in the prior art, the backup on the storage node reduces the storage space. At the same time, when the client accesses the metadata, it only needs to access all the metadata blocks from the main data storage node, which improves the metadata access. speed.
本发明实施例,还提供了非易失性计算机可读存储介质和计算机程序产品,非易失性计算机可读存储介质和计算机程序产品中包含的计算机程序指令,CPU执行内存中加载的该计算机程序指令用于实现本发明各实施中管理节点和存储节点(主数据存储节点、数据存储节点和校验存储节点)对应的功能。Embodiments of the present invention also provide a non-volatile computer-readable storage medium and a computer program product, the computer program instructions contained in the non-volatile computer-readable storage medium and the computer program product, and the CPU executing the computer program loaded in the memory The program instructions are used to realize the functions corresponding to the management node and the storage node (main data storage node, data storage node and check storage node) in various implementations of the present invention.
本发明实施例中给出的示例性描述。本发明实施例中的“分片1”、“分片2”。。。“分片12”等并不是用于严格限定先后关系,只是用于区分不同的分片。本发明实施例中的分片可以为硬盘中的物理块等。本发明实施例中的硬盘,如前所述,可以为机械盘和固态硬盘中的至少一种。本发明实施例中存储对象程序的进程对应的硬盘还可以为存储阵列等,本发明实施例对此不作限定。Exemplary descriptions are given in the embodiments of the present invention. "
在本发明所提供的几个实施例中,应该理解到,所公开的装置、方法,可以通过其它的方式实现。例如,以上所描述的装置实施例所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the division of the units in the apparatus embodiments described above is only a logical function division, and other division methods may be used in actual implementation, for example, multiple units or components may be combined or integrated into another system, or Some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
Claims (10)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710508014.8A CN109144406B (en) | 2017-06-28 | 2017-06-28 | Metadata storage method, system and storage medium in distributed storage system |
| CN202010648620.1A CN111949210B (en) | 2017-06-28 | 2017-06-28 | Metadata storage method, system and storage medium in distributed storage system |
| PCT/CN2018/075077 WO2019000949A1 (en) | 2017-06-28 | 2018-02-02 | Metadata storage method and system in distributed storage system, and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710508014.8A CN109144406B (en) | 2017-06-28 | 2017-06-28 | Metadata storage method, system and storage medium in distributed storage system |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010648620.1A Division CN111949210B (en) | 2017-06-28 | 2017-06-28 | Metadata storage method, system and storage medium in distributed storage system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109144406A CN109144406A (en) | 2019-01-04 |
| CN109144406B true CN109144406B (en) | 2020-08-07 |
Family
ID=64740945
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010648620.1A Active CN111949210B (en) | 2017-06-28 | 2017-06-28 | Metadata storage method, system and storage medium in distributed storage system |
| CN201710508014.8A Active CN109144406B (en) | 2017-06-28 | 2017-06-28 | Metadata storage method, system and storage medium in distributed storage system |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010648620.1A Active CN111949210B (en) | 2017-06-28 | 2017-06-28 | Metadata storage method, system and storage medium in distributed storage system |
Country Status (2)
| Country | Link |
|---|---|
| CN (2) | CN111949210B (en) |
| WO (1) | WO2019000949A1 (en) |
Families Citing this family (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3935515B1 (en) * | 2019-03-04 | 2024-02-14 | Hitachi Vantara LLC | Metadata routing in a distributed system |
| EP3971701A4 (en) * | 2019-09-09 | 2022-06-15 | Huawei Cloud Computing Technologies Co., Ltd. | Data processing method in storage system, device, and storage system |
| CN111444274B (en) * | 2020-03-26 | 2021-04-30 | 上海依图网络科技有限公司 | Data synchronization method, data synchronization system, and apparatus, medium, and system thereof |
| CN111638995B (en) * | 2020-05-08 | 2024-09-20 | 杭州海康威视系统技术有限公司 | Metadata backup method, device and equipment and storage medium |
| WO2022094895A1 (en) * | 2020-11-05 | 2022-05-12 | Alibaba Group Holding Limited | Virtual data copy supporting garbage collection in distributed file systems |
| CN112947864B (en) * | 2021-03-29 | 2024-03-08 | 南方电网数字平台科技(广东)有限公司 | Metadata storage method, apparatus, device and storage medium |
| CN115904794A (en) * | 2021-08-18 | 2023-04-04 | 华为技术有限公司 | A data processing method and device |
| CN115268801B (en) * | 2022-09-30 | 2023-01-10 | 天津卓朗昆仑云软件技术有限公司 | Backup system and method for block device |
| CN119718197B (en) * | 2024-12-06 | 2025-08-12 | 北京邮电大学 | A method and system for distributed data storage based on erasure codes |
| CN119336276B (en) * | 2024-12-19 | 2025-03-21 | 北京大道云行科技有限公司 | High-performance bare disk management method |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103399823A (en) * | 2011-12-31 | 2013-11-20 | 华为数字技术(成都)有限公司 | Method, equipment and system for storing service data |
| CN106233264A (en) * | 2014-03-31 | 2016-12-14 | 亚马逊科技公司 | File storage with variable stripe size |
| CN106471461A (en) * | 2014-06-04 | 2017-03-01 | 纯存储公司 | Automatically reconfigure storage device memorizer topology |
| CN106662983A (en) * | 2015-12-31 | 2017-05-10 | 华为技术有限公司 | Method, apparatus and system for data reconstruction in distributed storage system |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7051155B2 (en) * | 2002-08-05 | 2006-05-23 | Sun Microsystems, Inc. | Method and system for striping data to accommodate integrity metadata |
| CN102411637B (en) * | 2011-12-30 | 2013-07-24 | 创新科软件技术(深圳)有限公司 | Metadata management method of distributed file system |
| US8914668B2 (en) * | 2012-09-06 | 2014-12-16 | International Business Machines Corporation | Asynchronous raid stripe writes to enable response to media errors |
| CN102937964B (en) * | 2012-09-28 | 2015-02-11 | 无锡江南计算技术研究所 | Intelligent data service method based on distributed system |
| US9104332B2 (en) * | 2013-04-16 | 2015-08-11 | International Business Machines Corporation | Managing metadata and data for a logical volume in a distributed and declustered system |
| US9529675B2 (en) * | 2013-07-26 | 2016-12-27 | Huawei Technologies Co., Ltd. | Data recovery method, data recovery device and distributed storage system |
| CN103699494B (en) * | 2013-12-06 | 2017-03-15 | 北京奇虎科技有限公司 | A kind of date storage method, data storage device and distributed memory system |
| CN103729436A (en) * | 2013-12-27 | 2014-04-16 | 中国科学院信息工程研究所 | Distributed metadata management method and system |
| CN106294772B (en) * | 2016-08-11 | 2019-03-19 | 电子科技大学 | The buffer memory management method of distributed memory columnar database |
| CN106599308B (en) * | 2016-12-29 | 2020-01-31 | 郭晓凤 | distributed metadata management method and system |
-
2017
- 2017-06-28 CN CN202010648620.1A patent/CN111949210B/en active Active
- 2017-06-28 CN CN201710508014.8A patent/CN109144406B/en active Active
-
2018
- 2018-02-02 WO PCT/CN2018/075077 patent/WO2019000949A1/en not_active Ceased
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103399823A (en) * | 2011-12-31 | 2013-11-20 | 华为数字技术(成都)有限公司 | Method, equipment and system for storing service data |
| CN106233264A (en) * | 2014-03-31 | 2016-12-14 | 亚马逊科技公司 | File storage with variable stripe size |
| CN106471461A (en) * | 2014-06-04 | 2017-03-01 | 纯存储公司 | Automatically reconfigure storage device memorizer topology |
| CN106662983A (en) * | 2015-12-31 | 2017-05-10 | 华为技术有限公司 | Method, apparatus and system for data reconstruction in distributed storage system |
Also Published As
| Publication number | Publication date |
|---|---|
| CN111949210B (en) | 2024-11-15 |
| CN109144406A (en) | 2019-01-04 |
| CN111949210A (en) | 2020-11-17 |
| WO2019000949A1 (en) | 2019-01-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109144406B (en) | Metadata storage method, system and storage medium in distributed storage system | |
| US11379142B2 (en) | Snapshot-enabled storage system implementing algorithm for efficient reclamation of snapshot storage space | |
| US11082206B2 (en) | Layout-independent cryptographic stamp of a distributed dataset | |
| US11409705B2 (en) | Log-structured storage device format | |
| US9733848B2 (en) | Method and system for pooling, partitioning, and sharing network storage resources | |
| US9582198B2 (en) | Compressed block map of densely-populated data structures | |
| CN102255962B (en) | Distributive storage method, device and system | |
| KR20200017363A (en) | MANAGED SWITCHING BETWEEN ONE OR MORE HOSTS AND SOLID STATE DRIVES (SSDs) BASED ON THE NVMe PROTOCOL TO PROVIDE HOST STORAGE SERVICES | |
| US20200356282A1 (en) | Distributed Storage System, Data Processing Method, and Storage Node | |
| CN110325958B (en) | Data storage method and device in distributed block storage system and computer readable storage medium | |
| US11899533B2 (en) | Stripe reassembling method in storage system and stripe server | |
| US12032849B2 (en) | Distributed storage system and computer program product | |
| US11775194B2 (en) | Data storage method and apparatus in distributed storage system, and computer program product | |
| WO2021017782A1 (en) | Method for accessing distributed storage system, client, and computer program product | |
| CN114489465A (en) | Method, network device and computer system for data processing using network card | |
| US20250094068A1 (en) | Parallelized recovery of logical block address (lba) tables | |
| CN119536619A (en) | Method for writing data and storage device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |