CN107943421B

CN107943421B - Partition division method and device based on distributed storage system

Info

Publication number: CN107943421B
Application number: CN201711241562.5A
Authority: CN
Inventors: 罗四维; 张雷; 刘小威
Original assignee: Chengdu Huawei Technology Co Ltd
Current assignee: Chengdu Huawei Technology Co Ltd
Priority date: 2017-11-30
Filing date: 2017-11-30
Publication date: 2021-04-20
Anticipated expiration: 2037-11-30
Also published as: CN107943421A

Abstract

The embodiment of the application discloses a partition dividing method and device based on a distributed storage system, relates to the field of storage, and solves the problem of ensuring data reliability under the condition that a small number of storage nodes or a small number of disks in a small-scale cluster have faults. The specific scheme is as follows: the cluster management node acquires fault information, wherein the fault information is used for indicating a storage node with a fault or a storage medium with a fault; the cluster management node divides the storage medium with the normal state again according to the fault information, the load of the storage node with the normal state and the redundancy mode of the EC to obtain first updated partition information; the cluster management node sends first updated partition information to the application node. The embodiment of the application is used for a data storage process.

Description

Partition division method and device based on distributed storage system

Technical Field

The embodiment of the application relates to the field of storage, in particular to a partition dividing method and device based on a distributed storage system.

Background

In a big data environment, the more data is mastered, the greater the value the data implies. Currently, enterprise users, data center infrastructure, and the like mainly store massive data through cloud storage technology, for example, a distributed storage system. While storing massive data, it is necessary to ensure the reliability of the massive data. Existing strategies for ensuring data reliability mainly include Multi-copy (Multi-replication) and Erasure Coding (EC).

In the distributed storage system of the small-scale cluster, partitions of the distributed storage system can be preset according to the redundancy mode of the EC, each partition comprises N + K disks, each disk in each partition belongs to different storage nodes, N represents the data fragment of the EC, and K represents the check fragment of the EC. And after the application node carries out EC coding on the data to be written, at least one EC stripe is obtained, and each EC stripe is written into one partition. Under the condition of a small amount of storage node failures or a small amount of disk failures, original data can be recovered by acquiring partial data and performing simple XOR calculation, so that the reliability of data reading is ensured. However, the data writing method requires that the application node must perform full stripe writing according to the partition configuration specified by the system, which limits the flexibility of the upper layer service; moreover, when the partition does not satisfy the condition of current data writing, the data writing and the subsequent processes cannot be normally executed, and the condition of degraded writing occurs.

To avoid destaging writes, the storage nodes in other partitions in the storage system that are in a normal state may be temporarily used to store data that should be written to the failed storage node. Nevertheless, the system write service can be supported without degradation, and the reliability of data write is ensured. However, the storage node management complexity across partitions is increased; if the failed storage node is recovered, the data migration operation needs to be started, which additionally increases the overhead of data migration and reduces the overall performance of the system.

Therefore, how to ensure the reliability of data is an urgent problem to be solved under the condition that a small number of storage nodes or a small number of disks in a small-scale cluster have faults.

Disclosure of Invention

The embodiment of the application provides a partition dividing method and device based on a distributed storage system, and solves the problem of ensuring data reliability under the condition that a small number of storage nodes in a small-scale cluster or a small number of disks have faults.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

in a first aspect of the embodiments of the present application, a partition dividing method based on a distributed storage system is provided, including: the distributed storage system comprises cluster management nodes, application nodes and S storage nodes, wherein each storage node comprises X storage media, the S X X storage media included by the S storage nodes in a redundancy mode according to an erasure code EC are divided into P partitions, each partition in the P partitions comprises Y storage media, and the Y storage media consist of one storage medium in each storage node of the Y storage nodes, wherein the redundancy mode of the EC is the number of data fragments and the number of check fragments, N represents the number of the data fragments, K represents the number of the check fragments, and Y is N + K, and the basic principle of the distributed storage system is that: firstly, a cluster management node acquires fault information, wherein the fault information is used for indicating a storage node with a fault or a storage medium with a fault; then, the cluster management node re-divides the storage medium with the normal state according to the fault information, the load of the storage node with the normal state and the redundancy mode of the EC to obtain first updated partition information; the cluster management node sends first updated partition information to the application node. According to the partition dividing method based on the distributed storage system, after the storage node or the storage medium fails, the storage medium in the normal state is divided again according to the failure information, the load of the storage node in the normal state and the redundancy mode of the EC, the number of the storage media in the partition is always kept to be the same as the configuration of the redundancy mode of the EC, the data can be successfully written in during data writing, and the reliability of the data is effectively improved.

With reference to the first aspect, in a possible implementation manner, if the failure information is the node identifier of the failed storage node, i storage nodes fail, and the cluster management node re-partitions the storage medium with a normal state according to the failure information, the load of the storage node with a normal state, and the redundancy mode of the EC to obtain the first updated partition information, the method includes: the cluster management node divides (S-i) × storage media included in the S-i storage nodes into Q partitions according to the redundancy mode of the EC and the load of the S-i storage nodes to obtain first updated partition information, wherein the first updated partition information includes a partition identifier of each partition in the Q partitions and a medium identifier of the storage media included in each partition in the Q partitions.

With reference to the first aspect, in another possible implementation manner, if the failure information is a media identifier of a failed storage medium, where j storage media are failed, the cluster management node repartitions the storage medium with a normal state according to the failure information, a load of the storage node with a normal state, and a redundancy mode of the EC to obtain first updated partition information, where the method includes: the cluster management node divides (S X X) -j storage media included by the S storage nodes into W partitions according to the redundancy mode of the EC and the load of the S storage nodes to obtain first updating partition information, wherein the first updating partition information comprises a partition identifier of each partition in the W partitions and a medium identifier of the storage media included by each partition in the W partitions.

With reference to the foregoing possible implementation manner, in another possible implementation manner, after the cluster management node sends the first updated partition information to the application node, the method further includes: the cluster management node acquires recovery information, wherein the recovery information is used for indicating fault removal of a failed storage node or fault removal of a failed storage medium; the cluster management node re-divides the storage medium with the normal state according to the recovery information, the redundancy mode of the EC and the load of the storage node with the normal state to obtain second updated partition information; and the cluster management node sends second updating partition information to the application node. Therefore, after the storage node or the storage medium is relieved from the fault and the normal state is recovered, the storage medium in the distributed storage system is partitioned again, so that the storage medium is fully utilized, and the waste of storage space is avoided.

In a second aspect of the embodiments of the present application, a data writing method is provided, including: the distributed storage system comprises cluster management nodes, application nodes and S storage nodes, wherein each storage node comprises X storage media, the S X X storage media included by the S storage nodes are divided into P partitions according to a redundancy mode of an erasure code EC, each partition in the P partitions comprises Y storage media, and the Y storage media consist of one storage medium in each storage node of the Y storage nodes, wherein the redundancy mode of the EC is the number of data fragments and the number of check fragments, N represents the number of the data fragments, K represents the number of the check fragments, and Y is N + K, and the method comprises the following steps: the application node carries out EC encoding on the data to be written to obtain L EC strips, each EC strip comprises N data fragments and K verification fragments, L is determined by the data volume of the data to be written, and L is larger than or equal to 1; the application node stores L EC stripes into L partitions in Q partitions according to first updated partition information, the first updated partition information comprises a partition identifier of each partition in the Q partitions and a medium identifier of a storage medium included in each partition in the Q partitions, the Q partitions are obtained by dividing (S-i) X storage media included in S-i storage nodes by a cluster management node according to an EC redundancy mode and loads of the S-i storage nodes, and i represents the number of failed storage nodes; or the application node stores the L EC stripes to L partitions in the W partitions according to the first updated partition information, the first updated partition information includes a partition identifier of each partition in the W partitions and a medium identifier of a storage medium included in each partition in the W partitions, the W partitions are obtained by dividing (S X X) -j storage media included in the S storage nodes by the cluster management node according to the redundancy mode of the EC and the load of the S storage nodes, and j represents the number of failed storage media. Therefore, after the storage node or the storage medium fails, the storage medium is divided into the partitions again, the number of the storage media in the partitions is always kept the same as the configuration of the redundancy mode of the EC, data are written according to the updated partitions, the data can be successfully written in the data writing process, and the reliability of the data is effectively improved.

With reference to the second aspect, in a possible implementation manner, before the application node performs EC coding on the data to be written to obtain L EC stripes, the method further includes: the application node receives first updating partition information sent by the cluster management node.

With reference to the foregoing possible implementation manner, in another possible implementation manner, after the application node receives the first updated partition information sent by the cluster management node, the method further includes: the application node receives second updated partition information sent by the cluster management node, wherein the second updated partition information is obtained by the cluster management node by re-dividing the storage medium with the normal state according to recovery information, the redundancy mode of the EC and the load of the storage node with the normal state, and the recovery information is used for indicating fault removal of the storage node with the fault or fault removal of the storage medium with the fault.

In a third aspect of the embodiments of the present application, a cluster management node is provided, including: the distributed storage system comprises cluster management nodes, application nodes and S storage nodes, wherein each storage node comprises X storage media, the S X X storage media included by the S storage nodes in a redundancy mode according to an erasure code EC are divided into P partitions, each partition in the P partitions comprises Y storage media, and the Y storage media consist of one storage medium in each storage node of the Y storage nodes, wherein the redundancy mode of the EC is the number of data fragments and the number of check fragments, N represents the number of the data fragments, K represents the number of the check fragments, Y is N + K, and the cluster management nodes comprise: the receiving and sending unit is used for acquiring fault information, and the fault information is used for indicating a failed storage node or a failed storage medium; the processing unit is used for repartitioning the storage medium with the normal state according to the fault information, the load of the storage node with the normal state and the redundancy mode of the EC to obtain first updated partition information; and the transceiving unit is also used for sending the first updating partition information to the application node.

In a fourth aspect of the embodiments of the present application, an application node is provided, including: the distributed storage system comprises cluster management nodes, application nodes and S storage nodes, wherein each storage node comprises X storage media, the S X X storage media included by the S storage nodes in a redundancy mode according to an erasure code EC are divided into P partitions, each partition in the P partitions comprises Y storage media, and the Y storage media consist of one storage medium in each storage node of the Y storage nodes, wherein the redundancy mode of the EC is the number of data fragments and the number of check fragments, N represents the number of the data fragments, K represents the number of the check fragments, Y is N + K, and the application nodes comprise: the processing unit is used for carrying out EC encoding on data to be written to obtain L EC strips, each EC strip comprises N data fragments and K verification fragments, L is determined by the data volume of the data to be written, and L is more than or equal to 1; the processing unit and the transceiver unit are used for storing L EC stripes into L partitions in Q partitions according to first updated partition information, the first updated partition information comprises a partition identifier of each partition in the Q partitions and a medium identifier of a storage medium included in each partition in the Q partitions, the Q partitions are obtained by dividing (S-i) × X storage media included in S-i storage nodes by a cluster management node according to an EC redundancy mode and loads of the S-i storage nodes, and i represents the number of failed storage nodes; or the application node stores the L EC stripes to L partitions in the W partitions according to the first updated partition information, the first updated partition information includes a partition identifier of each partition in the W partitions and a medium identifier of a storage medium included in each partition in the W partitions, the W partitions are obtained by dividing (S X X) -j storage media included in the S storage nodes by the cluster management node according to the redundancy mode of the EC and the load of the S storage nodes, and j represents the number of failed storage media.

It should be noted that after the cluster management node obtains the failure information, if Y is greater than S, each partition of the Q partitions includes at least two storage media belonging to the same storage node. In addition, the distributed storage system described in the embodiment of the present application is a small-scale cluster system, and S is an integer greater than or equal to 3 and less than or equal to 20. The failure rate was 10% S. i may be 3.

It should be noted that the functional modules in the third aspect and the fourth aspect may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions. For example, a communication interface for performing the functions of the transceiver unit, a processor for performing the functions of the processing unit, and a memory for processing the program instructions of the partition dividing method and the data writing method based on the distributed storage system according to the embodiment of the present application. The processor, the communication interface and the memory are connected through a bus and complete mutual communication. Specifically, reference may be made to a function of behavior of a cluster management node in the partition dividing method based on the distributed storage system provided in the first aspect, and a function of behavior of an application node in the data writing method provided in the second aspect.

In a fifth aspect of the embodiments of the present application, a cluster management node is provided, where the cluster management node may include: at least one processor, a memory, a communication interface, a communication bus; the at least one processor is connected to the memory and the communication interface through a communication bus, the memory is used for storing computer-executable instructions, and when the processor runs, the processor executes the computer-executable instructions stored in the memory, so that the cluster management node executes the partition partitioning method based on the distributed storage system according to the first aspect or any one of the possible implementation manners of the first aspect.

In a sixth aspect of the embodiments of the present application, an application node is provided, where the application node may include: at least one processor, a memory, a communication interface, a communication bus; the at least one processor is connected with the memory and the communication interface through the communication bus, the memory is used for storing computer execution instructions, and when the processor runs, the processor executes the computer execution instructions stored in the memory, so that the application node executes the data writing method of the second aspect or any possible implementation manner of the second aspect.

A seventh aspect of the embodiments of the present application provides a computer-readable storage medium for storing computer software instructions for the cluster management node, where the computer software instructions, when executed by a processor, enable the cluster management node to perform the method of any of the above aspects.

In an eighth aspect of embodiments of the present application, a computer-readable storage medium is provided, which stores computer software instructions for the application node, and when the computer software instructions are executed by a processor, the application node may execute the method of any aspect.

In a ninth aspect of embodiments of the present application, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the above aspects.

In addition, the technical effects brought by any one of the design manners in the third aspect to the ninth aspect can be referred to the technical effects brought by the different design manners in the first aspect and the second aspect, and are not described herein again.

In the embodiment of the present application, names of the cluster management node and the application node do not limit the devices themselves, and in actual implementation, the devices may appear by other names. Provided that the function of each device is similar to the embodiments of the present application, and fall within the scope of the claims of the present application and their equivalents.

These and other aspects of the embodiments of the present application will be more readily apparent from the following description of the embodiments.

Drawings

Fig. 1 is a simplified schematic diagram of a distributed storage system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a partition division provided in the prior art;

FIG. 3 is a schematic diagram of another partition division provided in the prior art;

FIG. 4 is a schematic diagram of another partition provided in the prior art;

fig. 5 is a flowchart of a partition partitioning method based on a distributed storage system according to an embodiment of the present application;

fig. 6 is a schematic diagram of a partition division provided in the embodiment of the present application;

fig. 7 is a schematic diagram of another partition provided in the embodiment of the present application;

fig. 8 is a schematic diagram of another partition division provided in the embodiment of the present application;

fig. 9 is a schematic diagram of another partition provided in the embodiment of the present application;

FIG. 10 is a schematic diagram of another partition provided in the present application;

fig. 11 is a schematic diagram of another partition provided in the embodiment of the present application;

fig. 12 is a flowchart of a data writing method according to an embodiment of the present application;

FIG. 13 is a diagram illustrating a data writing process according to an embodiment of the present application;

fig. 14 is a schematic composition diagram of a cluster management node according to an embodiment of the present application;

FIG. 15 is a schematic diagram of a computer device according to an embodiment of the present application;

fig. 16 is a schematic composition diagram of another cluster management node provided in the embodiment of the present application;

fig. 17 is a schematic composition diagram of an application node according to an embodiment of the present application;

fig. 18 is a schematic composition diagram of another application node according to an embodiment of the present application.

Detailed Description

For clarity and conciseness of the following descriptions of the various embodiments, a brief introduction to the related art is first given:

the distributed storage system is a storage system easy to expand, positions of all storage nodes are equal, positions and the number of the storage nodes in the system are not limited, the distributed storage system can be expanded at will, data can be stored on a plurality of independent storage nodes in a scattered mode, and the effect of load balancing is achieved. Compared with the traditional network storage system, the system adopts a centralized storage server to store all data, and improves the reliability, the availability and the access efficiency of the system.

And the erasure code fragments the data to obtain data fragments, then calculates a small number of verification fragments according to the data fragments, and respectively stores all the data fragments and the verification fragments in different data nodes. When reading, only a small amount of fragment information is needed to be obtained, and the original data can be obtained by combining simple XOR calculation. The method greatly improves the space utilization rate of the hard disk, accelerates the calculation process by using hardware equipment, and can control the performance loss within a certain range.

By way of example, fig. 1 is a simplified schematic diagram of a distributed storage system according to an embodiment of the present application. As shown in fig. 1, the system architecture may include: the system comprises a cluster management node, an application node and S storage nodes, wherein each storage node comprises X storage media. The system can pre-configure the redundancy mode of the EC, that is, the number of data fragments and the number of check fragments, according to the number of storage nodes and the number of storage media. The system can divide S X storage media included in S storage nodes in the system into P partitions according to the redundancy mode of the EC, each partition in the P partitions includes Y storage media, the Y storage media select different Y storage nodes from the S storage nodes, then select one storage medium from each storage node of the different Y storage nodes, and determine the storage medium selected from each storage node of the Y storage nodes as forming one partition. It will be appreciated that the Y storage media included in each partition belong to different storage nodes. The storage media included in each partition is also different. The redundancy mode of the EC is the number of data fragments and the number of check fragments, N represents the number of data fragments, K represents the number of check fragments, and Y is N + K.

It should be noted that the partition dividing method based on the distributed storage system according to the embodiment of the present application is applicable to a small-scale cluster distributed storage system. For example, S is an integer of 3 or more and 20 or less. X is an integer of 1 or more. For example, the distributed storage system includes 8 storage nodes, and each storage node includes 6 storage media, that is, S has a value of 8 and X has a value of 6. Or, the distributed storage system includes 6 storage nodes, and each storage node includes 6 storage media, that is, the value of S is 6, and the value of X is 6.

Assume that the redundancy mode of EC is 4 data slices and 2 check slices, i.e. N is 4 and K is 2. According to the redundancy mode of the EC, a distributed storage system including 4 data fragments and 2 check fragments and each including 6 storage media is partitioned, that is, different 6 storage nodes are selected from the 8 storage nodes, then one storage medium is selected from each of the different 6 storage nodes, the storage medium selected from each of the 6 storage nodes is determined to form a partition, eight partitions can be obtained, and each partition includes 6 storage media. It is assumed that 8 storage nodes are encoded according to 1 to 8, and 6 storage media included in each storage node are encoded according to 1 to 6. The storage medium 1 in the storage node 1 can be denoted as 1-1, the storage medium 2 in the storage node 1 can be denoted as 1-2, the storage medium 3 in the storage node 1 can be denoted as 1-3, the storage medium 4 in the storage node 1 can be denoted as 1-4, the storage medium 5 in the storage node 1 can be denoted as 1-5, and the storage medium 6 in the storage node 1 can be denoted as 1-6. Similarly, storage medium 1 in storage node 2 may be denoted as 2-1, and storage medium 2 in storage node 2 may be denoted as 2-2. Storage media in other storage nodes may also be represented according to the above editing manner, and this embodiment of the present application is not described herein again.

As shown in FIG. 2, partition one includes storage medium 1-1, storage medium 2-1, storage medium 3-1, storage medium 4-1, storage medium 5-1, and storage medium 6-1. Partition two includes storage medium 2-2, storage medium 3-2, storage medium 4-2, storage medium 5-2, storage medium 6-2, and storage medium 7-2. Partition three includes storage medium 3-3, storage medium 4-3, storage medium 5-3, storage medium 6-3, storage medium 7-3, and storage medium 8-3. Partition four includes storage media 4-4, storage media 5-4, storage media 6-4, storage media 7-4, storage media 8-4, and storage media 1-2. Partition five includes storage media 5-5, storage media 6-5, storage media 7-5, storage media 8-5, storage media 1-3, and storage media 2-3. Partition six includes storage media 6-6, storage media 7-6, storage media 8-6, storage media 1-4, storage media 2-4, and storage media 3-4. Partition seven includes storage media 1-5, storage media 2-5, storage media 3-5, storage media 4-5, storage media 7-1, and storage media 8-1. Partition eight includes storage media 1-6, storage media 2-6, storage media 3-6, storage media 4-6, storage media 5-6, and storage media 8-2.

Assume that the redundancy mode of EC is 4 data slices and 2 check slices, i.e. N is 4 and K is 2. According to the redundancy mode of the EC, a distributed storage system including 6 storage nodes each including 6 storage media is partitioned for 4 data fragments and 2 check fragments, that is, one storage medium is selected from each of the 6 storage nodes, the storage medium selected from each of the 6 storage nodes is determined to form a partition, six partitions may be obtained, and each partition includes 6 storage media. It is assumed that 6 storage nodes are encoded according to 1 to 6, and 6 storage media included in each storage node are encoded according to 1 to 6. The description mode of the storage medium in the storage node may be expressed in the above editing mode, and is not described herein again in this embodiment of the present application.

As shown in FIG. 3, partition one includes storage medium 1-1, storage medium 2-1, storage medium 3-1, storage medium 4-1, storage medium 5-1, and storage medium 6-1. Partition two includes storage medium 1-2, storage medium 2-2, storage medium 3-2, storage medium 4-2, storage medium 5-2, and storage medium 6-2. Partition three includes storage media 1-3, storage media 2-3, storage media 3-3, storage media 4-3, storage media 5-3, and storage media 6-3. Partition four includes storage media 1-4, storage media 2-4, storage media 3-4, storage media 4-4, storage media 5-4, and storage media 6-4. Partition five includes storage media 1-5, storage media 2-5, storage media 3-5, storage media 4-5, storage media 5-5, and storage media 6-5. Partition six includes storage media 1-6, storage media 2-6, storage media 3-6, storage media 4-6, storage media 5-6, and storage media 6-6.

It should be noted that the partition manner is only an exemplary illustration, and the embodiment of the present application does not limit this, and other manners of partition may also be used in practical applications. However, when partitioning a partition, it is necessary to ensure that storage media included in the same partition belong to different storage nodes. If there are remaining storage media in the distributed storage system that do not constitute enough of a full partition, the storage media that have been allocated to other partitions may be selected to constitute partitions with the remaining storage media, i.e., different partitions may include the same storage media (the same storage media in the same storage node). It is to be noted that, when selecting a storage medium that has been allocated to another partition, a storage medium having the smallest load needs to be selected among storage nodes other than the storage nodes including the remaining storage media.

Assume that the redundancy mode of EC is 4 data slices and 2 check slices, i.e. N is 4 and K is 2. According to the redundancy mode of the EC, a distributed storage system including 8 storage nodes each including 5 storage media is partitioned, that is, different 6 storage nodes are selected from the 8 storage nodes, then one storage medium is selected from each of the different 6 storage nodes, the storage medium selected from each of the 6 storage nodes is determined to form a partition, seven partitions can be obtained, and each partition includes 6 storage media. It is assumed that 8 storage nodes are encoded according to 1 to 6, and 5 storage media included in each storage node are encoded according to 1 to 5. The description mode of the storage medium in the storage node may be expressed in the above editing mode, and is not described herein again in this embodiment of the present application.

As shown in FIG. 4, partition one includes storage medium 1-1, storage medium 2-1, storage medium 3-1, storage medium 4-1, storage medium 5-1, and storage medium 6-1. Partition two includes storage medium 2-2, storage medium 3-2, storage medium 4-2, storage medium 5-2, storage medium 6-2, and storage medium 7-2. Partition three includes storage medium 3-3, storage medium 4-3, storage medium 5-3, storage medium 6-3, storage medium 7-3, and storage medium 8-3. Partition four includes storage media 4-4, storage media 5-4, storage media 6-4, storage media 7-4, storage media 8-4, and storage media 1-2. Partition five includes storage media 5-5, storage media 6-5, storage media 7-5, storage media 8-5, storage media 1-3, and storage media 2-3. Partition six includes storage media 1-5, storage media 2-5, storage media 3-5, storage media 4-5, storage media 7-1, and storage media 8-1. At this time, the four storage media of the remaining storage media 1-4, the storage media 2-4, the storage media 3-4 and the storage media 8-2 constitute a partition and are different by two, and two storage media can be selected from the storage node 4, the storage node 5, the storage node 6 and the storage node 7 according to the load from small to large. Assuming that the storage media 6-4 and 7-4 are minimally loaded, the storage media 1-4, 2-4, 3-4, 8-2, 6-4 and 7-4 may be grouped into partition seven.

Wherein, the storage medium refers to a carrier for storing data. Such as a floppy disk, an optical disk, a DVD, a hard disk, a flash Memory, a Secure Digital Memory Card (SD) Card, a multimedia Card (MMC) Card, a Memory Stick (Memory Stick), etc. The most popular storage medium today is a flash memory (Nand flash) based disk.

The storage node comprises the storage medium described above for storing data. The storage node according to the embodiment of the present application may also be referred to as a storage server. Each storage node may be a different logic node integrated on the same device, or may be a device distributed and in different locations.

The cluster management node is used for managing metadata, addresses of the storage nodes, states of the storage nodes and loads of the storage nodes. Metadata (Metadata), also called intermediary data and relay data, is data (data about data) describing data, and is mainly information describing data attribute (property) for supporting functions of indicating storage location, history data, resource search, file record, and the like.

The application node stores application software for generating data, writing the data into the storage node or accessing the storage node to read the data. The application node and the cluster management node may be different logical nodes integrated on the same device, or may be devices distributed in different locations, and may be connected through a network or directly.

The embodiment of the application provides a partition dividing method based on a distributed storage system, wherein the distributed storage system comprises a cluster management node, an application node and S storage nodes, and the basic principle is as follows: firstly, a cluster management node acquires fault information, wherein the fault information is used for indicating a storage node with a fault or a storage medium with a fault; then, the cluster management node re-divides the storage medium with the normal state according to the fault information, the load of the storage node with the normal state and the redundancy mode of the EC to obtain first updated partition information; the cluster management node sends first updated partition information to the application node. Therefore, after the storage node or the storage medium fails, the storage medium with the normal state is divided again according to the failure information, the load of the storage node with the normal state and the redundancy mode of the EC, the number of the storage media in the partition is always kept to be the same as the configuration of the redundancy mode of the EC, the successful writing of data during the writing of the data is ensured, and the reliability of the data is effectively improved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 5 is a flowchart of a partition dividing method based on a distributed storage system according to an embodiment of the present application, and as shown in fig. 5, the method may include:

s501, the cluster management node acquires fault information.

For example, the cluster management node may periodically send a status request message to the storage nodes in the distributed storage system to query whether the status of the storage nodes in the distributed storage system is normal. The status request message is used for requesting the storage node to return status information. If the storage node is in a normal state, a normal state response message can be returned to the cluster management node; if the storage node is in a fault state, the storage node may return a fault state response message to the cluster management node. In addition, if the storage node does not receive any message such as a normal state response message or a fault state response message within a predetermined period of time, the cluster management node may determine that the storage node responds to a timeout and the storage node is in a fault state. The failure status response message may refer to a failure of the storage node or a failure of a storage medium included in the storage node.

Optionally, the cluster management node may send the state request message to the storage nodes in the distributed storage system without actively sending the state request message, and the storage nodes in the distributed storage system may periodically send the state information to the cluster management node to notify whether the state of the cluster management node is normal. If the cluster management node does not receive the state information of a certain storage node in the distributed storage system within a preset time period, the fault of the certain storage node can be determined.

The failure information is used to indicate a failed storage node or a failed storage medium.

S502, the cluster management node divides the storage medium with the normal state again according to the fault information, the load of the storage node with the normal state and the redundancy mode of the EC to obtain first updated partition information.

Optionally, if the failure information is the node identifier of the failed storage node. For example, the storage node 1 and the storage node 1 may be node identifiers. Assuming that i storage nodes have faults, i is an integer greater than or equal to 1 and less than or equal to 3. At this time, data cannot be written to the partition of the storage medium including the failed storage node, and therefore, the cluster management node divides (S-i) × X storage media included in S-i storage nodes into Q partitions according to the redundancy pattern of the EC and the load of the S-i storage nodes, and obtains first updated partition information. The first updated partition information includes a partition identification of each of the Q partitions and a media identification of a storage medium included in each of the Q partitions. For example, the storage media 1-1 and 2-3 may be media identifications of the storage media. For example, the partition one and the partition two may be partition identifiers.

When Y is smaller than or equal to S, for each partition in Q partitions, the cluster management node selects Y storage nodes from S-i storage nodes, then selects a storage medium from each storage node of different Y storage nodes, and determines the storage medium selected from each storage node of the Y storage nodes as forming a partition.

For example, assume that the redundancy mode of EC is 4 data slices and 2 check slices, i.e. N is 4 and K is 2. The distributed storage system includes 8 storage nodes, each of which includes 6 storage media. Assume that the cluster management node receives a storage node 8 failure. The cluster management node partitions 6 × 7-42 storage media included in the storage nodes 1 to 7 for 4 data fragments and 2 check fragments according to the redundancy mode of the EC, that is, first selects different 6 storage nodes from the 7 storage nodes, then selects one storage medium from each of the different 6 storage nodes, determines the storage medium selected from each of the 6 storage nodes as forming a partition, and may obtain seven partitions, each of which includes 6 storage media.

As shown in FIG. 6, partition one includes storage medium 1-1, storage medium 2-1, storage medium 3-1, storage medium 4-1, storage medium 5-1, and storage medium 6-1. Partition two includes storage medium 2-2, storage medium 3-2, storage medium 4-2, storage medium 5-2, storage medium 6-2, and storage medium 7-2. Partition three includes storage medium 3-3, storage medium 4-3, storage medium 5-3, storage medium 6-3, storage medium 7-3, and storage medium 1-2. Partition four includes storage media 4-4, storage media 5-4, storage media 6-4, storage media 7-4, storage media 1-3, and storage media 2-3. Partition five includes storage media 5-5, storage media 6-5, storage media 7-5, storage media 1-4, storage media 2-4, and storage media 3-4. Partition six includes storage media 6-6, storage media 7-6, storage media 1-5, storage media 2-5, storage media 3-5, and storage media 4-5. Partition seven includes storage media 1-6, storage media 2-6, storage media 3-6, storage media 4-6, storage media 5-6, and storage media 7-1.

In the above-described division method, the storage medium included in each partition is different. In a possible implementation manner, when the remaining storage media in the distributed storage system are not enough to form a partition, the storage media that has been divided may be selected from the storage nodes with the smallest load in the distributed storage system, and form a partition with the remaining storage media, that is, different partitions may include the same storage media.

For example, assume that the redundancy mode of EC is 4 data slices and 2 check slices, i.e. N is 4 and K is 2. The distributed storage system comprises 8 storage nodes, each storage node comprising 5 storage media. Assume that the cluster management node receives a storage node 8 failure. The cluster management node partitions 5 × 7-35 storage media included in the storage nodes 1 to 7 for 4 data fragments and 2 check fragments according to the redundancy mode of the EC, that is, first selects 6 different storage nodes from the 7 storage nodes, then selects one storage medium from each of the 6 different storage nodes, determines the storage medium selected from each of the 6 storage nodes as forming a partition, and may obtain six partitions, each of which includes 6 storage media.

As shown in FIG. 7, partition one includes storage medium 1-1, storage medium 2-1, storage medium 3-1, storage medium 4-1, storage medium 5-1, and storage medium 6-1. Partition two includes storage medium 2-2, storage medium 3-2, storage medium 4-2, storage medium 5-2, storage medium 6-2, and storage medium 7-2. Partition three includes storage medium 3-3, storage medium 4-3, storage medium 5-3, storage medium 6-3, storage medium 7-3, and storage medium 1-2. Partition four includes storage media 4-4, storage media 5-4, storage media 6-4, storage media 7-4, storage media 1-3, and storage media 2-3. Partition five includes storage media 5-5, storage media 6-5, storage media 7-5, storage media 1-4, storage media 2-4, and storage media 3-4. At this time, the remaining storage media 1-5, 2-5, 3-5, 4-5 and 7-1 constitute one storage medium, and the composition partition is different from the other storage medium, and one storage medium can be selected from the storage node 5 and the storage node 6 from the smallest to the largest according to the load. Assuming that storage medium 5-3 is least loaded, storage medium 5-3, along with storage medium 1-5, storage medium 2-5, storage medium 3-5, storage medium 4-5, and storage medium 7-1, may be grouped together into partition six.

In the above-described partitioning method, although one storage node fails in the distributed storage system, the number of storage nodes is still greater than the redundancy pattern of the EC. In a possible implementation manner, the number of storage nodes in the distributed storage system with normal states may be smaller than the configuration of the redundancy mode of EC, that is, Y is greater than S, and therefore, at least two storage media in the storage media included in each of the Q partitions belong to the same storage node.

For example, assume that the redundancy mode of EC is 4 data slices and 2 check slices, i.e. N is 4 and K is 2. The distributed storage system includes 6 storage nodes, each of which includes 6 storage media. Assume that the cluster management node receives a storage node 6 failure. The cluster management node partitions 6 × 5-30 storage media included by the storage nodes 1 to 5 for 4 data fragments and 2 check fragments according to the redundancy mode of the EC, selects one storage medium from each of the 5 storage nodes to obtain 5 storage media, sorts the 5 storage nodes from small to large according to the loads of the 5 storage nodes, selects 1 storage medium from the storage node with the smallest load, and forms a new partition by the 5 storage media and the 1 storage medium. Two different storage media of the same storage node exist in the partition.

As shown in FIG. 8, partition one includes storage medium 1-1, storage medium 2-1, storage medium 3-1, storage medium 4-1, storage medium 5-1, and storage medium 1-2. Partition two includes storage medium 2-2, storage medium 3-2, storage medium 4-2, storage medium 5-2, storage medium 1-3, and storage medium 2-3. Partition three includes storage media 3-3, storage media 4-3, storage media 5-3, storage media 1-4, storage media 2-4, and storage media 3-4. Partition four includes storage media 4-4, storage media 5-4, storage media 1-5, storage media 2-5, storage media 3-5, and storage media 4-5. Partition five includes storage media 1-6, storage media 2-6, storage media 3-6, storage media 4-6, storage media 5-6, and storage media 5-5.

Optionally, if the failure information is a medium identifier of the failed storage medium, the medium identifier is used to indicate a location of the storage medium in the distributed storage system. For example, the above-mentioned storage medium 1-1 for representing the storage medium 1 in the storage node 1, the storage medium 4-5 for representing the storage medium 5 in the storage node 4, and the like may be the medium identification. Assuming that there are j storage medium failures, j is an integer greater than or equal to 1. At this time, data cannot be written into the partition including the failed storage medium, and therefore, the cluster management node divides (S × X) -j storage media included in S storage nodes into W partitions according to the redundancy mode of the EC and the load of the storage nodes in a normal state, and obtains first updated partition information. The first updated partition information includes a partition identification of each of the W partitions and a media identification of a storage medium included in each of the W partitions. For example, the partition one and the partition two may be partition identifiers.

It should be noted that, if a storage medium in a storage node fails, when a partition is divided, a storage medium should be selected from each storage node of S storage nodes to form a partition, and it is ensured that the storage medium included in each partition belongs to different storage nodes. When the remaining storage media in the distributed storage system are not enough to form a partition, the storage media that have been divided may be selected from the storage nodes with the smallest load in the distributed storage system, and form a partition with the remaining storage media, that is, different partitions may include the same storage media. Different ones of the W partitions may include the same storage media (the same storage media in the same storage node).

For example, taking the partition shown in fig. 2 as an example, assuming that the storage medium 1 of the storage node 8 fails, that is, the storage medium 8-1 in the partition seven fails, the partition seven has one less storage medium, at this time, the storage medium 8-2 may be supplemented into the partition seven, and the partition eight has one less storage medium, and assuming that the load of the storage medium 6-6 is minimum, at this time, the storage medium 6-6 may be multiplexed, and the storage medium 6-6 may be supplemented into the partition eight. As shown in FIG. 9, partition seven includes storage media 1-5, storage media 2-5, storage media 3-5, storage media 4-5, storage media 7-1, and storage media 8-2. Partition eight includes storage media 1-6, storage media 2-6, storage media 3-6, storage media 4-6, storage media 5-6, and storage media 6-6. The storage media included in the partitions one to six are the same as those included in the partitions one to six shown in fig. 2.

For example, taking the partition shown in FIG. 3 as an example, assume that storage medium 1 of storage node 6 fails, i.e., storage medium 6-1 in partition one fails, and storage medium 2 of storage node 5 fails, i.e., storage medium 5-2 in partition one fails. As shown in FIG. 10, storage medium 6-2 may be supplemented into partition one, which includes storage medium 1-1, storage medium 2-1, storage medium 3-1, storage medium 4-1, storage medium 5-1, and storage medium 6-2. Storage media 5-3 and storage media 6-3 may be complemented into partition two, which includes storage media 1-2, storage media 2-2, storage media 3-2, storage media 4-2, storage media 5-3 and storage media 6-3. Storage media 5-4 and storage media 6-4 may be complemented into partition three, which includes storage media 1-3, storage media 2-3, storage media 3-3, storage media 4-3, storage media 5-4, and storage media 6-4. Storage media 5-5 and storage media 6-5 may be complemented into partition four, which includes storage media 1-4, storage media 2-4, storage media 3-4, storage media 4-4, storage media 5-5 and storage media 6-5. Storage media 5-6 and storage media 6-6 may be supplemented into partition five, which includes storage media 1-5, storage media 2-5, storage media 3-5, storage media 4-5, storage media 5-6, and storage media 6-6. At this time, the four storage media 1-6, 2-6, 3-6 and 4-6 constitute a partition and are different by two, and two storage media can be selected from the storage node 5 and the storage node 6 according to the load from small to large. Assuming that the storage media 5-4 and 6-5 are least loaded, the storage media 5-4 and 6-5, and the storage media 1-6, 2-6, 3-6 and 4-6 may be grouped together into partition six.

In another implementation, as shown in FIG. 11, partition one includes storage medium 1-1, storage medium 2-1, storage medium 3-1, storage medium 4-1, storage medium 5-1, and storage medium 1-2. Partition two includes storage media 1-3, storage media 2-2, storage media 3-2, storage media 4-2, storage media 5-3, and storage media 6-3. Partition three includes storage media 1-4, storage media 2-3, storage media 3-3, storage media 4-3, storage media 5-4, and storage media 6-4. Partition four includes storage media 1-5, storage media 2-4, storage media 3-4, storage media 4-4, storage media 5-5, and storage media 6-5. Partition five includes storage media 1-6, storage media 2-5, storage media 3-5, storage media 4-5, storage media 5-6, and storage media 6-6. Assuming that the storage medium 1-1 is less loaded, the storage medium 1-1 may be multiplexed at this time, and the storage medium 1-1 is complemented into partition six. Partition six includes storage media 1-1, storage media 2-6, storage media 3-6, storage media 4-6, storage media 5-6, and storage media 6-6.

Although the distributed storage system stores the fault storage nodes or fault storage media, the number of the storage media in the partitions is always kept to be 6, which is the same as the configuration of the redundancy mode of the EC, so that the reliability of data is effectively improved. It should be noted that, at this time, there may be 1 storage medium belonging to multiple partitions, and in this case, if all the storage media of one partition of two partitions including the same storage medium are full, and the other partition of the two partitions includes the same storage medium without capacity and cannot write data, it may be determined that the other partition of the two partitions belongs to an invalid partition. If all storage media of one of the two partitions comprising the same storage medium are not fully written, the other of the two partitions comprising the same storage medium has free capacity to write data until fully written.

S503, the cluster management node sends the first updating partition information to the application node.

S504, the application node receives first updating partition information sent by the cluster management node.

After receiving the first updating partition information sent by the cluster management node, the application node stores the first updating partition information.

Further, after the cluster management node repartitions the partitions to the storage medium in the distributed storage system, the application node may write data according to the newly partitioned partitions when the application node needs to write data. As shown in fig. 12, the following detailed steps may be included:

and S505, the application node carries out EC coding on the data to be written to obtain L EC strips.

Each EC stripe includes N data slices and K check slices.

S506, the application node stores the L EC stripes to L partitions in the Q partitions according to the first updated partition information.

The first updated partition information includes a partition identifier of each partition in the Q partitions and a medium identifier of a storage medium included in each partition in the Q partitions, the Q partitions are obtained by dividing (S-i) × X storage media included in S-i storage nodes by the cluster management node according to the redundancy mode of the EC and the load of the S-i storage nodes, and i represents the number of failed storage nodes.

S507, the application node stores the L EC stripes to L partitions in the W partitions according to the first updated partition information.

The first updated partition information includes a partition identifier of each partition in the W partitions and a medium identifier of a storage medium included in each partition in the W partitions, the W partitions are obtained by dividing (S X) -j storage media included in the S storage nodes by the cluster management node according to the redundancy mode of the EC and the load of the S storage nodes, and j represents the number of failed storage media.

For example, assuming that data streams 0 through 11 need to be written to the distributed storage system, the partition structure of the distributed storage system is the updated partition shown in FIG. 8. N is 4 and K is 2. As shown in fig. 13, first, data streams 0 to 11 are divided into three blocks according to 4 data, where the first data block includes three data slices, i.e., data slice 0 to data slice 3, the second data block includes three data slices, i.e., data slice 4 to data slice 7, and the third data block includes three data slices, i.e., data slice 8 to data slice 11; then, for the first data block, performing xor according to the data slice 0 to the data slice 3 to obtain two check slices, that is, the check slice P0 and the check slice Q0, and the data slice 0 to the data slice 3, the check slice P0 and the check slice Q0 form a first stripe. For the second data block, performing xor according to the data fragments 4 to 7 to obtain two check fragments, namely, the check fragment P1 and the check fragment Q1, and the data fragments 4 to 7, the check fragment P1 and the check fragment Q1 form a second stripe. For the third data block, performing xor according to the data fragment 8 to the data fragment 11 to obtain two check fragments, that is, the check fragment P2 and the check fragment Q2, and the data fragment 8 to the data fragment 11, the check fragment P2 and the check fragment Q2 form a third stripe. Depending on the load of the partitions, three stripes are written from small to large into three partitions, e.g., the partitions shown in FIG. 8.

In addition, after the cluster management node sends the first updated partition information to the application node, if the failure of the failed storage node is resolved or the failure of the failed storage medium is resolved, the embodiment of the present application may further include the following detailed steps:

s508, the cluster management node acquires the recovery message.

The recovery message is used to indicate the failure resolution of the failed storage node or the failure resolution of the failed storage medium.

S509, the cluster management node re-divides the storage medium with the normal state according to the recovery information, the redundancy mode of the EC and the load of the storage node with the normal state to obtain second updated partition information.

S510, the cluster management node sends second updating partition information to the application node.

S511, the application node receives the second updating partition information sent by the cluster management node.

The second updated partition information is obtained by the cluster management node by re-dividing the storage medium with a normal state according to the recovery information, the redundancy mode of the EC, and the load of the storage node with a normal state, wherein the recovery information is used for indicating the failure relief of the failed storage node or the failure relief of the failed storage medium. And if the application node needs to write the data again, the data can be written into the storage node according to the second updating partition information.

It should be noted that the capacity of the distributed storage system and the capacity of each storage medium in the distributed storage system may be set by a user according to a requirement, which is not limited in the embodiment of the present application. For a partition, if the capacity of the partition is sufficient, a number of different stripes may be written.

In addition, when the partition is divided, firstly, the storage media in the partition are ensured to belong to different storage nodes, and if the storage media in the partition cannot be ensured to belong to different storage nodes, the required number of storage media can be selected from the storage nodes of the distributed storage system from small to large according to the load of the storage media to form the partition.

The above-mentioned scheme provided by the embodiment of the present application is introduced mainly from the perspective of interaction between network elements. It is understood that each network element, for example, a cluster management node, an application node, in order to implement the above functions, includes a corresponding hardware structure and/or software module for performing each function. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, functional modules may be divided for the cluster management node and the application node according to the above method example, for example, each functional module may be divided for each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

In the case of dividing each functional module by corresponding functions, fig. 14 shows a schematic diagram of a possible composition of the cluster management node mentioned above and in the embodiment, as shown in fig. 14, the cluster management node may include: a transceiving unit 141 and a processing unit 142.

The transceiving unit 141 is configured to support the cluster management node to execute S501 and S503 in the partition dividing method based on the distributed storage system shown in fig. 5, and S501, S503, S508, and S510 in the data writing method shown in fig. 12.

The processing unit 142 is configured to support the cluster management node to execute S502 in the partition dividing method based on the distributed storage system shown in fig. 5, and S502 and S509 in the data writing method shown in fig. 12.

It should be noted that all relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.

The cluster management node provided by the embodiment of the application is used for executing the partition dividing method based on the distributed storage system, so that the same effect as the partition dividing method based on the distributed storage system can be achieved.

In a specific implementation, the cluster management node illustrated in fig. 14 may be implemented by a computer device illustrated in fig. 15.

Fig. 15 is a schematic diagram of a computer device according to an embodiment of the present disclosure, and as shown in fig. 15, the computer device may include at least one processor 151, a memory 152, a communication interface 153, and a communication bus 154.

The following describes the components of the computer device in detail with reference to fig. 15:

processor 151 is the control center of a computer device, and may be a single processor or a collective term for a plurality of processing elements. For example, the processor 151 is a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application, such as: one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs).

Among other things, the processor 151 may perform various functions of the computer device by running or executing software programs stored in the memory 152, and calling data stored in the memory 152.

In particular implementations, processor 151 may include one or more CPUs such as CPU0 and CPU1 shown in fig. 15 for one embodiment.

The processor in the embodiment of the application is mainly used for acquiring the fault information, and repartitioning the storage medium with the normal state according to the fault information, the load of the storage node with the normal state and the redundancy mode of the EC to obtain the first updated partition information.

In particular implementations, a computer device may include multiple processors, such as processor 151 and processor 155 shown in FIG. 15, for example, as an embodiment. Each of these processors may be a single-Core Processor (CPU) or a multi-Core Processor (CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

The Memory 152 may be a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory 152 may be self-contained and coupled to the processor 151 via a communication bus 154. The memory 152 may also be integrated with the processor 151.

The memory 152 is used for storing software programs for implementing the present application, and is controlled by the processor 151 to execute the software programs.

The communication interface 153 is used for communicating with other devices or communication Networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc. The communication interface 153 may include a receiving unit implementing a receiving function and a transmitting unit implementing a transmitting function.

The communication interface according to the embodiment of the present application is mainly used for sending the first updated partition information to the application node.

The communication bus 154 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 15, but this is not intended to represent only one bus or type of bus.

The device architecture shown in fig. 15 does not constitute a limitation on computer devices, and may include more or fewer components than shown, or some of the components may be combined, or a different arrangement of components.

In case of using integrated units, fig. 16 shows another possible composition diagram of the cluster management node involved in the above embodiments. As shown in fig. 16, the cluster management node includes: a processing module 161 and a communication module 162.

Processing module 161 is configured to control and manage actions of the cluster management node, e.g., processing module 161 is configured to support the cluster management node to perform S502 in fig. 5, S502 and S509 in fig. 12, and/or other processes for the techniques described herein. The communication module 162 is used to support communication between the cluster management node and other network entities, such as the application node and the storage node shown in fig. 1. The cluster management node may also include a storage module 163 for storing program codes and data for the cluster management node.

The processing module 161 may be a processor or a controller. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like. The communication module 162 may be a transceiver, a transceiver circuit or a communication interface, etc. The storage module 163 may be a memory.

When the processing module 161 is a processor, the communication module 162 is a communication interface, and the storage module 163 is a memory, the cluster management node according to the embodiment of the present application may be a computer device shown in fig. 15.

In the case of dividing each functional module by corresponding functions, fig. 17 shows a possible composition diagram of the application node mentioned above and in the embodiment, as shown in fig. 17, the application node may include: a processing unit 171 and a transceiver unit 172.

Among them, the processing unit 171 is configured to support the application node to execute S505, S506, and S507 in the data writing method shown in fig. 12.

The transceiving unit 172 is configured to support the cluster management node to execute S504 in the partition dividing method based on the distributed storage system shown in fig. 5, and S504, S505, S506, S507, and S511 in the data writing method shown in fig. 12.

The application node provided by the embodiment of the application node is used for executing the partition dividing method based on the distributed storage system, so that the same effect as that of the partition dividing method based on the distributed storage system can be achieved.

In case of integrated units, fig. 18 shows another possible composition diagram of the application node involved in the above embodiment. As shown in fig. 18, the application node includes: a processing module 181 and a communication module 182.

Processing module 181 is used to control and manage the actions of the application node, e.g., processing module 181 is used to support the application node in performing S505, S506, and S507 in fig. 12, and/or other processes for the techniques described herein. The communication module 182 is used to support communication between the application node and other network entities, for example, the cluster management node shown in fig. 1. The application node may also comprise a storage module 183 for storing program codes and data of the application node.

The processing module 181 may be a processor or a controller. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like. The communication module 182 may be a transceiver, a transceiving circuit or a communication interface, etc. The storage module 183 may be a memory.

When the processing module 181 is a processor, the communication module 182 is a communication interface, and the storage module 183 is a memory, the application node according to the embodiment of the present application may be a computer device shown in fig. 15.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A partition dividing method based on a distributed storage system is characterized in that the distributed storage system comprises cluster management nodes, application nodes and S storage nodes, each storage node comprises X storage media, and S X storage media included in the S storage nodes are divided into P partitions according to a redundancy mode of an erasure code EC, each partition in the P partitions comprises Y storage media, and the Y storage media are composed of one storage medium in each storage node of Y storage nodes, wherein the redundancy mode of the EC is the number of data fragments and the number of check fragments, N represents the number of data fragments, K represents the number of check fragments, Y is N + K,

the method comprises the following steps:

the cluster management node acquires fault information, wherein the fault information is used for indicating a storage node with a fault or a storage medium with a fault;

the cluster management node re-divides the storage medium with the normal state according to the fault information, the load of the storage node with the normal state and the redundancy mode of the EC to obtain first updated partition information;

the cluster management node sends the first updated partition information to the application node.

2. The method according to claim 1, wherein if the failure information is a node identifier of a failed storage node, i storage nodes fail, and the cluster management node re-partitions the storage medium with a normal state according to the failure information, the load of the storage node with a normal state, and the redundancy mode of the EC to obtain first updated partition information, including:

the cluster management node divides (S-i) × storage media included in S-i storage nodes into Q partitions according to the redundancy mode of the EC and the load of the S-i storage nodes to obtain the first update partition information, where the first update partition information includes a partition identifier of each partition in the Q partitions and a medium identifier of a storage medium included in each partition in the Q partitions.

3. The method of claim 2, wherein each of said Q partitions includes at least two storage media belonging to a same storage node if Y is greater than S.

4. The method according to claim 1, wherein S is an integer of 3 or more and 20 or less.

5. The method according to claim 1, wherein if the failure information is a media identifier of a failed storage medium, j storage medium failures occur, and the cluster management node repartitions the storage medium with a normal state according to the failure information, a load of the storage node with a normal state, and a redundancy pattern of the EC to obtain first updated partition information, including:

the cluster management node divides (S X) -j storage media included in the S storage nodes into W partitions according to the redundancy mode of the EC and the load of the S storage nodes to obtain the first update partition information, where the first update partition information includes a partition identifier of each partition in the W partitions and a medium identifier of a storage medium included in each partition in the W partitions.

6. The method of any of claims 1-5, wherein after the cluster management node sends the first updated partition information to the application node, the method further comprises:

the cluster management node acquires recovery information, wherein the recovery information is used for indicating the failure removal of the failed storage node or the failure removal of the failed storage medium;

the cluster management node re-divides the storage medium with the normal state according to the recovery information, the redundancy mode of the EC and the load of the storage node with the normal state to obtain second updated partition information;

the cluster management node sends the second updated partition information to the application node.

7. A data writing method is characterized in that a distributed storage system comprises cluster management nodes, application nodes and S storage nodes, each storage node comprises X storage media, the S X storage media included by the S storage nodes are divided into P partitions according to a redundancy mode of an erasure code EC, each partition in the P partitions comprises Y storage media, and the Y storage media are composed of one storage medium in each storage node of the Y storage nodes, wherein the redundancy mode of the EC is the number of data fragments and the number of check fragments, N represents the number of data fragments, K represents the number of check fragments, Y is N + K,

the method comprises the following steps:

the application node carries out EC encoding on data to be written to obtain L EC strips, each EC strip comprises N data fragments and K check fragments, L is determined by the data volume of the data to be written, and L is larger than or equal to 1;

the application node stores the L EC stripes into L partitions in Q partitions according to first updated partition information, the first updated partition information comprises a partition identifier of each partition in the Q partitions and a medium identifier of a storage medium included in each partition in the Q partitions, the Q partitions are obtained by dividing (S-i) X storage media included in S-i storage nodes by the cluster management node according to the redundancy mode of the EC and the load of the S-i storage nodes, and i represents the number of failed storage nodes; or the application node stores the L EC stripes to L partitions in W partitions according to first updated partition information, where the first updated partition information includes a partition identifier of each partition in the W partitions and a medium identifier of a storage medium included in each partition in the W partitions, the W partitions are obtained by dividing (S X) -j storage media included in the S storage nodes by the cluster management node according to the redundancy mode of the EC and the load of the S storage nodes, and j represents the number of failed storage media.

8. The method of claim 7, wherein each of said Q partitions includes at least two storage media belonging to a same storage node if Y is greater than S.

9. The method according to claim 7, wherein S is an integer of 3 or more and 20 or less.

10. The method of claim 7, wherein before the application node EC-encodes data to be written, resulting in L EC stripes, the method further comprises:

and the application node receives the first updating partition information sent by the cluster management node.

11. The method according to any of claims 7-10, wherein after the application node receives the first updated partition information sent by the cluster management node, the method further comprises:

the application node receives second updated partition information sent by the cluster management node, wherein the second updated partition information is obtained by the cluster management node by re-dividing a storage medium with a normal state according to recovery information, the redundancy mode of the EC and the load of the storage node with a normal state, and the recovery information is used for indicating fault removal of a failed storage node or fault removal of a failed storage medium.

12. A cluster management node is characterized in that a distributed storage system comprises the cluster management node, an application node and S storage nodes, each storage node comprises X storage media, S X storage media included in S storage nodes are divided into P partitions according to a redundancy mode of an erasure code EC, each partition in the P partitions comprises Y storage media, the Y storage media are composed of one storage medium in each storage node of Y storage nodes, the redundancy mode of the EC is the number of data fragments and the number of check fragments, N represents the number of data fragments, K represents the number of check fragments, Y is N + K,

the cluster management node comprises:

a transceiving unit, configured to acquire failure information, where the failure information is used to indicate a failed storage node or a failed storage medium;

the processing unit is used for repartitioning the storage medium with the normal state according to the fault information, the load of the storage node with the normal state and the redundancy mode of the EC to obtain first updated partition information;

the transceiver unit is further configured to send the first updated partition information to the application node.

13. The cluster management node according to claim 12, wherein if the failure information is a node identifier of a failed storage node, i storage nodes fail, the processing unit is specifically configured to:

dividing (S-i) × storage media included in S-i storage nodes into Q partitions according to the redundancy mode of the EC and the load of the S-i storage nodes to obtain first updated partition information, wherein the first updated partition information comprises a partition identifier of each partition in the Q partitions and a medium identifier of the storage media included in each partition in the Q partitions.

14. The cluster management node of claim 13, wherein each of the Q partitions includes at least two storage media belonging to the same storage node if Y is greater than S.

15. The cluster management node of claim 12, wherein S is an integer greater than or equal to 3 and less than or equal to 20.

16. The cluster management node according to claim 12, wherein if the failure information is a media identifier of a failed storage medium, where j storage media have failed, the processing unit is specifically configured to:

dividing (S X X) -j storage media included by the S storage nodes into W partitions according to the redundancy mode of the EC and the loads of the S storage nodes to obtain first updated partition information, wherein the first updated partition information comprises a partition identifier of each partition in the W partitions and a medium identifier of the storage media included by each partition in the W partitions.

17. The cluster management node according to any of claims 12-16,

the transceiver unit is further configured to acquire recovery information, where the recovery information is used to indicate failure resolution of the failed storage node or failure resolution of the failed storage medium;

the processing unit is further configured to re-partition the storage medium with the normal state according to the recovery information, the redundancy mode of the EC, and the load of the storage node with the normal state, so as to obtain second updated partition information;

the transceiver unit is further configured to send the second updated partition information to the application node.

18. An application node, wherein a distributed storage system includes a cluster management node, the application node, and S storage nodes, each of the storage nodes includes X storage media, and S × X storage media included in S storage nodes are divided into P partitions according to a redundancy pattern of an erasure code EC, each of the P partitions includes Y storage media, and the Y storage media is composed of one storage medium in each storage node of Y storage nodes, where the redundancy pattern of the EC is the number of data fragments and the number of parity fragments, N denotes the number of data fragments, K denotes the number of parity fragments, Y is N + K,

the application node comprises:

the processing unit is used for carrying out EC encoding on data to be written to obtain L EC strips, each EC strip comprises N data fragments and K verification fragments, L is determined by the data volume of the data to be written, and L is larger than or equal to 1;

the processing unit and the transceiver unit are configured to store the L EC stripes into L partitions in Q partitions according to first updated partition information, where the first updated partition information includes a partition identifier of each partition in the Q partitions and a medium identifier of a storage medium included in each partition in the Q partitions, the Q partitions are obtained by dividing, by the cluster management node, the (S-i) × X storage media included in S-i storage nodes according to the redundancy mode of the EC and the load of the S-i storage nodes, and i represents the number of failed storage nodes; or the application node stores the L EC stripes to L partitions in W partitions according to first updated partition information, where the first updated partition information includes a partition identifier of each partition in the W partitions and a medium identifier of a storage medium included in each partition in the W partitions, the W partitions are obtained by dividing (S X) -j storage media included in the S storage nodes by the cluster management node according to the redundancy mode of the EC and the load of the S storage nodes, and j represents the number of failed storage media.

19. The application node of claim 18, wherein each of said Q partitions includes at least two storage media belonging to the same storage node if Y is greater than S.

20. The application node of claim 18, wherein S is an integer greater than or equal to 3 and less than or equal to 20.

21. The application node of claim 18,

the transceiver unit is further configured to receive the first updated partition information sent by the cluster management node.

22. The application node according to any of claims 18-21,

the transceiver unit is further configured to receive second updated partition information sent by the cluster management node, where the second updated partition information is obtained by the cluster management node repartitioning the storage medium with a normal state according to recovery information, the redundancy mode of the EC, and a load of the storage node with a normal state, and the recovery information is used to indicate failure relief of the failed storage node or failure relief of the failed storage medium.

23. A computer-readable storage medium comprising instructions that, when executed on a cluster management node, cause the cluster management node to perform the distributed storage system based partitioning method of any one of claims 1-6.

24. A computer-readable storage medium comprising instructions that, when run on an application node, cause the application node to perform a data writing method according to any one of claims 7-11.

25. An apparatus applied to a partition method based on a distributed storage system, wherein the apparatus exists in a product form of a chip, and the apparatus has a structure including a processor and a memory, the memory is coupled to the processor and is configured to store program instructions and data of the apparatus, and the processor is configured to execute the program instructions stored in the memory, so that the apparatus performs the method according to any one of claims 1 to 6.

26. A device for use in a method of writing data, the device being in the form of a chip, the device comprising a processor and a memory, the memory being configured to be coupled to the processor and to store program instructions and data for the device, the processor being configured to execute the program instructions stored in the memory to cause the device to perform the method of any one of claims 7 to 11.