CN120523411A

CN120523411A - Data writing method, device, equipment and medium based on data deduplication

Info

Publication number: CN120523411A
Application number: CN202510685255.4A
Authority: CN
Inventors: 穆向东; 李雪生
Original assignee: Inspur Jinan data Technology Co ltd
Current assignee: Inspur Jinan data Technology Co ltd
Priority date: 2025-05-26
Filing date: 2025-05-26
Publication date: 2025-08-22

Abstract

The invention discloses a data writing method, a device, equipment and a medium based on data deduplication, which relate to the technical field of computers and comprise the steps of obtaining a target data writing request of which the data volume to be written is larger than a preset data volume threshold value or meets a preset trigger time condition; determining a first logical block conforming to the deduplication condition and a second logical block not conforming to the deduplication condition by utilizing target fingerprint information of each logical block in a target data writing request, recording a block fingerprint forward relation representing a storage position of a physical block in an internal object mapped from fingerprint information in the fingerprint table, searching a first physical block corresponding to the first logical block in the internal object, newly establishing the physical block in the internal object as a second physical block, establishing a mapping relation between the first logical block, the second logical block and the corresponding physical block, and writing data to be written of the second logical block into the second physical block. The data writing efficiency is improved, and the data deleting is completed.

Description

Data writing method, device, equipment and medium based on data deduplication

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data writing method, apparatus, device, and medium based on data deduplication.

Background

Data deduplication (deduplication) is a technique in a storage system that retains only unique copies and references redundant data with pointers by identifying and eliminating duplicate data blocks or files. The core aim is to optimize the utilization rate of storage resources and reduce the cost by reducing redundant data storage. For example, when 100 employees store the same file, the deduplication technology stores only 1 original file, and the rest realizes a storage compression rate of 100:1 through pointer mapping. In recent years, the explosive growth of data causes shortage of storage resources, and the deduplication can save 30% -95% of space, and reduce the hardware purchasing and operation and maintenance cost.

The partitioning mechanism of data deduplication is a core link for realizing efficient deduplication, and aims to cut data into minimum units of identifiable repetition. In the traditional scheme, the corresponding relation between the query service object and the internal object is traversed, so that whether the data to be written is in the physical partition of the internal object or not is determined, a large amount of storage space is occupied by the large amount of corresponding relation, and the searching efficiency is low and the speed is low for searching the data.

It can be seen that how to improve the data writing efficiency and complete the data deduplication is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the invention aims to provide a data writing method, device, equipment and medium based on data deduplication, which improve the data writing efficiency and complete the data deduplication. The specific scheme is as follows:

in a first aspect, the present invention discloses a data writing method based on data deduplication, which includes:

Acquiring a target data writing request of which the data quantity to be written is larger than a preset data quantity threshold value or meets a preset trigger time condition;

determining target fingerprint information of each logic block in the target data writing request;

Determining a first logical partition meeting the deduplication condition and a second logical partition not meeting the deduplication condition from the logical partitions by utilizing the target fingerprint information and a fingerprint table, wherein a partition fingerprint forward relation is recorded in the fingerprint table and is a mapping relation for representing a storage position of a physical partition in an internal object mapped from the fingerprint information;

Searching a corresponding first physical partition in an internal object based on the target fingerprint information of the first logical partition, and newly establishing a physical partition in the internal object as a second physical partition;

And respectively establishing a mapping relation between the first logical block and the first physical block and a mapping relation between the second logical block and the second physical block, and writing the data to be written of the second logical block into the second physical block.

Optionally, obtaining the target data writing request with the data volume to be written greater than the preset data volume threshold value includes:

if the data volume to be written of a single data writing request is larger than a preset data volume threshold value, determining each data writing request as a target data writing request;

And if the data volume to be written of the single data writing request is not greater than the preset data volume threshold, aggregating the data writing requests to obtain a target data writing request with the data volume to be written greater than the preset data volume threshold.

Optionally, the determining the target fingerprint information of each logical block in the target data writing request includes:

dividing the target data block of the target data writing request to obtain each logic block, and respectively carrying out hash operation on each logic block to determine the obtained hash value as target fingerprint information of each logic block.

Optionally, the determining, by using the target fingerprint information and the fingerprint table, a first logical partition meeting the deduplication condition and a second logical partition not meeting the deduplication condition from the logical partitions includes:

judging whether the current target fingerprint information of the logic block is stored in the block fingerprint forward relation of the fingerprint table;

If the target fingerprint information of the current logic block is stored, determining the current logic block as a first logic block conforming to the deduplication condition;

And if the target fingerprint information of the current logic block is not stored, determining the current logic block as a second logic block which does not meet the deduplication condition.

Optionally, the first logical partition meeting the deduplication condition is a logical partition in which the data to be written is stored in an established physical partition of an internal object, and the second logical partition not meeting the deduplication condition is a logical partition in which the data to be written is not stored in the established physical partition.

Optionally, the establishing a mapping relationship between the first logical block and the first physical block and a mapping relationship between the second logical block and the second physical block respectively includes:

Establishing a first block data forward relationship mapped from the first logical block to the first physical block and a first block data reverse relationship mapped from the first physical block to the first logical block, a second block data forward relationship mapped from the second logical block to the second physical block, and a second block data reverse relationship mapped from the second physical block to the second logical block, respectively;

correspondingly, after the mapping relation between the first logical block and the first physical block and the mapping relation between the second logical block and the second physical block are respectively established, the method further includes:

respectively storing the forward relation of each block data in a service object corresponding to the first logic block and a service object corresponding to the second logic block, and respectively storing the reverse relation of each block data in a corresponding physical block in an internal object;

And constructing a block fingerprint forward relation from the fingerprint information of the second logical block to the storage position of the corresponding second logical block according to the second block data forward relation, constructing a block fingerprint reverse relation from the storage position of the second logical block to the fingerprint information of the second logical block according to the second block data reverse relation, and writing the block fingerprint forward relation and the block fingerprint reverse relation into the fingerprint table and the corresponding physical block respectively.

Optionally, establishing a mapping relationship between the first logical partition and the first physical partition includes:

determining the relation quantity of the reverse relation of the stored block data in the first physical block, and judging whether the relation quantity is larger than a preset relation quantity threshold value or not;

If the relation number is not greater than the preset relation number threshold, establishing a mapping relation between the first logical block and the first physical block;

And if the relation number is larger than the preset relation number threshold, changing the first logic block into a second logic block, and jumping to the step of newly establishing a physical block in the internal object as a second logic block.

In a second aspect, the present invention discloses a data writing device based on data deduplication, comprising:

the request acquisition module is used for acquiring a target data writing request of which the data volume to be written is larger than a preset data volume threshold value or meets a preset trigger time condition;

the fingerprint acquisition module is used for determining target fingerprint information of each logic block in the target data writing request;

the system comprises a target fingerprint information and a fingerprint table, wherein the target fingerprint information is used for identifying a first logical partition meeting the deletion condition and a second logical partition not meeting the deletion condition from the logical partitions;

The physical block determining module is used for searching a corresponding first physical block in the internal object based on the target fingerprint information of the first logical block, and newly establishing the physical block in the internal object as a second physical block;

And the data writing module is used for respectively establishing the mapping relation between the first logical block and the first physical block and the mapping relation between the second logical block and the second physical block, and writing the data to be written of the second logical block into the second physical block.

In a third aspect, the present invention discloses an electronic device, comprising:

a memory for storing a computer program;

And a processor for executing a computer program to implement the steps of the disclosed data writing method based on data deduplication.

In a fourth aspect, the present invention discloses a computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the data writing method based on data deduplication disclosed above.

The method comprises the steps of obtaining a target data writing request with data quantity to be written being larger than a preset data quantity threshold value or meeting a preset trigger time condition, determining target fingerprint information of each logic block in the target data writing request, determining a first logic block conforming to a deduplication condition and a second logic block not conforming to the deduplication condition from each logic block by utilizing the target fingerprint information and a fingerprint table, recording a block fingerprint forward relation in the fingerprint table, wherein the block fingerprint forward relation is a mapping relation representing a storage position of a physical block mapped from fingerprint information to an internal object, searching a corresponding first physical block in the internal object based on the target fingerprint information of the first logic block, newly establishing the physical block in the internal object as a second physical block, respectively establishing a mapping relation between the first logic block and the first physical block and a mapping relation between the second logic block and the second physical block, and writing the data to be written in the second logic block.

The method has the advantages that the target data writing request is a request with the data quantity to be written being larger than the preset data quantity threshold or a request meeting the preset triggering time condition, namely, the method does not directly trigger the data writing flow whenever a data writing request with the smaller data quantity to be written exists, but triggers the data writing flow when the data quantity to be written is larger than the preset data quantity threshold or the preset triggering time condition is met, so that the service life of an internal object is reduced caused by frequent data writing, secondly, the target fingerprint information and the fingerprint table are utilized to determine a first logical block meeting the deleting condition and a second logical block not meeting the deleting condition from the logical blocks, namely, the first logical block is determined by the logical blocks corresponding to the fingerprint information in the fingerprint forward relation of the fingerprint table, otherwise, the first logical block is determined to be the second logical block, and the fingerprint information is in the relation between the fingerprint information and the storage position of the physical block in the internal object, namely, the first logical block to be written into the second logical block is not established as the second logical block, the second logical block is not met by the second logical block corresponding to the fingerprint information in the fingerprint forward relation of the logical block, and the second logical block is not established in the second logical block is written into the internal object, and the second logical block is not met with the second logical block corresponding to the second logical block, and the second logical block is not met to the second logical block is written into the second logical block in the physical block, and the second logical block is not met to the logical block in the logical block forward relation, and the mapping relation between all the logical blocks and the physical blocks does not need to be traversed, so that the efficiency is higher.

Drawings

For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flow chart of a data writing method based on data deduplication according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a mapping relationship between a specific logical partition and a physical partition according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a mapping relationship between a logical partition and a physical partition according to another embodiment of the present invention;

FIG. 4 is a flowchart of a specific data writing method based on data deduplication according to an embodiment of the present invention;

FIG. 5 is a flow chart of a specific data writing process according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a data writing device based on data deduplication according to an embodiment of the present invention;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.

The terms "comprising" and "having" in the description of the invention and in the above-described figures, as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description.

Next, a data writing scheme based on data deduplication provided by the embodiment of the present invention will be described in detail. Fig. 1 is a data writing method based on data deduplication according to an embodiment of the present invention, including:

step S11, obtaining a target data writing request of which the data quantity to be written is larger than a preset data quantity threshold value or meets a preset trigger time condition.

And acquiring a target data writing request, wherein the target data writing request is a writing request with the data volume to be written being larger than a preset data volume threshold value or a writing request meeting a preset trigger time condition. That is, not all write requests will respond immediately, but will be responded to when the write requests satisfy both conditions.

In a first specific embodiment, obtaining target data writing requests with data to be written larger than a preset data amount threshold value comprises determining each data writing request as the target data writing request if the data to be written of a single data writing request is larger than the preset data amount threshold value, and aggregating each data writing request to obtain the target data writing request with the data to be written larger than the preset data amount threshold value if the data to be written of the single data writing request is not larger than the preset data amount threshold value.

If the data volume to be written of the single data writing request is larger than a preset data volume threshold, the data writing request is indicated to be a large IO (Input/Output) request, and if the data volume to be written of the single data writing request is not larger than the preset data volume threshold, the data writing request is indicated to be a small IO request. If the data volume to be written of a single data writing request is larger than a preset data volume threshold, namely a large IO request, each data writing request is determined to be a target data writing request, that is, the large IO request adopts a direct writing mode, WAL (pre-writing log system, namely Write-Ahead Logging) is not written, the large IO adopts a direct writing mode, the large IO adopts the direct writing mode mainly for reducing the wear rate of a Solid state disk (Solid STATE DISK or Solid STATE DRIVE, namely SSD), and in a large IO scene, the writing IO does not pass through a cache pool but directly writes into the data pool, so that the wear rate of the SSD is reduced. If the data volume to be written of a single data writing request is not larger than a preset data volume threshold, namely a small IO request, all the data writing requests are aggregated to obtain a target data writing request with the data volume to be written larger than the preset data volume threshold, the small IO request adopts a WAL mechanism, namely WAL is written first, IO data are written for many times and cached and aggregated in a memory, and when the aggregate volume is larger than the preset data volume threshold, the data are flushed down to a data pool.

In a second specific embodiment, a target data write request satisfying a preset trigger time condition is obtained, i.e. the current time is the time of responding to the write request, and then even if the amount of data to be written of a single data write request of the data write request is not greater than the preset data amount threshold, it can be responded to, i.e. it is two response condition dimensions, i.e. every 1 minute is the time of responding to the write request, and then even if the amount of data of the write request aggregated in this minute is not greater than the preset data amount threshold, it can be used as the target data write request.

And step S12, determining target fingerprint information of each logic block in the target data writing request.

Distributed storage is a storage system that aggregates storage space in multiple storage devices into one that can provide a unified access interface and management interface for an application server. RADOS (Reliable Autonomic Distributed Object Store) is a core component of a storage system, provides high-reliability self-repairing distributed object storage capability, and has the design goal of realizing automatic storage and disaster recovery of mass data through a decentralization architecture and intelligent data management, and can be understood as a storage pool. OSD (Object Storage Device) is the actual storage unit in the storage cluster, responsible for storing and retrieving data objects, each OSD typically corresponds to a physical storage device (e.g., a hard drive or solid state drive) and is managed by the OSD process. PG (Placement Group) is a basic unit for organizing and managing data, which is a logical container containing a set of objects (objects) that have similar access patterns or attributes and are stored and replicated together, and the storage system uses PG to decide on which OSDs the data is stored in order to optimize data distribution and access efficiency, by which the storage system can manage and balance storage load more efficiently because it is the basic unit for operation of the crumh algorithm (Controlled Replication Under Scalable Hashing, i.e., controlled replication under extended hash) that decides the storage location of the data based on PG IDs so that the data is evenly distributed among OSDs, keeping a good balance even when cluster sizes change.

Data deduplication (deduplication) is a technique in a storage system that retains only unique copies and references redundant data with pointers by identifying and eliminating duplicate data blocks or files, with the core goal of optimizing storage resource utilization and reducing costs by reducing redundant data storage. The data deleting and repeating block mechanism is one kind of efficient duplication eliminating key link, and has the aim of cutting data into minimum identifiable duplication units, and the data are divided into fixed length blocks and variable length blocks based on different block modes and mixed block technology combining the advantages of the fixed length blocks and the variable length blocks.

It may be understood that the target data writing request may be a request that the aggregate amount reaches a preset data amount threshold, or may be a writing request obtained when the current time is the time of responding to the writing request after a period of aggregation. The target data write request may be a write request from a different client, and then the target data block of the write request is from a different business object.

In this embodiment, determining the target fingerprint information of each logical block in the target data write request includes dividing the target data block of the target data write request to obtain each logical block, and performing hash operation on each logical block to determine the obtained hash value as the target fingerprint information of each logical block.

The target data block of the target data writing request may be divided according to different service objects to obtain each logic block corresponding to each service object, and hash operation is performed on each logic block to determine the obtained hash value as target fingerprint information of each logic block, it may be understood that if the data in logic block 1 and the data in logic block 2 are the same, the fingerprint information of logic block 1 and the fingerprint information of logic block 2 are the same, and conversely, if the data in logic block 1 and the data in logic block 2 are different, the fingerprint information of logic block 1 and the fingerprint information of logic block 2 are different, that is, the target fingerprint information is used to represent whether the data in each logic block is the same.

And S13, determining a first logical block conforming to the deduplication condition and a second logical block not conforming to the deduplication condition from the logical blocks by utilizing the target fingerprint information and a fingerprint table, wherein a block fingerprint forward relation is recorded in the fingerprint table, and is a mapping relation for representing a storage position mapped from the fingerprint information to a physical block in an internal object.

The internal object is a data pool, and comprises each history physical block, and the physical blocks store the data in each logic block. Omap (Object Map) A Key mechanism for managing object metadata is essentially a Key-Value pair (Key-Value) storage structure associated with an object, which provides flexible and efficient metadata management capabilities by being independent of the storage mode of the object data, and is particularly suitable for large-scale distributed scenes. The fingerprint table is internally recorded with a block fingerprint forward relation, and the block fingerprint forward relation is a mapping relation for representing the storage position of the physical blocks in the internal object mapped from fingerprint information. It will be appreciated that fingerprint information of various historical logical blocks is recorded in the fingerprint table, and storage locations of physical blocks having a mapping relation with each fingerprint information are also recorded, and it is seen that if fingerprint information of a certain logical block is recorded in the fingerprint table, a corresponding physical block can be found in the internal object, i.e. data of the logical block is stored in the physical block.

In this embodiment, determining the first logical partition meeting the deduplication condition and the second logical partition not meeting the deduplication condition from the logical partitions by using the target fingerprint information and the fingerprint table includes determining whether the target fingerprint information of the current logical partition is stored in a partition fingerprint forward relationship of the fingerprint table, determining the current logical partition as the first logical partition meeting the deduplication condition if the target fingerprint information of the current logical partition is stored, and determining the current logical partition as the second logical partition not meeting the deduplication condition if the target fingerprint information of the current logical partition is not stored.

If the target fingerprint information of the current logical block is stored in the block fingerprint forward relation of the fingerprint table, the current logical block is judged to be a first logical block conforming to the deduplication condition, if the target fingerprint information of the current logical block is not stored in the block fingerprint forward relation of the fingerprint table, the current logical block is judged to be a second logical block not conforming to the deduplication condition, for example, the target fingerprint information of each logical block is FP1, FP2 and FP3 respectively, and only FP1 and FP2 are stored in the block fingerprint forward relation of the fingerprint table, then the logical blocks corresponding to FP1 and FP2 respectively are the first logical block, and the logical block corresponding to FP3 is the second logical block.

In this embodiment, the first logical partition that meets the deduplication condition is a logical partition in which data to be written is already stored in an established physical partition of an internal object, and the second logical partition that does not meet the deduplication condition is a logical partition in which data to be written is not already stored in the established physical partition.

Because the target fingerprint information of the first logical block is stored in the block fingerprint forward relation of the fingerprint table, and the block fingerprint forward relation is a mapping relation representing the storage position of the physical block in the internal object mapped from the fingerprint information, then the fact that the established physical block corresponding to the first logical block exists in the internal object is indicated, namely, the data to be written of the first logical block is already stored in the established physical block of the internal object, so that the data to be written of the first logical block meeting the deduplication condition does not need to be written into the corresponding physical block again, and therefore, the condition of the same data to be rewritten is avoided, and further, the data deduplication is completed. Similarly, the target fingerprint information of the second logical block is not stored in the block fingerprint forward relation of the fingerprint table, and the fact that the established physical block corresponding to the second logical block does not exist in the internal object is regarded as that the data to be written of the second logical block is not stored in the established physical block of the internal object, so that the second logical block does not meet the deduplication condition, namely, the data to be written of the second logical block is required to be written into the internal object.

And step S14, searching a corresponding first physical block in the internal object based on the target fingerprint information of the first logical block, and newly establishing the physical block in the internal object as a second physical block.

Because the target fingerprint information of the first logical block is stored in the block fingerprint forward relation of the fingerprint table, and the block fingerprint forward relation is a mapping relation representing a storage position mapped from the fingerprint information to the physical block in the internal object, the corresponding first physical block can be searched in the internal object based on the target fingerprint information of the first logical block, and the block fingerprint forward relation of the fingerprint table is regarded as that the target fingerprint information of the second logical block is not stored, and the established physical block corresponding to the second logical block does not exist in the internal object, so that in order to write the data to be written of the second logical block into the internal object subsequently, the physical block needs to be newly established in the internal object as the second physical block.

And S15, respectively establishing a mapping relation between the first logical block and the first physical block and a mapping relation between the second logical block and the second physical block, and writing the data to be written of the second logical block into the second physical block.

The mapping relation between the first logic block and the first physical block and the mapping relation between the second logic block and the second physical block are respectively established, although the data to be written in the first logic block is already stored in the first physical block, the mapping relation between the first logic block and the first physical block can be found out through fingerprint information, but the mapping relation between the first logic block and the first physical block does not exist, if the logic block and the physical block are searched through the fingerprint information each time, the reading operation is complicated when the data is read subsequently, so that the mapping relation between the first logic block and the first physical block is established, the subsequent data reading operation can be facilitated, the mapping relation between the second logic block and the second physical block is also established in the same way, and the reading operation can be completed according to the mapping relation. Because the second logical partition does not meet the deduplication condition, the data to be written of the second logical partition needs to be written to the second logical partition.

In this embodiment, the establishing the mapping relationship between the first logical block and the first physical block and the mapping relationship between the second logical block and the second physical block includes establishing a first block data forward relationship from the first logical block to the first physical block and a first block data reverse relationship from the first physical block to the first logical block, a second block data forward relationship from the second logical block to the second physical block, and a second block data reverse relationship from the second physical block to the second logical block, respectively.

The mapping relationship between the logical blocks and the physical blocks includes two types, one is a block data forward relationship LP, the other is a block data reverse relationship PL, but the mapping relationship between the logical blocks and the physical blocks is characterized by whether the mapping relationship is a block data forward relationship or a block data reverse relationship, but the mapping relationship between the logical blocks and the physical blocks is not one-to-one, for example, a mapping relationship diagram between a specific logical block and a physical block shown in fig. 2 may be many-to-one, that is, one logical block only corresponds to one physical block, one physical block may be referred to by 1 or more logical blocks, when one physical block is referred to by a plurality of logical blocks, the block data is repeated data, and the logical blocks may belong to the same logical object or may belong to different logical objects. For example, a case may occur where both logical partition 1 and logical partition 2 may correspond to physical partition 1, but a case may not occur where logical partition 1 corresponds to both physical partition 1 and physical partition 2. Further, the forward relation of the block data represents mapping from a logic block to a physical block, while the forward relation of the block data represents mapping from a physical block to a logic block, for example, the forward relation of the block data represents mapping from a logic block 1 to a physical block 1, the forward relation of the block data represents mapping from the physical block 1 to the logic block 1, and the forward relation of the block data with opposite directions and same corresponding relation can be used as a verification relation.

In this embodiment, after the mapping relationship between the first logical block and the first physical block and the mapping relationship between the second logical block and the second physical block are respectively established, the method further includes respectively storing forward relationships of the block data in a service object corresponding to the first logical block and a service object corresponding to the second logical block, respectively storing reverse relationships of the block data in corresponding physical blocks in an internal object, constructing a block fingerprint forward relationship from fingerprint information of the second logical block to a storage location of the corresponding second physical block according to the forward relationships of the second block data, constructing a block fingerprint reverse relationship from the storage location of the second physical block to fingerprint information of the second logical block according to the reverse relationships of the second block data, and writing the block fingerprint forward relationship and the block fingerprint reverse relationship into the fingerprint table and the corresponding physical block respectively.

The forward relation of each block data is respectively stored in a service object corresponding to a first logic block and a service object corresponding to a second logic block, namely, the forward relation of the first block data is stored in the service object corresponding to the first logic block, the forward relation of the second block data is stored in the service object corresponding to the second logic block, and the reverse relation of each block data is respectively stored in a physical block corresponding to an internal object, namely, the reverse relation of the first block data is stored in the physical block corresponding to the first logic block (namely, the first physical block), and the reverse relation of the second block data is stored in the physical block corresponding to the second logic block (namely, the second physical block).

It will be appreciated that the forward relationship of the block fingerprints from the fingerprint information of the first logical block to the storage location of the first physical block has been recorded in the fingerprint table and the reverse relationship of the block fingerprints from the storage location of the first physical block to the fingerprint information of the first logical block has also been stored in the first physical block, whereas the forward relationship of the block fingerprints of the second logical block has not been recorded in the fingerprint table, so that the forward relationship of the block fingerprints from the fingerprint information of the second logical block to the storage location of the corresponding second physical block needs to be constructed from the forward relationship of the second block data, and the reverse relationship of the block fingerprints from the storage location of the second physical block to the fingerprint information of the second logical block needs to be constructed from the reverse relationship of the second block data, and the forward relationship of the block fingerprints and the reverse relationship of the block fingerprints from the storage location of the second logical block to the second physical block need to be written into the fingerprint table and the second physical block respectively.

For example, as shown in fig. 3, a mapping relationship between a specific logical block and a physical block is illustrated, LP (block data forward relationship) is used to map from a logical block to a physical block, which is stored in metadata of a service object, and according to which actual data is accessible, PL (block data reverse relationship) indicates that a physical block is referenced by a specific logical block, and it is possible to confirm whether a physical block may be being referenced by the reverse relationship, and check whether the reverse relationship is still valid by querying whether a corresponding LP (block data forward relationship) exists. In deduplication, one physical partition may be referred to by multiple logical partitions, that is, there may be multiple partition data reverse relationships of one physical partition, and accordingly, when checking whether a certain physical partition is being referred to, it needs to be checked whether the logical partition to which all the partition data reverse relationships point still has a partition data forward relationship pointing to the physical partition. When the service data is updated (including adding, deleting and changing), the reverse relation of the block data is updated (including adding, deleting and changing), so that the block data direction relation is stored in the internal object abstract information to be convenient for dynamic updating. The FP (block fingerprint forward relation) points to the position where the content of the data block represented by the FP (block fingerprint forward relation) is actually stored, namely which data block stored in which internal object is used for performing the deduplication query, provides the physical storage position of the existing duplicate data block for the deduplication logical block, and constructs the forward and reverse relation of the new block data so as to achieve the purpose of deduplication. The forward relation of the segmented fingerprints is stored in a fingerprint table, and each forward relation of the segmented fingerprints corresponds to one KV taking the segmented fingerprint information as a key in the fingerprint table. The PF (block fingerprint inverse relation) records fingerprint information of the contents of the data blocks of the internal object, which is stored in the internal object in the form of data, so as to save metadata space.

Under the deduplication model, the logical objects, the fingerprint table and the internal objects are usually attributed to different PGs, a multi-object transaction mechanism in the same PG cannot be used, synchronous changes of the various forward and reverse relationships are required to be processed according to a certain order principle, and the general principle is that no forward relationship exists under any condition, and no corresponding reverse relationship exists, because the physical storage position pointed by the forward relationship can release space due to no (intrinsically referenced) reverse relationship, the fact that the physical storage position is not reassigned and written with new data cannot be guaranteed, and if redundant residues exist in the reverse relationship, the validity of the reverse relationship can be determined by checking whether the forward relationship exists.

The fingerprint tables are stored by omap objects, one or more fingerprint tables are provided for each PG in the storage pool. Each KV in the fingerprint table is a block fingerprint forward relation, its key is fingerprint information, and value is a storage location (i.e. an internal object name and a data range) of the corresponding physical block. The total KV scale in a single fingerprint table is controlled within 1 ten thousand, when a storage pool is created, the number of the fingerprint tables in the storage pool is determined according to the scale according to the effective capacity of the storage pool, and the subsequent expansion and contraction capacity storage pool is unchanged. The fingerprint information is stored and accessed by the fingerprint value hash to the attribution fingerprint table. The fingerprint table is the core for realizing the deduplication, and the identification of the deduplication data is confirmed by querying the fingerprint table.

In the embodiment, the mapping relation between the first logical block and the first physical block is established, and the method comprises the steps of determining the relation number of the reverse relation of the stored block data in the first physical block, judging whether the relation number is larger than a preset relation number threshold value, establishing the mapping relation between the first logical block and the first physical block if the relation number is not larger than the preset relation number threshold value, changing the first logical block into a second logical block if the relation number is larger than the preset relation number threshold value, and jumping to the step of newly establishing the physical block in the internal object as a second physical block.

To reduce storm effects during metadata update when integrating objects, the maximum number of references of a single physical block needs to be limited, i.e. the maximum number of references is 64, which is a preset relation number threshold, and therefore, the maximum value of the deduplication ratio cannot exceed 64:1. The actual deduplication ratio under a general data model is much lower than this, so it can be approximated that the deduplication ratio of the storage system is mainly determined by the actual deduplication ratio of the input data model. Specifically, after determining the first logical block and the first physical block corresponding to the first logical block, a mapping relationship between the first logical block and the first physical block should be established, that is, a forward relationship between the first physical block data and a reverse relationship between the first physical block data should be established, but in order to prevent a storm effect from occurring, it is required to determine the relationship number of the stored reverse relationships between the first logical block data in the first physical block first, and determine whether the relationship number is greater than a preset relationship number threshold, if the relationship number is not greater than the preset relationship number threshold, a mapping relationship between the first logical block and the first physical block should be established, that is, the first physical block at this time may be referenced, and if the relationship number is greater than the preset relationship number threshold, which indicates that the first physical block at this time cannot be referenced, then the first logical block is changed into a second logical block, that is, the first logical block needs to be regarded as a second physical block of a second skip condition, and is transferred to a new physical block in the internal physical block, that is regarded as a second physical block of a second skip condition, and if the relationship between the first logical block and a new physical block is established, for example, a 1 and a new physical block is regarded as a new physical block, and a physical block is not fully written into a physical block 1, and a new physical block is required to be written into the first physical block, and a physical block 1, and a new physical block is stored in the internal physical block, and a 1 is required to be mapped to be a physical block, and a new physical block, and has a large relationship is stored in the physical block.

The deduplication mechanism in the large IO scene is the same as that in the small IO scene, namely whether the data to be written of the writing request exist in the internal object or not is judged according to fingerprint information, and if the data to be written exist in the internal object, the physical block is not newly built, but the mapping relation between the logical block and the physical block is updated.

It should be noted that, in the small IO scenario, the mode of responding to the target data writing request is an online deduplication mode, the multiple-time writing of the IO data is performed in the memory and is buffered and aggregated, when the aggregation flushing condition (including the aggregation amount, the flushing waiting time and the like) is met, the flushing data is flushed to the data pool, the relevant metadata is updated finally, the deduplication processing is performed during the buffering and aggregation (before the flushing), the deduplication relevant metadata and the updating processing (including adding, deleting, changing and searching) thereof are added on the basis of the existing additional writing, and the buffering and flushing processes do not occupy the writing IO path. In the large IO scene, the mode of responding to the target data writing request is an offline deleting mode, deleting the blocks and calculating fingerprints need to be completed before the data writing pool is written, so that a fingerprint inverse relation which needs to be downloaded together with effective data is constructed, the processing inevitably falls on a large IO writing IO path, the large IO writing performance can be influenced to a certain extent, the following deleting metadata inquiry and updating operation can only be carried out according to the granularity of the blocks, the metadata operation quantity is greatly increased, larger performance consumption can be brought, the influence degree of deleting the large IO writing performance by deleting the data on line is larger, the online deleting operation is more suitable for a small IO writing model based on a WAL mechanism, and the deleting operation of the large IO writing model is more suitable for adopting an offline deleting mode which is described later.

Referring to fig. 4, the embodiment of the invention discloses a specific data writing method based on data deduplication, and compared with the previous embodiment, the embodiment further describes and optimizes the technical scheme. Comprising the following steps:

Step S21, obtaining a target data writing request of which the data quantity to be written is larger than a preset data quantity threshold value or meets a preset trigger time condition.

And S22, determining target fingerprint information of each logic block in the target data writing request.

And S23, judging whether the target fingerprint information of the current logic block is stored in the block fingerprint forward relation of the fingerprint table, wherein the block fingerprint forward relation is recorded in the fingerprint table, and the block fingerprint forward relation is a mapping relation representing the storage position of the physical block in the internal object mapped from the fingerprint information.

It should be noted that, the fingerprint forward relation of each block of fingerprints recorded in the fingerprint table stores the fingerprint information of each historical logical block in the write request which has been responded, so that the fingerprint forward relation of each block of fingerprints recorded in the fingerprint table does not necessarily store the fingerprint information of each logical block in the write request which has not been responded, but the fingerprint information stored in the fingerprint forward relation of each block of fingerprints recorded in the fingerprint table proves that the data corresponding to the fingerprint information has been stored in the corresponding physical block, so that whether the data to be written of the logical block has been stored in the corresponding physical block can be determined only according to whether the fingerprint information of a certain logical block is contained in the fingerprint table, and the forward relation and the reverse relation of the block data of all the block data do not need to be traversed in sequence, so that the efficiency of judging whether the current logical block meets the deduplication condition can be effectively accelerated, and in the distributed storage system, the data deduplication technology and the mechanism can improve the storage efficiency and reduce the cost. The network resource consumption is also reduced, the stored written data volume is reduced under the scene of writing a large amount of concurrent large data volume, the storage performance can be improved, the written data volume can be reduced after the data is deleted, the block erasure and the write amplification of a disk are reduced, and the service life of the disk is prolonged.

And step S24, if the target fingerprint information of the current logic block is stored, determining the current logic block as a first logic block meeting the deduplication condition, and if the target fingerprint information of the current logic block is not stored, determining the current logic block as a second logic block not meeting the deduplication condition.

If the target fingerprint information of the current logical block is stored in the block fingerprint forward relation of the fingerprint table, the current logical block is judged to be a first logical block meeting the deduplication condition, and if the target fingerprint information of the current logical block is not stored in the block fingerprint forward relation of the fingerprint table, the current logical block is judged to be a second logical block not meeting the deduplication condition.

And S25, searching a corresponding first physical block in the internal object based on the target fingerprint information of the first logical block, and newly establishing the physical block in the internal object as a second physical block.

And S26, respectively establishing a mapping relation between the first logical block and the first physical block and a mapping relation between the second logical block and the second physical block, and writing the data to be written of the second logical block into the second physical block.

The specific processes of respectively establishing the mapping relation between the first logic block and the first physical block and the mapping relation between the second logic block and the second physical block are that respectively establishing a first block data forward relation from the first logic block to the first physical block and a first block data reverse relation from the first physical block to the first logic block, a second block data forward relation from the second logic block to the second physical block and a second block data reverse relation from the second physical block to the second logic block. After the forward relation and the reverse relation of the block data are established, the forward relation of the block data is respectively stored in the service object corresponding to the first logic block and the service object corresponding to the second logic block, and the reverse relation of the block data is respectively stored in the corresponding physical blocks in the internal object. When data is required to be read from the physical partition 1, the corresponding logical partition 1 can be found according to the inverse relation of the partition data, the forward relation of the partition data recorded in the business object corresponding to the logical partition 1 is read, whether the physical partition recorded in the forward relation of the partition data is the physical partition 1 is judged, if yes, the current verification is indicated, the data reading can be performed, if not, the current verification is indicated, the data reading cannot be performed, and prompt information can be generated so that maintenance personnel can perform corresponding processing according to the prompt information, for example, correction of the forward relation of the partition data is performed.

In a distributed storage system, the data deduplication technology and the mechanism can improve the storage efficiency of storage, reduce the cost, reduce the network resource consumption, reduce the stored written data volume under the scene of writing a large amount of concurrent large data volume, improve the storage performance, reduce the written data volume after data deduplication, reduce the block erasure and the write amplification of a disk, and improve the service life of the disk.

The present application will be described with reference to a specific data writing process diagram shown in fig. 5. Taking a small IO scene as an example, a client (client) initiates a write request (comprising a service object, a cut-off address and a data length), the service object receives the write request and carries out caching and aggregation in a memory based on a WAL mechanism, in the process of refreshing data under the WAL aggregation, the forward relation in the write range of the service object is read and the blocks are carried out to obtain each logic block, the target fingerprint information of each logic block is obtained, wherein before the target fingerprint information is obtained, physical blocks for storing all 0 data and physical blocks for storing all 1 data are created in advance during system initialization, fixed fingerprint information is allocated for all 0/all 1 data, whether all 0 or all 1 logic blocks exist in each logic block is judged, if so, the all 0 or all 1 logic blocks do not need to carry out hash calculation, thus invalid fingerprint information can be directly determined, meaningless fingerprint calculation of all 0/all 1 blocks is avoided, the result is fixed, hash information of all 0/all 1 blocks is avoided, and after the hash information of the FPI (FPI) and FPI (fingerprint table) is mapped and all fingerprint information is determined according to the fingerprint information.

And determining a first logical block conforming to the deduplication condition and a second logical block not conforming to the deduplication condition from the logical blocks by utilizing the target fingerprint information and the fingerprint table, wherein the fingerprint table is internally recorded with a block fingerprint forward relation which is a mapping relation representing the storage position of the physical block mapped from the fingerprint information to the internal object.

The method comprises the steps of establishing a first block data forward relation mapped from a first logical block to a first physical block and a first block data reverse relation mapped from the first physical block to the first logical block, respectively storing the block data forward relation in a business object corresponding to the first logical block, respectively storing the block data reverse relation in a corresponding physical block in an internal object, and eliminating cache data of the first logical block in a memory.

Establishing a second block data forward relation mapped from the second logical block to the second physical block and a second block data reverse relation mapped from the second physical block to the second logical block, respectively storing each block data forward relation in a service object corresponding to the second logical block, respectively storing each block data reverse relation in a corresponding physical block in an internal object, constructing a block fingerprint forward relation mapped from fingerprint information of the second logical block to a storage position of the corresponding second physical block according to the second block data forward relation, constructing a block fingerprint reverse relation mapped from the storage position of the second physical block to fingerprint information of the second logical block according to the second block data reverse relation, and respectively writing the block fingerprint forward relation and the block fingerprint reverse relation into a fingerprint table and the corresponding physical block.

It can be understood that if the inverse relation of the block data of a certain physical block needs to be deleted, and the inverse relation of the block data of the physical block is changed to be null after the deletion, the corresponding inverse relation of the block fingerprint needs to be deleted, and the corresponding forward relation of the block fingerprint is deleted in the fingerprint table, namely, the forward relation of the block data and the inverse relation of the block data are ensured to be corresponding, and the forward relation of the block fingerprint and the inverse relation of the block fingerprint are also corresponding.

Fig. 6 is a schematic structural diagram of a data writing device based on data deduplication according to an embodiment of the present invention, including:

a request acquisition module 11, configured to acquire a target data writing request in which a data amount to be written is greater than a preset data amount threshold or a preset trigger time condition is satisfied;

a fingerprint acquisition module 12, configured to determine target fingerprint information of each logical partition in the target data write request;

The deduplication judging module 13 is configured to determine a first logical partition meeting deduplication conditions and a second logical partition not meeting deduplication conditions from the logical partitions by using the target fingerprint information and a fingerprint table, where a partition fingerprint forward relationship is recorded in the fingerprint table, and the partition fingerprint forward relationship is a mapping relationship representing a storage position mapped from fingerprint information to a physical partition in an internal object;

A physical block determining module 14, configured to find a corresponding first physical block in an internal object based on target fingerprint information of the first logical block, and newly establish the physical block in the internal object as a second physical block;

the data writing module 15 is configured to establish a mapping relationship between the first logical block and the first physical block and a mapping relationship between the second logical block and the second physical block, respectively, and write data to be written in the second logical block into the second physical block.

Further, the embodiment of the present application further discloses an electronic device, and fig. 7 is a block diagram of an electronic device according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application. The electronic device may comprise, in particular, at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25 and a communication bus 26. The memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement relevant steps in the data writing method based on data deduplication disclosed in any of the foregoing embodiments. In addition, the electronic device in the present embodiment may be specifically an electronic computer.

In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device, the communication interface 24 is configured to create a data transmission channel with an external device for the electronic device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the outside, where the specific interface type may be selected according to the needs of the specific application, which is not specifically limited herein.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the data deduplication-based data writing method performed by the electronic device as disclosed in any of the foregoing embodiments.

Furthermore, the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program realizes the data writing method based on data deduplication when being executed by a processor. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

While the foregoing has been provided to illustrate the principles and embodiments of the present application, specific examples have been provided herein to assist in understanding the principles and embodiments of the present application, and are intended to be in no way limiting, for those of ordinary skill in the art will, in light of the above teachings, appreciate that the principles and embodiments of the present application may be varied in any way.

Claims

1. The data writing method based on data deduplication is characterized by comprising the following steps:

2. The data writing method based on data deduplication according to claim 1, wherein obtaining a target data writing request in which an amount of data to be written is greater than a preset data amount threshold value includes:

3. The data writing method based on data deduplication according to claim 1, wherein the determining the target fingerprint information of each logical partition in the target data writing request includes:

4. The data writing method based on data deduplication according to claim 1, wherein the determining, from the logical partitions, a first logical partition that meets the deduplication condition and a second logical partition that does not meet the deduplication condition using the target fingerprint information and a fingerprint table, includes:

5. The data writing method based on data deduplication according to claim 1, wherein the first logical partition conforming to a deduplication condition is a logical partition in which data to be written has been stored within an established physical partition of an internal object, and the second logical partition not conforming to the deduplication condition is a logical partition in which data to be written has not been stored within the established physical partition.

6. The data writing method based on data deduplication as claimed in claim 1, wherein the establishing a mapping relationship between the first logical partition and the first physical partition and a mapping relationship between the second logical partition and the second physical partition, respectively, includes:

7. The data writing method based on data deduplication as claimed in claim 6, wherein establishing a mapping relationship between the first logical partition and the first physical partition comprises:

8. A data writing apparatus based on data deduplication, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

A processor for executing the computer program to implement the steps of the data writing method based on data deduplication as claimed in any of claims 1 to 7.

10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, which when executed by a processor, implements the steps of the data writing method based on data deduplication as claimed in any of claims 1 to 7.