CN113253947B

CN113253947B - A deduplication method, apparatus, device and readable storage medium

Info

Publication number: CN113253947B
Application number: CN202110803769.7A
Authority: CN
Inventors: 甄凤远; 刘志勇
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2021-10-15
Anticipated expiration: 2041-07-16
Also published as: CN113253947A

Abstract

The invention discloses a method for deduplication. The method converts the comparison of deduplicated data into two-level comparison, and stores a part of the eigenvalues written in IO, that is, the first part of eigenvalues The other part of the eigenvalues of IO, that is, the second part of the eigenvalues, is stored in the disk to avoid excessive occupation of the cache space; the first part of the eigenvalues are compared first during the comparison, and only when the first part of the eigenvalues are matched If the first part of the eigenvalue comparison fails, then the second part of the eigenvalue comparison will be abandoned and the data will be written directly. This two-level comparison The method can improve the deduplication performance by means of cache pre-screening, and the method of post-cache allocation can reduce unnecessary memory overhead, and the overall performance of the system is high. The invention also discloses a deduplication device, a device and a readable storage medium, which have corresponding technical effects.

Description

Deduplication method, deduplication device, deduplication equipment and readable storage medium

Technical Field

The present invention relates to the field of full flash memory technologies, and in particular, to a deduplication method, apparatus, device, and readable storage medium.

Background

At present, a full flash memory storage system has gradually become a mainstream storage device of each large operator and financial institution, in order to improve the total data storage capacity on the premise of unchanging capacity, each large manufacturer supports a deduplication (deduplication) characteristic, deduplication is a technology for saving storage space, generally, there are many repeated data in a data storage pool, deduplication is a technology for finding and processing the repeated data, and in brief, deduplication is to keep only 1 part of N parts of repeated data, and point an address pointer of N-1 part of data to the only one part. Deduplication saves customer costs, but introduces many overheads to the system that do not exist with traditional storage, such as HASH computations, metadata, etc.

Compared with the traditional storage system, the full flash memory calculates the HASH value according to the data block when writing data, then reads the data from the disk according to the calculated HASH value, if the read hit happens, the data block is not repeatedly written, and if the read hit happens, a copy of HASH data is written while the data block is written. By the method, the writing of repeated data can be effectively reduced, the effective utilization rate of the storage system is greatly improved, the time delay of writing data can be greatly improved by frequently reading the disk by HASH, and the overall performance of the storage system is reduced. To solve the above problem, there is a method of fully buffering HASH data, but the current memory size cannot support a sufficient amount of HASH data for full buffering.

In summary, how to reduce unnecessary disk accessing operations and reduce unnecessary memory overhead is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a deduplication method, a deduplication device, deduplication equipment and a readable storage medium, which can reduce unnecessary disk accessing operation and reduce unnecessary memory overhead.

In order to solve the technical problems, the invention provides the following technical scheme:

a method of deduplication, comprising:

after receiving an IO write-in request, determining an IO characteristic value to be written in;

determining a first part of characteristic values and a second part of characteristic values in the IO characteristic values to be written;

inquiring whether a first cache block pointed by the first part of characteristic values is empty in a first cache block group; the first cache block group is used for caching a first part of characteristic values written into IO;

if the first cache block is not empty, inquiring whether a second memory block pointed by the second part of characteristic value is empty in a disk; the magnetic disk is used for storing a second part of characteristic values written into IO;

if the second memory block is not empty, pointing the storage address to be written into the IO to the IO storage address corresponding to the first cache block and the second memory block;

and if the first cache block or the second memory block is empty, executing data writing step on the IO to be written.

Optionally, the querying, in the first cache block group, whether the first cache block pointed to by the first partial feature value is empty includes:

determining a first sub-portion feature value and a second sub-portion feature value in the first portion feature value;

querying a first sub-cache block group in a first sub-cache block group to determine whether a first sub-cache block pointed to by the first sub-portion feature value is empty; the first sub-cache block group is a cache block in the first cache block group, and is used for caching the first sub-part characteristic value written into the IO;

if the first sub-cache block is not empty, querying a second sub-cache block group pointed by the second sub-portion characteristic value whether the second sub-cache block is empty; the second sub-cache block group is a cache block in the first cache block group, and is used for caching the second sub-part characteristic value written into the IO;

if the second sub-cache block is not empty, determining that the first cache block is not empty;

and if the first sub-cache block or the second sub-cache block is empty, determining that the first cache block is empty.

Optionally, the determining a first part of feature values and a second part of feature values in the IO feature values to be written includes:

reading the first n-bit characteristic values in the IO characteristic values to be written as the first part characteristic values; wherein n is any positive integer smaller than the total number of bits of the IO characteristic value to be written;

reading the m-bit characteristic value to be written in the IO characteristic value as the second part characteristic value; wherein m is the difference between the total number of bits and n.

Optionally, after receiving the IO write request, determining the IO characteristic value to be written includes:

and after receiving the IO write-in request, determining the hash value of the IO to be written in as the IO characteristic value to be written in.

A deduplication apparatus, comprising:

the characteristic value determining unit is used for determining the IO characteristic value to be written after receiving the IO writing request;

a partial characteristic determining unit, configured to determine a first partial characteristic value and a second partial characteristic value in the IO characteristic values to be written;

a first judging unit, configured to query, in a first cache block group, whether a first cache block pointed to by the first partial feature value is empty; the first cache block group is used for caching a first part of characteristic values written into IO; if the first cache block is not empty, triggering a second judgment unit; if the first cache block is empty, triggering a write-in unit;

the second judging unit is configured to query, in the disk, whether a second memory block pointed to by the second partial characteristic value is empty; the magnetic disk is used for storing a second part of characteristic values written into IO; if the second memory block is empty, triggering the write-in unit; if the second memory block is not empty, triggering a repeated data management unit;

the duplicate data management unit is configured to point a storage address to be written into an IO to an IO storage address corresponding to the first cache block and the second memory block;

and the writing unit is used for executing a data writing step on the IO to be written.

Optionally, the first determining unit includes:

a buffer part feature determining subunit, configured to determine a first sub-part feature value and a second sub-part feature value in the first part feature value;

a first determining subunit, configured to query, in the first sub-cache block group, whether the first sub-cache block pointed by the first sub-portion feature value is empty; the first sub-cache block group is a cache block in the first cache block group, and is used for caching the first sub-part characteristic value written into the IO; if the first sub-cache block is not empty, triggering a second judgment subunit; if the first sub-cache block is empty, determining that the first cache block is empty;

the second determining subunit is configured to query, in a second sub-cache block group, whether a second sub-cache block pointed by the second sub-portion feature value is empty; the second sub-cache block group is a cache block in the first cache block group, and is used for caching the second sub-part characteristic value written into the IO; if the second sub-cache block is not empty, determining that the first cache block is not empty; and if the second sub-cache block is empty, determining that the first cache block is empty.

Optionally, the partial feature determination unit includes:

a first partial characteristic determining subunit, configured to read a first n-bit characteristic value in the IO characteristic value to be written as the first partial characteristic value; wherein n is any positive integer smaller than the total number of bits of the IO characteristic value to be written;

a second part characteristic determining subunit, configured to read a last m-bit characteristic value in the IO characteristic value to be written as the second part characteristic value; wherein m is the difference between the total number of bits and n.

Optionally, the characteristic value determining unit specifically includes: and the hash determining unit is used for determining the hash value of the IO to be written as the IO characteristic value to be written after receiving the IO write request.

A computer device, comprising:

a memory for storing a computer program;

and the processor is used for realizing the steps of the deduplication method when executing the computer program.

A readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described deduplication method.

In the method provided by the embodiment of the invention, the comparison of the deleted data is converted into two-stage comparison, one part of the characteristic values written into the IO, namely the first part of the characteristic values, is stored in the cache, and the other part of the characteristic values written into the IO, namely the second part of the characteristic values, is stored in the disk, so that the excessive occupation of the cache space is avoided; the two-stage comparison mode can improve the deduplication performance through a cache pre-screening mode, reduce unnecessary memory overhead through a cache post-allocation mode, and achieve high overall performance of the system.

Accordingly, embodiments of the present invention further provide a deduplication apparatus, a device, and a readable storage medium corresponding to the deduplication method, which have the technical effects described above and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present invention or technical solutions in related arts, the drawings used in the description of the embodiments or related arts will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating an implementation of a deduplication method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a deduplication apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The core of the invention is to provide a deduplication method, which can reduce unnecessary disk accessing operation and reduce unnecessary memory overhead.

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a deduplication method according to an embodiment of the present invention, where the method includes the following steps:

s101, after receiving an IO write request, determining an IO characteristic value to be written;

in the all-flash storage system, when a host performs IO (input/output) writing, a feature value of IO to be written is determined, the type of the feature value is not limited in this embodiment, and generally, a HASH value (HASH) of IO to be written is extracted in comparison of deduplication data, and then the HASH value may also be extracted in this step, and other types of feature values may also be used for other purposes, which is not described herein again.

S102, determining a first part of characteristic values and a second part of characteristic values to be written in the IO characteristic values;

in this embodiment, a two-stage comparison method is provided when searching for the duplicate data HASH, and the eigenvalue is divided into a first part eigenvalue and a second part eigenvalue, where the first part eigenvalue and the second part eigenvalue are part of the eigenvalue, and the eigenvalue is a combination of the first part eigenvalue and the second part eigenvalue, where a specific eigenvalue division method is not limited, and the first bits of the eigenvalue may be directly used as the first part eigenvalue, and the second bits as the second part eigenvalue; several discontinuous bits designated by the characteristic value can be used as a first part characteristic value, the rest part is used as a second part characteristic value, and corresponding setting can be carried out according to the comparison requirement of the actual characteristic value. In order to facilitate the splitting and comparison of the characteristic values, the first n-bit characteristic values to be written in the IO characteristic values can be read as the first part of characteristic values; wherein n is any positive integer less than the total number of bits of the IO characteristic value to be written; reading a rear m-bit characteristic value to be written in the IO characteristic value as a second part characteristic value; wherein m is the difference between the total number of bits and n. In this embodiment, only the above eigenvalue splitting scheme is taken as an example, and the implementation manners in other eigenvalue splitting schemes can refer to the description of this embodiment, which is not described herein again.

Taking n as 32 bits as an example, the original 64-bit deduplication HASH comparison is converted into a 32-bit deduplication HASH comparison in this embodiment, so as to reduce the comparison data amount and reduce unnecessary disk accessing behavior.

S103, inquiring whether a first cache block pointed by the first part of characteristic value is empty or not in the first cache block group;

the first cache block group is configured to cache a first part of feature values of the written IO, for example, a cache block storing the first 32 bits of the written IO is stored. In this embodiment, a part of characteristic values written into IO is stored in a cache, and is stored by a cache block, so as to ensure a data reading speed.

And inquiring whether the first cache block pointed by the first part of characteristic values is empty in the first cache block group, wherein the first cache block pointed by the first part of characteristic values is empty, the empty condition indicates that the cache block pointed by the first part of characteristic values does not exist, namely the written IO does not contain the IO to be written currently, at the moment, the step of writing the data into the IO to be written is directly executed without reading the second part of characteristic values, and the workload of comparison is reduced. If not, it indicates that there is a cache block to which the first part of feature values point, that is, the written IO may include an IO to be written currently, at this time, it needs to further determine whether the second part of feature values is the same as the second part of feature values of the written IO, and then step S104 is executed.

S104, inquiring whether a second memory block pointed by the second part of characteristic values is empty in the disk;

in this embodiment, the second part of characteristic values are stored in the disk and stored in the local end of the device to reduce the occupation of the cache space.

And querying whether the second memory block pointed by the second part of characteristic value is empty in the disk, indicating that no cache block pointed by the second part of characteristic value exists if the second memory block is empty, that is, the written IO does not include the IO to be written currently, and executing step S106. If not, it indicates that there is a cache block pointed by the second part of feature values, that is, the written IO includes the IO to be currently written, and the data to be currently written is the deduplication data, and the step S105 is directly executed without repeated writing.

S105, pointing the storage address to be written into the IO to the IO storage address corresponding to the first cache block and the second memory block;

and pointing the storage address to be written into the IO to the IO storage address corresponding to the first cache block and the second memory block, so that data does not need to be written repeatedly.

And S106, executing data writing step on the IO to be written.

The written IO does not include the current IO to be written, that is, the current IO to be written does not include the deduplication data, and at this time, the step of data writing may be performed according to a related method, specifically including the steps of memory allocation, data writing, and generation and storage of the characteristic value, which is not described herein again.

To deepen understanding of the present embodiment, taking the first part of feature values as the first 32-bit HASH value as an example, in the present embodiment, the first 32 bits of the 64-bit HASH data are taken to perform full cache, and the second 32 bits are subjected to local storage of the device through a disk, so as to avoid a large amount of occupation of cache space. The first 32 bits are used as first-level pre-reading, the second 32 bits are used as second-level pre-reading, the second-level pre-reading is carried out only under the condition of first-level pre-reading hit, the disk is downloaded for reading, if the first-level pre-reading is not hit, the second-level pre-reading is abandoned, the writing of data and a HASH value is directly carried out again, namely the step of writing the data into the IO to be written is carried out, and the HASH reading cache is updated; if the second-level read cache hit indicates that the data exists in the current repeated IO to be written, the data does not need to be written repeatedly, and only the address of the data to be written is required to point to the storage address corresponding to the repeated IO to be written. The introduction of the deduplication mode can perform memory pre-screening before HASH reading, reduce unnecessary disk accessing operation, realize dynamic memory allocation by the design of secondary read cache, and reduce unnecessary memory overhead, thereby greatly improving the performance of the deduplication characteristic of the storage system.

Based on the above description, in the technical scheme provided in the embodiment of the present invention, the comparison of the deduplication data is converted into two-level comparison, one part (i.e., the first part of feature values) of the IO-written feature values is stored in the cache, and the other part (i.e., the second part of feature values) of the IO-written feature values is stored in the disk, so as to avoid excessive occupation of the cache space; the two-stage comparison mode can improve the deduplication performance through a cache pre-screening mode, reduce unnecessary memory overhead through a cache post-allocation mode, and achieve high overall performance of the system.

It should be noted that, based on the above embodiments, the embodiments of the present invention also provide corresponding improvements. In the preferred/improved embodiment, the same steps as those in the above embodiment or corresponding steps may be referred to each other, and corresponding advantageous effects may also be referred to each other, which are not described in detail in the preferred/improved embodiment herein.

In step S103, it is queried whether the first cache block pointed by the first partial characteristic value is in the air in the first cache block group, and to further increase the comparison and search speed of the cache blocks, the following steps may be performed:

(1) determining a first subsection feature value and a second subsection feature value in the first part feature value;

(2) querying the first sub-cache block group in the first sub-cache block group to determine whether the first sub-cache block pointed to by the first sub-portion feature value is empty; the first sub-cache block group is a cache block in the first cache block group, and is used for caching the first sub-part characteristic value written into the IO;

(3) if the first sub-cache block is not empty, querying whether a second sub-cache block pointed by the second sub-portion characteristic value is empty in a second sub-cache block group; the second sub-cache block group is a cache block used for caching the second sub-part characteristic value written in the IO in the first cache block group;

(4) if the second sub-cache block is not empty, determining that the first cache block is not empty;

(5) and if the first sub-cache block or the second sub-cache block is empty, determining that the first cache block is empty.

The comparison for the first part feature converts the lookup comparison for the cache data into a two-level cache lookup comparison, and divides the first part feature value into a first sub-part feature value and a second sub-part feature value, where the specific division manner in this embodiment is not limited, and reference may be made to the step of dividing the feature value into the first part feature value and the second part feature value in the above embodiment. The two-level cache comparison mode can improve the deduplication performance by a cache pre-screening mode, and can reduce unnecessary memory overhead by a cache post-allocation mode, so that the overall performance of the system is high.

Corresponding to the above method embodiments, the embodiments of the present invention further provide a deduplication device, and the deduplication device described below and the deduplication method described above may be referred to correspondingly.

Referring to fig. 2, the apparatus includes the following modules:

the characteristic value determining unit 110 is mainly configured to determine an IO characteristic value to be written after receiving an IO write request;

the partial feature determining unit 120 is mainly configured to determine a first partial feature value and a second partial feature value to be written in the IO feature value;

the first determining unit 130 is mainly configured to query whether the first cache block pointed by the first part of feature values in the first cache block group is empty; the first cache block group is used for caching a first part of characteristic values written into IO; if the first cache block is not empty, the second determining unit 140 is triggered; if the first cache block is empty, the write unit 150 is triggered;

the second determining unit 140 is mainly configured to query whether the second memory block pointed by the second part of feature values is empty in the disk; the disk is used for storing the second part of characteristic values written into IO; if the second memory block is empty, the write unit 150 is triggered; if the second memory block is not empty, the duplicate data management unit 160 is triggered;

the write unit 150 is mainly used for performing a data write step on IO to be written;

the repeated data management unit 160 is mainly configured to point the storage address to be written into the IO to the IO storage address corresponding to the first cache block and the second memory block.

In an embodiment of the present invention, the first determining unit specifically includes:

a cache portion feature determination subunit, configured to determine a first sub-portion feature value and a second sub-portion feature value in the first portion feature value;

a first determining subunit, configured to query, in the first sub-cache block group, whether the first sub-cache block pointed to by the first sub-portion feature value is empty; the first sub-cache block group is a cache block in the first cache block group, and is used for caching the first sub-part characteristic value written into the IO; if the first sub-cache block is not empty, triggering a second judgment subunit; if the first sub-cache block is empty, determining that the first cache block is empty;

a second determining subunit, configured to query, in the second sub-cache block group, whether a second sub-cache block pointed by the second sub-portion feature value is empty; the second sub-cache block group is a cache block used for caching the second sub-part characteristic value written in the IO in the first cache block group; if the second sub-cache block is not empty, determining that the first cache block is not empty; if the second sub-cache block is empty, the first cache block is determined to be empty.

In one embodiment of the present invention, the partial feature determination unit includes:

the first part characteristic determining subunit is used for reading the first n-bit characteristic values in the IO characteristic values to be written as first part characteristic values; wherein n is any positive integer less than the total number of bits of the IO characteristic value to be written;

the second part characteristic determining subunit is used for reading a rear m-bit characteristic value to be written in the IO characteristic value as a second part characteristic value; wherein m is the difference between the total number of bits and n.

In an embodiment of the present invention, the characteristic value determining unit specifically includes: and the hash determining unit is used for determining the hash value of the IO to be written as the IO characteristic value to be written after receiving the IO write request.

Corresponding to the above method embodiment, an embodiment of the present invention further provides a computer device, and a computer device described below and a deduplication method described above may be referred to correspondingly.

The computer device includes:

a memory for storing a computer program;

a processor for implementing the steps of the deduplication method of the above method embodiments when executing the computer program.

Specifically, referring to fig. 3, a specific structural diagram of a computer device provided in this embodiment is a schematic diagram of a computer device, which may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 322 (e.g., one or more processors) and a memory 332, where the memory 332 stores one or more computer applications 342 or data 344. Memory 332 may be, among other things, transient or persistent storage. The program stored in memory 332 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a data processing device. Still further, the central processor 322 may be configured to communicate with the memory 332 to execute a series of instruction operations in the memory 332 on the computer device 301.

The computer device 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input-output interfaces 358, and/or one or more operating systems 341.

The steps in the above described deduplication method may be implemented by the structure of a computer device.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a readable storage medium, and a readable storage medium described below and a deduplication method described above may be referred to correspondingly.

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the deduplication method of the above-mentioned method embodiments.

The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various other readable storage media capable of storing program codes.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Claims

1. A deduplication method, comprising:

determining a first part of characteristic values and a second part of characteristic values in the IO characteristic values to be written; dividing the characteristic value into the first part characteristic value and the second part characteristic value, and enabling the first part characteristic value and the second part characteristic value to be one part of the characteristic values;

if the first cache block is not empty, inquiring whether a second memory block pointed by the second part of characteristic value is empty in a disk; the magnetic disk is used for storing a second part of characteristic values written into IO; the null indication does not have a cache block pointed to by the characteristic value; the not null indication has a cache block to which the characteristic value points;

2. A method according to claim 1, wherein said querying in the first cache block group whether the first cache block pointed to by the first partial characteristic value is empty comprises:

3. A method according to claim 1, wherein the determining the first partial characteristic value and the second partial characteristic value of the IO characteristic values to be written includes:

4. A deduplication method according to claim 1, wherein determining an IO characteristic value to be written after receiving an IO write request comprises:

5. A deduplication apparatus, comprising:

a partial characteristic determining unit, configured to determine a first partial characteristic value and a second partial characteristic value in the IO characteristic values to be written; dividing the characteristic value into the first part characteristic value and the second part characteristic value, and enabling the first part characteristic value and the second part characteristic value to be one part of the characteristic values;

a first judging unit, configured to query, in a first cache block group, whether a first cache block pointed to by the first partial feature value is empty; the first cache block group is used for caching a first part of characteristic values written into IO; if the first cache block is not empty, triggering a second judgment unit; if the first cache block is empty, triggering a write-in unit; the null indication does not have a cache block pointed to by the characteristic value; the not null indication has a cache block to which the characteristic value points;

6. The de-duplication apparatus of claim 5 wherein the first determination unit comprises:

7. The de-duplication apparatus of claim 5 wherein the partial feature determination unit comprises:

8. The de-duplication apparatus according to claim 5, wherein the eigenvalue determination unit is specifically: and the hash determining unit is used for determining the hash value of the IO to be written as the IO characteristic value to be written after receiving the IO write request.

9. A computer device, comprising:

a memory for storing a computer program;

processor for implementing the steps of the deduplication method of any of claims 1 to 4 when executing the computer program.

10. A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the deduplication method as recited in any one of claims 1 through 4.