CN101814045B

CN101814045B - Data organization method for backup services

Info

Publication number: CN101814045B
Application number: CN2010101523978A
Authority: CN
Inventors: 周可; 王桦; 张鹏
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2010-04-22
Filing date: 2010-04-22
Publication date: 2011-09-14
Anticipated expiration: 2030-04-22
Also published as: CN101814045A

Abstract

The invention discloses a backup service software storage server end data organization method, which is used for improving the data organization and data management efficiency of the storage server end. The method includes: ① initializing storage server storage space into metadata area (including main record, index header and data index) and data area; ② accepting and judging user operation commands, performing backup operations sequentially, and turning to step ④ for restoring operations, and turning to step ④ for deleting operations. Step ⑤; ③ process user backup operation, back up user data to the storage server data area, and use deduplication technology to avoid duplicate data backup; turn to step ②; ④ process recovery operation, and list the recovery data specified by the user in the storage server data Area location search, and then transfer to the client; go to step ②; ⑤ process the delete operation, find the data that the user specifies to delete, and perform corresponding processing according to the backup data block reference count of these data in the storage server data area; go to step ②. The method improves the utilization ratio, manageability and system scalability of the storage server end, saves network bandwidth, and improves backup efficiency.

Description

A kind of data organization method that is used for backup services

Technical field

The invention belongs to computer data storage and backup method, be specifically related to a kind of data organization method of backup services, this method has realized the deletion of piece level repeating data.

Background technology

Along with the develop rapidly of informationized society, the service operation from the daily life to the enterprise, all the infosystem of being permeated is day by day surrounded, and is also increasing to its dependence.Especially in industries such as finance, communication, traffic and insurance,, bring immeasurable loss can for individual and enterprise in case critical data is lost or damaged.

Here said backup services is one in essence provides the backup of certain disaster tolerance function to recover software systems, can provide perfect data backup, recovery and related management task for individual and enterprise customer, and can customize various backup policy according to self actual demand.The backup services here also is a kind of software pattern simultaneously, for information-based needed all-network infrastructure and software, hardware running platform are built by enterprise, and be responsible for the enforcement in all early stages, a series of services such as maintenance in later stage, enterprise need not to buy software and hardware, builds machine room, recruits the IT personnel, can use infosystem by the internet.Just as opening just energy water of water tap, enterprise leases software service according to actual needs.

Data backup is the important measures that ensure information safety with recovering.Data importance show constantly that the data that require on the storage system can obtain effectively and comprehensively protection especially.Along with the appearance and the development of express network and the communication technology, mass memory innovation technology, basic storage resources is compared snafu variation in the past.The application of increasing various infosystems also makes the data volume of conservation value be the geometric series rising, and these all are that data backup has been researched and proposed higher requirement with exploitation and the correlation technique of recovering software.

Demand to storage space and data aspect when the user uses backup services generally includes: can increase or reduce the use amount to storage space according to demand; Can accessible use both have living space, as long as promptly have remaining space and network to reach, the data backup task can both correct execution; Can recover backed up data at any time.In order to satisfy these demands, require user's space and data to possess certain logical independence, so need research user's space way to manage and Backup Data method for organizing.In addition, also need to design allocation of space, reclaim mechanism efficiently, when fully excavating the coupling of repeating data storage space, keep the logical independence of user data, and realize data search and visit efficiently.

In general backup software framework, storage server is the physical medium that the supervisor console of process data backup software authenticated, and it can be a hard drive space on the server, the memory device that server is plug-in, perhaps a disk mapping on the network.Can dispose a plurality of storage servers by supervisor console, under the unified management of backup server, backup client backups to data on the corresponding storage server.

In the design of before backup software storage server end, adopt the backup method of file-level mostly.The backup of file-level, promptly backup software can only perceive this one deck of file, with files all on the source disk, backups on another purpose medium.So the file-level backup software, otherwise the file system interface that relies on operating system to provide comes backup file, or self have the function of file system, can discern file system metadata.In brief, it is that unit is read with the file that the mechanism of file-level backup software is exactly data, and then with the file storage of reading on the another one medium.Obviously this has formed performance bottleneck for PB level large-scale storage systems, because the data cell of storage server end management is exactly a file, this inevitably causes the backup of a large amount of repeating datas, management to the storage server end has also brought very big inconvenience, can address these problems to a great extent and utilize piece level data de-duplication technology to carry out data backup.

On the other hand, in backup software before, storage server end original allocation fixes for user's free memory capacity often, greatly reduces the extensibility of system like this.In the practical application, system can't expect that each user who is faced finally can use much storage spaces (may a maximum available storage restriction be arranged according to user's authority and type certainly), distributed big and caused waste of storage space and utilization factor to descend possibly, distributed little and may bring very big restriction user's use.

Recently Avamar company has been purchased by EMC Inc., and this company obtains the data de-duplication of patent and overall single example memory technology and can guarantee the Backup Data section only storage is once in global scope.This can be effectively will move and 300 times of data recovered amount reductions, can also realize full backup and fast quick-recovery every day simultaneously.At each 24KB data segment, Avamar generates unique 20 byte ID sign, uses the SHA-1 cryptographic algorithm.This unique ID is exactly the fingerprint of this data segment, so the software of Avamar can use this unique ID to determine whether that a data segment once was stored before this.But SHA-1 cryptographic algorithm calculation of complex is very big to the consumption of CPU.Because data segment is too small, the fingerprint space that consumes when the user ID data volume is very big is also very big, also has certain scalability problem simultaneously simultaneously.

Summary of the invention

The object of the present invention is to provide a kind of data organization method that is used for backup services, this method can realize the deletion of piece level repeating data, can improve the data organization and the efficiency of management.

The invention provides a kind of data organization method that is used for backup services, this method comprises the steps:

(1) initialization:

The metadata information partially-initialized comprises that indexing head information, data directory information, the data field metadata information to meta-data region composed initial value;

Prepare to accept user's backup request in data space of data field predistribution;

(2) receive user command and judge the user command type:

Judge the user command type, if backup operation enters step (3), if recovery operation then changes step (4) over to, if deletion action then changes step (5) over to;

(3) carry out back-up processing according to following process:

(3.1) file block at first that the user is to be backed up, then the content of data piecemeal is carried out Hash with the MD5 algorithm, obtain the fingerprint of a unique identification data piecemeal, deblocking is that index stores is in the indexing head and data directory of storage server end member data with the fingerprint;

(3.2) by the fingerprint of backup client to storage server transmission deblocking, whether the storage server end is inquired about this deblocking according to fingerprint and is existed;

(3.3) if this fingerprint does not exist, then backup client transmits this deblocking and gives storage server, and then this deblocking is new Backup Data piece, in storage server end memory space dynamic allocation, and finishes the write operation of this new Backup Data piece; If exist, then only need the pairing index information of updated stored this deblocking of server end, its reference count is added one;

(3.4) change step (2) over to;

When (4) recovering, check in the Hash tabulation for the treatment of that recovery file comprises by backup server, be positioned at logical place in the corresponding data space according to Hash tabulation visit storage space metadata information, read to treat that from the storage server end recovery file data are to core buffer successively then, pass to backup client by socket then, and synthetic required file set, change step (2) again over to;

(5) delete backup file by following process:

(5.1) check in the Hash tabulation for the treatment of that deleted file comprises by backup server in the standby system software;

(5.2) search the indexing head and the data directory mapping table of the meta-data region of storage server end according to hash value, if the suction parameter hash value does not exist, then return at once, rreturn value is false;

(5.3) otherwise the reference count of the object metadata of hash value correspondence is subtracted 1, rreturn value is true;

(5.4) change step (2) again over to.

Not only the many growths of kind are fast for present business data, and are high redundancies, a lot of identical files or data storing arranged in system and between the system, and the file that edits has a large amount of redundancies too, and these redundancies are present in the file version in the past.Traditional backup software backs up these redundant datas again and again, has amplified this redundancy.Present reasonable solution is to adopt data de-duplication technology.Data de-duplication technology not only can realize high compression rate, discharges storage space, also can reduce the cost based on Disk Backup, has also reduced the cost of data management.The present invention is data organization and the management method that a kind of data de-duplication technology based on the piece level realizes the storage server end, can efficiently carry out and client computer between the transmission of backup/restoration data, and carry out local storage space management and data organization by the strategy of backup server.The present invention can realize the data de-duplication of the overall situation under the prerequisite that does not influence the main users backup and recover, along with the growth of number of users and backup data quantity, the effect of data de-duplication will be obvious all the more.Can significantly reduce the required data volume of user ID, save the storage space that BACKUP TIME, the network bandwidth and backup need.

Description of drawings

Fig. 1 is the position fixing process figure of employed storage data organization of the inventive method and data item;

Fig. 2 is the FB(flow block) of the inventive method;

Fig. 3 is the process flow diagram of memory space dynamic allocation among the present invention;

Fig. 4 is the write operation process flow diagram in the backup operation of the present invention.

Embodiment

Below by by embodiment the present invention being described in further detail, but following examples only are illustrative, and protection scope of the present invention is not subjected to the restriction of these embodiment.

The backup services system is made up of backup server, storage server, backup client three parts based on tripartite framework.Wherein, backup client is responsible for accepting other relevant requests of data backup policy, recovery request or data management of customization.Backup server connects backup client and storage server, is the control center of whole data backup software.It is responsible for user right control, overall job scheduling and overall storage administration.When backup client is initiated the backup/restoration operation, guide the storage server of itself and appointment to connect and enter the execution link by backup server; On the other hand, backup server will be monitored calculating, transmission and the storage pressure of each storage server, and carry out the load balancing strategy.User profile, storage server state and other basic metadata that supports the backup server operation intend adopting database to store.Storage server be responsible for carrying out and client computer between the transmission of backup/restoration data, and carry out local storage space management and data organization by the strategy of backup server.

Below be 4 data structures that need this example use of explanation: master record district, indexing head, data directory and data field, its structure as shown in Figure 1.

The master record district: mainly describe the information of whole storage space, it deposits following information: indexing head information, data directory information, data field metadata information.

Indexing head: be an object Hash table, be used for realizing the mapping of object ID (by 160 hash value of data content generation) to data directory.Here to as if storage system in the elementary cell of data storage, be different from file and piece as basic module in the heritage storage system, to liking the combination of application data and definition memory attribute (metadata), wherein comprise data and permission data autonomy of other enough information and self-management.Its uses hash value represent object ID in the object-oriented storage, as storing foundation, sets up mapping relations content and object between by the hash value index with file content.Because hash value is that the overall situation is unique,, improved the manageability that system is shared so have the unique NameSpace of the overall situation on statistical significance.What system adopted is ripe MD5 algorithm, and the MD5Hash algorithm is transformed into the big integer of a 128bit (16byte), i.e. object ID with the data content of random length.

Data directory: be that a size is that (N represents index number in the data directory to N, and span is 2 ²⁰～2 ³⁰) array, each element in the array is the metadata structure of an object, information in the metadata structure has: object ID, (I represents the data space numbering to the start offset address of object in the data field, J represents the logical data block number in the data space of place, K represents the interior offset address of logic data block in the data space), object institute corresponding data size, the copy number of object institute corresponding data content, with the position of next object in Object table of this object map same position in the object Hash table, this just becomes a chained list to the object linking that is mapped to same position in the object Hash table.

Data field: the data that are used for depositing object, the data of object comprise object ID, data content length and data content, for the ease of storage space management, the data field is divided into several continuous data spaces (each data space is represented with an independent data file), and each data space is made up of some logic data blocks.

Deblocking: in the backup services system, when carrying out backup or recovery operation, all be the data that will handle according to the regular length piecemeal, each piecemeal is exactly a deblocking.

Backup Data piece: when the user utilizes the file of backup client backup appointment and file, backup client at first wants these backed up data according to regular length piecemeal (dividing block size in the actual backup services software systems is 4M), and each piecemeal is exactly the Backup Data piece

Logic data block: at the storage server end, manage and efficiently utilize the storage server storage space for convenience, each data space is divided into the experimental process storage unit, and each sub-storage unit is exactly a logic data block (each logical data block size is 1G in the actual backup services software systems)

Further specify the implementation procedure of this example below in conjunction with accompanying drawing.

Show that as Fig. 2 the inventive method comprises the steps:

(1) initialization:

Usually the storage data are divided into two parts: meta-data region and data field.The actual backed up data of user is stored in the data field, and the relevant information of describing these user data is stored in meta-data region.Beginning initialization metadata district mainly is that indexing head information, data directory information, the data field metadata information to meta-data region composed initial value.With indexing head is that the object Hash table all is changed to 0, represents all availablely, also each element in the data directory array is changed to simultaneously zero, represents that also write without any data this time.And prepare to accept user's backup request in data space of data field predistribution.The data space sum is defined as S, and S value maximum is no more than 1000.The data space number that the data field is current has used is V, V＜=S.The preallocated logic data block of each data space (block) number is defined as P, and the P maximum is no more than 10.Each data space largest logical data block number is defined as W, and the W maximum is no more than 1024.In our present backup services Software deployment was implemented, each data space was made up of 1024 logic data blocks, and each logical data block size is 1G, and each data space is 1T to the maximum.

(2) receive user command and judge the user command type:

(3) carry out back-up processing according to following process:

(3.1) (it is b to file block at first that the user is to be backed up that definition of data divides block size, b value size is 1M---4M, the b value is 4M during this backup services Software deployment of reality), then the content of data piecemeal is carried out Hash with the MD5 algorithm, obtain the fingerprint of a unique identification data piecemeal, deblocking is that index stores is in the indexing head and data directory of storage server end member data with the fingerprint;

(3.3) if this fingerprint does not exist, then backup client transmits this deblocking and gives storage server, in storage server end memory space dynamic allocation, and finishes the write operation of this deblocking; If exist, then need not to transmit data, only need the pairing index information of updated stored this deblocking of server end, reference count is added one.

(3.4) change step (2) over to;

In the above-mentioned steps (3.3), can be according to process memory space dynamic allocation shown in Figure 3, concrete steps are as follows:

(a1) judge whether the residue free space that can satisfy the big or small Backup Data piece of appointment is arranged,, enter step (a5) in P the logic data block in current data space if having, otherwise, step (a2) entered;

(a2) judge whether P＜W sets up, enter step (a6) if set up, otherwise, step (a3) entered;

(a3) judge whether other data space in the storage server master record has the residue free space that can satisfy the Backup Data piece of specifying size, if having, enters step (a5), otherwise, step (a4) entered;

(a4) whether interpretation V＜S sets up, if set up, then increases a data space on storage server, for Backup Data piece to be written in the new data space distributes a data index, changes step (a8) then over to, otherwise, enter step (a7);

(a5) for Backup Data piece to be written in the residue free space distributes a data index, change step (a8) then over to;

(a6) being the space of a Backup Data block size of this data space growth on storage server, is that this Backup Data piece distributes a data index again, changes step (a8) then over to;

(a7) because can not find the residue free space that can satisfy the Backup Data piece of specifying size, so announce the dynamic assignment failure;

(a8) finish dynamic allocation procedure.

Can also process as shown in Figure 4 finish write operation, its step is as follows:

(b1) dynamically seek free memory at the storage server end, search whether the logic data block that satisfies condition is arranged;

(b2) if do not have the utilogic data block then return failure;

(b3) if the free memory that satisfies new Backup Data block size is arranged, just create a new data index, new Backup Data piece is write the respective stored server location, then respective index head and data directory metadata are write the master record district.

When (4) recovering, check in the Hash tabulation for the treatment of that recovery file comprises by backup server in the standby system software, be positioned at logical place in the corresponding data space according to Hash tabulation visit storage space metadata information, read to treat that from the storage server end recovery file data are to core buffer successively then, pass to backup client by socket then, and synthetic required file set, change step (2) again over to.

As shown in Figure 2, it is as follows to be positioned at the process of the logical place in the corresponding data space according to Hash tabulation visit storage space metadata information:

(4.1) establish the figure place that m is predefined indexing head, by the preceding m position of data hash value indexing head is carried out index, the content of indexing head has constituted data directory number.

Usually, each indexing head accounts for two bytes, and one has 2 ^mIndividual, the span of m is generally 20～30.

(4.2) by indexing head to the data indexed addressing.Data directory then carries out addressing to the data item of single job, specifically comprises three partial contents:

(4.2.1) the structure member I by data directory (I is the data space numbering) finds concrete data space number;

(4.2.2) the structure member J by data directory (J is the logical data block number in the data space) finds the piece number in the data space;

(4.2.4) the structure member K (K is the interior offset address of logic data block in the data space) by data directory finds the offset address of data item in logic data block, is equivalent to three-level addressing.Data head and data entity that thus can the locator data item.

(4.4) obtain top three grades of logic data block address informations, just can navigate to corresponding data field reading of data.

(5) delete backup file by following process:

(5.1) check in the Hash tabulation that backup file to be deleted comprises by backup server in the standby system software;

(5.4) change step (2) again over to.

Because we provide a kind of online backup service, thus backup server and storage server as finger daemon all the time at running background, therefore do not have the end situation, wait for the operation requests that receives the user all the time.And the operation interface of backup client to be the user use online backup service, the user can land the operation that backup client is carried out appointment arbitrarily the time, as backup, recovery and deletion etc.

Example:

The run time infrastructure of backup services system applies is:

1. hardware environment and support environment

Backup client requires main frame to possess 512M and above internal memory, 10Mbps and above network handling capacity.

Dispatch server requires main frame to possess 2GB and above internal memory, 1000Mbps and above network handling capacity.

Storage server requires main frame to possess 4GB and above internal memory and TB level external memory ability, the above network handling capacity of 1000Mbps level.

Possess GB level network exchange ability between dispatch server and the storage server software place main frame, possess the network-in-dialing ability between client and the service end software place main frame.Require server host place environment to possess the pacing items that redundant power guarantee, the guarantee of redundancy communication link, temperature control system, fire prevention system etc. guarantee that main frame runs well.

2. software runtime environment

The backup client program run is under Windows XP and later version operating system or the operating system platform based on Linux 2.6 kernels.

Dispatch server and storage server operate under Windows Server 2003 operating system platforms.

In the online backup service system that realizes at present and normally move, each data space size of storage server end is 1T, and the data space number is 20 to the maximum.Each data space is divided into 1024 logic data blocks, and each logical data block size is 1G.

The above is preferred embodiment of the present invention, but the present invention should not be confined to the disclosed content of this embodiment and accompanying drawing.So everyly do not break away from the equivalence of finishing under the spirit disclosed in this invention or revise, all fall into the scope of protection of the invention.

Claims

1. A data organization method for backup service, characterized in that the method comprises the steps of:

(1) Initialization:

Partial initialization of metadata information, including assigning initial values to index header information, data index information, and data area metadata information in the metadata area;

Pre-allocate a data space in the data area and prepare to accept the user's backup request;

(2) Receive user commands and determine the type of user commands:

Determine the user command type, if it is a backup operation, go to step (3), if it is a restore operation, go to step (4), if it is a delete operation, go to step (5);

(3) Perform backup processing according to the following process:

(3.1) First divide the file to be backed up by the user into blocks, and then hash the content of the data block with the MD5 algorithm to obtain a fingerprint that uniquely identifies the data block. The data block is stored in the storage server end element with the fingerprint as the index Data index header and data index;

(3.2) The backup client transmits the fingerprint of the data block to the storage server, and the storage server queries whether the block exists according to the fingerprint;

(3.3) If the fingerprint does not exist, the backup client sends the data block to the storage server, then the data block is a new backup data block, and the storage server dynamically allocates storage space, and completes the new backup data block Write operation; if it exists, you only need to update the index information corresponding to the data block on the storage server, and increase its reference count by one;

(3.4) Go to step (2);

(4) When restoring, the backup server checks the Hash list contained in the file to be restored, accesses the metadata information of the storage space according to the Hash list to locate the logical position in the corresponding data space, and then reads the data of the file to be restored from the storage server in turn to the memory buffer, and then pass it to the backup client through the socket, and synthesize the required file set, and then turn to step (2);

(5) Delete the backed up files according to the following procedure:

(5.1) The Hash list contained in the file to be deleted is found by the backup server in the backup system software;

(5.2) Search the index header and data index mapping table of the metadata area on the storage server according to the Hash value. If the Hash value of the entry parameter does not exist, return immediately, and the return value is false;

(5.3) Otherwise, the reference count of the object metadata corresponding to the Hash value is decremented by 1, and the return value is true;

(5.4) Go to step (2) again.

2. The data organization method for backup service according to claim 1, characterized in that, in the above step (3.3), let P represent the number of logical data blocks pre-allocated in each data space, and W represent each data space The maximum number of logical data blocks that can be accommodated in the space, V represents the number of data spaces used, and S represents the total number of data spaces;

The specific steps of dynamically allocating storage space are as follows:

(a1) Judging whether there is remaining available space in the P logical data blocks of the current data space that can satisfy the specified size of the backup data block, if yes, go to step (a5), otherwise, go to step (a2);

(a2) Determine whether P<W is true, if true, go to step (a6), otherwise, go to step (a3);

(a3) Judging whether other data spaces in the main record of the storage server have remaining free space that can meet the specified size of the backup data block, if yes, go to step (a5), otherwise, go to step (a4);

(a4) Judging whether V<S is true, if it is true, add a data space on the storage server, allocate a data index for the backup data block to be written in the new data space, and then go to step (a8), otherwise, Go to step (a7);

(a5) Allocate a data index for the backup data block to be written in the remaining available space, and then go to step (a8);

(a6) Increase the space of a logical data block size for the data space on the storage server, then allocate a data index for the backup data block, and then turn to step (a8);

(a7) declare a dynamic allocation failure;

(a8) End the dynamic allocation process.

3. The data organization method for backup service according to claim 1, wherein the writing operation comprises the following steps:

(b1) Dynamically search for available storage space on the storage server side, and find out whether there are logical data blocks that meet the conditions;

(b2) Return failure if there is no available logical data block;

(b3) If there is available storage space that meets the size of the new backup data block, create a new data index, write the new backup data block to the corresponding storage server location, and then write the corresponding index header and data index metadata into the main record area .

4. the data organization method that is used for backup service according to claim 1, is characterized in that, the process of being positioned at the logical position in corresponding data space according to Hash list access storage space metadata information is as follows:

(4.1) According to the preset number of digits in the index header, the index header is indexed by the first digit of the data Hash value, and the content of the index header constitutes the data index number;

(4.2) Address the data index through the index header, and then address the data item of an operation through the data index, which specifically includes three parts:

(4.2.1) Find the specific data space number through the data space number in the structure member of the data index;

(4.2.2) Find the block number in the data space through the logical data block number in the data space in the structure member of the data index;

(4.2.3) Find the offset address of the data item in the logical data block through the offset address in the logical data block in the data space of the structure member of the data index, and locate the data header and data entity of the data item;

(4.4) Use the obtained address information to locate the corresponding data area to read data.