Background technology
Along with the develop rapidly of informationized society, the service operation from the daily life to the enterprise, all the infosystem of being permeated is day by day surrounded, and is also increasing to its dependence.Especially in industries such as finance, communication, traffic and insurance,, bring immeasurable loss can for individual and enterprise in case critical data is lost or damaged.
Here said backup services is one in essence provides the backup of certain disaster tolerance function to recover software systems, can provide perfect data backup, recovery and related management task for individual and enterprise customer, and can customize various backup policy according to self actual demand.The backup services here also is a kind of software pattern simultaneously, for information-based needed all-network infrastructure and software, hardware running platform are built by enterprise, and be responsible for the enforcement in all early stages, a series of services such as maintenance in later stage, enterprise need not to buy software and hardware, builds machine room, recruits the IT personnel, can use infosystem by the internet.Just as opening just energy water of water tap, enterprise leases software service according to actual needs.
Data backup is the important measures that ensure information safety with recovering.Data importance show constantly that the data that require on the storage system can obtain effectively and comprehensively protection especially.Along with the appearance and the development of express network and the communication technology, mass memory innovation technology, basic storage resources is compared snafu variation in the past.The application of increasing various infosystems also makes the data volume of conservation value be the geometric series rising, and these all are that data backup has been researched and proposed higher requirement with exploitation and the correlation technique of recovering software.
Demand to storage space and data aspect when the user uses backup services generally includes: can increase or reduce the use amount to storage space according to demand; Can accessible use both have living space, as long as promptly have remaining space and network to reach, the data backup task can both correct execution; Can recover backed up data at any time.In order to satisfy these demands, require user's space and data to possess certain logical independence, so need research user's space way to manage and Backup Data method for organizing.In addition, also need to design allocation of space, reclaim mechanism efficiently, when fully excavating the coupling of repeating data storage space, keep the logical independence of user data, and realize data search and visit efficiently.
In general backup software framework, storage server is the physical medium that the supervisor console of process data backup software authenticated, and it can be a hard drive space on the server, the memory device that server is plug-in, perhaps a disk mapping on the network.Can dispose a plurality of storage servers by supervisor console, under the unified management of backup server, backup client backups to data on the corresponding storage server.
In the design of before backup software storage server end, adopt the backup method of file-level mostly.The backup of file-level, promptly backup software can only perceive this one deck of file, with files all on the source disk, backups on another purpose medium.So the file-level backup software, otherwise the file system interface that relies on operating system to provide comes backup file, or self have the function of file system, can discern file system metadata.In brief, it is that unit is read with the file that the mechanism of file-level backup software is exactly data, and then with the file storage of reading on the another one medium.Obviously this has formed performance bottleneck for PB level large-scale storage systems, because the data cell of storage server end management is exactly a file, this inevitably causes the backup of a large amount of repeating datas, management to the storage server end has also brought very big inconvenience, can address these problems to a great extent and utilize piece level data de-duplication technology to carry out data backup.
On the other hand, in backup software before, storage server end original allocation fixes for user's free memory capacity often, greatly reduces the extensibility of system like this.In the practical application, system can't expect that each user who is faced finally can use much storage spaces (may a maximum available storage restriction be arranged according to user's authority and type certainly), distributed big and caused waste of storage space and utilization factor to descend possibly, distributed little and may bring very big restriction user's use.
Recently Avamar company has been purchased by EMC Inc., and this company obtains the data de-duplication of patent and overall single example memory technology and can guarantee the Backup Data section only storage is once in global scope.This can be effectively will move and 300 times of data recovered amount reductions, can also realize full backup and fast quick-recovery every day simultaneously.At each 24KB data segment, Avamar generates unique 20 byte ID sign, uses the SHA-1 cryptographic algorithm.This unique ID is exactly the fingerprint of this data segment, so the software of Avamar can use this unique ID to determine whether that a data segment once was stored before this.But SHA-1 cryptographic algorithm calculation of complex is very big to the consumption of CPU.Because data segment is too small, the fingerprint space that consumes when the user ID data volume is very big is also very big, also has certain scalability problem simultaneously simultaneously.
Summary of the invention
The object of the present invention is to provide a kind of data organization method that is used for backup services, this method can realize the deletion of piece level repeating data, can improve the data organization and the efficiency of management.
The invention provides a kind of data organization method that is used for backup services, this method comprises the steps:
(1) initialization:
The metadata information partially-initialized comprises that indexing head information, data directory information, the data field metadata information to meta-data region composed initial value;
Prepare to accept user's backup request in data space of data field predistribution;
(2) receive user command and judge the user command type:
Judge the user command type, if backup operation enters step (3), if recovery operation then changes step (4) over to, if deletion action then changes step (5) over to;
(3) carry out back-up processing according to following process:
(3.1) file block at first that the user is to be backed up, then the content of data piecemeal is carried out Hash with the MD5 algorithm, obtain the fingerprint of a unique identification data piecemeal, deblocking is that index stores is in the indexing head and data directory of storage server end member data with the fingerprint;
(3.2) by the fingerprint of backup client to storage server transmission deblocking, whether the storage server end is inquired about this deblocking according to fingerprint and is existed;
(3.3) if this fingerprint does not exist, then backup client transmits this deblocking and gives storage server, and then this deblocking is new Backup Data piece, in storage server end memory space dynamic allocation, and finishes the write operation of this new Backup Data piece; If exist, then only need the pairing index information of updated stored this deblocking of server end, its reference count is added one;
(3.4) change step (2) over to;
When (4) recovering, check in the Hash tabulation for the treatment of that recovery file comprises by backup server, be positioned at logical place in the corresponding data space according to Hash tabulation visit storage space metadata information, read to treat that from the storage server end recovery file data are to core buffer successively then, pass to backup client by socket then, and synthetic required file set, change step (2) again over to;
(5) delete backup file by following process:
(5.1) check in the Hash tabulation for the treatment of that deleted file comprises by backup server in the standby system software;
(5.2) search the indexing head and the data directory mapping table of the meta-data region of storage server end according to hash value, if the suction parameter hash value does not exist, then return at once, rreturn value is false;
(5.3) otherwise the reference count of the object metadata of hash value correspondence is subtracted 1, rreturn value is true;
(5.4) change step (2) again over to.
Not only the many growths of kind are fast for present business data, and are high redundancies, a lot of identical files or data storing arranged in system and between the system, and the file that edits has a large amount of redundancies too, and these redundancies are present in the file version in the past.Traditional backup software backs up these redundant datas again and again, has amplified this redundancy.Present reasonable solution is to adopt data de-duplication technology.Data de-duplication technology not only can realize high compression rate, discharges storage space, also can reduce the cost based on Disk Backup, has also reduced the cost of data management.The present invention is data organization and the management method that a kind of data de-duplication technology based on the piece level realizes the storage server end, can efficiently carry out and client computer between the transmission of backup/restoration data, and carry out local storage space management and data organization by the strategy of backup server.The present invention can realize the data de-duplication of the overall situation under the prerequisite that does not influence the main users backup and recover, along with the growth of number of users and backup data quantity, the effect of data de-duplication will be obvious all the more.Can significantly reduce the required data volume of user ID, save the storage space that BACKUP TIME, the network bandwidth and backup need.
Embodiment
Below by by embodiment the present invention being described in further detail, but following examples only are illustrative, and protection scope of the present invention is not subjected to the restriction of these embodiment.
The backup services system is made up of backup server, storage server, backup client three parts based on tripartite framework.Wherein, backup client is responsible for accepting other relevant requests of data backup policy, recovery request or data management of customization.Backup server connects backup client and storage server, is the control center of whole data backup software.It is responsible for user right control, overall job scheduling and overall storage administration.When backup client is initiated the backup/restoration operation, guide the storage server of itself and appointment to connect and enter the execution link by backup server; On the other hand, backup server will be monitored calculating, transmission and the storage pressure of each storage server, and carry out the load balancing strategy.User profile, storage server state and other basic metadata that supports the backup server operation intend adopting database to store.Storage server be responsible for carrying out and client computer between the transmission of backup/restoration data, and carry out local storage space management and data organization by the strategy of backup server.
Below be 4 data structures that need this example use of explanation: master record district, indexing head, data directory and data field, its structure as shown in Figure 1.
The master record district: mainly describe the information of whole storage space, it deposits following information: indexing head information, data directory information, data field metadata information.
Indexing head: be an object Hash table, be used for realizing the mapping of object ID (by 160 hash value of data content generation) to data directory.Here to as if storage system in the elementary cell of data storage, be different from file and piece as basic module in the heritage storage system, to liking the combination of application data and definition memory attribute (metadata), wherein comprise data and permission data autonomy of other enough information and self-management.Its uses hash value represent object ID in the object-oriented storage, as storing foundation, sets up mapping relations content and object between by the hash value index with file content.Because hash value is that the overall situation is unique,, improved the manageability that system is shared so have the unique NameSpace of the overall situation on statistical significance.What system adopted is ripe MD5 algorithm, and the MD5Hash algorithm is transformed into the big integer of a 128bit (16byte), i.e. object ID with the data content of random length.
Data directory: be that a size is that (N represents index number in the data directory to N, and span is 2
20~2
30) array, each element in the array is the metadata structure of an object, information in the metadata structure has: object ID, (I represents the data space numbering to the start offset address of object in the data field, J represents the logical data block number in the data space of place, K represents the interior offset address of logic data block in the data space), object institute corresponding data size, the copy number of object institute corresponding data content, with the position of next object in Object table of this object map same position in the object Hash table, this just becomes a chained list to the object linking that is mapped to same position in the object Hash table.
Data field: the data that are used for depositing object, the data of object comprise object ID, data content length and data content, for the ease of storage space management, the data field is divided into several continuous data spaces (each data space is represented with an independent data file), and each data space is made up of some logic data blocks.
Deblocking: in the backup services system, when carrying out backup or recovery operation, all be the data that will handle according to the regular length piecemeal, each piecemeal is exactly a deblocking.
Backup Data piece: when the user utilizes the file of backup client backup appointment and file, backup client at first wants these backed up data according to regular length piecemeal (dividing block size in the actual backup services software systems is 4M), and each piecemeal is exactly the Backup Data piece
Logic data block: at the storage server end, manage and efficiently utilize the storage server storage space for convenience, each data space is divided into the experimental process storage unit, and each sub-storage unit is exactly a logic data block (each logical data block size is 1G in the actual backup services software systems)
Further specify the implementation procedure of this example below in conjunction with accompanying drawing.
Show that as Fig. 2 the inventive method comprises the steps:
(1) initialization:
Usually the storage data are divided into two parts: meta-data region and data field.The actual backed up data of user is stored in the data field, and the relevant information of describing these user data is stored in meta-data region.Beginning initialization metadata district mainly is that indexing head information, data directory information, the data field metadata information to meta-data region composed initial value.With indexing head is that the object Hash table all is changed to 0, represents all availablely, also each element in the data directory array is changed to simultaneously zero, represents that also write without any data this time.And prepare to accept user's backup request in data space of data field predistribution.The data space sum is defined as S, and S value maximum is no more than 1000.The data space number that the data field is current has used is V, V<=S.The preallocated logic data block of each data space (block) number is defined as P, and the P maximum is no more than 10.Each data space largest logical data block number is defined as W, and the W maximum is no more than 1024.In our present backup services Software deployment was implemented, each data space was made up of 1024 logic data blocks, and each logical data block size is 1G, and each data space is 1T to the maximum.
(2) receive user command and judge the user command type:
Judge the user command type, if backup operation enters step (3), if recovery operation then changes step (4) over to, if deletion action then changes step (5) over to;
(3) carry out back-up processing according to following process:
(3.1) (it is b to file block at first that the user is to be backed up that definition of data divides block size, b value size is 1M---4M, the b value is 4M during this backup services Software deployment of reality), then the content of data piecemeal is carried out Hash with the MD5 algorithm, obtain the fingerprint of a unique identification data piecemeal, deblocking is that index stores is in the indexing head and data directory of storage server end member data with the fingerprint;
(3.2) by the fingerprint of backup client to storage server transmission deblocking, whether the storage server end is inquired about this deblocking according to fingerprint and is existed;
(3.3) if this fingerprint does not exist, then backup client transmits this deblocking and gives storage server, in storage server end memory space dynamic allocation, and finishes the write operation of this deblocking; If exist, then need not to transmit data, only need the pairing index information of updated stored this deblocking of server end, reference count is added one.
(3.4) change step (2) over to;
In the above-mentioned steps (3.3), can be according to process memory space dynamic allocation shown in Figure 3, concrete steps are as follows:
(a1) judge whether the residue free space that can satisfy the big or small Backup Data piece of appointment is arranged,, enter step (a5) in P the logic data block in current data space if having, otherwise, step (a2) entered;
(a2) judge whether P<W sets up, enter step (a6) if set up, otherwise, step (a3) entered;
(a3) judge whether other data space in the storage server master record has the residue free space that can satisfy the Backup Data piece of specifying size, if having, enters step (a5), otherwise, step (a4) entered;
(a4) whether interpretation V<S sets up, if set up, then increases a data space on storage server, for Backup Data piece to be written in the new data space distributes a data index, changes step (a8) then over to, otherwise, enter step (a7);
(a5) for Backup Data piece to be written in the residue free space distributes a data index, change step (a8) then over to;
(a6) being the space of a Backup Data block size of this data space growth on storage server, is that this Backup Data piece distributes a data index again, changes step (a8) then over to;
(a7) because can not find the residue free space that can satisfy the Backup Data piece of specifying size, so announce the dynamic assignment failure;
(a8) finish dynamic allocation procedure.
Can also process as shown in Figure 4 finish write operation, its step is as follows:
(b1) dynamically seek free memory at the storage server end, search whether the logic data block that satisfies condition is arranged;
(b2) if do not have the utilogic data block then return failure;
(b3) if the free memory that satisfies new Backup Data block size is arranged, just create a new data index, new Backup Data piece is write the respective stored server location, then respective index head and data directory metadata are write the master record district.
When (4) recovering, check in the Hash tabulation for the treatment of that recovery file comprises by backup server in the standby system software, be positioned at logical place in the corresponding data space according to Hash tabulation visit storage space metadata information, read to treat that from the storage server end recovery file data are to core buffer successively then, pass to backup client by socket then, and synthetic required file set, change step (2) again over to.
As shown in Figure 2, it is as follows to be positioned at the process of the logical place in the corresponding data space according to Hash tabulation visit storage space metadata information:
(4.1) establish the figure place that m is predefined indexing head, by the preceding m position of data hash value indexing head is carried out index, the content of indexing head has constituted data directory number.
Usually, each indexing head accounts for two bytes, and one has 2
mIndividual, the span of m is generally 20~30.
(4.2) by indexing head to the data indexed addressing.Data directory then carries out addressing to the data item of single job, specifically comprises three partial contents:
(4.2.1) the structure member I by data directory (I is the data space numbering) finds concrete data space number;
(4.2.2) the structure member J by data directory (J is the logical data block number in the data space) finds the piece number in the data space;
(4.2.4) the structure member K (K is the interior offset address of logic data block in the data space) by data directory finds the offset address of data item in logic data block, is equivalent to three-level addressing.Data head and data entity that thus can the locator data item.
(4.4) obtain top three grades of logic data block address informations, just can navigate to corresponding data field reading of data.
(5) delete backup file by following process:
(5.1) check in the Hash tabulation that backup file to be deleted comprises by backup server in the standby system software;
(5.2) search the indexing head and the data directory mapping table of the meta-data region of storage server end according to hash value, if the suction parameter hash value does not exist, then return at once, rreturn value is false;
(5.3) otherwise the reference count of the object metadata of hash value correspondence is subtracted 1, rreturn value is true;
(5.4) change step (2) again over to.
Because we provide a kind of online backup service, thus backup server and storage server as finger daemon all the time at running background, therefore do not have the end situation, wait for the operation requests that receives the user all the time.And the operation interface of backup client to be the user use online backup service, the user can land the operation that backup client is carried out appointment arbitrarily the time, as backup, recovery and deletion etc.
Example:
The run time infrastructure of backup services system applies is:
1. hardware environment and support environment
Backup client requires main frame to possess 512M and above internal memory, 10Mbps and above network handling capacity.
Dispatch server requires main frame to possess 2GB and above internal memory, 1000Mbps and above network handling capacity.
Storage server requires main frame to possess 4GB and above internal memory and TB level external memory ability, the above network handling capacity of 1000Mbps level.
Possess GB level network exchange ability between dispatch server and the storage server software place main frame, possess the network-in-dialing ability between client and the service end software place main frame.Require server host place environment to possess the pacing items that redundant power guarantee, the guarantee of redundancy communication link, temperature control system, fire prevention system etc. guarantee that main frame runs well.
2. software runtime environment
The backup client program run is under Windows XP and later version operating system or the operating system platform based on Linux 2.6 kernels.
Dispatch server and storage server operate under Windows Server 2003 operating system platforms.
In the online backup service system that realizes at present and normally move, each data space size of storage server end is 1T, and the data space number is 20 to the maximum.Each data space is divided into 1024 logic data blocks, and each logical data block size is 1G.
The above is preferred embodiment of the present invention, but the present invention should not be confined to the disclosed content of this embodiment and accompanying drawing.So everyly do not break away from the equivalence of finishing under the spirit disclosed in this invention or revise, all fall into the scope of protection of the invention.