TWI442223B

TWI442223B - The data recovery method of the data de-duplication

Info

Publication number: TWI442223B
Application number: TW100128572A
Authority: TW
Inventors: Wei Liu; Chih Feng Chen
Original assignee: Inventec Corp
Priority date: 2011-08-10
Filing date: 2011-08-10
Publication date: 2014-06-21
Also published as: TW201308070A

Description

Data recovery method for deduplication

一種重複數據刪除的數據維護方法，特別有關於一種重複數據刪除的數據復原方法。A data maintenance method for data deduplication, in particular, a data recovery method for deduplication.

重複數據刪除(data de-duplication)是一種數據縮減技術，通常用於基於磁盤的備份系統，主要目的在於減少存儲系統中使用的存儲容量。它的工作方式是在某個時間周期內查找不同文件中不同位置的重複可變大小數據塊。重複的數據塊用指示符取代。採用「重複數據刪除」技術可以讓出更多的備份空間，不僅可以使存儲系統上的備份數據保存更長的時間，而且還可以節約離線存儲時所需的大量的帶寬。Data de-duplication is a data reduction technique commonly used in disk-based backup systems with the primary goal of reducing the storage capacity used in storage systems. It works by looking for duplicate variable-sized blocks of data in different locations in different files over a period of time. Duplicate data blocks are replaced with indicators. The use of "deduplication" technology can make more backup space, not only can save backup data on the storage system for a longer period of time, but also save a lot of bandwidth required for offline storage.

在進行重複數據刪除的過程中，客戶端111會對輸入文件112進行切分的處理。輸入文件112在經過切分處理後會產生多個資料區塊(在此定義為切分資料塊113)。請參考「第1圖」所示，其係為習知技術之重複數據刪除後的切分資料塊示意圖。隨後，客戶端111會對切分資料塊113進行哈希處理，並產生相應各切分資料塊113的指紋特徵值(意即切分資料塊113的指紋特徵值)。客戶端111將所得到的指紋特徵值與儲存於存儲服務端中的指紋特徵值進行比對，並判斷有無相同的指紋特徵值。若是存在相同的指紋特徵值時，則代表此一資料區塊曾經被存放於存儲服務端中。In the process of performing deduplication, the client 111 performs a process of dividing the input file 112. The input file 112 will generate a plurality of data blocks (defined herein as the split data block 113) after being subjected to the segmentation process. Please refer to "Figure 1", which is a schematic diagram of the segmentation data block after deduplication of the prior art. Subsequently, the client 111 performs hash processing on the segmentation data block 113, and generates fingerprint feature values of the respective slice data blocks 113 (that is, the fingerprint feature values of the segmentation data block 113). The client 111 compares the obtained fingerprint feature value with the fingerprint feature value stored in the storage server, and determines whether there is the same fingerprint feature value. If the same fingerprint feature value exists, it means that the data block has been stored in the storage server.

當客戶端111欲進行數據回復的處理時，客戶端111會向存儲服務端提出文件索取要求。存儲服務端會根據文件索取要求將所有的切分資料塊113(意即整份的輸入文件112)直接傳送給客戶端111。客戶端111在將所接收的切分資料塊113對輸入文件112進行覆寫，用以還原輸入文件112。這樣的作法雖然快速，但是對於客戶端111(與存儲服務端)會產生高負載及傳輸時頻寬的佔用等問題。When the client 111 wants to perform data reply processing, the client 111 requests a file request from the storage server. The storage server will transmit all the segmentation data blocks 113 (that is, the entire input file 112) directly to the client 111 according to the file request. The client 111 overwrites the input file 112 with the received segmentation data block 113 to restore the input file 112. Although this method is fast, it has problems such as high load on the client 111 (and the storage server) and bandwidth occupation during transmission.

鑒於以上的問題，本發明在於提供一種重複數據刪除的數據復原方法，用以復原客戶端的目標文件之部分數據。In view of the above problems, the present invention provides a data recovery method for data deduplication for restoring part of data of a client object file.

本發明所揭露之重複數據刪除的數據復原方法包括以下步驟：客戶端取得目標文件之文件屬性；客戶端向存儲服務端查詢目標文件所相應的原始文件之文件屬性；由客戶端比對目標文件的文件屬性是否與原始文件的文件屬性一致；若目標文件與原始文件的文件屬性不一致時，則對目標文件進行切分處理，並產生至少一切分資料塊與相應的指紋特徵值；向存儲服務端取得原始文件的所有的指紋特徵值，客戶端比對原始文件與目標文件的指紋特徵值之相異處；客戶端根據相異的指紋特徵值向存儲服務端取得相應的切分資料塊，並將所取得的切分資料塊覆寫至目標文件中相應位置。The data recovery method for data deduplication disclosed in the present invention includes the following steps: the client obtains the file attribute of the target file; the client queries the storage server for the file attribute of the original file corresponding to the target file; and the client compares the target file. Whether the file attribute is consistent with the file attribute of the original file; if the target file is inconsistent with the file attribute of the original file, the target file is segmented, and at least all the data blocks and corresponding fingerprint feature values are generated; The terminal obtains all the fingerprint feature values of the original file, and the client compares the fingerprint feature values of the original file with the target file; the client obtains the corresponding segmentation data block from the storage server according to the different fingerprint feature values. Overwrite the obtained segmentation data block to the corresponding location in the target file.

本發明提出一種重複數據刪除的數據復原方法，用以復原客戶端的目標文件之部分數據。客戶端透過存儲服務端所儲存的指紋特徵值與相應的切分資料塊對目標文件進行局部的數據復原。The invention provides a data recovery method for data deduplication, which is used for restoring part of data of a target file of a client. The client performs local data restoration on the target file through the fingerprint feature value stored in the storage server and the corresponding segmentation data block.

有關本發明的特徵與實作，茲配合圖式作最佳實施例詳細說明如下。The features and implementations of the present invention are described in detail below with reference to the drawings.

請參考「第2圖」所示，其係為本發明之架構示意圖。請參考「第2圖」所示，其係為本發明之架構示意圖。本發明包括客戶端210與存儲服務端220。客戶端210可以通過網際網路(Internet)或企業內網(intranet)的方式連接於存儲服務端220，也可以將客戶端210與存儲服務端220同時運行於同一台計算機裝置上。Please refer to "Figure 2" for a schematic diagram of the architecture of the present invention. Please refer to "Figure 2" for a schematic diagram of the architecture of the present invention. The present invention includes a client 210 and a storage server 220. The client 210 can be connected to the storage server 220 through an Internet or an intranet, or the client 210 and the storage server 220 can be simultaneously run on the same computer device.

存儲服務端220更包括指紋特徵值索引列表221，指紋特徵值索引表記錄多組指紋特徵值222。客戶端210向存儲服務端220發出對一輸入文件的查詢要求時，存儲服務端220根據指紋特徵值索引列表221所記載的內容並透過下述方式進行查詢的動作。請參考「第3圖」所示，其係為本發明之重複數據刪除的流程示意圖。The storage server 220 further includes a fingerprint feature value index list 221, and the fingerprint feature value index table records a plurality of sets of fingerprint feature values 222. When the client 210 issues a query request for an input file to the storage server 220, the storage server 220 indexes the content of the list 221 based on the fingerprint feature value and performs an inquiry operation in the following manner. Please refer to "Figure 3", which is a schematic diagram of the process of deduplication of the present invention.

步驟S310：客戶端載入輸入文件，並產生相應輸入文件的數據區塊與相應每一數據區塊的指紋特徵值；Step S310: The client loads the input file, and generates a data block of the corresponding input file and a fingerprint feature value of each corresponding data block;

步驟S320：客戶端向存儲服務端發送查詢請求，在查詢請求中記錄相應數據區塊的指紋特徵值，用以向存儲服務端查詢是否存在有相同的指紋特徵值；Step S320: The client sends a query request to the storage server, and records the fingerprint feature value of the corresponding data block in the query request, so as to query the storage server for whether the same fingerprint feature value exists;

步驟S330：當存儲服務端的指紋特徵值索引列表中未儲存指紋特徵值，則存儲服務端向客戶端發送儲存要求，用以將指紋特徵值所相應的數據區塊傳送至存儲服務端中儲存，並且存儲服務端將所接收到的指紋特徵值依序加入指紋特徵值索引列表中；以及Step S330: When the fingerprint feature value is not stored in the fingerprint feature value index list of the storage server, the storage server sends a storage request to the client to transmit the data block corresponding to the fingerprint feature value to the storage server for storage. And the storage server adds the received fingerprint feature values to the fingerprint feature value index list in sequence;

步驟S340：當存儲服務端的指紋特徵值索引表中已經存在該筆指紋特徵值，則存儲服務端向客戶端回應該筆切分資料塊已經存在。Step S340: When the fingerprint feature value already exists in the fingerprint feature value index table of the storage server, the storage server returns to the client that the pen segmentation data block already exists.

由客戶端210中載入輸入文件，客戶端210對輸入文件進行切分處理，並產生相應輸入文件的數據區塊與相應每一數據區塊的指紋特徵值222。指紋特徵值222計算的演算法可以是但不侷限於SHA-1或MD5。而數據區塊係根據固定長度方式(fixed-size partition)或基於內容變長度分割方式(content-defined chunking，CDC)。定長切分演算法採用預先定義好的切分資料塊大小對輸入文件進行切分。定長分塊演算法的優點是簡單、性能高。內容定義切分演算法是一種變長分塊演算法，它應用指紋數據(如Rabin指紋)將檔分割成長度大小不等的分塊策略。與定長切分演算法不同，內容定義切分演算法是基於輸入文件內容進行切分，因此切分資料塊大小是可變化的。The input file is loaded by the client 210, and the client 210 performs a segmentation process on the input file, and generates a data block of the corresponding input file and a fingerprint feature value 222 of each corresponding data block. The algorithm for calculating the fingerprint feature value 222 may be, but is not limited to, SHA-1 or MD5. The data block is based on a fixed-size partition or a content-defined chunking (CDC). The fixed length segmentation algorithm uses the predefined segmentation data block size to segment the input file. The advantage of the fixed length block algorithm is simplicity and high performance. The content definition segmentation algorithm is a variable length block algorithm that uses fingerprint data (such as Rabin fingerprint) to segment the file into chunking strategies of varying lengths. Unlike the fixed-length segmentation algorithm, the content definition segmentation algorithm is based on the input file content, so the segmentation data block size can be changed.

接著，客戶端210向存儲服務端220發送查詢請求，在查詢請求中記錄相應數據區塊的指紋特徵值222，用以向存儲服務端220查詢是否存在有相同的指紋特徵值222。當存儲服務端220的指紋特徵值索引列表221中末儲存指紋特徵值222，則存儲服務端220向客戶端210發送儲存要求，用以將指紋特徵值222所相應的數據區塊傳送至存儲服務端220中儲存，並且存儲服務端220將所接收到的指紋特徵值222依序加入指紋特徵值索引列表221中。Then, the client 210 sends a query request to the storage server 220, and records the fingerprint feature value 222 of the corresponding data block in the query request to query the storage server 220 whether the same fingerprint feature value 222 exists. When the fingerprint feature value 222 is stored in the fingerprint feature value index list 221 of the storage server 220, the storage server 220 sends a storage request to the client 210 to transfer the data block corresponding to the fingerprint feature value 222 to the storage service. The terminal 220 stores the information, and the storage server 220 adds the received fingerprint feature values 222 to the fingerprint feature value index list 221 in sequence.

當客戶端210欲進行文件的還原處理，客戶端210會向存儲服務端220發出文件還原要求。為能清楚說明客戶端210與伺服端所存儲的文件，因此將客戶端210欲進行文件還原處理的文件將其定義為目標文件。存儲服務端220所存儲的數據文件(意即各文件的切分資料塊)定義為原始文件，因此原始文件的數量並非僅為一個。存儲服務端220根據下列步驟進行相應的文件還原處理，請參考「第4圖」與「第5圖」所示，其係分別為本發明之運作流程示意圖與切分資料塊的差異示意圖，其係包括以下步驟：When the client 210 wants to perform file restoration processing, the client 210 issues a file restore request to the storage server 220. In order to clearly explain the file stored by the client 210 and the server, the file that the client 210 intends to perform file restoration processing defines it as an object file. The data files stored by the storage server 220 (that is, the sliced data blocks of each file) are defined as original files, so the number of original files is not only one. The storage server 220 performs corresponding file restoration processing according to the following steps. Please refer to "Fig. 4" and "Fig. 5", which are respectively schematic diagrams showing the difference between the operational flow diagram and the segmentation data block of the present invention. The system includes the following steps:

步驟S410：客戶端取得目標文件之文件屬性；Step S410: The client obtains the file attribute of the target file.

步驟S420：客戶端向存儲服務端查詢目標文件所相應的原始文件之文件屬性；Step S420: The client queries the storage server for the file attribute of the original file corresponding to the target file.

步驟S430：由客戶端比對目標文件的文件屬性是否與原始文件的文件屬性一致；Step S430: Whether the file attribute of the target file is compared with the file attribute of the original file by the client;

步驟S440：若目標文件與原始文件的文件屬性一致時，則客戶端不進行文件還原處理；Step S440: If the target file is consistent with the file attribute of the original file, the client does not perform file restoration processing;

步驟S450：若目標文件與原始文件的文件屬性不一致時，則對目標文件進行切分處理，並產生至少一切分資料塊與相應的指紋特徵值；Step S450: If the target file is inconsistent with the file attribute of the original file, the target file is segmented, and at least all the data blocks and the corresponding fingerprint feature values are generated;

步驟S460：向存儲服務端取得原始文件的所有的指紋特徵值，客戶端比對原始文件與目標文件的指紋特徵值之相異處；以及Step S460: Obtain all fingerprint feature values of the original file from the storage server, and the client compares the difference between the original document and the fingerprint feature value of the target file;

步驟S470：客戶端根據相異的指紋特徵值向存儲服務端取得相應的切分資料塊，並將所取得的切分資料塊覆寫至目標文件中相應位置。Step S470: The client obtains the corresponding segmentation data block from the storage server according to the different fingerprint feature values, and overwrites the obtained segmentation data block to the corresponding location in the target file.

首先，客戶端210取得目標文件之文件屬性，文件屬性係為時間戳記(Time Stamp)或索引號(Index)。換言之，客戶端210在對目標文件進行切分處理前，客戶端210會紀錄目標文件520的文件屬性。接著，客戶端210會向存儲服務端220查詢目標文件520所相應的原始文件510之文件屬性。存儲伺服端220查找是否已經儲存目標文件520所相應的原始文件510的文件屬性。客戶端210如果之前已經對目標文件520進行過數據備份，存儲伺服端220中會儲存對應目標文件520的原始文件510與相關的文件屬性。First, the client 210 obtains the file attribute of the target file, and the file attribute is a time stamp or an index number. In other words, the client 210 records the file attributes of the target file 520 before the client 210 performs the segmentation process on the target file. Next, the client 210 queries the storage server 220 for the file attributes of the original file 510 corresponding to the target file 520. The storage server 220 looks up whether the file attributes of the original file 510 corresponding to the target file 520 have been stored. If the client 210 has previously performed data backup on the target file 520, the storage server 220 stores the original file 510 of the corresponding target file 520 and related file attributes.

客戶端210根據存儲伺服端220所傳來的原始文件510的文件屬性與目標文件520的文件屬性進行比對。若文件屬性以時間戳記為例，在不同時間所創建的數據文件會分別給定不同的時間戳記。因此目標文件520與原始文件510的文件屬性不一致時，則代表目標文件520已經被修改過。The client 210 compares the file attributes of the original file 510 transmitted from the storage server 220 with the file attributes of the target file 520. If the file attribute takes a timestamp as an example, the data files created at different times will be given different timestamps. Therefore, when the target file 520 does not match the file attribute of the original file 510, it means that the target file 520 has been modified.

若目標文件520與原始文件510的文件屬性不一致時，客戶端210則對目標文件520進行切分處理，並產生至少一切分資料塊與相應的指紋特徵值222。客戶端210向存儲服務端220取得原始文件510的所有的指紋特徵值222，客戶端210比對原始文件510與目標文件520的指紋特徵值222之相異處(即為「第5圖」中切分資料塊的黑色區塊處)。而存儲服務端220接收到客戶端210索取指紋特徵值222可以是整批傳送，也可以是分批傳送至客戶端210。由於指紋特徵值222的數據量遠比切分資料塊小，因此在傳輸過程中並不會嚴重的影響頻寬的使用。最後，客戶端210根據相異的指紋特徵值222向存儲服務端220取得相應的切分資料塊，並將所取得的切分資料塊覆寫至目標文件520中相應位置。If the target file 520 does not match the file attribute of the original file 510, the client 210 performs a segmentation process on the target file 520, and generates at least all the partial data blocks and the corresponding fingerprint feature values 222. The client 210 obtains all the fingerprint feature values 222 of the original file 510 from the storage server 220, and the client 210 compares the difference between the original file 510 and the fingerprint feature value 222 of the target file 520 (ie, in FIG. 5). Split the black block of the data block). The storage server 220 receives the fingerprint feature value 222 from the client 210, which may be a batch transmission, or may be transmitted to the client 210 in batches. Since the data amount of the fingerprint feature value 222 is much smaller than the segmentation data block, the use of the bandwidth is not seriously affected during the transmission. Finally, the client 210 obtains the corresponding segmentation data block from the storage server 220 according to the different fingerprint feature value 222, and overwrites the obtained segmentation data block to the corresponding location in the target file 520.

本發明提出一種重複數據刪除的數據復原方法，用以復原客戶端210的目標文件520之部分數據。客戶端210透過存儲服務端220所儲存的指紋特徵值222與相應的切分資料塊對目標文件520進行局部的數據復原。而且本發明相較於習知技術而言，本發明不需對目標文件520逐一的讀取與寫入，僅需要進行讀取與計算的處理。所以相較於習知技術而言，本發明的寫入時間可以有效的縮短。The present invention provides a data recovery method for data deduplication for restoring part of the data of the target file 520 of the client 210. The client 210 performs local data restoration on the target file 520 through the fingerprint feature value 222 stored in the storage server 220 and the corresponding segmentation data block. Moreover, the present invention does not require reading and writing of the object files 520 one by one as compared with the prior art, and only the processing of reading and calculation is required. Therefore, the writing time of the present invention can be effectively shortened compared to the prior art.

雖然本發明以前述之較佳實施例揭露如上，然其並非用以限定本發明，任何熟習相像技藝者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾，因此本發明之專利保護範圍須視本說明書所附之申請專利範圍所界定者為準。While the present invention has been described above in terms of the preferred embodiments thereof, it is not intended to limit the invention, and the invention may be modified and modified without departing from the spirit and scope of the invention. The patent protection scope of the invention is subject to the definition of the scope of the patent application attached to the specification.

111．．．客戶端111. . . Client

112．．．輸入文件112. . . Input file

113．．．切分資料塊113. . . Segmentation data block

210．．．客戶端210. . . Client

220．．．存儲服務端220. . . Storage server

221．．．指紋特徵值索引列表221. . . Fingerprint feature value index list

222．．．指紋特徵值222. . . Fingerprint feature value

510．．．原始文件510. . . Original file

520．．．目標文件520. . . Target file

第1圖係為習知技術之重複數據刪除後的切分資料塊示意圖。Figure 1 is a schematic diagram of a segmentation block after deduplication of the prior art.

第2圖係為本發明之架構示意圖。Figure 2 is a schematic diagram of the architecture of the present invention.

第3圖係為本發明之重複數據刪除的流程示意圖。Figure 3 is a schematic flow chart of the deduplication of the present invention.

第4圖係為本發明之運作流程示意圖。Figure 4 is a schematic diagram of the operational flow of the present invention.

第5圖係為本發明之切分資料塊的差異示意圖。Figure 5 is a schematic diagram showing the difference of the segmentation data block of the present invention.

Claims

A data recovery method for data deduplication, which is to provide a client to perform partial data recovery on an object file according to an original file stored in a storage server and subjected to deduplication processing, the data recovery method comprising: the client Obtaining a file attribute of the target file; the client queries the storage server for the file attribute of an original file corresponding to the target file; whether the file attribute of the target file is compared with the original The file attribute of the file is consistent; if the target file is inconsistent with the file attribute of the original file, then the target file is subjected to all sub-processing, and at least all the sub-blocks and corresponding fingerprint feature values are generated; to the storage The server obtains all the fingerprint feature values of the original file, the client is different from the fingerprint feature values of the original file and the target file; and the client according to the different fingerprint feature values Obtaining the corresponding segmentation data blocks from the storage server, and overwriting the obtained segmentation data blocks The target file corresponding position.

The data restoration method of deduplication as described in claim 1, wherein the file attribute is a time stamp or an index number.

The data restoration method of deduplication as described in claim 1, wherein the fingerprint feature value is transmitted through a hash algorithm or a one way algorithm.

The data recovery method of the data deduplication method of claim 1, wherein the step of overwriting the obtained segment data block to the corresponding location in the object file further comprises: the client repeating the fingerprints different in comparison The feature value is obtained from the storage server to obtain the corresponding segmentation data block, and the target file is overwritten until all the target files are completed.