[go: up one dir, main page]

CN105068757B - A kind of redundant data De-weight method based on file semantics and system real-time status - Google Patents

A kind of redundant data De-weight method based on file semantics and system real-time status Download PDF

Info

Publication number
CN105068757B
CN105068757B CN201510435945.0A CN201510435945A CN105068757B CN 105068757 B CN105068757 B CN 105068757B CN 201510435945 A CN201510435945 A CN 201510435945A CN 105068757 B CN105068757 B CN 105068757B
Authority
CN
China
Prior art keywords
file
deduplication
sla
duplicate removal
less
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510435945.0A
Other languages
Chinese (zh)
Other versions
CN105068757A (en
Inventor
尹建伟
唐彦
邓水光
李莹
吴健
吴朝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201510435945.0A priority Critical patent/CN105068757B/en
Publication of CN105068757A publication Critical patent/CN105068757A/en
Application granted granted Critical
Publication of CN105068757B publication Critical patent/CN105068757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于文件语义和系统实时状态的冗余数据去重方法,该方法主要由三个功能模块实现:基于多语义维度划分的去重优先度计算模块(MPD模块)、分层级的数据去重模块(去重器)以及基于系统实时状态的去重控制模块(控制器)。MPD模块基于多维度的文件语义,输出优先进行去重操作的文件对象,去重器则根据以上输出依次执行分层级的,包含全局文件层面去重以及本地基于数据块层面去重的去重策略;同时在去重器的运作过程当中,控制器会根据系统的实时状态,对去重器进行动态调整,从而在保证分布式主存储系统的读请求响应性能的同时,节省更多的存储空间成本开销。

The invention discloses a redundant data deduplication method based on file semantics and system real-time status. The method is mainly realized by three functional modules: deduplication priority calculation module (MPD module) based on multi-semantic dimension division, layered Level data deduplication module (deduplication device) and deduplication control module (controller) based on the real-time status of the system. Based on multi-dimensional file semantics, the MPD module outputs file objects that are prioritized for deduplication operations, and the deduplication device executes hierarchical deduplication sequentially based on the above output, including deduplication at the global file level and local deduplication based on data block level deduplication At the same time, during the operation of the deduplication device, the controller will dynamically adjust the deduplication device according to the real-time status of the system, so as to save more storage while ensuring the read request response performance of the distributed primary storage system Space cost overhead.

Description

一种基于文件语义和系统实时状态的冗余数据去重方法A redundant data deduplication method based on file semantics and real-time system status

技术领域technical field

本发明属于计算机信息管理技术领域,具体涉及一种基于文件语义和系统实时状态的冗余数据去重方法。The invention belongs to the technical field of computer information management, and in particular relates to a redundant data deduplication method based on file semantics and system real-time status.

背景技术Background technique

随着云计算和移动互联网的进一步普及和深入应用,各类网上应用和服务正在各行各业中扮演着更为重要的角色,而随着互联网应用的用户基数急剧上升,全球的信息总量也正在以爆炸性的速度在增长。分布式存储系统是各类云服务的后台支撑系统,所有的服务数据都存储在分布式存储系统当中,存储系统对外提供统一的读写接口,以便用户或上层应用服务访问或修改存于磁盘当中的数据。对应不同的应用场景和主要用途,分布式存储系统又可以从粗粒度上划分为两类:备份存储系统和主存储系统。备份存储系统主要应用于冷数据的备份,如系统日志、历史档案等等,这些备份系统一般搭建在造价相对低廉的存储硬件设备上,甚至是可以基于磁带存储的底层存储设备,因为备份存储系统当中的数据的访问热度非常地低,一般只会在有特殊需求时才会将历史数据读取出来,因此备份存储系统对数据的读写性能没有太高的要求。相比之下,主存储系统,一般指的是存储了上层应用服务会直接访问的数据的系统,则对于数据的访问性能有着较高的要求,因为在读写的高效性直接决定了用户对于上层应用和服务的体验。主存储系统通常都是以数据块为基本单位,将文件分为一个个数据块存储在底层的磁盘当中,然后在主存储系统的内存当中维护着一份关于所有数据块的索引,索引的目的是为了记录数据块所从属的文件的信息,以及数据块在磁盘上的物理位置。由于主存储系统需要有较高的数据访问性能,因此在分布式的环境下,系统开发者通常会设定一套具有冗余特性的数据存放和管理机制,即同一份数据,在分布式的主存储系统当中会保留多于一份的备份,结构上非常相似的两份文件,如包含了大量相同文件的两个工程文件夹,也会保留两份完整的文件夹。With the further popularization and in-depth application of cloud computing and mobile Internet, various online applications and services are playing a more important role in all walks of life. is growing at an explosive rate. The distributed storage system is the background support system of various cloud services. All service data are stored in the distributed storage system. The storage system provides a unified read-write interface for users or upper-level application services to access or modify and store in the disk. The data. Corresponding to different application scenarios and main purposes, distributed storage systems can be roughly divided into two categories: backup storage systems and primary storage systems. The backup storage system is mainly used for the backup of cold data, such as system logs, historical files, etc. These backup systems are generally built on relatively cheap storage hardware devices, or even the underlying storage devices that can be based on tape storage, because the backup storage system The access to the data is very low, and historical data is generally only read when there is a special need. Therefore, the backup storage system does not have high requirements for data read and write performance. In contrast, the primary storage system generally refers to a system that stores data directly accessed by upper-layer application services, and has higher requirements for data access performance, because the efficiency of reading and writing directly determines the user's The experience of upper-layer applications and services. The main storage system usually takes the data block as the basic unit, divides the file into data blocks and stores them in the underlying disk, and then maintains an index about all data blocks in the memory of the main storage system. The purpose of the index is It is to record the information of the file to which the data block belongs and the physical location of the data block on the disk. Since the main storage system needs to have high data access performance, in a distributed environment, system developers usually set up a set of redundant data storage and management mechanisms, that is, the same data, in the distributed More than one backup will be kept in the main storage system, and two files with very similar structures, such as two project folders containing a large number of identical files, will also keep two complete folders.

在大数据时代到来之前,我们的存储设备需要存储和处理的数据都是相对小量的,即使搭建分布式主存储系统所需要的硬件设备不菲,硬件上的开销也没有成为服务提供商所需要重视的因素。而在大数据的时代,随着各类互联网应用与服务的激增,用户的数量每天都在增长,服务和应用的规模也越来越庞大和复杂,因此支撑各类云服务的主存储系统当中需要存储的数据量也就出现了一个爆发式的增长。这些数据是需要被上层的应用和服务所直接访问到的,因此无法使用廉价的备份存储系统来协助存储。虽然存储硬件设备的造价随着技术上的改进而在不断降低,更低的价钱可以买到更大容量的设备,但是因为在大数据时代,数据量是以指数级的速度在增长的,因此在实际当中,存储数据量的增长速度已经超过了硬件存储设备造价的下降速度,购买更多的存储设备不能从根本上应对数据量的激增。而且从经济利益上来说,这更是一个非常消耗投入成本的方法。Before the arrival of the big data era, our storage devices needed to store and process a relatively small amount of data. Even if the hardware equipment needed to build a distributed primary storage system is expensive, the hardware overhead has not become a service provider. factors that need attention. In the era of big data, with the rapid increase of various Internet applications and services, the number of users is increasing every day, and the scale of services and applications is becoming larger and more complex. Therefore, the main storage system supporting various cloud services There has also been an explosive growth in the amount of data that needs to be stored. These data need to be directly accessed by upper-layer applications and services, so cheap backup storage systems cannot be used to assist storage. Although the cost of storage hardware devices continues to decrease with technological improvements, and devices with larger capacities can be purchased at lower prices, but because in the era of big data, the amount of data is growing at an exponential rate, so In practice, the growth rate of the amount of stored data has exceeded the rate of decline in the cost of hardware storage devices, and purchasing more storage devices cannot fundamentally cope with the surge in the amount of data. Moreover, in terms of economic benefits, this is a method that consumes a lot of input costs.

在这个背景与挑战的驱动下,冗余数据去重技术逐渐受到了越来越多的服务提供商的关注,尤其是需要在主存储系统当中保存大量数据的服务提供商。从概括的角度来描述,冗余数据去重技术就是通过比较数据之间的“签名”,如果发现了具有相同签名的数据,则将其判定为冗余的数据,接下来会将冗余的数据删除掉,然后在索引更新被删除的数据的信息,将其物理磁盘位置指向保留了的冗余数据的位置。当下一次用户或者应用访问这个被删除的数据的时候,系统则会根据索引当中的信息,把请求导向到保留在磁盘上的冗余数据的位置,并在该数据上进行用户所请求的操作。从数据去重技术的粒度粗细上分,一般可以分为两类:文件层面的去重以及数据块层面上的去重。概括地说,文件层面上的数据去重技术仅仅删除冗余的完全相同的两个文件,它的比较单位是整个文件。而数据块层面上的去重技术的比较和处理单位则是细化到数据块上,如前面所提到的,分布式主存储系统的底层存储实际上是以块为单位的,一个文件可能会被分成多个块存储在磁盘上,而在这种情况下,每个数据块都会被计算出属于这个块的签名,去重技术就是在基于数据块的签名的对比的基础上,删除冗余的数据块。从数据去重技术的作用域上划分,可以分化全局去重和本地去重。作用域的概念是在分布式存储系统中起作用的,概括地说,全局的去重技术会检测所有位于该系统的服务器上的冗余数据,即使这些数据所在的服务器在地理上是分离的,而本地去重技术则只关注同一台服务器上或同一台存储设备上的冗余数据。从数据去重技术的执行时机上划分,可以分为离线去重和在线去重两种。离线去重一般指的是去重程序在系统后台运行,是当新的文件数据写入到了磁盘之后再进行冗余数据的检测和删除,而在线去重指的就是在新的数据在写入的过程当中进行检测和冗余删除。Driven by this background and challenges, redundant data deduplication technology has gradually attracted more and more attention from service providers, especially service providers who need to store large amounts of data in the primary storage system. From a general point of view, redundant data deduplication technology is to compare the "signatures" between data, if the data with the same signature is found, it will be judged as redundant data, and then the redundant data will be The data is deleted, and then the information of the deleted data is updated in the index, and its physical disk location points to the location of the retained redundant data. When the user or application accesses the deleted data next time, the system will direct the request to the location of the redundant data kept on the disk according to the information in the index, and perform the operation requested by the user on the data. From the granularity of data deduplication technology, it can generally be divided into two categories: deduplication at the file level and deduplication at the data block level. In a nutshell, data deduplication technology at the file level only deletes redundant identical two files, and its comparison unit is the entire file. The comparison and processing unit of the deduplication technology at the data block level is refined to the data block. As mentioned above, the underlying storage of the distributed primary storage system is actually based on the block. A file may It will be divided into multiple blocks and stored on the disk. In this case, each data block will be calculated to belong to the signature of this block. The deduplication technology is based on the comparison of signatures based on data blocks. remaining data blocks. According to the scope of data deduplication technology, global deduplication and local deduplication can be differentiated. The concept of scope works in a distributed storage system. In a nutshell, the global deduplication technology will detect redundant data on all servers in the system, even if the servers where these data are located are geographically separated. , while local deduplication technology only focuses on redundant data on the same server or on the same storage device. From the perspective of execution timing of data deduplication technology, it can be divided into two types: offline deduplication and online deduplication. Offline deduplication generally refers to the deduplication program running in the background of the system, which detects and deletes redundant data after the new file data is written to the disk, while online deduplication refers to the new data being written During the process of detection and redundancy deletion.

结合以上的分类,在现实的存储系统当中普遍实现和使用的去重技术方案都是以上所谈及分类的结合,而最主要使用的方案有两类:全局的基于文件层面的去重(GlobalFile-level Deduplication,GFD)以及本地的基于数据块层面的去重(Local Chunk-levelDeduplication,LCD)。这两种方案虽然能够在一定程度上满足一定的去重需求,但是它们的设计却都没有考虑在主存储系统这个应用场景的与其他因素之间的平衡与取舍。首先,数据去重技术会带来数据读性能上的损失,GFD可能会造成需要从远端的服务器传送用户访问的数据,从而带来网络传输上的延时,而LCD则会造成本地磁盘的碎片化,从而在读取某个数据的时候可能会需要多次磁盘寻道,从而带来对读操作响应的延迟。去重度越大,就越能节省存储设备上带来的开销,但是读性能就会被影响得越严重,因此去重度与读性能上的平衡与取舍是一个首要要考虑的问题。其次,主存储不同于备份存储系统,后者的数据是冷的不经常访问的而且通常是作为一个历史记录的形式保存的,因此站在系统的角度来说,这些数据可以被统一地看做是二进制字节流。而在主存储系统当中,由于其直接支撑上层服务的特性,它当中所存储的数据是非常多样化的,而且会由于服务的种类不同,用户群的访问特性不同,具有一定的语义性。这些文件的语义性应该利用到面向主存储系统的去重方案设计当中。第三,由于主存储系统直接跟用户打交道,用户的访问的特征会因为时间的不同,地域的不同,或者是服务的改变而存在差异,用户的也会因人而异,所以面向主存储系统的去重方案需要是动态的,能够作出调整的,才能更好地在读性能和存储空间效率的平衡上做出更好的选择。Combining the above classifications, the deduplication technology solutions commonly implemented and used in actual storage systems are a combination of the above classifications, and the most commonly used solutions are two types: global deduplication based on the file level (GlobalFile -level Deduplication, GFD) and local deduplication based on the data block level (Local Chunk-level Deduplication, LCD). Although these two solutions can meet certain deduplication requirements to a certain extent, their design does not consider the balance and trade-off between the application scenario of the primary storage system and other factors. First of all, the data deduplication technology will cause the loss of data reading performance. GFD may cause the data accessed by the user to be transmitted from the remote server, which will cause delays in network transmission, and LCD will cause local disks. Fragmentation, so that multiple disk seeks may be required when reading a certain data, resulting in a delay in the response to the read operation. The greater the deduplication, the more the overhead on the storage device can be saved, but the read performance will be affected more severely. Therefore, the balance and trade-off between deduplication and read performance is a primary consideration. Secondly, the primary storage is different from the backup storage system. The data of the latter is cold, infrequently accessed and usually stored as a historical record. Therefore, from the perspective of the system, these data can be uniformly viewed as is a binary byte stream. In the main storage system, due to the characteristics of directly supporting upper-layer services, the data stored in it is very diverse, and due to different types of services and different access characteristics of user groups, it has certain semantics. The semantics of these files should be used in the deduplication scheme design for the main storage system. Third, since the main storage system directly deals with users, the user's access characteristics will be different due to different time, different regions, or changes in services, and users will also vary from person to person, so for the main storage system The deduplication scheme needs to be dynamic and can be adjusted, so as to better make a better choice in terms of the balance between read performance and storage space efficiency.

综上可见,在大数据的时代背景下,云服务提供商一方面有着迫切降低存储空间成本开销的迫切需求,另一方面,又希望冗余数据的去重不会对上层应用和服务的性能受到太大的影响,用户的使用体验可以得到保证。如何针对分布式主存储系统的使用特性以及数据特征,如何利用不同于备份存储系统的丰富的文件语义与变化的系统状态,用以设计和实现高效的冗余数据去重方案,达到系统空间效率上和数据读性能上高效的平衡,成为本领域技术人员迫切需要解决的一个重要问题。To sum up, in the era of big data, cloud service providers have an urgent need to reduce the cost of storage space on the one hand, and on the other hand, hope that the deduplication of redundant data will not affect the performance of upper-layer applications and services. If it is too affected, the user experience can be guaranteed. How to design and implement an efficient redundant data deduplication solution to achieve system space efficiency by considering the use characteristics and data characteristics of the distributed primary storage system, and how to use the rich file semantics and changing system status different from the backup storage system The high-efficiency balance between data upload and data read performance has become an important problem that those skilled in the art urgently need to solve.

发明内容Contents of the invention

针对现有技术所存在的上述技术问题,本发明提供了一种基于文件语义和系统实时状态的冗余数据去重方法,能够使分布式主存储系统在维持较高的读请求响应性能的同时,减少存储空间成本的开销。Aiming at the above-mentioned technical problems existing in the prior art, the present invention provides a redundant data deduplication method based on file semantics and real-time system status, which enables the distributed primary storage system to maintain high read request response performance while , reducing the overhead of storage space costs.

一种基于文件语义和系统实时状态的冗余数据去重方法,如下:A redundant data deduplication method based on file semantics and real-time system status is as follows:

周期性的检测分布式存储系统的读响应时延和去重比率;根据当前时刻系统的读响应时延和去重比率,采用以下基于SLA(Service Level Agreement,服务等级协议代)动态调节机制对系统的去重器进行调节:Periodically detect the read response delay and deduplication ratio of the distributed storage system; according to the read response delay and deduplication ratio of the system at the current moment, the following dynamic adjustment mechanism based on SLA (Service Level Agreement) is adopted to The system's deweighter adjusts:

根据系统当前所参照的SLA,判断当前时刻系统的读响应时延是否大于该SLA读响应时延区间上限的1.1倍:According to the SLA currently referenced by the system, determine whether the read response delay of the system at the current moment is greater than 1.1 times the upper limit of the SLA read response delay interval:

若是,则使去重器在下一周期内对系统停止执行GFD和LCD;若否,判断当前时刻系统的读响应时延是否小于该SLA的读响应时延区间上限:If yes, make the deduplicator stop executing GFD and LCD for the system in the next cycle; if not, judge whether the read response delay of the system at the current moment is less than the upper limit of the read response delay interval of the SLA:

若否,则使去重器在下一周期内对系统保留执行GFD,停止执行LCD;若是,判断当前时刻系统的去重比率是否小于该SLA的去重比率区间下限:If not, make the deduplication device keep executing GFD on the system in the next cycle, and stop executing LCD; if so, determine whether the deduplication ratio of the system at the current moment is less than the lower limit of the deduplication ratio interval of the SLA:

若是,则使去重器在下一周期内对系统正常执行GFD和LCD;若否,则使系统当前所参照的SLA提升一个等级,并根据新的SLA按照上述动态调节机制重新进行判断。If yes, make the de-weighter execute GFD and LCD normally on the system in the next cycle; if not, make the SLA currently referenced by the system raise a level, and re-judgment according to the above dynamic adjustment mechanism according to the new SLA.

当前时刻系统的读响应时延即为上一周期内用户对系统所有读请求的平均响应时延;当前时刻系统的去重比率即为初始时刻至当前时刻系统移除的冗余数据的总体积与初始时刻至当前时刻系统未移除冗余数据的情况下累加存储后的数据总体积的比值。The read response delay of the system at the current moment is the average response delay of all read requests from users to the system in the previous cycle; the deduplication ratio of the system at the current moment is the total volume of redundant data removed by the system from the initial moment to the current moment The ratio of the accumulated and stored data volume from the initial moment to the current moment when the system does not remove redundant data.

若当前时刻系统的读响应时延大于系统当前所处SLA读响应时延区间上限的1.1倍,则使去重器在下一周期内对系统停止执行GFD和LCD;若加上下一周期去重器已经连续三个周期对系统停止执行GFD和LCD,则使系统当前所参照的SLA下降一个等级。If the read response delay of the system at the current moment is greater than 1.1 times the upper limit of the SLA read response delay interval where the system is currently located, the de-duplicator will stop executing GFD and LCD for the system in the next cycle; if the de-duplicator is added in the next cycle If the system stops executing GFD and LCD for three consecutive cycles, the current SLA referenced by the system will be lowered by one level.

所述SLA的等级明细如下:The level breakdown of said SLA is as follows:

第一等级的SLA,其读响应时延区间为[0,600ms],去重比率区间为[0.25,1);For the first level of SLA, the read response delay interval is [0, 600ms], and the deduplication ratio interval is [0.25, 1);

第二等级的SLA,其读响应时延区间为[0,600ms],去重比率区间为[0.1,0.25];For the second-level SLA, the read response delay interval is [0, 600ms], and the deduplication ratio interval is [0.1, 0.25];

第三等级的SLA,其读响应时延区间为[600ms,750ms],去重比率区间为[0.25,1);For the third-level SLA, the read response delay interval is [600ms, 750ms], and the deduplication ratio interval is [0.25, 1);

第四等级的SLA,其读响应时延区间为[600ms,750ms],去重比率区间为[0.1,0.25];For the fourth level of SLA, the read response delay interval is [600ms, 750ms], and the deduplication ratio interval is [0.1, 0.25];

第五等级的SLA,其读响应时延区间为[750ms,900ms],去重比率区间为[0.25,1);For the fifth-level SLA, the read response delay interval is [750ms, 900ms], and the deduplication ratio interval is [0.25, 1);

第六等级的SLA,其读响应时延区间为[750ms,900ms],去重比率区间为[0.1,0.25];For the sixth-level SLA, the read response delay interval is [750ms, 900ms], and the deduplication ratio interval is [0.1, 0.25];

第七等级的SLA,其读响应时延区间为[900ms,1200ms],去重比率区间为[0.25,1);For the seventh-level SLA, the read response delay interval is [900ms, 1200ms], and the deduplication ratio interval is [0.25, 1);

第八等级的SLA,其读响应时延区间为[900ms,1200ms],去重比率区间为[0.1,0.25]。For the eighth level of SLA, the read response delay interval is [900ms, 1200ms], and the deduplication ratio interval is [0.1, 0.25].

所述的去重器采用基于优先度分层级的冗余数据去重方案对系统进行去重,具体过程如下:The deduplication device uses a redundant data deduplication scheme based on priority levels to deduplicate the system, and the specific process is as follows:

(1)基于文件语义,计算出系统中每个文件的去重优先度;(1) Based on the file semantics, calculate the deduplication priority of each file in the system;

(2)从系统的全局文件索引中提取去重优先度最大的n个文件的文件位置记录,n为大于1的自然数;(2) Extract the file location records of the n files with the largest deduplication priority from the global file index of the system, where n is a natural number greater than 1;

(3)根据文件位置记录对上述n个文件进行分层级的冗余数据去重:(3) According to the file location record, the redundant data of the above n files is deduplicated hierarchically:

首先,对这n个文件进行GFD,进而更新全局文件索引;初始,这n个文件的文件位置记录均被标记成“脏”,而对于通过GFD去除冗余数据后文件的文件位置记录则标记为“干净”;First, perform GFD on the n files, and then update the global file index; initially, the file location records of the n files are marked as "dirty", and the file location records of the files after removing redundant data through GFD are marked as "dirty". for "clean";

然后,对经过GFD处理后文件位置记录仍标记为“脏”的文件进行LCD,进而更新本地数据块索引;Then, LCD is performed on the files whose file location records are still marked as "dirty" after GFD processing, and then the local data block index is updated;

(4)根据步骤(1)至(3),反复执行。(4) Repeat steps (1) to (3).

所述的步骤(1)中通过以下公式计算每个文件的去重优先度:In the described step (1), the deduplication priority of each file is calculated by the following formula:

ρ=α1H+α2G+α3Eρ=α 1 H+α 2 G+α 3 E

其中:ρ为文件的去重优先度,H为文件最近访问时间的优先值,G为文件大小的优先值,E为文件类型的优先值,α1~α3分别为对应H、G和E的权重系数。Among them: ρ is the deduplication priority of the file, H is the priority value of the latest access time of the file, G is the priority value of the file size, E is the priority value of the file type, α 1 ~ α 3 are the corresponding H, G and E respectively The weight coefficient of .

所述优先值H的确定标准如下:The criteria for determining the priority value H are as follows:

若文件最近访问时间大于1个月,则H=7;If the latest access time of the file is greater than 1 month, then H=7;

若文件最近访问时间大于15天且小于等于1个月,则H=6;If the latest access time of the file is greater than 15 days and less than or equal to 1 month, then H=6;

若文件最近访问时间大于7天且小于等于15天,则H=5;If the latest access time of the file is greater than 7 days and less than or equal to 15 days, then H=5;

若文件最近访问时间大于3天且小于等于7天,则H=4;If the latest access time of the file is greater than 3 days and less than or equal to 7 days, then H=4;

若文件最近访问时间大于1天且小于等于3天,则H=3;If the latest access time of the file is greater than 1 day and less than or equal to 3 days, then H=3;

若文件最近访问时间大于12小时且小于等于1天,则H=2;If the latest access time of the file is greater than 12 hours and less than or equal to 1 day, then H=2;

若文件最近访问时间大于6小时且小于等于12小时,则H=1;If the latest access time of the file is greater than 6 hours and less than or equal to 12 hours, then H=1;

若文件最近访问时间小于等于6小时,则H=-1。If the latest access time of the file is less than or equal to 6 hours, then H=-1.

所述优先值G的确定标准如下:The criteria for determining the priority value G are as follows:

若文件大小大于1G,则G=7;If the file size is greater than 1G, then G=7;

若文件大小大于512MB且小于等于1G,则G=6;If the file size is greater than 512MB and less than or equal to 1G, then G=6;

若文件大小大于256MB且小于等于512MB,则G=5;If the file size is greater than 256MB and less than or equal to 512MB, then G=5;

若文件大小大于64MB且小于等于256MB,则G=4;If the file size is greater than 64MB and less than or equal to 256MB, then G=4;

若文件大小大于8MB且小于等于64MB,则G=3;If the file size is greater than 8MB and less than or equal to 64MB, then G=3;

若文件大小大于1MB且小于等于8MB,则G=2;If the file size is greater than 1MB and less than or equal to 8MB, then G=2;

若文件大小大于128KB且小于等于1MB,则G=1;If the file size is greater than 128KB and less than or equal to 1MB, then G=1;

若文件大小小于等于128KB,则G=-1。If the file size is less than or equal to 128KB, then G=-1.

所述优先值E的确定标准如下:The criteria for determining the priority value E are as follows:

若文件类型为备份日志,则E=7;If the file type is a backup log, then E=7;

若文件类型为镜像,则E=6;If the file type is a mirror image, then E=6;

若文件类型为项目,则E=5;If the file type is project, then E=5;

若文件类型为视频,则E=4;If the file type is video, then E=4;

若文件类型为音频,则E=3;If the file type is audio, then E=3;

若文件类型为文档,则E=2;If the file type is document, then E=2;

若文件类型为图片,则E=1;If the file type is a picture, then E=1;

若文件类型为其他,则E=-1。If the file type is other, then E=-1.

本发明由于采用了上述的技术方案,故与现有技术相比,具有如下显著的有益效果:The present invention has the following significant beneficial effects compared with the prior art due to the adoption of the above-mentioned technical scheme:

(1)本发明方法相比起传统的冗余数据去重方案,一个明显的优点就在于充分利用了主存储系统当中所包含的丰富的文件语义,通过对多维度的文件语义的属性取值进行量化的划分,赋予不同的文件属于以具体的去重优先值,并通过赋予不同维度之间以不同的权重,量化地计算出某个文件的用数值表示的去重优先度,使得去重器可以区别对待不同的文件,从而在节省存储空间的同时,维持了对上层应用读请求的高效响应。(1) Compared with the traditional redundant data deduplication scheme, the method of the present invention has an obvious advantage in that it makes full use of the rich file semantics contained in the main storage system, and obtains values for attributes of multi-dimensional file semantics Carry out quantitative division, give different files a specific deduplication priority value, and by assigning different weights between different dimensions, quantitatively calculate the deduplication priority of a certain file in numerical values, so that deduplication The server can treat different files differently, so as to save storage space and maintain an efficient response to the read request of the upper application.

(2)本发明方法引入了对去重程序运行时的动态调整机制以及对去重方案需求进行统一描述的机制。统一的需求描述机制可以令冗余数据去重方案中的功能模块定性地获取用户的具体需求,能够清楚地按优先级排列用户在读性能与空间效率上的多级取舍。动态的调整可以使本发明的去重做到系统状态可感知,在读请求响应性能受到影响时可以以不同程度调整去重器的工作方式。(2) The method of the present invention introduces a dynamic adjustment mechanism for the deduplication program during operation and a mechanism for uniformly describing the deduplication scheme requirements. The unified requirements description mechanism can enable the functional modules in the redundant data deduplication solution to qualitatively obtain the specific needs of users, and can clearly arrange the multi-level trade-offs of users in terms of read performance and space efficiency according to priorities. The dynamic adjustment can make the deduplication of the present invention perceptible to the system state, and the working mode of the deduplication device can be adjusted to different degrees when the response performance of the read request is affected.

(3)本发明方法引入了双层的结合GFD和LCD的去重程序机制,并在不同的时机采用不同的机制。相比起传统的单一机制的冗余数据去重方案,本发明所提出的双层的混合去重机制增加了在去重粒度与去重作用域上的灵活性,可以使得控制器根据用户的需求与系统的实时状态,通过调整两种不同方案之间的运行时机来达到满足更高优先级的SLA的效果。(3) The method of the present invention introduces a double-layer deduplication program mechanism combining GFD and LCD, and adopts different mechanisms at different times. Compared with the traditional single-mechanism redundant data de-duplication scheme, the double-layer hybrid de-duplication mechanism proposed by the present invention increases the flexibility in de-duplication granularity and de-duplication scope, allowing the controller to Real-time status of demand and system, by adjusting the running timing between two different schemes to achieve the effect of meeting higher priority SLA.

附图说明Description of drawings

图1为本发明冗余数据去重方法的功能架构示意图。FIG. 1 is a schematic diagram of the functional architecture of the redundant data deduplication method of the present invention.

图2为本发明冗余数据去重方法的流程示意图。FIG. 2 is a schematic flow chart of the redundant data deduplication method of the present invention.

具体实施方式detailed description

为了更为具体地描述本发明,下面结合附图及具体实施方式对本发明的技术方案进行详细说明。In order to describe the present invention more specifically, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示,在实际运行应用环境当中,本发明基于文件语义和系统实时状态的冗余数据去重方案运行在通用的分布式主存储系统内,存储系统主要包括元数据服务器和对象存储服务器集群;其中:As shown in Figure 1, in the actual operating application environment, the redundant data deduplication scheme based on file semantics and real-time system status of the present invention runs in a general-purpose distributed primary storage system, and the storage system mainly includes metadata servers and object storage server cluster; where:

元数据服务器负责接收用户请求以及将请求定向到相应的对象存储服务器上,还负责检测整个分布式主存储系统的运行状态,并维护了全局的以文件名为单位的索引。索引中包含了每个文件的“签名”,位置信息以及元数据信息,这个签名是对于每个文件来说都是唯一的,是经过SHA-1算法在文件的二进制内容上计算出来的。The metadata server is responsible for receiving user requests and directing the requests to the corresponding object storage servers. It is also responsible for detecting the running status of the entire distributed primary storage system and maintaining a global index in units of file names. The index contains the "signature", location information and metadata information of each file. This signature is unique to each file and is calculated on the binary content of the file through the SHA-1 algorithm.

而对象存储服务器集群则包含了多台独立的对象存储服务器,每一台服务器都保存了一定量的文件,服务器的本地磁盘支持以数据块的方式对文件数据进行存储,每一台对象存储服务器负责维护保存在本地的文件的以数据块为单位的索引信息。同样地,索引包含了数据块在本地磁盘上的物理位置,以及经SHA-1算法基于这个数据块的二进制内容计算出来的签名。The object storage server cluster includes multiple independent object storage servers. Each server saves a certain amount of files. The local disk of the server supports the storage of file data in the form of data blocks. Each object storage server Responsible for maintaining the index information in units of data blocks of files stored locally. Likewise, the index contains the physical location of the data block on the local disk, as well as a signature calculated using the SHA-1 algorithm based on the binary content of the data block.

本实施方式中主要包括三个功能模块:基于多语义维度划分的去重优先度计算模块(MPD模块)、分层级的数据去重模块(去重器)以及基于系统实时状态的去重控制模块(控制器)。这三个功能模块均运行在元数据服务器上或对象存储服务器上,并且能与元数据服务器的全局文件索引以及各台对象存储服务器上的数据块索引进行交互与操作,其中:This embodiment mainly includes three functional modules: deduplication priority calculation module (MPD module) based on multi-semantic dimension division, hierarchical data deduplication module (deduplication device), and deduplication control based on real-time system status module (controller). These three functional modules all run on the metadata server or object storage server, and can interact and operate with the global file index of the metadata server and the data block index on each object storage server, among which:

MPD模块运行于元数据服务器上,它通过获取元数据服务器上所保存的关于文件以及文件的附属信息,计算出具有高去重优先度的文件列表,MPD模块基于多维度的文件语义,周期性地输出一个列表,该列表里面包含了固定数量的文件位置记录,这些文件为去重器应该优先实施去重的对象。本方案的MPD模块基于文件的三个维度语义进行去重优先度计算:文件的最近访问时间戳,文件的大小,文件的类型。在此基础上,MPD模块对于每一个维度中的具体值的范围作了明确的赋值,使得每个文件在这三个维度中的属性值都对应一个明确的数值,这个数据标示着某个文件在这个维度当中的去重优先度,数值越大表示越被优先考虑进行去重操作。而且,这三个维度有着不同权重系数,文件的最终去重优先度是这三个维度当中的对应值的加权。在实际应用当中,这些赋值是可以根据具体的服务场景和需求来定制的,本实施方式的缺省划分标准和对应的赋值,以及各个不同维度的权重赋值如表1所示:The MPD module runs on the metadata server. It calculates the list of files with high deduplication priority by obtaining the files and file auxiliary information stored on the metadata server. The MPD module is based on multi-dimensional file semantics and periodicity. Output a list, which contains a fixed number of file location records, and these files are the objects that the de-duplicator should prioritize to implement de-duplication. The MPD module of this solution performs deduplication priority calculation based on the three-dimensional semantics of the file: the latest access timestamp of the file, the size of the file, and the type of the file. On this basis, the MPD module clearly assigns the range of specific values in each dimension, so that the attribute values of each file in these three dimensions correspond to a clear value, and this data marks a certain file. For the deduplication priority in this dimension, the larger the value, the more priority is given to the deduplication operation. Moreover, these three dimensions have different weight coefficients, and the final deduplication priority of the file is the weighting of the corresponding values in these three dimensions. In practical applications, these assignments can be customized according to specific service scenarios and requirements. The default division standards and corresponding assignments in this embodiment, as well as the weight assignments of different dimensions are shown in Table 1:

表1Table 1

根据表1的各项值,每个文件的最终去重优先度的值=最近访问时间优先值*0.5+文件大小优先值*0.3+文件类型优先值*0.2。MPD模块周期性地扫描元数据服务器中的文件索引并将去重优先度最大的前50个文件的信息,包含到一个列表当中,该列表称为去重候选列表。每一个文件信息占据该表的一行,每一行还具有一个附属的标志位,初始化时每个标志位均为“脏”,表示该行文件还尚未被去重器所处理过;相对应地,若去重器已经将该行的文件处理完毕,则该标志位被改为“干净”。According to the values in Table 1, the final deduplication priority value of each file=recent access time priority value*0.5+file size priority value*0.3+file type priority value*0.2. The MPD module periodically scans the file index in the metadata server and includes the information of the top 50 files with the highest deduplication priority into a list, which is called a deduplication candidate list. Each file information occupies one line of the table, and each line also has an attached flag bit, and each flag bit is "dirty" at initialization, indicating that the row of files has not been processed by the deduplicator; correspondingly, If the deduplicator has finished processing the file in this line, the flag will be changed to "clean".

去重器运行在各台对象存储服务器上,它接收MPD模块输出的列表作为输入,对列表中的文件进行分层次的去重。去重器会执行双层的去重操作,分别是全局的文件层面的去重GFD和本地的数据块层面的去重LCD。去重器定期从MPD模块获取最新的去重候选列表,然后从具有最大的去重优先值的文件开始进行去重操作。在控制器允许的情况下,去重器首先会对该文件实施GFD操作,为了在全局范围内寻找到是否有冗余的文件,本地的去重器会对向元数据服务器发起查询请求,因为元数据服务器中保存了全局的文件索引,因此可以知道该文件是否有不同的冗余备份分布在不同的对象存储服务器上。若经查询发现存在冗余的文件备份,则当前的去重器会向冗余文件所在的对象服务器发出去重请求,对方对象服务器会通知其本地的去重器将该冗余文件从本地磁盘上删除掉。发起去重请求的去重器在获知冗余文件已被删除后,会将去重优先列表当中该文件记录的标志位设置为“干净”,同时通知元数据服务器更新文件索引,将刚被删除的文件的位置信息导向到该去重器所在的对象存储服务器上。The deduplicator runs on each object storage server. It receives the list output by the MPD module as input, and deduplicates the files in the list hierarchically. The deduplication device will perform double-layer deduplication operations, namely deduplication GFD at the global file level and deduplication LCD at the local data block level. The deduplication device periodically obtains the latest deduplication candidate list from the MPD module, and then starts deduplication operation from the file with the largest deduplication priority value. If the controller allows it, the deduplicator will first implement the GFD operation on the file. In order to find whether there are redundant files in the global scope, the local deduplicator will initiate a query request to the metadata server, because A global file index is saved in the metadata server, so it can be known whether the file has different redundant backups distributed on different object storage servers. If redundant file backups are found through query, the current deduplication device will send a deduplication request to the object server where the redundant files are located, and the other object server will notify its local deduplication device to remove the redundant files from the local disk delete it. After the deduplication device that initiates the deduplication request learns that the redundant file has been deleted, it will set the flag bit of the file record in the deduplication priority list to "clean", and at the same time notify the metadata server to update the file index and delete the newly deleted file. The location information of the file is directed to the object storage server where the deduplicator is located.

当去重器以GFD方式将去重优先列表遍历完成之后,在控制器允许的情况下,会从头开始以LCD的方式再遍历一次该列表。去重器实施LCD是通过检索本地的对象存储器上的数据块的索引,以确定在本地是否存在与当前文件的某个或某些分块所冗余的数据块。若存在冗余的数据块,则将冗余的块删除后,相应地更新本地对象存储服务器的索引,并将去重候选列表中对应的文件项标志位标记为“干净”。LCD过程也遍历完去重候选LCD过程也遍历完去重候选列表之后,本地去重器通知元数据服务器该周期的去重优先列表已经处理完成。After the deduplication device completes traversing the deduplication priority list in the GFD mode, it will traverse the list again in the LCD mode from the beginning if the controller allows it. The deduplication device implements LCD by searching the index of the data block on the local object storage to determine whether there is a data block redundant with one or some blocks of the current file locally. If redundant data blocks exist, after the redundant blocks are deleted, the index of the local object storage server is updated accordingly, and the corresponding file entry flag bit in the deduplication candidate list is marked as "clean". After the LCD process has also traversed the deduplication candidate list, the local deduplication device notifies the metadata server that the deduplication priority list of this period has been processed.

而在去重器运作的过程当中,控制器会根据系统的实时状态,对去重器的去重策略进行动态调整。控制器是一个分布式的组件,它同时运行在元数据服务器和每一台对象存储服务器上,在运行的过程当中,控制器会根据系统的实时状态,对去重器的去重策略进行动态调整。在元数据服务器上的部分负责监控整个分布式主存储系统的对于读请求响应的延时状况,运行在每一台对象存储服务器上的部分负责监测该服务器上的存储空间占用情况。控制器根据一组可以由用户提前设定的,按照优先等级排列的<读响应延时范围,去重比率范围>需求对,来对去重器的进行动态的调节。我们将以上带有优先等级的分层次的一组需求对的集合,称为一个服务等级协议代SLA。本实施方式给出了SLA的固定格式,因此该SLA可以由用户根据此格式自行细化设定,从而作为控制器的输入;示例格式的SLA如表2所示:During the operation of the de-heavy device, the controller will dynamically adjust the de-heavy strategy of the de-weight device according to the real-time status of the system. The controller is a distributed component that runs on the metadata server and each object storage server at the same time. During the running process, the controller will dynamically implement the deduplication strategy of the deduplication device according to the real-time status of the system. Adjustment. The part on the metadata server is responsible for monitoring the delay status of the response to the read request of the entire distributed primary storage system, and the part running on each object storage server is responsible for monitoring the storage space occupancy of the server. The controller dynamically adjusts the deduplication device according to a group of <read response delay range, deduplication ratio range> demand pairs that can be set in advance by the user and arranged according to priority. We call the collection of a hierarchical set of demand pairs with priorities above a service level agreement (SLA). This embodiment provides a fixed format of the SLA, so the SLA can be refined and set by the user according to this format, so as to be used as the input of the controller; the example format of the SLA is shown in Table 2:

表2Table 2

当来自用户的读请求到达元数据服务器时,元数据服务器将此时的时间戳作为对应该读请求的开始响应时间戳,并把该时间戳插入请求报文里面,随后将该请求转发到对应的对象存储服务器上。接受到该读请求的对象存储服务器将该读请求对应的对象完全读出,并且在最后一个数据块开始发送至用户端时,将此时的时间戳作为该读请求的结束时间戳。开始时间戳与结束时间戳之间的相隔时长,即为该请求的响应延时。该延时由分布在每个对象存储服务器上的控制器分组件所捕获,并且发送到位于元数据服务器上的控制器组件,位于元数据服务器的控制器组件收集从各个对象存储服务器上的控制器组件发来的相对于每个读请求的响应时延,然后每隔固定的周期T,计算出在该周期T内所有的读响应时延的平均值,该值作为该周期内的读响应延时参考值。When a read request from a user arrives at the metadata server, the metadata server takes the timestamp at this time as the start response timestamp for the read request, inserts the timestamp into the request message, and then forwards the request to the corresponding on the object storage server. The object storage server that receives the read request completely reads out the object corresponding to the read request, and when the last data block starts to be sent to the client, the time stamp at this time is used as the end time stamp of the read request. The interval between the start timestamp and the end timestamp is the response delay of the request. The delay is captured by the controller subcomponent distributed on each object storage server, and sent to the controller component located on the metadata server, and the controller component located on the metadata server collects the control information from each object storage server The response delay sent by the device component relative to each read request, and then every fixed cycle T, calculate the average value of all read response delays in this cycle T, and this value is used as the read response in this cycle Delay reference value.

去重比率为移除的冗余数据的体积与未进行去重前存储数据总体积的比值。位于每个对象存储服务器的去重器在完成对某个文件的GFD或者LCD去重操作后,在将去重候优先列表当中的标志位设置为“干净”的同时,再于该文件项记录后添加被移除的冗余数据的体积大小。元数据服务器在回收每一个被遍历完的去重优先列表时,从表中统计出该次列表中所被移除的数据的体积大小,再除以从全局文件索引出能够直接得到的整个分布式主存储系统当中所有文件的大小,得出实时的去重比率。The deduplication ratio is the ratio of the volume of redundant data removed to the total volume of stored data before deduplication. After the deduplication device located on each object storage server completes the GFD or LCD deduplication operation on a certain file, it sets the flag bit in the deduplication priority list to "clean", and then records in the file item After adding the volume size of redundant data removed. When the metadata server reclaims each deduplication priority list that has been traversed, it counts the volume of the data removed from the list from the table, and then divides it by the entire distribution that can be directly obtained from the global file index According to the size of all files in the primary storage system, the real-time deduplication ratio is obtained.

位于元数据服务器的控制器会定期与每个周期T内检测和查询整个分布式主存储系统的读响应延时值以及去重比率,然后再结合SLA中的具体需求范围来对去重器进行动态调节,如图2所示,主要的操作包括以下流程:The controller located on the metadata server will regularly detect and query the read response delay value and deduplication ratio of the entire distributed primary storage system in each cycle T, and then combine the specific demand range in the SLA to perform deduplication. Dynamic adjustment, as shown in Figure 2, the main operations include the following processes:

控制器从优先等级最低的SLA开始,若当前的读响应延时值符合或优于当前SLA的需求范围,且即时的去重比率符合或高于当前等级的去重比率上限,则允许去重器继续正常进行工作,并将当前SLA优先等级往上提升一级;The controller starts from the SLA with the lowest priority level. If the current read response delay value meets or exceeds the demand range of the current SLA, and the immediate deduplication ratio meets or exceeds the upper limit of the deduplication ratio of the current level, deduplication is allowed. The server continues to work normally, and the current SLA priority level is raised by one level;

若当前的读响应延时值符合或优于当前等级的需求范围,但即时的去重比率低于当前等级的去重比率下限,则允许去重器继续正常进行工作,并将当前SLA保留在当前等级;If the current read response latency value meets or exceeds the demand range of the current level, but the immediate deduplication ratio is lower than the lower limit of the deduplication ratio of the current level, the deduplication device is allowed to continue to work normally, and the current SLA is kept at current level;

若当前读响应延时值不符合当前需求的范围,分两种情况:a)若当前读响应延时值大于当前等级的读延时需求范围上限的1.1倍,则控制器停止去重器的所有操作,若连续三个周期内系统的读响应延时一直没有回落到小于当前等级的读延时需求范围上限的1.1倍的范围内,则将当前SLA等级往下降低一级;b)若当前读响应延时值大于当前需求范围上限但小于需求上限的1.1倍,控制器则依然停留在该需求等级上,并对去重器发出指令,停止LCD去重操作,只保留GFD操作。If the current read response delay value does not meet the current demand range, there are two cases: a) If the current read response delay value is greater than 1.1 times the upper limit of the read delay demand range of the current level, the controller stops the deduplication For all operations, if the read response delay of the system has not fallen back to a range less than 1.1 times the upper limit of the read delay requirement range of the current level within three consecutive cycles, the current SLA level will be lowered by one level; b) if The current read response delay value is greater than the upper limit of the current demand range but less than 1.1 times the upper limit of the demand, the controller still stays at the demand level, and sends an instruction to the deduplication device to stop the LCD deduplication operation and only keep the GFD operation.

上述的对实施例的描述是为便于本技术领域的普通技术人员能理解和应用本发明。熟悉本领域技术的人员显然可以容易地对上述实施例做出各种修改,并把在此说明的一般原理应用到其他实施例中而不必经过创造性的劳动。因此,本发明不限于上述实施例,本领域技术人员根据本发明的揭示,对于本发明做出的改进和修改都应该在本发明的保护范围之内。The above description of the embodiments is for those of ordinary skill in the art to understand and apply the present invention. It is obvious that those skilled in the art can easily make various modifications to the above-mentioned embodiments, and apply the general principles described here to other embodiments without creative efforts. Therefore, the present invention is not limited to the above-mentioned embodiments, and improvements and modifications made by those skilled in the art according to the disclosure of the present invention should fall within the protection scope of the present invention.

Claims (7)

1. a kind of redundant data De-weight method based on file semantics and system real-time status, as follows:
The periodically reading response delay and duplicate removal ratio of detection distributed memory system;Responded according to the reading of current time system Time delay and duplicate removal ratio, treasure is gone to be adjusted system using below based on SLA dynamic mechanisms:
According to the SLA of the current institute's reference of system, when judging whether the reading response delay of current time system is more than the SLA and reads response Prolong the section upper limit 1.1 times:
If so, treasure is then set to stop performing GFD and LCD to system within next cycle;If it is not, judge current time system Read the reading response delay section upper limit whether response delay is less than the SLA:
If it is not, then making treasure retain system within next cycle performs GFD, stop performing LCD;If so, when judging current Whether the duplicate removal ratio of etching system is less than the duplicate removal rate terms lower limit of the SLA:
If so, then make treasure execution GFD and LCD normal to system within next cycle;If it is not, then make the current institute's reference of system SLA lift a grade, and judgement is re-started according to above-mentioned dynamic mechanism according to new SLA;
Wherein:The GFD represents the global duplicate removal based on file level, and the LCD represents local based on data block aspect Duplicate removal, the grade of the SLA is detailed as follows:
The SLA of the first estate, it be [0,600ms] that it, which reads response delay section, duplicate removal rate terms be [0.25,1);
The SLA of second grade, it is [0,600ms] that it, which reads response delay section, and duplicate removal rate terms are [0.1,0.25];
The SLA of the tertiary gradient, it be [600ms, 750ms] that it, which reads response delay section, duplicate removal rate terms be [0.25,1);
The SLA of the fourth estate, it is [600ms, 750ms] that it, which reads response delay section, and duplicate removal rate terms are [0.1,0.25];
The SLA of 5th grade, it be [750ms, 900ms] that it, which reads response delay section, duplicate removal rate terms be [0.25,1);
The SLA of 6th grade, it is [750ms, 900ms] that it, which reads response delay section, and duplicate removal rate terms are [0.1,0.25];
The SLA of 7th grade, it be [900ms, 1200ms] that it, which reads response delay section, duplicate removal rate terms be [0.25,1);
The SLA of 8th grade, it is [900ms, 1200ms] that it, which reads response delay section, and duplicate removal rate terms are [0.1,0.25].
2. redundant data De-weight method according to claim 1, it is characterised in that:If during the reading response of current time system Prolong 1.1 times for being presently in SLA more than system and reading the response delay section upper limit, then treasure is stopped within next cycle to system Only perform GFD and LCD;If removing treasure plus next cycle, continuous three cycles stop performing GFD and LCD to system, The SLA of the current institute's reference of system is set to decline a grade.
3. redundant data De-weight method according to claim 1, it is characterised in that:Described goes treasure to use based on preferential Spend hierarchical redundant data duplicate removal scheme and duplicate removal is carried out to system, detailed process is as follows:
(1) file semantics are based on, calculate the duplicate removal relative importance value of each file in system;
(2) the document location record of n maximum file of duplicate removal relative importance value is extracted from the global profile index of system, n is big In 1 natural number;
(3) recorded according to document location and hierarchical redundant data duplicate removal is carried out to above-mentioned n file:
First, GFD is carried out to this n file, and then updates global profile index;Initially, the document location record of this n file It is marked as " dirty ", and then " clean " is labeled as the document location record by file after GFD removal redundant datas;
Then, to after GFD is handled document location record still carry out LCD labeled as the file of " dirty ", and then update local number Indexed according to block;
(4) performed repeatedly to (3) according to step (1).
4. redundant data De-weight method according to claim 3, it is characterised in that:By following in described step (1) Formula calculates the duplicate removal relative importance value of each file:
ρ=α1H+α2G+α3E
Wherein:ρ is the duplicate removal relative importance value of file, and H is the preferred value of file last access time, and G is the preferred value of file size, E be file type preferred value, α13Respectively correspond to H, G and E weight coefficient.
5. redundant data De-weight method according to claim 4, it is characterised in that:The preferred value H calibrates standard such as really Under:
If file last access time is more than 1 month, H=7;
If file last access time was more than 15 days and less than or equal to 1 month, H=6;
If file last access time was more than 7 days and less than or equal to 15 day, H=5;
If file last access time was more than 3 days and less than or equal to 7 day, H=4;
If file last access time was more than 1 day and less than or equal to 3 day, H=3;
If file last access time was more than 12 hours and less than or equal to 1 day, H=2;
If file last access time was more than 6 hours and less than or equal to 12 hour, H=1;
If file last access time is less than or equal to 6 hours, H=-1.
6. redundant data De-weight method according to claim 4, it is characterised in that:The preferred value G calibrates standard such as really Under:
If file size is more than 1G, G=7;
If file size is more than 512MB and is less than or equal to 1G, G=6;
If file size is more than 256MB and is less than or equal to 512MB, G=5;
If file size is more than 64MB and is less than or equal to 256MB, G=4;
If file size is more than 8MB and is less than or equal to 64MB, G=3;
If file size is more than 1MB and is less than or equal to 8MB, G=2;
If file size is more than 128KB and is less than or equal to 1MB, G=1;
If file size is less than or equal to 128KB, G=-1.
7. redundant data De-weight method according to claim 4, it is characterised in that:The preferred value E calibrates standard such as really Under:
If file type is backup log, E=7;
If file type is mirror image, E=6;
If file type is project, E=5;
If file type is video, E=4;
If file type is audio, E=3;
If file type is document, E=2;
If file type is picture, E=1;
If file type is other, E=-1.
CN201510435945.0A 2015-07-23 2015-07-23 A kind of redundant data De-weight method based on file semantics and system real-time status Active CN105068757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510435945.0A CN105068757B (en) 2015-07-23 2015-07-23 A kind of redundant data De-weight method based on file semantics and system real-time status

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510435945.0A CN105068757B (en) 2015-07-23 2015-07-23 A kind of redundant data De-weight method based on file semantics and system real-time status

Publications (2)

Publication Number Publication Date
CN105068757A CN105068757A (en) 2015-11-18
CN105068757B true CN105068757B (en) 2017-12-22

Family

ID=54498139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510435945.0A Active CN105068757B (en) 2015-07-23 2015-07-23 A kind of redundant data De-weight method based on file semantics and system real-time status

Country Status (1)

Country Link
CN (1) CN105068757B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832406B (en) * 2017-11-03 2020-09-11 北京锐安科技有限公司 Method, device, equipment and storage medium for removing duplicate entries of mass log data
CN107819697B (en) * 2017-11-27 2020-03-27 锐捷网络股份有限公司 Data transmission method, switch and data center
CN111291000B (en) * 2018-05-18 2023-11-03 腾讯科技(深圳)有限公司 Blockchain-based file acquisition methods, equipment and storage media
CN110413235B (en) * 2019-07-26 2020-07-24 华中科技大学 SSD (solid State disk) deduplication oriented data distribution method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102523285A (en) * 2011-12-15 2012-06-27 杭州电子科技大学 Storage caching method of object-based distributed file system
CN102662859A (en) * 2012-03-14 2012-09-12 北京神州数码思特奇信息技术股份有限公司 Data cache system based on service grade and method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7689602B1 (en) * 2005-07-20 2010-03-30 Bakbone Software, Inc. Method of creating hierarchical indices for a distributed object system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102523285A (en) * 2011-12-15 2012-06-27 杭州电子科技大学 Storage caching method of object-based distributed file system
CN102662859A (en) * 2012-03-14 2012-09-12 北京神州数码思特奇信息技术股份有限公司 Data cache system based on service grade and method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于二分图匹配的语义Web服务发现方法》;邓水光等;《计算机学报》;20080831;第31卷(第8期);1364-1375 *

Also Published As

Publication number Publication date
CN105068757A (en) 2015-11-18

Similar Documents

Publication Publication Date Title
CN103150263B (en) Classification storage means
CN102662992B (en) Method and device for storing and accessing massive small files
CN102521269B (en) Index-based computer continuous data protection method
CN103366016B (en) E-file based on HDFS is centrally stored and optimization method
CN103106152B (en) Based on the data dispatching method of level storage medium
US9330108B2 (en) Multi-site heat map management
CN104111804B (en) A kind of distributed file system
US11914894B2 (en) Using scheduling tags in host compute commands to manage host compute task execution by a storage device in a storage system
US8909614B2 (en) Data access location selecting system, method, and program
CN118075293B (en) Cloud platform data storage method based on SaaS
CN105183839A (en) Hadoop-based storage optimizing method for small file hierachical indexing
US20220171792A1 (en) Ingestion partition auto-scaling in a time-series database
CN105068757B (en) A kind of redundant data De-weight method based on file semantics and system real-time status
CN104978324B (en) Data processing method and device
CN103020255A (en) Hierarchical storage method and hierarchical storage device
CN107249035B (en) A method for storing and reading shared duplicate data with dynamically variable levels
CN102646121A (en) A two-level storage method combining RDBMS and Hadoop cloud storage
CN102984280A (en) Data backup system and method for social cloud storage network application
CN102129472A (en) Construction method for high-efficiency hybrid storage structure of semantic-orient search engine
EP4530875A1 (en) Key-value store and file system
CN102111438A (en) Method and device for parameter adjustment and distributed computation platform system
CN117033693B (en) Method and system for cloud processing in mixed mode
CN110858210B (en) Data query method and device
JP5853109B2 (en) Computer, computer system controller and recording medium
CN110019092A (en) Method, controller and the system of data storage

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant