CN100505762C

CN100505762C - Distributed multi-level cache system for object network storage

Info

Publication number: CN100505762C
Application number: CNB2006100188340A
Authority: CN
Inventors: 陈进才; 周功业; 吴伟杰; 易佳; 雷伟; 李漪
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2006-04-19
Filing date: 2006-04-19
Publication date: 2009-06-24
Anticipated expiration: 2026-04-19
Also published as: CN1852318A

Abstract

The buffering system includes buffer in zero level in use for buffering metadata, buffer in first level in use for buffering data at client end, and buffer in second level in use for buffering data at object storage node. Management module of buffer in zero level is in charge for looking up, replacing metadata in buffer in zero level, and accessing metadata according to network access control protocol; Management module of buffer in first level is in charge for applying, assigning and recycling shared memory space as well as substituting buffer unit; Management module of buffer in second level is in charge for carrying unitive substitution in multiple queues for buffer in second level between object storage nodes. The invention possesses advantages of solving consistency of buffering metadata at client end, quickening speed fro accessing metadata, reducing response time for requesting service and average search time, raising I/O throughput, and data transmission rate.

Description

Distributed multi-level cache system for object network storage

技术领域 technical field

本发明属于计算机存储技术领域，具体涉及一种适用于对象网络存储的分布式多级缓存系统。The invention belongs to the technical field of computer storage, and in particular relates to a distributed multi-level cache system suitable for object network storage.

背景技术 Background technique

下一代互联网速度更快、规模更大、安全性与服务质量更高，从而要求基于下一代互联网的网络存储系统具有与之适应的高效性、可扩展性、安全性和高服务质量。传统的以块和文件为基础构建的存储系统(如NAS、SAN等)已经不能满足上述要求。美国加州大学Berkeley分校的Paterson教授、卡内基梅隆大学等多所大学及研究机构提出了一种基于对象(Object-Based)的网络存储接口协议，以此方便构建高效的大型网络存储系统。The next-generation Internet is faster, larger, and has higher security and service quality, which requires the network storage system based on the next-generation Internet to have high efficiency, scalability, security and high service quality. Traditional storage systems built on the basis of blocks and files (such as NAS, SAN, etc.) can no longer meet the above requirements. Professor Paterson of the University of California, Berkeley, Carnegie Mellon University and other universities and research institutions have proposed an object-based (Object-Based) network storage interface protocol to facilitate the construction of efficient large-scale network storage systems.

中国发明专利公开号为CN 1728665A的发明公开了一种组建基于对象的存储系统的方法。该发明所述基于对象的网络存储系统，包括由计算机网络连接的I个元数据服务器、N个对象存储节点和M个客户端，其对象文件系统是基于标准Linux的ext2文件系统，带有日志文件系统特性。The Chinese invention patent publication number is CN 1728665A, which discloses a method for building an object-based storage system. The object-based network storage system described in this invention includes 1 metadata server, N object storage nodes and M clients connected by computer network, and its object file system is based on the ext2 file system of standard Linux, with log File system properties.

大型网络存储系统需要设计一个高效的并行文件系统与之相匹配，才能实现存储系统的高带宽、高扩展性、高安全性、智能化等。文件系统作为系统的I/O子系统，在很多情况下会成为系统性能进一步提高的瓶颈。设计一个高性能的并行文件系统需要从两个方面进行考虑：一方面尽可能减少对磁盘的访问次数，另一方面尽可能简化网络访问协议，减少网络协议开销。文件系统缓存技术是一种尽量减少文件系统访问磁盘次数的机制，用提高文件系统性能的方法来弥补磁盘性能的不足；同时，好的缓存系统设计还可以简化网络访问协议。A large-scale network storage system needs to design an efficient parallel file system to match it, in order to achieve high bandwidth, high scalability, high security, and intelligence of the storage system. As the I/O subsystem of the system, the file system will become a bottleneck for further improvement of system performance in many cases. Designing a high-performance parallel file system needs to be considered from two aspects: on the one hand, reduce the number of disk accesses as much as possible, and on the other hand, simplify the network access protocol as much as possible and reduce the network protocol overhead. File system caching technology is a mechanism to minimize the number of times the file system accesses the disk, and to improve the performance of the file system to make up for the lack of disk performance; at the same time, a good cache system design can also simplify network access protocols.

目前，基于对象存储文件系统已成为高性能网络存储文件系统的研究热点，产生了如Cluster File Systems公司的Lustre、Panasas公司的ActiveScale文件系统、加州大学圣克鲁兹分校开发的面向对象的文件系统OBFS等。这些文件系统都各有特点，在文件系统缓存方面也都有所提及，但主要集中在某个方面缓存策略的研究，缓存系统缺乏整体性，造成缓存系统访问控制协议复杂、访问请求响应时间较大等不足。还有一种实现缓存的方法是把缓存作为一个独立的设备运行，此种方案对缓存进行集中管理，方便实现某些替换、查找策略，但独立的设备增加了成本，同时也是系统的单一故障点。At present, the object-based storage file system has become a research hotspot for high-performance network storage file systems, such as Cluster File Systems' Lustre, Panasas' ActiveScale file system, and the object-oriented file system OBFS developed by the University of California, Santa Cruz, etc. . These file systems have their own characteristics, and they are also mentioned in the file system cache, but they mainly focus on the research of a certain aspect of the cache strategy. The cache system lacks integrity, resulting in complex access control protocols for the cache system and poor access request response time. Larger and other insufficient. Another way to implement caching is to run the cache as an independent device. This solution centrally manages the cache and facilitates the implementation of certain replacement and search strategies. However, independent devices increase the cost and are also a single point of failure for the system. .

发明内容 Contents of the invention

本发明的目的在于提供一种适用于对象网络存储的分布式多级缓存系统，该系统克服了现有缓存系统访问控制协议复杂、访问请求延迟较大的不足，在客户端、对象存储节点设立三级缓存提供完整的缓存系统及其缓存策略，简化节点间网络访问协议、降低元数据服务器负载、减少数据访问请求响应时间、提高系统I/O处理能力、降低系统构建成本。The purpose of the present invention is to provide a distributed multi-level cache system suitable for object network storage. This system overcomes the shortcomings of the existing cache system, such as complex access control protocols and large delays in access requests. The three-level cache provides a complete cache system and its cache strategy, simplifies the network access protocol between nodes, reduces the load on the metadata server, reduces the response time of data access requests, improves the system I/O processing capacity, and reduces the system construction cost.

本发明提供的一种适用于对象网络存储的分布式多级缓存系统，其特征在于：该系统包括位于客户端中的零级缓存和一级缓存，以及位于对象存储节点中的二级缓存；零级缓存用于对源自元数据服务器中元数据的缓存，一级缓存为直接服务于应用进程的数据缓存，二级缓存为直接服务于客户端的数据缓存；A distributed multi-level cache system suitable for object network storage provided by the present invention is characterized in that: the system includes a zero-level cache and a first-level cache located in the client, and a second-level cache located in the object storage node; The zero-level cache is used to cache metadata from the metadata server, the first-level cache is the data cache that directly serves the application process, and the second-level cache is the data cache that directly serves the client;

零级缓存管理模块负责零级缓存中元数据的查找、替换以及按照网络访问控制协议进行元数据的访问；The zero-level cache management module is responsible for the search and replacement of metadata in the zero-level cache and the access to metadata in accordance with the network access control protocol;

一级缓存管理模块用于控制一级缓存，负责共享内存空间的申请、分配、回收以及缓存单元的置换；The first-level cache management module is used to control the first-level cache, and is responsible for the application, allocation, recovery and replacement of cache units in shared memory space;

二级缓存管理模块负责对对象存储节点间二级缓存进行多队列统一置换。The second-level cache management module is responsible for performing multi-queue unified replacement of the second-level cache between object storage nodes.

本发明具有如下特点：The present invention has following characteristics:

(1)零级元数据缓存设在客户端，在缓存命中情况下比采用PVFS结构并行文件系统减少一次元数据服务器的访问，降低了网络通信量；设计了一种简化的元数据访问协议，解决了多个客户端元数据缓存一致性问题；加快了元数据的访问速度，降低了服务请求响应时间；提高了整个文件系统的I/O吞吐量；(1) The zero-level metadata cache is set on the client side. In the case of a cache hit, the access to the metadata server is reduced once compared with the PVFS structure parallel file system, and the network traffic is reduced; a simplified metadata access protocol is designed. Solved the metadata cache consistency problem of multiple clients; accelerated the access speed of metadata and reduced the response time of service requests; improved the I/O throughput of the entire file system;

(2)一级数据缓存采用共享内存方式实现，减少了进程通信时内存拷贝次数，实现应用进程间数据共享；一级缓存采用基于hash的LFU-DA算法，保留了LFU-DA算法命中率高的特点，同时极大降低了缓存对象的平均查找时间；(2) The first-level data cache is implemented by shared memory, which reduces the number of memory copies during process communication and realizes data sharing between application processes; the first-level cache uses the hash-based LFU-DA algorithm, which retains the high hit rate of the LFU-DA algorithm characteristics, while greatly reducing the average lookup time of cached objects;

(3)对象存储节点中的二级缓存能推迟并减少对磁盘的直接访问，根据对象存储节点对象访问特性，提出了访问组的概念，同时设计的多队列统一置换算法，综合考虑了访问时间和访问频率，使各对象存储节点缓存命中率趋于均匀，提高存储节点缓存整体命中率；(3) The second-level cache in the object storage node can delay and reduce the direct access to the disk. According to the object access characteristics of the object storage node, the concept of access group is proposed, and the multi-queue unified replacement algorithm is designed at the same time, taking the access time into consideration and access frequency, so that the cache hit rate of each object storage node tends to be uniform, and the overall hit rate of the storage node cache is improved;

(4)三级缓存共同组成适用于对象网络存储的分布式多级缓存系统，极大的提高了存储系统的效率，减少了系统延迟，提高了系统数据传输率。(4) The three-level cache together forms a distributed multi-level cache system suitable for object network storage, which greatly improves the efficiency of the storage system, reduces system delay, and improves the system data transmission rate.

附图说明 Description of drawings

图1是本发明适用于对象网络存储的分布式多级缓存系统示意图；Fig. 1 is a schematic diagram of a distributed multi-level cache system suitable for object network storage according to the present invention;

图2.1是改进前文件系统读文件过程示意图；Figure 2.1 is a schematic diagram of the file system reading process before improvement;

图2.2是本发明在客户端设立元数据缓存后读文件过程示意图；Figure 2.2 is a schematic diagram of the process of reading files after the client sets up the metadata cache in the present invention;

图3.1是本发明在客户端设立一级缓存后缓存块分配前缓存状态示意图；Figure 3.1 is a schematic diagram of the cache state before the allocation of cache blocks after the client sets up the first-level cache in the present invention;

图3.2是本发明在客户端设立一级缓存后缓存块分配后缓存状态示意图；Figure 3.2 is a schematic diagram of the cache state after the cache block is allocated after the client sets up the first-level cache in the present invention;

图3.3是本发明在客户端设立一级缓存后缓存块释放前缓存状态示意图；Figure 3.3 is a schematic diagram of the cache state before the cache block is released after the client sets up the first-level cache in the present invention;

图3.4是本发明在客户端设立一级缓存后缓存块释放后缓存状态示意图；Figure 3.4 is a schematic diagram of the cache state after the cache block is released after the client sets up the first-level cache in the present invention;

图4是本发明数据访问控制流程图。Fig. 4 is a flow chart of data access control in the present invention.

具体实施方式 Detailed ways

如图1所示，基于对象网络存储系统中，元数据服务器1、对象存储节点2和客户端3由计算机网络连接在一起。数据访问过程如下：由客户端3向元数据服务器1发送访问请求；元数据服务器1验证客户身份，确认后查询本地元数据，如果存在，则元数据服务器1给相应对象存储节点2发送访问请求测试包，对象存储节点2确认对象打开后向元数据服务器1返回授权及客户所请求的存储对象映射表；客户端3收到元数据服务器1的信息后与对象存储节点建立连接，进行数据传输。As shown in FIG. 1 , in an object-based network storage system, a metadata server 1 , an object storage node 2 and a client 3 are connected together by a computer network. The data access process is as follows: the client 3 sends an access request to the metadata server 1; the metadata server 1 verifies the client’s identity, and queries the local metadata after confirmation. If it exists, the metadata server 1 sends an access request to the corresponding object storage node 2 Test package, after object storage node 2 confirms that the object is opened, it returns the authorization and the storage object mapping table requested by the client to metadata server 1; after receiving the information from metadata server 1, client 3 establishes a connection with the object storage node for data transmission .

本发明的分布式多级缓存系统包括客户端3中的零级缓存6、一级缓存8和对象存储节点2中的二级缓存4。三级缓存均在主存中实现。零级缓存6设立在客户端3中，是对来源于元数据服务器1的元数据缓存，在缓存命中情况下能够减少一次元数据服务器1的访问，元数据缓存的高命中率极大地降低了网络通讯量；一级缓存8是直接服务于应用进程的数据缓存，它的命中直接消除了客户端3到对象存储节点2的网络消耗，提高了系统性能；二级缓存4设立在对象存储节点2，直接服务于客户端3，缓存命中情况下可以避免访问低速物理存储设备，加快了数据访问速度。The distributed multi-level cache system of the present invention includes a zero-level cache 6 in the client 3 , a first-level cache 8 and a second-level cache 4 in the object storage node 2 . The third-level cache is implemented in main memory. The zero-level cache 6 is set up in the client 3, and it caches the metadata from the metadata server 1. In the case of a cache hit, it can reduce the access to the metadata server 1 once, and the high hit rate of the metadata cache greatly reduces the Network traffic; Level 1 cache 8 is a data cache that directly serves the application process, and its hit directly eliminates the network consumption from the client 3 to the object storage node 2, improving system performance; Level 2 cache 4 is set up in the object storage node 2. Serve the client directly. 3. In the case of a cache hit, it can avoid accessing low-speed physical storage devices and speed up data access.

本发明三级缓存分别由各自缓存管理模块控制。零级缓存管理模块7负责零级缓存6中元数据的查找、替换以及零级缓存未命中情况下按照网络访问控制协议进行元数据服务器的访问。元数据查找采用由元数据项对应的全局路径名作为hash参数建立的缓存hash表，以加快查找速度；元数据缓存替换策略采用FIFO(First In First Out)队列；元数据访问控制协议首先访问零级缓存6，命中则直接向元数据服务器1发送服务请求，未命中则发送元数据查询请求，得到确认后再发送服务请求。一级缓存管理模块9负责共享内存空间的申请、分配、回收以及缓存单元的置换。一级缓存8用共享内存方式实现，系统通过一级缓存管理模块9管理缓存空间的申请、分配和回收；一级缓存8中缓存单元置换采用基于hash的LFU-DA(LFU with Dynamic Aging)算法。二级缓存管理模块5负责对象存储节点2之间二级缓存4的统一置换，采用多队列统一置换算法——MQU(Multi-Queue United)算法。The three-level cache of the present invention is controlled by respective cache management modules. The zero-level cache management module 7 is responsible for searching and replacing metadata in the zero-level cache 6 and accessing the metadata server according to the network access control protocol in the case of a zero-level cache miss. Metadata search uses the cache hash table established by the global path name corresponding to the metadata item as the hash parameter to speed up the search; the metadata cache replacement strategy uses FIFO (First In First Out) queue; the metadata access control protocol first accesses zero Level 6 cache, if hit, send service request directly to metadata server 1, if miss, send metadata query request, and send service request after confirmation. The first-level cache management module 9 is responsible for the application, allocation, recovery and replacement of cache units of shared memory space. The first-level cache 8 is implemented in the form of shared memory. The system manages the application, allocation and recycling of cache space through the first-level cache management module 9; the cache unit replacement in the first-level cache 8 adopts the hash-based LFU-DA (LFU with Dynamic Aging) algorithm . The secondary cache management module 5 is responsible for the unified replacement of the secondary cache 4 between the object storage nodes 2, and adopts the multi-queue unified replacement algorithm—MQU (Multi-Queue United) algorithm.

零级缓存6是元数据缓存，在基于对象的存储系统中元数据包括文件与对象的映射信息、对象存储节点信息和存储对象的属性信息等。虽然基于对象的存储系统中对象元数据(属性)很丰富，但在元数据服务器1上只保存与用户相关的对象属性(如文件名与对象ID映射，对象存储节点位置信息)，这部分属性占用的空间并不大，通常只需一百个字节到两百个字节。零级缓存是供客户端查询文件是否存在，只需保留文件名与对象ID映射即可。因此在主存中一个容量为1MB的零级缓存空间可以容纳近万条供查询的存储对象元数据，对越来越大的主存空间来说这点花费是值得的，这保证了元数据缓存具有相当高的缓存命中率。如此规模的缓存元数据使元数据被置换不经常发生，因此置换算法采用FIFO队列，特点是实现简单。但是较多的元数据又使查找变的困难，查找算法的性能决定零级缓存6的效率，本发明采用查找效率最高的hash表，用元数据项对应的全局路径名作为hash参数建立缓存hash表，加快查找速度。客户端启动时零级缓存为空，零级缓存在客户端第一次访问异地数据的同时从元数据服务器获得所需元数据的备份。The zero-level cache 6 is a metadata cache. In an object-based storage system, metadata includes mapping information between files and objects, object storage node information, and attribute information of storage objects. Although the object metadata (attributes) in the object-based storage system is very rich, only the user-related object attributes (such as file name and object ID mapping, object storage node location information) are saved on the metadata server 1. It does not take up much space, usually only one hundred to two hundred bytes. The zero-level cache is for the client to query whether the file exists, and only needs to keep the mapping between the file name and the object ID. Therefore, a zero-level cache space with a capacity of 1MB in the main memory can accommodate nearly ten thousand pieces of storage object metadata for query. This cost is worthwhile for the increasingly large main memory space, which ensures that the metadata The cache has a fairly high cache hit ratio. Such large-scale cached metadata makes metadata replacement infrequent. Therefore, the replacement algorithm uses a FIFO queue, which is characterized by simple implementation. However, more metadata makes searching more difficult, and the performance of the search algorithm determines the efficiency of the zero-level cache 6. The present invention adopts the hash table with the highest search efficiency, and uses the global path name corresponding to the metadata item as a hash parameter to establish a cache hash table to speed up lookups. The zero-level cache is empty when the client starts, and the zero-level cache obtains the backup of the required metadata from the metadata server when the client accesses the off-site data for the first time.

基于对象网络存储系统中客户端应用进程数据访问首先对要访问的文件(从应用程序呈现的数据访问形式仍为文件)做路径检查，如果是本地文件则调用本地操作函数对文件进行处理；否则，把要访问文件的路径转化为全局文件系统唯一路径名，随后与元数据服务器建立连接。基于对象网络存储系统中数据集中存放在对象存储节点，因此整个对象文件系统中对元数据的操作是非常频繁的。本发明涉及的用于对象网络存储系统的分布式文件系统参照PVFS设计，其访问过程清晰、实现简单，但PVFS对每一次元数据操作需要访问两次元数据服务器，一次查询，一次获取真正元数据，会带来大量的网络通讯开销，不能满足对象文件系统要求，必须对数据访问协议进行改进，改进的方法是在客户端设立零级缓存。In the object-based network storage system, the client application process data access first checks the path of the file to be accessed (the data access form presented by the application program is still a file), and if it is a local file, call the local operation function to process the file; otherwise , convert the path of the file to be accessed into a unique path name of the global file system, and then establish a connection with the metadata server. In an object-based network storage system, data is stored centrally on object storage nodes, so the operations on metadata in the entire object file system are very frequent. The distributed file system used in the object network storage system involved in the present invention is designed with reference to PVFS. Its access process is clear and easy to implement. However, PVFS needs to visit the metadata server twice for each metadata operation, once to query and once to obtain real metadata. , will bring a large amount of network communication overhead, and cannot meet the requirements of the object file system. The data access protocol must be improved. The improved method is to set up a zero-level cache on the client.

没有设立零级缓存情况下的对象存储节点数据访问过程如图2.1所示。从客户端看，主要分为以下步骤：Figure 2.1 shows the data access process of an object storage node without setting up a zero-level cache. From the perspective of the client, it is mainly divided into the following steps:

A1：客户端把要访问文件的路径转化为全局文件系统唯一路径名，与元数据服务器建立连接，发出查询请求；A1: The client converts the path of the file to be accessed into a unique path name of the global file system, establishes a connection with the metadata server, and sends a query request;

A2：元数据服务器对客户端访问进行身份验证，合法则查询元数据，确认所访问的文件是否存在，发送应答信息到客户端；否则客户端访问非法，过程结束；A2: The metadata server authenticates the client's access. If it is legal, it queries the metadata to confirm whether the accessed file exists, and sends a response message to the client; otherwise, the client's access is illegal and the process ends;

A3：客户端收到应答信息，如果要访问的数据存在，则向元数据服务器发出读对象请求，否则向应用进程报错，过程结束；A3: The client receives the response information, if the data to be accessed exists, it sends a read object request to the metadata server, otherwise it reports an error to the application process, and the process ends;

A4：元数据服务器收到客户读对象请求后，发送对象ID到相关对象存储节点；A4: After the metadata server receives the client's request to read the object, it sends the object ID to the relevant object storage node;

A5：对象存储节点验证元数据服务器访问请求后，打开对象，发送应答信息到元数据服务器；A5: After the object storage node verifies the metadata server access request, it opens the object and sends a response message to the metadata server;

A6：元数据服务器收到应答后，把对客户端的授权、文件与存储对象ID映射表以及存储对象节点位置信息应答给客户端；A6: After receiving the response, the metadata server responds to the client with the authorization to the client, the mapping table of file and storage object ID, and the location information of the storage object node;

A7：客户端收到元数据服务器的应答信息后，向所收到元数据对应的对象存储节点发读对象请求命令包，与相应对象存储节点建立连接；A7: After receiving the response information from the metadata server, the client sends a read object request command packet to the object storage node corresponding to the received metadata, and establishes a connection with the corresponding object storage node;

A8：对象存储节点验证用户请求，确认后发送数据对象给客户端，直到数据传输结束。A8: The object storage node verifies the user request, and sends the data object to the client after confirmation until the end of the data transmission.

可以看出上述数据访问控制协议总共涉及到两次客户端与元数据服务器的网络通信，一次与各相关对象存储节点的网络通信，还包括一次元数据服务器与各存储节点的网络通信，所引起的网络通信开销是很大的，严重限制了系统性能的提高。It can be seen that the above data access control protocol involves a total of two network communications between the client and the metadata server, one network communication with each related object storage node, and one network communication between the metadata server and each storage node. The network communication overhead is very large, which seriously limits the improvement of system performance.

详细分析上述协议，发现该协议可以简化。本发明在客户端构建零级元数据缓存就是一种简化上述网络访问控制协议的方法。简化后的网络访问控制协议如图2.2所示，主要分如下几个步骤：Analyzing the above protocol in detail, it is found that the protocol can be simplified. The present invention constructs a zero-level metadata cache on the client side is a method for simplifying the above-mentioned network access control protocol. The simplified network access control protocol is shown in Figure 2.2, which mainly consists of the following steps:

B1：客户端把要访问文件的路径转化为全局文件系统唯一路径名，应用进程查询零级缓存，判断缓存是否命中，命中进入步骤B4；B1: The client converts the path of the file to be accessed into the unique path name of the global file system, and the application process queries the zero-level cache to determine whether the cache is hit, and the hit enters step B4;

B2：零级缓存未命中则客户端与元数据服务器建立连接，发送元数据查询请求；B2: If the zero-level cache misses, the client establishes a connection with the metadata server and sends a metadata query request;

B3：元数据服务器对客户端访问进行身份验证，合法则查询元数据，确认所访问的文件是否存在，发送应答信息到客户端，进入步骤B5；B3: The metadata server authenticates the client’s access. If it is legal, it queries the metadata, confirms whether the accessed file exists, sends a response message to the client, and proceeds to step B5;

B4：客户端零级缓存命中则应答应用进程；B4: The client responds to the application process when the zero-level cache hits;

B5：客户端应用进程收到应答信息后，访问对象存在则发送读对象请求到元数据服务器，否则报错，过程结束；B5: After the client application process receives the response information, if the access object exists, it sends a read object request to the metadata server, otherwise an error is reported and the process ends;

B6：元数据服务器收到客户读对象请求后，发送对象ID到相关对象存储节点；B6: After the metadata server receives the client's request to read the object, it sends the object ID to the relevant object storage node;

B7：对象存储节点验证元数据服务器访问请求后，打开对象，发送应答信息到元数据服务器；B7: After the object storage node verifies the metadata server access request, it opens the object and sends a response message to the metadata server;

B8：元数据服务器收到应答后，把对客户端的授权、文件与存储对象ID映射表以及存储对象节点位置信息应答给客户端，同时更新零级缓存；B8: After receiving the response, the metadata server responds to the client with the authorization to the client, the mapping table of file and storage object ID, and the location information of the storage object node, and updates the zero-level cache at the same time;

B9：客户端收到元数据服务器的应答信息后，向所收到元数据对应的对象存储节点发读对象请求命令包，与相应对象存储节点建立连接；B9: After receiving the response information from the metadata server, the client sends a read object request command packet to the object storage node corresponding to the received metadata, and establishes a connection with the corresponding object storage node;

B10：对象存储节点验证用户请求，确认后发送数据对象给客户端，直到数据传输结束。B10: The object storage node verifies the user request, and sends the data object to the client after confirmation, until the end of the data transmission.

从以上数据网络访问控制协议可以看出，零级缓存在命中情况下可以省略一次元数据服务器的访问，极高的零级缓存命中率大大降低了元数据访问的网络开销。From the above data network access control protocol, it can be seen that the zero-level cache can omit one visit to the metadata server in the case of a hit, and the extremely high zero-level cache hit rate greatly reduces the network overhead of metadata access.

一般设立缓存后会带来缓存一致性问题，因为每个客户端都有零级缓存，必然造成缓存的不一致。如果不解决好缓存一致性问题特别是元数据缓存将给系统带来灾难性后果。本发明简化后的网络访问控制协议很好的解决了元数据缓存不一致性问题。对数据对象的任何操作所引起的元数据变化都必须记录在元数据服务器或者对象存储节点中。对象元数据一致性是由客户端第二次元数据访问B5-B8来保证，因为所有操作的真正元数据都是操作B8从元数据服务器返回的，并没有直接从本地的零级缓存中读取。客户端打开一个对象或做其它元数据相关操作时，至少需要访问一次元数据服务器，用户对对象的任何操作都必须通知元数据服务器，时刻保证对象元数据的正确性；对象存储节点只接收来自元数据服务器的打开请求。以上两条保证了元数据的一致性，因此即使多个零级缓存中元数据存在不一致，也不会带来任何问题。假设某一客户端在零级缓存中持有一个对象的元数据期间，元数据因其它客户端操作而发生了改变，导致了元数据出现不一致情况。第一次元数据访问请求B1-B4的目的只是为了检查目标文件是否存在，第二次元数据访问请求B5-B8并没有从发生不一致的零级缓存中获取元数据，相反，第二次元数据访问B5-B8所获得的元数据还将更新零级缓存，由此避免了零级缓存不一致给系统可能造成的后果。Generally, after setting up a cache, there will be cache consistency problems, because each client has a zero-level cache, which will inevitably cause cache inconsistency. Failure to address cache coherence issues, especially metadata caching, will have catastrophic consequences for the system. The simplified network access control protocol of the present invention well solves the problem of metadata cache inconsistency. Metadata changes caused by any operation on data objects must be recorded in the metadata server or object storage node. The consistency of object metadata is guaranteed by the client's second metadata access B5-B8, because the real metadata of all operations is returned from the metadata server by operation B8, and is not directly read from the local zero-level cache . When the client opens an object or performs other metadata-related operations, it needs to visit the metadata server at least once. Any operation on the object by the user must notify the metadata server to ensure the correctness of the object metadata at all times; the object storage node only receives data from An open request from the metadata server. The above two guarantee the consistency of metadata, so even if there is inconsistency in metadata in multiple zero-level caches, it will not cause any problems. Assume that while a client is holding an object's metadata in the zero-level cache, the metadata is changed due to other client operations, resulting in inconsistencies in the metadata. The purpose of the first metadata access request B1-B4 is only to check whether the target file exists. The second metadata access request B5-B8 does not obtain metadata from the inconsistent zero-level cache. On the contrary, the second metadata access The metadata obtained by B5-B8 will also update the zero-level cache, thus avoiding the possible consequences of the zero-level cache inconsistency on the system.

本发明所述客户端中一级缓存采用共享内存方式实现。一级缓存管理模块中的守护进程负责共享内存空间的申请、分配和回收，但并不负责处理其它进程的读写请求。应用进程将共享内存映射到自己的虚拟内存空间，在需要读写数据时，应用进程就可以直接访问共享内存形式的缓存，从而避免了进程间的大量通信和内存的多次拷贝。守护进程在客户端初启时一次性的向系统申请一块共享内存空间，按照缓存块大小把该空间分成很多物理上连续的小数据块，用一个缓存块数组管理。系统另外维护了一个与缓存块数组同样大小的可分配缓存块数组，指示当前可分配的缓存块数组项。可分配缓存块数组项的值代表可分配的缓存块数组的下标。当可分配缓存块数组项的值为-1时，表示其所代表的块已分配；可分配缓存块数组项的值为s时，表示缓存块s将被分配。系统初始化时，所有缓存块都是空闲的，第i个可分配缓存块数组项的值为i。为了方便管理，采用两个数值——可分配号和空闲号，指示可分配缓存块数组中可供分配的空闲项：可分配号指示第一个可供分配数据块，空闲号指示最后一个空闲数据块，也代表块回收时将被插入到可分配缓存块数组中的位置，如果可分配号与空闲号相等，表示已无空闲缓存空间，需要通过置换来腾出空间。The first-level cache in the client according to the present invention is implemented in a shared memory manner. The daemon process in the first-level cache management module is responsible for the application, allocation and recovery of the shared memory space, but is not responsible for handling the read and write requests of other processes. The application process maps the shared memory to its own virtual memory space. When it needs to read and write data, the application process can directly access the cache in the form of shared memory, thus avoiding a large amount of communication between processes and multiple copies of memory. The daemon process applies for a piece of shared memory space to the system at one time when the client starts, divides the space into many physically continuous small data blocks according to the size of the cache block, and manages it with a cache block array. The system also maintains an allocatable cache block array with the same size as the cache block array, indicating the currently allocatable cache block array items. The value of the allocatable cache block array item represents the subscript of the allocatable cache block array. When the value of the allocatable cache block array item is -1, it means that the block it represents has been allocated; when the value of the allocatable cache block array item is s, it means that the cache block s will be allocated. When the system is initialized, all cache blocks are free, and the value of the ith allocatable cache block array item is i. For the convenience of management, two values are used—the allocatable number and the free number to indicate the free items available for allocation in the allocatable cache block array: the allocatable number indicates the first data block available for allocation, and the free number indicates the last free item The data block also represents the position that will be inserted into the array of allocatable cache blocks when the block is reclaimed. If the allocatable number is equal to the free number, it means that there is no free cache space and it needs to be replaced to make room.

一级缓存管理模块守护进程按如下步骤申请、分配和回收缓存空间：The first-level cache management module daemon process applies for, allocates, and reclaims cache space in the following steps:

(1)向系统申请一块缓存空间，并按照指定缓存块大小将缓存空间分块，用一个缓存块数组表示缓存空间，数组大小为所申请缓存空间的总块数；(1) Apply for a cache space from the system, and divide the cache space into blocks according to the specified cache block size, and use a cache block array to represent the cache space, and the size of the array is the total number of blocks in the cache space requested;

(2)守护进程维护一个可分配缓存块数组，大小与缓存块数组相同，并用可分配号、空闲号分别指示可分配缓存块数组中空闲项的首部和尾部；(2) The daemon process maintains an array of allocatable cache blocks, the size of which is the same as the array of cache blocks, and indicates the head and tail of the free items in the allocatable cache block array with the allocatable number and the free number respectively;

(3)当应用进程访问共享缓存空间时，进入步骤(4)；当应用进程释放缓存空间时，进入步骤(5)；(3) When the application process accesses the shared cache space, enter step (4); when the application process releases the cache space, enter step (5);

(4)访问共享缓存空间时，如果命中则直接进行读写操作；否则，首先判断是否有空闲缓存块可用，方法是判断可分配号和空闲号是否相等，如果可分配号和空闲号相等，进入步骤(4.1)，否则进入步骤(4.2)；(4) When accessing the shared cache space, if it hits, then directly perform read and write operations; otherwise, first judge whether there is a free cache block available, the method is to judge whether the assignable number and the idle number are equal, if the assignable number and the idle number are equal, Go to step (4.1), otherwise go to step (4.2);

(4.1)可分配号和空闲号相等表示没有空闲块可用，则对缓存块进行置换，置换后再进行读写操作，工作完毕；(4.1) If the allocatable number is equal to the free number, it means that there is no free block available, then the cache block is replaced, and then read and write operations are performed after the replacement, and the work is completed;

(4.2)可分配号和空闲号不相等表示还有空闲块可用，读取可分配号所指示的可分配缓存块数组项的值，该值就是第一块可供分配缓存空闲块的下标，把下标指示的缓存块分配给应用进程，将可分配号所指示的可分配缓存块数组项的值设为-1，然后可分配号加1并以缓存总块数为模做取余运算，余数即为新的可分配号，它指示下一个空闲块，工作完毕；(4.2) The disequilibrium between the allocatable number and the free number indicates that there are still free blocks available, read the value of the array item of the allocatable cache block indicated by the allocatable number, and this value is the subscript of the first free block available for allocating cache , allocate the cache block indicated by the subscript to the application process, set the value of the array item of the allocatable cache block indicated by the allocatable number to -1, then add 1 to the allocatable number and take the remainder modulo the total number of cache blocks Operation, the remainder is the new assignable number, which indicates the next free block, and the work is completed;

(5)应用进程释放缓存块时首先空闲号加1并以缓存总块数为模做取余运算得到新的空闲号，把空闲号指示的可分配缓存块数组项的值设为所释放缓存块号，完成缓存块的回收，工作完毕。(5) When the application process releases the cache block, first add 1 to the idle number and perform a remainder operation based on the total number of cache blocks to obtain a new idle number, and set the value of the array item of the allocatable cache block indicated by the idle number to the released cache Block number, the recovery of the cache block is completed, and the work is completed.

如图3.1-3.4是一级缓存分配与释放过程的一个具体实例。图3.1是缓存单元第一次分配前的状态。可分配缓存块数组的各项值与下标相等。可分配号为0，指明第一个可分配的数据块为缓存块数组的第零项；空闲号为n，指明最后一个可分配的数据块为缓存块数组的第n项。缓存块数组的第一块被分配后的状态如图3.2所示：第零项可分配缓存块数组的值为-1，可分配号向前推进值变为1，表示下一可分配块为缓存块数组的第一项。图3.3描述了数据块第一次被释放前的状态。在此之前，前4块已被分配；图3.4描述了数据块被释放后的状态：在缓存块数组第二块被释放后，按照空闲号的指示，将可分配缓存数组第零项的值修改为2，表示最后一个可分配块为缓存块数组的第二项，同时空闲号循环向前推进，其值为0。Figure 3.1-3.4 is a specific example of the first-level cache allocation and release process. Figure 3.1 is the state of the cache unit before its first allocation. Each value of the array of allocatable cache blocks is equal to the subscript. The allocatable number is 0, indicating that the first allocable data block is the zeroth item of the cache block array; the idle number is n, indicating that the last allocatable data block is the nth item of the cache block array. The state of the first block of the cache block array after being allocated is shown in Figure 3.2: the value of the zeroth item of the allocatable cache block array is -1, and the value of the allocatable number advances to 1, indicating that the next allocatable block is The first item of the cache block array. Figure 3.3 depicts the state of a data block before it is freed for the first time. Before this, the first 4 blocks have been allocated; Figure 3.4 describes the state after the data block is released: after the second block of the cache block array is released, according to the indication of the free number, the value of the zeroth item of the cache array will be allocated It is changed to 2, indicating that the last allocatable block is the second item of the cache block array, and the idle number cycle advances forward, and its value is 0.

以上是本发明一级缓存空间申请、分配和回收的实现，由守护进程完成，但它不处理具体进程读写请求，读写请求由一级缓存管理模块中实现的基于hash的LFU-DA(LFU with Dynamic Aging)算法完成。The above is the realization of the first-level cache space application, allocation and reclaim of the present invention, which is completed by the daemon process, but it does not handle the specific process read and write requests, and the read and write requests are implemented by the hash-based LFU-DA ( LFU with Dynamic Aging) algorithm is completed.

LFU-DA是LFU的改进算法。LFU算法是基于频率访问优先考虑的置换算法，即只考虑到访问了很多次的数据块有很大机会在不久被重新访问，却忽略了这些数据块的最近一次访问时间。事实上，系统经常在处理完一个任务之后，转向处理其它的任务。那些在前一次任务中访问非常频繁的数据块有可能再也没有机会在新的任务中被重新访问，但是这些占据了大量缓存空间的数据块的访问次数却是很大的，采用LFU算法不可能把这些数据块置换出去，极大的降低了缓存空间的利用率。LFU-DA算法对缓存数据块加入了“年龄”属性，并且在访问过程中不断加以调整，但是并不是每次都调整所有块的年龄，而只是在访问的时候动态调整。数据块的“年龄”K_i由数据块的访问次数和上次被置换块的“年龄”共同求得，K_i的计算公式如下：LFU-DA is an improved algorithm of LFU. The LFU algorithm is a replacement algorithm based on frequency access priority, that is, it only considers that data blocks that have been accessed many times have a high chance of being re-visited in the near future, but ignores the latest access time of these data blocks. In fact, the system often turns to other tasks after processing one task. Those data blocks that were accessed very frequently in the previous task may never have the opportunity to be re-visited in the new task, but the number of accesses to these data blocks that occupy a large amount of cache space is very large. Using the LFU algorithm does not It is possible to replace these data blocks, which greatly reduces the utilization rate of the cache space. The LFU-DA algorithm adds the "age" attribute to the cached data blocks, and it is adjusted continuously during the access process, but the age of all blocks is not adjusted every time, but only dynamically adjusted during access. The "age" K _i of the data block is jointly obtained from the access times of the data block and the "age" of the last replaced block. The calculation formula of K _i is as follows:

K_i＝CI*F_i+K (公式1)K _i =CI*F _i +K (Formula 1)

其中，F_i是数据块被访问的次数，第一次加入到缓存中的数据块访问次数F₁的值为1；CI为频率因子，是一个可变参数，通常取值为1，但可根据系统不同应用调整其取值，得到当前应用的最佳值；K是一个全局变量，初始值为0，一有数据块被置换出去时，都把K值调整为被置换块的“年龄”。LFU-DA算法每一次都把“年龄”最小的数据块置换出去。实践表明LFU-DA算法比LRU算法具有更高的命中率。由于LFU-DA同时考虑了“访问时间”和“访问频率”两个方面，用作客户端缓存置换算法能获得很好的性能，明显比只考虑访问时间的算法有更高的命中率。然而当文件系统缓存容量很大时，LFU-DA算法有一个弊端：会造成缓存队列过长，导致应用程序在缓存中查找数据耗时间太久。Among them, F _i is the number of times the data block is accessed, and the value of the number of access times F ₁ of the data block added to the cache for the first time is 1; CI is the frequency factor, which is a variable parameter, usually the value is 1, but it can be Adjust its value according to different applications of the system to get the best value for the current application; K is a global variable with an initial value of 0. Whenever a data block is replaced, the value of K is adjusted to the "age" of the replaced block . The LFU-DA algorithm replaces the data block with the smallest "age" every time. Practice shows that the LFU-DA algorithm has a higher hit rate than the LRU algorithm. Since LFU-DA considers both "access time" and "access frequency", it can be used as a client cache replacement algorithm to obtain good performance, and it has a significantly higher hit rate than an algorithm that only considers access time. However, when the file system cache capacity is large, the LFU-DA algorithm has a disadvantage: it will cause the cache queue to be too long, causing the application to take too long to find data in the cache.

本发明对LFU-DA算法作进一步的改进，具体的做法是：将过长的缓存队列根据所选hash函数分成m个小队列，用缓存数据块的块号和数据块所属对象的ID求hash值，然后根据所求hash值把数据块插入到对应的队列中。每一个hash队列都是LFU-DA队列。这样就可以加快缓存的查找速度。证明方法很简单：假设LFU-DA算法的平均查找时间是t，系统缓存队列被分成m个队列，那么改进后算法的平均查找时间约为t/m。The present invention further improves the LFU-DA algorithm. The specific method is: divide the too long cache queue into m small queues according to the selected hash function, and use the block number of the cache data block and the ID of the object to which the data block belongs to find the hash value, and then insert the data block into the corresponding queue according to the requested hash value. Each hash queue is an LFU-DA queue. This speeds up cache lookups. The proof method is simple: assuming that the average search time of the LFU-DA algorithm is t, and the system cache queue is divided into m queues, then the average search time of the improved algorithm is about t/m.

共享内存方式的缓存结构可能出现同一时刻多个进程访问同一缓存块的情况，需要加锁实现进程间的访问互斥。锁粒度大小对缓存性能影响很大，粒度过大会造成太大的一致性和互斥开销，粒度太小则会带来太大的锁管理开销。本发明缓存的锁粒度是队列级的。缓存队列根据hash函数被分成m个小队列，为每个小队列都设立一个读写互斥锁。使用队列级锁粒度策略易于管理，守护进程在文件系统初启时负责预先为每个队列静态申请一个锁，之后所有锁也由守护进程管理和撤消。本实例锁采用信号量实现。In the cache structure of the shared memory mode, multiple processes may access the same cache block at the same time, and locks are required to achieve mutual exclusion between processes. The size of the lock granularity has a great impact on cache performance. If the granularity is too large, it will cause too much consistency and mutual exclusion overhead. If the granularity is too small, it will bring too much lock management overhead. The lock granularity of the cache in the present invention is at the queue level. The cache queue is divided into m small queues according to the hash function, and a read-write mutex is set up for each small queue. It is easy to manage using the queue-level lock granularity strategy. The daemon process is responsible for statically applying for a lock for each queue in advance when the file system is initially started, and then all locks are also managed and revoked by the daemon process. In this example, the lock is implemented using a semaphore.

一级缓存管理模块控制流程如下：The control flow of the first-level cache management module is as follows:

(1)守护进程申请共享内存空间，维护和管理缓存块数组和可分配缓存块数组；(1) The daemon process applies for shared memory space, maintains and manages the cache block array and the allocatable cache block array;

(2)应用进程数据访问请求，判断一级缓存是否命中，命中进入步骤(5)；(2) Apply the process data access request, judge whether the first-level cache hits, hit and enter step (5);

(3)一级缓存未命中，按照零级缓存网络访问协议访问对象存储节点，得到所需数据对象，数据对象访问频率为0，判断缓存队列是否为满，如果满则进入步骤(4)，否则将数据对象放入空的缓存单元，进入步骤(5)；(3) The first-level cache misses, access the object storage node according to the zero-level cache network access protocol, and obtain the required data object, the data object access frequency is 0, judge whether the cache queue is full, and if it is full, enter step (4), Otherwise, put the data object into an empty cache unit and enter step (5);

(4)缓存队列满，则查找所有缓存单元中年龄最小的缓存单元，用数据对象替换年龄最小的缓存单元，使全局变量K等于被替换块年龄；(4) cache queue is full, then find the cache unit with the smallest age in all cache units, replace the cache unit with the smallest age with the data object, make the global variable K equal to the replaced block age;

(5)数据块访问频率加1，按照(公式1)计算得到新的年龄，然后求hash值队列号，申请队列级互斥锁，把缓存单元插入新的队列，进行缓存单元读写操作。(5) Add 1 to the data block access frequency, calculate the new age according to (Formula 1), then calculate the hash value queue number, apply for a queue-level mutex, insert the cache unit into a new queue, and perform cache unit read and write operations.

二级缓存管理模块负责对象存储节点间二级缓存统一置换，采用多队列统一置换算法——MQU(Multi-Queue United)算法。The second-level cache management module is responsible for the unified replacement of the second-level cache between object storage nodes, and adopts the multi-queue unified replacement algorithm - MQU (Multi-Queue United) algorithm.

大型对象网络存储系统中，一个文件往往被分片成多个对象，分别存储在不同的对象存储节点上，以提高并行带宽。在高性能计算(如矩阵计算)中，经常需要同时取得所有存储节点上的数据对象后才能开始操作，如果某个节点访问延迟较大，就会产生系统单一失效点，降低计算性能。为了解决单一失效点问题，本发明提出了访问组的概念，即具有相同属性分布在不同对象存储节点上的数据对象构成同一个访问组。MQU算法对于属于同一个访问组的数据对象进行统一置换，要么全部保留，要么全部置换，从而保证了较高的系统命中率。In large-scale object network storage systems, a file is often divided into multiple objects, which are stored on different object storage nodes to improve parallel bandwidth. In high-performance computing (such as matrix computing), it is often necessary to obtain data objects on all storage nodes at the same time before starting operations. If the access delay of a certain node is large, a single point of failure in the system will occur and the computing performance will be reduced. In order to solve the problem of a single point of failure, the present invention proposes the concept of an access group, that is, data objects with the same attribute distributed on different object storage nodes form the same access group. The MQU algorithm uniformly replaces the data objects belonging to the same access group, either all of them are retained or all of them are replaced, thus ensuring a high system hit rate.

对象存储节点上的二级缓存不像客户端中的一级缓存，访问具有高度的时间和空间局部原理特性，这种特性使得考虑访问时间优先的算法在一级缓存中具有较高的命中率。然而LRU和MRU等基于访问时间考虑的置换算法缺乏对数据块访问频率的考虑；LFU等基于访问频率考虑的置换算法则缺乏对访问时间的考虑，存在缓存污染问题；LFU-DA、MQ等算法虽然综合考虑了访问时间和访问频率，但是没有考虑在分布式环境下多个存储节点间的统一置换，在存在多个存储节点的分布式系统中缓存命中率和缓存读写响应速度不够理想。The second-level cache on the object storage node is not like the first-level cache in the client. The access has a high degree of temporal and spatial locality, which makes the algorithm that considers access time priority have a higher hit rate in the first-level cache. . However, replacement algorithms based on access time considerations such as LRU and MRU lack consideration of data block access frequency; replacement algorithms based on access frequency considerations such as LFU lack consideration of access time and have cache pollution problems; algorithms such as LFU-DA and MQ Although the access time and access frequency are comprehensively considered, the unified replacement among multiple storage nodes in a distributed environment is not considered, and the cache hit rate and cache read and write response speed are not ideal in a distributed system with multiple storage nodes.

本发明MQU算法基于MQ算法设计，在置换时，对位于不同存储节点上的同属于一个访问组的对象给予整体考虑，和传统的数据块单独置换算法相比，MQU算法减小了存储节点二级缓存中出现单个节点缓存缺失的概率，并且各节点的缓存命中率更趋于平均。如何确定不同对象是否属于同一个访问组，本发明利用集合对象属性来实现，属于同一个集合的对象位于同一个访问组。The MQU algorithm of the present invention is designed based on the MQ algorithm. During the replacement, the objects belonging to the same access group on different storage nodes are given overall consideration. Compared with the traditional data block replacement algorithm alone, the MQU algorithm reduces the number of storage nodes. The probability of a single node cache miss in the level cache, and the cache hit rate of each node tends to be more average. How to determine whether different objects belong to the same access group is realized by using the property of the set object, and the objects belonging to the same set are located in the same access group.

MQU算法可分成三部分：缓存单元的访问、缓存队列的动态调整和置换对象的选取。其中缓存单元的访问是算法的主体，由它调用算法的其它两部分。算法描述如下：The MQU algorithm can be divided into three parts: the access of the cache unit, the dynamic adjustment of the cache queue and the selection of the replacement object. Among them, the access of the cache unit is the main body of the algorithm, which calls the other two parts of the algorithm. The algorithm is described as follows:

MQU算法是在MQ算法的基础上设计的。MQU算法以访问组中对象的访问频率作为统一置换的依据，各对象存储节点在统一置换一个访问组时需要获知位于其他存储节点上同一个访问组中对象的访问频率，可以通过在每个对象存储节点维护所有存储节点缓存信息来实现。MQU算法对属于每个存储节点的缓存信息都使用多队列来管理，即分别用q个LRU队列来管理属于每个存储节点的缓存信息，q取值由具体应用决定，范围在4到8之间。根据对象的访问频率来决定把其插入到哪一个LRU队列中。每个存储节点的缓存信息被分成q个部分，每个部分对应一级，从而能够很快地从最低一级缓存队列中找出一个对象进行置换。这样每个存储节点上都要维护p*q个缓存队列，其中p为对象存储节的个数。The MQU algorithm is designed on the basis of the MQ algorithm. The MQU algorithm uses the access frequency of objects in an access group as the basis for uniform replacement. When each object storage node uniformly replaces an access group, it needs to know the access frequency of objects in the same access group on other storage nodes. Storage nodes maintain all storage node cache information to achieve this. The MQU algorithm uses multiple queues to manage the cache information belonging to each storage node, that is, q LRU queues are used to manage the cache information belonging to each storage node. The value of q is determined by the specific application, ranging from 4 to 8 between. Which LRU queue to insert it into is determined according to the access frequency of the object. The cache information of each storage node is divided into q parts, and each part corresponds to one level, so that an object can be quickly found from the lowest level cache queue for replacement. In this way, p*q cache queues must be maintained on each storage node, where p is the number of object storage nodes.

MQU算法根据对象在缓存中未被访问的时间来动态调整数据对象的访问频率，为此为每个对象都设置一个生存周期(lifeTime)，同时设对象的生存期限(expireTime)为当前时间(currentTime)加上生存周期。如果对象过了生存期限即生存期限小于当前时间，此对象将被调整到低一级队列(允许不被访问的时间较短)中，最低一级队列中的数据对象则有可能被调整出缓存。The MQU algorithm dynamically adjusts the access frequency of the data object according to the time when the object is not accessed in the cache. For this purpose, a lifetime (lifeTime) is set for each object, and the lifetime (expireTime) of the object is set as the current time (currentTime ) plus lifetime. If the object has passed the lifetime, that is, the lifetime is less than the current time, the object will be adjusted to the lower-level queue (the time allowed for not being accessed is shorter), and the data objects in the lowest-level queue may be adjusted out of the cache .

MQU算法把对象从缓存中置换出来并非简单的丢弃，而是为属于每个存储节点的q个缓存信息队列都提供了一个FIFO队列来记录被丢弃对象的历史访问频率和缓存单元标记号，一旦丢弃的数据对象被重新访问，即可从中找到它的历史访问频率。对一个访问组实行整体置换时，需要经常查找同一个访问组中其他数据对象的访问频率。为了减少查找时间，MQU算法把同一访问组中所有的数据对象都链接在一起。The replacement of objects from the cache by the MQU algorithm is not simply discarded, but a FIFO queue is provided for the q cache information queues belonging to each storage node to record the historical access frequency and cache unit tag number of the discarded object. A discarded data object is revisited, from which its historical access frequency can be found. When implementing overall replacement for an access group, it is often necessary to find out the access frequencies of other data objects in the same access group. In order to reduce the lookup time, the MQU algorithm links all data objects in the same access group together.

二级缓存管理模块控制流程如下：The control flow of the secondary cache management module is as follows:

(1)申请缓存空间，监听客户端访问请求；(1) Apply for cache space and monitor client access requests;

(2)对象存储节点与客户端建立连接，客户请求对某个数据对象a进行操作；(2) The object storage node establishes a connection with the client, and the client requests to operate on a certain data object a;

(3)对象存储节点访问二级缓存，判断二级缓存是否命中，如果命中，把对象a从当前队列移到高一级队列中，处于最高队列则不变，进入步骤(6)，否则，二级缓存未命中，进入步骤(4)；(3) The object storage node accesses the second-level cache to determine whether the second-level cache hits. If it hits, move the object a from the current queue to the higher-level queue. If it is in the highest queue, it remains unchanged and enters step (6). Otherwise, L2 cache miss, enter step (4);

(4)判断缓存队列是否有空闲块，如果有，进入步骤(6)，否则，缓存队列已满，按照步骤(4.1)-(4.5)选取被置换对象：(4) Determine whether there are free blocks in the cache queue, if so, enter step (6), otherwise, the cache queue is full, and select the replaced object according to steps (4.1)-(4.5):

(4.1)选取被置换对象，把对象存储节点最低一级非空队列中第一个对象作为被置换对象；(4.1) Select the object to be replaced, and use the first object in the lowest-level non-empty queue of the object storage node as the object to be replaced;

(4.2)如果被置换对象不属于任何访问组，直接丢弃，进入步骤(5)；否则，获取被置换对象所在的访问组中所有对象的最大访问频率；(4.2) If the replaced object does not belong to any access group, discard it directly and proceed to step (5); otherwise, obtain the maximum access frequency of all objects in the access group where the replaced object is located;

(4.3)如果访问组中所有对象的最大访问频率小于等于被置换对象的访问频率，则丢弃本访问组的所有对象，进入步骤(4.5)；(4.3) If the maximum access frequency of all objects in the access group is less than or equal to the access frequency of the replaced object, then discard all objects in this access group and enter step (4.5);

(4.4)否则调整本访问组中所有对象的访问频率为该访问组中所有对象中的最大访问频率，插入相应队列尾部，调整被置换对象为最低一级非空队列的下一个对象，进入步骤(4.2)；(4.4) Otherwise, adjust the access frequency of all objects in this access group to the maximum access frequency among all objects in this access group, insert the corresponding queue tail, adjust the replaced object to be the next object of the lowest non-empty queue, and enter the step (4.2);

(4.5)FIFO队列记录下被置换对象的访问频率，返回被置换对象；(4.5) The FIFO queue records the access frequency of the replaced object and returns the replaced object;

(5)缓存队列已满且若数据对象a在FIFO队列中，从FIFO队列中获得数据对象a的访问频率，否则数据对象a的访问频率置为初始值0；(5) The cache queue is full and if the data object a is in the FIFO queue, obtain the access frequency of the data object a from the FIFO queue, otherwise the access frequency of the data object a is set to the initial value 0;

(6)从存储设备读取对象a，对象a的访问频率加1，计算新的队列号，把对象a插入新队列的尾部，计算对象a的生存期，按照步骤(6.1)-(6.3)动态调整对象存储节点二级缓存队列；(6) Read object a from the storage device, add 1 to the access frequency of object a, calculate the new queue number, insert object a into the tail of the new queue, calculate the lifetime of object a, follow steps (6.1)-(6.3) Dynamically adjust the second-level cache queue of the object storage node;

(6.1)当前时间加1；(6.1) Add 1 to the current time;

(6.2)在当前对象存储节点中依次选择q个二级缓存队列中的一个队列，选取当前队列的第一个对象。如果当前队列为最后一级队列，进入步骤(7)；(6.2) In the current object storage node, one of the q secondary cache queues is sequentially selected, and the first object of the current queue is selected. If the current queue is the last level queue, enter step (7);

(6.3)如果第一个对象的生存期限小于当前时间，把此对象调整到下一级缓存队列尾部，对象的新生存期限为当前时间加生存周期，当前队列级加1，进入步骤(6.2)；(6.3) If the lifetime of the first object is less than the current time, adjust this object to the end of the next-level cache queue, the new lifetime of the object is the current time plus the lifetime, the current queue level plus 1, and enter step (6.2) ;

(7)操作结束。(7) The operation ends.

本发明的分布式多级缓存系统，很好的利用了现有的主存资源，用最小代价取得了很好的性能。如图4所示为本发明提供分布式多级缓存系统后数据访问控制流程图。The distributed multi-level cache system of the present invention makes good use of existing main memory resources and achieves good performance with minimum cost. FIG. 4 is a flow chart of data access control after the distributed multi-level cache system is provided in the present invention.

Claims

1. A distributed multi-level cache system suitable for object network storage, characterized in that: the system includes a zero-level cache and a first-level cache located in the client, and a second-level cache located in the object storage node; the zero-level cache The cache is used to cache metadata from the metadata server. The first-level cache is a data cache that directly serves the application process, and the second-level cache is a data cache that directly serves the client;

The zero-level cache management module is responsible for the search and replacement of metadata in the zero-level cache and the access to metadata in accordance with the network access control protocol;

The first-level cache management module is used to control the first-level cache, and is responsible for the application, allocation, recovery and replacement of cache units in shared memory space;

The secondary cache management module uses a multi-queue unified replacement algorithm to perform multi-queue unified replacement of the secondary cache of the object storage node of the same access group. The same access group consists of objects with the same attribute and distributed on different object storage nodes. Data object composition.

2. The system according to claim 1, characterized in that: the network access control protocol in the zero-level cache management module is:

(B1) The client converts the path of the file to be accessed into the unique path name of the global file system, and the application process queries the zero-level cache to determine whether the cache hits, and if it hits, enter step (B4);

(B2) The client establishes a connection with the metadata server, and sends a metadata query request;

(B3) metadata server carries out identity verification to client access, if legal then inquires metadata, confirms whether the file of access exists, sends response information to client, enters step (B5);

(B4) responding to the application process;

(B5) After the client application process receives the response information, if the access object exists, it sends a read object request to the metadata server, otherwise an error is reported, and the process ends;

(B6) After the metadata server receives the client's request to read the object, it sends the object ID to the relevant object storage node;

(B7) After the object storage node verifies the access request of the metadata server, it opens the object and sends a response message to the metadata server;

(B8) After the metadata server receives the response, it responds to the client with the authorization to the client, the file and storage object ID mapping table, and the location information of the storage object node, and updates the zero-level cache at the same time;

(B9) After receiving the response information from the metadata server, the client sends an object request command packet to the object storage node corresponding to the received metadata, and establishes a connection with the corresponding object storage node;

(B10) The object storage node verifies the user request, and sends the data object to the client after confirmation, until the end of the data transmission.

3. The system according to claim 1 or 2, characterized in that: the daemon process of the first-level cache management module applies for, allocates and reclaims the first-level cache space according to the following steps:

(C1) Apply for a cache space to the system, and divide the cache space into blocks according to the size of the specified cache block, and represent the cache space with an array of cache blocks, and the size of the array is the total number of blocks of the applied cache space;

(C2) maintaining an array of allocatable cache blocks, the size of which is the same as the array of cache blocks, and indicating the head and tail of free items in the array of allocatable cache blocks with an allocatable number and a free number respectively;

(C3) When the application process accesses the shared cache space, enter step (C4); when the application process releases the cache space, enter step (C5);

(C4) When accessing the shared cache space, if it hits, then directly perform read and write operations; otherwise, first judge whether there is an idle cache block available, the method is to judge whether the assignable number and the idle number are equal, if the assignable number and the idle number are equal, Enter step (C41), otherwise enter step (C42);

(C41) The distributable number and the free number are equal and represent that there is no free block available, then the cache block is replaced, and then read and write operations are performed after the replacement, and the work is completed;

(C42) The disequilibrium between the allocatable number and the free number indicates that there are still free blocks available, read the value of the array item of the allocatable cache block indicated by the allocatable number, and this value is the subscript of the first free block available for allocating cache , allocate the cache block indicated by the subscript to the application process, set the value of the array item of the allocatable cache block indicated by the allocatable number to -1, then add 1 to the allocatable number and take the remainder as the modulo of the total number of cache blocks Operation, the remainder is the new assignable number, which indicates the next free block, and the work is completed;

(C5) The application process releases the cache block. First, the idle number is increased by 1, and then the total block number of the cache is used as a modulo to do a remainder operation to obtain a new idle number, and the value of the array item of the allocated cache block indicated by the idle number is set to be released Cache the block number, complete the recovery of the cache block, and the work is completed.

4. The system according to claim 3, wherein the first-level cache management module replaces the cache unit according to the following steps:

(D1) Apply for shared memory space, maintain and manage cache block arrays and allocatable cache block arrays;

(D2) application process data access request, judging whether the first-level cache hits, hits and enters step (D5);

(D3) Level-1 cache misses, access object storage node according to zero-level cache network access protocol, obtain required data object, data object access frequency is 0, judge whether cache queue is full, if full then enter step (D4), Otherwise, put the data object into an empty cache unit and go to step (D5);

(D4) Find the cache unit with the smallest age in all cache units, replace the cache unit with the smallest age with the data object, make the global variable K equal to the age of the replaced block;

(D5) Add 1 to the data block access frequency, calculate the new age Ki according to formula (1), then find the hash value queue number, apply for a queue-level mutex, insert the cache unit into a new queue, and perform cache unit read and write operations ;

K _i ＝CI*F _i +K formula (1)

Among them, F _i is the number of times the data block is accessed, and the value of the number of data block accesses F ₁ added to the cache for the first time is 1; CI is the frequency factor; the K value is adjusted to be replaced when a data block is replaced The "age" of the block.

5. The system according to claim 4, wherein the second-level cache management module uniformly replaces the second-level cache between object storage nodes according to the following steps:

(E1) Apply for cache space and monitor client access requests;

(E2) The object storage node establishes a connection with the client, and the client requests to operate on a certain data object a;

(E3) The object storage node accesses the secondary cache, and judges whether the secondary cache hits. If it hits, move the above-mentioned object a from the current queue to the higher-level queue. If it is in the highest queue, it will not change, and enter step (E6), otherwise , the second-level cache miss, enter step (E4);

(E4) Judging whether the cache queue has free blocks, if so, enter step (E6), otherwise, the cache queue is full, select the replaced object according to steps (E41)-(E45):

(E41) Select the object to be replaced, and use the first object in the lowest-level non-empty queue of the object storage node as the object to be replaced;

(E42) If the replaced object does not belong to any access group, directly discard it and enter step (E5); otherwise, obtain the maximum access frequency of all objects in the access group where the replaced object is located;

(E43) If the maximum access frequency of all objects in the access group is less than or equal to the access frequency of the replaced object, then discard all objects of this access group and enter step (E45);

(E44) Adjust the access frequency of all objects in this access group to the maximum access frequency among all objects in this access group, insert the tail of the corresponding queue, adjust the replaced object to be the next object of the lowest non-empty queue, and turn to the step (E42);

(E45) The FIFO queue records the access frequency of the replaced object, and returns the replaced object;

(E5) The cache queue is full and if the data object a is in the FIFO queue, obtain the access frequency of the data object a from the FIFO queue, otherwise the access frequency of the data object a is set to an initial value of 0;

(E6) read object a from storage device, add 1 to the access frequency of object a, calculate new queue number, insert object a into the tail of new queue, calculate the lifetime of object a, according to steps (E61)-(E63) Dynamically adjust the second-level cache queue of the object storage node;

(E61) Add 1 to the current time;

(E62) Select one of the q second-level cache queues in the current object storage node, where q is the number of cache information belonging to each storage node, and the value of q is between 4 and 8; select the current queue If the first object of the current queue is the last level queue, enter step (E7);

(E63) If the lifetime of the first object is less than the current time, adjust this object to the tail of the next level cache queue, the new lifetime of the object is the current time plus the lifetime, the current queue level plus 1, and enter the step (E62) ;

(E7) The operation ends.