[go: up one dir, main page]

CN109840247B - File system and data layout method - Google Patents

File system and data layout method Download PDF

Info

Publication number
CN109840247B
CN109840247B CN201811547400.9A CN201811547400A CN109840247B CN 109840247 B CN109840247 B CN 109840247B CN 201811547400 A CN201811547400 A CN 201811547400A CN 109840247 B CN109840247 B CN 109840247B
Authority
CN
China
Prior art keywords
file system
cost
module
file
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811547400.9A
Other languages
Chinese (zh)
Other versions
CN109840247A (en
Inventor
王洋
夏明辉
须成忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201811547400.9A priority Critical patent/CN109840247B/en
Publication of CN109840247A publication Critical patent/CN109840247A/en
Priority to PCT/CN2019/121301 priority patent/WO2020125362A1/en
Application granted granted Critical
Publication of CN109840247B publication Critical patent/CN109840247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种文件系统,文件系统包括成本计算模块和区域划分模块,成本计算模块用于计算或预估文件系统中文件请求的访问成本,成本计算模块能够向区域划分模块输出成本模型;区域划分模块用于将文件划分到的不同区域,以使得给定访问的总成本最小化;区域划分模块还用于获得区域对应的条带大小。本发明还提供一种数据布局方法。现有技术相比本发明的有益效果是:该文件系统通过将文件划分为一组最优区域的方式来支持区域级的数据布局,而且通过该文件系统能够确定最优区域及对应条带大小,故此,该文件系统能够优化混合型存储系统的数据布局,进一步该文件系统能够灵活地适应工作负载的变化和存储系统的异构性,能够显著加快I/O系统性能。

Figure 201811547400

The invention provides a file system. The file system includes a cost calculation module and an area division module. The cost calculation module is used to calculate or estimate the access cost of a file request in the file system. The cost calculation module can output a cost model to the area division module; The division module is used to divide the file into different areas so as to minimize the total cost of a given access; the area division module is also used to obtain the strip size corresponding to the area. The present invention also provides a data layout method. Compared with the prior art, the beneficial effects of the present invention are: the file system supports the data layout of the region level by dividing the file into a group of optimal regions, and the file system can determine the optimal region and the corresponding strip size Therefore, the file system can optimize the data layout of the hybrid storage system, and further, the file system can flexibly adapt to changes in workload and heterogeneity of the storage system, and can significantly speed up the performance of the I/O system.

Figure 201811547400

Description

文件系统及数据布局方法File system and data layout method

技术领域technical field

本发明属于数据布局技术领域,尤其涉及一种文件系统和一种数据布局方法。The invention belongs to the technical field of data layout, and in particular relates to a file system and a data layout method.

背景技术Background technique

随着大规模数据密集型应用在各个应用领域不断增加,I/O(输入/输出)性能正成为存储系统的瓶颈。为了解决这个问题,本领域技术人员相继将诸多并行文件系统(Parallel File System,简称PFS)引入高性能存储系统当中。上述的并行文件系统包括OrangeFS、Lustre、GPFS、PanFS和PLFS等,各并行文件系统的简介如下:As large-scale data-intensive applications continue to increase in various application fields, I/O (input/output) performance is becoming the bottleneck of storage systems. In order to solve this problem, those skilled in the art have successively introduced many parallel file systems (Parallel File Systems, PFS for short) into high-performance storage systems. The above-mentioned parallel file systems include OrangeFS, Lustre, GPFS, PanFS and PLFS, etc. The introduction of each parallel file system is as follows:

1、OrangeFS是虚拟并行文件系统(PVFS)的一个分支,其与PVFS类似,是一个针对高性能计算以及高性能数据访问所提出的一种并行文件系统。与传统的PVFS相比,Ora-ngeFS致力于提高小文件处理的性能、增加服务器的交叉容错及提供安全访问控制。1. OrangeFS is a branch of the Virtual Parallel File System (PVFS). Similar to PVFS, it is a parallel file system proposed for high-performance computing and high-performance data access. Compared with traditional PVFS, Ora-ngeFS is committed to improving the performance of small file processing, increasing the cross fault tolerance of servers and providing secure access control.

2、Lustre是HP、Intel、Cluster File System公司联合美国能源部开发的Linux集群并行文件系统,Lustre采用分布式的锁管理机制来实现并发控制,元数据和文件数据的通讯链路分开管理。2. Lustre is a Linux cluster parallel file system developed by HP, Intel, Cluster File System and the US Department of Energy. Lustre uses a distributed lock management mechanism to achieve concurrency control, and the communication links of metadata and file data are managed separately.

3、GPFS是General Parallel File System的缩写。源自IBM公司的GPFS是一个可扩展、高性能、基于共享磁盘的通用并行文件系统,GPFS能为存储系统中的所有节点提供并行、高速、安全、可靠的数据存取。3. GPFS is the abbreviation of General Parallel File System. GPFS from IBM is a scalable, high-performance, general-purpose parallel file system based on shared disks. GPFS can provide parallel, high-speed, safe, and reliable data access for all nodes in the storage system.

4、PanFS是由Panasas公司研发的并行文件系统,PanFS是通用的并行文件系统,目前其主要的应用领域和luster类似,PanFS可扩展行,其能够通过分布式锁提供的强一致性。4. PanFS is a parallel file system developed by Panasas. PanFS is a general parallel file system. At present, its main application areas are similar to those of luster. PanFS is scalable and can provide strong consistency through distributed locks.

5、PLFS是一款开源的并行检查点存储文件系统。5. PLFS is an open source parallel checkpoint storage file system.

综上,基于这些并行文件系统就能够执行跨多个服务器分发数据文件的操作,因此,并行文件系统(PFS)可以允许并行应用的多个任务以聚合的I/O带宽形式同步访问数据文件。In summary, based on these parallel file systems, the operation of distributing data files across multiple servers can be performed. Therefore, a parallel file system (PFS) can allow multiple tasks of parallel applications to synchronously access data files in the form of aggregated I/O bandwidth.

但是现有的并行文件系统(PFS)也并不是没有缺陷的,其缺陷在于,现有的并行文件系统(PFS)与基于新型存储技术的混合型存储系统不匹配。在逐步展开描述不适配问题之前,首先需要阐明的是基于新型存储技术的混合型存储系统的情况,随着新型存储技术的发展,基于闪存的固态驱动器(Solid State Disk,简称SSD)应用越发广泛,较硬盘驱动器(Hard Disk Drive,简称HDD)而言,固态驱动器具有存储效率高、响应快和成本高的特点,所以,综合考虑,一个合理的存储系统不适合全部由硬盘驱动器组成,因为读写和响应速度偏慢,合理的存储系统也不适宜全部由造价很高的固态驱动器组成,换言之,固态驱动器在一个大型集群中并不会完全取代硬盘驱动器。因此,使用同时包括基于固态驱动器的服务器和基于硬盘驱动器的服务器的混合型存储系统是一种优选策略。这种策略对于有限成本预算下的HPC系统更加实用。HPC是高性能计算(High Performance Computing)机群的简称。However, the existing parallel file system (PFS) is not without defects. The defect is that the existing parallel file system (PFS) does not match with the hybrid storage system based on the new storage technology. Before gradually describing the incompatibility problem, it is necessary to clarify the situation of hybrid storage systems based on new storage technologies. Compared with Hard Disk Drive (HDD), solid-state drive has the characteristics of high storage efficiency, fast response and high cost. Therefore, comprehensive consideration, a reasonable storage system is not suitable for all hard disk drives, because Read and write and response speeds are slow, and reasonable storage systems are not suitable for all solid-state drives, which are expensive to manufacture. In other words, solid-state drives will not completely replace hard disk drives in a large cluster. Therefore, using a hybrid storage system that includes both solid-state drive-based servers and hard-disk drive-based servers is a preferred strategy. This strategy is more practical for HPC systems with limited cost budgets. HPC is the abbreviation of High Performance Computing (High Performance Computing) cluster.

另一方面,并行文件系统(PFS)的效率取决于有效的数据文件布局,即数据文件如何在可用节点上分布,大多数现有的布局方案使用固定大小的条带分割成多个服务器上分布数据文件,还利用固定大小的条带提供来自多个服务器的并发数据访问,这甚至使得每个服务器上都有数据放置。虽然现有的布局方案实现简单,易被广泛使用,但这样的布局方案显然是适用于使用了同质服务器的存储系统,不适用于混合型存储系统。On the other hand, the efficiency of a parallel file system (PFS) depends on the efficient data file layout, i.e. how the data files are distributed across the available nodes, most existing layout schemes use fixed-size stripes split into multiple servers for distribution Data files also utilize fixed-size stripes to provide concurrent data access from multiple servers, which even enables data placement on each server. Although the existing layout scheme is simple to implement and easy to be widely used, such a layout scheme is obviously suitable for storage systems using homogeneous servers, not for hybrid storage systems.

当现有的并行文件系统应用于混合型存储系统时,基于固态驱动器的服务器和基于硬盘驱动器的服务器之间的性能差距会显著降低并行文件系统的性能,因为基于固态驱动器的服务器总是比基于硬盘驱动器的服务器具有更高的性能,从而需要更少的I/O时间来完成相同数量的数据访问,如果应用现有的布局方案,该方案会给基于固态驱动器的服务器和基于硬盘驱动器的服务器分配相同的条带,这可能会导致异构服务器之间的负载严重不平衡,另外,复杂的I/O工作负载也可能危及I/O系统的效率。When existing parallel file systems are applied to a hybrid storage system, the performance gap between solid-state drive-based servers and hard drive-based servers can significantly degrade parallel file system performance, as solid-state drive-based servers are always more HDD-based servers have higher performance and thus require less I/O time to complete the same amount of data access, which, if applied with the existing layout scheme, gives solid-state drive-based servers and HDD-based servers Allocate the same stripes, which can cause severe load imbalance among heterogeneous servers, and complex I/O workloads can jeopardize the efficiency of the I/O system.

发明内容SUMMARY OF THE INVENTION

有鉴于此,为解决现有的并行文件系统(PFS)与基于新型存储技术的混合型存储系统匹配时所产生的数据分布不合理的问题,本发明提供一种文件系统,所述文件系统包括相互电性连接的I/O示踪器、成本计算模块和区域划分模块,所述I/O示踪器用于向所述区域划分模块提供自身收集到所述文件系统运行时的I/O信息;所述I/O示踪器还用于向所述成本计算模块提供自身收集到的所述文件系统的配置文件;所述成本计算模块用于计算或预估所述文件系统中文件请求的访问成本,以向所述区域划分模块输出成本模型;所述区域划分模块用于根据所述成本模型生成总成本最小化的分布区域,并将文件划分到的不同区域中,所述区域划分模块还用于获得所述区域对应的条带大小。In view of this, in order to solve the problem of unreasonable data distribution when the existing parallel file system (PFS) is matched with a hybrid storage system based on a new storage technology, the present invention provides a file system, the file system includes: The I/O tracer, the cost calculation module and the area division module are electrically connected to each other, the I/O tracer is used to provide the area division module with the I/O information collected by itself when the file system is running The I/O tracer is also used to provide the configuration file of the file system collected by itself to the cost calculation module; the cost calculation module is used to calculate or estimate the file request in the file system. access cost, so as to output the cost model to the area division module; the area division module is used to generate a distribution area that minimizes the total cost according to the cost model, and divide the file into different areas, the area division module It is also used to obtain the strip size corresponding to the region.

较佳地,所述文件系统还包括内核部分,所述内核部分用于执行元数据服务器、混合型存储系统和客户端三方之间的信息或数据的交互;所述内核部分包括FUSE模块。Preferably, the file system further includes a kernel part, and the kernel part is used for executing the interaction of information or data among the metadata server, the hybrid storage system and the client; the kernel part includes a FUSE module.

较佳地,所述文件系统还包括守护进程模块,所述守护进程模块用于在后台执行守护进程;所述FUSE模块用于作为所述守护进程的代理。Preferably, the file system further includes a daemon process module, and the daemon process module is used to execute the daemon process in the background; the FUSE module is used to act as an agent of the daemon process.

较佳地,所述文件系统还包括更新数据布局模块,所述更新数据布局模块分别与所述守护进程模块、所述I/O示踪器、所述区域划分模块和所述混合型存储系统连接,所述更新数据布局模块用于动态检测和更新区域变化。Preferably, the file system further includes an update data layout module, which is respectively connected with the daemon process module, the I/O tracer, the area division module and the hybrid storage system. connected, the update data layout module is used to dynamically detect and update area changes.

较佳地,所述成本计算模块用于计算请求的总成本,总成本计算公式为:T=Ts+Tc+T2,公式中,Ts表示所述FUSE模块和所述守护进程模块之间进行两个上下文切换的时间,Tc表示复制时间,T2表示网络和存储成本。Preferably, the cost calculation module is used to calculate the total cost of the request, and the total cost calculation formula is: T=T s +T c +T 2 , in the formula, T s represents the FUSE module and the daemon process module The time between two context switches, T c represents the replication time, and T 2 represents the network and storage cost.

较佳地,所述混合型存储系统包括包括基于固态驱动器的服务器SServer和基于硬盘驱动器的服务器HServer;Preferably, the hybrid storage system includes a server SServer based on a solid state drive and a server HServer based on a hard disk drive;

所述复制时间的计算公式为:Tc(r,h,s)≈3(mh+ns)tc,公式中tc表示从内核空间到用户空间的单元数据复制时间,h表示HServer上条带尺寸,s表示SServer上条带尺寸,m表示HServer的数量,n表示SServer的数量;The calculation formula of the copy time is: T c (r,h,s)≈3(mh+ns)t c , in the formula, t c represents the unit data copy time from the kernel space to the user space, and h represents the HServer uploading time. Band size, s represents the strip size on the SServer, m represents the number of HServers, and n represents the number of SServers;

所述网络和存储成本的计算公式为:T2≈Te+max{h(th+t),s(ts+t)},公式中,t表示数据传输网络时间,th和ts分别表示HServer上单元数据传输时间和SServer上单元数据传输时间,Te表示网络连接时间。The calculation formula of the network and storage cost is: T 2 ≈T e +max{h(t h +t),s(t s +t)}, in the formula, t represents the data transmission network time, t h and t s represents the unit data transmission time on the HServer and the unit data transmission time on the SServer, respectively, and T e represents the network connection time.

较佳地,所述区域划分模用于获取从事件i开始将l个事件划分为k个区域的最小成本

Figure GDA0002731175200000041
所述最小成本
Figure GDA0002731175200000042
的计算公式为:Preferably, the area division module is used to obtain the minimum cost of dividing l events into k areas starting from event i.
Figure GDA0002731175200000041
the minimum cost
Figure GDA0002731175200000042
The calculation formula is:

Figure GDA0002731175200000043
Figure GDA0002731175200000043

公式中,

Figure GDA0002731175200000044
Figure GDA0002731175200000045
定义了一个大小为
Figure GDA0002731175200000046
区域,
Figure GDA0002731175200000047
表示尺寸为f的第一区域的成本。formula,
Figure GDA0002731175200000044
Figure GDA0002731175200000045
defines a size of
Figure GDA0002731175200000046
area,
Figure GDA0002731175200000047
represents the cost of the first region of size f.

较佳地,基于固态驱动器的服务器和基于硬盘驱动器的服务器能够将

Figure GDA0002731175200000048
条带化,并分别得到hi和si,si的计算公式为si=αhi,hi的计算公式为:Preferably, solid-state drive-based servers and hard-disk drive-based servers are capable of
Figure GDA0002731175200000048
Striping, and obtain hi and s i respectively, the calculation formula of s i is s i = αhi , and the calculation formula of hi is:

Figure GDA0002731175200000049
Figure GDA0002731175200000049

公式中,α≥1且是SServer相对于HServer的扩展因子,B表示配置中的块大小。In the formula, α≥1 and is the expansion factor of SServer relative to HServer, and B represents the block size in the configuration.

本发明还提供一种数据布局方法,其包括:The present invention also provides a data layout method, which includes:

步骤S1,将运行时的数据访问的I/O信息以及用于成本建模的文件系统配置文件收集到跟踪文件中,将文件系统配置文件定向用于建立成本模型,将I/O信息用于区域划分;In step S1, the I/O information of the data access at runtime and the file system configuration file used for cost modeling are collected into the tracking file, the file system configuration file is directed to establish the cost model, and the I/O information is used for the cost modeling. regional division;

步骤S2,计算或预估文件请求的访问成本,形成成本模型;Step S2, calculate or estimate the access cost of the file request to form a cost model;

步骤S3,根据所述成本模型以生成总成本最小化的分布区域,并将文件划分到的不同区域中;Step S3, according to the cost model to generate a distribution area where the total cost is minimized, and divide the file into different areas;

步骤S4,获取所述区域对应的条带大小。Step S4, acquiring the strip size corresponding to the area.

本发明实施例与现有技术相比存在的有益效果是:The beneficial effects that the embodiment of the present invention has compared with the prior art are:

本发明提出的一种文件系统通过将文件划分为一组最优区域的方式来支持区域级的数据布局,而且通过该文件系统能够确定最优区域及其条带大小,故此,通过该文件系统能够优化混合型存储系统100的数据布局。该文件系统能够灵活地适应工作负载的变化和服务器异构性,从而显著加快I/O系统性能。The file system proposed by the present invention supports the data layout of the region level by dividing the file into a group of optimal regions, and the optimal region and its stripe size can be determined through the file system. Therefore, through the file system The data layout of the hybrid storage system 100 can be optimized. The file system can flexibly adapt to workload changes and server heterogeneity, resulting in significantly faster I/O system performance.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present invention. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1为现有技术中使用固定大小条带的数据布局方案示意图;1 is a schematic diagram of a data layout scheme using a fixed-size strip in the prior art;

图2为基于区域划分的数据布局方案示意图;Fig. 2 is a schematic diagram of a data layout scheme based on regional division;

图3为基于区域数据布局的文件系统的示意图;Fig. 3 is the schematic diagram of the file system based on regional data layout;

图4为本发明实施例中成本计算模块的工作原理示意图;4 is a schematic diagram of a working principle of a cost calculation module in an embodiment of the present invention;

图5为本发明实施例中文件系统的一个应用示例图。FIG. 5 is a diagram of an example application of a file system in an embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图,对本发明上述的和另外的技术特征和优点作更详细的说明。The above and other technical features and advantages of the present invention will be described in more detail below with reference to the accompanying drawings.

在本发明的描述中,需要理解的是,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是至少两个,除非另有明确具体的限定。In the description of the present invention, it should be understood that the terms "first" and "second" are only used for description purposes, and cannot be interpreted as indicating or implying relative importance or the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present invention, "plurality" means at least two unless expressly and specifically defined otherwise.

在本发明中,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”、“固定”等术语应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或成一体;可以是机械连接,也可以是电连接或彼此可通讯;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系,除非另有明确的限定。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本发明中的具体含义。In the present invention, unless otherwise expressly specified and limited, the terms "installed", "connected", "connected", "fixed" and other terms should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection , or integrated; it can be a mechanical connection or an electrical connection or can communicate with each other; it can be directly connected or indirectly connected through an intermediate medium, it can be the internal connection of two components or the interaction relationship between the two components, unless otherwise expressly qualified. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood according to specific situations.

在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention.

以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本发明实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本发明。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本发明的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as specific system structures and technologies are set forth in order to provide a thorough understanding of the embodiments of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.

为了说明本发明所述的技术方案,下面通过具体实施例来进行说明。In order to illustrate the technical solutions of the present invention, the following specific embodiments are used for description.

实施例Example

一般来说,大多数PFS采用三种典型的数据布局:1-DH、1-DV和2-D。1-DH布局是指一个客户端进程能够从所有存储服务器访问数据。与1-DH布局相反,1-DV布局是指一个客户端进程只能从单个存储服务器访问数据。而2-D布局介于1-DH布局和1-DV布局两者之间,这意味着2-D布局是指一个客户端进程从所有存储服务器的子集访问数据。In general, most PFSs employ three typical data layouts: 1-DH, 1-DV, and 2-D. The 1-DH layout means that one client process can access data from all storage servers. In contrast to the 1-DH layout, the 1-DV layout means that a client process can only access data from a single storage server. The 2-D layout is between the 1-DH layout and the 1-DV layout, which means that the 2-D layout means that a client process accesses data from a subset of all storage servers.

首先,比较大多数PFS的数据布局方式与区域级文件系统的数据布局方式的区别,以执行两个并发读取访问一个9x大小的文件为例,两个并发读取被定义为第一读取71和第二读取72。第一读取71和第二读取72根据时间轴73进行读取。First, compare the difference between the data layout of most PFS and the data layout of the regional file system. Take the example of executing two concurrent reads to access a 9x size file. Two concurrent reads are defined as the first read. 71 and the second read 72. The first reading 71 and the second reading 72 are read according to the time axis 73 .

当使用传统文件系统的数据布局方式时,如图1所示,文件被均匀地分区并存储在每个存储服务器上,其条带大小为3x。因此,当三个服务器在同一时间完成时,每个请求都要在3x的时间内完成,这两个读取请求总共需要6x的时间来完成。该种数据布局忽视了混合型存储系统中不同存储介质间的差异,使得较高性能的存储介质的读写效率不能充分体现,可以理解为,较高性能的存储介质被强制降级使用。When using the data layout of a traditional file system, as shown in Figure 1, files are evenly partitioned and stored on each storage server with a stripe size of 3x. So when the three servers complete at the same time, each request takes 3x the time to complete, and the two read requests take a total of 6x to complete. This data layout ignores the differences between different storage media in the hybrid storage system, so that the read/write efficiency of the higher-performance storage media cannot be fully reflected. It can be understood that the higher-performance storage media is forced to be degraded.

如图2所示,对于区域级文件系统来说,其数据布局方案有着明显的优势,区域级文件系统RLFS将文件7分为两个区域,两个区域分别为第一区域73和第二区域74,每个区域使用自身对应的条带大小(x或2x)在所有服务器上进行分区,第二读取72会被分为两部分读取,总时间对应的是读取3x(3x=x+2x)的时间,但就第二读取72的时间来看,两种参与比较的数据布局方式相同,但是基于区域级文件系统的第一读取71时间减少到读取2x需要的时间。As shown in Figure 2, for the regional file system, its data layout scheme has obvious advantages. The regional file system RLFS divides the file 7 into two areas, the two areas are the first area 73 and the second area. 74. Each region is partitioned on all servers using its own corresponding stripe size (x or 2x). The second read 72 will be divided into two parts to read, and the total time corresponds to reading 3x (3x=x +2x) time, but in terms of the time of the second read 72, the two types of data involved in the comparison are laid out in the same way, but the time of the first read 71 based on the zone-level file system is reduced to the time required to read 2x.

从这个例子中可以发现,RLFS中的区域方案与传统的数据布局相比,是一种更细粒度、更自适应的数据布局方案,在所有存储服务器中都对应有不同的条带大小。因此,RLFS中的区域方案可以看作是1-DH布局方案的一个变体,RLFS能够聚合所有存储服务器的带宽,以最大限度地提高I/O性能。RLFS十分匹配混合型存储系统100。From this example, it can be found that the region scheme in RLFS is a finer-grained and more adaptive data layout scheme than the traditional data layout scheme, which corresponds to different stripe sizes in all storage servers. Therefore, the zone scheme in RLFS can be seen as a variant of the 1-DH layout scheme, RLFS is able to aggregate the bandwidth of all storage servers to maximize I/O performance. RLFS is a good match for the hybrid storage system 100 .

RLFS旨在通过使用不同大小的文件条来支持基于区域的数据布局。为了同时适应混合型存储系统100和复杂的I/O工作负载,RLFS采用了分区处理方式来实现最优的数据布局。RLFS中会生成成本模型,根据成本模型,RLFS将一个大文件划分为一组区域,每个区域单独存放自身的条带大小。当应用的所有I/O请求的总成本最小化时,得到最优区域以及它们的条带尺寸。RLFS is designed to support region-based data layout by using file strips of different sizes. To accommodate mixed storage systems 100 and complex I/O workloads at the same time, RLFS employs a partitioning approach to achieve optimal data layout. A cost model is generated in RLFS. According to the cost model, RLFS divides a large file into a set of regions, and each region stores its own stripe size separately. The optimal regions and their stripe sizes are obtained when the total cost of all I/O requests of the application is minimized.

而本实施例中所涉及的存储系统为混合型存储系统100,混合型存储系统100包括基于固态驱动器的服务器102和基于硬盘驱动器的服务器101,基于固态驱动器的服务器简称SServer,基于硬盘驱动器的服务器HServer。The storage system involved in this embodiment is a hybrid storage system 100. The hybrid storage system 100 includes a solid-state drive-based server 102 and a hard disk drive-based server 101. The solid-state drive-based server is referred to as SServer for short, and the hard disk drive-based server HServer.

本发明实施例提供了一种文件系统,该文件系统称为区域级文件系统,即RegionLevel File System,简称RLFS。该文件系统能够支持区域级数据布局,并解决现有并行文件系统中出现的数据分布问题。RLFS依赖于定义的成本模型以及基于每个区域的异构感知方案来确定每个服务器的最优文件条带大小,并且进一步利用改变的访问模式来调整在运行时的区域方案。An embodiment of the present invention provides a file system, which is called a region-level file system, that is, a Region Level File System, or RLFS for short. The file system can support region-level data layout and solve the data distribution problems that occur in existing parallel file systems. RLFS relies on a defined cost model and a per-zone-based heterogeneous-aware scheme to determine the optimal file stripe size for each server, and further leverages changing access patterns to tune the zone scheme at runtime.

更具体地说,首先为RLFS开发一个成本模型来估计区域访问的完成时间,从而利用动态规划将文件划分为细粒度区域,然后对于HDD和SSD服务器分配每个区域所选择的最优文件条带大小。RLFS是存储系统和应用程序感知的,RLFS本质上代表了从传统的一维固定条带尺寸布局到二维变化的条带尺寸布局方式的改变,RLFS能够很好地适应服务器性能和应用行为的变化。此外,RLFS还根据所检测到的访问模式的变化,对生成的数据布局方案进行更新,以解决静态数据布局问题,使其更适合于运行时的文件访问。More specifically, a cost model is first developed for RLFS to estimate the completion time of region accesses, thereby utilizing dynamic programming to partition files into fine-grained regions, and then for HDD and SSD servers, the optimal file stripes selected for each region are allocated. size. RLFS is storage system and application aware. RLFS essentially represents a change from the traditional one-dimensional fixed stripe size layout to two-dimensional changing stripe size layout. RLFS can well adapt to changes in server performance and application behavior. Variety. In addition, RLFS also updates the generated data layout scheme based on the detected changes in access patterns to address static data layout issues and make them more suitable for runtime file access.

如图3所示,本发明实施例提供了一种文件系统,该文件系统称为区域级文件系统,即Region Level File System,简称RLFS。该文件系统能够支持区域级数据布局,并解决现有并行文件系统中出现的数据分布问题。As shown in FIG. 3 , an embodiment of the present invention provides a file system, which is called a region-level file system, that is, a Region Level File System, or RLFS for short. The file system can support region-level data layout and solve the data distribution problems that occur in existing parallel file systems.

RLFS包内核部分和用户级的守护进程模块20,优选地,内核部分包括FUSE模块10。换言之,本发明实施例提供的一种文件系统RLFS优选是基于FUSE框架设计。FUSE是指用户空间文件系统,其是Filesystem in Userspace的缩写。内核部分优选为Linux内核模块,内核部分还包括VFS模块11,VFS为虚拟文件系统,其是Virtual File System的简称。VFS模块11用于注册RLFS,内核部分中会创建块设备,块设备充当守护进程模块20与内核部分的接口,FUSE模块10充当守护进程模块20的代理,用于应用程序发出的各种文件系统请求。The RLFS package includes the kernel part and the user-level daemon module 20 , preferably, the kernel part includes the FUSE module 10 . In other words, a file system RLFS provided by the embodiment of the present invention is preferably designed based on the FUSE framework. FUSE refers to a user space file system, which is an abbreviation for Filesystem in Userspace. The kernel part is preferably a Linux kernel module, and the kernel part further includes a VFS module 11 . VFS is a virtual file system, which is an abbreviation for Virtual File System. The VFS module 11 is used to register RLFS, a block device will be created in the kernel part, the block device acts as the interface between the daemon process module 20 and the kernel part, the FUSE module 10 acts as an agent of the daemon process module 20, and is used for various file systems issued by the application program. ask.

来自客户端1的应用程序可以通过将RLFS挂载到其名称空间的方式来访问RLFS,此后,所有针对挂载点的文件系统调用都通过VFS模块11转发到FUSE模块10。然后,FUSE模块10通过块设备将请求队列中的调用指令中继到守护进程模块20,其中,通过联系元数据服务器200和或其他存储服务器,调用适当的服务处理程序以适应文件系统调用。响应沿着反向路径通过内核部分传播,并最终传播回应用程序,应用程序在发出请求后通常处于等待状态,等待响应。RLFS的守护进程和存储服务器应该完成PFS的所有语义。例如,读取处理程序应该首先识别哪些存储服务器具有所请求的数据段,以及每个服务器中哪些存储了相应的数据段,然后向这些服务器发出并行访问的子请求。内核部分还包括文件日志模块12,用于记录针对元数据服务器200的操作日志。Applications from client 1 can access RLFS by mounting RLFS into its namespace, after which all file system calls to the mount point are forwarded to FUSE module 10 through VFS module 11 . The FUSE module 10 then relays the call instructions in the request queue to the daemon module 20 via the block device, where, by contacting the metadata server 200 and or other storage servers, the appropriate service handler is invoked to accommodate the file system call. The response propagates along the reverse path through the kernel section and eventually back to the application, which is usually in a wait state after making a request, waiting for a response. The daemon and storage server of RLFS should do all the semantics of PFS. For example, a read handler should first identify which storage servers have the requested data segment, and which in each server store the corresponding data segment, and then issue subrequests to those servers for parallel access. The kernel part also includes a file log module 12 for recording operation logs for the metadata server 200 .

除了PFS的一般语义外,RLFS还需要实现基于区域的数据布局功能。为了实现这个目标,RLFS装备了具有三个用户级组件的I/O示踪器3、成本计算模块4和区域划分模块5,RLFS通过三个用户级组件来完成一个三相数据布局周期。数据布局周期从跟踪阶段开始,在跟踪阶段,I/O示踪器3在应用程序执行期间将数据访问的运行时统计信息以及用于成本建模的文件系统概要(例如,FUSE队列信息)收集到跟踪文件中。然后,I/O示踪器3将读/写迹线馈送到区域划分模块5,并且在下一个分析阶段I/O示踪器3将文件系统配置文件定向到成本计算模块4,区域划分模块5利用更新的成本模型来生成区域,每个区域都为两种服务器分配其自身的条带尺寸。最后,在放置阶段,在运行时将文件放置在底层混合型存储系统100上,以便根据上一阶段获得的布局方案优化在后的运行中的I/O请求。通过这三个阶段,RLFS可以大大提高应用程序在后续运行中的I/O性能。RLFS中还包括更新数据布局模块8,所述更新数据布局模块分别与守护进程模块20、I/O示踪器3、区域划分模块5和混合型存储系统100连接,所述更新数据布局模块用于动态更新数据布局,所述更新数据布局模块用于动态检测和更新区域变化。。进一步,将I/O示踪器3、成本计算模块4和区域划分模块5的具体功能分别阐述:In addition to the general semantics of PFS, RLFS also needs to implement region-based data layout capabilities. To achieve this goal, the RLFS is equipped with an I/O tracer 3, a cost calculation module 4, and an area division module 5 with three user-level components, through which the RLFS completes a three-phase data layout cycle. The data layout cycle begins with the trace phase, where I/O Tracer 3 collects runtime statistics of data accesses and filesystem summaries (e.g., FUSE queue information) for cost modeling during application execution into the trace file. The I/O tracer 3 then feeds the read/write traces to the zoning module 5, and in the next analysis phase the I/O tracer 3 directs the file system configuration file to the costing module 4, the zoning module 5 Utilize the updated cost model to generate regions, each assigning its own stripe size to both servers. Finally, in the placement stage, the file is placed on the underlying hybrid storage system 100 at runtime in order to optimize I/O requests in subsequent runs according to the layout scheme obtained in the previous stage. Through these three stages, RLFS can greatly improve the I/O performance of the application in subsequent runs. The RLFS also includes an update data layout module 8, which is respectively connected to the daemon process module 20, the I/O tracer 3, the area division module 5 and the hybrid storage system 100, and the update data layout module uses For dynamically updating the data layout, the updating data layout module is used to dynamically detect and update area changes. . Further, the specific functions of the I/O tracer 3, the cost calculation module 4 and the area division module 5 are described respectively:

一、I/O示踪器31. I/O tracer 3

I/O示踪器3在RLFS中既用于收集运行时I/O信息,还用于收集文件系统配置文件。虽然现有技术中有一些可用于I/O数据收集的技术和工具,例如IOSIG[42],但鉴于本发明实施例提供的文件系统是基于FUSE框架设计的,类似IOSIG[42]的现有技术中的I/O数据收集工具不能直接适用于RLFS。这是FUSE框架结构固有特性决定的。因此,在本实施例所涉及的I/O示踪器3,其遵循N-1日志模式,所有的RLFS守护进程都被用来写入单个文件共享文件。因此,设计的I/O示踪器3可以帮助收集I/O操作的所有信息,包括文件访问类型、操作时间和其他与进程相关的数据。I/O Tracer 3 is used in RLFS to collect both runtime I/O information and file system configuration files. Although there are some technologies and tools that can be used for I/O data collection in the prior art, such as IOSIG [42], since the file system provided by the embodiment of the present invention is designed based on the FUSE framework, similar to the existing IOSIG [42] The I/O data collection tools in the technology are not directly applicable to RLFS. This is determined by the inherent characteristics of the FUSE framework structure. Therefore, in the I/O tracer 3 involved in this embodiment, which follows the N-1 journaling mode, all RLFS daemons are used to write to a single file share. Therefore, the designed I/O tracer 3 can help collect all information of I/O operations, including file access types, operation time and other process-related data.

使用I/O示踪器3运行相应的应用程序之后,可以获得进程ID、文件描述符、操作类型、偏移量、请求大小和时间戳信息。为了便于进一步的区域划分以及指导最优数据布局,文件的所有I/O请求都按其偏移量的升序排序。After running the corresponding application using I/O tracer 3, the process ID, file descriptor, operation type, offset, request size and timestamp information can be obtained. To facilitate further partitioning and to guide optimal data layout, all I/O requests to a file are sorted in ascending order of their offsets.

运行时I/O信息是在特定环境下收集的,也可以用一些参数来充分了解收集到的I/O信息。为此,除了I/O信息之外,还应允许I/O示踪器3进一步收集关于文件系统的运行时配置文件,尤其是基于FUSE框架下的文件系统的运行时配置文件,该配置文件将定向到成本计算模块4中,辅助更新的成本模型,区域划分模块5会进一步根据成本计算模块4获得的最小总成本来确定最优的区域划分。The runtime I/O information is collected in a specific environment, and some parameters can also be used to fully understand the collected I/O information. To this end, in addition to the I/O information, the I/O tracer 3 should be allowed to further collect the runtime configuration files about the file system, especially the runtime configuration files of the file system based on the FUSE framework, which It will be directed to the cost calculation module 4 to assist the updated cost model, and the area division module 5 will further determine the optimal area division according to the minimum total cost obtained by the cost calculation module 4 .

二、成本计算模块42. Cost calculation module 4

成本计算模块4能够生成成本模型,且成本模型以寻找到最小总成本为目标。The cost calculation module 4 can generate a cost model, and the cost model aims to find the minimum total cost.

为了获得存储系统中每个服务器的最优区域划分及其条带大小,本发明实施例中提出的文件系统需要依赖成本计算模块4。在成本计算模块4中,成本被定义为每个文件请求的I/O完成时间。成本计算模块4用于计算文件系统中文件请求访问的成本。该文件系统是与混合型存储系统相匹配的。In order to obtain the optimal area division and stripe size of each server in the storage system, the file system proposed in the embodiment of the present invention needs to rely on the cost calculation module 4 . In the cost calculation module 4, the cost is defined as the I/O completion time of each file request. The cost calculation module 4 is used to calculate the cost of file request access in the file system. The file system is compatible with hybrid storage systems.

由于文件请求的访问总成本与文件系统本身及底层网络和存储服务器相关,所以,文件请求的访问总成本包括文件系统的系统成本及网络和存储成本。因此,成本计算模块4的计算依据就应该包括文件系统的系统成本及网络和存储成本。对应地,成本计算模块4包括系统成本计算模块41及网络和存储成本计算模块42。Since the total access cost of the file request is related to the file system itself and the underlying network and storage server, the total access cost of the file request includes the system cost of the file system and the network and storage costs. Therefore, the calculation basis of the cost calculation module 4 should include the system cost of the file system and the network and storage cost. Correspondingly, the cost calculation module 4 includes a system cost calculation module 41 and a network and storage cost calculation module 42 .

由于本发明实施例提出的一种文件系统是建立在在FUSE框架之上的,所以,文件系统的系统成本主要是指FUSE数据路径中的时间开销。由于RLFS的主要目标是通过数据文件在混合型存储系统100上的最优位置来优化读取请求,因此本实施例中只定义关于读取请求的系统成本,写入请求的成本也可以通过遵循相同的参数来导出。Since the file system proposed by the embodiment of the present invention is built on the FUSE framework, the system cost of the file system mainly refers to the time overhead in the FUSE data path. Since the main goal of RLFS is to optimize read requests through the optimal location of data files on the hybrid storage system 100, only the system cost of read requests is defined in this embodiment, and the cost of write requests can also be determined by following export with the same parameters.

如图4所示,对于每个读取请求,其服务时间被分成三个子部分,其一为FUSE模块10中排队等待的时间,其二为FUSE模块10和守护进程模块20之间进行两个上下文切换的时间,其三为第一次复制所收集到的三个复制操作的时间。As shown in Figure 4, for each read request, its service time is divided into three sub-parts, one of which is the waiting time in the FUSE module 10, and the other is the time between the FUSE module 10 and the daemon module 20. The time of context switching, and the third is the time of three copy operations collected for the first copy.

数据从含有m个HServer和n个SServer的网络系统流向守护进程模块20,然后再从守护进程模块20到FUSE模块10中,最后由FUSE模块10发送到客户端1。Data flows from the network system containing m HServers and n SServers to the daemon process module 20 , and then from the daemon process module 20 to the FUSE module 10 , and finally sent to the client 1 by the FUSE module 10 .

在FUSE模块10队列中等待读取请求的时间与客户端1和RLFS之间运行的应用程序密切相关,FUSE模块10队列中等待读取请求的时间不仅取决于应用程序做出的I/O请求模式,其还与由文件系统引起的其他因素有关,如页面缓存或中断等。因此,很难准确地估计它。然而,当考虑到通过RLFS的守护进程模块20的多线程支持来最小化队列延迟这一因素,可以安全地假设FUSE模块10队列中读取请求的等待时间可以忽略不计,即Tq=0。The time waiting for a read request in the FUSE module 10 queue is closely related to the application running between Client 1 and RLFS, the time waiting for a read request in the FUSE module 10 queue depends not only on the I/O requests made by the application mode, which is also related to other factors caused by the file system, such as page caching or interrupts. Therefore, it is difficult to estimate it accurately. However, when considering the factor of minimizing queue delays through the multithreading support of the daemon module 20 of RLFS, it is safe to assume that the waiting time for read requests in the FUSE module 10 queue is negligible, ie Tq= 0.

进一步,上下文切换时间是系统相关的,并且独立于数据大小,可以将其视为一个常量值。所以,FUSE模块10和守护进程模块20之间进行两个上下文切换的时间Ts的计算公式为:Ts=2μ。其中μ是上下文切换时间。Further, the context switch time is system-dependent and independent of the data size, and can be treated as a constant value. Therefore, the calculation formula of the time T s for two context switches between the FUSE module 10 and the daemon process module 20 is: T s =2μ. where μ is the context switch time.

第一次复制所收集到的三个复制操作的时间,简称复制时间Tc,复制时间Tc与文件请求的数据大小r成正比,其计算公式为:The time of the three copy operations collected for the first copy is referred to as the copy time T c , and the copy time T c is proportional to the data size r of the file request. The calculation formula is:

Tc(r,h,s)=3rtc T c (r,h,s)=3rt c

文件请求的数据大小为r,文件请求的数据大小r的计算公式是:The data size of the file request is r, and the calculation formula of the data size r of the file request is:

r=msm+nsn r=ms m +ns n

sm和sn分别代表HServer上最大的子请求大小和SServer上的最大子请求大小,且sm≤h且sn≤s,h表示HServer上条带尺寸,s表示SServer上条带尺寸,所以,进一步,复制时间Tc可以表示为:s m and s n represent the maximum sub-request size on HServer and the maximum sub-request size on SServer respectively, and s m ≤ h and s n ≤ s, h represents the stripe size on HServer, s represents the stripe size on SServer, So, further, the copy time T c can be expressed as:

Tc(r,h,s)≈3(mh+ns)tc T c (r,h,s)≈3(mh+ns)t c

tc是从内核空间到用户空间的单元数据复制时间。因此,由系统成本计算模块41计算出的总成本的第一部分表示为T1,T1=Ts+Tct c is the unit data copy time from kernel space to user space. Therefore, the first part of the total cost calculated by the system cost calculation module 41 is denoted as T 1 , where T 1 =T s +T c .

而由网络和存储成本计算模块42计算的网络和存储成本包括:网络连接时间Te、存储访问时间Ta和网络传输时间Tx。在PFS中,请求会被划分为一组子任务,每个子任务转发到单独的存储服务器以供并行执行。所以,网络和存储服务器中的请求子部件成本由所有子请求的最大成本确定。假定每类服务器(HServer或SServer)对于网络和存储具有相同的配置,就可以根据数据大小(sm和sn)和数据传输网络时间t确定网络传输时间Tx,具体公式为:The network and storage costs calculated by the network and storage cost calculation module 42 include: network connection time Te , storage access time Ta and network transmission time Tx . In PFS, requests are divided into a set of subtasks, each of which is forwarded to a separate storage server for parallel execution. Therefore, the request subcomponent cost in the network and storage servers is determined by the maximum cost of all subrequests. Assuming that each type of server (HServer or SServer) has the same configuration for the network and storage, the network transmission time Tx can be determined according to the data size (s m and s n ) and the data transmission network time t . The specific formula is:

Figure GDA0002731175200000121
Figure GDA0002731175200000121

上式中sm和sn分别代表HServer上最大的子请求大小和SServer上的最大子请求大小。In the above formula, s m and s n represent the maximum sub-request size on HServer and the maximum sub-request size on SServer, respectively.

与网络传输时间Tx类似,存储访问时间Ta由子请求决定,具体公式为:Similar to the network transmission time T x , the storage access time T a is determined by the sub-request, and the specific formula is:

Figure GDA0002731175200000122
Figure GDA0002731175200000122

上式中sm和sn分别代表HServer上最大的子请求大小和SServer上的最大子请求大小,th和ts分别表示HServer上单元数据传输时间和SServer上单元数据传输时间。In the above formula, s m and s n represent the maximum sub-request size on HServer and the maximum sub-request size on SServer, respectively, and th and ts represent the unit data transmission time on HServer and the unit data transmission time on SServer, respectively.

而与存储访问时间Ta和网络传输时间Tx不同,网络连接时间Te为常数,其与数据大小无关。综上,由网络和存储成本计算模块42计算的网络和存储成本间T2可以通过公式表示:Unlike the storage access time T a and the network transmission time T x , the network connection time T e is a constant, which is independent of the data size. To sum up, the network and storage cost interval T2 calculated by the network and storage cost calculation module 42 can be expressed by the formula :

Figure GDA0002731175200000131
Figure GDA0002731175200000131

进一步网络和存储成本时间T2可以表示为:Further network and storage cost time T2 can be expressed as :

T2≈Te+max{h(th+t),s(ts+t)}T 2 ≈T e +max{h(t h +t),s(t s +t)}

上式中h表示HServer上条带尺寸,s表示SServer上条带尺寸。In the above formula, h represents the size of the stripe on the HServer, and s represents the size of the stripe on the SServer.

从成本计算模块4中可以看出,请求的总成本T可以表示为:T=T1+T2,请求的总成本是描述应用程序、文件系统和数据布局的参数的函数。因此,它是高度异质性的,由服务器条带大小h和s决定的。It can be seen from the cost calculation module 4 that the total cost T of the request can be expressed as: T=T 1 +T 2 , and the total cost of the request is a function of parameters describing the application, file system and data layout. Therefore, it is highly heterogeneous, determined by the server stripe sizes h and s.

另外,需要说明的是,由于在SServers中的读与写有很大的不同,写入请求所涉及的操作比读要多,此时,FUSE模块10和守护进程模块20之间进行两个上下文切换的时间Ts需要加入写入放大、垃圾收集和磨损均衡的时间。In addition, it should be noted that since the read and write in SServers are very different, the write request involves more operations than the read. At this time, two contexts are performed between the FUSE module 10 and the daemon module 20. The switching time T s needs to add the time of write amplification, garbage collection and wear leveling.

为了便于阐明成本计算模块4的工作原理,此处将成本计算模块4中涉及的成本分析模式的参数以表格形式展现,如表一所示。In order to facilitate the explanation of the working principle of the cost calculation module 4, the parameters of the cost analysis mode involved in the cost calculation module 4 are presented in a table form, as shown in Table 1.

Figure GDA0002731175200000132
Figure GDA0002731175200000132

Figure GDA0002731175200000141
Figure GDA0002731175200000141

表一成本分析模式中的参数Table 1 Parameters in Cost Analysis Mode

三、区域划分模块53. Regional division module 5

通过成本计算模块4生成的成本模型的指导,区域划分模块5能够将文件划分到不同的区域,试图最小化以并行应用程序为特征的给定访问集合的总成本。现有的区域划分装置有HARL,而HARL是将区域划分和条带大小确定分两个不同的阶段来处理,与HARL不同,RLFS的布局策略是整体的,RLFS的布局策略以一种统一的方式考虑区域划分和条带大小确定问题,所以,RLFS能够一次确定区域划分和条带大小。RLFS不像HARL那样以启发式方式扫描跟踪文件以查找逻辑区域,而是将逻辑区域和物理块放在一起,以最小的总成本为目标。这种考虑很容易理解,因为文件访问的最小单元是块,例如64MB或128MB,且逻辑区域可以自然地跨越相邻物理块的序列。Guided by the cost model generated by the cost calculation module 4, the region partition module 5 is able to partition files into different regions in an attempt to minimize the total cost of a given set of accesses characterized by parallel applications. The existing area division device includes HARL, and HARL divides area division and strip size determination into two different stages. Different from HARL, the layout strategy of RLFS is integral, and the layout strategy of RLFS is based on a unified The method considers the problem of area division and strip size determination, so RLFS can determine the area division and strip size at one time. Instead of heuristically scanning trace files for logical regions like HARL, RLFS puts logical regions and physical blocks together with the goal of having the smallest overall cost. This consideration is easy to understand because the smallest unit of file access is a block, such as 64MB or 128MB, and logical regions can naturally span a sequence of adjacent physical blocks.

在区域划分模块5中能够执行第一算法,第一算法是一种离线形式的最有相对快速算法,第一算法可以周期性地重复以适应访问的动态特性。“相对快速”意味着算法是伪多项式时间,该算法的实质是首先将共享文件表示为块序列,然后根据给定的访问请求以块为单位对文件进行分区,最后利用动态规划模块从这些分区中找到最优的区域划分。A first algorithm can be executed in the area division module 5, the first algorithm is the most relatively fast algorithm in an offline form, and the first algorithm can be repeated periodically to adapt to the dynamic nature of the visit. "relatively fast" means that the algorithm is pseudo-polynomial time, and the essence of the algorithm is to first represent the shared file as a sequence of blocks, then partition the file in blocks according to a given access request, and finally use the dynamic programming module to extract data from these partitions. to find the optimal region division.

根据访问模式给出的I/O事件,例如开始或结束I/O操作,文件F具有L的大小的示例,由段数(L=12段)定义,并且相邻段的序列被合并为区域,每个区域被红色的垂直虚线隔离。每个区域在HServer和SServer之间的数据都是条带的,逻辑I/O请求可以由单个针对与请求的数据有关的多个物理请求来处理。通过这种布局优化,与传统策略相比,总访问的成本根据定义的成本模型被最小化。An example of a file F having size L, defined by the number of segments (L=12 segments), and a sequence of adjacent segments are merged into regions, according to an I/O event given by the access mode, such as starting or ending an I/O operation, Each area is isolated by a red vertical dashed line. The data for each region is striped between the HServer and the SServer, and logical I/O requests can be handled by a single for multiple physical requests related to the requested data. With this layout optimization, the cost of total access is minimized according to the defined cost model compared to traditional strategies.

Figure GDA0002731175200000142
表示从索引i开始的具有l请求事件的文件被划分为k区域时的最小成本,由以下递归来计算0≤i<l的:Assume
Figure GDA0002731175200000142
represents the minimum cost when a file with l request events starting at index i is divided into k regions, computed by the following recursion for 0 ≤ i < l:

Figure GDA0002731175200000151
Figure GDA0002731175200000151

其中,

Figure GDA0002731175200000152
定义了一个区域,其大小为
Figure GDA0002731175200000153
Figure GDA0002731175200000154
可以表示为:in,
Figure GDA0002731175200000152
defines a region whose size is
Figure GDA0002731175200000153
Figure GDA0002731175200000154
It can be expressed as:

Figure GDA0002731175200000155
Figure GDA0002731175200000155

Figure GDA0002731175200000156
将在HServer和SServer中被条带化,分别为hi和si
Figure GDA0002731175200000156
will be striped in HServer and SServer , hi and si respectively.

从递归中可以获得从事件i开始,将l个事件划分为k区域的最小成本

Figure GDA0002731175200000157
当m从l变化到l-i时,将尺寸为f的第一区域的成本
Figure GDA0002731175200000158
相加到剩余的
Figure GDA0002731175200000159
中,从而计算出最小和。当段的数目不足以支持剩下的k区域划分,设置
Figure GDA00027311752000001510
否则设置
Figure GDA00027311752000001511
The minimum cost of dividing l events into k regions starting from event i can be obtained from recursion
Figure GDA0002731175200000157
Convert the cost of the first region of size f as m varies from l to li
Figure GDA0002731175200000158
add to the remaining
Figure GDA0002731175200000159
, so as to calculate the minimum sum. When the number of segments is not enough to support the remaining k region divisions, set
Figure GDA00027311752000001510
otherwise set
Figure GDA00027311752000001511

给出

Figure GDA00027311752000001512
的定义后,进一步计算从
Figure GDA00027311752000001513
Figure GDA00027311752000001514
区域的(子)请求,然后计算
Figure GDA00027311752000001515
的成本
Figure GDA00027311752000001516
计算公式如下:give
Figure GDA00027311752000001512
After the definition of , further calculations are made from
Figure GDA00027311752000001513
arrive
Figure GDA00027311752000001514
(sub)requests for the region, then compute
Figure GDA00027311752000001515
the cost of
Figure GDA00027311752000001516
Calculated as follows:

Figure GDA00027311752000001517
Figure GDA00027311752000001517

上公式中T(r,hi,si)是在成本模型中定义的,假设si=αhi,那么有:In the above formula, T(r, h i , s i ) is defined in the cost model, assuming that s i = αhi i , then there are:

Figure GDA00027311752000001518
Figure GDA00027311752000001518

这里,α≥1是SServer相对于HServer的扩展因子,B表示配置中的块大小。Here, α≥1 is the expansion factor of SServer relative to HServer, and B represents the block size in the configuration.

通过上述的4个方程,可以得到文件布局的最优区域划分并且最小化给定请求的成本。进一步,文件被分区放置在底层异构服务器上,底层异构服务器上每个区域对应有确定的条带大小。此后,对于每个请求R,可根据其条带大小读取相应区域,从而满足要求。Through the above 4 equations, one can get the optimal partition of the file layout and minimize the cost for a given request. Further, the file is partitioned and placed on the underlying heterogeneous server, and each area on the underlying heterogeneous server has a certain stripe size corresponding to it. Thereafter, for each request R, the corresponding region can be read according to its stripe size to satisfy the requirement.

图5为本发明实施例中文件系统的一个应用示例图。如图5所示,文件客户端1代表来自计算服务器301的应用程序发出请求,混合型存储系统100负责存储和管理已剥离的区域,元数据服务器200(MDS)包含存储在RLFS中的文件的描述信息。在文件操作期间,客户端1首先联系MDS以获取文件元数据,然后利用它通过RLFS守护进程与混合型存储系统100进行数据访问。FIG. 5 is a diagram of an example application of a file system in an embodiment of the present invention. As shown in FIG. 5, the file client 1 issues requests on behalf of the application from the computing server 301, the hybrid storage system 100 is responsible for storing and managing the stripped area, and the metadata server 200 (MDS) contains the files stored in the RLFS. Description. During file operations, client 1 first contacts the MDS to obtain file metadata, and then utilizes it for data access with the hybrid storage system 100 through the RLFS daemon.

RLFS将一个大文件逻辑地映射到多个小(区域)文件中,每个文件代表一个具有类似I/O工作负载的文件区域。区域文件被进一步剥离在所有HServer和SServer上,并且每个条带作为单独的数据文件存储在每个存储服务器中。为此,MDS为RLFS中的每个物理文件维护一个区域条形表(RST),如下表二所示,其中按照每个服务器里的偏移量和条带大小来记录文件的每个区域。当文件被写入RLFS时,区域条形表(RST)由区域划分模块5创建,当访问模式改变时更新区域条形表(RST)。为了提高效率,可以在安装和卸载RLFS时将要读取的文件的RST缓存和解缓存储在与应用程序相同的目录中。RLFS logically maps a large file into multiple small (regional) files, each representing a region of the file with a similar I/O workload. Zone files are further stripped across all HServers and SServers, and each stripe is stored as a separate data file in each storage server. To this end, MDS maintains a region strip table (RST) for each physical file in RLFS, as shown in Table 2 below, in which each region of the file is recorded according to the offset and stripe size in each server. The region strip table (RST) is created by the region partition module 5 when a file is written to the RLFS, and the region strip table (RST) is updated when the access mode changes. For efficiency, RST caches and caches of files to be read when RLFS is installed and uninstalled can be stored in the same directory as the application.

Figure GDA0002731175200000161
Figure GDA0002731175200000161

表二区域条形表数据结构Table 2 area bar table data structure

对于成本计算模块4,则使用并行文件系统中的一个文件服务器来测试具有读/写模式的HServer和SServer的上下文切换时间、单位数据复制时间和单位数据传输时间,这些参数可以随不同的I/O模式而变化。此外,使用一对节点,即一个客户端节点和一个文件服务器,来估计网络传输时间,网络传输时间测试可重复测试数千次,然后计算它们的平均值,作为生成成本模型的参数值。For cost calculation module 4, a file server in the parallel file system is used to test the context switching time, unit data replication time and unit data transfer time of HServer and SServer with read/write mode. These parameters can vary with different I/W O mode changes. In addition, using a pair of nodes, a client node and a file server, to estimate the network transfer time, the network transfer time test can be repeated thousands of times, and then their averages are calculated as parameter values for the generative cost model.

为特定文件执行最优数据布局,RLFS首先使用其区域划分模块5来计算文件的最优区域划分,然后利用成本模型和I/O跟踪数据来确定每个区域的条带大小。计算最优区域信息用于每个服务器上同时写入文件,区域划分模块5为MDS中的文件创建用于后续读取的RST。MDS容纳RLFS的命名空间、RST以及有关每个文件的其他信息。然而,由于区域划分算法给定的地区数量有限,MDS的大小受高度的控制的,MDS的大小较小。To perform the optimal data layout for a particular file, RLFS first uses its region partitioning module 5 to calculate the optimal partitioning of the file, and then utilizes the cost model and I/O trace data to determine the stripe size for each region. The optimal region information is calculated for simultaneously writing files on each server, and the region division module 5 creates an RST for the files in the MDS for subsequent reading. MDS holds RLFS's namespace, RST, and other information about each file. However, due to the limited number of regions given by the region division algorithm, the size of the MDS is highly controlled and the size of the MDS is small.

另外,为了便于对每个文件的并行读取,混合型存储系统100维护了一个平面名称空间,其中每个文件可以通过本地磁盘中的“filename_region#_stripe#”来标识。注意“filename”可以包含应用程序指定的路径信息。后台I/O守护进程用于接收来自客户端1的传入请求,其特点是“filename”“region#”和“stripe#”,通过发送回请求的条带文件来对请求进行服务,这些条带文件与其他条带文件结合起来以满足应用程序的需要。In addition, to facilitate parallel reading of each file, the hybrid storage system 100 maintains a flat namespace, where each file can be identified by "filename_region#_stripe#" in the local disk. Note that "filename" can contain application-specific path information. The background I/O daemon is used to receive incoming requests from Client 1, characterized by "filename", "region#" and "stripe#", and service the request by sending back the requested stripe file, which stripes Strip files are combined with other strip files to meet the needs of the application.

本发明提出的一种文件系统(RLFS)通过将文件划分为一组最优区域的方式来支持区域级的数据布局,从而通过该文件系统能够确定最优区域及其条带大小,因此,通过该文件系统能够优化混合型存储系统100的数据布局。与内核方法相比,使用FUSE模块10不仅极大地简化了的开发工作,而且允许通过标准文件系统接口访问RLFS,可以使应用程序以透明的方式访问RLFS,可变大小的RLFS能够减轻服务器之间的负载不平衡,能够灵活地适应工作负载变化和服务器异构性,从而显著加快I/O系统性能。A file system (RLFS) proposed by the present invention supports region-level data layout by dividing the file into a set of optimal regions, so that the optimal region and its stripe size can be determined by the file system. Therefore, by The file system can optimize the data layout of the hybrid storage system 100 . Using the FUSE module 10 not only greatly simplifies the development effort compared to the kernel approach, but also allows access to RLFS through a standard file system interface, enabling applications to access RLFS in a transparent manner. It can flexibly adapt to workload changes and server heterogeneity, thereby significantly accelerating I/O system performance.

本发明实施例所提出的文件系统(RLFS)已经经过实验验证,确定可行,并且性能表现优秀。实验结果表明,RLFS能够很好配合混合型存储系统100一起运行,RLFS很大程度地提高并行I/O性能。The file system (RLFS) proposed by the embodiment of the present invention has been verified by experiments, and it is confirmed that it is feasible and has excellent performance. The experimental results show that RLFS can work well with the hybrid storage system 100, and RLFS can greatly improve the parallel I/O performance.

在实验中,针对三种数据布局方案做了比较:方案一利用固定大小的条带;方案二利用随机选择的条带,方案三通过RLFS实现。对于读和写,RLFS分别使用{32KB,160KB}和{36KB,148KB}的最优数据布局,这与具有64KB的固定大小条带的默认布局相比,I/O性能提高了73.4%和176.7%。与其他具有不同但固定大小的条纹的布局相比,RLFS使读取性能提高到138.6%,写入性能提高到177.6%。与随机选择的条带策略相比,RLFS使读取性能提高到154.5%,写入性能提高到215.4%。In the experiment, three data layout schemes are compared: scheme one uses strips of fixed size; scheme two uses randomly selected stripes, and scheme three is implemented by RLFS. For read and write, RLFS uses the optimal data layout of {32KB, 160KB} and {36KB, 148KB}, respectively, which improves I/O performance by 73.4% and 176.7% compared to the default layout with fixed-size stripes of 64KB %. Compared to other layouts with different but fixed-sized stripes, RLFS improves read performance to 138.6% and write performance to 177.6%. Compared to the randomly chosen striping strategy, RLFS improves read performance to 154.5% and write performance to 215.4%.

基于代表性基准的实验结果表明,RLFS是混合并行文件系统中的一个有前途和可行的解决方案,并行I/O性能从读取的20.6%提高到556.1%,写入22.7%提高到288.7%。Experimental results based on representative benchmarks show that RLFS is a promising and feasible solution in hybrid parallel file systems, with parallel I/O performance improving from 20.6% to 556.1% for reads and 22.7% to 288.7% for writes .

本发明还提供一种数据布局方法,其包括:The present invention also provides a data layout method, which includes:

步骤S1,将运行时的数据访问的I/O信息以及用于成本建模的文件系统配置文件收集到跟踪文件中,将文件系统配置文件定向用于建立成本模型,将I/O信息用于区域划分;In step S1, the I/O information of the data access at runtime and the file system configuration file used for cost modeling are collected into the tracking file, the file system configuration file is directed to establish the cost model, and the I/O information is used for the cost modeling. regional division;

步骤S2,计算或预估文件请求的访问成本,形成成本模型;Step S2, calculate or estimate the access cost of the file request to form a cost model;

步骤S3,根据所述成本模型以生成总成本最小化的分布区域,并将文件划分到的不同区域中;Step S3, according to the cost model to generate a distribution area where the total cost is minimized, and divide the file into different areas;

步骤S4,获取所述区域对应的条带大小。Step S4, acquiring the strip size corresponding to the area.

上述方法的有益效果在于,该方法通过将文件划分为一组最优区域的方式来支持区域级的数据布局,从而确定最优区域及其条带大小,该方法优化了混合型存储系统的数据布局。The beneficial effect of the above method is that the method supports the data layout at the area level by dividing the file into a set of optimal areas, so as to determine the optimal area and its stripe size, and the method optimizes the data of the hybrid storage system. layout.

以上所述实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围,均应包含在本发明的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be used for the foregoing implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in the within the protection scope of the present invention.

Claims (8)

1. A file system is characterized in that the file system comprises an I/O tracer, a cost calculation module and a region division module which are electrically connected with each other, wherein the I/O tracer is used for providing I/O information collected by the I/O tracer when the file system runs to the region division module; the I/O tracer is also used for providing the cost calculation module with the configuration file of the file system collected by the I/O tracer; the cost calculation module is used for calculating or predicting the access cost of the file request in the file system so as to output a cost model to the region division module; the area dividing module is used for generating a distribution area with the minimized total cost according to the cost model and dividing files into different areas, and the area dividing module is also used for obtaining the size of a strip corresponding to the area;
the cost calculation module is used for calculating the total cost of the request, and the total cost calculation formula is as follows: t ═ Ts+Tc+T2In the formula, TsRepresents the time of two context switches between the FUSE module and the daemon module, TcDenotes the reproduction time, T2Representing network and storage costs.
2. The file system of claim 1, wherein the file system further comprises a kernel portion for performing interaction of information or data between a metadata server, the hybrid storage system, and a client party; the kernel portion includes a FUSE module.
3. The file system of claim 2, wherein the file system further comprises a daemon module for executing a daemon process in the background; the FUSE module is used as a proxy of the daemon.
4. The file system of claim 3, wherein the file system further comprises an update data placement module, the update data placement module being respectively coupled to the daemon module, the I/O tracer, the zoning module, and the hybrid storage system, the update data placement module being configured to dynamically detect and update a change in a zone.
5. The file system of claim 4, wherein the hybrid storage system comprises a solid state drive-based server SServer and a hard disk drive-based server HServer;
the calculation formula of the replication time is as follows: t isc(r,h,s)≈3(mh+ns)tcIn the formula tcRepresenting the unit data copying time from a kernel space to a user space, r representing the data size of a file request, h representing the size of a stripe on an HServer, s representing the size of a stripe on an SServer, m representing the number of the HServers, and n representing the number of the SServers;
the calculation formula of the network and storage cost is as follows: t is2≈Te+max{h(th+t),s(ts+ t) }, where t denotes the data transmission network time, thAnd tsRespectively representing the cell data transmission time on HServer and the cell data transmission time on SServer, TeRepresenting the network connection time.
6. The file system of claim 5, wherein the region partitioning module is to obtain the secondary data fromMinimum cost to start dividing l events into k regions for event i
Figure FDA0002731175190000021
The minimum cost
Figure FDA0002731175190000022
The calculation formula of (2) is as follows:
Figure FDA0002731175190000023
in the formula, the first step is that,
Figure FDA0002731175190000024
Figure FDA0002731175190000025
define a size of
Figure FDA0002731175190000026
The area of the image to be displayed is,
Figure FDA0002731175190000027
the cost of the first area of size f is indicated.
7. The file system of claim 6, wherein the solid state drive based server and the hard drive based server are capable of connecting
Figure FDA0002731175190000028
Striping and obtaining h respectivelyiAnd si,siIs calculated as si=αhi,hiThe calculation formula of (2) is as follows:
Figure FDA0002731175190000029
in the formula, α ≧ 1 and is the SServer spreading factor relative to HServer, B denotes the block size in the configuration.
8. A data layout method using the file system according to any one of claims 1 to 7, characterized by comprising:
step S1, collecting the I/O information of data access during operation and the file system configuration file for cost modeling into a tracking file, orienting the file system configuration file for establishing a cost model, and using the I/O information for region division;
step S2, calculating or estimating the access cost of the file request to form a cost model;
step S3, generating a distribution area with minimized total cost according to the cost model, and dividing the files into different areas;
step S4, obtaining the size of the strip corresponding to the area.
CN201811547400.9A 2018-12-18 2018-12-18 File system and data layout method Active CN109840247B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811547400.9A CN109840247B (en) 2018-12-18 2018-12-18 File system and data layout method
PCT/CN2019/121301 WO2020125362A1 (en) 2018-12-18 2019-11-27 File system and data layout method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811547400.9A CN109840247B (en) 2018-12-18 2018-12-18 File system and data layout method

Publications (2)

Publication Number Publication Date
CN109840247A CN109840247A (en) 2019-06-04
CN109840247B true CN109840247B (en) 2020-12-18

Family

ID=66883264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811547400.9A Active CN109840247B (en) 2018-12-18 2018-12-18 File system and data layout method

Country Status (2)

Country Link
CN (1) CN109840247B (en)
WO (1) WO2020125362A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840247B (en) * 2018-12-18 2020-12-18 深圳先进技术研究院 File system and data layout method
CN110825698B (en) * 2019-11-07 2021-02-09 重庆紫光华山智安科技有限公司 Metadata management method and related device
CN114578299B (en) * 2021-06-10 2024-11-15 中国人民解放军63698部队 A method and system for generating radio frequency signals by wireless remote control beacon equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104020961A (en) * 2014-05-15 2014-09-03 深圳市深信服电子科技有限公司 Distributed data storage method, device and system
CN105872031A (en) * 2016-03-26 2016-08-17 天津书生云科技有限公司 Storage system
CN106326344A (en) * 2016-08-05 2017-01-11 中国水产科学研究院东海水产研究所 Distributed massive data management and retrieval method
CN107479827A (en) * 2017-07-24 2017-12-15 上海德拓信息技术股份有限公司 A kind of mixing storage system implementation method based on IO and separated from meta-data
CN107734026A (en) * 2017-10-11 2018-02-23 郑州云海信息技术有限公司 A kind of design method, device and the equipment of network attached storage cluster
US10014028B2 (en) * 2014-10-17 2018-07-03 Panasonic Intellectual Property Corporation Of America Recording medium, playback device, and playback method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2299375A3 (en) * 2002-11-14 2012-02-01 EMC Corporation Systems and methods for restriping files in a distributed file system
JP2005302152A (en) * 2004-04-12 2005-10-27 Sony Corp Composite type storage device, data writing method, and program
US7949636B2 (en) * 2008-03-27 2011-05-24 Emc Corporation Systems and methods for a read only mode for a portion of a storage system
CN102566942A (en) * 2011-12-28 2012-07-11 华为技术有限公司 File striping writing method, device and system
US9916311B1 (en) * 2013-12-30 2018-03-13 Emc Corporation Storage of bursty data using multiple storage tiers with heterogeneous device storage
CN103778222A (en) * 2014-01-22 2014-05-07 浪潮(北京)电子信息产业有限公司 File storage method and system for distributed file system
US9772787B2 (en) * 2014-03-31 2017-09-26 Amazon Technologies, Inc. File storage using variable stripe sizes
CN105760164B (en) * 2016-02-15 2020-01-10 苏州浪潮智能科技有限公司 Method for realizing ACL authority in user space file system
CN106528761B (en) * 2016-11-04 2019-06-18 郑州云海信息技术有限公司 A file caching method and device
CN109840247B (en) * 2018-12-18 2020-12-18 深圳先进技术研究院 File system and data layout method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104020961A (en) * 2014-05-15 2014-09-03 深圳市深信服电子科技有限公司 Distributed data storage method, device and system
US10014028B2 (en) * 2014-10-17 2018-07-03 Panasonic Intellectual Property Corporation Of America Recording medium, playback device, and playback method
CN105872031A (en) * 2016-03-26 2016-08-17 天津书生云科技有限公司 Storage system
CN106326344A (en) * 2016-08-05 2017-01-11 中国水产科学研究院东海水产研究所 Distributed massive data management and retrieval method
CN107479827A (en) * 2017-07-24 2017-12-15 上海德拓信息技术股份有限公司 A kind of mixing storage system implementation method based on IO and separated from meta-data
CN107734026A (en) * 2017-10-11 2018-02-23 郑州云海信息技术有限公司 A kind of design method, device and the equipment of network attached storage cluster

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于FUSE的用户态文件系统的设计与实现";黄永胜;《东北大学硕士毕业论文》;20150817;第8-45页 *

Also Published As

Publication number Publication date
CN109840247A (en) 2019-06-04
WO2020125362A1 (en) 2020-06-25

Similar Documents

Publication Publication Date Title
Kang et al. Towards building a high-performance, scale-in key-value storage system
CN101866359B (en) Small file storage and visit method in avicade file system
CN103873559A (en) Database all-in-one machine capable of realizing high-speed storage
CN104408091A (en) Data storage method and system for distributed file system
CN109840247B (en) File system and data layout method
US20140115016A1 (en) Systems and methods for enabling parallel processing of write transactions
Riedel et al. Data mining on an OLTP system (nearly) for free
WO2024021470A1 (en) Cross-region data scheduling method and apparatus, device, and storage medium
Li et al. Elastic and stable compaction for lsm-tree: A faas-based approach on terarkdb
WO2021057108A1 (en) Data reading method, data writing method, and server
US20240427792A1 (en) Separation of logical and physical storage in a distributed database system
WO2024131379A1 (en) Data storage method, apparatus and system
Doekemeijer et al. Key-value stores on flash storage devices: A survey
Xie et al. PetPS: Supporting huge embedding models with persistent memory
CN113553325A (en) Synchronization method and system for aggregation objects in object storage system
CN120492491A (en) Cache optimization method, device, computer equipment, readable storage medium and program product for power data system
US9870152B2 (en) Management system and management method for managing data units constituting schemas of a database
Jackowski et al. Objdedup: High-throughput object storage layer for backup systems with block-level deduplication
Liu et al. ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers
Kesavan et al. {FlexGroup} Volumes: A Distributed {WAFL} File System
Chen et al. A novel non-volatile memory update mechanism for 6g edge computing
Zhang et al. Design and implementation of an out-of-band virtualization system for large SANs
JP5278254B2 (en) Storage system, data storage method and program
CN119620935B (en) Method and system for reducing storage synchronization delay of cloud-edge object storage system
Ruan et al. Improving Shuffle I/O performance for big data processing using hybrid storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant