[go: up one dir, main page]

CN109522428B - An external memory access method for graph computing system based on index positioning - Google Patents

An external memory access method for graph computing system based on index positioning Download PDF

Info

Publication number
CN109522428B
CN109522428B CN201811082365.8A CN201811082365A CN109522428B CN 109522428 B CN109522428 B CN 109522428B CN 201811082365 A CN201811082365 A CN 201811082365A CN 109522428 B CN109522428 B CN 109522428B
Authority
CN
China
Prior art keywords
edge
data
file
vertex
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811082365.8A
Other languages
Chinese (zh)
Other versions
CN109522428A (en
Inventor
王芳
冯丹
陈静
蒋子威
王子毅
刘上
杨蕾
杨文鑫
陈硕
曹孟媛
戴凯航
施展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201811082365.8A priority Critical patent/CN109522428B/en
Publication of CN109522428A publication Critical patent/CN109522428A/en
Application granted granted Critical
Publication of CN109522428B publication Critical patent/CN109522428B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于索引定位的图计算系统的外存访问方法,包括:将完整的图数据分割成多个子图;对各个子图的边分别按照源顶点编号和目标顶点编号进行排序;将排序后的各个子图写入外存文件中,并分别为源顶点编号和目标顶点编号建立索引;从索引定位的载入方式和访问完整数据的载入方式中选择最优载入方式;以最优载入方式,将外存中各个子图载入内存。本发明对外存数据结构重新设计,改进数据加载方式,使系统能够在载入之前分析外存中的有效数据,显著减少I/O数据量和随机访问次数;分析访问完整数据方式与索引定位方式的时间开销,动态判断系统的最优数据载入方式,降低数据加载的时间开销。

Figure 201811082365

The invention discloses an external memory access method for a graph computing system based on index positioning, comprising: dividing complete graph data into multiple subgraphs; sorting the edges of each subgraph according to source vertex numbers and target vertex numbers respectively; Write the sorted subgraphs into the external memory file, and establish indexes for the source vertex number and the target vertex number respectively; select the optimal loading method from the loading method of index positioning and the loading method of accessing complete data; Load each sub-image in external memory into memory in an optimal loading method. The invention redesigns the external memory data structure, improves the data loading method, enables the system to analyze the valid data in the external memory before loading, significantly reduces the amount of I/O data and the number of random accesses; analyzes and accesses the complete data method and index positioning method It can dynamically determine the optimal data loading method of the system and reduce the time overhead of data loading.

Figure 201811082365

Description

一种基于索引定位的图计算系统的外存访问方法An external memory access method for graph computing system based on index positioning

技术领域technical field

本发明属于基于外存的图计算领域,更具体地,涉及一种基于索引定位的图计算系统的外存访问方法。The invention belongs to the field of graph computing based on external memory, and more particularly, relates to an external memory access method of a graph computing system based on index positioning.

背景技术Background technique

图计算是通过迭代执行更新函数来进行。基于外存的图计算系统通常使用的方法是将图数据组织成磁盘上的多个子图数据文件,以便每个子图文件都可以放入内存。每个子图中包含用于计算更新的顶点信息,一次完整的迭代过程将处理所有子图数据。关键点在于如何管理所有子图的计算状态以保证处理结果的正确性,其中包括将图数据从外存加载到内存,以及将中间结果写回到外存,以便后续计算可以得到更新后的结果。因此,每次迭代都需要访问大量数据,这会产生大量的IO开销并成为基于外存的方法的瓶颈。Graph computations are performed by iteratively executing the update function. A common approach used by out-of-memory-based graph computing systems is to organize graph data into multiple subgraph data files on disk, so that each subgraph file can fit into memory. Each subgraph contains vertex information used to compute updates, and one full iteration process will process all subgraph data. The key point is how to manage the calculation state of all subgraphs to ensure the correctness of the processing results, including loading graph data from external memory to memory, and writing intermediate results back to external memory, so that subsequent calculations can get updated results . Therefore, each iteration needs to access a large amount of data, which creates a lot of IO overhead and becomes a bottleneck for out-of-memory-based methods.

GraphChi系统在预处理期间将图数据中的顶点划分成不相交的区间,并将边分割成多个数据分片,数据分片与顶点区间一一对应,每个数据分片中的边的目标顶点对应着各自顶点区间中的顶点,它使用以顶点为中心的处理模型,从邻居顶点收集数据,在每个顶点上执行更新函数,计算和更新顶点值。使用一种并行滑动窗口的技术来减少随机的外存I/O;Keval Vora等人提出了一种针对外存图计算系统通用的优化访问方法ADS,只针对活跃的数据进行读取,其主要思想是每一次迭代结束时,将下一轮进行迭代的活跃数据重新生成新的子分区,并在下一轮执行迭代时只读取新生成的子分区。同时在内存中设立DELAY_BUFFER,用来存储需要执行更新却没有执行的顶点,每隔数轮迭代时统一读取原始子图对位于DELAY_BUFFER中的子图进行更新。The GraphChi system divides the vertices in the graph data into disjoint intervals during preprocessing, and divides the edges into multiple data shards. The data shards correspond to the vertex intervals one-to-one. The target of the edge in each data shard is Vertices correspond to vertices in their respective vertex intervals, and it uses a vertex-centric processing model to collect data from neighboring vertices, execute an update function on each vertex, and compute and update vertex values. A parallel sliding window technique is used to reduce random external memory I/O; Keval Vora et al. proposed a general optimized access method ADS for external memory graph computing systems, which only reads active data. The idea is that at the end of each iteration, the active data in the next round of iterations will be regenerated into new sub-partitions, and only the newly generated sub-partitions will be read in the next round of iterations. At the same time, DELAY_BUFFER is set up in memory to store the vertices that need to be updated but not executed, and the original subgraph is read uniformly every few iterations to update the subgraph located in DELAY_BUFFER.

综上,图计算系统中目前的外存数据访问方式大多都是载入完整的子图分区。无论是使用并行滑动窗口的GraphChi,还是创建动态子图分区的ADS,由于未对外存数据进行更加细致的划分,无法实现精确定位访问计算所需的数据。同时,如果简单的使用定位的方法针对外存数据进行载入,虽然可以减少载入数据量,提高资源利用率,但是原本的顺序访问将被分成多次的随机访问,带来额外的时间开销。To sum up, most of the current external memory data access methods in graph computing systems are to load complete subgraph partitions. Whether it is GraphChi using a parallel sliding window or ADS that creates dynamic subgraph partitions, because the external memory data is not divided more carefully, it is impossible to accurately locate and access the data required for computing. At the same time, if you simply use the positioning method to load external memory data, although the amount of loaded data can be reduced and resource utilization can be improved, the original sequential access will be divided into multiple random accesses, which will bring additional time overhead. .

发明内容SUMMARY OF THE INVENTION

针对现有技术的缺陷,本发明的目的在于解决现有技术中外存访问方式I/O数据量、随机访问次数和时间开销大的技术问题。In view of the defects of the prior art, the purpose of the present invention is to solve the technical problems of large I/O data volume, random access times and time overhead in the external memory access method in the prior art.

为实现上述目的,第一方面,本发明实施例提供了一种基于索引定位的图计算系统的外存访问方法,该方法包括以下步骤:In order to achieve the above object, in a first aspect, an embodiment of the present invention provides an external memory access method for a graph computing system based on index positioning, the method includes the following steps:

S0.将完整的图数据分割成内存能够容纳的多个子图;S0. Divide the complete graph data into multiple subgraphs that the memory can accommodate;

S1.对各个子图的边分别按照源顶点编号和目标顶点编号进行排序;S1. Sort the edges of each subgraph according to the source vertex number and the target vertex number respectively;

S2.将排序后的各个子图写入外存文件中,并分别为源顶点编号和目标顶点编号建立索引;S2. Write the sorted subgraphs into the external memory file, and establish indexes for the source vertex number and the target vertex number respectively;

S3.从索引定位的载入方式和访问完整数据的载入方式中选择最优载入方式;S3. Select the optimal loading method from the loading method of index positioning and the loading method of accessing complete data;

S4.以最优载入方式,将外存中各个子图载入内存。S4. Load each subgraph in the external memory into the memory in an optimal loading manner.

具体地,所述将排序后的各个子图写入外存文件中具体如下:Specifically, the writing of the sorted subgraphs into the external memory file is as follows:

将边排序后的各个子图写入外存文件时,将具有相同源顶点、目标顶点的边存储在连续的外存数据块中;边的数据包含边与边值,在外存中分别保存为两个文件:边文件和边值文件;边文件保存图的拓扑结构,存储格式是邻接表格式;边值文件中的边值顺序与边文件中的边顺序相对应。When writing each subgraph after edge sorting into the external memory file, the edges with the same source vertex and target vertex are stored in a continuous external memory data block; the data of the edge includes the edge and the edge value, which are respectively saved in the external memory as Two files: an edge file and an edge value file; the edge file saves the topology of the graph, and the storage format is the adjacency list format; the edge value order in the edge value file corresponds to the edge order in the edge file.

具体地,所述边文件包含:按照源顶点编号的出边文件和按照目标顶点编号的入边文件;其中,出边文件按照源顶点编号进行组织,顺序且连续存储有源顶点编号、出度以及出边的目标顶点编号;入边文件按照目的顶点编号进行组成,顺序且连续存储有目的顶点编号、入度以及入边的源顶点编号。Specifically, the edge file includes: an outgoing edge file according to the number of the source vertex and an incoming edge file according to the number of the target vertex; wherein, the outgoing edge file is organized according to the number of the source vertex, and sequentially and continuously stores the number of the source vertex, the out-degree And the target vertex number of the outgoing edge; the incoming edge file is composed according to the destination vertex number, and sequentially and continuously stores the destination vertex number, the in-degree, and the source vertex number of the incoming edge.

具体地,所述索引记录该顶点对应的边在外存文件中的偏移地址。Specifically, the index records the offset address of the edge corresponding to the vertex in the external storage file.

具体地,所述索引定位的载入方式具体如下:Specifically, the loading method of the index positioning is as follows:

(1)根据索引定位在出边文件中找出需要加载到内存的数据块;(1) Find the data block that needs to be loaded into the memory in the outbound file according to the index positioning;

(2)根据出边文件中需要载入数据块出边的偏移地址,找到在边值文件中出边的边值数据,将边值数据加载到内存;(2) According to the offset address of the outgoing edge of the data block that needs to be loaded in the outgoing edge file, find the outgoing edge value data in the edge value file, and load the edge value data into the memory;

(3)在内存中构建顶点的出边拓扑结构;(3) Constructing the out-edge topology of vertices in memory;

(4)在应用需要入边时,根据索引定位到入边文件中需要加载到内存中的入边数据块;(4) When the application needs to enter the edge, locate the edge-in data block that needs to be loaded into the memory in the edge-in file according to the index;

(5)根据入边文件中需要载入数据块的入边偏移地址,找到在边值索引文件中入边的边值偏移地址,加载边值数据偏移地址至内存;(5) According to the input edge offset address of the data block that needs to be loaded in the input edge file, find the edge value offset address of the input edge in the edge value index file, and load the edge value data offset address to the memory;

(6)在内存中构建顶点的入边拓扑结构。(6) Construct the incoming edge topology of vertices in memory.

具体地,步骤S3具体如下:Specifically, step S3 is as follows:

S30.确定各个子图的活跃顶点,判断活跃顶点比例是否超过阈值,若是,进入步骤S31,否则,访问完整数据的载入方式为最优载入方式,结束步骤S3;S30. Determine the active vertices of each subgraph, determine whether the active vertex ratio exceeds the threshold, if so, go to step S31, otherwise, the loading mode of accessing the complete data is the optimal loading mode, and end step S3;

S31.记录活跃顶点数据对应的数据块编号;S31. Record the data block number corresponding to the active vertex data;

S32.基于活跃顶点数据对应的数据块编号,计算索引定位的载入方式的开销Cost(Index)和访问完整数据的载入方式的开销Cost(All);S32. Based on the data block number corresponding to the active vertex data, calculate the cost of the loading mode of index positioning (Index) and the cost of the loading mode of accessing complete data Cost (All);

S33.判断Cost(All)是否大于Cost(Index),若是,索引定位的载入方式为最优载入方式,否则,访问完整数据的载入方式为最优载入方式。S33. Determine whether Cost(All) is greater than Cost(Index). If so, the loading method for index positioning is the optimal loading method; otherwise, the loading method for accessing complete data is the optimal loading method.

具体地,所述阈值的取值范围为20%~30%。Specifically, the threshold value ranges from 20% to 30%.

具体地,访问完整数据的载入方式的开销

Figure BDA0001802287430000041
索引定位的载入方式的开销
Figure BDA0001802287430000042
其中,E为图中所有边的集合;D为单条边所占的存储空间;B为一次I/O的数据块大小;b为每个索引项指向的数据块大小;k为当前系统中所有活跃顶点对应的索引数据块的数量;r为随机访问开销。Specifically, the overhead of the loading method to access the complete data
Figure BDA0001802287430000041
The cost of loading the index positioning
Figure BDA0001802287430000042
Among them, E is the set of all edges in the graph; D is the storage space occupied by a single edge; B is the size of the data block of one I/O; b is the size of the data block pointed to by each index entry; k is the size of the data block in the current system The number of index data blocks corresponding to active vertices; r is the random access cost.

第二方面,本发明实施例提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述第一方面所述的外存访问方法。In a second aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the external memory access method described in the first aspect above is implemented .

总体而言,通过本发明所构思的以上技术方法与现有技术相比,具有以下有益效果:In general, compared with the prior art, the above technical method conceived by the present invention has the following beneficial effects:

1.本发明结合图应用的运行特征,对外存数据的组织结构进行重新设计,使相同顶点的数据存放在外存的连续空间中便于读取,并为顶点对应的数据块在文件中的偏移地址建立索引,从而快速访问到对应的数据块,改进了系统的数据加载方式,使系统能够在数据载入阶段之前计算分析外存中的有效数据,从而实现选择载入计算所需要的顶点数据,从而显著减少I/O数据量和随机访问次数;1. The present invention redesigns the organizational structure of the external memory data in combination with the operating characteristics of the graph application, so that the data of the same vertex is stored in the continuous space of the external memory for easy reading, and is the offset of the data block corresponding to the vertex in the file. The address establishes an index, so as to quickly access the corresponding data block, which improves the data loading method of the system, so that the system can calculate and analyze the valid data in the external memory before the data loading stage, so as to realize the selection of the vertex data required for the loading calculation. , thereby significantly reducing the amount of I/O data and the number of random accesses;

2.本发明分析了原有的访问完整数据方式与索引定位方式的时间开销,设计出决策判断功能,用于在每轮迭代中动态判断系统的最优数据载入方式,有效降低数据加载的时间开销。2. The present invention analyzes the time overhead of the original access complete data method and index positioning method, and designs a decision-making and judgment function, which is used to dynamically determine the optimal data loading method of the system in each round of iteration, effectively reducing the time and cost of data loading. time cost.

附图说明Description of drawings

图1为本发明提供的边值文件与边值偏移地址文件的对应关系示意图;1 is a schematic diagram of the correspondence between a boundary value file provided by the present invention and a boundary value offset address file;

图2为本发明实施例提供的出边索引文件和出边文件示意图;FIG. 2 is a schematic diagram of an outbound index file and an outbound file provided by an embodiment of the present invention;

图3为本发明实施例提供的步骤S3的流程图。FIG. 3 is a flowchart of step S3 provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方法及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical methods and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明提供的一种基于索引定位的图计算系统的外存访问方法,该方法包括以下步骤:A method for accessing external memory of a graph computing system based on index positioning provided by the present invention includes the following steps:

S0.将完整的图数据分割成内存能够容纳的多个子图;S0. Divide the complete graph data into multiple subgraphs that the memory can accommodate;

S1.对各个子图的边分别按照源顶点编号和目标顶点编号进行排序;S1. Sort the edges of each subgraph according to the source vertex number and the target vertex number respectively;

S2.将排序后的各个子图写入外存文件中,并分别为源顶点编号和目标顶点编号建立索引;S2. Write the sorted subgraphs into the external memory file, and establish indexes for the source vertex number and the target vertex number respectively;

S3.从索引定位的载入方式和访问完整数据的载入方式中选择最优载入方式;S3. Select the optimal loading method from the loading method of index positioning and the loading method of accessing complete data;

S4.以最优载入方式,将外存中各个子图载入内存。S4. Load each subgraph in the external memory into the memory in an optimal loading manner.

图计算系统主要由预处理阶段和计算执行阶段组成。预处理阶段包括将原始图数据读入内存、数据经过处理成系统计算执行需要的格式并写入外存文件;计算执行阶段包括将外存数据载入内存、内存中构建子图、执行计算更新以及计算结果写回外存。为了实现基于索引的外存数据的选择载入,在预处理阶段增加了子图数据排序和建立索引;在计算执行阶段增加了决策判断,修改了边数据载入、边值数据载入的方式。The graph computing system mainly consists of a preprocessing stage and a computation execution stage. The preprocessing stage includes reading the original graph data into the memory, processing the data into the format required for system calculation execution, and writing it to the external memory file; the calculation execution stage includes loading the external memory data into the memory, constructing subgraphs in the memory, and performing calculation updates. And the calculation results are written back to external memory. In order to realize the selective loading of external memory data based on the index, subgraph data sorting and indexing are added in the preprocessing stage; decision judgment is added in the calculation execution stage, and the way of loading edge data and edge value data is modified. .

本发明的预处理阶段包括步骤S0-S2。The preprocessing stage of the present invention includes steps S0-S2.

步骤S0.将完整的图数据分割成内存能够容纳的多个子图。Step S0. Divide the complete graph data into multiple subgraphs that the memory can accommodate.

根据计算资源中可用内存的大小,图数据从完整的原始数据被划分成多个子图,子图数据量通常不超过可用内存容量,用以保证内存能存下单个子图的全部数据。According to the size of the available memory in the computing resources, the graph data is divided into multiple subgraphs from the complete original data. The amount of subgraph data usually does not exceed the available memory capacity to ensure that the memory can store all the data of a single subgraph.

步骤S1.对各个子图的边分别按照源顶点编号和目标顶点的编号进行排序。Step S1. Sort the edges of each subgraph according to the number of the source vertex and the number of the target vertex respectively.

将顶点视为从0到|V|-1进行连续编号,将子图数据分别按照源顶点编号和目标顶点的编号进行排序。顶点集V是图中所有顶点的集合,顶点由顶点编号和顶点自身的值组成。Consider the vertices to be consecutively numbered from 0 to |V|-1, and sort the subgraph data according to the number of the source vertex and the number of the destination vertex, respectively. The vertex set V is the set of all vertices in the graph, and the vertex consists of the vertex number and the value of the vertex itself.

步骤S2.将排序后的各个子图写入外存文件中,并分别为源顶点编号和目标顶点编号建立索引,所述索引记录该顶点对应的边在外存文件中的偏移地址。Step S2. Write the sorted subgraphs into the external storage file, and establish indexes for the source vertex number and the target vertex number respectively, and the index records the offset address of the edge corresponding to the vertex in the external storage file.

将边排序后的各个子图写入外存文件时,将具有相同源顶点、目标顶点的边存储在连续的外存数据块中。边的数据包含边与边值,在外存中分别保存为两个文件:边文件和边值文件。边文件保存图的拓扑结构,存储格式是邻接表格式。边文件包含:按照源顶点编号的出边文件和按照目标顶点编号的入边文件。其中,出边文件按照源顶点编号进行组织,顺序且连续存储有源顶点编号、出度以及出边的目标顶点编号;入边文件按照目的顶点编号进行组成,顺序且连续存储有目的顶点编号、入度以及入边的源顶点编号。边值文件中的边值顺序与边文件中的边顺序相对应。由于两份文件中的存储的信息顺序相同,通过边文件的信息就能够很方便地定位边值文件中的数据。When writing each subgraph after edge sorting into an external memory file, the edges with the same source vertex and target vertex are stored in a continuous external memory data block. The edge data includes edge and edge value, which are saved as two files in external memory: edge file and edge value file. The edge file saves the topology of the graph, and the storage format is the adjacency list format. The edge files include: outgoing edge files numbered by source vertices and incoming edge files numbered by destination vertices. Among them, the out-edge file is organized according to the source vertex number, and the source vertex number, out-degree and the destination vertex number of the out-edge are stored sequentially and continuously; the in-edge file is composed according to the destination vertex number, and the destination vertex number, The in-degree and the source vertex number of the in-edge. The order of edge values in the edge file corresponds to the order of edges in the edge file. Since the stored information in the two files is in the same order, the data in the edge value file can be easily located through the information in the edge file.

考虑到如果要保存两份边值数据就不可避免的需要同步和写回两倍数据,因此考虑优化数据写回过程。由于两份边值之间仅有存储顺序的差异,因此设计了一个改进的边值文件结构,只保存一份边值数据和一份该边值数据在文件中的偏移地址。图1为本发明提供的边值文件与边值偏移地址文件的对应关系示意图。如图1所示,出边的数据文件保存原本的边值数据,边值的顺序对应于该数据文件中边的顺序,用以兼容系统原本的数据载入方式;入边的数据文件构建一份相同格式的“边值”数据,该文件中保存的内容并不是真正的边值,而是一个文件偏移地址,指向真正的边值数据在文件中的位置。Considering that if you want to save two copies of edge value data, it is inevitable to synchronize and write back twice the data, so consider optimizing the data write back process. Since there is only the difference in the storage order between the two copies of the edge value, an improved edge value file structure is designed to save only one copy of the edge value data and the offset address of the edge value data in the file. FIG. 1 is a schematic diagram of the correspondence between a boundary value file and a boundary value offset address file provided by the present invention. As shown in Figure 1, the data file of the outgoing edge saves the original edge value data, and the order of the edge values corresponds to the order of the edges in the data file, which is compatible with the original data loading method of the system; the data file of the incoming edge constructs a A copy of the "boundary value" data in the same format, the content saved in the file is not the real boundary value, but a file offset address, pointing to the position of the real boundary value data in the file.

所述索引记录该顶点对应的边在外存文件中的偏移地址。本发明采取稀疏索引的方法分别为源顶点编号和目标顶点编号建立索引,使用冗余存储的方法,将每一条边分别作为入边和出边存储一次,用增加一倍的外存存储开销换取时间性能上的提升。具体如下:The index records the offset address of the edge corresponding to the vertex in the external memory file. The invention adopts the method of sparse index to establish indexes for the source vertex number and the target vertex number respectively, and uses the method of redundant storage to store each edge as an incoming edge and an outgoing edge respectively, and double the external memory storage overhead in exchange for it. Time performance improvement. details as follows:

在建立索引时,以边文件的顶点编号建立索引文件,索引指向顶点的出边/入边在文件中的偏移地址。索引文件的每一行对应一个二元组:顶点编号+其对应的出边信息/入边信息在文件中的偏移地址。按照给定的区间间隔将边文件分成多个块,每一个索引项指向对应块的第一条数据,包含顶点编号及其文件偏移地址。用这样的方法可以跳过大量的顶点数据。同时在载入数据时,一次载入一个数据块虽然不可避免的会读入一些无用数据,但是由于每次读取的数据块中可能包含有多个所需顶点的数据,可以降低I/O次数,提高效率。When creating an index, an index file is created based on the vertex number of the edge file, and the index points to the offset address of the outgoing edge/incoming edge of the vertex in the file. Each line of the index file corresponds to a two-tuple: vertex number + the offset address of its corresponding out-edge information/in-edge information in the file. Divide the edge file into multiple blocks according to a given interval, and each index item points to the first piece of data of the corresponding block, including the vertex number and its file offset address. A lot of vertex data can be skipped in this way. At the same time, when loading data, although one data block is loaded at a time, some useless data will inevitably be read, but since the data block read each time may contain the data of multiple required vertices, the I/O can be reduced. times to improve efficiency.

图2为本发明实施例提供的出边索引文件和出边文件示意图。如图2所示,出边文件中每一行代表一个源顶点的出边信息,其组成为:源顶点编号+出度+目的顶点编号。给定的区间间隔为3。在建立索引时,以3为间隔将出边文件分成多个块,则顶点1-3为一个块,顶点4-6为一个块,建立的索引分别指向块的第一个顶点,例如,第一个顶点V1,其指向的地址为第一个块的首地址0,其后顶点4指向第二个块的首地址32(假设存储每个值都需要4bit)。FIG. 2 is a schematic diagram of an outgoing index file and an outgoing file provided by an embodiment of the present invention. As shown in Figure 2, each line in the out-edge file represents the out-edge information of a source vertex, which consists of: source vertex number + out-degree + destination vertex number. The given interval interval is 3. When creating an index, the outgoing edge file is divided into multiple blocks at an interval of 3, then vertices 1-3 are a block, vertices 4-6 are a block, and the established indexes point to the first vertex of the block, for example, the first A vertex V1 points to the first address 0 of the first block, and then vertex 4 points to the first address 32 of the second block (assuming that 4 bits are required to store each value).

图的拓扑结构直接影响外存数据的组织结构,从而影响对外存数据的定位。本发明在不改变图的拓扑结构的前提下,为外存数据建立索引以及查到定位数据降低了实现的复杂度,同时也减少了额外的外存访问的开销。The topology of the graph directly affects the organizational structure of the external memory data, thereby affecting the positioning of the external memory data. On the premise of not changing the topological structure of the graph, the present invention establishes an index for the external memory data and finds the positioning data, which reduces the complexity of implementation and also reduces the overhead of additional external memory access.

本发明的计算执行阶段包括步骤S3-S4。The calculation execution phase of the present invention includes steps S3-S4.

步骤S3.从索引定位的载入方式和访问完整数据的载入方式中选择最优载入方式。图3为本发明实施例提供的步骤S3的流程图。如图3所示,步骤S3具体如下:Step S3. Select the optimal loading mode from the loading mode of index positioning and the loading mode of accessing complete data. FIG. 3 is a flowchart of step S3 provided by an embodiment of the present invention. As shown in Figure 3, step S3 is as follows:

S30.确定各个子图的活跃顶点,判断活跃顶点比例是否超过阈值,若是,进入步骤S31,否则,访问完整数据的载入方式为最优载入方式,结束步骤S3。S30. Determine the active vertices of each subgraph, and determine whether the proportion of active vertices exceeds the threshold. If so, go to step S31. Otherwise, the loading mode for accessing complete data is the optimal loading mode, and step S3 ends.

参与计算更新的顶点称为活跃顶点;不参与计算的顶点称为不活跃节点。The vertices that participate in the calculation update are called active vertices; the vertices that do not participate in the calculation are called inactive nodes.

活跃顶点比例较低的轮次使用索引定位的载入方式能够取得性能提升,活跃顶点比例较高的轮次使用原本的访问完整数据的方式性能更优,综合两种数据载入方式能够取得整体的最优方案,即通过决策判断模块进行数据载入方式的选择能够获得最大的性能提升效果。阈值的取值范围为20%~30%,优选为30%。The rounds with a low proportion of active vertices use the index-based loading method to improve performance, and the rounds with a high proportion of active vertices use the original method of accessing complete data for better performance. Combining the two data loading methods can achieve overall performance. The optimal solution, that is, the selection of the data loading method through the decision and judgment module can obtain the maximum performance improvement effect. The threshold value ranges from 20% to 30%, preferably 30%.

S31.记录活跃顶点数据对应的数据块编号。S31. Record the data block number corresponding to the active vertex data.

一个数据块中通常包含多个顶点的数据,统计出所有活跃顶点数据对应的数据块。考虑活跃顶点在数据块中的分布情况,即求出哪些数据块具有活跃顶点,并统计具有活跃顶点的数据块。A data block usually contains data of multiple vertices, and the data blocks corresponding to all active vertex data are counted. Consider the distribution of active vertices in data blocks, that is, find out which data blocks have active vertices, and count the data blocks with active vertices.

S32.基于活跃顶点数据对应的数据块编号,计算索引定位的载入方式的开销Cost(Index)和访问完整数据的载入方式的开销Cost(All)。S32. Based on the data block number corresponding to the active vertex data, calculate the cost Cost (Index) of the loading mode of index positioning and the cost Cost (All) of the loading mode of accessing complete data.

数据从外存载入内存有两种方式,这两种方式载入的数据内容并不相同。访问完整数据时,载入的是完整的子图数据,对应的是源顶点有序的数据文件,目标顶点有序的数据文件并未被载入;而使用索引定位载入有效数据时,是从源顶点有序的出边文件中载入顶点的出边,从目标顶点有序的入边文件中载入顶点的入边。There are two ways to load data from external memory into memory, and the content of the data loaded in these two ways is not the same. When accessing the complete data, the complete subgraph data is loaded, which corresponds to the data file with the ordered vertices of the source, and the ordered data file of the target vertices is not loaded; when using index positioning to load valid data, it is Loads the outgoing edges of vertices from the source vertex ordered outgoing edge file, and loads the vertex incoming edges from the destination vertex ordered incoming edge file.

对于访问完整子图的数据载入方式,通常是在应用所需的数据量较多时使用,这时可以跳过的无效数据较少,优化收益比较低,利用外存的顺序访问的高性能可以获得更高的访问效率。注意到原始图数据经过预处理之后得到了两份外存数据,两份数据文件中都包含了完整的子图信息,因此系统在载入数据是只需要访问其中一份数据文件。构建子图时依次处理数据文件中的每一条边,分别判断边的源顶点和目标顶点是否为应用所需的数据。如果源顶点需要参与计算,将当前边加入到该顶点的出边序列;如果目标顶点需要参与计算,将当前边加入到该顶点的入边序列。For the data loading method of accessing the complete subgraph, it is usually used when the amount of data required by the application is large. At this time, there is less invalid data that can be skipped, and the optimization benefit is relatively low. The high performance of sequential access using external memory can be used. Get higher access efficiency. Note that after the original image data is preprocessed, two copies of external memory data are obtained, and both data files contain complete sub-image information, so the system only needs to access one of the data files when loading data. When constructing a subgraph, process each edge in the data file in turn, and determine whether the source vertex and target vertex of the edge are the data required by the application. If the source vertex needs to participate in the calculation, add the current edge to the vertex's out-edge sequence; if the target vertex needs to participate in the calculation, add the current edge to the vertex's in-edge sequence.

访问完整子图的数据载入方式具体如下:The data loading method for accessing the complete subgraph is as follows:

(1)顺序载入源顶点有序的数据文件;(1) Sequentially load the ordered data files of the source vertices;

(2)依次处理数据文件中的每一条边,分别判断边的源顶点和目标顶点是否为应用所需的数据,如果源顶点需要参与计算,将当前边加入到该顶点的出边序列;如果目标顶点需要参与计算,将当前边加入到该顶点的入边序列。(2) Process each edge in the data file in turn, and determine whether the source vertex and target vertex of the edge are the data required by the application. If the source vertex needs to participate in the calculation, add the current edge to the out-edge sequence of the vertex; if The target vertex needs to participate in the calculation, adding the current edge to the incoming edge sequence of this vertex.

对于索引定位的数据载入方式,通常是在应用所需的数据量较少时使用,这时可以跳过的无效数据很多,优化效果明显,利用索引定位访问应用所需的数据从而降低I/O数据量,获得更高的访问效率。在访问外存数据之前,首先获取到系统中的顶点状态信息,并记录其中活跃顶点数据对应的数据块编号,一个数据块中通常包含多个顶点的数据,统计出所有活跃顶点数据对应的数据块,根据索引定位到对应的文件位置访问数据块。理想情况下,只需要载入活跃顶点对应的数据,但是在实际读取时,考虑到每个顶点都要访问一次外存会导致I/O次数很多效率不高,而使用访问顶点对应的数据块的方式能够一次读入多个顶点的数据,虽然数据块中会不可避免的包含一些无用数据,但是可以极大的降低I/O次数,提高I/O效率。注意到载入的数据是从两份数据文件中分别得到的,从源顶点有序的文件中载入顶点的出边,从目标顶点有序的文件中载入顶点的入边。因此构建子图时把出边添加到对应源顶点的出边序列,把入边添加到对应目标顶点的入边序列中,相比于访问完整数据的方式,减少了需要处理的数据量。For the data loading method of index positioning, it is usually used when the amount of data required by the application is small. At this time, there is a lot of invalid data that can be skipped, and the optimization effect is obvious. Use index positioning to access the data required by the application to reduce I// O data volume to obtain higher access efficiency. Before accessing the external memory data, first obtain the vertex state information in the system, and record the data block number corresponding to the active vertex data. A data block usually contains the data of multiple vertices, and count the data corresponding to all active vertex data. Block, locate the corresponding file location according to the index to access the data block. Ideally, only the data corresponding to the active vertex needs to be loaded, but in actual reading, considering that each vertex needs to access the external memory once, it will lead to a lot of I/O times, which is inefficient, and the data corresponding to the access vertex is used. The block method can read the data of multiple vertices at a time. Although the data block will inevitably contain some useless data, it can greatly reduce the number of I/O and improve the efficiency of I/O. Note that the loaded data is obtained from two data files, the outgoing edges of the vertices are loaded from the source vertex-ordered file, and the incoming edges of the vertices are loaded from the target vertex-ordered file. Therefore, when constructing a subgraph, adding the outgoing edge to the outgoing edge sequence of the corresponding source vertex, and adding the incoming edge to the incoming edge sequence of the corresponding target vertex, reduces the amount of data that needs to be processed compared to the way of accessing the complete data.

索引定位的数据载入方式具体如下:The data loading method of index positioning is as follows:

(1)根据索引定位在出边文件中找出需要加载到内存的数据块;(1) Find the data block that needs to be loaded into the memory in the outbound file according to the index positioning;

(2)根据出边文件中需要载入数据块出边的偏移地址,找到在边值文件中出边的边值数据,将边值数据加载到内存;(2) According to the offset address of the outgoing edge of the data block that needs to be loaded in the outgoing edge file, find the outgoing edge value data in the edge value file, and load the edge value data into the memory;

(3)在内存中构建顶点的出边拓扑结构;(3) Constructing the out-edge topology of vertices in memory;

(4)在应用需要入边时,根据索引定位到入边文件中需要加载到内存中的入边数据块;(4) When the application needs to enter the edge, locate the edge-in data block that needs to be loaded into the memory in the edge-in file according to the index;

(5)根据入边文件中需要载入数据块的入边偏移地址,找到在边值索引文件中入边的边值偏移地址,加载边值数据的偏移地址至内存;(5) According to the input edge offset address of the data block that needs to be loaded in the input edge file, find the edge value offset address of the input edge in the edge value index file, and load the offset address of the edge value data to the memory;

(6)在内存中构建顶点的入边拓扑结构。(6) Construct the incoming edge topology of vertices in memory.

在数据载入时,根据顶点的编号就能够快速的定位到数据在文件中的位置,从而提高运行时的数据查找和访问效率。When data is loaded, the position of the data in the file can be quickly located according to the vertex number, thereby improving the efficiency of data search and access at runtime.

访问完整数据的载入方式的开销

Figure BDA0001802287430000101
索引定位的载入方式的开销
Figure BDA0001802287430000102
其中,E为图中所有边的集合;D为单条边所占的存储空间;B为一次I/O的数据块大小;b为每个索引项指向的数据块大小;k为当前系统中所有活跃顶点对应的索引数据块的数量;r为随机访问开销。The overhead of the loading method to access the complete data
Figure BDA0001802287430000101
The cost of loading the index positioning
Figure BDA0001802287430000102
Among them, E is the set of all edges in the graph; D is the storage space occupied by a single edge; B is the size of the data block of one I/O; b is the size of the data block pointed to by each index entry; k is the size of the data block in the current system The number of index data blocks corresponding to active vertices; r is the random access cost.

D、|E|B、b和r这些常量在计算运行之前的获取并赋值,其中,D*|E|即为数据文件所占的空间。The constants D, |E|B, b, and r are acquired and assigned before the calculation runs, where D*|E| is the space occupied by the data file.

S33.判断Cost(All)是否大于Cost(Index),若是,索引定位的载入方式为最优载入方式,否则,访问完整数据的载入方式为最优载入方式。S33. Determine whether Cost(All) is greater than Cost(Index). If so, the loading method for index positioning is the optimal loading method; otherwise, the loading method for accessing complete data is the optimal loading method.

通过对比两种数据载入的方式,可以看出索引定位的数据载入方式适用于活跃顶点数量较少的情况,这时计算所需的数据量较少,如果访问完整的数据就有很多的无效数据,并且在构建子图时需要处理的数据量也很大,浪费了CPU资源。而访问完整子图的数据载入方式能够充分利用外存磁盘的顺序访问性能,在活跃顶点数量比例较多时一次载入全部的数据能够获得更好的性能。综合使用这两种数据载入方式,使得系统在载入数据时始终获得最优的性能,使得系统的综合效率获得了极大的提高。By comparing the two data loading methods, it can be seen that the data loading method of index positioning is suitable for the case where the number of active vertices is small. At this time, the amount of data required for calculation is small. If the complete data is accessed, there will be a lot of Invalid data, and the amount of data that needs to be processed when building the subgraph is also huge, wasting CPU resources. The data loading method of accessing the complete subgraph can make full use of the sequential access performance of the external memory disk, and when the proportion of active vertices is large, loading all the data at one time can obtain better performance. The comprehensive use of these two data loading methods enables the system to always obtain optimal performance when loading data, which greatly improves the overall efficiency of the system.

步骤S4.以最优载入方式,将外存中各个子图载入内存。Step S4. Load each sub-picture in the external memory into the internal memory in an optimal loading manner.

以上,仅为本申请较佳的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应该以权利要求的保护范围为准。The above are only the preferred embodiments of the present application, but the protection scope of the present application is not limited to this. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in the present application, All should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (7)

1. An external memory access method for a graph computing system based on index positioning, the method comprising the steps of:
s0. dividing the complete graph data into multiple sub-graphs that the memory can hold;
s1, sequencing edges of each subgraph according to a source vertex number and a target vertex number;
s2, writing the sequenced subgraphs into an external storage file, and respectively establishing indexes for a source vertex number and a target vertex number;
s3, selecting an optimal loading mode from the loading modes of index positioning and the data loading modes of accessing the complete subgraph, wherein the step S3 is as follows:
s30, determining active vertexes of all sub-graphs, judging whether the proportion of the active vertexes exceeds a threshold value, if so, entering a step S31, otherwise, accessing the data loading mode of the complete sub-graph to be an optimal loading mode, and ending the step S3;
s31, recording a data block number corresponding to the active vertex data;
s32, calculating the overhead cost (index) of a loading mode for index positioning and the overhead cost (all) of a data loading mode for accessing a complete subgraph based on the number of a data block corresponding to the active vertex data;
s33, judging whether cost (all) is greater than cost (index), if so, determining the loading mode of index positioning as the optimal loading mode, otherwise, determining the data loading mode of accessing the complete subgraph as the optimal loading mode;
the loading mode of the index positioning is as follows: (1) finding out data blocks needing to be loaded into the memory in the edge file according to the index positioning; (2) according to the offset address of the data block output edge which needs to be loaded in the output edge file, finding out the edge value data of the output edge in the edge value file, and loading the edge value data into the memory; (3) constructing an edge-out topological structure of a vertex in a memory; (4) when the application needs to enter the edge, locating an edge entering data block which needs to be loaded into the memory in the edge entering file according to the index; (5) according to the edge entering offset address of the data block to be loaded in the edge entering file, finding the edge value offset address of the edge entering in the edge value index file, and loading the edge value data offset address to the memory; (6) constructing an edge-entering topological structure of a vertex in a memory;
the data loading mode for accessing the complete subgraph is as follows: (1) sequentially loading data files with ordered source vertexes; (2) sequentially processing each edge in the data file, respectively judging whether a source vertex and a target vertex of the edge are data required by application, and adding a current edge into an edge outlet sequence of the vertex if the source vertex needs to participate in calculation; if the target vertex needs to participate in calculation, adding the current edge into the edge entering sequence of the vertex;
and S4, loading each sub-graph in the external memory into the internal memory in an optimal loading mode.
2. The method according to claim 1, wherein the writing of the sorted subgraphs into the external memory file is as follows:
when each sub-graph after edge sequencing is written into an external memory file, edges with the same source vertex and target vertex are stored in continuous external memory data blocks; the data of the edge comprises an edge and an edge value, and the edge are respectively stored as two files in an external memory: an edge file and an edge value file; the edge file stores the topological structure of the graph, the storage format is the adjacency list format; the order of the edge values in the edge file corresponds to the order of the edges in the edge file.
3. The external memory access method of claim 2, wherein the edge file comprises: an edge-out file according to the source vertex number and an edge-in file according to the target vertex number; the edge-out file is organized according to the source vertex number, and the source vertex number, the out-degree and the target vertex number of the edge-out are sequentially and continuously stored; the edge entering file is composed according to the target vertex number, and the target vertex number, the degree of entry and the source vertex number of the edge entering are sequentially and continuously stored.
4. The method as claimed in claim 2, wherein, when creating the index, the edge-value index file is created by using the vertex number of the edge file, and each row of the edge-value index file corresponds to one tuple: the vertex number of the edge file and the offset address of the corresponding outgoing edge or incoming edge in the external file.
5. The method according to claim 1, wherein the threshold value ranges from 20% to 30%.
6. The method of claim 1, wherein the overhead of accessing the data load of a complete subgraph
Figure FDA0002575606660000031
Overhead of index-oriented load-wise
Figure FDA0002575606660000032
Wherein E is the set of all edges in the graph; d is a storage space occupied by a single edge; b is the data block size of primary I/O; b is the size of the data block pointed to by each index entry; k is the number of index data blocks corresponding to all active vertexes in the current system; r is the random access overhead.
7. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements a method of external memory access according to any one of claims 1 to 6.
CN201811082365.8A 2018-09-17 2018-09-17 An external memory access method for graph computing system based on index positioning Active CN109522428B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811082365.8A CN109522428B (en) 2018-09-17 2018-09-17 An external memory access method for graph computing system based on index positioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811082365.8A CN109522428B (en) 2018-09-17 2018-09-17 An external memory access method for graph computing system based on index positioning

Publications (2)

Publication Number Publication Date
CN109522428A CN109522428A (en) 2019-03-26
CN109522428B true CN109522428B (en) 2020-11-24

Family

ID=65771279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811082365.8A Active CN109522428B (en) 2018-09-17 2018-09-17 An external memory access method for graph computing system based on index positioning

Country Status (1)

Country Link
CN (1) CN109522428B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523000B (en) * 2020-04-23 2023-06-23 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for importing data
CN112287182B (en) * 2020-10-30 2023-09-19 杭州海康威视数字技术股份有限公司 Graph data storage and processing method and device and computer storage medium
CN112799845A (en) * 2021-02-02 2021-05-14 深圳计算科学研究院 Graph algorithm parallel acceleration method and device based on GRAPE framework
CN112988064B (en) * 2021-02-09 2022-11-08 华中科技大学 Concurrent multitask-oriented disk graph processing method
CN113448964B (en) * 2021-06-29 2022-10-21 四川蜀天梦图数据科技有限公司 Hybrid storage method and device based on graph-KV
CN114282073B (en) * 2022-03-02 2022-07-15 支付宝(杭州)信息技术有限公司 Data storage method and device, data reading method and device
CN114756483A (en) * 2022-03-31 2022-07-15 深圳清华大学研究院 Subgraph segmentation optimization method based on inter-core storage access and application
CN115391341A (en) * 2022-08-23 2022-11-25 抖音视界有限公司 Distributed graph data processing system, method, device, equipment and storage medium
CN116861032B (en) * 2023-06-12 2025-09-16 达梦数据技术(江苏)有限公司 Edge storage implementation method, system, equipment and storage medium of distributed protogram database

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122248A (en) * 2017-05-02 2017-09-01 华中科技大学 A kind of distributed figure processing method of storage optimization
CN107491495A (en) * 2017-07-25 2017-12-19 南京师范大学 Storage method of the preferential space-time trajectory data file of space attribute in auxiliary storage device
CN107957962A (en) * 2017-12-19 2018-04-24 重庆大学 It is a kind of to calculate efficient figure division methods and system towards big figure

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122248A (en) * 2017-05-02 2017-09-01 华中科技大学 A kind of distributed figure processing method of storage optimization
CN107491495A (en) * 2017-07-25 2017-12-19 南京师范大学 Storage method of the preferential space-time trajectory data file of space attribute in auxiliary storage device
CN107957962A (en) * 2017-12-19 2018-04-24 重庆大学 It is a kind of to calculate efficient figure division methods and system towards big figure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Multi-GPU平台的大规模图数据处理;张珩等;《计算机研究与发展》;20180115;第55卷(第2期);第273-288页 *

Also Published As

Publication number Publication date
CN109522428A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN109522428B (en) An external memory access method for graph computing system based on index positioning
US8229968B2 (en) Data caching for distributed execution computing
CN112287182A (en) Graph data storage and processing method and device and computer storage medium
CN110737804B (en) Graph processing access optimization method and system based on activity degree layout
CN106777351A (en) Computing system and its method are stored based on ART tree distributed systems figure
CN115114294B (en) Adaptive method, device and computer equipment for database storage mode
CN103559016A (en) Frequent subgraph excavating method based on graphic processor parallel computing
Jaiyeoba et al. Graphtinker: A high performance data structure for dynamic graph processing
US20230281157A1 (en) Post-exascale graph computing method, system, storage medium and electronic device thereof
CN104778077A (en) High-speed extranuclear graph processing method and system based on random and continuous disk access
US20250284539A1 (en) Distributed graph data processing system, method, apparatus and device, and storage medium
CN109189994B (en) CAM structure storage system for graph computation application
CN110688055B (en) A data access method and system in large graph computing
Huang et al. Survey of external memory large-scale graph processing on a multi-core system
CN115049103A (en) Parallel graph traversal method facing single-source shortest path
CN119106202A (en) A data access mode-aware software-hardware collaborative dynamic graph processing device
KR102354343B1 (en) Spatial indexing method and apparatus for blockchain-based geospatial data
CN117149795B (en) Adaptive graph calculation updating method and system based on hybrid memory
CN109254725B (en) A disk map processing method and system based on subgraph construction
CN106933882A (en) A kind of big data incremental calculation method and device
CN105005627A (en) Shortest path key node query method based on Spark distributed system
CN116610640A (en) Operation method for file, electronic device and storage medium
CN114579537A (en) Distributed graph database optimization method and device, electronic equipment and storage medium
CN110851178B (en) Inter-process program static analysis method based on distributed graph reachable computation
CN107665291A (en) A kind of mutation detection method based on cloud computing platform Spark

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant