[go: up one dir, main page]

CN114118443A - Large-scale graph embedding training method and system based on Optane DIMM - Google Patents

Large-scale graph embedding training method and system based on Optane DIMM Download PDF

Info

Publication number
CN114118443A
CN114118443A CN202111415792.5A CN202111415792A CN114118443A CN 114118443 A CN114118443 A CN 114118443A CN 202111415792 A CN202111415792 A CN 202111415792A CN 114118443 A CN114118443 A CN 114118443A
Authority
CN
China
Prior art keywords
graph
training
data
gpu
vertex
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111415792.5A
Other languages
Chinese (zh)
Other versions
CN114118443B (en
Inventor
姚建国
陈悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiao Tong University
Original Assignee
Shanghai Jiao Tong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiao Tong University filed Critical Shanghai Jiao Tong University
Priority to CN202111415792.5A priority Critical patent/CN114118443B/en
Publication of CN114118443A publication Critical patent/CN114118443A/en
Application granted granted Critical
Publication of CN114118443B publication Critical patent/CN114118443B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了一种基于Optane DIMM的大规模图嵌入训练方法及系统,包括:原始图处理步骤:对原始图进行处理,生成DRAM能够加载的图数据;数据预处理步骤:将图数据根据特征进行两层图分割,将完整图拆分成子图,并存储在磁盘中,使得子图能够加载入GPU进行分区训练;图训练步骤:根据不同介质的访存特性,将训练所用的图数据存储在不同的物理介质中,根据训练过程中所依赖数据的不同特性来切割算法,采用CPU和GPU分工训练,平衡CPU计算、GPU计算以及CPU‑GPU通信三者的开销。本发明根据图的特性进行高质量的两层图分割,将大图转化为子图,存储在磁盘中,从而子图能够加载入GPU进行分区训练。

Figure 202111415792

The invention provides a large-scale graph embedding training method and system based on Optane DIMM, including: an original graph processing step: processing the original graph to generate graph data that can be loaded by DRAM; a data preprocessing step: converting graph data according to features Perform two-layer graph segmentation, split the complete graph into subgraphs, and store them in the disk, so that the subgraphs can be loaded into the GPU for partition training; graph training steps: According to the memory access characteristics of different media, the graph data used for training is stored. In different physical media, the algorithm is divided according to the different characteristics of the data dependent on the training process, and the CPU and GPU are used for training to balance the overhead of CPU computing, GPU computing and CPU-GPU communication. The present invention performs high-quality two-layer graph segmentation according to the characteristics of the graph, converts the large graph into subgraphs, and stores them in the disk, so that the subgraphs can be loaded into the GPU for partition training.

Figure 202111415792

Description

Large-scale graph embedding training method and system based on Optane DIMM
Technical Field
The invention relates to the technical field of computer storage, computer calculation and deep learning, in particular to a large-scale graph embedding training method and system based on Optane DIMM.
Background
Graphs, such as social networks, word coexistence networks, and communication networks, are widely present in a variety of real-world applications. Through the analysis of the above, we can understand the social structure, language and different communication modes, so the graph is always the hot point of the research in the academic world.
The network representation using the adjacency matrix a represents a graph using a storage space of | V | × | V |, and the space required for such representation increases exponentially as the number of nodes increases. Meanwhile, the vast majority of adjacency matrices are 0, and the sparsity of data makes it difficult to apply a fast and efficient learning method. Graph embedding learning means learning to obtain low-dimensional vector representation of nodes in a network, and formally, the goal of graph embedding learning is to learn a real-valued vector for each node V e V
Figure RE-GDA0003445938290000011
Where κ < | V | represents the dimension of the vector. The graph and graph embedding mathematics are defined as follows:
the following drawings: graph G (V, E) is a set of vertices V ═ V1,…,vnAnd E set of edge sets. e.g. of the typeijE E contains a source vertex viAnd a target vertex vj. For weighted graph G, adjacency matrix W contains non-negative weights W associated with each edgeijIs more than or equal to 0. If v isiAnd vjWithout connection, then WijIs set to 0. For undirected weighted graphs, there is always
Figure RE-GDA0003445938290000012
Figure RE-GDA0003445938290000013
Graph embedding: given a graph G (V, E) and the predefined dimensionality of embedding d, a graph (node) embedding is a mapping f:
Figure RE-GDA0003445938290000014
so that the function f retains some of the semantic features defined on the graph G. The graph embedding problem is the problem of mapping an entire graph, sub-graph or edge to a d-dimensional vector. A visual representation of graph embedding is shown in fig. 1.
The graph embedding technique is a technique belonging to graph analysis and representation learning. The purpose of this is to represent a graph as a low-dimensional vector while maintaining the graph structure. As a simple and effective method for reducing dimensionality, graph embedding has been widely applied and successfully applied to the fields of node classification, clustering, recommendation, link prediction, network visualization, and the like.
Graph embedding algorithms often measure similarity using random walks. Random walk is also the basis of a class of output sensitive algorithms, which utilize random walk to calculate local structural information under the linear complexity of input graph volume size. It is this association with local structures that encourages random walks to be the basic tool for extracting information from the graph. In addition to capturing community information, there are two other desirable features that use random walks as the basis for the algorithm. First, local exploration is easily parallelized. Several random walkers (on different threads, processes, or machines) can simultaneously explore different parts of the same graph. Second, small changes in the graph structure can be accommodated without requiring global recalculation, relying on information obtained from short random walks. We can update the learned model with new random walk iterations, from regions of sub-linear variation in time to the entire graph. Therefore, the invention extracts the graph characteristics by using the random walk as the basis.
In the big data age, the size of graphs is increasing. For example, in social networks, the number of nodes has increased to the billions and the number of edges has also increased to the billions. Effectively processing graphs of this scale remains a challenge. For large graphs with billions of vertices, it is difficult for a typical single-node server to guarantee enough DRAM memory to support the service. Although existing swap partitioning techniques may build a larger logical memory for a machine than a physical memory, it is still difficult to achieve TB level capacity and limited by low disk access speed and inefficiency.
Another solution referred to is to store the graph data and the embedded data on a magnetic disk. However, since graph embedding algorithms often require byte addressing and disk access latency is high, the possibility of building a high performance graph embedding system scheme with disks is low. While distributed solutions have limitations in terms of efficiency, cost and user-friendliness: first, the graphics data and embedded data need to be accessed and transferred frequently between different machines. It causes high network communication costs and delays. Second, purchasing a set of powerful machines is expensive, placing a burden on small corporations and individual developers. Therefore, how to build an efficient large graph embedding system on a single machine becomes a current challenge and opportunity. Non-volatile memory is another direction in which stand-alone solutions to this problem.
The advent of new hardware Optane DIMMs promises for efficient large-scale embedded training on a stand-alone machine. Intel's opto DIMM was the first commercial persistent memory that supported byte-granular access in DRAM order. Meanwhile, the storage capacity of a single Optane DIMM can be as high as 512 GB. For a dual-socket machine, the supported Optane DIMM size is up to 6T (2 sockets x 6 channels x 512 gb/DIMM).
In chinese patent publication No. CN113343123A, a training method for generating an antagonistic multiple relation graph network model for detecting a machine account is disclosed, the training method comprising: modeling a platform as a graph containing nodes v and relations r, wherein the number of the graph is determined by the number of types of the relations r; generating a false target node vt of a source node v by using a generator G; respectively inputting the sampled node pairs (v, u) and (v, vt) into a connection relation discriminator D, and repeatedly training the connection relation discriminator D; reasoning node pairs in the graph by using the trained connection relation discriminator D, determining the connection relation of the node pairs, and further updating the structure of the graph; and inputting the characterization vectors of the nodes into a classifier, reversely propagating and updating parameters of the model according to the loss function, and performing training for multiple times to obtain a trained network model for generating the confrontation multiple relation graph.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a large-scale graph embedding training method and system based on Optane DIMM.
The invention provides a large-scale graph embedding training method based on Optane DIMM, which comprises the following steps:
original graph processing steps: processing the original graph to generate graph data which can be loaded by a DRAM;
a data preprocessing step: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training;
and (3) graph training: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU.
Preferably, the primitive map processing step includes the sub-steps of:
s1: initializing a hash table with the size of hash _ table _ size, and storing the mapping from the vertex name to the vertex ID;
s2: entering circulation, wherein the circulation frequency is the size of the original graph list;
s3: reading an edge from the original graph in each cycle, wherein the edge consists of a vertex v _ name and a target vertex u _ name;
s4: searching whether v _ name appears already from a hash table, if so, returning the mapped v _ id, and if not, calling a hash _ table. AddVertex (name _ v, count _ num _ vertices) method and inserting a new vertex;
s5: if a hash _ table. AddVertex (name _ v, count _ num _ verticals) is called, firstly creating a new vertex, adding a vertex set verticals, and then adding a counter count _ num _ verticals by one; if the count _ num _ thresholds exceeds the maximum capacity of the existing thresholds, the thresholds will automatically expand;
s6: circulating until a vacancy of the hash table is found, and inserting a mapping relation from a vertex name to an ID in the hash table;
s7: processing u _ name according to the steps in S4-S6;
s8: and writing the mapped edge into an output file.
Preferably, the graph data is stored on a disk in a file form, and the data format is source _ vertex _ id and destination _ vertex _ id; for an undirected graph, simultaneously storing source _ vertex _ id and destination _ vertex _ id; destination _ vertex _ id, source _ vertex _ id.
Preferably, the splitting of the graph data includes:
-using an edge segmentation strategy: dividing the graph data according to points, separating edges, dividing sub-graph data into the number of GPUs, and not performing edge division for single GPU equipment;
-using a point segmentation strategy: and dividing the graph data according to edges, determining the number of the divided subgraphs according to the size of a GPU memory, and not performing a point division strategy when the subgraphs can be completely loaded into the GPU.
Preferably, the CPU performs negative sampling and trimming training operations, the GPU performs positive sampling, positive sample training, and negative sample training operations, and the CPU-GPU communication is performed using a PCI load.
Preferably, the CPU task specifically includes:
loading data: loading data required by training, wherein the data comprises graph structure data and graph embedding data;
the Graph structure data is loaded into the Optane DIMM from a disk, and a metadata structure Graph is abstracted and points to concrete data in the Optane DIMM, and the metadata is stored in a DRAM;
the graph embedded data is allocated with a memory and initialized in the DRAM, a metadata structure Embedding is abstracted, and the metadata structure Embedding points to concrete data and is stored in the DRAM;
negative sampling: the system starts the FIRST _ PARTITION _ NUM number of threads, each thread maintains a NEG _ SAMPLE _ POOL _ NUM block sampling POOL, the SIZE of the sampling POOL is NEG _ SAMPLE _ POOL _ SIZE, and different threads perform parallel negative sampling;
cutting edge training: training each cut edge generated in the edge segmentation process by using a CPU (central processing unit); the CPU starts CROSS _ EDGE _ TRAIN _ THREAD THREADs, and each THREAD is responsible for training partial trimming
Task scheduling: the first layer of subgraphs are trained by different GPUs in parallel, the different GPUs are isolated from each other, and no data communication overhead exists; serially training a second layer of sub-graphs;
graph embedding evaluation: with the generated node embedding, different types of machine learning tasks are run, evaluating either the micro f1 or the macro f1, for comparison with other solutions.
Preferably, in the negative sampling process, the sampling strategy for each small sampling pool includes: counting the degree sum of the whole graph point and the degree sum of the sub-graph points, and determining the sampling number of each sub-graph according to the proportion of the sub-graph degree sum to the whole graph degree sum, wherein the following conditions are met:
subgraphi-degree_num/graph_degree_num
=subgraphi-neg_sam_num/NEG_SAMPLE_POOL_SIZE。
preferably, the GPU task includes:
positive sampling: for each edge, positive sampling is carried out in a random walk mode, the input is a vertex vid, and the output is a positive sampling edge list;
training: for each edge in the subgraph, training is performed with the GPU.
Preferably, the PCI load task includes:
graph structure data transmission: copying the graph structure data from Optane DIMMs to a GPU for calculation through a PCIe protocol;
and (3) negative sample transmission: copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;
embedding and transmitting: copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;
embedding and transmitting: the embedded data is copied back to the host DRAM by the GPU over the PCIe protocol.
The invention discloses a large-scale graph embedding training system based on Optane DIMM, which comprises:
an original graph processing module: processing the original graph to generate graph data which can be loaded by a DRAM;
a data preprocessing module: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training;
a graph training module: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the invention, high-quality two-layer graph segmentation is carried out according to the characteristics of the graph, and the large graph is converted into the subgraph and stored in the disk, so that the subgraph can be loaded into a GPU for partition training.
2. According to the method, the graph data used for training are respectively stored in different physical media according to the access characteristics of different media, the frequently written data are stored in a DRAM, and the read-only data are stored in an Optane DIMM.
3. According to different characteristics of data depended on in the algorithm flow, the method cuts the algorithm, the GPU processes positive sampling (random walk), positive sample training and negative sample training, and the CPU processes negative sampling and edge cutting training, so that the system computing power can be fully utilized.
4. Asynchronous transmission in the invention balances computation and IO overhead, thereby constructing a high-performance end-to-end training complete execution flow.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a diagram illustrating a graph embedding process in the prior art;
FIG. 2 is a schematic diagram of an overall method according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
According to the large-scale graph embedding training method based on Optane DIMM provided by the invention, referring to FIG. 1, the method comprises the following steps:
original graph processing steps: processing the original graph to generate graph data which can be loaded by a DRAM;
a data preprocessing step: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training;
and (3) graph training: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU.
In order to break through the limitation, the system designs a streaming Hash mapping scheme without loading complete graph data into the system, and the processing step of the original graph comprises the following substeps:
s1: initializing a hash table with the size of hash _ table _ size, and storing the mapping from the vertex name to the vertex ID;
s2: entering circulation, wherein the circulation frequency is the size of the original graph list;
s3: reading an edge from the original graph in each cycle, wherein the edge consists of a vertex v _ name and a target vertex u _ name;
s4: searching whether v _ name appears already from a hash table, if so, returning the mapped v _ id, and if not, calling a hash _ table. AddVertex (name _ v, count _ num _ vertices) method and inserting a new vertex;
s5: if a hash _ table. AddVertex (name _ v, count _ num _ verticals) is called, firstly creating a new vertex, adding a vertex set verticals, and then adding a counter count _ num _ verticals by one; if the count _ num _ thresholds exceeds the maximum capacity of the existing thresholds, the thresholds will automatically expand;
s6: circulating until a vacancy of the hash table is found, and inserting a mapping relation from a vertex name to an ID in the hash table;
s7: processing u _ name according to the steps in S4-S6;
s8: and writing the mapped edge into an output file.
Further, the graph data is stored on a disk in a file form, and the data format is source _ vertex _ id and destination _ vertex _ id; for an undirected graph, simultaneously storing source _ vertex _ id and destination _ vertex _ id; destination _ vertex _ id, source _ vertex _ id.
The graph partitioning divides a large complete graph into small sub-graphs so as to be loaded to a GPU for calculation, and the graph data splitting comprises the following two strategies:
the first layer employs an edge-splitting strategy: dividing sub-graph data into the number of GPUs, and not performing edge division on single GPU equipment; edge division refers to dividing the edge according to points in the dividing process; the method has the advantages that points in the subgraph have uniqueness, and data synchronous communication in the training process is reduced; the overhead is the creation of cut edges, requiring additional processing during the training process.
The second layer employs a point segmentation strategy: the number of the sub-graphs to be divided is determined according to the size of the GPU memory, and when the sub-graphs can be completely loaded into the GPU, a point division strategy is not carried out. So in the case of a graph that is too small to load completely into the GPU, two levels of graph partitioning will not really be performed. The point segmentation refers to the division according to edges in the division process, and multiple copies exist in a vertex; the method has the advantages that the overhead of additional cutting edges is not generated; the disadvantage is that multiple copies of the vertices are created, potentially creating synchronous communication overhead.
Further, during the graph data partitioning process, a graph data compression mechanism between partitions and within partitions is operated. For example, in the full graph, the vertex ID is large and needs to be stored with long data types, each vertex taking 8 bytes. After partitioning, the vertices may be stored with < first _ partition _ id, second _ partition _ id, offset > triplets, of the type byte, each vertex occupying 3 bytes.
According to different characteristics of data depended on in the training process, the method comprises a cutting algorithm, wherein the GPU is responsible for positive sampling (random walk), positive sample training and negative sample training, and the CPU is responsible for negative sampling and trimming training and balances the expenses of CPU calculation, GPU calculation and CPU-GPU communication, so that a high-performance end-to-end training complete execution stream is constructed. Specifically, the CPU performs negative sampling and trimming training operations, the GPU performs positive sampling, positive sample training, and negative sample training operations, and the CPU-GPU communication is performed using a PCI load.
Task assignments are shown in the following table:
Figure RE-GDA0003445938290000071
Figure RE-GDA0003445938290000081
in more detail, the CPU tasks specifically include:
loading data: the first step in training the workflow is to perform data loading. The data in the system comprises two types, the first type is graph structure data (nodes and edges), the structure relation of a storage graph is large in data size, and only reading and not writing are carried out in the training process. The second method is graph embedding data, each graph vertex maintains an embedding vector which is a training target and a training result, reading and writing are frequent, and the data volume is small.
The Graph structure data is stored on a disk after being divided, the Graph structure data is loaded into an Optane DIMM from the disk in the loading process, the metadata structure Graph is abstracted and points to concrete data in the Optane DIMM, and the metadata is stored in a DRAM.
The graph embedded data is allocated with a memory and initialized in the DRAM, and a metadata structure Embelling is abstracted, points to concrete data and is stored in the DRAM.
Negative sampling: the system starts the FIRST _ PARTITION _ NUM number of threads, each thread maintains a NEG _ SAMPLE _ POOL _ NUM block SAMPLE POOL, the SIZE of the SAMPLE POOL is NEG _ SAMPLE _ POOL _ SIZE, and different threads perform negative sampling in parallel.
For each small sampling pool, the sampling strategy comprises: counting the degree sum of the whole graph point and the degree sum of the sub-graph points, and determining the sampling number of each sub-graph according to the proportion of the sub-graph degree sum to the whole graph degree sum, wherein the following conditions are met:
subgraphi-degree_num/graph_degree_num
=subgraphi-neg_sam_mum/NEG_SAMPLE_POOL_SIZE
in addition, a producer consumer model is adopted by the negative sampling module, the CPU sampling is taken as a producer, and the GPU is trained as a consumer. When the sampling pool is full, the CPU sampling thread is blocked, and the GPU training thread can be executed; when the sampling pool is empty, the CPU sampling thread can execute, and the GPU training thread is blocked. The size of NEG _ SAMPLE _ POOL _ NUM determines the load balance of the two; NEG _ SAMPLE _ POOL _ SIZE affects the quality of training, all user controllable parameters.
Cutting edge training: training each cut edge generated in the edge segmentation process by using a CPU (central processing unit); the CPU starts a CROSS _ EDGE _ TRAIN _ THREAD THREAD, and each THREAD is responsible for training partial trimming.
The specific strategy is as follows: the CPU starts CROSS _ EDGE _ TRAIN _ THREAD THREADs, and each THREAD is responsible for training partial trimming; each trimming, namely randomly walking the source _ vertex to generate a positive sample; for each positive sample, globally and randomly sampling a vertex to replace target _ vertex, and generating a negative sample; for each positive sample, sampling num _ negative samples; and carrying out gradient descent on the positive and negative samples to achieve the training effect.
Task scheduling: according to different characteristics of data depended on in the training process, the method adopts a cutting algorithm, a GPU is responsible for positive sampling (random walk), positive sample training and negative sample training, and a CPU is responsible for negative sampling and edge cutting training. Algorithm scheduling and data communication are uniformly controlled by a CPU.
The specific strategy is as follows: the first layer of subgraphs are trained by different GPUs in parallel, the different GPUs are isolated from each other, and no data communication overhead exists; serially training a second layer of sub-graphs; the negative sampling module adopts a producer consumer model, the CPU samples as a producer, and the GPU trains as a consumer. When the sampling pool is full, the CPU sampling thread is blocked, the GPU training thread can execute, when the sampling pool is empty, the CPU sampling thread can execute, and the GPU training thread is blocked; asynchronous transfer, overlapping GPU training overhead and CPU-GPU data transferOverhead, specifically, when the sub-graph P1 is trained in the GPU, the sub-graph P is trained2The host (Optane DIMM) transmits to the GPU.
Graph embedding evaluation: and executing different types of machine learning tasks, such as node classification, by utilizing the generated node embedding. Either micro f1 or macro f1 was evaluated and compared to other solutions.
In more detail, the GPU tasks include:
positive sampling: for each edge, positive sampling is performed by random walk. The sampling pseudo-code is as follows:
Figure RE-GDA0003445938290000091
Figure RE-GDA0003445938290000101
the input is the vertex vid and the output is the list of positively sampled edges. Line 4 initializes walk _ length to 0. The 5 th row enters the loop, with the end condition that counter i accumulates to random _ walk _ length times. In the loop, line 9 obtains a neighbor vertex list of vertex vid; line 10 judges whether the neighbor list is empty, if so, the loop is skipped; line 13 performs random sampling on the neighbor vertex, the sampled probability is determined by the vertex weight, and the sampled vertex is assigned to the uid; line 14 assigns vid to head [ i ]; line 15 assigns uid to tail [ i ]; row 16 assigns uid to vid; the variable i, walk _ length, of lines 17 and 18 are incremented by one, respectively.
Training: for each edge in the subgraph, training is performed with the GPU. The specific strategy is as follows: starting NUM _ EDGES threads by the GPU, wherein each thread is responsible for training one edge; each trimming, namely randomly walking the source _ vertex to generate a positive sample; for each positive sample, sampling a vertex by a negative sample pool to replace target _ vertex, and generating a negative sample; for each positive sample, sampling num _ negative samples; and carrying out gradient descent on the positive and negative samples to achieve the training effect.
In more detail, the PCI load task includes:
graph structure data transmission: host (Optane DIMM) to GPU;
copying the graph structure data from Optane DIMMs to a GPU for calculation through a PCIe protocol;
and (3) negative sample transmission: host (DRAM) to GPU;
copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;
embedding and transmitting: host (DRAM) to GPU;
copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;
embedding and transmitting: GPU to host (DRAM);
the embedded data is copied back to the host DRAM by the GPU over the PCIe protocol.
The invention also introduces a large-scale graph embedding training system based on Optane DIMM, which comprises:
an original graph processing module: processing the original graph to generate graph data which can be loaded by a DRAM;
a data preprocessing module: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training;
a graph training module: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A large-scale graph embedding training method based on Optane DIMM is characterized by comprising the following steps:
original graph processing steps: processing the original graph to generate graph data which can be loaded by a DRAM;
a data preprocessing step: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training;
and (3) graph training: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU.
2. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: the original map processing step includes the sub-steps of:
s1: initializing a hash table with the size of hash _ table _ size, and storing the mapping from the vertex name to the vertex ID;
s2: entering circulation, wherein the circulation frequency is the size of the original graph list;
s3: reading an edge from the original graph in each cycle, wherein the edge consists of a vertex v _ name and a target vertex u _ name;
s4: searching whether v _ name appears already from a hash table, if so, returning the mapped v _ id, and if not, calling a hash _ table. AddVertex (name _ v, count _ num _ vertices) method and inserting a new vertex;
s5: if a hash _ table. AddVertex (name _ v, count _ num _ verticals) is called, firstly creating a new vertex, adding a vertex set verticals, and then adding a counter count _ num _ verticals by one; if the count _ num _ thresholds exceeds the maximum capacity of the existing thresholds, the thresholds will automatically expand;
s6: circulating until a vacancy of the hash table is found, and inserting a mapping relation from a vertex name to an ID in the hash table;
s7: processing u _ name according to the steps in S4-S6;
s8: and writing the mapped edge into an output file.
3. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: the graph data is stored on a disk in a file form, and the data format is source _ vertex _ id and destination _ vertex _ id; for an undirected graph, simultaneously storing source _ vertex _ id and destination _ vertex _ id; destination _ vertex _ id, source _ vertex _ id.
4. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: the splitting of the graph data comprises:
-using an edge segmentation strategy: dividing the graph data according to points, separating edges, dividing sub-graph data into the number of GPUs, and not performing edge division for single GPU equipment;
-using a point segmentation strategy: and dividing the graph data according to edges, determining the number of the divided subgraphs according to the size of a GPU memory, and not performing a point division strategy when the subgraphs can be completely loaded into the GPU.
5. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: the CPU executes negative sampling and trimming training operations, the GPU executes positive sampling, positive sample training and negative sample training operations, and the CPU-GPU communication is executed by adopting a PCI load.
6. The Optane DIMM based large scale graph embedding training method of claim 5, wherein: the CPU task specifically includes:
loading data: loading data required by training, wherein the data comprises graph structure data and graph embedding data;
the Graph structure data is loaded into the Optane DIMM from a disk, and a metadata structure Graph is abstracted and points to concrete data in the Optane DIMM, and the metadata is stored in a DRAM;
the graph embedded data is allocated with a memory and initialized in the DRAM, a metadata structure Embedding is abstracted, and the metadata structure Embedding points to concrete data and is stored in the DRAM;
negative sampling: the system starts the FIRST _ PARTITION _ NUM number of threads, each thread maintains a NEG _ SAMPLE _ POOL _ NUM block sampling POOL, the SIZE of the sampling POOL is NEG _ SAMPLE _ POOL _ SIZE, and different threads perform parallel negative sampling;
cutting edge training: training each cut edge generated in the edge segmentation process by using a CPU (central processing unit); the CPU starts CROSS _ EDGE _ TRAIN _ THREAD THREADs, and each THREAD is responsible for training partial trimming
Task scheduling: the first layer of subgraphs are trained by different GPUs in parallel, the different GPUs are isolated from each other, and no data communication overhead exists; serially training a second layer of sub-graphs;
graph embedding evaluation: with the generated node embedding, different types of machine learning tasks are run, evaluating either the micro f1 or the macro f1, for comparison with other solutions.
7. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: in the negative sampling process, the sampling strategy for each small sampling pool comprises the following steps: counting the degree sum of the whole graph point and the degree sum of the sub-graph points, and determining the sampling number of each sub-graph according to the proportion of the sub-graph degree sum to the whole graph degree sum, wherein the following conditions are met:
subgraphi_degree_num/graph_degree_num
=subgraphi_neg_sam_num/NEG_SAMPLE_POOL_SIZE。
8. the Optane DIMM based large scale graph embedding training method of claim 5, wherein: the GPU task comprises the following steps:
positive sampling: for each edge, positive sampling is carried out in a random walk mode, the input is a vertex vid, and the output is a positive sampling edge list;
training: for each edge in the subgraph, training is performed with the GPU.
9. The Optane DIMM based large scale graph embedding training method of claim 5, wherein: the PCI load tasks include:
the graph structure data transmission comprises the steps of copying the graph structure data from Optane DIMMs to a GPU for calculation through a PCIe protocol;
negative sample transmission, namely copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;
the embedded transmission is that the graph structure data is copied to the GPU from the host DRAM through the PCIe protocol for calculation;
embedded transfer-the embedded data is copied back by the GPU to the host DRAM over the PCIe protocol.
10. A large-scale graph embedding training system based on Optane DIMMs, comprising:
an original graph processing module: processing the original graph to generate graph data which can be loaded by a DRAM;
a data preprocessing module: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training;
a graph training module: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU.
CN202111415792.5A 2021-11-25 2021-11-25 Large-scale graph embedding training method and system based on Optane DIMM Active CN114118443B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111415792.5A CN114118443B (en) 2021-11-25 2021-11-25 Large-scale graph embedding training method and system based on Optane DIMM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111415792.5A CN114118443B (en) 2021-11-25 2021-11-25 Large-scale graph embedding training method and system based on Optane DIMM

Publications (2)

Publication Number Publication Date
CN114118443A true CN114118443A (en) 2022-03-01
CN114118443B CN114118443B (en) 2025-09-26

Family

ID=80373219

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111415792.5A Active CN114118443B (en) 2021-11-25 2021-11-25 Large-scale graph embedding training method and system based on Optane DIMM

Country Status (1)

Country Link
CN (1) CN114118443B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114897664A (en) * 2022-05-19 2022-08-12 北京百度网讯科技有限公司 Graph model deployment method and device, GPU and storage medium
CN118484296A (en) * 2024-04-17 2024-08-13 华南理工大学 A heuristic graph partitioning method based on GPU computing power awareness

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190073590A1 (en) * 2017-09-01 2019-03-07 Facebook, Inc. Sparse Neural Network Training Optimization
US20200272907A1 (en) * 2019-02-22 2020-08-27 Huazhong University Of Science And Technology Deep learning heterogeneous computing method based on layer-wide memory allocation and system thereof
CN113609310A (en) * 2021-08-25 2021-11-05 上海交通大学 Single-machine large-scale knowledge graph embedding system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190073590A1 (en) * 2017-09-01 2019-03-07 Facebook, Inc. Sparse Neural Network Training Optimization
US20200272907A1 (en) * 2019-02-22 2020-08-27 Huazhong University Of Science And Technology Deep learning heterogeneous computing method based on layer-wide memory allocation and system thereof
CN113609310A (en) * 2021-08-25 2021-11-05 上海交通大学 Single-machine large-scale knowledge graph embedding system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KUNPENG LIU等: "Automated Feature Selection: A Reinforcement Learning Perspective", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》, vol. 35, no. 03, 24 September 2021 (2021-09-24), pages 2272 *
ZHE ZHOU等: "GCNear: A Hybrid Architecture for Efficient GCNTraining with Near-Memory Processing", 《ARXIV:2111.00680V1》, 1 November 2021 (2021-11-01), pages 1 - 14 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114897664A (en) * 2022-05-19 2022-08-12 北京百度网讯科技有限公司 Graph model deployment method and device, GPU and storage medium
CN114897664B (en) * 2022-05-19 2024-05-07 北京百度网讯科技有限公司 Graph model deployment method and device, GPU and storage medium
CN118484296A (en) * 2024-04-17 2024-08-13 华南理工大学 A heuristic graph partitioning method based on GPU computing power awareness

Also Published As

Publication number Publication date
CN114118443B (en) 2025-09-26

Similar Documents

Publication Publication Date Title
US8990209B2 (en) Distributed scalable clustering and community detection
CN109816032B (en) Unbiased mapping zero sample classification method and device based on generative countermeasure network
US9836701B2 (en) Distributed stage-wise parallel machine learning
CN114503125A (en) Structured pruning method, system and computer readable medium
JP5950285B2 (en) A method for searching a tree using an instruction that operates on data having a plurality of predetermined bit widths, a computer for searching a tree using the instruction, and a computer thereof program
Mohan et al. A scalable method for link prediction in large real world networks
CN111930518B (en) Knowledge graph representation learning-oriented distributed framework construction method
CN106874478A (en) Parallelization random tags subset multi-tag file classification method based on Spark
US11640379B2 (en) Metadata decomposition for graph transformation
CN114118443B (en) Large-scale graph embedding training method and system based on Optane DIMM
WO2016095068A1 (en) Pedestrian detection apparatus and method
CN103412878B (en) Document theme partitioning method based on domain knowledge map community structure
Gothai et al. Map-reduce based distance weighted k-nearest neighbor machine learning algorithm for big data applications
CN110674326A (en) Neural network structure retrieval method based on polynomial distribution learning
Kansal et al. Comparative analysis of convolutional neural network in object detection
WO2025167876A1 (en) Object category recognition model training method and apparatus, and object category recognition method and apparatus
US20240395014A1 (en) Object recognition model updating method and apparatus, electronic device, storage medium, and computer program product
US11960520B2 (en) Hierarchical topic model with an interpretable topic hierarchy
He et al. Parallel outlier detection using kd-tree based on mapreduce
CN113392868B (en) Model training method, related device, equipment and storage medium
CN119599085A (en) A method and system for large model training resource evaluation and strategy recommendation
KR102585925B1 (en) Apparatus for automatically collecting learning data and method therefor
CN112347369B (en) Integrated learning dynamic social network link prediction method based on network characterization
JP7103987B2 (en) Information processing equipment, information processing methods, and programs
Narayanan et al. Semantic node embeddings of distributed graphs using apache spark

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant