Background
In a multi-element network, nodes and edges contain rich attribute information, such as attributes including name, age, gender, nationality, region, occupation, and the like. In the field of visualization research, experts have designed various visualization methods for such multivariate network data, mapping attribute information by modifying the visual appearance (size, color, shape) of nodes on the force guidance graph, or combining the force guidance graph with auxiliary views (e.g., tables, parallel coordinates, radar charts) to cooperatively analyze topology and multidimensional attribute information. However, as the scale of multivariate networks continues to increase, simultaneous analysis of topology and properties becomes a challenge. In force-guided maps, visual clutter largely hinders the visual perception of topology. Clustering is a more flexible strategy that allows users to classify nodes into different categories based on node attributes or network topology. Topological properties of intra-class and inter-class graphs are defined as by Batagelj et al (V.Batagelj, F.Brandenburg, W.Didimo, G.Liotta, P.Palladino, and M.Patrigani.visual analysis of large graphs using (x, y) -clustering and hybrid visualizations.IEEE transactions on visualization and computer graphics,17:1587-1598, 122010. doi:10.1109/TVCG.2010.265), allowing users to interactively explore graphs by expanding/contracting clusters. OnonionGraph (L.Shi, Q.Liao, H.Tong, Y.Hu, Y.ZHao, and C.Lin.Hierarchical focus + context semantic network visualization. pp.89-96, 032014. doi:10.1109/Pacific Vis.2014.44) takes into account attributes and topology, clusters heterogeneous network nodes to different levels, and the top-down hierarchy provides more semantic information.
Clustering of multivariate networks is divided into two categories, topology-based clustering (J.Abello, F.ham, and N.Krishan.ask-graph: A large scale graph visualization system. ie trans vis compilation. IEEE transactions on visualization and computer graphics,12: 669-676, 092006. doi:10.1109/TVCG.2006.120) and attribute-based clustering (M.Wattenberg.visual interpretation of multivariate graphs.811-819, 012. doi: 10.1145/1124772.1124891). However, in the topology-based clustering method, the attributes are randomly distributed, which makes it difficult to explain the formation of clusters. In the attribute clustering-based method, the topology of the network is severely damaged, which makes the relationships between the nodes difficult to explore and analyze. Therefore, integrating topology and attributes into the visualization clusters of large-scale multivariate networks remains a challenging task.
In addition, the network characterization learning method can handle large networks with millions of nodes and edges by preserving the context features to project the network nodes into the vector space. Among them, learning algorithms based on random walk are simple to implement, easy to expand and parallelize, and various methods have been proposed, such as deep walk (b.perozzi, r.al-Rfou, and s.skiiana.deep walk: Online learning of social representation.pp.701-710,2014) and node2vec (a.grover and j.leskovic.node 2vec: Scalable learning for networks.vol.2016, pp.855-864, 072016. doi:10.1145/2939672.2939754), which both well maintain the high-order proximity between nodes.
Disclosure of Invention
The invention aims to provide a novel clustering-based large-scale multivariate network data simplified visual analysis method aiming at the problems in the background art, the use of attribute information in network characterization learning is emphatically researched, and the large-scale multivariate network is explored through the network characterization learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
step (1) constructing an attribute-enhanced network representation learning model based on original network large-scale data, and converting nodes into high-dimensional vector representation embedded with a topological structure and attribute information;
constructing a multi-level clustering model by using an attribute-enhanced network representation learning model, and dividing nodes into hierarchical categories according to structure compactness, attribute homogeneity and clustering quantity in a vectorization space;
designing a simplified expression visual analysis scheme, and constructing a simplified visual analysis system of large-scale multi-element network data; the simplified visual analysis system forms visual expression through a cluster view and a collaborative view.
Further, the step (1) is specifically:
(1-1) constructing a corpus based on attribute similarity:
quantifying the attribute similarity between nodes, and if the adjacent nodes have the same category, determining the similarity sim (v)
i,v
i,k′) Set to 1, otherwise set to 0; v. of
i,k′For wandering an initial node v
iK 'th neighbor of (1, K), K' being v
iThe number of neighboring nodes of (2); probability of jumping
By node viNode v of the front wi-wAs a starting point, vi-wMigration rw (v) based on attribute similarityi-w)=(vi-w,…,vi-1,vi,vi+1,…,vi+w) Uniformly sampling in the neighbor of the last node in the current walking path until the maximum walking step length L is reached, wherein L is 2w + 1;
and generating a wandering path by taking other nodes as initial nodes in the same way, wherein all the nodes are taken as the initial nodes to carry out the wandering based on the attribute similarity.
(1-2) repeatedly traversing the nodes by a fixed number of times T, generating a walking path starting from each node by a fixed walking step length L, and creating a corpus consisting of all walking paths of each attribute;
forming a compound corpus minize from corpora generated by multiple attributesφ(-logPr({vi-w,…,vi+w}\vi|φ(vi) ()) as input to the Skip-Gram model; wherein phi (v)i) Denotes viVector, Pr ({ v)i-w,…,vi+w}\vi|φ(vi) Represents node v)iContext occurrence vi-wTo vi+wThe probability of (d); the Skip-Gram model enables the resulting vector to be maximized from viPresume its context vi-wTo vi+wAnd obtaining high-dimensional vector representation of the nodes.
Further, the step (2) is specifically:
(2-1) projecting the obtained high-dimensional vector of the node to a two-dimensional space, applying a k mean value to the projection vector to perform k-means clustering on the node, and generating a hierarchical clustering structure by using a hierarchical clustering method, namely performing iterative merging on paired clusters until one cluster is left;
(2-2) guiding the cutting of the hierarchy using three metrics, respectively: structure compactness, attribute homogeneity and clustering quantity;
setting a multi-objective function E (C) w1Esc(C)+w2Eah(C)+w3Estr(C) (ii) a Wherein a set of clusters C ═ { C ] generated from the multivariate network1,C2,…,CkIs part of a hierarchy with a set of leaf nodes, Ck″Set k "leaf nodes, k ═ 1,2, …, k; w is a1、w2、w3To adjust the parameters, to balance the metrics;
compactness function E of the structure
sc(C) 1-dense (c); wherein the cluster density
Representing the structural compactness of clusters derived from C; the larger the clustering density is, the smaller the structural compactness is;
attribute homogeneity function
Wherein N is the number of attributes,
is C
mProperty a of
xThe lower the information entropy is, the more uniform the node attribute of the cluster is; the smaller the average entropy value, the higher the probability that a cluster becomes a leaf node;
number of clusters
Representing a behavior of punishing to divide the multi-element nodes into a plurality of small clusters; wherein, dis (C)
xRoot) represents C in the hierarchy
xAnd the distance between the initial point root,
is represented by the formula
xAverage distance between leaf nodes of its descendants, p being C
xOne of all descendant leaf nodes in the tree.
Further, in the step (3), the clustering view provides simplified expression of the multivariate graph by using a glyph embedded network graph, leaf nodes are arranged by using a force guide model, the clustering of each leaf node is visualized by using the glyph, and nodes with similar attributes are gathered into a cluster;
the collaborative view is used for providing a group of collaborative views to help a user to intuitively explore the clustering of the multivariate graph, and a high-dimensional vector is projected to a two-dimensional space by utilizing a traditional dimension reduction model t-SNE; for each node, finding adjacent nodes in the surrounding radius of the node, then calculating the information entropy of the nodes on a plurality of attributes, taking the reciprocal of the entropy as the weight of the node, and participating in the calculation of the density of the kernel, thereby determining the final drawing of the heat map; the projection view is interactively connected with the tree graph so as to explore clusters on different levels; and visualizing the process of changing the multi-objective function value along with the interaction of the user by adopting the river graph at the top.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a network representation learning model with enhanced attributes, which extracts topology information and attribute information simultaneously and represents nodes as vectors for retaining topology connection and multi-attribute similarity; defining a multi-target energy function considering structure compactness, attribute homogeneity and clustering quantity to obtain an optimal cut layer of a hierarchical structure, so as to generate multi-level clusters for a user to explore a multi-element network in a hierarchical mode; a multivariate network simplified visual analysis system is developed, and attribute-enhanced network representation learning, hierarchical clustering and cluster-based simplification are visually expressed, so that a user can conveniently evaluate node clustering and explore semantic features of an original multivariate network.
Detailed Description
The simplified visual analysis method for large-scale multivariate network data of the present invention is further described below with reference to the accompanying drawings and examples.
The flow of this embodiment is shown in fig. 1, and includes the following steps:
step (1) constructing an attribute-enhanced network representation learning model based on original network large-scale data, and converting nodes into high-dimensional vector representation embedded with a topological structure and attribute information; the method comprises the following steps:
(1-1) constructing a corpus based on attribute similarity:
by increasing the probability that nodes with similar attribute values co-occur in the walk, the semantic relationship between the nodes is preserved. Suppose node v
iAs an initial point of walk, it is associated with K nodes v
i,1,v
i,2,…,v
i,KAdjacent to each other. Jump to node v in next round
i,k′Is dependent on v
i,KAnd v
iSimilarity of attributes therebetween.Quantifying the attribute similarity between nodes, and if the adjacent nodes have the same category, determining the similarity sim (v)
i,v
i,k′) Set to 1, otherwise set to 0, K' ∈ (1, K). When considering an attribute, the probability of a jump
By node viNode v of the front wi-wAs a starting point, vi-wMigration rw (v) based on attribute similarityi-w)=(vi-w,…,vi-1,vi,vi+1,…,vi+w) Uniformly sampling in the neighbor of the last node in the current walking path until the maximum walking step length L is reached, wherein L is 2w + 1; the wandering paths with other nodes as initial points are also generated in the same manner, wherein all nodes are used as initial points to perform the wandering based on the attribute similarity.
The original method is to simply construct a corpus based on a random walk mode, the method of the invention changes the probability of random jump based on attribute similarity, and the nodes with similar attributes can be jumped to, so that the constructed corpus is more enhanced for the characteristic description of the nodes with similar attributes and structure association.
The (1-2) nodes often have multiple attributes, and the method designs a novel corpus generation strategy, as shown in fig. 2. For each attribute, a walk path is generated by the similarity of attributes between adjacent nodes. The nodes are repeatedly traversed by a fixed number of times T, the walk path starting with each node is generated by a fixed walk step L, thereby creating a corpus consisting of all the walk paths of the attribute, and a corpus of other attributes is created in the same manner.
Forming a compound corpus minize from corpora generated by multiple attributesφ(-logPr({vi-w,…,vi+w}\vi|φ(vi) ()) as input to the Skip-Gram model; wherein phi (v)i) Denotes viVector, Pr ({ v)i-w,…,vi+w}\vi|φ(vi) Is shown in (a)Node viContext occurrence vi-wTo vi+wThe probability of (c). The Skip-Gram model enables the resulting vector to be maximized from viPresume its context vi-wTo vi+wAnd obtaining high-dimensional vector representation of the nodes.
Taking two adjacent nodes as an example, the probability of jumping from one node to another is determined according to the similarity of multiple attributes. The more common values they share across multiple attributes, the higher the probability that they will appear in the same wander. When the co-occurrence probability between the two nodes is very large, a pair of obviously similar node vectors is obtained to serve as the network characterization learning model with enhanced attributes.
Constructing a multi-level clustering model by using an attribute-enhanced network representation learning model, and dividing nodes into hierarchical categories according to structure compactness, attribute homogeneity and clustering quantity in a vectorization space; the method comprises the following steps:
(2-1) projecting the obtained high-dimensional vector of the node to a two-dimensional space, applying k-means to the projection vector, and performing k-means clustering on the node (in the present embodiment, initially designated k is 60); and further generating a hierarchical clustering structure by using a hierarchical clustering method, namely performing iterative combination on paired clusters until one cluster is remained.
(2-2) in order to exhibit a meaningful multivariate network visualization effect, appropriate cutting of the hierarchical structure is required. In this method, three metrics are used to guide the slicing of the hierarchy, respectively: compactness of structure, homogeneity of attributes, and number of clusters.
Setting a multi-objective function E (C) w1Esc(C)+w2Eah(C)+w3Estr(C) (ii) a Wherein a set of clusters C ═ { C ] generated from the multivariate network1,C2,…,CkIs part of a hierarchy with a set of leaf nodes, Ck″Set k "is the k" th leaf node, k ═ 1,2, …, k. w is a1、w2、w3To adjust the parameters, the metrics are balanced.
Compactness function E of the structuresc(C)=1-density(C);
Wherein the cluster density
Representing the structural compactness of clusters derived from C; the larger the clustering density, the smaller the structural compactness.
Attribute homogeneity function
Wherein N is the number of attributes,
is C
mProperty a of
xThe lower the information entropy, the more uniform the node attributes of the cluster. Therefore, the smaller the average entropy value, the higher the probability that the cluster becomes a leaf node.
Number of clusters
Representing a behavior of punishing to divide the multi-element nodes into a plurality of small clusters; wherein, dis (C)
xRoot) represents C in the hierarchy
xAnd the distance between the initial point root,
is represented by the formula
xAverage distance between leaf nodes of its descendants, p being C
xOne of all descendant leaf nodes in the tree.
In this embodiment, the tuning parameter w is initially specified1=1.0、w2=5.0、w30.1. To minimize the multi-objective function, a minimal cut algorithm is applied to generate the required cut levels for the hierarchy, where the layout of the leaf nodes can better weight the metrics. Fig. 3 is a schematic diagram of a multi-level clustering model, which shows an optimal cut layer of a hierarchical clustering structure, and reconstructs clusters by an optimal segmentation method of a minimum multi-objective function, so that nodes in the same cluster are closely connected and have similar attributes. Through the steps, the optimal cutting layer in the multi-level clustering is obtained.
Designing a simplified expression visual analysis scheme, and constructing a simplified visual analysis system of large-scale multi-element network data; the simplified visual analysis system forms visual expression through clustering view and collaborative view; the method comprises the following steps:
(3-1) clustering view: a glyph (glyph) embedded network graph is employed to provide a simplified representation of the multivariate graph. First, leaf nodes are laid out using a force-guided model. The clustering of each leaf node is visualized by using a glyph which consists of an inner circle and an outer circle. In particular, the topological pattern is represented by an inner circle, where the radius encodes the number of nodes inside the cluster. The lower left corner shading represents the clustering coefficient measured in 2h/y (y-1), where h and y are the number of nodes and edges in the cluster. The outer ring side focuses on the distribution characteristics conveying various attributes. As described above, nodes with similar attributes are aggregated into a cluster, where only a subset of the attributes play a significant role. For these isomorphic nodes in a cluster, the clustering effect of each attribute is important to help the user identify important attributes, which can be measured by the entropy of the information. Thus, the outer circle is divided into a plurality of fan-shaped circles having equal center angles according to the user-specified attribute set. Each sector ring is colored with a unique color, the height of which encodes the information entropy value on this attribute. FIG. 4-1 is a design diagram of clustering glyph of the multi-attribute mode: the category colors encode different attributes, while the outer sector radius encodes the inverse of the entropy value; the larger the height, the lower the information entropy value, and the stronger the aggregation effect. If the user specifies only one attribute, the fan-shaped ring conveys a detailed distribution of attribute values to compare the aggregate effect of this attribute in the cluster. In this case, the height of each sector ring would encode the number of nodes having corresponding attribute values, as shown in FIG. 4-2.
(3-2) collaborative view: providing a set of collaborative views helps users intuitively explore the clustering of the multivariate graph. Wherein the projection view displays vectors obtained from the attribute enhanced network representation learning model. And projecting the high-dimensional vector to a two-dimensional space by using a traditional dimension reduction model t-SNE. Due to the advantages of t-SNE, nodes with similar structure and attribute information are distributed very closely in the vectorization space. Through visual perception, a user can easily perceive the similarity between nodes. A thermodynamic diagram is also provided to enhance the representation of the homogeneity of the attributes. For each node, its neighboring nodes are found within its surrounding radius, and then the information entropy of these nodes over multiple attributes is calculated. The reciprocal of the entropy value is used as the weight of the point and participates in the calculation of the kernel density, thereby determining the final drawing of the heat map. In addition, the projection view is also connected with the tree graph through interaction so as to explore the clusters on different levels. The tree graph is designed to visualize the hierarchical clustering structure, where nodes link the tree graph highlighting the nodes in the selected cut level, and the middle nodes represent hierarchical clusters distributed from bottom to top. Leaf nodes link rectangles at the bottom to show the characteristics of the attribute distribution. The regions of different colors within the rectangle encode the information entropy of the multiple attributes. If only one attribute is explored, the area within the rectangle will show the number of attribute values. In addition, a river graph is designed at the top to visualize the process of multi-objective function value change with user interaction. The three-layer distribution represents the structural compactness, property homogeneity and energy variation of the tree structure. The user may adjust the parameters through a set of buttons listed at the top. When a cluster is specified in the tree view, the bounding box of the node contained by the cluster will be highlighted in the projection view. The detailed information of the plurality of attributes in the cluster of interest is displayed in an "attribute view". After the clustering is specified, the corresponding lines are highlighted in the parallel coordinates, so that a user can intuitively perceive important attributes, easily evaluate the clustering effect and deeply understand hidden semantics. To further indicate the proportion of different attribute values on the attribute, bars are distributed on parallel axes, the size of which encodes the number of intersecting lines it contains.
The effectiveness of the invention was evaluated as follows:
the attribute enhanced network characterization learning model is subjected to experimental research and is combined with a clustering and multilevel clustering model (w) generated by other three methods1=1.0、w2=5.0、w30.1) the clusters in the generated optimal cut were compared:
(1) an attribute-based method (Attr-based) that takes multi-element nodes as multi-dimensional vectors, regardless of structural information; (2) deepwalk embeds nodes into a low-dimensional space only considering structural information; (3) attribute Awareness Graph Embedding (AAGE) combines node attributes with graph structures by quantifying the sum of multiple attribute similarities to weight values for edges, and learns implicit expressions for the weighted network nodes. The default parameter values in our method are set as follows: the repetition number T of the random walk for each node is 15, and the length L of the random walk is 20.
For a set of clusters C ═ { C) generated from a multivariate network based on the above method1,C2,…,CkAnd (5) evaluating the clustering quality by using the clustering density (formula 6) and the clustering entropy (formula 7). We first studied different methods in the dataset (patent network) and clustered the nodes using k-means. Wherein the patent network is a large network, wherein the nodes represent patents and the edges represent citations between the patents. And extracting a connected subgraph containing 10,178 patents and 18,809 edges. Each patent has multiple attributes such as year of grant, country, patent category, technology category, and subcategory. The experimental results are as follows: as in fig. 5-1, this approach outperforms AAGE, DeepWalk, and attribute-based approaches on all k, means in terms of cluster density. Meanwhile, entropy values of the graph clusters generated by the method are smaller than those of AAGE. Deepwalk is readily observed to have a maximum entropy value, as shown in FIG. 5-2. This means that the method and AAGE generated graph clusters have more uniform node properties compared to DeepWalk, since they consider both properties and topology. In contrast, the attribute-based approach has the smallest entropy and smallest density value. The results show that for clusters that only consider attributes, the homogeneity of the attributes is well maintained, but the structural compactness is poor.
The different network clustering methods described above were studied using another data set (Vis network) and the results of the experiments are shown in FIGS. 6-1 and 6-2. The Vis network data was abstracted from the ArnetMiner database, which contains 2592 papers from 1990 to 2014 at three major meetings (SciVis, InfoVis and VAST). There are three types of papers, namely C (journal recommendation corpus), M (monograph), and T (meeting record). The publication years are divided into five categories, each of which is in units of five years. There are a total of 8984 links to represent the reference relationships between these papers. It can be seen that the performance of DeepWalk is best when the number of clusters k is 3 (fig. 6-1). The performance of the method of the invention and Deepwalk is similar in the cases where k is 5 and k is 7, both of which have a slight advantage over AAGE. Whatever the value of k, the attribute-based approach has the smallest entropy value and the smallest density value. The process of the invention and AAGE have similar properties in terms of homogeneity of properties (FIG. 6-2). Their entropy values are also significantly smaller than Deepwalk. The result shows that the attribute enhanced network characterization learning method has good clustering result quality and stable performance under the condition of considering both the topological structure and the node attribute information.