[go: up one dir, main page]

CN113392332A - Simplified visual analysis method for large-scale multi-element network data - Google Patents

Simplified visual analysis method for large-scale multi-element network data Download PDF

Info

Publication number
CN113392332A
CN113392332A CN202110535193.0A CN202110535193A CN113392332A CN 113392332 A CN113392332 A CN 113392332A CN 202110535193 A CN202110535193 A CN 202110535193A CN 113392332 A CN113392332 A CN 113392332A
Authority
CN
China
Prior art keywords
nodes
attribute
clustering
node
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110535193.0A
Other languages
Chinese (zh)
Other versions
CN113392332B (en
Inventor
周志光
张汝敏
胡淼鑫
刘玉华
王毅刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110535193.0A priority Critical patent/CN113392332B/en
Publication of CN113392332A publication Critical patent/CN113392332A/en
Application granted granted Critical
Publication of CN113392332B publication Critical patent/CN113392332B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了面向大规模多元网络数据的简化可视分析方法。本发明方法首先基于原始网络大规模数据,构建属性增强的网络表征学习模型,将节点转换成嵌入拓扑结构和属性信息的高维向量表示;然后利用属性增强的网络表征学习模型构建多层次聚类模型,在向量化空间中根据结构紧密度、属性同质性和聚类数量将节点划分为层次类别;最后设计简化表达可视分析方案,构建大规模多元网络数据的简化可视分析系统;所述的简化可视分析系统通过聚类视图、协同视图构成视觉表达。本发明方法对大规模多元网络数据进行视觉简化、探索和聚类,有效减少视觉混乱,并提高了大规模多元网络的可读性和分析效率。

Figure 202110535193

The invention discloses a simplified visual analysis method for large-scale multivariate network data. The method of the invention firstly builds an attribute-enhanced network representation learning model based on the large-scale data of the original network, and converts nodes into high-dimensional vector representations embedded in topology structure and attribute information; and then uses the attribute-enhanced network representation learning model to construct multi-level clustering In the vectorized space, nodes are divided into hierarchical categories according to structural tightness, attribute homogeneity and number of clusters; finally, a simplified expression visual analysis scheme is designed to build a simplified visual analysis system for large-scale multivariate network data; The simplified visual analysis system described above constitutes a visual representation through clustering views and collaborative views. The method of the invention performs visual simplification, exploration and clustering on large-scale multivariate network data, effectively reduces visual confusion, and improves the readability and analysis efficiency of the large-scale multivariate network.

Figure 202110535193

Description

Simplified visual analysis method for large-scale multi-element network data
Technical Field
The invention belongs to the technical field of graphics and visualization, and particularly relates to a simplified visual analysis method for large-scale multivariate network data.
Background
In a multi-element network, nodes and edges contain rich attribute information, such as attributes including name, age, gender, nationality, region, occupation, and the like. In the field of visualization research, experts have designed various visualization methods for such multivariate network data, mapping attribute information by modifying the visual appearance (size, color, shape) of nodes on the force guidance graph, or combining the force guidance graph with auxiliary views (e.g., tables, parallel coordinates, radar charts) to cooperatively analyze topology and multidimensional attribute information. However, as the scale of multivariate networks continues to increase, simultaneous analysis of topology and properties becomes a challenge. In force-guided maps, visual clutter largely hinders the visual perception of topology. Clustering is a more flexible strategy that allows users to classify nodes into different categories based on node attributes or network topology. Topological properties of intra-class and inter-class graphs are defined as by Batagelj et al (V.Batagelj, F.Brandenburg, W.Didimo, G.Liotta, P.Palladino, and M.Patrigani.visual analysis of large graphs using (x, y) -clustering and hybrid visualizations.IEEE transactions on visualization and computer graphics,17:1587-1598, 122010. doi:10.1109/TVCG.2010.265), allowing users to interactively explore graphs by expanding/contracting clusters. OnonionGraph (L.Shi, Q.Liao, H.Tong, Y.Hu, Y.ZHao, and C.Lin.Hierarchical focus + context semantic network visualization. pp.89-96, 032014. doi:10.1109/Pacific Vis.2014.44) takes into account attributes and topology, clusters heterogeneous network nodes to different levels, and the top-down hierarchy provides more semantic information.
Clustering of multivariate networks is divided into two categories, topology-based clustering (J.Abello, F.ham, and N.Krishan.ask-graph: A large scale graph visualization system. ie trans vis compilation. IEEE transactions on visualization and computer graphics,12: 669-676, 092006. doi:10.1109/TVCG.2006.120) and attribute-based clustering (M.Wattenberg.visual interpretation of multivariate graphs.811-819, 012. doi: 10.1145/1124772.1124891). However, in the topology-based clustering method, the attributes are randomly distributed, which makes it difficult to explain the formation of clusters. In the attribute clustering-based method, the topology of the network is severely damaged, which makes the relationships between the nodes difficult to explore and analyze. Therefore, integrating topology and attributes into the visualization clusters of large-scale multivariate networks remains a challenging task.
In addition, the network characterization learning method can handle large networks with millions of nodes and edges by preserving the context features to project the network nodes into the vector space. Among them, learning algorithms based on random walk are simple to implement, easy to expand and parallelize, and various methods have been proposed, such as deep walk (b.perozzi, r.al-Rfou, and s.skiiana.deep walk: Online learning of social representation.pp.701-710,2014) and node2vec (a.grover and j.leskovic.node 2vec: Scalable learning for networks.vol.2016, pp.855-864, 072016. doi:10.1145/2939672.2939754), which both well maintain the high-order proximity between nodes.
Disclosure of Invention
The invention aims to provide a novel clustering-based large-scale multivariate network data simplified visual analysis method aiming at the problems in the background art, the use of attribute information in network characterization learning is emphatically researched, and the large-scale multivariate network is explored through the network characterization learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
step (1) constructing an attribute-enhanced network representation learning model based on original network large-scale data, and converting nodes into high-dimensional vector representation embedded with a topological structure and attribute information;
constructing a multi-level clustering model by using an attribute-enhanced network representation learning model, and dividing nodes into hierarchical categories according to structure compactness, attribute homogeneity and clustering quantity in a vectorization space;
designing a simplified expression visual analysis scheme, and constructing a simplified visual analysis system of large-scale multi-element network data; the simplified visual analysis system forms visual expression through a cluster view and a collaborative view.
Further, the step (1) is specifically:
(1-1) constructing a corpus based on attribute similarity:
quantifying the attribute similarity between nodes, and if the adjacent nodes have the same category, determining the similarity sim (v)i,vi,k′) Set to 1, otherwise set to 0; v. ofi,k′For wandering an initial node viK 'th neighbor of (1, K), K' being viThe number of neighboring nodes of (2); probability of jumping
Figure BDA0003069574470000021
By node viNode v of the front wi-wAs a starting point, vi-wMigration rw (v) based on attribute similarityi-w)=(vi-w,…,vi-1,vi,vi+1,…,vi+w) Uniformly sampling in the neighbor of the last node in the current walking path until the maximum walking step length L is reached, wherein L is 2w + 1;
and generating a wandering path by taking other nodes as initial nodes in the same way, wherein all the nodes are taken as the initial nodes to carry out the wandering based on the attribute similarity.
(1-2) repeatedly traversing the nodes by a fixed number of times T, generating a walking path starting from each node by a fixed walking step length L, and creating a corpus consisting of all walking paths of each attribute;
forming a compound corpus minize from corpora generated by multiple attributesφ(-logPr({vi-w,…,vi+w}\vi|φ(vi) ()) as input to the Skip-Gram model; wherein phi (v)i) Denotes viVector, Pr ({ v)i-w,…,vi+w}\vi|φ(vi) Represents node v)iContext occurrence vi-wTo vi+wThe probability of (d); the Skip-Gram model enables the resulting vector to be maximized from viPresume its context vi-wTo vi+wAnd obtaining high-dimensional vector representation of the nodes.
Further, the step (2) is specifically:
(2-1) projecting the obtained high-dimensional vector of the node to a two-dimensional space, applying a k mean value to the projection vector to perform k-means clustering on the node, and generating a hierarchical clustering structure by using a hierarchical clustering method, namely performing iterative merging on paired clusters until one cluster is left;
(2-2) guiding the cutting of the hierarchy using three metrics, respectively: structure compactness, attribute homogeneity and clustering quantity;
setting a multi-objective function E (C) w1Esc(C)+w2Eah(C)+w3Estr(C) (ii) a Wherein a set of clusters C ═ { C ] generated from the multivariate network1,C2,…,CkIs part of a hierarchy with a set of leaf nodes, Ck″Set k "leaf nodes, k ═ 1,2, …, k; w is a1、w2、w3To adjust the parameters, to balance the metrics;
compactness function E of the structuresc(C) 1-dense (c); wherein the cluster density
Figure BDA0003069574470000031
Representing the structural compactness of clusters derived from C; the larger the clustering density is, the smaller the structural compactness is;
attribute homogeneity function
Figure BDA0003069574470000032
Wherein N is the number of attributes,
Figure BDA0003069574470000033
is CmProperty a ofxThe lower the information entropy is, the more uniform the node attribute of the cluster is; the smaller the average entropy value, the higher the probability that a cluster becomes a leaf node;
number of clusters
Figure BDA0003069574470000034
Representing a behavior of punishing to divide the multi-element nodes into a plurality of small clusters; wherein, dis (C)xRoot) represents C in the hierarchyxAnd the distance between the initial point root,
Figure BDA0003069574470000041
is represented by the formulaxAverage distance between leaf nodes of its descendants, p being CxOne of all descendant leaf nodes in the tree.
Further, in the step (3), the clustering view provides simplified expression of the multivariate graph by using a glyph embedded network graph, leaf nodes are arranged by using a force guide model, the clustering of each leaf node is visualized by using the glyph, and nodes with similar attributes are gathered into a cluster;
the collaborative view is used for providing a group of collaborative views to help a user to intuitively explore the clustering of the multivariate graph, and a high-dimensional vector is projected to a two-dimensional space by utilizing a traditional dimension reduction model t-SNE; for each node, finding adjacent nodes in the surrounding radius of the node, then calculating the information entropy of the nodes on a plurality of attributes, taking the reciprocal of the entropy as the weight of the node, and participating in the calculation of the density of the kernel, thereby determining the final drawing of the heat map; the projection view is interactively connected with the tree graph so as to explore clusters on different levels; and visualizing the process of changing the multi-objective function value along with the interaction of the user by adopting the river graph at the top.
Compared with the prior art, the invention has the beneficial effects that: the invention provides a network representation learning model with enhanced attributes, which extracts topology information and attribute information simultaneously and represents nodes as vectors for retaining topology connection and multi-attribute similarity; defining a multi-target energy function considering structure compactness, attribute homogeneity and clustering quantity to obtain an optimal cut layer of a hierarchical structure, so as to generate multi-level clusters for a user to explore a multi-element network in a hierarchical mode; a multivariate network simplified visual analysis system is developed, and attribute-enhanced network representation learning, hierarchical clustering and cluster-based simplification are visually expressed, so that a user can conveniently evaluate node clustering and explore semantic features of an original multivariate network.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of an attribute enhanced network characterization learning model;
FIG. 3 is a diagram of a multi-level clustering model;
FIG. 4-1 is a diagram of clustering glyph design of a multi-attribute model;
FIG. 4-2 is a diagram of clustering glyph design of a single attribute mode;
FIG. 5-1 is a schematic diagram of the evaluation of the clustering quality of the patent network in terms of clustering density;
FIG. 5-2 is a schematic diagram of the evaluation of the clustering quality of the patent network in the aspect of information entropy;
FIG. 6-1 is a schematic diagram of Vis network clustering quality evaluation in terms of clustering density;
FIG. 6-2 is a schematic diagram of Vis network cluster quality evaluation in the aspect of information entropy.
Detailed Description
The simplified visual analysis method for large-scale multivariate network data of the present invention is further described below with reference to the accompanying drawings and examples.
The flow of this embodiment is shown in fig. 1, and includes the following steps:
step (1) constructing an attribute-enhanced network representation learning model based on original network large-scale data, and converting nodes into high-dimensional vector representation embedded with a topological structure and attribute information; the method comprises the following steps:
(1-1) constructing a corpus based on attribute similarity:
by increasing the probability that nodes with similar attribute values co-occur in the walk, the semantic relationship between the nodes is preserved. Suppose node viAs an initial point of walk, it is associated with K nodes vi,1,vi,2,…,vi,KAdjacent to each other. Jump to node v in next roundi,k′Is dependent on vi,KAnd viSimilarity of attributes therebetween.Quantifying the attribute similarity between nodes, and if the adjacent nodes have the same category, determining the similarity sim (v)i,vi,k′) Set to 1, otherwise set to 0, K' ∈ (1, K). When considering an attribute, the probability of a jump
Figure BDA0003069574470000051
By node viNode v of the front wi-wAs a starting point, vi-wMigration rw (v) based on attribute similarityi-w)=(vi-w,…,vi-1,vi,vi+1,…,vi+w) Uniformly sampling in the neighbor of the last node in the current walking path until the maximum walking step length L is reached, wherein L is 2w + 1; the wandering paths with other nodes as initial points are also generated in the same manner, wherein all nodes are used as initial points to perform the wandering based on the attribute similarity.
The original method is to simply construct a corpus based on a random walk mode, the method of the invention changes the probability of random jump based on attribute similarity, and the nodes with similar attributes can be jumped to, so that the constructed corpus is more enhanced for the characteristic description of the nodes with similar attributes and structure association.
The (1-2) nodes often have multiple attributes, and the method designs a novel corpus generation strategy, as shown in fig. 2. For each attribute, a walk path is generated by the similarity of attributes between adjacent nodes. The nodes are repeatedly traversed by a fixed number of times T, the walk path starting with each node is generated by a fixed walk step L, thereby creating a corpus consisting of all the walk paths of the attribute, and a corpus of other attributes is created in the same manner.
Forming a compound corpus minize from corpora generated by multiple attributesφ(-logPr({vi-w,…,vi+w}\vi|φ(vi) ()) as input to the Skip-Gram model; wherein phi (v)i) Denotes viVector, Pr ({ v)i-w,…,vi+w}\vi|φ(vi) Is shown in (a)Node viContext occurrence vi-wTo vi+wThe probability of (c). The Skip-Gram model enables the resulting vector to be maximized from viPresume its context vi-wTo vi+wAnd obtaining high-dimensional vector representation of the nodes.
Taking two adjacent nodes as an example, the probability of jumping from one node to another is determined according to the similarity of multiple attributes. The more common values they share across multiple attributes, the higher the probability that they will appear in the same wander. When the co-occurrence probability between the two nodes is very large, a pair of obviously similar node vectors is obtained to serve as the network characterization learning model with enhanced attributes.
Constructing a multi-level clustering model by using an attribute-enhanced network representation learning model, and dividing nodes into hierarchical categories according to structure compactness, attribute homogeneity and clustering quantity in a vectorization space; the method comprises the following steps:
(2-1) projecting the obtained high-dimensional vector of the node to a two-dimensional space, applying k-means to the projection vector, and performing k-means clustering on the node (in the present embodiment, initially designated k is 60); and further generating a hierarchical clustering structure by using a hierarchical clustering method, namely performing iterative combination on paired clusters until one cluster is remained.
(2-2) in order to exhibit a meaningful multivariate network visualization effect, appropriate cutting of the hierarchical structure is required. In this method, three metrics are used to guide the slicing of the hierarchy, respectively: compactness of structure, homogeneity of attributes, and number of clusters.
Setting a multi-objective function E (C) w1Esc(C)+w2Eah(C)+w3Estr(C) (ii) a Wherein a set of clusters C ═ { C ] generated from the multivariate network1,C2,…,CkIs part of a hierarchy with a set of leaf nodes, Ck″Set k "is the k" th leaf node, k ═ 1,2, …, k. w is a1、w2、w3To adjust the parameters, the metrics are balanced.
Compactness function E of the structuresc(C)=1-density(C);
Wherein the cluster density
Figure BDA0003069574470000061
Representing the structural compactness of clusters derived from C; the larger the clustering density, the smaller the structural compactness.
Attribute homogeneity function
Figure BDA0003069574470000062
Wherein N is the number of attributes,
Figure BDA0003069574470000063
is CmProperty a ofxThe lower the information entropy, the more uniform the node attributes of the cluster. Therefore, the smaller the average entropy value, the higher the probability that the cluster becomes a leaf node.
Number of clusters
Figure BDA0003069574470000064
Representing a behavior of punishing to divide the multi-element nodes into a plurality of small clusters; wherein, dis (C)xRoot) represents C in the hierarchyxAnd the distance between the initial point root,
Figure BDA0003069574470000065
is represented by the formulaxAverage distance between leaf nodes of its descendants, p being CxOne of all descendant leaf nodes in the tree.
In this embodiment, the tuning parameter w is initially specified1=1.0、w2=5.0、w30.1. To minimize the multi-objective function, a minimal cut algorithm is applied to generate the required cut levels for the hierarchy, where the layout of the leaf nodes can better weight the metrics. Fig. 3 is a schematic diagram of a multi-level clustering model, which shows an optimal cut layer of a hierarchical clustering structure, and reconstructs clusters by an optimal segmentation method of a minimum multi-objective function, so that nodes in the same cluster are closely connected and have similar attributes. Through the steps, the optimal cutting layer in the multi-level clustering is obtained.
Designing a simplified expression visual analysis scheme, and constructing a simplified visual analysis system of large-scale multi-element network data; the simplified visual analysis system forms visual expression through clustering view and collaborative view; the method comprises the following steps:
(3-1) clustering view: a glyph (glyph) embedded network graph is employed to provide a simplified representation of the multivariate graph. First, leaf nodes are laid out using a force-guided model. The clustering of each leaf node is visualized by using a glyph which consists of an inner circle and an outer circle. In particular, the topological pattern is represented by an inner circle, where the radius encodes the number of nodes inside the cluster. The lower left corner shading represents the clustering coefficient measured in 2h/y (y-1), where h and y are the number of nodes and edges in the cluster. The outer ring side focuses on the distribution characteristics conveying various attributes. As described above, nodes with similar attributes are aggregated into a cluster, where only a subset of the attributes play a significant role. For these isomorphic nodes in a cluster, the clustering effect of each attribute is important to help the user identify important attributes, which can be measured by the entropy of the information. Thus, the outer circle is divided into a plurality of fan-shaped circles having equal center angles according to the user-specified attribute set. Each sector ring is colored with a unique color, the height of which encodes the information entropy value on this attribute. FIG. 4-1 is a design diagram of clustering glyph of the multi-attribute mode: the category colors encode different attributes, while the outer sector radius encodes the inverse of the entropy value; the larger the height, the lower the information entropy value, and the stronger the aggregation effect. If the user specifies only one attribute, the fan-shaped ring conveys a detailed distribution of attribute values to compare the aggregate effect of this attribute in the cluster. In this case, the height of each sector ring would encode the number of nodes having corresponding attribute values, as shown in FIG. 4-2.
(3-2) collaborative view: providing a set of collaborative views helps users intuitively explore the clustering of the multivariate graph. Wherein the projection view displays vectors obtained from the attribute enhanced network representation learning model. And projecting the high-dimensional vector to a two-dimensional space by using a traditional dimension reduction model t-SNE. Due to the advantages of t-SNE, nodes with similar structure and attribute information are distributed very closely in the vectorization space. Through visual perception, a user can easily perceive the similarity between nodes. A thermodynamic diagram is also provided to enhance the representation of the homogeneity of the attributes. For each node, its neighboring nodes are found within its surrounding radius, and then the information entropy of these nodes over multiple attributes is calculated. The reciprocal of the entropy value is used as the weight of the point and participates in the calculation of the kernel density, thereby determining the final drawing of the heat map. In addition, the projection view is also connected with the tree graph through interaction so as to explore the clusters on different levels. The tree graph is designed to visualize the hierarchical clustering structure, where nodes link the tree graph highlighting the nodes in the selected cut level, and the middle nodes represent hierarchical clusters distributed from bottom to top. Leaf nodes link rectangles at the bottom to show the characteristics of the attribute distribution. The regions of different colors within the rectangle encode the information entropy of the multiple attributes. If only one attribute is explored, the area within the rectangle will show the number of attribute values. In addition, a river graph is designed at the top to visualize the process of multi-objective function value change with user interaction. The three-layer distribution represents the structural compactness, property homogeneity and energy variation of the tree structure. The user may adjust the parameters through a set of buttons listed at the top. When a cluster is specified in the tree view, the bounding box of the node contained by the cluster will be highlighted in the projection view. The detailed information of the plurality of attributes in the cluster of interest is displayed in an "attribute view". After the clustering is specified, the corresponding lines are highlighted in the parallel coordinates, so that a user can intuitively perceive important attributes, easily evaluate the clustering effect and deeply understand hidden semantics. To further indicate the proportion of different attribute values on the attribute, bars are distributed on parallel axes, the size of which encodes the number of intersecting lines it contains.
The effectiveness of the invention was evaluated as follows:
the attribute enhanced network characterization learning model is subjected to experimental research and is combined with a clustering and multilevel clustering model (w) generated by other three methods1=1.0、w2=5.0、w30.1) the clusters in the generated optimal cut were compared:
(1) an attribute-based method (Attr-based) that takes multi-element nodes as multi-dimensional vectors, regardless of structural information; (2) deepwalk embeds nodes into a low-dimensional space only considering structural information; (3) attribute Awareness Graph Embedding (AAGE) combines node attributes with graph structures by quantifying the sum of multiple attribute similarities to weight values for edges, and learns implicit expressions for the weighted network nodes. The default parameter values in our method are set as follows: the repetition number T of the random walk for each node is 15, and the length L of the random walk is 20.
For a set of clusters C ═ { C) generated from a multivariate network based on the above method1,C2,…,CkAnd (5) evaluating the clustering quality by using the clustering density (formula 6) and the clustering entropy (formula 7). We first studied different methods in the dataset (patent network) and clustered the nodes using k-means. Wherein the patent network is a large network, wherein the nodes represent patents and the edges represent citations between the patents. And extracting a connected subgraph containing 10,178 patents and 18,809 edges. Each patent has multiple attributes such as year of grant, country, patent category, technology category, and subcategory. The experimental results are as follows: as in fig. 5-1, this approach outperforms AAGE, DeepWalk, and attribute-based approaches on all k, means in terms of cluster density. Meanwhile, entropy values of the graph clusters generated by the method are smaller than those of AAGE. Deepwalk is readily observed to have a maximum entropy value, as shown in FIG. 5-2. This means that the method and AAGE generated graph clusters have more uniform node properties compared to DeepWalk, since they consider both properties and topology. In contrast, the attribute-based approach has the smallest entropy and smallest density value. The results show that for clusters that only consider attributes, the homogeneity of the attributes is well maintained, but the structural compactness is poor.
The different network clustering methods described above were studied using another data set (Vis network) and the results of the experiments are shown in FIGS. 6-1 and 6-2. The Vis network data was abstracted from the ArnetMiner database, which contains 2592 papers from 1990 to 2014 at three major meetings (SciVis, InfoVis and VAST). There are three types of papers, namely C (journal recommendation corpus), M (monograph), and T (meeting record). The publication years are divided into five categories, each of which is in units of five years. There are a total of 8984 links to represent the reference relationships between these papers. It can be seen that the performance of DeepWalk is best when the number of clusters k is 3 (fig. 6-1). The performance of the method of the invention and Deepwalk is similar in the cases where k is 5 and k is 7, both of which have a slight advantage over AAGE. Whatever the value of k, the attribute-based approach has the smallest entropy value and the smallest density value. The process of the invention and AAGE have similar properties in terms of homogeneity of properties (FIG. 6-2). Their entropy values are also significantly smaller than Deepwalk. The result shows that the attribute enhanced network characterization learning method has good clustering result quality and stable performance under the condition of considering both the topological structure and the node attribute information.

Claims (4)

1.面向大规模多元网络数据的简化可视分析方法,其特征在于,该方法具体是:1. A simplified visual analysis method for large-scale multivariate network data, characterized in that the method is specifically: 步骤(1)基于原始网络大规模数据,构建属性增强的网络表征学习模型,将节点转换成嵌入拓扑结构和属性信息的高维向量表示;Step (1) build an attribute-enhanced network representation learning model based on the large-scale data of the original network, and convert nodes into high-dimensional vector representations embedded with topology and attribute information; 步骤(2)利用属性增强的网络表征学习模型构建多层次聚类模型,在向量化空间中根据结构紧密度、属性同质性和聚类数量将节点划分为层次类别;Step (2) constructing a multi-level clustering model by using the attribute-enhanced network representation learning model, and dividing the nodes into hierarchical categories in the vectorized space according to structural tightness, attribute homogeneity and number of clusters; 步骤(3)设计简化表达可视分析方案,构建大规模多元网络数据的简化可视分析系统;所述的简化可视分析系统通过聚类视图、协同视图构成视觉表达。Step (3) Design a simplified visualization analysis scheme, and construct a simplified visualization analysis system for large-scale multivariate network data; the simplified visualization analysis system constitutes a visual expression through a clustering view and a collaborative view. 2.如权利要求1所述的面向大规模多元网络数据的简化可视分析方法,其特征在于,步骤(1)具体是:2. the simplified visual analysis method for large-scale multivariate network data as claimed in claim 1, is characterized in that, step (1) is specifically: (1-1)构建基于属性相似性的语料库:(1-1) Build a corpus based on attribute similarity: 将节点之间的属性相似度进行量化,如果相邻节点拥有同一个类别时,则其相似性sim(vi,vi,k′)设置为1,否则设置为0;vi,k′为游走初始节点vi的第k′个相邻节点,k′∈(1,K),K为vi的相邻节点的数量;跳转概率
Figure FDA0003069574460000011
Quantify the attribute similarity between nodes. If the adjacent nodes have the same category, the similarity sim(vi,vi, k ' ) is set to 1, otherwise it is set to 0; vi ,k' is the k'th adjacent node of the initial node v i , k'∈(1,K), K is the number of adjacent nodes of v i ; the jump probability
Figure FDA0003069574460000011
以节点vi的前w的节点vi-w为起始点,vi-w基于属性相似性的游走rw(vi-w)=(vi-w,…,vi-1,vi,vi+1,…,vi+w),在当前游走路径中的最后一个节点的邻居中均匀地采样,直到达到最大游走步长L,L=2w+1;Taking the node v iw before the node v i as the starting point, the v iw walk based on attribute similarity rw(v iw )=(v iw ,...,v i-1 ,v i ,v i+1 ,... ,v i+w ), sample evenly in the neighbors of the last node in the current walk path, until the maximum walk step size L is reached, L=2w+1; 以其他节点为初始节点的游走路径采用相同方式生成,所有节点都要作为初始节点进行基于属性相似性的游走;The walk paths with other nodes as initial nodes are generated in the same way, and all nodes must be used as initial nodes to perform a walk based on attribute similarity; (1-2)以固定的次数T重复遍历节点,以固定的游走步长L生成以每个节点开始的游走路径,创建各个属性的所有游走路径组成的语料库;(1-2) Repeatedly traverse the nodes with a fixed number of times T, generate a walking path starting with each node with a fixed walking step L, and create a corpus composed of all the walking paths of each attribute; 由多个属性生成的语料构成一个复合语料库minimizeφ(-logPr({vi-w,…,vi+w}\vi|φ(vi))),作为Skip-Gram模型的输入;其中φ(vi)表示vi向量、Pr({vi-w,…,vi+w}\vi|φ(vi))表示节点vi上下文出现vi-w到vi+w的概率;Skip-Gram模型使得到的向量能最大化从vi推测出其上下文vi-w到vi+w的概率,得到节点的高维向量表示。The corpus generated by multiple attributes forms a composite corpus minimize φ (-logPr({v iw ,…,v i+w }\v i |φ(v i ))), as the input of the Skip-Gram model; where φ (v i ) represents the v i vector, Pr({v iw ,...,v i+w }\v i |φ(vi )) represents the probability that the node v i context appears from v iw to v i +w ; Skip- The Gram model makes the resulting vector maximize the probability of inferring its context v iw to v i+w from v i , and obtains a high-dimensional vector representation of the node.
3.如权利要求1所述的面向大规模多元网络数据的简化可视分析方法,其特征在于,步骤(2)具体是:3. the simplified visual analysis method for large-scale multivariate network data as claimed in claim 1, is characterized in that, step (2) is specifically: (2-1)将得到的节点的高维向量投影到一个二维空间,将k均值应用于投影向量对节点进行k-means聚类,利用层次聚类方法生成层次聚类结构,即对成对的聚类进行迭代合并,直到剩下一个聚类为止;(2-1) Project the high-dimensional vector of the obtained node into a two-dimensional space, apply k-means to the projection vector, perform k-means clustering on the nodes, and use the hierarchical clustering method to generate a hierarchical clustering structure, that is, pair pairs Iteratively merge the clusters until one cluster remains; (2-2)使用三个度量标准指导层次结构的切割,分别是:结构紧密度、属性同质性、聚类数量;(2-2) Use three metrics to guide the cutting of hierarchical structures, namely: structural tightness, attribute homogeneity, and number of clusters; 设定多目标函数E(C)=w1Esc(C)+w2Eah(C)+w3Estr(C);其中,从多元网络生成的聚类集合C={C1,C2,…,Ck}是层次结构的一部分,具有一组叶子结点,Ck″为第k″组叶子结点,k″=1,2,…,k;w1、w2、w3为调整参数,用来平衡度量;Set the multi-objective function E(C)=w 1 E sc (C)+w 2 E ah (C)+w 3 E str (C); wherein, the clustering set C={C 1 , C 2 , . _ w 3 is an adjustment parameter used to balance the measurement; 结构紧密度函数Esc(C)=1-density(C);Structural density function E sc (C)=1-density(C); 其中,聚类密度
Figure FDA0003069574460000021
表示由C派生的聚类的结构紧密度;聚类密度越大,结构紧密度越小;
Among them, the cluster density
Figure FDA0003069574460000021
Represents the structural compactness of clusters derived from C; the greater the clustering density, the smaller the structural compactness;
属性同质性函数
Figure FDA0003069574460000022
其中,N为属性的数量,
Figure FDA0003069574460000025
为Cm的属性ax上的信息熵,信息熵越低,聚类的节点属性越均匀;平均熵值越小的聚类成为叶子结点的概率越高;
property homogeneity function
Figure FDA0003069574460000022
where N is the number of attributes,
Figure FDA0003069574460000025
is the information entropy on the attribute a x of C m , the lower the information entropy, the more uniform the node attributes of the cluster; the smaller the average entropy value, the higher the probability that the cluster becomes a leaf node;
聚类数量
Figure FDA0003069574460000023
表示惩罚将多元节点划分成多个小聚类的行为;其中,dis(Cx,root)表示层次结构中Cx和初始点root之间的距离,
Figure FDA0003069574460000024
表示与Cx其后代叶子结点之间的平均距离,p为Cx中所有后代叶子结点之一。
Number of clusters
Figure FDA0003069574460000023
Represents the behavior of penalizing the division of multivariate nodes into multiple small clusters; where dis(C x ,root) represents the distance between C x and the initial point root in the hierarchy,
Figure FDA0003069574460000024
Represents the average distance from C x and its descendant leaf nodes, and p is one of all descendant leaf nodes in C x .
4.如权利要求1所述的面向大规模多元网络数据的简化可视分析方法,其特征在于:步骤(3)中,4. the simplified visual analysis method oriented to large-scale multivariate network data as claimed in claim 1, is characterized in that: in step (3), 所述的聚类视图是采用glyph嵌入的网络图来提供多元图的简化表达,使用力引导模型对叶子结点进行布局,每个叶子结点的聚类都使用glyph进行可视化,有相似属性的节点被聚集到一个聚类中;The clustering view uses the glyph-embedded network graph to provide a simplified representation of the multivariate graph, uses the force-guided model to lay out the leaf nodes, and the clustering of each leaf node is visualized using glyphs, which have similar attributes. Nodes are clustered into a cluster; 所述的协同视图是提供一组协同视图帮助用户直观地探索多元图的聚类,利用传统的降维模型t-SNE将高维向量投影到二维空间;对于每个节点,在其周围半径内找到其相邻节点,然后计算这些节点在多个属性上的信息熵,将熵值的倒数作为该点的权重,参与核密度的计算,从而确定热图的最终绘制;投影视图与树图通过交互连接,以在不同层次上探索聚类;在顶部采用河流图来可视化多目标函数值随着用户的交互而变化的过程。The collaborative view is to provide a set of collaborative views to help users intuitively explore the clustering of multivariate graphs, and use the traditional dimensionality reduction model t-SNE to project high-dimensional vectors into a two-dimensional space; for each node, the radius around it is Find its adjacent nodes, and then calculate the information entropy of these nodes on multiple attributes, take the reciprocal of the entropy value as the weight of the point, and participate in the calculation of the kernel density to determine the final drawing of the heat map; projection view and tree map Interactions are connected to explore clusters at different levels; a river plot is used on top to visualize how the value of the multi-objective function changes as the user interacts.
CN202110535193.0A 2021-05-17 2021-05-17 A simplified visual analysis method for large-scale multivariate network data Active CN113392332B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110535193.0A CN113392332B (en) 2021-05-17 2021-05-17 A simplified visual analysis method for large-scale multivariate network data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110535193.0A CN113392332B (en) 2021-05-17 2021-05-17 A simplified visual analysis method for large-scale multivariate network data

Publications (2)

Publication Number Publication Date
CN113392332A true CN113392332A (en) 2021-09-14
CN113392332B CN113392332B (en) 2022-06-14

Family

ID=77617902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110535193.0A Active CN113392332B (en) 2021-05-17 2021-05-17 A simplified visual analysis method for large-scale multivariate network data

Country Status (1)

Country Link
CN (1) CN113392332B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114116881A (en) * 2021-12-01 2022-03-01 天津大学 A Visualization Control Method for Multidimensional Data Based on Representation Learning
CN115223661A (en) * 2022-06-24 2022-10-21 西湖大学 Biomics Data Analysis System
CN116822452A (en) * 2023-08-23 2023-09-29 芯行纪科技有限公司 Chip layout optimization method and related equipment
CN117573873A (en) * 2023-11-29 2024-02-20 中国医学科学院医学信息研究所 A name disambiguation method and related devices
CN120353533A (en) * 2025-06-24 2025-07-22 中国人民解放军国防科技大学 Large-scale complex network visualization method for community structure discovery

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080181503A1 (en) * 2007-01-30 2008-07-31 Alon Schclar Diffusion bases methods for segmentation and clustering
US20160110442A1 (en) * 2014-10-16 2016-04-21 Accenture Global Services Limited Segmentation Discovery, Evaluation and Implementation Platform
CN111260491A (en) * 2020-02-13 2020-06-09 南方科技大学 Method and system for discovering network community structure
CN111325326A (en) * 2020-02-21 2020-06-23 北京工业大学 A Link Prediction Method Based on Heterogeneous Network Representation Learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080181503A1 (en) * 2007-01-30 2008-07-31 Alon Schclar Diffusion bases methods for segmentation and clustering
US20160110442A1 (en) * 2014-10-16 2016-04-21 Accenture Global Services Limited Segmentation Discovery, Evaluation and Implementation Platform
CN111260491A (en) * 2020-02-13 2020-06-09 南方科技大学 Method and system for discovering network community structure
CN111325326A (en) * 2020-02-21 2020-06-23 北京工业大学 A Link Prediction Method Based on Heterogeneous Network Representation Learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴烨等: "一种高效的属性图聚类方法", 《计算机学报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114116881A (en) * 2021-12-01 2022-03-01 天津大学 A Visualization Control Method for Multidimensional Data Based on Representation Learning
CN115223661A (en) * 2022-06-24 2022-10-21 西湖大学 Biomics Data Analysis System
CN115223661B (en) * 2022-06-24 2023-04-14 西湖大学 Bio-omics data analysis system
CN116822452A (en) * 2023-08-23 2023-09-29 芯行纪科技有限公司 Chip layout optimization method and related equipment
CN116822452B (en) * 2023-08-23 2023-11-21 芯行纪科技有限公司 Chip layout optimization method and related equipment
CN117573873A (en) * 2023-11-29 2024-02-20 中国医学科学院医学信息研究所 A name disambiguation method and related devices
CN120353533A (en) * 2025-06-24 2025-07-22 中国人民解放军国防科技大学 Large-scale complex network visualization method for community structure discovery

Also Published As

Publication number Publication date
CN113392332B (en) 2022-06-14

Similar Documents

Publication Publication Date Title
CN113392332A (en) Simplified visual analysis method for large-scale multi-element network data
US11048714B2 (en) Data analysis platform for visualizing data according to relationships
Gupta et al. Evolutionary clustering and analysis of bibliographic networks
Greene et al. Producing a unified graph representation from multiple social network views
US9779150B1 (en) Systems and methods for filtering data used in data visualizations that use relationships
US9613086B1 (en) Graphical user interface for generating and displaying data visualizations that use relationships
Zhu et al. DRGraph: An efficient graph layout algorithm for large-scale graphs by dimensionality reduction
US9710527B1 (en) Systems and methods of arranging displayed elements in data visualizations and use relationships
Chen et al. A survey on visualization approaches for exploring association relationships in graph data
Hodge et al. Hierarchical growing cell structures: TreeGCS
CN110909253B (en) Group relation mining and analyzing method based on specific users
CN109063094A (en) A method of establishing knowledge of TCM map
Itoh et al. Key-node-separated graph clustering and layouts for human relationship graph visualization
Chen et al. Community detection in subspace of attribute
EP4148593A1 (en) Methods and systems for extracting and visualizing patterns in large-scale data sets
CN112765490A (en) Information recommendation method and system based on knowledge graph and graph convolution network
Wang et al. Hierarchical and overlapping social circle identification in ego networks based on link clustering
Shen et al. Graph exploration with embedding-guided layouts
CN109766478B (en) A Semantic Enhanced Large-Scale Multivariate Graph Simplified Visualization Approach
Yang et al. Graph embedding via graph summarization
Dias et al. A hierarchical network simplification via non-negative matrix factorization
Bei et al. Summarizing scale-free networks based on virtual and real links
Kawanobe et al. Experimental study of characterizing frequent itemsets using representation learning
Rickard et al. Hypercube graph representations and fuzzy measures of graph properties
Salako et al. Toward Understanding Similarity of Visualization Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant