Disclosure of Invention
The invention provides a density peak value-based adaptive overlapping community discovery method, and aims to improve the problems.
The invention is realized in such a way that a self-adaptive overlapping community discovery method based on a density peak value comprises the following steps:
S1, calculating edge link strength and point link strength between nodes based on an adjacent matrix of a node network;
S2, constructing a distance matrix of the node network, and calculating the local density rho and the shortest distance delta of each node based on the distance matrix;
S3, self-adaptively selecting a community center in the network through the local density rho and the shortest distance delta of the nodes;
S4, distributing the residual nodes to communities corresponding to at least one community center based on the probability vectors of the residual nodes, and determining overlapping communities according to the distributed community sets.
Further, the calculation formula of the edge link strength between the node i and the node j is as follows:
Wherein A ij is the element value of the ith row and jth column in the adjacency matrix, w i、wj respectively represents the sum of the edge weights of the node i and the sum of the edge weights of the node j, r is the difference between the maximum weight w max and the minimum value w min of the sum of the edge weights of all the nodes in the node network, V ij is the common point set between the node i and the node j, and eta is a constant.
Further, the calculation formula of the point link strength between the node i and the node j is as follows:
Wherein vst ij is a point link strength value between node i and node j, N i + is a set of node i and its neighbor node, N j + is a set of node j and its neighbor node, V ij is a set of common points between node i and node j, and λ is a non-negative real number.
Further, a distance dist ij between the node i and the node j is calculated, an element value of the j-th column of the i-th row in the distance matrix is formed, and a dist ij distance calculation formula is as follows:
wherein lst ij is the edge link strength between node i and node j, vst ij is the point link strength between point i and point j, and ε is a constant.
Further, the community center selection method specifically comprises the following steps:
Normalizing the density rho and the shortest distance delta of all nodes to obtain rho * and delta *;
Calculating the product gamma of rho * and delta *, then carrying out ascending order on the gamma, defining gamma after ascending order as gamma *,γ* and defining an index corresponding to gamma as ind;
For each node i in gamma *, performing linear fitting on nodes arranged before the node i in gamma *, and further calculating an estimated value gamma' corresponding to the node i;
The gamma difference delta gamma, delta gamma = gamma * -gamma' for node i is calculated,
Judging the elements in gamma * from back to front, and making the difference delta gamma smaller than the average valueThe smallest lower of the points of (a) is marked M,Adding nodes with point subscripts larger than M to a candidate community center set for the average value of the difference delta gamma of all the nodes;
Deleting candidate community centers which do not meet the following condition 1 in the candidate community center set to form a final community center set;
The condition 1 is that the local density and the shortest distance are respectively larger than the maximum value of the local density and the shortest distance of k adjacent nodes closest to the local density and the shortest distance.
Further, the distribution method of the remaining nodes specifically comprises the following steps:
Calculating a probability vector of each remaining node, namely the probability that the remaining node belongs to each community center;
The rest nodes are distributed to communities with the highest probability;
And calculating the divisor between the probability value of the rest node belonging to other communities and the maximum probability value, and distributing the rest node to the community corresponding to the divisor larger than the given threshold.
Further, the calculation formula of the probability vector p i of the remaining node i is specifically as follows:
Where cc i represents a set of k nodes with a local density higher than and closest to node i, p j represents a probability vector for node j in cc i, w ij represents a probability weight for node j relative to node i, and the formula for probability weight w ij is as follows:
dist ij represents the distance between node i and node j, dist im the distance between node i and node m.
The invention is realized in that an adaptive overlapping community discovery system based on density peaks, the system comprising:
The social topic network modeling module filters key node words through the topic words, calculates the relativity among the key node words and constructs a node network;
The edge link strength calculation module is used for calculating the edge link strength between the nodes based on the adjacency matrix of the node network;
The point link strength calculation module is used for calculating the point link strength between the nodes based on the adjacency matrix of the node network;
the distance matrix calculation module calculates the distance between the nodes based on the edge link strength and the point link strength of each node, namely, a distance matrix is constructed;
The density peak value calculation module is used for calculating the local density rho and the shortest distance delta of each node based on the distance matrix;
The community center generation module is used for adaptively selecting a community center in the network through the local density rho and the shortest distance delta of the nodes;
the node allocation module allocates the remaining nodes to communities corresponding to at least one community center based on probability vectors of the remaining nodes;
and the overlapped community discovery module is used for determining overlapped communities according to the distributed community set.
Further, the edge link strength calculation module calculates the edge link strength between the nodes based on the following formula:
Wherein A ij is the element value of the ith row and jth column in the adjacency matrix, w i、wj respectively represents the sum of the edge weights of the node i and the sum of the edge weights of the node j, r is the difference between the maximum weight w max and the minimum value w min of the sum of the edge weights of all the nodes in the node network, V ij is the common point set between the node i and the node j, and eta is a constant.
Further, the point link strength calculation module calculates the point link strength between the nodes based on the following formula:
Wherein vst ij is a point link strength value between node i and node j, N i + is a set of node i and its neighbor node, N j + is a set of node j and its neighbor node, V ij is a set of common points between node i and node j, and λ is a non-negative real number.
The method adopts the edge link and the point link strength to accurately calculate the distance between the nodes, adopts the linear fitting method to adaptively select the community center, defines probability vectors for each node in the distribution stage of the rest nodes to represent the probability of belonging to each community, has low complexity, has better discovery effect on overlapping communities and has better division result on a large-scale complex network.
Detailed Description
The following detailed description of the embodiments of the invention, given by way of example only, is presented in the accompanying drawings to aid in a more complete, accurate, and thorough understanding of the inventive concepts and aspects of the invention by those skilled in the art.
The invention relates to the term interpretation:
complex networks-complex networks refer to networks that have some or all of the properties of self-organization, self-similarity, attractors, the small world, scaleless.
Communities-communities are defined as a subset of the set of all nodes in a network, where the connections between nodes within the subset are tighter than with other nodes outside the subset.
Overlapping communities, namely nodes with overlapping among different communities.
Fig. 1 is a flowchart of a density peak value-based adaptive overlapping community discovery method according to an embodiment of the present invention, where the method includes the following steps:
S1, calculating edge link and point link strength between nodes by an adjacent matrix of a node network;
In the embodiment of the invention, the adjacency matrix acquisition process of the node network is specifically as follows:
Filtering key node words through subject words, calculating the correlation degree of the key node words, and constructing a node network, wherein the correlation degree of the key node words is the weight of the edge where two nodes are located in the node network, and the process of constructing the node network is specifically as follows:
After preprocessing the text data, combining with the choice of the subject words, defining word vectors and relativity, and constructing a topic network. The word vector definition formula is:
ωi→(pi1,…,pij,…,pik)
Wherein omega i is the i-th word selected, k represents the vocabulary amount in the whole sentence text, and p ij represents the number of sentences of co-occurrence of the i-th and j-th words in the total sentence text.
The words selected in the text become node words in the node network, and the correlation calculation formula between the node words omega i,ωj is as follows:
Wherein, p it represents the sentence number of the co-occurrence of the ith word and the t word, p jt represents the sentence number of the co-occurrence of the jth word and the t word, k represents the vocabulary total amount in the whole sentence text, and the adjacent matrix of the node network is obtained according to the correlation calculation value among the node words, wherein the node words in the node network are called nodes in the adjacent matrix;
Edge link lst ij is the edge link strength calculation formula between node i and node j:
Wherein A ij is the element value of the ith row and jth column in the adjacency matrix, and is defined as the weight value of the edge where the node i and the node j are located in the node network, the value is R (w i,wj),Ait、Atj respectively represents the weight value of the edge where the node i and the node t are located and the weight value of the edge where the node j and the node t are located, w i、wj respectively represents the sum of the joint edge weights of the node i and the sum of the joint edge weights of the node j, R is the difference value between the maximum weight w max and the minimum value w min of the sum of the joint edge weights of all the nodes in the node network, V ij is the common point set between the node i and the node j, and if the node g is connected with the node i and the node j through the joint edge at the same time, the node g is the common point between the node i and the node j, eta is a constant, and the value is 1.
The calculation formula of the point link strength between the node i and the node j is as follows:
Wherein vst ij is a point link strength value between node i and node j, N i + is a set of node i and its neighbor nodes, N j + is a set of node j and its neighbor nodes, the other nodes on the border of node i are neighbor nodes of node i, V ij is a common point set between node i and node j, and λ is a non-negative real number.
S2, constructing a distance matrix of the node network, and calculating the local density rho and the shortest distance delta of each node based on the distance matrix;
Calculating a distance dist ij between a node i and a node j to form an element value of an ith row and a jth column in a distance matrix, wherein a dist ij distance calculation formula is as follows:
Wherein lst ij is the edge link strength between node i and node j, vst ij is the point link strength between point i and point j, ε is a constant, and ε is 1.
The density of the node i is represented by the local density ρ i, and the shortest distance from the node i to a node higher than the density thereof is represented by the shortest distance δ i. Because the distribution of the nodes in the complex network is relatively sparse and the nature of the community is a local structure, the influence of the global node on the local density of the nodes is considered when the local density is calculated, the influence of k neighbor nodes closest to the nodes on the local density of the nodes is only considered when the local density is calculated, and the definition of the local density rho i is shown as follows.
Wherein KNN i represents k nearest neighbor nodes to node i, k is an average value of the number of neighbor nodes in the network, and d c represents a cut-off distance;
the shortest distance delta i of the node i is calculated as follows:
Where δ i is the shortest distance between point i and the higher density point.
S3, adaptively selecting a community center in the network through the density rho and the shortest distance delta of the nodes;
In order to make the influence of rho and delta on community center selection have consistency, rho and delta need to have the same value range, so that the local density rho and the shortest distance delta are normalized after calculation is completed, and the density rho and the shortest distance delta of the nodes are normalized to obtain rho * and delta *;
Only nodes with relatively large local density ρ and shortest distance δ can be selected as cluster centers, and in order to improve the efficiency of cluster center selection, the product γ of ρ * and δ * is calculated according to the following calculation formula:
γ=ρ*δ*
Then ascending order is carried out on the calculated gamma, the gamma after ascending order is defined as gamma *,γ*, and the index corresponding to gamma is defined as ind;
For each node i in gamma *, performing linear fitting on the nodes arranged before the node i in gamma *, and further calculating an estimated value gamma' corresponding to the node i, wherein the calculation formula is as follows:
γ′=a×γ*+b
Wherein a and b are parameters obtained by linear fitting;
after the calculation of γ' is completed, the difference Δγ is calculated, and the calculation formula is:
Δγ=γ*-γ′
Judging the elements in gamma * from back to front, and making the difference delta gamma smaller than the average value The smallest lower of the points of (a) is marked M,And adding nodes with point subscripts larger than M to the candidate community center set as an average value of all node difference values delta gamma, wherein the adding formula is as follows:
S0={indt|M≤t≤|ind}
Wherein S 0 is a candidate community center set.
Because the local density and the shortest distance of the community center should be higher than those of the neighboring nodes, according to the fact that the local density and the shortest distance of the community center should be respectively greater than the maximum value of the local densities and the shortest distances of k neighboring nodes closest to the community center, candidate community centers which do not meet the condition are removed from the candidate community center set S 0, and a final community center set S is obtained.
S4, distributing the unassigned nodes to a community center based on probability vectors of the unassigned nodes, and determining overlapping communities according to the distributed community set.
The traditional remaining node allocation mechanism is to allocate the nodes to clusters where cluster centers which are higher in local density and closest to the nodes are located, the allocation mechanism causes the nodes to only belong to one cluster, but the possibility that the nodes belong to a plurality of communities exists in a complex network, and therefore the allocation mechanism is not applicable to overlapping community detection.
In order for the node allocation mechanism to distinguish overlapping nodes, the probability vector defined by the present invention represents the probability that a node belongs to each community, e.g., the probability vector for node i is denoted as p i={pi1,pi2,…,pi|S|. When the node i is a community center and the community in which the node i is located is t, the probability vector p i of the node i is as follows.
I.e. the community center i is considered to be affiliated to community t and not to other communities. If node i is not a community center, its probability vector is determined by k nodes with higher local densities and closest distances thereto, and the probability vector p i for node i is as follows:
Where cc i represents a set of k nodes with a local density higher than and closest to node i, p j represents a probability vector for node j in cc i, w ij represents a probability weight for node j relative to node i, and the formula for probability weight w ij is as follows:
The method comprises the steps of calculating the attribution degree of each community center, namely the probability of being affiliated to each community, allocating the node to the community with the largest probability, calculating the division value between the probability value affiliated to other communities and the maximum probability value, and allocating the node to other communities if the division value is larger than a given threshold value, wherein in the probability vector of the node i, if the community r=argmax (p i), the node i is allocated to the community r. Then the algorithm calculates the ratio of other community probabilities p iv and the maximum probability p ir in the probability vector, and if p iv meets p iv/pir not less than ζ, the node i is considered to be the node in the community v at the same time. Wherein ζ is a pre-specified threshold. And determining overlapping communities and hot topics according to the distributed community set.
Fig. 2 is a schematic structural diagram of an adaptive overlapping community discovery system based on density peaks according to an embodiment of the present invention, and for convenience of explanation, only a portion related to the embodiment of the present invention is shown, where the system includes:
The social topic network modeling module filters key node words through the subject words, calculates the relativity among the key node words and constructs a node network;
After preprocessing the text data, combining with the choice of the subject words, defining word vectors and relativity, and constructing a topic network. The word vector definition formula is:
ωi→(pi1,…,pij,…,pik)
Wherein omega i is the i-th word selected, k represents the vocabulary amount in the whole sentence text, and p ij represents the number of sentences of co-occurrence of the i-th and j-th words in the total sentence text.
The words selected in the text become node words in the node network, and the correlation calculation formula between the node words omega i,ωj is as follows:
The method comprises the steps of obtaining a node network, wherein p it represents the number of sentences in which an ith word and a t word co-occur, p jt represents the number of sentences in which a j-th word and a t-th word co-occur, k represents the total amount of vocabulary in the whole sentence text, and obtaining an adjacent matrix of the node network according to a correlation degree calculation value among the node words, wherein the node words in the node network are called nodes in the adjacent matrix.
The edge link strength calculation module calculates edge link strength between nodes based on an adjacency matrix of the node network, and the edge link strength between the node i and the node j is calculated based on the following formula:
Wherein A ij is the element value of the ith row and jth column in the adjacency matrix, w i、wj respectively represents the sum of the edge weights of the node i and the sum of the edge weights of the node j, r is the difference between the maximum weight w max and the minimum value w min of the sum of the edge weights of all the nodes in the node network, V ij is the common point set between the node i and the node j, and eta is a constant.
The point link strength calculation module calculates the point link strength between each node based on the adjacency matrix of the node network, and the point link strength between the node i and the node j is calculated based on the following formula:
Wherein vst ij is a point link strength value between node i and node j, N i + is a set of node i and its neighbor node, N j + is a set of node j and its neighbor node, V ij is a set of common points between node i and node j, and λ is a non-negative real number.
The distance matrix calculation module calculates the distance between the nodes based on the edge link strength and the point link strength of each node, namely, a distance matrix is constructed, the distance dist ij between the node i and the node j forms the element value of the ith row and the jth column in the distance matrix, and the distance dist ij has a calculation formula as follows:
wherein lst ij is the edge link strength between node i and node j, vst ij is the point link strength between point i and point j, and ε is a constant.
The density peak value calculation module calculates the local density ρ and the shortest distance δ of each node based on the distance matrix, and the calculation formula of the local density ρ i is specifically as follows:
wherein KNN i represents k nearest neighbor nodes to node i, k is an average value of the number of neighbor nodes in the network, and d c represents a cut-off distance;
the shortest distance delta i of the node i is calculated as follows:
Where δ i is the shortest distance between point i and the higher density point.
The community center generation module is used for adaptively selecting a community center in the network through the local density rho and the shortest distance delta of the nodes, and the generation process of the community center is specifically as follows:
Normalizing the density rho and the shortest distance delta of all nodes to obtain rho * and delta *;
Calculating the product gamma of rho * and delta *, then carrying out ascending order on the gamma, defining gamma after ascending order as gamma *,γ* and defining an index corresponding to gamma as ind;
For each node i in gamma *, performing linear fitting on nodes arranged before the node i in gamma *, and further calculating an estimated value gamma' corresponding to the node i;
The gamma difference delta gamma, delta gamma = gamma * -gamma' for node i is calculated,
Judging the elements in gamma * from back to front, and making the difference delta gamma smaller than the average valueThe smallest lower of the points of (a) is marked M,Adding nodes with point subscripts larger than M to a candidate community center set for the average value of the difference delta gamma of all the nodes;
Deleting candidate community centers which do not meet the following condition 1 in the candidate community center set to form a final community center set;
The condition 1 is that the local density and the shortest distance are respectively larger than the maximum value of the local density and the shortest distance of k adjacent nodes closest to the local density and the shortest distance.
The node allocation module allocates the remaining nodes to communities corresponding to at least one community center based on probability vectors of the remaining nodes, wherein the allocation method of the remaining nodes specifically comprises the following steps:
Calculating a probability vector of each remaining node, namely the probability that the remaining node belongs to each community center;
The rest nodes are distributed to communities with the highest probability;
And calculating the divisor between the probability value of the rest node belonging to other communities and the maximum probability value, and distributing the rest node to the community corresponding to the divisor larger than the given threshold.
The probability vector p i of the remaining node i has the following calculation formula:
Where cc i represents a set of k nodes with a local density higher than and closest to node i, p j represents a probability vector for node j in cc i, w ij represents a probability weight for node j relative to node i, and the formula for probability weight w ij is as follows:
dist ij represents the distance between node i and node j, dist im the distance between node i and node m.
And the overlapped community discovery module is used for determining overlapped communities according to the distributed community sets, and nodes positioned in at least two communities are overlapped parts of the communities.
While the invention has been described above with reference to the accompanying drawings, it will be apparent that the invention is not limited to the above embodiments, but is capable of being modified or applied directly to other applications without modification, as long as various insubstantial modifications of the method concept and technical solution of the invention are adopted, all within the scope of the invention.