[go: up one dir, main page]

CN113987342B - An adaptive overlapping community discovery method and system based on density peak - Google Patents

An adaptive overlapping community discovery method and system based on density peak Download PDF

Info

Publication number
CN113987342B
CN113987342B CN202111246835.1A CN202111246835A CN113987342B CN 113987342 B CN113987342 B CN 113987342B CN 202111246835 A CN202111246835 A CN 202111246835A CN 113987342 B CN113987342 B CN 113987342B
Authority
CN
China
Prior art keywords
node
nodes
community
link strength
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111246835.1A
Other languages
Chinese (zh)
Other versions
CN113987342A (en
Inventor
叶涛
杜亭莉
孙丽萍
郑孝遥
俞庆英
罗永龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Normal University
Original Assignee
Anhui Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Normal University filed Critical Anhui Normal University
Priority to CN202111246835.1A priority Critical patent/CN113987342B/en
Publication of CN113987342A publication Critical patent/CN113987342A/en
Application granted granted Critical
Publication of CN113987342B publication Critical patent/CN113987342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开的一种基于密度峰值的自适应重叠社区发现方法,包括如下步骤:S1、基于节点网络的邻接矩阵计算各节点之间的边链接强度和点链接强度;S2、构造节点网络的距离矩阵,并基于距离矩阵计算每个节点的局部密度ρ和最短距离δ;S3、通过节点的局部密度ρ和最短距离δ自适应选取网络中的社区中心;S4、基于剩余节点的概率向量将剩余节点分配到至少一个社区中心对应的社区,根据分配后的社区集合确定重叠社区。本发明提供的基于密度峰值的自适应重叠社区发现方法复杂度低,对重叠社区发现的效果较好,并对大规模复杂网络进行划分的结果更好。

The present invention discloses an adaptive overlapping community discovery method based on density peak, comprising the following steps: S1, calculating the edge link strength and point link strength between nodes based on the adjacency matrix of the node network; S2, constructing the distance matrix of the node network, and calculating the local density ρ and the shortest distance δ of each node based on the distance matrix; S3, adaptively selecting the community center in the network through the local density ρ and the shortest distance δ of the node; S4, allocating the remaining nodes to the community corresponding to at least one community center based on the probability vector of the remaining nodes, and determining the overlapping community according to the allocated community set. The adaptive overlapping community discovery method based on density peak provided by the present invention has low complexity, good effect on overlapping community discovery, and better results for dividing large-scale complex networks.

Description

Self-adaptive overlapping community discovery method and system based on density peak value
Technical Field
The invention belongs to the technical field of computers and communication, and particularly relates to a density peak value-based self-adaptive overlapping community discovery method and system.
Background
In real life, many complex systems can be modeled as a complex network for analysis, such as a common blood-source network, an aviation network, a traffic network, a computer network, a literature reference network, a social network, and the like. The community discovery divides the community structure according to the connection tightness degree among the network nodes, and reasonably classifies the nodes in the network. Community overlap is an important feature in a network, namely nodes with overlap between different communities. The detection of the overlapped communities has important research value and scientific significance for network structure analysis, community division and the like.
The existing community discovery algorithm still has the following defects:
The overlapped community discovery algorithm using the density peak value mostly has the defect that the distance between the nodes cannot be accurately calculated, because most algorithms do not consider the influence of edges and public nodes on the distance calculation, or cannot adaptively select the community center, because the community center is required to be selected from a decision diagram subjectively after the density and the shortest distance of the nodes are calculated, the subjectivity is strong, or the defects of low detection efficiency, long time, incapability of detecting a large-scale complex network and the like are caused by higher complexity of the algorithm.
Disclosure of Invention
The invention provides a density peak value-based adaptive overlapping community discovery method, and aims to improve the problems.
The invention is realized in such a way that a self-adaptive overlapping community discovery method based on a density peak value comprises the following steps:
S1, calculating edge link strength and point link strength between nodes based on an adjacent matrix of a node network;
S2, constructing a distance matrix of the node network, and calculating the local density rho and the shortest distance delta of each node based on the distance matrix;
S3, self-adaptively selecting a community center in the network through the local density rho and the shortest distance delta of the nodes;
S4, distributing the residual nodes to communities corresponding to at least one community center based on the probability vectors of the residual nodes, and determining overlapping communities according to the distributed community sets.
Further, the calculation formula of the edge link strength between the node i and the node j is as follows:
Wherein A ij is the element value of the ith row and jth column in the adjacency matrix, w i、wj respectively represents the sum of the edge weights of the node i and the sum of the edge weights of the node j, r is the difference between the maximum weight w max and the minimum value w min of the sum of the edge weights of all the nodes in the node network, V ij is the common point set between the node i and the node j, and eta is a constant.
Further, the calculation formula of the point link strength between the node i and the node j is as follows:
Wherein vst ij is a point link strength value between node i and node j, N i + is a set of node i and its neighbor node, N j + is a set of node j and its neighbor node, V ij is a set of common points between node i and node j, and λ is a non-negative real number.
Further, a distance dist ij between the node i and the node j is calculated, an element value of the j-th column of the i-th row in the distance matrix is formed, and a dist ij distance calculation formula is as follows:
wherein lst ij is the edge link strength between node i and node j, vst ij is the point link strength between point i and point j, and ε is a constant.
Further, the community center selection method specifically comprises the following steps:
Normalizing the density rho and the shortest distance delta of all nodes to obtain rho * and delta *;
Calculating the product gamma of rho * and delta *, then carrying out ascending order on the gamma, defining gamma after ascending order as gamma ** and defining an index corresponding to gamma as ind;
For each node i in gamma *, performing linear fitting on nodes arranged before the node i in gamma *, and further calculating an estimated value gamma' corresponding to the node i;
The gamma difference delta gamma, delta gamma = gamma * -gamma' for node i is calculated,
Judging the elements in gamma * from back to front, and making the difference delta gamma smaller than the average valueThe smallest lower of the points of (a) is marked M,Adding nodes with point subscripts larger than M to a candidate community center set for the average value of the difference delta gamma of all the nodes;
Deleting candidate community centers which do not meet the following condition 1 in the candidate community center set to form a final community center set;
The condition 1 is that the local density and the shortest distance are respectively larger than the maximum value of the local density and the shortest distance of k adjacent nodes closest to the local density and the shortest distance.
Further, the distribution method of the remaining nodes specifically comprises the following steps:
Calculating a probability vector of each remaining node, namely the probability that the remaining node belongs to each community center;
The rest nodes are distributed to communities with the highest probability;
And calculating the divisor between the probability value of the rest node belonging to other communities and the maximum probability value, and distributing the rest node to the community corresponding to the divisor larger than the given threshold.
Further, the calculation formula of the probability vector p i of the remaining node i is specifically as follows:
Where cc i represents a set of k nodes with a local density higher than and closest to node i, p j represents a probability vector for node j in cc i, w ij represents a probability weight for node j relative to node i, and the formula for probability weight w ij is as follows:
dist ij represents the distance between node i and node j, dist im the distance between node i and node m.
The invention is realized in that an adaptive overlapping community discovery system based on density peaks, the system comprising:
The social topic network modeling module filters key node words through the topic words, calculates the relativity among the key node words and constructs a node network;
The edge link strength calculation module is used for calculating the edge link strength between the nodes based on the adjacency matrix of the node network;
The point link strength calculation module is used for calculating the point link strength between the nodes based on the adjacency matrix of the node network;
the distance matrix calculation module calculates the distance between the nodes based on the edge link strength and the point link strength of each node, namely, a distance matrix is constructed;
The density peak value calculation module is used for calculating the local density rho and the shortest distance delta of each node based on the distance matrix;
The community center generation module is used for adaptively selecting a community center in the network through the local density rho and the shortest distance delta of the nodes;
the node allocation module allocates the remaining nodes to communities corresponding to at least one community center based on probability vectors of the remaining nodes;
and the overlapped community discovery module is used for determining overlapped communities according to the distributed community set.
Further, the edge link strength calculation module calculates the edge link strength between the nodes based on the following formula:
Wherein A ij is the element value of the ith row and jth column in the adjacency matrix, w i、wj respectively represents the sum of the edge weights of the node i and the sum of the edge weights of the node j, r is the difference between the maximum weight w max and the minimum value w min of the sum of the edge weights of all the nodes in the node network, V ij is the common point set between the node i and the node j, and eta is a constant.
Further, the point link strength calculation module calculates the point link strength between the nodes based on the following formula:
Wherein vst ij is a point link strength value between node i and node j, N i + is a set of node i and its neighbor node, N j + is a set of node j and its neighbor node, V ij is a set of common points between node i and node j, and λ is a non-negative real number.
The method adopts the edge link and the point link strength to accurately calculate the distance between the nodes, adopts the linear fitting method to adaptively select the community center, defines probability vectors for each node in the distribution stage of the rest nodes to represent the probability of belonging to each community, has low complexity, has better discovery effect on overlapping communities and has better division result on a large-scale complex network.
Drawings
FIG. 1 is a flow chart of a density peak-based adaptive overlapping community discovery method provided by an embodiment of the invention;
Fig. 2 is a schematic structural diagram of an adaptive overlapping community discovery system based on density peaks according to an embodiment of the present invention.
Detailed Description
The following detailed description of the embodiments of the invention, given by way of example only, is presented in the accompanying drawings to aid in a more complete, accurate, and thorough understanding of the inventive concepts and aspects of the invention by those skilled in the art.
The invention relates to the term interpretation:
complex networks-complex networks refer to networks that have some or all of the properties of self-organization, self-similarity, attractors, the small world, scaleless.
Communities-communities are defined as a subset of the set of all nodes in a network, where the connections between nodes within the subset are tighter than with other nodes outside the subset.
Overlapping communities, namely nodes with overlapping among different communities.
Fig. 1 is a flowchart of a density peak value-based adaptive overlapping community discovery method according to an embodiment of the present invention, where the method includes the following steps:
S1, calculating edge link and point link strength between nodes by an adjacent matrix of a node network;
In the embodiment of the invention, the adjacency matrix acquisition process of the node network is specifically as follows:
Filtering key node words through subject words, calculating the correlation degree of the key node words, and constructing a node network, wherein the correlation degree of the key node words is the weight of the edge where two nodes are located in the node network, and the process of constructing the node network is specifically as follows:
After preprocessing the text data, combining with the choice of the subject words, defining word vectors and relativity, and constructing a topic network. The word vector definition formula is:
ωi→(pi1,…,pij,…,pik)
Wherein omega i is the i-th word selected, k represents the vocabulary amount in the whole sentence text, and p ij represents the number of sentences of co-occurrence of the i-th and j-th words in the total sentence text.
The words selected in the text become node words in the node network, and the correlation calculation formula between the node words omega ij is as follows:
Wherein, p it represents the sentence number of the co-occurrence of the ith word and the t word, p jt represents the sentence number of the co-occurrence of the jth word and the t word, k represents the vocabulary total amount in the whole sentence text, and the adjacent matrix of the node network is obtained according to the correlation calculation value among the node words, wherein the node words in the node network are called nodes in the adjacent matrix;
Edge link lst ij is the edge link strength calculation formula between node i and node j:
Wherein A ij is the element value of the ith row and jth column in the adjacency matrix, and is defined as the weight value of the edge where the node i and the node j are located in the node network, the value is R (w i,wj),Ait、Atj respectively represents the weight value of the edge where the node i and the node t are located and the weight value of the edge where the node j and the node t are located, w i、wj respectively represents the sum of the joint edge weights of the node i and the sum of the joint edge weights of the node j, R is the difference value between the maximum weight w max and the minimum value w min of the sum of the joint edge weights of all the nodes in the node network, V ij is the common point set between the node i and the node j, and if the node g is connected with the node i and the node j through the joint edge at the same time, the node g is the common point between the node i and the node j, eta is a constant, and the value is 1.
The calculation formula of the point link strength between the node i and the node j is as follows:
Wherein vst ij is a point link strength value between node i and node j, N i + is a set of node i and its neighbor nodes, N j + is a set of node j and its neighbor nodes, the other nodes on the border of node i are neighbor nodes of node i, V ij is a common point set between node i and node j, and λ is a non-negative real number.
S2, constructing a distance matrix of the node network, and calculating the local density rho and the shortest distance delta of each node based on the distance matrix;
Calculating a distance dist ij between a node i and a node j to form an element value of an ith row and a jth column in a distance matrix, wherein a dist ij distance calculation formula is as follows:
Wherein lst ij is the edge link strength between node i and node j, vst ij is the point link strength between point i and point j, ε is a constant, and ε is 1.
The density of the node i is represented by the local density ρ i, and the shortest distance from the node i to a node higher than the density thereof is represented by the shortest distance δ i. Because the distribution of the nodes in the complex network is relatively sparse and the nature of the community is a local structure, the influence of the global node on the local density of the nodes is considered when the local density is calculated, the influence of k neighbor nodes closest to the nodes on the local density of the nodes is only considered when the local density is calculated, and the definition of the local density rho i is shown as follows.
Wherein KNN i represents k nearest neighbor nodes to node i, k is an average value of the number of neighbor nodes in the network, and d c represents a cut-off distance;
the shortest distance delta i of the node i is calculated as follows:
Where δ i is the shortest distance between point i and the higher density point.
S3, adaptively selecting a community center in the network through the density rho and the shortest distance delta of the nodes;
In order to make the influence of rho and delta on community center selection have consistency, rho and delta need to have the same value range, so that the local density rho and the shortest distance delta are normalized after calculation is completed, and the density rho and the shortest distance delta of the nodes are normalized to obtain rho * and delta *;
Only nodes with relatively large local density ρ and shortest distance δ can be selected as cluster centers, and in order to improve the efficiency of cluster center selection, the product γ of ρ * and δ * is calculated according to the following calculation formula:
γ=ρ*δ*
Then ascending order is carried out on the calculated gamma, the gamma after ascending order is defined as gamma **, and the index corresponding to gamma is defined as ind;
For each node i in gamma *, performing linear fitting on the nodes arranged before the node i in gamma *, and further calculating an estimated value gamma' corresponding to the node i, wherein the calculation formula is as follows:
γ′=a×γ*+b
Wherein a and b are parameters obtained by linear fitting;
after the calculation of γ' is completed, the difference Δγ is calculated, and the calculation formula is:
Δγ=γ*-γ′
Judging the elements in gamma * from back to front, and making the difference delta gamma smaller than the average value The smallest lower of the points of (a) is marked M,And adding nodes with point subscripts larger than M to the candidate community center set as an average value of all node difference values delta gamma, wherein the adding formula is as follows:
S0={indt|M≤t≤|ind}
Wherein S 0 is a candidate community center set.
Because the local density and the shortest distance of the community center should be higher than those of the neighboring nodes, according to the fact that the local density and the shortest distance of the community center should be respectively greater than the maximum value of the local densities and the shortest distances of k neighboring nodes closest to the community center, candidate community centers which do not meet the condition are removed from the candidate community center set S 0, and a final community center set S is obtained.
S4, distributing the unassigned nodes to a community center based on probability vectors of the unassigned nodes, and determining overlapping communities according to the distributed community set.
The traditional remaining node allocation mechanism is to allocate the nodes to clusters where cluster centers which are higher in local density and closest to the nodes are located, the allocation mechanism causes the nodes to only belong to one cluster, but the possibility that the nodes belong to a plurality of communities exists in a complex network, and therefore the allocation mechanism is not applicable to overlapping community detection.
In order for the node allocation mechanism to distinguish overlapping nodes, the probability vector defined by the present invention represents the probability that a node belongs to each community, e.g., the probability vector for node i is denoted as p i={pi1,pi2,…,pi|S|. When the node i is a community center and the community in which the node i is located is t, the probability vector p i of the node i is as follows.
I.e. the community center i is considered to be affiliated to community t and not to other communities. If node i is not a community center, its probability vector is determined by k nodes with higher local densities and closest distances thereto, and the probability vector p i for node i is as follows:
Where cc i represents a set of k nodes with a local density higher than and closest to node i, p j represents a probability vector for node j in cc i, w ij represents a probability weight for node j relative to node i, and the formula for probability weight w ij is as follows:
The method comprises the steps of calculating the attribution degree of each community center, namely the probability of being affiliated to each community, allocating the node to the community with the largest probability, calculating the division value between the probability value affiliated to other communities and the maximum probability value, and allocating the node to other communities if the division value is larger than a given threshold value, wherein in the probability vector of the node i, if the community r=argmax (p i), the node i is allocated to the community r. Then the algorithm calculates the ratio of other community probabilities p iv and the maximum probability p ir in the probability vector, and if p iv meets p iv/pir not less than ζ, the node i is considered to be the node in the community v at the same time. Wherein ζ is a pre-specified threshold. And determining overlapping communities and hot topics according to the distributed community set.
Fig. 2 is a schematic structural diagram of an adaptive overlapping community discovery system based on density peaks according to an embodiment of the present invention, and for convenience of explanation, only a portion related to the embodiment of the present invention is shown, where the system includes:
The social topic network modeling module filters key node words through the subject words, calculates the relativity among the key node words and constructs a node network;
After preprocessing the text data, combining with the choice of the subject words, defining word vectors and relativity, and constructing a topic network. The word vector definition formula is:
ωi→(pi1,…,pij,…,pik)
Wherein omega i is the i-th word selected, k represents the vocabulary amount in the whole sentence text, and p ij represents the number of sentences of co-occurrence of the i-th and j-th words in the total sentence text.
The words selected in the text become node words in the node network, and the correlation calculation formula between the node words omega ij is as follows:
The method comprises the steps of obtaining a node network, wherein p it represents the number of sentences in which an ith word and a t word co-occur, p jt represents the number of sentences in which a j-th word and a t-th word co-occur, k represents the total amount of vocabulary in the whole sentence text, and obtaining an adjacent matrix of the node network according to a correlation degree calculation value among the node words, wherein the node words in the node network are called nodes in the adjacent matrix.
The edge link strength calculation module calculates edge link strength between nodes based on an adjacency matrix of the node network, and the edge link strength between the node i and the node j is calculated based on the following formula:
Wherein A ij is the element value of the ith row and jth column in the adjacency matrix, w i、wj respectively represents the sum of the edge weights of the node i and the sum of the edge weights of the node j, r is the difference between the maximum weight w max and the minimum value w min of the sum of the edge weights of all the nodes in the node network, V ij is the common point set between the node i and the node j, and eta is a constant.
The point link strength calculation module calculates the point link strength between each node based on the adjacency matrix of the node network, and the point link strength between the node i and the node j is calculated based on the following formula:
Wherein vst ij is a point link strength value between node i and node j, N i + is a set of node i and its neighbor node, N j + is a set of node j and its neighbor node, V ij is a set of common points between node i and node j, and λ is a non-negative real number.
The distance matrix calculation module calculates the distance between the nodes based on the edge link strength and the point link strength of each node, namely, a distance matrix is constructed, the distance dist ij between the node i and the node j forms the element value of the ith row and the jth column in the distance matrix, and the distance dist ij has a calculation formula as follows:
wherein lst ij is the edge link strength between node i and node j, vst ij is the point link strength between point i and point j, and ε is a constant.
The density peak value calculation module calculates the local density ρ and the shortest distance δ of each node based on the distance matrix, and the calculation formula of the local density ρ i is specifically as follows:
wherein KNN i represents k nearest neighbor nodes to node i, k is an average value of the number of neighbor nodes in the network, and d c represents a cut-off distance;
the shortest distance delta i of the node i is calculated as follows:
Where δ i is the shortest distance between point i and the higher density point.
The community center generation module is used for adaptively selecting a community center in the network through the local density rho and the shortest distance delta of the nodes, and the generation process of the community center is specifically as follows:
Normalizing the density rho and the shortest distance delta of all nodes to obtain rho * and delta *;
Calculating the product gamma of rho * and delta *, then carrying out ascending order on the gamma, defining gamma after ascending order as gamma ** and defining an index corresponding to gamma as ind;
For each node i in gamma *, performing linear fitting on nodes arranged before the node i in gamma *, and further calculating an estimated value gamma' corresponding to the node i;
The gamma difference delta gamma, delta gamma = gamma * -gamma' for node i is calculated,
Judging the elements in gamma * from back to front, and making the difference delta gamma smaller than the average valueThe smallest lower of the points of (a) is marked M,Adding nodes with point subscripts larger than M to a candidate community center set for the average value of the difference delta gamma of all the nodes;
Deleting candidate community centers which do not meet the following condition 1 in the candidate community center set to form a final community center set;
The condition 1 is that the local density and the shortest distance are respectively larger than the maximum value of the local density and the shortest distance of k adjacent nodes closest to the local density and the shortest distance.
The node allocation module allocates the remaining nodes to communities corresponding to at least one community center based on probability vectors of the remaining nodes, wherein the allocation method of the remaining nodes specifically comprises the following steps:
Calculating a probability vector of each remaining node, namely the probability that the remaining node belongs to each community center;
The rest nodes are distributed to communities with the highest probability;
And calculating the divisor between the probability value of the rest node belonging to other communities and the maximum probability value, and distributing the rest node to the community corresponding to the divisor larger than the given threshold.
The probability vector p i of the remaining node i has the following calculation formula:
Where cc i represents a set of k nodes with a local density higher than and closest to node i, p j represents a probability vector for node j in cc i, w ij represents a probability weight for node j relative to node i, and the formula for probability weight w ij is as follows:
dist ij represents the distance between node i and node j, dist im the distance between node i and node m.
And the overlapped community discovery module is used for determining overlapped communities according to the distributed community sets, and nodes positioned in at least two communities are overlapped parts of the communities.
While the invention has been described above with reference to the accompanying drawings, it will be apparent that the invention is not limited to the above embodiments, but is capable of being modified or applied directly to other applications without modification, as long as various insubstantial modifications of the method concept and technical solution of the invention are adopted, all within the scope of the invention.

Claims (9)

1.一种基于密度峰值的自适应重叠社区发现方法,其特征在于,所述方法具体包括如下步骤:1. A density peak-based adaptive overlapping community discovery method, characterized in that the method specifically comprises the following steps: S1、通过主题词过滤筛选出关键节点词汇,对关键节点词汇间的相关度进行计算,构建节点网络,基于节点网络的邻接矩阵计算各节点之间的边链接强度和点链接强度;S1. Filter out key node words through subject words, calculate the correlation between key node words, build a node network, and calculate the edge link strength and point link strength between nodes based on the adjacency matrix of the node network; S2、构造节点网络的距离矩阵,并基于距离矩阵计算每个节点的局部密度ρ和最短距离δ;S2, construct the distance matrix of the node network, and calculate the local density ρ and the shortest distance δ of each node based on the distance matrix; S3、通过节点的局部密度ρ和最短距离δ自适应选取网络中的社区中心;S3, adaptively select the community center in the network through the local density ρ and the shortest distance δ of the node; S4、基于剩余节点的概率向量将剩余节点分配到至少一个社区中心对应的社区,根据分配后的社区集合确定重叠社区;S4, assigning the remaining nodes to communities corresponding to at least one community center based on the probability vectors of the remaining nodes, and determining overlapping communities according to the assigned community sets; 社区中心的选择方法具体包括如下步骤:The method for selecting a community center includes the following steps: 对所有节点的密度ρ和最短距离δ进行归一化处理,得到ρ*和δ*Normalize the density ρ and the shortest distance δ of all nodes to obtain ρ * and δ * ; 计算ρ*和δ*的乘积γ,再对γ进行升序排序,将升序排序后的γ定义为γ*,γ*对应于γ的索引定义为ind;Calculate the product γ of ρ * and δ * , then sort γ in ascending order, define the sorted γ in ascending order as γ * , and define the index of γ * corresponding to γ as ind; 针对γ*中的每个节点i,对在γ*中排列在节点i之前的节点进行线性拟合,进而计算节点i对应的估计值γ′;For each node i in γ * , perform linear fitting on the nodes that precede node i in γ * , and then calculate the estimated value γ′ corresponding to node i; 计算节点i的γ差值Δγ,Δγ=γ*-γ′,Calculate the γ difference Δγ of node i, Δγ=γ * -γ′, 将γ*中元素从后往前判断,将差值Δγ小于平均值的点的最小下标记为M,为所有节点差值Δγ的平均值,并将点下标大于M的节点添加到候选社区中心集;Judge the elements in γ * from the back to the front, and find the difference Δγ less than the average value The minimum mark of the point is M, is the average value of all node differences Δγ, and adds nodes with subscripts greater than M to the candidate community center set; 在候选社区中心集中删除不满足下列条件1的候选社区中心,形成最终的社区中心集;Delete the candidate community centers that do not meet the following condition 1 from the candidate community center set to form the final community center set; 条件1:局部密度和最短距离均分别大于与其距离最近的k个邻居节点的局部密度和最短距离的最大值。Condition 1: The local density and the shortest distance are both greater than the maximum values of the local density and the shortest distance of the k nearest neighboring nodes. 2.如权利要求1所述基于密度峰值的自适应重叠社区发现方法,其特征在于,节点i和节点j之间的边链接强度计算公式如下:2. The adaptive overlapping community discovery method based on density peak according to claim 1, wherein the edge link strength between node i and node j is calculated as follows: 其中,Aij为邻接矩阵中第i行第j列的元素值,wi、wj分别表示节点i的相接边权重总和、节点j的相接边权重总和,r为节点网络中所有节点相接边权重总和的最大权重wmax与最小值wmin的差值;Vij为节点i和节点j之间的公共点集合,η为常数。Among them, A ij is the element value of the i-th row and j-th column in the adjacency matrix, w i and w j represent the sum of the weights of the connected edges of node i and node j respectively, r is the difference between the maximum weight w max and the minimum value w min of the sum of the weights of the connected edges of all nodes in the node network; Vij is the set of common points between nodes i and j, and η is a constant. 3.如权利要求1所述基于密度峰值的自适应重叠社区发现方法,其特征在于,节点i和节点j之间点链接强度计算公式如下:3. The adaptive overlapping community discovery method based on density peak according to claim 1, characterized in that the point link strength calculation formula between node i and node j is as follows: 其中,vstij是节点i和节点j之间的点链接强度值;Ni +为节点i和其邻居节点的集合,Nj +为节点j和其邻居节点的集合,Vij为节点i和节点j之间的公共点集合,λ为非负实数。Among them, vst ij is the point link strength value between node i and node j; Ni + is the set of node i and its neighbor nodes, Nj + is the set of node j and its neighbor nodes, Vij is the set of common points between node i and node j, and λ is a non-negative real number. 4.如权利要求1所述基于密度峰值的自适应重叠社区发现方法,其特征在于,计算节点i和节点j之间的距离distij,构成距离矩阵中第i行第j列的元素值,distij距离计算公式为:4. The adaptive overlapping community discovery method based on density peak value as claimed in claim 1, characterized in that the distance dist ij between node i and node j is calculated to form the element value of the i-th row and j-th column in the distance matrix, and the dist ij distance calculation formula is: 其中,lstij为节点i和节点j之间的边链接强度;vstij为点i和点j之间的点链接强度,ε为常数。Among them, lst ij is the edge link strength between node i and node j; vst ij is the point link strength between point i and point j, and ε is a constant. 5.如权利要求1所述基于密度峰值的自适应重叠社区发现方法,其特征在于,剩余节点的分配方法具体如下:5. The method for discovering adaptive overlapping communities based on density peaks according to claim 1, wherein the method for allocating the remaining nodes is as follows: 计算每个剩余节点的概率向量,即剩余节点属于每个社区中心的概率;Calculate the probability vector of each remaining node, that is, the probability that the remaining node belongs to each community center; 将剩余节点分配到概率最大的社区中;Assign the remaining nodes to the community with the highest probability; 计算剩余节点隶属于其他社区的概率值与最大概率值之间的除值,将该剩余节点分配到大于给定阈值的除值所对应的社区。Calculate the division between the probability value of the remaining nodes belonging to other communities and the maximum probability value, and assign the remaining nodes to the community corresponding to the division value greater than the given threshold. 6.如权利要求5所述基于密度峰值的自适应重叠社区发现方法,其特征在于,剩余节点i的概率向量pi计算公式具体如下:6. The adaptive overlapping community discovery method based on density peak value as claimed in claim 5, characterized in that the probability vector pi of the remaining node i is calculated by the following formula: 其中,cci表示局部密度高于节点i且与节点i距离最近的k个节点的集合,pj表示cci中的节点j的概率向量,wij表示节点j相对于节点i的概率权重,概率权重wij的公式如下所示:Among them, cc i represents the set of k nodes with higher local density than node i and closest to node i, p j represents the probability vector of node j in cc i , w ij represents the probability weight of node j relative to node i, and the formula of probability weight w ij is as follows: distij表示节点i和节点j之间的距离,distim节点i和节点m之间的距离。dist ij represents the distance between node i and node j, and dist im represents the distance between node i and node m. 7.一种基于密度峰值的自适应重叠社区发现系统,其特征在于,所述系统包括:7. An adaptive overlapping community discovery system based on density peak, characterized in that the system comprises: 社交话题网络建模模块,通过主题词过滤筛选出关键节点词汇,对关键节点词汇间的相关度进行计算,构建节点网络;The social topic network modeling module selects key node words through keyword filtering, calculates the correlation between key node words, and constructs a node network; 边链接强度计算模块,基于节点网络的邻接矩阵计算各节点之间的边链接强度;The edge link strength calculation module calculates the edge link strength between nodes based on the adjacency matrix of the node network; 点链接强度计算模块,基于节点网络的邻接矩阵计算各节点之间的点链接强度;Point link strength calculation module, which calculates the point link strength between nodes based on the adjacency matrix of the node network; 距离矩阵计算模块,基于各节点的边链接强度及点链接强度计算各节点间的距离,即构建距离矩阵;The distance matrix calculation module calculates the distance between each node based on the edge link strength and point link strength of each node, that is, constructs a distance matrix; 密度峰值计算模块,基于距离矩阵计算每个节点的局部密度ρ和最短距离δ;Density peak calculation module, which calculates the local density ρ and shortest distance δ of each node based on the distance matrix; 社区中心生成模块,通过节点的局部密度ρ和最短距离δ自适应选取网络中的社区中心;The community center generation module adaptively selects the community center in the network through the local density ρ and the shortest distance δ of the nodes; 节点分配模块,基于剩余节点的概率向量将剩余节点分配到至少一个社区中心对应的社区;A node allocation module allocates the remaining nodes to a community corresponding to at least one community center based on the probability vectors of the remaining nodes; 重叠社区发现模块,根据分配后的社区集合确定重叠社区;Overlapping community discovery module, which determines overlapping communities based on the assigned community sets; 社区中心生成模块中的社区中心的选择方法具体如下:The method for selecting a community center in the community center generation module is as follows: 对所有节点的密度ρ和最短距离δ进行归一化处理,得到ρ*和δ*Normalize the density ρ and the shortest distance δ of all nodes to obtain ρ * and δ * ; 计算ρ*和δ*的乘积γ,再对γ进行升序排序,将升序排序后的γ定义为γ*,γ*对应于γ的索引定义为ind;Calculate the product γ of ρ * and δ * , then sort γ in ascending order, define the sorted γ in ascending order as γ * , and define the index of γ * corresponding to γ as ind; 针对γ*中的每个节点i,对在γ*中排列在节点i之前的节点进行线性拟合,进而计算节点i对应的估计值γ′;For each node i in γ * , perform linear fitting on the nodes that precede node i in γ * , and then calculate the estimated value γ′ corresponding to node i; 计算节点i的γ差值Δγ,Δγ=γ*-γ′,Calculate the γ difference Δγ of node i, Δγ=γ * -γ′, 将γ*中元素从后往前判断,将差值Δγ小于平均值的点的最小下标记为M,为所有节点差值Δγ的平均值,并将点下标大于M的节点添加到候选社区中心集;Judge the elements in γ * from the back to the front, and find the difference Δγ less than the average value The minimum mark of the point is M, is the average value of all node differences Δγ, and adds nodes with subscripts greater than M to the candidate community center set; 在候选社区中心集中删除不满足下列条件1的候选社区中心,形成最终的社区中心集;Delete the candidate community centers that do not meet the following condition 1 from the candidate community center set to form the final community center set; 条件1:局部密度和最短距离均分别大于与其距离最近的k个邻居节点的局部密度和最短距离的最大值。Condition 1: The local density and the shortest distance are both greater than the maximum values of the local density and the shortest distance of the k nearest neighboring nodes. 8.如权利要求7所述基于密度峰值的自适应重叠社区发现系统,其特征在于,边链接强度计算模块基于如下公式计算各节点之间的边链接强度:8. The adaptive overlapping community discovery system based on density peak value according to claim 7, characterized in that the edge link strength calculation module calculates the edge link strength between each node based on the following formula: 其中,Aij为邻接矩阵中第i行第j列的元素值,wi、wj分别表示节点i的相接边权重总和、节点j的相接边权重总和,r为节点网络中所有节点相接边权重总和的最大权重wmax与最小值wmin的差值;Vij为节点i和节点j之间的公共点集合,η为常数。Among them, A ij is the element value of the i-th row and j-th column in the adjacency matrix, w i and w j represent the sum of the weights of the connected edges of node i and node j respectively, r is the difference between the maximum weight w max and the minimum value w min of the sum of the weights of the connected edges of all nodes in the node network; Vij is the set of common points between nodes i and j, and η is a constant. 9.如权利要求7所述基于密度峰值的自适应重叠社区发现系统,其特征在于,点链接强度计算模块基于如下公式计算各节点之间的点链接强度:9. The adaptive overlapping community discovery system based on density peak value according to claim 7, characterized in that the point link strength calculation module calculates the point link strength between each node based on the following formula: 其中,vstij是节点i和节点j之间的点链接强度值;Ni +为节点i和其邻居节点的集合,Nj +为节点j和其邻居节点的集合,Vij为节点i和节点j之间的公共点集合,λ为非负实数。Among them, vst ij is the point link strength value between node i and node j; Ni + is the set of node i and its neighbor nodes, Nj + is the set of node j and its neighbor nodes, Vij is the set of common points between node i and node j, and λ is a non-negative real number.
CN202111246835.1A 2021-10-26 2021-10-26 An adaptive overlapping community discovery method and system based on density peak Active CN113987342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111246835.1A CN113987342B (en) 2021-10-26 2021-10-26 An adaptive overlapping community discovery method and system based on density peak

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111246835.1A CN113987342B (en) 2021-10-26 2021-10-26 An adaptive overlapping community discovery method and system based on density peak

Publications (2)

Publication Number Publication Date
CN113987342A CN113987342A (en) 2022-01-28
CN113987342B true CN113987342B (en) 2024-12-13

Family

ID=79741481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111246835.1A Active CN113987342B (en) 2021-10-26 2021-10-26 An adaptive overlapping community discovery method and system based on density peak

Country Status (1)

Country Link
CN (1) CN113987342B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115271800A (en) * 2022-07-19 2022-11-01 安徽师范大学 An adaptive urban nucleic acid detection point layout method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280472A (en) * 2018-01-18 2018-07-13 安徽师范大学 A kind of density peak clustering method optimized based on local density and cluster centre
CN110472677A (en) * 2019-08-06 2019-11-19 长沙理工大学 A kind of density peaks clustering method based on natural arest neighbors and shortest path

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108280472A (en) * 2018-01-18 2018-07-13 安徽师范大学 A kind of density peak clustering method optimized based on local density and cluster centre
CN110472677A (en) * 2019-08-06 2019-11-19 长沙理工大学 A kind of density peaks clustering method based on natural arest neighbors and shortest path

Also Published As

Publication number Publication date
CN113987342A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
Cheng et al. A local information based multi-objective evolutionary algorithm for community detection in complex networks
CN106960390A (en) Overlapping community division method based on convergence degree
CN112311608B (en) Multilayer heterogeneous network space node characterization method
CN111626321B (en) Image data clustering method and device
Bello-Orgaz et al. Evolutionary clustering algorithm for community detection using graph-based information
CN113228059A (en) Cross-network-oriented representation learning algorithm
CN112767186A (en) Social network link prediction method based on 7-subgraph topological structure
CN116226467B (en) Community discovery method of graph convolution neural network based on node structural features
CN114511905B (en) A face clustering method based on graph convolutional neural network
CN115734274A (en) Cellular network fault diagnosis method based on deep learning and knowledge graph
CN113987342B (en) An adaptive overlapping community discovery method and system based on density peak
Dhumal et al. Survey on community detection in online social networks
Zhou et al. A Local Perspective-based Model for Overlapping Community Detection
Peng et al. Classifying multiclass relationships between ASes using graph convolutional network
CN102004801A (en) Information classification method
Shaydulin et al. Aggregative coarsening for multilevel hypergraph partitioning
CN116501937B (en) Overlapping community recommendation method and system based on dynamic evolution
CN104463704A (en) Reduction method and system for reliability evaluation indexes of power communication network
Machuca et al. Cluster-based LSTM models to improve Dengue cases forecast
CN116485232A (en) Water quality assessment method and system under unbalanced sample
CN115965805A (en) Multichannel graph pooling method based on semantic information
Tao et al. Structural identity representation learning of blockchain transaction network for metaverse
Guzmán-Ponce et al. Weighted complete graphs for condensing data
Pati et al. Constructing minimal spanning tree based on rough set theory for gene selection
CN113486916A (en) Clustering system and method based on boundary ring shrinkage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant