CN115049002A

CN115049002A - Complex network influence node identification method based on reverse generation network

Info

Publication number: CN115049002A
Application number: CN202210681512.3A
Authority: CN
Inventors: 刘小洋; 叶舒
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-13

Abstract

The invention provides a complex network influence node identification method based on a reverse generation network, which comprises the following steps: s1, community division: carrying out community division on the network by using a Louvain algorithm; s2, generating a candidate node set: node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and then a part of nodes are selected to be added into a candidate node set; s3, selecting a seed node: and finding out the final k influential nodes from the candidate node set as seed nodes. The method can select key nodes in the network or nodes for maintaining the stability of the network as candidate nodes, the seed nodes selected by the method have higher infection speed and larger infection scale, and the seed node sets are dispersed on most networks.

Description

Influence node identification method of complex network based on reverse generative network

技术领域technical field

本发明涉及影响力节点识别技术领域，特别是涉及一种基于反向生成网络的复杂网络影响力节点识别方法。The present invention relates to the technical field of influence node identification, in particular to a complex network influence node identification method based on reverse generation network.

背景技术Background technique

网络科学的繁荣引发了复杂网络中识别一组有影响力节点的新潮，现实生活中的许多领域包括生物学、物理学、社会学、工程学等，都可以用复杂网络来表示。网络中的的关键节点或是影响力节点可用于维持网络拓扑结构的稳定性、决定网络中如谣言等信息传递的效率，这些节点直接决定了网络是否能够正常运转。通常将这些节点称之为影响力节点。而识别网络中一组有影响力的节点就是经典的影响力最大化问题。加强对影响力节点的控制，为加快信息传播提供新机遇，这对病毒式营销、识别药物靶标和必要蛋白质以及谣言控制等都具有十分重要的意义，例如在口碑营销中，选取最具影响力的用户进行推广以保证花费最少预算达到最大的传播效果，又或如在社会网络中检测舆情传播源头，可以控制舆情传播。随着大数据和5G时代的到来，网络的类型和规模也迅猛增长，这对旧有的节点影响力度量方法提出了新的挑战。The prosperity of network science has led to a new wave of identifying a set of influential nodes in complex networks. Many real-life fields, including biology, physics, sociology, engineering, etc., can be represented by complex networks. The key nodes or influential nodes in the network can be used to maintain the stability of the network topology and determine the efficiency of information transmission such as rumors in the network. These nodes directly determine whether the network can operate normally. These nodes are usually referred to as influence nodes. Identifying a group of influential nodes in the network is a classical influence maximization problem. Strengthening the control of influential nodes provides new opportunities for accelerating information dissemination, which is of great significance for viral marketing, identification of drug targets and essential proteins, and rumor control. For example, in word-of-mouth marketing, selecting the most influential Users can promote it to ensure that the least budget is spent to achieve the maximum communication effect, or, for example, by detecting the source of public opinion transmission in the social network, it can control the transmission of public opinion. With the advent of the era of big data and 5G, the types and scales of networks have also grown rapidly, which poses new challenges to the old method of measuring the influence of nodes.

基于社区发现的影响力节点识别方法在近几年引发了网络科学研究者的广泛关注。然而已有的方法只是简单地利用了社区结构这一特性，并且一些基于三段式的方法在选择候选节点集和种子节点时还存在一些不足。首先，现有的方法如随机游走、遗传算法或其他的一些启发式算法在选择候选节点集时需要遍历整个网络，并且选择的候选节点之间可能存在彼此聚集的情况。若从这些候选节点集中选择出来种子节点可能存在影响力范围重叠的问题。因此候选节点集的生成在整个种子节点选择过程中尤为重要。其次，在种子节点识别阶段时，大量算法利用贪婪算法进行精确选择，尽管在生成候选节点集中进行选择已大大提高了运行效率，但也是比较耗时的。Influence node identification methods based on community discovery have attracted extensive attention of network science researchers in recent years. However, the existing methods simply take advantage of the feature of community structure, and some methods based on three-stage method have some shortcomings in selecting candidate node sets and seed nodes. First, existing methods such as random walk, genetic algorithm or some other heuristic algorithms need to traverse the entire network when selecting the candidate node set, and the selected candidate nodes may gather with each other. If seed nodes are selected from these candidate node sets, there may be a problem of overlapping influence ranges. Therefore, the generation of candidate node sets is particularly important in the entire seed node selection process. Secondly, in the stage of seed node identification, a large number of algorithms use greedy algorithms for accurate selection. Although the selection in the generated candidate node set has greatly improved the operation efficiency, it is also time-consuming.

发明内容SUMMARY OF THE INVENTION

本发明旨在至少解决现有技术中存在的技术问题，特别创新地提出了一种基于反向生成网络的复杂网络影响力节点识别方法。The present invention aims to at least solve the technical problems existing in the prior art, and particularly innovatively proposes a complex network influence node identification method based on a reverse generation network.

为了实现本发明的上述目的，本发明提供了一种基于反向生成网络的复杂网络影响力节点识别方法，包括以下步骤：In order to realize the above-mentioned purpose of the present invention, the present invention provides a kind of complex network influence node identification method based on reverse generation network, comprises the following steps:

S1，社区划分：利用Louvain算法对网络进行社区划分，缩小种子节点的搜索空间；S1, community division: use the Louvain algorithm to divide the community into the network to reduce the search space of seed nodes;

S2，候选节点集生成：利用图遍历收集节点信息从而辅助构造改进的反向生成网络，在这种方法下构建网络时不必恢复原始网络即可选出高重要性节点；再选择一部分节点加入候选节点集；S2, candidate node set generation: use graph traversal to collect node information to assist in constructing an improved reverse generation network. In this method, high-importance nodes can be selected without restoring the original network when building the network; node set;

S3，选择种子节点：从候选节点集中找出最终的k个有影响力节点作为种子节点。S3, select seed nodes: find the final k influential nodes from the candidate node set as seed nodes.

进一步地，所述社区的规模至少具备以下条件：Further, the scale of the community at least meets the following conditions:

C_size＝size(G)*η (9)C_size=size(G)*η (9)

其中C_size表示社区大小；where C_size represents the community size;

size(G)表示网络G的节点数量；size(G) represents the number of nodes in network G;

η是一个可调参数，用于控制满足条件的社区大小。η is a tunable parameter that controls the size of the community that satisfies the condition.

进一步地，所述S2包括：Further, the S2 includes:

S2-1，计算剩余节点的加入代价，选择最小化代价函数的节点构建网络，代价函数为：S2-1, calculate the joining cost of the remaining nodes, and select the node that minimizes the cost function to build the network. The cost function is:

其中cost(u,n+1)表示节点u在(n+1)^th时间步的代价函数；where cost(u,n+1) represents the cost function of node u at the (n+1) ^th time step;

表示在(n+1)^th时间步，将节点u加入到网络G中的最大连通分量的大小，即

Represents the size of the largest connected component that adds node u to the network G at the (n+1) ^th time step, i.e.

AUC_u表示节点u的AUC值；AUC _u represents the AUC value of node u;

AUC_max表示所有节点中最大的AUC值；AUC _max represents the largest AUC value among all nodes;

AUC_min表示所有节点中最小的AUC值；AUC _min represents the smallest AUC value among all nodes;

ξ是一个足够小的正参数；ξ is a sufficiently small positive parameter;

S2-2，加入节点之后，需要更新网络中每个连通分量的大小，并记录网络中最大连通分量的大小；S2-2, after adding a node, it is necessary to update the size of each connected component in the network, and record the size of the largest connected component in the network;

S2-3，重复步骤S2-1～S2-2，直到剩余节点数量满足所需候选节点数量，与此同时停止构造网络。S2-3, repeat steps S2-1 to S2-2 until the number of remaining nodes meets the required number of candidate nodes, and at the same time stop constructing the network.

利用改进的反向生成网络来生成候选节点集有以下优点，首先，可以使得生成的候选节点集更加分散，节点少量聚集，并且都是网络中的关键节点。第二，选择的候选节点影响力范围重叠度低，在第三阶段也可以用启发式方法与贪婪方法平衡来选择种子节点。第三，这些候选节点保证了网络的健壮性，移除这些节点将很容易造成网络瓦解。第四，在生成候选节点时不用遍历整个网络，当未加入节点满足候选节点集数量时即可停止构建网络。Using the improved reverse generative network to generate candidate node sets has the following advantages. First, the generated candidate node sets can be more dispersed, with a small number of nodes aggregated, and all of them are key nodes in the network. Second, the influence range of the selected candidate nodes has a low degree of overlap. In the third stage, the heuristic method and the greedy method can also be used to balance the selection of seed nodes. Third, these candidate nodes ensure the robustness of the network, and removing these nodes will easily cause the network to collapse. Fourth, it is not necessary to traverse the entire network when generating candidate nodes, and the network construction can be stopped when the number of unjoined nodes meets the number of candidate node sets.

进一步地，所述S2还包括：Further, the S2 also includes:

为了进一步缩小搜索范围，并保证有质量合适的候选节点，在每个社区形成的独立网络中设置候选节点集的大小，其公式表示如下：In order to further narrow the search range and ensure that there are candidate nodes with suitable quality, the size of the candidate node set is set in the independent network formed by each community, and the formula is as follows:

其中cand_num_i表示第i个社区候选节点集的大小；where cand_num _i represents the size of the i-th community candidate node set;

(C_size_i-C_size_min)/(C_size_max-C_size_min)表示第i个社区在所有选择社区中的比例；(C_size _i -C_size _min )/(C_size _max -C_size _min ) represents the proportion of the i-th community in all selected communities;

β是一个放大参数；β is an amplification parameter;

k是最终需要选择的的种子节点数量。k is the final number of seed nodes to be selected.

进一步地，所述图遍历包括：Further, the graph traversal includes:

选择度中心性作为图遍历中的初始化中心分数，度中心性经过图遍历框架优化，可以产生更加细粒度的衡量节点影响力的AUC分数。The degree centrality is selected as the initial center score in graph traversal, and the degree centrality is optimized by the graph traversal framework to generate a more fine-grained AUC score that measures the influence of nodes.

进一步地，所述S3包括：Further, the S3 includes:

S3-1，通过度折扣算法在候选节点集中选择k₁个节点；S3-1, select k ₁ nodes in the candidate node set through the degree discount algorithm;

所述度折扣算法的计算公式为：The calculation formula of the degree discount algorithm is:

其中gdd_v表示节点v的度折扣；where gdd _v represents the degree discount of node v;

d_v表示节点v的度；d _v represents the degree of node v;

t_v表示节点v的感染邻居数量；t _v represents the number of infected neighbors of node v;

t_w表示节点v的易感邻居节点w的感染邻居数；t _w represents the number of infected neighbors of node w's susceptible neighbor node w;

p表示传染概率；p represents the infection probability;

S3-2，通过改进的子模性算法选择k₂个节点；S3-2, select k ₂ nodes through the improved submodularity algorithm;

k₂＝k-k₁ (13)k ₂ =kk ₁ (13)

其中k为最终需要选择的种子节点数量；Where k is the number of seed nodes that need to be selected in the end;

μ是一个可调参数以平衡贪心算法和启发式算法；μ is a tunable parameter to balance greedy and heuristic algorithms;

c表示网络中满足一定规模的社区的总数；c represents the total number of communities in the network that meet a certain size;

cand_num_i表示第i个社区候选节点集的大小；cand_num _i represents the size of the ith community candidate node set;

所述改进的子模性算法选择k₂个节点具体步骤为：若在每轮改进的子模性算法选择的过程中选择的节点u与之前选择的节点或启发式过程选择的节点相似，则不选择该节点，并将其从候选节点集中剔除。The specific steps for selecting k ₂ nodes by the improved submodularity algorithm are as follows: if the node u selected in each round of the improved submodularity algorithm selection process is similar to the node selected before or the node selected by the heuristic process, then The node is not selected and removed from the candidate node set.

改进的子模性算法相较于原来的子模性算法，进一步考虑了节点的位置信息以及节点之间的结构相似性。Compared with the original submodularity algorithm, the improved submodularity algorithm further considers the position information of nodes and the structural similarity between nodes.

进一步地，所述相似是通过下式来判断节点是否相似的，若满足下式，则节点u和节点v相似；Further, the similarity is to judge whether the nodes are similar by the following formula, if the following formula is satisfied, then the node u and the node v are similar;

其中sim表示节点u和节点v的相似性；where sim represents the similarity between node u and node v;

N(u)表示节点u的邻居节点集；N(u) represents the set of neighbor nodes of node u;

N(v)表示节点v的邻居节点集；N(v) represents the set of neighbor nodes of node v;

N(u)∩N(v)表示N(u)和N(v)共有的邻居数；N(u)∩N(v) represents the number of neighbors shared by N(u) and N(v);

|·|表示集合的数量；|·| indicates the number of sets;

ε为一个趋近于0的参数；ε is a parameter approaching 0;

abs(·)表示取绝对值；abs( ) means to take the absolute value;

ks(u)为节点u的归一化的k-壳指标；ks(u) is the normalized k-shell index of node u;

ks(v)为节点v的归一化的k-壳指标。ks(v) is the normalized k-shell index of node v.

进一步地，还包括通过以下性能指标对所述方法进行评价：鲁棒性值、累积分布函数、SIR模型以及平均最短路径长度。通过这四个性能指标能全面地判断本方法是否合理。Further, it also includes evaluating the method by the following performance indicators: robustness value, cumulative distribution function, SIR model, and average shortest path length. The four performance indicators can comprehensively judge whether the method is reasonable.

综上所述，由于采用了上述技术方案，本发明的优点如下：To sum up, due to the adoption of the above technical solutions, the advantages of the present invention are as follows:

(1)利用影响得分辅助构造反向生成网络，可以使生成的网络不必恢复到原始网络，未加入到网络中的节点直接加入候选节点集，大大减小了计算时间。(1) Using the influence score to assist in constructing the reverse generation network, the generated network does not have to be restored to the original network, and the nodes not added to the network are directly added to the candidate node set, which greatly reduces the computing time.

(2)考虑到节点与节点之间的共有邻居及位置关系会使选择出的节点集影响范围重叠，因此提出了改进的CELF算法。(2) Considering that the common neighbors and positional relationship between nodes will overlap the influence range of the selected node set, an improved CELF algorithm is proposed.

(3)考虑了网络的社区结构，利用社区对算法进行加速，并结合了网络的连通性、图遍历以及启发式算法和贪心算法的优点。(3) Considering the community structure of the network, using the community to accelerate the algorithm, and combining the network connectivity, graph traversal, and the advantages of heuristic algorithm and greedy algorithm.

由此，本发明更能选择出网络中的关键节点或维持网络稳定性的节点作为候选节点，且提出的方法选择的种子节点感染速度更快，感染规模更大，并且种子节点集之间在大部分网络上都是分散的。Therefore, the present invention can better select key nodes in the network or nodes that maintain network stability as candidate nodes, and the seed node selected by the proposed method has a faster infection speed and a larger infection scale, and the seed node set is between the Most of the network is decentralized.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will, in part, be set forth in the description that follows, and in part will become apparent from the description, or will be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of embodiments in conjunction with the accompanying drawings, wherein:

图1是本发明鲁棒性值R曲线图。FIG. 1 is a graph of the robustness value R of the present invention.

图2是本发明图遍历中心性示意图。Fig. 2 is a schematic diagram of graph traversal centrality of the present invention.

图3是本发明CBGN框架的整体示意图。FIG. 3 is an overall schematic diagram of the CBGN framework of the present invention.

图4是本发明的一个玩具网络示意图。Figure 4 is a schematic diagram of a toy network of the present invention.

图5是本发明6个实验网络的度分布图。Fig. 5 is a degree distribution diagram of six experimental networks of the present invention.

图6是本发明6个真实网络在不同方法下的R曲线图。Fig. 6 is the R curve diagram of 6 real networks of the present invention under different methods.

图7是本发明2个真实网络社区中的R曲线.Fig. 7 is the R curve in two real network communities of the present invention.

图8是本发明Degree和TARank_degree在3个网络上的CDF分布图。FIG. 8 is a CDF distribution diagram of Degree and TARank_degree of the present invention on three networks.

图9是本发明不同算法选择的种子节点在SIR模型下的传播影响示意图。FIG. 9 is a schematic diagram of the propagation influence of seed nodes selected by different algorithms of the present invention under the SIR model.

图10是本发明不同感染率下的传播规模示意图。FIG. 10 is a schematic diagram of the transmission scale of the present invention under different infection rates.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, only used to explain the present invention, and should not be construed as a limitation of the present invention.

本发明提供了一种基于反向生成网络的复杂网络影响力节点识别方法，其具体实施例如下：The present invention provides a complex network influence node identification method based on reverse generation network, and its specific embodiments are as follows:

S1，对舆情网络进行社区划分：利用Louvain算法对舆情网络进行社区划分，缩小舆情种子节点的搜索空间；S1, divide the public opinion network into a community: use the Louvain algorithm to divide the public opinion network into a community to reduce the search space of the public opinion seed nodes;

S2，舆情候选节点集生成：利用图遍历收集舆情节点信息从而辅助构造改进的反向生成网络，在这种方法下构建网络时不必恢复原始网络即可选出高重要性节点；再选择一部分节点加入舆情候选节点集；S2, public opinion candidate node set generation: use graph traversal to collect public opinion node information to assist in constructing an improved reverse generation network. In this method, high-importance nodes can be selected without restoring the original network when constructing the network; then select some nodes Join the public opinion candidate node set;

S3，选择舆情种子节点：从舆情候选节点集中找出最终的k个有影响力节点作为舆情种子节点。S3, select the public opinion seed nodes: find the final k influential nodes from the public opinion candidate node set as the public opinion seed nodes.

只需对筛选出的舆情种子节点对应的用户进行言论控制，就可以快速地控制舆情传播。It is only necessary to control the speech of the users corresponding to the selected public opinion seed nodes, and the spread of public opinion can be quickly controlled.

1.相关内容介绍1. Introduction of related content

1.1背景1.1 Background

影响力最大化问题最早被DomingosandRichardson研究，之后2003年Kempe等人将该问题形式化为离散最优化问题，并被证明为NP难问题。为高效获取影响力节点，从不同的角度去衡量节点，大量的指标接连提出，suchasDegree，ClusterRank，K-shell，H-index，NeighborhoodCoreness， PageRank，DegreeDiscount等。单纯利用高度中心性选取一组节点往往会造成影响力范围的重叠，而K-shell、H-index等直接选择节点也存在求解精度低的问题。PageRank算法同时考虑了节点的数目和质量，但当有悬挂页面出现时，会发生规范泄露。ClusterRank同时考虑了节点的直接邻居和其聚类系数，同样由于考虑网络局部信息而缺乏性能保证。Bae等人设想当一个节点有更多位于网络核心位置的邻居节点时，传播范围更广，提出了邻域核数中心性，该方法平衡了节点的度和位置关系，有效改善了K-shell方法的简并性。还有很多结合网络拓扑结构以启发式地选择节点，Beni 等人提出了一种IMT算法从图的密集部分选择种子节点，以便在最短的距离内访问更多的节点。Zhang等人考虑局部邻居信息和全局社区信息，设计出一种基于局部-全局影响力指标的约束进化算法(IICEA)，有效地解决了预算影响约束的问题。Mao等人提出了一种拓扑潜在方案，用于预测大规模在线社交网络中有影响力节点。基于网络拓扑结构的中心性虽然能在较短时间内寻找影响力节点，但这类算法往往稳定性不足，求解质量不高。The influence maximization problem was first studied by Domingos and Richardson, and in 2003 Kempe et al. formalized the problem as a discrete optimization problem, which was proved to be NP-hard. In order to efficiently obtain influential nodes and measure nodes from different perspectives, a large number of indicators have been proposed one after another, such asDegree, ClusterRank, K-shell, H-index, NeighborhoodCoreness, PageRank, DegreeDiscount, etc. Simply using height centrality to select a group of nodes often results in overlapping influence ranges, and direct selection of nodes such as K-shell and H-index also has the problem of low solution accuracy. The PageRank algorithm considers both the number and quality of nodes, but specification leakage occurs when dangling pages appear. ClusterRank also considers the direct neighbors of nodes and their clustering coefficients, and also lacks performance guarantees due to the consideration of local network information. Bae et al. envisaged that when a node has more neighbor nodes located at the core of the network, the propagation range is wider, and proposed the neighborhood kernel centrality, which balances the degree and position relationship of the node and effectively improves the K-shell Degeneracy of methods. There are also many combinations of network topology to heuristically select nodes, Beni et al. proposed an IMT algorithm to select seed nodes from dense parts of the graph in order to visit more nodes within the shortest distance. Considering local neighbor information and global community information, Zhang et al. designed a Constrained Evolutionary Algorithm (IICEA) based on the local-global influence index, which effectively solved the problem of budget influence constraints. Mao et al. proposed a topological latent scheme for predicting influential nodes in large-scale online social networks. Although based on the centrality of the network topology, it is possible to find influential nodes in a relatively short time, but such algorithms are often not stable enough and the solution quality is not high.

在最近的几十年里，一些经典的贪心算法相继被提出，如依次选择边际增益最大的节点的 greedy算法，以及在该算法上进行改进的CELF，CELF++等。尽管这些算法的影响扩散近似于因子(1-1/e-ε)内的最优解，ε为一个趋近于0的很小的参数，e为自然底数，但这些算法需要上万次的蒙特卡罗模拟，造成了很长的计算时间，使其难以扩展到大规模网络上。为解决这一现象，大量的网络科学研究者又提出了针对特定领域的启发式算法，牺牲一定的准确率大大减小了计算复杂度。近年来，一个新的指标---鲁棒性值(Robustnessvalue)被广泛用来衡量节点的重要性。鲁棒性起源于著名的渗透理论，即当移除网络中一部分节点时，会造成网络的崩溃。鲁棒性值用于衡量网络的连通性，更小的鲁棒性值意味着算法更好的性能。相比于传统的方法需要对全部节点进行排序，而鲁棒性仅考虑所有节点在网络中的重要性。另外，社区也是网络科学中一个重要的结构，相比于不同的社区，节点在相同社区可能联系更频繁。大量的基于社区的影响力最大化算法被提出，证明了社区划分的有效性。In recent decades, some classic greedy algorithms have been proposed, such as the greedy algorithm that sequentially selects the node with the largest marginal gain, and the improved CELF and CELF++ algorithms. Although the influence diffusion of these algorithms approximates the optimal solution within the factor (1-1/e-ε), where ε is a small parameter approaching 0, and e is the natural base, these algorithms require tens of thousands of iterations. Monte Carlo simulations incur long computation times, making it difficult to scale to large networks. In order to solve this phenomenon, a large number of network science researchers have proposed heuristic algorithms for specific fields, sacrificing a certain accuracy rate to greatly reduce the computational complexity. In recent years, a new indicator, Robustnessvalue, has been widely used to measure the importance of nodes. Robustness originates from the well-known penetration theory, which states that when a part of the network is removed, the network collapses. The robustness value is used to measure the connectivity of the network, and a smaller robustness value means better performance of the algorithm. Compared with traditional methods, all nodes need to be sorted, and robustness only considers the importance of all nodes in the network. In addition, community is also an important structure in network science. Compared with different communities, nodes in the same community may be connected more frequently. A large number of community-based influence maximization algorithms have been proposed, proving the effectiveness of community division.

基于以上讨论，在本发明专利中，提出了一种新的基于社区的反向生成网络(Community-basedBackwardGeneratingNetworks，CBGN)方法去识别复杂网络中的一组有影响力节点，其中包括三个步骤：(1)社区划分；(2)候选集生成；(3)选择种子节点。首先，利用Louvain算法对网络进行社区划分，缩小种子节点的搜索空间。在第二步，利用图遍历去收集节点的信息，利用这些节点信息不同，辅助构造反向生成网络，选择一部分节点加入候选节点集；最后，利用启发式算法和改进的经典贪婪算法从候选节点中找出最终的top-k有影响力节点。Based on the above discussion, in the patent of the present invention, a new community-based Backward Generating Networks (CBGN) method is proposed to identify a group of influential nodes in a complex network, which includes three steps: (1) community division; (2) candidate set generation; (3) selection of seed nodes. First, the Louvain algorithm is used to divide the network into a community to reduce the search space of seed nodes. In the second step, use graph traversal to collect the information of nodes, use the different information of these nodes to assist in constructing a reverse generation network, and select a part of the nodes to join the candidate node set; finally, use the heuristic algorithm and the improved classical greedy algorithm from the candidate nodes. to find the final top-k influential nodes.

1.2研究动机1.2 Research motivation

基于社区发现的影响力节点识别方法在近几年引发了网络科学研究者的广泛关注。然而已有的方法只是简单地利用了社区结构这一特性，并且一些基于三段式的方法在选择候选节点集和种子节点时还存在一些不足。首先，现有的方法如随机游走、遗传算法或其他的一些启发式算法在选择候选节点集时需要遍历整个网络，并且选择的候选节点之间可能存在彼此聚集的情况。若从这些候选节点集中选择出来种子节点可能存在影响力范围重叠的问题。因此候选节点集的生成在整个种子节点选择过程中尤为重要。其次，在种子节点识别阶段时，大量算法利用贪婪算法进行精确选择，尽管在生成候选节点集中进行选择已大大提高了运行效率，但也是比较耗时的。因此希望在第二阶段选择出的候选节点集能对第三阶段有所帮助，并且不仅仅是缩小搜索范围的作用。Influence node identification methods based on community discovery have attracted extensive attention of network science researchers in recent years. However, the existing methods simply take advantage of the feature of community structure, and some methods based on three-stage method have some shortcomings in selecting candidate node sets and seed nodes. First, existing methods such as random walk, genetic algorithm or some other heuristic algorithms need to traverse the entire network when selecting the candidate node set, and the selected candidate nodes may gather with each other. If seed nodes are selected from these candidate node sets, there may be a problem of overlapping influence ranges. Therefore, the generation of candidate node sets is particularly important in the entire seed node selection process. Secondly, in the stage of seed node identification, a large number of algorithms use greedy algorithms for accurate selection. Although the selection in the generated candidate node set has greatly improved the operation efficiency, it is also time-consuming. Therefore, it is hoped that the candidate node set selected in the second stage can help the third stage, and not only narrow the search scope.

2.相关概念2. Related concepts

让G＝(V,E)表示一个复杂网络图，V＝[v₁,v₂,...,v_n]表示一组节点集，E＝[e₁,e₂,...,e_m]表示一组边集。节点表示复杂网络中的个体，边表示网络中个体之间的关系。在本发明专利中，只关注无向和无权的简单网络，并且不允许存在自环。Let G=(V,E) denote a complex network graph, V=[v ₁ ,v ₂ ,...,v _n ] denote a set of nodes, E=[e ₁ ,e ₂ ,...,e _m ] represents a set of edge sets. Nodes represent individuals in a complex network, and edges represent relationships between individuals in the network. In the patent of the present invention, only the undirected and unauthorized simple network is concerned, and the existence of self-loop is not allowed.

定义1，影响力节点top-k：Top-k有影响力的节点定义为在特定场合中具有显著影响力的节点，并且数量被指定为k。Definition 1, Influential Nodes top-k: Top-k influential nodes are defined as nodes with significant influence in a specific occasion, and the number is designated as k.

定义2，影响力最大化问题：影响力最大化问题是指去寻找一组影响力节点，使得以该k个节点为源节点S，能够使得影响力在指定传播模型上传播时达到最大化，这可以表示为Definition 2: Influence maximization problem: Influence maximization problem refers to finding a group of influential nodes, so that the k nodes are used as source nodes S to maximize the influence when spreading on the specified propagation model, This can be expressed as

Influ(S)＝arg max_{S∈V,|S|＝k}σ(S) (1)Influ(S)=arg max _S∈V,|S|=k σ(S) (1)

其中Influ(S)表示源节点集S的影响力，σ(·)表示信息扩散函数，σ(S)表示以种子集S作为节点集合，在指定模型上传播后能够影响的预期节点数量。Among them, Influ(S) represents the influence of the source node set S, σ(·) represents the information diffusion function, and σ(S) represents the expected number of nodes that can be affected by the seed set S as the node set after spreading on the specified model.

定义3，鲁棒性值：鲁棒性值可以用来评估排序算法的性能。给定一个网络，在每个时间步移除一个节点并计算剩余网络中最大连通分量的大小直至网络为空。其网络鲁棒性值R定义为Definition 3, Robustness value: The robustness value can be used to evaluate the performance of the sorting algorithm. Given a network, remove a node at each time step and compute the size of the largest connected component in the remaining network until the network is empty. Its network robustness value R is defined as

其中n是G中节点的数目，σ_gcc(G)表示没有移除任何一个节点时网络的最大连通分量的大小， σ_gcc(G\{v₁,v₂,...,v_n})表示从网络中顺序移除集合K＝{v₁,v₂,...,v_k}中的节点后剩余网络中的巨大连通分量的大小。鲁棒性值R可以视为R曲线下的面积，如图1所示，水平轴为k/n，垂直轴为 σ_gcc(G\{v₁,v₂,...,v_n})/σ_gcc(G)；图1的x轴表示移除节点的比例，y轴表示移除节点后剩余网络最大连通分量的大小。where n is the number of nodes in G, σ _gcc (G) represents the size of the largest connected component of the network when no node is removed, σ _gcc (G\{v ₁ ,v ₂ ,...,v _n }) Represents the size of the huge connected components in the remaining network after sequentially removing nodes in the set K={v ₁ , v ₂ ,...,v _k } from the network. The robustness value R can be regarded as the area under the R curve, as shown in Figure 1, with k/n on the horizontal axis and σ _gcc on the vertical axis (G\{v ₁ ,v ₂ ,...,v _n }) /σ _gcc (G); The x-axis of Figure 1 represents the proportion of removed nodes, and the y-axis represents the size of the largest connected component of the remaining network after removing nodes.

在所有n个节点的网络中，最脆弱的网络为星型网络，最健壮的为全连接网络，在星型网络中，

在完全网络中，

因此，鲁棒性值的范围为

更小的鲁棒性值意味着算法更小的性能。Among all n-node networks, the most vulnerable network is the star network, and the most robust is the fully connected network. In the star network,

In a complete network,

Therefore, the range of robustness values is

A smaller robustness value means a smaller performance of the algorithm.

在这一节，将介绍如下3个方面：社区检测，反向生成网络，图遍历方法。In this section, the following three aspects will be introduced: community detection, reverse generative network, and graph traversal methods.

2.1.社区检测2.1. Community detection

通常认为社区是一组相互联系紧密的节点，关注内容相似或拥有相同兴趣爱好的用户。社区内部所有节点紧密聚集，而不同社区之间只是稀疏连接。研究网络中的社区结构检测问题，对于发现社交网络中的群组结构、理解网络群组结构对信息传播的影响、识别复杂网络中的关键节点等都具有重要意义。考虑现实世界网络特性，通过合理地利用社区结构，可以使算法复杂度大大减少。A community is usually thought of as a group of closely interconnected nodes that focus on users with similar content or with the same interests. All nodes within a community are closely clustered, while different communities are only sparsely connected. Research on community structure detection in networks is of great significance for discovering group structure in social networks, understanding the impact of network group structure on information dissemination, and identifying key nodes in complex networks. Considering the real-world network characteristics, the algorithm complexity can be greatly reduced by rationally utilizing the community structure.

近十年大量杰出的社区发现算法如FPMQA，BiLPA，CommunityGAN，SEAL等被提出，但 Newman算法和Louvain毋庸置疑仍是被广泛利用的社区检测算法。在本发明专利中，选择Louvain 算法对网络进行聚类，该算法是基于模块度的社区发现算法，其优化目标是最大化整个社区网络的模块度。网络模块度定义为In the past decade, a large number of outstanding community detection algorithms such as FPMQA, BiLPA, CommunityGAN, SEAL, etc. have been proposed, but Newman algorithm and Louvain are undoubtedly still widely used community detection algorithms. In the patent of the present invention, the Louvain algorithm is selected to cluster the network. This algorithm is a community discovery algorithm based on modularity, and its optimization goal is to maximize the modularity of the entire community network. The network modularity is defined as

其中Q表示网络模块度；where Q is the network modularity;

k_i表示节点i相连的边的权重之和；k _i represents the sum of the weights of the edges connected to node i;

k_j表示节点j相连的边的权重之和；k _j represents the sum of the weights of the edges connected to node j;

A_ij表示节点i、j之间边的权重；A _ij represents the weight of the edge between nodes i and j;

m表示所有边的权重之和；m represents the sum of the weights of all edges;

其中A表示网络邻接矩阵，代表了节点之间边的权重，网络不是带权图时，所有边的权重视为 1；k_i＝∑_jA_ij表示所有与节点i相连的边的权重之和，

表示所有边的权重之和。Louvain 算法主要包括Modularity Optimization和Community Aggregation两个阶段。前一阶段主要是将每个节点划分到与其邻接的节点所在的社区中，以使模块度不断增大；后一阶段主要是将第一阶段划分出来的社区聚合成一个点，再根据上一步的社区结构重新构造网络。阶段一需要计算模块度增益，其计算公式为：where A represents the network adjacency matrix, which represents the weight of edges between nodes. When the network is not a weighted graph, the weight of all edges is 1; k _i =∑ _j A _ij represents the sum of the weights of all edges connected to node i ,

represents the sum of the weights of all edges. The Louvain algorithm mainly includes two stages: Modularity Optimization and Community Aggregation. The former stage is mainly to divide each node into the community where its adjacent nodes are located, so as to increase the modularity; community structure to reconstruct the network. The first stage needs to calculate the modularity gain, and its calculation formula is:

上式可以简化为：The above formula can be simplified to:

其中∑_in表示社区C内部所有边的权值之和，k_i,in表示从节点i指向社区C的边的权值之和。 ∑_tot表示指向社区C中节点的边的权值和。Louvain算法可以获得网络更加自然的社区，并且也是相当快速的算法，因此利用该算法对网络进行社区划分来加速所提出的算法。where ∑ _in represents the sum of the weights of all edges within the community C, and k _i,in represents the sum of the weights of the edges from node i to community C. ∑ _tot represents the sum of the weights of the edges pointing to the nodes in the community C. The Louvain algorithm can obtain a more natural community of the network, and it is also a fairly fast algorithm, so the algorithm is used to divide the community of the network to speed up the proposed algorithm.

2.2.反向生成网络2.2. Reverse generative network

Lin等人提出了Backward Generating Networks(BGN)去识别复杂网络中的有影响力节点，该方法通过最小化鲁棒性值R来得到节点重要性的排序。BGN旨在找到一个节点序列，使网络尽快地崩溃，即网络中最大连通分量快速减小。BGN的核心是逆向过程，它没有选择从网络中删除节点，而是依据网络中巨大连通分量大小增长尽可能慢的需求，逐个向空网络中添加节点去构造原始网络。这样，节点的排序与添加的顺序相反，也就是说，后面添加的节点在维护网络连接方面更重要。Lin et al. proposed Backward Generating Networks (BGN) to identify influential nodes in complex networks, and the method obtains the ranking of node importance by minimizing the robustness value R. The purpose of BGN is to find a sequence of nodes, so that the network collapses as soon as possible, that is, the maximum connected component in the network decreases rapidly. The core of BGN is the reverse process. It does not choose to delete nodes from the network, but according to the requirement that the size of the huge connected components in the network grow as slowly as possible, add nodes to the empty network one by one to construct the original network. In this way, the nodes are ordered in the opposite order than they were added, that is, nodes added later are more important in maintaining network connectivity.

逆向过程从空网络G₀(V₀,E₀)开始，其中，

并且

在(n+1)^th时间步即第n+1个节点的时间步，将剩余节点中的一个节点添加到当前的网络G_n(V_n,E_n)中形成一个有n+1个节点的新网络，即G_n+1(V_n+1,E_n+1)，V_n表示网络G_n的节点集，E_n表示网络G_n的边集。重复此过程，直到网络恢复为原始网络。注意，在这一过程中所有进行中的网络G_n(n＝0,1,2,...,N)都是网络G的诱导子图，G_n表示n个节点的网络。根据BGN的策略，在每一个时间步中选择的节点应该尽可能最小化网络G_n+1中的最大连通分量的大小。The reverse process starts with an empty network G ₀ (V ₀ , E ₀ ), where,

and

At the (n+1) ^th time step, that is, the time step of the n+1 th node, one of the remaining nodes is added to the current network G _n (V _n , E _n ) to form a network with n+1 nodes The new network of , namely G _n+1 (V _n+1 , E _n ₊₁ ), V _n represents the node set of the network G _n , and En represents the edge set of the network G _n . Repeat this process until the network returns to the original network. Note that all in-progress networks Gn ( _n =0,1,2,...,N) in this process are induced subgraphs of network G, Gn representing a network of _n nodes. According to the BGN strategy, the nodes selected in each time step should minimize the size of the largest connected component in the network G _n+1 as much as possible.

2.3.图遍历中心性2.3. Graph Traversal Centrality

图遍历框架可以纳入不同类型的中心性以提高现有性能。该方法从图遍历的角度解决影响力节点识别问题，与现有方法完全不同。现有的任何中心性如度中心性，H-index等都可以通过该框架生成重要性分数。图遍历中心性如图2所示，首先，对于网络中的每一个节点，通过逐层遍历图来构建一个广度优先搜索树(BFS)，其中目标节点为根节点，如图2(A)所示。每一个节点都有一个初始中心性分数，可以从任意中心性度量方法获得。对于有影响力节点的树，一般在BFS树的顶层会有较多节点，这是因为顶层节点属于根节点的局部邻居。第二，从每棵BFS树中构建一个长度为h的累积分数向量vec＝[l₁,l₂,...,l_h]，其中h表示最大的层数(根节点的层数为1)。l_i表示层数不大于i的所有节点的分数之和，如图2(B)所示。第三，得分向量vec的前m个值被用来绘制曲线，用曲线下的面积来量化节点的影响，将曲线下的面积记为AUC值，以此来衡量每个节点的影响力；如图2(C)所示，其中m(1≤m≤h)是用户特定的参数，本图中m＝2；x轴表示BFS树的层数，y 轴表示每层的累积分数。Graph traversal frameworks can incorporate different types of centrality to improve existing performance. This method solves the problem of identifying influential nodes from the perspective of graph traversal, which is completely different from existing methods. Any existing centrality such as degree centrality, H-index, etc. can generate importance scores through this framework. The graph traversal centrality is shown in Figure 2. First, for each node in the network, a breadth-first search tree (BFS) is constructed by traversing the graph layer by layer, where the target node is the root node, as shown in Figure 2(A) Show. Each node has an initial centrality score, which can be obtained from any centrality metric. For a tree of influential nodes, there are generally more nodes at the top level of the BFS tree, because the top level nodes belong to the local neighbors of the root node. Second, construct a cumulative score vector vec=[l ₁ ,l ₂ ,...,l _h ] of length h from each BFS tree, where h represents the maximum number of levels (the number of levels of the root node is 1 ). l _i represents the sum of the scores of all nodes with layers not greater than i, as shown in Figure 2(B). Third, the first m values of the score vector vec are used to draw the curve, the area under the curve is used to quantify the influence of the node, and the area under the curve is recorded as the AUC value to measure the influence of each node; such as As shown in Figure 2(C), where m (1≤m≤h) is a user-specific parameter, m=2 in this figure; the x-axis represents the number of layers of the BFS tree, and the y-axis represents the cumulative score of each layer.

在给定的网络图G中，假设初始化中心性分数为CS＝{c₁,c₂,...,c_n}，其中c_i表示节点i的某种中心性(如度、H-index等)，n表示节点数量，c_n表示第n个节点中心性分数。那么每个节点生成的BFS 树在第m层的累积分数可以被计算如下：In a given network graph G, suppose the initial centrality score is CS={c ₁ ,c ₂ ,...,cn }, where c _i represents a certain centrality of node _i (such as degree, H-index etc.), n represents the number of nodes, and c _n represents the nth node centrality score. Then the cumulative score of the BFS tree generated by each node at the mth layer can be calculated as follows:

其中cum_score(m)表示每个节点生成的BFS树在第m层的累积分数；v_j表示节点j，T(q)表示某节点生成的BFS树中在第q层的所有节点，c_j表示节点j的某种中心性。AUC值通过生成的得分向量的前m项绘制曲线并计算面积获得，可以表示如下：where cum_score(m) represents the cumulative score of the BFS tree generated by each node in the mth layer; v _j represents node j, T(q) represents all nodes in the qth layer in the BFS tree generated by a node, and c _j represents Some centrality of node j. The AUC value is obtained by plotting the first m terms of the generated score vector and calculating the area, which can be expressed as follows:

AUC(m)值表示生成的得分向量的前m项绘制曲线并计算面积获得，表示一个节点的AUC值，以此衡量节点的重要性。The AUC(m) value represents the first m items of the generated score vector to draw the curve and calculate the area, which represents the AUC value of a node to measure the importance of the node.

通过图遍历获得的AUC值可以更好地提升如度中心性等的性能，使已有的中心性排序方法更加细粒度。The AUC value obtained by graph traversal can better improve the performance such as degree centrality, and make the existing centrality sorting methods more fine-grained.

3.提出的框架3. The proposed framework

在这一章节中，将提出新的CBGN框架以实现网络中的影响力最大化。其算法伪代码如算法 1所示。该框架如图3所示由三部分组成：(1)社区划分；在这一阶段上采用一种适合应用数据集的、考虑运行时间的社区检测算法(算法1line3)。(2)候选集生成；在反向网络生成中引入图遍历这一概念。利用图遍历进一步优化的中心性度量来逐个向空网络中添加节点。最后，还没有加入到网络中的节点即作为候选节点(算法1lines 4-10)。(3)选择种子节点。将启发式算法与贪婪算法平衡以快速准确选择节点(算法1line12)。下面将会给出每一步的详细描述。In this chapter, a new CBGN framework will be proposed to maximize influence in the network. The pseudocode of its algorithm is shown in Algorithm 1. The framework is composed of three parts as shown in Figure 3: (1) Community division; at this stage, a community detection algorithm (algorithm 1line3) suitable for the application dataset and considering the running time is adopted. (2) Candidate set generation; the concept of graph traversal is introduced in reverse network generation. Nodes are added to the empty network one by one using a centrality metric further optimized by graph traversal. Finally, the nodes that have not joined the network are regarded as candidate nodes (algorithm 1lines 4-10). (3) Select the seed node. Balance the heuristic algorithm with the greedy algorithm for fast and accurate node selection (Algorithm 1line12). A detailed description of each step will be given below.

3.1社区划分3.1 Community Division

Louvain算法被用来对本发明专利真实网络数据集进行划分。相比于其他的一些图分割方法、层次聚类方法、标签传播方法等，Louvain算法不需要关于社区数量的先验知识，可以发现更自然的社区。因此，由Louvain算法获得的社区更加接近真实网络的固有社区。并且该社区发现算法还可以应用到大规模网络上。另外，对于划分社区后的网络，并不是所有的社区都是足够有意义地容纳最终的种子节点，对于那些规模较小的社区，不将其送入候选节点选择阶段。考虑到规模更大的社区有更多的影响力，每个社区规模至少具备以下条件The Louvain algorithm is used to divide the real network data set of the patent of the present invention. Compared with some other graph segmentation methods, hierarchical clustering methods, label propagation methods, etc., the Louvain algorithm does not require prior knowledge about the number of communities and can find more natural communities. Therefore, the community obtained by Louvain's algorithm is closer to the intrinsic community of the real network. And the community discovery algorithm can also be applied to large-scale networks. In addition, for the network after the community is divided, not all the communities are meaningful enough to accommodate the final seed nodes, and for those smaller communities, they are not sent to the candidate node selection stage. Considering that larger communities have more influence, each community size has at least the following conditions

C_size＝size(G)*η (9)C_size=size(G)*η (9)

其中size(G)表示网络G的节点数量，η是一个可调参数，用于控制满足条件的社区大小。本发明专利设置η＝0.01。where size(G) represents the number of nodes in the network G, and η is an adjustable parameter to control the size of the community that satisfies the condition. The invention patent sets η=0.01.

3.2候选节点选择3.2 Candidate Node Selection

在这一阶段，候选节点将在相互独立的社区中产生。通过减少需要评估的候选节点的数量来找到最具影响力的节点，从而提高效率。在这一步中，每个社区将被看作一个诱导子图，一个独立的网络。在每个网络中，通过最小化鲁棒性值执行反向生成网络的过程。当将节点u加入网络时，记此时网络的最大连通分量为G[u]。根据反向生成网络策略，加入的节点u应最小化网络中最大化连通分量的大小，但是可能存在两个或以上节点同时满足此条件，例如图4中的绿色节点，图4是一个玩具网络.，图4(A)是在反向生成过程的第2个时间步，发现在保证最小化网络最大连通分量的情况下，再加入一个节点时，这里有多个节点(节点1，2，3，4，6)可选，图4(B)在反向生成网络的第4个时间步，这里有3个节点(节点1，3，4)可选。利用通过图遍历优化过的中心性来辅助构造反向生成网络。将最小化连通分量大小这一目标转换为最小化代价函数。将其代价函数定义为：At this stage, candidate nodes will be generated in mutually independent communities. Efficiency is improved by reducing the number of candidate nodes that need to be evaluated to find the most influential nodes. In this step, each community will be treated as an induced subgraph, an independent network. In each network, the process of back-generating the network is performed by minimizing the robustness value. When adding node u to the network, record the maximum connected component of the network as G[u]. According to the reverse generation network strategy, the added node u should minimize the size of the maximum connected component in the network, but there may be two or more nodes that satisfy this condition at the same time, such as the green nodes in Figure 4, which is a toy network ., Figure 4(A) is the second time step in the reverse generation process. It is found that when a node is added while ensuring that the maximum connected component of the network is minimized, there are multiple nodes (node 1, 2, 3, 4, 6) are optional, Fig. 4(B) is in the 4th time step of the reverse generative network, here there are 3 nodes (nodes 1, 3, 4) optional. The centrality optimized by graph traversal is used to assist in constructing a reverse generative network. Transform the objective of minimizing the size of connected components into minimizing the cost function. Define its cost function as:

其中，cost(u,n+1)表示节点u在(n+1)^th时间步的代价函数，AUC_u表示节点u的AUC值，AUC_max表示所有节点AUC值最大的，AUC_min表示所有节点AUC值最小的，

ξ是一个足够小的正参数，保证

这里选择度中心性作为图遍历中的初始化中心分数，度中心性经过图遍历框架优化，可以产生更加细粒度的衡量节点影响力的AUC分数。在每个时间步将使得代价函数最小的节点加入到网络中，当剩余未加入节点数量满足候选节点数量需求时，停止构造网络。其改进的在每个社区中用于生成候选节点的反向生成网络算法imp_BGN如算法2 所示。由于反向生成网络过程中先加入是较为不重要的节点，因此将剩余未用于构造原始网络的节点则传入候选节点集。为了进一步缩小搜索范围，并保证有质量合适的候选节点，在每个社区形成的独立网络中设置候选节点集的大小，其公式表示如下：Among them, cost(u,n+1) represents the cost function of node u at (n+1) ^th time step, AUC _u represents the AUC value of node u, AUC _max represents the largest AUC value of all nodes, and AUC _min represents all nodes with the smallest AUC value,

represents the size of the largest connected component that adds node u to the network G at the (n+1) ^th time step, namely

ξ is a positive parameter small enough to guarantee

Here, degree centrality is selected as the initial center score in graph traversal. The degree centrality is optimized by the graph traversal framework, which can generate a more fine-grained AUC score that measures the influence of nodes. At each time step, the node with the smallest cost function is added to the network, and the network is stopped when the number of remaining unjoined nodes meets the requirement of the number of candidate nodes. Its improved reverse generative network algorithm imp_BGN for generating candidate nodes in each community is shown in Algorithm 2. Since the less important nodes are added first in the process of reverse network generation, the remaining nodes that are not used to construct the original network are passed into the candidate node set. In order to further narrow the search range and ensure that there are candidate nodes with suitable quality, the size of the candidate node set is set in the independent network formed by each community. The formula is as follows:

其中，cand_num_i表示第i个社区候选节点集的大小，项(C_size_i-C_size_min)/(C_size_max-C_size_min)表示第i个社区在所有选择社区中的比例，其值定义在[0,1]。β是一个放大参数，控制候选节点集的规模大小。k是最终需要选择的的种子节点数量。利用改进的反向生成网络来生成候选节点集有以下优点，首先，可以使得生成的候选节点集更加分散，节点少量聚集，并且都是网络中的关键节点。第二，选择的候选节点影响力范围重叠度低，在第三阶段也可以用启发式方法与贪婪方法平衡来选择种子节点。第三，这些候选节点保证了网络的健壮性，移除这些节点将很容易造成网络瓦解。第四，在生成候选节点时不用遍历整个网络，当未加入节点满足候选节点集数量时即可停止构建网络。Among them, cand_num _i represents the size of the candidate node set of the ith community, and the item (C_size _i -C_size _min )/(C_size _max -C_size _min ) represents the proportion of the ith community in all selected communities, and its value is defined in [0 ,1]. β is a scaling parameter that controls the size of the candidate node set. k is the final number of seed nodes to be selected. Using the improved reverse generative network to generate candidate node sets has the following advantages. First, the generated candidate node sets can be more dispersed, with a small number of nodes clustered, and all of them are key nodes in the network. Second, the influence range of the selected candidate nodes has a low degree of overlap. In the third stage, the heuristic method and the greedy method can also be used to balance the selection of seed nodes. Third, these candidate nodes ensure the robustness of the network, and removing these nodes will easily cause the network to collapse. Fourth, it is not necessary to traverse the entire network when generating candidate nodes, and the network construction can be stopped when the number of unjoined nodes meets the number of candidate node sets.

综上，利用imp_BGN生成候选节点是很具有意义的。In summary, it is very meaningful to use imp_BGN to generate candidate nodes.

在算法2中，首先lines1-2行对算法进行初始化，之后lines3-11构建反向生成网络以生成候选节点集。每次在网络中加入节点时，需要计算剩余节点的加入代价(lines4-5)，选择最小化代价函数的节点(line7)构建网络。在加入节点之后，需要更新网络中每个派系即连通分量的大小，并记录网络中最大连通分量的大小(line8)。重复此过程，直到剩余节点数量满足所需候选节点数量。In Algorithm 2, first lines1-2 initialize the algorithm, and then lines3-11 construct a reverse generative network to generate candidate node sets. Each time a node is added to the network, the joining cost of the remaining nodes (lines4-5) needs to be calculated, and the node (line7) that minimizes the cost function is selected to construct the network. After adding a node, it is necessary to update the size of each faction in the network, that is, the size of the connected component, and record the size of the largest connected component in the network (line8). This process is repeated until the number of remaining nodes meets the required number of candidate nodes.

3.3选择种子节点3.3 Select the seed node

通过前面两阶段，搜索空间已大大缩减。由于第二阶段的候选节点集节点之间彼此较为分散，可以利用启发式算法部分选择节点来平衡算法效率。在这一阶段选择启发式方法结合贪婪方法一起选择种子影响力节点。利用启发式算法结合改进的经典贪婪算法或者利用启发式算法结合经典贪婪算法均可。优选用启发式算法结合改进的经典贪婪算法选择种子影响力节点，步骤如下所示：Through the first two stages, the search space has been greatly reduced. Since the nodes of the candidate node set in the second stage are relatively scattered with each other, the heuristic algorithm can be used to partially select nodes to balance the efficiency of the algorithm. At this stage, the selection heuristic method is combined with the greedy method to select the seed influence nodes. Either heuristic algorithm combined with improved classical greedy algorithm or heuristic algorithm combined with classical greedy algorithm can be used. It is preferable to use a heuristic algorithm combined with an improved classical greedy algorithm to select seed influence nodes. The steps are as follows:

整体的种子节点选择分为两步：步骤1)：通过度折扣算法在候选节点集中选择部分k₁节点，步骤2)：通过改进的子模性CELF算法选择部分k₂节点。令k₁、k₂满足以下条件：The overall seed node selection is divided into two steps: Step 1): select some k ₁ nodes in the candidate node set by the degree discount algorithm, and step 2): select some k ₂ nodes by the improved submodular CELF algorithm. Let k ₁ and k ₂ satisfy the following conditions:

k₂＝k-k₁ (13)k ₂ =kk ₁ (13)

其中，k为最终需要选择的种子节点数量，c表示网络中满足一定规模的社区的总数。μ是一个可调参数以平衡贪心算法和启发式算法。为方便，本发明专利设置μ＝0.5。在启发式算法选择过程中，令度折扣中的传染概率p大于网络的传播阈值。每个节点的广义折扣度用式(14)获得，将计算结果从高到低排序，选择前k₁个节点。Among them, k is the final number of seed nodes to be selected, and c is the total number of communities in the network that meet a certain scale. μ is a tunable parameter to balance greedy and heuristic algorithms. For convenience, the patent of the present invention sets μ=0.5. In the heuristic selection process, the contagion probability p in the degree discount is made larger than the network's propagation threshold. The generalized discount degree of each node is obtained by formula (14), the calculation results are sorted from high to low, and the first k ₁ nodes are selected.

其中d_v表示节点的度，t_v表示节点v的感染邻居数量，t_w表示节点v的易感邻居节点w的感染邻居数。在贪婪的CELF选择阶段，进一步考虑了节点的位置信息以及节点之间的结构相似性，若在每轮改进的CELF选择的过程中选择的节点u与之前选择的节点(选择阶段先选的节点，CELF 算法需要迭代地选择节点)或启发式过程选择的节点相似，即只要有一个节点满足，那么就不选择该节点，并将其从候选节点集中剔除。设计节点u和节点v相似性如下：where d _v represents the degree of the node, t _v represents the number of infected neighbors of node v, and t _w represents the number of infected neighbors of node w of node v's susceptible neighbors. In the greedy CELF selection stage, the location information of nodes and the structural similarity between nodes are further considered. , the CELF algorithm needs to iteratively select nodes) or the nodes selected by the heuristic process are similar, that is, as long as there is a node that satisfies, then the node is not selected, and it is eliminated from the candidate node set. Design node u and node v similarity as follows:

sim_loc＝1-abs(ks(u)-ks(v)) (16)sim_loc=1-abs(ks(u)-ks(v)) (16)

其中sim表示节点u和节点v的相似性，N(u)表示节点u的邻居节点集，N(v)表示节点v的邻居节点集，abs(·)表示取绝对值。ks(u)为节点u的归一化的k-壳指标，ks(v)为节点v的归一化的k -壳指标。式中前项表示节点之间的结构相似性，后项表示节点之间的位置相似性。由于两个节点之间的k-壳指标相等或相近，则他们应该位于网络的相近位置，这样的节点被认为在位置上相似。位置相似sim_loc性通过式(16)计算。显然，sim_loc越大则节点位置越相似。式中ε为一个平衡结构相似性和位置相似性的正参数，这里设置ε＝0.1。其选择种子节点算法伪代码如算法3所示。where sim represents the similarity between node u and node v, N(u) represents the neighbor node set of node u, N(v) represents the neighbor node set of node v, and abs( ) represents the absolute value. ks(u) is the normalized k-shell metric for node u, and ks(v) is the normalized k-shell metric for node v. The former term represents the structural similarity between nodes, and the latter term represents the positional similarity between nodes. Since the k-shell indices between two nodes are equal or similar, they should be located at similar locations in the network, and such nodes are considered to be similar in location. The location similarity sim_loc is calculated by equation (16). Obviously, the larger the sim_loc, the more similar the node positions. where ε is a positive parameter that balances structural similarity and positional similarity, and ε = 0.1 is set here. The pseudo-code of the algorithm for selecting seed nodes is shown in Algorithm 3.

其中，lines1-3对算法进行初始化，接着用度折扣算法选择部分k₁节点(line5)，最后用改进的 CELF算法选择部分k₂节点(lines12-19)，sim_value表示相似性阈值，两节点之间的相似性大于该相似性阈值，则认为他们是相似的。Among them, lines1-3 initialize the algorithm, then use the degree discount algorithm to select some k ₁ nodes (line5), and finally use the improved CELF algorithm to select some k ₂ nodes (lines12-19), sim_value represents the similarity threshold, the difference between the two nodes If the similarity between them is greater than the similarity threshold, they are considered to be similar.

4实验结果4 Experimental results

为了将提出的CBGN方法与已有的算法(Degree、K-shell、NC+、PageRank、ClusterRank和 BGN)进行比较，在Inf-USAir、CEnew、Power、Ca-GrQc、Hamster和Router这6个真实网络上进行了包括鲁棒性和SIR模型的传播规模的仿真实验。In order to compare the proposed CBGN method with existing algorithms (Degree, K-shell, NC+, PageRank, ClusterRank and BGN), six real networks including Inf-USAir, CEnew, Power, Ca-GrQc, Hamster and Router Simulation experiments including robustness and propagation scale of the SIR model were carried out on .

4.1数据集4.1 Dataset

本发明专利使用了6个不同类型和规模的真实网络数据集，其统计特性如表1。The patent of the present invention uses 6 real network data sets of different types and scales, and their statistical characteristics are shown in Table 1.

每个实验网络的度分布和网络社区划分如图5所示，分别为(a)Inf-USAir，(b)CEnew，(c)Power， (d)Hamster，(e)Ca-GrQc，(f)Router。其中横坐标表示网络中节点的度，纵坐标表示节点度出现的频数。小图为网络的社区可视化结果。The degree distribution and network community division of each experimental network are shown in Fig. 5, which are (a) Inf-USAir, (b) CEnew, (c) Power, (d) Hamster, (e) Ca-GrQc, (f) )Router. The abscissa represents the degree of nodes in the network, and the ordinate represents the frequency of node degrees. The small image shows the results of the community visualization of the network.

其中，1)Inf-USAir是一个美国航空网络，节点表示一个机场，边则表示两个机场之间有直飞的航线。2)CEnew是一个生物网络，描述的秀丽隐杆线虫代谢网络的边列表。3)Power是一个无向、无权网络，代表美国各州电网的拓扑结构，每个节点表示电力公司，每条边表示电力公司之间的关系。4)Hamster是一个描述网站”www.hamsterster.com”的用户之间的友谊关系。5)Ca-GrQc网络是一个科学合作网络，涵盖了提交给广义相对论和量子宇宙学类别的论文的作者之间的科学合作。 6)Router是自治系统级别上互联网结构的对称快照。Among them, 1) Inf-USAir is an American aviation network, the node represents an airport, and the edge represents a direct flight between two airports. 2) CEnew is a biological network that describes an edge list of the metabolic network of C. elegans. 3) Power is an undirected, weightless network, representing the topology of the power grids in the states of the United States, each node represents a power company, and each edge represents the relationship between power companies. 4) Hamster is a description of the friendship between users of the website "www.hamsterster.com". 5) The Ca-GrQc network is a scientific collaboration network covering scientific collaborations between authors of papers submitted to the general relativity and quantum cosmology categories. 6) A Router is a symmetric snapshot of the Internet structure at the autonomous system level.

表1网络的统计特性Table 1 Statistical characteristics of the network

NetworkNetwork |V||V| |E||E| <k><k> kmaxkmax <d><d> C_numC_num βminβmin Inf-USAirInf-USAir 332332 21262126 12.80712.807 139139 2.7382.738 77 0.02310.0231 CEnewCEnew 453453 20252025 8.948.94 237237 2.6642.664 99 0.02560.0256 PowerPower 685685 12821282 5.7435.743 1212 12.42212.422 1717 0.27780.2778 HamsterHamster 24262426 1663116631 13.71113.711 273273 3.673.67 168168 0.02410.0241 Ca-GrQcCa-GrQc 41584158 1342213422 6.4566.456 8181 6.0496.049 4040 0.05890.0589 RouterRouter 50225022 62586258 2.4922.492 106106 6.4496.449 5555 0.0786 0.0786

在表1中，|V|表示网络中节点总数，|E|表示网络中边的数量。<k>＝2|E|/|V|表示网络的平均度。 k_max表示网络最大度。<d>表示网络平均最短路径长度。C_num表示网络中的社区个数。β_min为网络的传播阈值，通过公式<k>/(<k²>-<k>)计算。In Table 1, |V| represents the total number of nodes in the network, and |E| represents the number of edges in the network. <k>=2|E|/|V| represents the average degree of the network. k _max represents the maximum degree of the network. <d> represents the average shortest path length of the network. C_num represents the number of communities in the network. β _min is the propagation threshold of the network, which is calculated by the formula <k>/(<k ² >-<k>).

4.2性能指标4.2 Performance indicators

(1)鲁棒性值(1) Robustness value

可以运用鲁棒性来评价算法的性能，给定一个网络，在每个时间步删除一个节点，并计算剩余网络中最大连通分量的大小。将这些每次加入节点时的最大连通分量加和，并用网络大小N归一化即可得鲁棒性值。鲁棒性值通过式(2)计算得到，并且值越小则说明排序算法越能给出正确的排序。Robustness can be used to evaluate the performance of the algorithm, given a network, delete a node at each time step, and calculate the size of the largest connected component in the remaining network. The robustness value can be obtained by summing the maximum connected components of these nodes each time they are added and normalized by the network size N. The robustness value is calculated by formula (2), and the smaller the value is, the better the sorting algorithm can give the correct sorting.

(2)累积分布函数(CDF)(2) Cumulative Distribution Function (CDF)

累积分布函数能完整描述一个随机变量X的概率分布，对于所有的实数x，累积分布函数定义如下：The cumulative distribution function can completely describe the probability distribution of a random variable X. For all real numbers x, the cumulative distribution function is defined as follows:

F_X(x)＝P(X≤x)for-∞<x<+∞ (17)F _X (x)=P(X≤x)for-∞<x<+∞ (17)

可以用CDF确定取自总体的随机观测值小于或等于特定值的概率。本发明专利用CDF曲线来测量排序算法区分节点重要性的能力。The CDF can be used to determine the probability that a random observation from a population is less than or equal to a particular value. The invention patent uses CDF curve to measure the ability of sorting algorithm to distinguish the importance of nodes.

(3)SIR模型(3) SIR model

SIR模型是一种常见的描述传染病的扩散模型，其基本假设是将网络中的节点分为三类：a)易感节点，指未被感染的但缺乏免疫能力的节点，b)感染节点，这类节点是已经被感染的节点，在每个时间步可以以β的概率去感染邻居易感节点，c)恢复节点，同样在每个时间步，每个感染节点会以γ的概率变为恢复状态，并且在之后不会参与感染和被感染过程。SIR模型常被使用来衡量节点的影响力规模。本发明专利使用SIR模型来衡量选择的种子节点的最终感染规模。优秀的传播者能迅速达到高感染水平，其在t时刻感染规模F(t)以及在感染过程中达到稳定状态的最终感染规模F(t_c)可以被表示为The SIR model is a common diffusion model for describing infectious diseases. Its basic assumption is to divide the nodes in the network into three categories: a) susceptible nodes, which refer to uninfected but lacking immunity nodes, b) infected nodes , such nodes are already infected nodes, which can infect neighboring susceptible nodes with probability β at each time step, c) restore nodes, also at each time step, each infected node will change with probability γ In order to restore the state, and will not participate in the infection and infection process afterward. The SIR model is often used to measure the influence scale of a node. The invention patent uses the SIR model to measure the final infection scale of the selected seed nodes. A good spreader can quickly reach a high infection level, and its infection scale F(t) at time t and the final infection scale F(t _c ) that reaches a steady state during the infection process can be expressed as

其中n_I(t)表示t时刻感染节点的数量，n_R(t)表示t时刻恢复节点数量，n是G中节点的数目，。where n _I (t) represents the number of infected nodes at time t, n _R (t) represents the number of recovered nodes at time t, and n is the number of nodes in G.

(4)平均最短路径长度(4) Average shortest path length

为了确保更广泛的覆盖范围，可以考虑所选择的种子节点散布在网络的各个部分。一般来说，选择的节点越分散即均匀散布在网络中，则节点之间的影响范围重叠度越小，可以期望的感染的范围就更大。通常可以用平均最短路径长度来判断节点的分散程度，种子节点之间的平均最短路径长度L_s可以通过式(19)计算。To ensure wider coverage, consider the chosen seed nodes to be spread across parts of the network. Generally speaking, the more dispersed the selected nodes are, that is, evenly distributed in the network, the smaller the overlapping degree of the influence range between nodes, and the larger the expected infection range. Generally, the average shortest path length can be used to judge the degree of dispersion of nodes, and the average shortest path length L _s between seed nodes can be calculated by formula (19).

其中S表示选择的种子节点集，d_u,v表示节点u到节点v的平均最短路径长度。where S represents the selected set of seed nodes, and d _{u, v} represents the average shortest path length from node u to node v.

4.3基线算法4.3 Baseline Algorithm

以6种先进的算法作为基准算法，分别在鲁棒性实验和传播规模实验与提出的CBGN方法进行对比，其6种算法简要介绍如下。Taking 6 advanced algorithms as benchmark algorithms, the robustness experiments and propagation scale experiments are compared with the proposed CBGN method. The 6 algorithms are briefly introduced as follows.

Degree：该算法选择最大的度作为种子节点，是一种简单的、直观的以及常用的标准算法。Degree: This algorithm selects the maximum degree as the seed node, which is a simple, intuitive and commonly used standard algorithm.

K-shell：通过K-壳分解可以获得每个节点的k-壳值，K-壳方法考虑了节点在网络中的位置关系。K-shell: The k-shell value of each node can be obtained by K-shell decomposition, and the K-shell method considers the positional relationship of nodes in the network.

Neighborhood coreness(NC)：该方法是在K-壳方法上的进一步改进，每个节点的邻域核数C_nc(v) 和扩展的邻域核数C_nc+(v)计算如下，其中ks(w)表示节点w的k-壳值。Neighborhood coreness(NC): This method is a further improvement on the K-shell method. The number of neighborhood cores C _nc (v) and the number of extended neighborhood cores C _nc+ (v) for each node are calculated as follows, where ks( w) represents the k-shell value of node w.

其中N(v)表示节点v的邻居数量。where N(v) represents the number of neighbors of node v.

PageRank：PageRank算法作为计算机互联网网页重要度的算法被提出。PageRank值越高，网页就越重要，在互联网搜索的排序中可能就被排在前面。如果一个网页被很多其他网页链接到的话，说明这个网页比较重要。如果一个PageRank值很高的网页链接到一个其他的网页，那么被链接到的网页的PageRank值会相应地因此而提高。PageRank: The PageRank algorithm is proposed as an algorithm for the importance of computer Internet pages. The higher the PageRank value, the more important a web page is and may be ranked higher in Internet search rankings. If a web page is linked to by many other web pages, it means that this web page is more important. If a page with a high PageRank links to another page, the PageRank of the linked page will increase accordingly.

ClusterRank：ClusterRank算法不仅考虑了节点本身的影响力，还考虑了节点的聚类系数。它考虑了网络的局部信息，缺乏性能保障。ClusterRank: The ClusterRank algorithm not only considers the influence of the node itself, but also considers the clustering coefficient of the node. It considers the local information of the network and lacks performance guarantee.

BGN：该算法是从网络鲁棒性的角度去考虑节点重要性的，考虑了网络的全局信息。BGN: This algorithm considers the importance of nodes from the perspective of network robustness, and considers the global information of the network.

4.4鲁棒性分析4.4 Robustness Analysis

为了验证改进的反向生成网络算法的有效性，将imp_BGN与六种基线算法进行对比，分析其鲁棒性。好的排序算法应该具备更小的鲁棒性值，即R曲线下的面积越小。图6可视化了6个网络中反向生成网络过程中生成的R曲线，6个网络分别是(a)Inf-USAir，(b)CEnew，(c)Power，(d) Hamster，(e)Ca-GrQc，(f)Router。表2中给出了不同方法在6个真实网络的鲁棒性值R。从图6 及表2中可以看出，除了在Inf-USAir网络中，Imp_BGN相较于其他6种基线方法，其鲁棒性值R 更小。由此可以说明利用该方法可以很好地选择出候选节点，这些候选节点有助于维持网络稳定性及连通性。考虑到反向生成网络是应用在候选节点选择阶段，候选节点是在以社区形成的子网络中进行的，为了进一步验证算法的有效性，在每个网络的规模较大的前2个社区中对算法鲁棒性进行分析，其结果统计如表3，C₁，C₂表示社区规模排名前2的两个社区。在图7中对CEnew和Power 这两个网络中的前两个社区的R曲线进行可视化，图7为2个真实网络社区中的R曲线，横轴表示种子节点比例，纵轴表示网络中最大连通分量的大小。图7(a)为CEnew中的最大的社区，图 7(b)CEnew中次大的社区，图7(c)为Power中最大的社区，图7(d)为Power中次大的社区。In order to verify the effectiveness of the improved reverse generative network algorithm, imp_BGN is compared with six baseline algorithms to analyze its robustness. A good sorting algorithm should have a smaller robustness value, that is, the smaller the area under the R curve. Figure 6 visualizes the R-curves generated during the reverse generation of the network in 6 networks, (a) Inf-USAir, (b) CEnew, (c) Power, (d) Hamster, (e) Ca -GrQc, (f)Router. The robustness values R of different methods on 6 real networks are given in Table 2. It can be seen from Figure 6 and Table 2 that, except in the Inf-USAir network, Imp_BGN has a smaller robustness value R than the other six baseline methods. It can be shown that candidate nodes can be well selected by this method, and these candidate nodes are helpful to maintain network stability and connectivity. Considering that the reverse generation network is applied in the candidate node selection stage, and the candidate nodes are carried out in the sub-network formed by the community, in order to further verify the effectiveness of the algorithm, in the first two communities with a larger scale of each network The robustness of the algorithm is analyzed, and the statistics of the results are shown in Table 3. C ₁ and C ₂ represent the top two communities in terms of community size. In Figure 7, the R curves of the first two communities in the two networks, CEnew and Power, are visualized. Figure 7 shows the R curves in the two real network communities. The horizontal axis represents the proportion of seed nodes, and the vertical axis represents the largest network in the network. The size of the connected components. Figure 7(a) is the largest community in CEnew, Figure 7(b) is the second largest community in CEnew, Figure 7(c) is the largest community in Power, and Figure 7(d) is the second largest community in Power.

从表3和图7中可以看出，imp_BGN算法在大部分网络的社区中表现也最出色，并且在 Inf-USAir网络的社区中也以微小优势胜过BGN算法。As can be seen from Table 3 and Figure 7, the imp_BGN algorithm also performs the best in the community of most networks, and also outperforms the BGN algorithm by a small margin in the community of the Inf-USAir network.

表2不同方法鲁棒性值RTable 2 Robustness value R of different methods

NetworkNetwork Inf-USAirInf-USAir CEnewCEnew PowerPower HamsterHamster Ca-GrQcCa-GrQc RouterRouter K-shellK-shell 0.16140.1614 0.18730.1873 0.42230.4223 0.18150.1815 0.23170.2317 0.02850.0285 ClusterRankClusterRank 0.11810.1181 0.13010.1301 0.21150.2115 0.13710.1371 0.10510.1051 0.01580.0158 PageRankPageRank 0.12270.1227 0.12290.1229 0.20190.2019 0.14210.1421 0.10270.1027 0.01350.0135 NC+NC+ 0.16430.1643 0.16230.1623 0.33290.3329 0.16920.1692 0.21430.2143 0.02020.0202 BGNBGN 0.08990.0899 0.11710.1171 0.06330.0633 0.10450.1045 0.06060.0606 0.00760.0076 DegreeDegree 0.12600.1260 0.12000.1200 0.22860.2286 0.13840.1384 0.13130.1313 0.01210.0121 Imp_BGNImp_BGN 0.09610.0961 0.07900.0790 0.04310.0431 0.08720.0872 0.05380.0538 0.0063 0.0063

表3不同算法在社区中的鲁棒性值RTable 3 Robustness values R of different algorithms in the community

另外，之所以选择经过图遍历优化过的度中心性来辅助构建反向生成网络，相比于直接用度中心性来辅助构建更有效。这是因为度中心性经过图遍历框架优化过后变得更加细粒度。可以用分辨率来区分节点重要性的能力，这样的分辨能力可以通过累积分布函数CDF来衡量。图8给出了在 3个网络中Degree和经图遍历优化过的度中心性TRank_degree的CDF曲线。3个网络分别为(a) Inf-USAir，(b)CEnew，(c)Power，图中x轴表示节点的等级，y轴表示各等级所占的比例。其CDF曲线与x轴夹角越小则表明算法效果越好。可以看出TRank_degree更能区分节点的重要性。这更加验证了算法imp_BGN的有效性。In addition, the reason for choosing the degree centrality optimized by graph traversal to assist the construction of the reverse generation network is more effective than directly using the degree centrality to assist the construction. This is because the degree centrality becomes more fine-grained after being optimized by the graph traversal framework. The ability to distinguish the importance of nodes can be measured by the resolution, which can be measured by the cumulative distribution function CDF. Figure 8 presents the CDF curves of Degree and graph traversal optimized degree centrality TRank_degree in three networks. The three networks are (a) Inf-USAir, (b) CEnew, (c) Power. The x-axis in the figure represents the level of nodes, and the y-axis represents the proportion of each level. The smaller the angle between the CDF curve and the x-axis, the better the algorithm effect. It can be seen that TRank_degree can better distinguish the importance of nodes. This further verifies the effectiveness of the algorithm imp_BGN.

4.5传播规模分析4.5 Analysis of Spread Scale

为验证所提出的CBGN方法选择影响力节点的能力，选用SIR模型来衡量不同算法选择的种子节点的最终感染规模F(t_c)。选取β应高于网络的传播阈值β_min，感染率设置为λ＝β/γ。由于模型存在随机性，实验结果通过对1000次独立实验求平均获得。选择的种子节点数量设置为网络规模的3％。实验结果如图9所示，其中横坐标表示感染的时间t，纵坐标F(t)表示t时刻累积感染的节点数，并且F(t)会随者时间的推移达到一个稳定值F(t_c)。在更少的时间达到更大的F(t_c)，则表明算法的性能更好。In order to verify the ability of the proposed CBGN method to select influential nodes, the SIR model is selected to measure the final infection scale F(t _c ) of the seed nodes selected by different algorithms. The selection β should be higher than the network propagation threshold β _min , and the infection rate is set as λ=β/γ. Due to the randomness of the model, the experimental results were obtained by averaging 1000 independent experiments. The number of seed nodes chosen is set to 3% of the network size. The experimental results are shown in Figure 9, where the abscissa represents the time t of infection, and the ordinate F(t) represents the cumulative number of infected nodes at time t, and F(t) will reach a stable value F(t) over time _c ). Reaching a larger F(t _c ) in less time indicates better performance of the algorithm.

观察图9的SIR模型时间步长实验，所提出的CBGN与另外的6个算法相比，在6个网络数据集中传播规模都是最佳的，这6个网络分别为(a)Inf-USAir，(b)CEnew，(c)Power，(d)Hamster， (e)Ca-GrQc，(f)Router。在Inf-USAir网络中，所提出的CBGN方法感染规模明显高于其他6种基线算法，而6种基线算法感染规模相当。在Power网络中，CBGN算法感染规模以0.84％的优势胜过最好的ClusterRank算法。在网络CEnew、Hamster、Ca-GrQC和Router中，CBGN方法的感染规模分别高于最好的BGN算法0.59％、1.05％、0.71％和0.14％。在这4个网络中，BGN算法都表现出优秀的能力，但还是次于CBGN。另外，在SIR模型中，不同的传染概率也会对传播规模造成一定的影响，因此对SIR模型的不同的传染率进行实验，设置λ范围为[1.0,2.0]，实验结果如图 10所示。同样地，实验结果由1000次独立实验的平均获得。其中x轴表示传染率λ，y轴表示在某一传染率下的稳定的最终感染规模F(t_c)。Observing the time step experiment of the SIR model in Figure 9, the proposed CBGN has the best propagation scale in the 6 network datasets compared with the other 6 algorithms. These 6 networks are (a) Inf-USAir , (b) CEnew, (c) Power, (d) Hamster, (e) Ca-GrQc, (f) Router. In the Inf-USAir network, the infection scale of the proposed CBGN method is significantly higher than that of the other 6 baseline algorithms, which are comparable in infection scale. In the Power network, the infection scale of the CBGN algorithm outperforms the best ClusterRank algorithm by 0.84%. In the networks CEnew, Hamster, Ca-GrQC and Router, the infection scale of the CBGN method is 0.59%, 1.05%, 0.71% and 0.14% higher than the best BGN algorithm, respectively. In these 4 networks, the BGN algorithm shows excellent ability, but it is still inferior to CBGN. In addition, in the SIR model, different infection probabilities will also have a certain impact on the transmission scale. Therefore, experiments are carried out on different infection rates of the SIR model, and the λ range is set to [1.0, 2.0], and the experimental results are shown in Figure 10. . Again, the experimental results were obtained from the average of 1000 independent experiments. where the x-axis represents the infection rate λ, and the y-axis represents the stable final infection scale F(t _c ) at a certain infection rate.

在不同的感染概率下，所提出的CBGN方法感染规模都优于6种基线算法。除在CEnew网络中，CBGN方法与BGN算法表现相似，在其余网络中都以较大的优势胜过BGN算法。此外，在选择候选节点时，改进的反向生成网络算法imp_BGN在Inf-USAir、Router上并未始终收获最小的鲁棒性值，但通过最终的CBGN方法选择出的种子节点却能成功感染最多的节点。由此可见，构建的CBGN框架是循序渐进的和有效的。The infection scale of the proposed CBGN method outperforms the six baseline algorithms under different infection probabilities. Except in the CEnew network, the CBGN method is similar to the BGN algorithm, and outperforms the BGN algorithm by a large advantage in other networks. In addition, when selecting candidate nodes, the improved reverse generation network algorithm imp_BGN does not always obtain the smallest robustness value on Inf-USAir and Router, but the seed nodes selected by the final CBGN method can successfully infect the most node. It can be seen that the constructed CBGN framework is step-by-step and effective.

4.6平均最短路径长度分析4.6 Average Shortest Path Length Analysis

通常来说，选择的种子节点集节点之间越分散，即平均最短路径越大，则扩散影响可以达到更广的范围，因此种子节点集之间的平均最短路径长度L_s通常作为一种衡量好坏的指标。L_s并不是一个绝对的指标，因为在选择节点时考虑的是节点的传播能力而不仅仅是节点的分散程度。Generally speaking, the more dispersed among the nodes of the selected seed node set, that is, the larger the average shortest path, the wider the diffusion effect can reach. Therefore, the average shortest path length L _s between the seed node sets is usually used as a measure Good and bad indicators. L _s is not an absolute indicator, because the spreading ability of nodes is considered when selecting nodes and not just the degree of dispersion of nodes.

表4平均最短路径长度Table 4 Average shortest path length

NetworkNetwork Inf-USAirInf-USAir CEnewCEnew PowerPower HamsterHamster Ca-GrQcCa-GrQc RouterRouter DegreeDegree 1.01.0 1.30771.3077 12.380912.3809 1.69291.6929 3.9363.936 3.63813.6381 K-shellK-shell 1.01.0 1.30771.3077 9.77629.7762 1.99581.9958 4.01174.0117 3.18193.1819 NC+NC+ 1.01.0 1.25271.2527 8.67148.6714 1.58711.5871 3.97453.9745 3.02533.0253 PageRankPageRank 1.13331.1333 1.31871.3187 11.214311.2143 1.73931.7393 3.38533.3853 3.64403.6440 ClusterRankClusterRank 1.01.0 1.31871.3187 10.652410.6524 1.65411.6541 2.92742.9274 3.06403.0640 BGNBGN 1.01.0 1.46151.4615 8.53818.5381 2.10082.1008 3.74633.7463 4.13904.1390 CBGNCBGN 1.21.2 1.52751.5275 12.285712.2857 2.26182.2618 3.98213.9821 4.0840 4.0840

表4给出了提出的CBGN方法与6种基线算法选择的种子节点之间的平均最短路径长度。可以看出，在一半的网络中，通过CBGN方法选择出的种子节点集都是最分散的。而Degree、K-shell 和BGN方法分别在Power、Ca-GrQc和Router网络中选择出最分散的节点。Table 4 presents the average shortest path lengths between the seed nodes selected by the proposed CBGN method and the 6 baseline algorithms. It can be seen that in half of the networks, the set of seed nodes selected by the CBGN method is the most dispersed. While Degree, K-shell and BGN methods select the most dispersed nodes in Power, Ca-GrQc and Router networks, respectively.

5结论5 Conclusion

本发明专利提出了一种基于社区的反向生成网络框架CBGN去解决影响力最大化问题。首先，利用适合应用数据集的、考虑运行时间的Louvain算法将网络划分成自然的社区，通过这一步缩小影响力节点的搜索范围。然后，将每个社区视为原图的诱导子图，在每个子图中运用图遍历优化的度中心性来辅助反向构建网络，每次向网络中加入最小化代价函数的节点，当剩余未加入网络的节点满足候选节点数量时停止构建网络，所有的这些候选节点被送入候选节点集。通过分析鲁棒性实验，改进的反向生成网络算法在整个网络或独立的社区中都能获得更小的鲁棒性值。这验证了改进的imp_BGN算法更能选择出网络中的关键节点或维持网络稳定性的节点作为候选节点。最后，利用度折扣和考虑网络结构和节点位置关系的贪婪算法在候选节点集中选择出最终的种子节点。通过算法的传播规模和平均最短路径长度实验证明，提出的CBGN方法选择的种子节点感染速度更快，感染规模更大，并且种子节点集之间在大部分网络上都是分散的。The patent of this invention proposes a community-based reverse generation network framework CBGN to solve the problem of maximizing influence. First, use the Louvain algorithm suitable for the application dataset and consider the running time to divide the network into natural communities, and narrow the search scope of influential nodes through this step. Then, each community is regarded as an induced subgraph of the original graph, and the degree centrality of graph traversal optimization is used in each subgraph to assist the reverse construction of the network, and each time a node that minimizes the cost function is added to the network, when the remaining When the nodes that do not join the network meet the number of candidate nodes, the network construction is stopped, and all these candidate nodes are sent to the candidate node set. By analyzing the robustness experiments, the improved reverse generative network algorithm can obtain smaller robustness values in the whole network or independent communities. This verifies that the improved imp_BGN algorithm can better select key nodes in the network or nodes that maintain network stability as candidate nodes. Finally, the utilization discount and the greedy algorithm considering the network structure and node position relationship select the final seed node from the candidate node set. The propagation scale and average shortest path length experiments of the algorithm prove that the seed nodes selected by the proposed CBGN method have a faster infection speed and larger infection scale, and the seed node sets are scattered on most of the network.

综上所述，本发明专利的主要贡献总结如下：To sum up, the main contributions of the patent of the present invention are summarized as follows:

(1)提出了一种新的最小化代价函数的反向生成网络方法Imp_BGN，利用图遍历这一新视角，通过构建一个目标节点为根节点的广度优先搜索树(BFS)来评估每一个节点，从BFS树中可以得到每一个节点的影响得分，利用影响得分来辅助构造反向生成网络，并且由于首先加入的是最不重要的节点，可以使生成的网络不必恢复到原始网络，未加入到网络中的节点直接加入候选节点集，大大减小了计算时间。(1) A new reverse generative network method, Imp_BGN, is proposed to minimize the cost function. Using the new perspective of graph traversal, each node is evaluated by constructing a breadth-first search tree (BFS) with the target node as the root node. , the influence score of each node can be obtained from the BFS tree, and the influence score can be used to assist in constructing the reverse generation network, and since the least important node is added first, the generated network can be saved from the original network without being added. The nodes in the network are directly added to the candidate node set, which greatly reduces the computation time.

(2)改进了CELF算法，考虑到节点与节点之间的共有邻居及位置关系会使选择出的节点集影响范围重叠，设计出了一种相似性评价指标，应用在了CELF选择种子节点的过程。(2) The CELF algorithm is improved. Considering that the common neighbors and positional relationships between nodes will overlap the influence of the selected node set, a similarity evaluation index is designed, which is applied to the CELF selection of seed nodes. process.

(3)构想出了一个基于社区的反向生成网络框架CBGN来选择复杂网络中一组有影响力的节点，考虑了网络的社区结构，利用社区对算法进行加速，并结合了网络的连通性、图遍历以及启发式算法和贪心算法的优点。(3) Conceived a community-based reverse generative network framework CBGN to select a group of influential nodes in a complex network, considering the community structure of the network, using the community to accelerate the algorithm, and combining the connectivity of the network , graph traversal, and the advantages of heuristics and greedy algorithms.

(4)对提出的CBGN方法进行鲁棒性、影响力传播规模和节点间平均最短路径长度等实验评估，在Inf-USAir等6个真实网络上的实验结果表明，提出的算法相比已有的先进方法更有竞争性。(4) The proposed CBGN method is experimentally evaluated on the robustness, influence propagation scale and average shortest path length between nodes. advanced methods are more competitive.

此外，对于影响力节点识别问题，从不同角度来看仍有很多挑战，例如，如何高效地挖掘大规模网络影响力节点，在时变网络上影响力节点可能会随着拓扑的变化而变化以及在多层网络中如何更好地结合不同层之间的信息等。未来的工作将更进一步地扩展到加权网络、时变网络、多层网络和异质网络上。In addition, for the problem of influential node identification, there are still many challenges from different perspectives, such as how to efficiently mine large-scale network influential nodes, on a time-varying network, the influential nodes may change with the change of topology and How to better combine information between different layers in a multi-layer network, etc. Future work will further extend to weighted networks, time-varying networks, multilayer networks, and heterogeneous networks.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, The scope of the invention is defined by the claims and their equivalents.

Claims

1. A complex network influence node identification method based on a reverse generation network is characterized by comprising the following steps:

s1, community division: carrying out community division on the network by using a Louvain algorithm;

s2, generating a candidate node set: node information is collected by using graph traversal so as to assist in constructing an improved reverse generation network, and then a part of nodes are selected to be added into a candidate node set;

s3, selecting a seed node: and finding out the final k influential nodes from the candidate node set as seed nodes.

2. The method for identifying the complex network influence node based on the backward generation network according to claim 1, wherein the size of the community at least has the following condition:

C_size＝size(G)*η (9)

wherein C _ size represents the community size;

size (G) represents the number of nodes of network G;

eta is an adjustable parameter for controlling the size of the community that satisfies the condition.

3. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S2 comprises:

s2-1, calculating the adding cost of the remaining nodes, selecting the nodes with the minimized cost function to construct the network, wherein the cost function is as follows:

where cost (u, n +1) indicates that node u is at (n +1) ^th A cost function of the time step;

is shown in (n +1) ^th Time step, the size of the maximum connected component for adding node u to network G, i.e.

AUC _u AUC value representing node u;

AUC _max represents the maximum AUC value among all nodes;

AUC _min represents the smallest AUC value among all nodes;

ξ is a sufficiently small positive parameter;

s2-2, after the node is added, the size of each connected component in the network needs to be updated, and the size of the maximum connected component in the network is recorded;

s2-3, repeating the steps S2-1-S2-2 until the number of the remaining nodes meets the required candidate node number, and at the same time stopping constructing the network.

4. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S2 further comprises:

in order to further narrow the search range and ensure that there are candidate nodes with proper quality, the size of a candidate node set is set in an independent network formed by each community, and the formula is as follows:

where cand _ num _i Representing the size of the ith community candidate node set;

(C_size _i -C_size _min )/(C_size _max -C_size _min ) Representing the proportion of the ith community among all the selected communities;

β is an amplification parameter;

k is the number of seed nodes that ultimately need to be selected.

5. The complex network influence node identification method based on the backward generation network according to claim 1, wherein the graph traversal comprises:

the degree centrality is selected as an initial center score in graph traversal, and the degree centrality is optimized through a graph traversal framework, so that an AUC score which is finer in granularity and can measure the influence of the nodes can be generated.

6. The complex network impact node identification method based on the backward generation network of claim 1, wherein the S3 comprises:

s3-1, selecting k in the candidate node set through a degree discount algorithm ₁ A node;

the calculation formula of the degree discount algorithm is as follows:

gdd therein _v Represents a degree discount for node v;

d _v degree representing node v;

t _v represents the number of infected neighbors of node v;

t _w representing the number of infected neighbors of a susceptible neighbor node w of the node v;

p represents the probability of infection;

s3-2, selecting k through improved sub-model algorithm ₂ A node;

k ₂ ＝k-k ₁ (13)

k is the number of the seed nodes which need to be selected finally;

mu is an adjustable parameter to balance greedy algorithm and heuristic algorithm;

c represents the total number of communities in the network that meet a certain scale;

cand_num _i representing the size of the ith community candidate node set;

the improved sub-modular algorithm selection k ₂ The specific steps of each node are as follows: and if the node u selected in the process of each round of improved sub-model algorithm selection is similar to the node selected before or the node selected in the heuristic process, not selecting the node and removing the node from the candidate node set.

7. The complex network influence node identification method based on the backward generation network of claim 6, wherein the similarity is determined by the following formula to determine whether the nodes are similar, and if the following formula is satisfied, the node u is similar to the node v;

where sim represents the similarity of node u and node v;

n (u) represents a set of neighbor nodes for node u;

n (v) a set of neighbor nodes representing node v;

n (u) andn (v) represents the number of neighbors shared by N (u) and N (v);

| represents the number of sets;

ε is a parameter approaching 0;

abs (·) represents the absolute value;

ks (u) is the normalized k-shell index for node u;

ks (v) is a normalized k-shell index for node v.

8. The method for identifying the complex network influence node based on the backward generation network according to claim 1, further comprising evaluating the method by using the following performance indexes: robustness value, cumulative distribution function, SIR model, and average shortest path length.