[go: up one dir, main page]

GB2545931A - Defining edges and their weights between nodes in a network - Google Patents

Defining edges and their weights between nodes in a network Download PDF

Info

Publication number
GB2545931A
GB2545931A GB1523224.2A GB201523224A GB2545931A GB 2545931 A GB2545931 A GB 2545931A GB 201523224 A GB201523224 A GB 201523224A GB 2545931 A GB2545931 A GB 2545931A
Authority
GB
United Kingdom
Prior art keywords
nodes
similarity
network
pair
measure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1523224.2A
Other versions
GB201523224D0 (en
Inventor
Francis Murphy Dominic
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to GB1523224.2A priority Critical patent/GB2545931A/en
Publication of GB201523224D0 publication Critical patent/GB201523224D0/en
Publication of GB2545931A publication Critical patent/GB2545931A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Mathematical Optimization (AREA)
  • Algebra (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Defining edges and edge weights between nodes in a network to form a graphical database. The similarity of nodes in the network is determined and an edge is defined between nodes that have a similarity measure exceeding a threshold level 805, 806. The determined similarity between the nodes defines the weight of the determined edge 807. The nodes each have an associated attribute set to which a similarity function 804 is applied in a pairwise fashion as a similarity measure between each pair of nodes. Storing similarity of nodes in this manner allows the influence of nodes to be determined and ranked to facilitate decisions such as which nodes should receive patches first or defining attack vectors. The nodes may be network nodes and the similarity may be based on attributes 801, 803 such as their features or capabilities e.g. based on a nodes UPnP XML description. Other embodiments for the nodes are also encompassed such as within citation networks, social networks, identity management and even fraud detection or protein sequence comparisons. The similarity measure may be based on a Jaccard similarity or Tversky matrix. An adjacency matrix may be determined with a threshold to reduce/prune the number of edges.

Description

Defining Edges and Their Weights Between Nodes in a
Network
Introduction
The present invention relates to defining edges and the weights thereof between nodes in a network.
It is known to represent entities by way of a graph, in which the entities are the nodes of the graph and relationships between them are represented as edges. Much has been discovered in terms of the properties of graphs, and highly sophisticated mathematics and algorithms have been developed to allow inferences to be made from the graph. For example, graphs representing computer networks can be analysed to identify the spanning tree for the network, whilst citation networks can be analysed to identify the most important author. However, a problem still remains in terms of actually defining the graph from some information about the entities in a network, particularly in a deterministic and timely manner.
Thus according to an aspect of the present invention, there is provided a method of defining edges and the weights thereof between nodes in a network, where said nodes each have an associated attribute set, the method comprising, for each pair of nodes in the network: applying a similarity function to the attribute sets of each of the pair of nodes to obtain a similarity measure for the pair; comparing the similarity measure to a threshold value; and in response to the similarity measure for the pair being greater than the threshold value, defining an edge between the pair of nodes, and weighting said edge with the similarity measure, in response to the similarity measure for the pair being less than or equal to the threshold value, not defining an edge between the pair of nodes.
According to another aspect of the present invention, there is provided apparatus for defining edges and the weights thereof between nodes in a network, where said nodes each have an associated attribute set, the apparatus comprising memory, a network interface and a processing device configured to: create and store a graph database in memory; obtain and store in the graph database the identity of the nodes in said network and their associated attribute sets; and for each node in the network: apply a similarity function to the attribute sets of each pair of nodes to obtain a similarity measure for the pair, compare the similarity measure to a threshold value, and in response to the similarity measure for the pair being greater than the threshold value, create an edge between the pair of nodes in the graph database, and assigning the similarity measure as a property of the edge in the graph database, in response to the similarity measure for the pair being less than or equal to the threshold value, not create an edge between the pair of nodes in the graph database.
The invention will now be described by way of example only with reference to the accompanying drawings, of which:
Figure 1 shows an example of the application of the present invention to a collection of entities;
Figure 2 shows the process of deriving the principal left eigenvector by a power method;
Figure 3 shows an environment in which the invention may be deployed;
Figure 4 shows details of the hardware configuration of the central station 321;
Figure 5 shows an overview of procedures undertaken by the central station 321;
Figure 6 shows a mapping of data in memory at the central station 321;
Figure 7 shows steps undertaken to refresh the graph;
Figure 8 shows steps undertaken to form edges for a new node;
Figure 9 shows steps undertaken to update edges for a changed node; and
Figure 10 shows steps undertaken to rank the nodes.
Overview of the Invention A collection of entities may each have a number of attributes associated with them. The present invention recognises that, as such, comparison between the entities in terms of their attributes can allow discovery of an underlying network connecting the entities. The results of the comparison, by using a suitable function, will be indicative of the relative similarity between the entities, allowing further analysis of the network to be performed.
Formally, we therefore define a weighted digraph G = (V, E, w) given by a set of nodes V representing the entities, and which are connected by edges (u, v) e E where u, v e V. Each edge has a weight associated with it, which is defined by w(u, v) e R+.
In the present invention, the weight w(u, v) is generated by applying a similarity function to the attribute sets Au and Av. In principle, the attribute sets may be any representation of the attributes of the entities. As will be described in the context of the present embodiment of the invention, the / attribute sets are XML documents detailing, in a structured manner according to a schema, the description of a computing device. Any other attributes could be compared, such as the content of scholarly documents to protein sequences for example.
In the present invention, the similarity function may be any real-valued function which quantifies the similarity between two objects. In an embodiment of the present invention, the function used is the Jaccard similarity coefficient, which calculates, for two sets A and 6, the ratio of the number of elements of their intersection and the number of elements of their union:
This returns a value between zero and unity, and has a value of zero when the sets are disjoint and one when they are equal. Use of the Jaccard similarity coefficient allows the use of efficient algorithms such as MinHash (a min-wise independent permutations locality sensitive hashing scheme) to arrive at the edge weights w. Thus, when used with large collections, the graph may be defined within a reasonable timeframe.
Whilst those skilled in the art will appreciate that similarity metrics such as the Jaccard similarity coefficient are inherently symmetric, in that the commutivity relationship J(A,B) = J(B,A) holds, the present invention’s use of digraphs enables the application of asymmetric models of similarity such as the Tversky index. Such asymmetric similarity measures are again typically valued between zero and unity, but they are non-commutative. It has been shown that such asymmetric measures can improve performance in producing better nearest neighbours, and correlate more strongly with experimental data on human judgment on similarity between entities. Other more sophisticated similarity measures may be used, such as the most frequent k characters algorithm which may be used in text mining applications.
Given a weighted digraph G = (V, E, w), the weighted adjacency matrix is a |V| * |V| square matrix P. The entries of the matrix are given by Wuv = w(u, v), for (u, v) e E where u, v e V, and ww = 0 otherwise. The storage requirement for the weighted adjacency matrix representation is proportional to the square of the number of vertices.
Given a weighted digraph G = (V, E, w), the adjacency list for a node v e V is a list whose entries are the nodes connected to v by edges (u, v) e E as before, and the weights wuv = w(u, v). There are therefore 2(deg(v)) entries in the adjacency list. To store the full graph, there are therefore \V\ adjacency lists, and the storage requirement for the adjacency list representation is therefore proportional to the number of edges and vertices in the graph.
Creating adjacency lists is important for two reasons. First, most networks do not contain highly similar nodes. Indeed, most nodes are dissimilar and so the adjacency matrix contains a large number of entries that are very low in value and do not actually define a meaningful relationship between the entities in the network. By thresholding, these edges are not created and so the entries in the adjacency matrix are zero. This however leads directly to the problems associated with sparse matrices, namely their use of large amounts of memory for a disproportionately small amount of information. Thus, by using adjacency lists to store the edges and their weights between nodes, substantial savings in terms of memory utilisation are made when the size of the network is large.
Figure 1
An example of the application of the present invention is therefore shown in Figure 1, with a collection of four entities 101 (shapes) having attributes represented in the illustration by their number of sides. Thus, shape 102 is a circle, shape 103 and 104 are squares and shape 105 is a pentagon. For the purposes of this example, a similarity function 106 is defined which compares the number of sides of the shapes and returns a similarity measure Win accordance with the following discontinuous function:
where |a| denotes the number of sides of a shape a.
Thus, the adjacency matrix 107 for the collection is as follows:
Then, by applying a threshold of, say, 0.5, by comparing each weight to the threshold, the number of edges is pruned appropriately giving a postthresholding adjacency matrix 108:
Adjacency lists for the collection are therefore as follows: A: B,ΙιΟ,Ι,ϋ,Ι B: C,1,D,0.8 C: B,1,D,1 D: B,0.8,0,0.8
As can be seen, the thresholding operation minimises the number of edges created. Whilst some information is clearly encoded by similarity measures between entities having a value close to zero, in large networks the value of that information tends not to be worth the detrimental performance stemming from the graph no longer being able to be stored using techniques for storing sparse matrices. By using adjacency lists, the amount of memory required given the sparseness of the graph is much, much less than if stored as a matrix. Some operations are simpler on adjacency matrices than on adjacency lists, such as computing anything that requires knowledge of the inbound edges of a node. However, sophisticated techniques have been developed to shuffle and sort edges by their destination nodes, allowing computation over the inbound edges, mitigating against this difficulty. In addition, for extremely large graphs, the adjacency list representation allows more efficient algorithms for graph exploration to be used, such as MapReduce.
Figure 2
Following creation of the adjacency lists, the underlying network for the collection of entities can be analysed. In accordance with an embodiment of the present invention, the entities can be ranked in terms of their influence over the other entities in the collection. The influence is determined in part by the weighting of the edges between the nodes in the graph.
In a specific embodiment of the present invention, the ranking of the nodes in the graph is determined by the values of a principal left eigenvector of a modified adjacency matrix of the graph. In other words, the influence of an entity is determined by the influence of those entities it is connected to, which is moderated by the similarity of the entities (the edge weights). Thus, in this model, highly similar entities will pass and receive more influence between them than dissimilar ones. A power method can be used to find the eigenvector and thus the influence of each node, and convergence is reached quickly due to the eigengap of the modified adjacency matrix.
The reason for modifying the weighted adjacency matrix by a damping factor is due to the possibility of the graph having cycles and dangling nodes. A damping factor effectively prevents the chance of particular nodes forming infinite sinks for influence in the network, leading to a distorted picture of the actual distribution of influence throughout the graph. A suitable damping factor has found to be 0.85.
Figure 2 illustrates the process of deriving the principal left eigenvector by a power method.
Following modification of the adjacency matrix of the graph by a damping factor d, which, in an example, is 0.85, the principal left eigenvector may be approximated by first initialising the eigenvector values at 1 at step 201. Each eigenvector value /a, Ib, h, Id in this example is evaluated in turn at step 202. The process is repeated at step 203 until convergence is reached. The threshold for convergence is a trade-off between accuracy and the available computation time. It has been found that the eigenvector becomes stationary to a satisfactory degree within ten iterations.
In practice, due to the number of nodes in the network, more efficient techniques may be employed such as a MapReduce solver, an adaptive power method or via graph aggregation, as will be apparent to those skilled in the art.
It will be appreciated that the use of the thresholding process in determining whether an edge is created between nodes in the graph representation of the collection assists in speeding up the ranking process.
This is because, despite little similarity between entities, the computation time remains the same as for highly similar entities yet does not contribute to or affect the ranking of the entities in terms of the influence enough to warrant the inefficiencies and complexity created.
An Embodiment
As previously described, the principles of the present invention have application to any network in which the entities represented as nodes have attributes which may be compared by a similarity measure. A specific embodiment of the present invention involves the creation of a graph and the subsequent analysis thereof for distinct devices accessible via the Internet.
Figure 3
An environment in which the invention may be deployed is therefore illustrated in Figure 3. A plurality of computing devices 301 to 319 are connected to the Internet 320. In addition, a central station 321, embodying an aspect of the present invention, is also connected to the Internet 320, and maintains a database of the computing devices and connections between them in accordance with the methods of the present invention. In practice, many more computing devices would be connected to the Internet and monitored by the central station.
Each one of the computing devices 301 to 319 has a variety of capabilities, some of which are distinct and some of which match, and some of which are backwards compatible to some degree. In the present example, each one of the computing devices stores its capabilities in an XML document structured as a UPnP (RTM) Device Description. The UPnP (RTM) Device Description includes vendor-specific manufacturer information such as the model name and number, serial number, manufacturer name, and a list of embedded services. Each service description includes a list of the commands, or actions, to which the service responds, and parameters, or arguments, for each action; the description for a service also includes a list of variables; these variables model the state of the service at run time, and are described in terms of their data type, range, and event characteristics.
In the present example, the central station 321 is a server computer configured to maintain a graph database keeping track of a very large number of computing devices, allowing it to respond to queries from those computing devices as to which other computing devices form part of their own network. Thus, the central station 321 uses the method of the present invention to define edges and the weights thereof between the computing devices in the network. As will be appreciated by those skilled in the art, a graph database is one in which data is conceptualised and stored as a graph, making them well suited for analysing the interconnections.
In addition, the maintenance of such a graph database allows the central station to utilise the present invention’s ranking method by the influence of each of the computing devices. This can assist in terms of identifying the most important and thus must critical of the computing devices in the network. In the present example, these inferences can be used for identifying the likely vectors of network attacks, and can assist in making decisions as to the order in which to patch devices.
Figure 4
Details of the hardware configuration of the central station 321 is illustrated in Figure 4.
In order for central station 321 to execute instructions, it comprises a processor such as central processing unit (CPU) 401. In this instance, CPU 401 is a single multi-core Intel® Xeon® CPU. It is possible that in other configurations the processor comprises several such CPUs to provide a high degree of parallelism in the execution of instructions.
Memory is provided by 32 gigabytes of DDR3 random access memory (RAM) 402, which allows storage of frequently-used instructions and data structures by central station 321.
Permanent storage is provided by a storage device such as hard disk drive 403, which in this instance is a mechanical Serial Attached SCSI (SAS) hard disk drive and has a capacity of one terabyte. Hard disk drive 403 stores an operating system and application data. Alternatively, the storage device could be a solid state drive to provide higher performance. As will be appreciated by those skilled in the art, in alternative embodiments, a number of hard disk drives could be provided and configured as a RAID array to improve data access times and/or data redundancy. A network interface allows central station 321 to connect to and receive network traffic over an internal network (not shown) and thus ultimately the Internet 320. In this example, network interface is provided by a gigabit-speed 1 Gbps Ethernet network interface 404, but could alternatively be a 10 Gbps interface in cases where a larger degree of bandwidth is required. Additionally, central station 321 comprises a number of human interfaces 405, including a keyboard, a mouse and a visual display unit, that allow an administrator to interact with and configure it. In alternative cases, central station 321 does not include human interfaces, and an administrator interacts with and configures it via another computer using a protocol such as secure shell (SSH).
Central station 321 also comprises an optical drive, such as a CD-ROM drive 406, into which an optical disk, such as a CD-ROM 407 can be inserted. CD-ROM 407 comprises computer-readable instructions that are installed on hard disk drive 403, loaded into RAM 402 and executed by CPU 401. Alternatively, the instructions (illustrated as 408) may be transferred from a network location via the internal network using network interface 204.
It is to be appreciated that the above system is merely an example of a configuration of system that can fulfil the role of central station 321. Any other system having a processor, memory, and a network interface could also be used.
Figure 5
An overview of procedures undertaken by the central station 321 is shown in Figure 5.
At step 501, the central station 321 is powered on, and at step 502 its operating system is loaded into main memory and boots. At step 503, graph database instructions are loaded and the database is initialised. If the instructions are not already present on the hard disk drive 403, they are obtained during step 503 either from a computer-readable medium such as CD-ROM 407 or from a network location as network-sourced instructions 408. Once the graph database instructions have been loaded into main memory, the network analysis instructions of the present invention are loaded ready for use at step 504. Again, if the instructions are not already present on the hard disk drive 403, they are obtained during step 503 either from a computer-readable medium such as CD-ROM 407 or from a network location as network-sourced Instructions 408.
Control proceeds to step 505, whereupon the graph database is established in memory in accordance with the graph database instructions. The central station 321 is then in a position to begin monitoring for network state changes, which in the present embodiment include the addition of, removal of or the making of changes to a computing device. Thus at step 506, a network state change occurs, resulting in an update to the graph being carried out at step 507. Procedures carried out during step 507 depending on the nature of the change to the monitored network of devices will be described with reference to Figures 7, 8 and 9.
As described previously, in an embodiment of the present invention, ranking of nodes in the graph is performed on the basis of their calculated influence in the network. Thus, following a change in the network state, the nodes are (re-) ranked according at step 508. Control the returns to step 506 when a further network state change occurs.
By providing a graph database in this form, the central station 321 can be responsive to queries either from the connected computing devices 301 to 319 as to the existence of other devices in their own network, or from other sources of queries wishing to analyse the network. Provision of functionality in the present embodiment for ranking according to the influence of nodes in the network allows assessments to be made as to, for example, the most likely targets for attack, or the devices to prioritise for patching, etc.
Figure 6 A mapping of data in memory at the central station 321 following initialisation is shown in Figure 6.
The operating system, which in the present example is GNU/Linux (RTM), is shown at layer 601. Alternative operating systems may be used in other implementations, such as Microsoft Windows (RTM). Above the operating system layer 601, the graph database instructions are held in layer 602. In the present embodiment, the graph database is Neo4j, which is an open source database implemented in Java (RTM) and Scala. In this particular graph database, all data is stored in form of either an edge, a node or an attribute thereof.
The network analysis instructions of the present invention are held at layer 603 and are invoked whenever a network state change occurs (step 506).
Above the operating system and instructions are the data structures created and accessed by the graph database and the network analysis instructions. At layer 604, a list of all of the nodes in the graph is held. In the present example, each node is stored with a node ID. This then links to the stored attribute lists for the nodes at layer 605. In an embodiment, the attribute sets are copies of the attribute sets submitted to the central station 321 by each network connected device, i.e. its UPnP (RTM) Device Description. In an alternative embodiment, the stored attribute sets are hashed versions of the Device Descriptions so as to achieve a degree of compression. Finally, the adjacency lists for the graph are held in memory at layer 606, storing, for each node, the connected nodes and the weight of the edge connecting them.
Figure 7
Steps undertaken to refresh the graph are set out in Figure 7.
Following identification of a change to the state of the network at step 506, step 507 is started and a question is asked at step 701 as to whether the change is caused by the addition of a new computing device. If so, the control proceeds to step 702 where a new node is created in the graph database. In the present example, this means that a new entry is made in the list of nodes at memory layer 604.
At step 703, the attribute set for the new device is obtained, and in the present example stored in memory layer 605 at step 704. As described previously, in an alternative embodiment a hash is taken of the retrieved attribute set which is then stored instead. In an example, a MinHash is taken of the attribute set.
Control then proceeds to step 705 where appropriate edges are formed for the new node in the graph. This process will be described with reference to Figure 8. Following formation of the edges, step 507 is complete.
If the question asked at step 701 was answered in the negative, then control proceeds to step 706 where the attributes for the node are updated. Depending upon the implementation, this either involves overwriting the stored attribute set, or performing a hash such as a MinHash of the updated attributes and overwriting those already stored in memory layer 605.
Following updating of the attributes, its edges are updated at step 707. This process will be described with reference to Figure 9. Following appropriate updates to the edges, then step 507 is complete.
Figure 8
Steps undertaken to form edges for a new node are set out in Figure 8.
The attribute set / for the new node / is loaded at step 801. At step 802, a temporary integer j is set at 1, after which the attribute set J for node j is loaded at step 803 (skipping the instance where the new node is selected). A similarity function, defined generally as S(A, B), is then applied to the attribute sets / and J of the nodes / and j. In the present example, the similarity function is symmetric. In particular, the similarity function is the Jaccard similarity coefficient in the present embodiment. Thus, step 804 evaluates the ratio of the number of elements of the intersection and the number of elements of the union of the attribute sets / and J. A question is then asked at step 805 as to whether the similarity measure calculated at step 804 exceeds a threshold value. In the present embodiment, the threshold value is set at 0.01. Any other threshold may be used, however, depending upon the information in the attribute sets and the utility of similarities towards the lower end of the scale, balanced against the requirement to encourage a sparse adjacency matrix to improve storage efficiency.
If the measure does exceed the threshold, with the question asked at step 805 being answered in the affirmative, then control proceeds to step 806 in which an edge (/, J) is created in the graph database and the computed result of the similarity function is stored as a property (the weight) w{i, j) for the edge. As the similarity function in the present example is symmetric, such that S(A, B) = S(B, A), at step 807 another edge (/, i) is created and the same computed result of the similarity function is stored as the weight w(j, i). This means that the similarity function need not be calculated twice.
As will be appreciated by those skilled in the art, in alternative embodiments in which the similarity function is asymmetric, such as application of the Tversky index, then the procedure of applying the similarity function to the attribute sets / and J and comparing S(/, J) to the threshold value would need to be carried out twice.
In any event, following creation of the edges (/, j) and (/, /), or if the computed similarity function(s) did not exceed the threshold and edges were not created, then control proceeds to step 808 where the integer j is incremented by one. A question is then asked at step 809 as to whether j is equal to the total number of nodes n. If not, the control returns to step 803 where the next node j Φ i is considered. However, if j is equal to n, then the node / has been compared to all other nodes and step 705 is complete.
Whilst it will be appreciated that the present invention employs directed edges in the graph so as to provide support for both symmetric and asymmetric similarity functions, it will be clear to those skilled in the art that if the implementation of the present invention only calls for a symmetric similarity function, then a non-directed graph may be created and so step 807 may be omitted.
Figure 9
Steps undertaken to update edges for a changed node are set out in Figure 9.
In a similar way to during step 705, the attribute set / for the updated node / is loaded at step 901. At step 902, a temporary integer j is set at 1, after which the attribute set J for node j is loaded at step 903 (skipping the instance where the new node is selected).
The similarity function in use, S(A, B), is then applied at step 904 to the attribute sets / and J of the nodes / and j. Again, in this example, the similarity function is the Jaccard similarity coefficient. A question is then asked at step 905 as to whether the similarity measure calculated at step 904 exceeds a threshold value which, again, in the present embodiment, is set at 0.01. As previously described, given the symmetric nature of the Jaccard similarity coefficient in the present example, there is no need to re-compute S(l, J) and so the result can be used for both edge (a, b) and (b, a).
Thus, if the similarity measure exceeds the threshold value, then a further question is asked at step 906 as to whether an edge (/, j) exists. If not, then control proceeds to step 907 where an edge (/', j) and an edge (/, i) are created in the graph database. Control then proceeds to step 908, which is also invoked directly if the question asked at step 906 is answered in the affirmative, whereupon the computed result of the similarity function is stored as the weight w(i, j) for the edge (/, /), and as the weight w(j, i) for the edge (/, i)
If the question asked at step 905 was answered in the negative, to the effect that the computed similarity measure does not exceed the threshold, then control proceeds instead to step 909 which poses a question as to whether an edge (/, j) already exists, due to a sufficient similarity prior to the update to the node /. If the question asked at step 909 is answered in the affirmative, then at step 910 the edges (/', j) and (/, /) are removed from the graph database.
Following completion of steps 908 or 910, or if the question asked at step 909 was answered in the negative, then the relationship between the updated node / and the node j under consideration is complete. At step 911, the iterative variable j is incremented by one, and a question is then asked as to whether / is equal to the total number of nodes n. If not, the control returns to step 903 where the next node j Ψ i is considered. However, if j is equal to n, then the node / has been compared to all other nodes and step 707 is complete.
It is worth noting that if the similarity measure chosen is linear, as with the Jaccard similarity coefficient for example, then a new device can adopt a network of another similar node, with edges being weighted by adjusting the existing weights by the computer similarity coefficient. Updates to existing nodes and their effect on edge weights can also be effected in a similar manner, by comparing the new attribute set of a node to its old attribute set and adjusting the weight of each edge in its edge set by the computed similarity coefficient. Edges whose weight goes below the threshold can then be deleted. This leads to the possibility of highly efficient network discovery for the computing devices monitored by the central station 321.
Figure 10
Steps undertaken to rank the nodes are set out in Figure 10.
Following the network state change and the updating of the graph, step 508 is initiated and begins with the creation of modified adjacency lists using the damping factor d at step 1001. This process is carried out as described previously with reference to Figure 2, and the application of the damping factor d to the adjacency list representation of the graph rather than the adjacency matrix will be immediately apparent to those skilled in the art due to the modification process simply using addition and multiplication. In the present example, the damping factor is set at 0.85, but other values may be used. A low damping factor will result in fewer iterations being required for convergence, but can detrimentally affect the accuracy of the ranking.
Following modification of the adjacency lists, computation of the principal left eigenvector is performed at step 1002. As described previously, for small graphs, the power method may be used to iteratively compute the values of the principal left eigenvector until convergence is reached to the desired accuracy. As described previously, as the size of the graph increases, the power method becomes increasingly less efficient when compared to other techniques which may be able to distribute computation between a number of processes, such as MapReduce implementations etc. Such techniques are well documented in the literature.
Following computation at step 1003, the nodes in the graph are assigned a rank in accordance with their corresponding values in the principal left eigenvector. The values are indicative of their influence over the other nodes within the graph, and as such can assist in identifying the most important actors within the network. The assigned rank can then be used to sort query results received by the central station 321.
It will be appreciated that the embodiment described herein is merely exemplary and the present invention can be applied to a multitude of use cases. It is contemplated, for example, that the present invention can be applied to systems for fraud detection, identity management and social network analysis.

Claims (20)

Claims What we claim is:
1. A method of defining edges and the weights thereof between nodes in a network to form a graph, where said nodes each have an associated attribute set, the method comprising, for each pair of nodes in the network: applying a similarity function to the attribute sets of each of the pair of nodes to obtain a similarity measure for the pair; comparing the similarity measure to a threshold value; and in response to the similarity measure for the pair being greater than the threshold value, defining an edge between the pair of nodes, and weighting said edge with the similarity measure, in response to the similarity measure for the pair being less than or equal to the threshold value, not defining an edge between the pair of nodes.
2. The method of claim 1, in which the similarity function returns a similarity measure having a value between zero and unity.
3. The method of claim 1 or claim 2, in which the threshold value is 0.01.
4. The method of any one of claims 1 to 3, in which the nodes are network-connected devices and the associated attribute sets define the features and capabilities of the devices.
5. The method of any one of claims 1 to 4, in which the nodes, the edges between them and the weights thereof are stored in a graph database.
6. The method of any one of claims 1 to 5, in which the similarity measure computed by the similarity function computes the Jaccard similarity coefficient.
7. The method of any one of claims 1 to 5, in which the edges are directed edges so as to enable the storage of asymmetric similarity relationships between nodes.
8. The method of claim 7, further comprising a step of defining an influence measure for each node.
9. The method of claim 8, in which defining the influence measure for each node includes: modifying the adjacency matrix of the network by a damping factor; obtaining the principal left eigenvector of the modified adjacency matrix.
10. The method of claim 9, in which the network is represented in the form of adjacency lists and the principal left eigenvector is obtained directly from said adjacency lists.
11. The method of claim 10, in which the principal left eigenvector is obtained by a power method.
12. The method of any one of claims 8 to 11, further comprising a step of ranking the nodes in accordance with the influence measure.
13. Instructions executable by a computer that, when executed by the computer, cause the computer to perform the method defined by any one of claims 1 to 12.
14. A computer system programmed with the instructions of claim 13.
15. A computer-readable medium encoded with instructions executable by a computer that, when executed by the computer, cause the computer to perform the method defined by any one of claims 1 to 12.
16. Apparatus for defining edges and the weights thereof between nodes in a network to form a graph, where said nodes each have an associated attribute set, the apparatus comprising memory, a network interface and a processing device configured to: create and store a graph database in memory; obtain and store in the graph database the identity of the nodes in said network and their associated attribute sets; and for each node in the network: apply a similarity function to the attribute sets of each pair of nodes to obtain a similarity measure for the pair, compare the similarity measure to a threshold value, and in response to the similarity measure for the pair being greater than the threshold value, create an edge between the pair of nodes in the graph database, and assigning the similarity measure as a property of the edge in the graph database, in response to the similarity measure for the pair being less than or equal to the threshold value, not create an edge between the pair of nodes in the graph database.
17. Apparatus according to claim 16, wherein the processing device is configured to obtain the identity of nodes in the network and their associated attribute sets by way of a network discovery protocol using the network interface.
18. Apparatus according to claim 16, wherein the processing device is configured to obtain the identity of nodes in the network and their associated attribute sets by loading them from memory or a computer-readable medium.
19. Apparatus according to any one of claims 16 to 18, wherein the processing device is further configured to define an influence measure for each node.
20. Apparatus according to claim 19, wherein the processing device is further configured to rank the nodes in accordance with the influence measure.
GB1523224.2A 2015-12-31 2015-12-31 Defining edges and their weights between nodes in a network Withdrawn GB2545931A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1523224.2A GB2545931A (en) 2015-12-31 2015-12-31 Defining edges and their weights between nodes in a network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1523224.2A GB2545931A (en) 2015-12-31 2015-12-31 Defining edges and their weights between nodes in a network

Publications (2)

Publication Number Publication Date
GB201523224D0 GB201523224D0 (en) 2016-02-17
GB2545931A true GB2545931A (en) 2017-07-05

Family

ID=55406584

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1523224.2A Withdrawn GB2545931A (en) 2015-12-31 2015-12-31 Defining edges and their weights between nodes in a network

Country Status (1)

Country Link
GB (1) GB2545931A (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241968A (en) * 2019-07-19 2021-01-19 上海高德威智能交通系统有限公司 Target tracking method and device
CN111046299B (en) * 2019-12-11 2023-07-18 支付宝(杭州)信息技术有限公司 Feature information extraction method and device for relational network
CN112016836B (en) * 2020-08-31 2023-11-03 中国银联股份有限公司 A method and device for determining similarity between objects
DE102021210846A1 (en) * 2021-09-28 2023-03-30 Robert Bosch Gesellschaft mit beschränkter Haftung Method for generating a graph structure for training a graph neural network
CN116094943B (en) * 2023-04-07 2023-06-06 湖南快乐阳光互动娱乐传媒有限公司 PCDN node importance ranking method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1591923A1 (en) * 2004-04-30 2005-11-02 Microsoft Corporation Method and system for ranking documents of a search result to improve diversity and information richness
US20060136428A1 (en) * 2004-12-16 2006-06-22 International Business Machines Corporation Automatic composition of services through semantic attribute matching
US20070143278A1 (en) * 2005-12-15 2007-06-21 Microsoft Corporation Context-based key phrase discovery and similarity measurement utilizing search engine query logs
US20110225173A1 (en) * 2010-03-11 2011-09-15 Yahoo! Inc Method and system for determining similarity score
US20150220530A1 (en) * 2014-01-31 2015-08-06 Google Inc. Efficient similarity ranking for bipartite graphs

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1591923A1 (en) * 2004-04-30 2005-11-02 Microsoft Corporation Method and system for ranking documents of a search result to improve diversity and information richness
US20060136428A1 (en) * 2004-12-16 2006-06-22 International Business Machines Corporation Automatic composition of services through semantic attribute matching
US20070143278A1 (en) * 2005-12-15 2007-06-21 Microsoft Corporation Context-based key phrase discovery and similarity measurement utilizing search engine query logs
US20110225173A1 (en) * 2010-03-11 2011-09-15 Yahoo! Inc Method and system for determining similarity score
US20150220530A1 (en) * 2014-01-31 2015-08-06 Google Inc. Efficient similarity ranking for bipartite graphs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Lecture Notes in Computer Science, Correct System Design, number 558, pages 638-653, 23/09/2013, "Continuous similarity computation over streaming graphs", Valeri E. et al. *

Also Published As

Publication number Publication date
GB201523224D0 (en) 2016-02-17

Similar Documents

Publication Publication Date Title
US20210075798A1 (en) Validation-based determination of computational models
EP3435623B1 (en) Malware detection using local computational models
JP7462678B2 (en) Clustering and Dynamic Re-Clustering of Similar Text Documents
GB2545931A (en) Defining edges and their weights between nodes in a network
CN106980623B (en) A method and device for determining a data model
US10679055B2 (en) Anomaly detection using non-target clustering
JP5520291B2 (en) Method and system for identifying file classification
CN103210368A (en) Software Application Identification
US10659329B1 (en) Container distance measurement and clustering
WO2017128868A1 (en) Classification method, device and system for program file
KR101716564B1 (en) Malware Detection Method and System Based on Hadoop
CN111914253A (en) Method, system, equipment and readable storage medium for intrusion detection
CN109085995B (en) Storage method, device and system for dynamic data fragmentation
US20170249218A1 (en) Data to be backed up in a backup system
CN107480685B (en) GraphX-based distributed power iterative clustering method and device
US9992232B2 (en) Policy block creation with context-sensitive policy line classification
CN106104480A (en) Cluster-Wide Memory Management Using Affinity Preserving Signatures
WO2022249179A1 (en) Anomaly detection for tabular data with internal contrastive learning
US11310137B2 (en) System and method to propagate information across a connected set of entities irrespective of the specific entity type
Alikhan et al. Dingo optimization based network bandwidth selection to reduce processing time during data upload and access from cloud by user
US20230367849A1 (en) Entropy exclusion of training data for an embedding network
WO2020119627A1 (en) Abnormality detection and positioning method and apparatus applied to distributed container cloud platform
US20220414533A1 (en) Automated hyperparameter tuning in machine learning algorithms
EP2819054B1 (en) Flexible fingerprint for detection of malware
CN114091019B (en) Dataset construction, malware identification, identification model construction method and device

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)