CN104077279B - A kind of parallel communities discovery method and apparatus - Google Patents
A kind of parallel communities discovery method and apparatus Download PDFInfo
- Publication number
- CN104077279B CN104077279B CN201310096315.6A CN201310096315A CN104077279B CN 104077279 B CN104077279 B CN 104077279B CN 201310096315 A CN201310096315 A CN 201310096315A CN 104077279 B CN104077279 B CN 104077279B
- Authority
- CN
- China
- Prior art keywords
- matrix
- point
- stored
- hdfs
- original
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of parallel communities to find method and apparatus, is related to the field of data mining.The method include the steps that reading in original social network data, converting thereof into the form of adjacency matrix and being stored in HDFS file system;The degree matrix D and Laplacian matrix of the adjacency matrix for the figure being stored on HDFS are calculated on the computing cluster configured with Hadoop environment;The parallel Lanczos numerical solution of characteristic value and feature vector is carried out to Laplacian matrix, obtain the corresponding feature vector of K maximum eigenvalue before matrix, and it constructs eigenvectors matrix and is normalized, the eigenvectors matrix that is standardized simultaneously extracts feature, regard every row as a point, K class is clustered into using clustering method;According to the corresponding relationship of point, equivalently divides the individual in original community into K class, complete the classification of community.The invention also discloses a kind of community discovery devices.Technical scheme has good adaptability for large-scale data.
Description
Technical field
The present invention relates in the parallel computation and social networks of the field of data mining more particularly to large-scale data
Community discovery scheme.
Background technique
Social networks is made of the relationship between individual and individual.Individual generally includes personal, tissue and other societies are real
Body can also indicate webpage, blog, mailbox, short message, paper and position etc.;Social relationships generally comprise friend, relatives and same
Relationship etc. can also indicate to click, concern, send the various actions such as message and reference.
There are community structure in social networks, close relation between the individual inside community, intercommunal relationship is then not
Closely.Community detection (also known as community discovery) is exactly that detection identifies these communities.Community discovery can be used as orientation recommender system
Basis, can be positioned at the point to play a crucial role inside class, can position and play a crucial role to the connection between class and class
Point.Also the problems such as can be applied to route searching, Relationship Prediction, scientific algorithm.
At present in the field of data mining, community discovery problem is mainly converted into clustering problem.It is traditional as K-means is clustered
Method is the algorithm based on distance, calculates the distance between individual and central point decision ownership by randomly selecting central point, then
Central point is updated, iteration is basically unchanged until central point.But this method is when abstract individual is converted into theorem in Euclid space point
Wait none suitable mapping function.And after being mapped to theorem in Euclid space, spherical community can only be found, and cannot find
The community structure of the spills such as crescent.
Based on the method for hierarchical clustering based on the similitude or intensity that connect between each node, network is divided into community.It should
Method according into network addition while or from network delete while can be divided into two classes: condensing method and splitting method.It is typical
Cohesion cluster CNM algorithm, principle is optimized to modularity based on Greedy idea.Requirement of the algorithm to memory and processor
Very high, when data scale is very big, single machine can not be handled at all.
Spectral clustering realizes that community discovery is got well than general method effect, because spectral clustering, which is equivalent to, realizes one preferably
The coordinate of abstract individual to spatial point map, but the numerical solution of characteristic value therein and feature vector is a problem,
Because of Calculation bottleneck, existing method can only be applied in small-scale data.
Summary of the invention
The technical problem to be solved by the invention is to provide a kind of parallel communities to find method and apparatus, existing to overcome
Calculation bottleneck problem present in technology.
In order to solve the above-mentioned technical problem, the invention discloses a kind of community discovery methods, comprising:
Original social network data is read in, the form of adjacency matrix is converted thereof into and is stored in HDFS file system;
The degree matrix D of the adjacency matrix for the figure being stored on HDFS is calculated on the computing cluster configured with Hadoop environment
With Laplacian matrix Lsym=I-D-1/2SD-1/2;
The parallel Lanczos numerical value of characteristic value and feature vector is carried out using Haoop frame to the Laplacian matrix
It solves, obtains K maximum eigenvalue I=λ before matrix1≥λ2≥…≥λK, corresponding feature vector is expressed as V1, V2...,
VK;
By described eigenvector V1, V2..., VKIt is in line, constructs eigenvectors matrix and it is normalized, the eigenvectors matrix to be standardized
Feature extraction is carried out to the eigenvectors matrix Y of standardization, regards every row as a point, represents original abstract
The theorem in Euclid space of body maps, and is clustered into K class using the clustering method simply based on distance;
According to the corresponding relationship of point, (the corresponding of the individual being stored in the point and original community in the figure on HDFS is closed
System, it is one-to-one that a point, which corresponds to an individual), it equivalently divides the individual in original community into K class, completes society
The classification in area.
Preferably, reading in original social network data in the above method, the form of adjacency matrix and storage are converted thereof into
Refer in HDFS file system:
Original social network data is read in, be modeled as having the right undirected graph model or haves no right undirected graph model, will be built
The adjacency matrix W distributed storage of the figure obtained after mould is on HDFS.
Preferably, the original social network data is the text text being stored in HDFS file system in the above method
The format of part, every row is " 1 user name of user name, 2 relationship weight ", indicates the relationship strength between two users;Or
For the sequential file being stored in HDFS file system;Or
For the relation data being stored in the database of Hadoop platform.
Preferably, calculating the figure being stored on HDFS on the computing cluster configured with Hadoop environment in the above method
The degree matrix D of adjacency matrix refer to:
Parallel method based on MapReduce calculates the degree matrix of adjacency matrix.
Preferably, carrying out feature extraction in the above method to the eigenvectors matrix Y of standardization, regarding every row as one
Point represents the theorem in Euclid space mapping of original abstract individual, is clustered into K class using the clustering method simply based on distance
Refer to:
First sequence is chosen K point and is stored in array, and then all points of order traversal, replace number with certain probability
Some element in group;
All points are assigned to each Map node again, node calculates the distance of local data point and central point, chooses
Ownership of the nearest central point as data point, output the result is that<key, value>it is right, wherein key is ownership a little,
Value is seat target value and square value a little;Local sum is merged the coordinate and coordinate of the point of same home by Combine node
Square;The coordinate of all nodes and coordinate square are carried out summation merging by Reduce node, and calculate convergence whether, using square
Item calculates judgment criteria of each excentric extent of polymerization of classification as cluster quality.
The invention also discloses a kind of community discovery devices, comprising:
Input module reads in original social network data, converts thereof into the form of adjacency matrix and is stored in HDFS text
In part system;
Laplacian generation module calculates the figure being stored on HDFS on the computing cluster configured with Hadoop environment
Adjacency matrix degree matrix D and Laplacian matrix Lsym=I-D-1/2SD-1/2;
Feature decomposition module, to the Laplacian matrix using Haoop frame carry out characteristic value and feature vector and
Row Lanczos numerical solution obtains K maximum eigenvalue I=λ before matrix1≥λ2≥…≥λK, corresponding feature vector table
It is shown as V1, V2..., VK, by described eigenvector V1, V2..., VKIt is in line, constructs eigenvectors matrixAnd it is normalized, the eigenvectors matrix to be standardized
Characteristic extracting module carries out feature extraction to the eigenvectors matrix Y of standardization, regards every row as a point, generation
The theorem in Euclid space of the original abstract individual of table maps, and is clustered into K class using the clustering method simply based on distance;
Output module equivalently divides the individual in original community into K class, completes community according to the corresponding relationship of point
Classification.
Preferably, the input module reads in original social network data in above-mentioned apparatus, adjacency matrix is converted thereof into
Form and being stored in HDFS file system refer to:
Original social network data is read in, be modeled as having the right undirected graph model or haves no right undirected graph model, will be built
The adjacency matrix W distributed storage of the figure obtained after mould is on HDFS.
Preferably, the original social network data is the text text being stored in HDFS file system in above-mentioned apparatus
The format of part, every row is " 1 user name of user name, 2 relationship weight ", indicates the relationship strength between two users;Or
For the sequential file being stored in HDFS file system;Or
For the relation data being stored in the database of Hadoop platform.
Preferably, the Laplacian generation module, the parallel method based on MapReduce calculates adjacent in above-mentioned apparatus
Connect the degree matrix of matrix.
Preferably, in above-mentioned apparatus, the characteristic extracting module carries out feature to the eigenvectors matrix Y of standardization and mentions
It takes, regards every row as a point, the theorem in Euclid space mapping of original abstract individual is represented, using the cluster side simply based on distance
Method is clustered into K class and is referred to:
First sequence is chosen K point and is stored in array, and then all points of order traversal, replace number with certain probability
Some element in group;
All points are assigned to each Map node again, node calculates the distance of local data point and central point, chooses
Ownership of the nearest central point as data point, output the result is that<key, value>it is right, wherein key is ownership a little,
Value is seat target value and square value a little;Local sum is merged the coordinate and coordinate of the point of same home by Combine node
Square;The coordinate of all nodes and coordinate square are carried out summation merging by Reduce node, and calculate convergence whether, using square
Item calculates judgment criteria of each excentric extent of polymerization of classification as cluster quality.
Technical scheme realizes the subsystem of the parallel spectrum calculating based on Lanczos method, based on MapReduce
The matrix-vector multiplication parallel method of frame and parallel K-Means cluster realizing method.These subsystems can solve institute above
The technological difficulties mentioned, and be used under other application scenarios and be not limited to this patent the embodiment described.
Specifically, technical scheme has the advantage that
For large-scale data have good adaptability, under the application background of mass data can effectively and rapidly into
Row one calculates in real time;
To the effect of community discovery by being then based on the algorithm improvement of spectral clustering, have the result as spectral clustering excellent
The characteristics of;
The Hadoop frame of use, so that system fault-tolerance with higher, works as machine for the failure of a fraction group of planes
The case where available effective processing, improve stability;
Load reaches good equilibrium.
It should also be noted that the usage scenario of technical scheme is including but not limited to following several:
1, the community discovery of large scale network structure.Large-scale community network meter may be implemented in technical scheme
Calculate, discovery wherein potential community structure, the community network especially suitable for social networks such as: microblogging, Renren Network,
facebook、twitter。
2, community discovery can be used as the basis of orientation recommender system, provide user personalized service.
3, it can be positioned at the point to play a crucial role inside class, can position and play a crucial role to the connection between class and class
Point.Commercial advertisement is launched with directive significance to accurate.
4, the problems such as also can be applied to route searching, Relationship Prediction, scientific algorithm.
Detailed description of the invention
Fig. 1 is the flow chart and system framework figure of the embodiment of the present invention;
Fig. 2 is the MapReduce flow chart of document processing module;
Fig. 3 be in feature decomposition module matrix-vector at MapReduce flow chart;
Fig. 4 is the MapReduce implementation flow chart of the k-means in characteristic extracting module;
Fig. 5 is the result for completing community discovery using the present invention to the social network data of a true Sina weibo
Figure.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to skill of the present invention
Art scheme is described in further detail.It should be noted that in the absence of conflict, in embodiments herein and embodiment
Feature can arbitrarily be combined with each other.
Embodiment 1
The embodiment of the present invention provides a kind of community discovery method
Step 1) input module read in from HDFS (Hadoop Distributed File System) it is original will be social
The text file of network data is modeled as having the right undirected graph model or haves no right non-directed graph, and the similar matrix S of figure is (adjacent
Connecing matrix W) distributed storage is on HDFS;
The Laplacian generation module of step 2) the community discovery system, in the calculating collection for being configured with Hadoop environment
The degree matrix D and Laplacian matrix L of the adjacency matrix of figure are calculated on groupsym=I-D-1/2SD-1/2;
The feature decomposition module of step 3) the community discovery system carries out Laplacian matrix using Haoop frame
The parallel Lanczos numerical solution of characteristic value and feature vector obtains K maximum eigenvalue I=λ before matrix1≥λ2≥…≥
λK, corresponding feature vector is expressed as V1, V2..., VK;
Obtained feature vector is in line by step 4) preceding feature decomposing module, constructs eigenvectors matrix, and it is normalized, and obtains standardization eigenvectors matrix
The characteristic extracting module of step 5) the community discovery system carries out feature extraction to the feature vector of standardization, will
Every row regards a point as, represents the theorem in Euclid space mapping of original abstract individual, such as using the clustering method simply based on distance
K-Means is clustered into K class.
Step 6) output module equivalently divides the individual in original community into K class, completes according to the corresponding relationship of point
The classification of community.
The community discovery method further include:
The social network data that the input module is read is every row with text files memory in HDFS file system
Format be " 1 user name of user name, 2 relationship weight ", illustrate the relationship strength between two users.It can also be with sequence
File is stored in the sequential file in HDFS file system.It can also be the relationship number being stored in the database of Hadoop platform
According to.The form that the input module converts the data into adjacency matrix is stored in HDFS file system.When conversion, use
Name in an account book should renumber into the number since 0 according to input sequence to adapt to the representation of matrix, for example user A is numbered
At 0, user C is numbered into 9, and bonding strength is 1 between them, then the 0th row the 9th column and the 9th row the 0th column are all corresponding in matrix
It is 1.
The Laplacian generation module, realization process are parallelizations.It calculates similarity matrix and is based on MapReduce
Parallel method it is as follows:
Map node: it is concurrently calculated on each node according to the numerical value expression of the attribute dimensions of each individual point itself
Similarity output between points.
Reduce node: purpose is that obtained similarity matrix is carried out a rarefaction, takes maximum t for each point
For a similarity point as the point for having connection, remaining is set as 0.
The feature decomposition module is also the parallel method under Hadoop frame, and the most time-consuming place of Lanzos method exists
It is multiplied in matrix-vector, the invention also includes:
1) a kind of parallel matrix multiplication based on Hadoop frame: for a vector, the community discovery device by its
It is sent to each calculate node as global information, for matrix, it is contemplated that the computation rule of matrix multiplication is by matrix by rows
For unit, each Map task of Map node does primary row and multiplies column vector, then using line label and product value as<key, value>
To being sent to Reduce node, by Reduce reduction operation will be obtained after the value group to correct position of each row result arrange to
Amount.
2) each processor a kind of method of load balancing: is limited under Hadoop frame using only a Map task, such as
Described in step 2, for community discovery system described in matrix file by Factorization algorithm at the file with processor number integral multiple
Number is stored on HDFS, and input in this way can reach extraordinary load balancing.
After above-mentioned multiplication is realized, the maximum K characteristic value of needs can be obtained using traditional Lanczos algorithm
And feature vector.
It is clustered in the characteristic extracting module using parallel improvement K-Means method, treatment process is divided into just
The iteration module of beginningization module and central point.
The characteristic extracting module specifically includes:
Initialization module K point of first sequence selection is stored in array, then all points of order traversal, with certain
Probability replace some element in array;
The iteration module of the central point is a MapReduce task module: the iteration module is by all points minute
It is fitted on each Map node, node calculates the distance of local data point and central point, chooses nearest central point as data point
Ownership, output the result is that<key, value>it is right, wherein key is ownership a little, value be seat target value a little and square
Value;Local sum is merged the coordinate and coordinate square of the point of same home by Combine node;Reduce node is by all nodes
Coordinate and coordinate square carry out summation merging, and whether calculate convergence, it is excentric partially to calculate each classification using quadratic term
From degree, extent of polymerization clusters fine or not judgment criteria as one in other words.
The data set that this example uses is the data of certain the true social networks grabbed by computer by network, and scale is
1,600,000 points and about 200,000,000 sides.For the convenience of introduction, the figure that we generate is only with mutual intrinsic pass
It is the weighting without considering interactive information, the figure of generation is to have no right digraph, has relationship then to have connection, the two-value square of W 0,1
Battle array.
It is the process signal of this method referring to Fig. 1.Input module, which is read, first generates data (step 101).At this
In embodiment, acquired data are in order to protect the privacy of user, and user name is just directly taken out when network obtains data
As at digital number 0-N, but their number can change because input sequence is different during renumbeing, this
One step is set to adapt to general true input;
The present embodiment second step is the Laplacian matrix expression for converting the data into figure, is stored in HDFS file system
On, symmetrical Laplacian matrix: L is used heresym=I-D-1/2SD-1/2
Therefore the process first step be will likely asymmetrical adjacency matrix be transformed into symmetrical adjacency matrix, guarantee matrix W
It is symmetrical.
Next will degree of solution matrix D, degree matrix only existence value on the diagonal, it is possible to linear list by degree matrix
It is stored in memory.
Process solves Laplacian matrix (step 102) in next step, this step will be spent matrix and is distributed to as global variable
Each node, and the Laplacian matrix acquired is stored in the HDFS file system of Hadoop with the document form serialized
On.
Every a line is all the sparse storage of serializing in the present embodiment, is illustrated:
If a line Laplacian matrix is expressed as 0000100100, without serializing previously stored data
It should be (4,1) (6,1).Sequential file on HDFS be by after the Data Serialization in the form of binary (rather than word
Symbol form) it saves hereof.
The third step of the present embodiment needs to solve the characteristic value and corresponding feature vector of a large-scale sparse matrix
(step 103).The first step is Lanczos iteration, and basic procedure is as follows:
1) it initializes
2) Lanczos iteration
For k=1,2 ..., m;
3) eigen vector is solved
It is available in this way:
QR algorithm seeks tridiagonal matrix TmmCharacteristic value and feature vector, acquire the characteristic value and feature of matrix A
Vector is
The step of being multiplied in the feature decomposition module to matrix-vector has carried out the parallelization of MapReduce.
A matrix is sparse matrix of million magnitudes multiplied by million magnitudes in the present embodiment.A can be regarded as million
A row vector composition, u is also the vector of a million magnitude dimensions.It is as follows:
It can see that in this way
Above-mentioned independent multiplication is assigned to each node to achieve the effect that acceleration by us.
The 4th step of the present embodiment is the result (feature vector) for the parallel Lanczos algorithm that will be improved with text file
Mode, which exports, gives characteristic extracting module (step 104).
Feature extraction uses k-means algorithm when classification number is greater than 2, in order to allow K-means convergence rate to add
Fastly, initial center point must be more discrete, if Relatively centralized or identical will appear the case where two classes focus on a class.
We use following two schemes in the present embodiment:
It is random to generate
K different points are randomly selected as initialization center in a concentration.
Random assortment
Random is divided into point set k class, calculates the center of each classification, the center as initialization.
The present embodiment final step is that cluster result is converted into community's classification results (step 105).This step is related to
User name mapping was once become number designation in a step 101 by the anti-number from label to user name, in this step, according to
Classification information belonging to number designation is changed into the information of user's generic by the mapping relations file on HDFS.
As shown in Fig. 2, to generate the detailed MapReduce process of Laplacian matrix.In the present embodiment, due to not having
The attribute dimensions information for thering is individual to put, therefore can generally have no right directly using the adjacency matrix of relationship as similarity matrix
Figure processing method is all in this way.Therefore the process for solving similar matrix and rarefaction can be skipped over, L is directly passed throughsym=I-D-1/2SD-1/2Laplacian matrix is solved, MapReduce is divided to for two steps here, and diagonal matrix premultiplication is one to capable behaviour
Make.Therefore the map task in Map node is to every a line all elements all multiplied by D-1/2The diagonal element of corresponding row,
Reduce node merges every one matrix of behavior;Second step Map node tasks be to each element of every row multiplied by it is corresponding (-
D-1/2) diagonal element and by diagonal element all plus 1;Obtained each row is merged into Laplacian square by Reduce node
Battle array.
As shown in figure 3, being the MapReduce schematic diagram of matrix-vector multiplication.For a vector, the community discovery device
It is sent to each calculate node as global information, for matrix, it is contemplated that the computation rule of matrix multiplication is by matrix
By behavior unit, each Map task of Map node does primary row and multiplies column vector, then using line label and product value as < key,
Value > be sent to Reduce node to by file passes through reduction operation for the value group of each row to just on Reduce node
Result column vector is obtained behind true position.As a result it is stored in HDFS file system.
As shown in figure 4, describing the MapReduce processing schematic of the characteristic extracting module of the community discovery system.
In the present embodiment, after an input fragment enters the characteristic extracting module, Map node tasks can be read in
The classification information of one step, is packaged into key-value pair.Wherein key indicates that the point in input fragment is returned according to the classification information of previous step
Class situation judges the generic of the point set;And the class that value will indicate that the vector of point information is packaged into.
In order to improve arithmetic speed, on the local node, needs to carry out first time merging using Combine task, will belong to
It is merged together in same category of point set, does not need the information for saving all the points here, it is only necessary to save fragment point set
Coordinate and and square etc. information.
Last Reduce task merges point same category of on different nodes, calculates new classification information.
An iteration process terminates.
As shown in figure 5, dividing last result for community.This figure is explained as follows:
If regarding each point as an individual in social networks, matrix indicates this network, such as fruit dot 1 and point 2 have
Connection, then the value of (1,2) and (2,1) two o'clock is 1 in matrix, and performance sets 1 on the diagram as black, and 0 is white, then will
First two stains out.
It is envisioned that if by contact close group and be artificially continuously placed on matrix lot (here it is
The thing that community discovery is done puts same class people together), since connection interpersonal inside its community is inevitable close, that
It just will appear the pattern for having obvious square blackening on diagonal line.Fig. 5 exactly indicates community with the mode of this image
It was found that result, it is seen then that the network of this 1,600,000 people is obviously divided into many small communities not of uniform size, and dark
Spot is obvious.Demonstrate reliability of the system under extensive mass data.
Embodiment 2
The present embodiment provides a kind of community discoveries to realize system, comprising:
Input module reads in original social network data, converts thereof into the form of adjacency matrix and is stored in HDFS text
In part system;
Specifically, input module reads in original social network data, is modeled as having the right undirected graph model or having no right
Undirected graph model, by the adjacency matrix W distributed storage of the figure obtained after modeling on HDFS.
Wherein, original social network data is the text file being stored in HDFS file system, and the format of every row is
" 1 user name of user name, 2 relationship weight ", indicates the relationship strength between two users;Or to be stored in HDFS file system
On sequential file;Or the relation data to be stored in the database of Hadoop platform.
Laplacian generation module calculates the figure being stored on HDFS on the computing cluster configured with Hadoop environment
Adjacency matrix degree matrix D and Laplacian matrix Lsym=I-D-1/2SD-1/2;
Specifically, above-mentioned Laplacian generation module can calculate adjacency matrix based on the parallel method of MapReduce
Degree matrix.
Feature decomposition module, to the Laplacian matrix using Haoop frame carry out characteristic value and feature vector and
Row Lanczos numerical solution obtains K maximum eigenvalue I=λ before matrix1≥λ2≥…≥λK, corresponding feature vector table
It is shown as V1, V2..., VK, by described eigenvector V1, V2..., VKIt is in line, constructs eigenvectors matrixAnd it is normalized, the eigenvectors matrix to be standardized
Characteristic extracting module carries out feature extraction to the eigenvectors matrix Y of standardization, regards every row as a point, generation
The theorem in Euclid space of the original abstract individual of table maps, and is clustered into K class using the clustering method simply based on distance;
Specifically, features described above extraction module, first sequence is chosen K point and is stored in array, and then order traversal is all
Point, some element in array is replaced with certain probability;
All points are assigned to each Map node again, node calculates the distance of local data point and central point, chooses
Ownership of the nearest central point as data point, output the result is that<key, value>it is right, wherein key is ownership a little,
Value is seat target value and square value a little;Local sum is merged the coordinate and coordinate of the point of same home by Combine node
Square;The coordinate of all nodes and coordinate square are carried out summation merging by Reduce node, and calculate convergence whether, using square
Item calculates judgment criteria of each excentric extent of polymerization of classification as cluster quality.
Output module equivalently divides the individual in original community into K class, completes community according to the corresponding relationship of point
Classification.
From above-described embodiment as can be seen that technical scheme is more attractive to current application background, because working as
The social networks of preceding mass data is seen everywhere, whom does division to the network of even tens of thousands of people of thousands of people again without and feels very spine
Hand.Difficulty is that the scramble network to up to a million or even more than one hundred million a nodes divides, and obviously requirement is calculated using parallelization for this
Method and system complete, as this patent proposed invention scheme.
Those of ordinary skill in the art will appreciate that all or part of the steps in the above method can be instructed by program
Related hardware is completed, and described program can store in computer readable storage medium, such as read-only memory, disk or CD
Deng.Optionally, one or more integrated circuits can be used also to realize in all or part of the steps of above-described embodiment.Accordingly
Ground, each module/unit in above-described embodiment can take the form of hardware realization, can also use the shape of software function module
Formula is realized.The application is not limited to the combination of the hardware and software of any particular form.
The above, preferred embodiments only of the invention, is not intended to limit the scope of the present invention.It is all this
Within the spirit and principle of invention, any modification, equivalent substitution, improvement and etc. done should be included in protection model of the invention
Within enclosing.
Claims (10)
1. a kind of community discovery method, which is characterized in that this method comprises:
Original social network data is read in, the form of adjacency matrix is converted thereof into and is stored in HDFS file system;
Calculated on the computing cluster configured with Hadoop environment the adjacency matrix of figure being stored on HDFS degree matrix D and
Laplacian matrix Lsym=I-D-1/2SD-1/2;
The Laplacian matrix is asked using the parallel Lanczos numerical value that Hadoop frame carries out characteristic value and feature vector
Solution, obtains K maximum eigenvalue I=λ before matrix1≥λ2≥…≥λK, corresponding feature vector is expressed as V1,V2,…,VK;
By described eigenvector V1,V2,…,VKIt is in line, constructs eigenvectors matrix U=[V1,V2,…,VK], and to it
It is normalized, the eigenvectors matrix to be standardized
Feature extraction is carried out to the eigenvectors matrix Y of standardization, regards every row as a point, represents original abstract individual
Theorem in Euclid space mapping, is clustered into K class using the clustering method simply based on distance;
According to the corresponding relationship of point, equivalently divides the individual in original community into K class, complete the classification of community;
Wherein, S is adjacency matrix.
2. the method as described in claim 1, which is characterized in that read in original social network data, convert thereof into adjacent square
The form and being stored in HDFS file system of battle array refers to:
Original social network data is read in, be modeled as having the right undirected graph model or haves no right undirected graph model, after modeling
The adjacency matrix W distributed storage of obtained figure is on HDFS.
3. method according to claim 1 or 2, which is characterized in that
The original social network data is the text file being stored in HDFS file system, and the format of every row is " user
1 user name of name, 2 relationship weight ", indicates the relationship strength between two users;Or
For the sequential file being stored in HDFS file system;Or
For the relation data being stored in the database of Hadoop platform.
4. method according to claim 1 or 2, which is characterized in that
The degree matrix D that the adjacency matrix for the figure being stored on HDFS is calculated on the computing cluster configured with Hadoop environment refers to:
Parallel method based on MapReduce calculates the degree matrix of adjacency matrix.
5. method according to claim 1 or 2, which is characterized in that
Feature extraction is carried out to the eigenvectors matrix Y of standardization, regards every row as a point, represents original abstract individual
Theorem in Euclid space mapping, is clustered into K class using the clustering method simply based on distance and is referred to:
First sequence is chosen K point and is stored in array, and then all points of order traversal, are replaced in array with certain probability
Some element;
All points are assigned to each Map node again, node calculates the distance of local data point and central point, chooses nearest
Ownership of the central point as data point, output the result is that<key, value>it is right, wherein key is ownership a little, and value is
The square value of the coordinate of point;Combine node sums the square value of the coordinate in this place;Reduce node will own
The summed result of Combine node merges, and whether calculate convergence, it is excentric poly- to calculate each classification using quadratic term
Judgment criteria of the conjunction degree as cluster quality.
6. a kind of community discovery device, which is characterized in that the device includes:
Input module reads in original social network data, converts thereof into the form of adjacency matrix and is stored in HDFS file system
On system;
Laplacian generation module calculates the neighbour for the figure being stored on HDFS on the computing cluster configured with Hadoop environment
Connect the degree matrix D and Laplacian matrix L of matrixsym=I-D-1/2SD-1/2;
Feature decomposition module carries out the parallel of characteristic value and feature vector using Haoop frame to the Laplacian matrix
Lanczos numerical solution obtains K maximum eigenvalue I=λ before matrix1≥λ2≥…≥λK, corresponding feature vector expression
At V1,V2,…,VK, by described eigenvector V1,V2,…,VKIt is in line, constructs eigenvectors matrix U=[V1,V2,…,
VK], and it is normalized, the eigenvectors matrix to be standardized
Characteristic extracting module carries out feature extraction to the eigenvectors matrix Y of standardization, regards every row as a point, represent original
The theorem in Euclid space mapping of the abstract individual come, is clustered into K class using the clustering method simply based on distance;
Output module equivalently divides the individual in original community into K class, completes point of community according to the corresponding relationship of point
Class;
Wherein, S is adjacency matrix.
7. device as claimed in claim 6, which is characterized in that the input module reads in original social network data, by it
It is converted into the form of adjacency matrix and is stored in HDFS file system to refer to:
Original social network data is read in, be modeled as having the right undirected graph model or haves no right undirected graph model, after modeling
The adjacency matrix W distributed storage of obtained figure is on HDFS.
8. device as claimed in claims 6 or 7, which is characterized in that
The original social network data is the text file being stored in HDFS file system, and the format of every row is " user
1 user name of name, 2 relationship weight ", indicates the relationship strength between two users;Or
For the sequential file being stored in HDFS file system;Or
For the relation data being stored in the database of Hadoop platform.
9. device as claimed in claims 6 or 7, which is characterized in that
The Laplacian generation module, the parallel method based on MapReduce calculate the degree matrix of adjacency matrix.
10. device as claimed in claims 6 or 7, which is characterized in that
The characteristic extracting module carries out feature extraction to the eigenvectors matrix Y of standardization, regards every row as a point, generation
The theorem in Euclid space of the original abstract individual of table maps, and is clustered into K class using the clustering method simply based on distance and referred to:
First sequence is chosen K point and is stored in array, and then all points of order traversal, are replaced in array with certain probability
Some element;
All points are assigned to each Map node again, node calculates the distance of local data point and central point, chooses nearest
Ownership of the central point as data point, output the result is that<key, value>it is right, wherein key is ownership a little, and value is
The square value of the coordinate of point;Combine node sums the square value of the coordinate in this place;Reduce node will own
The summed result of Combine node merges, and whether calculate convergence, it is excentric poly- to calculate each classification using quadratic term
Judgment criteria of the conjunction degree as cluster quality.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201310096315.6A CN104077279B (en) | 2013-03-25 | 2013-03-25 | A kind of parallel communities discovery method and apparatus |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201310096315.6A CN104077279B (en) | 2013-03-25 | 2013-03-25 | A kind of parallel communities discovery method and apparatus |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN104077279A CN104077279A (en) | 2014-10-01 |
| CN104077279B true CN104077279B (en) | 2019-02-05 |
Family
ID=51598539
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201310096315.6A Active CN104077279B (en) | 2013-03-25 | 2013-03-25 | A kind of parallel communities discovery method and apparatus |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN104077279B (en) |
Families Citing this family (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104408096B (en) * | 2014-11-17 | 2017-01-25 | 河南理工大学 | Network information retrieval method based on information bottleneck theory and community detection |
| CN104991912A (en) * | 2015-06-19 | 2015-10-21 | 四川大学 | Large scale map data clustering algorithm based on MapReduce architecture |
| CN105354298A (en) * | 2015-11-01 | 2016-02-24 | 长春理工大学 | Hadoop based method for analyzing large-scale social network and analysis platform thereof |
| CN105512242B (en) * | 2015-11-30 | 2018-09-21 | 浙江工业大学 | A kind of parallel recommendation method based on social network structure |
| CN106570082B (en) * | 2016-10-19 | 2019-11-05 | 浙江工业大学 | A kind of friends method for digging of combination network topology characteristic and user behavior characteristics |
| CN108205551B (en) * | 2016-12-16 | 2020-09-29 | 北京酷我科技有限公司 | Song recommendation method and song recommendation system |
| CN110110154B (en) * | 2018-02-01 | 2023-07-11 | 腾讯科技(深圳)有限公司 | Graph file processing method, device and storage medium |
| EP3824665A1 (en) * | 2018-07-19 | 2021-05-26 | Nokia Technologies Oy | Environment modeling and abstraction of network states for cognitive functions |
| CN110290165B (en) * | 2019-04-04 | 2022-01-28 | 平安科技(深圳)有限公司 | Method for regulating and controlling communication load between network hosts, electronic device and readable storage medium |
| CN110210248B (en) * | 2019-06-13 | 2020-12-25 | 重庆邮电大学 | Privacy protection-oriented network structure de-anonymization system and method |
| CN117112847B (en) * | 2023-10-20 | 2024-02-06 | 杭州悦数科技有限公司 | Data generation method and device of graph database based on community model |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101944045A (en) * | 2010-10-18 | 2011-01-12 | 中国人民解放军国防科学技术大学 | Method for distributing parallel discrete event simulation objects based on community characteristics |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7849509B2 (en) * | 2005-10-07 | 2010-12-07 | Microsoft Corporation | Detection of security vulnerabilities in computer programs |
-
2013
- 2013-03-25 CN CN201310096315.6A patent/CN104077279B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101944045A (en) * | 2010-10-18 | 2011-01-12 | 中国人民解放军国防科学技术大学 | Method for distributing parallel discrete event simulation objects based on community characteristics |
Also Published As
| Publication number | Publication date |
|---|---|
| CN104077279A (en) | 2014-10-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104077279B (en) | A kind of parallel communities discovery method and apparatus | |
| Zhang et al. | Privacy-preserving double-projection deep computation model with crowdsourcing on cloud for big data feature learning | |
| CN111670457B (en) | Optimization of dynamic object instance detection, segmentation and structure mapping | |
| Kong et al. | A balanced power consumption algorithm based on enhanced parallel cat swarm optimization for wireless sensor network | |
| CN113822315B (en) | Property graph processing method, device, electronic device and readable storage medium | |
| CN111684490A (en) | Optimization of dynamic object instance detection, segmentation and structure mapping | |
| JP7527716B2 (en) | DATA PROCESSING METHOD, APPARATUS, ELECTRONIC DEVICE, AND COMPUTER PROGRAM | |
| Haut et al. | Cloud deep networks for hyperspectral image analysis | |
| CN104035978B (en) | Combo discovering method and system | |
| Roychowdhury et al. | Ag-mic: Azure-based generalized flow for medical image classification | |
| Lv | Construction of marine ship automatic identification system data mining platform based on big data | |
| Wright et al. | Help me to help you: machine augmented citizen science | |
| Liu et al. | WTFM layer: An effective map extractor for unsupervised shape correspondence | |
| CN114911778A (en) | Data processing method and device, computer equipment and storage medium | |
| CN109325480A (en) | The input method and terminal device of identity information | |
| CN109657950B (en) | Hierarchical analysis method, hierarchical analysis device, hierarchical analysis equipment and computer-readable storage medium | |
| JP6462611B2 (en) | Generating device, generating method, and generating program | |
| Singh | Facebook comment volume prediction | |
| Shanmuganathan et al. | An AI based approach to multiple census data analysis for feature selection | |
| Rizki et al. | Spark-based in-memory DEM creation from 3D LiDAR point clouds | |
| Kolici et al. | Scalability, memory issues and challenges in mining large data sets | |
| Saikumar et al. | A design and implementation of real-time product selection with matrix factorization, collaborative filtering | |
| Madhivanan et al. | An evaluation on the performance of privacy preserving split neural networks using EMNIST dataset | |
| Wang et al. | Rail Transit Prediction Based on Multi‐View Graph Attention Networks | |
| Demochkin et al. | Multi-label image set recognition in visually-aware recommender systems |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |