CN104573130B - The entity resolution method and device calculated based on colony - Google Patents
The entity resolution method and device calculated based on colony Download PDFInfo
- Publication number
- CN104573130B CN104573130B CN201510076586.4A CN201510076586A CN104573130B CN 104573130 B CN104573130 B CN 104573130B CN 201510076586 A CN201510076586 A CN 201510076586A CN 104573130 B CN104573130 B CN 104573130B
- Authority
- CN
- China
- Prior art keywords
- record
- subsets
- cluster
- candidate
- subset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
 
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明实施例提供一种基于群体计算的实体解析方法及装置,该方法包括:先对数据库中的初始记录进行分层聚类得到至少两个聚类子集;当检测到所述数据库中增加了新记录时,从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集,并确定与所述至少两个相关聚类子集分别对应的候选记录对;通过众包用户标注方式判断是否至少一个所述候选记录对代表同一实体;若确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中;若确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集;从而可对静态和动态数据集进行实体解析,提升了解析效率。
An embodiment of the present invention provides a method and device for entity resolution based on group computing, the method includes: first performing hierarchical clustering on the initial records in the database to obtain at least two cluster subsets; when an increase in the database is detected When a new record is found, at least two relevant cluster subsets most relevant to the new record are obtained from the at least two cluster subsets, and candidate candidates corresponding to the at least two relevant cluster subsets are determined. record pair; judging whether at least one of the candidate record pairs represents the same entity through crowdsourcing user annotation; if it is determined that the first candidate record pair represents the same entity, then add the new record to the first cluster to which the first record belongs In the subset; if it is determined that all the candidate record pairs do not represent the same entity, a new cluster subset is established for the new record, and a label set is created for the new cluster subset; thus static and dynamic Entity parsing is performed on the dataset, which improves parsing efficiency.
Description
技术领域technical field
本发明实施例涉及计算机技术,尤其涉及一种基于群体计算的实体解析方法及装置。Embodiments of the present invention relate to computer technology, and in particular to a group computing-based entity resolution method and device.
背景技术Background technique
数据库是按照数据结构来组织、存储和管理数据的仓库;随着信息技术和市场的发展,数据管理不再仅仅是存储和管理数据,而转变成用户所需要的各种数据管理的方式。在数据库管理过程中提出了实体解析,其中,实体解析的目的是识别出数据库中代表同一实体的不同记录。随着大数据时代的到来,越来越多的数据在被进一步地分析处理前需要被匹配或整合,因此,对于高质量的实体解析的需求正在迅速增长。A database is a warehouse that organizes, stores and manages data according to the data structure; with the development of information technology and the market, data management is no longer just storing and managing data, but has transformed into various data management methods required by users. Entity resolution is proposed in the process of database management, wherein the purpose of entity resolution is to identify different records representing the same entity in the database. With the advent of the big data era, more and more data needs to be matched or integrated before further analysis and processing. Therefore, the demand for high-quality entity resolution is growing rapidly.
现有的实体解析方法主要针对静态数据源(即假设数据源是静态不变的),且每次实体解析过程都是对整个数据源进行解析。但在实际应用中,每段时间数据库中都会有新的数据增加、删除或修改,即大部分数据源都是动态变化的,如社交网站上用户提交的信息、电子商务网站上的商品信息、软件工程领域中的Bug资源库等;若采用现有的实体解析方法,数据库中每次有新增数据时都需要对整个数据源进行实体解析,花费较大,即解析效率较低。Existing entity resolution methods are mainly aimed at static data sources (that is, assuming that the data source is static), and each entity resolution process resolves the entire data source. However, in practical applications, new data will be added, deleted or modified in the database every time, that is, most data sources are dynamically changing, such as information submitted by users on social networking sites, product information on e-commerce sites, Bug resource library in the field of software engineering, etc.; if the existing entity resolution method is adopted, the entity resolution of the entire data source needs to be performed every time there is new data in the database, which costs a lot, that is, the resolution efficiency is low.
发明内容Contents of the invention
本发明实施例提供一种基于群体计算的实体解析方法及装置,可对静态和动态数据集进行实体解析,在较少花销下实现较高的查全率和查准率,从而提升了解析效率。Embodiments of the present invention provide a method and device for entity analysis based on group computing, which can perform entity analysis on static and dynamic data sets, and achieve higher recall and precision with less cost, thereby improving analysis efficiency.
第一方面,本发明实施例提供一种基于群体计算的实体解析方法,包括:In the first aspect, an embodiment of the present invention provides a group computing-based entity resolution method, including:
基于众包的分层聚类方法对数据库中的初始记录进行分层聚类,得到至少两个聚类子集;The crowdsourcing-based hierarchical clustering method performs hierarchical clustering on the initial records in the database to obtain at least two cluster subsets;
当检测到所述数据库中增加了新记录时,获取所述新记录的特征信息;When detecting that a new record has been added to the database, acquire feature information of the new record;
根据所述至少两个聚类子集的子集信息及所述新记录的特征信息从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集;其中,所述至少两个聚类子集的子集信息包括:所述聚类子集的标签集信息及索引信息;obtaining at least two relevant cluster subsets most relevant to the new record from the at least two cluster subsets according to the subset information of the at least two cluster subsets and the feature information of the new record; Wherein, the subset information of the at least two cluster subsets includes: label set information and index information of the cluster subsets;
根据所述新记录与所述至少两个相关聚类子集中每个记录的相似度大小关系确定与所述至少两个相关聚类子集分别对应的候选记录对;determining candidate record pairs respectively corresponding to the at least two relevant cluster subsets according to the similarity relationship between the new record and each record in the at least two relevant cluster subsets;
通过众包用户标注方式判断是否至少一个所述候选记录对代表同一实体;若确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集;其中,所述第一记录与所述新记录形成所述第一候选记录对。Judging whether at least one of the candidate record pairs represents the same entity through crowdsourcing user labeling; if it is determined that the first candidate record pair represents the same entity, then adding the new record to the first cluster subset to which the first record belongs, And update the label set of the first clustering subset; if it is determined that all the candidate record pairs do not represent the same entity, a new clustering subset is established for the new record, and the new clustering sub-set A set of tags is created; wherein the first record and the new record form the first candidate record pair.
可选地,所述基于众包的分层聚类方法对数据库中的初始记录进行分层聚类,得到至少两个聚类子集,包括:Optionally, the crowdsourcing-based hierarchical clustering method performs hierarchical clustering on the initial records in the database to obtain at least two cluster subsets, including:
根据每对所述初始记录之间代表同一实体的概率大小将代表同一实体的概率大于上限概率阈值的初始记录对聚为一类,形成相应的初级聚类子集,并为每个所述初级聚类子集创建标签集及索引;其中,每对所述初始记录形成所述初始记录对;According to the probability of representing the same entity between each pair of the initial records, the initial record pairs whose probability of representing the same entity is greater than the upper probability threshold are clustered into one class to form a corresponding primary cluster subset, and for each of the primary The clustering subset creates a label set and an index; wherein, each pair of the initial records forms the initial record pair;
通过众包用户标注方式依次将所述初级聚类子集分层地进行合并,直至合并后的各个聚类子集之间的最小距离大于下限阈值,最终得到至少两个聚类子集。The primary cluster subsets are successively merged hierarchically by means of crowdsourcing user labeling until the minimum distance between the merged cluster subsets is greater than the lower threshold, and finally at least two cluster subsets are obtained.
可选地,所述根据每对所述初始记录之间代表同一实体的概率大小将代表同一实体的概率大于上限概率阈值的初始记录对聚为一类,形成相应的初级聚类子集,包括:Optionally, according to the probability of representing the same entity between each pair of the initial records, the initial record pairs whose probability of representing the same entity is greater than the upper probability threshold are clustered into one class to form corresponding primary clustering subsets, including :
获取所述初始记录对代表同一实体的概率;obtaining the probability that said pair of initial records represent the same entity;
将代表同一实体的概率大于上限概率阈值的所述初始记录对聚为一类,形成相应的初级聚类子集。The initial record pairs whose probability of representing the same entity is greater than the upper probability threshold are clustered into one class to form corresponding primary cluster subsets.
可选地,所述通过众包用户标注方式依次将所述初级聚类子集分层地进行合并,直至合并后的各个聚类子集之间的最小距离大于下限阈值,最终得到至少两个聚类子集,包括:Optionally, the primary cluster subsets are successively merged hierarchically through crowdsourcing user annotation until the minimum distance between the merged cluster subsets is greater than the lower threshold, and finally at least two cluster subsets are obtained. A subset of clusters, including:
步骤A、计算所述初级聚类子集中每对初级聚类子集之间的距离,选择所述距离最小的一对初级聚类子集作为两个候选合并子集;Step A, calculating the distance between each pair of primary cluster subsets in the primary cluster subsets, and selecting the pair of primary cluster subsets with the smallest distance as two candidate merging subsets;
步骤B、判断所述两个候选合并子集之间的距离是否小于下限阈值;若所述两个候选合并子集之间的距离小于所述下限阈值,则分别从所述两个候选合并子集中选择第二记录形成第二候选记录对,将所述第二候选记录对以及所述两个候选合并子集的标签集发送给众包平台,以使所述众包平台判断所述第二候选记录对是否代表同一实体以及是否对所述标签集中的标签点赞;其中,所述第二候选记录对为所述两个候选合并子集中代表同一实体的概率最大的记录对;Step B, judging whether the distance between the two candidate merging subsets is less than the lower limit threshold; if the distance between the two candidate merging subsets is less than the lower limit threshold, the two candidate merging subsets Centrally select the second record to form a second candidate record pair, and send the second candidate record pair and the label sets of the two candidate merged subsets to the crowdsourcing platform, so that the crowdsourcing platform can judge the second candidate record pair Whether the candidate record pair represents the same entity and whether the label in the label set is praised; wherein, the second candidate record pair is the record pair with the greatest probability of representing the same entity in the two candidate merged subsets;
步骤C、接收所述众包平台返回的第一判断结果,并根据所述第一判断结果确定是否将所述两个候选合并子集合并以及根据所述众包平台对所述标签集中标签的点赞次数对所述标签集中的标签进行排序和/或过滤;若根据所述第一判断结果确定所述两个候选合并子集代表同一实体,则将所述两个候选合并子集合并为一个聚类子集,更新所述聚类子集的标签集及索引,并将合并得到的所述聚类子集作为初级聚类子集;若根据所述第一判断结果确定所述两个候选合并子集不代表同一实体,则将所述两个候选合并子集之间的距离设为1;Step C. Receive the first judgment result returned by the crowdsourcing platform, and determine whether to merge the two candidate merging subsets according to the first judgment result, and determine whether to merge the two candidate merged subsets according to the crowdsourcing platform's classification of the tags in the tag set Sorting and/or filtering the tags in the tag set according to the number of likes; if it is determined according to the first judgment result that the two candidate merging subsets represent the same entity, then merging the two candidate merging subsets into A clustering subset, updating the label set and index of the clustering subset, and using the merged clustering subset as a primary clustering subset; if the two If the candidate merging subsets do not represent the same entity, the distance between the two candidate merging subsets is set to 1;
返回继续执行所述步骤A-步骤C,直至所述两个候选合并子集之间的距离大于所述下限阈值,则将至少两个所述初级聚类子集作为得到的所述至少两个聚类子集。Go back and continue to execute the step A-step C until the distance between the two candidate merging subsets is greater than the lower threshold, then at least two of the primary clustering subsets are used as the obtained at least two Clustering subsets.
可选地,所述获取所述初始记录对代表同一实体的概率,包括:Optionally, the obtaining the probability that the initial record pair represents the same entity includes:
根据所述初始记录对的相应属性之间的相似性计算所述初始记录对的相似度;calculating the similarity of the initial record pair according to the similarity between the corresponding attributes of the initial record pair;
基于机器学习模型计算所述初始记录对代表同一实体的概率。A probability that the pair of initial records represent the same entity is calculated based on a machine learning model.
可选地,所述计算所述初级聚类子集中每对初级聚类子集之间的距离,包括:Optionally, the calculating the distance between each pair of primary cluster subsets in the primary cluster subsets includes:
分别从所述每对初级聚类子集中选择代表同一实体的概率最大的记录对(ri,rj),其中,ri∈Ci,rj∈Cj,Ci为所述每对初级聚类子集中的一个初级聚类子集,Cj为所述每对初级聚类子集中的另一个初级聚类子集;Select the record pair (r i , r j ) with the highest probability representing the same entity from each pair of primary clustering subsets, where r i ∈ C i , r j ∈ C j , and C i is the A primary cluster subset in the primary cluster subset, Cj is another primary cluster subset in each pair of primary cluster subsets;
根据公式得到所述每对初级聚类子集之间的距离;其中,maxSimi为所述记录对(ri,rj)代表同一实体的概率,cosinSimi为所述每对初级聚类子集的余弦相似度。According to the formula Get the distance between each pair of primary clustering subsets; where, maxSimi is the probability that the record pair (r i , r j ) represents the same entity, and cosinSimi is the cosine similarity of each pair of primary clustering subsets Spend.
可选地,所述根据所述至少两个聚类子集的子集信息及所述新记录的特征信息从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集,包括:Optionally, according to the subset information of the at least two cluster subsets and the feature information of the new record, at least two most relevant to the new record are obtained from the at least two cluster subsets. A subset of related clusters, including:
根据所述基于众包的分层聚类方法对数据库中的初始记录进行分层聚类得到的所述至少两个聚类子集的标签集信息及索引信息建立倒排索引;Establishing an inverted index for the label set information and index information of the at least two clustering subsets obtained by performing hierarchical clustering on the initial records in the database according to the crowdsourcing-based hierarchical clustering method;
根据所述倒排索引及所述新记录的特征信息进行检索,从所述至少两个聚类子集中得到与所述新记录最相关的所述至少两个相关聚类子集。Searching is performed according to the inverted index and the feature information of the new record, and the at least two relevant cluster subsets most relevant to the new record are obtained from the at least two cluster subsets.
可选地,所述根据所述新记录与所述至少两个相关聚类子集中每个记录的相似度大小关系确定与所述至少两个相关聚类子集分别对应的候选记录对,包括:Optionally, the determining the candidate record pairs respectively corresponding to the at least two related cluster subsets according to the similarity relationship between the new record and each record in the at least two related cluster subsets includes :
分别计算所述新记录与所述至少两个相关聚类子集中每个记录的相似度;separately calculating the similarity of the new record to each record in the at least two relevant subsets of clusters;
分别从每个所述相关聚类子集中选择一个与所述新记录的相似度最大的记录,并分别与所述新记录形成对应所述相关聚类子集的候选记录对;其中,所述相关聚类子集的个数等于所述候选记录对的个数。Respectively select a record with the greatest similarity to the new record from each of the related cluster subsets, and respectively form a candidate record pair corresponding to the related cluster subset with the new record; wherein, the The number of relevant cluster subsets is equal to the number of candidate record pairs.
可选地,所述通过众包用户标注方式判断是否至少一个所述候选记录对代表同一实体;若确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若确定所有所述候选相似对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集,包括:Optionally, judging whether at least one of the candidate record pairs represents the same entity through crowdsourcing user annotation; if it is determined that the first candidate record pair represents the same entity, then adding the new record to the first record to which the first record belongs In a clustering subset, and update the label set of the first clustering subset; if it is determined that all the candidate similar pairs do not represent the same entity, a new clustering subset is established for the new record, and is The new subset of clusters creates a label set consisting of:
将所有所述候选记录对发送给众包平台,以使所述众包平台判断所述候选记录对是否代表同一实体;sending all the candidate record pairs to the crowdsourcing platform, so that the crowdsourcing platform determines whether the candidate record pairs represent the same entity;
接收所述众包平台返回的第二判断结果,并根据所述第二判断结果确定是否至少一个所述候选记录对代表同一实体;若根据所述第二判断结果确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若根据所述第二判断结果确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集。receiving the second judgment result returned by the crowdsourcing platform, and determining whether at least one of the candidate record pairs represents the same entity according to the second judgment result; if it is determined according to the second judgment result that the first candidate record pair represents the same entity entity, add the new record to the first cluster subset to which the first record belongs, and update the label set of the first cluster subset; if all the candidate records are determined according to the second judgment result If the pairs do not represent the same entity, a new cluster subset is established for the new record, and a label set is created for the new cluster subset.
第二方面,本发明实施例提供一种基于群体计算的实体解析装置,包括:In the second aspect, an embodiment of the present invention provides a group computing-based entity resolution device, including:
分层聚类模块,用于基于众包的分层聚类方法对数据库中的初始记录进行分层聚类,得到至少两个聚类子集;A hierarchical clustering module for hierarchically clustering the initial records in the database based on a crowdsourcing hierarchical clustering method to obtain at least two cluster subsets;
检测模块,用于当检测到所述数据库中增加了新记录时,获取所述新记录的特征信息;A detection module, configured to obtain feature information of the new record when it is detected that a new record has been added to the database;
第一确定模块,用于根据所述至少两个聚类子集的子集信息及所述新记录的特征信息从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集;其中,所述至少两个聚类子集的子集信息包括:所述聚类子集的标签集信息及索引信息;The first determination module is configured to obtain at least two most relevant to the new record from the at least two cluster subsets according to the subset information of the at least two cluster subsets and the feature information of the new record. related clustering subsets; wherein, the subset information of the at least two clustering subsets includes: label set information and index information of the clustering subsets;
第二确定模块,用于根据所述新记录与所述至少两个相关聚类子集中每个记录的相似度大小关系确定与所述至少两个相关聚类子集分别对应的候选记录对;A second determination module, configured to determine candidate record pairs respectively corresponding to the at least two relevant cluster subsets according to the similarity relationship between the new record and each record in the at least two relevant cluster subsets;
划分模块,用于通过众包用户标注方式判断是否至少一个所述候选记录对代表同一实体;若确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集;其中,所述第一记录与所述新记录形成所述第一候选记录对。The division module is used to judge whether at least one of the candidate record pairs represents the same entity through crowdsourcing user labeling; if it is determined that the first candidate record pair represents the same entity, then add the new record to the first record to which the first record belongs. In the clustering subset, and update the label set of the first clustering subset; if it is determined that all the candidate record pairs do not represent the same entity, a new clustering subset is established for the new record, and for all The new cluster subset creates a label set; wherein, the first record and the new record form the first candidate record pair.
本发明中,基于众包的分层聚类方法对数据库中的初始记录进行分层聚类,得到至少两个聚类子集;进一步地,当检测到所述数据库中增加了新记录时,获取所述新记录的特征信息;进一步地,根据所述至少两个聚类子集的子集信息及所述新记录的特征信息从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集,其中,所述至少两个聚类子集的子集信息包括:所述聚类子集的标签集信息及索引信息;进一步地,根据所述新记录与所述至少两个相关聚类子集中每个记录的相似度大小关系确定与所述至少两个相关聚类子集分别对应的候选记录对;进一步地,通过众包用户标注方式判断是否至少一个所述候选记录对代表同一实体;若确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集;其中,所述第一记录与所述新记录形成所述第一候选记录对;即可对静态和动态数据集进行实体解析,在较少花销下实现较高的查全率和查准率,从而提升了解析效率。In the present invention, the hierarchical clustering method based on crowdsourcing performs hierarchical clustering on the initial records in the database to obtain at least two cluster subsets; further, when it is detected that a new record has been added to the database, Obtaining the feature information of the new record; further, obtaining the new record from the at least two cluster subsets according to the subset information of the at least two cluster subsets and the feature information of the new record The most relevant at least two relevant cluster subsets, wherein the subset information of the at least two cluster subsets includes: label set information and index information of the cluster subsets; further, according to the new The similarity relationship between the record and each record in the at least two related cluster subsets determines the candidate record pairs corresponding to the at least two related cluster subsets; further, it is judged whether At least one of the candidate record pairs represents the same entity; if it is determined that the first candidate record pair represents the same entity, the new record is added to the first cluster subset to which the first record belongs, and the first cluster is updated the label set of the subset; if it is determined that all the candidate record pairs do not represent the same entity, then a new cluster subset is established for the new record, and a label set is created for the new cluster subset; wherein, the The first record and the new record form the first candidate record pair; entity resolution can be performed on static and dynamic data sets, and a higher recall rate and precision rate can be achieved with less cost, thereby improving analysis efficiency.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1为本发明基于群体计算的实体解析方法实施例一的流程示意图;FIG. 1 is a schematic flowchart of Embodiment 1 of the entity resolution method based on group computing in the present invention;
图2为本发明基于群体计算的实体解析方法实施例二的流程示意图;FIG. 2 is a schematic flowchart of Embodiment 2 of the entity resolution method based on group computing in the present invention;
图3为本发明基于群体计算的实体解析方法实施例三的流程示意图;FIG. 3 is a schematic flowchart of Embodiment 3 of the entity resolution method based on group computing in the present invention;
图4为本发明基于群体计算的实体解析装置实施例一的结构示意图。FIG. 4 is a schematic structural diagram of Embodiment 1 of an entity resolution device based on swarm computing in the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.
由于在某些场景下,代表同一实体的不同记录通常并不相同;实体解析的主要任务就是识别出数据库中代表同一实体的不同记录,在清洗或者整合来自多个数据源的数据时尤其重要。例如邮箱列表可能包含很多条实际上指的是同一个物理地址的记录,但由于包含一些不同的拼写或者缺失部分信息等,每条记录之间一定会存在一些差别。例如,一个公司可能会拥有多个不同的用于存放用户资料信息的数据库(每个数据库属于一个子部门),一般情况下,公司希望能够通过整合这些用户信息获得每个用户更加完整的资料;由于在每个不同的数据库中,每个用户信息可能以不同的形式出现,即不存在一个统一的识别符,因此,在多个数据库之间识别匹配的用户信息并不容易。Because in some scenarios, different records representing the same entity are usually not the same; the main task of entity resolution is to identify different records representing the same entity in the database, which is especially important when cleaning or integrating data from multiple data sources. For example, a mailbox list may contain many records that actually refer to the same physical address, but there must be some differences between each record due to some different spellings or missing information. For example, a company may have several different databases for storing user profile information (each database belongs to a sub-department). Generally, the company hopes to obtain a more complete profile of each user by integrating these user information; Since each user information may appear in different forms in each different database, that is, there is no unified identifier, therefore, it is not easy to identify matching user information among multiple databases.
机器学习是近20多年兴起的一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习理论主要是设计和分析一些让计算机可以自动“学习”的算法,即机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法。机器学习算法大致可以分为监督学习、半监督学习、无监督学习和增强学习四大类;1)监督学习是指从给定的训练数据集中学习出一个函数,当新的数据到来时,可以根据这个函数预测结果;其中,监督学习的训练数据集要求是包括输入和输出(即特征和目标),训练数据集中的目标是由人标注的;常见的监督学习算法包括回归分析和统计分类;2)无监督学习与监督学习相比,训练数据集没有人为标注的结果,常见的无监督学习算法有聚类;3)半监督学习介于监督学习与无监督学习之间;4)增强学习通过观察来学习,即每个动作都会对环境有所影响,学习对象根据观察到的周围环境的反馈来做出判断。Machine learning is a multi-field interdisciplinary subject that has emerged in the past 20 years, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Machine learning theory is mainly to design and analyze some algorithms that allow computers to "learn" automatically, that is, machine learning algorithms are a type of algorithms that automatically analyze and obtain laws from data, and use the laws to predict unknown data. Machine learning algorithms can be roughly divided into four categories: supervised learning, semi-supervised learning, unsupervised learning and enhanced learning; 1) Supervised learning refers to learning a function from a given training data set. When new data arrives, it can be The result is predicted according to this function; among them, the training data set requirement of supervised learning is to include input and output (ie features and targets), and the targets in the training data set are marked by people; common supervised learning algorithms include regression analysis and statistical classification; 2) Compared with supervised learning, unsupervised learning has no human-labeled results in the training data set, and common unsupervised learning algorithms have clustering; 3) Semi-supervised learning is between supervised learning and unsupervised learning; 4) Enhanced learning Learning through observation, that is, every action will have an impact on the environment, and the learning object makes judgments based on the feedback of the observed surrounding environment.
随着大数据时代的到来,对于高质量的实体解析的需求正在迅速增长,传统的实体解析方案是基于机器学习,虽然基于机器学习的实体解析领域已经有了大量的研究工作,但由于实体解析过程所涉及的语义分析、领域知识和相关经验,机器在判断不同记录是否是同一实体时没有人判断的准确率高,即机器处理仍然不够准确。随着基于众包模式市场的快速发展,通过将人工标注应用到实体解析的过程,即基于众包平台进行实体解析,虽然人工标注的准确率比机器判断高,但是却会在时间和金钱上带来更大的花费。With the advent of the big data era, the demand for high-quality entity resolution is growing rapidly. Traditional entity resolution solutions are based on machine learning. Although there have been a lot of research work in the field of entity resolution based on machine learning, due to the The semantic analysis, domain knowledge and related experience involved in the process, the accuracy rate of the machine in judging whether different records are the same entity is not as high as that of human judgment, that is, the machine processing is still not accurate enough. With the rapid development of the market based on the crowdsourcing model, by applying manual labeling to the process of entity resolution, that is, entity resolution based on the crowdsourcing platform, although the accuracy of manual labeling is higher than that of machine judgment, it will cost time and money. bring about greater costs.
目前基于群体计算的实体解析方法都只适用于静态的数据库,即每次都是对整个数据库进行解析,其中,群体计算的思想就是将众包与机器学习或者人工智能与云计算相结合,通过融合计算机处理的高效性和人群智慧的准确性来解决问题。但在实际应用中数据库都是动态的,比如Facebook中地标数据集、软件工程中的Bug资源库,即每段时间数据库中都会有新的数据增加,需要和数据库中已有的数据进行解析。因此,传统的仅适用于静态数据源的实体解析方法已经不能满足动态数据源的需求。At present, entity resolution methods based on group computing are only applicable to static databases, that is, the entire database is parsed every time. Among them, the idea of group computing is to combine crowdsourcing with machine learning or artificial intelligence and cloud computing. Fuse the efficiency of computer processing with the accuracy of crowd intelligence to solve problems. But in practical applications, databases are dynamic, such as landmark datasets in Facebook and bug resource libraries in software engineering. That is, new data will be added to the database every time, and it needs to be analyzed with the existing data in the database. Therefore, the traditional entity resolution method, which is only suitable for static data sources, cannot meet the needs of dynamic data sources.
本发明通过融合计算机处理的高效性和人群智慧的准确性对数据进行解析,提出一种基于群体计算且能对静态和动态数据集进行实体解析的方案,该方案能在较少的花销下,实现较高的查全率和查准率。The present invention analyzes the data by integrating the efficiency of computer processing and the accuracy of crowd intelligence, and proposes a scheme based on crowd computing that can perform entity analysis on static and dynamic data sets. , achieving high recall and precision.
图1为本发明基于群体计算的实体解析方法实施例一的流程示意图,如图1所示,本实施例的方法可以包括:Fig. 1 is a schematic flowchart of Embodiment 1 of the entity resolution method based on group computing in the present invention. As shown in Fig. 1, the method of this embodiment may include:
S101、基于众包的分层聚类方法对数据库中的初始记录进行分层聚类,得到至少两个聚类子集。S101. Hierarchical clustering is performed on initial records in a database using a crowdsourcing-based hierarchical clustering method to obtain at least two cluster subsets.
由于在实际应用的数据源中,大部分的记录之间都不是重复的;如果将所有的记录对都交给众包平台判断,在经济和时间上都不可行,因此,可以基于机器学习得到的记录对的重复概率,通过设定上下限概率阀值过滤掉极大概率或者极小概率代表同一实体的记录对,即认为概率大于上限概率阀值的记录对则代表同一实体,概率小于下限概率阀值的记录对则不代表同一实体。Since most of the records in the actual application data source are not repeated; if all the record pairs are handed over to the crowdsourcing platform for judgment, it is not feasible in terms of economy and time. Therefore, it can be obtained based on machine learning. The repetition probability of the record pair, by setting the upper and lower probability thresholds to filter out the record pairs with a high probability or a very small probability representing the same entity, that is, the record pair with a probability greater than the upper probability threshold represents the same entity, and the probability is less than the lower limit Pairs of records with probability thresholds do not represent the same entity.
本发明实施例中,通过分层聚类算法将代表同一实体的记录都聚到同一个子类中(即不同的子类中的记录对则代表不同实体);分层聚类是指由不同层次的分割聚类组成,层次之间的分割具有嵌套的关系。具体地,通过采用自底向上的策略进行聚类,首先通过过滤步骤后得到初级聚类子集,然后根据每对初级聚类子集之间的距离并通过众包用户标注方式按照一定的次序迭代地将所述初级聚类子集分层地合并为较大的聚类子集,直至合并后的各个聚类子集之间的最小距离大于下限阈值即每对聚类子集所包含的记录之间代表同一实体的概率都小于所述下限概率阀值(其中,两个聚类子集之间的距离越小,则表示所述两个聚类子集之间的记录对代表同一实体的概率越大或者所述两个聚类子集之间重复的概率越大;两个聚类子集之间的距离越大,则表示所述两个聚类子集之间的记录对代表同一实体的概率越小或者所述两个聚类子集之间重复的概率越小);其中,合并后的各个聚类子集之间的最小距离大于下限阈值,则代表各个聚类子集之间的记录对代表同一实体的概率小于所述下限概率阀值,也即各个聚类子集并不代表同一实体,因此,聚类算法停止。可选地,所述通过过滤步骤后得到初级聚类子集包括:根据所述上限概率阀值对数据库中的初始记录进行初步聚类,如将代表同一实体的概率大于所述上限概率阀值的初始记录对聚到一类中形成相应的初级聚类子集;其中,每个所述初级聚类子集中的记录都代表同一实体,不同所述初级聚类子集之间的记录则不代表同一实体。In the embodiment of the present invention, the records representing the same entity are gathered into the same subcategory through the hierarchical clustering algorithm (that is, the record pairs in different subcategories represent different entities); Hierarchical partition clusters are composed, and partitions between hierarchies have nested relationships. Specifically, by adopting a bottom-up strategy for clustering, firstly, the primary cluster subsets are obtained after the filtering step, and then according to the distance between each pair of primary cluster subsets and in a certain order through crowdsourcing user labeling Iteratively merge the primary cluster subsets into larger cluster subsets hierarchically until the minimum distance between each cluster subset after merging is greater than the lower threshold, that is, each pair of cluster subsets contains The probability that the records represent the same entity is less than the lower limit probability threshold (wherein, the smaller the distance between the two cluster subsets, it means that the record pair between the two cluster subsets represents the same entity The greater the probability of or the greater the probability of duplication between the two cluster subsets; the greater the distance between the two cluster subsets, it means that the record pair between the two cluster subsets represents The smaller the probability of the same entity or the smaller the probability of duplication between the two cluster subsets); wherein, the minimum distance between each cluster subset after merging is greater than the lower limit threshold, it represents each cluster subset The probability that the pairs of records represent the same entity is less than the lower probability threshold, that is, each cluster subset does not represent the same entity, so the clustering algorithm stops. Optionally, obtaining the primary cluster subset after passing the filtering step includes: performing preliminary clustering on the initial records in the database according to the upper limit probability threshold, for example, the probability of representing the same entity is greater than the upper limit probability threshold The initial record pairs are clustered into one class to form corresponding primary cluster subsets; wherein, the records in each primary cluster subset represent the same entity, and the records between different primary cluster subsets do not represent the same entity.
可选的,步骤101包括:Optionally, step 101 includes:
根据每对所述初始记录之间代表同一实体的概率大小将代表同一实体的概率大于上限概率阈值的初始记录对聚为一类,形成相应的初级聚类子集,并为每个所述初级聚类子集创建标签集及索引;其中,每对所述初始记录形成所述初始记录对;According to the probability of representing the same entity between each pair of the initial records, the initial record pairs whose probability of representing the same entity is greater than the upper probability threshold are clustered into one class to form a corresponding primary cluster subset, and for each of the primary The clustering subset creates a label set and an index; wherein, each pair of the initial records forms the initial record pair;
通过众包用户标注方式依次将所述初级聚类子集分层地进行合并,直至合并后的各个聚类子集之间的最小距离大于下限阈值,最终得到至少两个聚类子集。The primary cluster subsets are successively merged hierarchically by means of crowdsourcing user labeling until the minimum distance between the merged cluster subsets is greater than the lower threshold, and finally at least two cluster subsets are obtained.
本发明实施例中,可选地,首先通过获取所述初始记录对代表同一实体的概率;其次,将代表同一实体的概率大于上限概率阈值的所述初始记录对聚为一类,形成相应的初级聚类子集;进一步地为了后续进行动态数据的查询,为每个所述初级聚类子集创建标签集及索引,可选地,通过机器或者人工标注为每个所述初级聚类子集创建标签集,可选地,通过对所述初级聚类子集中每个记录的属性值进行分词、去停用词、取词干处理,然后选择在记录之间重复出现并且在整个数据集中逆向文件频率(inverse document frequency,简称IDF)值较大的词(即关键词)添加到所述初级聚类子集的标签集中(其中,IDF值是一个词语普遍重要性的度量,某一特定词语的IDF值等于:总文件数目除以包含该词语之文件的数目得到商值,再对所述商值取对数得到的数值);并根据所述初级聚类子集的标签集信息为每个所述初级聚类子集创建索引,其中,所述索引信息包括所述初级聚类子集中记录的关键词;进一步地,通过众包用户标注方式依次将所述初级聚类子集分层地进行合并,直至合并后的各个聚类子集之间的最小距离大于下限阈值,则代表各个聚类子集之间的记录对代表同一实体的概率小于所述下限概率阀值,也即各个聚类子集并不代表同一实体,因此,聚类算法停止,最终得到至少两个聚类子集。In the embodiment of the present invention, optionally, firstly, the probability that the initial record pair represents the same entity is acquired; secondly, the initial record pairs whose probability of representing the same entity is greater than the upper limit probability threshold are grouped into one group to form a corresponding Primary cluster subsets; further, for subsequent dynamic data query, create label sets and indexes for each of the primary cluster subsets, optionally, label each of the primary cluster sub-sets by machine or manually Set creates a label set, optionally by tokenizing, removing stopwords, stemming the attribute value of each record in the primary cluster subset, and then selecting to recur between records and across the entire dataset Inverse document frequency (inverse document frequency, abbreviated as IDF) The words (ie keywords) with larger values are added to the label set of the primary clustering subset (wherein, the IDF value is a measure of the general importance of a word, and a specific The IDF value of word is equal to: the number of total file divided by the number of the file that contains this word obtains quotient value, then described quotient value is taken logarithmic numerical value that obtains); And according to the label set information of described primary cluster subset as An index is created for each of the primary cluster subsets, wherein the index information includes keywords recorded in the primary cluster subsets; further, the primary cluster subsets are sequentially classified by crowdsourcing user annotation Merge layer by layer until the minimum distance between the merged cluster subsets is greater than the lower limit threshold, which means that the probability that the record pairs between each cluster subset represent the same entity is less than the lower limit probability threshold, that is The individual cluster subsets do not represent the same entity, so the clustering algorithm stops until at least two cluster subsets are finally obtained.
可选地,所述获取所述初始记录对代表同一实体的概率,包括:Optionally, the obtaining the probability that the initial record pair represents the same entity includes:
根据所述初始记录对的相应属性之间的相似性计算所述初始记录对的相似度;calculating the similarity of the initial record pair according to the similarity between the corresponding attributes of the initial record pair;
基于机器学习模型计算所述初始记录对代表同一实体的概率。A probability that the pair of initial records represent the same entity is calculated based on a machine learning model.
本发明实施例中,每个所述初始记录对(所述初始记录对包括两个初始记录)的相似度可以用一个特征向量表示,所述特征向量的每一维表示两个所述初始记录之间某个属性的相似性,假设用n个计算相似度的函数来度量m个属性,则所述特征向量的维度是n*m维,即可根据所述初始记录对的相应属性之间的相似性计算所述初始记录对的相似度。进一步地,基于机器学习模型的实体解析可以看作为分类问题,例如正表示两个记录代表的是同一实体,否则表示所述两个实体代表不同实体,即一般分类器的输入是代表一对记录之间相似度的特征向量,输出是两类问题的分类结果,但是本申请需要得到的一对记录代表同一实体的概率,因此,本发明实施例中提出基于机器学习模型计算所述初始记录对代表同一实体的概率,可选地,通过训练集来训练分类器,其中,所述训练集包含分别表示重复记录(即代表同一实体的记录)和非重复记录(即代表不同实体的记录)的特征向量,训练出的分类器即可表示每个所述初始记录对代表同一实体的概率。In the embodiment of the present invention, the similarity of each initial record pair (the initial record pair includes two initial records) can be represented by a feature vector, and each dimension of the feature vector represents two initial records The similarity of a certain attribute between, assuming that n functions to calculate the similarity are used to measure m attributes, then the dimension of the feature vector is n*m dimensions, that is, according to the relationship between the corresponding attributes of the initial record pair The similarity calculates the similarity of the initial record pair. Furthermore, the entity resolution based on the machine learning model can be regarded as a classification problem, for example, it means that two records represent the same entity, otherwise it means that the two entities represent different entities, that is, the input of a general classifier represents a pair of records The eigenvector of the similarity between them, the output is the classification result of the two types of problems, but the probability that a pair of records that this application needs to obtain represents the same entity, therefore, in the embodiment of the present invention, it is proposed to calculate the initial record pair based on a machine learning model The probability of representing the same entity, optionally, the classifier is trained through a training set, wherein the training set contains records representing duplicate records (i.e., records representing the same entity) and non-repeating records (i.e., records representing different entities) The feature vector, the trained classifier can represent the probability that each of the initial record pairs represent the same entity.
可选地,所述上限概率阈值及所述下限概率阀值可根据设定目标的查全率和查准率来确定;若所述上限概率阀值设置得太低,会降低查准率;若所述下限概率阀值太高,会降低查全率;若所述上限概率阀值太高或者所述下限概率阀值太低,会影响过滤效率。Optionally, the upper limit probability threshold and the lower limit probability threshold can be determined according to the recall rate and precision rate of the set target; if the upper limit probability threshold value is set too low, the precision rate will be reduced; If the lower limit probability threshold is too high, the recall rate will be reduced; if the upper limit probability threshold is too high or the lower limit probability threshold is too low, filtering efficiency will be affected.
S102、当检测到所述数据库中增加了新记录时,获取所述新记录的特征信息。S102. When it is detected that a new record is added to the database, acquire feature information of the new record.
本发明实施例中,当检测到所述数据库中增加了新记录R时,获取所述新记录的特征信息,可选地,对所述新记录的属性值进行分词、去停用词、取词干,并根据词频-逆向文件频率(term frequency–inverse document frequency,简称TF-IDF)值提取关键词(即特征信息),以便根据所述新记录的特征信息及所述数据库中已存在的所述聚类子集的子集信息,对所述新记录进行合理地分类,如若确定所述新记录与某个所述聚类子集中的记录都代表同一实体,则将所述新记录合并入所述聚类子集,或者若确定所述新记录与任一所述聚类子集中的记录都不代表同一实体,则将为所述新记录建立一个新聚类子集。In the embodiment of the present invention, when it is detected that a new record R has been added to the database, the feature information of the new record is obtained, and optionally, the attribute value of the new record is segmented, stop words removed, and stem, and extract keywords (i.e. feature information) according to the term frequency-inverse document frequency (term frequency-inverse document frequency, referred to as TF-IDF) value, so that according to the feature information of the new record and the existing Subset information of the cluster subset, reasonably classifying the new record, if it is determined that the new record and a record in a certain cluster subset represent the same entity, then merge the new record If it is determined that the new record does not represent the same entity as the records in any of the cluster subsets, a new cluster subset will be established for the new record.
S103、根据所述至少两个聚类子集的子集信息及所述新记录的特征信息从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集。S103. According to the subset information of the at least two cluster subsets and the feature information of the new record, obtain at least two related clusters most relevant to the new record from the at least two cluster subsets set.
本发明实施例中,根据所述至少两个聚类子集的子集信息及所述新记录的特征信息通过信息检索方式从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集;可选地,所述至少两个聚类子集的子集信息包括:所述聚类子集的标签集信息及索引信息。In the embodiment of the present invention, according to the subset information of the at least two cluster subsets and the feature information of the new record, the most relevant information for the new record is obtained from the at least two cluster subsets by means of information retrieval. at least two relevant cluster subsets; optionally, the subset information of the at least two cluster subsets includes: label set information and index information of the cluster subsets.
可选地,步骤S103包括:Optionally, step S103 includes:
根据所述基于众包的分层聚类方法对数据库中的初始记录进行分层聚类得到的所述至少两个聚类子集的标签集信息及索引信息建立倒排索引;Establishing an inverted index for the label set information and index information of the at least two clustering subsets obtained by performing hierarchical clustering on the initial records in the database according to the crowdsourcing-based hierarchical clustering method;
根据所述倒排索引及所述新记录的特征信息进行检索,从所述至少两个聚类子集中得到与所述新记录最相关的所述至少两个相关聚类子集。Searching is performed according to the inverted index and the feature information of the new record, and the at least two relevant cluster subsets most relevant to the new record are obtained from the at least two cluster subsets.
本发明实施例中,根据所述基于众包的分层聚类方法对数据库中的初始记录进行分层聚类得到的所述至少两个聚类子集的标签集信息及索引信息建立倒排索引,可选地,根据每个所述聚类子集的标签集信息及索引信息将每个所述聚类子集的标签集中的所有标签作为键,每个所述聚类子集的存储地址作为值,即所述倒排索引中的每一项都包括一个属性值和具有该属性值对应记录的存储地址(由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为倒排索引);进一步地,根据所述倒排索引及所述新记录的特征信息通过信息检索方式进行检索,从所述至少两个聚类子集中得到与所述新记录最相关的所述至少两个相关聚类子集,即与所述新记录最可能代表同一实体的至少两个相关聚类子集。In the embodiment of the present invention, the tag set information and index information of the at least two cluster subsets obtained by performing hierarchical clustering on the initial records in the database according to the crowdsourcing-based hierarchical clustering method are set up and inverted Index, optionally, according to the label set information and index information of each of the cluster subsets, all the labels in the label set of each of the cluster subsets are used as keys, and the storage of each of the cluster subsets The address is used as a value, that is, each item in the inverted index includes an attribute value and the storage address of the corresponding record with the attribute value (because the attribute value is not determined by the record, but the position of the record is determined by the attribute value , so it is called an inverted index); further, according to the inverted index and the feature information of the new record, the information retrieval method is used to retrieve, and from the at least two clustering subsets, the The at least two related cluster subsets that are related, that is, the at least two related cluster subsets that most likely represent the same entity as the new record.
S104、根据所述新记录与所述至少两个相关聚类子集中每个记录的相似度大小关系确定与所述至少两个相关聚类子集分别对应的候选记录对。S104. Determine candidate record pairs respectively corresponding to the at least two relevant cluster subsets according to the similarity relationship between the new record and each record in the at least two relevant cluster subsets.
本发明实施例中,根据所述新记录R与所述至少两个相关聚类子集中每个记录的相似度大小关系,分别从每个所述相关聚类子集中选择一个与所述新记录的相似度最大的记录R’,从而确定与所述至少两个相关聚类子集分别对应的候选记录对(R,R’)。In the embodiment of the present invention, according to the similarity relationship between the new record R and each record in the at least two related cluster subsets, one of the related cluster subsets is selected to be related to the new record The record R' with the largest similarity, so as to determine the candidate record pairs (R, R') respectively corresponding to the at least two related cluster subsets.
可选地,所述步骤S104包括:Optionally, the step S104 includes:
分别计算所述新记录与所述至少两个相关聚类子集中每个记录的相似度;separately calculating the similarity of the new record to each record in the at least two relevant subsets of clusters;
分别从每个所述相关聚类子集中选择一个与所述新记录的相似度最大的记录,并分别与所述新记录形成对应所述相关聚类子集的候选记录对;其中,所述相关聚类子集的个数等于所述候选记录对的个数。Respectively select a record with the greatest similarity to the new record from each of the related cluster subsets, and respectively form a candidate record pair corresponding to the related cluster subset with the new record; wherein, the The number of relevant cluster subsets is equal to the number of candidate record pairs.
本发明实施例中,分别计算所述新记录R与所述至少两个相关聚类子集中每个记录的相似度,进一步地,分别从每个所述相关聚类子集中选择一个与所述新记录R的相似度最大的记录R’,并分别与所述新记录R形成对应所述相关聚类子集的候选记录对(R,R’);其中,所述相关聚类子集的个数等于所述候选记录对的个数,如若步骤S103中得到与所述新记录R最相关的五个相关聚类子集,则步骤S104中确定与所述五个相关聚类子集分别对应的五个候选记录对。可选地,可以通过文本相似度算法计算所述新记录R与所述至少两个相关聚类子集中每个记录的相似度。In the embodiment of the present invention, the similarity between the new record R and each record in the at least two related cluster subsets is calculated respectively, and further, one of the related cluster subsets is selected from each of the related cluster subsets respectively The record R' with the highest similarity of the new record R, and respectively form a candidate record pair (R, R') corresponding to the related cluster subset with the new record R; wherein, the related cluster subset The number is equal to the number of the candidate record pairs. If five relevant clustering subsets most relevant to the new record R are obtained in step S103, then in step S104, it is determined that the five relevant clustering subsets are respectively Corresponding five candidate record pairs. Optionally, the similarity between the new record R and each record in the at least two relevant cluster subsets may be calculated through a text similarity algorithm.
S105、通过众包用户标注方式判断是否至少一个所述候选记录对代表同一实体;若确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集;其中,所述第一记录与所述新记录形成所述第一候选记录对。S105. Determine whether at least one of the candidate record pairs represents the same entity through crowdsourcing user annotation; if it is determined that the first candidate record pair represents the same entity, add the new record to the first cluster to which the first record belongs Centralize, and update the label set of the first clustering subset; if it is determined that all the candidate record pairs do not represent the same entity, then create a new clustering subset for the new record, and set up a new clustering subset for the new clustering A class subset creates a label set; wherein said first record and said new record form said first candidate record pair.
本发明实施例中,在根据所述新记录与所述至少两个相关聚类子集中每个记录的相似度大小关系确定与所述至少两个相关聚类子集分别对应的候选记录对后,将所述候选记录对(R,R’)发送给众包平台,通过众包用户标注方式判断是否至少一个所述候选记录对(R,R’)代表同一实体;若确定第一候选记录对(R,R1’)代表同一实体,则将所述新记录R添加到第一记录R1’所属的第一聚类子集中(其中,所述第一聚类子集为所述至少两个相关聚类子集中的一个聚类子集),并更新所述第一聚类子集的标签集;若确定所有所述候选相似对(R,R’)都不代表同一实体,则为所述新记录R建立一个新聚类子集,并为所述新聚类子集创建标签集,可选地,通过提取所述新记录的关键词创建所述新聚类子集的标签集;其中,所述第一候选记录对(R,R1’)为所有所述候选记录对(R,R’)中某一个候选记录对。In the embodiment of the present invention, after the candidate record pairs corresponding to the at least two related cluster subsets are determined according to the similarity relationship between the new record and each record in the at least two related cluster subsets , send the candidate record pair (R, R') to the crowdsourcing platform, and judge whether at least one of the candidate record pairs (R, R') represents the same entity through crowdsourcing user annotation; if the first candidate record is determined A pair (R, R1') represents the same entity, then the new record R is added to the first cluster subset to which the first record R1' belongs (wherein, the first cluster subset is the at least two A cluster subset in the related cluster subset), and update the label set of the first cluster subset; if it is determined that all the candidate similar pairs (R, R') do not represent the same entity, then all The new record R establishes a new clustering subset, and creates a label set for the new clustering subset, optionally, creates a label set of the new clustering subset by extracting keywords of the new record; Wherein, the first candidate record pair (R, R1') is a candidate record pair among all the candidate record pairs (R, R').
可选地,步骤S105包括:Optionally, step S105 includes:
将所有所述候选记录对发送给众包平台,以使所述众包平台判断所述候选记录对是否代表同一实体;sending all the candidate record pairs to the crowdsourcing platform, so that the crowdsourcing platform determines whether the candidate record pairs represent the same entity;
接收所述众包平台返回的第二判断结果,并根据所述第二判断结果确定是否至少一个所述候选记录对代表同一实体;若根据所述第二判断结果确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若根据所述第二判断结果确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集。receiving the second judgment result returned by the crowdsourcing platform, and determining whether at least one of the candidate record pairs represents the same entity according to the second judgment result; if it is determined according to the second judgment result that the first candidate record pair represents the same entity entity, add the new record to the first cluster subset to which the first record belongs, and update the label set of the first cluster subset; if all the candidate records are determined according to the second judgment result If the pairs do not represent the same entity, a new cluster subset is established for the new record, and a label set is created for the new cluster subset.
本发明实施例中,将步骤S104中确定的所有所述候选记录对(R,R’)发送给众包平台,以使所述众包平台判断所述候选记录对(R,R’)是否代表同一实体,如所述众包平台通过对每个所述候选记录对(R,R’)进行标注的形式指示所述候选记录对(R,R’)是否代表同一实体;进一步地,接收所述众包平台返回的第二判断结果(如所述众包平台对每个所述候选记录对标注的是否代表同一实体的结果),并根据所述第二判断结果确定是否至少一个所述候选记录对代表同一实体,由于所述众包平台会包括多个众包用户即多个众包用户会对每个所述候选记录对标注是否代表同一实体,也即所述第二判断结果为多个众包用户对每个所述候选记录对标注的是否代表同一实体的结果,可选地,采用投票算法进行众包结果汇聚即选择票数过半的答案作为结果,若多半以上众包用户对某个所述候选记录对标注为代表同一实体,则确定所述候选记录对代表同一实体,若个别或者少于一半的众包用户对某个所述候选记录对标注为代表不同实体,则确定所述候选记录对不代表同一实体;若根据所述第二判断结果确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若根据所述第二判断结果确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集。In the embodiment of the present invention, all the candidate record pairs (R, R') determined in step S104 are sent to the crowdsourcing platform, so that the crowdsourcing platform can judge whether the candidate record pairs (R, R') are represent the same entity, such as the crowdsourcing platform indicates whether the candidate record pair (R, R') represents the same entity by marking each candidate record pair (R, R'); further, receiving The second judgment result returned by the crowdsourcing platform (such as whether the crowdsourcing platform marks each of the candidate record pairs represents the same entity), and determines whether at least one of the Candidate record pairs represent the same entity, since the crowdsourcing platform will include multiple crowdsourcing users, that is, multiple crowdsourcing users will mark whether each candidate record pair represents the same entity, that is, the second judgment result is A plurality of crowdsourcing users mark whether each candidate record represents the result of the same entity. Optionally, a voting algorithm is used to aggregate crowdsourcing results, that is, the answer with more than half of the votes is selected as the result. If more than half of the crowdsourcing users agree If one of the candidate record pairs is marked as representing the same entity, it is determined that the candidate record pair represents the same entity, and if individual or less than half of the crowdsourcing users mark a certain candidate record pair as representing a different entity, it is determined The candidate record pair does not represent the same entity; if it is determined according to the second judgment result that the first candidate record pair represents the same entity, then add the new record to the first cluster subset to which the first record belongs, and update The label set of the first clustering subset; if it is determined according to the second judgment result that all the candidate record pairs do not represent the same entity, a new clustering subset is established for the new record, and for all Create a label set for the new subset of clusters described above.
本发明实施例中,基于众包的分层聚类方法对数据库中的初始记录进行分层聚类,得到至少两个聚类子集;进一步地,当检测到所述数据库中增加了新记录时,获取所述新记录的特征信息;进一步地,根据所述至少两个聚类子集的子集信息及所述新记录的特征信息从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集,其中,所述至少两个聚类子集的子集信息包括:所述聚类子集的标签集信息及索引信息;进一步地,根据所述新记录与所述至少两个相关聚类子集中每个记录的相似度大小关系确定与所述至少两个相关聚类子集分别对应的候选记录对;进一步地,通过众包用户标注方式判断是否至少一个所述候选记录对代表同一实体;若确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集;其中,所述第一记录与所述新记录形成所述第一候选记录对;即可对静态和动态数据集进行实体解析,在较少花销下实现较高的查全率和查准率,从而提升了解析效率。In the embodiment of the present invention, the hierarchical clustering method based on crowdsourcing performs hierarchical clustering on the initial records in the database to obtain at least two cluster subsets; further, when it is detected that a new record has been added to the database , obtain the feature information of the new record; further, obtain the same information from the at least two cluster subsets according to the subset information of the at least two cluster subsets and the feature information of the new record. Newly record the most relevant at least two relevant cluster subsets, wherein the subset information of the at least two cluster subsets includes: label set information and index information of the cluster subsets; further, according to the The similarity relationship between the new record and each record in the at least two relevant cluster subsets determines the candidate record pairs corresponding to the at least two relevant cluster subsets; further, through crowdsourcing user labeling judging whether at least one of the candidate record pairs represents the same entity; if it is determined that the first candidate record pair represents the same entity, then adding the new record to the first cluster subset to which the first record belongs, and updating the first A label set of a cluster subset; if it is determined that all of the candidate record pairs do not represent the same entity, a new cluster subset is established for the new record, and a label set is created for the new cluster subset; wherein , the first record and the new record form the first candidate record pair; that is, entity resolution can be performed on static and dynamic data sets, and a higher recall rate and precision rate can be achieved with less cost, Thereby improving the analysis efficiency.
可选地,所述通过众包用户标注方式依次将所述初级聚类子集分层地进行合并,直至合并后的各个聚类子集之间的最小距离大于下限阈值,最终得到至少两个聚类子集,包括:Optionally, the primary cluster subsets are successively merged hierarchically through crowdsourcing user annotation until the minimum distance between the merged cluster subsets is greater than the lower threshold, and finally at least two cluster subsets are obtained. A subset of clusters, including:
步骤A、计算所述初级聚类子集中每对初级聚类子集之间的距离,选择所述距离最小的一对初级聚类子集作为两个候选合并子集;Step A, calculating the distance between each pair of primary cluster subsets in the primary cluster subsets, and selecting the pair of primary cluster subsets with the smallest distance as two candidate merging subsets;
步骤B、判断所述两个候选合并子集之间的距离是否小于下限阈值;若所述两个候选合并子集之间的距离小于所述下限阈值,则分别从所述两个候选合并子集中选择第二记录形成第二候选记录对,将所述第二候选记录对以及所述两个候选合并子集的标签集发送给众包平台,以使所述众包平台判断所述第二候选记录对是否代表同一实体以及是否对所述标签集中的标签点赞;其中,所述第二候选记录对为所述两个候选合并子集中代表同一实体的概率最大的记录对;Step B, judging whether the distance between the two candidate merging subsets is less than the lower limit threshold; if the distance between the two candidate merging subsets is less than the lower limit threshold, the two candidate merging subsets Centrally select the second record to form a second candidate record pair, and send the second candidate record pair and the label sets of the two candidate merged subsets to the crowdsourcing platform, so that the crowdsourcing platform can judge the second candidate record pair Whether the candidate record pair represents the same entity and whether the label in the label set is praised; wherein, the second candidate record pair is the record pair with the greatest probability of representing the same entity in the two candidate merged subsets;
步骤C、接收所述众包平台返回的第一判断结果,并根据所述第一判断结果确定是否将所述两个候选合并子集合并以及根据所述众包平台对所述标签集中标签的点赞次数对所述标签集中的标签进行排序和/或过滤;若根据所述第一判断结果确定所述两个候选合并子集代表同一实体,则将所述两个候选合并子集合并为一个聚类子集,更新所述聚类子集的标签集及索引,并将合并得到的所述聚类子集作为初级聚类子集;若根据所述第一判断结果确定所述两个候选合并子集不代表同一实体,则将所述两个候选合并子集之间的距离设为1;Step C. Receive the first judgment result returned by the crowdsourcing platform, and determine whether to merge the two candidate merging subsets according to the first judgment result, and determine whether to merge the two candidate merged subsets according to the crowdsourcing platform's classification of the tags in the tag set Sorting and/or filtering the tags in the tag set according to the number of likes; if it is determined according to the first judgment result that the two candidate merging subsets represent the same entity, then merging the two candidate merging subsets into A clustering subset, updating the label set and index of the clustering subset, and using the merged clustering subset as a primary clustering subset; if the two If the candidate merging subsets do not represent the same entity, the distance between the two candidate merging subsets is set to 1;
返回继续执行所述步骤A-步骤C,直至所述两个候选合并子集之间的距离大于所述下限阈值,则将至少两个所述初级聚类子集作为得到的所述至少两个聚类子集。Go back and continue to execute the step A-step C until the distance between the two candidate merging subsets is greater than the lower threshold, then at least two of the primary clustering subsets are used as the obtained at least two Clustering subsets.
本发明实施例中,在步骤A中计算所述初级聚类子集中每对初级聚类子集之间的距离,选择所述距离最小的一对初级聚类子集作为两个候选合并子集;其中,距离最小的一对初级聚类子集代表所述初级聚类子集之间的记录对代表同一实体的概率最大。可选地,所述计算所述初级聚类子集中每对初级聚类子集之间的距离,包括:分别从所述每对初级聚类子集中选择代表同一实体的概率最大的记录对(ri,rj),其中,ri∈Ci,rj∈Cj,Ci为所述每对初级聚类子集中的一个初级聚类子集,Cj为所述每对初级聚类子集中的另一个初级聚类子集;进一步地,根据公式得到所述每对初级聚类子集之间的距离;其中,maxSimi为所述记录对(ri,rj)代表同一实体的概率,cosinSimi为所述每对初级聚类子集的余弦相似度。可选地,还可采用其它方式计算所述初级聚类子集中每对初级聚类子集之间的距离,此处不再赘述。In the embodiment of the present invention, in step A, the distance between each pair of primary cluster subsets in the primary cluster subsets is calculated, and the pair of primary cluster subsets with the smallest distance is selected as two candidate merging subsets ; Among them, the pair of primary cluster subsets with the smallest distance represents that the record pair between the primary cluster subsets has the highest probability of representing the same entity. Optionally, the calculating the distance between each pair of primary cluster subsets in the primary cluster subsets includes: selecting respectively from each pair of primary cluster subsets the record pair ( r i , r j ), where, r i ∈ C i , r j ∈ C j , C i is a primary cluster subset of each pair of primary cluster subsets, and C j is each pair of primary cluster subsets Another primary cluster subset in the class subset; further, according to the formula Get the distance between each pair of primary clustering subsets; where, maxSimi is the probability that the record pair (r i , r j ) represents the same entity, and cosinSimi is the cosine similarity of each pair of primary clustering subsets Spend. Optionally, other methods may also be used to calculate the distance between each pair of primary cluster subsets in the primary cluster subsets, which will not be repeated here.
进一步地,在步骤B中通过判断所述两个候选合并子集之间的距离是否小于下限阈值(即判断所述两个候选合并子集的记录对代表同一实体的概率是否大于所述下限概率阈值);若所述两个候选合并子集之间的距离小于所述下限阈值(即代表所述两个候选合并子集之间的记录对代表同一实体的概率大于所述下限概率阈值),则分别从所述两个候选合并子集中选择第二记录形成第二候选记录对,将所述第二候选记录对以及所述两个候选合并子集的标签集发送给众包平台,以使所述众包平台判断所述第二候选记录对是否代表同一实体以及是否对所述标签集中的标签点赞,如所述众包平台通过对所述第二候选记录对进行标注的形式指示所述第二候选记录对是否代表同一实体;其中,所述第二候选记录对为所述两个候选合并子集中代表同一实体的概率最大的记录对,如所述两个候选合并子集中选择代表同一实体的概率最大的记录对(r1,r2)(即所述第二候选记录对),所述两个候选合并子集分别为C1及C2,r1为C1中的第二记录,r2为C2中的第二记录。Further, in step B, by judging whether the distance between the two candidate merging subsets is less than the lower limit threshold (that is, judging whether the probability that the records of the two candidate merging subsets represent the same entity is greater than the lower limit probability Threshold); if the distance between the two candidate merging subsets is less than the lower limit threshold (that is, the probability that the record pair between the two candidate merging subsets represents the same entity is greater than the lower limit probability threshold), Then select the second record from the two candidate merging subsets to form a second candidate record pair, and send the second candidate record pair and the label sets of the two candidate merging subsets to the crowdsourcing platform, so that The crowdsourcing platform judges whether the second candidate record pair represents the same entity and whether it likes the tags in the tag set, as indicated by the crowdsourcing platform by marking the second candidate record pair. Whether the second candidate record pair represents the same entity; wherein, the second candidate record pair is the record pair with the highest probability of representing the same entity in the two candidate merging subsets, such as selecting the representative in the two candidate merging subsets The record pair (r 1 , r 2 ) with the highest probability of the same entity (that is, the second candidate record pair), the two candidate merging subsets are C 1 and C 2 respectively, and r 1 is the first record pair in C 1 Second record, r2 is the second record in C2.
进一步地,在步骤C中接收所述众包平台返回的第一判断结果(如所述众包平台对所述第二候选记录对标注的是否代表同一实体的结果以及对所述两个候选合并子集的标签集中标签的点赞次数结果),并根据所述第一判断结果确定是否将所述两个候选合并子集合并以及根据所述众包平台对所述标签集中标签的点赞次数对所述标签集中的标签进行排序和/或过滤,由于所述众包平台会包括多个众包用户即多个众包用户会对所述第二候选记录对标注是否代表同一实体,也即所述第二判断结果为多个众包用户对所述第二候选记录对标注的是否代表同一实体的结果以及对所述两个候选合并子集的标签集中标签的点赞次数结果,可选地,可采用投票算法进行众包结果汇聚即选择票数过半的答案作为结果,若多半以上众包用户对所述第二候选记录对标注为代表同一实体,则确定所述第二候选记录对代表同一实体,若个别或者少于一半的众包用户对所述第二候选记录对标注为代表不同实体,则确定所述第二候选记录对不代表同一实体;若根据所述第一判断结果确定所述两个候选合并子集代表同一实体,则将所述两个候选合并子集合并为一个聚类子集,更新所述聚类子集的标签集及索引,并将合并得到的所述聚类子集作为初级聚类子集;若根据所述第一判断结果确定所述两个候选合并子集不代表同一实体,则将所述两个候选合并子集之间的距离设为1;进一步地,返回继续执行所述步骤A-步骤C,直至所述两个候选合并子集之间的距离大于所述下限阈值,即代表所述两个候选合并子集之间的记录对代表同一实体的概率小于所述下限概率阀值,同时由于将距离最小的一对初级聚类子集作为候选合并子集,若所述两个候选合并子集之间的聚类大于所述下限阈值,则其他初级聚类子集之间的距离必然也大于所述所述下限阈值,即各个所述初始聚类子集并不代表同一实体,因此,聚类算法停止,将此时的至少两个所述初级聚类子集作为得到的所述至少两个聚类子集,从而实现了按照一定的次序迭代地将所述初级聚类子集分层地合并为较大的聚类子集。Further, in step C, the first judgment result returned by the crowdsourcing platform is received (such as the result of whether the crowdsourcing platform marks the second candidate record pair as representing the same entity and whether the two candidates are merged The results of the number of likes of tags in the tag set of the subset), and determine whether to merge the two candidate merging subsets according to the first judgment result and according to the number of likes of the tags in the tag set by the crowdsourcing platform Sorting and/or filtering the tags in the tag set, because the crowdsourcing platform will include multiple crowdsourcing users, that is, multiple crowdsourcing users will mark whether the second candidate record pair represents the same entity, that is, The second judgment result is the result of multiple crowdsourcing users marking whether the second candidate record pair represents the same entity and the number of likes for the tags in the tag set of the two candidate merged subsets, optional Specifically, the voting algorithm can be used to aggregate crowdsourcing results, that is, to select the answer with more than half of the votes as the result. If more than half of the crowdsourcing users mark the second candidate record pair as representing the same entity, then determine that the second candidate record pair represents the same entity. For the same entity, if individual or less than half of the crowdsourcing users mark the second candidate record pair as representing a different entity, it is determined that the second candidate record pair does not represent the same entity; if it is determined according to the first judgment result The two candidate merging subsets represent the same entity, then merge the two candidate merging subsets into one clustering subset, update the label set and index of the clustering subset, and merge the obtained The clustering subset is used as the primary clustering subset; if it is determined according to the first judgment result that the two candidate merging subsets do not represent the same entity, the distance between the two candidate merging subsets is set to 1 ; Further, return and continue to execute the steps A-Step C until the distance between the two candidate merging subsets is greater than the lower threshold, which means that the record pair representative between the two candidate merging subsets The probability of the same entity is less than the lower limit probability threshold, and since the pair of primary clustering subsets with the smallest distance is used as the candidate merging subset, if the clustering between the two candidate merging subsets is greater than the lower limit threshold , then the distances between other primary clustering subsets must also be greater than the lower limit threshold, that is, each of the initial clustering subsets does not represent the same entity, therefore, the clustering algorithm stops, and the at least two The primary clustering subsets are obtained as the at least two clustering subsets, so that iteratively merging the primary clustering subsets into a larger clustering subset in a certain order is achieved. .
图2为本发明基于群体计算的实体解析方法实施例二的流程示意图,在上述实施例的基础上,对所述基于众包的分层聚类方法对数据库中的初始记录进行分层聚类得到至少两个聚类子集的步骤进行详细说明,如图2所示,本实施例的方法可以包括:Fig. 2 is a schematic flow chart of the second embodiment of the entity resolution method based on group computing in the present invention. On the basis of the above embodiment, the hierarchical clustering method based on crowdsourcing is used to perform hierarchical clustering on the initial records in the database The steps of obtaining at least two clustering subsets are described in detail, as shown in Figure 2, the method of this embodiment may include:
S201、获取所述初始记录对代表同一实体的概率,并执行步骤S202;S201. Obtain the probability that the initial record pair represents the same entity, and execute step S202;
S202、将代表同一实体的概率大于上限概率阈值的所述初始记录对聚为一类,形成相应的初级聚类子集,并为每个所述初级聚类子集创建标签集及索引,执行步骤S203;S202. Cluster the initial record pairs whose probability of representing the same entity is greater than the upper probability threshold into one class to form corresponding primary cluster subsets, and create a label set and index for each primary cluster subset, and execute Step S203;
S203、计算所述初级聚类子集中每对初级聚类子集之间的距离,选择所述距离最小的一对初级聚类子集作为两个候选合并子集,并执行步骤S204;S203. Calculate the distance between each pair of primary cluster subsets in the primary cluster subsets, select a pair of primary cluster subsets with the smallest distance as two candidate merging subsets, and perform step S204;
S204、判断所述两个候选合并子集之间的距离是否小于下限阈值;若是,则执行步骤S205;若否(即所述两个候选合并子集之间的距离大于所述下限阈值),则执行步骤S206;S204. Determine whether the distance between the two candidate merging subsets is smaller than the lower threshold; if yes, execute step S205; if not (that is, the distance between the two candidate merging subsets is greater than the lower threshold), Then execute step S206;
S205、分别从所述两个候选合并子集中选择第二记录形成第二候选记录对,将所述第二候选记录对以及所述两个候选合并子集的标签集发送给众包平台,以使所述众包平台判断所述第二候选记录对是否代表同一实体以及是否对所述标签集中的标签点赞;进一步地,执行步骤S207;其中,所述第二候选记录对为所述两个候选合并子集中代表同一实体的概率最大的记录对;S205. Select second records from the two candidate merging subsets respectively to form a second candidate record pair, and send the second candidate record pair and the label sets of the two candidate merging subsets to the crowdsourcing platform to Make the crowdsourcing platform judge whether the second candidate record pair represents the same entity and whether it likes the tags in the tag set; further, step S207 is executed; wherein, the second candidate record pair is the two The record pair with the highest probability representing the same entity in candidate merged subsets;
S206、则聚类终止,将至少两个所述初级聚类子集作为得到的所述至少两个聚类子集;S206. The clustering is terminated, and at least two of the primary cluster subsets are used as the obtained at least two cluster subsets;
S207、接收所述众包平台返回的第一判断结果,并根据所述第一判断结果确定是否将所述两个候选合并子集合并以及根据所述众包平台对所述标签集中标签的点赞次数对所述标签集中的标签进行排序和/或过滤;若根据所述第一判断结果确定所述两个候选合并子集代表同一实体,则执行步骤S208;若根据所述第一判断结果确定所述两个候选合并子集不代表同一实体,则执行步骤S209;S207. Receive the first judgment result returned by the crowdsourcing platform, and determine whether to merge the two candidate merging subsets according to the first judgment result, and set the tag points according to the crowdsourcing platform for the tags The number of likes sorts and/or filters the tags in the tag set; if it is determined according to the first judgment result that the two candidate merging subsets represent the same entity, then perform step S208; if according to the first judgment result If it is determined that the two candidate merging subsets do not represent the same entity, step S209 is performed;
S208、将所述两个候选合并子集合并为一个聚类子集,更新所述聚类子集的标签集及索引,将合并得到的所述聚类子集作为初级聚类子集,并返回继续执行步骤S203;S208. Merge the two candidate merging subsets into one clustering subset, update the label set and index of the clustering subset, use the merged clustering subset as a primary clustering subset, and Return to continue to execute step S203;
S209、将所述两个候选合并子集之间的距离设为1,并返回继续执行步骤S203。S209. Set the distance between the two candidate merging subsets as 1, and return to step S203.
图3为本发明基于群体计算的实体解析方法实施例三的流程示意图,在上述实施例的基础上,如图3所示,本实施例的方法可以包括:Fig. 3 is a schematic flowchart of Embodiment 3 of the entity resolution method based on group computing in the present invention. On the basis of the above embodiments, as shown in Fig. 3 , the method of this embodiment may include:
S301、当检测到所述数据库中增加了新记录时,获取所述新记录的特征信息;S301. When it is detected that a new record has been added to the database, acquire feature information of the new record;
S302、所述根据所述至少两个聚类子集的子集信息及所述新记录的特征信息从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集;S302. According to the subset information of the at least two cluster subsets and the feature information of the new record, at least two related clusters most related to the new record are obtained from the at least two cluster subsets. class subset;
S303、确定与所述至少两个相关聚类子集分别对应的候选记录对,并将所有所述候选记录对发送给众包平台,以使所述众包平台判断所述候选记录对是否代表同一实体;S303. Determine the candidate record pairs respectively corresponding to the at least two related cluster subsets, and send all the candidate record pairs to the crowdsourcing platform, so that the crowdsourcing platform can judge whether the candidate record pairs represent the same entity;
S304、接收所述众包平台返回的第二判断结果,并根据所述第二判断结果确定是否至少一个所述候选记录对代表同一实体;若根据所述第二判断结果确定第一候选记录对代表同一实体,则执行步骤S305;若根据所述第二判断结果确定所有所述候选记录对都不代表同一实体,则执行步骤S306;S304. Receive the second judgment result returned by the crowdsourcing platform, and determine whether at least one of the candidate record pairs represents the same entity according to the second judgment result; if the first candidate record pair is determined according to the second judgment result represent the same entity, execute step S305; if it is determined according to the second judgment result that all the candidate record pairs do not represent the same entity, execute step S306;
S305、将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;S305. Add the new record to the first cluster subset to which the first record belongs, and update the label set of the first cluster subset;
S306、为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集。S306. Establish a new cluster subset for the new record, and create a label set for the new cluster subset.
本发明实施例中,在步骤S302中所述至少两个聚类子集即为当检测到所述数据库中增加了新记录时,所述数据库中已有的聚类子集。In the embodiment of the present invention, the at least two cluster subsets in step S302 are existing cluster subsets in the database when it is detected that a new record is added in the database.
图4为本发明基于群体计算的实体解析装置实施例一的结构示意图,如图4所示,本实施例提供的基于群体计算的实体解析装置40可以包括:分层聚类模块401、检测模块402、第一确定模块403、第二确定模块404及划分模块405。FIG. 4 is a schematic structural diagram of Embodiment 1 of an entity resolution device based on group computing in the present invention. As shown in FIG. 402 , a first determination module 403 , a second determination module 404 , and a division module 405 .
其中,分层聚类模块401用于基于众包的分层聚类方法对数据库中的初始记录进行分层聚类,得到至少两个聚类子集;Wherein, the hierarchical clustering module 401 is used for hierarchically clustering the initial records in the database based on the crowdsourcing hierarchical clustering method to obtain at least two clustering subsets;
检测模块402用于当检测到所述数据库中增加了新记录时,获取所述新记录的特征信息;The detection module 402 is configured to obtain feature information of the new record when it is detected that a new record has been added in the database;
第一确定模块403用于根据所述至少两个聚类子集的子集信息及所述新记录的特征信息从所述至少两个聚类子集中得到与所述新记录最相关的至少两个相关聚类子集;其中,所述至少两个聚类子集的子集信息包括:所述聚类子集的标签集信息及索引信息;The first determining module 403 is configured to obtain at least two most relevant information of the new record from the at least two cluster subsets according to the subset information of the at least two cluster subsets and the feature information of the new record. related clustering subsets; wherein, the subset information of the at least two clustering subsets includes: label set information and index information of the clustering subsets;
第二确定模块404用于根据所述新记录与所述至少两个相关聚类子集中每个记录的相似度大小关系确定与所述至少两个相关聚类子集分别对应的候选记录对;The second determination module 404 is used to determine the candidate record pairs respectively corresponding to the at least two related cluster subsets according to the similarity relationship between the new record and each record in the at least two related cluster subsets;
划分模块405用于通过众包用户标注方式判断是否至少一个所述候选记录对代表同一实体;若确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集;其中,所述第一记录与所述新记录形成所述第一候选记录对。The division module 405 is used to judge whether at least one of the candidate record pairs represents the same entity through crowdsourcing user annotation; if it is determined that the first candidate record pair represents the same entity, then add the new record to the first record to which the first record belongs. In the clustering subset, and update the label set of the first clustering subset; if it is determined that all the candidate record pairs do not represent the same entity, a new clustering subset is established for the new record, and for all The new cluster subset creates a label set; wherein, the first record and the new record form the first candidate record pair.
可选地,所述分层聚类模块包括:Optionally, the hierarchical clustering module includes:
初级聚类单元,用于根据每对所述初始记录之间代表同一实体的概率大小将代表同一实体的概率大于上限概率阈值的初始记录对聚为一类,形成相应的初级聚类子集,并为每个所述初级聚类子集创建标签集及索引;其中,每对所述初始记录形成所述初始记录对;The primary clustering unit is used to cluster the pairs of initial records whose probability of representing the same entity is greater than the upper limit probability threshold according to the probability of representing the same entity between each pair of the initial records to form corresponding primary clustering subsets, And create a tag set and an index for each of the primary clustering subsets; wherein, each pair of the initial records forms the initial record pair;
分层聚类单元,用于通过众包用户标注方式依次将所述初级聚类子集分层地进行合并,直至合并后的各个聚类子集之间的最小距离大于下限阈值,最终得到至少两个聚类子集。The hierarchical clustering unit is configured to sequentially merge the primary cluster subsets hierarchically through crowdsourcing user labeling, until the minimum distance between the merged cluster subsets is greater than the lower threshold, and finally obtain at least Two cluster subsets.
可选地,所述初级聚类单元具体用于:Optionally, the primary clustering unit is specifically used for:
获取所述初始记录对代表同一实体的概率;obtaining the probability that said pair of initial records represent the same entity;
将代表同一实体的概率大于上限概率阈值的所述初始记录对聚为一类,形成相应的初级聚类子集。The initial record pairs whose probability of representing the same entity is greater than the upper probability threshold are clustered into one class to form corresponding primary cluster subsets.
可选地,所述分层聚类单元具体用于:Optionally, the hierarchical clustering unit is specifically used for:
步骤A、计算所述初级聚类子集中每对初级聚类子集之间的距离,选择所述距离最小的一对初级聚类子集作为两个候选合并子集;Step A, calculating the distance between each pair of primary cluster subsets in the primary cluster subsets, and selecting the pair of primary cluster subsets with the smallest distance as two candidate merging subsets;
步骤B、判断所述两个候选合并子集之间的距离是否小于下限阈值;若所述两个候选合并子集之间的距离小于所述下限阈值,则分别从所述两个候选合并子集中选择第二记录形成第二候选记录对,将所述第二候选记录对以及所述两个候选合并子集的标签集发送给众包平台,以使所述众包平台判断所述第二候选记录对是否代表同一实体以及是否对所述标签集中的标签点赞;其中,所述第二候选记录对为所述两个候选合并子集中代表同一实体的概率最大的记录对;Step B, judging whether the distance between the two candidate merging subsets is less than the lower limit threshold; if the distance between the two candidate merging subsets is less than the lower limit threshold, the two candidate merging subsets Centrally select the second record to form a second candidate record pair, and send the second candidate record pair and the label sets of the two candidate merged subsets to the crowdsourcing platform, so that the crowdsourcing platform can judge the second candidate record pair Whether the candidate record pair represents the same entity and whether the label in the label set is praised; wherein, the second candidate record pair is the record pair with the greatest probability of representing the same entity in the two candidate merged subsets;
步骤C、接收所述众包平台返回的第一判断结果,并根据所述第一判断结果确定是否将所述两个候选合并子集合并以及根据所述众包平台对所述标签集中标签的点赞次数对所述标签集中的标签进行排序和/或过滤;若根据所述第一判断结果确定所述两个候选合并子集代表同一实体,则将所述两个候选合并子集合并为一个聚类子集,更新所述聚类子集的标签集及索引,并将合并得到的所述聚类子集作为初级聚类子集;若根据所述第一判断结果确定所述两个候选合并子集不代表同一实体,则将所述两个候选合并子集之间的距离设为1;Step C. Receive the first judgment result returned by the crowdsourcing platform, and determine whether to merge the two candidate merging subsets according to the first judgment result, and determine whether to merge the two candidate merged subsets according to the crowdsourcing platform's classification of the tags in the tag set Sorting and/or filtering the tags in the tag set according to the number of likes; if it is determined according to the first judgment result that the two candidate merging subsets represent the same entity, then merging the two candidate merging subsets into A clustering subset, updating the label set and index of the clustering subset, and using the merged clustering subset as a primary clustering subset; if the two If the candidate merging subsets do not represent the same entity, the distance between the two candidate merging subsets is set to 1;
返回继续执行所述步骤A-步骤C,直至所述两个候选合并子集之间的距离大于所述下限阈值,则将至少两个所述初级聚类子集作为得到的所述至少两个聚类子集。Go back and continue to execute the step A-step C until the distance between the two candidate merging subsets is greater than the lower threshold, then at least two of the primary clustering subsets are used as the obtained at least two Clustering subsets.
可选地,所述初级聚类单元还具体用于:Optionally, the primary clustering unit is also specifically used for:
根据所述初始记录对的相应属性之间的相似性计算所述初始记录对的相似度;calculating the similarity of the initial record pair according to the similarity between the corresponding attributes of the initial record pair;
基于机器学习模型计算所述初始记录对代表同一实体的概率。A probability that the pair of initial records represent the same entity is calculated based on a machine learning model.
可选地,所述分层聚类单元还具体用于:Optionally, the hierarchical clustering unit is also specifically used for:
分别从所述每对初级聚类子集中选择代表同一实体的概率最大的记录对(ri,rj),其中,ri∈Ci,rj∈Cj,Ci为所述每对初级聚类子集中的一个初级聚类子集,Cj为所述每对初级聚类子集中的另一个初级聚类子集;Select the record pair (r i , r j ) with the highest probability representing the same entity from each pair of primary clustering subsets, where r i ∈ C i , r j ∈ C j , and C i is the A primary cluster subset in the primary cluster subset, Cj is another primary cluster subset in each pair of primary cluster subsets;
根据公式得到所述每对初级聚类子集之间的距离;其中,maxSimi为所述记录对(ri,rj)代表同一实体的概率,cosinSimi为所述每对初级聚类子集的余弦相似度。According to the formula Get the distance between each pair of primary clustering subsets; where, maxSimi is the probability that the record pair (r i , r j ) represents the same entity, and cosinSimi is the cosine similarity of each pair of primary clustering subsets Spend.
可选地,所述第一确定模块具体用于:Optionally, the first determining module is specifically configured to:
根据所述基于众包的分层聚类方法对数据库中的初始记录进行分层聚类得到的所述至少两个聚类子集的标签集信息及索引信息建立倒排索引;Establishing an inverted index for the label set information and index information of the at least two clustering subsets obtained by performing hierarchical clustering on the initial records in the database according to the crowdsourcing-based hierarchical clustering method;
根据所述倒排索引及所述新记录的特征信息进行检索,从所述至少两个聚类子集中得到与所述新记录最相关的所述至少两个相关聚类子集。Searching is performed according to the inverted index and the feature information of the new record, and the at least two relevant cluster subsets most relevant to the new record are obtained from the at least two cluster subsets.
可选地,所述第二确定模块具体用于:Optionally, the second determination module is specifically configured to:
分别计算所述新记录与所述至少两个相关聚类子集中每个记录的相似度;separately calculating the similarity of the new record to each record in the at least two relevant subsets of clusters;
分别从每个所述相关聚类子集中选择一个与所述新记录的相似度最大的记录,并分别与所述新记录形成对应所述相关聚类子集的候选记录对;其中,所述相关聚类子集的个数等于所述候选记录对的个数。Respectively select a record with the greatest similarity to the new record from each of the related cluster subsets, and respectively form a candidate record pair corresponding to the related cluster subset with the new record; wherein, the The number of relevant cluster subsets is equal to the number of candidate record pairs.
可选地,所述划分模块具体用于:Optionally, the division module is specifically used for:
将所有所述候选记录对发送给众包平台,以使所述众包平台判断所述候选记录对是否代表同一实体;sending all the candidate record pairs to the crowdsourcing platform, so that the crowdsourcing platform determines whether the candidate record pairs represent the same entity;
接收所述众包平台返回的第二判断结果,并根据所述第二判断结果确定是否至少一个所述候选记录对代表同一实体;若根据所述第二判断结果确定第一候选记录对代表同一实体,则将所述新记录添加到第一记录所属的第一聚类子集中,并更新所述第一聚类子集的标签集;若根据所述第二判断结果确定所有所述候选记录对都不代表同一实体,则为所述新记录建立一个新聚类子集,并为所述新聚类子集创建标签集。receiving the second judgment result returned by the crowdsourcing platform, and determining whether at least one of the candidate record pairs represents the same entity according to the second judgment result; if it is determined according to the second judgment result that the first candidate record pair represents the same entity entity, add the new record to the first cluster subset to which the first record belongs, and update the label set of the first cluster subset; if all the candidate records are determined according to the second judgment result If the pairs do not represent the same entity, a new cluster subset is established for the new record, and a label set is created for the new cluster subset.
本实施例的基于群体计算的实体解析装置,可以用于执行本发明上述基于群体计算的实体解析方法任意实施例中的技术方案,其实现原理和技术效果类似,此处不再赘述。The entity resolution device based on swarm computing in this embodiment can be used to execute the technical solution in any embodiment of the above-mentioned entity resolution method based on swarm computing in the present invention, and its implementation principle and technical effect are similar, and will not be repeated here.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above method embodiments can be completed by program instructions and related hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps of the above-mentioned method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other various media that can store program codes.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201510076586.4A CN104573130B (en) | 2015-02-12 | 2015-02-12 | The entity resolution method and device calculated based on colony | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201510076586.4A CN104573130B (en) | 2015-02-12 | 2015-02-12 | The entity resolution method and device calculated based on colony | 
Publications (2)
| Publication Number | Publication Date | 
|---|---|
| CN104573130A CN104573130A (en) | 2015-04-29 | 
| CN104573130B true CN104573130B (en) | 2017-11-03 | 
Family
ID=53089191
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN201510076586.4A Expired - Fee Related CN104573130B (en) | 2015-02-12 | 2015-02-12 | The entity resolution method and device calculated based on colony | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN104573130B (en) | 
Families Citing this family (13)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN105608318B (en) * | 2015-12-18 | 2018-06-15 | 清华大学 | Crowdsourcing marks integration method | 
| US10956453B2 (en) * | 2017-05-24 | 2021-03-23 | International Business Machines Corporation | Method to estimate the deletability of data objects | 
| CN107292365B (en) * | 2017-06-27 | 2021-01-08 | 百度在线网络技术(北京)有限公司 | Method, device and equipment for binding commodity label and computer readable storage medium | 
| CN107832156B (en) * | 2017-11-28 | 2021-08-10 | 南京航空航天大学 | Group computing system supporting heterogeneous tasks and computing devices | 
| CN108090221B (en) * | 2018-01-02 | 2019-05-10 | 北京市燃气集团有限责任公司 | A kind of correlating method of combustion gas card data and user management data | 
| US11501111B2 (en) | 2018-04-06 | 2022-11-15 | International Business Machines Corporation | Learning models for entity resolution using active learning | 
| US10776269B2 (en) | 2018-07-24 | 2020-09-15 | International Business Machines Corporation | Two level compute memoing for large scale entity resolution | 
| CN109241513A (en) * | 2018-08-27 | 2019-01-18 | 上海宝尊电子商务有限公司 | A kind of method and device based on big data crowdsourcing model data mark | 
| CN109710736B (en) * | 2018-12-19 | 2020-08-14 | 浙江大学 | An Active Crowdsourcing Task Generation Method for Search Ranking | 
| US11875253B2 (en) | 2019-06-17 | 2024-01-16 | International Business Machines Corporation | Low-resource entity resolution with transfer learning | 
| US11556845B2 (en) * | 2019-08-29 | 2023-01-17 | International Business Machines Corporation | System for identifying duplicate parties using entity resolution | 
| US11544477B2 (en) | 2019-08-29 | 2023-01-03 | International Business Machines Corporation | System for identifying duplicate parties using entity resolution | 
| CN113971216B (en) * | 2021-10-22 | 2023-02-03 | 北京百度网讯科技有限公司 | Data processing method, device, electronic device and memory | 
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US6223186B1 (en) * | 1998-05-04 | 2001-04-24 | Incyte Pharmaceuticals, Inc. | System and method for a precompiled database for biomolecular sequence information | 
| CN102708100A (en) * | 2011-03-28 | 2012-10-03 | 北京百度网讯科技有限公司 | Method and device for digging relation keyword of relevant entity word and application thereof | 
| CN103488671A (en) * | 2012-06-11 | 2014-01-01 | 国际商业机器公司 | Method and system for querying and integrating structured and instructured data | 
| CN103902742A (en) * | 2014-04-25 | 2014-07-02 | 中国科学院信息工程研究所 | Access control determination engine optimization system and method based on big data | 
| CN104239553A (en) * | 2014-09-24 | 2014-12-24 | 江苏名通信息科技有限公司 | Entity recognition method based on Map-Reduce framework | 
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| EP2693346A1 (en) * | 2012-07-30 | 2014-02-05 | ExB Asset Management GmbH | Resource efficient document search | 
- 
        2015
        - 2015-02-12 CN CN201510076586.4A patent/CN104573130B/en not_active Expired - Fee Related
 
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US6223186B1 (en) * | 1998-05-04 | 2001-04-24 | Incyte Pharmaceuticals, Inc. | System and method for a precompiled database for biomolecular sequence information | 
| CN102708100A (en) * | 2011-03-28 | 2012-10-03 | 北京百度网讯科技有限公司 | Method and device for digging relation keyword of relevant entity word and application thereof | 
| CN103488671A (en) * | 2012-06-11 | 2014-01-01 | 国际商业机器公司 | Method and system for querying and integrating structured and instructured data | 
| CN103902742A (en) * | 2014-04-25 | 2014-07-02 | 中国科学院信息工程研究所 | Access control determination engine optimization system and method based on big data | 
| CN104239553A (en) * | 2014-09-24 | 2014-12-24 | 江苏名通信息科技有限公司 | Entity recognition method based on Map-Reduce framework | 
Non-Patent Citations (1)
| Title | 
|---|
| 基于众包的电子商务数据实体分类系统;叶晨等;《计算机研究与发展》;20131231;405-409 * | 
Also Published As
| Publication number | Publication date | 
|---|---|
| CN104573130A (en) | 2015-04-29 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN104573130B (en) | The entity resolution method and device calculated based on colony | |
| US11714831B2 (en) | Data processing and classification | |
| US11580119B2 (en) | System and method for automatic persona generation using small text components | |
| US12265538B2 (en) | Schema-adaptable data enrichment and retrieval | |
| US9916350B2 (en) | Automated creation of join graphs for unrelated data sets among relational databases | |
| CN111782965A (en) | Intent to recommend methods, devices, equipment and storage media | |
| WO2021103492A1 (en) | Risk prediction method and system for business operations | |
| WO2023108980A1 (en) | Information push method and device based on text adversarial sample | |
| CN111797214A (en) | Question screening method, device, computer equipment and medium based on FAQ database | |
| CN110737805B (en) | Method and device for processing graph model data and terminal equipment | |
| CN104346438B (en) | Based on big data data management service system | |
| CN110633365A (en) | A hierarchical multi-label text classification method and system based on word vectors | |
| Karthikeyan et al. | Probability based document clustering and image clustering using content-based image retrieval | |
| CN104239553A (en) | Entity recognition method based on Map-Reduce framework | |
| WO2016029230A1 (en) | Automated creation of join graphs for unrelated data sets among relational databases | |
| CN108763496A (en) | A kind of sound state data fusion client segmentation algorithm based on grid and density | |
| US20240086441A1 (en) | System and method for automatic profile segmentation using small text variations | |
| CN116049379A (en) | Knowledge recommendation method, knowledge recommendation device, electronic equipment and storage medium | |
| US12332854B2 (en) | Meta-learning systems and/or methods for error detection in structured data | |
| CN113807429B (en) | Enterprise classification method, enterprise classification device, computer equipment and storage medium | |
| CN114911826A (en) | A method and system for retrieving linked data | |
| CN105701227A (en) | Cross-media similarity measure method and search method based on local association graph | |
| CN113704422A (en) | Text recommendation method and device, computer equipment and storage medium | |
| US20250077495A1 (en) | Systems and methods for scalable dataset content embedding for improved database searchability | |
| CN117633328A (en) | New media content monitoring method and system based on data mining | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | ||
| CF01 | Termination of patent right due to non-payment of annual fee | Granted publication date: 20171103 |