CN106339369B - A method and system for identifying synonyms in a dataset - Google Patents
A method and system for identifying synonyms in a dataset Download PDFInfo
- Publication number
- CN106339369B CN106339369B CN201610772919.1A CN201610772919A CN106339369B CN 106339369 B CN106339369 B CN 106339369B CN 201610772919 A CN201610772919 A CN 201610772919A CN 106339369 B CN106339369 B CN 106339369B
- Authority
- CN
- China
- Prior art keywords
- keyword
- value
- occurrence
- data set
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
 
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域technical field
本发明涉及语义识别技术领域,特别涉及一种资料集的同义词识别方法及系统。The invention relates to the technical field of semantic recognition, in particular to a method and system for recognizing synonyms in a data set.
背景技术Background technique
人机交互是研究系统与用户之间的交互关系的科学。其中,上述系统可以是各种各样的机器,也可以是计算机的系统和软件。例如,智能检索系统、语义理解系统等等。Human-computer interaction is the science that studies the interaction between systems and users. Wherein, the above-mentioned system may be various machines, and may also be computer systems and software. For example, intelligent retrieval system, semantic understanding system, etc.
同义词是人机交互的重要组成部分。同义词的自动识别是知识库研究的重要组成部分,同义词自动识别的方式很多,常见的方法有基于词形相似识别法、基于定义识别法等等。前者只能识别词形相近的同义词,不能识别出词形完全不同的同义词;后者需要依靠特定的结构文本,如果某些关键词没有在文本中进行定义,就不能识别出来,因此,在具体应用中受到很大的限制。Synonyms are an important part of human-computer interaction. The automatic identification of synonyms is an important part of knowledge base research. There are many ways to automatically identify synonyms. Common methods include morphological similarity-based identification, definition-based identification, and so on. The former can only identify synonyms with similar morphological forms, but cannot identify synonyms with completely different morphological forms; the latter needs to rely on a specific structural text, and if some keywords are not defined in the text, they cannot be identified. Therefore, in specific The application is very limited.
综上所述可以看出,如何提高同义词的识别效果是目前有待解决的问题。In summary, it can be seen that how to improve the recognition effect of synonyms is a problem to be solved at present.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本发明的目的在于提供一种资料集的同义词识别方法及系统,提高了同义词的识别效果。其具体方案如下:In view of this, the purpose of the present invention is to provide a method and system for identifying synonyms in a data set, which improves the effect of identifying synonyms. Its specific plan is as follows:
一种资料集的同义词识别方法,包括:A method for identifying synonyms in a dataset, including:
获取包括N份资料的资料集,N为正整数;Get a data set including N pieces of data, where N is a positive integer;
分别提取每份资料中的所有关键词;Extract all keywords in each data separately;
分别确定每个关键词的所有同生词,其中,任一个关键词的同生词为与该关键词同时出现在同一份资料中的关键词;Determine all the cognate words of each keyword respectively, wherein, the cognate words of any keyword are the keywords that appear in the same data at the same time as the keyword;
分别计算每个关键词与该关键词所对应的同生词之间的同生值;其中,同生值用来衡量任意两个关键词在同一份资料中出现的概率;Calculate the co-occurrence value between each keyword and the co-occurrence word corresponding to the keyword respectively; wherein, co-occurrence value is used to measure the probability of any two keywords appearing in the same data;
分别确定每个关键词的高值词群,其中,任一个关键词的高值词群为按照同生值从大到小的排列顺序,对该关键词的所有同生词进行排序后得到的词群;Determine the high-value word group of each keyword respectively, wherein, the high-value word group of any keyword is the word obtained by sorting all the cognate words of the keyword in descending order of cognate value. group;
对所述资料集中任意两个关键词之间是否为同义词进行识别,得到相应的同义词库,其中,若第一关键词的高值词群与第二关键词的高值词群之间的相似度不小于预设相似度阈值,并且所述第一关键词和所述第二关键词之间的同生值为0,则将所述第一关键词和所述第二关键词识别为同义词。Identify whether any two keywords in the data set are synonyms, and obtain a corresponding thesaurus, wherein, if the similarity between the high-value word group of the first keyword and the high-value word group of the second keyword is The degree of similarity is not less than the preset similarity threshold, and the co-occurrence value between the first keyword and the second keyword is 0, then the first keyword and the second keyword are identified as synonyms .
优选的,任意两个关键词之间的同生值的计算公式为:Preferably, the calculation formula of the co-occurrence value between any two keywords is:
Eij=Cij 2/(Ci×Cj);E ij =C ij 2 /(C i ×C j );
式中,Cij表示所述资料集中同时出现关键词Ki和关键词Kj的资料的总份数,Ci表示所述资料集中出现所述关键词Ki的资料的总份数;Cj表示所述资料集中出现所述关键词Kj的资料的总份数;Eij表示所述关键词Ki和所述关键词Kj之间的同生值。In the formula, C ij represents the total number of copies of the data of the keyword K i and the keyword K j that appear simultaneously in the data set, and C i represents the total number of copies of the data of the keyword K i in the data set; C j represents the total number of data sets in which the keyword K j appears in the data set; E ij represents the co-occurrence value between the keyword K i and the keyword K j .
优选的,在所述分别确定每个关键词的高值词群的过程之后,还包括:Preferably, after the process of respectively determining the high-value word group of each keyword, the method further includes:
分别计算每个关键词的同生指数,其中,任一个关键词的同生指数为该关键词对应的高值词群中的所有同生词与该关键词之间的同生值的平均值。The co-occurrence index of each keyword is calculated separately, wherein the co-occurrence index of any keyword is the average value of co-occurrence values between all co-occurrence words in the high-value word group corresponding to the keyword and the keyword.
优选的,在所述第一关键词的高值词群与所述第二关键词的高值词群之间的相似度不小于所述预设相似度阈值的情况下,还包括:Preferably, when the similarity between the high-value word group of the first keyword and the high-value word group of the second keyword is not less than the preset similarity threshold, the method further includes:
若所述第一关键词和所述第二关键词之间的同生值不为0,则计算所述第一关键词的同生指数和所述第二关键词的同生指数之间的平均值,得到相应的平均同生指数;If the co-occurrence value between the first keyword and the second keyword is not 0, calculate the co-occurrence index of the first keyword and the co-occurrence index of the second keyword. The average value is obtained to obtain the corresponding average syngeneic index;
计算所述平均同生指数与当前同生值之间的比值,其中,所述当前同生值为所述第一关键词和所述第二关键词之间的同生值;Calculate the ratio between the average co-generation index and the current co-generation value, wherein the current co-generation value is the co-generation value between the first keyword and the second keyword;
判断所述比值是否不小于预设比值阈值,如果是,则将所述第一关键词和所述第二关键词识别为同义词,如果否,则将所述第一关键词和所述第二关键词识别为非同义词。Determine whether the ratio is not less than a preset ratio threshold, if so, identify the first keyword and the second keyword as synonyms, if not, identify the first keyword and the second keyword as synonyms Keywords are identified as non-synonyms.
优选的,所述预设相似度阈值为80%,所述预设比值阈值为10。Preferably, the preset similarity threshold is 80%, and the preset ratio threshold is 10.
本发明还公开了一种资料集的同义词识别系统,包括:The invention also discloses a synonym identification system for a data set, comprising:
资料集获取模块,用于获取包括N份资料的资料集,N为正整数;The data set acquisition module is used to acquire a data set including N pieces of data, where N is a positive integer;
关键词提取模块,用于分别提取每份资料中的所有关键词;The keyword extraction module is used to extract all the keywords in each data separately;
同生词确定模块,用于分别确定每个关键词的所有同生词,其中,任一个关键词的同生词为与该关键词同时出现在同一份资料中的关键词;The cognate word determination module is used to determine all the cognate words of each keyword respectively, wherein, the cognate words of any keyword are the keywords that appear in the same data at the same time as the keyword;
同生值计算模块,用于分别计算每个关键词与该关键词所对应的同生词之间的同生值;其中,同生值用来衡量任意两个关键词在同一份资料中出现的概率;The co-occurrence value calculation module is used to calculate the co-occurrence value between each keyword and the co-occurrence word corresponding to the keyword; wherein, the co-occurrence value is used to measure the occurrence of any two keywords in the same data probability;
高值词群确定模块,用于分别确定每个关键词的高值词群,其中,任一个关键词的高值词群为按照同生值从大到小的排列顺序,对该关键词的所有同生词进行排序后得到的词群;The high-value word group determination module is used to determine the high-value word group of each keyword respectively, wherein, the high-value word group of any keyword is arranged in descending order of co-occurrence value, and the value of the keyword is the same. The word group obtained after sorting all cognate words;
同义词识别模块,用于对所述资料集中任意两个关键词之间是否为同义词进行识别,得到相应的同义词库,其中,若第一关键词的高值词群与第二关键词的高值词群之间的相似度不小于预设相似度阈值,并且所述第一关键词和所述第二关键词之间的同生值为0,则将所述第一关键词和所述第二关键词识别为同义词。The synonym identification module is used to identify whether any two keywords in the data set are synonyms, and obtain a corresponding thesaurus, wherein if the high-value word group of the first keyword and the high-value word of the second keyword are The similarity between word groups is not less than the preset similarity threshold, and the co-occurrence value between the first keyword and the second keyword is 0, then the first keyword and the second keyword are set. Two keywords are identified as synonyms.
优选的,所述同生值计算模块在计算任意两个关键词之间的同生值时,相应的计算公式为:Preferably, when the co-generation value calculation module calculates the co-generation value between any two keywords, the corresponding calculation formula is:
Eij=Cij 2/(Ci×Cj);E ij =C ij 2 /(C i ×C j );
式中,Cij表示所述资料集中同时出现关键词Ki和关键词Kj的资料的总份数,Ci表示所述资料集中出现所述关键词Ki的资料的总份数;Cj表示所述资料集中出现所述关键词Kj的资料的总份数;Eij表示所述关键词Ki和所述关键词Kj之间的同生值。In the formula, C ij represents the total number of copies of the data of the keyword K i and the keyword K j that appear simultaneously in the data set, and C i represents the total number of copies of the data of the keyword K i in the data set; C j represents the total number of data sets in which the keyword K j appears in the data set; E ij represents the co-occurrence value between the keyword K i and the keyword K j .
优选的,所述同义词识别系统,还包括:Preferably, the synonym recognition system further includes:
同生指数计算模块,用于在所述高值词群确定模块分别确定每个关键词的高值词群之后,分别计算每个关键词的同生指数,其中,任一个关键词的同生指数为该关键词对应的高值词群中的所有同生词与该关键词之间的同生值的平均值。The co-occurrence index calculation module is used to calculate the co-occurrence index of each keyword after the high-value word group determination module determines the high-value word group of each keyword respectively, wherein the co-occurrence index of any keyword is The index is the average of the co-occurrence values between all co-occurrence words in the high-value word group corresponding to the keyword and the keyword.
优选的,所述同义词识别模块还包括:Preferably, the synonym recognition module further includes:
平均值计算单元,用于在所述第一关键词的高值词群与所述第二关键词的高值词群之间的相似度不小于所述预设相似度阈值,并且,所述第一关键词和所述第二关键词之间的同生值不为0的情况下,计算所述第一关键词的同生指数和所述第二关键词的同生指数之间的平均值,得到相应的平均同生指数;an average value calculation unit, used for the similarity between the high-value word group of the first keyword and the high-value word group of the second keyword is not less than the preset similarity threshold, and the If the co-occurrence value between the first keyword and the second keyword is not 0, calculate the average between the co-occurrence index of the first keyword and the co-occurrence index of the second keyword value to obtain the corresponding average congenital index;
比值计算单元,用于计算所述平均同生指数与当前同生值之间的比值,其中,所述当前同生值为所述第一关键词和所述第二关键词之间的同生值;A ratio calculation unit, configured to calculate the ratio between the average co-occurrence index and the current co-occurrence value, wherein the current co-occurrence value is the co-occurrence between the first keyword and the second keyword value;
比值判断单元,用于判断所述比值是否不小于预设比值阈值,如果是,则将所述第一关键词和所述第二关键词识别为同义词,如果否,则将所述第一关键词和所述第二关键词识别为非同义词。A ratio judgment unit, configured to judge whether the ratio is not less than a preset ratio threshold, if so, identify the first keyword and the second keyword as synonyms, and if not, identify the first keyword The word and the second keyword are identified as non-synonymous.
优选的,所述预设相似度阈值为80%,所述预设比值阈值为10。Preferably, the preset similarity threshold is 80%, and the preset ratio threshold is 10.
可见,本发明在对资料集中的任意两个关键词是否为同义词进行识别之前,先计算出资料集中每个关键词与该关键词所对应的同生词之间的同生值,以及确定出每个关键词的高值词群,其中,所谓的同生值是用来衡量任意两个关键词在同一份资料中出现的概率,而任一个关键词的高值词群是指按照同生值从大到小的排列顺序,对该关键词的所有同生词进行排序后得到的词群。可以理解的是,若任意两个关键词各自所对应的高值词群较为相似,则意味着这两个关键词很有可能是同义词,在此基础上,若这两个关键词之间同生值为0,也即,若这两个关键词并没有出现在同一份资料中,由于在同一份资料中,资料的创作者很大程度上不会先后采用两种不同词形的词语来表示同一种含义,所以,在通过高值词群的比对发现某两个关键词很可能是同义词之后,若进一步发现这两个关键词并没有出现在同一份资料中,则将这两个关键词识别成同义词,显然,上述同义词的识别过程中无需涉及到词形本身的比较或者依赖于特定的文本结构,从而能够大幅地提升同义词的识别效果,也即,提高了同义词的识别准确率,并且上述同义词的识别方法不会受限于资料类型的不同,能够广泛地应用于各类文本资料,具有非常广阔的应用前景。It can be seen that, before identifying whether any two keywords in the data set are synonyms, the present invention first calculates the co-occurrence value between each keyword in the data set and the congenital word corresponding to the keyword, and determines each The high-value word group of a keyword, in which the so-called co-occurrence value is used to measure the probability of any two keywords appearing in the same data, and the high-value word group of any keyword refers to the co-occurrence value The word group obtained by sorting all the cognate words of the keyword in descending order. It can be understood that if the high-value word groups corresponding to any two keywords are relatively similar, it means that the two keywords are likely to be synonyms. The life value is 0, that is, if these two keywords do not appear in the same data, because in the same data, the creator of the data will not use two words with different forms in succession to a large extent. means the same meaning. Therefore, after finding that two keywords are likely to be synonyms through the comparison of high-value word groups, if it is further found that these two keywords do not appear in the same data, the two keywords Keywords are recognized as synonyms. Obviously, the above-mentioned synonym recognition process does not need to involve the comparison of the word form itself or rely on a specific text structure, so that the recognition effect of synonyms can be greatly improved, that is, the recognition accuracy of synonyms is improved. , and the above-mentioned method for identifying synonyms is not limited by the different types of data, and can be widely applied to various text data, which has a very broad application prospect.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.
图1为本发明实施例公开的一种资料集的同义词识别方法流程图;1 is a flowchart of a method for identifying synonyms in a data set disclosed in an embodiment of the present invention;
图2为本发明实施例公开的一种资料集的同义词识别系统结构示意图。FIG. 2 is a schematic structural diagram of a synonym recognition system for a data set disclosed in an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
本发明实施例公开了一种资料集的同义词识别方法,参见图1所示,该方法包括:An embodiment of the present invention discloses a method for identifying synonyms in a data set. Referring to FIG. 1 , the method includes:
步骤S11:获取包括N份资料的资料集,N为正整数。Step S11: Acquire a data set including N pieces of data, where N is a positive integer.
其中,本发明实施例中的资料集包括通过网络途径和/或人工收集的途径来获取到的各种专题和/或学科资料,例如科技文献、专利文献、病案病例、事实数据等。Wherein, the data set in the embodiment of the present invention includes various thematic and/or subject data obtained through network and/or manual collection, such as scientific and technological documents, patent documents, medical records, factual data, and the like.
另外,需要说明的是,上述资料集中包括的资料的数量越多,最终的同义词识别准确率则越高。In addition, it should be noted that the greater the number of data included in the above data set, the higher the final synonym recognition accuracy.
步骤S12:分别提取每份资料中的所有关键词。Step S12: Extract all the keywords in each document respectively.
其中,每份资料中的关键词既可以是人工标引的关键词,也可以是由后台系统自动标引的关键词。Among them, the keywords in each document can be either manually indexed keywords or automatically indexed by a background system.
步骤S13:分别确定每个关键词的所有同生词,其中,任一个关键词的同生词为与该关键词同时出现在同一份资料中的关键词。Step S13: Determining all cognate words of each keyword respectively, wherein the cognate words of any keyword are keywords that appear in the same document at the same time as the keyword.
本实施例中,若关键词A和关键词B同时出现在同一份资料中,则关键词A称为关键词B的同生词,同理,关键词B也称为关键词A的同生词,可以理解的是,上述关键词A和关键词B构成了一组同生词对。In this embodiment, if the keyword A and the keyword B appear in the same document at the same time, the keyword A is called the cognate word of the keyword B. Similarly, the keyword B is also called the cognate word of the keyword A. It can be understood that the above keyword A and keyword B constitute a set of homologous word pairs.
另外,为了便于对同生词进行管理,本发明实施例可以将确定出的所有同生词保存至关系型数据库中,或者以矩阵的形式保存下来以形成相应的同生词矩阵。In addition, in order to facilitate the management of cognate words, in the embodiment of the present invention, all the determined cognate words may be stored in a relational database, or stored in the form of a matrix to form a corresponding cognate word matrix.
步骤S14:分别计算每个关键词与该关键词所对应的同生词之间的同生值;其中,同生值用来衡量任意两个关键词在同一份资料中出现的概率。Step S14: Calculate the co-occurrence value between each keyword and the co-occurrence word corresponding to the keyword respectively; wherein, the co-occurrence value is used to measure the probability of any two keywords appearing in the same document.
本实施例中,所谓的同生值是用来衡量任意两个关键词在同一份资料中出现的概率,也即用来衡量任意两个关键字能否构成同生词对的概率。In this embodiment, the so-called co-occurrence value is used to measure the probability of any two keywords appearing in the same data, that is, to measure the probability of whether any two keywords can form a pair of co-occurring words.
如果在步骤S14之前,已经将所有的同生词保存为同生词矩阵或保存至上述关系型数据库中,则本发明实施例还可以进一步将步骤S14中计算出的每个同生值标注在上述同生词矩阵或关系型数据库中的相应位置上,由此构成一张由同生词以及相应同生值构成的同生词网。If, before step S14, all the synonymous words have been saved as synonymous word matrices or stored in the above-mentioned relational database, the embodiment of the present invention may further mark each synonymous value calculated in step S14 in the above-mentioned synonymous word matrix. At the corresponding position in the new word matrix or relational database, a network of cognate words composed of cognate words and corresponding cognate values is formed.
步骤S15:分别确定每个关键词的高值词群,其中,任一个关键词的高值词群为按照同生值从大到小的排列顺序,对该关键词的所有同生词进行排序后得到的词群。Step S15: Determine the high-value word group of each keyword respectively, wherein, the high-value word group of any keyword is arranged in descending order according to the co-occurrence value, after sorting all co-occurrence words of the keyword. obtained word group.
本实施例中某个关键词的高值词群是指将该关键词的所有同生词按照同生值从大到小的顺序排列后得到的词群。In this embodiment, the high-value word group of a certain keyword refers to a word group obtained by arranging all cognate words of the keyword in descending order of cognate value.
例如,下面表一记录了A资料中关键词“5-氟尿嘧啶”所对应的高值词群;另外,下面表二记录了B资料中关键词“5-FU”所对应的高值词群。For example, the following table 1 records the high-value word groups corresponding to the keyword "5-fluorouracil" in the A data; in addition, the following table 2 records the high-value word groups corresponding to the keyword "5-FU" in the B data.
其中,表一中显示出关键词“5-氟尿嘧啶”和“5-FU”之间的同生值为0,这意味着关键词“5-FU”并没有出现在A资料中。同理,表二中显示出关键词“5-FU”与“5-氟尿嘧啶”之间的同生值为0,这意味着关键词“5-氟尿嘧啶”并没有出现在B资料中。Among them, Table 1 shows that the syngeneic value between the keywords "5-fluorouracil" and "5-FU" is 0, which means that the keyword "5-FU" does not appear in the A data. Similarly, Table 2 shows that the syngeneic value between the keywords "5-FU" and "5-fluorouracil" is 0, which means that the keyword "5-fluorouracil" does not appear in the B data.
表一Table I
      
表二Table II
      
步骤S16:对资料集中任意两个关键词之间是否为同义词进行识别,得到相应的同义词库,其中,若第一关键词的高值词群与第二关键词的高值词群之间的相似度不小于预设相似度阈值,并且第一关键词和第二关键词之间的同生值为0,则将第一关键词和第二关键词识别为同义词。Step S16: Identify whether any two keywords in the data set are synonyms, and obtain a corresponding thesaurus. If the similarity is not less than the preset similarity threshold, and the co-occurrence value between the first keyword and the second keyword is 0, the first keyword and the second keyword are identified as synonyms.
可以理解的是,上述第一关键词和上述第二关键均是资料集中任意的关键词。本发明实施例在第一关键词的高值词群与第二关键词的高值词群之间的相似度不小于预设相似度阈值的情况下,若第一关键词和第二关键词之间的同生值为0,则将第一关键词和第二关键词识别为同义词。It can be understood that, the above-mentioned first keyword and the above-mentioned second key are arbitrary keywords in the data set. In this embodiment of the present invention, under the condition that the similarity between the high-value word group of the first keyword and the high-value word group of the second keyword is not less than the preset similarity threshold, if the first keyword and the second keyword have a similarity If the co-value between them is 0, the first keyword and the second keyword are identified as synonyms.
以上述表一和表二为例,表一中的关键词“5-氟尿嘧啶”和表二中的关键词“5-FU”各自所对应的高值词群非常类似,在这种情况下,进一步比较关键词“5-氟尿嘧啶”和“5-FU”之间的同生值,通过上述表一和表二可知,关键词“5-氟尿嘧啶”和“5-FU”之间的同生值为0。由于在同一份资料中,资料的创作者很大程度上不会先后采用两种不同词形的词语来表示同一种含义,所以,本实施例将上述关键词“5-氟尿嘧啶”和“5-FU”确定为了同义词。Taking the above-mentioned Tables 1 and 2 as examples, the keyword "5-fluorouracil" in Table 1 and the keyword "5-FU" in Table 2 have very similar high-value word groups. In this case, Further comparison of the syngeneic value between the keywords "5-fluorouracil" and "5-FU", it can be seen from the above Tables 1 and 2 that the syngeneic value between the keywords "5-fluorouracil" and "5-FU" is 0. Since in the same document, the creator of the document will not use two words with different morphological forms successively to express the same meaning. Therefore, in this embodiment, the above keywords "5-fluorouracil" and "5-fluorouracil" FU" was identified as a synonym.
另外需要说明的是,上述步骤S16所创建出来的同义词库可以直接应用到数据资源的组织与利用,以及人工智能等领域。In addition, it should be noted that the thesaurus created in the above step S16 can be directly applied to the organization and utilization of data resources, as well as fields such as artificial intelligence.
可见,本发明实施例在对资料集中的任意两个关键词是否为同义词进行识别之前,先计算出资料集中每个关键词与该关键词所对应的同生词之间的同生值,以及确定出每个关键词的高值词群,其中,所谓的同生值是用来衡量任意两个关键词在同一份资料中出现的概率,而任一个关键词的高值词群是指按照同生值从大到小的排列顺序,对该关键词的所有同生词进行排序后得到的词群。可以理解的是,若任意两个关键词各自所对应的高值词群较为相似,则意味着这两个关键词很有可能是同义词,在此基础上,若这两个关键词之间同生值为0,也即,若这两个关键词并没有出现在同一份资料中,由于在同一份资料中,资料的创作者很大程度上不会先后采用两种不同词形的词语来表示同一种含义,所以,在通过高值词群的比对发现某两个关键词很可能是同义词之后,若进一步发现这两个关键词并没有出现在同一份资料中,则将这两个关键词识别成同义词,显然,上述同义词的识别过程中无需涉及到词形本身的比较或者依赖于特定的文本结构,从而能够大幅地提升同义词的识别效果,也即,提高了同义词的识别准确率,并且上述同义词的识别方法不会受限于资料类型的不同,能够广泛地应用于各类文本资料,具有非常广阔的应用前景。It can be seen that, before identifying whether any two keywords in the data set are synonyms, the embodiment of the present invention first calculates the co-occurrence value between each keyword in the data set and the co-occurrence word corresponding to the keyword, and determines The high-value word group of each keyword is obtained, in which the so-called co-occurrence value is used to measure the probability of any two keywords appearing in the same data, and the high-value word group of any keyword refers to the same The word group is obtained after sorting all the cognate words of the keyword in the order of the occurrence value from large to small. It can be understood that if the high-value word groups corresponding to any two keywords are relatively similar, it means that the two keywords are likely to be synonyms. The life value is 0, that is, if these two keywords do not appear in the same data, because in the same data, the creator of the data will not use two words with different forms in succession to a large extent. means the same meaning. Therefore, after finding that two keywords are likely to be synonyms through the comparison of high-value word groups, if it is further found that these two keywords do not appear in the same data, the two keywords Keywords are recognized as synonyms. Obviously, the above-mentioned synonym recognition process does not need to involve the comparison of the word form itself or rely on a specific text structure, so that the recognition effect of synonyms can be greatly improved, that is, the recognition accuracy of synonyms is improved. , and the above-mentioned method for identifying synonyms is not limited by the different types of data, and can be widely applied to various text data, which has a very broad application prospect.
本发明实施例公开了一种具体的资料集的同义词识别方法,相对于上一实施例,本实施例对技术方案作了进一步的说明和优化。具体的:The embodiment of the present invention discloses a specific method for identifying synonyms of a data set. Compared with the previous embodiment, this embodiment further describes and optimizes the technical solution. specific:
上一实施例步骤S14中,需要计算每个关键词与该关键词所对应的同生词之间的同生值。本实施例中,任意两个关键词之间的同生值的计算公式具体为:In step S14 of the previous embodiment, it is necessary to calculate the cognate value between each keyword and the cognate word corresponding to the keyword. In this embodiment, the calculation formula of the co-generation value between any two keywords is specifically:
Eij=Cij 2/(Ci×Cj);E ij =C ij 2 /(C i ×C j );
式中,Cij表示资料集中同时出现关键词Ki和关键词Kj的资料的总份数,Ci表示资料集中出现关键词Ki的资料的总份数;Cj表示资料集中出现关键词Kj的资料的总份数;Eij表示关键词Ki和关键词Kj之间的同生值。In the formula, C ij represents the total number of data in the data set with the keyword K i and the keyword K j at the same time, C i represents the total number of data in the data set with the keyword K i ; C j represents the key word in the data set. The total number of documents of the word K j ; E ij represents the co-occurrence value between the keyword K i and the keyword K j .
上一实施例步骤S15中,需要分别确定每个关键词的高值词群。本实施例中,在分别确定每个关键词的高值词群的过程之后,还可以进一步包括:分别计算每个关键词的同生指数,其中,任一个关键词的同生指数为该关键词对应的高值词群中的所有同生词与该关键词之间的同生值的平均值。In step S15 of the previous embodiment, the high-value word group of each keyword needs to be determined separately. In this embodiment, after the process of respectively determining the high-value word group of each keyword, it may further include: separately calculating the co-occurrence index of each keyword, wherein the co-occurrence index of any keyword is the key The average value of cognate values between all cognate words in the high-value word group corresponding to the word and the keyword.
另外,上一实施例步骤S16中公开了如下的技术方案:在第一关键词的高值词群与第二关键词的高值词群之间的相似度不小于预设相似度阈值,并且第一关键词和第二关键词之间的同生值为0的情况下,将第一关键词和第二关键词识别为同义词。In addition, in step S16 of the previous embodiment, the following technical solution is disclosed: the similarity between the high-value word group of the first keyword and the high-value word group of the second keyword is not less than a preset similarity threshold, and When the co-occurrence value between the first keyword and the second keyword is 0, the first keyword and the second keyword are identified as synonyms.
由于在同一份资料中,无法完全排除资料的作者前后采用两种不同的关键词来表述同一种含义。为了进一步提升同义词的识别准确率,本实施例中,在第一关键词的高值词群与第二关键词的高值词群之间的相似度不小于预设相似度阈值的情况下,还包括下面步骤S17至步骤S19,其中:Because in the same data, it cannot be completely ruled out that the authors of the data use two different keywords to express the same meaning. In order to further improve the recognition accuracy of synonyms, in this embodiment, when the similarity between the high-value word group of the first keyword and the high-value word group of the second keyword is not less than the preset similarity threshold, It also includes the following steps S17 to S19, wherein:
步骤S17:若第一关键词和第二关键词之间的同生值不为0,则计算第一关键词的同生指数和第二关键词的同生指数之间的平均值,得到相应的平均同生指数;Step S17: If the co-occurrence value between the first keyword and the second keyword is not 0, calculate the average value between the co-occurrence index of the first keyword and the co-occurrence index of the second keyword, and obtain the corresponding The average congenital index of ;
步骤S18:计算上述平均同生指数与当前同生值之间的比值,其中,当前同生值为第一关键词和第二关键词之间的同生值;Step S18: Calculate the ratio between the above-mentioned average co-generation index and the current co-generation value, wherein the current co-generation value is the co-generation value between the first keyword and the second keyword;
步骤S19:判断上述比值是否不小于预设比值阈值,如果是,则将第一关键词和第二关键词识别为同义词,如果否,则将第一关键词和第二关键词识别为非同义词。Step S19: judging whether the above ratio is not less than a preset ratio threshold, if so, identify the first keyword and the second keyword as synonyms, if not, identify the first keyword and the second keyword as non-synonymous .
本实施例中,优先将上述预设相似度阈值设为80%,以及,将上述预设比值阈值设为10。In this embodiment, the preset similarity threshold is preferably set to 80%, and the preset ratio threshold is set to 10.
相应的,本发明实施例还公开了一种资料集的同义词识别系统,参见图2所示,该系统包括:Correspondingly, the embodiment of the present invention also discloses a synonym recognition system for a data set, as shown in FIG. 2 , the system includes:
资料集获取模块21,用于获取包括N份资料的资料集,N为正整数;The data set acquisition module 21 is used to acquire a data set including N pieces of data, where N is a positive integer;
关键词提取模块22,用于分别提取每份资料中的所有关键词;The keyword extraction module 22 is used for extracting all keywords in each data respectively;
同生词确定模块23,用于分别确定每个关键词的所有同生词,其中,任一个关键词的同生词为与该关键词同时出现在同一份资料中的关键词;The cognate word determination module 23 is used to determine all cognate words of each keyword respectively, wherein, the cognate words of any keyword are the keywords that appear in the same document simultaneously with this keyword;
同生值计算模块24,用于分别计算每个关键词与该关键词所对应的同生词之间的同生值;其中,同生值用来衡量任意两个关键词在同一份资料中出现的概率;The co-occurrence value calculation module 24 is used to calculate the co-occurrence value between each keyword and the co-occurrence word corresponding to the keyword respectively; wherein, the co-occurrence value is used to measure the occurrence of any two keywords in the same data The probability;
高值词群确定模块25,用于分别确定每个关键词的高值词群,其中,任一个关键词的高值词群为按照同生值从大到小的排列顺序,对该关键词的所有同生词进行排序后得到的词群;The high-value word group determination module 25 is used to determine the high-value word group of each keyword respectively, wherein, the high-value word group of any keyword is arranged in descending order according to the co-occurrence value. The word group obtained after sorting all the cognate words of ;
同义词识别模块26,用于对资料集中任意两个关键词之间是否为同义词进行识别,得到相应的同义词库,其中,若第一关键词的高值词群与第二关键词的高值词群之间的相似度不小于预设相似度阈值,并且第一关键词和第二关键词之间的同生值为0,则将第一关键词和第二关键词识别为同义词。The synonym identification module 26 is used to identify whether any two keywords in the data set are synonyms, and obtain a corresponding synonym database, wherein if the high-value word group of the first keyword and the high-value word of the second keyword If the similarity between the groups is not less than the preset similarity threshold, and the co-occurrence value between the first keyword and the second keyword is 0, the first keyword and the second keyword are identified as synonyms.
可见,本发明实施例在对资料集中的任意两个关键词是否为同义词进行识别之前,先计算出资料集中每个关键词与该关键词所对应的同生词之间的同生值,以及确定出每个关键词的高值词群,其中,所谓的同生值是用来衡量任意两个关键词在同一份资料中出现的概率,而任一个关键词的高值词群是指按照同生值从大到小的排列顺序,对该关键词的所有同生词进行排序后得到的词群。可以理解的是,若任意两个关键词各自所对应的高值词群较为相似,则意味着这两个关键词很有可能是同义词,在此基础上,若这两个关键词之间同生值为0,也即,若这两个关键词并没有出现在同一份资料中,由于在同一份资料中,资料的创作者很大程度上不会先后采用两种不同词形的词语来表示同一种含义,所以,在通过高值词群的比对发现某两个关键词很可能是同义词之后,若进一步发现这两个关键词并没有出现在同一份资料中,则将这两个关键词识别成同义词,显然,上述同义词的识别过程中无需涉及到词形本身的比较或者依赖于特定的文本结构,从而能够大幅地提升同义词的识别效果,也即,提高了同义词的识别准确率,并且上述同义词的识别方法不会受限于资料类型的不同,能够广泛地应用于各类文本资料,具有非常广阔的应用前景。It can be seen that, before identifying whether any two keywords in the data set are synonyms, the embodiment of the present invention first calculates the co-occurrence value between each keyword in the data set and the co-occurrence word corresponding to the keyword, and determines The high-value word group of each keyword is obtained, in which the so-called co-occurrence value is used to measure the probability of any two keywords appearing in the same data, and the high-value word group of any keyword refers to the same The word group is obtained after sorting all the cognate words of the keyword in the order of the occurrence value from large to small. It can be understood that if the high-value word groups corresponding to any two keywords are relatively similar, it means that the two keywords are likely to be synonyms. The life value is 0, that is, if these two keywords do not appear in the same data, because in the same data, the creator of the data will not use two words with different forms in succession to a large extent. means the same meaning. Therefore, after finding that two keywords are likely to be synonyms through the comparison of high-value word groups, if it is further found that these two keywords do not appear in the same data, the two keywords Keywords are identified as synonyms. Obviously, the above-mentioned synonym identification process does not need to involve the comparison of the word form itself or rely on a specific text structure, so that the identification effect of synonyms can be greatly improved, that is, the identification accuracy of synonyms is improved. , and the above-mentioned method for identifying synonyms is not limited by different types of data, can be widely applied to various text data, and has a very broad application prospect.
进一步的,上述同生值计算模块在计算任意两个关键词之间的同生值时,相应的计算公式为:Further, when the above-mentioned co-generation value calculation module calculates the co-generation value between any two keywords, the corresponding calculation formula is:
Eij=Cij 2/(Ci×Cj);E ij =C ij 2 /(C i ×C j );
式中,Cij表示资料集中同时出现关键词Ki和关键词Kj的资料的总份数,Ci表示资料集中出现关键词Ki的资料的总份数;Cj表示资料集中出现关键词Kj的资料的总份数;Eij表示关键词Ki和关键词Kj之间的同生值。In the formula, C ij represents the total number of data in the data set with the keyword K i and the keyword K j at the same time, C i represents the total number of data in the data set with the keyword K i ; C j represents the key word in the data set. The total number of documents of the word K j ; E ij represents the co-occurrence value between the keyword K i and the keyword K j .
另外,本实施例中的同义词识别系统,还可以进一步包括:In addition, the synonym recognition system in this embodiment may further include:
同生指数计算模块,用于在高值词群确定模块分别确定每个关键词的高值词群之后,分别计算每个关键词的同生指数,其中,任一个关键词的同生指数为该关键词对应的高值词群中的所有同生词与该关键词之间的同生值的平均值。The co-occurrence index calculation module is used to calculate the co-occurrence index of each keyword after the high-value word group determination module determines the high-value word group of each keyword, wherein the co-occurrence index of any keyword is The average value of cognate values between all cognate words in the high-value word group corresponding to the keyword and the keyword.
进一步的,上述同义词识别模块还可以进一步包括平均值计算单元、比值计算单元和比值判断单元;其中,Further, the above-mentioned synonym identification module may further include an average value calculation unit, a ratio calculation unit and a ratio judgment unit; wherein,
平均值计算单元,用于在第一关键词的高值词群与第二关键词的高值词群之间的相似度不小于预设相似度阈值,并且,第一关键词和第二关键词之间的同生值不为0的情况下,计算第一关键词的同生指数和第二关键词的同生指数之间的平均值,得到相应的平均同生指数;The average value calculation unit is used for the similarity between the high-value word group of the first keyword and the high-value word group of the second keyword is not less than a preset similarity threshold, and the first keyword and the second keyword When the co-occurrence value between words is not 0, calculate the average value between the co-occurrence index of the first keyword and the co-occurrence index of the second keyword, and obtain the corresponding average co-occurrence index;
比值计算单元,用于计算平均同生指数与当前同生值之间的比值,其中,当前同生值为第一关键词和第二关键词之间的同生值;a ratio calculation unit, used to calculate the ratio between the average co-generation index and the current co-generation value, wherein the current co-generation value is the co-generation value between the first keyword and the second keyword;
比值判断单元,用于判断比值是否不小于预设比值阈值,如果是,则将第一关键词和第二关键词识别为同义词,如果否,则将第一关键词和第二关键词识别为非同义词。The ratio judgment unit is used for judging whether the ratio is not less than the preset ratio threshold, if so, the first keyword and the second keyword are identified as synonyms, if not, the first keyword and the second keyword are identified as non-synonyms.
优选的,上述预设相似度阈值为80%,预设比值阈值为10。Preferably, the preset similarity threshold is 80%, and the preset ratio threshold is 10.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
以上对本发明所提供的一种资料集的同义词识别方法及系统进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The method and system for identifying synonyms in a data set provided by the present invention have been described in detail above. The principles and implementations of the present invention are described with specific examples in this paper. The method of the invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, the content of this description should not be understood to limit the present invention.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201610772919.1A CN106339369B (en) | 2016-08-30 | 2016-08-30 | A method and system for identifying synonyms in a dataset | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201610772919.1A CN106339369B (en) | 2016-08-30 | 2016-08-30 | A method and system for identifying synonyms in a dataset | 
Publications (2)
| Publication Number | Publication Date | 
|---|---|
| CN106339369A CN106339369A (en) | 2017-01-18 | 
| CN106339369B true CN106339369B (en) | 2019-06-04 | 
Family
ID=57822802
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN201610772919.1A Expired - Fee Related CN106339369B (en) | 2016-08-30 | 2016-08-30 | A method and system for identifying synonyms in a dataset | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN106339369B (en) | 
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN111414750B (en) * | 2020-03-18 | 2023-08-18 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for judging synonyms of a term | 
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN1223410A (en) * | 1998-01-13 | 1999-07-21 | 富士通株式会社 | Related term extraction apparatus, related term extraction method, and computer-readable recording medium having related term extration program recorded thereon | 
| US8392413B1 (en) * | 2007-02-07 | 2013-03-05 | Google Inc. | Document-based synonym generation | 
| WO2014002775A1 (en) * | 2012-06-25 | 2014-01-03 | 日本電気株式会社 | Synonym extraction system, method and recording medium | 
| JP2014132406A (en) * | 2013-01-07 | 2014-07-17 | Nec Corp | Synonym extraction system, method and program | 
| JP5754019B2 (en) * | 2011-07-11 | 2015-07-22 | 日本電気株式会社 | Synonym extraction system, method and program | 
- 
        2016
        - 2016-08-30 CN CN201610772919.1A patent/CN106339369B/en not_active Expired - Fee Related
 
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN1223410A (en) * | 1998-01-13 | 1999-07-21 | 富士通株式会社 | Related term extraction apparatus, related term extraction method, and computer-readable recording medium having related term extration program recorded thereon | 
| US8392413B1 (en) * | 2007-02-07 | 2013-03-05 | Google Inc. | Document-based synonym generation | 
| JP5754019B2 (en) * | 2011-07-11 | 2015-07-22 | 日本電気株式会社 | Synonym extraction system, method and program | 
| WO2014002775A1 (en) * | 2012-06-25 | 2014-01-03 | 日本電気株式会社 | Synonym extraction system, method and recording medium | 
| JP2014132406A (en) * | 2013-01-07 | 2014-07-17 | Nec Corp | Synonym extraction system, method and program | 
Non-Patent Citations (3)
| Title | 
|---|
| Using cooccurrence statistics and the web to discover synonyms in a technical language;Marco Baroni,Sabrina Bisi;《 Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC"04)》;20040531;第3节 * | 
| 共词分析法研究(一)——共词分析的过程与方式;钟伟金,李佳;《情报杂志》;20080531(第5期);第2.4节 * | 
| 基于共现"互斥互信"原理的同义词识别;钟伟金;《中华医学图书情报杂志》;20120531;第21卷(第5期);第1-2节图1 * | 
Also Published As
| Publication number | Publication date | 
|---|---|
| CN106339369A (en) | 2017-01-18 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN104615593B (en) | Hot microblog topic automatic testing method and device | |
| CN104252445B (en) | Approximate repetitive file detection method and device | |
| US8315997B1 (en) | Automatic identification of document versions | |
| JP6231668B2 (en) | Keyword expansion method and system and classification corpus annotation method and system | |
| CN107818815B (en) | Electronic medical record retrieval method and system | |
| WO2023071118A1 (en) | Method and system for calculating text similarity, device, and storage medium | |
| CN103761264B (en) | Concept hierarchy establishing method based on product review document set | |
| CN103902619B (en) | A kind of network public-opinion monitoring method and system | |
| CN103186556B (en) | Obtain the method with searching structure semantic knowledge and corresponding intrument | |
| CN102760142A (en) | Method and device for extracting subject label in search result aiming at searching query | |
| CN108509490B (en) | A method and system for discovering hot topics on the Internet | |
| CN110489548A (en) | A kind of Chinese microblog topic detecting method and system based on semanteme, time and social networks | |
| CN107526792A (en) | A kind of Chinese question sentence keyword rapid extracting method | |
| CN105868347A (en) | Tautonym disambiguation method based on multistep clustering | |
| CN110162632A (en) | A method for discovering special news events | |
| CN105740232A (en) | Method and device for automatically extracting feedback hotspots | |
| CN112328735A (en) | Hot topic determination method and device and terminal equipment | |
| CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
| CN104346411B (en) | The method and apparatus that multiple contributions are clustered | |
| CN106339369B (en) | A method and system for identifying synonyms in a dataset | |
| CN106408316A (en) | Method and device for identifying customers | |
| TWI807661B (en) | Method and device for identifying industry proper nouns from text | |
| KR101351555B1 (en) | classification-extraction system based meaning for text-mining of large data. | |
| CN106202405B (en) | A kind of compactedness Text Extraction based on text similarity relation | |
| CN112115237B (en) | Construction method and device of tobacco science and technology literature data recommendation model | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee | Granted publication date: 20190604 Termination date: 20200830 | |
| CF01 | Termination of patent right due to non-payment of annual fee |