CN104008119B

CN104008119B - A kind of one-to-many mixed characters string fusion comparison method

Info

Publication number: CN104008119B
Application number: CN201310746846.5A
Authority: CN
Inventors: 童晓阳; 甄威; 郑永康; 姜振超; 庄先涛; 吴继维; 张茜; 丁宣文
Original assignee: Southwest Jiaotong University; Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Current assignee: Southwest Jiaotong University; Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Priority date: 2013-12-30
Filing date: 2013-12-30
Publication date: 2017-09-26
Anticipated expiration: 2033-12-30
Also published as: CN104008119A

Abstract

The invention discloses a one-to-many mixed character string fusion comparison method, which uses a source character string to find the most similar or matching target character string from a group of character strings to be compared. First use the improved GST* algorithm, and then use a partially ordered string comparison algorithm POC. Combining the characteristics of the above two algorithms in terms of string disorder and partially ordered matching, the matching degree values calculated by the two algorithms are weighted and fused to obtain the final matching degree. In addition, considering that synonymous strings have different expressions in different occasions, a string equivalent replacement strategy is adopted to equivalently replace synonymous character substrings in the source string and the string to be compared with the same string, Greatly improve the matching of two strings. By matching the source string with a group of strings to be matched respectively, and then sorting the matching degrees, and taking the string with the highest matching degree as the target string, the better matching of one-to-many mixed strings is realized.

Description

A One-to-Many Mixed String Fusion Comparison Method

技术领域technical field

本发明属于字符串智能比对技术领域，具体涉及一种新型一对多的混合字符串融合比对方法。The invention belongs to the technical field of character string intelligent comparison, and in particular relates to a novel one-to-many mixed character string fusion comparison method.

背景技术Background technique

字符串比对问题是计算机科学中的一个基本问题，其研究内容在信息检索、模式识别等众多领域均有着重要的应用价值^[1]-[4]。String comparison is a basic problem in computer science, and its research content has important application value in many fields such as information retrieval and pattern recognition ^[1]-[4] .

文献1分别研究中文字符串模糊匹配算法，文献2研究了一种基于汉字聚类特征的中文字符串相似度计算方法。文献3对LCS与GST算法做了比较，GST算法是一种贪婪字符串比对算法，也是一种无序匹配算法，目前应用较广，但该算法采用了两个字符串逐个字符比较的方法，所以算法的时间复杂度较大。文献4研究了对GST算法改进后RKR-GST算法，提高了GST算法的运行效率，但是RKR-GST算法中散列函数的选择对算法的运行影响很大。Document 1 studies the fuzzy matching algorithm of Chinese character strings respectively, and Document 2 studies a Chinese character string similarity calculation method based on Chinese character clustering features. Document 3 compares the LCS and GST algorithms. The GST algorithm is a greedy string comparison algorithm and an unordered matching algorithm. It is widely used at present, but the algorithm uses the method of character-by-character comparison of two strings. , so the time complexity of the algorithm is large. Document 4 studies the RKR-GST algorithm after the improvement of the GST algorithm, which improves the operating efficiency of the GST algorithm, but the choice of the hash function in the RKR-GST algorithm has a great impact on the operation of the algorithm.

现有的字符串比对方法往往只采用一种算法，没有能够充分利用无序字符子串和部分有序字符子串在匹配度计算时的各自特点，往往它们的比对效果并不理想。在一些某些混合字符串的实际应用中，不但要求比对的准确性高，而且要求比对的速度快。目前，通过单一的匹配度计算方法，往往很难精确地表达字符串的相似程度。Existing character string comparison methods often only use one algorithm, which fails to make full use of the respective characteristics of unordered character substrings and partially ordered character substrings in the calculation of matching degrees, and their comparison results are often not ideal. In some practical applications of some mixed character strings, not only high accuracy of comparison is required, but also high speed of comparison is required. At present, it is often difficult to accurately express the similarity of character strings through a single matching calculation method.

另外，现有的字符串比对方法没有考虑同义字符串可能存在不同表达方式的情况，使得现有字符串比对方法在此类情况下很难达到较准确、高匹配率的要求。In addition, the existing string comparison methods do not consider the situation that synonymous strings may have different expressions, which makes it difficult for the existing string comparison methods to meet the requirements of more accurate and high matching rates in such cases.

参考文献：references:

[1]陈开渠，赵洁，彭志威.快速中文字符串模糊匹配算法[J].中文信息学报，2003，18(2)：58-65[1] Chen Kaiqu, Zhao Jie, Peng Zhiwei. Fast Chinese String Fuzzy Matching Algorithm [J]. Chinese Journal of Information, 2003, 18(2): 58-65

[2]王静婷.基于汉字聚类特征的中文字符串相似度计算研究[J].现代图书情报技术，2011，20(2)：48-53[2] Wang Jingting. Research on Similarity Calculation of Chinese Strings Based on Chinese Character Clustering Features [J]. Modern Library and Information Technology, 2011, 20(2): 48-53

[3]于海英.字符串相似度度量中LCS和GST算法比较[J].电子科技，2011，24(3)：101-103[3] Yu Haiying. Comparison of LCS and GST Algorithms in String Similarity Measurement [J]. Electronic Science and Technology, 2011, 24(3): 101-103

[4]牛永洁.RKR_GST算法在_NET中的分析与实现[J].信息技术，2012，3：171-174[4] Niu Yongjie. Analysis and Implementation of RKR_GST Algorithm in _NET [J]. Information Technology, 2012, 3: 171-174

发明内容Contents of the invention

鉴于现有技术的以上不足，本发明的目的是提供一种更准确的混合字符串融合比对方法。解决了实际应用中以单一匹配度计算方法很难达到精确表达字符串之间相似程度、同义字符串存在不同表达方式情形下现有字符串比对方法几乎失效等问题。In view of the above deficiencies in the prior art, the purpose of the present invention is to provide a more accurate method for fusion and comparison of mixed character strings. It solves the problems that it is difficult to accurately express the similarity between strings with a single matching degree calculation method in practical applications, and the existing string comparison methods are almost invalid when synonymous strings have different expressions.

本发明的目的是通过以下的手段实现的：The purpose of the present invention is achieved by the following means:

一种一对多的混合字符串融合比对方法，对基于汉字聚类特征的由汉字、数字、英文字母组成的混合字符串的相似度进行融合比对，以提高表达字符串的相似的精确度，包括以下主要步骤：A one-to-many mixed string fusion comparison method, which performs fusion comparison on the similarity of mixed strings composed of Chinese characters, numbers, and English letters based on the clustering characteristics of Chinese characters, so as to improve the accuracy of expressing the similarity of strings degree, including the following main steps:

1）取出源字符串和一组待匹配字符串；1) Take out the source string and a set of strings to be matched;

2）读出事先在存储器中构建的字符串等价替换字典，对该组待匹配字符串中部分字符（子串）进行等价替换；利用等价替换字典，将上述在源字符串场合和待匹配字符串场合具有不同描述但含义相同的两种子串进行统一；2) Read out the string equivalent replacement dictionary built in memory in advance, and perform equivalent replacement of some characters (substrings) in the group of strings to be matched; Unify the two substrings with different descriptions but the same meaning in the string to be matched;

3）取出源字符串，依此取出等价替换后的该待匹配字符串数组中的一个待匹配字符串；3) Take out the source string, and then take out a string to be matched in the array of strings to be matched after equivalent replacement;

4）利用GST*算法计算源字符串与该待匹配字符串的匹配度a：4) Use the GST* algorithm to calculate the matching degree a between the source string and the string to be matched:

采用传统GST算法，得到两个字符串中各公共子串，将它们存入公共子串链表中。如果某个公共子串的字符长度与较长字符串字符长度的比值大于或等于0.33，则在计算匹配度时将该公共子串的字符个数乘以权重,该权重为大于1的常数；如果某个公共子串的字符长度与较长字符串字符长度的比值小于0.33、且公共子串的字符个数大于最小匹配长度，则计算匹配度时将该公共子串的字符个数直接带入计算；The traditional GST algorithm is used to obtain the common substrings in the two strings, and store them in the public substring linked list. If the ratio of the character length of a certain common substring to the character length of a longer character string is greater than or equal to 0.33, then when calculating the matching degree, the number of characters of the common substring is multiplied by a weight, and the weight is a constant greater than 1; If the ratio of the character length of a certain common substring to the character length of the longer string is less than 0.33, and the number of characters in the common substring is greater than the minimum matching length, then the number of characters in the common substring is directly taken into account when calculating the matching degree. into the calculation;

5）利用偏有序字符串匹配算法POC（Partial Order Comparison,POC）计算源字符串和待匹配字符串的匹配度b：5) Use the partial order string matching algorithm POC (Partial Order Comparison, POC) to calculate the matching degree b between the source string and the string to be matched:

将两个待匹配的含有汉字、数字和英文字母的混合字符串分别称为源字符串和待匹配字符串,The two mixed strings containing Chinese characters, numbers and English letters to be matched are respectively called the source string and the string to be matched,

首先，先搜索出源字符串与待匹配字符串中相同的字符或汉字，记录下它们的个数，;First, search for the same characters or Chinese characters in the source string and the string to be matched, and record their numbers;

其次，以源字符串和待匹配字符串中较长字符串为标准，求匹配度1(match_degree1)：Secondly, use the longer string in the source string and the string to be matched as the standard to find the matching degree 1 (match_degree1):

以其中较短字符串为标准，求匹配度2(match_degree2)：Using the shorter string as the standard, find the matching degree 2 (match_degree2):

公式（1）、（2）中[]表示取整;[ ] in formulas (1) and (2) means rounding;

再次，分别比较源字符串和待匹配字符串中第1个或第2个数字与字母，最后1个或倒数第2个数字与字母，如果其中1项相等，则调整匹配度2的match_degree2数值为match_degree2+1:Again, compare the first or second number and letter in the source string and the string to be matched, and the last or second-to-last number and letter, and if one of them is equal, adjust the match_degree2 value of matching degree 2 For match_degree2+1:

对匹配度1和匹配度2赋予不同的权重0.41、0.59，求源字符串和待匹配字符串的最终匹配值b：Assign different weights 0.41 and 0.59 to matching degree 1 and matching degree 2, and find the final matching value b of the source string and the string to be matched:

b＝match_degree1×0.41+match_degree2×0.59 （3）b=match_degree1×0.41+match_degree2×0.59 (3)

6）将步骤4）GST*计算所得的匹配度a和步骤5）POC计算所得的匹配度b进行加权融合，融合方法是，如果匹配度a大于匹配度b，则最终匹配度为a;如果匹配度a小于匹配度b，则最终匹配度等于(a+b)/2;6) Perform weighted fusion of the matching degree a obtained by step 4) GST* calculation and the matching degree b obtained by step 5) POC calculation. The fusion method is that if the matching degree a is greater than the matching degree b, the final matching degree is a; if Matching degree a is less than matching degree b, then the final matching degree is equal to (a+b)/2;

7）将源字符串和待匹配字符串数组中每个待匹配字符串计算所获得匹配度进行排序，把最大匹配度对应的待匹配字符串，作为与源字符串最匹配的目标字符串。7) Sort the matching degree calculated by the source string and each to-be-matched string in the to-be-matched string array, and use the to-be-matched string corresponding to the maximum matching degree as the target string that best matches the source string.

在步骤4）中，先搜索出源字符串与待匹配字符串相同的各公共子串，再对不同长度的公共子串赋予不同的权重，增大了较长公共字符子串的权重。In step 4), the common substrings that are the same as the source string and the string to be matched are first searched out, and then different weights are assigned to the common substrings of different lengths, which increases the weight of longer common character substrings.

本发明的GST*算法，针对传统GST算法存在的较短公共子串匹配度可能比较长公共子串的匹配度更大的现象，对其进行了改进：如果公共子串的字符长度与较长字符串字符长度的比值大于或等于0.33，则在计算匹配度时将该公共子串的字符个数乘以权重（大于1的常数）；如果公共子串的字符长度与较长字符串字符长度的比值小于0.33、且公共子串的字符个数大于最小匹配长度，则计算匹配度时将该公共子串的字符个数直接带入计算。The GST* algorithm of the present invention is aimed at the phenomenon that the matching degree of the shorter common substring that exists in the traditional GST algorithm may be larger than the matching degree of the long common substring, and it is improved: if the character length of the common substring is the same as the longer If the ratio of the character length of the string is greater than or equal to 0.33, multiply the number of characters of the common substring by the weight (a constant greater than 1) when calculating the matching degree; if the character length of the common substring is greater than the character length of the longer string The ratio of is less than 0.33, and the number of characters in the common substring is greater than the minimum matching length, then the number of characters in the common substring is directly brought into the calculation when calculating the matching degree.

在步骤5）中，将两个含有数字、字母、汉字的混合字符串分别作为源字符串和待匹配字符串；分别以其中较长字符串、较短字符串为标准，求出匹配度1和匹配度2；然后再比较第一个或多个数字与字母、最后一个或多个数字与字母是否相等，对匹配度2进行修改。最后对两种匹配度分别赋予不同权重，得到两个字符串之间的匹配度值。In step 5), two mixed strings containing numbers, letters, and Chinese characters are used as the source string and the string to be matched respectively; the longer string and the shorter string are used as the standard to find the matching degree 1 and matching degree 2; then compare whether the first or more numbers and letters, and the last one or more numbers and letters are equal, and modify the matching degree 2. Finally, different weights are assigned to the two matching degrees to obtain the matching degree value between the two strings.

本发明的偏有序字符串比对算法POC考虑到匹配度2更能反映实际匹配情况，因此赋予匹配度2稍微较大的权重。The partial ordered character string comparison algorithm POC of the present invention considers that the matching degree 2 can better reflect the actual matching situation, so the matching degree 2 is given a slightly larger weight.

本发明给出了字符串等价替换策略。比如，“高压侧”与“220KV侧”，“千伏”与“kV”，在含义上是等价的。采用现有各类比对算法均不能准确地反映它们之间的等价关系，因此提出字符串等价替换策略。事先构建一个字符子串等价替换字典，采用：待匹配中子串=等价的源侧子串的形式，比如千伏=kV，它表示等号两侧的字符子串在含义上是相同的，等号左侧子串代表待匹配字符串中的某子串，等号右侧子串代表与左侧等价的源字符串中子串。The invention provides a character string equivalent replacement strategy. For example, "high voltage side" and "220KV side", "kilovolt" and "kV" are equivalent in meaning. The existing comparison algorithms cannot accurately reflect the equivalence relationship between them, so a string equivalent replacement strategy is proposed. Build a character substring equivalent replacement dictionary in advance, using the form of: substring to be matched = equivalent source side substring, such as kilovolts = kV, which means that the character substrings on both sides of the equal sign are the same in meaning , the substring on the left side of the equal sign represents a substring in the string to be matched, and the substring on the right side of the equal sign represents a substring in the source string equivalent to the left side.

在做匹配度计算之前，先检查待匹配字符串中是否含有字符子串等价替换字典中各行中左侧的字符子串，如果有，则替换它为等号右侧的源侧字符子串。在此基础上，再运用本融合比对算法进行比对，计算得到相应的匹配度，这样大大提高了匹配的精确度，能够反映出参与比较两个字符串之间真实的匹配情况。Before doing the matching degree calculation, first check whether the string to be matched contains a character substring equivalent to the character substring on the left side of each line in the replacement dictionary, and if so, replace it with the source side character substring on the right side of the equal sign . On this basis, the fusion comparison algorithm is used to compare and calculate the corresponding matching degree, which greatly improves the matching accuracy and can reflect the real matching situation between the two character strings involved in the comparison.

本发明适用于一对多混合字符串的比对。分别计算源字符串与一组待匹配字符串的匹配度，并将得到的各匹配度进行排序，从中找出与源字符串匹配度最大的待匹配字符串，把它确定为目标字符串，从而实现了一对多字符串的较佳匹配。The invention is suitable for comparison of one-to-many mixed character strings. Calculate the matching degree of the source string and a group of strings to be matched respectively, and sort the obtained matching degrees, find out the string to be matched with the highest matching degree with the source string, and determine it as the target string, This enables better matching of one-to-many strings.

附图说明：Description of drawings:

图1为新型一对多字符串的融合比对方法的流程图。FIG. 1 is a flow chart of a novel one-to-many character string fusion comparison method.

图2为一对多混合字符串的融合比对方法的应用实例。Fig. 2 is an application example of the fusion comparison method of one-to-many mixed character strings.

具体实施方式detailed description

下面结合附图对本发明的方法作进一步的详述Below in conjunction with accompanying drawing, method of the present invention is described in further detail

下面结合附图对本发明做进一步地详细说明。本发明具体涉及一种混合字符串的融合比对方法。首先将待匹配的字符串分别称为源字符串和待匹配字符串。本发明更能适合于从一组待匹配字符串中寻找与源字符串最匹配的目标匹配字符串。The present invention will be described in further detail below in conjunction with the accompanying drawings. The invention specifically relates to a fusion comparison method of mixed character strings. Firstly, the character string to be matched is referred to as the source character string and the character string to be matched respectively. The present invention is more suitable for finding the target matching character string that best matches the source character string from a group of character strings to be matched.

具体实施方式如下。The specific implementation method is as follows.

1.取出源字符串和一组待匹配字符串；1. Take out the source string and a set of strings to be matched;

2.读出事先构建的字符串等价替换字典，对该组待匹配字符串中部分字符进行等价替换。例如“高压侧”与“220KV侧”等价，“千伏”与“kV”等价。在进行字符串匹配度计算之前，利用等价替换字典可将上述不同描述进行统一；2. Read out the string equivalent replacement dictionary built in advance, and perform equivalent replacement for some characters in the group of strings to be matched. For example, "high voltage side" is equivalent to "220KV side", and "kV" is equivalent to "kV". Before calculating the string matching degree, the above different descriptions can be unified by using the equivalent replacement dictionary;

3.取出源字符串，依此取出等价替换后的该组待匹配字符串数组中的一个待匹配字符串；3. Take out the source string, and accordingly take out a string to be matched in the array of strings to be matched after equivalent replacement;

4.利用GST*算法计算源字符串与待匹配字符串的匹配度。4. Use the GST* algorithm to calculate the matching degree between the source string and the string to be matched.

GST*算法与传统的GST算法的改进效果，通过以下举例说明。The improvement effect of the GST* algorithm and the traditional GST algorithm is illustrated by the following examples.

例如“abcde”与“qbcio”、“abcde”与“qbico”为两组待比较字符串，利用GST算法计算两组字符串匹配度均为40%。For example, "abcde" and "qbcio", "abcde" and "qbico" are two groups of character strings to be compared, and the GST algorithm is used to calculate the matching degree of both groups of character strings to be 40%.

而采用GST*算法计算两组字符串匹配度，结果分别为43.2%和40%。可看出GST*算法的比对结果更准确一些。The GST* algorithm is used to calculate the matching degree of the two groups of strings, and the results are 43.2% and 40% respectively. It can be seen that the comparison result of the GST* algorithm is more accurate.

GST*算法使具有较长公共子串的两字符串的匹配度更高。The GST* algorithm makes the matching degree of two strings with longer common substrings higher.

5.利用偏有序字符串匹配算法POC计算源字符串和待匹配字符串的匹配度。5. Use the partial ordered string matching algorithm POC to calculate the matching degree between the source string and the string to be matched.

将两个待匹配的含有汉字、数字和字母的混合字符串分别称为源字符串和待匹配字符串。The two mixed character strings containing Chinese characters, numbers and letters to be matched are respectively called the source string and the character string to be matched.

首先，先搜索出源字符串与待匹配字符串相同的字符，记录下它们的个数。First, search for the same characters in the source string and the string to be matched, and record their numbers.

公式（1）、（2）中[□]表示取整。[□] in formulas (1) and (2) means rounding.

再次，分别比较源字符串和待匹配字符串中第1个（或第2个）数字与字母，最后1个（或倒数第2个）数字与字母，如果其中1项相等，则调整匹配度2的match_degree2数值为match_degree2+1。Again, compare the first (or second) number and letter in the source string and the string to be matched, and the last (or second last) number and letter, and if one of them is equal, adjust the matching degree The match_degree2 value of 2 is match_degree2+1.

最后，由于在实际应用中，匹配度2更能反映实际匹配情况，因此赋予匹配度2更大的权重。Finally, because in practical applications, matching degree 2 can better reflect the actual matching situation, so matching degree 2 is given greater weight.

6）将步骤4）GST*计算所得的匹配度a和步骤5）POC计算所得的匹配度b进行加权融合，融合方法是，如果匹配度a大于匹配度b，则最终匹配度为a。如果匹配度a小于匹配度b，则最终匹配度等于(a+b)/2。6) Perform weighted fusion of the matching degree a obtained by step 4) GST* calculation and the matching degree b obtained by step 5) POC calculation. The fusion method is that if the matching degree a is greater than the matching degree b, the final matching degree is a. If the matching degree a is smaller than the matching degree b, the final matching degree is equal to (a+b)/2.

得到最终的比对结果，充分利用了两种算法的特点；Get the final comparison result, making full use of the characteristics of the two algorithms;

7.检查循环是否执行完毕；7. Check whether the loop is executed;

8.对各匹配度进行排序，找出匹配度最大对应的字符串，作为最匹配的目标字符串。8. Sort each matching degree, find out the character string corresponding to the highest matching degree, and use it as the most matching target character string.

图2为本发明的混合字符串一对多比对方法的应用实例。计算了一组混合字符串的匹配情况。图2中分别列出利用GST*算法、偏有序字符串匹配算法(POC算法)、两种算法加权融合方法后GST*_POC的匹配度。Fig. 2 is an application example of the mixed character string one-to-many comparison method of the present invention. Matches are computed for a set of mixed strings. Figure 2 lists the matching degree of GST*_POC after using the GST* algorithm, the partial ordered string matching algorithm (POC algorithm), and the weighted fusion method of the two algorithms.

可看到，图2中第1条比对、第2条比对中，第1条的待匹配字符串比第2条的待匹配字符串更接近于第1列的源字符串。It can be seen that in the first comparison and the second comparison in Fig. 2, the string to be matched in the first item is closer to the source string in the first column than the string to be matched in the second item.

从图2中结果能够得出结论，利用本发明算法得到了较为理想的比对结果。From the results in Figure 2, it can be concluded that a relatively ideal comparison result has been obtained by using the algorithm of the present invention.

Claims

1. a kind of one-to-many mixed characters string fusion comparison method, to based on Chinese characters clustering feature by Chinese character, numeral, English The similarity of the mixed characters string of letter composition carries out fusion ratio pair, to improve the similar accuracy of expression character string, including Following key step：

1) source string and one group of matching string are taken out；

2) the character string equivalencing dictionary built in memory in advance is read, to partial character in this group of matching string I.e. substring carries out equivalencing；Using equivalencing dictionary, will have not in source string occasion and matching string occasion Describe together but two kinds of substrings of implication identical are unified；

3) source string is taken out, a character to be matched in the matching string array after equivalencing is taken out successively String；

4) source string and the matching degree a of the matching string are calculated using GST* algorithms：

Using traditional GST algorithms, each public substring in two character strings is obtained, they are stored in public substring chained list, if The ratio of the character length of some public substring and longer character string character length is more than or equal to 0.33, then is calculating matching degree When the character number of the public substring is multiplied by weight, the weight is constant more than 1；If the character of some public substring is long Degree and the ratio of longer character string character length are less than 0.33 and the character number of public substring is more than smallest match length, then The character number of the public substring is brought directly to calculating when calculating matching degree；

5) source string and the matching degree b of matching string are calculated using partially orderly string matching algorithm POC：

Two mixed characters strings containing Chinese character, numeral and English alphabet to be matched are referred to as source string and to be matched Character string,

First, source string and identical character in matching string are searched out, their number is recorded, secondly, with source Longer character string is standard in character string and matching string, seeks matching degree 1：

, as standard, to seek matching degree 2 wherein compared with short character strings：

[] represents to round in formula (1), (2)；

Again, the 1st or the 2nd numeral and letter, last 1 or reciprocal in source string and matching string are respectively compared 2nd numeral and letter, if wherein 1 equal, the numerical value of adjustment matching degree 2 is match_degree2+1:

Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, the matching of source string and matching string is asked Spend b：

B=match_degree1 × 0.41+match_degree2 × 0.59 (3)

6) by step 4) GST* calculate obtained by matching degree a and step 5) POC calculate obtained by matching degree b be weighted fusion, Fusion method is, if matching degree a is more than matching degree b, final matching degree is a；If matching degree a is less than matching degree b, most Whole matching degree is equal to (a+b)/2；

7) each matching string calculating in source string and matching string array is obtained into matching degree to be ranked up, The corresponding matching string of maximum matching degree, the target string most matched with source string is used as.