[go: up one dir, main page]

CN104008119B - A kind of one-to-many mixed characters string fusion comparison method - Google Patents

A kind of one-to-many mixed characters string fusion comparison method Download PDF

Info

Publication number
CN104008119B
CN104008119B CN201310746846.5A CN201310746846A CN104008119B CN 104008119 B CN104008119 B CN 104008119B CN 201310746846 A CN201310746846 A CN 201310746846A CN 104008119 B CN104008119 B CN 104008119B
Authority
CN
China
Prior art keywords
string
matching
character
matching degree
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310746846.5A
Other languages
Chinese (zh)
Other versions
CN104008119A (en
Inventor
童晓阳
甄威
郑永康
姜振超
庄先涛
吴继维
张茜
丁宣文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Original Assignee
Southwest Jiaotong University
Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University, Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd filed Critical Southwest Jiaotong University
Priority to CN201310746846.5A priority Critical patent/CN104008119B/en
Publication of CN104008119A publication Critical patent/CN104008119A/en
Application granted granted Critical
Publication of CN104008119B publication Critical patent/CN104008119B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种一对多的混合字符串融合比对方法,将一个源字符串从一组待比对字符串中找到最相似或匹配目标字符串。首先用改进的GST*算法,再使用一种偏有序的字符串比对算法POC。结合以上两种算法分别在字符串无序、部分有序匹配方面各自的特点,将两种算法计算得到的匹配度值进行加权融合求得最终的匹配度。另外,针对同义字符串在不同场合下具有不同的表达方式,采用字符串等价替换策略,对源字符串、待比对字符串中同义字符子串等价替换成相同的字符串,大大提高两个字符串的匹配度。通过将源字符串与一组待匹配字符串分别匹配,再将各匹配度进行排序,将最高匹配度的字符串作为目标字符串,实现了一对多的混合字符串的较佳匹配。

The invention discloses a one-to-many mixed character string fusion comparison method, which uses a source character string to find the most similar or matching target character string from a group of character strings to be compared. First use the improved GST* algorithm, and then use a partially ordered string comparison algorithm POC. Combining the characteristics of the above two algorithms in terms of string disorder and partially ordered matching, the matching degree values calculated by the two algorithms are weighted and fused to obtain the final matching degree. In addition, considering that synonymous strings have different expressions in different occasions, a string equivalent replacement strategy is adopted to equivalently replace synonymous character substrings in the source string and the string to be compared with the same string, Greatly improve the matching of two strings. By matching the source string with a group of strings to be matched respectively, and then sorting the matching degrees, and taking the string with the highest matching degree as the target string, the better matching of one-to-many mixed strings is realized.

Description

一种一对多的混合字符串融合比对方法A One-to-Many Mixed String Fusion Comparison Method

技术领域technical field

本发明属于字符串智能比对技术领域,具体涉及一种新型一对多的混合字符串融合比对方法。The invention belongs to the technical field of character string intelligent comparison, and in particular relates to a novel one-to-many mixed character string fusion comparison method.

背景技术Background technique

字符串比对问题是计算机科学中的一个基本问题,其研究内容在信息检索、模式识别等众多领域均有着重要的应用价值[1]-[4]String comparison is a basic problem in computer science, and its research content has important application value in many fields such as information retrieval and pattern recognition [1]-[4] .

文献1分别研究中文字符串模糊匹配算法,文献2研究了一种基于汉字聚类特征的中文字符串相似度计算方法。文献3对LCS与GST算法做了比较,GST算法是一种贪婪字符串比对算法,也是一种无序匹配算法,目前应用较广,但该算法采用了两个字符串逐个字符比较的方法,所以算法的时间复杂度较大。文献4研究了对GST算法改进后RKR-GST算法,提高了GST算法的运行效率,但是RKR-GST算法中散列函数的选择对算法的运行影响很大。Document 1 studies the fuzzy matching algorithm of Chinese character strings respectively, and Document 2 studies a Chinese character string similarity calculation method based on Chinese character clustering features. Document 3 compares the LCS and GST algorithms. The GST algorithm is a greedy string comparison algorithm and an unordered matching algorithm. It is widely used at present, but the algorithm uses the method of character-by-character comparison of two strings. , so the time complexity of the algorithm is large. Document 4 studies the RKR-GST algorithm after the improvement of the GST algorithm, which improves the operating efficiency of the GST algorithm, but the choice of the hash function in the RKR-GST algorithm has a great impact on the operation of the algorithm.

现有的字符串比对方法往往只采用一种算法,没有能够充分利用无序字符子串和部分有序字符子串在匹配度计算时的各自特点,往往它们的比对效果并不理想。在一些某些混合字符串的实际应用中,不但要求比对的准确性高,而且要求比对的速度快。目前,通过单一的匹配度计算方法,往往很难精确地表达字符串的相似程度。Existing character string comparison methods often only use one algorithm, which fails to make full use of the respective characteristics of unordered character substrings and partially ordered character substrings in the calculation of matching degrees, and their comparison results are often not ideal. In some practical applications of some mixed character strings, not only high accuracy of comparison is required, but also high speed of comparison is required. At present, it is often difficult to accurately express the similarity of character strings through a single matching calculation method.

另外,现有的字符串比对方法没有考虑同义字符串可能存在不同表达方式的情况,使得现有字符串比对方法在此类情况下很难达到较准确、高匹配率的要求。In addition, the existing string comparison methods do not consider the situation that synonymous strings may have different expressions, which makes it difficult for the existing string comparison methods to meet the requirements of more accurate and high matching rates in such cases.

参考文献:references:

[1]陈开渠,赵洁,彭志威.快速中文字符串模糊匹配算法[J].中文信息学报,2003,18(2):58-65[1] Chen Kaiqu, Zhao Jie, Peng Zhiwei. Fast Chinese String Fuzzy Matching Algorithm [J]. Chinese Journal of Information, 2003, 18(2): 58-65

[2]王静婷.基于汉字聚类特征的中文字符串相似度计算研究[J].现代图书情报技术,2011,20(2):48-53[2] Wang Jingting. Research on Similarity Calculation of Chinese Strings Based on Chinese Character Clustering Features [J]. Modern Library and Information Technology, 2011, 20(2): 48-53

[3]于海英.字符串相似度度量中LCS和GST算法比较[J].电子科技,2011,24(3):101-103[3] Yu Haiying. Comparison of LCS and GST Algorithms in String Similarity Measurement [J]. Electronic Science and Technology, 2011, 24(3): 101-103

[4]牛永洁.RKR_GST算法在_NET中的分析与实现[J].信息技术,2012,3:171-174[4] Niu Yongjie. Analysis and Implementation of RKR_GST Algorithm in _NET [J]. Information Technology, 2012, 3: 171-174

发明内容Contents of the invention

鉴于现有技术的以上不足,本发明的目的是提供一种更准确的混合字符串融合比对方法。解决了实际应用中以单一匹配度计算方法很难达到精确表达字符串之间相似程度、同义字符串存在不同表达方式情形下现有字符串比对方法几乎失效等问题。In view of the above deficiencies in the prior art, the purpose of the present invention is to provide a more accurate method for fusion and comparison of mixed character strings. It solves the problems that it is difficult to accurately express the similarity between strings with a single matching degree calculation method in practical applications, and the existing string comparison methods are almost invalid when synonymous strings have different expressions.

本发明的目的是通过以下的手段实现的:The purpose of the present invention is achieved by the following means:

一种一对多的混合字符串融合比对方法,对基于汉字聚类特征的由汉字、数字、英文字母组成的混合字符串的相似度进行融合比对,以提高表达字符串的相似的精确度,包括以下主要步骤:A one-to-many mixed string fusion comparison method, which performs fusion comparison on the similarity of mixed strings composed of Chinese characters, numbers, and English letters based on the clustering characteristics of Chinese characters, so as to improve the accuracy of expressing the similarity of strings degree, including the following main steps:

1)取出源字符串和一组待匹配字符串;1) Take out the source string and a set of strings to be matched;

2)读出事先在存储器中构建的字符串等价替换字典,对该组待匹配字符串中部分字符(子串)进行等价替换;利用等价替换字典,将上述在源字符串场合和待匹配字符串场合具有不同描述但含义相同的两种子串进行统一;2) Read out the string equivalent replacement dictionary built in memory in advance, and perform equivalent replacement of some characters (substrings) in the group of strings to be matched; Unify the two substrings with different descriptions but the same meaning in the string to be matched;

3)取出源字符串,依此取出等价替换后的该待匹配字符串数组中的一个待匹配字符串;3) Take out the source string, and then take out a string to be matched in the array of strings to be matched after equivalent replacement;

4)利用GST*算法计算源字符串与该待匹配字符串的匹配度a:4) Use the GST* algorithm to calculate the matching degree a between the source string and the string to be matched:

采用传统GST算法,得到两个字符串中各公共子串,将它们存入公共子串链表中。如果某个公共子串的字符长度与较长字符串字符长度的比值大于或等于0.33,则在计算匹配度时将该公共子串的字符个数乘以权重,该权重为大于1的常数;如果某个公共子串的字符长度与较长字符串字符长度的比值小于0.33、且公共子串的字符个数大于最小匹配长度,则计算匹配度时将该公共子串的字符个数直接带入计算;The traditional GST algorithm is used to obtain the common substrings in the two strings, and store them in the public substring linked list. If the ratio of the character length of a certain common substring to the character length of a longer character string is greater than or equal to 0.33, then when calculating the matching degree, the number of characters of the common substring is multiplied by a weight, and the weight is a constant greater than 1; If the ratio of the character length of a certain common substring to the character length of the longer string is less than 0.33, and the number of characters in the common substring is greater than the minimum matching length, then the number of characters in the common substring is directly taken into account when calculating the matching degree. into the calculation;

5)利用偏有序字符串匹配算法POC(Partial Order Comparison,POC)计算源字符串和待匹配字符串的匹配度b:5) Use the partial order string matching algorithm POC (Partial Order Comparison, POC) to calculate the matching degree b between the source string and the string to be matched:

将两个待匹配的含有汉字、数字和英文字母的混合字符串分别称为源字符串和待匹配字符串,The two mixed strings containing Chinese characters, numbers and English letters to be matched are respectively called the source string and the string to be matched,

首先,先搜索出源字符串与待匹配字符串中相同的字符或汉字,记录下它们的个数,;First, search for the same characters or Chinese characters in the source string and the string to be matched, and record their numbers;

其次,以源字符串和待匹配字符串中较长字符串为标准,求匹配度1(match_degree1):Secondly, use the longer string in the source string and the string to be matched as the standard to find the matching degree 1 (match_degree1):

以其中较短字符串为标准,求匹配度2(match_degree2):Using the shorter string as the standard, find the matching degree 2 (match_degree2):

公式(1)、(2)中[]表示取整;[ ] in formulas (1) and (2) means rounding;

再次,分别比较源字符串和待匹配字符串中第1个或第2个数字与字母,最后1个或倒数第2个数字与字母,如果其中1项相等,则调整匹配度2的match_degree2数值为match_degree2+1:Again, compare the first or second number and letter in the source string and the string to be matched, and the last or second-to-last number and letter, and if one of them is equal, adjust the match_degree2 value of matching degree 2 For match_degree2+1:

对匹配度1和匹配度2赋予不同的权重0.41、0.59,求源字符串和待匹配字符串的最终匹配值b:Assign different weights 0.41 and 0.59 to matching degree 1 and matching degree 2, and find the final matching value b of the source string and the string to be matched:

b=match_degree1×0.41+match_degree2×0.59 (3)b=match_degree1×0.41+match_degree2×0.59 (3)

6)将步骤4)GST*计算所得的匹配度a和步骤5)POC计算所得的匹配度b进行加权融合,融合方法是,如果匹配度a大于匹配度b,则最终匹配度为a;如果匹配度a小于匹配度b,则最终匹配度等于(a+b)/2;6) Perform weighted fusion of the matching degree a obtained by step 4) GST* calculation and the matching degree b obtained by step 5) POC calculation. The fusion method is that if the matching degree a is greater than the matching degree b, the final matching degree is a; if Matching degree a is less than matching degree b, then the final matching degree is equal to (a+b)/2;

7)将源字符串和待匹配字符串数组中每个待匹配字符串计算所获得匹配度进行排序,把最大匹配度对应的待匹配字符串,作为与源字符串最匹配的目标字符串。7) Sort the matching degree calculated by the source string and each to-be-matched string in the to-be-matched string array, and use the to-be-matched string corresponding to the maximum matching degree as the target string that best matches the source string.

在步骤4)中,先搜索出源字符串与待匹配字符串相同的各公共子串,再对不同长度的公共子串赋予不同的权重,增大了较长公共字符子串的权重。In step 4), the common substrings that are the same as the source string and the string to be matched are first searched out, and then different weights are assigned to the common substrings of different lengths, which increases the weight of longer common character substrings.

本发明的GST*算法,针对传统GST算法存在的较短公共子串匹配度可能比较长公共子串的匹配度更大的现象,对其进行了改进:如果公共子串的字符长度与较长字符串字符长度的比值大于或等于0.33,则在计算匹配度时将该公共子串的字符个数乘以权重(大于1的常数);如果公共子串的字符长度与较长字符串字符长度的比值小于0.33、且公共子串的字符个数大于最小匹配长度,则计算匹配度时将该公共子串的字符个数直接带入计算。The GST* algorithm of the present invention is aimed at the phenomenon that the matching degree of the shorter common substring that exists in the traditional GST algorithm may be larger than the matching degree of the long common substring, and it is improved: if the character length of the common substring is the same as the longer If the ratio of the character length of the string is greater than or equal to 0.33, multiply the number of characters of the common substring by the weight (a constant greater than 1) when calculating the matching degree; if the character length of the common substring is greater than the character length of the longer string The ratio of is less than 0.33, and the number of characters in the common substring is greater than the minimum matching length, then the number of characters in the common substring is directly brought into the calculation when calculating the matching degree.

在步骤5)中,将两个含有数字、字母、汉字的混合字符串分别作为源字符串和待匹配字符串;分别以其中较长字符串、较短字符串为标准,求出匹配度1和匹配度2;然后再比较第一个或多个数字与字母、最后一个或多个数字与字母是否相等,对匹配度2进行修改。最后对两种匹配度分别赋予不同权重,得到两个字符串之间的匹配度值。In step 5), two mixed strings containing numbers, letters, and Chinese characters are used as the source string and the string to be matched respectively; the longer string and the shorter string are used as the standard to find the matching degree 1 and matching degree 2; then compare whether the first or more numbers and letters, and the last one or more numbers and letters are equal, and modify the matching degree 2. Finally, different weights are assigned to the two matching degrees to obtain the matching degree value between the two strings.

本发明的偏有序字符串比对算法POC考虑到匹配度2更能反映实际匹配情况,因此赋予匹配度2稍微较大的权重。The partial ordered character string comparison algorithm POC of the present invention considers that the matching degree 2 can better reflect the actual matching situation, so the matching degree 2 is given a slightly larger weight.

本发明给出了字符串等价替换策略。比如,“高压侧”与“220KV侧”,“千伏”与“kV”,在含义上是等价的。采用现有各类比对算法均不能准确地反映它们之间的等价关系,因此提出字符串等价替换策略。事先构建一个字符子串等价替换字典,采用:待匹配中子串=等价的源侧子串的形式,比如千伏=kV,它表示等号两侧的字符子串在含义上是相同的,等号左侧子串代表待匹配字符串中的某子串,等号右侧子串代表与左侧等价的源字符串中子串。The invention provides a character string equivalent replacement strategy. For example, "high voltage side" and "220KV side", "kilovolt" and "kV" are equivalent in meaning. The existing comparison algorithms cannot accurately reflect the equivalence relationship between them, so a string equivalent replacement strategy is proposed. Build a character substring equivalent replacement dictionary in advance, using the form of: substring to be matched = equivalent source side substring, such as kilovolts = kV, which means that the character substrings on both sides of the equal sign are the same in meaning , the substring on the left side of the equal sign represents a substring in the string to be matched, and the substring on the right side of the equal sign represents a substring in the source string equivalent to the left side.

在做匹配度计算之前,先检查待匹配字符串中是否含有字符子串等价替换字典中各行中左侧的字符子串,如果有,则替换它为等号右侧的源侧字符子串。在此基础上,再运用本融合比对算法进行比对,计算得到相应的匹配度,这样大大提高了匹配的精确度,能够反映出参与比较两个字符串之间真实的匹配情况。Before doing the matching degree calculation, first check whether the string to be matched contains a character substring equivalent to the character substring on the left side of each line in the replacement dictionary, and if so, replace it with the source side character substring on the right side of the equal sign . On this basis, the fusion comparison algorithm is used to compare and calculate the corresponding matching degree, which greatly improves the matching accuracy and can reflect the real matching situation between the two character strings involved in the comparison.

本发明适用于一对多混合字符串的比对。分别计算源字符串与一组待匹配字符串的匹配度,并将得到的各匹配度进行排序,从中找出与源字符串匹配度最大的待匹配字符串,把它确定为目标字符串,从而实现了一对多字符串的较佳匹配。The invention is suitable for comparison of one-to-many mixed character strings. Calculate the matching degree of the source string and a group of strings to be matched respectively, and sort the obtained matching degrees, find out the string to be matched with the highest matching degree with the source string, and determine it as the target string, This enables better matching of one-to-many strings.

附图说明:Description of drawings:

图1为新型一对多字符串的融合比对方法的流程图。FIG. 1 is a flow chart of a novel one-to-many character string fusion comparison method.

图2为一对多混合字符串的融合比对方法的应用实例。Fig. 2 is an application example of the fusion comparison method of one-to-many mixed character strings.

具体实施方式detailed description

下面结合附图对本发明的方法作进一步的详述Below in conjunction with accompanying drawing, method of the present invention is described in further detail

下面结合附图对本发明做进一步地详细说明。本发明具体涉及一种混合字符串的融合比对方法。首先将待匹配的字符串分别称为源字符串和待匹配字符串。本发明更能适合于从一组待匹配字符串中寻找与源字符串最匹配的目标匹配字符串。The present invention will be described in further detail below in conjunction with the accompanying drawings. The invention specifically relates to a fusion comparison method of mixed character strings. Firstly, the character string to be matched is referred to as the source character string and the character string to be matched respectively. The present invention is more suitable for finding the target matching character string that best matches the source character string from a group of character strings to be matched.

具体实施方式如下。The specific implementation method is as follows.

1.取出源字符串和一组待匹配字符串;1. Take out the source string and a set of strings to be matched;

2.读出事先构建的字符串等价替换字典,对该组待匹配字符串中部分字符进行等价替换。例如“高压侧”与“220KV侧”等价,“千伏”与“kV”等价。在进行字符串匹配度计算之前,利用等价替换字典可将上述不同描述进行统一;2. Read out the string equivalent replacement dictionary built in advance, and perform equivalent replacement for some characters in the group of strings to be matched. For example, "high voltage side" is equivalent to "220KV side", and "kV" is equivalent to "kV". Before calculating the string matching degree, the above different descriptions can be unified by using the equivalent replacement dictionary;

3.取出源字符串,依此取出等价替换后的该组待匹配字符串数组中的一个待匹配字符串;3. Take out the source string, and accordingly take out a string to be matched in the array of strings to be matched after equivalent replacement;

4.利用GST*算法计算源字符串与待匹配字符串的匹配度。4. Use the GST* algorithm to calculate the matching degree between the source string and the string to be matched.

GST*算法与传统的GST算法的改进效果,通过以下举例说明。The improvement effect of the GST* algorithm and the traditional GST algorithm is illustrated by the following examples.

例如“abcde”与“qbcio”、“abcde”与“qbico”为两组待比较字符串,利用GST算法计算两组字符串匹配度均为40%。For example, "abcde" and "qbcio", "abcde" and "qbico" are two groups of character strings to be compared, and the GST algorithm is used to calculate the matching degree of both groups of character strings to be 40%.

而采用GST*算法计算两组字符串匹配度,结果分别为43.2%和40%。可看出GST*算法的比对结果更准确一些。The GST* algorithm is used to calculate the matching degree of the two groups of strings, and the results are 43.2% and 40% respectively. It can be seen that the comparison result of the GST* algorithm is more accurate.

GST*算法使具有较长公共子串的两字符串的匹配度更高。The GST* algorithm makes the matching degree of two strings with longer common substrings higher.

5.利用偏有序字符串匹配算法POC计算源字符串和待匹配字符串的匹配度。5. Use the partial ordered string matching algorithm POC to calculate the matching degree between the source string and the string to be matched.

将两个待匹配的含有汉字、数字和字母的混合字符串分别称为源字符串和待匹配字符串。The two mixed character strings containing Chinese characters, numbers and letters to be matched are respectively called the source string and the character string to be matched.

首先,先搜索出源字符串与待匹配字符串相同的字符,记录下它们的个数。First, search for the same characters in the source string and the string to be matched, and record their numbers.

其次,以源字符串和待匹配字符串中较长字符串为标准,求匹配度1(match_degree1):Secondly, use the longer string in the source string and the string to be matched as the standard to find the matching degree 1 (match_degree1):

以其中较短字符串为标准,求匹配度2(match_degree2):Using the shorter string as the standard, find the matching degree 2 (match_degree2):

公式(1)、(2)中[□]表示取整。[□] in formulas (1) and (2) means rounding.

再次,分别比较源字符串和待匹配字符串中第1个(或第2个)数字与字母,最后1个(或倒数第2个)数字与字母,如果其中1项相等,则调整匹配度2的match_degree2数值为match_degree2+1。Again, compare the first (or second) number and letter in the source string and the string to be matched, and the last (or second last) number and letter, and if one of them is equal, adjust the matching degree The match_degree2 value of 2 is match_degree2+1.

最后,由于在实际应用中,匹配度2更能反映实际匹配情况,因此赋予匹配度2更大的权重。Finally, because in practical applications, matching degree 2 can better reflect the actual matching situation, so matching degree 2 is given greater weight.

对匹配度1和匹配度2赋予不同的权重0.41、0.59,求源字符串和待匹配字符串的最终匹配值b:Assign different weights 0.41 and 0.59 to matching degree 1 and matching degree 2, and find the final matching value b of the source string and the string to be matched:

b=match_degree1×0.41+match_degree2×0.59 (3)b=match_degree1×0.41+match_degree2×0.59 (3)

6)将步骤4)GST*计算所得的匹配度a和步骤5)POC计算所得的匹配度b进行加权融合,融合方法是,如果匹配度a大于匹配度b,则最终匹配度为a。如果匹配度a小于匹配度b,则最终匹配度等于(a+b)/2。6) Perform weighted fusion of the matching degree a obtained by step 4) GST* calculation and the matching degree b obtained by step 5) POC calculation. The fusion method is that if the matching degree a is greater than the matching degree b, the final matching degree is a. If the matching degree a is smaller than the matching degree b, the final matching degree is equal to (a+b)/2.

得到最终的比对结果,充分利用了两种算法的特点;Get the final comparison result, making full use of the characteristics of the two algorithms;

7.检查循环是否执行完毕;7. Check whether the loop is executed;

8.对各匹配度进行排序,找出匹配度最大对应的字符串,作为最匹配的目标字符串。8. Sort each matching degree, find out the character string corresponding to the highest matching degree, and use it as the most matching target character string.

图2为本发明的混合字符串一对多比对方法的应用实例。计算了一组混合字符串的匹配情况。图2中分别列出利用GST*算法、偏有序字符串匹配算法(POC算法)、两种算法加权融合方法后GST*_POC的匹配度。Fig. 2 is an application example of the mixed character string one-to-many comparison method of the present invention. Matches are computed for a set of mixed strings. Figure 2 lists the matching degree of GST*_POC after using the GST* algorithm, the partial ordered string matching algorithm (POC algorithm), and the weighted fusion method of the two algorithms.

可看到,图2中第1条比对、第2条比对中,第1条的待匹配字符串比第2条的待匹配字符串更接近于第1列的源字符串。It can be seen that in the first comparison and the second comparison in Fig. 2, the string to be matched in the first item is closer to the source string in the first column than the string to be matched in the second item.

从图2中结果能够得出结论,利用本发明算法得到了较为理想的比对结果。From the results in Figure 2, it can be concluded that a relatively ideal comparison result has been obtained by using the algorithm of the present invention.

Claims (1)

1. a kind of one-to-many mixed characters string fusion comparison method, to based on Chinese characters clustering feature by Chinese character, numeral, English The similarity of the mixed characters string of letter composition carries out fusion ratio pair, to improve the similar accuracy of expression character string, including Following key step:
1) source string and one group of matching string are taken out;
2) the character string equivalencing dictionary built in memory in advance is read, to partial character in this group of matching string I.e. substring carries out equivalencing;Using equivalencing dictionary, will have not in source string occasion and matching string occasion Describe together but two kinds of substrings of implication identical are unified;
3) source string is taken out, a character to be matched in the matching string array after equivalencing is taken out successively String;
4) source string and the matching degree a of the matching string are calculated using GST* algorithms:
Using traditional GST algorithms, each public substring in two character strings is obtained, they are stored in public substring chained list, if The ratio of the character length of some public substring and longer character string character length is more than or equal to 0.33, then is calculating matching degree When the character number of the public substring is multiplied by weight, the weight is constant more than 1;If the character of some public substring is long Degree and the ratio of longer character string character length are less than 0.33 and the character number of public substring is more than smallest match length, then The character number of the public substring is brought directly to calculating when calculating matching degree;
5) source string and the matching degree b of matching string are calculated using partially orderly string matching algorithm POC:
Two mixed characters strings containing Chinese character, numeral and English alphabet to be matched are referred to as source string and to be matched Character string,
First, source string and identical character in matching string are searched out, their number is recorded, secondly, with source Longer character string is standard in character string and matching string, seeks matching degree 1:
, as standard, to seek matching degree 2 wherein compared with short character strings:
[] represents to round in formula (1), (2);
Again, the 1st or the 2nd numeral and letter, last 1 or reciprocal in source string and matching string are respectively compared 2nd numeral and letter, if wherein 1 equal, the numerical value of adjustment matching degree 2 is match_degree2+1:
Different weights 0.41,0.59 are assigned to matching degree 1 and matching degree 2, the matching of source string and matching string is asked Spend b:
B=match_degree1 × 0.41+match_degree2 × 0.59 (3)
6) by step 4) GST* calculate obtained by matching degree a and step 5) POC calculate obtained by matching degree b be weighted fusion, Fusion method is, if matching degree a is more than matching degree b, final matching degree is a;If matching degree a is less than matching degree b, most Whole matching degree is equal to (a+b)/2;
7) each matching string calculating in source string and matching string array is obtained into matching degree to be ranked up, The corresponding matching string of maximum matching degree, the target string most matched with source string is used as.
CN201310746846.5A 2013-12-30 2013-12-30 A kind of one-to-many mixed characters string fusion comparison method Expired - Fee Related CN104008119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310746846.5A CN104008119B (en) 2013-12-30 2013-12-30 A kind of one-to-many mixed characters string fusion comparison method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310746846.5A CN104008119B (en) 2013-12-30 2013-12-30 A kind of one-to-many mixed characters string fusion comparison method

Publications (2)

Publication Number Publication Date
CN104008119A CN104008119A (en) 2014-08-27
CN104008119B true CN104008119B (en) 2017-09-26

Family

ID=51368778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310746846.5A Expired - Fee Related CN104008119B (en) 2013-12-30 2013-12-30 A kind of one-to-many mixed characters string fusion comparison method

Country Status (1)

Country Link
CN (1) CN104008119B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732041B (en) * 2015-04-13 2017-09-29 国网四川省电力公司电力科学研究院 A kind of empty terminal table automatic generation method based on many SCD templates
CN105184713A (en) * 2015-07-17 2015-12-23 四川久远银海软件股份有限公司 Intelligent matching and sorting system and method capable of benefitting contrast of assigned drugs of medical insurance
CN107102998A (en) * 2016-02-22 2017-08-29 阿里巴巴集团控股有限公司 A kind of String distance computational methods and device
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106919663A (en) * 2017-02-14 2017-07-04 华北电力大学 Character string matching method in the multi-source heterogeneous data fusion of power regulation system
CN109741745A (en) * 2019-01-28 2019-05-10 中国银行股份有限公司 A kind of transaction air navigation aid and device
CN112215216B (en) * 2020-09-10 2024-08-13 中国东方电气集团有限公司 System and method for fuzzy matching of character strings of image recognition results

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329680A (en) * 2008-07-17 2008-12-24 安徽科大讯飞信息科技股份有限公司 Large scale rapid matching method of sentence surface

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100557606C (en) * 2003-03-03 2009-11-04 皇家飞利浦电子股份有限公司 Method and apparatus for finding a string

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329680A (en) * 2008-07-17 2008-12-24 安徽科大讯飞信息科技股份有限公司 Large scale rapid matching method of sentence surface

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Unified Approach for Computing Document Similarity with Fingerprinting and Alignments;Jongkyu Seo等;《2012 IEEE 12th International Conference on Computer and Information Technology》;20121029;第448-455页 *
多种字符串相似度算法的比较研究;牛永洁 等;《计算机与数字工程》;20120320(第3期);第14-17页 *

Also Published As

Publication number Publication date
CN104008119A (en) 2014-08-27

Similar Documents

Publication Publication Date Title
CN104008119B (en) A kind of one-to-many mixed characters string fusion comparison method
CN107562824B (en) A text similarity detection method
Huang et al. Accurate and fast asymmetric locality-sensitive hashing scheme for maximum inner product search
Gagie et al. LZ77-based self-indexing with faster pattern matching
CN103218423B (en) Data query method and device
CN103425739B (en) Character string matching method
CN105138514B (en) It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN104680178B (en) Image classification method based on transfer learning multi attractor cellular automaton
CN110990596B (en) Multi-mode hash retrieval method and system based on self-adaptive quantization
CN106557777B (en) An Improved Kmeans Document Clustering Method Based on SimHash
CN108427729A (en) Large-scale picture retrieval method based on depth residual error network and Hash coding
CN101329680B (en) Large scale rapid matching method of sentence surface
CN107194422A (en) A kind of convolutional neural networks relation sorting technique of the forward and reverse example of combination
CN105574212A (en) Image retrieval method for multi-index disk Hash structure
CN106570166B (en) A video retrieval method and device based on multiple locality-sensitive hash tables
CN114911999B (en) Name matching method and device
CN104731811B (en) A kind of clustering information evolution analysis method towards extensive dynamic short text
CN106339459B (en) The method that Chinese web page is presorted is carried out based on Keywords matching
CN115129871B (en) Text category determining method, apparatus, computer device and storage medium
CN105354264A (en) Locality-sensitive-hashing-based subject label fast endowing method
CN111832706A (en) Hash Center-Based Continuous Learning Method
US20220179890A1 (en) Information processing apparatus, non-transitory computer-readable storage medium, and information processing method
Yang et al. Graph attention hashing via contrastive learning for unsupervised cross-modal retrieval
CN111581332A (en) Similar judicial case matching method and system based on triple deep hash learning
CN112784040B (en) Text Classification Method for Vertical Industry Based on Corpus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170926

Termination date: 20181230

CF01 Termination of patent right due to non-payment of annual fee