CN113590895B - Character string retrieval method - Google Patents
Character string retrieval method Download PDFInfo
- Publication number
- CN113590895B CN113590895B CN202110860647.1A CN202110860647A CN113590895B CN 113590895 B CN113590895 B CN 113590895B CN 202110860647 A CN202110860647 A CN 202110860647A CN 113590895 B CN113590895 B CN 113590895B
- Authority
- CN
- China
- Prior art keywords
- string
- array
- character
- feature
- pattern string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Character Discrimination (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及计算机信息处理领域,具体是一种字符串检索方法,包括如下步骤:步骤1、获取模式串中的特征对;步骤2、获取记录模式串与目标串进行匹配的优化顺序PNext数组;步骤3、获取记录模式串在目标串中跳跃匹配距离的TNext数组和FNext数组;步骤4、模式串在目标串中进行从左到右匹配时,如果对应字符适配,根据步骤2事先获得的匹配顺序依次进行匹配,如果对应字符失配,根据步骤3事先获取的记录数组进行跳跃。解决了字符串检索过程中,一旦发生失配,无法快速获取较大的跳跃距离,而导致检索过程中频繁回溯的问题,并提升了检索效率。
The present invention relates to the field of computer information processing, in particular to a character string retrieval method, comprising the following steps: step 1, acquiring feature pairs in a pattern string; step 2, obtaining an optimized order PNext array for matching a record pattern string with a target string; Step 3, obtain the TNext array and the FNext array that record the jump matching distance of the pattern string in the target string; Step 4, when the pattern string is matched from left to right in the target string, if the corresponding character is adapted, according to the prior obtained in step 2 The matching sequence is matched sequentially, and if the corresponding characters do not match, the jump is performed according to the record array obtained in advance in step 3. It solves the problem that once a mismatch occurs in the string retrieval process, a large jump distance cannot be obtained quickly, which leads to frequent backtracking during the retrieval process, and the retrieval efficiency is improved.
Description
技术领域technical field
本发明涉及计算机信息处理领域,具体是指一种字符串检索方法。The invention relates to the field of computer information processing, in particular to a character string retrieval method.
背景技术Background technique
字符串检索是计算机领域的经典问题,在信息检索、病毒检测、基因序列匹配等许多场景都有着重要应用,现有字符串检索主要有精确检索和模糊检索。String retrieval is a classic problem in the computer field. It has important applications in many scenarios such as information retrieval, virus detection, and gene sequence matching. Existing string retrieval mainly includes precise retrieval and fuzzy retrieval.
自1977年,Boyer-Moore和Knuth-Morris-Pratt分别提出BM算法和KMP算法以来,精确检索算法受到许多学者的持续跟踪,KMP算法在传统朴素算法的基础上,提出了一种避免目标串回溯的方法,从而使算法效率有了一定程度的提高;BM算法采用由后往前匹配并且根据坏字符与好后缀获得较大的跳跃距离,将检索效率提升到次线性时间复杂度,也就是实际检索次数小于目标串长度;1980年,Boyer Moore Horspool在BM算法的基础上进行了改进提出BMH算法,BMH它不再像BM算法关注失配的字符,它的关注点在坏字符上,根据这个字符在模式串中出现的最后的位置算出偏移长度,否则偏移模式串的长度;此外,还出现了Karp-Rabin以及BM算法的诸多变体等;1990年Daniel M.Sunday在坏字符的思想的基础上提出Sunday算法,将字符串检索效率提升到一个新的高度。Since Boyer-Moore and Knuth-Morris-Pratt proposed the BM algorithm and the KMP algorithm respectively in 1977, the precise retrieval algorithm has been continuously tracked by many scholars. The KMP algorithm proposes a method to avoid target string backtracking based on the traditional simple algorithm. method, so that the efficiency of the algorithm has been improved to a certain extent; the BM algorithm uses back-to-front matching and obtains a larger jump distance according to bad characters and good suffixes, and improves the retrieval efficiency to sub-linear time complexity, that is, the actual The number of retrievals is less than the length of the target string; in 1980, Boyer Moore Horspool improved on the basis of the BM algorithm and proposed the BMH algorithm. BMH no longer focuses on mismatched characters like the BM algorithm. It focuses on bad characters. According to this The last position of the character in the pattern string to calculate the offset length, otherwise the length of the offset pattern string; in addition, there have been many variants of the Karp-Rabin and BM algorithms; in 1990, Daniel M.Sunday in the bad character On the basis of the idea, the Sunday algorithm is proposed, which raises the efficiency of string retrieval to a new level.
但是现有技术依然存在一个问题,即匹配过程中存在无效匹配,一旦发生失配,则无法快速获取足够大的跳跃距离,从而增加了检索过程中的频繁回溯。However, there is still a problem in the existing technology, that is, there are invalid matches in the matching process. Once a mismatch occurs, it is impossible to quickly obtain a sufficiently large jump distance, thereby increasing frequent backtracking in the retrieval process.
发明内容Contents of the invention
基于以上问题,本发明提供了一种字符串检索方法,解决了字符串检索过程中,一旦发生失配,无法快速获取跳跃距离,从而增加了检索过程中的频繁回溯的问题。Based on the above problems, the present invention provides a string retrieval method, which solves the problem that in the string retrieval process, once a mismatch occurs, the jump distance cannot be quickly obtained, thereby increasing the frequent backtracking in the retrieval process.
为解决以上技术问题,本发明采用的技术方案如下:In order to solve the above technical problems, the technical scheme adopted in the present invention is as follows:
一种字符串检索方法,包括如下步骤:A character string retrieval method, comprising the steps of:
步骤1、获取模式串中的特征对;
步骤2、获取记录模式串与目标串进行匹配的优化顺序PNext数组;
步骤3、获取记录模式串在目标串中跳跃匹配距离的TNext数组和FNext数组;
步骤4、模式串在目标串中进行从左到右匹配时,如果对应字符适配,根据步骤2事先获得的匹配顺序依次进行匹配,如果对应字符失配,根据步骤3事先获取的记录数组进行跳跃匹配。
进一步,所述步骤1中,特征对为模式串中出现次数最多的字符对,特征对的个数不少于2,特征对左边的字符称为特征前序,右边的字符称为特征后序,特征前序与特征后序是两个互不相同的字符。将模式串中所有特征对从右至左进行编号,编号从1开始,即最右边的特征对为第一特征对,如果次数最多的字符对有不止1个,选取它们中最右边的字符对作为特征对。获取方法具体如下:Further, in the
首先初始化一个足够大的记录数组,并将所有值初始化为0,然后使用指针i与指针i+1 对模式串从左至右进行遍历,其中i∈[0,PLen-2],PLen为模式串的长度,当指针i与指针i+1 遍历到任意一个满足PStr[i]≠PStr[i+1]条件的字符对时,将这两个字符在记录数组中唯一对应的数组元素值加1,其中PStr为模式串数组,如果该值大于或者等于目前出现最多次数字符对的个数num,则将num更新为当前数组元素的值,其中num的初始值为0,并使用特征对记录指针分别记录该字符对的特征前序‘Cs’与特征后序‘Ce’,同时更新指针FirstFeaStr,其中FirstFeaStr指针用于记录第一个特征后序的位置;当模式串全部遍历完成后,如果num 大于1,此时记录指针记录的字符对即为该模式串的特征对,num记录的个数即为特征对的总数,指针FirstFeaStr记录的位置即为模式串与目标处进行匹配时的匹配起点,并在模式串与目标串匹配之前构建一般的PNext、TNext和FNext数组;如果遍历完成之后num=1,则表明该模式串中不存在特征对,此时匹配起点设为模式串最右边字符,并跳过TNext数组与 PNext数组的一般构建步骤,进而构建特殊TNext数组与PNext数组。First initialize a sufficiently large record array and initialize all values to 0, then use pointer i and pointer i+1 to traverse the pattern string from left to right, where i∈[0,PLen-2], PLen is the pattern The length of the string, when pointer i and pointer i+1 traverse to any character pair that satisfies the condition of PStr[i]≠PSr[i+1], add the unique corresponding array element value of these two characters in the record array to 1, where PStr is an array of pattern strings, if the value is greater than or equal to the number num of character pairs that appear most frequently at present, then update num to the value of the current array element, where the initial value of num is 0, and use feature pairs to record The pointers respectively record the characteristic preorder 'C s ' and characteristic postorder 'C e ' of the character pair, and update the pointer FirstFeaStr at the same time, where the FirstFeaStr pointer is used to record the position of the first characteristic postorder; when all pattern strings are traversed , if num is greater than 1, the character pair recorded by the record pointer is the feature pair of the pattern string, the number of num records is the total number of feature pairs, and the position recorded by the pointer FirstFeaStr is when the pattern string matches the target The matching starting point of the pattern string, and construct the general PNext, TNext and FNext arrays before the pattern string matches the target string; if num=1 after the traversal is completed, it indicates that there is no feature pair in the pattern string, and the matching starting point is set to the pattern string The rightmost character, and skip the general construction steps of TNext array and PNext array, and then construct special TNext array and PNext array.
进一步,所述步骤2中,当成功获取特征对之后,PNext数组记录的匹配顺序包括模式串第一个特征对与目标串的匹配顺序、模式串中除第一个特征对外的特征对和非特征区与目标串的匹配顺序。构建与模式串等长的PNext数组,将第一个特征后序的位置作为匹配起点,将PNext数组中与匹配起点相同位置的元素赋值为第一个特征对的前序索引,将与第一个特征前序相同位置的元素赋值为下一个特征对的特征后序索引,以此类推;将与最后一个特征对前序位置相同的元素赋值为模式串最右边第一个非特征区字符索引,然后依次从右到左对非特征区字符对应位置进行赋值,每个字符位置对应元素赋值为其相邻的下一个非特征区字符索引。并将PNext数组中最后一个赋值的元素标记为结束标志。Further, in the
进一步,所述模式串第一个特征对与目标串的匹配顺序具体如下:Further, the matching order of the first feature pair of the pattern string and the target string is specifically as follows:
获取模式串第一特征对与目标串的匹配顺序即为模式串第一特征对在PNext数组中对应元素的计算、赋值过程。当成功获取特征对之后,指针FirstFeaStr记录的位置模式串与目标串匹配的起点,构建PNext数组时,使用指针i对模式串数组进行从右至左的遍历操作,同时使用指针P对PNext数组经赋值操作,遍历起点为FirstFeaStr指针所指向的数组索引。首先,将PNext数组中FirstFeaStr指针对应的数组元素赋值为FirstFeaStr-1,同时将指针i 和P同时赋值为第一个特征前序的位置,并且对数组元素Index[t]赋值为i,其中Index数组用于记录特征前序的位置,其中t为特征对序号,其中t的初值为1,每处理一个特征对将t 加1,并且将数组元素Sequence[i]与Sequence[i+1]分别赋值为1、2,其中Sequence数组用于标记特征对字符的位置;然后,使用指针i与i+1对PNext数组继续遍历,当指针i找到特征前序字符‘Cs’,并且指针i+1找到特征后序字符‘Ce’,即遍历到第二特征对时,将指针P指针指向的PNext数组元素赋值为i+1,最后,将P指针赋值为i+1。Obtaining the matching order of the first feature pair of the pattern string and the target string is the calculation and assignment process of the corresponding element of the first feature pair of the pattern string in the PNext array. After successfully obtaining the feature pair, the starting point where the pattern string recorded by the pointer FirstFeaStr matches the target string, when constructing the PNext array, use the pointer i to traverse the pattern string array from right to left, and use the pointer P to traverse the PNext array Assignment operation, the traversal starting point is the array index pointed to by the FirstFeaStr pointer. First, assign the array element corresponding to the FirstFeaStr pointer in the PNext array to FirstFeaStr-1, and assign the pointer i and P to the position of the first feature preorder at the same time, and assign the value to i to the array element Index[t], where Index The array is used to record the position of the preorder of the feature, where t is the sequence number of the feature pair, where the initial value of t is 1, and t is added by 1 for each feature pair processed, and the array element Sequence[i] and Sequence[i+1] Assign values of 1 and 2 respectively, where the Sequence array is used to mark the position of the character pair; then, use the pointer i and i+1 to continue traversing the PNext array, when the pointer i finds the character 'C s ', and the pointer i +1 finds the characteristic postorder character 'C e ', that is, when traversing to the second characteristic pair, assign the PNext array element pointed to by the pointer P to i+1, and finally assign the P pointer to i+1.
进一步,所述模式串中除第一个特征对外的特征对或者非特征区与目标串的匹配顺序即为对PNext数组中对应数组元素的计算、赋值过程。Further, the matching order of the feature pairs or non-feature regions and the target string except the first feature in the pattern string is the calculation and assignment process of the corresponding array elements in the PNext array.
当完成PNext数组中第一特征对对应数组元素的计算、赋值之后,指针i指向第二特征对的特征前序,指针P指向PNext数组中第二特征后序对应的数组位置,此时,完成除第一特征对之外的特征对对应的PNext数组元素的计算、赋值,其具体实现方法如下:After the calculation and assignment of the corresponding array elements of the first feature pair in the PNext array are completed, the pointer i points to the preorder of the feature of the second feature pair, and the pointer P points to the array position corresponding to the postorder of the second feature in the PNext array. At this time, completion The calculation and assignment of the PNext array elements corresponding to the feature pairs other than the first feature pair, the specific implementation method is as follows:
首先,将指针i向左移动一位,然后将PNext数组中P指针指向的数组元素赋值为i+1,再将P指针指向i+1对应的数组位置,然后,使用i指针与i+1指针继续进行向左遍历,当指针i找到特征前序字符‘Cs’,并且指针i+1找到特征后序字符‘Ce’,即获得下一个特征对时,将指针P指向的PNext数组元素赋值为i+1,最后将P指针赋值为i+1。与此同时,完成同获取模式串第一特征对与目标串的匹配顺序中对Sequence数组和Index数组相同的计算、赋值,并添加对记录数组Distance的计算、赋值,即将Distance[t-1]赋值为Index[t-1] -Index[t],其中Distance数组用于记录相邻两个特征前序间距。如此往复。First, move the pointer i one bit to the left, then assign the array element pointed to by the P pointer in the PNext array to i+1, then point the P pointer to the array position corresponding to i+1, and then use the i pointer and i+1 The pointer continues to traverse to the left. When the pointer i finds the characteristic preorder character 'C s ', and the pointer i+1 finds the characteristic postorder character 'C e ', that is, when the next characteristic pair is obtained, the pointer P points to the PNext array The element is assigned the value i+1, and finally the P pointer is assigned the value i+1. At the same time, complete the same calculation and assignment of the Sequence array and the Index array in the matching order of the first feature pair of the acquired pattern string and the target string, and add the calculation and assignment of the record array Distance, that is, Distance[t-1] The assignment is Index[t-1] -Index[t], where the Distance array is used to record the distance between two adjacent features. So back and forth.
当指针i对模式串进行向左遍历完成之后,P指针指向最左边的特征前序对应的PNext 数组位置,此时,再使用遍历指针i对模式串从右至左依次对非特征区元素进行遍历,进而完成对非特征区对应的PNext数组元素的计算、赋值,其具体实现方法如下:After pointer i completes the leftward traversal of the pattern string, the P pointer points to the PNext array position corresponding to the leftmost feature preorder. Traversing, and then completing the calculation and assignment of the PNext array elements corresponding to the non-characteristic area, the specific implementation method is as follows:
当i指针指向一个非特征区字符时,首先将P指针指向的PNext数组元素赋值为i,再将P指针赋值为i,然后,继续使用i指针进行向左遍历,寻找下一个非特征字符。如此往复。当完成所有非特征区对应的PNext数组元素的计算、赋值,i指针进行向左遍历至最左边的非特征区字符时,将i指针指向的PNext数组赋值为-1,其中-1为匹配结束标记,即表示该字符匹配成功时,所有模式串字符在目标串中匹配成功。When the i pointer points to a non-characteristic area character, first assign the PNext array element pointed to by the P pointer to i, then assign the P pointer to i, and then continue to use the i pointer to traverse to the left to find the next non-characteristic character. So back and forth. When the calculation and assignment of the PNext array elements corresponding to all non-characteristic areas are completed, and the i pointer traverses to the left to the leftmost non-characteristic area character, assign the PNext array pointed to by the i pointer to -1, where -1 is the end of the match mark, which means that when the character matches successfully, all pattern string characters are successfully matched in the target string.
进一步,通过分析模式串与目标串匹配过程可知,在模式串与目标处进行检索操作之前,计算模式串移动至目标串中任意位置时,模式串第一特征对与目标串对应字符失配时所能提供的最佳跳跃距离,FNext数组记录的最佳跳跃距离与模式串最右边两个或者三个字符所在位置对应目标串中的字符组成的字符串一一映射。最佳跳跃距离为模式串中从最右边第二个字符开始向左,第一个与模式串最右边两个或者三个字符所在位置对应目标串中的字符串相同的字符串,该字符串最右边字符与模式串最右边字符在模式串中所在位置之间的距离。因此,所述步骤3中,获取记录模式串在目标串中跳跃匹配距离的FNext数组的方法具体如下:Further, by analyzing the matching process between the pattern string and the target string, it can be known that when the calculation pattern string moves to any position in the target string before the search operation is performed at the pattern string and the target string, when the first feature pair of the pattern string does not match the corresponding character of the target string The best jumping distance that can be provided, the best jumping distance recorded in the FNext array and the character string composed of the characters in the target string corresponding to the positions of the rightmost two or three characters of the pattern string. The optimal jumping distance is from the second rightmost character in the pattern string to the left, and the first character string is the same as the string in the target string corresponding to the position of the rightmost two or three characters of the pattern string. The distance between the rightmost character and the position of the rightmost character of the pattern string in the pattern string. Therefore, in said
由于目标串中可能存在所有由目标串字符集中两个或者三个任意字符组成的字符串,因而,在模式串与目标处进行检索操作之前,计算出所有由目标串字符集中两个或者三个任意字符组成的字符串所能提供的最佳跳跃距离,并将这些字符串所能提供的最佳跳跃距离提前存储在由一个字符串与一个数组元素一一映射的FNext数组中。Since there may be all character strings composed of two or three arbitrary characters in the character set of the target string in the target string, before the retrieval operation is performed at the pattern string and the target, all character strings consisting of two or three characters in the character set of the target string are calculated. The best jumping distance that can be provided by a string composed of any characters, and store the best jumping distance that these strings can provide in advance in the FNext array that is mapped one by one by a string and an array element.
首先,获取目标串字符集,计算目标串字符集与自身的一重或二重笛卡尔积,逐一获取由该字符集组成的所有字符串集合,并一一映射到FNext数组中的一个数组元素,即每个字符串唯一对应FNext数组中的一个元素位置。同时对该位置的数组元素进行计算并赋初值,由于最佳跳跃距离为模式串中从最右边第二个字符开始向左,第一个与模式串最右边两个或者三个字符所在位置对应目标串中的字符串相同的字符串,该字符串最右边字符与模式串最右边字符在模式串中所在位置之间的距离。初始值计算时,假设模式串中不存在所有由目标串字符集中两个或者三个任意字符组成的字符串,最佳跳跃距离计算具体方法如下:First, obtain the character set of the target string, calculate the single or double Cartesian product of the character set of the target string and itself, obtain all the string sets composed of the character set one by one, and map them to an array element in the FNext array one by one, That is, each string uniquely corresponds to an element position in the FNext array. At the same time, calculate and assign the initial value to the array element at this position. Since the optimal jumping distance is from the second rightmost character in the pattern string to the left, the first and the rightmost two or three characters of the pattern string are located Corresponding to the same character string in the target string, the distance between the rightmost character of the string and the position of the rightmost character of the pattern string in the pattern string. When calculating the initial value, it is assumed that there are no strings consisting of two or three arbitrary characters in the character set of the target string in the pattern string. The specific method for calculating the optimal jump distance is as follows:
当以两个字符构建FNext数组时,若模式串首字符与获取的两个字符中第二个字符相同,将该字符串映射的数组元素赋值为PLen-1,否则赋值为PLen,其中PLen为模式串长度。When constructing the FNext array with two characters, if the first character of the pattern string is the same as the second character of the two characters obtained, the array element mapped to the string is assigned as PLen-1, otherwise it is assigned as PLen, where PLen is Pattern string length.
当以三个字符构建FNext数组时,若模式串中首字符、第二字符分别与获取的三个字符中第二个、第三个字符均相同,将该字符串唯一映射的数组元素赋值为PLen-2,若模式串首字符与获取的三个字符中第三个字符相同,将该字符串唯一映射的数组元素赋值为PLen-1,否则赋值为PLen。When constructing an FNext array with three characters, if the first character and the second character in the pattern string are the same as the second and third characters in the three characters obtained respectively, assign the array element uniquely mapped to the string to PLen-2, if the first character of the pattern string is the same as the third character among the three acquired characters, assign the array element uniquely mapped to the string to PLen-1, otherwise assign it to PLen.
然后,对模式串由左至右按长度为2或3的字符进行遍历,对遍历到的每一个字符串计算最佳跳跃距离,其值为该字符串最右边字符与模式串最右边字符在模式串中所在位置之间的距离,将该字符串与FNext数组映射的数组元素重新赋值为该最佳跳跃距离。Then, the pattern string is traversed from left to right according to the characters with a length of 2 or 3, and the optimal jump distance is calculated for each character string traversed, and its value is between the rightmost character of the string and the rightmost character of the pattern string. The distance between the positions in the pattern string, reassign the array elements mapped from the string to the FNext array as the optimal jump distance.
当由左至右进行遍历并重新赋值完成之后,FNext数组中的各元素即为模式串中第一个特征对与目标串匹配中失配时,模式串最右边两个或三个字符在目标串中对应位置上的任意字符串所能提供的最佳跳跃距离。After traversing from left to right and reassignment is completed, each element in the FNext array is the first feature pair in the pattern string that does not match the target string, and the rightmost two or three characters of the pattern string are in the target string The optimal jumping distance provided by any string at the corresponding position in the string.
构建FNext数组之前,判定使用两个或者三个字符进行构建。当字符集小于20时,使用三个字符构建FNext数组,当字符集大于20时,使用两个字符构建FNext数组;当目标串较长时,使用三个字符构建FNext数组,反之,使用两个字符构建FNext数组。Before constructing the FNext array, it is determined to use two or three characters for construction. When the character set is less than 20, use three characters to build the FNext array, when the character set is greater than 20, use two characters to build the FNext array; when the target string is longer, use three characters to build the FNext array, otherwise, use two characters Characters build an FNext array.
进一步,通过分析模式串与目标串匹配过程可知,模式串在目标串中移动至任意位置时,根据PNext数组所指导的匹配顺序,按照特征对优先匹配、非特征区字符在特征对全部适配之后依次从右至左进行匹配的方案进行匹配,当模式串除第一特征对字符与目标串发生失配,其他字符与目标串发生失配时,必然存在至少一个特征对已经与目标串字符适配。根据模式串特征对的性质可知,当模式串除第一特征对之外的字符与目标串字符失配,将模式串在目标串中进行右移操作时,如果已经匹配成功的特征对字符与模式串的非特征区字符对齐,必然导致下一轮模式串与目标串的匹配过程中发生失配。Further, by analyzing the matching process between the pattern string and the target string, it can be seen that when the pattern string moves to any position in the target string, according to the matching sequence guided by the PNext array, the feature pairs are matched first, and the characters in the non-feature area are matched after all the feature pairs are matched. The matching scheme is carried out from right to left in turn. When the pattern string does not match the first feature pair with the target string, and other characters do not match the target string, there must be at least one feature pair that has already matched the target string. match. According to the nature of the pattern string feature pair, when the characters of the pattern string except the first feature pair do not match the target string characters, and the pattern string is shifted right in the target string, if the successfully matched feature pair characters and The alignment of characters in the non-characteristic area of the pattern string will inevitably lead to a mismatch in the next round of matching between the pattern string and the target string.
针对上诉分析,本发明提出一种模式串除第一特征对字符与目标串字符发生失配时根据已经适配的特征对获取最佳跳跃距离的方案:当模式串除第一特征对字符与目标串字符发生失配,模式串在目标串中进行右移时,需满足已经匹配成功的特征对不存在与模式串非特征区对齐的字符,或者,模式串最左边字符与特征后序字符相等时,可以将模式串最左边字符移动至与已经匹配成功的最右边特征后序对齐,将所有满足上述条件的移动距离中最小的移动距离称为最佳跳跃距离。由于所有模式串除第一特征对字符之外的特征对字符或者非特征区字符与目标串字符发生失配时,所能提供的最佳跳跃距离都能在模式串与目标串进行匹配之前获得,因此,将所有模式串除第一特征对外的字符失配时所能提供的最佳跳跃距离使用TNext数组进行存储,当模式串与目标串进行匹配的过程中任意字符发生失配,可直接使用TNext数组获取最佳跳跃距离。Aiming at the appeal analysis, the present invention proposes a scheme to obtain the best jumping distance according to the adapted feature pair when the character of the pattern string except the first feature pair and the target string character do not match: when the pattern string removes the first feature pair character and There is a character mismatch in the target string. When the pattern string is shifted to the right in the target string, it must be satisfied that there is no character aligned with the non-characteristic area of the pattern string for the feature pair that has been successfully matched, or the leftmost character of the pattern string and the subsequent character of the feature When they are equal, the leftmost character of the pattern string can be moved to be aligned with the rightmost feature that has been successfully matched, and the smallest moving distance among all moving distances that meet the above conditions is called the optimal jumping distance. Since all pattern strings except the first feature pair characters or non-feature region characters are mismatched with the target string characters, the best jump distance that can be provided can be obtained before the pattern string matches the target string , therefore, the TNext array is used to store the best jumping distance that can be provided when all the characters in the pattern string except the first feature do not match. When any character mismatch occurs during the matching process between the pattern string and the target string, it can be directly Use the TNext array to get the best jumping distance.
所述步骤3中,当存在特征对时,获取记录模式串在目标串中跳跃匹配距离的TNext 数组的方法具体如下:In said
首先,获取特征对失配时所能提供的最佳跳跃距离:依次对TNext数组中与模式串第二个特征对对应位置相同的数组元素开始进行赋值,将模式串中正在赋值的特征对右边的所有字符称为已知字符串,该字符串中的特征对称为已知特征对,第一个已知特征对为基准特征对。使用Sequence、Index和Distance数组构建TNext数组。当已知特征对与模式串特征对完全对齐时,用第一个特征前序索引减去与基准特征对对齐的特征前序索引,得到正在赋值的特征对失配时的最佳跳跃距离;若模式串中各个特征对与基准特征对逐一对齐,但不满足已知特征对与模式串特征对完全对齐的条件,获得该模式串所能提供的最长跳跃距离,最长跳跃距离的获取方法为:判断特征后序与模式串最左边的字符是否相等,若相等,最长跳跃距离即为第一个特征后序的索引值,否则最长跳跃距离即为第一个特征后序的索引值加1,并将TNext数组所有后续待赋值数组元素全部赋值为该最长跳跃距离。First, obtain the best jump distance that can be provided when the feature pair is mismatched: sequentially assign values to the array elements in the TNext array that are at the same position as the second feature pair in the pattern string, and assign the feature that is being assigned in the pattern string to the right All the characters in are called known strings, the feature pairs in this string are called known feature pairs, and the first known feature pair is the reference feature pair. Build a TNext array using the Sequence, Index, and Distance arrays. When the known feature pair is completely aligned with the pattern string feature pair, the first feature preorder index is subtracted from the feature preorder index aligned with the reference feature pair to obtain the optimal jump distance when the feature pair being assigned is mismatched; If each feature pair in the pattern string is aligned with the reference feature pair one by one, but the condition that the known feature pair is completely aligned with the pattern string feature pair is not satisfied, obtain the longest jump distance that the pattern string can provide, and obtain the longest jump distance The method is: judge whether the character postorder and the leftmost character of the pattern string are equal, if they are equal, the longest jump distance is the index value of the first feature postorder, otherwise the longest jump distance is the index value of the first
其次,获取非特征区字符失配时所能提供的最佳跳跃距离:将模式串作为已知字符串,使用特征对失配时获取最佳跳跃距离的方法,获取非特征区字符失配时所能提供的最佳跳跃距离,并将所有TNext数组中与模式串中非特征区字符对应位置的数组元素都赋值为该距离。Secondly, the best jumping distance that can be provided when obtaining the non-characteristic area character mismatch: using the pattern string as a known character string, using the method of obtaining the best jumping distance when the feature pair is mismatched, when obtaining the non-characteristic area character mismatch The optimal jumping distance that can be provided, and all the array elements in the TNext array that correspond to the characters in the non-characteristic area in the pattern string are assigned the distance.
进一步,通过分析模式串与目标串匹配过程可知,模式串中从右向左第三个字符开始,当任意字符与目标串失配时,模式串最右边两个字符与目标串已经匹配成功,并且模式串中不存在另外一个与该字符对相同的字符对,因此,构建过程只需判断模式串最左边字符与模式串最右边字符是否相等,即可计算出从模式串倒数第三个字符开始,任意字符与目标串失配时的最佳跳跃距离。因此,所述步骤3中,当不存在特征对时,获取记录模式串在目标串中跳跃匹配距离的特殊PNext、TNext数组的方法具体如下:Further, by analyzing the matching process between the pattern string and the target string, it can be seen that starting from the third character from right to left in the pattern string, when any character does not match the target string, the rightmost two characters of the pattern string and the target string have been successfully matched. And there is no other character pair identical to this character pair in the pattern string. Therefore, the construction process only needs to judge whether the leftmost character of the pattern string is equal to the rightmost character of the pattern string, and then calculate the third character from the bottom of the pattern string Initially, the optimal jumping distance when any character does not match the target string. Therefore, in the
当模式串不存在特征对时,匹配起点为模式串最右边的字符位置,从匹配起点开始由右至左,依次构建特殊PNext数组,即模式串在目标串中移动至任意位置时,模式串字符与目标串字符从模式串最右边字符开始,依次向左进行匹配。当模式串最右边两个字符失配时,根据特殊FNext数组获取最佳跳跃距离,特殊FNext数组的构建与FNext数组一般构建方法相同。否则根据特殊TNext数组获取最佳跳跃距离,构建与模式串等长的TNext数组,其每一个数组元素记录对应模式串字符失配时所能提供的最佳跳跃距离。判断模式串最左边的字符与模式串最右边的字符是否相等,若相等,则将TNext数组所有值赋值为PLen-1,否则,赋值为PLen,即目标串在模式串中跳跃距离为模式串长度。When there is no feature pair in the pattern string, the matching starting point is the rightmost character position of the pattern string, and from right to left from the matching starting point, a special PNext array is constructed sequentially, that is, when the pattern string moves to any position in the target string, the pattern string characters The characters in the target string are matched from the rightmost character of the pattern string to the left. When the two rightmost characters of the pattern string do not match, the optimal jump distance is obtained according to the special FNext array. The construction of the special FNext array is the same as the general construction method of the FNext array. Otherwise, obtain the best jumping distance according to the special TNext array, construct a TNext array with the same length as the pattern string, and each array element records the best jumping distance that can be provided when the characters in the corresponding pattern string do not match. Determine whether the leftmost character of the pattern string is equal to the rightmost character of the pattern string. If they are equal, assign all the values of the TNext array to PLen-1, otherwise, assign the value to PLen, that is, the jump distance of the target string in the pattern string is the pattern string length.
进一步,所述步骤4中,获取PNext、TNext和FNext数组之后,将模式串在目标串中从左至右进行检索,检索过程中,使用模式串指针Pi与目标串指针Ti分别指向模式串的字符与目标串的字符进行匹配。当模式串移动至目标串某一位置时,从模式串匹配起点FirstFeaStr开始,从右至左与目标串对应字符进行跳跃匹配。Further, in the
当模式串指针与目标串指针指向的字符适配时,使用模式串指针获取PNext数组中对应的数组元素值PNext[Pi],并分别计算模式串与目标串指针指向的下一个匹配字符位置:先将 Ti指向Ti-Pi+PNext[Pi],再将Pi指向PNext[Pi]。When the pattern string pointer matches the character pointed to by the target string pointer, use the pattern string pointer to obtain the corresponding array element value PNext[Pi] in the PNext array, and calculate the next matching character position pointed to by the pattern string and the target string pointer respectively: First point Ti to Ti-Pi+PNext[Pi], then point Pi to PNext[Pi].
当模式串指针与目标串指针指向的字符失配时,如果模式串指针指向的字符是第一特征对,获取目标串中与模式串最后两个或者三个字符对齐的字符串,然后获取FNext数组中与该字符串唯一映射的数组元素值V,该值即为当前模式串第一特征对与目标串中对应字符失配的最佳跳跃距离,通过该最佳跳跃距离计算模式串与目标串指针指向的下一个匹配字符位置:先将Ti指向Ti-Pi+FirstFeaStr+V,再将Pi指向FirstFeaStr;如果模式串指针指向的字符不是第一特征对,使用模式串指针获取TNext数组中对应的数组元素值TNext[Pi],该值即为当前模式串字符与目标串中对应字符失配的最佳跳跃距离,通过该最佳跳跃距离计算模式串与目标串指针指向的下一个匹配字符位置:先将Ti指向Ti-Pi+FirstFeaStr+TNext[Pi],再将Pi指向FirstFeaStr。When the pattern string pointer does not match the character pointed to by the target string pointer, if the character pointed to by the pattern string pointer is the first feature pair, obtain the character string aligned with the last two or three characters of the pattern string in the target string, and then obtain FNext The array element value V uniquely mapped to the string in the array is the optimal jumping distance between the first feature pair of the current pattern string and the corresponding character in the target string, and the optimal jumping distance is used to calculate the pattern string and the target The position of the next matching character pointed to by the string pointer: first point Ti to Ti-Pi+FirstFeaStr+V, and then point Pi to FirstFeaStr; if the character pointed to by the pattern string pointer is not the first feature pair, use the pattern string pointer to obtain the corresponding character in the TNext array The array element value TNext[Pi] of TNext[Pi], this value is the best jumping distance between the characters in the current pattern string and the corresponding characters in the target string, and the next matching character pointed to by the pointer of the pattern string and the target string is calculated through the best jumping distance Position: first point Ti to Ti-Pi+FirstFeaStr+TNext[Pi], then point Pi to FirstFeaStr.
与现有技术相比,本发明的有益效果是:当任意字符出现失配时,根据提前计算的下一轮匹配的最佳跳跃距离进行跳跃,规避了现有技术中发生失配之后需要再次进行计算,才能获取一定的跳跃距离的弊端;本发明根据模式串本身特性计算跳跃匹配方式,在此基础上提出两种能够获得较大跳跃距离的加速匹配方案,达到加速匹配的目的,解决了字符串检索过程中,一旦发生失配,无法快速获取跳跃距离,从而增加了检索过程中的频繁回溯且计算耗时的问题。Compared with the prior art, the beneficial effect of the present invention is: when there is a mismatch in any character, jumping is carried out according to the best jumping distance of the next round of matching calculated in advance, avoiding the need to repeat the matching after the mismatch occurs in the prior art. Only by performing calculations can a certain jumping distance be obtained; the present invention calculates the jumping matching method according to the characteristics of the pattern string itself, and on this basis proposes two accelerated matching schemes that can obtain a larger jumping distance, so as to achieve the purpose of accelerated matching and solve the problem of During the string retrieval process, once a mismatch occurs, the jump distance cannot be quickly obtained, which increases the problem of frequent backtracking and time-consuming calculations during the retrieval process.
附图说明Description of drawings
图1为本实施例实施步骤流程图;Fig. 1 is the flowchart of implementation steps of the present embodiment;
图2为本实施例模式串划分示意图;Fig. 2 is a schematic diagram of pattern string division in this embodiment;
图3为本实施例模式串匹配原理示意图;FIG. 3 is a schematic diagram of the principle of pattern string matching in this embodiment;
图4为本实施例模式串与目标串匹配示意图;Fig. 4 is a schematic diagram of the matching between the pattern string and the target string in the present embodiment;
图5为本实施例记录数组的示意图;Fig. 5 is the schematic diagram of the record array of the present embodiment;
图6为本实施例指针滑动构建PNext数据模块的示意图;Fig. 6 is the schematic diagram that the pointer slides of the present embodiment constructs the PNext data module;
图7为本实施例特征对遍历完成示意图;Fig. 7 is a schematic diagram of completion of feature pair traversal in this embodiment;
图8为本实施例完整的PNext数组的示意图;Fig. 8 is the schematic diagram of the complete PNext array of the present embodiment;
图9为本实施例已知字符串示意图;FIG. 9 is a schematic diagram of known character strings in this embodiment;
图10为本实施例最佳跳跃距离计算失败示意图;FIG. 10 is a schematic diagram of the failure of calculating the optimal jump distance in this embodiment;
图11为本实施例最佳跳跃距离计算成功示意图;Figure 11 is a schematic diagram of the successful calculation of the best jumping distance in this embodiment;
图12为本实施例特殊情况下模式串数组的构建;Fig. 12 is the construction of pattern string array under special circumstances of the present embodiment;
图13为本实施例选取长度为100的模式串在不同数量级的目标串中进行检索的结果示意图;Fig. 13 is a schematic diagram of the result of searching in target strings of different magnitudes by selecting a pattern string with a length of 100 in this embodiment;
图14为本实施例选取长度为500的模式串在不同数量级的目标串中进行检索的结果示意图;Fig. 14 is a schematic diagram of the result of searching in target strings of different magnitudes by selecting a pattern string with a length of 500 in this embodiment;
图15为本实施例选取长度为1000的模式串在不同数量级的目标串中进行检索的结果示意图;Fig. 15 is a schematic diagram of the result of searching in target strings of different magnitudes by selecting a pattern string with a length of 1000 in this embodiment;
图16为本实施例选取长度为3000的模式串在不同数量级的目标串中进行检索的结果示意图;Fig. 16 is a schematic diagram of the result of searching in target strings of different orders of magnitude by selecting a pattern string with a length of 3000 in this embodiment;
图17为本实施例选取长度为5000的模式串在不同数量级的目标串中进行检索的结果示意图;Fig. 17 is a schematic diagram of the result of searching in target strings of different orders of magnitude by selecting a pattern string with a length of 5000 in this embodiment;
图18为本实施例选取长度为10000的模式串在不同数量级的目标串中进行检索的结果示意图;Fig. 18 is a schematic diagram of the result of searching in target strings of different magnitudes by selecting a pattern string with a length of 10000 in this embodiment;
图19为验证字符集大小对本实施例算法检索效率的结果示意图;Fig. 19 is a schematic diagram of the result of verifying the retrieval efficiency of the algorithm of the present embodiment by the size of the character set;
图20为字符集个数大于15本实施例算法检索效率的结果示意图。Fig. 20 is a schematic diagram of the retrieval efficiency results of the algorithm in this embodiment where the number of character sets is greater than 15.
具体实施方式Detailed ways
下面结合附图对本发明作进一步的说明,本发明的实施方式包括但不限于下列实施例。The present invention will be further described below in conjunction with the accompanying drawings, and the embodiments of the present invention include but are not limited to the following examples.
图1为本实施例实施步骤流程图,包括:Fig. 1 is a flow chart of the implementation steps of this embodiment, including:
步骤1、获取模式串中的特征对;
步骤2、获取记录模式串与目标串进行匹配的优化顺序PNext数组;
步骤3、获取记录模式串在目标串中跳跃匹配距离的TNext数组和FNext数组;
步骤4、模式串在目标串中进行从左到右匹配时,如果对应字符适配,根据步骤2获得的匹配顺序依次进行匹配,如果对应字符失配,根据步骤3事先获取的记录数组进行跳跃匹配。
本实施例中,模式串(PStr,长度为PLen)在目标串(TStr,长度为TLen)中进行检索之前,需要在模式串中对字符进行特征区和非特征区进行区分,从而提取出特征对,如图2所示,模式串PStr中,特征区由特征对Fi=“CsCe”组成,特征对Fi由两个有序且紧邻的字符‘Cs’和‘Ce’组成,分别称为特征前序Fi[1]=‘Cs’和特征后序Fi[2]=‘Ce’,‘Cs’≠‘Ce’;非特征区为特征对之外的区域,一个字符位的长度为1,长度为dk,dk≥0。模式串中特征对的总数为num(num≥2),Fi(i∈[1,num])。对特征对从右到左依次编号,即最右边的特征对为第一特征对。同时,模式串最左边的字符PStr[0]记做FirstChar。In this embodiment, before the pattern string (PStr, length is PLen) is retrieved in the target string (TStr, length is TLen), it is necessary to distinguish the characteristic area and the non-characteristic area of the character in the pattern string, thereby extracting the characteristic Yes, as shown in Figure 2, in the pattern string PStr, the feature area is composed of the feature pair F i = "C s C e ", and the feature pair F i consists of two ordered and adjacent characters 'C s ' and 'C e 'composition, respectively called feature preorder F i[1] ='C s ' and feature postorder F i[2] ='C e ', 'C s '≠'C e '; non-characteristic area For the area outside the feature pair, the length of a character bit is 1, The length is dk , and dk≥0 . The total number of feature pairs in the pattern string is num (num≥2), F i (i∈[1, num]). The feature pairs are numbered sequentially from right to left, that is, the rightmost feature pair is the first feature pair. At the same time, the leftmost character PStr[0] of the pattern string is recorded as FirstChar.
由上述可知,模式串具有以下性质:From the above, we can see that the pattern string has the following properties:
性质1:特征对是模式串中出现次数最多的字符对;Property 1: The feature pair is the most frequently occurring character pair in the pattern string;
性质2:特征前序和特征后序严格保序;Property 2: The pre-order of features and the post-order of features are strictly guaranteed;
性质3:互斥性,即非特征区不存在特征对。Property 3: Mutual exclusion, that is, there are no feature pairs in the non-feature area.
本实施例中,特征对的作用在于检索过程中快速定位模式串在目标串的可能位置,采取优先匹配特征对,特征对全部适配之后,对非特征区字符从右至左依次匹配的方案进行匹配。分析模式串与目标串匹配过程,当模式串字符与目标串字符发生失配时,由模式串性质 3可知,模式串中特征对以外的区域不存在特征对,根据该性质可知,模式串在目标串中进行右移的过程中,可能存在由于已经匹配成功的特征对字符与移动后的模式串非特征区字符进行匹配的移动距离,如果将模式串移动这些距离,必然导致失配。In this embodiment, the function of the feature pair is to quickly locate the possible position of the pattern string in the target string during the retrieval process, and adopt the scheme of matching the feature pair first, and after all the feature pairs are matched, the characters in the non-characteristic area are matched sequentially from right to left to match. Analyzing the matching process between the pattern string and the target string, when there is a mismatch between the pattern string characters and the target string characters, it can be known from the
根据上述分析可知,当模式串字符与目标串字符发生失配时,可根据已经匹配成功的模式串字符分析出避免无效匹配的下一轮匹配位置,从而加速模式串在目标串中的检索速度。显然,模式串特征对与目标串字符适配数量越多,指导加速匹配的信息越多,因此,为了能够得到较长的跳跃距离,需要尽可能的使模式串字符与目标串字符适配。可选取长度为D的字符串作为备选特征字符,D∈[2,PLen/2]。显然,匹配过程中,模式串中特征字符串越短,特征字符串在目标串中出现的概率就越大,匹配成功的可能性就越大,本实施例取D=2作为特征字符串长度,即特征对选用2个字符进行说明。将模式串中所有特征对从右至左进行编号,编号从1开始,即最右边的特征对为第一特征对,如果次数最多的字符对有不止1个,选取它们中最右边的字符对作为特征对。According to the above analysis, when there is a mismatch between the pattern string characters and the target string characters, the matching position of the next round to avoid invalid matching can be analyzed according to the successfully matched pattern string characters, thereby accelerating the retrieval speed of the pattern string in the target string . Obviously, the more pattern string feature pairs are adapted to the target string characters, the more information to guide accelerated matching. Therefore, in order to obtain a longer jump distance, it is necessary to make the pattern string characters and target string characters fit as much as possible. A character string of length D can be selected as an alternative feature character, D∈[2, PLen/2]. Obviously, during the matching process, the shorter the characteristic string in the pattern string, the greater the probability of the characteristic character string appearing in the target string, and the greater the possibility of successful matching. In this embodiment, D=2 is used as the characteristic character string length , that is, the feature pair uses 2 characters to describe. Number all feature pairs in the pattern string from right to left, starting from 1, that is, the rightmost feature pair is the first feature pair, if there is more than one character pair with the most number of times, select the rightmost character pair among them as feature pairs.
本实施例中,如图3所示,匹配原理是当模式串移动至目标串某一位置,从第一个特征对的特征后序开始,使用模式串字符与目标串字符开始进行匹配,若匹配成功,继续对该特征对的特征前序进行匹配,当特征前序也匹配成功时,跳跃到下一个特征对的特征后序进行匹配,如此循环,如图3上面一行所示,当模式串中所有特征对匹配成功时,跳跃至模式串最右边的非特征字符,依次对非特征区字符从右到左进行匹配,如图3下面一行所示,如果最左边的非特征字符匹配成功,则表明该模式串在目标串中检索成功。In this embodiment, as shown in Figure 3, the matching principle is that when the pattern string moves to a certain position of the target string, starting from the feature postorder of the first feature pair, the pattern string characters are used to start matching with the target string characters, if If the matching is successful, continue to match the feature preorder of the feature pair. When the feature preorder is also matched successfully, jump to the feature postorder of the next feature pair to match, and so on. As shown in the upper row of Figure 3, when the pattern When all the feature pairs in the string are successfully matched, jump to the rightmost non-characteristic character of the pattern string, and match the characters in the non-characteristic area from right to left in turn, as shown in the bottom row of Figure 3, if the leftmost non-characteristic character matches successfully , it indicates that the pattern string is retrieved successfully in the target string.
本实施例中,最佳跳跃距离的原理是按照匹配原理进行匹配,如图4所示,当模式串 PStr中F1、F2、F3(Fi表示第i个特征对)与目标串TStr匹配成功,而下一个与目标串进行匹配的特征对中的任意字符失配时,此前已经匹配成功的F′1、F′2、F′3为已知特征对,其中F′i表示已知特征对,F′i[1]与F′i[2]分别表示已知特征前序与特征后序。进行下次匹配时,若移动不恰当的距离,会导致匹配失败,匹配失败是因为已知特征对与模式串的非特征区进行匹配,或已知特征对与模式串中的特征对发生错位,上述匹配称为为无效匹配。如图4所示,移动 Try1、Try2和Try3进行的匹配均为无效匹配,Try4为有效匹配,Try5为过度移动的匹配。其中,进行Try1尝试模式串右移1位,出现Fi[1]与F′i[2]进行匹配,即特征对错位,从而导致失配(失配字符为Try1中‘×’填充);进行Try2尝试模式串右移2位,存在F′i与区域进行匹配,由性质3可知,区域不包含特征对,因此,F′i与对齐必然导致失配(失配字符为Try2中‘×’填充);进行Try3尝试模式串右移4位,此时F2与F′1对齐、F3与F′2对齐,但是区域存在与F′3对齐的字符,因此移动至此同样为无效匹配;进行Try4尝试模式串右移11 位,此时F4与F′1对齐,并且模式串不存在与已知特征对对齐的非特征字符,因此当模式串右移11位时不会产生无效匹配。因此,尝试移动不同距离进行匹配时,为避免无效匹配,需要得到下次匹配的最佳跳跃距离。任意字符失配之后,根据性质3可得,将模式串右移后满足特定条件的距离称为可跳跃距离:模式串中非特征字符与目标串中已知特征对不存在对齐的字符时(如图4中Try4和Try5)。显然,将模式串从左往右移动过程中,第一个可跳跃距离即为最佳跳跃距离(如图4中Try4)。若跳跃距离小于最佳跳跃距离时会产生无效匹配,而大于最佳跳跃距离则可能遗漏本该检索成功的区域,如图4中Try5过程移动14位会出现漏检。当上述有一个特例,即当且仅当FirstChar与F′i[2]对齐且F′i[2]=FirstChar这种特殊情况时,不能作为无效匹配的判定条件。In this embodiment, the principle of the optimal jumping distance is to match according to the matching principle. As shown in Figure 4, when F 1 , F 2 , F 3 (F i represents the i-th feature pair) in the pattern string PStr and the target string When TStr matches successfully, but any character in the next feature pair that matches the target string does not match, F′ 1 , F′ 2 , and F′ 3 that have been successfully matched before are known feature pairs, where F′ i represents Known feature pairs, F′ i[1] and F′ i[2] represent the known feature preorder and feature postorder respectively. When performing the next match, if you move an inappropriate distance, the matching will fail. The matching failure is because the known feature pair is matched with the non-characteristic area of the pattern string, or the known feature pair is misplaced with the feature pair in the pattern string. , the above match is called an invalid match. As shown in Figure 4, the matches performed by moving Try1, Try2 and Try3 are all invalid matches, Try4 is a valid match, and Try5 is an excessively moved match. Among them, the Try1 trial pattern string is shifted to the right by 1 bit, and F i[1] and F' i[2] appear to be matched, that is, the feature pair is misplaced, resulting in a mismatch (the mismatch character is filled with '×' in Try1); Carry out Try2 and try to shift the pattern string to the right by 2 bits, and there are F′ i and The area is matched, it can be seen from the
基于以上对本实施例进行原理解释后,一种字符串检索方法,包括如下步骤:After explaining the principle of this embodiment based on the above, a character string retrieval method includes the following steps:
步骤1、在模式串的字符中获取特征对。
其中,特征对获取的方法具体为:Among them, the method of feature pair acquisition is as follows:
首先开辟一个大小为SIZE×SIZE的二维Chars记录数组,SIZE为ASCII字符集大小, 并将所有值初始化为0,该二维数组中的各个元素与任意两个字符组成的字符对唯一映射,其映射方法为,任意字符对中其中一个字符的ASCII码对应Chars数组的横坐标,另一个字符的ASCII码对应Chars数组的纵坐标。使用指针i与指针i+1对模式串从左至右进行遍历,其中i∈[0,PLen-2],当指针i与指针i+1遍历到任一满足PStr[i]≠PStr[i+1]的字符对时,将这两个字符在记录数组中所映射的数组元素值加1,如果该值大于或者等于目前出现次数最多的字符对个数num,则将num更新为当前数组元素的值,并使用特征对记录指针分别记录该字符对的特征前序‘Cs’与特征后序‘Ce’,同时对指针FirstFeaStr赋值为i+1,指针FirstFeaStr 用于记录第一个特征后序的位置,作为模式串与目标处进行匹配时的起点。当模式串全部遍历完成后,如果num大于1,此时记录指针记录的字符对即为该模式串的特征对,num记录的个数即为特征对的总数,指针FirstFeaStr记录的位置即为模式串与目标处进行匹配时的匹配起点,并在模式串与目标串匹配之前构建一般的PNext、TNext和FNext数组;如果遍历完成之后num=1,则表明该模式串中不存在特征对,此时匹配起点设为模式串最右边字符,并跳过TNext数组与PNext数组的一般构建步骤,进而构建特殊TNext数组与PNext数组。First, create a two-dimensional Chars record array with a size of SIZE×SIZE. SIZE is the size of the ASCII character set, and initialize all values to 0. Each element in the two-dimensional array is uniquely mapped to a character pair composed of any two characters. The mapping method is that the ASCII code of one character in any character pair corresponds to the abscissa of the Chars array, and the ASCII code of the other character corresponds to the ordinate of the Chars array. Use pointer i and pointer i+1 to traverse the pattern string from left to right, where i∈[0,PLen-2], when pointer i and pointer i+1 traverse to any satisfying PStr[i]≠PSr[i +1], add 1 to the array element value mapped by these two characters in the record array, if the value is greater than or equal to the number num of character pairs that appear most frequently at present, update num to the current array The value of the element, and use the feature pair record pointer to respectively record the characteristic preorder 'C s ' and characteristic postorder 'C e ' of the character pair, and assign the value i+1 to the pointer FirstFeaStr, and the pointer FirstFeaStr is used to record the first The post-order position of the feature is used as the starting point for matching the pattern string with the target. When all pattern strings are traversed, if num is greater than 1, the character pair recorded by the record pointer is the feature pair of the pattern string, the number of num records is the total number of feature pairs, and the position recorded by the pointer FirstFeaStr is the pattern The matching starting point when the string matches the target, and before the pattern string matches the target string, the general PNext, TNext and FNext arrays are constructed; if num=1 after the traversal is completed, it indicates that there is no feature pair in the pattern string, and this When the matching starting point is set to the rightmost character of the pattern string, the general construction steps of the TNext array and the PNext array are skipped, and then the special TNext array and the PNext array are constructed.
如图5所示,获得特征对的同时,容易得到以下三个可用于快速计算最佳跳跃距离的记录数组:标识模式串中所有特征前序与特征后序所在位置的Sequence数组,如Sequence[2]=1与Sequence[3]=2分别表示特征前序与特征后序,非特征区数组值为0、依次记录所有特征前序索引的Index数组,如Index[1]=9,以及记录相邻特征前序间距的Distance 数组,如Distance[1]储存F1[1]与F2[1]之间的距离Δ1=4。为了提升算法效率,在获取特征对的同时,便对Sequence数组初始化为0。As shown in Figure 5, while obtaining the feature pair, it is easy to obtain the following three record arrays that can be used to quickly calculate the optimal jump distance: the Sequence array that identifies the positions of all feature preorders and feature postorders in the pattern string, such as Sequence[ 2]=1 and Sequence[3]=2 respectively represent the pre-order of the feature and the post-order of the feature, the non-characteristic area array value is 0, and the Index array of all the pre-order indexes of the features is recorded in turn, such as Index[1]=9, and the record The Distance array of the preceding distance between adjacent features, for example, Distance[1] stores the distance Δ 1 =4 between F 1[1] and F 2[1] . In order to improve the efficiency of the algorithm, the Sequence array is initialized to 0 while acquiring the feature pairs.
步骤2、模式串与目标串进行特征对和非特征区匹配并利用PNext数组获取匹配顺序。
其中,构建记录模式串与目标串匹配顺序的PNext数组,具体方法为:Among them, construct a PNext array that records the matching order of the pattern string and the target string, the specific method is:
首先,构建与模式串等长的PNext数组,由于PNext数组记录先后匹配的两个字符的位置关系,因此对模式串采用双指针同步遍历的方法构建该数组。使用指针P指向PNext数组中与正在处理的字符对应的位置,使用指针i对模式串进行从右至左的遍历,用于寻找特征对。为提高算法效率,计算PNext数组的特征区时,同时获取上述步骤1中的三个记录数组,如图6所示,以PStr="TTACTATTAGGATTAGCA"为例,"TA"为该模式串中的特征对。First, build a PNext array with the same length as the pattern string. Since the PNext array records the positional relationship between the two characters matched successively, the pattern string is constructed by double-pointer synchronous traversal. Use the pointer P to point to the position corresponding to the character being processed in the PNext array, and use the pointer i to traverse the pattern string from right to left to find the feature pair. In order to improve the algorithm efficiency, when calculating the feature area of the PNext array, obtain the three record arrays in the
本实施例中,PNext数组记录的匹配顺序包括模式串第一个特征对与目标串的匹配顺序、模式串除第一个特征对外的特征对或者非特征区与目标串的匹配顺序。In this embodiment, the matching sequence of the PNext array records includes the matching sequence of the first feature pair of the pattern string and the target string, and the matching sequence of the feature pairs of the pattern string except the first feature or the matching sequence of the non-feature region and the target string.
其中,计算模式串第一个特征对与目标串的匹配顺序的方法具体如下:Among them, the method for calculating the matching order of the first feature pair of the pattern string and the target string is as follows:
当完成特征对获取时,FirstFeaStr记录的索引即为第一个特征对的特征后序的位置,如图6(P1)所示,将指针i和指针P指向第一个特征对的特征前序,即FirstFeaStr-1,将 PNext[FirstFeaStr]赋值为i。同时,对Index[t](t为特征对序号,初值为1)赋值为i,用于记录第一个特征对的特征前序的位置,赋值完成后将t加1;Sequence[i]与Sequence[i+1]分别赋值为1、2,用于标记特征对的位置。由于前后相邻的两个特征对必然满足前者的特征后序与后者的特征前序不会重合,为提升计算效率,在处理完一个特征对后,将指针i向左移动两位。然后,如图6(P2)所示,使用指针i与i+1对PNext数组继续向左遍历,当指针i找到特征前序字符'T',并且指针i+1找到特征后序字符'A',即遍历到第二特征对时,将指针 P指针指向的PNext数组元素赋值为i+1,最后,将P指针赋值为i+1。When the feature pair acquisition is completed, the index recorded by FirstFeaStr is the position of the feature sequence of the first feature pair, as shown in Figure 6 (P1), point the pointer i and the pointer P to the feature sequence of the first feature pair , namely FirstFeaStr-1, assign PNext[FirstFeaStr] to i. At the same time, assign i to Index[t] (t is the serial number of the feature pair, the initial value is 1), which is used to record the position of the feature preorder of the first feature pair, and add 1 to t after the assignment is completed; Sequence[i] and Sequence[i+1] are assigned values of 1 and 2 respectively, which are used to mark the position of the feature pair. Since the two adjacent feature pairs must satisfy that the post-order of the former feature will not coincide with the pre-order of the latter feature, in order to improve the calculation efficiency, after processing a feature pair, move the pointer i to the left by two bits. Then, as shown in Figure 6 (P2), use the pointers i and i+1 to continue to traverse the PNext array to the left, when the pointer i finds the characteristic preorder character 'T', and the pointer i+1 finds the characteristic postorder character 'A ', that is, when traversing to the second feature pair, assign the PNext array element pointed to by the pointer P to i+1, and finally, assign the P pointer to i+1.
另外,计算模式串中除第一个特征对外的特征对或者非特征区与目标串的匹配顺序的方法具体如下:In addition, the method of calculating the matching order of feature pairs or non-feature regions and the target string except the first feature in the pattern string is as follows:
首先,将指针i向左移动一位,然后将PNext数组中P指针指向的数组元素赋值为i+1,再将P指针指向i+1对应的数组位置,然后,使用i指针与i+1指针继续进行向左遍历,当指针i找到特征前序字符'T',并且指针i+1找到特征后序字符'A',即获得下一个特征对时,将指针P指向的PNext数组元素赋值为i+1,最后将P指针赋值为i+1。与此同时,完成同获取模式串第一特征对与目标串的匹配顺序中对Sequence数组和Index数组相同的计算、赋值,并添加对记录数组Distance的计算、赋值,即将Distance[t-1]赋值为Index[t-1]-Index[t],其中Distance数组用于记录相邻两个特征前序间距。如此往复。First, move the pointer i one bit to the left, then assign the array element pointed to by the P pointer in the PNext array to i+1, then point the P pointer to the array position corresponding to i+1, and then use the i pointer and i+1 The pointer continues to traverse to the left. When the pointer i finds the characteristic preorder character 'T', and the pointer i+1 finds the characteristic postorder character 'A', that is, when the next characteristic pair is obtained, the PNext array element pointed to by the pointer P is assigned a value is i+1, and finally assign the P pointer to i+1. At the same time, complete the same calculation and assignment of the Sequence array and the Index array in the matching order of the first feature pair of the acquired pattern string and the target string, and add the calculation and assignment of the record array Distance, that is, Distance[t-1] The assignment is Index[t-1]-Index[t], where the Distance array is used to record the distance between two adjacent features. So back and forth.
如图7所示,当指针i对模式串进行向左遍历完成之后,P指针指向最左边的特征前序对应的PNext数组位置,此时,使用遍历指针i对模式串从右至左依次对非特征区元素进行遍历,进而完成对非特征区对应的PNext数组元素的计算、赋值,其具体实现方法如下:As shown in Figure 7, after the pointer i traverses the pattern string to the left, the P pointer points to the PNext array position corresponding to the leftmost feature preorder. At this time, use the traverse pointer i to traverse the pattern string from right to left The elements of the non-feature area are traversed, and then the calculation and assignment of the PNext array elements corresponding to the non-feature area are completed. The specific implementation method is as follows:
如图8所示,当i指针指向一个非特征区字符时,首先将P指针指向的PNext数组元素赋值为i,再将P指针赋值为i,然后,继续使用i指针进行向左遍历,寻找下一个非特征字符,通过Sequence数组判断i指针指向的字符是否为特征对。如此往复。当完成所有非特征区对应的PNext数组元素的计算、赋值,i指针进行向左遍历至最左边的非特征区字符时,将i指针指向的PNext数组赋值为-1,其中-1为匹配结束标记,即表示该字符匹配成功时,所有模式串字符在目标串中匹配成功。As shown in Figure 8, when the i pointer points to a character in a non-characteristic area, first assign the PNext array element pointed to by the P pointer to i, and then assign the P pointer to i, and then continue to use the i pointer to traverse to the left to find For the next non-characteristic character, judge whether the character pointed to by the i pointer is a characteristic pair through the Sequence array. So back and forth. When the calculation and assignment of the PNext array elements corresponding to all non-characteristic areas are completed, and the i pointer traverses to the left to the leftmost non-characteristic area character, assign the PNext array pointed to by the i pointer to -1, where -1 is the end of the match mark, which means that when the character matches successfully, all pattern string characters are successfully matched in the target string.
步骤3、获取记录模式串在目标串中跳跃匹配距离的FNext数组的方法具体如下:
为快速构建FNext数组,本实施例采取的方案是:获取目标串字符集后,首先,计算目标串字符集与自身的一重或二重笛卡尔积,逐一获取由该字符集组成的所有字符串集合, FNext数组记录的最佳跳跃距离与该集合中的字符串一一映射。计算笛卡尔积的同时对数组元素进行计算并赋初值,最佳跳跃距离表示模式串中从最右边第二个字符开始向左,第一个与模式串最右边两个或者三个字符所在位置对应目标串中的字符串相同的字符串,该字符串最右边字符与模式串最右边字符在模式串中所在位置之间的距离。初始值计算时,假设模式串中不存在所有由目标串字符集中两个或者三个任意字符组成的字符串,具体方法如下:In order to quickly construct the FNext array, the solution adopted in this embodiment is: after obtaining the character set of the target string, at first, calculate the single or double Cartesian product of the character set of the target string and itself, and obtain all character strings formed by the character set one by one Set, the best jumping distance recorded in the FNext array is mapped one by one to the strings in the set. Calculate the Cartesian product while calculating and assigning initial values to the array elements. The optimal jumping distance means that the pattern string starts from the second character on the right and goes to the left. The position corresponds to the distance between the rightmost character of the character string and the position of the rightmost character of the pattern string in the pattern string for the same character string in the target string. When calculating the initial value, it is assumed that there are no strings consisting of two or three arbitrary characters in the target string character set in the pattern string. The specific method is as follows:
当以两个字符构建FNext数组时,若模式串首字符与获取的两个字符中第二个字符相同,将该字符串映射的数组元素赋值为PLen-1,否则赋值为PLen,其中PLen为模式串长度。When constructing the FNext array with two characters, if the first character of the pattern string is the same as the second character of the two characters obtained, the array element mapped to the string is assigned as PLen-1, otherwise it is assigned as PLen, where PLen is Pattern string length.
当以三个字符构建FNext数组时,若模式串中首字符、第二字符分别与获取的三个字符中第二个、第三个字符均相同,将该字符串唯一映射的数组元素赋值为PLen-2,若模式串首字符与获取的三个字符中第三个字符相同,将该字符串唯一映射的数组元素赋值为PLen-1,否则赋值为PLen。When constructing an FNext array with three characters, if the first character and the second character in the pattern string are the same as the second and third characters in the three characters obtained respectively, assign the array element uniquely mapped to the string to PLen-2, if the first character of the pattern string is the same as the third character among the three acquired characters, assign the array element uniquely mapped to the string to PLen-1, otherwise assign it to PLen.
然后,对模式串由左至右按长度为2或3的字符进行遍历,对遍历到的每一个字符串计算最佳跳跃距离,其值为该字符串最右边字符与模式串最右边字符在模式串中所在位置之间的距离,将该字符串与FNext数组映射的数组元素重新赋值为该最佳跳跃距离。Then, the pattern string is traversed from left to right according to the characters with a length of 2 or 3, and the optimal jump distance is calculated for each character string traversed, and its value is between the rightmost character of the string and the rightmost character of the pattern string. The distance between the positions in the pattern string, reassign the array elements mapped from the string to the FNext array as the optimal jump distance.
当由左至右进行遍历并重新赋值完成之后,FNext数组中的各元素即为模式串中第一个特征对与目标串匹配中失配时,模式串最右边两个或三个字符在目标串中对应位置上的任意字符串所能提供的最佳跳跃距离。After traversing from left to right and reassignment is completed, each element in the FNext array is the first feature pair in the pattern string that does not match the target string, and the rightmost two or three characters of the pattern string are in the target string The optimal jumping distance provided by any string at the corresponding position in the string.
构建FNext数组之前,判定使用两个或者三个字符进行构建。当字符集小于20时,使用三个字符构建FNext数组,当字符集大于20时,使用两个字符构建FNext数组;当目标串较长时,使用三个字符构建FNext数组,反之,使用两个字符构建FNext数组。Before constructing the FNext array, it is determined to use two or three characters for construction. When the character set is less than 20, use three characters to build the FNext array, when the character set is greater than 20, use two characters to build the FNext array; when the target string is longer, use three characters to build the FNext array, otherwise, use two characters Characters build an FNext array.
步骤3、获取特征对成功时,获取记录模式串在目标串中跳跃匹配距离的TNext数组的方法具体如下:
构建如模式串等长的TNext数组,依次对该数组中与模式串特征对对应的数组元素、模式串非特征区数组元素进行计算、赋值处理。对于F2,F2对应的TNext数组元素表示该特征对于目标串发生失配时的最佳跳跃距离,当F2失配时,已知特征对仅有F′1,只能将模式串右移,使F2与与F′1对齐。当对Fi(i∈[3,num])进行处理时,通过以下方法对TNext数组中对应元素进行计算、赋值:Construct a TNext array of the same length as the pattern string, and perform calculation and assignment processing on the array elements corresponding to the feature pairs of the pattern string and the array elements of the non-characteristic area of the pattern string in the array. For F 2 , the TNext array element corresponding to F 2 indicates the optimal jump distance of the feature when the target string is mismatched. When F 2 is mismatched, the known feature pair is only F′ 1 , and the pattern string can only be Shift to align F2 with F'1 . When processing F i (i∈[3,num]), the corresponding elements in the TNext array are calculated and assigned by the following methods:
如图9所示,检索过程中,若Fi(i∈[3,num])发生失配,则Fi之前的特征对全部适配,即已知特征对为F′j(j∈[1,i-1]),将模式串中发生失配的特征对右边的所有字符称为已知字符串。由于已知字符串与模式串之间的匹配和目标串与模式串之间的匹配过程相同,因此,使用模式串与已知字符串之间的匹配计算最佳跳跃距离。As shown in Figure 9, during the retrieval process, if there is a mismatch in F i (i∈[3,num]), all feature pairs before F i are adapted, that is, the known feature pair is F′ j (j∈[ 1,i-1]), and all the characters on the right of the mismatched feature pair in the pattern string are called known strings. Since the matching process between the known string and the pattern string is the same as the matching process between the target string and the pattern string, the optimal jumping distance is calculated using the matching between the pattern string and the known string.
已知字符串中F′1称为基准特征对,使用模式串中Fi(i∈[2,num])依次与基准特征对对齐,判断Fi与对齐的模式串特征对序号,此时T等于2,已知字符串为F′j(j∈[1,3])覆盖的区域,与F′1第一次对齐的模式串特征对为FT,OSP指针指向已知特征前序,初始化为1,SSP指针根据Index数组获取与基准特征对对齐的模式串特征对的特征前序的位置。首先根据Distance数组获取OSP指针与下一个已知特征前序之间的距离,SSP指针减去该距离即为模式串中与下一个已知特征前序对齐的位置,然后使用SSP指针判断Sequence数组中该位置的元素值是否为1,用于判定是否存在错位字符,例如图10中Shift①所示,SSP指针移动至F′2[1]对齐的模式串字符位置,即模式串中F3[1]字符的位置,此时Sequence等于1,说明已知字符串中已知特征对与模式串中对应位置特征对对齐,OSP移动指向下一个特征前序(Shift②),继续判断是否存在错位字符;当SSP指针移动指向下一个已知特征前序对齐的模式串字符时(Shift③),F′3[1]对齐的字符为非特征字符,即对应Sequence数组中该位置元素不为1,因此该位置的匹配为无效匹配,结束此轮计算,将模式串下一个特征对与基准特征对对齐,即T加1、OSP恢复至1和SSP指向Index[T],继续下一轮进行计算,如此循环。F′ 1 in the known character string is called a reference feature pair, use the F i (i∈[2, num]) in the pattern string to align with the reference feature pair in turn, and judge the serial number of the F i and the aligned pattern string feature pair, at this time T is equal to 2, the known string is the area covered by F′ j (j∈[1,3]), the pattern string feature pair aligned with F′ 1 for the first time is F T , and the OSP pointer points to the preorder of the known feature , initialized to 1, and the SSP pointer obtains the position of the feature preorder of the pattern string feature pair aligned with the reference feature pair according to the Index array. First, according to the Distance array, obtain the distance between the OSP pointer and the preorder of the next known feature. Subtracting the distance from the SSP pointer is the position in the pattern string that is aligned with the preorder of the next known feature, and then use the SSP pointer to judge the Sequence array Whether the element value at this position is 1 is used to determine whether there is a misplaced character. For example, as shown by Shift① in Figure 10, the SSP pointer moves to the character position of the pattern string aligned with F′ 2[1] , that is, F 3[ in the pattern string 1] The position of the character. At this time, Sequence is equal to 1, indicating that the known feature pair in the known string is aligned with the corresponding position feature pair in the pattern string. OSP moves to the next feature preorder (Shift②), and continues to judge whether there is a misplaced character ; When the SSP pointer moves to point to the pattern string character (Shift ③) of the next known characteristic preorder alignment, the character aligned by F′ 3[1] is a non-characteristic character, that is, the position element in the corresponding Sequence array is not 1, so The match at this position is an invalid match. End this round of calculation, align the next feature pair of the pattern string with the reference feature pair, that is, add 1 to T, restore OSP to 1, and point to Index[T] with SSP, and continue the next round of calculation. So cycle.
当SSP减去OSP根据Distance获取的距离时,如果SSP<-1,代表SSP指针指向位置超出模式串2个以上字符位,显然,已知字符串的下一已知特征对不可能与模式串字符对齐,因此,模式串移动至此即为最佳跳跃距离,如果SSP=-1则说明模式串FirstChar字符与已知字符串已知特征对的特征后续对齐,只需判断Fi[2]与FirstChar是否相等,如果相等则为最佳跳跃距离,否则为无效匹配。如图11所示,如果已知字符串中所有已知特征对与模式串都不存在错位字符,说明该位置为可跳跃距离。When SSP subtracts the distance obtained by OSP according to Distance, if SSP<-1, it means that the SSP pointer points to a position beyond the pattern string by more than 2 character bits. Obviously, the next known feature pair of the known string cannot be matched with the pattern string Characters are aligned. Therefore, moving the pattern string to this point is the optimal jumping distance. If SSP=-1, it means that the FirstChar character of the pattern string is subsequently aligned with the known feature pair of the known character string. It is only necessary to judge F i[2] and Whether FirstChar is equal, if equal, it is the best jumping distance, otherwise it is an invalid match. As shown in Figure 11, if there is no misplaced character in all known feature pairs and pattern strings in the known character string, it means that the position is a jumpable distance.
TStart表示模式串中第一个与基准特征对对齐的特征对序号,初始值为2,T用于记录计算过程中与基准特征对对齐的模式串特征对序号,t表示此时正在处理的特征对序号。当已知特征对个数为K,若模式串中某一特征对与基准特征对对齐时存在错位字符,那么当已知特征对个数大于K时,该模式串特征对与基准特征对对齐时必然存在错位字符,因此,计算最佳跳跃距离时,模式串中第一次与基准特征对对齐的特征对应该是上一次计算完成时与基准特征对对齐的特征对,并非每次都须从F2开始计算。因此,T的初始值设为TStart,每当计算一次最佳跳跃距离,便将此次计算结束时的TStart传递给下一次计算。TStart indicates the serial number of the first feature pair aligned with the reference feature pair in the pattern string, the initial value is 2, T is used to record the serial number of the feature pair of the pattern string aligned with the reference feature pair during the calculation process, and t indicates the feature being processed at this time For the serial number. When the number of known feature pairs is K, if there is a misplaced character when a certain feature pair in the pattern string is aligned with the reference feature pair, then when the number of known feature pairs is greater than K, the pattern string feature pair is aligned with the reference feature pair Therefore, when calculating the optimal jumping distance, the feature pair that is first aligned with the reference feature pair in the pattern string should be the feature pair that was aligned with the reference feature pair when the last calculation was completed, not every time Count from F2 . Therefore, the initial value of T is set to TStart, and every time the optimal jumping distance is calculated, TStart at the end of this calculation is passed to the next calculation.
由于存在所有模式串特征对与基准特征对对齐,却不满足可跳跃距离条件,而TNext 数组的数组必须赋值,即函数必须返回一个最佳跳跃距离,因此当基准特征对与所有模式串特征对对齐时,仍不满足可跳跃距离条件,只需判断特征后序与FirstChar是否相等,若相等,将FirstChar移动至第一个特征后序;否则,将FirstChar移动至第一个特征后序的右边字符位置。Since there are all pattern string feature pairs that are aligned with the reference feature pairs, but do not meet the jumpable distance condition, the array of TNext array must be assigned, that is, the function must return an optimal jump distance, so when the reference feature pair and all pattern string feature pairs When aligning, if the jumpable distance condition is still not satisfied, it is only necessary to judge whether the sequence of the feature is equal to FirstChar, if they are equal, move FirstChar to the sequence of the first feature; otherwise, move FirstChar to the right of the sequence of the first feature character position.
若Fnum与基准特征对对齐同样为无效匹配,模式串将获得一个最长跳跃距离,该距离由模式串属性决定,因此,对TNext数组下一个元素赋值时无需使用计算最佳跳跃距离,只需将TNext数组之后所有元素全部赋值为该最长跳跃距离即可。基于上述方法,最长跳跃距离等于Fnum[1]和F1[2]的索引相加决定,其中F1[2]索引的大小占主导,因此,步骤1中特征对的选取时,如果出现次数最多的字符对存在多个时,选取模式串中较为靠右的字符对作为特征对。If the alignment of F num and the reference feature pair is also an invalid match, the pattern string will obtain a longest jump distance, which is determined by the pattern string attribute. Therefore, it is not necessary to use the calculation of the best jump distance when assigning a value to the next element of the TNext array, just All elements after the TNext array need to be assigned the longest jump distance. Based on the above method, the longest jumping distance is equal to the addition of the indices of F num[1] and F 1[2] , where the size of the index of F 1[2] dominates. Therefore, when selecting a feature pair in
当TNext数组中特征对对应区域元素完成赋值,需要对非特征区进行赋值,此时,将所有特征对作为已知特征对,同样采用同计算TNext数组中特征对对应数组元素值的方式计算最佳跳跃距离,只需将完整模式串作为已知字符串即可。为提高算法效率,PNext数组、 TNext数组中非特征区赋值在同一程序模块中实现。When the feature pairs in the TNext array are assigned values to the corresponding area elements, it is necessary to assign values to the non-feature areas. At this time, all feature pairs are regarded as known feature pairs, and the same method is used to calculate the corresponding array element values of the feature pairs in the TNext array. The best jumping distance, only need to use the complete pattern string as a known string. In order to improve the efficiency of the algorithm, the assignment of non-characteristic areas in the PNext array and TNext array is implemented in the same program module.
步骤3、获取特征对失败时,获取记录模式串在目标串中跳跃匹配距离的特殊PNext、 TNext数组的方法具体如下:
当模式串不存在特征对时,在现有检索过程的基础上,构建特殊的PNext数组和TNext 数组,可获得高效的检索效率。如图12所示,以PStr="ABCDEFGH"为例,PLen=8且不存在特征对,首先构建一组从右至左依次检索的PNext数组,匹配起点FirstFNext设为PLen–1。当前两个匹配的字符(G和H)失配时,根据FNext数组获取最佳跳跃距离;当其它字符失配时,由于此时模式串中不存在最右边两个字符组成的字符对,因此,只需判断FirstChar 与PStr[PLen–1]是否相等,若相等则将TNext数组所有值赋值为PLen-1,否则赋值为PLen。When there is no feature pair in the pattern string, on the basis of the existing retrieval process, construct special PNext array and TNext array to obtain high retrieval efficiency. As shown in Figure 12, taking PStr="ABCDEFGH" as an example, PLen=8 and there is no feature pair, first construct a set of PNext arrays that are retrieved sequentially from right to left, and the matching starting point FirstFNext is set to PLen–1. When the first two matched characters (G and H) do not match, the optimal jump distance is obtained according to the FNext array; when other characters do not match, since there is no character pair consisting of the rightmost two characters in the pattern string at this time, therefore , just judge whether FirstChar is equal to PStr[PLen–1], if they are equal, assign all values of TNext array to PLen-1, otherwise assign to PLen.
步骤4、若特征对和非特征区均匹配成功,则模式串在目标串中检索成功,并获取下一个匹配字符的位置,其中检索的过程具体如下:
获取PNext、TNext和FNext数组之后,将模式串在目标串中从左至右进行检索,检索过程中,使用模式串指针Pi与目标串指针Ti分别指向模式串的字符与目标串的字符进行匹配。当模式串移动至目标串某一位置时,从模式串匹配起点FirstFeaStr开始,从右至左与目标串对应字符进行跳跃匹配。After obtaining the PNext, TNext and FNext arrays, search the pattern string from left to right in the target string. During the retrieval process, use the pattern string pointer Pi and the target string pointer Ti to point to the characters of the pattern string and the characters of the target string respectively to match . When the pattern string moves to a certain position of the target string, start from the matching starting point FirstFeaStr of the pattern string, and perform jump matching with the corresponding character of the target string from right to left.
当模式串指针与目标串指针指向的字符适配时,使用模式串指针获取PNext数组中对应的数组元素值PNext[Pi],并分别计算模式串与目标串指针指向的下一个匹配字符位置:先将 Ti指向Ti-Pi+PNext[Pi],再将Pi指向PNext[Pi]。When the pattern string pointer matches the character pointed to by the target string pointer, use the pattern string pointer to obtain the corresponding array element value PNext[Pi] in the PNext array, and calculate the next matching character position pointed to by the pattern string and the target string pointer respectively: First point Ti to Ti-Pi+PNext[Pi], then point Pi to PNext[Pi].
当模式串指针与目标串指针指向的字符失配时,如果模式串指针指向的字符是第一特征对,获取目标串中与模式串最后两个或者三个字符对齐的字符串,然后获取FNext数组中与该字符串唯一映射的数组元素值V,该值即为当前模式串第一特征对与目标串中对应字符失配的最佳跳跃距离,通过该最佳跳跃距离计算模式串与目标串指针指向的下一个匹配字符位置:先将Ti指向Ti-Pi+FirstFeaStr+V,再将Pi指向FirstFeaStr;如果模式串指针指向的字符不是第一特征对,使用模式串指针获取TNext数组中对应的数组元素值TNext[Pi],该值即为当前模式串字符与目标串中对应字符失配的最佳跳跃距离,通过该最佳跳跃距离计算模式串与目标串指针指向的下一个匹配字符位置:先将Ti指向Ti-Pi+FirstFeaStr+TNext[Pi],再将Pi指向FirstFeaStr。When the pattern string pointer does not match the character pointed to by the target string pointer, if the character pointed to by the pattern string pointer is the first feature pair, obtain the character string aligned with the last two or three characters of the pattern string in the target string, and then obtain FNext The array element value V uniquely mapped to the string in the array is the optimal jumping distance between the first feature pair of the current pattern string and the corresponding character in the target string, and the optimal jumping distance is used to calculate the pattern string and the target The position of the next matching character pointed to by the string pointer: first point Ti to Ti-Pi+FirstFeaStr+V, and then point Pi to FirstFeaStr; if the character pointed to by the pattern string pointer is not the first feature pair, use the pattern string pointer to obtain the corresponding character in the TNext array The array element value TNext[Pi] of TNext[Pi], this value is the best jumping distance between the characters in the current pattern string and the corresponding characters in the target string, and the next matching character pointed to by the pointer of the pattern string and the target string is calculated through the best jumping distance Position: first point Ti to Ti-Pi+FirstFeaStr+TNext[Pi], then point Pi to FirstFeaStr.
当指针Pi遍历到最后一个字符(PNext[Pi]=-1),跳出循环。由于检索顺序为跳跃匹配,当模式串与目标串任意两个字符进行匹配之后,无论失败或是成功,目标串指针指向下一个匹配位置之前都需要恢复到当前模式串FirstChar对齐的目标串位置,然后再根据辅助数组获取下一个匹配的位置。When the pointer Pi traverses to the last character (PNext[Pi]=-1), jump out of the loop. Since the search sequence is jump matching, when the pattern string matches any two characters of the target string, no matter whether it fails or succeeds, the target string pointer needs to be restored to the target string position aligned with the FirstChar of the current pattern string before pointing to the next matching position. Then get the next matching position according to the auxiliary array.
为了进一步验证本实施例的稳定性和有效性,通过公开基因组数据库 https://www.ncbi.nlm.nih.gov/获得不同数量级的字符串(肿瘤基因组、病毒基因组、遗传病基因组),选取长度为100、500、1000、3000、5000和10000的模式串在不同数量级的目标串中分别进行检索,利用本实施例的算法ZK(2)、ZK(3)与现有文献提出的Sunday、BMH、KMP、 BF和BM算法进行比较,其中ZK(2)和ZK(3)分别为使用两个和三个字符构建FNext数组的算法。In order to further verify the stability and effectiveness of this example, strings of different magnitudes (tumor genome, virus genome, and genetic disease genome) were obtained through the public genome database https://www.ncbi.nlm.nih.gov/, and selected Pattern strings with lengths of 100, 500, 1000, 3000, 5000 and 10000 are retrieved respectively in target strings of different orders of magnitude, using the algorithms ZK(2) and ZK(3) of this embodiment and the Sunday, BMH, KMP, BF and BM algorithms are compared, among which ZK(2) and ZK(3) are algorithms for constructing FNext arrays using two and three characters respectively.
实验全过程均在内存中操作,CPU为Intel-i5-8400-2.80GHz,内存24G,采用微秒级计时函数"QueryPerformanceFrequency"统计检索时间。其中,基因组数据库去除了基因标签。The whole process of the experiment is operated in the memory, the CPU is Intel-i5-8400-2.80GHz, the memory is 24G, and the microsecond-level timing function "QueryPerformanceFrequency" is used to count the retrieval time. Among them, the genome database has removed gene tags.
如图13-图18所示为实验一,随目标串数量级增大,ZK(2)和ZK(3)算法对不同长度的模式串检索时间(ms)均远小于其它算法,并且算法稳定性好。目标串数量级越大,ZK(3)算法优势更加明显。如图17-图18所示,当目标串数量级较低时,ZK(2)算法明显优于ZK(3)以及其它算法,是因为ZK(3)算法在构建辅助数组时相对检索时间耗时较长。在实际应用中,当目标串较长时,建议选择ZK(3)算法,反之,选择ZK(2)算法。As shown in Figure 13-Figure 18,
为验证字符集大小对已有算法与ZK算法检索效率的影响,进行了实验二,如图19-20 所示,采取随机生成指定字符集大小的目标串与模式串的方法,TLen=5×10^8,PLen=2 ×10^3,纵坐标为检索时间,单位毫秒,横坐标表示字符集大小。图19中,对ZK(2)、ZK(3)与现有算法进行测试,并排除对数据分析影响较大的KMP和BF算法,对字符集5-60进行分析;图20中,随字符集的增大,当字符集小于20时,ZK(3)算法检索效率远高于其他算法,当字符集大于20时,辅助数组构建耗时较长。In order to verify the effect of character set size on the retrieval efficiency of existing algorithms and the ZK algorithm,
由于ZK算法采取跳跃匹配的方式进行匹配,并且TNext数组与FNext数组对应于任意字符失配时的最佳跳跃距都可以根据模式串属性计算得出,因此ZK算法采取提前计算最佳跳跃距离并保存在对应数组的方案,实际检索时,无论正在匹配的字符失配或者适配,可根据当前正在匹配字符的索引在对应的辅助数组中获取下一匹配位置或者最佳跳跃距离。本实施例的优点在于检索过程中无需进行过多计算,只需对相应辅助数组进行取数据操作,极大的优化了程序检索过程的时间消耗。Since the ZK algorithm uses jump matching for matching, and the optimal jump distance when the TNext array and the FNext array correspond to any character mismatch can be calculated according to the pattern string attributes, the ZK algorithm uses the calculation of the optimal jump distance in advance and The scheme saved in the corresponding array, in actual retrieval, regardless of the mismatch or adaptation of the character being matched, the next matching position or the best jumping distance can be obtained in the corresponding auxiliary array according to the index of the character currently being matched. The advantage of this embodiment is that there is no need to perform too many calculations in the retrieval process, and only need to perform a data fetch operation on the corresponding auxiliary array, which greatly optimizes the time consumption of the program retrieval process.
如上即为本发明的实施例。上述实施例以及实施例中的具体参数仅是为了清楚表述发明人的发明验证过程,并非用以限制本发明的专利保护范围,本发明的专利保护范围仍然以其权利要求书为准,凡是运用本发明的说明书及附图内容所作的等同结构变化,同理均应包含在本发明的保护范围内。The above is the embodiment of the present invention. The above examples and the specific parameters in the examples are only for clearly expressing the inventor's invention verification process, and are not used to limit the scope of patent protection of the present invention. The scope of patent protection of the present invention is still subject to its claims. The equivalent structural changes made in the specification and drawings of the present invention should be included in the protection scope of the present invention in the same way.
Claims (3)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110860647.1A CN113590895B (en) | 2021-07-28 | 2021-07-28 | Character string retrieval method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110860647.1A CN113590895B (en) | 2021-07-28 | 2021-07-28 | Character string retrieval method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113590895A CN113590895A (en) | 2021-11-02 |
| CN113590895B true CN113590895B (en) | 2023-04-25 |
Family
ID=78251402
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110860647.1A Active CN113590895B (en) | 2021-07-28 | 2021-07-28 | Character string retrieval method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113590895B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116992088A (en) * | 2023-03-11 | 2023-11-03 | 陈伟峰 | Multi-level dictionary data structure and algorithm based on integers |
| CN118885644A (en) * | 2024-07-30 | 2024-11-01 | 北京酷车易美网络科技有限公司 | A vehicle model library name matching method based on string comparison algorithm |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1173684A (en) * | 1996-05-21 | 1998-02-18 | 株式会社日立制作所 | means for recognizing an input character string by reasoning |
| CN102760173A (en) * | 2012-07-02 | 2012-10-31 | 河海大学 | Bottom-up XML (eXtensible Markup Language) twig pattern matching method |
| CN103577598A (en) * | 2013-11-15 | 2014-02-12 | 曙光信息产业(北京)有限公司 | Matching method and device for pattern string and text string |
| CN103873317A (en) * | 2012-12-18 | 2014-06-18 | 中国科学院空间科学与应用研究中心 | Method and system for detecting CCSDS (consultative committee for space data system) space link protocol |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7359895B2 (en) * | 2004-11-18 | 2008-04-15 | Industrial Technology Research Institute | Spiral string matching method |
-
2021
- 2021-07-28 CN CN202110860647.1A patent/CN113590895B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1173684A (en) * | 1996-05-21 | 1998-02-18 | 株式会社日立制作所 | means for recognizing an input character string by reasoning |
| CN102760173A (en) * | 2012-07-02 | 2012-10-31 | 河海大学 | Bottom-up XML (eXtensible Markup Language) twig pattern matching method |
| CN103873317A (en) * | 2012-12-18 | 2014-06-18 | 中国科学院空间科学与应用研究中心 | Method and system for detecting CCSDS (consultative committee for space data system) space link protocol |
| CN103577598A (en) * | 2013-11-15 | 2014-02-12 | 曙光信息产业(北京)有限公司 | Matching method and device for pattern string and text string |
Non-Patent Citations (3)
| Title |
|---|
| Lei Zhang et al..An Improved String Matching Algorithm for HTTP Data Reduction.《2015 International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP)》.2016,1-8. * |
| 徐禄 等.无线电监测软件总线的设计及应用.《西华大学学报(自然科学版)》.2014,6-10. * |
| 苏艳刚.基于多模式匹配的网络入侵检测系统关键技术实现.《中国优秀硕士学位论文全文数据库 信息科技辑》.2009,I139-190. * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113590895A (en) | 2021-11-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113590895B (en) | Character string retrieval method | |
| US8190591B2 (en) | Bit string searching apparatus, searching method, and program | |
| CN106228002B (en) | High-efficiency abnormal time sequence data extraction method based on secondary screening | |
| CN101609455A (en) | A Method for High-Speed Accurate Single-Pattern String Matching | |
| CN105138534B (en) | Great-leap-forward seed lookup algorithm based on FMD indexes and fast table | |
| CN107015951B (en) | Method and system for verifying correctness of suffix array | |
| CN109976806B (en) | Java Statement Block Cloning Detection Method Based on Bytecode Sequence Matching | |
| KR100959244B1 (en) | Fast String Pattern Detection Using Layered Shift Tables | |
| US8250089B2 (en) | Bit string search apparatus, search method, and program | |
| CN116483337A (en) | API completion method based on prompt learning and data enhancement | |
| CN105335624A (en) | Gene order fragment fast positioning method based on bitmap | |
| CN101908102A (en) | Method and device for predicting ribonucleic acid secondary structure based on stem region | |
| CN105025013B (en) | The method for building up of dynamic IP Matching Model based on priority Trie trees | |
| Wei et al. | A branch elimination-based efficient algorithm for large-scale multiple longest common subsequence problem | |
| US20120239664A1 (en) | Bit string search apparatus, search method, and program | |
| CN101923481A (en) | Program execution optimization method based on code duplication super block | |
| CN113065419B (en) | Pattern matching algorithm and system based on flow high-frequency content | |
| US8195667B2 (en) | Bit string search apparatus, search method, and program | |
| CN115329094B (en) | Knowledge graph embedding representation method and system based on neighborhood relation characterization vector | |
| CN115641911B (en) | A Method for Overlap Detection Between Sequences | |
| CN112825268A (en) | Sequencing result comparison method and application thereof | |
| KR20100105080A (en) | Query processing method and apparatus based on n-gram | |
| Yingfan et al. | Revisiting $ k $-Nearest Neighbor Graph Construction on High-Dimensional Data: Experiments and Analyses | |
| CN102402692B (en) | Method and system for recognizing feature character strings | |
| Kurniawan et al. | A new string matching algorithm based on logical indexing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |