WO1992015067A1 - Methode de recherche de sous-chaines - Google Patents
Methode de recherche de sous-chaines Download PDFInfo
- Publication number
- WO1992015067A1 WO1992015067A1 PCT/GB1992/000333 GB9200333W WO9215067A1 WO 1992015067 A1 WO1992015067 A1 WO 1992015067A1 GB 9200333 W GB9200333 W GB 9200333W WO 9215067 A1 WO9215067 A1 WO 9215067A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- pattern
- character
- cuπent
- match
- text
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Definitions
- the present invention relates to substring searching and, more particularly, to a method of locating within a character string a substring that matches a given character pattern of one or more characters, the characters making up the string and pattern being taken from the same character set.
- the character set may, for example, be the English alphabet; in this case the given character pattern may be an English word it is desired to locate in a body of text constituting the character string. More generally, the character set could be all possible 256 combinations of 8 binary quantities (ie 8-bit bytes), substring searching being used to find occurrences of given byte patterns within a body of computer data.
- the character set could simply be the binary character set of 0, 1 in which case substring searching becomes searching for a given binary sequence within a binary string.
- Substring searching has widespread uses in information processing ranging from pattern matching and feature recognition to text processing. As a result, there is a substantial body of prior art relevant to such searching.
- the simplest algorithm for locating the occurrences of a pattern within a text will be called the "naive algorithm M here.
- the text as a line of data to be scanned linearly from left to right.
- the text has to be checked for a complete occurrence of the pattern at all possible positions.
- the pattern is first checked ' with its leftmost end aligned with the leftmost end of the text.
- pattern starting with the leftmost and proceeding through the pattern to the right one character at a time, is compared with the character in the corresponding position in the text. Whenever a position is found in which the pattern and text characters do not match, then the algorithm has determined that the pattern does not occur at this alignment in the text, and the pattern is moved one character to the right against the text. A new comparison of each pattern character with the corresponding text character is then begun, in order to check for the pattern at this new position in the text. If the end of the pattern is reached after all characters have been found to match then the algorithm has successfully located an occurrence of the pattern in the text. If the end of the text is reached without finding an occurrence of the pattern, then the algorithm has determined that the pattern does not occur in the text. This naive algorithm is conceptually simple but inefficient.
- searching could equally as well have been effected from right to left as from left to right. This is generally the case for substring searching methods and references herein to searching leftwards or rightwards should be taken as by way of example rather than by way of limitation, except insofar as these references imply a relative direction of execution of associated actions.
- Knuth et al described a more efficient algorithm which will be called the "KMP" algorithm here (see: Knuth,D.E., Morris, J.H. and Pratt, V.R. Fast pattern matching in strings, Siam J Comput, Vol 6 No 2, June 1977, pp 323-350.)
- the KMP algorithm differs from the naive algorithm in exploiting the fact that if in comparing the pattern with the text in a given position a number of characters are found to match before a mismatch is found, then the pattern of known text characters that have already been matched restricts the positions of pattern occurrences in the text that are possible.
- the KMP algorithm searches for the pattern in the text from left to right, and scans characters in the pattern from left to right one character at a time.
- the algorithm can, in a preliminary phase, precompute from the pattern a table of pattern-shift data that indicates for each position in the pattern at which a mismatch between the pattern and the text is first found the number of characters by which the pattern should be shifted to the right in order to find the next possible occurrence of the pattern in the text. If the text is sufficiently long, the fixed overhead of precomputing the table is offset by the increased rate of testing possible positions for occurrence of the pattern in the main phase of the algorithm.
- the KMP algorithm is more efficient than the naive algorithm because it uses implicit information about the previously scanned characters to allow the pattern to be shifted more than one character at a time.
- Boyer and Moore have also described an algorithm that uses the idea of precomputing tables of pattern-shift data in a preliminary phase of the algorithm, and that uses information about the text implicit in the number of characters already matched between the text and pattern in the current pattern position (see: Boyer,R.S. and Moore,J.S. A fast string searching algorithm. Commun ACM 20, 10 (Oct 1977), pp 762-772.).
- Boyer and Moore's algorithm called the "BM” algorithm here, differed from the KMP algorithm in that the BM algorithm scans characters in the pattern from right to left, while still searching the text for pattern matches from left to right.
- the first text character tested at a particular pattern position in the text happens to be a character that does not occur in the pattern at all.
- the tested character is at the left hand end of the pattern, and the pattern is shifted one character to the right.
- the tested character is at the right hand end of the pattern, and the pattern is shifted to the right by the number of characters in the pattern. Any smaller shift of the pattern would place one of the characters in the pattern against the tested text character, which does not occur in the pattern.
- the pattern is shifted by using the text character that mismatched the pattern to index a precomputed table that gives the minimum pattern shift compatible with the value of the mismatched text character.
- a second precomputed table can be used to give the minimum pattern shift compatible with the fact that pattern characters previously tested in the cu ⁇ ent pattern position matched the text (and so these characters in the text must match any occu ⁇ ence of the pattern in another position). If both precomputed tables are used then the pattern can be shifted by the maximum of the shifts indicated by the two tables.
- the BM algorithm is more efficient than the KMP algorithm because it uses a right to left scanning order, which on average results in the pattern being shifted by a larger number of characters on each character mismatch found.
- Knuth et al in a postscript to his above-referenced article described a generalisation of the BM algorithm, which we will call the "BMKMP" algorithm.
- Knuth et al observed that when the BM algorithm shifts the pattern over the text, it "forgets” all that it already “knows” about the characters that have already been matched. They proposed the BMKMP algorithm which retains all of the knowledge of the text that is still relevant to the comparison process by encoding this knowledge into one of a finite number of states. Each state represents one of the possible combinations of known text characters at the current pattern position.
- the algorithm state determines the position of the next text character to be tested, and the value of that text character and the current state together determine the next state and by how many characters the pattern should be shifted against the text.
- This algorithm compares well with others in that it requires a low number of character comparisons between the pattern and the text.
- several factors make the BMKMP algorithm unattractive in practice :-
- the state table is typically rather large. If all characters in the pattern are distinct, then there are (m ⁇ 2 + m)/2 states, where there are m characters in the pattern and " A " denotes exponentiation. Worse, if all characters are not distinct then the only clear bound established on the size of the state table is 2 ⁇ m states, which could be too large to implement in many cases.
- Each state in the state table needs entries to define how the algorithm proceeds according to the character value found in the text. An entry is needed for each distinct character found in the pattern, and for all other characters together.
- the state table entries can be structured either to allow fast access to the entry for the character found in the text, or to allow efficient use of the memory used to store the state table. However, the aims of fast access and efficient use of memory are in conflict to some extent, and this will make the BMKMP algorithm more expensive to implement than some of the other algorithms described here.
- the state table must be computed from the pattern before the algorithm can begin searching the text.
- the size and complexity of the state table make this pre-computation an expensive operation, which could only be justified if a long text was to be searched.
- Sunday has described a family of substring searching algorithms (called the “Sunday” algorithms here) which are also related to the BM algorithm (see: Sunday,D.M. "A very fast substring search algorithm” Co mun ACM 33, 8 (Aug 1990), pp 132-142.). These algorithms differ from each other in the order in which the characters in the pattern are compared against the text, and these different comparison orders also result in different requirements for precomputing the tables of shift values.
- the BM algorithm uses one precomputed table to obtain a minimum value of pattern shift from the value of the text character that mismatched, and another to obtain a minimum value of pattern shift from the number of pattern characters successfully matched against the text at the current pattern position.
- Sunday's algorithms use one precomputed table to obtain a minimum value of pattern shift from all of the pattern characters tested against the text at the current pattern position, including the mismatched character (although the value of text character found at the mismatch position is not used.)
- Sunday's algorithms read the text character immediately following the end of the pattern in its cu ⁇ ent position, and use this text character to access a second precomputed table which gives a second minimum value of pattern shift.
- the pattern is then shifted by the maximum of the two minimum pattern shift distances.
- a method of locating within a character string a substring that matches a given character pattern of one or more characters, the characters making up the string and pattern being taken from the same character set comprising a preliminary phase during which pattern-shift data is derived from said given pattern, and a main phase in which the pattern is notionally positioned relative to said string at a succession of different current pattern positions in each of which one end of the pattern corresponds to a respective character position in said string and said pattern extends along the string therefrom in a predetermined direction, the main phase including the steps of :
- step (b) where step (a) ascertains a mismatch, notionally shifting the pattern along the string in said predetermined direction, by an amount derived from said pattern- shift data, to a new current pattern position, steps (a) and (b) being repeated as necessary until a match is found or the string has been fully searched; characterised in that:
- said pattern-shift data is in the form of a respective possible-match record for each of a plurality of different character items of said character set where a character item may comprise a single character of said character set and/or combinations of such characters and said plurality of items is such as to enable any possible string to be built up from combinations, including repeats, of said character items,
- each possible-match record indicates, for an assumed situation of the associated said character item being present in the string and said one end of the pattern being moved in said predetermined direction towards the character item, whether said character item matches the corresponding character or characters of said pattern for each of the N last pattern positions taken by said pattern as its said one end is moved up to said character item, N being a positive integer, and
- the main phase involves determining, from the respective possible-match record of at least one character item of the substring that corresponds to said pattern in its current position, the next pattern position at which the presence of the, or all of the, said at least one character items is compatible with a match between the pattern and a substring of the string, this determination taking account of the position in said string of the or each said at least one character item relative to said one end of the pattern, by using an appropriate portion of the character item's possible-match record.
- the possible-match records thus permit information on possible future matches derived from one or more character items to be utilised in determining the next pattern position.
- the use of possible-match records means that the order in which character items in the string are tested against the pattern can be varied.
- the possibility of combining possible-shift information derived in respect of several character items makes the method potentially powerful.
- the said character items will often be constituted exclusively by all the single characters of the character set. However, in certain circumstances (for example, in searching strings made up from the binary character set of 0,1), it may be advantageous to use-combinations of characters; thus for the binary character set case, the character items could be constituted by all possible 256 combinations of eight binary characters.
- step (a) of the main phase involves identifying the character item present at a particular location in the substring currently corresponding to said pattern, and checking whether this character item matches the character or characters in the corresponding location in the pattern, these operations being repeated for different locations in said substring until either a mismatch is found or the whole substring has been successfully matched to the pattern.
- step (a) also includes providing a current possible-match indicator giving a cumulative indication of which pattern positions yet to be reached, provide the possibility of a match, this cumulative indication being provided by the incorporation into said indicator, for the or each character item identified, of the relevant portion of its possible-match record, said current possible-match indicator being used to determine said next pattern position to which the pattern position is shifted in step (b) of the main phase.
- the said current possible-match indicator includes a possible match indication for the current pattern position as well as for pattern positions yet to be reached, this indication being cumulatively derived from the possible-match records of the or each character item identified for checking in step (a) of the main phase.
- This information means that this indicator can not only be used to determine the next pattern position should a mismatch be found at the current position, but also to actually check for a mismatch at the current position in respect of each character item identified for checking. This leads to an efficient implementation of the present method.
- the aforesaid current possible-match indicator can either be reset (initially, to indicate all pattern positions not yet passed are possible match positions), or modified to retain only the portion thereof relevant to the pattern positions not yet passed.
- each said possible-match record and said current possible-match indicator preferably take the form of an N-bit field in which each bit position indicates whether a match is possible for a respective pattern position, the updating of said cu ⁇ ent possible-match indicator being effected by bit shifting and/or logically combining the relevant bits of a possible-match records with corresponding bits of said indicator.
- the integer N will have a value equal to the character length of said pattern. However, it is also possible for N to have a value greater or less than the pattern length (the latter being required, for example, where the possible-match records are held in 32-bit memory locations and the pattern is more than 32 characters long).
- Figure 1 is a diagram illustrating a predetermined pattern and a text string to be searched for occurrences of the pattern, the diagram being used to elucidate certain basic terms used in describing the substring search method;
- Figure 2 is a diagram similar to Figure 1 but elucidating further terms used in describing the substring search method;
- Figure 3 is a diagram illustrating the information content of shift masks used by the substring search method;
- Figure 4 is a diagram illustrating the use of the shift masks to generate a cumulative current shift mask used by the substring search method;
- Figure 5 is a top level flow diagram of the substring search method;
- Figure 6 is a flow diagram of a shift-mask generation step of the Figure 5 flow diagram;
- Figure 7 is a flow diagram of a pattern shift calculation step of the Figure 5 flow diagram.
- both the text 11 and pattern 12 are ordered sequences of characters, which we will describe as starting at the leftmost character and ending at the rightmost character (in the Figures, the characters of the text are generally represented by dots for clarity).
- the present substring search method scans the text from left to right, searching for the leftmost occurrence of the given pattern. If the pattern is not present in the text, this information is returned.
- the search method can be reapplied to the text to find further occurrences of the given pattern if this is required.
- the substring search method tests for occurrence of the pattern with the leftmost character of the pattern aligned with the leftmost character of the text before any other alignment is tested.
- pattern test sequence the order in which the characters of the current text range are tested does not affect correct operation, and so this order can be chosen for maximum efficiency, and even varied throughout the operation of the method.
- a simple and reasonably effective choice of pattern test sequence is to test the characters in the current text range from rightmost to leftmost in sequence.
- the comparisons between the text characters and the pattern characters are not made directly. Instead, the pattern is processed to provide a table of possible-match records each of which indicates for a respective text character whether the character matches with the corresponding pattern character for all possible relevant positions of pattern occurrences relative to that particular text character. This precomputed table and the text are then used to locate occurrences of the pattern in the text, the pattern itself not being required during the search.
- the table will be described as the "shift mask table” here and its possible-match records will be referred to as "shift masks”. The shift mask table will now be described in more detail.
- the pattern 12 is illustrated in two pattern positions in Figure 2, these being a leftward pattern position, indicated by a ⁇ ow 22L, in which the rightmost pattern character is in alignment with the text character 21, and a rightward pattern position, indicated by a ⁇ ow 22R, which is the position of the text character 21.
- a ⁇ ow 22L a leftward pattern position
- a ⁇ ow 22R a rightward pattern position
- the pattern positions for which the text character will affect the operations of the search method are the M consecutive pattern positions starting with the pattern position indicated by arrow 22L and ending with the pattern position indicated by a ⁇ ow 22R.
- the shift mask table contains one shift mask for each character (that is, character value) in the set of characters from which the text can be composed.
- Each shift mask consists of M binary values called "mask bits".
- Each mask bit in a shift mask co ⁇ esponds to one pattern position within the mask range of a text character, and indicates whether a pattern occu ⁇ ence at that pattern position relative to the text character is consistent with the text character value being the same as the character value to which the shift mask relates.
- Figure 3 shows for each of three possible values of the test character 21 (namely, the values "A”, “B” and “C”), the possible positions of pattern occu ⁇ ence relative to occu ⁇ ences of that character in the text; in addition, Figure 3 also shows the co ⁇ esponding shift mask 31, 32, 33 for each of the three character values.
- a tick within a shift mask indicates that the position of the tick co ⁇ esponds to a possible position of a pattern occu ⁇ ence
- a cross indicates that the co ⁇ esponding pattern position is not a possible pattern occu ⁇ ence.
- the rightmost mask bit in each shift mask co ⁇ esponds to the rightmost pattern position within the mask range, in other words the pattern position for which the leftmost character of the pattern is aligned with the particular text character.
- the leftmost mask bit in each shift mask co ⁇ esponds to the leftmost pattern position within the mask range, which is the pattern position for which the rightmost character of the pattern is aligned with the particular text character.
- the shift mask for any chosen character value can be computed by reversing the order of the pattern right to left, and then setting each mask bit of the shift mask to indicate whether the co ⁇ esponding character in the reversed pattern is equal to the chosen character value.
- the co ⁇ esponding shift masks each have all their mark bits set to indicate that a pattern occu ⁇ ence is not possible.
- the substring search method also maintains a "cu ⁇ ent shift mask", which records for the cu ⁇ ent pattern position which text characters positions within the cu ⁇ ent text range could be pattern positions co ⁇ esponding to a pattern occu ⁇ ence in the text.
- the cu ⁇ ent shift mask consists of M binary bits, each bit co ⁇ esponding to one pattern position within the cu ⁇ ent text range.
- the leftmost mask bit in the cu ⁇ ent shift mask co ⁇ esponds to the cu ⁇ ent pattern position, and the rightmost mask bit in the cu ⁇ ent shift mask co ⁇ esponds to the pattern shifted (M-l) characters to the right of its cu ⁇ ent position, so that the leftmost character of the pattern would align with the same text character that the rightmost character of the pattern aligns with at the cu ⁇ ent pattern position.
- Positions of possible pattern occu ⁇ ences outside the cu ⁇ ent text range do not need to be recorded. Possible pattern occu ⁇ ences to the left of the cu ⁇ ent text range have already been checked, and at the cu ⁇ ent pattern position the search method has not tested any text characters that affect the possibility of pattern occu ⁇ ences to the right of the cu ⁇ ent text range. All pattern occu ⁇ ences to the right of the cu ⁇ ent text range are regarded as possible.
- the cu ⁇ ent shift mask is the basic indicator of pattern matching possibilities during the execution of the search method, the cu ⁇ ent shift mask being updated for each new text character examined by combining elements of the co ⁇ esponding character shift mask with the cu ⁇ ent shift mask.
- a general description of the operation of the search method will next be given followed by a specific example related to the pattern already illustrated in Figures 1 to 3.
- the search method begins to test the text for occu ⁇ ences of the pattern.
- the method tests for occu ⁇ ence of the pattern with the leftmost character of the pattern aligned with the leftmost character of the text before any other alignment is tested.
- the test for occu ⁇ ence of the pattern at the first pattern position begins by reading the value of the text character that is aligned with the first pattern character in the pattern test sequence.
- the text character is used to retrieve the shift mask co ⁇ esponding to that character from the shift mask table. This shift mask indicates the possible positions of pattern occu ⁇ ences relative to the text character tested.
- the tested character may, as in the present case, be located in the text such that some of the pattern positions to which the co ⁇ esponding shift mask relates are positioned to the left of the cu ⁇ ent pattern position and therefore of no relevance; however, since the tested character will be one lying within the cu ⁇ ent text range, there will always be a portion of the shift mask which is relevant to determining the possibility of pattern occu ⁇ ences within the cu ⁇ ent text range.
- the search method therefore now proceeds by taking the relevant portion of the shift mask of the tested character and combining it appropriately with the cu ⁇ ent shift mask so that indications in the two masks relating to the same pattern positions are brought together to generate a new cu ⁇ ent shift mask giving an updated indication on the possibility of pattern occu ⁇ ences at the pattern positions embraced by the cu ⁇ ent shift mask.
- the process of selecting the relevant portion of the shift mask to be combined with the 'cu ⁇ ent shift mask may be viewed in terms of initially aligning the two masks and then leftward shifting the shift mask of the tested character relative to the cu ⁇ ent shift mask until co ⁇ esponding bits in the two masks refer to the possibility of pattern occu ⁇ ence at the same pattern positions.
- the number of mask bits of shift required depend on the position of the text character read within the cu ⁇ ent text range.
- the bits of the shifted shift mask which align with bits of the cu ⁇ ent shift mask form the aforesaid relevant portion of the shift mask.
- co ⁇ esponding bits in the translated mask and cu ⁇ ent shift mask are logically combined to give a new value of the cu ⁇ ent shift mask, a pattern occu ⁇ ence at each pattern position only being possible if it was possible before the most recently text character was read and if the translated mask indicates that the pattern occu ⁇ ence at that pattern position is consistent with the newly known presence of the tested text character in the text.
- the logical combination could be implemented by bitwise ANDing or ORing of the two masks, depending on how the possibility of pattern occu ⁇ ence is coded into binary bit values.
- the new value of the current shift mask might or might not indicate that a pattern occu ⁇ ence at the current pattern position is possible.
- the search method proceeds to use the pattern test sequence to access further text characters for testing as described above. If all text characters in the cu ⁇ ent text range are tested and a pattern occu ⁇ ence at the cu ⁇ ent pattern position is still indicated as being possible in the current shift mask, then a pattern occu ⁇ ence at the cu ⁇ ent pattern position has indeed been found, and the cu ⁇ ent pattern position is returned. If the cu ⁇ ent shift mask indicates that a pattern occu ⁇ ence at the cu ⁇ ent pattern position is not possible at any stage before pattern occu ⁇ ence is confirmed, then the search method proceeds to test for pattern occu ⁇ ence at the next possible pattern position.
- the next possible pattern position is indicated by the leftmost bit of the cu ⁇ ent shift mask that indicates possible pattern occu ⁇ ence.
- the position of this leftmost bit within the cu ⁇ ent shift mask indicates the number of characters by which the pattern has to be shifted to the right along the text (called the "pattern shift" here).
- Figure 4 illustrates the way in which the search method uses the cu ⁇ ent shift mask.
- the pattern 12 is cu ⁇ ently at position 13 in the text 11.
- CSM cu ⁇ ent shift mask
- text character 41 is tested first, and is found to be an "A”.
- the shift mask 31 for A is retrieved and has to be shifted two bits to the left to reflect the position of character 41 as being two characters to the left of the rightmost character in the cu ⁇ ent text range.
- the resultant translated mask 51 has its two rightmost bits (shown bracketed) set to indicate possible pattern occu ⁇ ences at the co ⁇ esponding pattern positions.
- the initial cu ⁇ ent shift mask 50 and the translated mask 51 are now combined to form a final cu ⁇ ent shift mask 52 for the character 41 test.
- the leftmost bit of mask 52 indicates that the cu ⁇ ent pattern position is a possible position of a pattern occu ⁇ ence (because text character 41 matched the co ⁇ esponding character in the pattern 12); in other words, the pattern shift is found to be zero.
- the search is therefore continued without shifting the pattern position, the next step being to test the next text character in the pattern test sequence - character 42 in this example.
- the initial cu ⁇ ent shift mask 53 present at the start of character 42 testing is the same as the final cu ⁇ ent shift mask 52 for character 41 testing.
- text character 42 is found to be a "C”
- the shift mask 33 for C needs to be shifted by zero bits to form the co ⁇ esponding translated mask 54 (this is because text character 42 is zero bits to the left of the rightmost character in the cu ⁇ ent text range - it is the rightmost character).
- the initial cu ⁇ ent shift mask 53 is then combined with the translated mask 54, marking pattern positions as possible positions of pattern occu ⁇ ences only when both masks 33 and 43 indicated that the position was a possible pattern occu ⁇ ence.
- the result is the new cu ⁇ ent shift mask 55 (the final CSM for character 42 testing).
- This cu ⁇ ent shift mask 55 is then processed to find the value of the pattern shift which is five in this example.
- the cu ⁇ ent pattern position must now be shifted by five to the position of character 41 before testing is continued (it will be seen showing that this is indeed a possible position for a pattern occu ⁇ ence, as both text characters 41 and 42 match the pattern at this new pattern position).
- the cu ⁇ ent shift mask must also be co ⁇ espondingly shifted to form the initial cu ⁇ ent shift mask 56 for the next character test, the appropriate shift being five bits to the left to bring the leftmost bit indicating possible pattern occu ⁇ ence (bit 57) to the leftmost bit position of the cu ⁇ ent shift mask 56.
- the bits 58 shifted into the right hand end of the cu ⁇ ent shift mask are all set to indicate that pattern occu ⁇ ence at the co ⁇ esponding pattern positions is possible, reflecting the fact that nothing is yet known about the co ⁇ esponding parts of the text.
- This shifting procedure preserves all information about possible positions of pattern occu ⁇ ence that is still of potential value - in this example the two bits 59 of the final cu ⁇ ent shift mask 56 indicate two pattern positions at which there cannot be occu ⁇ ences of the pattern in the text.
- the pattern test sequence is re-initialised, and testing continued at the new pattern position beginning by reading the value of the text character that is now aligned with the first pattern character in the pattern test sequence.
- the procedure is now as previously described, the search method terminating either on finding a pattern occurrence or on reaching the end of the text.
- Figures 5, 6 and 7 are flow diagrams illustrating one possible implementation of the substring search method in program form.
- the notation "A ⁇ ⁇ B" is used to indicate that element A has been shifted to the left by B bit positions with the rightmost bits of element A being set to indicate the possibility of pattern occu ⁇ ence ("OK").
- the pattern test sequence (that is, the order in which characters in the cu ⁇ ent text range at a particular pattern position, are tested) is determined by a one dimensional array PATTERN_TEST_SEQUENCE, the value of the nth element of which indicates the location in the cu ⁇ ent text range of the nth character to be tested.
- the location in the cu ⁇ ent text range co ⁇ esponding to the cu ⁇ ent pattern position is assigned the value zero with successive locations having a successively incremented value.
- Figure 5 is a top level flow chart of the substring search method, the first step 61 of this method being to read in the text and pattern.
- the next step 62 is to generate the shift mask table for all the characters making up the text string; this table is constituted by an array CHARMASK Q where CHARMASK [N] represents the shift mask for the character "N".
- the generation of the array CHARMASK ⁇ is shown in more detail in Figure 6 to be described more fully hereinafter.
- step 63 The generation of the shift mask table is followed by a step 63 in which a variable CURRENT_SHIFT_MASK representing the cu ⁇ ent shift mask is initialized with all its bits set to represent possible pattern occu ⁇ ence ("ALL OK"); at the same time a variable TEXT_POINTER representing pattern position is initialized to the start of the text string.
- the search method then enters a double looped program structure, the outer loop of which (including steps 64 and 72) is concerned with advancing the search method along the text, and the inner loop of which (including steps 67 and 70) is concerned with testing for a match at the cu ⁇ ent pattern position. More particularly, step 64 ascertains whether there is text still left to be tested and if not, the result is returned at step 65 that no pattern has been found. However, if text is still left to be tested then a variable PATTERN_POINTER is initialized to one, this variable being used to indicate the number of the character next to be tested relative to how many characters have already been tested at the cu ⁇ ent pattern position. Thereafter, the inner loop concerning testing for a match at the cu ⁇ ent position is entered.
- the first step of the inner loop is to ascertain whether there are pattern positions still left to be tested (step 67); if this is not the case, then this will be because the whole pattern has been tested and found to match so that a result is returned at step 68 that a pattern has been found at the pattern position represented by the variable TEXT_POINTER. However, if there is still pattern left to test, then a next character within the current text range is selected (using the variable PATTERN POINTER to access the pattern test sequence held in the array PATTERN TEST SEQUENCE), is tested and the resultant pattern shift determined, all this being done in step 69. The details of step 69 are shown in Figure 7, to be described more fully hereinafter.
- step 70 If the determined pattern shift is zero (as tested at step 70) then this means that for the tested character, a pattern occu ⁇ ence at the cu ⁇ ent pattern position is possible so that testing can continue at the cu ⁇ ent pattern position; to this end the variable PATTERN_POINTER is incremented to indicate the number of the next character to be tested (relative to how many characters have so far been tested at the cu ⁇ ent pattern position). After the variable PATTERN_POINTER has been incremented, step 67 is re-entered.
- program step 72 is carried out to advance the pattern to a new pattern position by changing the pattern position pointer TEXTJPOINTER and, at the same time, appropriately shifting the contents in the cu ⁇ ent shift mask.
- the search method returns to step 64.
- the first operation is to initialize all entries of the shift mask table array CHARMASK [CHAR] to a "NOT OK" setting, this being effected by steps 80 to 83 (the size of the array being determined by the number of characters in the alphabet from which the text string is composed).
- variables PATTERN_LOCATION and MASK_POSmON are also initialised, the variable PATTERN_LOCATION being set to reference the first pattern character position and the variable MASK_POSITION being set to the pattern length.
- a loop is then entered to generate the shift masks on the basis of the characters in the pattern.
- the first step of this loop is to retrieve the pattern character at the location indicated by the value of the variable PATTERN_LOCATION (step 85).
- the shift mask of the retrieved pattern character is modified by the setting of the bit at position MASK_POSITTON within the mask to "OK" (step 86).
- step 87 If the retrieved character was the last character of the pattern (step 87), generation of the shift mask table is now complete. However, if the end of the pattern has not been reached then the variable PATTERN_LOCA ⁇ ON is incremented and the variable MASKJPOSr ⁇ ON is decremented (step 88) before control is returned to step 85.
- the first step of this process is to set the value of a variable PATTERN_LOCATION to the value of the location of the character in the cu ⁇ ent text string next to be tested, this location value being determined by using the variable PATTERN_POINTER to access the array PATTERN_TEST_SEQUENCE (step 90).
- the identity of the character at the text position offset from the cu ⁇ ent pattern position is assigned to a variable TEXT_CHAR (step 91); this is the text character next to be tested against the pattern by use of the corresponding shift mask.
- the co ⁇ esponding shift mask CHARMASK [TEXT_CHAR] is retrieved and the translated mask appropriate for the position of the character being tested is formed by shifting the retrieved shift mask to the left by a number of bits co ⁇ esponding to the value (PATTERN_LENGTH - PAT ⁇ ERN_LOCA ⁇ ON) with the bits shifted into the righthand end of the mark being set to indicate the possibility of pattern occu ⁇ ence.
- step 93 a new cu ⁇ ent shift mask is formed by combining the old cu ⁇ ent shift mask (CURRENT_SHIFT_MASK) with the translated mask TRANSLATED_MASK).
- step 94 the pattern shift value SHIFT is determined by examining the newly formed cu ⁇ ent shift mask to ascertain the distance from the left of the mask of the first "OK” position. If there are no "OK" bits in the cu ⁇ ent shift mask, then SHIFT is set to the value of PATTERN_LENGTH.
- the pattern testing order which is the order in which the text characters within the cu ⁇ ent text range are selected for reading, can be precomputed so as to obtain the maximum statistically expected pattern shift, considering the relative frequency of occu ⁇ ence of the different pattern characters in normal usage of the alphabet.
- the relative positions of repeated characters within the pattern string can also be considered to obtain a more precise optimisation.
- the pattern testing order could be dynamically optimised according to the pattern positions that are still possible positions for pattern occu ⁇ ences at any stage. This optimisation may be too complex to be of practical value however.
- the pattern testing order can be determined in any way which provides a suitable performance for the particular application of the searching method.
- a record of text characters already tested could be maintained so as to avoid retesting them.
- a text character that has still to be tested according to the pattern testing sequence at the current pattern position might have been tested at a previous pattern position. If so, that text character must match the co ⁇ esponding pattern character at the cu ⁇ ent pattern position, and so the text character does not have to be retested.
- the record of text characters already tested could take the form of a mask with a bit to indicate whether each character in the cu ⁇ ent text range had already been read. This mask would need to be shifted whenever the pattern and hence the cu ⁇ ent text range was shifted, and the appropriate bit of the mask would need to be set whenever a text character was read.
- the various masks are most efficiently held in a single computer word each. This would imply that the masks have a limited length, say W bits (32 bits corresponding to patterns of up to 32 characters would be a common value for W using present computer technology.) Longer patterns might be handled either by using multiple computer words for each mask, or by maintaining masks that only represent the leftmost W characters at any pattern position and so only allow the pattern to be shifted by a maximum of W bits over the text between accesses to the text.
- the pattern shift can be obtained from the cu ⁇ ent shift mask in various ways. If the search method of the invention was embodied in electronic hardware then dedicated circuitry could be used. If the search method was implemented in computer software then either a precomputed shift table can be used to relate the contents of the cu ⁇ ent shift mask (or parts of the contents of the cu ⁇ ent shift mask) to pattern shift values, or alternatively a binary interpolation method could be used to find the co ⁇ ect shift value. Both of these techniques and others are well known in the prior art, indeed, the instruction set of some computers includes an instruction that allows the position of the leftmost one bit in a word to be determined directly.
- the substring searching method can be used where the given pattern includes a wildcard indicator indicating that the character at the co ⁇ esponding position in the pattern may be any of a group of characters (which could be the whole character set or a specified subset of characters); in this case, the matching of character items with characters of said pattern will take account of which characters are included in said group.
- the method can also be used where the pattern includes a wildcard indicator indicating that the pattern includes an unspecified segment of one or more characters in length; in this case, the method of the invention would first be used to locate the pattern portion prefixing said segment and then implemented again to locate the pattern portion following the segment, this second search of the method being started from the end of the substring located by the first search.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Controls And Circuits For Display Device (AREA)
Abstract
On décrit une méthode de recherche de sous-chaînes pour localiser, dans une chaîne de caractères (11), une sous-chaîne qui correspond à une configuration donnée (12) composée d'un ou de plusieurs caractères composant la chaîne, ladite configuration donnée étant établie à partir du même jeu de caractères. La méthode est basée sur une phase préliminaire dans laquelle on prépare un relevé des correspondances possibles (31) pour chaque caractère du jeu de caractères. Pour chacune des dernières positions N obtenues au fur et à mesure que la configuration donnée est rapprochée théoriquement du caractère associé au relevé, dans le but d'aligner le début de la configuration donnée avec le caractère, ledit relevé indique l'existence d'une correspondance avec le caractère correspondant de la configuration donnée. Dans une phase ultérieure de la méthode de recherche de sous-chaînes, la configuration donnée (12) est théoriquement déplacée le long de la chaîne en cours d'examen (11) et à chaque position un test est effectué, pour un caractère à la fois, pour déterminer si la sous-chaîne alignée avec la configuration donnée (12) correspond à cette dernière. Lorsqu'un test de caractère indique une non correspondance, la configuration donnée est déplacée pour occuper une nouvelle position. Le test de détection d'une correspondance de caractères est réalisé en combinant le relevé des correspondances éventuelles (31) d'un caractère de la sous-chaîne en cours d'examen. Lorsqu'on utilise un masque de déplacement courant (50) qui indique l'état actuel de connaissances concernant les positions de correspondance possibles, cette opération de combinaison requiert la formation d'un masque intermédiaire (51) résultant d'une traduction. Le nouveau masque de déplacement (52) ainsi obtenu indique non seulement la possibilité d'une correspondance à la position actuelle compte tenu de la présence du caractère soumis à l'examen, mais aussi le déplacement nécessaire en cas d'indication d'une non correspondance.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9103937.0 | 1991-02-26 | ||
GB919103937A GB9103937D0 (en) | 1991-02-26 | 1991-02-26 | Substring search method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1992015067A1 true WO1992015067A1 (fr) | 1992-09-03 |
Family
ID=10690561
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB1992/000333 WO1992015067A1 (fr) | 1991-02-26 | 1992-02-24 | Methode de recherche de sous-chaines |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB9103937D0 (fr) |
WO (1) | WO1992015067A1 (fr) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6718519B1 (en) | 1998-12-31 | 2004-04-06 | International Business Machines Corporation | System and method for outputting character sets in best available fonts |
US6760887B1 (en) | 1998-12-31 | 2004-07-06 | International Business Machines Corporation | System and method for highlighting of multifont documents |
US6785677B1 (en) * | 2001-05-02 | 2004-08-31 | Unisys Corporation | Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector |
US6813747B1 (en) | 1998-12-31 | 2004-11-02 | International Business Machines Corporation | System and method for output of multipart documents |
US7031002B1 (en) | 1998-12-31 | 2006-04-18 | International Business Machines Corporation | System and method for using character set matching to enhance print quality |
US7039637B2 (en) * | 1998-12-31 | 2006-05-02 | International Business Machines Corporation | System and method for evaluating characters in an inputted search string against a character table bank comprising a predetermined number of columns that correspond to a plurality of pre-determined candidate character sets in order to provide enhanced full text search |
US7191114B1 (en) | 1999-08-27 | 2007-03-13 | International Business Machines Corporation | System and method for evaluating character sets to determine a best match encoding a message |
CN104750846A (zh) * | 2015-04-10 | 2015-07-01 | 浪潮集团有限公司 | 一种子串查找方法及装置 |
EP1330740B1 (fr) * | 2000-09-29 | 2019-05-08 | Atos IT Solutions and Services GmbH | Procede d'acces a une unite de memoire lors de la recherche de suites partielles de caracteres, et unite de memoire correspondante |
WO2020013881A1 (fr) * | 2018-07-10 | 2020-01-16 | Didi Research America, Llc | Reconnaissance d'expression utilisant un saut de caractères |
US11163948B2 (en) | 2018-07-10 | 2021-11-02 | Beijing Didi Infinity Technology And Development Co., Ltd. | File fingerprint generation |
US11250131B2 (en) | 2019-12-19 | 2022-02-15 | Beijing Didi Infinity Technology And Development Co., Ltd. | Multi-purpose agent for endpoint scanning |
US11557141B2 (en) | 2019-12-19 | 2023-01-17 | Beijing Didi Infinity Technology And Development Co., Ltd. | Text document categorization using rules and document fingerprints |
-
1991
- 1991-02-26 GB GB919103937A patent/GB9103937D0/en active Pending
-
1992
- 1992-02-24 WO PCT/GB1992/000333 patent/WO1992015067A1/fr active Application Filing
Non-Patent Citations (3)
Title |
---|
COMMUNICATIONS OF THE ACM vol. 20, no. 10, October 1977, pages 762 - 772; R.S. BOYER ET AL.: 'A fast string searching algorithm' cited in the application * |
COMMUNICATIONS OF THE ACM vol. 33, no. 8, August 1990, pages 132 - 142; D.M. SUNDAY: 'A very fast substring search algorithm' cited in the application * |
SIAM J COMPUT vol. 6, no. 2, June 1977, pages 323 - 350; D.E. KNUTH ET AL.: 'Fast pattern matching in strings' cited in the application * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7039637B2 (en) * | 1998-12-31 | 2006-05-02 | International Business Machines Corporation | System and method for evaluating characters in an inputted search string against a character table bank comprising a predetermined number of columns that correspond to a plurality of pre-determined candidate character sets in order to provide enhanced full text search |
US6760887B1 (en) | 1998-12-31 | 2004-07-06 | International Business Machines Corporation | System and method for highlighting of multifont documents |
US6718519B1 (en) | 1998-12-31 | 2004-04-06 | International Business Machines Corporation | System and method for outputting character sets in best available fonts |
US6813747B1 (en) | 1998-12-31 | 2004-11-02 | International Business Machines Corporation | System and method for output of multipart documents |
US7031002B1 (en) | 1998-12-31 | 2006-04-18 | International Business Machines Corporation | System and method for using character set matching to enhance print quality |
US7191114B1 (en) | 1999-08-27 | 2007-03-13 | International Business Machines Corporation | System and method for evaluating character sets to determine a best match encoding a message |
EP1330740B1 (fr) * | 2000-09-29 | 2019-05-08 | Atos IT Solutions and Services GmbH | Procede d'acces a une unite de memoire lors de la recherche de suites partielles de caracteres, et unite de memoire correspondante |
US6785677B1 (en) * | 2001-05-02 | 2004-08-31 | Unisys Corporation | Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector |
CN104750846A (zh) * | 2015-04-10 | 2015-07-01 | 浪潮集团有限公司 | 一种子串查找方法及装置 |
CN104750846B (zh) * | 2015-04-10 | 2017-12-08 | 浪潮集团有限公司 | 一种子串查找方法及装置 |
WO2020013881A1 (fr) * | 2018-07-10 | 2020-01-16 | Didi Research America, Llc | Reconnaissance d'expression utilisant un saut de caractères |
US10956669B2 (en) | 2018-07-10 | 2021-03-23 | Beijing Didi Infinity Technology And Development Co., Ltd. | Expression recognition using character skipping |
US11163948B2 (en) | 2018-07-10 | 2021-11-02 | Beijing Didi Infinity Technology And Development Co., Ltd. | File fingerprint generation |
US11250131B2 (en) | 2019-12-19 | 2022-02-15 | Beijing Didi Infinity Technology And Development Co., Ltd. | Multi-purpose agent for endpoint scanning |
US11557141B2 (en) | 2019-12-19 | 2023-01-17 | Beijing Didi Infinity Technology And Development Co., Ltd. | Text document categorization using rules and document fingerprints |
Also Published As
Publication number | Publication date |
---|---|
GB9103937D0 (en) | 1991-04-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5664184A (en) | Method and apparatus for implementing Q-trees | |
Al-Suwaiyel et al. | Algorithms for trie compaction | |
WO1992015067A1 (fr) | Methode de recherche de sous-chaines | |
JP3540109B2 (ja) | データ圧縮方法及び装置 | |
US6115802A (en) | Efficient hash table for use in multi-threaded environments | |
Seward | On the performance of BWT sorting algorithms | |
US5357431A (en) | Character string retrieval system using index and unit for making the index | |
US5613145A (en) | Stored string data with element data units and pointer data units in distinct subranges of values | |
US7783855B2 (en) | Keymap order compression | |
US6070164A (en) | Database method and apparatus using hierarchical bit vector index structure | |
EP0268373B1 (fr) | Méthode et dispositif pour déterminer une adresse de banque de données | |
EP0250705B1 (fr) | Procédé et dispositif de recherche de chaînes de symboles dans des données | |
Chang | The study of an ordered minimal perfect hashing scheme | |
US6510505B1 (en) | System and method for allocating storage space using bit-parallel search of bitmap | |
US4084260A (en) | Best match content addressable memory | |
EP0677927A2 (fr) | Compression par comparaision de suites de caractères qui utilise un nombre de cycles minimal par caractère | |
EP0688104A2 (fr) | Méthode et dispositif de compression de données | |
WO1995034155A2 (fr) | Procede pour enregistrer et extraire des donnees, et systeme de memoire | |
US4896133A (en) | Parallel string processor and method for a minicomputer | |
US5727204A (en) | Database organization for rapid multi-set membership determination | |
US5081608A (en) | Apparatus for processing record-structured data by inserting replacement data of arbitrary length into selected data fields | |
US6718317B1 (en) | Methods for identifying partial periodic patterns and corresponding event subsequences in an event sequence | |
US5226148A (en) | Method and apparatus for validating character strings | |
Kim et al. | A compact index for Cartesian tree matching | |
US5398335A (en) | Virtually updating data records by assigning the update fractional addresses to maintain an ordinal relationship without renumbering original records |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FR GB GR IT LU MC NL SE |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
122 | Ep: pct application non-entry in european phase |