Character string matching method, system, storage medium and device
Technical Field
The invention relates to the fields of a character string matching algorithm, deep data packet detection, application traffic identification and network security, in particular to a character string matching method, a system, a storage medium and a device based on ranking.
Background
The character string matching algorithm is widely applied to network equipment based on deep packet inspection, such as an intrusion detection system, flow monitoring and the like. The algorithm finds out all matched characteristic character string rules by matching the data packet content with a characteristic character string rule set. With the increasing number of characteristic string rules, the string matching algorithm has become the performance bottleneck of network devices, and it is difficult to meet the performance and scalability requirements of deep packet inspection.
The Aho-Corasick algorithm (AC algorithm) is a multi-string matching algorithm which is most widely applied at present, a Deterministic Finite Automaton (DFA) is adopted to represent a group of characteristic strings, and the DFA state transition is utilized to detect whether input strings are matched or not. FIG. 2 is a DFA of a string set { eat, tea, are } and its state transition table. In the DFA on the left side of fig. 2, each circle represents a state, and each arrow represents a transition edge, i.e., a transition to a destination state when the current state reads in a character. The DFA initial state is 0 and the matching states are 3, 6 and 9, i.e. when migrating to the matching state, the corresponding matching character string is output. Note that: the DFA of fig. 2 omits the failing migration edge that migrates to the state with tree depth 0 and the restarting migration edge that migrates to the state with tree depth 1. The right side of fig. 2 is a state transition table corresponding to the DFA, wherein the first row represents the source state {0,1, …,9}, the first column represents the input characters { a, e, …, z }, and the other entries represent destination states to which the source state transits when reading one character. The DFA-based string matching process is: setting the current state as an initial state 0, reading in a character, migrating from the current state to a target state according to a DFA state migration table, and setting the target state as the current state; and sequentially reading each character string in the current state, migrating each character string to a corresponding target state, and outputting a corresponding matched character string if the current state is a matched state. For example, in the DFA of FIG. 2, when string { abreatea } is read in, starting from initial state 0, the state transitions performed by the DFA are 0- >7- >0- >0- >1- >2- >3- >5- >6, where matching states 3 and 6 output matching strings eat and tea.
AC algorithms face scalable challenges with large space overhead. The AC algorithm needs to store M × N migration edges, where M is the total number of DFA states and N is the DFA alphabet size (i.e., the number of unique characters). With the increasing and increasing complexity of the number of the rules of the characteristic character strings, the total number of DFA states constructed according to the characteristic character strings is exponentially and explosively increased, so that the storage cost of a DFA space is rapidly expanded. On the other hand, the space of a fast memory (such as a Static Random Access Memory (SRAM)) of the current network equipment is limited, the whole DFA state transition table of a large-scale string rule is difficult to store, and the string matching performance is limited.
The existing string matching algorithms also include a DFA compression algorithm D2FA based on default migration edges and a DFA compression algorithm Δ FA based on neighboring state migration edge merging.
D2FA is a DFA compression algorithm based on default migration edges. In D2FA, each state includes a default migration edge, which points to the default migration state, and each state records only the migration edge whose destination state is different from the default migration state, thereby significantly reducing the number of redundant migration edges. If the current state does not find the transition edge corresponding to the read-in character, D2FA uses the default transition edge to transition to the designated default state until the target state matching the read-in character is found. FIG. 3 is a DFA and its state transition table for the D2FA algorithm, where the DFA red edge is the default transition edge.
Δ FA is a DFA compression algorithm based on adjacent state transition edge merging. In Δ FA, each state holds only the transition edges of the different destination states from the neighboring states. When matching, the delta FA maintains a transition table of the current state for each input character, and records all transition edges of the current state. And according to the transition table of the current state, the delta FA transitions to the target state, and the transition edge of the target state, which is different from the transition edge of the adjacent state, is adopted to update the transition of the next state. Fig. 4 is a DFA of the Δ FA algorithm and its state transition table. FIG. 5 is a Δ FA string matching process and its migration table updates.
With the rapid increase of network bandwidth and the increasing of feature rules, the existing character string matching algorithm has the problems of large space overhead, long construction time, low matching throughput and the like. The AC algorithm spatial complexity is O (M N), where M represents the total number of DFA states and N represents the number of unique characters in the alphabet. As the number of string rules increases, the total number of DFA states (M) grows explosively, resulting in a dramatic increase in DFA memory overhead, beyond the existing fast memory capacity of the network device. The existing string matching compression algorithms D2FA and Δ FA reduce the DFA space overhead, but have the following problems: d2FA calculates the time complexity of the default migration edge as O (M2logM), the DFA of a large-scale character string set is difficult to process, in the matching process, when one character is read in each state, the default migration edge may need to be searched for multiple times, and the matching throughput is low; each state transition of the Δ FA requires updating the transition table of the next state once, resulting in low string matching throughput. Therefore, the existing string matching algorithm cannot simultaneously meet the requirements of small string matching storage space, short construction time and high matching throughput.
Disclosure of Invention
The invention aims to solve the problems of large space overhead, long construction time, low matching throughput and the like of the conventional string matching algorithm, provides a string matching method, a system, a storage medium and a device based on ranking, is generally called as a string matching algorithm RDFA, not only reduces the space overhead of DFA (distributed feedback architecture) storage, but also improves the construction time and the matching throughput of a DFA compression algorithm.
Aiming at the defects of the prior art, the invention provides a character string matching method, which comprises the following steps:
step 1, constructing a corresponding DFA state transition table according to a preset characteristic character string set, searching for a transition edge with the same number and the largest number of target states by traversing all source states of each input character in the state transition table, and storing the transition edge in a global transition table;
step 2, constructing a local migration table for each state in the state migration table, wherein the local migration table stores a migration edge different from a target state in the global migration table, and constructs a bit map for each state, each bit in the bit map is used for recording whether the target state of the migration edge of the local migration table is the same as the target state of the global migration table, if so, the bit is set to be '0', otherwise, the bit is set to be '1';
step 3, acquiring a character string to be matched, sequentially using the characters in the character string to be matched as characters to be matched, judging whether a bit corresponding to the character to be matched in the bit map is 1 or not by inquiring the bit map of the current state, counting the number of 1 before the corresponding bit if the bit is 1, using the number as the position of a target state in a local migration table, inquiring the local migration table according to the position to obtain the target state, and otherwise, directly inquiring the global migration table to obtain the target state;
and 4, outputting the matching character strings corresponding to all the target states as the matching result of the character strings to be matched.
The character string matching method, wherein the step 1 comprises:
step 11, setting the number k of the currently constructed global migration table as 1, and setting the number of the global migration table as n, wherein n is a positive integer;
step 12, traversing all source states aiming at each input character in the state transition table, finding the transition edges with the same number and the maximum number of the target states, and recording the input character and the target states in a global transition table with k being 1;
step 13, constructing a first bit map, wherein each bit in the bit map is used for recording whether the target state corresponding to the input character is consistent with the target state in the global migration table with k being 1, and if so, clearing the bit; otherwise, the bit is "1";
step 14, adding 1 to the number k of the currently constructed global migration table;
step 15, if the number of the constructed global migration tables is smaller than the preset number n, turning to step 16, otherwise, turning to step 28;
step 16, traversing all the source states aiming at each input character, finding the k-th and most-th transition edges with the same number of the target states, and recording the input character and the target states in a k-th global transition table;
step 17, constructing a kth bit map for each state, wherein each bit in the kth bit map is used for recording whether a target state corresponding to an input character of 1 in the kth-1 bit map is consistent with a target state in the kth global migration table, if so, the bit is cleared, otherwise, the bit is set to be 1; go to step 14;
the step 2 comprises the following steps: and step 28, building a local migration table for each state, wherein the local migration table sequentially places migration edges different from the target states in all the global migration tables.
The character string matching method, wherein the step 3 comprises:
step 31, initializing the sequence number k of the currently inquired bitmap to 1, wherein each state has n bitmaps;
step 32, if k is larger than n, go to step 36, otherwise go to step 33;
step 33, inquiring the kth bit map of the current state, and judging whether the bit corresponding to the character to be matched is 1 at present;
step 34, if the bit is 1, k +1 and go to step 32; if the bit is 0, the destination state is recorded in the kth global migration table, and go to step 35;
step 35, inquiring the kth global migration table to obtain the target state, and going to step 38;
step 36, calculating the number of '1' before the bit corresponding to the input character in the kth bit map, thereby obtaining the position of the target state in the current state local migration table;
step 37, inquiring a local migration table of the current state according to the position to obtain the target state;
step 38, the destination state is updated to the current state, the next input character is read in, and the process goes to step 31.
The character string matching method comprises the following steps: and the network system receives the data packet through the I/O equipment and obtains the character string to be matched of the data packet according to the analysis of the network protocol.
The invention also provides a character string matching system, which comprises:
the method comprises the steps that a module 1 constructs a corresponding DFA state transition table according to a preset characteristic character string set, searches for transition edges with the same number and the largest number of target states by traversing all source states of each input character in the state transition table, and stores the transition edges in a global transition table;
module 2, building a local migration table for each state in the state migration table, where the local migration table stores a migration edge different from a destination state in the global migration table, and building a bitmap for each state, where each bit in the bitmap is used to record whether the destination state of the migration edge of the local migration table is the same as the destination state of the global migration table, and if so, setting the bit to "0", otherwise, setting the bit to "1";
the module 3 acquires a character string to be matched, sequentially uses the characters in the character string to be matched as characters to be matched, judges whether a bit corresponding to the character to be matched in the bit map is 1 or not by inquiring the bit map of the current state, counts the number of 1 before the corresponding bit if the bit is 1, uses the bit as the position of a target state in a local migration table, inquires the local migration table according to the position to obtain the target state, and directly inquires the global migration table if the bit is not 1;
and the module 4 outputs the matching character strings corresponding to all the target states as the matching result of the character strings to be matched.
The character string matching system, wherein the module 1 comprises:
the module 11 sets the number k of the currently constructed global migration table to be 1, sets the number of the global migration table to be n, and sets n to be a positive integer;
the module 12 traverses all source states for each input character in the state transition table, finds the transition edges with the same number as the destination state and the maximum number, and records the input character and the destination state in the global transition table with k being 1;
the module 13 constructs a first bitmap, each bit in the bitmap is used for recording whether the destination state corresponding to the input character is consistent with the destination state in the global migration table with k being 1, and if so, the bit is cleared; otherwise, the bit is "1";
module 14, adding 1 to the number k of the currently constructed global migration table;
if the number of the constructed global migration tables is smaller than the preset number n, the module 15 turns to the module 16, otherwise, the module 28 turns to;
module 16, for each input character, traversing all source states, finding the k-th most migration edges with the same number of destination states, and recording the input character and the destination state in the k-th global migration table;
a module 17, constructing a kth bit map for each state, where each bit in the kth bit map is used to record whether a target state corresponding to an input character of 1 in the kth-1 bit map is consistent with a target state in the kth global migration table, and if so, clearing the bit, otherwise, setting the bit to be 1; turning to the module 14;
the module 2 comprises: module 28 builds a local migration table for each state, where the local migration table sequentially places a migration edge different from the destination state in all global migration tables.
The character string matching system, wherein the module 3 comprises:
the module 31 initializes the sequence number k of the currently inquired bitmap to 1, and each state has n bitmaps;
block 32, if k is greater than n, go to block 36, otherwise go to block 33;
the module 33 queries the kth bit map of the current state, and determines whether the bit corresponding to the current character to be matched is 1;
block 34, if the bit is 1, then k +1 goes to block 32; if the bit is 0, the destination state is recorded in the kth global migration table, and go to the module 35;
module 35, query the kth global migration table to obtain the destination state, and go to module 38;
module 36, calculate the number of "1" before the corresponding bit of the input character in the kth bit map, and thus obtain the position of the target state in the current state local migration table;
module 37, query the local migration table of the current state according to the position to obtain the destination state;
block 38, the destination status is updated to the current status, the next input character is read in, and the process goes to block 31.
The character string matching system, wherein the system for acquiring the character string to be matched comprises: and the network system receives the data packet through the I/O equipment and obtains the character string to be matched of the data packet according to the analysis of the network protocol.
The present invention also proposes a storage medium in which a program for executing the character string matching method of any one of claims 1 to 4 is stored.
The invention also proposes a data processing device comprising a processing unit and a storage medium as claimed in claim 9, the processing unit calling and executing a program in the storage medium.
According to the scheme, the invention has the advantages that:
1. ranking migration edges of the same target state of the same input character in a DFA state migration table by adopting a ranking idea, and compressing a plurality of migration edges of n before the same number of the migration edges are ranked into n migration edges through a global migration table and a local migration table of each state, so that the cost of a DFA storage space is reduced;
2. the same migration edges of 2 top-ranked bits are compressed, and the maximum benefit is brought on space storage;
3. the bitmap is used for realizing compression of the local migration table, and the bitmap can be used for judging the storage position of the target state when the current state and the input characters are known, and can be used for calculating the offset position of the target state in the local migration table under the condition that the target state is stored in the local migration table, so that the DFA storage space overhead is reduced, and the matching throughput is improved.
The RDFA algorithm uses the global migration table and the local migration table as basic data structures, a plurality of migration edges with the same purpose state and the same purpose state of input characters are compressed into one through the global migration table, and the local migration table stores the migration edges different from those in the global migration table, so that the DFA migration edge number is reduced, and the overall storage cost is reduced.
As shown in fig. 15, compared to the AC algorithm, the RDFA algorithm stores the number of migration edges reduced by more than 92%, the reduction ratio is higher than Δ FA and lower than D2FA and ADFA, and the number of migration edges is reduced more as n is increased. As shown in fig. 16, compared to the AC algorithm, the RDFA algorithm reduces the memory space by 90% or more, and the reduction ratio is higher than Δ FA and lower than D2FA and ADFA, where n is 2, the memory space is reduced most.
When the RDFA algorithm is used for matching, the storage position of the target state is judged by using the bitmap, then the target state is searched in the corresponding migration table according to the judgment result, state migration can be carried out once only by two times of inquiry, and the matching throughput is high.
As shown in fig. 17, the matching throughput of the RDFA algorithm is improved by two orders of magnitude compared to other string matching compression algorithms.
Drawings
FIG. 1 is a flow chart of network system work based on string matching;
FIG. 2 is a DFA and state transition table for the set of strings { eat, tea, are };
FIG. 3 is a DFA and state transition table for the D2FA algorithm;
FIG. 4 is a DFA of the Δ FA algorithm and its state transition table;
FIG. 5 is a diagram of the matching process of the Δ FA algorithm and its migration table update;
FIG. 6 is a diagram of an example data structure of the RDFA algorithm;
FIG. 7 is a pseudo-code diagram of the RDFA algorithm construction process;
FIG. 8 is a flow chart of the matching process of the RDFA algorithm;
FIG. 9 is a pseudo code diagram of the RDFA algorithm matching process;
FIG. 10 is a flow chart of the construction process of the RDFA extension algorithm;
FIG. 11 is a diagram of an example of a data structure of the RDFA extension algorithm;
FIG. 12 is a pseudo-code diagram of the RDFA extension algorithm building process;
FIG. 13 is a flow chart of the matching process of the RDFA extension algorithm;
FIG. 14 is a pseudo code diagram of the RDFA extension algorithm matching process;
FIG. 15 is a graph of the migration edge reduction ratio of a prior art string matching compression algorithm compared to the AC algorithm;
FIG. 16 is a graph of the memory reduction ratio of a prior art string matching compression algorithm compared to the AC algorithm;
FIG. 17 is a graph comparing the matching throughput of the AC algorithm with the existing string matching algorithm;
FIG. 18 is a diagram illustrating an example of the construction process of the RDFA algorithm;
FIG. 19 is a diagram illustrating an exemplary matching process of the RDFA algorithm;
FIG. 20 is a diagram illustrating an example of the construction process of the RDFA extension algorithm;
FIG. 21 is a diagram illustrating an exemplary matching process of the RDFA extension algorithm;
fig. 22 is a diagram of a data processing apparatus.
Detailed Description
Aiming at the problems of large space overhead, long construction time, low matching throughput and the like of the existing string matching algorithm, the invention provides a string matching algorithm RDFA based on ranking, which not only obviously compresses DFA storage space, but also solves the problems of long construction time and low matching throughput of the existing string matching compression algorithm. The RDFA algorithm is applied to network systems based on character string matching, such as an intrusion detection system, an application flow identification system, a Web application firewall and the like. Fig. 1 is a network system workflow based on string matching: firstly, a network system receives a data packet through I/O equipment; secondly, the data packet content is subjected to character string matching through network protocol analysis.
The core idea of the RDFA algorithm is as follows: firstly, constructing a global migration table, and storing migration edges of the same input character in the same target state; secondly, a local migration table of each state is constructed, migration edges of different target states in the global migration table are stored aiming at each input character, and the local migration edge table is further compressed by adopting a bitmap. The complexity of the construction time of the global migration table is O (M multiplied by N), M represents the total number of DFA states, N represents the number of unique characters in an alphabet, and the construction time is shorter than that of the existing algorithm; meanwhile, a large number of redundant migration edges are reduced by the global migration table, so that the DFA storage space is obviously compressed; when matching, the RDFA only needs to search the local migration table and the global migration table of the destination state for each read-in character, thereby improving the string matching throughput.
The RDFA algorithm comprises an RDFA construction process, an RDFA matching process and an RDFA expansion algorithm.
The construction process of the RDFA algorithm comprises the following steps:
step 1: and constructing a DFA state transition table corresponding to the characteristic character string set according to the AC algorithm. The characteristic character string set of the present embodiment is obtained from the existing intrusion detection system snort, but not limited thereto, other detection systems such as suricata and modsecurity may also have their own corresponding rule sets.
Step 2: and traversing all the source states aiming at each input character, finding out the migration edges with the same number and the maximum number of the target states, and storing the migration edges in a global migration table.
And step 3: building a local migration table for each state, wherein the local migration table stores a migration edge different from a target state in the global migration table; and meanwhile, constructing a bit bitmap for each state, wherein each bit is used for recording whether the destination state of the migration edge of the local migration table is the same as that of the global migration table or not, if the destination state is the same, the bit is 0, and if the destination state is not the same, the bit is 1.
Fig. 6 is a data structure example of the RDFA algorithm, which is divided into a global migration table and a local migration table. The global migration table stores the migration edges with the same input character and the same destination state and the largest number. The local migration table stores migration edges different from the destination state in the global migration table, and these different migration edges are recorded with 1 in the bitmap of each state.
FIG. 7 is pseudo code of an RDFA build process, lines 1-4 first part, depicting the build process of a global migration table; lines 5-15 describe the process of building a local migration table for each state from the global migration table, in the second section.
As shown in fig. 8, the matching process of the RDFA algorithm:
step 1: inquiring a bitmap of the current state, and judging whether a bit corresponding to the input character in the bitmap is 1 or not;
step 2: if the bit is 1, the destination state is recorded in the local migration table of the current state, and the step 3 is carried out; if the bit is 0, the destination state is recorded in the global migration table, and the step 5 is carried out;
and step 3: calculating the number of 1 before the corresponding bit of the input character in the bit bitmap, thereby obtaining the position of the target state in the current state local migration table;
and 4, step 4: and 6, inquiring the local migration table of the current state to obtain the target state, and turning to the step 6. One of the transition edges is a line segment with two end points from the head to the tail, the first end point is the current state, and the second end point is the destination state. Referring to fig. 2, a left side 01 of fig. 2 can be regarded as a transition edge, which corresponds to a first state 0 in a first row of the table on the right side, and when a second character e in the first row is read, a destination state 1 in a third row and a second column is jumped to.
And 5: inquiring the global migration table to obtain a target state;
step 6: and updating the target state to the current state, reading the next input character, and turning to the step 1.
FIG. 9 is pseudo code for an RDFA algorithm matching process, lines 2-5 first, describing the process of looking up a destination state in a migration-in-place table; lines 6-8 describe the process of looking up the destination state in the global migration table in the second section.
The RDFA expansion algorithm adopts a plurality of global migration tables to further compress DFA storage space.
As shown in fig. 10, the construction process of the RDFA expansion algorithm:
step 1: setting the number k of the currently constructed global migration table as 1, and setting the number of the global migration table as n.
Step 2: and traversing all the source states aiming at each input character, finding the migration edge with the maximum number of the destination states, and recording the input character and the destination states in a first global migration table.
And step 3: and constructing a first bit bitmap for each state, wherein each bit is used for recording whether the destination state corresponding to different input characters is consistent with the destination state in the first global migration table or not. If the bit is consistent, the bit is cleared; if not, the bit is set to 1.
And 4, step 4: the currently constructed global migration table number k is incremented by 1.
And 5: if the number of the constructed global migration tables is smaller than the preset number n, turning to the step 6; otherwise, go to step 8.
Step 6: for each input character, traversing all source states, finding the k-th and most-th migration edges with the same number of target states, and recording the input character and the target states in a global migration table k;
and 7: and constructing a kth bit bitmap for each state, wherein each bit is used for recording whether a destination state corresponding to the input character with 1 in the kth-1 bit bitmap is consistent with a destination state in the kth global transition table. If the bit is consistent, the bit is cleared; if not, the bit is set to 1. Go to step 4.
And 8: and constructing a local migration table for each state, wherein the local migration table sequentially places migration edges different from the target states in all the global migration tables.
FIG. 11 is an example of a data structure of the RDFA extension algorithm, which is divided into n global migration tables and one local migration table. The global migration table records the destination states corresponding to different input characters, and the kth global migration table is suitable for destination state query that the bit position corresponding to the input character in the kth-1 bit bitmap is 1 and the bit position corresponding to the kth bit bitmap is 0. Each bit of the kth bit map records whether the target state corresponding to the input character of one in the kth-1 th bit map can be checked in the kth global migration table, and if yes, the bit is 0; if not, the bit is 1. The local migration table records destination states inconsistent with all the global migration tables.
FIG. 12 is pseudo code of an RDFA build process, lines 1-6 first part, depicting the build process of n global migration tables; lines 7-23 describe the process of building a local migration table for each state from n global migration tables, in the second section.
As shown in fig. 13, the matching process of the RDFA expansion algorithm:
step 1: the sequence number k of the currently inquired bitmap is initialized to 1, and each state has n bitmaps;
step 2: if k is larger than n, go to step 6; otherwise, turning to the step 3;
and step 3: inquiring the kth bit map of the current state, and judging whether the bit corresponding to the current input character is 1;
and 4, step 4: if the bit is 1, increasing the value of k by 1, and turning to step 2; if the bit is 0, the destination state is recorded in the kth global migration table, and the step 5 is carried out;
and 5: inquiring the kth global migration table to obtain a target state, and turning to the step 8;
step 6: calculating the number of 1 before the bit corresponding to the input character in the kth bit map, thereby obtaining the position of the target state in the current state local migration table;
and 7: inquiring a local migration table of the current state to obtain a target state;
and 8: and updating the target state to the current state, reading the next input character, and turning to the step 1.
FIG. 14 is a pseudo code of the RDFA extension algorithm matching process, and the lines 4-10 describe the process of finding the destination state when the bit corresponding to the current input character in the bitmap is 1; lines 11-13 describe the process of looking up the destination state in the global migration table when the bit corresponding to the currently input character in the bitmap is 0.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
And (4) considering the characteristic character string set as { eat, tea, are }, and performing a character string matching process by using an RDFA algorithm. The construction process of the RDFA is shown in fig. 18. Fig. 18(a) shows a state transition table constructed according to the AC algorithm, where the red part represents the transition edge with the largest number of destination states for the same input character, and states 3, 6, and 9 are matching states, which match eat, tea, and are. Taking the input character a as an example, the destination states of seven states 0, 2, 3, 4, 6, 7 and 8 are 7, the destination states of two states 1 and 9 are 2, and the destination state of one state 5 is 6, so when the input character is a, 7 is written in the global transition table as the destination state, and the same applies to other input characters, and fig. 18(b) shows the constructed global transition table. Then, a local transition table of each state is constructed, taking state 2 as an example, the target state when character a is input in state 2 is 7, the target state is the same as the target state when character a is input in the global transition table, and the first bit of the bitmap is 0; the target state when the character e is input is 1, the target state is the same as the target state when the character e is input in the global migration table, and the second bit of the bitmap is 0; when the character r is input, the target state is 8, which is different from the target state when the character r is input in the global migration table, the third bit of the bitmap is 1, and the state 8 is added into the first bit of the local migration table; when the character t is input, the target state is 3, different from the target state when the character t is input in the global migration table, the fourth bit of the bitmap is 1, and the state 3 is added into the second bit of the local migration table; other input characters are the same. The constructed bitmap and migration-in-place table are shown in fig. 18(c) and 18(d), respectively.
When the input string is { eat }, the RDFA matching process is as shown in FIG. 19. Reading a character e from an initial state 0, wherein a bit corresponding to e in a bit bitmap of the 0 state is 0, a target state exists in a global transition table, the global transition table is searched, a target state 1 corresponding to an input character e is found, and the state 1 is converted into a current state; reading in a character a, wherein a bit corresponding to a in a bitmap of a 1 state is 1, the number of 1 in the bitmap before a is 0, the target state has the first bit of a local migration table, searching the local migration table to obtain a target state 2, and converting the state 2 into the current state; reading a bit corresponding to t in a bitmap of a character t, 2 state, wherein the bit corresponding to t is 1, the number of 1 in the bitmap before t is 1, the destination state has a second bit of the local migration table, searching the local migration table to obtain a destination state 3, and converting the state 3 into the current state; and the state 3 is a matching state, and the matching character string eat is output.
When n is 2, the RDFA expansion algorithm is constructed as shown in fig. 20. When k is 1, the process of constructing the global migration table is the same as that in fig. 18. When k is 2, the blue portion in fig. 20(a) represents the transition edge whose destination state is the same number of times as many as the same input character. Taking an input character a as an example, the destination states of seven states 0, 2, 3, 4, 6, 7 and 8 are 7, the destination states of two states 1 and 9 are 2, and the destination state of one state 5 is 6, so when the input character is a, 2 is written into the second global transition table as the destination state, and the other input characters are the same, and fig. 20(b) shows the second global transition table after being constructed. And then, constructing a second bitmap and a local transition table of each state, taking state 2 as an example, wherein only two bits corresponding to the characters r and t in the first bitmap of state 2 are 1, so that the second bitmap of state 2 records whether the target states corresponding to the characters r and t are consistent with the second global transition table by using only two bits. When the character r is input, the target state is 8, which is the same as the target state when the character r is input in the second global transition table, and the first bit of the second bitmap of the state 2 is 0; the target state when the character t is input is 3, the target state is the same as the target state when the character t is input in the second global transition table, and the second bit of the second bitmap of the state 2 is 0; since the destination states corresponding to all input characters in the state 2 can be found in the global migration table, no information needs to be recorded in the local migration table in the state 2. The constructed bitmap and migration-in-place table are shown in fig. 20(c) and 20(d), respectively.
When the input string is { eat }, the RDFA expansion algorithm matching process of n ═ 2 is as shown in fig. 21. Reading a character e from an initial state 0, wherein a bit corresponding to e in a first bit bitmap of the 0 state is 0, a target state exists in a first global transition table, searching the first global transition table, finding a target state 1 corresponding to the input character e, and converting the state 1 into a current state; reading a character a, wherein a bit corresponding to a in a first bitmap of a state 1 is 1, the number of the 1's before a in the first bitmap is 0, a corresponds to the first bit in a second bitmap, the bit is 0, a target state exists in a second global transition table, searching the second global transition table, finding a target state 2 corresponding to an input character a, and converting the state 2 into the current state; reading a character t, wherein a bit corresponding to t in a first bitmap of a state 2 is 1, the number of 1 bits before t in the first bitmap is 1, t corresponds to a second bit in a second bitmap, the bit is 0, a target state exists in a second global transition table, searching the second global transition table, finding a target state 3 corresponding to the input character t, and converting the state 3 into the current state; and the state 3 is a matching state, and the matching character string eat is output.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a character string matching system, which comprises:
the method comprises the steps that a module 1 constructs a corresponding DFA state transition table according to a preset characteristic character string set, searches for transition edges with the same number and the largest number of target states by traversing all source states of each input character in the state transition table, and stores the transition edges in a global transition table;
module 2, building a local migration table for each state in the state migration table, where the local migration table stores a migration edge different from a destination state in the global migration table, and building a bitmap for each state, where each bit in the bitmap is used to record whether the destination state of the migration edge of the local migration table is the same as the destination state of the global migration table, and if so, setting the bit to "0", otherwise, setting the bit to "1";
the module 3 acquires a character string to be matched, sequentially uses the characters in the character string to be matched as characters to be matched, judges whether a bit corresponding to the character to be matched in the bit map is 1 or not by inquiring the bit map of the current state, counts the number of 1 before the corresponding bit if the bit is 1, uses the bit as the position of a target state in a local migration table, inquires the local migration table according to the position to obtain the target state, and directly inquires the global migration table if the bit is not 1;
and the module 4 outputs the matching character strings corresponding to all the target states as the matching result of the character strings to be matched.
The character string matching system, wherein the module 1 comprises:
the module 11 sets the number k of the currently constructed global migration table to be 1, sets the number of the global migration table to be n, and sets n to be a positive integer;
the module 12 traverses all source states for each input character in the state transition table, finds the transition edges with the same number as the destination state and the maximum number, and records the input character and the destination state in the global transition table with k being 1;
the module 13 constructs a first bitmap, each bit in the bitmap is used for recording whether the destination state corresponding to the input character is consistent with the destination state in the global migration table with k being 1, and if so, the bit is cleared; otherwise, the bit is "1";
module 14, adding 1 to the number k of the currently constructed global migration table;
if the number of the constructed global migration tables is smaller than the preset number n, the module 15 turns to the module 16, otherwise, the module 28 turns to;
module 16, for each input character, traversing all source states, finding the k-th most migration edges with the same number of destination states, and recording the input character and the destination state in the k-th global migration table;
a module 17, constructing a kth bit map for each state, where each bit in the kth bit map is used to record whether a target state corresponding to an input character of 1 in the kth-1 bit map is consistent with a target state in the kth global migration table, and if so, clearing the bit, otherwise, setting the bit to be 1; turning to the module 14;
the module 2 comprises: module 28 builds a local migration table for each state, where the local migration table sequentially places a migration edge different from the destination state in all global migration tables.
The character string matching system, wherein the module 3 comprises:
the module 31 initializes the sequence number k of the currently inquired bitmap to 1, and each state has n bitmaps;
block 32, if k is greater than n, go to block 36, otherwise go to block 33;
the module 33 queries the kth bit map of the current state, and determines whether the bit corresponding to the current character to be matched is 1;
block 34, if the bit is 1, then k +1 goes to block 32; if the bit is 0, the destination state is recorded in the kth global migration table, and go to the module 35;
module 35, query the kth global migration table to obtain the destination state, and go to module 38;
module 36, calculate the number of "1" before the corresponding bit of the input character in the kth bit map, and thus obtain the position of the target state in the current state local migration table;
module 37, query the local migration table of the current state according to the position to obtain the destination state;
block 38, the destination status is updated to the current status, the next input character is read in, and the process goes to block 31.
The character string matching system, wherein the system for acquiring the character string to be matched comprises: and the network system receives the data packet through the I/O equipment and obtains the character string to be matched of the data packet according to the analysis of the network protocol.
The present invention also proposes a storage medium in which a program for executing the character string matching method of any one of claims 1 to 4 is stored.
The present invention also proposes a data processing apparatus including a processing unit and a storage medium according to claim 9, the processing unit calling and executing a program in the storage medium, as shown in fig. 22. The data processing device can also be provided with a storage device, a routing device, an input/output interface, a power supply module and the like according to actual needs.