CN110955855B - Information interception method, device and terminal - Google Patents
Information interception method, device and terminal Download PDFInfo
- Publication number
- CN110955855B CN110955855B CN201811132493.9A CN201811132493A CN110955855B CN 110955855 B CN110955855 B CN 110955855B CN 201811132493 A CN201811132493 A CN 201811132493A CN 110955855 B CN110955855 B CN 110955855B
- Authority
- CN
- China
- Prior art keywords
- information
- category
- level
- data
- character string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The embodiment of the invention provides an information interception terminal, which can comprise: a processor, a transceiver, a memory, a plurality of applications, causing the terminal to perform the steps of: starting a browser to access a webpage; acquiring information of access to a webpage; matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not; when the information of the access webpage comprises target information, the target information is intercepted. In the scheme, the terminal intercepts target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish character strings in the first data, and the matching times of the information of the access webpage and the first data are effectively reduced, so that the problem that the matching times are increased due to the fact that the number of character strings for intercepting the target information is large and a rationalized matching mode is not adopted is avoided.
Description
Technical Field
The embodiment of the invention relates to the technical field of webpage analysis and interception, in particular to a method, a device and a terminal for information interception.
Background
With the explosion of the internet, more and more web pages are inserted into a wide variety of advertisements. In order to avoid inconvenience to users in browsing web pages in a browser caused by such advertisements, it is necessary to intercept advertisements in the web pages.
At present, a general user webpage access request is sent to a server for processing, the server caches the webpage content and loads an easy list rule list, advertisement elements are hidden through the rule list, and then the webpage content after the advertisement elements are hidden is returned to a client for display. The easy list rule list comprises a plurality of character strings, is a rule set intercepted by an advertisement opened by an open source organization, and defines which elements in the webpage are advertisements and should be intercepted.
Disclosure of Invention
The embodiment of the invention provides an information interception method, an information interception device and a terminal, which are used for solving the problem that the number of matching times is increased due to more rules of advertisement interception and no rationalized matching modes by optimizing the rule matching modes based on the mode of implementing advertisement interception by the terminal.
In a first aspect, an embodiment of the present invention provides an information interception terminal, where the terminal may include: one or more processors, a transceiver, a memory, a plurality of application programs, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the terminal, cause the terminal to:
Starting a browser to access a webpage;
acquiring information of access to a webpage;
matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not;
when the information of the access webpage comprises target information, the target information is intercepted.
In the scheme, the terminal intercepts target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish character strings in the first data, and the matching times of the information of the access webpage and the first data are effectively reduced, so that the problem that the matching times are increased due to the fact that the number of character strings for intercepting the target information is large and a rationalized matching mode is not adopted is avoided.
In an alternative implementation, the "tree structure" may include:
the system comprises a plurality of nodes, wherein the nodes comprise root nodes and at least one level of sub-nodes, and each level of the at least one level of sub-nodes comprises at least two sub-nodes;
the nodes of each stage have a parent-child relationship with the associated next stage node, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
In another alternative implementation, the terminal may specifically perform the following steps:
and matching the information of the access webpage step by step from the first data of the father node of the tree structure to the first data of the child node in father-son relation with the father node until determining whether the information of the access webpage comprises target information.
Because the information of the access web page has the difference of length, the longer information of the access web page can not be directly matched, so the information of the access web page is matched step by step, the information of the access web page can be completely matched, and the accuracy of intercepting target information is improved.
In still another optional implementation manner, the tree structure may specifically include m-level sub-nodes, where each level of sub-nodes in the m-level sub-nodes is divided according to different preset rules in n preset rules, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
the j-th level of child nodes select 1 preset rule from f preset rules to divide, wherein f preset rules are that the previous j-1 level of child nodes in n preset rules select the rest preset rules, the j-1 level of child nodes are the previous level of child nodes in the j level of child nodes, the j level of child nodes are any level of child nodes in the m level of child nodes, and j and f are integers which are more than or equal to 1;
Each of the n preset rules includes at least two categories of character strings;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to m levels of sub-nodes, each sub-node in the m levels of sub-nodes corresponds to different character string categories in n preset rules respectively, and each sub-node comprises a plurality of character strings with different character string categories.
Because each terminal or operator has different definitions on target information, the method and the device provide various preset rules and categories, and the preset rules can be selected according to requirements, so that the flexibility of the tree structure can be improved, and the method and the device are suitable for more scenes.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules.
In yet another alternative implementation, the "black-and-white list rule" described above may include:
the class of the white list and the class of the black list, the 1 st level of the m level of the child nodes are divided according to the black and white list rule, and the character strings belonging to the class of the white list and the character strings belonging to the class of the black list in the first data correspond to one of the 1 st level of the child nodes respectively. In yet another alternative implementation, the terminal may perform the steps of:
And matching the information of the access webpage with the character strings of the category of the white list, and determining that the information of the access webpage does not comprise the target information by the terminal when the information of the access webpage comprises the character strings of the category of the white list.
Since some information for accessing the web page may have "ad", but may not be target information for some operators, setting the character string with the whitelist category excludes the possibility that the character string has "ad" but is not target information (i.e. advertisement), thereby improving the accuracy of interception.
In yet another alternative implementation, the terminal may further perform the steps of:
when the information of the access webpage does not comprise the character strings of the category of the white list, matching the information of the access webpage with the character strings of the category of the black list;
when the information of the access webpage does not comprise the character strings of the category of the blacklist, the terminal determines that the information of the access webpage does not comprise the target information;
when the information of the access webpage comprises the character strings of the blacklist category, the terminal matches the information of the access webpage with the child nodes of the character strings belonging to the blacklist category in a father-son relationship step by step until the fact that the information of the access webpage is matched is determined to be finished, and the terminal intercepts target information in the information of the access webpage.
In yet another alternative implementation, the "positioning and preset matching rule" may specifically include:
the method comprises the steps that a category of positioning matching and a category of preset matching are divided by a 2 nd-level child node in m-level child nodes according to positioning and preset matching rules, character strings belonging to the category of positioning matching and character strings belonging to the category of preset matching in first data correspond to one child node in the 2 nd-level child nodes respectively, and any child node in the 2 nd-level child nodes and child nodes of the character strings belonging to the category of blacklists in the 1 st-level child nodes are in a father-son relationship.
In still another optional implementation manner, the "category of location matching" may be used to screen at least one of information that a character string exists at a first preset location or information that a separator exists at a second preset location in information of the access webpage;
the preset matching category is used for screening at least one of prefix information or suffix information in the information of the access webpage.
In yet another alternative implementation, the "tag attribute rule" may specifically include:
the method comprises the steps that a class with a label and a class without the label are divided according to a label attribute rule, and a character string belonging to the class with the label and a character string without the label in first data correspond to one of the 3 rd-level sub-nodes respectively, wherein any one of the 3 rd-level sub-nodes and one of the 2 nd-level sub-nodes are in a father-son relationship.
In still another alternative implementation manner, the information with tag may be used to screen the information of the accessed web page, where the information with tag is included in the information with tag, and the information without tag is used to screen the information of the accessed web page, where the information without tag is not included in the information with tag; wherein,,
the category provided with the tag specifically includes: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
Because the information of the access webpage is diversified, the method can provide more possibility and more accurate interception of target information.
In yet another alternative implementation, the "character rule" described above may include:
the method comprises the steps that the class of a first character string and the class of a preset character string are divided according to character rules, the class 4 of m-level sub-nodes, the character string belonging to the class of the first character string and the character string of the class of the preset character string in first data correspond to one of the class 4 sub-nodes respectively, and any one of the class 4 sub-nodes and one of the class 3 sub-nodes are in a father-son relationship.
In yet another alternative implementation, the "category of first string" may be used to screen the information of the access web page that is the same as the first character in the character string of the category of first string;
the category of the preset character string is used for screening that the information of the access webpage is the same as the information of the preset character string.
In yet another alternative implementation, the "information for accessing a web page" may include: the user accesses the URL of the page or the URL of each element of the web page, and the target information is advertisement information.
In still another optional implementation manner, the "first data" is obtained after the server performs tree transformation processing according to second data, where the second data includes an effective string and a custom string of the browser, and the effective string is a string with a usage rate greater than a preset threshold value determined by screening an open source string in an open source website and historical data reported by a terminal in a preset time period.
Because the first data is downloaded from the terminal to the server, the whole matching process is carried out in the terminal, so that the matching speed of the terminal for information is greatly improved, and the problem that the processing of page contents can be completed quickly only by the server with higher performance in the prior art is solved.
In a second aspect, an embodiment of the present invention provides a server for data processing, including: one or more processors, transceivers, and memory multiple applications; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the server, cause the server to perform the steps of:
performing tree transformation processing on the second data to determine first data;
the server transmits the first data to the terminal so that the terminal can determine whether the access webpage contains target information or not.
In the scheme, the tree structure can be used for deeply distinguishing the character strings in the second data by performing tree transformation processing on the second data, so that the character strings are transformed into the tree structure with very high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
In an alternative implementation, the "target information" may be advertisement information;
the information for accessing the web page includes: the user accesses at least one of a URL of the page or a URL of each element of the web page.
In another alternative implementation, the server may perform the following steps: periodically acquiring at least one open source character string from an open source website;
Selecting a plurality of character strings with access quantity larger than a first threshold value from at least one open source character string and historical data in a preset time period, which are reported by a client, as effective character strings;
acquiring a custom character string of a browser server;
and determining second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string.
Since each browser server generally has different criteria, that is, the target information may be defined as advertisement information at the a website, but not at the B website, custom character strings of the browser server are added when generating the second data, so that the matched second data has flexibility and can be widely used.
In yet another alternative implementation, the server may perform the following steps:
dividing the plurality of sub-nodes into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different;
each of n preset rules respectively comprises at least two categories of character strings, and each layer in m levels is divided into at least two child nodes according to the categories of the character strings;
The second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
each child node in the kth level child nodes has a father-son relationship with one child node in the k-1 level, the k level child nodes are any one level child node in the m level child nodes, and k is an integer greater than or equal to 1.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules;
the server performs the steps of:
dividing the plurality of sub-nodes into m-level sub-nodes according to a black-and-white list rule, a positioning and preset matching rule, a tag attribute rule and a character rule.
In yet another alternative implementation, the server may perform the specific following steps:
when the black-and-white list rule includes the category of the white list and the category of the black list, dividing the 1 st level of m level sub-nodes into two sub-nodes according to the category of the white list and the category of the black list, wherein one sub-node of the two sub-nodes includes a character string belonging to the category of the white list in the second data, and the other sub-node includes a character string belonging to the category of the black list in the second data.
In yet another alternative implementation, the server may perform the specific following steps:
when the locating and preset matching rules comprise locating matching categories and preset matching categories, dividing the 2 nd level of m level sub-nodes into two sub-nodes according to the locating matching categories and the preset matching categories, wherein one sub-node of the two sub-nodes comprises character strings belonging to the locating matching categories in the second data, and the other sub-node comprises character strings belonging to the preset matching categories in the second data, and the two sub-nodes in the 2 nd level are in a father-son relationship with the node where the character strings belonging to the blacklist categories in the 1 st level are located.
In yet another alternative implementation, the server may perform the specific following steps: when the tag attribute rule comprises a category with a tag and a category without the tag, dividing the 3 rd level of m-level child nodes into two child nodes according to the category with the tag and the category without the tag, wherein one child node of the two child nodes comprises a character string belonging to the category with the tag in the second data, and the other child node comprises a character string belonging to the category without the tag in the second data, and any child node in the 3 rd level and one child node in the 2 nd level child node are in a father-son relationship.
In yet another alternative implementation, the "tagged category" described above may include: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
In yet another alternative implementation, the server may perform the specific following steps:
when the character rule comprises the category of the first character string and the category of the preset character string, dividing the 4 th level of m level sub-nodes into two sub-nodes according to the category of the first character string and the category of the preset character string, wherein one sub-node of the two sub-nodes comprises the character string belonging to the category of the first character string in the second data, the other sub-node comprises the character string belonging to the category of the preset character string in the second data, and any sub-node in the 4 th level and one sub-node in the 3 rd level sub-node are in a father-son relationship.
In a third aspect, an embodiment of the present invention provides a method for intercepting information, where the method may be performed based on a terminal, and the method may include the following steps:
Starting a browser to access a webpage;
acquiring information of access to a webpage;
matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not;
when the information of the access webpage comprises target information, the target information is intercepted.
According to the method, the target information in the browser page is intercepted through the first data with the tree structure, the tree structure can be used for carrying out deep distinction on the character strings in the first data, and the matching times of the information of the access webpage and the first data are effectively reduced, so that the problem that the matching times are increased due to the fact that the number of the character strings for intercepting the target information is large and a rationalized matching mode is not adopted is avoided, and the overall matching speed can be improved by more than 40%.
In an alternative implementation, the "tree structure" may include a plurality of nodes, where the plurality of nodes includes a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
the nodes of each stage have a parent-child relationship with the associated next stage node, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
In another optional implementation manner, in the step of matching the information of the access webpage with the first data arranged in the tree structure, the method specifically may include:
and matching the information of the access webpage step by step from the first data of the father node of the tree structure to the first data of the child node in father-son relation with the father node until determining whether the information of the access webpage comprises target information.
Because the information of the access web page has the difference of length, the longer information of the access web page can not be directly matched, so the information of the access web page is matched step by step, the information of the access web page can be completely matched, and the accuracy of intercepting target information is improved.
In yet another alternative implementation, the "tree structure" described above may include:
m-level child nodes, wherein each level of child nodes in the m-level child nodes is divided according to different preset rules in n preset rules, n and m are integers more than or equal to 1, and n is more than or equal to m;
the j-th level of child nodes select 1 preset rule from f preset rules to divide, wherein f preset rules are that the previous j-1 level of child nodes in n preset rules select the rest preset rules, the j-1 level of child nodes are the previous level of child nodes in the j level of child nodes, the j level of child nodes are any level of child nodes in the m level of child nodes, and j and f are integers which are more than or equal to 1;
Each of the n preset rules includes at least two categories of character strings;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to m levels of sub-nodes, each sub-node in the m levels of sub-nodes corresponds to different character string categories in n preset rules respectively, and each sub-node comprises a plurality of character strings with different character string categories.
Because each terminal or operator has different definitions on target information, the method and the device provide various preset rules and categories, and the preset rules can be selected according to requirements, so that the flexibility of the tree structure can be improved, and the method and the device are suitable for more scenes.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules.
In still another optional implementation manner, the "black-and-white list rule" may include a category of a white list and a category of a black list, and the level 1 sub-node in the m level sub-nodes is divided according to the black-and-white list rule, and a character string belonging to the category of the white list and a character string belonging to the category of the black list in the first data correspond to one sub-node in the level 1 sub-node respectively.
In still another optional implementation manner, in the step of matching the information of the accessed web page with the first data arranged in the tree structure, the method specifically may include:
matching the information of the access webpage with the character strings of the category of the white list, and determining that the information of the access webpage does not comprise the target information when the information of the access webpage comprises the character strings of the category of the white list.
Since some information for accessing the web page may have "ad", but may not be target information for some operators, setting the character string with the whitelist category excludes the possibility that the character string has "ad" but is not target information (i.e. advertisement), thereby improving the accuracy of interception.
In still another optional implementation manner, in the step of matching the information of the accessed web page with the first data arranged in the tree structure, the method specifically may include: when the information of the access webpage does not comprise the character strings of the category of the white list, matching the information of the access webpage with the character strings of the category of the black list;
when the information of the access webpage does not comprise the character strings of the category of the blacklist, determining that the information of the access webpage does not comprise target information;
When the information of the access webpage comprises the character strings of the blacklist category, the information of the access webpage is matched with the child nodes which are in father-son relations with the child nodes of the character strings belonging to the blacklist category step by step until the fact that the information of the access webpage is matched is confirmed, and target information in the information of the access webpage is intercepted.
In still another optional implementation manner, the positioning and preset matching rule may specifically include a positioning matching type and a preset matching type, where the 2 nd level child node in the m level child nodes is divided according to the positioning and preset matching rule, and a string belonging to the positioning matching type and a string belonging to the preset matching type in the first data correspond to one child node in the 2 nd level child nodes respectively, where any child node in the 2 nd level child node and a child node in the 1 st level child node belonging to the character string of the blacklist type are in a father-child relationship.
In still another optional implementation manner, the "category of location matching" may be used to screen at least one of information that a character string exists at a first preset location or information that a separator exists at a second preset location in information of the access webpage;
The preset matching category is used for screening at least one of prefix information or suffix information in the information of the access webpage.
In still another optional implementation manner, the "tag attribute rule" may include a category with a tag and a category without a tag, the 3 rd level child node in the m level child nodes is divided according to the tag attribute rule, and a string belonging to the category with the tag and a string belonging to the category without the tag in the first data correspond to one child node in the 3 rd level child node respectively, where any child node in the 3 rd level child node and one child node in the 2 nd level child node are in a parent-child relationship.
In still another optional implementation manner, the "category with tag" may be used to screen information of the accessed web page, where the information includes tag attribute, and the category without tag is used to screen information of the accessed web page, where the information does not include tag attribute;
the category provided with the tag specifically includes: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
Because the information of the access webpage is diversified, the method can provide more possibility and more accurate interception of target information.
In still another optional implementation manner, the "character rule" may include a category of the first character string and a category of the preset character string, the 4 th level child node in the m level child nodes is divided according to the character rule, the character string belonging to the category of the first character string and the character string of the category of the preset character string in the first data correspond to one child node in the 4 th level child node respectively, and any child node in the 4 th level child node and one child node in the 3 rd level child node are in a father-child relationship.
In yet another alternative implementation, the "category of first string" may be used to screen the information of the access web page that is the same as the first character in the category of first string;
the category of the preset character string is used for screening that the information of the access webpage is the same as the information of the preset character string.
In still another alternative implementation, the "information for accessing the web page" may include a URL of a user accessing the web page or a URL of each element of the web page, and the target information is advertisement information.
In still another optional implementation manner, the "first data" is obtained after the server performs tree transformation processing according to second data, where the second data includes an effective string and a custom string of the browser, and the effective string is a string whose usage rate is greater than a preset threshold value determined by screening an open source string in an open source website and reported historical data in a preset time period.
In a fourth aspect, an embodiment of the present invention provides a method for data processing, where the method may be performed based on a server (i.e. a server), and the method may specifically include the following steps:
performing tree transformation processing on the second data to determine first data;
and sending the first data to the terminal so that the terminal can determine whether the access webpage contains the target information or not.
In the scheme, the tree structure can be used for deeply distinguishing the character strings in the second data by performing tree transformation processing on the second data, so that the character strings are transformed into the tree structure with very high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
In an alternative implementation, the "target information" may be advertisement information;
The information for accessing the web page includes: the user accesses at least one of a URL of the page or a URL of each element of the web page.
In another optional implementation, before the step of performing the tree transformation processing on the second data to determine the first data, the method may further include: periodically acquiring at least one open source character string from an open source website;
selecting a plurality of character strings with access quantity larger than a first threshold value from at least one open source character string and historical data reported by a terminal in a preset time period as effective character strings;
acquiring a custom character string of a browser server;
and determining second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string.
Since each browser server generally has different criteria, that is, the target information may be defined as advertisement information at the a website, but not at the B website, custom character strings of the browser server are added when generating the second data, so that the matched second data has flexibility and can be widely used.
In yet another alternative implementation manner, in the step of performing tree transformation processing on the second data to determine the first data, the method specifically may include: dividing the plurality of sub-nodes into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different;
Each of n preset rules respectively comprises at least two categories of character strings, and each layer in m levels is divided into at least two child nodes according to the categories of the character strings;
the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
each child node in the kth level child nodes has a father-son relationship with one child node in the k-1 level, the k level child nodes are any one level child node in the m level child nodes, and k is an integer greater than or equal to 1.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules;
the server performs the steps of:
dividing the plurality of sub-nodes into m-level sub-nodes according to a black-and-white list rule, a positioning and preset matching rule, a tag attribute rule and a character rule.
In still another alternative implementation manner, in the step of performing the tree transformation processing on the second data to determine the first data, the method specifically may include:
when the black-and-white list rule includes the category of the white list and the category of the black list, dividing the 1 st level of m level sub-nodes into two sub-nodes according to the category of the white list and the category of the black list, wherein one sub-node of the two sub-nodes includes a character string belonging to the category of the white list in the second data, and the other sub-node includes a character string belonging to the category of the black list in the second data.
In still another alternative implementation manner, in the step of performing the tree transformation processing on the second data to determine the first data, the method specifically may include:
when the locating and preset matching rules comprise locating matching categories and preset matching categories, dividing the 2 nd level of m level sub-nodes into two sub-nodes according to the locating matching categories and the preset matching categories, wherein one sub-node of the two sub-nodes comprises character strings belonging to the locating matching categories in the second data, and the other sub-node comprises character strings belonging to the preset matching categories in the second data, and the two sub-nodes in the 2 nd level are in a father-son relationship with the node where the character strings belonging to the blacklist categories in the 1 st level are located.
In still another alternative implementation manner, in the step of performing the tree transformation processing on the second data to determine the first data, the method specifically may include:
when the tag attribute rule comprises a category with a tag and a category without the tag, dividing the 3 rd level of m-level child nodes into two child nodes according to the category with the tag and the category without the tag, wherein one child node of the two child nodes comprises a character string belonging to the category with the tag in the second data, and the other child node comprises a character string belonging to the category without the tag in the second data, and any child node in the 3 rd level and one child node in the 2 nd level child node are in a father-son relationship.
In still another alternative implementation, the "category with tag" may specifically include: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
In still another alternative implementation manner, in the step of performing the tree transformation processing on the second data to determine the first data, the method specifically may include:
when the character rule comprises the category of the first character string and the category of the preset character string, dividing the 4 th level of m level sub-nodes into two sub-nodes according to the category of the first character string and the category of the preset character string, wherein one sub-node of the two sub-nodes comprises the character string belonging to the category of the first character string in the second data, the other sub-node comprises the character string belonging to the category of the preset character string in the second data, and any sub-node in the 4 th level and one sub-node in the 3 rd level sub-node are in a father-son relationship.
In a fifth aspect, embodiments of the present invention provide an apparatus, which may include:
The processing module is used for starting the browser to access the webpage;
the receiving and transmitting module is used for acquiring information of the access webpage;
the processing module is also used for matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not; when the information of the access webpage comprises target information, the target information is intercepted.
In the scheme, the device intercepts target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish character strings in the first data, and the matching times of the information of the access webpage and the first data are effectively reduced, so that the problem that the matching times are increased due to the fact that the number of character strings for intercepting the target information is large and a rationalized matching mode is not adopted is avoided, and the overall matching speed can be improved by more than 40%.
In an alternative implementation, the "tree structure" may include a plurality of nodes, where the plurality of nodes includes a root node and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes;
the nodes of each stage have a parent-child relationship with the associated next stage node, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
In another optional implementation manner, the "processing module" may be specifically configured to match information of the accessed web page from first data of a parent node of the tree structure to first data of a child node in a parent-child relationship with the parent node step by step until it is determined whether the information of the accessed web page includes target information.
Because the information of the access web page has the difference of length, the longer information of the access web page can not be directly matched, so the information of the access web page is matched step by step, the information of the access web page can be completely matched, and the accuracy of intercepting target information is improved.
In yet another optional implementation manner, the tree structure may include m-level sub-nodes, where each level of sub-nodes in the m-level sub-nodes is divided according to different preset rules in n preset rules, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
the j-th level of child nodes select 1 preset rule from f preset rules to divide, wherein f preset rules are that the previous j-1 level of child nodes in n preset rules select the rest preset rules, the j-1 level of child nodes are the previous level of child nodes in the j level of child nodes, the j level of child nodes are any level of child nodes in the m level of child nodes, and j and f are integers which are more than or equal to 1;
Each of the n preset rules includes at least two categories of character strings;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to m levels of sub-nodes, each sub-node in the m levels of sub-nodes corresponds to different character string categories in n preset rules respectively, and each sub-node comprises a plurality of character strings with different character string categories.
Because each terminal or operator has different definitions on target information, the method and the device provide various preset rules and categories, and the preset rules can be selected according to requirements, so that the flexibility of the tree structure can be improved, and the method and the device are suitable for more scenes.
In yet another alternative implementation, the "n preset rules" may include at least one of the following rules:
black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules.
In still another optional implementation manner, the "black-and-white list rule" may include a category of a white list and a category of a black list, and the level 1 sub-node in the m level sub-nodes is divided according to the black-and-white list rule, and a character string belonging to the category of the white list and a character string belonging to the category of the black list in the first data correspond to one sub-node in the level 1 sub-node respectively.
In still another alternative implementation, the "processing module" may specifically be configured to match the information of the accessed web page with a string of the category of the whitelist, and determine that the information of the accessed web page does not include the target information when the information of the accessed web page includes the string of the category of the whitelist.
Since some information for accessing the web page may have "ad", but may not be target information for some operators, setting the character string with the whitelist category excludes the possibility that the character string has "ad" but is not target information (i.e. advertisement), thereby improving the accuracy of interception.
In yet another alternative implementation, the "processing module" may be specifically configured to match the information of the accessed web page with the character string of the category of the blacklist when the information of the accessed web page does not include the character string of the category of the whitelist;
when the information of the access webpage does not comprise the character strings of the category of the blacklist, determining that the information of the access webpage does not comprise target information;
when the information of the access webpage comprises the character strings of the blacklist category, the information of the access webpage is matched with the child nodes which are in father-son relations with the child nodes of the character strings belonging to the blacklist category step by step until the fact that the information of the access webpage is matched is confirmed, and target information in the information of the access webpage is intercepted.
In still another optional implementation manner, the positioning and preset matching rule may include a positioning matching type and a preset matching type, the 2 nd level child node in the m level child nodes is divided according to the positioning and preset matching rule, and the character string belonging to the positioning matching type and the character string belonging to the preset matching type in the first data correspond to one child node in the 2 nd level child nodes respectively, where any child node in the 2 nd level child node and a child node of the character string belonging to the blacklist type in the 1 st level child node are in a father-son relationship.
In still another optional implementation manner, the "category of location matching" may be used to screen at least one of information that a character string exists at a first preset location or information that a separator exists at a second preset location in information of the access webpage;
the preset matching category is used for screening at least one of prefix information or suffix information in the information of the access webpage.
In still another optional implementation manner, the "tag attribute rule" may include a category with a tag and a category without a tag, the 3 rd level child node in the m level child nodes is divided according to the tag attribute rule, and a string belonging to the category with the tag and a string belonging to the category without the tag in the first data correspond to one child node in the 3 rd level child node respectively, where any child node in the 3 rd level child node and one child node in the 2 nd level child node are in a parent-child relationship.
In still another optional implementation manner, the "category with tag" may be used to screen information of the accessed web page, where the information includes tag attribute, and the category without tag is used to screen information of the accessed web page, where the information does not include tag attribute; wherein,,
the category provided with the tag specifically includes: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
Because the information of the access webpage is diversified, the method can provide more possibility and more accurate interception of target information.
In still another optional implementation manner, the "character rule" may include a category of the first character string and a category of the preset character string, the 4 th level child node in the m level child nodes is divided according to the character rule, the character string belonging to the category of the first character string and the character string of the category of the preset character string in the first data correspond to one child node in the 4 th level child node respectively, and any child node in the 4 th level child node and one child node in the 3 rd level child node are in a father-child relationship.
In yet another alternative implementation, the "category of first string" may be used to screen the information of the access web page that is the same as the first character in the category of first string;
the category of the preset character string is used for screening that the information of the access webpage is the same as the information of the preset character string.
In still another alternative implementation, the "information for accessing the web page" may include a URL of a user accessing the web page or a URL of each element of the web page, and the target information is advertisement information.
In still another optional implementation manner, the "first data" may be obtained after the server performs tree transformation processing according to second data, where the second data includes an effective string and a custom string of the browser, and the effective string is a string whose usage rate is determined to be greater than a preset threshold by screening an open source string in an open source website and reported historical data in a preset time period.
In a sixth aspect, an embodiment of the present invention provides an apparatus for data processing, including:
the processing module is used for performing tree transformation processing on the second data and determining first data;
And the receiving and transmitting module is used for transmitting the first data to the terminal so that the terminal can determine whether the access webpage contains target information or not according to the determination.
In the scheme, the tree structure can be used for deeply distinguishing the character strings in the second data by performing tree transformation processing on the second data, so that the character strings are transformed into the tree structure with very high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
In an alternative implementation, the "target information" may be advertisement information;
the information for accessing the web page includes: the user accesses at least one of a URL of the page or a URL of each element of the web page.
In another optional implementation manner, the foregoing "transceiver module" may be further configured to periodically obtain at least one open source character string from an open source website;
the processing module may be further configured to select, from at least one open-source character string and the historical data reported by the client in a preset time period, a plurality of character strings with access amounts greater than a first threshold as valid character strings;
the transceiver module can also be used for acquiring the custom character string of the browser server;
the processing module may be further configured to determine the second data according to an effective string and a custom string, where the effective string and the custom string each include at least one string.
Since each browser server generally has different criteria, that is, the target information may be defined as advertisement information at the a website, but not at the B website, custom character strings of the browser server are added when generating the second data, so that the matched second data has flexibility and can be widely used.
In yet another optional implementation manner, the "processing module" may be specifically configured to divide the plurality of sub-nodes into m levels according to n preset rules, where the preset rules of each level in the m levels of sub-nodes are different;
each of n preset rules respectively comprises at least two categories of character strings, and each layer in m levels is divided into at least two child nodes according to the categories of the character strings;
the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers greater than or equal to 1, and n is greater than or equal to m;
each child node in the kth level child nodes has a father-son relationship with one child node in the k-1 level, the k level child nodes are any one level child node in the m level child nodes, and k is an integer greater than or equal to 1.
In another alternative implementation, the "n preset rules" may include at least one of the following rules: black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules;
The above-mentioned "processing module" may also be used to divide the plurality of sub-nodes into m-level sub-nodes according to black-and-white list rules, positioning and preset matching rules, tag attribute rules and character rules.
In still another alternative implementation manner, the "processing module" may be specifically configured to, when the black-and-white list rule includes a class of a white list and a class of a black list, divide the level 1 of the m level child nodes into two child nodes according to the class of the white list and the class of the black list, where one of the two child nodes includes a string in the second data that belongs to the class of the white list, and the other child node includes a string in the second data that belongs to the class of the black list.
In still another optional implementation manner, the "processing module" may specifically be configured to divide, when the locating and preset matching rule includes a locating matching category and a preset matching category, the level 2 of the m level child nodes into two child nodes according to the locating matching category and the preset matching category, where one child node of the two child nodes includes a string in the second data that belongs to the locating matching category, and the other child node includes a string in the second data that belongs to the preset matching category, and where the two child nodes in the level 2 and the node in the level 1 where the string in the category that belongs to the blacklist are in a parent-child relationship.
In still another alternative implementation manner, the "processing module" may specifically be configured to divide, when the tag attribute rule includes a category with a tag and a category without a tag, a 3 rd level of m level sub-nodes into two sub-nodes according to the category with a tag and the category without a tag, where one of the two sub-nodes includes a string in the second data that belongs to the category with a tag, and the other sub-node includes a string in the second data that belongs to the category without a tag, and any one of the 3 rd level sub-nodes and one of the 2 nd level sub-nodes are in a parent-child relationship.
In still another alternative implementation, the "category with tag" may specifically include: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
In still another alternative implementation manner, the "processing module" may specifically be configured to divide, when the character rule includes a category of a first character string and a category of a preset character string, a 4 th level of m level subnodes into two subnodes according to the category of the first character string and the category of the preset character string, where one of the two subnodes includes a character string belonging to the category of the first character string in the second data, and the other subnode includes a character string belonging to the category of the preset character string in the second data, and any one of the 4 th level subnodes and one of the 3 rd level subnodes are in a parent-child relationship.
In a seventh aspect, embodiments of the present invention provide a computer readable storage medium, which may include instructions that when executed on a computer, cause the computer to perform the steps of:
starting a browser to access a webpage;
acquiring the information of the access webpage;
matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not;
and intercepting the target information when the information of the access webpage comprises the target information.
In an eighth aspect, embodiments of the present invention provide a computer readable storage medium comprising instructions that when run on a computer cause the computer to perform the steps of:
performing tree transformation processing on the second data to determine first data;
and the server sends the first data to the terminal so that the terminal can determine whether the access webpage contains target information or not according to the determination.
In a ninth aspect, embodiments of the present invention provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of:
starting a browser to access a webpage;
Acquiring the information of the access webpage;
matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not;
and intercepting the target information when the information of the access webpage comprises the target information.
In a tenth aspect, embodiments of the present invention provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of:
performing tree transformation processing on the second data to determine first data;
and the server sends the first data to the terminal so that the terminal can determine whether the access webpage contains target information or not according to the determination.
Drawings
FIG. 1 is a schematic illustration of an application scenario for advertisement interception;
FIG. 2 is a schematic diagram of another application scenario for advertisement interception;
fig. 3 is a schematic diagram of an application scenario of advertisement interception according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for data processing according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a matching result of URLs of elements accessed by a browser client according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a tree structure according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a tree structure based on black-and-white list rule division according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a tree structure divided based on positioning and preset matching rules according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a statistical classification structure based on label attribute rule or character rule division according to an embodiment of the present invention;
fig. 10 is a schematic diagram of a tree structure based on rule division according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a tree structure based on sub-classification according to an embodiment of the present invention;
fig. 12 is a schematic diagram of a tree structure divided based on black-and-white list rules, positioning and preset matching rules and tag attribute rules according to an embodiment of the present invention;
FIG. 13 is a schematic diagram of a tree structure based on character rule division according to an embodiment of the present invention;
fig. 14 is a flowchart of an information interception method according to an embodiment of the present invention;
fig. 15 is a schematic diagram of a terminal structure for information interception according to an embodiment of the present invention;
fig. 16 is a schematic structural diagram of a data processing server according to an embodiment of the present invention;
Fig. 17 is a schematic structural diagram of an information interception device according to an embodiment of the present invention;
fig. 18 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
For the purpose of facilitating an understanding of the embodiments of the present invention, reference will now be made to the following description of specific embodiments, taken in conjunction with the accompanying drawings, which are not intended to limit the embodiments of the invention.
Currently, the technology for advertisement interception may be interception by using an Opera server, as shown in fig. 1, where the Opera server may include: browser server, web page cache and page processing server. Specifically, when a client (for example, a mobile phone, a tablet computer and the like) uses an Opera browser to browse a webpage, the client sends a webpage access request to a server, a browser server receives the webpage access request and sends query webpage information to a webpage cache library, the webpage cache library searches corresponding data according to the webpage information and sends the corresponding data to the browser server, and the browser server returns webpage content.
The related data stored in the webpage cache library is periodically sent to the webpage access request by the webpage processing server, and receives webpage content information, processes the webpage content information, and the processing content may include: at least one of picture compression, text compression or advertisement filtering, and then the processed content information is compressed and then sent to a webpage cache for storage so as to facilitate the query of a browsing server. Therefore, the method is based on the fact that the server side hides the advertisement, and then the webpage content after the advertisement is hidden is returned to the client side for displaying. The method needs to buffer a large number of pages at the server and analyze the whole content in the web page, and the process is that the server in the server needs to have higher performance to rapidly finish the processing of the page content, and has very high requirements on the performance and storage of hardware.
Another advertisement interception technique is applied to a browser system (as shown in fig. 2), in which a browser server needs to download an easyllist rule list, a browser client periodically downloads an advertisement interception character string to the browser server (it should be noted that, the easyllist rule list includes an advertisement interception character string), when the browser client accesses a web page, a uniform resource locator (uniform resource locator, URL) of an element in the access page is matched with the advertisement interception character string, and the element in the access page corresponding to the character string is hidden according to the matched character string.
Although this approach intercepts advertisements based on a browser client, it presents at least two problems. First, the easylist rule list aging problem downloaded through the browser server: for example, the URLs in the current easylist rule list are approximately 4.5W, and in a continuous increase, volunteers only like to add new rules to "contribute" and do nothing to do so that does not add value to them, such as: deleting old URLs in the easylist rule list, while deleting old rules has a risk, so that URL in the easylist rule list is continuously increased, and it should be noted that the advertisement interception character string is determined by URL in the easylist rule list. Meanwhile, URl in many easylist rule lists is proposed earlier, and the original website has modified the page implementation mode, URl in the easylist rule list is outdated, so URl in the outdated easylist rule list cannot provide effective advertisement for the browser client to intercept character strings. Second, the URL in the visited web page has low matching performance with the URL in the easy list rule, for example, as mentioned above, the URL rule in the easy list rule is about 4.5W, some large websites and the web requests of the first page are more than 100 and some even 430, so that when such web page is intercepted, the advertisement matching is performed for 4.5wx100=400W times or even tens of millions times, and then the performance of the device of the loaded browser client is inevitably affected obviously.
Therefore, based on the above-mentioned problems, the embodiments of the present invention provide a method, an apparatus, and a terminal for intercepting information based on a client, where the terminal intercepts target information in a browser page through first data having a tree structure, and the tree structure can perform deep distinction on a string in the first data, so as to effectively reduce the number of matching times between information accessing a web page and the first data, thereby avoiding the problem that the number of matching times is increased due to a more strings for intercepting target information and a non-rational matching manner.
For convenience of description, the embodiment of the invention uses advertisement information as an example of the target information in the access webpage, wherein the method provided by the embodiment of the invention can also be used for information other than the advertisement information, for example: consultation and web page address, etc.
Fig. 3 is a schematic diagram of an application scenario of advertisement interception according to an embodiment of the present invention. As shown in fig. 3, the scenario may include a client and a server, where the client may be specifically a browser client, and the server may be specifically a browser server.
Specifically, the method may include two processes, where the first process may be that the browser server determines the first data, and specifically, the browser server obtains at least one of URLs of a number of pages visited by the user or at least one URL of a page element, where the page element may include: at least one of text, connection or pictures; the browser server periodically obtains an open source list (e.g., an easylist rule list) of open source websites. The browser server learns by adopting a browser server learning mechanism (for example, a cloud side learning mechanism in fig. 3) according to at least one of the acquired URL of the page accessed by the user or at least one URL of the page element and an open source list acquired by the open source website (the open source list can comprise character strings intercepted by advertisements), and determines effective character strings (for example, character strings with access quantity larger than a preset threshold value in preset days and character strings with access quantity of X days of top1w in fig. 3) so as to effectively reduce the number of times of subsequent matching, wherein the step aims to remove invalid or rarely accessed character strings.
The browser server merges the valid string with the custom string of the browser (e.g., the self-operating interception rule representation in fig. 3) to determine the second data. The effective character string and the custom character string respectively comprise at least one character string. The browser server converts the second data into a tree-shaped private format, generates first data, stores the tree-shaped private format (the first data with the tree-shaped structure) into a private format preference rule base, and synchronizes to the browser client. The browser client periodically downloads first data with a tree structure to the browser server, when a third webpage is accessed, the webpage information accessed to the third webpage is matched with the first data with the tree structure, a matching result is determined, if the matching result is matched in the first data with the tree structure, the browser client intercepts the matched target information in the webpage information accessed to the third webpage, and the target information is generally advertisement information.
In summary, according to the method, on one hand, by counting a large amount of data accessed by users, character strings which are invalid or have low access in an original open source list are removed, so that the validity of rules is ensured, and meanwhile, the matching targets are reduced. On the other hand, through deep understanding of the second data, the character strings are classified according to the corresponding rules to form a tree structure, and the matching times of single information (i.e. the information of each element in the access third webpage, the element generally refers to characters, pictures, videos and the like) are greatly reduced during matching.
The method for intercepting information provided by the embodiment of the present invention is further described below with reference to fig. 4 to 13, first, a process of processing data (i.e., determining first data) at a server side of a browser needs to be described. As shown in fig. 4 to 13:
fig. 4 is a flowchart of a method for data processing according to an embodiment of the present invention. As shown in fig. 4, steps S410-S470 may be included, as follows:
s410: and the browser server receives an instruction of the browser client to access the webpage.
The instruction of the browser client to access the webpage can be an instruction of a large number of users to access a plurality of webpages through the browser; alternatively, a large number of users access instructions of the same web page through a browser. Specifically, according to a large number of instructions for users to access the webpage, the browser client records at least one of the URL of the access page or the URL of the page element, compresses at least one of the URL of the access page or the URL of the page element within a preset time, and the compressed file is the instruction for the browser client to access the webpage. Wherein, the second message does not have any user identifier, and the purpose is to ensure the privacy of the user.
It should be noted that, because there are a large number of users accessing, the browser client may send the instruction of accessing the web page by the browser client to the browser server multiple times.
S420: the browser server periodically obtains the latest open source list (e.g., an easy list string or a list containing an easy list string) from the open source website, for example: the server obtains the latest open source list from 12 am to the open source website every day.
S430: and the browser server determines the effective character strings, namely the rules for screening the high hit rate, according to the open source list and the instruction of the browser client to access the webpage.
Specifically, each time the browser client reports an instruction of the browser client to access the webpage, the server extracts at least one of the URL of the user to access the webpage or the URL of the page element in the instruction of the browser client to access the webpage, matches the two against the character strings in the open source list, counts up 1 if the corresponding character strings are matched, and repeats the steps until the browser server completes all matching of records in at least one of the URL of the user to access the webpage or the URL of the page element, and then places at least one of the URL of the user to access the webpage or the URL of the page element in the backup directory.
The browser server saves the counting result of each character string, counts the access counting result of each character string in a preset time period (for example, the latest 30 days), arranges the access counting result values from large to small, determines that the access counting result is larger than the character string corresponding to the first threshold (for example, the character string corresponding to which the access counting result is larger than the first threshold N=2000 is arranged), the character string corresponding to the access counting result larger than the first threshold is a valid character string, and the character string (for example, the character string with the access counting result smaller than the second threshold N=100) or the character string which has failed (for example, the character string with the access amount of zero) is directly rejected.
For example, the following details: in the easylist string list, there are the following rules for sina.
||mobile.sina.cn/public/files/image/600x150_
||mobile.sina.cn/public/files/image/620x300_
||sina.cn/api/article/news_banner?
||sina.cn/cm/sinaads_
||sina.cn^*/impress?
When the top page of the current sina. Cn is opened, the URLs to be accessed, collected by the browser client, are matched with the character strings in the earisist character string list by the server, as shown in fig. 5. As can be seen from fig. 5: is sin a cn/express? This rule was hit 4 times, ||sina.cn/cm/sinaads_was hit once. It is thus known which of the 5 strings given in the easylist string list are frequently accessed and which are rarely or not accessed. As shown in fig. 5, the result of a single access is that if millions of users' access instructions are collected, it is possible to obtain which are valid and which are invalid.
S440: and the browser server merges the effective character string with the custom character string of the browser to determine second data.
Specifically, the valid strings are screened from an open source list (e.g., an easyllist rule list), so the valid strings are open source. Meanwhile, when different browsers are operated, some custom rules exist, namely custom character strings of the browsers.
In another possible implementation, the step S440 may be preceded by obtaining a custom string of the browser server.
S450: and the browser server performs tree transformation processing on the second data to determine the first data.
The first data may include a character string for matching target information, where the target information refers to advertisement information in the application, and the first data is used for the browser client to intercept and access advertisement information in the page according to the first data.
Specifically, the browser server divides the second data into m levels according to n preset rules, wherein the preset rules of each level in the m levels of child nodes are different; each of the n preset rules respectively comprises at least two categories of character strings, and each layer in the m levels is divided into at least two child nodes according to the categories of the character strings; the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are larger than or equal to 1, and n is larger than or equal to m.
It may also be understood that the browser server divides the second data into m levels (where m is a positive integer greater than 0 and n is greater than or equal to m) according to n preset rules (where n is a positive integer greater than 0), each of the m levels includes at least two sub-nodes, each of the n preset rules includes at least two categories, the at least two sub-nodes in each level are divided according to at least two categories (i.e., each sub-node in each level corresponds to one category), and the at least two sub-nodes in each level include a plurality of character strings having one category.
When n preset rules are selected, the preset rules of each level in the m levels of child nodes are different, one may be selected by arranging according to the sequence in the n preset rules, and the other may be selected randomly from any two or three of the n preset rules, but at least cannot be lower than two.
By way of example, as shown in fig. 6, the browser server divides the second data into 4 levels (the fourth level is not shown) according to 4 preset rules, each of the 4 levels includes at least two child nodes, and the 4 preset rules may include: the black-and-white list rule, the positioning and preset matching rule, the tag attribute rule or the character rule, it should be noted that the preset rule may also include other possibilities (e.g. identification, fixed sentence, etc.), and the above rule is merely used for example in the present application, and is not limited to these 4 possibilities. When 2 or 3 of the division modes are selected to perform tree transformation processing, the matching speed is improved compared with the prior art, because the types of the division modes are reduced and the division strength is weaker.
Each preset rule of the 4 preset rules may include at least two categories, and each child node in each stage corresponds to one category, where the at least two child nodes in each stage are divided according to the at least two categories.
For example, when black-and-white list rule and positioning and character rule division in a division mode are selected to perform tree transformation processing, firstly black-and-white list division is adopted and then character string division is adopted to perform tree transformation processing; when the positioning and preset matching rule, the label attribute rule division and the character rule in the division mode are selected, firstly, the positioning and preset matching rule, then the label attribute rule division and finally the character rule are adopted for tree transformation processing. It should be understood that when the above 4 rules are selected, the rules should be arranged downward in order, and if the selected rules do not include or include some of the above rules, the number of the arrangement stages should be arranged according to the actual situation.
For example: when the WHITE list rule includes a WHITE list category and a BLACK list category, the 1 st level of the m level sub-nodes is divided into two sub-nodes (1 a in fig. 6 corresponds to a BLACK sub-node in fig. 7, 1b in fig. 6 corresponds to a WHITE sub-node in fig. 7) according to the WHITE list category and the BLACK list category, one of the two sub-nodes includes a character string (content in a frame under the WHITE sub-node in fig. 7) belonging to the WHITE list category in the second data, and the other sub-node includes a character string (content in a frame under the BLACK sub-node in fig. 7) belonging to the BLACK list category in the second data.
Specifically, the browser service end divides the second data into a first sub-node and a second sub-node according to the category of the white list and the category of the black list in the black-and-white list rule, wherein the first sub-node (for example, the 1a sub-node) comprises character strings belonging to the category of the black list, and the second sub-node (for example, the 1b sub-node) comprises character strings belonging to the category of the white list.
When the locating and preset matching rule includes a locating matching category and a preset matching category, dividing the level 2 of the m-level sub-nodes into two sub-nodes (for example, a 2a sub-node and a 2b sub-node in fig. 6) according to the locating matching category and the preset matching category, wherein one of the two sub-nodes includes a character string belonging to the locating matching category in the second data, and the other sub-node includes a character string belonging to the preset matching category in the second data, and the two sub-nodes in the level 2 and the node where the character string belonging to the blacklist category in the level 1 are in a parent-child relationship. In another possible embodiment, the 3 rd child node (e.g., the 2c child node in fig. 6) in the 2 nd level is in a parent-child relationship with the node in the 1 st level where the character string belonging to the category of the whitelist is located. In the implementation of the present invention, all the child nodes in the level 2 are nodes having a parent-child relationship with the node where the character string belonging to the category of the blacklist is located. Specifically, at least one of information of a character string existing at a first preset position or information of a separator existing at a second preset position is included in the child node having the category of positioning matching; the preset matching category comprises at least one of information for screening prefix existing in the information of the access webpage or information with suffix.
The following details are given for the child nodes with the category of locating match and the child nodes with the category of preset match:
the child nodes with the category of positioning matching are mainly divided according to characters at fixed positions, specifically, characters exist in the first preset position, wherein the characters represent that any character string appears in the first preset position; alternatively, there is a second preset position where a represents the presence of a separator at the second preset position (where a separator may be any character other than letters, numbers, _, -, or%).
For example, in the following browser client accessing the web site of the page,//,? Sum = can be regarded as separator:
http://example.com:8000/foo.bara=12&b=%D1%82%D0%B5%D1%81%D1%82
therefore, rule filtering in the rule list of location matches can be performed on either the ≡sample ≡com ≡or ≡D1%82%D0%B5% D1%81%D1% 82-or ≡foo.
In addition, the preset matching categories are divided according to a common mode, wherein the preset matching categories may include: at least one of prefix matching or post matching. The following description will be made in the case where both are present.
From the above, white. Plane and white. Glob refer to prefix match categories among the preset match categories, while white. Plane and black. Plane refer to suffix match categories among the preset match categories. For example: for the above sina related branching, the branching scenario shown in fig. 8 becomes that, due to the limited number, only 3 child nodes (e.g., white. Plane, black. Plane, and black. Glob) appear, and the contents of the boxes below white. Plane, black. Plane, and black. Glob are strings with corresponding categories contained in each child node. That is, the second level child node having a parent-child relationship with the first level 1a node under the first level 1a node may include 2a and 2b, while the second level child node having a parent-child relationship with the first level 1b node under the first level 1b node may also include 2a and 2b or 2c (this possibility is not shown in fig. 6).
The preset matching categories may be divided into 2 branches (i.e. prefix matching and post matching), and in a possible embodiment, the preset matching categories may be combined with the first stage into one layer, that is, under a ROOT node (ROOT) may include 4 nodes at the same time, for example: white, black, white, and black.
In another possible embodiment, the above-mentioned second level may have 4 sub-nodes, and the 4 sub-nodes may include a sub-node having information of the first preset position presence string, a sub-node having information of the second preset position presence string, a sub-node having information of the presence prefix, and a sub-node having information of the presence suffix.
In still another possible embodiment, 8 sub-nodes may occur in the second level, where the 8 sub-nodes may be divided into at least two groups, one group is 4 sub-nodes having a parent sub-node with the first level 1a node, the 4 sub-nodes may include a sub-node having information of a first preset position existence character string, a sub-node having information of a second preset position existence character string, a sub-node having information of a prefix and a sub-node having information of a suffix, and the other group is 4 sub-nodes having a parent sub-node with the first level 1b, and the 4 sub-nodes may include a sub-node having information of a first preset position existence character string, a sub-node having information of a second preset position existence character string, a sub-node having information of a prefix and a sub-node having information of a suffix.
When the tag attribute rule includes a tag-including category and a tag-not-including category, dividing a 3 rd level of the m-level child nodes into two child nodes (e.g., 3a and 3 b) according to the tag-including category and the tag-not-including category, one of the two child nodes including a character string belonging to the tag-including category in the second data, and the other child node including a character string belonging to the tag-not-including category in the second data, wherein any one of the 3 rd level child nodes and one of the 2 nd level child nodes have a parent-child relationship, for example: as shown in fig. 6, the 3a and 3b child nodes are in a parent-child relationship with the 2a child node, and the 3c child node is in a parent-child relationship with the 2b child node.
As shown in fig. 9, the following sequentially describes in connection with fig. 9: the tag attribute rules may be of various types, and in the embodiment of the present invention, two types are provided, one type includes a category with a tag (for example, the content related in fig. 9), and the other type includes a category without a tag, where the category with a tag may specifically include: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement. Therefore, first, the category (for example, the content of fig. 9) provided with the tag is described as follows:
MIME_TYPE of request content
"other":1
"xbl":1
"ping":1
"dtd":1
"script":2
"image":4
"background":4
"stylesheet":8
"object":16
"subdocument":32,
"document":64,
"xmlhttprequest":2048,
"object_subrequest":4096,
"media":16384,
"font":32768,
"popup":0x1000000,
The left column represents the division of the character string by the label-equipped class according to the label class, and the right column represents the corresponding number (set in the standard) to the label class after the division.
The above strings are further divided according to the label category, and the above 4 sub-categories (such as "script", "image" and "document" in fig. 10) are selected to perform the label category division, which is only performed for the black. Plane example as shown in fig. 10. The character string with the "image" in the second data is arranged on the child node according to the category with the label "image", and so on. There is also a class in fig. 10 that does not contain a tag (e.g., a string in the box below the "×" node in fig. 10 that is a class without a tag). There are also other label categories of nodes (the scope of the illustration is limited, and other label categories are denoted by '… …'), and then a large number of character strings are hung up in the nodes where the different label categories are located. Since the character strings of "script" and "image" generally occur in a high proportion, a large number of character string examples are given in fig. 10 in the child node of the label class of "script" and the child node of the label class of "image".
Second, in one possible embodiment, the categories with labels can be specifically classified into: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement. Specific categories among the categories having the tag (such as those related to reference numerals 2 to 6 in fig. 9) will be described in detail.
The host may include the following types 4:
first kind: direct classification
Specifically, only the host name is included (i.e., the description portion of the number 2 in fig. 9 includes only the host information), for example: such a string (example section in number 2 in fig. 9) 9377os. Com may be further divided subsequently according to the string of hostnames.
Second kind: third classification
Specifically, only host information of advertisement attribution (i.e., the description part of the number 3 in fig. 9 only contains information of the third party website accessing the advertisement attribution website), for example: (sequence number in FIG. 9) 116b.com $thrird-party may be further partitioned under classification followed by strings according to hostname.
Third kind: domain_direct classification
Specifically, the character division is performed according to the two-stage classification of the host and domain (i.e., the description part of the number 4 in fig. 9 contains the domain of the current web page and the information of the host of the advertisement web page).
Fourth kind: domain_Filter classification
Specifically, url information of the host and the advertisement (i.e., information of domain and advertisement content is included in the description part of the number 5 in fig. 9), for example, the following 5 character strings:
according to the above 5 strings, the above 5 strings may be further divided according to domain_filter:
the string containing the advertisement hostname may be: cdndm.com/12/2016/$domain=1kkk.com|dm 5.Com
The string containing no hostname and advertisement path may be: com/tps/$domain=ocucn.com
The string that does not contain the file name of the host containing the advertisement may be:
/static/media/curl.swf$domain=duba.com
according to the above further division, domain_filter may be further divided into 3 sub-nodes (e.g., as shown in fig. 11), namely, a node containing a string of an advertisement hostname (e.g., 111 in fig. 11), a node containing no string of a hostname containing an advertisement path (e.g., 113 in fig. 11), and a node containing no string of a hostname containing an advertisement (e.g., 112 in fig. 11).
The above classification processing method may also perform the same processing on domain_filters under the attribute classification of the home host. For example, for advertisements containing pictures, this classification method may be used:
Fifth, THIRD_FILTERS Classification
Specifically, for example: the book is a com. Tw/exep/ap/$thiard-party string.
The difference between this and the fourth domain_filter is that only domains currently accessed by the user and hosts of advertisements are different, and advertisement information processing portions thereof are the same. The same can be done according to the fourth domain_filter section.
In addition, a fifth method may be included: type_filter classification
Specifically, the domain_filter and the third_filter may be combined into a type_filter, and the type_filter may include two sub-categories of domain_filter and third_filter when only advertisement content information exists.
In summary, according to the black-and-white list division, the matching mode division and the rule category division, the second data may be subjected to tree transformation processing, and the transformed tree structure may be as shown in fig. 12, specifically, taking a black.plane node as an example, and combining the matching mode division and the rule category division to form the tree structure of fig. 12.
For the sub-nodes 120-129 in fig. 12, the division of the character strings may be performed for domain, the host of the advertisement, the path of the advertisement object, and the host name (name), specifically for the classification of at least one sub-node in direct or threaded in fig. 12, where the host name may be divided, for example: the first character is at least one of 0-9,a-Z, A-Z or others (as shown in fig. 13 in detail), and the first character is divided into 3 sub-nodes, wherein each sub-node divides the original 4 character strings into 3 sub-nodes according to different categories, only one example (the host name is used for dividing), and the rest (domains, the host of the advertisement and the path of the advertisement object) can also be used for dividing as above, which is not described in detail herein.
The browser server may further divide the third level, 120-129 subnodes in fig. 12, to the fourth level again, where the selected division preset rule may be a character rule, specifically, when the character rule includes a category of a first character string and a category of a preset character string, according to the category of the first character string and the category of the preset character string, divide the 4 th level of the m level subnodes into two subnodes, where one of the two subnodes includes a character string belonging to the category of the first character string in the second data, and the other subnode includes a character string belonging to the category of the preset character string in the second data, and any one of the 4 th level subnodes and one of the 3 rd level subnodes are in a parent-child relationship.
It should be noted that each child node in the kth level child node has a parent-child relationship with one child node in the k-1 level, the k level child node is any one level child node in the m level child nodes, and k is an integer greater than or equal to 1. In connection with the above example, it will be understood that, in the case where n=4, if k=2 (where k is equal to or less than n and is a positive integer), each of the level 2 child nodes (e.g., 2a and 2 b) has a parent-child relationship with one child node (e.g., 1 b) in the first level; alternatively, k=3, i.e., each of the level 3 child nodes (e.g., 3a, 3b, and 3 c) has a parent-child relationship with one child node in the second level (e.g., 2a or 2 b). Where each child node may have a child node at a next level that has a parent-child relationship with the node.
It should be noted that, in another possible embodiment, two or three of the above 4 division modes may also be selected from the above modes.
The functions of each child node of the tree structure include: matching at least one of the URL of the user access page or the URL of each element of the access page according to the character strings included in each child node; and distributing the sub-node matched with the URL of the access page of the user or the character string characteristics contained in the URL of each element of the access page.
Next, when the browser client starts the browser to access the web page, the browser client needs to intercept the target information (i.e. the advertisement information) in the access web page according to the first data with the tree structure, and the following steps are required to be executed, fig. 14 is a flowchart of an information interception method according to an embodiment of the present invention, and as shown in fig. 14, the steps include S1410-S1440, as follows:
s1410: the browser client launches a browser to access the web page.
Specifically, prior to this step, receiving the first data may also be included. Wherein the first data may be downloaded from the browser server, typically periodically (e.g., automatically when 12:00 networking every day). The first data are obtained after the server performs tree transformation processing according to the second data, wherein the second data comprise effective character strings and custom character strings of a browser, and the effective character strings are character strings with the utilization rate larger than a preset threshold value determined by screening open source character strings in open source websites and historical data reported by the terminal in a preset time period.
S1420: and the browser client acquires the information of the access webpage.
Specifically, the access web page may also refer to web site information.
The information for accessing the web page may include: the user accesses the URL of the page or accesses the URLs of the various elements of the web page. The information of the access web page may or may not include target information, where the target information generally refers to advertisement information in the embodiments provided in the present application.
S1430: the browser client matches the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not.
The specific matching process may be as follows:
firstly, describing the tree structure (the tree structure may be a tree structure determined by tree transformation of the browser server), and may be described in detail with reference to fig. 6, where the tree structure includes a plurality of nodes, and the plurality of nodes includes a ROOT node (ROOT) and at least one level of child nodes, and each level of the at least one level of child nodes includes at least two child nodes; the nodes of each stage have a parent-child relationship with the nodes of the next stage, and the first data are distributed on the plurality of nodes in the tree structure according to a preset rule. And matching the information of the access webpage step by step from the first data of the father node of the tree structure to the first data of the child node in father-son relation with the father node until determining whether the information of the access webpage comprises the target information.
Specifically, the tree structure may include m-level child nodes, where each level of child nodes in the m-level child nodes is divided according to different preset rules in n preset rules, n and m are integers greater than or equal to 1, and n is greater than or equal to m; the j-th level of child nodes select 1 preset rule from f preset rules to divide, wherein f preset rules are the rest preset rules selected for the previous j-1 level of child nodes in the n preset rules, the j-1 level of child nodes are the last level of child nodes of the j level of child nodes, the j level of child nodes are any level of child nodes in the m level of child nodes, and both j and f are integers greater than or equal to 1; each of the n preset rules respectively comprises at least two categories of character strings;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in n preset rules respectively, and each sub-node comprises a plurality of character strings with different character string categories.
Wherein, the n preset rules may include at least one rule of: black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules.
The embodiment provided in the application performs the partition matching as shown in the rule 4.
The level 1 in the tree structure comprises two sub-nodes, wherein a first sub-node in the level 1 sub-node comprises a plurality of character strings of a category with a white list, and a second sub-node in the level 1 sub-node comprises a plurality of character strings of a category with a black list. Wherein, the two child nodes are divided according to the rule of the black-and-white list.
In the matching process, if the information of the access webpage is matched in the first sub-node, the matching is directly finished, and a large number of character strings are not needed to be matched in the second sub-node.
For example, for example: as shown in fig. 7, for example, the website related to sina determines a (black) child node and a (white) child node, where it is known that the (white) child node may include a @ @ |sina.com/litong/close string; the (black) child node may include: the URL of a picture (an element) is https:// sina.com/litong/180528/close.jpg then matches the string in the (white) child node first, when it matches the "@ @ sina.com/litong// close" string, so that it is no longer necessary to match the string in the (black) child node. It can be seen that the string of the child node is used to screen the string that is not advertisement information, and when the matching is completed, it indicates that the information is not advertisement, and the matching process is terminated by jumping out of the tree structure without intercepting the information.
However, when the information of the access web page does not include the character string of the category having the white list, the information of the access web page is matched with the character string of the category of the black list, that is, in the second child node. When the information of the access webpage does not comprise the character string of the category of the blacklist, the terminal determines that the information of the access webpage does not comprise the target information, the terminal does not intercept the target information, the information is not an advertisement, the information is not intercepted, namely, the tree structure is jumped out, and the matching process is terminated.
When the information of the access webpage comprises the character strings of the category of the blacklist, the terminal is used for matching the information of the access webpage with the child nodes of the character strings belonging to the category of the blacklist step by step in a father-son relationship until the fact that the information of the access webpage is completely matched is determined, and the terminal intercepts target information in the information of the access webpage.
The 2 nd level in the tree structure comprises two child nodes, wherein any child node in the 2 nd level child node and the child node of the character string belonging to the category of the blacklist in the 1 st level child node are in a father-son relationship.
The first child node in level 2 includes a string having a category locating a match, and the second child node in level 2 includes a string having a category of a preset match. The category of the positioning match is used for screening at least one of information of character strings existing at a first preset position or information of separators existing at a second preset position in the information of the access webpage; the preset matching category is used for screening at least one of prefix information or suffix information in the information of the access webpage.
For example: the child nodes with the category of positioning matching are mainly divided according to characters at fixed positions, specifically, characters exist in the first preset position, wherein the characters represent that any character string appears in the first preset position; alternatively, there is a second preset position where a represents the presence of a separator at the second preset position (where a separator may be any character other than letters, numbers, _, -, or%). When the browser client accesses the web site of the page, including://,:/,? Sum = can be regarded as separator:
http://example.com:8000/foo.bara=12&b=%D1%82%D0%B5%D1%81%D1%82
therefore, rule filtering in the rule list of location matches can be performed on either the ≡sample ≡com ≡or ≡D1%82%D0%B5% D1%81%D1% 82-or ≡foo.
The prefix information or the suffix information is the same, and when a corresponding character string appears at a preset position, the identification can be matched, for example: the white, black, and black, glob, if the access web page includes the same prefix or suffix character of the white, black, and black, glob, the identification can be matched. When matching is completed, whether the web page accessing the page is matched is determined, if not, the matching is continued to the 3 rd level. If the matching is completed, the information is an advertisement, and the terminal intercepts the target information corresponding to the URL, so that the matching process is terminated.
The first child node in level 3 includes a string of tagged categories and the second child node includes a string of untagged categories. Any one of the level 3 child nodes and one of the level 2 child nodes are in a parent-child relationship.
The information with the tag is used for screening information including the tag attribute in the information of the access webpage, and the information without the tag is used for screening information not including the tag attribute in the information of the access webpage. The category with the tag can be further divided into: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
The specific matching process can be matching according to the following classification method:
first kind: direct matching
Specifically, only the host name is included (i.e., the description portion of the number 2 in fig. 9 includes only the host information), for example: such a string (example section in number 2 in fig. 9) 9377os. Com may be further matched subsequently according to the string of hostnames.
Second kind: third matching
Specifically, only host information of advertisement attribution (i.e., the description part of the number 3 in fig. 9 only contains information of the third party website accessing the advertisement attribution website), for example: (sequence number in FIG. 9) 116b.com $thrird-party may be further matched under classification by the string of hostnames.
Third kind: domain_direct matching
Specifically, character matching is performed according to the two-stage classification of the host and domain (i.e., the description part of the number 4 in fig. 9 contains the domain of the current web page and the information of the host of the advertisement web page).
Fourth kind: domain_Filter matching
Specifically, url information of the host and the advertisement (i.e., information of domain and advertisement content is included in the description part of the number 5 in fig. 9), for example, the following 5 character strings:
according to the 5 character strings, the matching can be further performed according to domain_Filter:
The string containing the advertisement hostname may be: cdndm.com/12/2016/$domain=1kkk.com|dm 5.Com
The string containing no hostname and advertisement path may be: com/tps/$domain=ocucn.com
The string that does not contain the file name of the host containing the advertisement may be:
/static/media/curl.swf$domain=duba.com
the matching method can also process domain_filters under attribute classification of the home host in the same way. For example, for advertisements containing pictures, this matching method may be used:
fifth, THIRD_FILTERS match
Specifically, for example: whether the information of the access web page contains a ||books.com.tw/exep/ap/$thin-party string.
The difference between this and the fourth domain_filter is that only domains currently accessed by the user and hosts of advertisements are different, and advertisement information processing portions thereof are the same. The same can be done according to the fourth domain_filter section.
In addition, a fifth method may be included: type_filter matching
Specifically, the domain_filter and the third_filter may be combined into a type_filter, and when the advertisement content information is included in the accessed page, the type_filter may include two subclasses of the domain_filter and the third_filter.
When matching is completed, whether the web page accessing the page is matched is determined, and if not, the matching is continued to the 4 th level. If the matching is completed, the information is an advertisement, and the terminal intercepts the target information corresponding to the URL, so that the matching process is terminated.
The first child node in level 4 includes a string of the category of the first string and the second child node includes a string of the category of the preset string. Any one of the level 4 child nodes and one of the level 3 child nodes are in a parent-child relationship.
Specifically, in the matching process, the category of the first character string is used for screening the information of the access webpage and the information that the character string of the category of the first character string has the same first character; the category of the preset character string is used for screening that the information of the access webpage is the same as the information that the character string of the category of the preset character string has the preset character string.
When matching is completed, whether the web page of the access page is matched is required to be determined, if not, the matching is continued to the 5 th level, and the like until the web page of the access page is completely matched. If the matching is completed, the information is an advertisement, and the terminal intercepts the target information corresponding to the URL, so that the matching process is terminated.
In summary, the functions of each child node in the tree structure include: matching at least one of the URL of the user access page or the URL of each element of the access page according to the character strings included in each child node; and distributing the sub-node matched with the URL of the access page of the user or the character string characteristics contained in the URL of each element of the access page.
S1440: when the information of the access webpage comprises target information, the target information is intercepted.
Specifically, in S1430, when it is determined that the matching of the web page of the access page is completed, it is explained that the information is an advertisement, and the terminal intercepts (may include deleting or hiding) the target information corresponding to the URL in the web page of the access page, thereby terminating the matching process. In effect, the user is unaware of the presence of the advertisement.
If it is determined that the web page of the access page is not matched, that is, the length of the matched URL does not exceed the preset threshold, it is proved that the advertisement information does not exist in the access page of the user, and the browser client can directly display the advertisement information to the user.
According to the scheme, the target information in the browser page is intercepted through the first data with the tree structure, the tree structure can deeply distinguish the character strings in the first data, and the matching times of the information of the access webpage and the first data are effectively reduced, so that the problem that the matching times are increased due to the fact that the number of character strings for intercepting the target information is large and a rationalized matching mode is not adopted is avoided. In addition, by counting the character strings in the acquired open source list, the aim is to remove invalid or rarely accessed character strings, and the number of rules is reduced so as to effectively reduce the number of subsequent matching times.
Fig. 15 is a schematic diagram of a terminal structure for information interception according to an embodiment of the present invention. As shown in fig. 15, the terminal 15 may include: one or more processors 1502, transceivers 1501, and applications (not shown in the figure) in memory 1503; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the terminal, cause the terminal to:
starting a browser to access a webpage;
acquiring information of access to a webpage;
matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not;
when the information of the access webpage comprises target information, the target information is intercepted.
The tree structure may include: the system comprises a plurality of nodes, wherein the nodes comprise root nodes and at least one level of sub-nodes, and each level of the at least one level of sub-nodes comprises at least two sub-nodes; the nodes of each stage have a parent-child relationship with the associated next stage node, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
The terminal may specifically perform the following steps:
and matching the information of the access webpage step by step from the first data of the father node of the tree structure to the first data of the child node in father-son relation with the father node until determining whether the information of the access webpage comprises target information.
The tree structure specifically may include m-level child nodes, where each level of child nodes in the m-level child nodes is divided according to different preset rules in n preset rules, where n and m are integers greater than or equal to 1, and n is greater than or equal to m; the j-th level of child nodes select 1 preset rule from f preset rules to divide, wherein f preset rules are that the previous j-1 level of child nodes in n preset rules select the rest preset rules, the j-1 level of child nodes are the previous level of child nodes in the j level of child nodes, the j level of child nodes are any level of child nodes in the m level of child nodes, and j and f are integers which are more than or equal to 1; each of the n preset rules includes at least two categories of character strings; the first data comprises a plurality of character strings, the character strings of the first data are divided according to m levels of sub-nodes, each sub-node in the m levels of sub-nodes corresponds to different character string categories in n preset rules respectively, and each sub-node comprises a plurality of character strings with different character string categories.
Wherein, the n preset rules may include at least one rule of: black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules.
Specifically, the black-and-white list rule may include: the class of the white list and the class of the black list, the 1 st level of the m level of the child nodes are divided according to the black and white list rule, and the character strings belonging to the class of the white list and the character strings belonging to the class of the black list in the first data correspond to one of the 1 st level of the child nodes respectively.
The terminal may perform the steps of: and matching the information of the access webpage with the character strings of the category of the white list, wherein when the information of the access webpage comprises the character strings of the category of the white list, the terminal determines that the information of the access webpage does not comprise target information, and the terminal does not intercept the target information.
The terminal may also perform the steps of: when the information of the access webpage does not comprise the character strings of the category of the white list, matching the information of the access webpage with the character strings of the category of the black list; when the information of the access webpage does not comprise the character strings of the category of the blacklist, the terminal determines that the information of the access webpage does not comprise the target information, and the terminal does not intercept the target information; when the information of the access webpage comprises the character strings of the blacklist category, the terminal matches the information of the access webpage with the child nodes of the character strings belonging to the blacklist category in a father-son relationship step by step until the fact that the information of the access webpage is matched is determined to be finished, and the terminal intercepts target information in the information of the access webpage.
The positioning and preset matching rules may specifically include: the method comprises the steps that a category of positioning matching and a category of preset matching are divided by a 2 nd-level child node in m-level child nodes according to positioning and preset matching rules, character strings belonging to the category of positioning matching and character strings belonging to the category of preset matching in first data correspond to one child node in the 2 nd-level child nodes respectively, and any child node in the 2 nd-level child nodes and child nodes of the character strings belonging to the category of blacklists in the 1 st-level child nodes are in a father-son relationship.
The category of the positioning match can be used for screening at least one of information of character strings existing at a first preset position or information of separators existing at a second preset position in the information of the access webpage; the preset matching category is used for screening at least one of prefix information or suffix information in the information of the access webpage.
The tag attribute rule may specifically include: the method comprises the steps that a class with a label and a class without the label are divided according to a label attribute rule, and a character string belonging to the class with the label and a character string without the label in first data correspond to one of the 3 rd-level sub-nodes respectively, wherein any one of the 3 rd-level sub-nodes and one of the 2 nd-level sub-nodes are in a father-son relationship.
The category with the tag can be used for screening information which includes tag attributes in the information of the accessed webpage, and the category without the tag is used for screening information which does not include tag attributes in the information of the accessed webpage; the category provided with the tag specifically includes: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
The character rule may include: the method comprises the steps that the class of a first character string and the class of a preset character string are divided according to character rules, the class 4 of m-level sub-nodes, the character string belonging to the class of the first character string and the character string of the class of the preset character string in first data correspond to one of the class 4 sub-nodes respectively, and any one of the class 4 sub-nodes and one of the class 3 sub-nodes are in a father-son relationship.
The category of the first character string can be used for screening information of the access webpage and information that the character string of the category of the first character string has the same first character; the category of the preset character string is used for screening that the information of the access webpage is the same as the information of the preset character string.
In the above step, the information of accessing the web page may include: the user accesses the URL of the page or the URL of each element of the web page, and the target information is advertisement information. The first data are obtained after the server side performs tree transformation processing according to the second data, wherein the second data comprise effective character strings and custom character strings of a browser, and the effective character strings are character strings with the utilization rate larger than a preset threshold value determined by screening open source character strings in open source websites and historical data in a preset time period, which are reported by a terminal.
Because the first data is downloaded from the terminal to the server, the whole matching process is carried out in the terminal, so that the matching speed of the terminal for information is greatly improved, and the problem that the processing of page contents can be completed quickly only by the server with higher performance in the prior art is solved.
In the scheme, the terminal intercepts target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish character strings in the first data, and the matching times of the information of the access webpage and the first data are effectively reduced, so that the problem that the matching times are increased due to the fact that the number of character strings for intercepting the target information is large and a rationalized matching mode is not adopted is avoided.
Fig. 16 is a schematic structural diagram of a data processing server according to an embodiment of the present invention. As shown in fig. 16, the server 16 may include: one or more processors 1601, transceivers 1602, and memory 1603; and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the server, cause the server to perform the steps of:
performing tree transformation processing on the second data to determine first data;
the server transmits the first data to the terminal so that the terminal can determine whether the access webpage contains target information or not.
Wherein, the target information can be advertisement information; the information for accessing the web page includes: the user accesses at least one of a URL of the page or a URL of each element of the web page.
The server may perform the specific steps of: periodically acquiring at least one open source character string from an open source website; selecting a plurality of character strings with access quantity larger than a first threshold value from at least one open source character string and historical data in a preset time period, which are reported by a client, as effective character strings; acquiring a custom character string of a browser server; and determining second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string.
The server may perform the specific steps of: dividing the plurality of sub-nodes into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different; each of n preset rules respectively comprises at least two categories of character strings, and each layer in m levels is divided into at least two child nodes according to the categories of the character strings; the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers greater than or equal to 1, and n is greater than or equal to m; each child node in the kth level child nodes has a father-son relationship with one child node in the k-1 level, the k level child nodes are any one level child node in the m level child nodes, and k is an integer greater than or equal to 1.
The n preset rules may include at least one of the following rules: black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules; the server performs the steps of: dividing the plurality of sub-nodes into m-level sub-nodes according to a black-and-white list rule, a positioning and preset matching rule, a tag attribute rule and a character rule.
The server may perform the specific steps of: when the black-and-white list rule includes the category of the white list and the category of the black list, dividing the 1 st level of m level sub-nodes into two sub-nodes according to the category of the white list and the category of the black list, wherein one sub-node of the two sub-nodes includes a character string belonging to the category of the white list in the second data, and the other sub-node includes a character string belonging to the category of the black list in the second data.
The server may perform the specific steps of: when the locating and preset matching rules comprise locating matching categories and preset matching categories, dividing the 2 nd level of m level sub-nodes into two sub-nodes according to the locating matching categories and the preset matching categories, wherein one sub-node of the two sub-nodes comprises character strings belonging to the locating matching categories in the second data, and the other sub-node comprises character strings belonging to the preset matching categories in the second data, and the two sub-nodes in the 2 nd level are in a father-son relationship with the node where the character strings belonging to the blacklist categories in the 1 st level are located.
The server may perform the specific steps of: when the tag attribute rule comprises a category with a tag and a category without the tag, dividing the 3 rd level of m-level child nodes into two child nodes according to the category with the tag and the category without the tag, wherein one child node of the two child nodes comprises a character string belonging to the category with the tag in the second data, and the other child node comprises a character string belonging to the category without the tag in the second data, and any child node in the 3 rd level and one child node in the 2 nd level child node are in a father-son relationship.
The category having the tag may include: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
The server may perform the specific steps of: when the character rule comprises the category of the first character string and the category of the preset character string, dividing the 4 th level of m level sub-nodes into two sub-nodes according to the category of the first character string and the category of the preset character string, wherein one sub-node of the two sub-nodes comprises the character string belonging to the category of the first character string in the second data, the other sub-node comprises the character string belonging to the category of the preset character string in the second data, and any sub-node in the 4 th level and one sub-node in the 3 rd level sub-node are in a father-son relationship.
In the scheme, the tree structure can be used for deeply distinguishing the character strings in the second data by performing tree transformation processing on the second data, so that the character strings are transformed into the tree structure with very high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
Fig. 17 is a schematic structural diagram of an information interception device according to an embodiment of the present invention. As shown in fig. 17, the apparatus 17 may include:
a processing module 1702 for launching a browser to access a web page;
the transceiver module 1701 is configured to obtain information of accessing a web page;
the processing module is also used for matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not; when the information of the access webpage comprises target information, the target information is intercepted.
The tree structure may include a plurality of nodes including a root node and at least one level of child nodes, each level of the at least one level of child nodes including at least two child nodes; the nodes of each stage have a parent-child relationship with the associated next stage node, and the first data are distributed on a plurality of nodes in a tree structure according to a preset rule.
The processing module may be specifically configured to match information of the accessed web page from first data of a parent node of the tree structure to first data of a child node in a parent-child relationship with the parent node step by step until it is determined whether the information of the accessed web page includes target information.
The tree structure may include m-level sub-nodes, where each level of sub-nodes in the m-level sub-nodes is divided according to different preset rules in n preset rules, where n and m are integers greater than or equal to 1, and n is greater than or equal to m; the j-th level of child nodes select 1 preset rule from f preset rules to divide, wherein f preset rules are that the previous j-1 level of child nodes in n preset rules select the rest preset rules, the j-1 level of child nodes are the previous level of child nodes in the j level of child nodes, the j level of child nodes are any level of child nodes in the m level of child nodes, and j and f are integers which are more than or equal to 1; each of the n preset rules includes at least two categories of character strings; the first data comprises a plurality of character strings, the character strings of the first data are divided according to m levels of sub-nodes, each sub-node in the m levels of sub-nodes corresponds to different character string categories in n preset rules respectively, and each sub-node comprises a plurality of character strings with different character string categories.
Wherein, the n preset rules may include at least one rule of: black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules.
The black-and-white list rule may include a category of a white list and a category of a black list, and the 1 st level child node in the m level child nodes is divided according to the black-and-white list rule, where a character string belonging to the category of the white list and a character string belonging to the category of the black list in the first data correspond to one child node in the 1 st level child node respectively.
The processing module may be specifically configured to match information of the accessed web page with a character string of a class of the whitelist, and determine that the information of the accessed web page does not include the target information and does not intercept the target information when the information of the accessed web page includes the character string of the class of the whitelist.
The processing module may be specifically configured to match the information of the accessed web page with the character string of the category of the blacklist when the information of the accessed web page does not include the character string of the category of the whitelist; when the information of the access webpage does not comprise the character strings of the category of the blacklist, determining that the information of the access webpage does not comprise target information and not intercepting the target information; when the information of the access webpage comprises the character strings of the blacklist category, the information of the access webpage is matched with the child nodes which are in father-son relations with the child nodes of the character strings belonging to the blacklist category step by step until the fact that the information of the access webpage is matched is confirmed, and target information in the information of the access webpage is intercepted.
The positioning and preset matching rule may include a positioning matching type and a preset matching type, the 2 nd level child node in the m level child nodes is divided according to the positioning and preset matching rule, and a character string belonging to the positioning matching type and a character string belonging to the preset matching type in the first data correspond to one child node in the 2 nd level child nodes respectively, wherein any child node in the 2 nd level child node and a child node of the character string belonging to the blacklist type in the 1 st level child node are in a father-son relationship.
The category of the positioning match can be used for screening at least one of information of character strings existing at a first preset position or information of separators existing at a second preset position in the information of the access webpage; the preset matching category is used for screening at least one of prefix information or suffix information in the information of the access webpage.
The tag attribute rule may include a category with a tag and a category without a tag, the 3 rd level child nodes in the m level child nodes are divided according to the tag attribute rule, and a character string belonging to the category with the tag and a character string not having the category with the tag in the first data correspond to one child node in the 3 rd level child nodes respectively, wherein any child node in the 3 rd level child nodes and one child node in the 2 nd level child nodes are in a father-child relationship.
The category with the tag can be used for screening information which includes tag attributes in the information of the accessed webpage, and the category without the tag is used for screening information which does not include tag attributes in the information of the accessed webpage; the category provided with the tag specifically includes: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
The character rule may include a category of a first character string and a category of a preset character string, the 4 th level child node in the m level child nodes is divided according to the character rule, the character string belonging to the category of the first character string and the character string of the category of the preset character string in the first data correspond to one child node in the 4 th level child node respectively, wherein any child node in the 4 th level child node and one child node in the 3 rd level child node are in a father-child relationship.
The category of the first character string can be used for screening information of the access webpage and information that the character string of the category of the first character string has the same first character; the category of the preset character string is used for screening that the information of the access webpage is the same as the information of the preset character string.
The information for accessing the web page may include a URL of a user accessing the web page or a URL of each element of the web page, and the target information is advertisement information. The first data can be obtained after the server performs tree transformation processing according to second data, wherein the second data comprises an effective character string and a custom character string of the browser, and the effective character string is a character string with the utilization rate larger than a preset threshold value determined by screening an open source character string in an open source website and reported historical data in a preset time period.
In the scheme, the device intercepts target information in the browser page through the first data with the tree structure, the tree structure can deeply distinguish character strings in the first data, and the matching times of the information of the access webpage and the first data are effectively reduced, so that the problem that the matching times are increased due to the fact that the number of character strings for intercepting the target information is large and a rationalized matching mode is not adopted is avoided, and the overall matching speed can be improved by more than 40%.
Fig. 18 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. As shown in fig. 18, the apparatus 18 includes:
the processing module 1802 performs tree transformation processing on the second data to determine first data;
the transceiver module 1801 transmits the first data to the terminal, so that the terminal can determine whether the access webpage contains the target information.
Wherein, the target information can be advertisement information; the information for accessing the web page includes: the user accesses at least one of a URL of the page or a URL of each element of the web page.
The transceiver module may be further configured to periodically obtain at least one open-source character string from an open-source website; and acquiring the custom character string of the browser server. The processing module is further configured to select, from at least one open-source character string and the historical data reported by the client in a preset time period, a plurality of character strings with access amounts greater than a first threshold as valid character strings; the processing module is further configured to determine second data according to the valid string and the custom string, where the valid string and the custom string each include at least one string; dividing the plurality of sub-nodes into m levels according to n preset rules, wherein the preset rules of each level in the m levels of sub-nodes are different; each of n preset rules respectively comprises at least two categories of character strings, and each layer in m levels is divided into at least two child nodes according to the categories of the character strings; the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers greater than or equal to 1, and n is greater than or equal to m; each child node in the kth level child nodes has a father-son relationship with one child node in the k-1 level, the k level child nodes are any one level child node in the m level child nodes, and k is an integer greater than or equal to 1.
Wherein, the n preset rules may include at least one rule of: black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules; the processing module may be further configured to divide the plurality of child nodes into m-level child nodes according to black-and-white list rules, positioning and preset matching rules, tag attribute rules, and character rules.
The processing module may be specifically configured to divide, when the black-and-white list rule includes a category of the white list and a category of the black list, a 1 st level of m level sub-nodes into two sub-nodes according to the category of the white list and the category of the black list, where one of the two sub-nodes includes a string of a category belonging to the white list in the second data, and the other sub-node includes a string of a category belonging to the black list in the second data.
The processing module may be specifically configured to divide, when the locating and preset matching rule includes a locating matching category and a preset matching category, a level 2 of the m level subnodes into two subnodes according to the locating matching category and the preset matching category, where one of the two subnodes includes a string belonging to the locating matching category in the second data, and the other subnode includes a string belonging to the preset matching category in the second data, where the two subnodes in the level 2 and a node where the string belonging to the blacklist category in the level 1 are in a parent-child relationship.
The processing module may be specifically configured to divide, when the tag attribute rule includes a category with a tag and a category without a tag, a 3 rd level of the m level of child nodes into two child nodes according to the category with the tag and the category without the tag, where one of the two child nodes includes a string belonging to the category with the tag in the second data, and the other child node includes a string belonging to the category without the tag in the second data, and any one of the 3 rd level of child nodes and one of the 2 nd level of child nodes are in a parent-child relationship.
The category having the tag may specifically include: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
The processing module may be specifically configured to divide, when the character rule includes a category of a first character string and a category of a preset character string, a 4 th level of m level subnodes into two subnodes according to the category of the first character string and the category of the preset character string, where one of the two subnodes includes a character string belonging to the category of the first character string in the second data, and the other subnode includes a character string belonging to the category of the preset character string in the second data, and any one of the 4 th level subnodes and one of the 3 rd level subnodes are in a parent-child relationship.
In the scheme, the tree structure can be used for deeply distinguishing the character strings in the second data by performing tree transformation processing on the second data, so that the character strings are transformed into the tree structure with very high distinguishing degree, and the matching times of the information of the access webpage and the first data are effectively reduced.
The embodiment of the invention provides a method, a device and a terminal for information interception. Through statistics of a large number of character strings in an open source list, invalid character strings or character strings with smaller access quantity are effectively removed, the number of the character strings is reduced, on the basis, second data are converted into first data with a tree structure for intercepting target information in a browser page, the tree structure can be used for deeply distinguishing the character strings in the first data, the matching times of information of an access webpage and the first data are effectively reduced, and therefore the problem that matching times are increased due to the fact that the number of character strings intercepting target information is large and a matching mode which is not reasonable is avoided, and in actual statistics, the matching speed can be integrally improved by more than 40%. Specifically, when dividing the tree structure, dividing the second data by using a black-and-white list rule, a positioning and preset matching rule, a tag attribute rule or a character rule through tree analysis of the character string, and the mode can be used for deeply distinguishing the character string and converting the character string into the tree structure with very high distinguishing degree, so that the speed of intercepting advertisements by a browser client is greatly improved, and the experience of a user is effectively improved.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention in further detail, and are not to be construed as limiting the scope of the invention, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the invention.
Claims (28)
1. A terminal for information interception, comprising: one or more processors, a transceiver, a memory, a plurality of application programs, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the terminal, cause the terminal to perform the steps of:
starting a browser to access a webpage;
acquiring the information of the access webpage;
matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not; the first data comprises a plurality of character strings and a plurality of categories, the plurality of character strings are divided into a plurality of parts based on the plurality of categories, the plurality of parts are respectively used as the tail ends of the tree structure, and the plurality of categories are used as branches of the tree structure to the tail ends of the tree structure;
And intercepting the target information when the information of the access webpage comprises the target information.
2. The terminal of claim 1, wherein the tree structure comprises a plurality of nodes including a root node and at least one level of child nodes, each level of the at least one level of child nodes including at least two child nodes;
the nodes of each stage have a parent-child relationship with the nodes of the next stage, and the first data are distributed on the plurality of nodes in the tree structure according to a preset rule.
3. The terminal according to claim 2, characterized in that it performs the following steps:
and matching the information of the access webpage step by step from the first data of the father node of the tree structure to the first data of the child node in father-son relation with the father node until determining whether the information of the access webpage comprises the target information.
4. A terminal according to claim 3, wherein the tree structure comprises m levels of sub-nodes, each level of sub-nodes in the m levels of sub-nodes being divided according to different preset rules among n preset rules, n and m being integers greater than or equal to 1, n being greater than or equal to m;
The j-th level of child nodes select 1 preset rule from f preset rules to divide, wherein f preset rules are the rest preset rules selected for the previous j-1 level of child nodes in the n preset rules, the j-1 level of child nodes are the last level of child nodes of the j level of child nodes, the j level of child nodes are any level of child nodes in the m level of child nodes, and both j and f are integers greater than or equal to 1;
each of the n preset rules respectively comprises at least two categories of character strings;
the first data comprises a plurality of character strings, the character strings of the first data are divided according to m-level sub-nodes, each sub-node in the m-level sub-nodes corresponds to different character string categories in n preset rules respectively, and each sub-node comprises a plurality of character strings with different character string categories.
5. The terminal of claim 4, wherein the n preset rules include at least one of the following rules:
black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules.
6. The terminal according to claim 5, wherein the black-and-white list rule includes a category of a white list and a category of a black list, a 1 st level child node of the m level child nodes is divided according to the black-and-white list rule, and a character string belonging to the category of the white list and a character string belonging to the category of the black list in the first data respectively correspond to one child node of the 1 st level child nodes.
7. The terminal according to claim 6, characterized in that it performs the following steps:
and the terminal matches the information of the access webpage with the character string of the category of the white list, and when the information of the access webpage comprises the character string of the category of the white list, the terminal determines that the information of the access webpage does not comprise the target information.
8. The terminal according to claim 7, characterized in that the terminal further performs the steps of: when the information of the access webpage does not comprise the character strings of the category of the white list, matching the information of the access webpage with the character strings of the category of the black list;
when the information of the access webpage does not comprise the character strings of the category of the blacklist, the terminal determines that the information of the access webpage does not comprise the target information;
when the information of the access webpage comprises the character strings of the category of the blacklist, the terminal is used for matching the information of the access webpage with the child nodes of the character strings belonging to the category of the blacklist step by step in a father-son relationship until the fact that the information of the access webpage is matched is determined to be finished, and the terminal intercepts target information in the information of the access webpage.
9. The terminal according to claim 8, wherein the positioning and preset matching rule includes a positioning matching category and a preset matching category, a 2 nd level child node in the m level child nodes is divided according to the positioning and preset matching rule, a character string belonging to the positioning matching category and a character string belonging to the preset matching category in the first data respectively correspond to one child node in the 2 nd level child nodes, and any one child node in the 2 nd level child nodes is in a parent-child relationship with a child node of a character string belonging to the blacklist category in the 1 st level child nodes.
10. The terminal of claim 9, wherein the category of location matching is used for screening at least one of information of character strings existing at a first preset position or information of separators existing at a second preset position in the information of the access web page;
the preset matching category is used for screening at least one of prefix information or suffix information in the information of the access webpage.
11. The terminal of claim 9, wherein the tag attribute rule includes a category with a tag and a category without a tag, a 3 rd level child node of the m level child nodes is divided according to the tag attribute rule, and a string belonging to the category with a tag and a string belonging to the category without a tag in the first data correspond to one child node of the 3 rd level child nodes, respectively, wherein any one child node of the 3 rd level child nodes is in a parent-child relationship with one child node of the 2 nd level child nodes.
12. The terminal of claim 11, wherein the information with tag is used for filtering information of the access web page and includes tag attribute, and the information without tag is used for filtering information of the access web page and does not include tag attribute; wherein,,
the category with the label specifically comprises: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
13. The terminal of claim 11, wherein the character rule includes a category of a first character string and a category of a preset character string, a 4 th level child node of the m level child nodes is divided according to the character rule, and a character string belonging to the category of the first character string and a character string belonging to the category of the preset character string in the first data correspond to one child node of the 4 th level child nodes respectively, wherein any one child node of the 4 th level child nodes is in a parent-child relationship with one child node of the 3 rd level child nodes.
14. The terminal according to claim 13, wherein the category of the first character string is used for filtering information of the access web page as information that a character string of the category of the first character string has a first character;
the category of the preset character string is used for screening that the information of the access webpage is the same as the information that the character string of the category of the preset character string has the preset character string.
15. The terminal according to any of the claims 1-14, wherein the information for accessing the web page comprises: the user accesses the URL of the webpage or the URL of each element of the accessed webpage, and the target information is advertisement information.
16. The terminal of claim 1, wherein the first data is obtained after the server performs tree transformation processing according to second data, and the second data includes a valid string and a custom string of the browser, where the valid string is a string with a usage rate greater than a preset threshold value determined by screening an open source string in an open source website and historical data reported by the terminal in a preset time period.
17. A server for data processing, comprising: one or more processors, a transceiver, a memory, a plurality of application programs, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the server, cause the server to perform the steps of:
Performing tree transformation processing on the second data to determine first data; the first data comprises a plurality of character strings and a plurality of categories, the plurality of character strings are divided into a plurality of parts based on the plurality of categories, the plurality of parts are respectively used as the tail ends of the tree shape, and the plurality of categories are used as branches of the tree shape to be led to the tail ends of the tree shape;
and the server sends the first data to the terminal so that the terminal can determine whether the access webpage contains target information according to the first data.
18. The server according to claim 17, wherein the target information is advertisement information;
the information of the access webpage comprises: at least one of the URL of the access page or the URL of each element of the access page.
19. The server according to claim 17, wherein the server performs the steps of:
periodically acquiring at least one open source character string from an open source website;
selecting a plurality of character strings with access quantity larger than a first threshold value from the at least one open source character string and the historical data in a preset time period, which are reported by the client, as effective character strings;
acquiring a custom character string of a browser server;
And determining the second data according to the effective character string and the custom character string, wherein the effective character string and the custom character string respectively comprise at least one character string.
20. The server according to claim 17 or 19, wherein the server performs the steps of:
dividing the second data into m levels according to n preset rules, wherein the preset rules of each level in the m levels of child nodes are different;
each of the n preset rules respectively comprises at least two categories of character strings, and each layer in the m levels is divided into at least two child nodes according to the categories of the character strings;
the second data comprises a plurality of character strings, each child node comprises a plurality of character strings belonging to different character string categories, n and m are integers which are larger than or equal to 1, and n is larger than or equal to m;
each child node in the kth level of child nodes has a father-son relationship with one child node in the k-1 level, the k level of child nodes are any level of child nodes in the m level of child nodes, and k is an integer greater than or equal to 1.
21. The server of claim 20, wherein the n preset rules include at least one of the following: black and white list rules, positioning and presetting matching rules, tag attribute rules or character rules; the server performs the steps of:
Dividing the plurality of sub-nodes into m-level sub-nodes according to the black-and-white list rule, the positioning and preset matching rule, the tag attribute rule and the character rule.
22. The server according to claim 21, wherein the server performs the steps of:
when the black-and-white list rule includes a category of a white list and a category of a black list, dividing the 1 st level of the m level sub-nodes into two sub-nodes according to the category of the white list and the category of the black list, wherein one sub-node of the two sub-nodes includes a character string belonging to the category of the white list in the second data, and the other sub-node includes a character string belonging to the category of the black list in the second data.
23. The server according to claim 22, wherein the server performs the steps of:
when the locating and preset matching rule comprises a locating matching category and a preset matching category, dividing the level 2 of the m-level sub-nodes into two sub-nodes according to the locating matching category and the preset matching category, wherein one sub-node of the two sub-nodes comprises a character string belonging to the locating matching category in the second data, and the other sub-node comprises a character string belonging to the preset matching category in the second data, and the two sub-nodes of the level 2 and the node of the character string belonging to the blacklist category in the level 1 are in a father-son relationship.
24. The server according to claim 23, wherein the server performs the steps of:
when the tag attribute rule includes a category with a tag and a category without a tag, dividing a 3 rd level of the m-level child nodes into two child nodes according to the category with the tag and the category without the tag, wherein one child node of the two child nodes includes a character string belonging to the category with the tag in the second data, and the other child node includes a character string belonging to the category without the tag in the second data, and any child node of the 3 rd level and one child node of the 2 nd level child nodes are in a father-child relationship.
25. The server according to claim 24, wherein the labeled category specifically includes: at least one of a class of host name only, a class of host information only with advertisement attributes, a class of two-level classification of host and domain name, a class of uniform resource locator URL information of host and advertisement, or a class of different URL information of only domain name and advertisement.
26. The server according to claim 24 or 25, characterized in that the server performs the following steps:
When the character rule includes the category of the first character string and the category of the preset character string, dividing the 4 th level of the m level sub-nodes into two sub-nodes according to the category of the first character string and the category of the preset character string, wherein one sub-node of the two sub-nodes includes the character string belonging to the category of the first character string in the second data, and the other sub-node includes the character string belonging to the category of the preset character string in the second data, and any sub-node of the 4 th level and one sub-node of the 3 rd level sub-node are in a father-son relationship.
27. A computer readable storage medium comprising instructions that when run on a computer cause the computer to perform the steps of:
starting a browser to access a webpage;
acquiring the information of the access webpage;
matching the information of the access webpage with first data arranged in a tree structure, wherein the first data is used for determining whether the information of the access webpage comprises target information or not; the first data comprises a plurality of character strings and a plurality of categories, the plurality of character strings are divided into a plurality of parts based on the plurality of categories, the plurality of parts are respectively used as the tail ends of the tree structure, and the plurality of categories are used as branches of the tree structure to the tail ends of the tree structure;
And intercepting the target information when the information of the access webpage comprises the target information.
28. A computer readable storage medium comprising instructions that when run on a computer cause the computer to perform the steps of:
performing tree transformation processing on the second data to determine first data; the first data comprises a plurality of character strings and a plurality of categories, the plurality of character strings are divided into a plurality of parts based on the plurality of categories, the plurality of parts are respectively used as the tail ends of the tree shape, and the plurality of categories are used as branches of the tree shape to be led to the tail ends of the tree shape;
and sending the first data to a terminal so that the terminal can determine whether the access webpage contains target information according to the first data.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811132493.9A CN110955855B (en) | 2018-09-27 | 2018-09-27 | Information interception method, device and terminal |
| PCT/CN2019/106728 WO2020063448A1 (en) | 2018-09-27 | 2019-09-19 | Information blocking method, device and terminal |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811132493.9A CN110955855B (en) | 2018-09-27 | 2018-09-27 | Information interception method, device and terminal |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN110955855A CN110955855A (en) | 2020-04-03 |
| CN110955855B true CN110955855B (en) | 2023-06-02 |
Family
ID=69951180
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201811132493.9A Active CN110955855B (en) | 2018-09-27 | 2018-09-27 | Information interception method, device and terminal |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN110955855B (en) |
| WO (1) | WO2020063448A1 (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112073374B (en) * | 2020-08-05 | 2023-03-24 | 长沙市到家悠享网络科技有限公司 | Information interception method, device and equipment |
| CN113641911B (en) * | 2021-08-19 | 2024-03-08 | 郑州阿帕斯数云信息科技有限公司 | Advertisement interception rule base establishing method, device, equipment and storage medium |
| CN114036434A (en) * | 2021-10-14 | 2022-02-11 | 深圳市世强元件网络有限公司 | Page access amount statistical method and system |
| CN117093777B (en) * | 2023-08-22 | 2024-10-29 | 北京领雁科技股份有限公司 | Method and device for intercepting browser page, electronic equipment and storage medium |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102332028A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | A Webpage-Oriented Bad Web Content Recognition Method |
| JP2015118466A (en) * | 2013-12-17 | 2015-06-25 | ケーディーアイコンズ株式会社 | Information processing apparatus and program |
| CN106033450A (en) * | 2015-03-17 | 2016-10-19 | 中兴通讯股份有限公司 | Method and device for blocking advertisement, and browser |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9215212B2 (en) * | 2009-06-22 | 2015-12-15 | Citrix Systems, Inc. | Systems and methods for providing a visualizer for rules of an application firewall |
| CN105100904A (en) * | 2014-05-09 | 2015-11-25 | 深圳市快播科技有限公司 | Video advertisement blocking method, device and browser |
| US9578042B2 (en) * | 2015-03-06 | 2017-02-21 | International Business Machines Corporation | Identifying malicious web infrastructures |
| US20160335298A1 (en) * | 2015-05-12 | 2016-11-17 | Extreme Networks, Inc. | Methods, systems, and non-transitory computer readable media for generating a tree structure with nodal comparison fields and cut values for rapid tree traversal and reduced numbers of full comparisons at leaf nodes |
| CN105824972A (en) * | 2016-04-15 | 2016-08-03 | 广东欧珀移动通信有限公司 | Network advertisement blocking method and device |
| CN107193889A (en) * | 2017-05-02 | 2017-09-22 | 努比亚技术有限公司 | Ad blocking method, terminal and computer-readable recording medium |
| CN107437026B (en) * | 2017-07-13 | 2020-12-08 | 西北大学 | A malicious webpage advertisement detection method based on advertisement network topology |
| CN108170810A (en) * | 2017-12-29 | 2018-06-15 | 南京邮电大学 | A kind of commercial detection method based on dynamic behaviour |
-
2018
- 2018-09-27 CN CN201811132493.9A patent/CN110955855B/en active Active
-
2019
- 2019-09-19 WO PCT/CN2019/106728 patent/WO2020063448A1/en not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102332028A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | A Webpage-Oriented Bad Web Content Recognition Method |
| JP2015118466A (en) * | 2013-12-17 | 2015-06-25 | ケーディーアイコンズ株式会社 | Information processing apparatus and program |
| CN106033450A (en) * | 2015-03-17 | 2016-10-19 | 中兴通讯股份有限公司 | Method and device for blocking advertisement, and browser |
Non-Patent Citations (1)
| Title |
|---|
| 信学峰.基于流氓软件的检测与拦截技术的研究.《中国优秀硕士学位论文全文数据库信息科技辑》.2015,第2015卷(第03期),I138-233. * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN110955855A (en) | 2020-04-03 |
| WO2020063448A1 (en) | 2020-04-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN102722563B (en) | Method and device for displaying page | |
| US9928301B2 (en) | Classifying uniform resource locators | |
| US8903800B2 (en) | System and method for indexing food providers and use of the index in search engines | |
| US10250526B2 (en) | Method and apparatus for increasing subresource loading speed | |
| US8478701B2 (en) | Locating a user based on aggregated tweet content associated with a location | |
| CN110955855B (en) | Information interception method, device and terminal | |
| CN104715064B (en) | It is a kind of to realize the method and server that keyword is marked on webpage | |
| US20150244670A1 (en) | Browser and method for domain name resolution by the same | |
| US20130219255A1 (en) | Authorized Syndicated Descriptions of Linked Web Content Displayed With Links in User-Generated Content | |
| US11423096B2 (en) | Method and apparatus for outputting information | |
| US20090043815A1 (en) | System and method for processing downloaded data | |
| US20130227394A1 (en) | Method, system and computer program product for replacing banners with widgets | |
| US10311120B2 (en) | Method and apparatus for identifying webpage type | |
| CN102708174A (en) | Method and device for displaying rich media information in browser | |
| CN102054003A (en) | Methods and systems for recommending network information and creating network resource index | |
| CN102750352A (en) | Method and device for classified collection of historical access records in browser | |
| US20130305131A1 (en) | Method, system and computer storage medium for pre-reading network data | |
| CN104090923B (en) | The methods of exhibiting and device of a kind of rich media information in browser | |
| KR101816205B1 (en) | Server and computer readable recording medium for providing internet content | |
| CN104065736A (en) | URL redirection method, device, and system | |
| CN104573120A (en) | Recommendation information obtaining method and device for terminal | |
| Gali et al. | Extracting representative image from web page | |
| CN107153674B (en) | A method and system for displaying live room information | |
| CN104750752B (en) | A kind of determining method and apparatus for the preferences user group that surfs the Internet | |
| CN108108381A (en) | The monitoring method and device of the page |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| TA01 | Transfer of patent application right |
Effective date of registration: 20220507 Address after: 523799 Room 101, building 4, No. 15, Huanhu Road, Songshanhu Park, Dongguan City, Guangdong Province Applicant after: Petal cloud Technology Co.,Ltd. Address before: 523808 Southern Factory Building (Phase I) Project B2 Production Plant-5, New Town Avenue, Songshan Lake High-tech Industrial Development Zone, Dongguan City, Guangdong Province Applicant before: HUAWEI DEVICE Co.,Ltd. |
|
| TA01 | Transfer of patent application right | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |