CN103914478B

CN103914478B - Webpage training method and system, webpage Forecasting Methodology and system

Info

Publication number: CN103914478B
Application number: CN201310003765.6A
Authority: CN
Inventors: 陈俊波; 薛贵荣; 李玉龙; 严孝伟; 李华康; 韩定; 韩定一
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2013-01-06
Filing date: 2013-01-06
Publication date: 2018-05-08
Anticipated expiration: 2033-01-06
Also published as: CN103914478A

Abstract

This application involves a kind of webpage training method and system, webpage Forecasting Methodology and system, including：The prior probability table of sort key word is obtained according to the data with existing associated with sort key word；Trained webpage is treated to be pre-processed to obtain web page text to be trained；Feature is extracted in the web page text to be trained according to prior probability table and show that the incidence relation feature vector between webpage to be trained and specified classification represents F1；Represent that the incidence relation feature vector F1 carries out model training to draw the classification results of the webpage to be trained.The application can handle the very strong bibliography system of isomerism at the same time, and utilize the bibliography system that relatively very few training data treatment scale is huge, in addition, by collecting browse/search behavior of the user on the whole network, rather than just the behavior in a home Web site, largely solves the sparse sex chromosome mosaicism of data.

Description

Webpage training method and system and webpage prediction method and system

Technical Field

The application relates to the field of internet, in particular to classification and prediction of internet access behaviors of a user.

Background

With the continuous popularization of computer technology, modern society has relied heavily on the convenience brought by information technology. With the increasing efficiency, security and reliability of computer and network technologies, more and more wholesalers, retailers and consumers choose to trade goods over the internet. Specific web sites are becoming the most commercially valuable service providers on the internet.

The user can browse, search, compare, buy, pay, and evaluate on a specific website to purchase the goods according to his commercial intention.

As well as the size and number of specific web sites. For example, Taobao, Tianmao, Jingdong, Amazon, Dangdong, and a large number of small and medium specific web sites.

The commercial intention analysis result can enable a specific website to provide personalized advertisement and recommendation results for each user, and personalizes a website entrance. By providing products and services that meet the commercial intention of the user, the transaction amount and the transaction amount can be greatly increased. Therefore, the analysis of the user's business intent is very important for a particular web site.

For a particular web site, there are difficulties in analyzing the commercial intention of its user:

(1) the number of user actions at a particular website is very rare relative to the user's actions on accessing the internet. The sparsity of data may result in an inability to fully judge a user's business intent; meanwhile, the rare data samples can cause the confidence interval range of probability distribution parameter estimation to be too large, so that accurate prediction cannot be made.

(2) The commodity category systems of specific websites have very strong heterogeneity, that is, the category systems of any two specific websites are very different. If a classical webpage classification method is used, for any specific website, large-scale labeled data (the number of the labeled data is in direct proportion to the number of categories) needs to be collected, and features in different forms are extracted according to different category systems, and finally a machine learning classification model is obtained through learning. This entails expensive costs.

(3) The commodity category system for a particular web site tends to be very large. In one example, the commodity category hierarchy is defined by a 5-level category structure tree in which the number of leaf nodes reaches a scale of 2 ten thousand. Too large a category system leads to three types of problems, and first, the amount of training data to be labeled is very large. Second, the accuracy of category prediction is greatly reduced. Thirdly, each webpage is required to be given a prediction of the multi-category label; the multicategory labels in turn lead to a further decrease in accuracy.

In view of the foregoing, there is a need for a solution for analyzing the commercial intention of a user using user behavior that overcomes the three problems described above.

Disclosure of Invention

The main purpose of the present application is to provide a web page training scheme and a web page prediction scheme, so as to solve the problems of website heterogeneity, large category system and data sparsity, which are difficult to overcome in the prior art.

In order to solve the technical problem, the purpose of the present application is achieved by the following technical solutions:

the application discloses a webpage training method, which is used for classifying webpages and comprises the following steps: obtaining a prior probability table of the classified keywords according to existing data associated with the classified keywords; preprocessing a webpage to be trained to obtain a webpage text to be trained; extracting features from the webpage text to be trained according to the prior probability table to obtain an incidence relation feature vector representation F1 between the webpage to be trained and the specified category; and performing model training on the incidence relation feature vector representation F1 to obtain a classification result of the webpage to be trained.

Preferably, in the method according to the present application, the existing data associated with the classification keyword includes a website category system and a user click log.

Preferably, in the method according to the present application, the step of obtaining the prior probability table of the classification keyword according to the existing data associated with the classification keyword includes: obtaining search keywords from the user click logs, and filtering to obtain classified keywords; calculating the occurrence frequency of each classification keyword in different classifications; and traversing each non-leaf node category in the website category system by adopting a bottom-up method, and calculating the information gain and chi-square test of each classified keyword.

Preferably, in the method according to the present application, the preprocessing of the web page to be trained includes one or more of the following processing: analyzing a webpage; text extraction; participle/stop word filtering.

Preferably, in the method according to the application, the step of extracting features from the text of the web page to be trained according to the prior probability table to obtain the association relationship feature vector representation F1 between the web page to be trained and the specified category includes: calculating the importance scores of the classified keywords in the web pages to be trained; discretizing the importance scores of the classified keywords and the prior probabilities of the classified keywords; obtaining a two-dimensional feature matrix according to the discretization importance of the classified keywords and the discretization prior probability of the classified keywords; and respectively calculating the two-dimensional feature matrix in different parts of the web page text to be trained, and sequentially arranging all matrix elements to obtain a feature vector with a fixed length.

Preferably, in the method according to the present application, the different parts of the text of the web page to be trained include a web page title, a web page body, and a web page hyperlink text.

Preferably, in the method according to the present application, the step of performing model training on the relevance feature vector representation F1 to obtain the classification result of the web page to be trained includes: performing regression model training on the labeled data according to a preset regression model algorithm; and calculating corresponding ROC curves aiming at the trained regression models, and setting different segmentation threshold values based on the ROC curves to obtain a classification result.

The application also discloses a webpage training system for classifying the webpage, including: means for obtaining a prior probability table for the classification keyword based on existing data associated with the classification keyword; the device is used for preprocessing the webpage to be trained to obtain the webpage text to be trained; extracting features from the webpage text to be trained according to the prior probability table to obtain an incidence relation feature vector representation F1 between the webpage to be trained and the specified category; and the device is used for carrying out model training on the incidence relation feature vector representation F1 to obtain a classification result of the webpage to be trained.

Preferably, in the web page training system, the existing data associated with the classification keyword includes a website category system and a user click log.

Preferably, in the web page training system, the means for obtaining the prior probability table of the classification keyword according to the existing data associated with the classification keyword includes: the device is used for acquiring search keywords from the user click logs and filtering the search keywords to obtain classified keywords; means for calculating a frequency of occurrence of each of the classification keywords in the different classifications; and the device is used for traversing each non-leaf node category in the website category system by adopting a bottom-up method, and calculating the information gain and chi-square test of each classified keyword.

Preferably, in the web page training system, the device for preprocessing the web page to be trained includes: a web page parsing device; text extraction means; a word segmentation/stop word filter.

Preferably, in the web page training system, the means for extracting features from the text of the web page to be trained according to the prior probability table to obtain an association relationship feature vector representation F1 between the web page to be trained and the specified category includes: means for calculating an importance score of the classified keyword in the web page to be trained; means for discretizing the importance scores of the classified keywords and the prior probabilities of the classified keywords; the device is used for obtaining a two-dimensional characteristic matrix according to the discretization importance of the classified keywords and the discretization prior probability of the classified keywords; and the device is used for calculating the two-dimensional characteristic matrix in different parts of the web page text to be trained respectively and sequentially arranging all matrix elements to obtain a characteristic vector with a fixed length.

Preferably, in the web page training system, different parts of the text of the web page to be trained include a web page title, a web page body and a web page hyperlink text.

Preferably, in the web page training system, the means for performing model training on the association feature vector representation F1 to obtain the classification result of the web page to be trained includes: the regression modeling device is used for carrying out regression model training on the labeled data according to a preset regression model algorithm; and the ROC analysis device is used for calculating corresponding ROC curves aiming at the trained regression model and setting different segmentation threshold values based on the ROC curves so as to obtain a classification result.

The application discloses a webpage prediction method, which is used for predicting categories aiming at user intentions by utilizing classification results obtained by training of a webpage training method and comprises the following steps: collecting user access logs; obtaining html source codes of the webpage from the user access log; preprocessing the html source code of the webpage to obtain a webpage text; extracting features from the webpage text according to the prior probability table to obtain an incidence relation feature vector representation F2 between the webpage and the specified category; and obtaining a user intention category prediction result according to the classification result and the incidence relation feature vector representation F2.

Preferably, in the method according to the present application, the step of obtaining html source code of the webpage from the user access log includes: subjecting the user access log to url screening; and crawling the filtered url to obtain html source codes of the webpage.

Preferably, in the method according to the present application, the step of url screening the user access log includes: calculating a url set after the duplication is removed aiming at the user access log; calculating a web page importance score for each url; and sequencing all the url after the duplication is removed according to the webpage importance scores, and taking the first plurality of urls as a url set to be crawled.

Preferably, in the method according to the present application, the step of preprocessing the html source code of the web page includes: analyzing a webpage; text extraction; participle/stop word filtering.

Preferably, in the method according to the present application, the step of extracting features from the web page text according to the prior probability table to obtain an association feature vector representation F2 between the web page and the specified category includes: calculating the importance scores of the classified keywords in the web pages; discretizing the importance scores of the classified keywords and the prior probabilities of the classified keywords; obtaining a two-dimensional feature matrix according to the discretization importance of the classified keywords and the discretization prior probability of the classified keywords; and respectively calculating the two-dimensional feature matrix in different parts of the webpage text, and sequentially arranging all matrix elements to obtain a feature vector with a fixed length.

Preferably, in the method according to the present application, the step of obtaining the prediction result of the user intention category according to the classification result and the incidence relation feature vector representation F2 includes: expressing F2 according to the classification result and the incidence relation feature vector to obtain an analysis result of a webpage level; and connecting the webpage-level analysis result with the user access log to obtain the intention indicated by each webpage access behavior of the user, and performing aggregation operation on user dimensions to obtain a user intention category prediction result.

The application also discloses a webpage prediction system, which is used for predicting categories aiming at the user intention by utilizing the classification result obtained by the training of the webpage training system, and comprises the following steps: means for collecting a user access log; means for obtaining html source code for a web page from the user access log; means for preprocessing the web page html source code to obtain web page text; extracting features from the webpage text according to a prior probability table to obtain an association relation feature vector representation F2 between the webpage and the specified category; and obtaining a user intention category prediction result according to the classification result and the incidence relation feature vector representation F2.

Preferably, in the system, the means for obtaining html source code of the web page from the user access log includes: means for url screening the user access log; and crawling the filtered url to obtain html source codes of the webpage.

Preferably, in the system, the means for url screening of the user access log includes: means for computing a deduplicated url set for the user access log; means for calculating a web page importance score for each url; and the device is used for sequencing all the url after the duplication is removed according to the webpage importance scores, and taking a plurality of first urls as the url set to be crawled.

Preferably, in the system, the means for preprocessing the html source code of the web page includes: a web page parsing device; text extraction means; a word segmentation/stop word filter.

Preferably, in the system, the means for extracting features from the web page text according to the prior probability table to obtain the association relationship feature vector representation F2 between the web page and the specified category includes: means for calculating an importance score of the classified keyword in the web page; means for discretizing the importance scores of the classified keywords and the prior probabilities of the classified keywords; the device is used for obtaining a two-dimensional characteristic matrix according to the discretization importance of the classified keywords and the discretization prior probability of the classified keywords; and the device is used for calculating the two-dimensional characteristic matrix in different parts of the webpage text respectively and sequentially arranging all matrix elements to obtain a characteristic vector with a fixed length.

Preferably, in the system, the means for obtaining the prediction result of the user intention category according to the classification result and the incidence relation feature vector representation F2 includes: a device for obtaining the analysis result of the webpage level according to the classification result and the incidence relation feature vector representation F2; and the device is used for connecting the webpage level analysis result with the user access log to obtain the intention indicated by each webpage access behavior of the user, and performing aggregation operation in user dimensions to obtain a user intention category prediction result.

Compared with the prior art, according to the technical scheme, the method and the device for processing the data can process a category system with strong isomerism at the same time, process the large-scale category system by using relatively few training data, and solve the problem of data sparsity to a great extent by collecting browsing/searching behaviors of the user on the whole network instead of behaviors on a website.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a web page training method according to an embodiment of the present application;

FIG. 2 is a ROC curve of an embodiment of the present application;

FIG. 3 is a block diagram of a web page training system according to an embodiment of the present application;

FIG. 4 is a flowchart of a web page prediction method according to an embodiment of the present application;

FIG. 5 is a block diagram of a web page prediction system of an embodiment of the present application;

FIG. 6 is an architecture diagram of an exemplary system for efficiently identifying a user's commercial intention in accordance with an embodiment of the present application.

Detailed Description

The method has the main idea that three important problems of data sparsity, category system heterogeneity and too large category number can be well solved by uniformly processing browsing/searching behaviors of users on the Internet, a universal data interface and a classification algorithm with automatic adaptation capability, and services are provided for numerous websites in a uniform flow.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings and specific embodiments.

The user's intent in browsing and searching activities on the internet may be commercial or non-commercial in nature, wherein the commercial intent may be further divided according to the specific commodity category system of a particular web site.

The identification of a user's online commercial intention is a very important function for a particular web site. Through the recommendation and advertisement mode, it can directly promote the volume and amount of transaction of specific web site.

On the other hand, the access behavior inside any specific website only occupies a small part of the browsing/searching behavior of the users on the whole network, and a large amount of browsing/searching behavior occurs outside the specific website. Identifying the commercial intention of a user by utilizing the behavior of the user on the whole network brings great value to a specific website.

Any specific website has a commodity category system, and for the specific website, the commercial intention of the user, which is expected to be analyzed, is consistent with the commodity category system, otherwise, consistent recommendation and advertisement cannot be given. It is particularly noted that the category hierarchy of the goods may be very large. In the example given, the commodity category hierarchy is given by a category hierarchy tree structure of up to 5 levels, with the number of all leaf nodes reaching 2 ten thousand.

The application is a platform system for processing commercial intention analysis, which collects browsing/searching behaviors of users on the internet and opens a data interface to a specific website. And uploading a commodity category system through a data interface by a specific website, and searching click logs and training data samples by a user. The behavior of the user on the whole network is recorded and specific information of a specific website is combined, characteristics related to commercial intention are extracted, and a machine learning classifier constructed from the characteristics automatically detects the Online Commercial Intention (OCI) of the user in any given specific website language.

According to an embodiment of the application, a web page training method is provided for classifying web pages.

Referring to fig. 1, fig. 1 is a flowchart of a web page training method 100 according to an embodiment of the present application. The training process aims to search click data and label data according to category system data uploaded by any specific website through an open data interface, and training to obtain a business intention classification result aiming at the website.

In step S102, a prior probability table of the classification keyword is obtained from the existing data associated with the classification keyword. Existing data associated with the category key may include a website category hierarchy and a user click log. The step is actually a log processing flow for obtaining the prior probability table according to the user click log and the website category architecture.

For example, first, a category system for a specific website is obtained through a "specific website open interface" (e.g., an e-commerce website open interface), and the user searches click logs and labeled training data. The category system and the user search click log for a specific website are associated with the classification keywords and are used for obtaining the prior probability table of the classification keywords. And the labeled training data will be used in a later step for model training.

A category system: for example, a tree-structured category system, any structured description language that can express a tree-structure can be used. For example, in one application instance, the taxonomy is described using extensible markup language (XML):

the user clicks the log: for example, a table structure, each row contains the following fields: the user searches for the keyword, and the search structure clicks the category corresponding to the commodity.

Training data: for example, more than a certain number of web pages may be provided that are labeled corresponding to a certain number of matching categories and non-matching categories in the category hierarchy. Each row of training data contains the following fields: url, html source code, category id, label result. For example:

http:// www. star. com. cn/mama/a 10102/html source code 4.3 matching

http:// www. star. com. cn/mama/a 10102/html source code 3.1 mismatch

http:// www. star. com. cn/mama/a 10102/html source code 4.1 partial match

Wherein the annotation result "match" identifies a completely matched category, "no match" identifies a completely unrelated category, and "partial match" identifies a partially related category.

Step S102 may include the steps of: obtaining search keywords from a user click log, and filtering to obtain classified keywords; calculating the occurrence frequency of each classification keyword in different classifications; and traversing each non-leaf node category in the website category system by adopting a bottom-up method, and calculating the information gain and chi-square test of each classified keyword.

For example, pre-processing the data input by the user, performing a simplified conversion, case conversion, transcoding to utf8, and filtering stop words.

And (3) searching for the length limit of the keyword, and filtering all keywords with the keyword length of < =4 bytes if the keyword does not belong to a pre-collected entity word list (consisting of brand words, product attribute descriptors and the like). (Note: =4byte in utf8 encoding, for > =2 Chinese characters, or > =4 English characters)

And respectively calculating the occurrence frequency of each keyword in different categories, and performing aggregation by taking the keywords as main keys to obtain a data format with the keywords as the main keys and category distribution as attributes.

And traversing each non-leaf node (namely, the category) in the category structure by adopting a bottom-up method, and calculating the information gain and chi-square test of each keyword. For example, an Information Gain (Information Gain) dimension reduction value of each keyword is calculated, and if the dimension reduction value is smaller than a certain threshold, all frequencies of the keywords appearing in the sub-categories are collected to the current category, and the frequencies of the keywords in all the sub-categories are cleared. And if the current node is the root node, deleting the key word.

For example, if K is defined as a key and C is defined as all categories under the current node, the information gain IG of K is calculated as follows:

IG(K)＝H(C)-H(C|K)

where H is the entropy of the information, defining P (x)_i) Taking X for variable X_iIs obtained by the following equation:

chi-square calculation method (chi-square) for calculating keywords meeting given conditions²) If the dimensionality reduction weight is smaller than a certain threshold value, summarizing all frequencies of the keywords appearing in the sub-categories to the current category, and clearing the frequencies of the keywords in all the sub-categories. And if the current node is the root node, deleting the key word.

The definition K is a key word and,for all other keys, { C_iThe given category system.

Statistics are as follows (contigency Table)

The chi-square calculation method comprises the following steps:

wherein T is_i，jTheoretical value for ith row and jth column:

conditions of keywords participating in the calculation:

and is

|{(i，j)|T_i，j＝＝0}|＝0

In step S104, the web page to be trained is preprocessed to obtain a text of the web page to be trained. The purpose of the pretreatment is as follows: 1) analyzing the html source code into structured data; 2) filtering non-text noise data; 3) and performing word segmentation on the text, and filtering stop words serving as noise data.

That is, step S104 may include one or more of the following processes: analyzing a webpage; text extraction; participle/stop word filtering.

In the webpage analyzing step, the input of the webpage analyzing process is html source codes. And analyzing the html source code to obtain the position information and semantic information of each element. For example, given html source code, it can be known which text appears in a hyperlink, which hyperlink text appears in a header, which hyperlink text appears in a footer, and so on. The output of the web page parsing flow is a document objection Model Tree (DocumentObject Model Tree).

In the text extraction step, the input of the text extraction process is a document objectification model structure tree. And filtering a header, a footer, navigation, advertisement and a side bar according to the output of the webpage analysis to obtain the text of the webpage. The output of the text extraction process is a document objectification model structure tree with label information. The elements in each tree are provided with a label to indicate whether the element belongs to the body of the web page.

In the step of filtering the word segmentation/stop words, the input of the filtering process of the word segmentation/stop words is a document objectification model structure tree with text labels. The preset Chinese word segmentation algorithm is adopted, and the Chinese word segmentation algorithm has a plurality of algorithms, such as CDWS, n-gram, hidden Markov model, CRF (Conditional random field) and the like. In one example of an application, CRF may be used as a chinese segmentation algorithm. The virtual words in the text grammar component have high occurrence frequency in the article, and have almost no meaning to the classification, namely the classification is interfered, and the text dimension is too high, so that the classification efficiency is influenced. This part of the vocabulary is filtered by deactivating the vocabulary. The output of the word segmentation/stop word filtering process is a document objectification model structure tree after the text in each element is subjected to word segmentation/stop word processing.

In step S106, extracting features from the text of the web page to be trained according to the prior probability table to obtain an association relationship feature vector representation F1 between the web page to be trained and the specified category. The step is actually a feature extraction process for analyzing and obtaining feature vectors in a webpage (which has undergone structuring processing, word segmentation, noise filtering and the like) subjected to a preprocessing process according to a category prior probability table obtained by a log processing process.

Step S106 may include the steps of: calculating the importance scores of the classified keywords in the web pages to be trained; discretizing the importance scores of the classified keywords and the prior probabilities of the classified keywords; obtaining a two-dimensional feature matrix according to the discretization importance of the classified keywords and the discretization prior probability of the classified keywords; and respectively calculating two-dimensional feature matrices in different parts of the web page text to be trained, and sequentially arranging all matrix elements to obtain a feature vector with a fixed length. Different parts of the text of the web page to be trained can comprise a web page title, a web page body and a web page hyperlink text.

For example, the importance score of the keyword in the web page is calculated, the TF/IDF information of the keyword, the position information of the keyword, and the font color information of the keyword are considered comprehensively, and the importance score of the keyword is calculated through a regression model. The score ranged from [0, 1 ].

Then discretizing the importance scores of the keywords and the prior probabilities of the keywords: setting corresponding threshold values, and discretizing the importance scores of the keywords into four possible values of high, low and low; setting corresponding threshold values, and discretizing the prior probability scores of the keywords into four possible values of high, low and low.

Then, according to the importance degree of the keyword in the webpage and a prior probability table obtained by log processing flow calculation, a two-dimensional feature matrix can be obtained by calculation, as follows:

importance/prior probability	Height of	Is higher than	Is lower than	Is low in
					Height of	n_1,1	n_1,2	n_1,3	n_1,4
Is higher than	n_2,1	n_2,2	n_2,3	n_2,4
					Is lower than	n_3,1	n_3,2	n_3,3	n_3,4
Is low in	n_4,1	n_4,2	n_4,3	n_4,4

E.g. n_2,3Indicating the number of keywords with higher importance and lower prior probability.

And finally, calculating the characteristic matrixes in the webpage title, the webpage text and the webpage hyperlink text respectively. Thus, 3 4 × 4 dimensional matrices are obtained, and there are 48 elements in total, 3 × 4 × 4. These 48 elements are arranged in sequence to obtain a feature vector of fixed length. The feature vector is the output result of the feature extraction process.

In step S108, model training is performed on the incidence relation feature vector representation F1 to obtain a classification result of the web page to be trained.

Step S108 may include the steps of: regression modeling and ROC analysis. The regression modeling may include performing regression model training on the labeled data according to a preset regression model algorithm. The ROC analysis may include computing a corresponding ROC curve for the trained regression model and setting different cut thresholds based on the ROC curve to obtain classification results. Under the application scene (large number of categories) of the text, the regression model and the ROC analysis technology are combined, the multi-category prediction problem can be solved, and the category prediction accuracy and the recall rate are improved.

For example, model training is performed on the commercial intention according to a preset regression model algorithm by combining the labeled data uploaded by a specific website and the feature vector obtained by the feature extraction module. Regression algorithms are of many kinds, such as SVM (support vector machine), Logistic Regression, etc. In one example, an SVM regression algorithm is applied for training.

After the regression algorithm model fitting is completed, a corresponding ROC (receiveoperating characteristics) curve is calculated for the model on the test set. In one example, the resulting ROC curve is shown in fig. 2.

The ROC curve has FPR (False Positive Rate) on the horizontal axis and TPR (True Positive Rate) on the vertical axis. The definition is as follows:

wherein:

FP (False Positive): and in the test set, the real label is false, and the model considers the number of the true data points.

TP (True Positive): and in the test set, the real label is true, and the model considers the number of true data points.

FN (False Positive, False negative): and in the test set, the real label is true, and the model considers the number of false data points.

TN (False Positive, true negative): in the test set, the real label is false, and the model considers the number of false data points.

FPR measures the accuracy of the model, the lower the index the better. TPR measures the recall rate of the model, with higher indicators being better. Different segmentation threshold values can be set according to different application scenes, and the regression problem is converted into the classification problem again to obtain the webpage classification result meeting the requirement.

For example, in an advertisement application scenario, a higher recall rate is desired, and a situation of low accuracy can be tolerated to some extent. On the contrary, in the anti-cheating application, a high accuracy is required, so that a threshold value can be set, a low FPR is ensured, and a certain missing judgment is tolerated.

According to an embodiment of the present application, a web page training system is provided for classifying web pages.

Referring to fig. 3, fig. 3 is a block diagram of a web page training system 300 according to an embodiment of the present application. The web page training system 300 comprises means 302 for obtaining a prior probability table for the classification keyword based on existing data associated with the classification keyword; means 304 for preprocessing the web page to be trained to obtain a text of the web page to be trained; a device 306 for extracting features from the web page text to be trained according to the prior probability table to obtain an incidence relation feature vector representation F1 between the web page to be trained and the specified category; and a device 308 for performing model training on the incidence relation feature vector representation F1 to obtain a classification result of the webpage to be trained.

Existing data associated with the category key may include a website category hierarchy and a user click log. The means 302 for obtaining a prior probability table for a classification keyword from existing data associated with the classification keyword may comprise: the device is used for acquiring search keywords from a user click log and filtering the search keywords to obtain classified keywords; means for calculating a frequency of occurrence of each of the classification keywords in the different classifications; and the device is used for traversing each non-leaf node category in the website category system by adopting a bottom-up method, and calculating the information gain and chi-square test of each classified keyword.

The means 304 for preprocessing the web page to be trained may include: a web page parsing device; text extraction means; a word segmentation/stop word filter.

The means 306 for extracting features from the prior probability table in the text of the web page to be trained to obtain an association relationship feature vector representation F1 between the web page to be trained and the specified category may include: means for calculating an importance score of the classified keyword in the web page to be trained; means for discretizing the importance scores of the classified keywords and the prior probabilities of the classified keywords; the device is used for obtaining a two-dimensional characteristic matrix according to the discretization importance of the classified keywords and the discretization prior probability of the classified keywords; and the device is used for calculating two-dimensional characteristic matrixes in different parts of the web page text to be trained respectively and sequentially arranging all matrix elements to obtain a characteristic vector with a fixed length. Different parts of the text of the web page to be trained may include the title of the web page, the body of the web page, and the hyperlink text of the web page.

The device for carrying out model training on the incidence relation feature vector representation F1 to obtain the classification result of the webpage to be trained can comprise a regression modeling device and an ROC analysis device. The regression modeling device can be used for carrying out regression model training on the labeled data according to a preset regression model algorithm; the ROC analysis device may be configured to calculate a corresponding ROC curve for the trained regression model, and set different cut thresholds based on the ROC curve to obtain a classification result.

The training process in the application solves the problems of the heterogeneity of the category system and the excessively large number of categories mentioned in the background art through the following innovations:

(1) and mining a category system of a specific website and a user search click log to obtain a prior probability table of the keyword.

(2) And generating an incidence relation characteristic vector representation between the webpage and the specified category by combining the prior probability table of the keyword and the webpage text characteristic.

(3) And converting the classification problem into a regression problem of the incidence relation between the webpage and the specified category, and expressing and training a regression model through the obtained incidence relation characteristic vector.

(4) And calculating an ROC curve of the regression model result and the classification marking result, reasonably dividing a threshold value, and inverting the regression problem into the classification problem.

Because the feature vector representation of the incidence relation between the webpage and the category is irrelevant to the specific category information, the method can process the category system with strong heterogeneity at the same time, and process the category system with large scale by using relatively few training data.

According to an embodiment of the present application, a web page prediction method is provided, which is used for performing category prediction for a user's intention by using a classification result obtained by training in the web page training method 100. The goal of the predictive process is to provide comprehensive business intent analysis results for any given particular web site based on the user's browsing/searching behavior across the web.

Referring to fig. 4, fig. 4 is a flowchart of a web page prediction method 400 according to an embodiment of the present application.

In step S402, a user access log is collected.

For example, the global cookie is first distributed across the entire network. Cookies are a technology that allows web servers to store small amounts of data to or read data from the hard disk of clients. Cookies are a very small text file placed on your hard disk by a Web server when you browse a Web site, and can record your user ID, password, browsed Web page, dwell time, etc. Cookies are commonly used to identify users.

Then when the user accesses the webpage with global cookies, the module returns the access log of the user to the back end of the server.

And then the back-end server is responsible for collecting and storing the user access logs. Access log format: the access log is a browsing record of the user accessing the internet and comprises the following fields: cookie id, IP, user, access Url, Refer Url, access time, search key.

In step S404, html source code of the web page is obtained from the user access log.

Step S404 may include the steps of: url screening is carried out on the user access log; and crawling the filtered url to obtain html source codes of the webpage. The step of url screening the user access log may include the following steps: aiming at a user access log, calculating a url set after duplication removal; calculating a web page importance score for each url; and sequencing all the url after the duplication is removed according to the webpage importance scores, and taking the first plurality of urls as a url set to be crawled. The url screening process is to screen the url list analyzed from the user access log table to select the url list which is considered as high quality.

For example, first, a deduplicated url set is calculated for the collected user access logs.

The following attributes are then calculated for each url: PV (Page View, i.e., how many times the web Page has been visited), UV (User View, i.e., how many people the web Page has been visited), PageRank (a web Page importance indicator calculated from a web Page hyperlink relationship), PageType (divided into 404 web pages, landing pages, list pages, content pages).

The web page importance score is then calculated:

S＝(α×sigmoid(PV)+β×sigmoid(UV))×PR×PT

wherein:

linear adjustment coefficients, satisfying constraints (α > 0, β > 0, α + β ═ 1)

PR: the value range of PageRank is 0-10

PT: the PageType is a discrete value, and takes a value of 0 when the PageType is 404 pages or a landing page, takes a value of 0.5 when the PageType is a list page, and takes a value of 1 when the PageType is a content page.

And calculating the importance score according to the calculation formula, sequencing all the urls, and taking Top N urls as a url set to be crawled.

And finally, crawling and storing the corresponding html source codes according to the given url list.

In step S406, the html source code of the web page is preprocessed to obtain web page text.

Similar to step S104, step S406 may include the steps of: analyzing a webpage; text extraction; participle/stop word filtering.

In the webpage analyzing step, the html source code is analyzed to obtain the position information and semantic information of each element. For example, given html source code, it can be known which text appears in a hyperlink, which hyperlink text appears in a header, which hyperlink text appears in a footer, and so on. The output of the web page parsing flow is a document objection Model Tree (DocumentObject Model Tree).

In the text extraction step, header/footer/navigation/advertisement/sidebar is filtered according to the output of webpage analysis to obtain the text of the webpage.

In the segmentation/stop word filtering step, a preset Chinese segmentation algorithm is adopted, and the Chinese segmentation algorithm has various algorithms, such as CDWS, n-gram, hidden Markov model, CRF (Conditional random field), and the like. In one example of an application, CRF may be used as a chinese segmentation algorithm. The virtual words in the text grammar component have high occurrence frequency in the article, and have almost no meaning to the classification, namely the classification is interfered, and the text dimension is too high, so that the classification efficiency is influenced. This part of the vocabulary is filtered by deactivating the vocabulary.

In step S408, extracting features from the web page text according to the prior probability table yields an association relationship feature vector representation F2 between the web page and the specified category. Preferably, a regression model and a segmentation threshold are obtained by using a training process, and a classification result of the web page is obtained by combining the feature vector representation F2 obtained in the above steps. Note that the classification result of the web page may belong to a plurality of categories, or may not belong to any one category.

Similar to step S106, step S408 may include the steps of: calculating the importance scores of the classified keywords in the web pages; discretizing the importance scores of the classified keywords and the prior probabilities of the classified keywords; obtaining a two-dimensional feature matrix according to the discretization importance of the classified keywords and the discretization prior probability of the classified keywords; and respectively calculating two-dimensional feature matrices in different parts of the webpage text, and sequentially arranging all matrix elements to obtain a feature vector with a fixed length.

E.g. n_2，3Indicating the number of keywords with higher importance and lower prior probability.

In step S410, a user intention category prediction result is obtained from the classification result and the association relation feature vector representation F2.

Step S410 may include the steps of: representing F2 according to the classification result and the association relation feature vector to obtain an analysis result of the webpage level; and connecting (join) the analysis result of the webpage level with the user access log to obtain the intention indicated by each webpage access behavior of the user, and performing aggregation operation in user dimensions to obtain a user intention category prediction result.

For example, machine learning classification is performed, and business intention preference of the web page is classified according to a preset classifier algorithm by combining a feature vector obtained by a feature extraction module and a model file generated in a training process.

Next, aggregation of user dimensions is performed.

Firstly, converting a user access log into a Table in a relational database, wherein the Table comprises a field CookieId, access Url and access time. The output of the machine learning classifier module is also converted into Table in the relational database, which contains field Url, business intention analysis result. That is, the input for the aggregation flow for the user dimension is two: (1) the web page analysis results table has two fields: webpage url, webpage business intention classification result; (2) the user accesses the log, which has three fields: user id, access url, access time.

And then performing Join operation on the two tables by using Url as a Key to obtain a connection Table, wherein the connection Table comprises a field CookieId, access Url, access time and a commercial intention analysis result.

And finally, performing aggregation operation according to the CookieId as Key, and aggregating the business intention analysis results of the same CookieId together. The time decay factor is considered in the aggregation algorithm, and the following formula describes the aggregation result of the ith business classification weight of a given CookieId:

wherein,

W_i: aggregation result of ith commercial intention classification weight given CookieId;

W_i，j: the weight of the ith commercial intention classification of a given cookie id on the jth page visited by the user;

and Dj: the time that the cookie id is given to access the jth page is a number of days from the current time.

That is, the output of the aggregation process for the user dimension is a user business intention analysis result table, which has two fields: user id, user commercial intention classification result.

In a preferred embodiment, the training process obtains a fitted regression model file according to the known training data and the processing process described above, and obtains a segmentation threshold according to the ROC curve analysis. In a machine learning classifier module of the prediction process, a regression model file obtained by the training process and a segmentation threshold are applied to obtain a prediction result of the webpage level. The prediction process obtains the business intention analysis result of the user level by aggregating the webpage prediction results according to the user.

According to the embodiment of the application, a webpage prediction system is provided, which is used for performing category prediction aiming at the user intention by using the classification result obtained by the training of the webpage training system.

Referring to fig. 5, fig. 5 is a block diagram of a web page prediction system 500 according to an embodiment of the present application. The web page prediction system 500 comprises means 502 for collecting a log of user accesses; means 504 for obtaining html source code for the web page from the user access log; means 506 for preprocessing html source code of the web page to obtain text of the web page; means 508 for extracting features from the web page text according to the prior probability table to obtain an association relationship feature vector representation F2 between the web page and the specified category; means 510 for obtaining a user intention category prediction result from the classification result and the association feature vector representation F2.

The means 504 for obtaining html source code for a web page from a user access log may include: means for url screening the user access log; and crawling the filtered url to obtain html source codes of the webpage. The device for url screening of the user access log may include: means for computing a deduplicated url set for a user access log; means for calculating a web page importance score for each url; and the device is used for sequencing all the url after the duplication is removed according to the webpage importance scores, and taking a plurality of first urls as the url set to be crawled.

The means 506 for preprocessing html source code of the web page may include: a web page parsing device; text extraction means; a word segmentation/stop word filter.

The means 508 for extracting features from the page text according to the prior probability table to obtain the associated feature vector representation F2 between the page and the specified category may include: means for calculating an importance score of the classified keyword in the web page; means for discretizing the importance scores of the classified keywords and the prior probabilities of the classified keywords; the device is used for obtaining a two-dimensional characteristic matrix according to the discretization importance of the classified keywords and the discretization prior probability of the classified keywords; and the device is used for calculating two-dimensional feature matrices in different parts of the webpage text respectively and sequentially arranging all matrix elements to obtain a feature vector with a fixed length.

The means 510 for obtaining the prediction result of the user intention category according to the classification result and the association feature vector representation F2 may include: a device for representing F2 according to the classification result and the association relation feature vector to obtain the analysis result of the webpage level; and the device is used for connecting the webpage level analysis result with the user access log to obtain the intention indicated by each webpage access behavior of the user, and performing aggregation operation on user dimensions to obtain a user intention category prediction result.

The prediction process in the present application solves the problem of data sparsity mentioned in the background art by the following innovations:

(1) collecting the browsing/searching behavior of the user on the whole network, not just on one website, greatly solves the problem of data sparsity, and in one application example, 50 hundred million pieces of user browsing log information can be collected every day.

(2) And extracting user browsing webpages with commercial values from the user browsing log information, and performing deep analysis on webpage contents.

(3) And an embedded interface is provided, and the customization requirements of a plurality of specific websites can be embedded in the prediction process.

(4) And self-adaptively analyzing the customized commercial intention analysis result for any given specific website by using the same user data.

FIG. 6 illustrates an architecture diagram of an exemplary system of the present application that effectively identifies a user's commercial intention. It can be seen that from a system hierarchy perspective, the system is divided into three levels: the system comprises a data collection layer, a data preprocessing layer and a data analysis layer. From the viewpoint of data flow, the system can be divided into two flows: training process and predicting process.

In summary, the present application solves three major problems listed in the background art. The concrete expression is as follows:

(1) according to the method and the system, all browsing and searching records of the user on the whole network are collected and analyzed, and a complete user file is established based on the analysis result, so that the problem of data sparsity is well solved.

(2) The method and the system provide a uniform data interface for any potential specific website, can well adapt to the situation of strong heterogeneity of a category system through the uniform interface and a processing flow, and can quickly establish accurate classifiers for different category systems on the premise that the training data scale is relatively small.

(3) The method and the device are suitable for a very large category system, do not require that training data linearly increase along with the increase of the number of categories, can well process the situations of multiple categories and no categories, and ensure the accuracy of category prediction.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A web page training method is used for classifying web pages, and is characterized by comprising the following steps:

obtaining a category prior probability table of the classification key words according to existing data associated with the classification key words, wherein the existing data associated with the classification key words comprises a website category system;

preprocessing a webpage to be trained to obtain a webpage text to be trained;

extracting features from the webpage text to be trained according to the category prior probability table to obtain an incidence relation feature vector representation F1 between the webpage to be trained and the specified category;

and performing model training on the incidence relation feature vector representation F1 to obtain a classification result of the webpage to be trained.

2. The method of claim 1, wherein the existing data associated with the classification key comprises a user click log.

3. The method of claim 2, wherein the step of obtaining a prior probability table for the category of the classification key based on existing data associated with the classification key comprises:

obtaining search keywords from the user click logs, and filtering to obtain classified keywords;

calculating the occurrence frequency of each classification keyword in different classifications;

and traversing each non-leaf node category in the website category system by adopting a bottom-up method, and calculating the information gain and chi-square test of each classified keyword.

4. The method of claim 1, wherein preprocessing the web page to be trained comprises one or more of:

analyzing a webpage;

text extraction;

participle/stop word filtering.

5. The method of claim 1, wherein the step of extracting features from the text of the web page to be trained according to the category prior probability table to obtain an association relationship feature vector representation F1 between the web page to be trained and the specified category comprises:

calculating the importance scores of the classified keywords in the web pages to be trained;

discretizing the importance scores of the classified keywords and the category prior probability of the classified keywords;

obtaining a two-dimensional feature matrix according to the discretization importance of the classified key words and the discretization category prior probability of the classified key words;

and respectively calculating the two-dimensional feature matrix in different parts of the web page text to be trained, and sequentially arranging all matrix elements to obtain a feature vector with a fixed length.

6. The method of claim 5, wherein the different portions of the text of the web page to be trained comprise a web page title, a web page body, and web page hyperlink text.

7. The method according to claim 1, wherein the step of model training the incidence relation feature vector representation F1 to obtain the classification result of the web page to be trained comprises:

performing regression model training on the labeled data according to a preset regression model algorithm; and

and calculating corresponding ROC curves aiming at the trained regression models, and setting different segmentation threshold values based on the ROC curves to obtain a classification result.

8. A web page training system for classifying web pages, comprising:

means for obtaining a category prior probability table for a classification keyword based on existing data associated with the classification keyword, the existing data associated with the classification keyword including a website category hierarchy;

the device is used for preprocessing the webpage to be trained to obtain the webpage text to be trained;

extracting features from the to-be-trained webpage text according to the category prior probability table to obtain an incidence relation feature vector representation F1 between the to-be-trained webpage and the specified category;

and the device is used for carrying out model training on the incidence relation feature vector representation F1 to obtain a classification result of the webpage to be trained.

9. A web page prediction method for performing category prediction with respect to a user's intention by using a classification result obtained by training according to the web page training method of any one of claims 1 to 7, comprising:

collecting user access logs;

obtaining html source codes of the webpage from the user access log;

preprocessing the html source code of the webpage to obtain a webpage text;

extracting features from the webpage text according to the category prior probability table to obtain an incidence relation feature vector representation F2 between the webpage and the specified category;

and obtaining a user intention category prediction result according to the classification result and the incidence relation feature vector representation F2.

10. The method of claim 9, wherein the step of obtaining html source code for the web page from the user access log comprises:

subjecting the user access log to url screening;

and crawling the filtered url to obtain html source codes of the webpage.

11. The method of claim 10, wherein the step of url screening the user access log comprises:

calculating a url set after the duplication is removed aiming at the user access log;

calculating a web page importance score for each url;

and sequencing all the url after the duplication is removed according to the webpage importance scores, and taking the first plurality of urls as a url set to be crawled.

12. The method of claim 9, wherein the step of extracting features from the web page text according to the category prior probability table to obtain the associated feature vector representation F2 between the web page and the specified category comprises:

calculating the importance scores of the classified keywords in the web pages;

and respectively calculating the two-dimensional feature matrix in different parts of the webpage text, and sequentially arranging all matrix elements to obtain a feature vector with a fixed length.

13. The method according to claim 9, wherein the step of obtaining the prediction result of the user intention category according to the classification result and the incidence relation feature vector representation F2 comprises:

expressing F2 according to the classification result and the incidence relation feature vector to obtain an analysis result of a webpage level;

and connecting the webpage-level analysis result with the user access log to obtain the intention indicated by each webpage access behavior of the user, and performing aggregation operation on user dimensions to obtain a user intention category prediction result.

14. A web page prediction system for predicting categories of user's intentions by using the classification result trained by the web page training system according to claim 8, comprising:

means for collecting a user access log;

means for obtaining html source code for a web page from the user access log;

means for preprocessing the web page html source code to obtain web page text;