CN117828216A

CN117828216A - Multi-mode Web information retrieval static ordering learning method, system, equipment and medium

Info

Publication number: CN117828216A
Application number: CN202311669809.9A
Authority: CN
Inventors: 耿光刚; 黄衍铭; 张继连; 冯丙文; 刘志全
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-04-05

Abstract

The invention discloses a multi-mode Web information retrieval static ordering learning method, a system, equipment and a medium, wherein the method comprises the steps of obtaining webpage screenshot and webpage information of a target webpage and obtaining webpage text information, HTML text, HTML labels and webpage statistical characteristic information based on the screenshot and the information; mapping and extracting features of webpage text information, HTML tag information and webpage statistical feature information respectively to obtain text feature vectors, HTML tag feature vectors, HTML text feature vectors and webpage statistical feature vectors respectively; combining the characteristics of the obtained 4 characteristic vectors to obtain a combined characteristic vector; and performing evaluation and grading according to the combined feature vector to generate a predicted evaluation grade of the webpage. The invention explores the inherent characteristics of the webpage information deeply from multiple dimensions, evaluates the quality of the Web content of the target webpage more accurately, objectively and reasonably, improves the evaluation accuracy, and is not easily influenced by various cheating methods.

Description

Multimodal Web information retrieval static ranking learning method, system, device and medium

技术领域Technical Field

本发明属于网页信息检索技术领域，具体涉及一种多模态Web信息检索静态排序学习方法、系统、设备及介质。The present invention belongs to the technical field of web page information retrieval, and in particular relates to a multi-modal Web information retrieval static ranking learning method, system, device and medium.

背景技术Background technique

Web信息指的是在互联网上存在的大量网页和其他相关内容，包括网站、博客、社交媒体帖子、新闻文章等等。Web信息不仅包含着各类高质量信息，而且包含着各类虚假、欺骗、作弊、滥用等信息，因此在网络内容众多且良莠不齐的背景下，如何辨识高质量网络内容变得愈发重要而迫切。Web information refers to the large number of web pages and other related content on the Internet, including websites, blogs, social media posts, news articles, etc. Web information not only contains various high-quality information, but also contains various false, deceptive, cheating, and abused information. Therefore, in the context of a large amount of web content and mixed quality, how to identify high-quality web content has become increasingly important and urgent.

Web信息检索静态排序学习(Static Ranking Learning for Web InformationRetrieval)，又称为Web内容质量评价(Evaluation of Web Content Quality)、“查询无关”排序(Query Independent Ranking)或静态排序(Static Ranking)，顾名思义是用以评价Web内容质量，将不同内容质量的网络内容进行归类或排序。静态排序学习是搜索引擎、推荐系统和智能对话服务等的核心和基础算法。对搜索引擎而言，静态排序是搜索引擎的核心算法，其评价结果作为搜索排序的“重要性”参考，也可以用作指导海量网页爬取等的重要依据。Static Ranking Learning for Web Information Retrieval, also known as Evaluation of Web Content Quality, Query Independent Ranking or Static Ranking, is used to evaluate the quality of Web content and classify or rank web content of different quality. Static Ranking Learning is the core and basic algorithm of search engines, recommendation systems, and intelligent dialogue services. For search engines, static ranking is the core algorithm of search engines. Its evaluation results serve as a reference for the "importance" of search rankings, and can also be used as an important basis for guiding the crawling of massive web pages.

现阶段的Web信息检索静态排序工作已经取得了一些研究成果，并应用了各种各样的技术，如PageRank、TrustRank、Truncated PageRank等一系列链接分析算法，链接作弊已经被很好的抑制。然而，缺乏有效的方案同时应对隐藏作弊、嵌入作弊、重定性作弊和Cloaking等作弊形式。同时，随着互联网内容形式的不断多样化、数据规模的持续膨胀，特别是在利益驱使下以Web作弊、虚假广告等为代表的各类滥用行为日益猖獗，如何高效地辨识不同质量的Web内容技术难度越来越大，与此同时网民和LLM等大模型对于高质量信息的需求却越来越强烈。此时，主流的Web信息检索静态排序方法很容易受滥用行为的影响，从而使得排序准确度不如预期，这就给现有的Web信息检索静态排序方法带来了新的挑战。At present, the static ranking of Web information retrieval has achieved some research results and applied various technologies, such as PageRank, TrustRank, Truncated PageRank and other link analysis algorithms. Link cheating has been well suppressed. However, there is a lack of effective solutions to deal with hidden cheating, embedded cheating, re-characterization cheating and Cloaking. At the same time, with the continuous diversification of Internet content forms and the continuous expansion of data scale, especially driven by interests, various abuses represented by Web cheating and false advertising are becoming increasingly rampant. How to efficiently identify Web content of different qualities is becoming more and more difficult. At the same time, netizens and large models such as LLM have an increasingly strong demand for high-quality information. At this time, the mainstream static ranking method of Web information retrieval is easily affected by abuse, resulting in less accurate ranking than expected, which brings new challenges to the existing static ranking method of Web information retrieval.

发明内容Summary of the invention

本发明的主要目的在于克服现有技术的排序准确度较低、易受各种作弊方法影响的不足，提出一种多模态Web信息检索静态排序学习方法、系统、设备及介质。The main purpose of the present invention is to overcome the shortcomings of the prior art, that is, the low ranking accuracy and susceptibility to various cheating methods, and to propose a multimodal Web information retrieval static ranking learning method, system, device and medium.

为了达到上述目的，本发明采用以下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种多模态Web信息检索静态排序学习方法，包括以下步骤：A multimodal Web information retrieval static ranking learning method comprises the following steps:

获取目标网页的网页截图，基于网页截图获取网页文本信息，并将网页文本信息映射为第一文本向量序列；Obtaining a webpage screenshot of the target webpage, obtaining webpage text information based on the webpage screenshot, and mapping the webpage text information into a first text vector sequence;

获取目标网页的HTML文本，将HTML文本信息映射为第一HTML文本向量序列；Obtaining HTML text of a target webpage, and mapping the HTML text information into a first HTML text vector sequence;

获取目标网页的HTML标签，将HTML标签信息映射为第一HTML标签向量序列；Obtaining HTML tags of the target web page, and mapping the HTML tag information into a first HTML tag vector sequence;

基于目标网页的网页统计特征，将网页统计特征信息映射为第一网页统计特征向量序列；网页统计特征包括链接分析特征、内容启发式特征、网站归属特征以及时序特征；Based on the webpage statistical features of the target webpage, the webpage statistical feature information is mapped into a first webpage statistical feature vector sequence; the webpage statistical features include link analysis features, content heuristic features, website attribution features and time series features;

基于独热编码与词嵌入向量模式，根据第一文本向量序列、第一HTML文本向量序列以及第一HTML标签向量序列得到第二文本特征向量矩阵、第二HTML文本特征向量矩阵以及第二HTML标签特征向量矩阵；采用Transformer编码器模型对第二文本特征向量矩阵、第二HTML文本特征向量矩阵以及第二HTML标签特征向量矩阵进行特征提取，得到文本特征向量、HTML文本特征向量以及HTML标签特征向量；Based on the one-hot encoding and word embedding vector mode, a second text feature vector matrix, a second HTML text feature vector matrix and a second HTML tag feature vector matrix are obtained according to the first text vector sequence, the first HTML text vector sequence and the first HTML tag vector sequence; a Transformer encoder model is used to extract features from the second text feature vector matrix, the second HTML text feature vector matrix and the second HTML tag feature vector matrix to obtain a text feature vector, an HTML text feature vector and an HTML tag feature vector;

基于特征组合与拼接模型，对第一网页统计特征向量序列进行向量拼接，得到第二网页统计特征向量矩阵；采用DNN模型对第二网页统计特征向量矩阵进行特征提取，得到网页统计特征向量；Based on the feature combination and concatenation model, the first webpage statistical feature vector sequence is concatenated to obtain a second webpage statistical feature vector matrix; the second webpage statistical feature vector matrix is subjected to feature extraction using a DNN model to obtain a webpage statistical feature vector;

将文本特征向量、HTML文本特征向量、HTML标签特征向量以及网页统计特征向量进行特征组合，得到组合特征向量；Combining the text feature vector, the HTML text feature vector, the HTML tag feature vector and the webpage statistical feature vector to obtain a combined feature vector;

根据组合特征向量进行评价与等级划分，生成关于网页的预测评价等级。Evaluation and grading are performed based on the combined feature vector to generate a predicted evaluation grade for the web page.

本发明还包括一种多模态Web信息检索静态排序学习系统，系统采用本发明提供的多模态Web信息检索静态排序学习方法，系统包括信息提取模块、特征提取模块、特征组合模块以及质量评价模块；The present invention also includes a multimodal Web information retrieval static ranking learning system, the system adopts the multimodal Web information retrieval static ranking learning method provided by the present invention, and the system includes an information extraction module, a feature extraction module, a feature combination module and a quality evaluation module;

信息提取模块，用于获取目标网页的网页截图与网页信息并基于网页截图与网页信息得到网页文本信息、HTML文本、HTML标签以及网页统计特征信息；An information extraction module is used to obtain a webpage screenshot and webpage information of a target webpage and obtain webpage text information, HTML text, HTML tags and webpage statistical feature information based on the webpage screenshot and webpage information;

特征提取模块，用于分别将网页文本信息、HTML文本信息、HTML标签信息以及网页统计特征信息映射为第一文本向量序列、第一HTML文本向量序列、第一HTML标签向量序列以及第一网页统计特征向量序列；还用于分别对第一文本向量序列、第一HTML文本向量序列、第一HTML标签向量序列以及第一网页统计特征向量序列进行特征提取，分别得到文本特征向量、HTML文本特征向量、HTML标签特征向量以及网页统计特征向量；A feature extraction module is used to map web page text information, HTML text information, HTML tag information and web page statistical feature information into a first text vector sequence, a first HTML text vector sequence, a first HTML tag vector sequence and a first web page statistical feature vector sequence, respectively; and is also used to perform feature extraction on the first text vector sequence, the first HTML text vector sequence, the first HTML tag vector sequence and the first web page statistical feature vector sequence, respectively, to obtain a text feature vector, an HTML text feature vector, an HTML tag feature vector and a web page statistical feature vector, respectively;

特征组合模块，用于将文本特征向量、HTML文本特征向量、HTML标签特征向量以及网页统计特征向量进行特征组合，得到组合特征向量；A feature combination module is used to combine the text feature vector, the HTML text feature vector, the HTML tag feature vector and the webpage statistical feature vector to obtain a combined feature vector;

质量评价模块，用于根据组合特征向量进行评价与等级划分，生成关于网页的预测评价等级；A quality evaluation module is used to evaluate and grade the web page based on the combined feature vectors to generate a predicted evaluation grade for the web page;

特征提取模块设有DNN模型和Transformer编码器模型；The feature extraction module is equipped with a DNN model and a Transformer encoder model;

特征组合模块设有全连接层，通过全连接层将带有文本上下文特征信息的文本特征向量、带有HTML文本上下文特征信息的HTML文本特征向量、带有HTML标签上下文特征信息的HTML标签特征向量以及带有网页统计特征信息的网页统计特征向量进行组合，计算获得最佳融合方式，得到适合进行分类分析的特征结果；The feature combination module is provided with a fully connected layer, through which the text feature vector with text context feature information, the HTML text feature vector with HTML text context feature information, the HTML tag feature vector with HTML tag context feature information, and the web page statistical feature vector with web page statistical feature information are combined to calculate the best fusion method and obtain the feature results suitable for classification analysis;

质量评价模块搭载有DNN模型，基于组合特征向量对Web内容质量进行评价与等级划分，以判断当前网页的预测评价等级。The quality evaluation module is equipped with a DNN model, which evaluates and grades the quality of Web content based on the combined feature vector to determine the predicted evaluation grade of the current web page.

本发明还包括一种计算机设备，包括存储器以及处理器，存储器存储有计算机程序，处理器执行计算机程序时实现本发明提供的多模态Web信息检索静态排序学习方法。The present invention also includes a computer device, including a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the multimodal Web information retrieval static ranking learning method provided by the present invention is implemented.

本发明还包括一种计算机可读存储介质，存储有计算机程序，当计算机程序被处理器执行时，实现本发明提供的多模态Web信息检索静态排序学习方法。The present invention also includes a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, the multimodal Web information retrieval static ranking learning method provided by the present invention is implemented.

本发明与现有技术相比，具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明统筹考虑多模态因素，除了链接分析特征、内容启发式特征、网页归属特征和时序特征之外，进一步包含两种不同模态语义信息，分别为机器视角下的文本内容信息、HTML标签信息和用户视角下的网页截图OCR文本信息，并通过深度学习point-wise架构挖掘上述多视角辨识信息，实现对链接作弊、隐藏作弊、嵌入作弊、跳转作弊、Cloaking等多种作弊因素的统筹考虑，同时兼顾网站内容、归属和时序特性等；相较于现有技术，从多维度更为深入地探索网页信息的内在特征，更准确、客观与合理地对目标网页的Web内容质量进行评价，提高了评价准确度。1. The present invention comprehensively considers multimodal factors. In addition to link analysis features, content heuristic features, web page attribution features and timing features, it further includes two different modal semantic information, namely text content information from a machine perspective, HTML tag information and web page screenshot OCR text information from a user perspective, and mines the above multi-perspective identification information through a deep learning point-wise architecture to achieve comprehensive consideration of multiple cheating factors such as link cheating, hidden cheating, embedded cheating, jump cheating, Cloaking, etc., while taking into account website content, attribution and timing characteristics; compared with the existing technology, it explores the intrinsic characteristics of web page information more deeply from multiple dimensions, evaluates the Web content quality of the target web page more accurately, objectively and reasonably, and improves the evaluation accuracy.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明方法的流程图；Fig. 1 is a flow chart of the method of the present invention;

图2是本发明方法的示意图；Fig. 2 is a schematic diagram of the method of the present invention;

图3是实施例中文本特征向量变换的示意图；FIG3 is a schematic diagram of text feature vector transformation in an embodiment;

图4是实施例中HTML文本特征向量变换的示意图；FIG4 is a schematic diagram of HTML text feature vector transformation in an embodiment;

图5是实施例中HTML标签特征向量变换的示意图；FIG5 is a schematic diagram of the transformation of HTML tag feature vectors in an embodiment;

图6是实施例中网页统计特征向量变换的示意图；FIG6 is a schematic diagram of a web page statistical feature vector transformation in an embodiment;

图7是实施例中利用Transformer编码器模型进行特征提取的流程示意图；FIG7 is a schematic diagram of a process of performing feature extraction using a Transformer encoder model in an embodiment;

图8是本发明系统的示意图。FIG. 8 is a schematic diagram of the system of the present invention.

具体实施方式Detailed ways

下面结合实施例及附图对本发明作进一步详细的描述，但本发明的实施方式不限于此。The present invention is further described in detail below in conjunction with embodiments and drawings, but the embodiments of the present invention are not limited thereto.

实施例Example

如图1和图2所示，本发明，一种多模态Web信息检索静态排序学习方法，包括以下步骤：As shown in FIG. 1 and FIG. 2 , the present invention, a multimodal Web information retrieval static ranking learning method, comprises the following steps:

获取目标网页的网页截图，基于网页截图获取网页文本信息，并将网页文本信息映射为第一文本向量序列，具体为：Obtain a screenshot of the target webpage, obtain webpage text information based on the webpage screenshot, and map the webpage text information into a first text vector sequence, specifically:

对于获取的网页文本信息，定义为输入数据x_text并将其视为文本序列，x_text将被划分成一系列的离散文本单元，即x_text＝{t₁,t₂t₃,…,t_L}，其中，L为输入数据的原始长度，t_i表示x_text中的每个离散文本单元；通过设置固定的长度为L_c，经截断或填充操作后获得新的字符元素序列即第一文本向量序列，其中/>表示/>中的每个离散文本单元。For the acquired web page text information, the input data x _text is defined as a text sequence, and x _text will be divided into a series of discrete text units, that is, x _text = {t ₁ , t ₂ t ₃ , …, t _L }, where L is the original length of the input data, and _ti represents each discrete text unit in x _text ; by setting a fixed length as L _c , a new character element sequence is obtained after truncation or padding operation That is, the first text vector sequence, where/> Indicates/> Each discrete text unit in .

获取目标网页的HTML文本，将HTML文本信息映射为第一HTML文本向量序列，具体为：Get the HTML text of the target web page, and map the HTML text information into a first HTML text vector sequence, specifically:

对于获取的网页HTML文本信息，定义为输入数据x_htmlText并将其视为HTML文本序列，x_htmlText被划分成一系列的离散HTML文本单元，即x_htmlText＝{h₁,h₂,h₃,…,h_hL}，其中，hL为输入数据的原始长度，h_i表示x_htmlText中的每个离散HTML文本单元；通过设置固定的长度为L_h，经截断或填充操作后获得新的HTML文本元素序列即第一HTML文本向量序列，其中/>表示/>中的每个离散HTML文本单元。The obtained HTML text information of the web page is defined as input data x _htmlText and regarded as an HTML text sequence. x _htmlText is divided into a series of discrete HTML text units, namely x _htmlText = {h ₁ ,h ₂ ,h ₃ ,…,h _hL }, where hL is the original length of the input data and _hi represents each discrete HTML text unit in x _htmlText . By setting a fixed length of L _h , a new HTML text element sequence is obtained after truncation or padding. That is, the first HTML text vector sequence, where /> Indicates/> Each discrete unit of HTML text in .

获取目标网页的HTML标签，将HTML标签信息映射为第一HTML标签向量序列，具体为：Get the HTML tag of the target web page, and map the HTML tag information into the first HTML tag vector sequence, specifically:

对于获得的网页HTML标签信息，定义为输入数据x_htmlTag并将其视为HTML标签序列，x_htmlTag将被划分成一系列的离散HTML标签单元，即x_htmlTag＝{g₁,g₂,g₃,…,g_gL}，其中，gL为输入数据的原始长度，g_i表示x_htmlTag中的每个离散HTML标签单元；通过设置固定的长度为L_g，经截断或填充操作后获得新的HTML标签元素序列即第一HTML标签向量序列，其中/>表示/>中的每个离散HTML标签单元。The obtained web page HTML tag information is defined as input data x _htmlTag and regarded as an HTML tag sequence. x _htmlTag will be divided into a series of discrete HTML tag units, that is, x _htmlTag = {g ₁ ,g ₂ ,g ₃ ,…,g _gL }, where gL is the original length of the input data and g _i represents each discrete HTML tag unit in x _htmlTag . By setting a fixed length of L _g , a new HTML tag element sequence is obtained after truncation or padding. That is, the first HTML tag vector sequence, where /> Indicates/> Each discrete HTML tag in a .

在本实施例中，获取目标网页的网页截图，基于网页截图得到网页文本信息，包括：In this embodiment, obtaining a screenshot of the target webpage and obtaining webpage text information based on the screenshot includes:

基于链接地址(URL)采集网页截图；对网页截图进行文本框检测，得到网页截图中的所有文本框区域；采用图像识别算法，根据文本框区域对网页截图进行文本识别，得到带有顺序的网页文本信息。Web page screenshots are collected based on link addresses (URLs); text box detection is performed on the web page screenshots to obtain all text box areas in the web page screenshots; an image recognition algorithm is used to perform text recognition on the web page screenshots according to the text box areas to obtain sequential web page text information.

在本实施例中，充分利用网页中的视觉信息，从网页用户的角度出发，利用如OCR等技术，可一定程度上绕开各种作弊手段，直接提取真实可视的网页文本信息与网页图像信息。In this embodiment, by making full use of the visual information in the web page, from the perspective of the web page user, using technologies such as OCR, various cheating methods can be circumvented to a certain extent, and real visible web page text information and web page image information can be directly extracted.

在一些实施例中，通过延迟等待等策略获取目标网页的网页截图，该截图即为后续所需的图像信息。In some embodiments, a screenshot of the target web page is obtained through a strategy such as delayed waiting, and the screenshot is the image information required subsequently.

在一些实施例中，对网页截图，采用网页文本检测算法检测网页截图中所有含文本的区域，并以[x1，y1，x2，y2，x3，y3，x4，y4]的矩形坐标形式表示出来，对应文本框左下左上、右下与右上四个点的横坐标与纵坐标，即其所围成的矩阵称为文本框。In some embodiments, for a web page screenshot, a web page text detection algorithm is used to detect all areas containing text in the web page screenshot and express them in the form of rectangular coordinates [x1, y1, x2, y2, x3, y3, x4, y4]. The horizontal and vertical coordinates of the four points at the lower left, upper left, lower right and upper right of the text box correspond to each other, and the matrix enclosed by them is called a text box.

在一些实施例中，采用OCR算法作为图像识别算法。In some embodiments, an OCR algorithm is used as the image recognition algorithm.

在一些实施例中，采用OpenCV函数库中的图像处理函数作为图像识别算法。In some embodiments, an image processing function in the OpenCV function library is used as an image recognition algorithm.

在一些实施例中，为保证文本识别的准确率，将过滤掉文本框面积过小的文本框，随后使用OpenCV函数库中的相关图像处理函数，根据文本框坐标，从网页截图中依次提取文本框区域并保存至内存中；若文本框区域带有倾斜角度，则为了提高文本识别精确率，对其进行水平化仿射处理。在获取网页截图中的所有文本框区域后，以多线程的形式从内存中并行提取多个文本框区域，并且将文本框区域输入算法模型(如yolov4)中，根据文本特征，识别得到对应文本。在完成所有文本框区域的文本识别后，将识别出的文本按顺序重新组合成完整的网页文本，实现对网页的文本信息的提取，得到网页文本信息。In some embodiments, in order to ensure the accuracy of text recognition, text boxes with too small text box areas are filtered out, and then the relevant image processing functions in the OpenCV function library are used to extract the text box areas from the web page screenshot in sequence according to the text box coordinates and save them in the memory; if the text box area has an inclination angle, in order to improve the accuracy of text recognition, it is horizontalized and affine processed. After obtaining all the text box areas in the web page screenshot, multiple text box areas are extracted from the memory in parallel in the form of multi-threading, and the text box areas are input into the algorithm model (such as yolov4), and the corresponding text is identified according to the text features. After completing the text recognition of all text box areas, the identified texts are reassembled into complete web page texts in order to extract the text information of the web page and obtain the web page text information.

在本实施例中，如图3、图4及图5所示，具体为：In this embodiment, as shown in FIG. 3 , FIG. 4 and FIG. 5 , specifically:

按照独热编码与词嵌入向量模式，对每个离散文本单元进行映射，将其映射到d维的向量空间表示/>最终第一文本向量序列/>映射成向量矩阵/>矩阵M_text则是特征构建所生成的第二文本特征向量矩阵；具体的，d的值取100；According to the unique hot encoding and word embedding vector mode, each discrete text unit Map it to a d-dimensional vector space representation/> Final first text vector sequence/> Mapping into vector matrix/> The matrix M _text is the second text feature vector matrix generated by feature construction; specifically, the value of d is 100;

按照独热编码与词嵌入向量模式，对每个离散HTML文本单元进行映射，将其映射到d维的向量空间表示/>最终第一HTML文本向量序列/> 映射成向量矩阵/>矩阵M_htmlText则是特征构建所生成的第二HTML文本特征向量矩阵；According to the one-hot encoding and word embedding vector mode, each discrete HTML text unit Map it to a d-dimensional vector space representation/> Final first HTML text vector sequence/> Mapping into vector matrix/> The matrix M _htmlText is the second HTML text feature vector matrix generated by feature construction;

按照独热编码与词嵌入向量模式，对每个离散HTML标签单元进行映射，将其映射到d维的向量空间表示/>最终第一HTML标签向量序列/> 映射成向量矩阵/>矩阵M_htmlTag则是特征构建所生成的第二HTML标签特征向量矩阵。According to the one-hot encoding and word embedding vector mode, each discrete HTML tag unit Map it to a d-dimensional vector space representation/> Final first HTML tag vector sequence /> Mapping into vector matrix/> The matrix M _htmlTag is the second HTML tag feature vector matrix generated by feature construction.

如图7所示，采用Transformer编码器模型对第二文本特征向量矩阵、第二HTML文本特征向量矩阵以及第二HTML标签特征向量矩阵进行特征提取，具体为：As shown in FIG7 , the Transformer encoder model is used to extract features from the second text feature vector matrix, the second HTML text feature vector matrix, and the second HTML tag feature vector matrix, specifically:

Transformer编码器模型包括若干个编码器，编码器包括多头自注意力机制、前馈层及层归一化；The Transformer encoder model includes several encoders, which include a multi-head self-attention mechanism, a feedforward layer, and layer normalization;

采用Transformer编码器模型进行特征提取的过程为：The process of feature extraction using the Transformer encoder model is:

对特征向量矩阵进行位置嵌入，得到嵌入位置信息的特征向量矩阵；Transformer编码器模型无法利用单词之间的顺序信息，本实施例中在进行文本特征提取前引入位置嵌入以记录单词在序列中的位置信息，以使得transformer编码器模型能够区分不同位置的单词，从而获取捕捉顺序信息的能力。The feature vector matrix is positionally embedded to obtain a feature vector matrix with embedded position information. The Transformer encoder model cannot utilize the sequential information between words. In this embodiment, position embedding is introduced before text feature extraction to record the position information of words in the sequence, so that the Transformer encoder model can distinguish words in different positions, thereby acquiring the ability to capture sequential information.

将嵌入位置信息的特征向量矩阵依次通过每个编码器中的多头自注意力机制、前馈层及层归一化，最终得到特征向量；The feature vector matrix of the embedded position information is passed through the multi-head self-attention mechanism, feedforward layer and layer normalization in each encoder in turn to finally obtain the feature vector;

对于第二文本特征向量矩阵的特征提取，其过程表达式为：For the feature extraction of the second text feature vector matrix, the process expression is:

其中，h_i表示第i个编码器输出的向量矩阵，l表示Transformer编码器模型中编码器的数量；表示嵌入位置信息的第二文本特征向量矩阵；Where _hi represents the vector matrix output by the i-th encoder, and l represents the number of encoders in the Transformer encoder model; A second text feature vector matrix representing embedded position information;

对于第二HTML文本特征向量矩阵的特征提取，其过程表达式为：For the feature extraction of the second HTML text feature vector matrix, the process expression is:

其中，表示嵌入位置信息的第二HTML文本特征向量矩阵；in, A second HTML text feature vector matrix representing embedded position information;

对于第二HTML标签特征向量矩阵的特征提取，其过程表达式为：For the feature extraction of the second HTML tag feature vector matrix, the process expression is:

其中，表示嵌入位置信息的第二HTML标签特征向量矩阵；in, A second HTML tag feature vector matrix representing the embedding position information;

其中，h_i表示第i个编码器输出的向量矩阵，l表示Transformer编码器模型中编码器的数量；表示词向量矩阵；Where _hi represents the vector matrix output by the i-th encoder, and l represents the number of encoders in the Transformer encoder model; Represents the word vector matrix;

其中，位置嵌入的过程具体为：The process of position embedding is as follows:

其中，pos代表单词的位置序号，i代表单词的维度序号，M_peText、M_peHtml以及M_peHtmlTag分别代表第二文本特征向量矩阵的位置信息矩阵、第二HTML文本特征向量矩阵的位置信息矩阵以及第二HTML标签特征向量矩阵的位置信息矩阵，它们的维度分别与第二文本特征向量矩阵M_text、第二HTML文本特征向量矩阵M_html以及第二HTML标签特征向量矩阵的维度相等；+表示矩阵相加操作；Wherein, pos represents the position number of the word, i represents the dimension number of the word, M _peText , M _peHtml and M _peHtmlTag represent the position information matrix of the second text feature vector matrix, the position information matrix of the second HTML text feature vector matrix and the position information matrix of the second HTML tag feature vector matrix respectively, and their dimensions are respectively equal to the dimensions of the second text feature vector matrix M _text , the second HTML text feature vector matrix M _html and the second HTML tag feature vector matrix; + represents a matrix addition operation;

自注意力机制即放缩点乘注意力机制，多头自注意力机制即将自注意力机制的计算过程重复h次，随后将计算结果相拼接；多头自注意力机制的过程表达式为：The self-attention mechanism is the scaled dot product attention mechanism. The multi-head self-attention mechanism repeats the calculation process of the self-attention mechanism h times, and then concatenates the calculation results. The process expression of the multi-head self-attention mechanism is:

multiHead(Q,K,V)＝Concat(head₁,…,head_h)multiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )

其中，head_i表示第i次自注意力机制的计算结果，h表示自注意力机制的重复次数，h的取值为10；MultiHead(Q,K,V)表示多头自注意力机制的计算结果；代表了不同的线性变换矩阵；/>分别代表查询、键以及值；d_k表示矩阵的列数，即向量维度；/> Among them, head _i represents the calculation result of the i-th self-attention mechanism, h represents the number of repetitions of the self-attention mechanism, and the value of h is 10; MultiHead(Q,K,V) represents the calculation result of the multi-head self-attention mechanism; Represents different linear transformation matrices; /> Represents query, key, and value respectively; d _k represents The number of columns of the matrix, i.e. the vector dimension; />

在任一编码器中，经多头自注意力机制后，得到向量矩阵A，随后输入前馈层，该层由多个全连接层所组成，负责对向量矩阵A进行向量矩阵的映射与仿射变换；经残差连接与层归一化后，得到该编码器模块的输出向量矩阵h_i，并将该输出向量矩阵继续作为下一个编码器模块的输入；最后一个编码器的输出h_l即为文本特征向量或HTML文本特征向量或HTML标签特征向量。In any encoder, after the multi-head self-attention mechanism, the vector matrix A is obtained, and then input into the feedforward layer, which is composed of multiple fully connected layers and is responsible for performing vector matrix mapping and affine transformation on the vector matrix A; after residual connection and layer normalization, the output vector matrix h _i of the encoder module is obtained, and the output vector matrix continues to be used as the input of the next encoder module; the output h _l of the last encoder is the text feature vector or HTML text feature vector or HTML tag feature vector.

基于特征组合与拼接模型，对第一网页统计特征向量序列进行向量拼接，得到第二网页统计特征向量矩阵；对于网页统计特征信息，将其定义为输入数据并将其视为网页统计特征序列，该特征序列为一维数值序列，该特征序列中同时包含链接分析特征x_link(TrustRank、SpamRank、Truncated PageRank等)、内容启发式特征x_content(热词覆盖率、文本长度等)、网页归属特征x_attribution(域名持有人、IP地址等)和时序特征x_time；Based on the feature combination and concatenation model, the first webpage statistical feature vector sequence is concatenated to obtain the second webpage statistical feature vector matrix; for the webpage statistical feature information, it is defined as input data and regarded as a webpage statistical feature sequence, which is a one-dimensional numerical sequence, and the feature sequence also includes link analysis features x _link (TrustRank, SpamRank, Truncated PageRank, etc.), content heuristic features x _content (hot word coverage, text length, etc.), webpage attribution features x _attribution (domain name holder, IP address, etc.) and time series features x _time ;

如图6所示，将以上4种特征进行顺序拼接组合，生成特征向量序列x_statistics＝{x_link；x_content；x_attribution；x_time}；对该特征向量序列进行维度转换，生成特征向量矩阵M_statistics＝{M_link；M_content；M_attribution；M_time}；As shown in FIG6 , the above four features are sequentially spliced and combined to generate a feature vector sequence x _statistics ={x _link ; x _content ; x _attribution ; x _time }; the feature vector sequence is dimensionally transformed to generate a feature vector matrix M _statistics ={M _link ; M _content ; M _attribution ; M _time };

其中M_statistics∈R^L*1，L为网页统计特征的总个数，1代表所生成的特征向量矩阵中的每个元素均为1维向量，M_statistics矩阵即特征构建所生成的第二网页统计特征向量矩阵。Wherein M _statistics ∈R ^L*1 , L is the total number of web page statistical features, 1 represents that each element in the generated feature vector matrix is a 1-dimensional vector, and the M _statistics matrix is the second web page statistical feature vector matrix generated by feature construction.

采用DNN模型对第二网页统计特征向量矩阵进行特征提取，得到网页统计特征向量；其中，DNN模型包括若干个全连接网络；Using a DNN model to extract features from the second webpage statistical feature vector matrix to obtain a webpage statistical feature vector; wherein the DNN model includes a plurality of fully connected networks;

采用DNN模型对第二网页统计特征向量矩阵进行特征提取，具体为：The DNN model is used to extract features from the second webpage statistical feature vector matrix, specifically:

将第二网页统计特征向量矩阵依次通过每个全连接网络中进行矩阵运算，最终得到网页统计特征向量，其过程表达式为：The second webpage statistical feature vector matrix is sequentially passed through each fully connected network for matrix operation, and finally the webpage statistical feature vector is obtained. The process expression is:

其中，M_i表示第i个全连接网络输出的向量矩阵，l表示DNN模型中全连接网络的数量；M_statistics表示第二网页统计特征向量矩阵；全连接网络的权重矩阵L_i为第i个全连接网络中神经元的数量；b_i表示第i个全连接网络中神经元的偏置项；DNN模型中最后一个全连接网络的输出M_i即为网页统计特征向量。Among them, _Mi represents the vector matrix output by the i-th fully connected network, l represents the number of fully connected networks in the DNN model; M _statistics represents the statistical feature vector matrix of the second web page; the weight matrix of the fully connected network _Li is the number of neurons in the i-th fully connected network; _bi represents the bias term of the neurons in the i-th fully connected network; the output _Mi of the last fully connected network in the DNN model is the statistical feature vector of the web page.

将文本特征向量、HTML文本特征向量、HTML标签特征向量以及网页统计特征向量进行特征组合，得到组合特征向量，具体为：The text feature vector, HTML text feature vector, HTML tag feature vector and web page statistical feature vector are combined to obtain a combined feature vector, specifically:

利用向量拼接操作对将文本特征向量V_text、HTML文本特征向量V_htmlText、HTML标签特征向量V_htmlTag以及网页统计特征向量V_statistics进行顺序拼接组合，得到拼接特征向量V_c＝[V_text；V_htmlText；V_htmlTag；V_statistics]；The text feature vector V _text , the HTML text feature vector V _htmlText , the HTML tag feature vector V _htmlTag and the webpage statistics feature vector V _statistics are sequentially concatenated and combined by using a vector concatenation operation to obtain a concatenated feature vector V _c =[V _text ; V _htmlText ; V _htmlTag ; V _statistics ];

采用单个全连接层对拼接特征向量进行特征组合操作，其目的是通过全连接层，在进行特征降维的同时，剔除冗余变量，寻找出特征向量之间的内在联系，其表达式为：A single fully connected layer is used to perform feature combination operation on the concatenated feature vectors. Its purpose is to remove redundant variables and find the intrinsic connection between feature vectors while performing feature dimensionality reduction through the fully connected layer. Its expression is:

V＝W_cV_c+b_c V＝ _WcVc ₊ _bc

其中，V∈R^f，表示特征组合向量，f代表低维特征空间的维度；全连接网络的权重矩阵W_c∈R^f，f为全连接网络中神经元的数量；b_c表示全连接网络中神经元的偏置项。Among them, ^V∈Rf represents the feature combination vector, f represents the dimension of the low-dimensional feature space; the weight matrix _Wc∈Rf ^of the fully connected network, f is the number of neurons in the fully connected network; _bc represents the bias term of the neurons in the fully connected network.

根据组合特征向量V进行评价与等级划分，生成关于网页的预测评价等级，具体为：Evaluation and grading are performed based on the combined feature vector V to generate a predicted evaluation grade for the web page, specifically:

采用DNN模型对组合特征向量V进行评价与等级划分，其中DNN模型由多个全连接网络所组成，具体为将组合特征向量V作为DNN模型的输入，依次通过模型中包含的多个全连接网络进行矩阵运算，最终的输出为组合特征向量的预测评价等级，表达式为：The DNN model is used to evaluate and grade the combined feature vector V, where the DNN model is composed of multiple fully connected networks. Specifically, the combined feature vector V is used as the input of the DNN model, and matrix operations are performed in turn through multiple fully connected networks contained in the model. The final output is the predicted evaluation level of the combined feature vector, which is expressed as:

其中，V_i表示第i个全连接网络输出的向量矩阵，l_v表示DNN模型中全连接网络的数量；全连接网络的权重矩阵f_i为第i个全连接网络中神经元的数量；b_i表示第i个全连接网络中神经元的偏置项；DNN模型中最后一个全连接网络的输出/>即为质量评价向量；/>表示质量评价向量对应每个评价等级的预测概率；/>表示预测评价等级；Among them, _Vi represents the vector matrix output by the i-th fully connected network, l _v represents the number of fully connected networks in the DNN model; the weight matrix of the fully connected network _fi is the number of neurons in the ith fully connected network; _bi is the bias term of the neurons in the ith fully connected network; the output of the last fully connected network in the DNN model/> That is the quality evaluation vector; /> Represents the predicted probability of each evaluation level corresponding to the quality evaluation vector;/> Indicates the predicted evaluation level;

在多模态Web信息检索静态排序学习方法的迭代训练过程中，采用多分类交叉熵函数作为损失函数，经反向传播迭代更新方法中所有模型的权重参数，损失函数的表达式为：In the iterative training process of the static ranking learning method for multimodal Web information retrieval, the multi-classification cross entropy function is used as the loss function. The weight parameters of all models in the back-propagation iterative update method are expressed as:

其中，y⁽ⁱ⁾表示训练过程中第i个网页的真实评价等级；表示训练过程中第i个网页对应每个评价等级的预测概率；N表示评价训练过程中的样本总数；i表示当前样本的索引序号。Among them, y ⁽ⁱ⁾ represents the true evaluation level of the i-th web page during the training process; It represents the predicted probability of each evaluation level corresponding to the i-th web page during the training process; N represents the total number of samples in the evaluation training process; i represents the index number of the current sample.

在一些实施例中，y⁽ⁱ⁾∈[0,R]，即y⁽ⁱ⁾为取值在[0,R]区间的质量评估值，在本发明中，参照ECML/PKDD Discovery Challenge规则，并考虑尽量拉大质量差异，R的取值不小于9，即质量值不少于10个等级。In some embodiments, y ⁽ⁱ⁾ ∈ [0, R], that is, y ⁽ⁱ⁾ is a quality assessment value in the interval [0, R]. In the present invention, referring to the ECML/PKDD Discovery Challenge rules and considering maximizing the quality difference as much as possible, the value of R is not less than 9, that is, the quality value is not less than 10 levels.

本领域技术人员应当理解，在多模态Web信息检索静态排序学习方法的迭代训练过程中，将根据预测评价等级与真实评价等级之间的差异，根据损失函数的结果，对所有模型中的各参数进行反向传播，利用梯度下降等不同算法使损失函数最小化，不断对各参数进行动态调整，使其在不断迭代中得到最优参数。Those skilled in the art should understand that in the iterative training process of the multimodal Web information retrieval static ranking learning method, the parameters in all models will be back-propagated according to the difference between the predicted evaluation level and the actual evaluation level and the result of the loss function, and the loss function will be minimized using different algorithms such as gradient descent, and the parameters will be dynamically adjusted continuously to obtain the optimal parameters in continuous iterations.

在另一个实施例中，提供了一种多模态Web信息检索静态排序学习系统，系统采用上述实施例的多模态Web信息检索静态排序学习方法，如图8所示，系统包括信息提取模块、特征提取模块、特征组合模块以及质量评价模块；In another embodiment, a multimodal Web information retrieval static ranking learning system is provided. The system adopts the multimodal Web information retrieval static ranking learning method of the above embodiment. As shown in FIG8 , the system includes an information extraction module, a feature extraction module, a feature combination module, and a quality evaluation module;

在本实施例中，特征提取模块设有DNN模型和Transformer编码器模型；In this embodiment, the feature extraction module is provided with a DNN model and a Transformer encoder model;

在本实施例中，特征组合模块设有全连接层，通过全连接层将带有文本上下文特征信息的文本特征向量、带有HTML文本上下文特征信息的HTML文本特征向量、带有HTML标签上下文特征信息的HTML标签特征向量以及带有网页统计特征信息的网页统计特征向量进行组合，计算获得最佳融合方式，得到适合进行分类分析的特征结果；In this embodiment, the feature combination module is provided with a fully connected layer, through which the text feature vector with text context feature information, the HTML text feature vector with HTML text context feature information, the HTML tag feature vector with HTML tag context feature information, and the web page statistical feature vector with web page statistical feature information are combined to calculate the best fusion mode and obtain the feature results suitable for classification analysis;

在本实施例中，质量评价模块搭载有DNN模型，基于组合特征向量对Web内容质量进行评价与等级划分，以判断当前网页的预测评价等级。In this embodiment, the quality evaluation module is equipped with a DNN model, which evaluates and grades the quality of Web content based on the combined feature vector to determine the predicted evaluation grade of the current web page.

在另一个实施例中，提供了一种计算机设备，包括存储器以及处理器，存储器存储有计算机程序，处理器执行计算机程序时实现如上述实施例的多模态Web信息检索静态排序学习方法。In another embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the multimodal Web information retrieval static ranking learning method as in the above embodiment is implemented.

所述计算机设备的硬件实体，包括处理器、存储器和通信接口；其中，处理器通常控制计算机设备的总体操作；通信接口用于使计算机设备通过网络与其他终端或服务器通信；存储器配置为存储由处理器可执行的指令和应用，还可以缓存待处理器以及计算机设备中各模块待处理或已经处理的数据(包括但不限于图像数据、音频数据、语音通信数据和视频通信数据)，可以通过闪存(FLASH)或随机访问存储器(RAM，Random Access Memory)实现。The hardware entity of the computer device includes a processor, a memory and a communication interface; wherein the processor generally controls the overall operation of the computer device; the communication interface is used to enable the computer device to communicate with other terminals or servers through a network; the memory is configured to store instructions and applications executable by the processor, and can also cache data to be processed or processed by the processor and various modules in the computer device (including but not limited to image data, audio data, voice communication data and video communication data), which can be implemented through flash memory (FLASH) or random access memory (RAM).

处理器、通信接口和存储器之间可以通过总线进行数据传输，总线可以包括任意数量的互联的总线和桥，总线将一个或多个处理器和存储器的各种电路连接在一起。Data can be transmitted between the processor, the communication interface and the memory via a bus, which may include any number of interconnected buses and bridges, and the bus connects various circuits of one or more processors and memories together.

在另一个实施例中，提供了一种计算机可读存储介质，存储有计算机程序，当计算机程序被处理器执行时，实现上述实施例的多模态Web信息检索静态排序学习方法。In another embodiment, a computer-readable storage medium is provided, which stores a computer program. When the computer program is executed by a processor, the multimodal Web information retrieval static ranking learning method of the above embodiment is implemented.

所述存储介质可以是瞬时性的，也可以是非瞬时性的。示范性地，存储介质包括但不限于U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机访问存储器(RAM，RandomAccess Memory)、磁碟或者光盘等各种可以存储计算机程序代码的介质。The storage medium may be transient or non-transient. Exemplarily, the storage medium includes, but is not limited to, a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk, and other media that can store computer program codes.

示范性地，处理器可以为中央处理器(Central Processing Unit，CPU)、微处理器(Microprocessor Unit，MPU)、数字信号处理器(Digital Signal Processor，DSP)或现场可编程门阵列(Field Programmable Gate Array，FPGA)等。Exemplarily, the processor may be a central processing unit (CPU), a microprocessor (MPU), a digital signal processor (DSP), or a field programmable gate array (FPGA).

在另一个实施例中，还提供了一种计算机程序，包括计算机可读代码，在所述计算机可读代码在计算机设备中运行的情况下，计算机设备中的处理器执行用于实现上述实施例的多模态Web信息检索静态排序学习方法中的部分或全部步骤。In another embodiment, a computer program is provided, including a computer-readable code. When the computer-readable code is executed in a computer device, a processor in the computer device executes part or all of the steps in the multimodal Web information retrieval static ranking learning method for implementing the above-mentioned embodiment.

在另一个实施例中，还提供了一种计算机程序产品，具体可以通过硬件、软件或其结合的方式实现。作为非限制性示例，所述计算机程序产品可以体现为存储介质，也可以体现为软件产品，例如SDK(Software Development Kit，软件开发包)等。In another embodiment, a computer program product is provided, which can be implemented by hardware, software or a combination thereof. As a non-limiting example, the computer program product can be embodied as a storage medium or a software product, such as an SDK (Software Development Kit).

还需要说明的是，在本说明书中，诸如术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that in this specification, terms such as "comprises", "includes" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of more restrictions, an element defined by the sentence "comprises a ..." does not exclude the presence of other identical elements in the process, method, article or device including the element.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其他实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown herein, but rather to the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-mode Web information retrieval static ordering learning method is characterized by comprising the following steps:

acquiring a webpage screenshot of a target webpage, acquiring webpage text information based on the webpage screenshot, and mapping the webpage text information into a first text vector sequence;

acquiring an HTML text of a target webpage, and mapping the HTML text information into a first HTML text vector sequence;

acquiring an HTML tag of a target webpage, and mapping HTML tag information into a first HTML tag vector sequence;

mapping the webpage statistical feature information into a first webpage statistical feature vector sequence based on the webpage statistical features of the target webpage; the webpage statistical features comprise link analysis features, content heuristic features, website attribution features and time sequence features;

based on the single-hot coding and the word embedding vector mode, a second text feature vector matrix, a second HTML text feature vector matrix and a second HTML tag feature vector matrix are obtained according to the first text vector sequence, the first HTML text vector sequence and the first HTML tag vector sequence; performing feature extraction on the second text feature vector matrix, the second HTML text feature vector matrix and the second HTML tag feature vector matrix by adopting a transducer encoder model to obtain a text feature vector, an HTML text feature vector and an HTML tag feature vector;

vector stitching is carried out on the first webpage statistical feature vector sequence based on the feature combination and stitching model, and a second webpage statistical feature vector matrix is obtained; performing feature extraction on the second webpage statistical feature vector matrix by adopting a DNN model to obtain a webpage statistical feature vector;

combining the text feature vector, the HTML tag feature vector and the webpage statistical feature vector to obtain a combined feature vector;

and performing evaluation and grading according to the combined feature vector, and generating a predicted evaluation grade related to the webpage.

2. The method for learning static rank of multimodal Web information retrieval according to claim 1, wherein mapping the Web page text information into a first sequence of text vectors comprises:

for the acquired webpage text information, the input data x is defined _text And treat it as a text sequence, x _text Will be divided into a series of discrete text units, i.e. x _text ＝{t ₁ ,t ₂ ,t ₃ ,…,t _L Wherein L is the original length of the input data, t _i Represents x _text Each discrete text unit of (a); by setting a fixed length L _c Obtaining new character element sequence after cutting or filling operationI.e. a first text vector sequence, wherein +.>Representation->Each discrete text unit of (a);

mapping the HTML text information into a first HTML text vector sequence, specifically:

for the acquired webpage HTML text information, the input data x is defined _htmlText And treat it as a sequence of HTML text, x _htmlText Divided into a series of discrete HTML text units, i.e. x _htmlText ＝{h ₁ ,h ₂ ,h ₃ ,…,h _hL Where hL is the original length of the input data, h _i Represents x _htmlText Each discrete HTML text unit in (a); by setting a fixed length L _h Obtaining new HTML text element sequence after cutting or filling operationI.e. a first sequence of HTML text vectors, wherein +.>Representation->Each discrete HTML text unit in (a);

mapping the HTML tag information into a first HTML tag vector sequence, specifically:

for the obtained web page HTML tag information, it is defined as input data x _htmlTag And regards it as a sequence of HTML tags, x _htmlTag Will be divided into a series of discrete HTML tag units, i.e. x _htmlTag ＝{g ₁ ,g ₂ ,g ₃ ,…,g _gL gL is the original length of the input data, g _i Represents x _htmlTag Each discrete HTML tag unit of (a); by setting a fixed length L _g Obtaining new HTML tag element sequence after cutting or filling operationI.e. a first sequence of HTML tag vectors, wherein +.>Representation->Is provided.

3. The method for learning static ordering of multimodal Web information retrieval according to claim 2, wherein the second text feature vector matrix, the second HTML text feature vector matrix, and the second HTML tag feature vector matrix are obtained according to the first text vector sequence, the first HTML text vector sequence, and the first HTML tag vector sequence based on the one-hot encoding and the word embedding vector mode, specifically:

for each discrete text unit according to the one-hot encoding and word embedding vector modeMapping is performed, which is mapped to the vector space representation of d-dimension +.>Final first text vector sequence->Mapping into a vector matrixMatrix M _text Then a second text feature vector matrix generated by feature construction; d takes 100;

for each discrete HTML text unit, according to the one-hot encoding and word embedding vector patternsMapping is performed, which is mapped to the vector space representation of d-dimension +.>Final first HTML text vector sequence +.> Mapping into vector matrix>Matrix M _htmlText Then a second HTML text feature vector matrix generated by feature construction;

for each discrete HTML tag unit according to the one-hot encoding and word embedding vector modeMapping is performed, which is mapped to the vector space representation of d-dimension +.>Final first HTML tag vector sequence +.> Mapping into vector matrix>Matrix M _htmlTaG Then it is the second HTML tag feature vector matrix generated by the feature build.

4. The method according to claim 1, wherein the statistical feature information of the Web page is defined as input data and regarded as a statistical feature sequence of the Web page, the feature sequence is a one-dimensional numerical sequence, and the feature sequence includes link analysis features x _link Heuristic content feature x _content Web page attribution feature x _attribution And timing characteristics x _time ；

The 4 features are sequentially spliced and combined to generate a feature vector sequence x _statistics ＝{x _link ；x _content ；x _attribution ；x _time -a }; performing dimension conversion on the feature vector sequence to generate a feature vector matrix M _statistics ＝{M _link ；M _content ；M _attribution ；M _time }；

Wherein M is _statistics ∈R ^L*1 L is the total number of the statistical features of the webpage, 1 represents that each element in the generated feature vector matrix is a 1-dimensional vector, M _statistics The matrix is a second webpage statistical feature vector matrix generated by feature construction.

5. The method for learning static rank of multimodal Web information retrieval according to claim 1, wherein feature extraction is performed on the second text feature vector matrix, the second HTML text feature vector matrix, and the second HTML tag feature vector matrix by using a transducer encoder model, specifically:

the transducer encoder model comprises a plurality of encoders, wherein each encoder comprises a multi-head self-attention mechanism, a feedforward layer and layer normalization;

the process of feature extraction using a transducer encoder model is:

performing position embedding on the feature vector matrix to obtain a feature vector matrix embedded with position information;

normalizing the feature vector matrix embedded with the position information through a multi-head self-attention mechanism, a feedforward layer and a layer in each encoder in sequence to finally obtain feature vectors;

for feature extraction of the second text feature vector matrix, the process expression is as follows:

wherein h is _i A vector matrix representing the i-th encoder output, l representing the number of encoders in the transducer encoder model;a second text feature vector matrix representing embedded location information;

and extracting the characteristics of the second HTML text characteristic vector matrix, wherein the process expression is as follows:

wherein,a second HTML text feature vector matrix representing embedded location information;

and extracting the characteristics of the second HTML tag characteristic vector matrix, wherein the process expression is as follows:

wherein,representation inlayEntering a second HTML tag feature vector matrix of the position information;

the position embedding process specifically comprises the following steps:

wherein pos represents the position number of the word, i represents the dimension number of the word, M _peText 、M _peHtml M is as follows _peHtmlTag Respectively representing a position information matrix of a second text feature vector matrix, a position information matrix of a second HTML text feature vector matrix and a position information matrix of a second HTML tag feature vector matrix, the dimensions of which are respectively identical to those of the second text feature vector matrix M _text Second HTML text feature vector matrix M _html And the dimensions of the second HTML tag feature vector matrix are equal; + represents a matrix addition operation;

the self-attention mechanism is a zoom point multiplied attention mechanism, the multi-head self-attention mechanism is characterized in that the calculation process of the self-attention mechanism is repeated for h times, and then calculation results are spliced; the process expression of the multi-head self-attention mechanism is:

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )

wherein head _i Indicating the calculation result of the ith self-attention mechanism, wherein h indicates the repetition number of the self-attention mechanism, and the value of h is 10; multiHead (Q, K, V) represents the computation of a multi-headed self-attention mechanism; w (W) _i ^Q 、W _i ^K 、W _i ^V Representing different linear transformation matrices; QW (QW) _i ^Q 、KW _i ^K 、VW _i ^V Respectively representing a query, a key, and a value; d, d _k Representing QW _i ^Q 、KW _i ^K The column number of the matrix, i.e., the vector dimension;

in any encoder, after a multi-head self-attention mechanism, a vector matrix A is obtained, and then a feedforward layer is input, wherein the feedforward layer consists of a plurality of fully-connected layers and is responsible for mapping and affine transformation of the vector matrix A; after residual connection and layer normalization, an output vector matrix h of the encoder module is obtained _i And continuing the output vector matrix as input to the next encoder module; output h of last encoder _l Namely a text feature vector or an HTML tag feature vector.

6. The multi-modal Web information retrieval static ordering learning method according to claim 1, wherein the DNN model comprises a plurality of fully connected networks;

and adopting a DNN model to extract the characteristics of the second webpage statistical characteristic vector matrix, wherein the method specifically comprises the following steps:

and sequentially carrying out matrix operation on the second webpage statistical feature vector matrix in each fully-connected network to finally obtain the webpage statistical feature vector, wherein the process expression is as follows:

wherein M is _i Representing a vector matrix output by the ith fully-connected network, and l represents the number of fully-connected networks in the DNN model; m is M _statistics Representing a second web page statistical feature vector matrix; weight matrix of fully connected networkL _i Is the number of neurons in the ith fully-connected network; b _i A bias term representing a neuron in an ith fully-connected network; output M of last fully connected network in DNN model _i The statistical feature vector of the web page is obtained.

7. The method for learning static rank of multimodal Web information retrieval according to claim 1, wherein the feature combination of the text feature vector, the HTML tag feature vector and the Web page statistics feature vector is specifically:

using vector concatenation operation pairs to combine text feature vectors V _text HTML text feature vector V _htmlText HTML tag feature vector V _htmlTag Webpage statistics feature vector V _statistics Sequentially splicing and combining to obtain a spliced feature vector V _c ＝[V _text ；V _htmlText ；V _htmlTag ；V _statistics ]；

The method adopts a single full-connection layer to perform feature combination operation on spliced feature vectors, and aims to eliminate redundant variables and find out internal relations between the feature vectors while performing feature dimension reduction through the full-connection layer, wherein the expression is as follows:

V＝W _c V _c +b _c

wherein V is E R ^f Representing feature combination vectors, f representing the dimensions of the low-dimensional feature space; weight matrix W of fully connected network _c ∈R ^f F is the number of neurons in the fully connected network; b _c A bias term representing neurons in a fully connected network;

performing evaluation and grading according to the combined feature vector V to generate a predicted evaluation grade related to the webpage, wherein the predicted evaluation grade is specifically as follows:

the method comprises the steps of evaluating and grading a combined feature vector V by adopting a DNN model, wherein the DNN model consists of a plurality of fully-connected networks, specifically, taking the combined feature vector V as input of the DNN model, sequentially carrying out matrix operation through the plurality of fully-connected networks contained in the model, and finally outputting the predicted evaluation grade of the combined feature vector, wherein the expression is as follows:

wherein V is _i Vector matrix representing the output of the ith fully connected network, l _v Representing the number of fully connected networks in the DNN model; weight matrix of fully connected networkf _i Is the number of neurons in the ith fully-connected network; b _i A bias term representing a neuron in an ith fully-connected network; output of last fully connected network in DNN model +.>Namely, a quality evaluation vector; />Representing the prediction probability of the quality evaluation vector corresponding to each evaluation level; />Representing a predictive rating;

in the iterative training process of the multi-mode Web information retrieval static ordering learning method, a multi-classification cross entropy function is adopted as a loss function, and the weight parameters of all models in the iterative updating method are propagated in the opposite direction, wherein the expression of the loss function is as follows:

wherein y is ⁽ⁱ⁾ Representing the real evaluation grade of the ith webpage in the training process;representing the prediction probability of the ith webpage corresponding to each evaluation level in the training process; n represents the total number of samples in the evaluation training process; i represents the index number of the current sample;

in the iterative training process of the multi-mode Web information retrieval static ordering learning method, each parameter in all models is counter-propagated according to the difference between the predicted evaluation level and the real evaluation level and the result of the loss function, the loss function is minimized by using different algorithms such as gradient descent, and each parameter is dynamically adjusted continuously, so that the optimal parameter is obtained in continuous iteration.

8. A multi-mode Web information retrieval static ordering learning system, which is characterized in that the system adopts the multi-mode Web information retrieval static ordering learning method according to any one of claims 1-7, and comprises an information extraction module, a feature combination module and a quality evaluation module;

the information extraction module is used for acquiring webpage screenshot and webpage information of the target webpage and obtaining webpage text information, HTML text, HTML labels and webpage statistical characteristic information based on the webpage screenshot and the webpage information;

the feature extraction module is used for mapping the webpage text information, the HTML tag information and the webpage statistical feature information into a first text vector sequence, a first HTML tag vector sequence and a first webpage statistical feature vector sequence respectively; the method is also used for respectively carrying out feature extraction on the first text vector sequence, the first HTML tag vector sequence and the first webpage statistical feature vector sequence to respectively obtain a text feature vector, an HTML tag feature vector and a webpage statistical feature vector;

the feature combination module is used for carrying out feature combination on the text feature vector, the HTML tag feature vector and the webpage statistical feature vector to obtain a combined feature vector;

the quality evaluation module is used for performing evaluation and grading according to the combined feature vectors and generating a predicted evaluation grade related to the webpage;

the feature extraction module is provided with a DNN model and a transducer encoder model;

the feature combination module is provided with a full connection layer, and combines the text feature vector with text context feature information, the HTML text feature vector with HTML text context feature information, the HTML tag feature vector with HTML tag context feature information and the webpage statistics feature vector with webpage statistics feature information through the full connection layer, so as to calculate and obtain an optimal fusion mode, and obtain a feature result suitable for classification analysis;

the quality evaluation module is loaded with a DNN model, and evaluates and classifies the quality of the Web content based on the combined feature vector so as to judge the predicted evaluation grade of the current webpage.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the multimodal Web information retrieval static ordering learning method of any of claims 1-7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the multimodal Web information retrieval static ranking learning method of any one of claims 1-7 is implemented when the computer program is executed by a processor.