CN113590819B

CN113590819B - Large-scale category hierarchical text classification method

Info

Publication number: CN113590819B
Application number: CN202110743721.1A
Authority: CN
Inventors: 谭军; 潘嵘; 毕宁; 任天宇; 黄嘉树
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2024-01-02
Anticipated expiration: 2041-06-30
Also published as: CN113590819A

Abstract

The present invention provides a large-scale category-level text classification method. This method applies a deep learning network to a plane classifier method and a global classifier method, and performs classification calculations on them respectively. On the one hand, the plane classifier method is different from the machine learning method. Compare the classification performance. On the other hand, compare the deep learning method to whether the global classifier method performs better than the flat classifier after learning the hierarchical dependence information; when using the flat classifier, it is consistent with the machine learning method and does not consider the hierarchical structure. , the loss function only considers the empirical loss function of the training set; when using a global classifier, the hierarchical structure is taken into consideration, and the penalty of the regularization term is added to the loss function; based on the classic neural network models CNN and RNN in text classification tasks, attention will be paid to The force mechanism is combined with the RNN model and CNN model to capture the key information of the text, which is conducive to further promoting the informatization and automation of government affairs.

Description

A large-scale category-level text classification method

技术领域Technical field

本发明涉及层级文本分类领域，更具体地，涉及一种大规模类别层级文本分类方法。The present invention relates to the field of hierarchical text classification, and more specifically, to a large-scale category hierarchical text classification method.

背景技术Background technique

政府采购，是指各级国家机关、事业单位和团体组织，使用财政性资金采购依法制定的集中采购目录以内的或者采购限额标准以上的货物、工程和服务的行为。与其他众多社会主体的采购行为相比，政府这一主体在进行采购行为时存在着显著的差异：政府采购的资金主要来源于政府财政收入，究其根本其最终来源是全体纳税人公众，因此具有较强公共属性，需要对纳税人负责，同时政府采购行为具有非商业性，不以获取利润为采购目标，而是通过采购满足政务工作运转的基本需求或向社会提供基础普惠的公共服务。因此，政府采购过程中，如何保证采购行为的公开性和规范性一直备受政府重视。当前，信息化系统管理下虽然为政府采购业务的管理带来了诸多便利，但仍存在着许多严峻的挑战尚待解决。为了更好的利用政府公开采购信息，挖掘其中的价值，本发明将在层级文本分类研究的基础上，针对政府采购数据，探究如何根据采购项目信息进行合适的采购项目分类，以进一步推进政务信息化和自动化。Government procurement refers to the behavior of state agencies, institutions and organizations at all levels using fiscal funds to purchase goods, projects and services that are within the centralized procurement catalog formulated in accordance with the law or above the procurement quota. Compared with the procurement behavior of many other social entities, there are significant differences in the procurement behavior of the government as a subject: the funds for government procurement mainly come from government fiscal revenue, and ultimately the ultimate source is all taxpayers. Therefore, It has strong public attributes and needs to be responsible to taxpayers. At the same time, government procurement behavior is non-commercial. It does not aim to obtain profits, but to meet the basic needs for the operation of government affairs or to provide basic and inclusive public services to the society through procurement. . Therefore, during the government procurement process, how to ensure the openness and standardization of procurement behavior has always attracted much attention from the government. At present, although information system management has brought many conveniences to the management of government procurement business, there are still many serious challenges that need to be solved. In order to make better use of government public procurement information and tap its value, this invention will explore how to classify appropriate procurement projects based on procurement project information based on hierarchical text classification research and government procurement data to further promote government affairs information. ization and automation.

面对政务公开的趋势和大数据时代的要求，对海量的政务公开数据进行深入分析研究并探索其价值是具有现实意义的。其中，本发明的主要目的是希望对政府采购项目的分类情况进行探究，寻找有效的层级分类方法，以期利用过往的政府采购项目数据给出自动分类建议，从而帮助将来的工作人员能够更高效、准确的进行采购项目分类。根据实际背景，在政府官方网站中获取大量政府采购项目的项目名称、品目等采购信息作为本发明研究采用的数据集，并对该数据集进行情况分析及预处理。Facing the trend of government affairs disclosure and the requirements of the big data era, it is of practical significance to conduct in-depth analysis and research on massive government affairs disclosure data and explore its value. Among them, the main purpose of the present invention is to explore the classification of government procurement projects and find effective hierarchical classification methods, in order to use past government procurement project data to give automatic classification suggestions, thereby helping future staff to be more efficient and efficient. Accurately classify procurement items. According to the actual background, the project names, items and other procurement information of a large number of government procurement projects are obtained from the official government website as the data set used in the research of this invention, and the data set is analyzed and preprocessed.

在层级文本分类中，主要通过建立层级分类器对训练样本标注，层级类别结构的每个分支结束的最末一个类别节点定义为叶子节点，其余定义为树干节点，而本发明考虑了平面分类器、局部分类器、全局分类器的优缺点，在机器学习方法中采用平面分类器方法作为baseline，即将所有叶子节点放在同一个平面内进行分类。而将深度学习网络应用于平面分类器方法和全局分类器方法，分别对其进行分类计算，一方面在平面分类器方法上与机器学习方法对比分类性能，另一方面在深度学习方法上对比全局分类器方法在学习层级依赖信息后是否比平面分类器有着更好的表现。王等人针对深度学习方法存在的长距离依赖问题，提出一种将LSTM与注意力机制相结合的关系抽取方法对语料进行分类。郝等人提出了以应用于大规模文本分类为目的的全职改进公式P-IDF来提高传统机器学习分类方法的性能。武等人针对文本分类过程中缺乏层次结构特征问题，提出了基于层次模型和SEAN注意力机制的NMF-SMVM文本分类算法。然而这些方法或者使用机器学习算法，或者忽略了大规模文本分类中的层次结构。In hierarchical text classification, training samples are mainly labeled by establishing a hierarchical classifier. The last category node at the end of each branch of the hierarchical category structure is defined as a leaf node, and the rest are defined as trunk nodes. The present invention considers a planar classifier. , the advantages and disadvantages of local classifiers and global classifiers. In machine learning methods, the plane classifier method is used as the baseline, that is, all leaf nodes are placed in the same plane for classification. The deep learning network is applied to the flat classifier method and the global classifier method, and classification calculations are performed respectively. On the one hand, the classification performance of the flat classifier method is compared with the machine learning method, and on the other hand, the deep learning method is compared with the global classifier method. Whether classifier methods perform better than flat classifiers after learning hierarchical dependence information. In order to solve the problem of long-distance dependence in deep learning methods, Wang et al. proposed a relationship extraction method that combines LSTM and attention mechanism to classify corpus. Hao et al. proposed a full-time improved formula P-IDF for the purpose of large-scale text classification to improve the performance of traditional machine learning classification methods. Wu et al. proposed the NMF-SMVM text classification algorithm based on the hierarchical model and SEAN attention mechanism to solve the problem of lack of hierarchical structure features in the text classification process. However, these methods either use machine learning algorithms or ignore the hierarchical structure in large-scale text classification.

发明内容Contents of the invention

本发明提供一种大规模类别层级文本分类方法，该方法基于文本分类任务中经典的神经网络模型CNN和RNN，将注意力机制与RNN模型、CNN模型相结合，对文本的关键信息进行捕捉，有利于进一步推进政务信息化和自动化。The present invention provides a large-scale category-level text classification method. This method is based on the classic neural network models CNN and RNN in text classification tasks, and combines the attention mechanism with the RNN model and the CNN model to capture the key information of the text. It is conducive to further promoting the informatization and automation of government affairs.

为了达到上述技术效果，本发明的技术方案如下：In order to achieve the above technical effects, the technical solutions of the present invention are as follows:

一种大规模类别层级文本分类方法，包括以下步骤：A large-scale category-level text classification method, including the following steps:

S1：采集政府采购公示数据；S1: Collect government procurement announcement data;

S2：利用机器学习的方式对步骤S1采集的数据进行分类；S2: Use machine learning to classify the data collected in step S1;

S3：将步骤S2的分类结果作为基准，使用ARCNN模型对步骤S1采集的数据进行分类。S3: Use the classification result of step S2 as the benchmark, and use the ARCNN model to classify the data collected in step S1.

进一步地，所述步骤S1的具体过程是：Further, the specific process of step S1 is:

收集政府采购公示数据，包括：成交公告以及竞标公告，对收集的政府采购公示数据进行预处理：Collect government procurement publicity data, including transaction announcements and bidding announcements, and preprocess the collected government procurement publicity data:

(1)将文本数据去除非法字符并将繁体字转成简体字；(1) Remove illegal characters from text data and convert traditional characters into simplified characters;

(2)将中文分词，并去除停用词；(2) Segment Chinese words and remove stop words;

(3)训练词向量。(3) Training word vectors.

进一步地，步骤S2中，使用层级分类问题的平面分类器方法进行分类，特征工程阶段，选择BOW和TF-IDF作为词语特征；分类器训练阶段，选择朴素贝叶斯分类器、Logistic回归分类器、支持向量机分类器、决策树分类器对实验数据进行分类；平面分类器不考虑层级结构，故本部分的损失函数仅考虑训练集的经验损失函数。Further, in step S2, the plane classifier method for hierarchical classification problems is used for classification. In the feature engineering stage, BOW and TF-IDF are selected as word features; in the classifier training stage, the naive Bayes classifier and Logistic regression classifier are selected. , support vector machine classifier, and decision tree classifier to classify experimental data; the plane classifier does not consider the hierarchical structure, so the loss function in this part only considers the empirical loss function of the training set.

进一步地，所述步骤S3中，对采集的政府采购公示数据进行中文文本分词，得到组词序列D＝[w₁；w₂；…；w_n]，令w表示网络结构中的参数集合，p(k|D，w)表示文本实例属于类别k的概率，将词序列D＝[w₁；w₂；…；w_n]经过Word Embedding映射到词向量空间中，得到词序列的向量表示：[e(w₁)；e(w₂)；…；e(w_n)]；Further, in the step S3, Chinese text segmentation is performed on the collected government procurement announcement data, and the word sequence D=[w ₁ ; w ₂ ; ...; w _n ] is obtained, and let w represent the parameter set in the network structure, p(k|D, w) represents the probability that the text instance belongs to category k. The word sequence D = [w ₁ ; w ₂ ; ...; w _n ] is mapped to the word vector space through Word Embedding to obtain the vector representation of the word sequence. : [e(w ₁ ); e(w ₂ ); ...; e(w _n )];

对输入的词序列，采用双向循环神经网络模型Bi-RNN分别学习当前词w_i的上下文信息，得到上文信息c_l(w_i)和下文信息c_r(w_i)，其中，e(w_i)为具有|e|个实值元素的非稀疏向量，c_l(w_i)、c_r(w_i)均为具有|c|个实值元素的非稀疏向量。For the input word sequence, the bidirectional recurrent neural network model Bi-RNN is used to learn the contextual information of the current word w _i , and obtain the upper information c _l ( _wi ) and the lower information c _r (wi ₎ , where, e(w _i ) is a non-sparse vector with |e| real-valued elements, c _l ( _wi ) and _cr (wi ₎ are both non-sparse vectors with |c| real-valued elements.

进一步地，所述步骤S3中，基于双向循环神经网络模型Bi-RNN的结构，上下文信息c_l(w_i)、c_r(w_i)为隐藏层信息，其计算方式如下：Further, in step S3, based on the structure of the bidirectional recurrent neural network model Bi-RNN, the context information c _l ( _wi ) and _cr ( _wi ) are hidden layer information, and their calculation method is as follows:

c_l(w_i)＝f(W^(l)c_l(w_i-1)+W^(sl)e(w_i-1))c _l ( _wi )=f(W ^(l) c _l ( _wi-1 )+W ^(sl) e( _wi-1 ))

c_r(w_i)＝f(W^(r)c_r(w_i+1)+W^(sr)e(w_i+1))c _r ( _wi )=f(W ^(r) c _r ( _wi+1 )+W ^(sr) e(wi ₊₁ ))

其中，c_l(w_i-1)来自于上文词语w_i-1处的隐藏层信息，e(w_i-1)来自于上文词语w_i-1处自身的WordEmbedding值，W^(l)是将隐藏层信息向后传递的转移矩阵，W^(sl)是将输入的WordEmbedding值信息向后传递的转移矩阵，f是非线性激活函数，对聚合后的上文隐藏层信息和输入信息进行激活后向后传递到当前隐藏层。对于文档中的第一个词语而言，由于不存在上文，故使用共享参数c_l(w₁)。Among them, c _l (wi _-1 ) comes from the hidden layer information at the word w _i-1 above, e(wi _-1 ) comes from the WordEmbedding value of the word w _i-1 above, W ^{(l )} is the transfer matrix that transfers the hidden layer information backward, W ^(sl) is the transfer matrix that transfers the input WordEmbedding value information backward, f is the nonlinear activation function, and performs the aggregation of the above hidden layer information and the input information After activation, it is passed backward to the current hidden layer. For the first word in the document, since there is no context, the shared parameter c _l (w ₁ ) is used.

进一步地，经过基于双向循环神经网络模型Bi-RNN，通过正向和反向的循环，对于当前词w_i，可以获得当前输入信息e(w_i)及其隐藏层信息c_l(w_i)、c_r(w_i)；将当前输入信息和隐藏层信息进行级联拼接，得到文本表示方式：Furthermore, based on the bidirectional recurrent neural network model Bi-RNN, through forward and reverse cycles, for the current word w _i , the current input information e(wi ₎ and its hidden layer information c _l (wi ₎ can be obtained , c _r ( _wi ); cascade and splice the current input information and hidden layer information to obtain the text representation:

x_i＝[c_l(w_i)；e(w_i)；c_r(w_i)]。x _i =[c _l ( _wi ); e ( _wi ); _cr ( _wi )].

进一步地，现在每个词语处都获取了上下文信息，但其利用程度并非完全相同，而且并非所有词语都对文本含义的表示有同等的贡献，为了让模型更好的聚焦于文本的关键部分，对于Bi-RNN模块的输出x_i，采用自注意力模型，来提取文本中重要性更高的词语，得到注意力标记后的文本表示：Furthermore, contextual information is now obtained at each word, but the degree of utilization is not exactly the same, and not all words contribute equally to the representation of text meaning. In order to allow the model to better focus on the key parts of the text, For the output x _i of the Bi-RNN module, the self-attention model is used to extract more important words in the text, and the text representation after attention marking is obtained:

将RNN模块中所得的文本表示x_i进行编码，通过线性映射f_q、f_k、f_v分别得到查询编码Query：q_i、键编码Key：k_i和值编码Value：v_i；The text representation x _i obtained in the RNN module is encoded, and the query encoding Query: q _i , key encoding Key: k _i and value encoding Value: v _i are obtained through linear mapping f _q , f _k , and f _v respectively;

q_i＝f_q(x_i)q _i =f _q (x _i )

k_i＝f_k(x_i)k _i = f _k ( _xi )

v_i＝f_v(x_i)v _i = f _v (x _i )

计算Query和Key之间的相似程度，得到注意力分数作为词语重要性关注度的度量；接着，将该分数采用softmax函数进行归一化处理，得到注意力权重值：Calculate the similarity between Query and Key to obtain the attention score as a measure of word importance and attention; then, use the softmax function to normalize the score to obtain the attention weight value:

e_i，j＝a(q_i，k_j)e _{i, j} = a (q _i , k _j )

a_i，j＝softmax(e_i，j)a _{i, j} = softmax(e _{i, j} )

根据所得归一化权重，对文本的Value进行加权求和，得到注意力标记后的文本表示：According to the obtained normalized weight, the value of the text is weighted and summed to obtain the text representation after attention marking:

进一步地，当前词w_i得到了注意力转化后的文本表示将其进一步进行卷积，并使用tanh函数作为激活函数，得到w_i的潜在语义向量/> Furthermore, the current word w _i obtains the text representation after attention transformation It is further convolved and the tanh function is used as the activation function to obtain the latent semantic vector of w _i />

卷积层中，令filter的kernel size＝1，这是因为实际上包含了w_i的上下文信息，获得所有单词的潜在语义向量后，采用最大池化层进行计算，得到/>其中/>的第k个元素是/>的第k个元素的最大值：In the convolutional layer, let the kernel size of the filter=1. This is because It actually contains the context information of w _i . After obtaining the latent semantic vectors of all words, the maximum pooling layer is used to calculate, and we get/> Among them/> The kth element of is/> The maximum value of the kth element:

经过池化层，将各种不同长度的文本表示转换为固定长度的向量，并且捕获整个文本中的信息，采用最大池化层，经过池化层的最大池化后进入输出层，通过一个全连接层和softmax函数得到文本实例属于类别的概率：After the pooling layer, text representations of various lengths are converted into fixed-length vectors, and the information in the entire text is captured. The maximum pooling layer is used. After the maximum pooling of the pooling layer, it enters the output layer and passes through a full The connection layer and softmax function get the probability that the text instance belongs to the category:

上述前向传播过程，需要经过模型训练学习的参数包括：In the above forward propagation process, the parameters that need to be learned through model training include:

w＝{E，b⁽²⁾，b⁽⁴⁾，c_l(w₁)，c_r(w_n)，W⁽²⁾，W⁽⁴⁾，W^(l)，W^(r)，W^(sl)，W^(sr)}。w={E, b ⁽²⁾ , b ⁽⁴⁾ , c _l (w ₁ ), _cr (w _n ), W ⁽²⁾ , W ⁽⁴⁾ , W ^(l) , W ^(r) , W ^(sl) , W ^(sr) }.

进一步地，一个类别层次树N，树的节点为各级各类别标签，并选择强制性叶子节点预测的方法进行研究，对于输入训练集中包含的M个文本实例，将其记作其中x_i∈X为文本嵌入的实例空间，t_i∈T表示类别，/> 为层级结构中的所有叶子节点集合，定义函数π：N→N，表示π(n)是节点n的祖先节点；令C_n表示节点n的所有子节点的集合，定义y_in＝(2I(t_i＝n)-1)∈{-1，1}为一个二值变量，用于表示样本x_i是否属于类别n∈T；预测模型记作函数f：X→T，目标为对于给定的TRAIN获得f的最小损失。Further, a category hierarchical tree N, the nodes of the tree are labels of each category at all levels, and the mandatory leaf node prediction method is selected for study. For the M text instances included in the input training set, it is recorded as where x _i ∈X is the instance space of text embedding, t _i ∈T represents the category,/> For the set of all leaf nodes in the hierarchical structure, define the function π: N→N, indicating that π(n) is the ancestor node of node n; let C _n represent the set of all child nodes of node n, define y _in = (2I( t _i =n)-1)∈{-1, 1} is a binary variable used to indicate whether sample x _i belongs to category n∈T; the prediction model is recorded as function f: TRAIN obtains the minimum loss of f.

进一步地，采用的损失函数由数据集的经验风险和用于惩罚预测函数f复杂性的正则化项构成，定义如下：Furthermore, the loss function adopted consists of the empirical risk of the data set and a regularization term used to penalize the complexity of the prediction function f, which is defined as follows:

采用的损失函数由数据集的经验风险和用于惩罚预测函数f复杂性的正则化项构成，定义如下：The loss function adopted consists of the empirical risk of the data set and a regularization term used to penalize the complexity of the prediction function f, defined as follows:

Loss＝λ(w)+C×R_emp Loss＝λ(w)+C×R _emp

其中，λ(w)为正则化项，R_emp为数据集的经验风险，C是在经验风险和控制f复杂性之间控制平衡的权重参数；Among them, λ(w) is the regularization term, R _emp is the empirical risk of the data set, and C is the weight parameter that controls the balance between the empirical risk and the complexity of the control f;

对于参数集合w，除了前向传播的方式外，将其分解到每个节点中，构成基于节点组成的参数集合：w＝{w_n：n∈N}，即有层级结构中的每个节点n与参数组w_n相关联；对于正则化项，通过将树结构的层级依赖关系引入w的正则化项中，从而在训练过程中学习层级结构，正则化项形式具体如下：For the parameter set w, in addition to the forward propagation method, it is decomposed into each node to form a parameter set based on the node composition: w = {w _n : n∈N}, that is, each node in the hierarchical structure n is associated with the parameter group w _n ; for the regularization term, the hierarchical structure is learned during the training process by introducing the hierarchical dependency of the tree structure into the regularization term of w. The form of the regularization term is as follows:

对于经验损失函数，类别层级结构中的叶子节点处的实例所引起的损失：For the empirical loss function, the loss due to instances at leaf nodes in the category hierarchy:

其中，L可以是不同模型中所有的凸损失函数。Among them, L can be all convex loss functions in different models.

与现有技术相比，本发明技术方案的有益效果是：Compared with the existing technology, the beneficial effects of the technical solution of the present invention are:

本发明采用深度学习方法对实验数据进行分类，将深度学习网络应用于平面分类器方法和全局分类器方法，分别对其进行分类计算，一方面在平面分类器方法上与机器学习方法对比分类性能，另一方面在深度学习方法上对比全局分类器方法在学习层级依赖信息后是否比平面分类器有着更好的表现；采用平面分类器时，与机器学习方法一致，不考虑层级结构，损失函数仅考虑训练集的经验损失函数；采用全局分类器时，将层级结构纳入考虑，损失函数中加入正则化项的惩罚；基于文本分类任务中经典的神经网络模型CNN和RNN，将注意力机制与RNN模型、CNN模型相结合，对文本的关键信息进行捕捉，有利于进一步推进政务信息化和自动化。The present invention uses the deep learning method to classify experimental data, applies the deep learning network to the plane classifier method and the global classifier method, and performs classification calculations on them respectively. On the one hand, the classification performance of the plane classifier method is compared with the machine learning method. , on the other hand, compare the deep learning method to whether the global classifier method has better performance than the flat classifier after learning hierarchical dependence information; when using the flat classifier, it is consistent with the machine learning method, regardless of the hierarchical structure and loss function Only the empirical loss function of the training set is considered; when using a global classifier, the hierarchical structure is taken into consideration, and the penalty of the regularization term is added to the loss function; based on the classic neural network models CNN and RNN in text classification tasks, the attention mechanism is combined with The combination of RNN model and CNN model can capture the key information of the text, which is conducive to further promoting the informatization and automation of government affairs.

附图说明Description of drawings

图1为本发明中ARCNN模型拓扑结构示意图；Figure 1 is a schematic diagram of the topology structure of the ARCNN model in the present invention;

图2为本发明中Self-Attention机制作用于RNN输出(重画自注意力)示意图；Figure 2 is a schematic diagram of the Self-Attention mechanism used in RNN output (redrawing self-attention) in the present invention;

图3为类别层次树及相关函数集合定义图。Figure 3 is a definition diagram of the category hierarchy tree and related function sets.

具体实施方式Detailed ways

附图仅用于示例性说明，不能理解为对本专利的限制；The drawings are for illustrative purposes only and should not be construed as limitations of this patent;

为了更好说明本实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；In order to better illustrate this embodiment, some components in the drawings will be omitted, enlarged or reduced, which does not represent the size of the actual product;

对于本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。It is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solution of the present invention will be further described below with reference to the accompanying drawings and examples.

如图1所示，一种大规模类别层级文本分类方法，利用了RNN模型对上下文信息的提取和CNN模型对局部相关性进行捕捉的基础上，引入了Attention机制以获得内部结构中较为关键的词语信息，同时也提高了模型的直观程度和解释性，其主要包括以下步骤：As shown in Figure 1, a large-scale category-level text classification method uses the RNN model to extract contextual information and the CNN model to capture local correlations, and introduces the Attention mechanism to obtain the more critical internal structure. Word information also improves the intuitiveness and interpretability of the model, which mainly includes the following steps:

(1)获取政府采购公示数据，其中通常包括两种公示类型：成交公告以及竞标公告。并对文本数据进行预处理。(1) Obtain government procurement announcement data, which usually includes two types of announcements: transaction announcements and bidding announcements. and preprocess text data.

(2)使用常见的机器学习方法对实验数据进行分类，以作为数据集的baseline，供后续深度学习方法和本发明采用的ARCNN模型进行模型效能的对比参考。(2) Use common machine learning methods to classify the experimental data as a baseline for the data set, which can be used as a reference for comparison of model performance between subsequent deep learning methods and the ARCNN model used in the present invention.

(3)使用ARCNN模型对实验数据进行分类。(3) Use the ARCNN model to classify experimental data.

本发明算法的整体框架图参见图1，将RNN模型与CNN模型进行结合，通过RNN结构对文本的上下文信息进行学习后，再将文本信息输入进CNN结构中进行卷积。The overall framework diagram of the algorithm of the present invention is shown in Figure 1. The RNN model and the CNN model are combined. After learning the context information of the text through the RNN structure, the text information is input into the CNN structure for convolution.

模型的输入为文本数据，输出为文本数据对应的层级类别标签。对于输入的文本实例，在本发明中为中文文本，故先对其进行中文文本分词预处理，得到一组词序列D＝[w₁；w₂；…；w_n](若为英文文本，则可根据空格符自然得到词序列)。令w表示网络结构中的参数集合，p(k|D，w)表示文本实例属于类别k的概率。The input of the model is text data, and the output is the hierarchical category label corresponding to the text data. For the input text example, in the present invention, it is a Chinese text, so the Chinese text word segmentation preprocessing is first performed on it to obtain a set of word sequences D = [w ₁ ; w ₂ ; ...; w _n ] (if it is an English text, Then the word sequence can be obtained naturally according to the space character). Let w represent the parameter set in the network structure, and p(k|D, w) represent the probability that the text instance belongs to category k.

将词序列D＝[w₁；w₂；…；w_n]经过Word Embedding映射到词向量空间中，得到词序列的向量表示：[e(w₁)；e(w₂)；…；e(w_n)]。同时，对输入的词序列，采用双向循环神经网络模型(Bi-RNN)分别学习当前词w_i的上下文信息，得到上文(左侧)信息c_l(w_i)和下文(右侧)信息c_r(w_i)。其中，e(w_i)为具有|e|个实值元素的非稀疏向量，c_l(w_i)、c_r(w_i)均为具有|c|个实值元素的非稀疏向量。The word sequence D = [w ₁ ; w ₂ ; ...; w _n ] is mapped into the word vector space through Word Embedding, and the vector representation of the word sequence is obtained: [e(w ₁ ); e(w ₂ ); ...; e (w _n )]. At the same time, for the input word sequence, a bidirectional recurrent neural network model (Bi-RNN) is used to learn the contextual information of the current word w _i , and obtain the above (left) information c _l (w _i ) and the below (right) information c _r (w _i ). Among them, e( _wi ) is a non-sparse vector with |e| real-valued elements, and c _l ( _wi ) and _cr ( _wi ) are both non-sparse vectors with |c| real-valued elements.

基于Bi-RNN模块的结构，上下文信息c_l(w_i)、c_r(w_t)为隐藏层信息，其计算方式如下：Based on the structure of the Bi-RNN module, the context information c _l ( _wi ) and _cr (w _t ) are hidden layer information, and their calculation method is as follows:

同理，c_r(w_i)的计算方式与c_l(w_i-1)类似，只是变成利用下文词语w_i-1处的隐藏层信息和输入信息，并将信息向前传递。对于文档中的最后一个词语而言，由于不存在下文，故使用共享参数c_r(w_n)。In the same way, the calculation method of _cr ( _wi ) is similar to c _l ( _wi-1 ), except that the hidden layer information and input information at the following word w _i-1 are used, and the information is passed forward. For the last word in the document, since there is no context, the shared parameter c _r (w _n ) is used.

此处，神经网络的双向循环过程可以采用更复杂的RNN结构来学习获取w_i的上下文信息，如LSTM、GRU等。Here, the two-way loop process of the neural network can use a more complex RNN structure to learn to obtain the context information of w _i , such as LSTM, GRU, etc.

经过Bi-RNN模块，通过正向和反向的循环，对于当前词w_i，可以获得当前输入(自身)信息e(w_i)及其隐藏层(上下文)信息c_l(w_i)、c_r(w_i)。将当前输入信息和隐藏层信息进行级联拼接，可以得到一种文本表示方式：After the Bi-RNN module, through forward and reverse cycles, for the current word w _i , the current input (self) information e(wi ₎ and its hidden layer (context) information c _l (wi ₎ , c can be obtained _r (w _i ). By cascading the current input information and the hidden layer information, a text representation can be obtained:

x_i＝[c_l(w_i)；e(w_i)；c_r(w_i)]x _i =[c _l ( _wi ); e ( _wi ); c _r ( _wi )]

与仅利用文本的部分信息的传统神经网络模型相比，这种文本表示方式通过使用上下文信息，能够更好地消除词语w_i可能存在的歧义。Compared with traditional neural network models that only use partial information of the text, this text representation method can better eliminate possible ambiguities in the word w _i by using contextual information.

虽然现在每个词语处都获取了上下文信息，但其利用程度并非完全相同，而且并非所有词语都对文本含义的表示有同等的贡献。因此，为了让模型更好的聚焦于文本的关键部分，对于Bi-RNN模块的输出x_i，采用自注意力模型，参见图2，来提取文本中重要性更高的词语，得到注意力标记后的文本表示。Although contextual information is now obtained at every word, the degree of utilization is not exactly the same, and not all words contribute equally to the representation of text meaning. Therefore, in order to allow the model to better focus on the key parts of the text, for the output x _i of the Bi-RNN module, a self-attention model is used, see Figure 2, to extract more important words in the text and obtain the attention mark The text representation after.

Self-Attention模块中的具体计算过程如下。首先，将RNN模块中所得的文本表示x_i进行编码，通过线性映射f_q、f_k、f_v分别得到Query(查询编码)q_i、Key(键编码)k_i和Value(值编码)v_i：The specific calculation process in the Self-Attention module is as follows. First, the text representation x _i obtained in the RNN module is encoded, and Query (query encoding) q _i , Key (key encoding) k _i and Value (value encoding) _v are obtained respectively through linear mapping f _q , f _k , and f v _i :

q_i＝f_q(x_i)q _i =f _q (x _i )

k_i＝f_k(x_i)k _i = f _k ( _xi )

v_i＝f_v(x_i)v _i = f _v (x _i )

随后，计算Query和Key之间的相似程度，得到注意力分数作为词语重要性关注度的度量。接着，将该分数采用softmax函数进行归一化处理，得到注意力权重值：Subsequently, the similarity between Query and Key is calculated, and the attention score is obtained as a measure of word importance attention. Then, the score is normalized using the softmax function to obtain the attention weight value:

e_i，j＝a(q_i，k_j)e _{i, j} = a (q _i , k _j )

a_i，j＝softmax(e_i，j)a _{i, j} = softmax(e _{i, j} )

根据所得归一化权重，对文本的Value进行加权求和，可以得到注意力标记后的文本表示：According to the obtained normalized weight, the value of the text is weighted and summed, and the text representation after attention marking can be obtained:

经过Attention模块后，当前词w_i得到了注意力转化后的文本表示将其进一步进行卷积，并使用tanh函数作为激活函数，得到w_i的潜在语义向量/> After passing through the Attention module, the current word w _i obtains the text representation after attention transformation. It is further convolved and the tanh function is used as the activation function to obtain the latent semantic vector of w _i />

卷积层中，可以令filter的kernel size＝1，这是因为实际上包含了w_i的上下文信息。然而，在实践中采用更大的kernel size(如[2，3，4])能够取得更好的效果，这可能是因为filter窗口大于1时能够进一步对w_i的上下文信息进行学习。In the convolutional layer, the kernel size of the filter can be set=1, because Actually contains the context information of w _i . However, in practice, using a larger kernel size (such as [2, 3, 4]) can achieve better results. This may be because the context information of w _i can be further learned when the filter window is larger than 1.

获得所有单词的潜在语义向量后，采用最大池化层进行计算，得到其中/>的第k个元素是/>的第k个元素的最大值：After obtaining the latent semantic vectors of all words, the maximum pooling layer is used to calculate, and we get Among them/> The kth element of is/> The maximum value of the kth element:

经过池化层，将各种不同长度的文本表示转换为固定长度的向量，并且捕获整个文本中的信息。此处采用最大池化层，因为模型只需要关注文本中最重要的潜在语义因素。After the pooling layer, various text representations of different lengths are converted into fixed-length vectors, and the information in the entire text is captured. A max pooling layer is used here because the model only needs to focus on the most important latent semantic factors in the text.

经过池化层的最大池化后进入输出层，通过一个全连接层和softmax函数可以得到文本实例属于类别的概率：After the maximum pooling of the pooling layer, the output layer is entered. Through a fully connected layer and the softmax function, the probability that the text instance belongs to the category can be obtained:

w＝{E，b⁽²⁾，b⁽⁴⁾，c_l(w₁)，c_r(w_n)，W⁽²⁾，W⁽⁴⁾，W^(l)，W^(r)，W^(sl)，W^(sr)}w={E, b ⁽²⁾ , b ⁽⁴⁾ , c _l (w ₁ ), _cr (w _n ), W ⁽²⁾ , W ⁽⁴⁾ , W ^(l) , W ^(r) , W ^(sl) , W ^(sr) }

本发明中的文本分类任务属于大规模层级类别的文本分类，存在着严重的类别不平衡情况。对此，本发明采用一个用于解决大规模层级类别文本分类的递归正则化项，既利用类标签之间的层次依赖关系来提高性能，同时又保持模型跨大规模层级类别的可伸缩性(即解决类别不平衡的问题)。The text classification task in the present invention belongs to text classification of large-scale hierarchical categories, and there is a serious category imbalance. In this regard, the present invention adopts a recursive regularization term for solving large-scale hierarchical category text classification, which not only utilizes the hierarchical dependencies between class labels to improve performance, but also maintains the scalability of the model across large-scale hierarchical categories ( That is, to solve the problem of class imbalance).

对于本发明研究对象的类别层级结构，参见图3，将其视为一个类别层次树N，树的节点为各级各类别标签，并选择强制性叶子节点预测的方法进行研究。对于输入训练集中包含的M个文本实例，将其记作其中x_i∈X为文本嵌入的实例空间，t_i∈T表示类别，/>为层级结构中的所有叶子节点集合。定义函数π：N→N，表示π(n)是节点n的祖先节点。令C_n表示节点n的所有子节点的集合，定义y_in＝(2I(t_i＝n)-1)∈{-1，1}为一个二值变量，用于表示样本x_i是否属于类别n∈T。本发明中的预测模型记作函数f：X→T，目标为对于给定的TRAIN获得f的最小损失。For the category hierarchical structure of the research object of the present invention, see Figure 3. It is regarded as a category hierarchical tree N. The nodes of the tree are labels of each category at all levels, and the method of mandatory leaf node prediction is selected for research. For M text instances contained in the input training set, let where x _i ∈X is the instance space of text embedding, t _i ∈T represents the category,/> A collection of all leaf nodes in the hierarchy. Define the function π: N→N, indicating that π(n) is the ancestor node of node n. Let C _n represent the set of all child nodes of node n, and define y _in =(2I(t _i =n)-1)∈{-1, 1} as a binary variable used to indicate whether sample x _i belongs to the category n∈T. The prediction model in the present invention is recorded as function f: X→T, and the goal is to obtain the minimum loss of f for a given TRAIN.

本发明采用的损失函数由数据集的经验风险和用于惩罚预测函数f复杂性的正则化项构成，定义如下：The loss function used in this invention consists of the empirical risk of the data set and a regularization term used to penalize the complexity of the prediction function f, and is defined as follows:

Loss＝λ(w)+C×R_emp Loss＝λ(w)+C×R _emp

其中，λ(w)为正则化项，R_emp为数据集的经验风险，C是在经验风险和控制f复杂性之间控制平衡的权重参数。Among them, λ(w) is the regularization term, R _emp is the empirical risk of the data set, and C is the weight parameter that controls the balance between the empirical risk and the complexity of the control f.

对于参数集合w，除了前向传播的方式外，在这里还可以将其分解到每个节点中，构成基于节点组成的参数集合：w＝{w_n：n∈N}，即有层级结构中的每个节点n与参数组w_n相关联。For the parameter set w, in addition to the forward propagation method, it can also be decomposed into each node to form a parameter set based on the nodes: w = {w _n : n∈N}, that is, there is a hierarchical structure Each node n of is associated with a parameter group w _n .

对于正则化项，希望通过将树结构的层级依赖关系引入w的正则化项中，从而在训练过程中学习层级结构。正则化项形式具体如下：For the regularization term, we hope to learn the hierarchical structure during the training process by introducing the hierarchical dependence of the tree structure into the regularization term of w. The form of the regularization term is as follows:

该正则化项基于对层次依赖关系进行建模，由其形式可知，它鼓励节点的参数与其祖先节点和临近节点(祖先节点相同且层级较低)的参数尽可能相似。这一方面有助于训练模型参数的同时利用类别层级依赖的信息，另一方面有助于在整个层次结构中共享样本强度，对于那些样本很少的类而言，也能够从样本较多的类中获得信息，从而在样本有限的情况下产生更好的分类模型。This regularization term is based on modeling hierarchical dependencies. As can be seen from its form, it encourages the parameters of a node to be as similar as possible to the parameters of its ancestor nodes and neighboring nodes (the ancestor nodes are the same and at a lower level). On the one hand, this helps to train model parameters while utilizing the information of class-level dependencies. On the other hand, it helps to share the sample strength throughout the hierarchy. For those classes with few samples, it can also be used from classes with more samples. Obtain information from classes to produce better classification models when samples are limited.

对于经验损失函数，定义为类别层级结构中的叶子节点处的实例所引起的损失：For the empirical loss function, defined as the loss incurred by instances at leaf nodes in the category hierarchy:

对于获取的政府采购数据信息，对其基本情况进行观测了解。政府采购数据信息共包含1313025条数据，其中每条数据中包含公告标题、采购项目名称、品目、采购单位、行政区域、公告时间、扩展信息、网址共7个元素。本发明的文本分类任务中，将“采购项目名称”作为分类任务的样本输入，“品类”作为样本对应的类别输出。在进行实验前，首先对实验数据的基本情况进行了解，并进行数据预处理工作。在进行分类任务前，本发明将实验数据集依照7：2：1的比例进行划分，得到训练集、验证集、测试集，并以测试集的评价指标作为参考标准。Observe and understand the basic situation of the government procurement data and information obtained. The government procurement data information contains a total of 1,313,025 pieces of data, each of which contains a total of 7 elements including announcement title, procurement project name, item, purchasing unit, administrative region, announcement time, extended information, and website address. In the text classification task of the present invention, the "purchasing item name" is input as the sample of the classification task, and the "category" is output as the category corresponding to the sample. Before conducting the experiment, first understand the basic situation of the experimental data and perform data preprocessing. Before performing the classification task, the present invention divides the experimental data set according to the ratio of 7:2:1 to obtain a training set, a verification set, and a test set, and uses the evaluation index of the test set as a reference standard.

本发明采用常见的机器学习方法对实验数据进行分类，以作为本发明数据集的baseline，供后续深度学习方法和本发明采用的ARCNN模型进行模型效能的对比参考。在baseline的构建中，本发明采用层级分类问题的平面分类器方法，将所有叶子节点放在同一个平面内进行分类。在特征工程阶段，选择BOW和TF-IDF作为词语特征。在分类器训练阶段，选择朴素贝叶斯分类器、Logistic回归分类器、支持向量机分类器、决策树分类器对实验数据进行分类。由于平面分类器不考虑层级结构，故本部分的损失函数仅考虑训练集的经验损失函数。The present invention uses common machine learning methods to classify experimental data, which serves as the baseline of the present invention's data set and serves as a reference for comparison of model performance between subsequent deep learning methods and the ARCNN model used in the present invention. In the construction of the baseline, the present invention adopts the plane classifier method of hierarchical classification problem, placing all leaf nodes in the same plane for classification. In the feature engineering stage, BOW and TF-IDF are selected as word features. In the classifier training stage, the naive Bayes classifier, logistic regression classifier, support vector machine classifier, and decision tree classifier are selected to classify the experimental data. Since the flat classifier does not consider the hierarchical structure, the loss function in this section only considers the empirical loss function of the training set.

本发明采用近年来提出的较为流行的深度学习方法对实验数据进行分类，作为本发明提出的ARCNN模型的分类性能的参考。其中，在将深度学习网络应用于平面分类器方法和全局分类器方法，分别对其进行分类计算，一方面在平面分类器方法上与机器学习方法对比分类性能，另一方面在深度学习方法上对比全局分类器方法在学习层级依赖信息后是否比平面分类器有着更好的表现。采用平面分类器时，与机器学习方法一致，不考虑层级结构，损失函数仅考虑训练集的经验损失函数。采用全局分类器时，将层级结构纳入考虑，损失函数中加入正则化项的惩罚。The present invention uses a relatively popular deep learning method proposed in recent years to classify experimental data as a reference for the classification performance of the ARCNN model proposed by the present invention. Among them, the deep learning network is applied to the flat classifier method and the global classifier method, and classification calculations are performed respectively. On the one hand, the classification performance of the flat classifier method is compared with the machine learning method, and on the other hand, the deep learning method is compared. Compare whether the global classifier method performs better than the flat classifier after learning hierarchical dependence information. When using a flat classifier, consistent with the machine learning method, the hierarchical structure is not considered, and the loss function only considers the empirical loss function of the training set. When using a global classifier, the hierarchical structure is taken into consideration and the penalty of the regularization term is added to the loss function.

本发明采用的深度学习方法参数设置如下：The parameters of the deep learning method used in this invention are set as follows:

(1)词向量模型参数(1)Word vector model parameters

对采购项目名称进行分词预处理后，对所得分词进行词向量表示。本发明中采用Word2Vec中的Skip-gram模型，对政府采购数据中所得的全体名称分词进行词向量预训练，获得200维的词向量，训练过程中上下文词语范围设置为2即共考虑4个上下文单词。After word segmentation preprocessing is performed on the procurement project name, the resulting word segments are represented by word vectors. In this invention, the Skip-gram model in Word2Vec is used to perform word vector pre-training on all name segmentations obtained from government procurement data, and a 200-dimensional word vector is obtained. During the training process, the context word range is set to 2, that is, a total of 4 contexts are considered. word.

(2)深度学习训练参数(2) Deep learning training parameters

在神经网络中，本发明设置训练时采用的batch size为64，epoch的数量为5，学习率衰减步设置为1000，学习率衰减率设置为0.1，梯度阈值设置为100。采用“Adam”作为神经网络的优化器，设置学习率为0.008。训练过程中，隐藏层的dropout设置为0.5。全连接层的激活函数设置为tanh函数，输出层的激活函数设置为sigmoid函数。损失函数中L2 lambda设置为0，采用带递归正则化项的损失函数，其中训练集的经验风险函数采用二分类交叉熵损失函数(Binary Cross Entropy Loss)。预测时，设置预测阈值为0.5。In the neural network, the present invention sets the batch size used in training to 64, the number of epochs to 5, the learning rate decay step to 1000, the learning rate decay rate to 0.1, and the gradient threshold to 100. Use "Adam" as the optimizer of the neural network and set the learning rate to 0.008. During training, the dropout of the hidden layer is set to 0.5. The activation function of the fully connected layer is set to the tanh function, and the activation function of the output layer is set to the sigmoid function. In the loss function, L2 lambda is set to 0, and a loss function with a recursive regularization term is used. The empirical risk function of the training set uses the binary cross entropy loss function (Binary Cross Entropy Loss). When predicting, set the prediction threshold to 0.5.

在本发明的隐藏层中，采用了多种神经网络模型结构，包括FastText、TextCNN、TextRNN、RCNN、ARCNN等模型。其中，除本发明提出的ARCNN模型外，其余模型的参数均为经过多次调参后得到最佳分类结果对应的参数，其参数并非本发明重点故不再赘述。In the hidden layer of the present invention, a variety of neural network model structures are used, including FastText, TextCNN, TextRNN, RCNN, ARCNN and other models. Among them, except for the ARCNN model proposed by the present invention, the parameters of the other models are the parameters corresponding to the best classification results obtained after multiple parameter adjustments. Their parameters are not the focus of the present invention and will not be described again.

ARCNN模型的核心主要有3层网络结构，与其他模型进行对比时，采用的参数设置如下。在RNN层，设置一层双向的RNN模块，隐藏层神经元个数为64，RNN模块的结构设置为Bi-GRU结构。在Attention层，设置attention dimension为16，Attention方法为自注意力方法并采用DOT Attention。在CNN层，设置卷积核通道数为64，卷积核大小为[2,3,4]，池化层采用最大池化方法。The core of the ARCNN model mainly has a 3-layer network structure. When compared with other models, the parameter settings used are as follows. In the RNN layer, a bidirectional RNN module is set, the number of hidden layer neurons is 64, and the structure of the RNN module is set to the Bi-GRU structure. In the Attention layer, set the attention dimension to 16, the Attention method is the self-attention method and uses DOT Attention. In the CNN layer, set the number of convolution kernel channels to 64, the convolution kernel size to [2, 3, 4], and the pooling layer uses the maximum pooling method.

本发明将两种特征提取方法与每一种机器学习分类器方法进行组合，分别计算其Micro-F1和Macro-F1得分。结果如表1所示：The present invention combines two feature extraction methods with each machine learning classifier method, and calculates its Micro-F1 and Macro-F1 scores respectively. The results are shown in Table 1:

表1Table 1

对实验数据集采用平面分类器策略并应用机器学习算法进行研究，观察实验结果可知，机器学习算法在面对大规模类别不平衡数据时存在着一定的困难，Micro-F1最高为BOW+Logistic Regression模型，达到了0.4617。就特征工程对比而言，本数据集上BOW方法的表现较TF-IDF方法更优，除了在Bayes模型上TF-IDF方法取得了稍好的效果外，在其他模型上BOW方法都有着明显的优势，特别是在Logistic Regression模型和DT模型上差值达到了0.04上下。就机器学习分类器对比而言，Logistic Regression模型取得了最好的效果，随后是DT模型、/>Bayes模型和KNN模型，而SVM模型在面对大规模类别不平衡数据时，不仅在计算速度上表现非常糟糕，在分类的效果上也不能让人满意，其Micro-F1仅为0.13。The experimental data set is studied using a flat classifier strategy and a machine learning algorithm. Observing the experimental results shows that the machine learning algorithm has certain difficulties when facing large-scale category imbalance data. The highest Micro-F1 is BOW+Logistic Regression model, reaching 0.4617. In terms of feature engineering comparison, the BOW method performs better than the TF-IDF method on this data set, except in In addition to the slightly better results achieved by the TF-IDF method on the Bayes model, the BOW method has obvious advantages on other models, especially the difference between the Logistic Regression model and the DT model, which is about 0.04. In terms of comparison of machine learning classifiers, the Logistic Regression model achieved the best results, followed by the DT model, /> Bayes model and KNN model, while the SVM model not only performs very poorly in terms of calculation speed when facing large-scale category imbalance data, but also is unsatisfactory in terms of classification effect, with its Micro-F1 being only 0.13.

本发明在相同的深度学习训练参数、预测参数的基础上，分别计算不同神经网络在预测集上的Micro-F1和Macro-F1，结果如表2：This invention calculates Micro-F1 and Macro-F1 of different neural networks on the prediction set based on the same deep learning training parameters and prediction parameters. The results are as shown in Table 2:

表2Table 2

对实验数据集采用平面分类器策略和全局分类器策略并应用神经网络算法进行研究。观察实验结果可知，就平面分类器策略而言，问题转化为多元单标签问题的情况下，在面对大规模类别不平衡数据时，神经网络的效果总体上要明显优于传统的机器学习算法。由实验结果对比可知，效果最差的神经网络分类器FastText在Flat Classification上的Micro-F1与效果最好的BOW+Logistic Regression几乎一致，而大部分神经网络分类器在Flat Classification上Micro-F1的比最优机器学习方法高0.05上下。The experimental data set is studied using the flat classifier strategy and the global classifier strategy and applying the neural network algorithm. Observing the experimental results, it can be seen that in terms of the flat classifier strategy, when the problem is transformed into a multivariate single-label problem, when facing large-scale category imbalance data, the effect of the neural network is generally significantly better than that of the traditional machine learning algorithm. . From the comparison of experimental results, it can be seen that the Micro-F1 of the least effective neural network classifier FastText on Flat Classification is almost the same as the best BOW+Logistic Regression, while most neural network classifiers have Micro-F1 on Flat Classification. It is about 0.05 higher than the optimal machine learning method.

就神经网络算法自身在不同策略上的表现对比来说，采用全局分类器进行Hierarchical Classification时其效果明显优于平面分类器的Flat Classification。由表2可见，对于每一个神经网络方法而言，Hierarchical Classification预测所得的预测集Micro-F1比进行Flat Classification预测的Micro-F1普遍高0.02-0.05，显然深度学习中的神经网络能够较好的学习并利用层级依赖信息，从而提高预测性能。In terms of comparing the performance of the neural network algorithm itself on different strategies, the effect of using the global classifier for Hierarchical Classification is significantly better than the Flat Classification of the flat classifier. As can be seen from Table 2, for each neural network method, the prediction set Micro-F1 predicted by Hierarchical Classification is generally 0.02-0.05 higher than the Micro-F1 predicted by Flat Classification. Obviously, the neural network in deep learning can perform better. Learn and exploit hierarchical dependency information to improve prediction performance.

就不同的神经网络在Hierarchical Classification上的表现进行对比，FastText结构相对简单，取得的Micro-F1为0.4961，优于机器学习方法，但不如其他结构更复杂、利用信息更充分的网络结构。TextCNN的Micro-F1值为0.5580，TextRNN的Micro-F1值为0.5798，这说明CNN模型、RNN模型的确能够在层级文本分类领域取得不错的效果，而RCNN模型融合了CNN模型、RNN模型，分类性能得到了提升，Micro-F1值达到了0.5824。本发明提出的ARCNN模型，在利用了RNN模型、CNN模型的优势基础上，通过引入Attention机制，得到了最好的分类表现，Micro-F1值为0.5864，在RCNN模型的基础上进一步提升，同时也优于其他所有模型。Comparing the performance of different neural networks on Hierarchical Classification, FastText has a relatively simple structure and achieved a Micro-F1 of 0.4961, which is worse than the machine learning method, but not as good as other network structures with more complex structures and full utilization of information. The Micro-F1 value of TextCNN is 0.5580, and the Micro-F1 value of TextRNN is 0.5798. This shows that the CNN model and the RNN model can indeed achieve good results in the field of hierarchical text classification, and the RCNN model combines the CNN model and the RNN model to achieve better classification performance. Improved, Micro-F1 value reached 0.5824. The ARCNN model proposed by this invention, based on the advantages of the RNN model and the CNN model, has achieved the best classification performance by introducing the Attention mechanism. The Micro-F1 value is 0.5864, which is further improved on the basis of the RCNN model. At the same time Also better than all other models.

相同或相似的标号对应相同或相似的部件；The same or similar numbers correspond to the same or similar parts;

附图中描述位置关系的用于仅用于示例性说明，不能理解为对本专利的限制；The positional relationships described in the drawings are for illustrative purposes only and should not be construed as limitations of this patent;

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples to clearly illustrate the present invention, and are not intended to limit the implementation of the present invention. For those of ordinary skill in the art, other different forms of changes or modifications can be made based on the above description. An exhaustive list of all implementations is neither necessary nor possible. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention shall be included in the protection scope of the claims of the present invention.

Claims

1. A large-scale category-level text classification method, which is characterized by including the following steps:

S1: Collect government procurement announcement data;

S2: Use machine learning to classify the data collected in step S1;

S3: Use the classification result of step S2 as the benchmark, and use the ARCNN model to classify the data collected in step S1;

In the step S3, Chinese text segmentation is performed on the collected government procurement announcement data to obtain the word sequence D = [w ₁ ; w ₂ ; ...; w _n ], let w represent the parameter set in the network structure, p(k |D, w) represents the probability that the text instance belongs to category k. The word sequence D = [w ₁ ; w ₂ ; ...; w _n ] is mapped to the word vector space through Word Embedding, and the vector representation of the word sequence is obtained: [e (w ₁ ); e(w ₂ );…; e(w _n )];

For the input word sequence, the bidirectional recurrent neural network model Bi-RNN is used to learn the contextual information of the current word w _i , and obtain the upper information c _l ( _wi ) and the lower information c _r (wi ₎ , where, e(w _i ) is a non-sparse vector with |e| real-valued elements, c _l (w _i ) and c _r (wi ₎ are both non-sparse vectors with |c| real-valued elements;

In step S3, based on the structure of the bidirectional recurrent neural network model Bi-RNN, the context information c _l ( _wi ) and _cr ( _wi ) are hidden layer information, and their calculation method is as follows:

c _l ( _wi )=f(W ^(l) c _l ( _wi-1 )+W ^(sl) e( _wi-1 ))

c _r ( _wi )=f(W ^(r) c _r ( _wi+1 )+W ^(sr) e(wi ₊₁ ))

Among them, c _l (wi _-1 ) comes from the hidden layer information at the above word w _i-1 , e(wi _-1 ) comes from the Word Embedding value of the above word w _i-1 , W ^{( l)} is the transfer matrix that transfers the hidden layer information backwards, W ^(sl) is the transfer matrix that transfers the input WordEmbedding value information backwards, c _r (wi ₊₁ ) comes from the following word w _i+1 Hidden layer information, e(w _i+1 ) comes from the WordEmbedding value of the word w _i+1 below, W ^(r) is the transfer matrix that transmits the hidden layer information forward, and W ^(sr) is the input Word The transfer matrix for forward transmission of Embedding value information, f is a nonlinear activation function, which activates the aggregated above hidden layer information and input information and then passes it backward to the current hidden layer. For the first word in the document, Since the above does not exist, the shared parameter c _l (w ₁ ) is used;

Based on the bidirectional recurrent neural network model Bi-RNN, through forward and reverse cycles, for the current word w _i , the current input information e(wi ₎ and its hidden layer information c _l (wi ₎ , c _r can be obtained ( _wi ); cascade and splice the current input information and hidden layer information to obtain the text representation x _i :

x _i =[c _l ( _wi ); e ( _wi ); c _r ( _wi )];

Now the contextual information is obtained at each word, but the degree of utilization is not exactly the same, and not all words contribute equally to the representation of text meaning. In order to allow the model to better focus on the key parts of the text, for Bi- The output x _i of the RNN module uses the self-attention model to extract more important words in the text, and obtain the text representation after attention marking:

The text representation x _i obtained in the RNN module is encoded, and the query encoding Query: q _i , key encoding Key: k _i and value encoding Value: v _i are obtained through linear mapping f _q , f _k , and f _v respectively;

q _i =f _q (x _i )

k _i = f _k ( _xi )

v _i = f _v (x _i )

Calculate the similarity between Query and Key to obtain the attention score as a measure of word importance and attention; then, use the softmax function to normalize the score to obtain the attention weight value:

e _{i, j} = a (q _i , k _j )

a _{i, j} = softmax(e _{i, j} )

According to the obtained normalized weight, the value of the text is weighted and summed to obtain the text representation after attention marking.

The current word w _i has obtained the text representation after attention transformation. It is further convolved and the tanh function is used as the activation function to obtain the latent semantic vector of w _i /> where W ⁽²⁾ is the weight matrix and b ⁽²⁾ is the bias vector:

In the convolutional layer, let the kernel size of the filter=1. This is because It actually contains the context information of w _i . After obtaining the latent semantic vectors of all words, the maximum pooling layer is used to calculate, and we get/> Among them/> The kth element of is/> The maximum value of the kth element:

After the pooling layer, text representations of various lengths are converted into fixed-length vectors, and the information in the entire text is captured. The maximum pooling layer is used. After the maximum pooling of the pooling layer, we obtain Entering the output layer, the probability p _i of the text instance belonging to the category is obtained through a fully connected layer and the softmax function as follows, where W ⁽⁴⁾ is the weight matrix and b ⁽⁴⁾ is the bias vector:

In the forward propagation process, the parameters that need to be learned through model training include: among them, c _l (wi _-1 ) comes from the hidden layer information at the word w _i-1 above, and e(wi _-1 ) comes from the above The Word Embedding value of the text word w _i-1 , W ^(l) is the transfer matrix that transfers the hidden layer information backward, W ^(sl) is the transfer matrix that transfers the input Word Embedding value information backward, c _r (wi ₊₁ ) comes from the hidden layer information at the following word w _i+1 , e(w _i+1 ) comes from the WordEmbedding value of itself at the following word w _i+1 , W ^(r) is the hidden layer information The transfer matrix that is passed forward, W ^(sr) is the transfer matrix that passes the input WordEmbedding value information forward, W ⁽²⁾ is the weight matrix of the convolution layer, b ⁽²⁾ is the bias vector of the convolution layer, W ⁽⁴⁾ is the weight matrix of the fully connected layer, and b ⁽⁴⁾ is the bias vector of the fully connected layer:

w＝{b ⁽²⁾ , b ⁽⁴⁾ , c _l (w ₁ ) , c _r (w _n ) , W ⁽²⁾ , W ⁽⁴⁾ , W ^(l) , W ^(r) , W ^(sl ), W ^(sr) }.

2. The large-scale category-level text classification method according to claim 1, characterized in that the specific process of step S1 is:

Collect government procurement publicity data, including transaction announcements and bidding announcements, and preprocess the collected government procurement publicity data:

(1) Remove illegal characters from text data and convert traditional characters into simplified characters;

(2) Segment Chinese words and remove stop words;

(3) Training word vectors.

3. The large-scale category hierarchical text classification method according to claim 2, characterized in that, in step S2, the plane classifier method of the hierarchical classification problem is used for classification, and in the feature engineering stage, BOW and TF-IDF are selected as word features. ; In the classifier training stage, choose the naive Bayes classifier, or Logistic regression classifier, or support vector machine classifier, or decision tree classifier to classify the experimental data; the plane classifier does not consider the hierarchical structure, so this section The loss function only considers the empirical loss function of the training set.

4. The large-scale category-level text classification method according to claim 1, characterized in that a category-level tree N, the nodes of the tree are labels of each category at all levels, and the method of mandatory leaf node prediction is selected for research. Input M text instances contained in the training set, denoted as where x _i ∈X is the instance space of text embedding, t _i ∈T represents the category,/> For the set of all leaf nodes in the hierarchical structure, define the function π: N→N, indicating that π(n) is the ancestor node of node n; let C _n represent the set of all child nodes of node n, define y _in = (2I( t _i =n)-1)∈{-1, 1} is a binary variable used to indicate whether the sample x _i belongs to the category n∈T; the prediction model is recorded as the prediction function f _p : X→T, and the target is for The given TRAIN obtains the minimum loss of the prediction function f _p .

5. The large-scale category-level text classification method according to claim 4, characterized in that the loss function used is composed of the empirical risk of the data set and a regularization term used to penalize the complexity of the prediction function _fp , defined as follows:

Loss＝λ(w)+C×R _emp

Among them, λ(w) is the regularization term, R _emp is the empirical risk of the data set, and C is the weight parameter that controls the balance between the empirical risk and the complexity of the control prediction function f _p ;

For the parameter set w, in addition to the forward propagation method, it is decomposed into each node to form a parameter set based on the node composition: w = {w _n : n∈N}, that is, each node in the hierarchical structure n is associated with the parameter group w _n , where the function π: N→N means that π(n) is the ancestor node of node n. For the regularization term, the hierarchical dependence of the tree structure is introduced into the regularization term of w , thereby learning the hierarchical structure during the training process. The form of the regularization term is as follows:

For the empirical loss function, the loss due to instances at leaf nodes in the category hierarchy:

Among them, x _i is the instance space of text embedding, w _n is a parameter group based on nodes, y _in is used to indicate whether sample x _i belongs to category n, and L is all convex loss functions in different models.