CN114153803B

CN114153803B - Government file attribution province classification method based on pre-training model

Info

Publication number: CN114153803B
Application number: CN202111470389.2A
Authority: CN
Inventors: 沈超; 朱皓宬; 周亚东; 刘晓明
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2024-07-19
Anticipated expiration: 2041-12-03
Also published as: CN114153803A

Abstract

The invention discloses a government file attribution province classification method based on a pre-training model, which comprises the following steps: 1) Extracting a feature dictionary from the csv and excel files; 2) Generating sentence vectors for all texts in the feature dictionary; 3) Performing regional entity recognition training on the sentence vectors to obtain a regional named entity recognition model; 4) And (5) performing region-province mapping training to obtain a region-province mapping model. The classification method based on the attribution provinces of the government affair class csv and excel files of the pre-training model can effectively classify the government affair class csv and excel files, effectively avoid the problem of province overlapping in the same file, and has high accuracy of the prediction result, small error, low calculation complexity and high practical value.

Description

Government file attribution province classification method based on pre-training model

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a government affair file attribution province classification method based on a pre-training model.

Background

The text classification uses a computer to automatically classify and mark the text set (or other entities or objects) according to a certain classification system or standard. According to a marked training document set, a relation model between document characteristics and document categories is found, and then category judgment is carried out on a new document by using the relation model obtained by learning. Text classification is gradually transitioning from knowledge-based methods to statistical and machine learning-based methods.

Text classification generally comprises the processes of text expression, classifier selection and training, classification result evaluation and feedback, and the like, wherein the text expression can be subdivided into text preprocessing, indexing and statistics, feature extraction and other steps, text classification problems are not essentially different from other classification problems, the method can be classified into matching according to certain features of data to be classified, and the complete matching is unlikely, so that the optimal matching result must be selected (according to a certain evaluation standard), thereby completing classification.

The method of knowledge engineering, which has been developed for a while later, defines a large number of inference rules for each category with the help of professionals, and if a document can satisfy the inference rules, it can be determined that it belongs to the category. However, the disadvantages of this approach are still evident, for example, the quality of classification is severely dependent on the quality of these rules, i.e., on the quality of the "person" who makes the rules; for example, people who make rules are expert grades, and the great increase of labor cost is often unbearable; the most deadly weakness of the knowledge engineering is that the knowledge engineering has no popularization at all, and a classification system constructed for the financial field is not provided with other methods except the method of completely dumping if the knowledge engineering is expanded to the related fields such as medical treatment, social insurance and the like, so that huge knowledge and fund waste are often caused.

Disclosure of Invention

Aiming at solving the problem of classification of the provinces of the government documents in the prior art, the invention aims to provide a classification method of the provinces of the government documents belonging to a pre-training model, which can classify the provinces of the government documents csv and excel documents belonging to the same documents and effectively avoid the problem of overlapping the provinces in the same document.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A government file attribution province classification method based on a pre-training model, wherein the government file is a csv and/or excel file, and the classification method comprises the following steps:

step 1: extracting features of the government file from five dimensions of file names, table heads, row attributes, column attributes and full text of the table to generate a corresponding feature dictionary;

Step 2: generating an embedded vector of semantic information from text data in a feature dictionary by utilizing a self-encoder in a pre-training model, capturing semantic contribution relation between words, and providing position embedded information of the words by a built-in function of the pre-training model Bert; integrating the semantic information and the position embedding information to generate sentence vectors of text data in the feature dictionary;

step 3: training to obtain a model ₁ of the regional naming entity recognition model by utilizing the sentence vectors obtained in the step 2;

Step 4: extracting regions in all feature dictionaries by using the region naming entity recognition model ₁ trained in the step 3, marking corresponding province labels according to a Chinese administrative region planning table, and training region-province mapping to obtain a region-province mapping model ₂;

Step 5: and performing provincial label classification on the new excel and csv files by using the model ₁、model₂.

In one embodiment, the step 1 includes:

Step 1.1: the method comprises the steps of representing a table in a csv file as a dictionary with five key value pairs, wherein five keys of the dictionary are respectively name_ chineseall, head_attribute, row_attribute, column_attribute and allcsv _ chinese, wherein name_ chineseall represents all Chinese characters in an original file name, head_attribute represents a table head in the original file, row_attribute represents all row attributes in the original file table, column_attribute represents all column attributes in the original file table, and allcsv _ chinese represents all Chinese contents in the original file;

step 1.2: generating n temporary csv files according to the number n of sheets contained in the excel files, then respectively generating a feature dictionary for all the temporary csv files according to the method in the step 1.1, and connecting all the corresponding values of the obtained n feature dictionaries according to keys to generate a total feature dictionary which is the feature dictionary corresponding to the excel files;

Step 1.3: and storing all government affair files and the feature dictionary corresponding to the government affair files into the json file according to the index sequence.

In one embodiment, the step 2 includes:

Step 2.1: dividing words of values corresponding to five keys in each feature dictionary, randomly generating 15% mask data of text data t after dividing words, adding identifiers representing the beginning and the end of sentences at the beginning and the end of the text data, predicting the characters to be masked by the characters not to be masked at the two sides of the characters to be masked, wherein an intermediate vector which is used for predicting the characters to be masked and does not contain position information is a semantic information embedded vector of the characters to be masked, namely an embedded vector which does not contain position information;

step 2.2: and 2.1, generating position index embedded information for the positions of each character according to the values corresponding to the five keys processed in the step 2.1, integrating the position index embedded information with the semantic information embedded vector generated in the step 2.1, and finally generating sentence vectors of the values corresponding to the five keys respectively.

In one embodiment, the step3 includes:

Step 3.1: passing the sentence vector [ c ₁,c₂,c₃,c₄,c₅,c₆, … ] of the value generated in the step 2 through a 4-layer Bi-LSTM layer to generate an implicit layer vector [ h ₁,h₂,h₃,h₄, … ], and using the implicit layer vector to capture the dependency relationship of the front item and the rear item of the sentence vector at the same time;

Step 3.2: the hidden layer vector [ h ₁,h₂,h₃,h₄, … ] passes through the CRF layer, outputs a predicted labeling sequence which accords with a labeling transfer constraint condition, namely the maximum possibility, and generates a labeling probability sequence [ p ₁,p₂,p₃,p₄, … ] through softmax function normalization, wherein the largest probability value in the labeling probability sequence [ p ₁,p₂,p₃,p₄, … ] corresponds to the predicted regional entity, so that the regional named entity recognition model ₁ is obtained.

In one embodiment, in the step 3.1, the LSTM layer in the positive direction is used to capture the long-distance dependency of c ₁ to c _n, the LSTM layer in the negative direction is used to capture the long-distance dependency of c _n to c ₁, so as to capture the dependency of the sentence vector in the positive direction and the negative direction simultaneously, and generate the hidden layer vector [ h ₁,h₂,h₃,h₄, … ], and the LSTM has three gates in total to maintain and adjust the cell state, including a forgetting gate, an input gate and an output gate, where for the cell state, the forgetting gate, the input gate and the output gate are defined as follows:

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f)

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

c_t＝f_tc_t-1+i_ttanh(W_xix_t+W_hih_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh (c_t)

Wherein the cell state represents the reserved information in the embedded vector, the forgetting gate receiving h _t-1 and x _t output a value f _t between 0 and 1 to determine how much information needs to be forgotten, the value will act on the last cell state c _t-1, 1 represents "completely reserved", 0 represents "completely forgotten", the input gate receiving h _t-1 and x _t output a value between 0 and 1 through i _t to determine how much information needs to be reserved, then the cell state is updated through c _t, the output gate receiving h _t-1 and x _t output a value between 0 and 1 through o _t, finally h _t determines how much information needs to be output in the current state c _t, sigma represents a sigmoid function, and since the text data in the government file has more information unrelated to the geographical feature, the information screening property of tm is more suitable for processing the government file.

In one embodiment, the step 3.2 is to score all possible real paths through a CRF (conditional random field) layer by using a sequence labeling transfer matrix, so as to output a predicted labeling sequence, where the sequence score calculation method of the CRF layer is defined as follows:

Wherein the method comprises the steps of AndThe emission score and the transfer score of y _i in the labeling sequence [ y ₁,y₂,y₃,y₄…y_i,…y_n ] are respectively represented, and the whole sequence is added to obtain score (x, y).

In one embodiment, the step4 includes:

Step 4.1: extracting values corresponding to five keys in all feature dictionaries, extracting all region entities contained in text data through a region named entity recognition model ₁ trained in the step 3, traversing a Chinese administrative region planning table at the same time, and finding province labels corresponding to the region entities to form region-province key value pairs { entity _n:province_n }, wherein entity _n represents the region entities and precursor _n represents the corresponding province labels;

Step 4.2: the region-province key value pair { entity _n:province_n } is classified and trained by the pre-training model Bert and connecting 1 full-connection layers, so that a region-province mapping model ₂ is obtained.

In one embodiment, the step 5 includes:

Step 5.1: generating a feature dictionary according to the file to be predicted in the step 1, and respectively passing text data corresponding to five keys in the feature dictionary through a model ₁ to generate a temporary dictionary { X: [ entity ₁,entity₂, … ] }, wherein X represents a key, [ entity ₁,entity₂, … ] is a regional entity extracted from a value corresponding to X;

Step 5.2: for the same file, starting from the file name name_ chineseall, if the file name is predicted by the model ₂, if a prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the header is sent to the model ₂ for prediction, if the prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the row attribute row_attribute is sent to the model ₂ for prediction, if the prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the column attribute column_attribute is sent to the model ₂ for prediction, if the prediction result exists, the prediction result is directly taken as a final result, if the region entity is not included, the file content allcsv _ chinese is sent to the model ₂ for prediction, if the prediction result exists, the prediction result is directly taken as a final result, if two or more different recognition results appear in any recognition steps above, the recognition results of all the recognition steps appear as final recognition results of all the region labels, and if the recognition results of most frequently appear in the region labels are not included.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention extracts the semanteme and the position information of the feature dictionary by utilizing the pre-training model to generate sentence vectors, and carries out named entity recognition on all the sentence vectors to extract all the regional entities.

(2) The invention forms the regional entity and the province into the regional-province relation pair through the Chinese administrative district planning table, and trains all the regional-province relation pairs to generate a universal regional-province identification model.

(3) According to the method for classifying the government affair class csv and excel file attribution provinces based on the pre-training model, chinese province file classification can be effectively performed on the government affair class csv and excel file, the problem of province overlapping in the same file is effectively avoided, the prediction result is high in accuracy, small in error and low in calculation complexity, and the method has high practical value.

Drawings

FIG. 1 is a flow chart of a classification method of the attribution provinces of the government affair class csv and excel files based on a pre-training model.

FIG. 2 is a Bi-LSTM feature extraction process.

Fig. 3 is a full flow chart of government file prediction.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Statistical learning methods have become an absolute mainstream in the field of text classification. The main reasons are that many of the techniques have a solid theoretical basis (in contrast, the subjective factors of the experts in the knowledge engineering method are most), that clear evaluation criteria exist, and that the actual performance is good. After the statistical classification algorithm successfully converts the sample data into a vector representation, the computer calculates to start a real learning process. The government affair files are mainly divided into csv files and excel files, the invention is based on the deep learning text classification method to realize the classification operation of the two government affair files, and the provincial attribution classification is the release of the government which judges the file is.

Specifically, as shown in fig. 1, the government affair file attribution province classification method based on a pre-training model comprises the following steps:

step 1: and extracting features of the government file from the five dimensions of the file name, the header, the row attribute, the column attribute and the full text of the table to generate a corresponding feature dictionary. One possible specific procedure is as follows:

Step 1.1: in this embodiment, the data set is 70809 government affair files in the national province of self-crawling. The method comprises the steps of representing tables in all csv files as a dictionary with five key value pairs, wherein five keys of the dictionary are respectively name_ chineseall, header, row_attribute, column_attribute and allcsv _ chinese, wherein name_ chineseall represents all Chinese characters in an original file name, header represents a table header in the original file, row_attribute represents all row attributes in the original file table, column_attribute represents all column attributes in the original file table, allcsv _ chinese represents all Chinese characters in the original file, for example, for the file ' Teng County primary and secondary school organization information_0.xls ', extracted name_ chineseall is ' Teng County primary and secondary school organization information ', header is ' number name of school in school number and school address graduation rate ', row_attribute is ' Guangxi Teng County ancient Long Zhengu center school Teng County … … ', row_attribute is ' 3802_ Teng County ', and the like, is ' full-scale index is ' chinese ', and the like;

step 1.2: generating n temporary csv files according to the number n of sheets contained in the excel file, then respectively generating a feature dictionary for all the temporary csv files according to the method in the step 1.1, connecting all the corresponding values of the obtained n feature dictionaries according to keys to generate a total feature dictionary, wherein the total feature dictionary is the feature dictionary corresponding to the excel file, for example, the file 'Teng County primary and secondary school organization information_0. Xls', wherein sheets consisting of 5 pieces of school information are contained, and then respectively carrying out dictionary feature extraction on the 5 sheets and then merging the extracted dictionary features to form dictionary features of the original file;

step 1.3: all government affair files and the feature dictionary corresponding to the government affair files are stored in the json file according to the index sequence, so that the government affair files and the feature dictionary are convenient to use later.

Step 2: generating an embedded vector of semantic information from text data in a feature dictionary by utilizing a self-encoder in a pre-training model, capturing semantic contribution relation between words, and providing position embedded information of the words by a built-in function of the pre-training model Bert; and integrating the semantic information and the position embedded information to generate sentence vectors of the text data in the feature dictionary. One possible specific procedure is as follows:

Step 3: and (3) training to obtain a model ₁ of the regional named entity recognition model by using the sentence vectors obtained in the step (2). One possible specific procedure is as follows:

Step 3.1: as shown in fig. 2, the sentence vector [ c ₁,c₂,c₃,c₄,c₅,c₆, … ] with the value generated in the step 2 passes through the 4-layer Bi-LSTM layer, and simultaneously captures the dependency relationship [ v _f0|v_b0,v_f1|v_b1,v_f2|v_b2,v_f3|v_b3, … ] of the front item and the rear item of the sentence vector, so as to generate the hidden layer vector [ h ₁,h₂,h₃,h₄, … ], which is used for capturing the dependency relationship of the front item and the rear item of the sentence vector at the same time, for example, for the five keys of the dictionary feature of the file "Teng County primary school organization information_0.xls" respectively corresponding to 5 sentence vectors, and then the 5 sentence vectors are respectively sent to the Bi-LSTM layer for feature extraction.

Among Bi-LSTM layers, the LSTM layer in the positive direction is used to capture the long-distance dependency of c ₁ to c _i, the LSTM layer in the negative direction is used to capture the long-distance dependency of c _i to c ₁, thereby capturing the dependency of sentence vector in the positive direction and the negative direction simultaneously, and generating implicit layer vectors [ h ₁,h₂,h₃,h₄, … ], LSTM has three gates in total to maintain and adjust cell states, including forget gates, input gates, and output gates, wherein for cell states, forget gates, input gates, output gates are defined as follows:

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f)

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

c_t＝f_tc_t-1+i_ttanh(W_xix_t+W_hih_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh (c_t)

Step 3.2: the hidden layer vector [ h ₁,h₂,h₃,h₄,…h_j ] passes through the CRF layer, outputs a predicted labeling sequence which accords with a labeling transfer constraint condition, namely the maximum possibility, and generates labeling probability sequences [ p ₁,p₂,p₃,p₄, … ] through softmax function normalization, and the largest probability value in the labeling probability sequences [ p ₁,p₂,p₃,p₄, … ] corresponds to the predicted regional entity, so that a regional naming entity recognition model ₁ is obtained.

Specifically, all possible real paths are scored by using a sequence labeling transfer matrix through a CRF (conditional random field) layer, so as to output a prediction labeling sequence, wherein the sequence score calculation method of the CRF layer is defined as follows:

Step 4: and 3, extracting the regions in all feature dictionaries by using the region named entity recognition model ₁ trained in the step 3, marking corresponding province labels according to a Chinese administrative region planning table, generating a training set and a testing set, and training region-province mapping to obtain a region-province mapping model ₂. One possible specific procedure is as follows:

Step 5: and performing province label classification on the new excel and csv files based on a step-by-step prediction mechanism by using a model ₁、model₂. One possible specific procedure is as follows:

Step 5.1: generating a feature dictionary according to the file to be predicted in the step 1, and respectively passing text data corresponding to five keys in the feature dictionary through a model ₁ to generate a temporary dictionary { X: [ entity ₁,entity₂, … ] }, wherein X represents a key, [ entity ₁,entity₂, … ] is a regional entity extracted from a value corresponding to X; for example, the temporary dictionary generated for the file name is { name_ chineseall: [ entity ₁,entity₂, … ] }, where [ entity ₁,entity₂, … ] is a region entity extracted from the text content corresponding to name_ chineseall.

Step 5.2: as shown in fig. 3, for the same file, starting from the file name name_ chineseall, if the file name is predicted by the model ₂, if a prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the header is sent to the model ₂ to be predicted, if the prediction result exists, the prediction result is directly taken as a final result, if the region entity is not included, the row attribute row_attribute is sent to the model ₂ to be predicted, if the prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the column attribute column_attribute is sent to the model ₂ to be predicted, if the prediction result exists, the prediction result is directly taken as a final result, if the region entity is not included, the file content allcsv _ chinese is sent to the model ₂ to be predicted, if the prediction result exists, the prediction result is directly taken as a final result, if two or more different region tags are recognized in any recognition steps, if the recognition results of all the region tags are most frequently recognized, and the recognition results of all the region tags are most frequently recognized as the recognition results in a dictionary. That is, for the same file, a progressive prediction method is adopted, the priority of the file name name_ chineseall is highest, then the file name is sequentially the header, the row attribute row_attribute, the column attribute column_attribute, and the file content allcsv _ chinese, if the data with high priority can obtain a prediction result, the prediction is directly ended and takes the result as a final result, if a plurality of provincial labels appear at a certain level, the data of all levels are predicted, the label with the largest occurrence number is taken as a prediction result, and if no result exists at all levels, the data is predicted as other areas.

The experimental results of this example are as follows:

the Accuracy (AUC) of the test set was stable at 0.9995.

The experimental result shows that the classification method based on the pre-training model government affair types csv and excel files, provided by the invention, uses a named entity identification method to extract all area names from text data, generates a training set and a testing set for each area and the corresponding province according to a Chinese administrative area planning table, classifies the original files based on a step-by-step prediction mechanism, can effectively classify the government affair types csv and excel files, effectively avoids the problem of province overlapping in the same file, and has the advantages of high accuracy rate, small error, low calculation complexity and high practical value. .

In conclusion, the classification method based on the pre-training model government affair class csv and excel file attribution provinces can effectively classify the government affair class csv and excel file, effectively avoid the problem of province overlapping in the same file, and has high prediction result accuracy, small error, low calculation complexity and high practical value.

While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims

1. The utility model provides a government affair file attribution province classification method based on a pre-training model, wherein the government affair file is a csv and/or excel file, and the classification method is characterized by comprising the following steps:

Step 3: training to obtain a model ₁ of the regional naming entity recognition model by using the sentence vectors obtained in the step 2, wherein the method comprises the following steps:

Step 3.2: the hidden layer vector [ h ₁,h₂,h₃,h₄, … ] passes through the CRF layer, outputs a predicted labeling sequence which accords with a labeling transfer constraint condition, namely the maximum possibility, and generates a labeling probability sequence [ p ₁,p₂,p₃,p₄, … ] through softmax function normalization, wherein the largest probability value in the labeling probability sequence [ p ₁,p₂,p₃,p₄, … ] corresponds to the predicted regional entity, so that a regional named entity recognition model ₁ is obtained;

Step 4: extracting regions in all feature dictionaries by using the region naming entity recognition model ₁ trained in the step 3, marking corresponding province labels according to a Chinese administrative region planning table, and training region-province mapping to obtain a region-province mapping model ₂; the method comprises the following steps:

Step 4.2: the region-province key value pair { entity _n:province_n } is subjected to classification training by a pre-training model Bert and connected with 1 full-connection layer, so that a region-province mapping model ₂ is obtained;

Step 5: the model ₁、model₂ is used for classifying the provincial labels of the new excel and csv files, and the method comprises the following steps:

2. The method for classifying government documents according to claim 1, wherein the step 1 comprises:

3. The method for classifying government documents according to claim 2, wherein the step 2 comprises:

4. The pre-training model-based government document attribution province classification method according to claim 1, wherein in the step 3.1, a positive-direction LSTM layer is used for capturing long-distance dependency of c ₁ to c _i, a negative-direction LSTM layer is used for capturing long-distance dependency of c _i to c ₁, thereby capturing dependency of sentence vector positive direction and negative direction simultaneously, and generating hidden layer vectors [ h ₁,h₂,h₃,h₄, … ], LSTM has three gates in total to maintain and adjust cell states, including forget gates, input gates, output gates, wherein for cell states, forget gates, input gates, output gates are defined as follows:

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f)

i_t＝σ(W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

c_t＝f_tc_t-1+i_ttanh(W_xix_t+W_hih_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_coc_t+b_o)

h_t＝o_ttanh (c_t)

5. The method for classifying government documents belonging to provinces based on pre-training models according to claim 1, wherein in the step 3.2, all possible real paths are scored by using a sequence annotation transfer matrix through a CRF layer, so as to output a prediction annotation sequence, and the sequence score calculation method of the CRF layer is defined as follows: