[go: up one dir, main page]

CN114153803B - Government file attribution province classification method based on pre-training model - Google Patents

Government file attribution province classification method based on pre-training model Download PDF

Info

Publication number
CN114153803B
CN114153803B CN202111470389.2A CN202111470389A CN114153803B CN 114153803 B CN114153803 B CN 114153803B CN 202111470389 A CN202111470389 A CN 202111470389A CN 114153803 B CN114153803 B CN 114153803B
Authority
CN
China
Prior art keywords
file
model
region
province
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111470389.2A
Other languages
Chinese (zh)
Other versions
CN114153803A (en
Inventor
沈超
朱皓宬
周亚东
刘晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202111470389.2A priority Critical patent/CN114153803B/en
Publication of CN114153803A publication Critical patent/CN114153803A/en
Application granted granted Critical
Publication of CN114153803B publication Critical patent/CN114153803B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a government file attribution province classification method based on a pre-training model, which comprises the following steps: 1) Extracting a feature dictionary from the csv and excel files; 2) Generating sentence vectors for all texts in the feature dictionary; 3) Performing regional entity recognition training on the sentence vectors to obtain a regional named entity recognition model; 4) And (5) performing region-province mapping training to obtain a region-province mapping model. The classification method based on the attribution provinces of the government affair class csv and excel files of the pre-training model can effectively classify the government affair class csv and excel files, effectively avoid the problem of province overlapping in the same file, and has high accuracy of the prediction result, small error, low calculation complexity and high practical value.

Description

Government file attribution province classification method based on pre-training model
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a government affair file attribution province classification method based on a pre-training model.
Background
The text classification uses a computer to automatically classify and mark the text set (or other entities or objects) according to a certain classification system or standard. According to a marked training document set, a relation model between document characteristics and document categories is found, and then category judgment is carried out on a new document by using the relation model obtained by learning. Text classification is gradually transitioning from knowledge-based methods to statistical and machine learning-based methods.
Text classification generally comprises the processes of text expression, classifier selection and training, classification result evaluation and feedback, and the like, wherein the text expression can be subdivided into text preprocessing, indexing and statistics, feature extraction and other steps, text classification problems are not essentially different from other classification problems, the method can be classified into matching according to certain features of data to be classified, and the complete matching is unlikely, so that the optimal matching result must be selected (according to a certain evaluation standard), thereby completing classification.
The method of knowledge engineering, which has been developed for a while later, defines a large number of inference rules for each category with the help of professionals, and if a document can satisfy the inference rules, it can be determined that it belongs to the category. However, the disadvantages of this approach are still evident, for example, the quality of classification is severely dependent on the quality of these rules, i.e., on the quality of the "person" who makes the rules; for example, people who make rules are expert grades, and the great increase of labor cost is often unbearable; the most deadly weakness of the knowledge engineering is that the knowledge engineering has no popularization at all, and a classification system constructed for the financial field is not provided with other methods except the method of completely dumping if the knowledge engineering is expanded to the related fields such as medical treatment, social insurance and the like, so that huge knowledge and fund waste are often caused.
Disclosure of Invention
Aiming at solving the problem of classification of the provinces of the government documents in the prior art, the invention aims to provide a classification method of the provinces of the government documents belonging to a pre-training model, which can classify the provinces of the government documents csv and excel documents belonging to the same documents and effectively avoid the problem of overlapping the provinces in the same document.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
A government file attribution province classification method based on a pre-training model, wherein the government file is a csv and/or excel file, and the classification method comprises the following steps:
step 1: extracting features of the government file from five dimensions of file names, table heads, row attributes, column attributes and full text of the table to generate a corresponding feature dictionary;
Step 2: generating an embedded vector of semantic information from text data in a feature dictionary by utilizing a self-encoder in a pre-training model, capturing semantic contribution relation between words, and providing position embedded information of the words by a built-in function of the pre-training model Bert; integrating the semantic information and the position embedding information to generate sentence vectors of text data in the feature dictionary;
step 3: training to obtain a model 1 of the regional naming entity recognition model by utilizing the sentence vectors obtained in the step 2;
Step 4: extracting regions in all feature dictionaries by using the region naming entity recognition model 1 trained in the step 3, marking corresponding province labels according to a Chinese administrative region planning table, and training region-province mapping to obtain a region-province mapping model 2;
Step 5: and performing provincial label classification on the new excel and csv files by using the model 1、model2.
In one embodiment, the step 1 includes:
Step 1.1: the method comprises the steps of representing a table in a csv file as a dictionary with five key value pairs, wherein five keys of the dictionary are respectively name_ chineseall, head_attribute, row_attribute, column_attribute and allcsv _ chinese, wherein name_ chineseall represents all Chinese characters in an original file name, head_attribute represents a table head in the original file, row_attribute represents all row attributes in the original file table, column_attribute represents all column attributes in the original file table, and allcsv _ chinese represents all Chinese contents in the original file;
step 1.2: generating n temporary csv files according to the number n of sheets contained in the excel files, then respectively generating a feature dictionary for all the temporary csv files according to the method in the step 1.1, and connecting all the corresponding values of the obtained n feature dictionaries according to keys to generate a total feature dictionary which is the feature dictionary corresponding to the excel files;
Step 1.3: and storing all government affair files and the feature dictionary corresponding to the government affair files into the json file according to the index sequence.
In one embodiment, the step 2 includes:
Step 2.1: dividing words of values corresponding to five keys in each feature dictionary, randomly generating 15% mask data of text data t after dividing words, adding identifiers representing the beginning and the end of sentences at the beginning and the end of the text data, predicting the characters to be masked by the characters not to be masked at the two sides of the characters to be masked, wherein an intermediate vector which is used for predicting the characters to be masked and does not contain position information is a semantic information embedded vector of the characters to be masked, namely an embedded vector which does not contain position information;
step 2.2: and 2.1, generating position index embedded information for the positions of each character according to the values corresponding to the five keys processed in the step 2.1, integrating the position index embedded information with the semantic information embedded vector generated in the step 2.1, and finally generating sentence vectors of the values corresponding to the five keys respectively.
In one embodiment, the step3 includes:
Step 3.1: passing the sentence vector [ c 1,c2,c3,c4,c5,c6, … ] of the value generated in the step 2 through a 4-layer Bi-LSTM layer to generate an implicit layer vector [ h 1,h2,h3,h4, … ], and using the implicit layer vector to capture the dependency relationship of the front item and the rear item of the sentence vector at the same time;
Step 3.2: the hidden layer vector [ h 1,h2,h3,h4, … ] passes through the CRF layer, outputs a predicted labeling sequence which accords with a labeling transfer constraint condition, namely the maximum possibility, and generates a labeling probability sequence [ p 1,p2,p3,p4, … ] through softmax function normalization, wherein the largest probability value in the labeling probability sequence [ p 1,p2,p3,p4, … ] corresponds to the predicted regional entity, so that the regional named entity recognition model 1 is obtained.
In one embodiment, in the step 3.1, the LSTM layer in the positive direction is used to capture the long-distance dependency of c 1 to c n, the LSTM layer in the negative direction is used to capture the long-distance dependency of c n to c 1, so as to capture the dependency of the sentence vector in the positive direction and the negative direction simultaneously, and generate the hidden layer vector [ h 1,h2,h3,h4, … ], and the LSTM has three gates in total to maintain and adjust the cell state, including a forgetting gate, an input gate and an output gate, where for the cell state, the forgetting gate, the input gate and the output gate are defined as follows:
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf)
it=σ(Wxixt+Whiht-1+Wcict-1+bi)
ct=ftct-1+ittanh(Wxixt+Whiht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh (ct)
Wherein the cell state represents the reserved information in the embedded vector, the forgetting gate receiving h t-1 and x t output a value f t between 0 and 1 to determine how much information needs to be forgotten, the value will act on the last cell state c t-1, 1 represents "completely reserved", 0 represents "completely forgotten", the input gate receiving h t-1 and x t output a value between 0 and 1 through i t to determine how much information needs to be reserved, then the cell state is updated through c t, the output gate receiving h t-1 and x t output a value between 0 and 1 through o t, finally h t determines how much information needs to be output in the current state c t, sigma represents a sigmoid function, and since the text data in the government file has more information unrelated to the geographical feature, the information screening property of tm is more suitable for processing the government file.
In one embodiment, the step 3.2 is to score all possible real paths through a CRF (conditional random field) layer by using a sequence labeling transfer matrix, so as to output a predicted labeling sequence, where the sequence score calculation method of the CRF layer is defined as follows:
Wherein the method comprises the steps of AndThe emission score and the transfer score of y i in the labeling sequence [ y 1,y2,y3,y4…yi,…yn ] are respectively represented, and the whole sequence is added to obtain score (x, y).
In one embodiment, the step4 includes:
Step 4.1: extracting values corresponding to five keys in all feature dictionaries, extracting all region entities contained in text data through a region named entity recognition model 1 trained in the step 3, traversing a Chinese administrative region planning table at the same time, and finding province labels corresponding to the region entities to form region-province key value pairs { entity n:provincen }, wherein entity n represents the region entities and precursor n represents the corresponding province labels;
Step 4.2: the region-province key value pair { entity n:provincen } is classified and trained by the pre-training model Bert and connecting 1 full-connection layers, so that a region-province mapping model 2 is obtained.
In one embodiment, the step 5 includes:
Step 5.1: generating a feature dictionary according to the file to be predicted in the step 1, and respectively passing text data corresponding to five keys in the feature dictionary through a model 1 to generate a temporary dictionary { X: [ entity 1,entity2, … ] }, wherein X represents a key, [ entity 1,entity2, … ] is a regional entity extracted from a value corresponding to X;
Step 5.2: for the same file, starting from the file name name_ chineseall, if the file name is predicted by the model 2, if a prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the header is sent to the model 2 for prediction, if the prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the row attribute row_attribute is sent to the model 2 for prediction, if the prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the column attribute column_attribute is sent to the model 2 for prediction, if the prediction result exists, the prediction result is directly taken as a final result, if the region entity is not included, the file content allcsv _ chinese is sent to the model 2 for prediction, if the prediction result exists, the prediction result is directly taken as a final result, if two or more different recognition results appear in any recognition steps above, the recognition results of all the recognition steps appear as final recognition results of all the region labels, and if the recognition results of most frequently appear in the region labels are not included.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention extracts the semanteme and the position information of the feature dictionary by utilizing the pre-training model to generate sentence vectors, and carries out named entity recognition on all the sentence vectors to extract all the regional entities.
(2) The invention forms the regional entity and the province into the regional-province relation pair through the Chinese administrative district planning table, and trains all the regional-province relation pairs to generate a universal regional-province identification model.
(3) According to the method for classifying the government affair class csv and excel file attribution provinces based on the pre-training model, chinese province file classification can be effectively performed on the government affair class csv and excel file, the problem of province overlapping in the same file is effectively avoided, the prediction result is high in accuracy, small in error and low in calculation complexity, and the method has high practical value.
Drawings
FIG. 1 is a flow chart of a classification method of the attribution provinces of the government affair class csv and excel files based on a pre-training model.
FIG. 2 is a Bi-LSTM feature extraction process.
Fig. 3 is a full flow chart of government file prediction.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.
Statistical learning methods have become an absolute mainstream in the field of text classification. The main reasons are that many of the techniques have a solid theoretical basis (in contrast, the subjective factors of the experts in the knowledge engineering method are most), that clear evaluation criteria exist, and that the actual performance is good. After the statistical classification algorithm successfully converts the sample data into a vector representation, the computer calculates to start a real learning process. The government affair files are mainly divided into csv files and excel files, the invention is based on the deep learning text classification method to realize the classification operation of the two government affair files, and the provincial attribution classification is the release of the government which judges the file is.
Specifically, as shown in fig. 1, the government affair file attribution province classification method based on a pre-training model comprises the following steps:
step 1: and extracting features of the government file from the five dimensions of the file name, the header, the row attribute, the column attribute and the full text of the table to generate a corresponding feature dictionary. One possible specific procedure is as follows:
Step 1.1: in this embodiment, the data set is 70809 government affair files in the national province of self-crawling. The method comprises the steps of representing tables in all csv files as a dictionary with five key value pairs, wherein five keys of the dictionary are respectively name_ chineseall, header, row_attribute, column_attribute and allcsv _ chinese, wherein name_ chineseall represents all Chinese characters in an original file name, header represents a table header in the original file, row_attribute represents all row attributes in the original file table, column_attribute represents all column attributes in the original file table, allcsv _ chinese represents all Chinese characters in the original file, for example, for the file ' Teng County primary and secondary school organization information_0.xls ', extracted name_ chineseall is ' Teng County primary and secondary school organization information ', header is ' number name of school in school number and school address graduation rate ', row_attribute is ' Guangxi Teng County ancient Long Zhengu center school Teng County … … ', row_attribute is ' 3802_ Teng County ', and the like, is ' full-scale index is ' chinese ', and the like;
step 1.2: generating n temporary csv files according to the number n of sheets contained in the excel file, then respectively generating a feature dictionary for all the temporary csv files according to the method in the step 1.1, connecting all the corresponding values of the obtained n feature dictionaries according to keys to generate a total feature dictionary, wherein the total feature dictionary is the feature dictionary corresponding to the excel file, for example, the file 'Teng County primary and secondary school organization information_0. Xls', wherein sheets consisting of 5 pieces of school information are contained, and then respectively carrying out dictionary feature extraction on the 5 sheets and then merging the extracted dictionary features to form dictionary features of the original file;
step 1.3: all government affair files and the feature dictionary corresponding to the government affair files are stored in the json file according to the index sequence, so that the government affair files and the feature dictionary are convenient to use later.
Step 2: generating an embedded vector of semantic information from text data in a feature dictionary by utilizing a self-encoder in a pre-training model, capturing semantic contribution relation between words, and providing position embedded information of the words by a built-in function of the pre-training model Bert; and integrating the semantic information and the position embedded information to generate sentence vectors of the text data in the feature dictionary. One possible specific procedure is as follows:
Step 2.1: dividing words of values corresponding to five keys in each feature dictionary, randomly generating 15% mask data of text data t after dividing words, adding identifiers representing the beginning and the end of sentences at the beginning and the end of the text data, predicting the characters to be masked by the characters not to be masked at the two sides of the characters to be masked, wherein an intermediate vector which is used for predicting the characters to be masked and does not contain position information is a semantic information embedded vector of the characters to be masked, namely an embedded vector which does not contain position information;
step 2.2: and 2.1, generating position index embedded information for the positions of each character according to the values corresponding to the five keys processed in the step 2.1, integrating the position index embedded information with the semantic information embedded vector generated in the step 2.1, and finally generating sentence vectors of the values corresponding to the five keys respectively.
Step 3: and (3) training to obtain a model 1 of the regional named entity recognition model by using the sentence vectors obtained in the step (2). One possible specific procedure is as follows:
Step 3.1: as shown in fig. 2, the sentence vector [ c 1,c2,c3,c4,c5,c6, … ] with the value generated in the step 2 passes through the 4-layer Bi-LSTM layer, and simultaneously captures the dependency relationship [ v f0|vb0,vf1|vb1,vf2|vb2,vf3|vb3, … ] of the front item and the rear item of the sentence vector, so as to generate the hidden layer vector [ h 1,h2,h3,h4, … ], which is used for capturing the dependency relationship of the front item and the rear item of the sentence vector at the same time, for example, for the five keys of the dictionary feature of the file "Teng County primary school organization information_0.xls" respectively corresponding to 5 sentence vectors, and then the 5 sentence vectors are respectively sent to the Bi-LSTM layer for feature extraction.
Among Bi-LSTM layers, the LSTM layer in the positive direction is used to capture the long-distance dependency of c 1 to c i, the LSTM layer in the negative direction is used to capture the long-distance dependency of c i to c 1, thereby capturing the dependency of sentence vector in the positive direction and the negative direction simultaneously, and generating implicit layer vectors [ h 1,h2,h3,h4, … ], LSTM has three gates in total to maintain and adjust cell states, including forget gates, input gates, and output gates, wherein for cell states, forget gates, input gates, output gates are defined as follows:
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf)
it=σ(Wxixt+Whiht-1+Wcict-1+bi)
ct=ftct-1+ittanh(Wxixt+Whiht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh (ct)
Wherein the cell state represents the reserved information in the embedded vector, the forgetting gate receiving h t-1 and x t output a value f t between 0 and 1 to determine how much information needs to be forgotten, the value will act on the last cell state c t-1, 1 represents "completely reserved", 0 represents "completely forgotten", the input gate receiving h t-1 and x t output a value between 0 and 1 through i t to determine how much information needs to be reserved, then the cell state is updated through c t, the output gate receiving h t-1 and x t output a value between 0 and 1 through o t, finally h t determines how much information needs to be output in the current state c t, sigma represents a sigmoid function, and since the text data in the government file has more information unrelated to the geographical feature, the information screening property of tm is more suitable for processing the government file.
Step 3.2: the hidden layer vector [ h 1,h2,h3,h4,…hj ] passes through the CRF layer, outputs a predicted labeling sequence which accords with a labeling transfer constraint condition, namely the maximum possibility, and generates labeling probability sequences [ p 1,p2,p3,p4, … ] through softmax function normalization, and the largest probability value in the labeling probability sequences [ p 1,p2,p3,p4, … ] corresponds to the predicted regional entity, so that a regional naming entity recognition model 1 is obtained.
Specifically, all possible real paths are scored by using a sequence labeling transfer matrix through a CRF (conditional random field) layer, so as to output a prediction labeling sequence, wherein the sequence score calculation method of the CRF layer is defined as follows:
Wherein the method comprises the steps of AndThe emission score and the transfer score of y i in the labeling sequence [ y 1,y2,y3,y4…yi,…yn ] are respectively represented, and the whole sequence is added to obtain score (x, y).
Step 4: and 3, extracting the regions in all feature dictionaries by using the region named entity recognition model 1 trained in the step 3, marking corresponding province labels according to a Chinese administrative region planning table, generating a training set and a testing set, and training region-province mapping to obtain a region-province mapping model 2. One possible specific procedure is as follows:
Step 4.1: extracting values corresponding to five keys in all feature dictionaries, extracting all region entities contained in text data through a region named entity recognition model 1 trained in the step 3, traversing a Chinese administrative region planning table at the same time, and finding province labels corresponding to the region entities to form region-province key value pairs { entity n:provincen }, wherein entity n represents the region entities and precursor n represents the corresponding province labels;
Step 4.2: the region-province key value pair { entity n:provincen } is classified and trained by the pre-training model Bert and connecting 1 full-connection layers, so that a region-province mapping model 2 is obtained.
Step 5: and performing province label classification on the new excel and csv files based on a step-by-step prediction mechanism by using a model 1、model2. One possible specific procedure is as follows:
Step 5.1: generating a feature dictionary according to the file to be predicted in the step 1, and respectively passing text data corresponding to five keys in the feature dictionary through a model 1 to generate a temporary dictionary { X: [ entity 1,entity2, … ] }, wherein X represents a key, [ entity 1,entity2, … ] is a regional entity extracted from a value corresponding to X; for example, the temporary dictionary generated for the file name is { name_ chineseall: [ entity 1,entity2, … ] }, where [ entity 1,entity2, … ] is a region entity extracted from the text content corresponding to name_ chineseall.
Step 5.2: as shown in fig. 3, for the same file, starting from the file name name_ chineseall, if the file name is predicted by the model 2, if a prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the header is sent to the model 2 to be predicted, if the prediction result exists, the prediction result is directly taken as a final result, if the region entity is not included, the row attribute row_attribute is sent to the model 2 to be predicted, if the prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the column attribute column_attribute is sent to the model 2 to be predicted, if the prediction result exists, the prediction result is directly taken as a final result, if the region entity is not included, the file content allcsv _ chinese is sent to the model 2 to be predicted, if the prediction result exists, the prediction result is directly taken as a final result, if two or more different region tags are recognized in any recognition steps, if the recognition results of all the region tags are most frequently recognized, and the recognition results of all the region tags are most frequently recognized as the recognition results in a dictionary. That is, for the same file, a progressive prediction method is adopted, the priority of the file name name_ chineseall is highest, then the file name is sequentially the header, the row attribute row_attribute, the column attribute column_attribute, and the file content allcsv _ chinese, if the data with high priority can obtain a prediction result, the prediction is directly ended and takes the result as a final result, if a plurality of provincial labels appear at a certain level, the data of all levels are predicted, the label with the largest occurrence number is taken as a prediction result, and if no result exists at all levels, the data is predicted as other areas.
The experimental results of this example are as follows:
the Accuracy (AUC) of the test set was stable at 0.9995.
The experimental result shows that the classification method based on the pre-training model government affair types csv and excel files, provided by the invention, uses a named entity identification method to extract all area names from text data, generates a training set and a testing set for each area and the corresponding province according to a Chinese administrative area planning table, classifies the original files based on a step-by-step prediction mechanism, can effectively classify the government affair types csv and excel files, effectively avoids the problem of province overlapping in the same file, and has the advantages of high accuracy rate, small error, low calculation complexity and high practical value. .
In conclusion, the classification method based on the pre-training model government affair class csv and excel file attribution provinces can effectively classify the government affair class csv and excel file, effectively avoid the problem of province overlapping in the same file, and has high prediction result accuracy, small error, low calculation complexity and high practical value.
While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims (5)

1. The utility model provides a government affair file attribution province classification method based on a pre-training model, wherein the government affair file is a csv and/or excel file, and the classification method is characterized by comprising the following steps:
step 1: extracting features of the government file from five dimensions of file names, table heads, row attributes, column attributes and full text of the table to generate a corresponding feature dictionary;
Step 2: generating an embedded vector of semantic information from text data in a feature dictionary by utilizing a self-encoder in a pre-training model, capturing semantic contribution relation between words, and providing position embedded information of the words by a built-in function of the pre-training model Bert; integrating the semantic information and the position embedding information to generate sentence vectors of text data in the feature dictionary;
Step 3: training to obtain a model 1 of the regional naming entity recognition model by using the sentence vectors obtained in the step 2, wherein the method comprises the following steps:
Step 3.1: passing the sentence vector [ c 1,c2,c3,c4,c5,c6, … ] of the value generated in the step 2 through a 4-layer Bi-LSTM layer to generate an implicit layer vector [ h 1,h2,h3,h4, … ], and using the implicit layer vector to capture the dependency relationship of the front item and the rear item of the sentence vector at the same time;
Step 3.2: the hidden layer vector [ h 1,h2,h3,h4, … ] passes through the CRF layer, outputs a predicted labeling sequence which accords with a labeling transfer constraint condition, namely the maximum possibility, and generates a labeling probability sequence [ p 1,p2,p3,p4, … ] through softmax function normalization, wherein the largest probability value in the labeling probability sequence [ p 1,p2,p3,p4, … ] corresponds to the predicted regional entity, so that a regional named entity recognition model 1 is obtained;
Step 4: extracting regions in all feature dictionaries by using the region naming entity recognition model 1 trained in the step 3, marking corresponding province labels according to a Chinese administrative region planning table, and training region-province mapping to obtain a region-province mapping model 2; the method comprises the following steps:
Step 4.1: extracting values corresponding to five keys in all feature dictionaries, extracting all region entities contained in text data through a region named entity recognition model 1 trained in the step 3, traversing a Chinese administrative region planning table at the same time, and finding province labels corresponding to the region entities to form region-province key value pairs { entity n:provincen }, wherein entity n represents the region entities and precursor n represents the corresponding province labels;
Step 4.2: the region-province key value pair { entity n:provincen } is subjected to classification training by a pre-training model Bert and connected with 1 full-connection layer, so that a region-province mapping model 2 is obtained;
Step 5: the model 1、model2 is used for classifying the provincial labels of the new excel and csv files, and the method comprises the following steps:
Step 5.1: generating a feature dictionary according to the file to be predicted in the step 1, and respectively passing text data corresponding to five keys in the feature dictionary through a model 1 to generate a temporary dictionary { X: [ entity 1,entity2, … ] }, wherein X represents a key, [ entity 1,entity2, … ] is a regional entity extracted from a value corresponding to X;
Step 5.2: for the same file, starting from the file name name_ chineseall, if the file name is predicted by the model 2, if a prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the header is sent to the model 2 for prediction, if the prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the row attribute row_attribute is sent to the model 2 for prediction, if the prediction result exists, the prediction is ended, the prediction result is directly taken as a final result, if the region entity is not included, the column attribute column_attribute is sent to the model 2 for prediction, if the prediction result exists, the prediction result is directly taken as a final result, if the region entity is not included, the file content allcsv _ chinese is sent to the model 2 for prediction, if the prediction result exists, the prediction result is directly taken as a final result, if two or more different recognition results appear in any recognition steps above, the recognition results of all the recognition steps appear as final recognition results of all the region labels, and if the recognition results of most frequently appear in the region labels are not included.
2. The method for classifying government documents according to claim 1, wherein the step 1 comprises:
Step 1.1: the method comprises the steps of representing a table in a csv file as a dictionary with five key value pairs, wherein five keys of the dictionary are respectively name_ chineseall, head_attribute, row_attribute, column_attribute and allcsv _ chinese, wherein name_ chineseall represents all Chinese characters in an original file name, head_attribute represents a table head in the original file, row_attribute represents all row attributes in the original file table, column_attribute represents all column attributes in the original file table, and allcsv _ chinese represents all Chinese contents in the original file;
step 1.2: generating n temporary csv files according to the number n of sheets contained in the excel files, then respectively generating a feature dictionary for all the temporary csv files according to the method in the step 1.1, and connecting all the corresponding values of the obtained n feature dictionaries according to keys to generate a total feature dictionary which is the feature dictionary corresponding to the excel files;
Step 1.3: and storing all government affair files and the feature dictionary corresponding to the government affair files into the json file according to the index sequence.
3. The method for classifying government documents according to claim 2, wherein the step 2 comprises:
Step 2.1: dividing words of values corresponding to five keys in each feature dictionary, randomly generating 15% mask data of text data t after dividing words, adding identifiers representing the beginning and the end of sentences at the beginning and the end of the text data, predicting the characters to be masked by the characters not to be masked at the two sides of the characters to be masked, wherein an intermediate vector which is used for predicting the characters to be masked and does not contain position information is a semantic information embedded vector of the characters to be masked, namely an embedded vector which does not contain position information;
step 2.2: and 2.1, generating position index embedded information for the positions of each character according to the values corresponding to the five keys processed in the step 2.1, integrating the position index embedded information with the semantic information embedded vector generated in the step 2.1, and finally generating sentence vectors of the values corresponding to the five keys respectively.
4. The pre-training model-based government document attribution province classification method according to claim 1, wherein in the step 3.1, a positive-direction LSTM layer is used for capturing long-distance dependency of c 1 to c i, a negative-direction LSTM layer is used for capturing long-distance dependency of c i to c 1, thereby capturing dependency of sentence vector positive direction and negative direction simultaneously, and generating hidden layer vectors [ h 1,h2,h3,h4, … ], LSTM has three gates in total to maintain and adjust cell states, including forget gates, input gates, output gates, wherein for cell states, forget gates, input gates, output gates are defined as follows:
ft=σ(Wxfxt+Whfht-1+Wcfct-1+bf)
it=σ(Wxixt+Whiht-1+Wcict-1+bi)
ct=ftct-1+ittanh(Wxixt+Whiht-1+bc)
ot=σ(Wxoxt+Whoht-1+Wcoct+bo)
ht=ottanh (ct)
Wherein the cell state represents the reserved information in the embedded vector, the forgetting gate receiving h t-1 and x t output a value f t between 0 and 1 to determine how much information needs to be forgotten, the value will act on the last cell state c t-1, 1 represents "completely reserved", 0 represents "completely forgotten", the input gate receiving h t-1 and x t output a value between 0 and 1 through i t to determine how much information needs to be reserved, then the cell state is updated through c t, the output gate receiving h t-1 and x t output a value between 0 and 1 through o t, finally h t determines how much information needs to be output in the current state c t, sigma represents a sigmoid function, and since the text data in the government file has more information unrelated to the geographical feature, the information screening property of tm is more suitable for processing the government file.
5. The method for classifying government documents belonging to provinces based on pre-training models according to claim 1, wherein in the step 3.2, all possible real paths are scored by using a sequence annotation transfer matrix through a CRF layer, so as to output a prediction annotation sequence, and the sequence score calculation method of the CRF layer is defined as follows:
Wherein the method comprises the steps of AndThe emission score and the transfer score of y i in the labeling sequence [ y 1,y2,y3,y4…yi,…yn ] are respectively represented, and the whole sequence is added to obtain score (x, y).
CN202111470389.2A 2021-12-03 2021-12-03 Government file attribution province classification method based on pre-training model Active CN114153803B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111470389.2A CN114153803B (en) 2021-12-03 2021-12-03 Government file attribution province classification method based on pre-training model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111470389.2A CN114153803B (en) 2021-12-03 2021-12-03 Government file attribution province classification method based on pre-training model

Publications (2)

Publication Number Publication Date
CN114153803A CN114153803A (en) 2022-03-08
CN114153803B true CN114153803B (en) 2024-07-19

Family

ID=80452587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111470389.2A Active CN114153803B (en) 2021-12-03 2021-12-03 Government file attribution province classification method based on pre-training model

Country Status (1)

Country Link
CN (1) CN114153803B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091007A (en) * 2020-03-23 2020-05-01 杭州有数金融信息服务有限公司 Method for identifying relationships among multiple enterprises based on public sentiment and enterprise portrait
AU2020103654A4 (en) * 2019-10-28 2021-01-14 Nanjing Normal University Method for intelligent construction of place name annotated corpus based on interactive and iterative learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8407165B2 (en) * 2011-06-15 2013-03-26 Ceresis, Llc Method for parsing, searching and formatting of text input for visual mapping of knowledge information
CN111125365B (en) * 2019-12-24 2022-01-07 京东科技控股股份有限公司 Address data labeling method and device, electronic equipment and storage medium
CN113609859B (en) * 2021-08-04 2024-12-03 浙江工业大学 A Chinese named entity recognition method for special equipment based on pre-training model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2020103654A4 (en) * 2019-10-28 2021-01-14 Nanjing Normal University Method for intelligent construction of place name annotated corpus based on interactive and iterative learning
CN111091007A (en) * 2020-03-23 2020-05-01 杭州有数金融信息服务有限公司 Method for identifying relationships among multiple enterprises based on public sentiment and enterprise portrait

Also Published As

Publication number Publication date
CN114153803A (en) 2022-03-08

Similar Documents

Publication Publication Date Title
CN112765358B (en) Taxpayer industry classification method based on noise label learning
CN109710768B (en) Tax payer industry two-level classification method based on MIMO recurrent neural network
CN117252255B (en) Disaster emergency knowledge graph construction method oriented to auxiliary decision
CN105824922A (en) Emotion classifying method fusing intrinsic feature and shallow feature
CN111324734B (en) Case microblog comment emotion classification method integrating emotion knowledge
CN111737484A (en) A method for constructing police knowledge map based on joint learning
CN105389379A (en) Rubbish article classification method based on distributed feature representation of text
WO2021128704A1 (en) Open set classification method based on classification utility
CN105955951A (en) Message filtering method and device
CN112905736B (en) An unsupervised text sentiment analysis method based on quantum theory
CN112989830A (en) Named entity identification method based on multivariate features and machine learning
CN107145514A (en) Chinese sentence pattern sorting technique based on decision tree and SVM mixed models
CN108563725A (en) A kind of Chinese symptom and sign composition recognition methods
CN113869055A (en) Feature Attribute Recognition Method of Power Grid Project Based on Deep Learning
CN105550292B (en) A kind of Web page classification method based on von Mises-Fisher probabilistic models
CN114548117A (en) Cause-and-effect relation extraction method based on BERT semantic enhancement
CN118503358A (en) Class case matching method based on judge document structure and fusion case elements
CN114707483A (en) Zero sample event extraction system and method based on contrast learning and data enhancement
CN111353032B (en) Question classification method and system for community question answering
CN113191135A (en) Multi-category emotion extraction method fusing facial characters
CN107092593B (en) Sentence semantic role recognition method and system for elementary mathematics stratified sampling application questions
CN113282746B (en) Method for generating variant comment countermeasure text of network media platform
CN111737467B (en) Object-level emotion classification method based on segmented convolutional neural network
CN111191029B (en) AC Construction Method Based on Supervised Learning and Text Classification
CN114153803B (en) Government file attribution province classification method based on pre-training model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant