Disclosure of Invention
The embodiment of the application aims to provide a text multi-label classification method, a device, computer equipment and a storage medium, which are used for solving the technical problems of poor accuracy and stability of text multi-label classification in the related technology.
In order to solve the above technical problems, the embodiment of the present application provides a text multi-label classification method, which adopts the following technical scheme:
Acquiring an original text data set, and preprocessing the original text data set to obtain a preprocessed text data set;
Acquiring a classification label of each preprocessing text in the preprocessing text data set, marking the preprocessing text data set according to the classification label to obtain a labeling text data set, and randomly dividing the labeling text data set into a training set and a verification set;
constructing a low-rank adaptive classification model according to the classification labels, and inputting the training set into the low-rank adaptive classification model;
obtaining a semantic feature extraction result and a low-rank decomposition result of the training set through the low-rank adaptive classification model, and fusing the semantic feature extraction result and the low-rank decomposition result to obtain a classification prediction result;
Fine-tuning the low-rank adaptive classification model based on the classification prediction result, and continuing iterative training until an iterative stopping condition is met, so as to obtain a fine-tuning classification model;
inputting the verification set into the fine-tuning classification model for verification to obtain a verification result, and outputting the fine-tuning classification model as a final text label classification model when the verification result meets a preset threshold condition;
and obtaining a text to be classified, inputting the text to be classified into the text label classification model for classification, and obtaining a text classification result.
Further, the step of obtaining a classification label for each of the preprocessed text in the preprocessed text dataset comprises:
Extracting the keyword characteristics of each preprocessed text in the preprocessed text dataset by adopting a TF-IDF algorithm;
classifying the preprocessed text according to the keyword characteristics by a clustering algorithm to obtain different text categories;
and generating a classification label of each preprocessing text according to the keyword characteristics of the clustering center of each text category.
Further, the low-rank adaptive classification model includes an input layer, a pre-trained oversized language model, a low-rank adaptive network and a plurality of classifiers, and the steps of obtaining a semantic feature extraction result and a low-rank decomposition result of the training set through the low-rank adaptive classification model, and fusing the semantic feature extraction result and the low-rank decomposition result to obtain a classification prediction result include:
preprocessing the training set through the input layer to obtain input text data;
inputting the input text data into the pre-trained ultra-large language model semantic feature extraction to obtain a text semantic representation vector;
performing low-rank decomposition on the input text data through the low-rank adaptation network to obtain a text low-rank adaptation matrix;
Splicing the text semantic representation vector and the text low-rank adaptation matrix to obtain text semantic features;
and respectively inputting the text semantic features into the plurality of classifiers to classify, so as to obtain a classification prediction result.
Further, the oversized language model includes an embedding layer and a multi-stage feature extraction network, and the step of inputting the input text data into the pre-trained oversized language model semantic feature extraction to obtain a text semantic representation vector includes:
vector conversion is carried out on the input text data through the embedding layer, so that a text embedding vector is obtained;
and inputting the text embedded vector into the multi-level feature extraction network to extract features with different scales, so as to obtain a multi-scale fused text semantic representation vector.
Further, the low-rank adaptation network includes a vector conversion layer and a low-rank adaptation matrix corresponding to each classification label, and the step of performing low-rank decomposition on the input text data through the low-rank adaptation network to obtain a text low-rank adaptation matrix includes:
Performing vector conversion on the text in the input text data through the vector conversion layer to obtain text word vectors;
respectively inputting the text word vectors into each low-rank adaptive matrix to perform matrix decomposition to obtain a text feature matrix corresponding to each classification label;
and splicing all the text feature matrixes to obtain a text low-rank adaptation matrix.
Further, the step of inputting the text semantic features into the plurality of classifiers to classify the text semantic features respectively to obtain a classification prediction result includes:
Inputting the text semantic features into a plurality of classifiers respectively, wherein the classifiers comprise classifiers corresponding to the classification labels;
Classifying and predicting the text semantic features through the classifiers to obtain the prediction probability of each classification label;
And outputting the classification label with the prediction probability larger than or equal to a preset confidence coefficient threshold value as a classification prediction result.
Further, the step of fine tuning the low rank adaptive classification model based on the classification prediction result includes:
Calculating a loss value between the classification prediction result and the classification label according to a preset loss function;
and based on the loss value, adopting a back propagation algorithm to adjust parameters of the low-rank adaptive network and the classifier.
In order to solve the technical problems, the embodiment of the application also provides a text multi-label classification device, which adopts the following technical scheme:
The preprocessing module is used for acquiring an original text data set, preprocessing the original text data set and obtaining a preprocessed text data set;
The labeling module is used for acquiring a classification label of each preprocessing text in the preprocessing text data set, labeling the preprocessing text data set according to the classification label to obtain a labeling text data set, and randomly dividing the labeling text data set into a training set and a verification set;
The construction module is used for constructing a low-rank adaptive classification model according to the classification labels and inputting the training set into the low-rank adaptive classification model;
the prediction module is used for obtaining a semantic feature extraction result and a low-rank decomposition result of the training set through the low-rank adaptive classification model, and fusing the semantic feature extraction result and the low-rank decomposition result to obtain a classification prediction result;
the iteration module is used for finely adjusting the low-rank adaptive classification model based on the classification prediction result, and continuing to carry out iteration training until the iteration stopping condition is met, so as to obtain a finely adjusted classification model;
The verification module is used for inputting the verification set into the fine-tuning classification model for verification to obtain a verification result, and outputting the fine-tuning classification model as a final text label classification model when the verification result meets a preset threshold condition;
The classification module is used for acquiring texts to be classified, inputting the texts to be classified into the text label classification model for classification, and obtaining text classification results.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
The computer device comprises a memory having stored therein computer readable instructions which when executed by the processor implement the steps of the text multi-label classification method as described above.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
The computer readable storage medium has stored thereon computer readable instructions which when executed by a processor implement the steps of the text multi-label classification method as described above.
Compared with the prior art, the application has the following main beneficial effects:
The application provides a text multi-label classification method, which can improve data quality and model generalization capability by preprocessing an original text data set, can enhance the expression capability of classification characteristics by acquiring classification labels of a preprocessed text and marking the classification labels, is beneficial to capturing the relevance among different classification labels so as to realize accurate multi-label classification, and can acquire a semantic characteristic extraction result and a low-rank decomposition result of a training set by inputting the training set into a low-rank adaptive classification model for processing, and can acquire a classification prediction result by fusing the semantic characteristic extraction result and the low-rank decomposition result, so that the semantic representation capability of the model can be fully utilized, the deep semantic information of the text can be captured, the key information of the text can be identified by utilizing fewer parameters through low-rank decomposition, the relevance among different labels can be captured, the parameter efficiency of the model can be improved, and the accuracy and the stability of text multi-label classification can be improved.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, and a server 103, where the terminal device 101 may be a notebook 1011, a tablet 1012, or a cell phone 1013. Network 102 is the medium used to provide communication links between terminal device 101 and server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables.
A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal device 101.
The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, and the terminal device 101 may be an electronic book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, moving picture experts compression standard audio layer III), an MP4 (Moving Picture Experts Group Audio Layer IV, moving picture experts compression standard audio layer IV) player, a laptop portable computer, a desktop computer, or the like, in addition to the notebook 1011, the tablet 1012, or the mobile phone 1013.
The server 103 may be a server providing various services, such as a background server providing support for pages displayed on the terminal device 101.
It should be noted that, the text multi-label classification method provided by the embodiment of the application is generally executed by a server/terminal device, and correspondingly, the text multi-label classification device is generally arranged in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow chart of one embodiment of a text multi-label classification method according to the present application is shown, comprising the steps of:
step S201, an original text data set is acquired, and the original text data set is preprocessed, so that a preprocessed text data set is obtained.
And initiating a data acquisition request according to the service requirement, and acquiring a large amount of service data under a corresponding service scene according to the data acquisition request to form an original text data set. Sources of raw text data acquisition include, but are not limited to, financial databases and third party research institutions, financial institution internal data, financial media and news websites, regulatory agencies, and the like.
In this embodiment, the electronic device (e.g., the server/terminal device shown in fig. 1) on which the text multi-tag classification method operates may initiate the data acquisition request through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.
The method comprises the steps of preprocessing an obtained original text data set, wherein preprocessing operation comprises cleaning, normalization processing, word segmentation and the like, cleaning comprises removal of irrelevant characters and special symbols, word stopping removal, spell checking, grammar correction, duplicate content removal, filling or deleting of missing values and the like, and the normalization processing comprises the step of scaling data values to be within a range of [0,1] by using a normalization method. The method comprises the steps of cleaning original text data in an original text data set to obtain a cleaned text data set, normalizing the cleaned text data set to obtain a normalized text data set, and word segmentation is carried out on the normalized text data set through a word segmentation technology to obtain a preprocessed text data set containing word segmentation units.
Step S202, a classification label of each preprocessing text in the preprocessing text data set is obtained, the preprocessing text data set is marked according to the classification label, a labeling text data set is obtained, and the labeling text data set is randomly divided into a training set and a verification set.
In this embodiment, the preprocessed text data set is classified to obtain a classification label corresponding to each preprocessed text in the preprocessed text data set, and a professional text labeling tool such as LabelImg, labelme is used to label the classification labels on the text data in the preprocessed text data set, so as to obtain a labeled text data set including the text data and the classification labels.
In some alternative implementations, the step of obtaining a classification tag for each of the preprocessed text in the preprocessed text dataset includes:
extracting key word characteristics of each preprocessed text in the preprocessed text data set by adopting a TF-IDF algorithm;
classifying the preprocessed text according to the keyword characteristics by a clustering algorithm to obtain different text categories;
and generating a classification label of each preprocessed text according to the keyword characteristics of the clustering center of each text category.
The method comprises the steps of calculating the ratio of the occurrence frequency of each word segmentation unit in a text where the word segmentation unit is located in a preprocessed text data set to the occurrence frequency of each word segmentation unit in all texts by adopting a TF-IDF algorithm, namely importance weight, and screening out word segmentation units with importance weight larger than or equal to a preset weight threshold as key word characteristics of the corresponding preprocessed texts. The method comprises the steps of determining K initial clustering centers according to key word characteristics through a K-means clustering algorithm, traversing all key word characteristics, determining groups corresponding to the key word characteristics, updating the clustering centers of the groups to obtain actual clustering centers of the groups, repeatedly calculating the distances between the key word characteristics and each actual clustering center, continuously updating the clustering centers of the groups until the groups are not changed, obtaining different key word categories, calculating similarity scores of all the key word characteristics of each preprocessed text and each key word category, taking the key word category with the similarity score being greater than or equal to a preset similarity threshold value as the text classification category to which the preprocessed text belongs, and automatically generating classification labels corresponding to the text classification categories according to the key word characteristics corresponding to the clustering centers of each text classification category, namely generating the classification labels of each preprocessed text, wherein one preprocessed text possibly corresponds to a plurality of classification labels.
By acquiring the classification labels of the preprocessed text data set, the expression capability of classification features can be enhanced, so that the model can better capture key information of different classification labels, the generalization capability and training speed of the model can be improved, and the precision, robustness and efficiency of text classification can be improved.
In this embodiment, the labeling text data set is randomly divided into the training set and the verification set according to a preset proportion, where the preset proportion can be set according to actual needs, and the preset proportion is 8:2, that is, the proportion of the training set and the verification set is 8:2.
It is emphasized that to further ensure the privacy and security of the annotated text dataset, the annotated text dataset may also be stored in a node of a blockchain.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
Step S203, a low-rank adaptive classification model is built according to the classification labels, and a training set is input into the low-rank adaptive classification model.
In the embodiment, all classification labels are obtained, a low-rank adaptation matrix corresponding to each classification label is constructed through a feature extraction and dimension reduction algorithm, all low-rank adaptation matrices are integrated to obtain a low-rank adaptation network, a classifier corresponding to each classification label is constructed according to the classification labels, a pre-trained ultra-large language model is obtained, an input layer is constructed, an input layer is integrated, the pre-trained ultra-large language model, the low-rank adaptation network and all classifiers are integrated, a low-rank adaptation classification model is obtained, and the low-rank adaptation classification model is trained through a training set.
In order to better adapt to a text multi-label classification task, a low-rank adaptation network is introduced for fine tuning, and each classification label correspondingly constructs a low-rank adaptation matrix, wherein the low-rank adaptation matrix comprises two low-rank matrices B and A, B epsilon R M×N,A∈RN ×M, N is the rank of the two low-rank matrices, and the rank N is smaller than M.
In the embodiment, a low-rank adaptive network is introduced, so that the model can capture the relevance among multiple labels, meanwhile, the parameter number is reduced, and the subsequent quick fine tuning is supported to generate a text classification model.
Step S204, semantic feature extraction results and low-rank decomposition results of the training set are obtained through the low-rank adaptive classification model, and classification prediction results are obtained by fusing the semantic feature extraction results and the low-rank decomposition results.
The low-rank adaptive classification model comprises an input layer, a pre-trained ultra-large language model, a low-rank adaptive network and a plurality of classifiers, a training set is input into the low-rank adaptive classification model, the training set is preprocessed through the input layer to obtain input texts in a data format suitable for model processing, the input texts are respectively input into the pre-trained ultra-large language model and the low-rank adaptive network, the results obtained by the pre-trained ultra-large language model and the low-rank adaptive network are added to obtain a final semantic result, and the final semantic result is respectively input into the classifiers of different classification labels to carry out classification training to obtain a classification prediction result.
The method comprises the steps of preprocessing a training set through an input layer to obtain input text data, extracting semantic features of an ultra-large language model of the input text data input in a pre-training mode to obtain text semantic representation vectors, carrying out low-rank decomposition on the input text data through a low-rank adaptation network to obtain a text low-rank adaptation matrix, splicing the text semantic representation vectors and the text low-rank adaptation matrix to obtain text semantic features, and respectively inputting the text semantic features into a plurality of classifiers to be classified to obtain classification prediction results.
In the embodiment, text classification is performed by training the low-rank adaptive classification model, so that the advantage of the ultra-large language model in terms of semantic understanding can be fully utilized, information in text data can be captured better, key features of different classification labels are captured through the low-rank adaptive network, the relevance between different labels is distinguished and captured, and the accuracy and stability of multi-label classification are improved.
In some optional implementations, the oversized language model includes an embedding layer and a multi-level feature extraction network, and the step of inputting the input text data into the pre-trained oversized language model semantic feature extraction, to obtain the text semantic representation vector includes:
vector conversion is carried out on input text data through an embedding layer, so that a text embedding vector is obtained;
inputting the text embedded vector into a multi-level feature extraction network to extract features of different scales, and obtaining a multi-scale fused text semantic representation vector.
In this embodiment, the vector conversion of the embedding layer includes position embedding, word embedding and character embedding on the input text data, wherein the position embedding provides information about the position of words in sentences, which is helpful for understanding the grammar structure and context of sentences, the word embedding captures semantic information of words so that models can understand the meaning of words in context, and the character embedding captures character-level information, which is helpful for models to process unknown words or spelling variants, and improves the processing capability of complex data of models by integrating the character-level information into models. The text embedding vector is a concatenation of a position embedding vector, a word embedding vector, and a character embedding vector.
The multi-level feature extraction network may employ encoder structures, encoder-decoder structures, and decoder structures for feature extraction of different scales.
When the multi-level feature extraction network adopts an encoder structure, the encoder structure is formed by stacking a plurality of layers of encoders, the multi-layer encoders are divided into a bottom layer feature unit, a middle layer feature unit and a high layer feature unit, surface layer feature extraction is carried out on text embedded vectors through the bottom layer feature unit, the extracted surface layer features are respectively input into the middle layer feature unit and the high layer feature unit, the surface layer features are subjected to feature extraction through the middle layer feature unit to obtain syntax features, the syntax features are input into the high layer feature unit, the surface layer features and the syntax features are subjected to feature extraction through the high layer feature unit to obtain semantic features, and the surface layer features, the syntax features and the semantic features are spliced to obtain text semantic representation vectors.
When the multistage feature extraction network adopts an encoder-decoder structure, the encoder-decoder structure comprises a plurality of encoders connected in a stacking way and a plurality of decoders connected in a stacking way, semantic and contextual features of text embedded vectors are extracted through a multi-head attention mechanism of the encoder to obtain text coding features, the text coding features are input into the decoder, and global dependency relations among the text coding features are captured through the multi-head attention mechanism and a cross attention mechanism to obtain text semantic representation vectors. Wherein the encoder and decoder handle text features of different scales through a multi-headed attention mechanism.
The semantic feature extraction is carried out through the ultra-large language model, so that deep semantic information of the text can be captured better, classification features in the text data are identified, further, label relevance can be captured effectively, and the efficiency and accuracy of text classification are improved.
In some optional implementations, the low-rank adaptation network includes a vector conversion layer and a low-rank adaptation matrix corresponding to each classification label, and the step of obtaining the text low-rank adaptation matrix by performing low-rank decomposition on the input text data by the low-rank adaptation network includes:
performing vector conversion on texts in the input text data through a vector conversion layer to obtain text word vectors;
respectively inputting text word vectors into each low-rank adaptive matrix to perform matrix decomposition to obtain text feature matrixes corresponding to each classification label;
and splicing all text feature matrixes to obtain a text low-rank adaptation matrix.
In this embodiment, input text data is converted into text word vectors, the text word vectors are respectively input into each low-rank adaptation matrix, the text word vectors are subjected to matrix decomposition by singular value decomposition (SVD, singular value decomposition), key features of each classification label are captured, and the text low-rank adaptation matrix is formed.
The low-rank decomposition is carried out through the low-rank adaptation network, so that the distinction and the relevance of different classification labels can be identified, and meanwhile, the number of parameters needing to be adjusted is reduced, so that the model training efficiency and the classification effect are improved.
And splicing the text low-rank adaptation matrix and the text semantic representation vector in a mode of adding or multiplying the text low-rank adaptation matrix and the text semantic representation vector to obtain final text semantic features, and inputting the text semantic features into classifiers of different classification labels for classification prediction.
In this embodiment, the step of inputting the text semantic features into a plurality of classifiers to classify the text semantic features respectively to obtain a classification prediction result includes:
inputting text semantic features into a plurality of classifiers respectively, wherein the classifiers comprise classifiers corresponding to classification labels;
classifying and predicting the text semantic features through each classifier to obtain the prediction probability of each classification label;
And outputting the classification labels with the prediction probability larger than or equal to a preset confidence coefficient threshold value as classification prediction results.
In this embodiment, the classifier may adopt a full-connection layer, map the text semantic features to the classification labels corresponding to the classifier through the Softmax activation function, obtain the classification probability distribution of each classification label, and obtain the prediction probability of each classification label according to the classification probability distribution.
The prediction probability is generally expressed by a prediction confidence, for example, if the probability that the classifier predicts the corresponding classification label is 0.9, then 0.9 can be regarded as the confidence of the classifier on the prediction result.
In some optional implementation manners of this embodiment, the classifier may use a Support Vector Machine (SVM), the SVM separates the texts of different classification labels by searching for an optimal decision hyperplane, inputs the text semantic features into the SVM classifier, and determines the classification label to which the text belongs by determining the position of the decision hyperplane, thereby completing the text classification task.
Through training the classifier, the text multi-label classification task can be completed more quickly, the processing efficiency is improved, and meanwhile, the accuracy of text classification can be improved under the condition of reducing misjudgment and missed judgment.
Step S205, fine tuning the low rank adaptive classification model based on the classification prediction result, and continuing the iterative training until the iterative stopping condition is met, thereby obtaining the fine tuning classification model.
The method comprises the steps of calculating a loss value between a classification prediction result and a classification label according to a preset loss function, and adjusting parameters of a low-rank adaptive network and a classifier by adopting a back propagation algorithm based on the loss value.
In this embodiment, the preset loss function may select a cross entropy loss function that can measure the difference between the model prediction result and the real label. And calculating a loss value between the classification prediction result and the classification label, returning the gradient of the loss value to the model parameter through a back propagation algorithm, and adjusting the parameters of the low-rank adaptive network and the classifier to realize the fine tuning process of the model. And continuing to iteratively train the fine-tuned model until the iteration stopping condition is met, and obtaining a converged fine-tuned classification model. The iteration stop condition is met, wherein the iteration stop condition is that the preset maximum iteration times are reached, or the current loss value is not reduced compared with the loss value of the previous round.
In some alternative implementations of the present embodiment, the overall recognition range can be modified well by adding or subtracting low rank matrices.
Since the subsequent training and fine tuning of the oversized language model requires a large amount of computing resources, the embodiment selects to freeze parameters thereof, adjusts parameters of a low-rank adaptation network and a classifier only, and can better adapt to multi-label classification tasks through fine tuning, thereby reducing training time and computing resources, reducing dependence on a large amount of marking data, and enhancing generalization capability of the model.
And S206, inputting the verification set into the fine-tuning classification model for verification to obtain a verification result, and outputting the fine-tuning classification model as a final text label classification model when the verification result meets a preset threshold condition.
And verifying the verification set through the fine-tuning classification model to obtain a verification classification result, calculating the classification accuracy between the verification classification result and the corresponding real classification label, taking the classification accuracy as the verification result, outputting the current fine-tuning classification model as a final text label classification model when the classification accuracy is larger than or equal to a preset threshold value and indicating that the verification result meets the preset threshold value condition, and returning to the step S204 when the classification accuracy is smaller than the preset threshold value and indicating that the verification result does not meet the preset threshold value condition and retraining the low-rank adaptive classification model.
Step S207, obtaining a text to be classified, inputting the text to be classified into a text label classification model for classification, and obtaining a text classification result.
And performing multi-label text classification on the text to be classified by using the trained text label classification model to obtain a text classification result, wherein the text classification result comprises classification labels to which the text to be classified belongs.
In this embodiment, the text label classification model is regularly adjusted and optimized, so that the text label classification model with stronger generalization capability and robustness can be obtained, so as to adapt to complex and changeable financial business scenes.
The method can improve the data quality by preprocessing the original text data set, improve the generalization capability of the model, enhance the expression capability of classification features by acquiring the classification tags of the preprocessed text and marking the classification tags, help capture the relevance among different classification tags to realize accurate multi-tag classification, obtain the semantic feature extraction result and the low-rank decomposition result of the training set by inputting the training set into the low-rank adaptive classification model for processing, and fuse the semantic feature extraction result and the low-rank decomposition result to obtain the classification prediction result, fully utilize the semantic representation capability of the model, capture the deep semantic information of the text, and identify the key information of the text by utilizing fewer parameters through low-rank decomposition, thereby capturing the relevance among different tags, and improving the accuracy and the stability of multi-tag classification of the text while improving the parameter efficiency of the model.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a text multi-label classification apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.
As shown in fig. 3, the text multi-label classification device 300 according to the present embodiment includes a preprocessing module 301, a labeling module 302, a construction module 303, a prediction module 304, an iteration module 305, a verification module 306, and a classification module 307. Wherein:
The preprocessing module 301 is configured to obtain an original text data set, and preprocess the original text data set to obtain a preprocessed text data set;
The labeling module 302 is configured to obtain a classification label of each preprocessed text in the preprocessed text data set, label the preprocessed text data set according to the classification label, obtain a labeled text data set, and randomly divide the labeled text data set into a training set and a verification set;
the construction module 303 is configured to construct a low-rank adaptive classification model according to the classification label, and input the training set into the low-rank adaptive classification model;
The prediction module 304 is configured to obtain a semantic feature extraction result and a low-rank decomposition result of the training set through the low-rank adaptive classification model, and fuse the semantic feature extraction result and the low-rank decomposition result to obtain a classification prediction result;
the iteration module 305 is configured to fine tune the low-rank adaptive classification model based on the classification prediction result, and continue the iterative training until the iteration stop condition is satisfied, so as to obtain a fine-tuned classification model;
The verification module 306 is configured to input the verification set into the fine-tuning classification model for verification, obtain a verification result, and output the fine-tuning classification model as a final text label classification model when the verification result meets a preset threshold condition;
The classification module 307 is configured to obtain a text to be classified, input the text to be classified into the text label classification model for classification, and obtain a text classification result.
It is emphasized that to further ensure the privacy and security of the annotated text dataset, the annotated text dataset may also be stored in a node of a blockchain.
The text multi-label classification device 300 can improve data quality and model generalization capability by preprocessing an original text data set, can enhance the expression capability of classification characteristics by acquiring classification labels of a preprocessed text and marking the classification labels, is beneficial to capturing the relevance among different classification labels so as to realize accurate multi-label classification, can fully utilize the semantic representation capability of the model by inputting a training set into a low-rank adaptive classification model and fusing semantic characteristic extraction results and low-rank decomposition results to obtain classification prediction results, captures deep semantic information of the text, and utilizes fewer parameters to identify key information of the text by low-rank decomposition, thereby capturing the relevance among different labels, improving the parameter efficiency of the model and improving the accuracy and stability of text multi-label classification.
In some alternative implementations of the present embodiment, the labeling module 302 is further configured to:
Extracting the keyword characteristics of each preprocessed text in the preprocessed text dataset by adopting a TF-IDF algorithm;
classifying the preprocessed text according to the keyword characteristics by a clustering algorithm to obtain different text categories;
and generating a classification label of each preprocessing text according to the keyword characteristics of the clustering center of each text category.
By acquiring the classification labels of the preprocessed text data set, the expression capability of classification features can be enhanced, so that the model can better capture key information of different classification labels, the generalization capability and training speed of the model can be improved, and the precision, robustness and efficiency of text classification can be improved.
In some alternative implementations, the low-rank adaptive classification model includes an input layer, a pre-trained oversized language model, a low-rank adaptive network, and a plurality of classifiers, and the prediction module 304 includes:
The input sub-module is used for preprocessing the training set through the input layer to obtain input text data;
The semantic feature extraction sub-module is used for inputting the input text data into the pre-trained ultra-large language model semantic feature extraction to obtain text semantic representation vectors;
The low-rank adaptation sub-module is used for carrying out low-rank decomposition on the input text data through the low-rank adaptation network to obtain a text low-rank adaptation matrix;
The splicing sub-module is used for splicing the text semantic representation vector and the text low-rank adaptation matrix to obtain text semantic features;
And the classification sub-module is used for respectively inputting the text semantic features into the plurality of classifiers to classify so as to obtain a classification prediction result.
The text classification is carried out by training the low-rank adaptive classification model, the advantages of the ultra-large language model in terms of semantic understanding can be fully utilized, information in text data can be captured better, key features of different classification labels are captured through the low-rank adaptive network, so that the relevance between different labels is distinguished and captured, and the accuracy and stability of multi-label classification are improved.
In some optional implementations of this embodiment, the oversized language model includes an embedding layer and a multi-level feature extraction network, the semantic feature extraction submodule further configured to:
vector conversion is carried out on the input text data through the embedding layer, so that a text embedding vector is obtained;
and inputting the text embedded vector into the multi-level feature extraction network to extract features with different scales, so as to obtain a multi-scale fused text semantic representation vector.
The semantic feature extraction is carried out through the ultra-large language model, so that deep semantic information of the text can be captured better, classification features in the text data are identified, further, label relevance can be captured effectively, and the efficiency and accuracy of text classification are improved.
In some optional implementations of this embodiment, the low-rank adaptation network includes a vector conversion layer and a low-rank adaptation matrix corresponding to each of the class labels, and the low-rank adaptation submodule is further configured to:
Performing vector conversion on the text in the input text data through the vector conversion layer to obtain text word vectors;
respectively inputting the text word vectors into each low-rank adaptive matrix to perform matrix decomposition to obtain a text feature matrix corresponding to each classification label;
and splicing all the text feature matrixes to obtain a text low-rank adaptation matrix.
The low-rank decomposition is carried out through the low-rank adaptation network, so that the distinction and the relevance of different classification labels can be identified, and meanwhile, the number of parameters needing to be adjusted is reduced, so that the model training efficiency and the classification effect are improved.
In some optional implementations of this embodiment, the classification submodule is further to:
Inputting the text semantic features into a plurality of classifiers respectively, wherein the classifiers comprise classifiers corresponding to the classification labels;
Classifying and predicting the text semantic features through the classifiers to obtain the prediction probability of each classification label;
And outputting the classification label with the prediction probability larger than or equal to a preset confidence coefficient threshold value as a classification prediction result.
Through training the classifier, the text multi-label classification task can be completed more quickly, the processing efficiency is improved, and meanwhile, the accuracy of text classification can be improved under the condition of reducing misjudgment and missed judgment.
In some alternative implementations of the present embodiment, the iteration module 305 includes:
The loss calculation sub-module is used for calculating a loss value between the classification prediction result and the classification label according to a preset loss function;
and the adjusting submodule is used for adjusting the parameters of the low-rank adaptive network and the classifier by adopting a back propagation algorithm based on the loss value.
Through fine tuning, the model can be better adapted to multi-label classification tasks, training time and calculation resources are reduced, dependence on a large amount of marked data is reduced, and generalization capability of the model is enhanced.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only a computer device 4 having a memory 41, a processor 42, a network interface 43 is shown in the figures, but it is understood that not all illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of a text multi-label classification method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the text multi-label classification method.
The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.
The method comprises the steps of preprocessing an original text data set, improving the data quality, improving the generalization capability of a model, enhancing the expression capability of classification features by acquiring classification tags of the preprocessed text and marking the classification tags, helping capturing the relevance among different classification tags to realize accurate multi-tag classification, and acquiring a classification prediction result by inputting a training set into a low-rank adaptive classification model and fusing a semantic feature extraction result and a low-rank decomposition result, so that the semantic representation capability of the model can be fully utilized, the deep semantic information of the text is captured, the key information of the text is identified by utilizing fewer parameters through low-rank decomposition, and the relevance among different tags is captured, thereby improving the parameter efficiency of the model and the classification accuracy and stability of the text multi-tag.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the text multi-label classification method as described above.
The method comprises the steps of preprocessing an original text data set, improving the data quality, improving the generalization capability of a model, enhancing the expression capability of classification features by acquiring classification tags of the preprocessed text and marking the classification tags, helping capturing the relevance among different classification tags to realize accurate multi-tag classification, and acquiring a classification prediction result by inputting a training set into a low-rank adaptive classification model and fusing a semantic feature extraction result and a low-rank decomposition result, so that the semantic representation capability of the model can be fully utilized, the deep semantic information of the text is captured, the key information of the text is identified by utilizing fewer parameters through low-rank decomposition, and the relevance among different tags is captured, thereby improving the parameter efficiency of the model and the classification accuracy and stability of the text multi-tag.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.