CN112861974A

CN112861974A - Text classification method and device, electronic equipment and storage medium

Info

Publication number: CN112861974A
Application number: CN202110183059.9A
Authority: CN
Inventors: 李东根; 田原; 易仕伟; 张伟
Original assignee: Workway Shenzhen Information Technology Co ltd
Current assignee: Workway Shenzhen Information Technology Co ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-05-28
Anticipated expiration: 2041-02-08
Also published as: CN112861974B

Abstract

The present application relates to the technical field of machine learning, and discloses a text classification method, an apparatus, an electronic device and a storage medium. The text classification method includes: obtaining a target feature vector of the text to be processed and at least two text sets, wherein each text set Corresponding to a category, each text set includes feature vectors of text data belonging to the same category; for each text set, the K nearest feature vectors of the target feature vector are obtained from each text set, based on the K nearest features vector and target feature vector, obtain the aggregation degree between the target feature vector and each text set, and obtain the comparison result of the aggregation degree and the category set degree of each text set; based on the comparison result corresponding to each text set, from at least The target text set is determined from the two text sets; the category of the target text set is determined as the category of the text to be processed, which improves the accuracy of text classification.

Description

Text classification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a text classification method and apparatus, an electronic device, and a storage medium.

Background

The text classification model is one of important applications in the field of artificial intelligence, and can identify the category to which the text belongs. The text classification model has wide application in intelligent customer service, news recommendation, intention recognition systems and the like, namely the text classification model is a basic component of the complex systems.

At present, in many text classification tasks, a KNN (K-nearest neighbor) algorithm is usually selected to generate a text classification model based on the existing text data fast modeling, and the basic idea is to compare feature vectors of a text to be processed with feature vectors of text data in a training set under the condition that the text data in the training set is known to be classified, find first K text data in the training set most similar to the text to be processed, and use a classification with the largest occurrence frequency in the K text data as a category to which the text to be processed belongs.

However, the classification result of the text classification model based on the KNN algorithm is extremely sensitive to the selection of K values and the distribution of text data in the training set. In practical application, for some application scenes with few samples, the problem of uneven distribution of text data of each category is easily caused, and the accuracy of text classification is seriously reduced.

Disclosure of Invention

The embodiment of the application provides a text classification method, a text classification device, an electronic device and a storage medium, which can reduce the influence of the value of K on a classification result, improve the problem of inaccurate classification result caused by different data distribution or unbalanced data, and improve the accuracy of text classification.

In a first aspect, an embodiment of the present application provides a text classification, including:

obtaining a target characteristic vector and at least two text sets of a text to be processed, wherein each text set corresponds to one category and comprises characteristic vectors of text data belonging to the same category;

for each text set, obtaining K nearest feature vectors of the target feature vector from each text set, obtaining a polymerization degree between the target feature vector and each text set based on the K nearest feature vectors and the target feature vectors, and obtaining a comparison result of the polymerization degree and the class aggregation degree of each text set; the aggregation degree represents the density degree between the target feature vector and the K nearest feature vectors, and the category aggregation degree represents the density degree of the feature vector distribution in the same text set;

determining a target text set from the at least two text sets based on a comparison result corresponding to each text set;

and determining the category of the target text set as the category of the text to be processed.

Optionally, the obtaining K nearest feature vectors of the target feature vector from each text set specifically includes:

obtaining the similarity between each feature vector in each text set and the target feature vector;

and according to the sequence of similarity from large to small, determining the K feature vectors ranked at the top as K nearest feature vectors of the target feature vector.

Optionally, the obtaining, based on the k nearest feature vectors and the target feature vector, a degree of polymerization between the target feature vector and each text set specifically includes:

obtaining the similarity of each feature vector in the K nearest feature vectors and the target feature vector;

and determining the average value of the similarity corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and each text set.

Optionally, the comparison result is a ratio of the degree of polymerization and the degree of category aggregation, and the determining a target text set from the at least two text sets based on the comparison result corresponding to each text set specifically includes:

determining the classification probability of the text to be processed belonging to each text set based on the comparison result corresponding to each text set;

and determining the text set with the maximum classification probability as a target text set.

Optionally, the category aggregation degree of each text set is obtained by:

for each feature vector in each text set, obtaining similarity between each feature vector and other feature vectors in each text set, and determining a degree of polymerization between each feature vector and each text set based on K similarity degrees in the top order according to the obtained similarity degrees in the descending order;

and determining the category polymerization degree corresponding to each text set based on the polymerization degree corresponding to each feature vector in each text set.

Optionally, the determining, based on the K similarity degrees ranked top, a degree of polymerization between each feature vector and each text set specifically includes:

and determining the average value of the K similarity degrees which are ranked at the top as the polymerization degree between each feature vector and each text set.

Optionally, the determining, based on the degree of polymerization corresponding to each feature vector in each text set, a category degree of polymerization corresponding to each text set specifically includes:

and determining the average value of the polymerization degrees corresponding to the feature vectors in each text set as the category polymerization degree corresponding to each text set.

Optionally, wherein each category corresponds to a user intent.

In a second aspect, an embodiment of the present application provides a text classification apparatus, including:

the acquisition module is used for acquiring a target characteristic vector and at least two text sets of a text to be processed, wherein each text set corresponds to one category, and each text set comprises characteristic vectors of text data belonging to the same category;

the aggregation degree calculation module is used for obtaining K nearest feature vectors of the target feature vector from each text set aiming at each text set, obtaining the aggregation degree between the target feature vector and each text set based on the K nearest feature vectors and the target feature vectors, and obtaining a comparison result of the aggregation degree and the class aggregation degree of each text set; the aggregation degree represents the density degree between the target feature vector and the K nearest feature vectors, and the category aggregation degree represents the density degree of the feature vector distribution in the same text set;

and the classification module is used for determining a target text set from the at least two text sets based on the comparison result corresponding to each text set, and determining the category of the target text set as the category of the text to be processed.

Optionally, the polymerization degree calculating module is specifically configured to: obtaining the similarity between each feature vector in each text set and the target feature vector; and according to the sequence of similarity from large to small, determining the K feature vectors ranked at the top as K nearest feature vectors of the target feature vector.

Optionally, the polymerization degree calculating module is specifically configured to: obtaining the similarity of each feature vector in the K nearest feature vectors and the target feature vector; and determining the average value of the similarity corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and each text set.

Optionally, the comparison result is a ratio of the degree of polymerization and the degree of class aggregation, and the classification module is specifically configured to: determining the classification probability of the text to be processed belonging to each text set based on the comparison result corresponding to each text set; and determining the text set with the maximum classification probability as a target text set.

Optionally, the text classification apparatus further includes a training module, configured to obtain a category aggregation degree of each text set by:

Optionally, the training module is specifically configured to: and determining the average value of the K similarity degrees which are ranked at the top as the polymerization degree between each feature vector and each text set.

Optionally, the training module is specifically configured to: and determining the average value of the polymerization degrees corresponding to the feature vectors in each text set as the category polymerization degree corresponding to each text set.

Optionally, wherein each category corresponds to a user intent.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the methods when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, implement the steps of any of the methods described above.

In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when executed by a processor, implement the steps of any of the methods described above.

The text classification method, the text classification device, the electronic equipment and the storage medium can automatically learn the distribution of the text data of the same category in the feature space from the text set, and obtain the category polymerization degree representing the density degree of the feature vector distribution in the text set; when text classification is carried out, text data of different classes in a training set are divided into text sets corresponding to the classes, K text data most similar to the text to be processed are searched in the text sets of the classes respectively, the polymerization degree of the text to be processed to each class is calculated respectively, and the most appropriate class of the text to be processed is determined by comparing the polymerization degree of the text to be processed to the class and the class aggregation degree. On one hand, the search and subsequent processing of K nearest text data are respectively carried out in text sets of different categories, so that the influence of the value of K on a classification result can be reduced, on the other hand, the aggregation degree and the category aggregation degree of each category are classified by comparing texts to be processed, so that the problem of inaccurate classification result caused by different data distribution or unbalanced data can be solved, and the accuracy of text classification is improved through the optimization of the two aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of text classification based on a KNN algorithm when K is 3 according to an embodiment of the present application;

fig. 2 is a schematic diagram of text classification based on a KNN algorithm when K is 5 according to an embodiment of the present application;

FIG. 3 is a diagram of an example of a distribution of text data in a feature space according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a text classification method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of obtaining a category polymerization degree of a text collection according to an embodiment of the present application;

FIG. 6 is a diagram of an example of a distribution of text data in a feature space according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text classification apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

For convenience of understanding, terms referred to in the embodiments of the present application are explained below:

stop Words (Stop Words) refer to the automatic filtering of some Words or phrases before or after processing natural language data (or text) in order to save storage space and improve search efficiency in information retrieval. Stop words are all manually input and are not automatically generated, and the generated stop words form a stop word list.

Chinese word segmentation is the process of dividing a Chinese character sequence into several independent words, i.e. recombining continuous character sequences into word sequences according to a certain standard. A common chinese word segmentation tool is jieba.

Word2vec, a group of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network.

The basic idea of the KNN algorithm is as follows: under the condition of text data classification in a known training set, comparing the feature vector of the text to be processed with the feature vector of the text data in the training set, finding the first K pieces of text data which are most similar to the text to be processed in the training set, and classifying the K pieces of text data with the largest occurrence frequency as the category to which the text to be processed belongs. Taking fig. 1 as an example, each point in fig. 1 represents a feature vector of text data, where a triangle point belongs to a category one, a circle point belongs to a category two, and a square point represents a feature vector of text to be processed, and assuming that K is 3, the KNN algorithm finds three points (i.e., points within a dotted circle in fig. 1) closest to the square point, determines which category the three points belong to respectively, and if more of the three points belong to the category one, the text to be processed (i.e., the square point) is classified into the category one.

However, the classification result of the text classification model based on the KNN algorithm is extremely sensitive to the selection of K values and the distribution of text data in the training set.

In practical applications, different values of K may bring different classification results. With fig. 1 as a force, when K is 3, the KNN algorithm finds three points closest to the feature vector (i.e., the points of the square) of the text to be processed, and then classifies the text to be processed into a category one (i.e., the category to which the triangular points belong). Taking fig. 2 as an example, when K is 5, the KNN algorithm finds five points closest to the feature vector of the text to be processed (i.e., the points of the square), and if there are 3 circular points in the five points, the feature vector of the text to be processed (i.e., the points of the square) is classified into category two.

Further, the distribution of text data of different classes may be different, or the number of data of different classes and the density of distribution may be different due to the unbalanced number of samples, which is not taken into account by the KNN algorithm. Taking fig. 3 as an example, each dot represents a text datum, the black dots belong to a first category, the white dots belong to a second category, the black dots are distributed in a narrow and densely distributed area, and the white dots are distributed in a wide and sparsely distributed area. For the gray points to be classified, the nearest neighbor points found by KNN are all black points, so the gray points are classified into class one. However, analyzing the distribution of the dots in fig. 3, it can be seen that the black dots are all next to each other, and the gray dots are clearly distant from the nearest black dots, and conversely, the interval between the gray dots and the white dots is relatively consistent with the distribution of the white dots, therefore, the gray dots should be classified into category two.

Therefore, for some application scenarios with fewer samples, the problem of uneven distribution of text data of each category is easily caused, and the accuracy of text classification is seriously reduced.

Based on this, an embodiment of the present application provides a text classification method, which specifically includes the following steps with reference to fig. 4:

s401, obtaining a target feature vector of a text to be processed and at least two text sets, wherein each text set corresponds to one category, and each text set comprises feature vectors of text data belonging to the same category.

In specific implementation, the text data used or appeared in the application scene can be obtained, the text data is divided into a plurality of categories, and the text data of the same category is put into the same text set. The type of the text data may be set according to application requirements, for example, by taking an intelligent customer service of a shopping website as an example, the text input by the user may be divided into commodity consultation, price consultation, payment operation consultation, activity consultation, after-sale consultation, and the like, and the embodiment of the present application is not limited.

In a specific implementation, the text data in each text set may be pre-processed in advance, where the pre-processing is to remove noise data in the text data, and the pre-processing specifically may include: the method comprises the steps of removing special characters (other characters except Chinese and English) and stop words in text data, and then performing word segmentation processing on the text data to obtain a plurality of word segments. And then converting the plurality of word segments into feature vectors, namely feature vectors of the text data to be detected, and storing the feature vectors of each text data into a corresponding text set, so that the text set is convenient to use during text classification.

Specifically, the text can be converted into feature vectors by combining a FastText word vector model and SIF weighting, and the formula is as follows: v ═ α FastText (w), firstly, using a FastText word vector model to convert all participles in text data into word vectors, then multiplying each word vector by SIF weight (the weight calculation formula of SIF is α ═ a/(a + P (w)), where a is 0.01, and P (w) is the probability of participle occurrence), then summing up all the word vectors, and finally obtaining the vector representation of the text data. Similarly, the text to be processed can be converted into a feature vector by adopting the mode of combining the FastText word vector model and SIF weighting, and the target feature vector is obtained.

Of course, other ways may also be adopted to convert the text into the feature vector, such as word2vec model, and the embodiment of the present application is not limited.

S402, aiming at each text set, K nearest feature vectors of a target feature vector are obtained from the text set, the polymerization degree between the target feature vector and the text set is obtained based on the K nearest feature vectors and the target feature vectors, and a comparison result of the polymerization degree and the classification aggregation degree of the text set is obtained.

The polymerization degree represents the density degree between the target feature vector and the K nearest feature vectors, and the density degree represents the polymerization degree between the text to be processed and the text set. The category polymerization degree represents the density degree of the distribution of the feature vectors in the same text set.

The K nearest feature vectors refer to K feature vectors nearest to the target feature vector. In specific implementation, the obtaining K nearest feature vectors of the target feature vector from the text set in step S402 specifically includes: and obtaining the similarity of each feature vector in the text set and the target feature vector, and determining the K feature vectors ranked in the front as K nearest feature vectors of the target feature vector according to the ranking of the similarity from large to small. Or, K nearest feature vectors may also be searched by using a kd-tree, which is the prior art and will not be described in detail.

In specific implementation, the obtaining, based on the k nearest feature vectors and the target feature vector in step S402, a polymerization degree between the target feature vector and the text set specifically includes: and obtaining the similarity between each feature vector in the K nearest feature vectors and the target feature vector, and determining the average value of the similarities corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and the text set.

Set of texts V in a category c_cFor example, a target feature vector J and a text set V are calculated_cWherein the target feature vector J and the text set V_cThe ith feature vector v in_c,iThe similarity of (A) is recorded as S_c,i＝Sim(v_c,iJ), all of S_c,i＝Sim(v_c,iJ) the set of_c(ii) a Then from S_cTaking the most similar K eigenvectors, and obtaining a set of K nearest eigenvectors as K_s＝Max(S_cK); finally, calculating the average value of the similarity of the k nearest feature vectors as a target feature vector J and a text set V_cDegree of polymerization h between_J,c＝Mean(k_s)。

S403, determining a target text set from at least two text sets based on the comparison result corresponding to each text set.

In specific implementation, the ratio of the polymerization degree and the class aggregation degree can be used as the comparison result. Therefore, the classification probability of the text to be processed belonging to each text set can be determined based on the comparison result corresponding to each text set, and the text set with the maximum classification probability is determined as the target text set.

For example, a text set V_cClass degree of polymerization of a_cTarget feature vector J and text set V_cHas a degree of polymerization of h_cThen degree of polymerization h_cAnd class degree of polymerization a_cHas a comparison result of h_c/a_c. Then, based on Softmax function P ═ Softmax (h)_c/a_c) And obtaining the classification probability that the target feature vector J belongs to the class c, and determining the text set with the maximum classification probability as the target text set.

In specific practice, h_c/a_cThe closer to 1, the target feature vector J is shown in the text set V_cThe closer the distribution mode in (1) is to the text set V_cThe distribution mode of the medium feature vectors. Therefore, the comparison result h corresponding to each text set can be compared_c/a_cSelect h closest to 1_c/a_cH is to be_c/a_cThe corresponding text set is determined as a target text set.

Target feature vector J and text set V_cDegree of polymerization between divided by the set of text V_cThe degree of class aggregation of (1) is equivalent to the target feature vector J and the text set V_cThe relative conversion is carried out on the absolute quantity of the polymerization degree between the text data and the text data, the classification is carried out on the basis of the ratio of the two, the distribution condition of the text data of each category is fully considered, and the accuracy of text classification is improved.

S404, determining the category of the target text set as the category of the text to be processed.

In specific implementation, referring to fig. 5, the category aggregation degree of each text set may be obtained as follows:

s501, aiming at each feature vector in the text set, obtaining the similarity between the feature vector and other feature vectors in the text set, and determining the polymerization degree between the feature vector and the text set based on K similarity in the top sequence according to the sequence from the obtained similarity from large to small.

Specifically, the average of the top K similarity degrees may be determined as the degree of aggregation between the feature vector and the text set. Or taking the median of the top K similarity as the polymerization degree between the feature vector and the text set.

S502, based on the polymerization degree corresponding to each feature vector in the text set, determining the category polymerization degree corresponding to the text set.

Specifically, an average value of the degree of polymerization corresponding to the feature vector in the text set may be determined as the category degree of polymerization corresponding to the text set. Or taking the median of the polymerization degrees corresponding to the feature vectors in the text set as the polymerization degree between the feature vectors and the text set.

For example, assume that K is 3, text set V_cThere are 20 feature vectors of text data. Set V in text_cTaking one of the feature vectors as an example, calculating the similarity between the feature vector and the other 19 feature vectors, taking the 3 largest similarities, and taking the average value of the 3 similarities as the polymerization degree between the feature vector and the text set. The polymerization degrees corresponding to the 20 feature vectors can be obtained through the method, and then the average value of the polymerization degrees corresponding to the 20 feature vectors is used as the category polymerization degree corresponding to the text set.

Taking fig. 6 as an example, each point represents a piece of text data, the black point belongs to the first category, the white point belongs to the second category, the distribution of the text data of the same category in the feature space is similar, they are aggregated together to form a category region, the black point distribution region is narrow and densely distributed, and the white point distribution region is wide and sparsely distributed. Taking K as an example, referring to the step shown in fig. 5, the category polymerization degrees of the category one and the category two are obtained by calculation, the category polymerization degrees of the category one and the category two are illustrated by a line segment in fig. 6, and the length of the line segment is inversely related to the category polymerization degree, that is, the longer the line segment is, the smaller the category polymerization degree is, which indicates that the text data distribution of the category is more sparse, for example, the black points are densely distributed, the obtained category polymerization degree is high, the white points are sparsely distributed, and the obtained category polymerization degree is low. Then, referring to the steps shown in fig. 4, the polymerization degree h1 of the gray point and the polymerization degree h2 of the gray point and the category two are calculated respectively, and then the polymerization degree h1 is compared with the analog polymerization degree of the category one, and the polymerization degree h2 and the category polymerization degree of the category two are compared, and based on the comparison result, it is found that the polymerization degree h2 is closer to the category polymerization degree of the category two, that is, the gray point more conforms to the distribution mode of the white point, so the gray point is assigned to the category two.

The text classification method can automatically learn the distribution of the text data of the same category in the feature space from the text set, and obtain the category polymerization degree representing the density degree of the feature vector distribution in the text set; when text classification is carried out, text data of different classes in a training set are divided into text sets corresponding to the classes, K text data most similar to the text to be processed are searched in the text sets of the classes respectively, the polymerization degree of the text to be processed to each class is calculated respectively, and the most appropriate class of the text to be processed is determined by comparing the polymerization degree of the text to be processed to the class and the class aggregation degree. On one hand, the search and subsequent processing of K nearest text data are respectively carried out in text sets of different categories, so that the influence of the value of K on a classification result can be reduced, on the other hand, the aggregation degree and the category aggregation degree of each category are classified by comparing texts to be processed, so that the problem of inaccurate classification result caused by different data distribution or unbalanced data can be solved, and the accuracy of text classification is improved through the optimization of the two aspects.

The text classification method can be applied to application scenes such as intelligent customer service, news recommendation and intention identification, can quickly construct a text classification model with high classification accuracy based on text data in the application scenes, and can well classify texts in scenes with few samples.

Next, the classification method according to the embodiment of the present application will be described with reference to the banking business as an example.

In the intention recognition, each category in the classification method corresponds to a user intention, and text data indicating the same intention is included in a text set of each category.

For example, the specific service is an unmanned banking service of a bank, and it is necessary to identify the intention of the user based on a sentence input by the user through voice or text, and further provide a corresponding service to the user or guide the user to perform a corresponding operation. At this time, the intention categories may be determined according to the application requirements, for example, the 6 categories may include transfer, deposit, withdraw, confirm, modify, and cancel, and then text data corresponding to various intentions are collected and stored in corresponding text sets, where the text data of the three categories of transfer, deposit, and withdraw are sentences with long length, and the text data is large in amount and sparse in distribution, and the text data of the three categories of confirm, modify, and cancel are phrases with long length or words, and the text data is small in amount and dense in distribution.

Training is performed on the basis of a text set corresponding to each intention so as to obtain a category polymerization degree corresponding to each intention.

Firstly, preprocessing text data in a text set.

The preprocessing aims to remove noise data in the text data, wherein the preprocessing specifically comprises the following steps: the method comprises the steps of removing special characters (other characters except Chinese and English) and stop words in text data, and then performing word segmentation processing on the text data to obtain a plurality of word segments.

And secondly, converting the preprocessed text data into a feature vector.

The word segmentation of the text data is converted into a feature vector, namely the feature vector of the text data to be detected, and the feature vector of each text data is stored in a corresponding text set, so that the text classification is convenient to use.

And thirdly, calculating the class polymerization degree.

The degree of polymerization of each text data in the text collection for such intent is calculated. Taking a class c as an example, V_cRepresenting a set of texts of category c, in the manner referred to in FIG. 5Calculate a text set V_cEach text data pair text set V in (1)_cAnd then an average of these polymerization degrees is calculated as a class polymerization degree of the class c. By the above method, class polymerization degrees of all classes are obtained.

And performing intention recognition on the text to be processed input by the user based on the category polymerization degree obtained in the training stage.

Firstly, preprocessing a text to be processed input by a user.

The preprocessing mode refers to the preprocessing mode in the training stage.

And secondly, converting the preprocessed text into a feature vector.

And performing vector representation on the text to be processed by using the same method in the same training stage to generate a vector v of the text to be processed.

And thirdly, calculating the polymerization degree of the text to be processed to each intention category.

Refer specifically to step S403 in fig. 4.

And fourthly, determining an intention recognition result.

Based on Softmax function P ═ Softmax (h)_c/a_c) And obtaining the classification probability of the text to be processed belonging to each intention category, and determining the text set with the maximum classification probability as a target text set. Wherein, the intention recognition result returned by the Softmax function can be represented as: { "class _ 0": p0, "class _ 1": p 1. }, where class _0, class _1 denote intent classes and p0, p1 denote classification probabilities.

For example, the text entered by the user is: "i want to transfer money", the word segmentation result obtained through data preprocessing is [ "i", "want", "transfer money"]. And performing vector characterization processing on the word segmentation result to obtain a target feature vector v. Calculating all category polymerization degrees to obtain a category polymerization degree vector [ a ]₀,a₁,a₂,a₃,a₄,a₅]The corresponding intention categories are [ "transfer", "deposit", "draw", "cancel", "modify", "confirm"]. And comparing the class polymerization degrees to obtain a class probability vector [0.91,0.04,0 ].02,0.01,0.01，0.01]And finally, the intention recognition result is as follows: { "transfer": 0.91, "dispose": 0.04, "draw": 0.02, "cancel": 0.01, "modify": 0.01, "confirm": 0.01}, i.e., it is recognized that the user intends to be "transfer".

As shown in fig. 7, based on the same inventive concept as the text classification method, the embodiment of the present application further provides a text classification device 70, which specifically includes:

the acquiring module 701 is configured to acquire a target feature vector of a text to be processed and at least two text sets, where each text set corresponds to one category and each text set includes feature vectors of text data belonging to the same category;

a polymerization degree calculation module 702, configured to obtain, for each text set, K nearest feature vectors of the target feature vector from each text set, obtain, based on the K nearest feature vectors and the target feature vector, a polymerization degree between the target feature vector and each text set, and obtain a comparison result between the polymerization degree and a class polymerization degree of each text set; the aggregation degree represents the density degree between the target feature vector and the K nearest feature vectors, and the category aggregation degree represents the density degree of the feature vector distribution in the same text set;

a classification module 703, configured to determine, based on a comparison result corresponding to each text set, a target text set from the at least two text sets, and determine a category of the target text set as a category of the text to be processed.

Optionally, the polymerization degree calculating module 702 is specifically configured to: obtaining the similarity between each feature vector in each text set and the target feature vector; and according to the sequence of similarity from large to small, determining the K feature vectors ranked at the top as K nearest feature vectors of the target feature vector.

Optionally, the polymerization degree calculating module 702 is specifically configured to: obtaining the similarity of each feature vector in the K nearest feature vectors and the target feature vector; and determining the average value of the similarity corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and each text set.

Optionally, the comparison result is a ratio of the degree of polymerization and the degree of class aggregation, and the classification module 703 is specifically configured to: determining the classification probability of the text to be processed belonging to each text set based on the comparison result corresponding to each text set; and determining the text set with the maximum classification probability as a target text set.

Optionally, the text classification device 70 further includes a training module, configured to obtain a category aggregation degree of each text set by:

Optionally, wherein each category corresponds to a user intent.

The text classification device and the text classification method provided by the embodiment of the application adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.

Based on the same inventive concept as the text classification method, the embodiment of the present application further provides an electronic device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 8, the electronic device 80 may include a processor 801 and a memory 802.

The Processor 801 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the text classification method disclosed in the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

Memory 802, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 902 of the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Embodiments of the present application provide a computer-readable storage medium for storing computer program instructions for the electronic device, which includes a program for executing the text classification method.

The computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Modifications and substitutions that may be readily apparent to those skilled in the art are intended to be included within the scope of the embodiments of the present application.

Claims

1. a text classification method, is characterized in that, comprises:

Obtain the target feature vector of the text to be processed and at least two text sets, wherein each text set corresponds to a category, and each text set includes feature vectors of text data belonging to the same category;

For each text set, obtain the K nearest feature vectors of the target feature vector from each text set, and obtain the target feature vector based on the K nearest feature vectors and the target feature vector and the aggregation degree between each text set, and obtain the comparison result of the aggregation degree and the category set degree of each text set; wherein, the aggregation degree represents the target feature vector and the K The degree of density between the nearest adjacent feature vectors, the category aggregation degree represents the degree of density of the distribution of feature vectors in the same text set;

determining a target text set from the at least two text sets based on the comparison result corresponding to each text set;

The category of the target text set is determined as the category of the text to be processed.

2. The method according to claim 1, wherein the obtaining the K nearest feature vectors of the target feature vector from each text set specifically comprises:

Obtain the similarity between each feature vector in each text set and the target feature vector;

According to the descending order of similarity, the K eigenvectors in the first order are determined as the K nearest eigenvectors of the target eigenvector.

3. The method according to claim 1, wherein the aggregation between the target feature vector and each text set is obtained based on the k nearest neighbor feature vectors and the target feature vector degrees, including:

Obtain the similarity of each eigenvector in the K nearest eigenvectors and the target eigenvector;

The average value of the similarity corresponding to the K nearest feature vectors is determined as the aggregation degree between the target feature vector and each text set.

4. The method according to claim 1, wherein the comparison result is a ratio of the aggregation degree and the category aggregation degree, and the comparison results are obtained from the at least two text collections based on the comparison result corresponding to each text collection. The target text set is determined from the text sets, including:

Based on the comparison result corresponding to each text set, determine the classification probability that the text to be processed belongs to each text set;

The text set with the highest classification probability is determined as the target text set.

5. The method according to any one of claims 1 to 4, wherein the category aggregation degree of each text set is obtained by:

For each feature vector in each text set, the similarity between each feature vector and other feature vectors in each text set is obtained, and the obtained similarity is sorted in descending order, Determine the degree of aggregation between each feature vector and each text set based on the top K similarities;

Based on the aggregation degree corresponding to each feature vector in each text set, the category aggregation degree corresponding to each text set is determined.

6. The method according to claim 5, characterized in that, determining the degree of aggregation between each feature vector and each text set based on the top K similarities, specifically comprising:

The average value of the top K similarities is determined as the degree of aggregation between each feature vector and each text set.

7 . The method according to claim 5 , wherein determining the category aggregation degree corresponding to each text set based on the aggregation degree corresponding to each feature vector in each text set, specifically comprising: 8 . :

The average of the aggregation degrees corresponding to the feature vectors in each text set is determined as the category aggregation degree corresponding to each text set.

8. A text classification device, characterized in that,

an acquisition module, used to obtain a target feature vector of the text to be processed and at least two text sets, wherein each text set corresponds to a category, and each text set includes feature vectors of text data belonging to the same category;

The aggregation degree calculation module is used for obtaining the K nearest feature vectors of the target feature vector from each text set for each text set, based on the K nearest feature vectors and the target feature vector , obtain the aggregation degree between the target feature vector and each text set, and obtain the comparison result between the aggregation degree and the category set degree of each text set; wherein, the aggregation degree represents the the degree of density between the target feature vector and the K nearest neighbors, and the degree of category aggregation represents the degree of density of the distribution of feature vectors in the same text set;

The classification module is configured to determine a target text set from the at least two text sets based on the comparison result corresponding to each text set, and determine the category of the target text set as the category of the text to be processed.

9. An electronic device, comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements any one of claims 1 to 7 when the processor executes the computer program the steps of the method described in item.

10. A computer-readable storage medium on which computer program instructions are stored, wherein the computer program instructions implement the steps of the method according to any one of claims 1 to 7 when the computer program instructions are executed by a processor.