[go: up one dir, main page]

CN111949789B - Text classification method and text classification system - Google Patents

Text classification method and text classification system Download PDF

Info

Publication number
CN111949789B
CN111949789B CN201910407126.3A CN201910407126A CN111949789B CN 111949789 B CN111949789 B CN 111949789B CN 201910407126 A CN201910407126 A CN 201910407126A CN 111949789 B CN111949789 B CN 111949789B
Authority
CN
China
Prior art keywords
text
vector
library
clustering
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910407126.3A
Other languages
Chinese (zh)
Other versions
CN111949789A (en
Inventor
孙金辉
陈生泰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910407126.3A priority Critical patent/CN111949789B/en
Publication of CN111949789A publication Critical patent/CN111949789A/en
Application granted granted Critical
Publication of CN111949789B publication Critical patent/CN111949789B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a text classification method and a text classification system, wherein the text classification method comprises the following steps: acquiring a target text, and encoding the target text into a first text vector by adopting a convolution network; dividing source texts in each text library into a plurality of categories by adopting a clustering model, wherein each category is represented by a second text vector corresponding to a clustering center of the category; and respectively calculating the distance between the first text vector and the plurality of second text vectors, and taking a text library where the clustering centers corresponding to the K nearest second text vectors are located as a candidate text library of the target text by adopting the ordering model, wherein K is a positive integer. According to the method, the target text is encoded into the first text vector, the distance calculation is carried out on the second text vectors corresponding to the clustering centers in each text library, K text libraries closest to the target text are used as candidate text libraries of the target vector, the target text can be classified rapidly and efficiently, and the accuracy of text classification is improved.

Description

Text classification method and text classification system
Technical Field
The invention relates to the technical field of computers, in particular to a text classification method and a text classification system.
Background
With the development of 'big data' technology, more and more fields begin to utilize computers to perform text matching and classification, and with the rapid increase of the size of text data sets, the algorithms of text classification are more and more, and the computational complexity is also higher and higher.
The conventional text classification method is classified based on association rules or keywords, the method classifies the text according to keywords contained in a target text or according to association relations, when a certain keyword is contained in a certain target text, the text can be classified better, if the target text does not contain a preselected keyword, the text cannot be classified, the classification method causes that a plurality of target texts cannot be matched into proper classification, the recommendation method based on the keyword is rough, semantic information is not considered well, and the classification result is inaccurate.
Therefore, the inventor believes that the above text classification method has a great limitation, and the problem of time and labor consumption and inaccurate classification exists in the process of classifying the target text based on the keywords.
Disclosure of Invention
In view of this, an embodiment of the present invention provides a text classification method and a text classification system, which uses a convolutional network to encode a target text into a first text vector, then uses a clustering model to represent a source text in a text library into a plurality of second text vectors, and uses a text library closest to the calculated distance between the vectors as a text library to be classified of the target text, thereby completing text classification and avoiding the problem that the target text cannot be classified accurately without keywords.
According to a first aspect of the present invention, there is provided a text classification method comprising:
Acquiring a target text, and encoding the target text into a first text vector by adopting a convolution network;
Dividing source texts in each text library into a plurality of categories by adopting a clustering model, wherein each category is represented by a second text vector corresponding to a clustering center of the category; and
And respectively calculating the distance between the first text vector and a plurality of second text vectors, and taking the text library where the clustering center corresponding to the K nearest second text vectors is located as a candidate text library of the target text by adopting an ordering model, wherein K is a positive integer.
Preferably, the text classification method further comprises: a plurality of text libraries are obtained from a storage system, each text library comprising a plurality of source texts.
Preferably, obtaining the target text and encoding it into the first text vector using the convolutional network comprises:
word segmentation operation is carried out on the target text;
mapping the word segmentation set corresponding to the target text into word vectors through a mapping function;
performing dimension filling on each word vector, and giving a random value;
And carrying out rolling and pooling operation on each word vector, and then splicing to obtain a first text vector.
Preferably, the source text in each text library is divided into a plurality of categories by adopting a clustering model, and each category is represented by a second text vector corresponding to a clustering center of the category and comprises:
Setting the number of the clustering centers corresponding to each text library;
Clustering all the source texts in each text library into a set number of categories by adopting the clustering model;
Representing each corresponding category by using the coordinates of the clustering center as a second text vector;
And storing the clustering result of each text library.
Preferably, the number of the cluster centers is 4.
Preferably, the distance between the first text vector and the second text vector is represented by a text similarity.
Preferably, the text similarity includes cosine similarity.
Preferably, using the ranking model to take the text library where the cluster centers corresponding to the K nearest second text vectors are located as a candidate text library of the target text includes:
Sorting the cosine similarity by adopting a sorting function;
And selecting the text library where the cluster center corresponding to the second text vector with the ordered cosine similarity at the first K bits is located as a candidate text library of the target text.
Preferably, the convolutional network comprises textCNN networks and the cluster model comprises a kmeans cluster model.
According to a second aspect of the present invention, there is provided a text classification system comprising:
the target text acquisition unit is used for acquiring a target text and encoding the target text into a first text vector by adopting a convolution network;
the clustering unit is used for dividing the source texts in each text library into a plurality of categories by adopting a clustering model, and each category is represented by a second text vector corresponding to the clustering center of the category; and
And the screening unit is used for respectively calculating the distances between the first text vector and the plurality of second text vectors, adopting a sorting model to take the text library where the clustering center corresponding to the K nearest second text vectors is located as a candidate text library of the target text, wherein K is a positive integer.
Preferably, the text classification system further comprises:
And the text library acquisition unit is used for acquiring a plurality of text libraries from the storage system, wherein each text library comprises a plurality of source texts.
Preferably, the target text acquisition unit includes:
the word segmentation unit is used for carrying out word segmentation operation on the target text;
The mapping unit is used for mapping the word segmentation set corresponding to the target text into word vectors through a mapping function;
the filling unit is used for dimension filling of each word vector and giving a random value;
And the splicing unit is used for carrying out rolling and pooling operation on each word vector and then splicing to obtain a first text vector.
Preferably, the clustering unit includes:
the quantity setting unit is used for setting the quantity of the clustering centers corresponding to each text library;
a classification unit, configured to cluster all the source texts in each text library into a set number of categories by using the clustering model;
a vector display unit, configured to represent each corresponding category by using coordinates of the clustering center as a second text vector;
and the clustering storage unit is used for storing the clustering result of each text library.
Preferably, the number of the cluster centers is 4.
Preferably, the distance between the first text vector and the second text vector is represented by a text similarity.
Preferably, the text similarity includes cosine similarity.
Preferably, the screening unit comprises:
the ordering unit is used for ordering the cosine similarity by adopting an ordering function;
And the selecting unit is used for selecting the text library where the clustering center corresponding to the second text vector with the cosine similarity positioned at the front K bits is positioned as a candidate text library of the target text.
According to a third aspect of the present invention there is provided a computer readable storage medium storing computer instructions which when executed implement a text classification method as described above.
According to a fourth aspect of the present invention, there is provided a text classification apparatus comprising: a memory for storing computer instructions; a processor coupled to the memory, the processor configured to perform a text classification method based on computer instructions stored by the memory to implement the method as described above.
Embodiments of the present invention have the following advantages or benefits: according to the text classification method and the text classification system, a network structure is adopted to encode target texts into first text vectors, then source texts in a plurality of text libraries are clustered through a clustering model, each category is represented by a second text vector in the clustering center of the source texts, distances between the first text vector and the plurality of second text vectors are calculated respectively, and the text library where the source texts corresponding to a plurality of second text vectors closest to each other are selected as a candidate text library of the target texts, so that the classification of the target texts is completed. Through encoding the target text and then clustering the text libraries, the text classification problem can be converted into the calculation between the vectors, the relation between the target text and a plurality of text libraries can be known only by calculating the distance between the vectors, and from the fact that the classification is not carried out, the matching of the target text and all source texts in a huge text library is not needed, the calculation of the similarity is not needed one by one, the calculation step is simplified, the calculation amount is saved, the text classification can be completed quickly and efficiently, the classification is accurate, the text classification efficiency is improved, and the defects caused by the classification according to keywords are avoided.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent by describing embodiments thereof with reference to the following drawings in which:
FIG. 1 shows a flow chart of a text classification method in an embodiment of the invention;
FIG. 2 illustrates a flow chart of a method of summarizing text classification in an embodiment of the invention;
fig. 3a shows a specific flowchart of step S101 shown in fig. 1;
FIG. 3b is a schematic diagram showing the encoding process of target text in an embodiment of the present invention;
Fig. 4 shows a specific flowchart of step S102 shown in fig. 1;
FIG. 5 shows a block diagram of a text classification system in an embodiment of the invention;
FIG. 6 illustrates a block diagram of a summarized text classification system in an embodiment of the invention;
Fig. 7 shows a specific configuration diagram of a target text acquisition unit 501 in the text classification system of the embodiment of the present invention;
FIG. 8 shows a specific block diagram of the clustering unit 502 in the text classification system in an embodiment of the invention;
Fig. 9 shows a block diagram of a text classification apparatus according to an embodiment of the present invention.
Detailed Description
The present invention is described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth in detail. The present invention will be fully understood by those skilled in the art without the details described herein. Well-known methods, procedures, and flows have not been described in detail so as not to obscure the nature of the invention. The figures are not necessarily drawn to scale.
Fig. 1 shows a flowchart of a text classification method in an embodiment of the invention, and specific steps include S101-S103.
In step S101, a target text is acquired and encoded into a first text vector using a convolutional network.
In step S102, the source text in each text library is divided into a plurality of categories by using a clustering model, and each category is represented by a second text vector corresponding to a clustering center of the category.
In step S103, distances between the first text vector and the plurality of second text vectors are calculated, and a text library where the cluster centers corresponding to the K nearest second text vectors are located is used as a candidate text library of the target text by using the ranking model, where K is a positive integer.
In this embodiment, the target text to be classified is represented in a vector form, the distance between the target text and the vector corresponding to the source text in each text library in the storage system is calculated, the distance is small, the representative similarity is high, several text libraries with the highest similarity are selected as the text libraries to be classified of the target text, and then the target text is accurately classified by using other methods.
In step S101, a target text is acquired and encoded into a first text vector using a convolutional network.
The text classification concept adopted in this embodiment is to map texts into text vectors first, and then judge the matching degree between the texts by comparing the similarity between the text vectors, so that the texts are expressed in the form of the text vectors first.
Given a target text, classifying it into the most likely text library requires comparing the similarity of the target text to the source text in the text library. The target text is first converted to a first text vector by a series of functions, in this embodiment, the target text is encoded into 192-length feature vectors using a convolutional network, such as textCNN. Of course, other networks such as LSTM, BILSTM, etc. may be used, and the CNN model may be more advantageous when characterizing the vector of the target text, so the encoding part in this embodiment uses textCNN networks.
The convolutional neural network is a feedforward neural network whose artificial neurons can respond to surrounding cells within a part of coverage, and has excellent performance for large-scale image processing, and includes a convolutional layer (convolutional layer) and a pooling layer (pooling layer). The textCNN model is a variant of the convolutional neural network in natural language processing, and is mainly different in that the width of the convolutional kernel is the length of the whole word embedding vector during the convolutional operation, the pooling stage is one-time pooling of all convolutional results, and other operations are basically consistent with the convolutional neural network.
In this embodiment, a convolution network is adopted to encode the target text, firstly, operations such as word segmentation, word embedding and the like are performed, then convolution, pooling, splicing and the like are performed, so as to form a vector with 192 lengths, and in particular, a vector encoding process is described in fig. 3a and 3 b.
In step S102, the source text in each text library is divided into a plurality of categories by using a clustering model, and each category is represented by a second text vector corresponding to a clustering center of the category.
The target text has been vectorized in the previous step, and the source text is stored in different text libraries, all the source texts in each text library have a certain similarity, and the different text libraries represent different categories. Although the similarity of source texts in the same text library is higher, more detailed classification can be performed due to the larger number of source texts. The target text is directly compared with all source texts in each text library, the calculated amount is huge, the target text is directly matched with the names of the text libraries, and the classification is possibly inaccurate, so that the source texts in each text library can be classified into a plurality of different categories, each category is provided with a center source text, the target text is compared with the center source text, the calculated amount is greatly reduced, and the matching result is more accurate.
In this embodiment, a clustering model is selected for classification, and all source texts in each text library are classified at one level, where the number of the required classification in each text library can be set in advance, for example, the number of the required classification in each text library is divided into 4 classes, and then 4 clustering centers are corresponding. Keans clustering is a model of dividing n points (which can be one observation or one instance of a sample) into k clusters, firstly randomly initializing a cluster center, clustering all the points to the cluster center, then calculating cluster center points of the last clustering result, and then calculating again; and ending the clustering when the cluster center change converges.
The embodiment selects only an exemplary clustering model, the clustering number can be determined through some indexes, and hierarchical clustering, spectral clustering and the like can be used for the clustering method.
After clustering, the source texts in a text library are divided into 4 categories, each category has a clustering center, the similarity of the source texts in each category is higher, and the coordinates of the clustering centers are used as second text vectors to characterize the category. The vectorization of the first text vector and the second text vector is thus completed, for example, the text vector corresponding to the target text is set as the first text vector, denoted vec1, and the text vector corresponding to the clustering center of the source text is set as the second text vector, denoted vec2.
In step S103, distances between the first text vector and the plurality of second text vectors are calculated, and a text library where the cluster centers corresponding to the K nearest second text vectors are located is used as a candidate text library of the target text by using the ranking model, where K is a positive integer.
And converting the target text into a first text vector vec1 with 192 lengths, and then respectively calculating the distance between the vector vec1 and each clustering center in each text library, namely calculating the distance between the first text vector vec1 and a second text vector vec2, wherein the representative similarity is higher when the distance is short, the association is higher, and the representative difference is larger when the distance is long.
In one embodiment, the distance between the first text vector and the second text vector is represented by a text similarity comprising a cosine similarity. For example, cos (a, b) is selected to measure the distance between vector a and vector b when summing the distances. Therefore, the cosine similarity function is used to compare the cosine similarity between the first text vector vec 1 and the second text vector vec 2, so as to determine the degree of association between the target text and the source text. Expressed by the formula: sim 1=cos(vec1,vec2), where cos () represents a cosine function and sim 1 represents cosine similarity between the two texts being compared. To some extent, it is believed that the higher the similarity, the higher the degree of association between two texts.
The target text and the source text are represented by text vectors, cosine similarity between the target text and the source text is calculated, and the distance between the vectors is represented by the similarity, so that the calculation time is shortened, and the classification efficiency is improved.
After the distance between the text vectors is calculated, matching the text vectors with a proper text library according to the calculation result, firstly, sorting the distances, wherein the distance is short, the representative similarity is high, the probability that the target text and a source text in the text library belong to the same class is high, so that the text library where the clustering centers corresponding to K second text vectors closest to each other are located is used as a candidate text library of the target text by adopting a sorting model, wherein K is a positive integer.
In one embodiment, using the ranking model to take the text library where the cluster centers corresponding to the K nearest second text vectors are located as the candidate text library of the target text includes: sorting the cosine similarity by adopting a sorting function; and selecting a text library with the ordered cosine similarity at the clustering center corresponding to the first K-bit second text vector as a candidate text library of the target text, wherein the ordering function is, for example, a top-K function. For example, selecting several text libraries with the highest cosine similarity and corresponding clustering centers of the first 15 second text vectors as candidate text libraries of the target text, and ending the text classification.
According to the text classification method, a network structure is adopted to encode target texts into first text vectors, then source texts in a plurality of text libraries are clustered through a clustering model, each category is represented by a second text vector in the clustering center of the source texts, distances between the first text vector and the plurality of second text vectors are calculated respectively, and the text library where the source texts corresponding to a plurality of second text vectors closest to each other are selected as a candidate text library of the target texts, so that classification of the target texts is completed. Through encoding the target text and then clustering the text libraries, the text classification problem can be converted into the calculation between the vectors, the relation between the target text and a plurality of text libraries can be known only by calculating the distance between the vectors, and from the fact that the classification is not carried out, the matching of the target text and all source texts in a huge text library is not needed, the calculation of the similarity is not needed one by one, the calculation step is simplified, the calculation amount is saved, the text classification can be completed quickly and efficiently, the classification is accurate, the text classification efficiency is improved, and the defects caused by the classification according to keywords are avoided.
The text classification method of the embodiment can be applied in practice, for example, the commodities in many supermarkets are displayed according to three-level classification at present, for example, the foods can be classified into fresh foods and cooked foods; the fresh food can be divided into vegetables, fruits, fresh meat, processed foods and the like, and the cooked food can be divided into hot food and cold dishes; vegetables including leaf vegetables, flowers and fruits, fungus, and flavoring, fruits including domestic fruits, imported fruits, special fruits etc. … … hot foods including barbecue, frying, steaming and parching, and cold dishes including sushi and cold dishes. At this time, the fresh and cooked food can be classified according to three stages, vegetables, fruits and the like can be classified according to a third stage, and imported fruits and domestic fruits can be classified according to a first stage. If a new product exists, and three-level classification is needed, the text classification method of the embodiment can be adopted.
The name corresponding to the product is taken as a target text, the classified three-level product is taken as a text library, for example, the third-level classification type is the number of the text libraries, and the names of all products under the third-level classification can be regarded as source texts, and each text library comprises different source texts. The method comprises the steps of clustering source texts under each three-level classification, selecting a set number of clustering centers, expressing the clustering centers and target texts in a vector form, calculating the distance between the clustering centers and the target texts, wherein the distance is small, the similarity is higher, selecting the first few positions with the minimum distance between the text vector corresponding to the name of a new product and the text vector corresponding to the clustering center under the three-level classification, and the third-level classification corresponding to the text vector and the text vector can be used as the candidate classification of the new product.
The text classification method of the embodiment of the invention is applied to the practice of product classification, wherein textCNN network structures are adopted to represent product names, then representative points of three-level classes are selected through a clustering model, then the distance between a new product and a class center is calculated, and finally the classification of the three-level classes is realized.
Fig. 2 shows a flowchart of a method for classifying summarized text according to an embodiment of the present invention, which specifically includes the following steps.
In step S201, a target text is acquired and encoded into a first text vector using a convolutional network.
In step S202, a plurality of text libraries, each comprising a plurality of source text, are retrieved from a storage system.
In step S203, the source text in each text library is divided into a plurality of categories by using a clustering model, and each category is represented by a second text vector corresponding to the clustering center of the category.
In step S204, distances between the first text vector and the plurality of second text vectors are calculated, and a text library where the cluster centers corresponding to the K nearest second text vectors are located is used as a candidate text library of the target text by using the ranking model, where K is a positive integer.
This embodiment is a more sophisticated text classification method than the previous embodiments. Step S201 and steps S203-S204 are the same as steps S101-S103 in fig. 1, and will not be described again here.
In step S202, a plurality of text libraries, each comprising a plurality of source text, are retrieved from a storage system.
Because the data of the text libraries is huge, the text libraries are stored in a storage system, and to classify and match target texts, the text libraries need to be acquired, each text library contains a plurality of different source texts, all the source texts in each text library have a relatively close relation, and one text library represents one type of classification.
Through the text classification method of the embodiment, several text libraries which are most likely to be matched with the target text can be selected. After selecting the text library to be classified corresponding to the target text, the classification result can be fed back to the text classification request initiator.
In combination with the above embodiment, when the text classification method of the present embodiment is applied to classification of a new product, not only a first text vector corresponding to a new product name is required to be obtained, but also a large number of three-level classifications corresponding to existing products are required to be obtained, the product data after three-level classifications are stored in a storage system, the data are extracted from the storage system, and then classification of a target text is performed.
The extracted data may be regarded as model data, as shown in table 1 below, which is, for example, commodity data provided by a third party merchant, including commodity names, three-level commodity names corresponding to commodities, commodity codes, and the like. The number of the three-level classes corresponding to the commodities stored in the system is about 3300, the number of the existing three-level classes is about 2900 after data cleaning, and the partial data of sampling is shown in table 1. The first column is the three-level category name (the name of the third level classification) and the second column is the data of the sampled partial commodity.
TABLE 1
It can be seen from table 1 that the third class of classification is two, that is, two text libraries are represented, namely, an LED light source and a usb disk, and each class corresponds to a plurality of commodities, the names of which can be regarded as source texts, and each class corresponds to a plurality of classes, that can be regarded as clustered classes, and each class can have a clustering center.
According to the text classification algorithm provided by the embodiment, a method of combining textCNN network structures with kmeans clustering is adopted, so that new products can be classified quickly, and then in the aspect of realization of a computer, the part input by a user is commodity name data, the part output by a model is TOP-K three-level class to which the new commodity belongs, wherein a K value can be set.
After vectorizing the names of new products, namely target texts, the new commodity names are coded into 192 vectors through a logistics quotation classification textCNN model, namely the commodity names are digitized, the commodity vectors are clustered in each three-level class in the next step, then the clustering center is used as the representative of the three-level class, the 192-length vectors and the clustering center vectors of the three-level class are used for calculating the distance, and K three-level classes with the nearest distance are returned as the three-level class to be classified.
Fig. 3a shows a specific flowchart of step S101 shown in fig. 1. The method specifically comprises the following steps.
In step S1011, a word segmentation operation is performed on the target text.
And performing word segmentation operation on the target text, namely correspondingly splitting the target text into a plurality of characters or words. Assuming that a certain target text can be divided into n segmented words, the segmented word set after segmentation is: w= { W 1,w2,…,wn }, where n represents the total number of words of the text. Therefore, after word segmentation, the first word segmentation set corresponding to the target text is expressed as: w 1={w1,w2,…,wn }.
Fig. 3b shows a schematic diagram of a coding process of a target text in the embodiment of the present invention, and in combination with fig. 3a and fig. 3b, a commodity name coding part adopts textCNN networks, and finally converts a commodity name consisting of a plurality of words into a spliced vector to be output, as in fig. 3b, in which, taking one commodity name of "pure color mobile phone screen explosion-proof toughened glass film" as an example, the whole network calculation process is illustrated.
Firstly, when a user inputs an explosion-proof toughened glass film of a pure-color mobile phone screen, a jieba word segmentation package or other word segmentation methods are used for segmenting names into words: the "pure color", "mobile phone", "screen", "explosion proof" and "tempered glass" are then performed for the next operation.
In step S1012, the word segmentation set corresponding to the target text is mapped to a word vector by a mapping function.
The present embodiment converts the word into a vector using, for example, a mapping function, and maps the word into a vector using, for example, word2vec features.
In step S1013, dimension filling is performed for each word vector, and a random value is given.
After word segmentation and word embedding, filling word segmentation results of commodity names into word groups with the same length by using a tool function, and respectively filling each word segment differently so that the lengths of each word segment after filling are consistent. The five words are then randomly assigned to 128-dimensional random numbers between-1 and 1, respectively.
In this embodiment, commodity name data is digitized by word embedding matrix, commodity name is encoded by using convolutional network textCNN, commodity name characters are converted into 32-dimensional vectors and then output, and in implementation, the output result (192-length vector) of the penultimate layer in the convolutional model is extracted as final output, and the 192-length vector is used as the first text vector.
In step S1014, a convolution and pooling operation is performed on each word vector, and then stitching is performed to obtain a first text vector.
The convolution operation is performed by covering different numbers of words according to different convolution windows, and the context words can be contained as much as possible by using different convolution windows to convolve due to semantic correlation among contexts. And then carrying out pooling operation on the convolution result, wherein pooling is an operation of reducing the dimension of the convolution result, and maximum value pooling is an operation of selecting the maximum value in a plurality of results with the same size convolution window as the final result. And then splicing the pooled output results of different convolution windows into a vector, and taking the spliced vector as the digital representation of the commodity name.
The present flow is a specific description of vectorization of target text, and in the present embodiment, description is given by way of example with trade name, and other text classification may be adopted.
The classification flow of the clustering model of the text classification method according to the embodiment of the present invention will be briefly described with reference to fig. 4.
Fig. 4 shows a specific flowchart of step S102 shown in fig. 1.
In step S1021, the number of clustering centers corresponding to each text library is set.
In this embodiment, for example, the kmeans method is used to cluster in three classes, and the number of clustering centers that can be classified in each text library is first set, and then clustering is performed. For example, the number of cluster centers is 4. The number of clusters selected in each tertiary category is 4 and then the coordinates of the 4 cluster centers in each tertiary category are retained.
In step S1022, all source texts in each text library are clustered into a set number of categories using a clustering model.
After the number of the clustering centers is determined, clustering is carried out for a plurality of times according to the number, and finally four clustering centers are selected from each text library and respectively represent four types of source texts.
In step S1023, each corresponding category is represented with coordinates of the cluster center as a second text vector.
In this embodiment, the clustering center is represented by vectorization, for example, coordinates corresponding to the clustering center may be used as the second text vector, and if the length of the vector is insufficient, a certain means may be used to fill the length.
In step S1024, the clustering result of each text library is stored.
And storing the clustered result of each clustered text library, so as to facilitate the call in the subsequent calculation. Because the data volume is huge, the data can be stored and taken at any time, and the data loss is not worried about.
When clustering, the clustering result is cached as initial data, and when the new commodity calculates the similarity with the clustering center, the calculated amount of the new commodity and the 2900 class 4 center is only needed to be realized, and the similarity is not needed to be calculated with all commodities, so that the calculated amount of text matching is greatly reduced. The cluster model is high in expansibility in the later period, and even if the commodity SKU provided by a third party merchant suddenly increases, the calculated amount is relatively stable, and the cluster model can quickly respond in the form of interface service. The method greatly reduces the calculated amount and the calculated difficulty in text classification, and can avoid the situation that the text cannot be accurately classified because the commodity name does not contain keywords, thereby improving the accuracy and precision of text classification.
It should be noted that the present invention is not limited to the text vector acquisition algorithm and the clustering algorithm provided in the above embodiments, and other algorithms may also be used to practice the text classification method provided in the embodiments of the present invention.
Fig. 5 shows a block diagram of a text classification system in an embodiment of the invention.
The text classification system 500 includes a target text acquisition unit 501, a clustering unit 502, and a screening unit 503.
The target text obtaining unit 501 is configured to obtain a target text, and encode the target text into a first text vector by using a convolutional network;
The clustering unit 502 is configured to divide the source text in each text library into a plurality of categories by using a clustering model, where each category is represented by a second text vector corresponding to a clustering center of the category;
The screening unit 503 is configured to calculate distances between the first text vector and the plurality of second text vectors, and use a text library where the cluster centers corresponding to the K nearest second text vectors are located as a candidate text library of the target text by using the ranking model, where K is a positive integer.
In this embodiment, the text classification system 500 encodes the target text by using a network structure, encodes the target text into a first text vector, clusters the source text in a plurality of text libraries by using a clustering model, each class is represented by a second text vector in the clustering center, calculates the distance between the first text vector and the plurality of second text vectors, and selects the text library in which the source text corresponding to the second text vectors closest to the first text vector is located as the candidate text library of the target text, thereby completing the classification of the target text. Through encoding the target text and then clustering the text libraries, the text classification problem can be converted into the calculation between the vectors, the relation between the target text and a plurality of text libraries can be known only by calculating the distance between the vectors, and from the fact that the classification is not carried out, the matching of the target text and all source texts in a huge text library is not needed, the calculation of the similarity is not needed one by one, the calculation step is simplified, the calculation amount is saved, the text classification can be completed quickly and efficiently, the classification is accurate, the text classification efficiency is improved, and the defects caused by the classification according to keywords are avoided.
In one embodiment, the distance between the first text vector and the second text vector is represented by a text similarity, which includes a cosine similarity, and the filtering unit 503 includes a sorting unit (not shown in the figure) and a selecting unit (not shown in the figure).
The ordering unit is used for ordering the cosine similarity by adopting an ordering function; the selecting unit is used for selecting a text library with the ordered cosine similarity at the clustering center corresponding to the first K-bit second text vector as a candidate text library of the target text.
Fig. 6 shows a block diagram of a text classification system summarized in the embodiment of the present invention, and the embodiment shown in fig. 6 adds a text library obtaining unit 601 on the basis of the embodiment of fig. 5.
The text library obtaining unit 601 is configured to obtain a plurality of text libraries from a storage system, where each text library includes a plurality of source texts.
Fig. 7 shows a specific configuration diagram of the target text acquisition unit 501 in the text classification system of the embodiment of the present invention.
The target text acquisition unit 501 of the text classification system 500 includes a word segmentation unit 5011, a mapping unit 5012, a filling unit 5013, and a concatenation unit 5014.
The word segmentation unit 5011 is used for performing word segmentation operation on the target text;
the mapping unit 5012 is configured to map the word segmentation set corresponding to the target text into a word vector through a mapping function;
the filling unit 5013 is used for dimension filling of each word vector and giving a random value;
The stitching unit 5014 is configured to perform a convolution and pooling operation on each word vector, and then stitch the word vectors to obtain a first text vector.
Fig. 8 shows a specific structural diagram of the clustering unit 502 in the text classification system according to the embodiment of the present invention.
The clustering unit 502 of the text classification system 500 includes a number setting unit 5021, a classification unit 5022, a vector display unit 5023, and a cluster storage unit 5024.
The number setting unit 5021 is used for setting the number of clustering centers corresponding to each text library;
the classifying unit 5022 is used for clustering all source texts in each text library into a set number of categories by adopting a clustering model;
the vector display unit 5023 is used for representing each corresponding category by taking the coordinates of the clustering center as a second text vector;
the cluster storage unit 5024 is used for storing a cluster result of each text library.
In one embodiment, the number of cluster centers is 4.
It should be understood that the systems and methods of embodiments of the present invention are corresponding and, therefore, are performed in a relatively abbreviated manner in the description of the system.
Fig. 9 shows a block diagram of a text classification apparatus according to an embodiment of the present invention. The apparatus shown in fig. 9 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention in any way.
Referring to fig. 9, the text classification apparatus 900 includes a processor 901, a memory 902, and an input-output device 903 connected by a bus. The memory 902 includes a Read Only Memory (ROM) and a Random Access Memory (RAM), and the memory 902 stores various computer instructions and data necessary for performing system functions, and the processor 901 reads the various computer instructions from the memory 902 to perform various appropriate actions and processes. The input-output device includes an input section of a keyboard, a mouse, etc.; an output section including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage section including a hard disk or the like; and a communication section including a network interface card such as a LAN card, a modem, and the like. The memory 902 also stores the following computer instructions to perform the operations specified by the text classification method of an embodiment of the invention: acquiring a target text, and encoding the target text into a first text vector by adopting a convolution network; dividing source texts in each text library into a plurality of categories by adopting a clustering model, wherein each category is represented by a second text vector corresponding to a clustering center of the category; and respectively calculating the distance between the first text vector and the plurality of second text vectors, and taking a text library where the clustering centers corresponding to the K nearest second text vectors are located as a candidate text library of the target text by adopting a sequencing model, wherein K is a positive integer.
Accordingly, embodiments of the present invention provide a computer-readable storage medium storing computer instructions that, when executed, perform operations specified by the text classification method described above.
The flowcharts, block diagrams in the figures illustrate the possible architectural framework, functions, and operations of the systems, methods, apparatus of the embodiments of the present invention, and the blocks in the flowcharts and block diagrams may represent a module, a program segment, or a code segment, which is an executable instruction for implementing the specified logical function(s). It should also be noted that the executable instructions that implement the specified logic functions may be recombined to produce new modules and program segments. The blocks of the drawings and the order of the blocks are thus merely to better illustrate the processes and steps of the embodiments and should not be taken as limiting the invention itself.
The various modules or units of the system may be implemented in hardware, firmware, or software. The software includes, for example, code programs formed using various programming languages such as JAVA, C/C++/C#, SQL, and the like. Although steps and sequences of steps of embodiments of the present invention are presented in terms of methods and apparatus, executable instructions for implementing the specified logical function(s) of the steps may be rearranged to produce new steps. The order of the steps should not be limited to only the order of the steps in the method and method illustration, but may be modified at any time as required by the function. For example, some of the steps may be performed in parallel or in reverse order.
Systems and methods according to the present invention may be deployed on a single or multiple servers. For example, different modules may be deployed on different servers, respectively, to form a dedicated server. Or the same functional units, modules, or systems may be distributed across multiple servers to relieve load pressure. The server includes, but is not limited to, a plurality of PCs, PC servers, blades, supercomputers, etc. connected on the same local area network and through the Internet.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. A method of text classification, comprising:
Acquiring a target text, and encoding the target text into a first text vector by adopting a convolution network;
setting the number of the clustering centers corresponding to each text library; clustering all the source texts in each text library into a set number of categories by adopting the clustering model; representing each corresponding category by using the coordinates of the clustering center as a second text vector; storing the clustering result of each text library, wherein the number of the text libraries corresponds to the category number of the lowest-level classification; and
And respectively calculating the distances between the first text vector and the plurality of second text vectors, adopting a sequencing model to take the text library where the clustering centers corresponding to the K nearest second text vectors are located as candidate text libraries of the target text, wherein K is a positive integer larger than 1.
2. The text classification method of claim 1, further comprising: a plurality of text libraries are obtained from a storage system, each text library comprising a plurality of source texts.
3. The text classification method of claim 1 wherein obtaining the target text and encoding it into the first text vector using a convolutional network comprises:
word segmentation operation is carried out on the target text;
mapping the word segmentation set corresponding to the target text into word vectors through a mapping function;
performing dimension filling on each word vector, and giving a random value;
And carrying out rolling and pooling operation on each word vector, and then splicing to obtain a first text vector.
4. The text classification method of claim 1, wherein the number of cluster centers is 4.
5. The text classification method of claim 1, wherein a distance between the first text vector and the second text vector is represented by a text similarity.
6. The text classification method of claim 5, wherein the text similarity comprises cosine similarity.
7. The text classification method according to claim 6, wherein using the text library where the cluster center corresponding to the K nearest second text vectors is located as the candidate text library of the target text by using a ranking model comprises:
Sorting the cosine similarity by adopting a sorting function;
And selecting the text library where the cluster center corresponding to the second text vector with the ordered cosine similarity at the first K bits is located as a candidate text library of the target text.
8. The text classification method of claim 1, wherein the convolutional network comprises a textCNN network and the cluster model comprises a kmeans cluster model.
9. A text classification system, comprising:
the target text acquisition unit is used for acquiring a target text and encoding the target text into a first text vector by adopting a convolution network;
the clustering unit is used for setting the number of the clustering centers corresponding to each text library; clustering all the source texts in each text library into a set number of categories by adopting the clustering model; representing each corresponding category by using the coordinates of the clustering center as a second text vector; storing the clustering result of each text library, wherein the number of the text libraries corresponds to the category number of the lowest-level classification; and
And the screening unit is used for respectively calculating the distances between the first text vector and the plurality of second text vectors, adopting a sorting model to take the text library where the clustering center corresponding to the K nearest second text vectors is located as a candidate text library of the target text, wherein K is a positive integer greater than 1.
10. A computer readable storage medium storing computer instructions which when executed implement the text classification method of any of claims 1 to 8.
11. A text classification device, comprising:
A memory for storing computer instructions;
A processor coupled to the memory, the processor configured to perform a text classification method according to any of claims 1 to 8 based on computer instructions stored by the memory.
CN201910407126.3A 2019-05-16 2019-05-16 Text classification method and text classification system Active CN111949789B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910407126.3A CN111949789B (en) 2019-05-16 2019-05-16 Text classification method and text classification system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910407126.3A CN111949789B (en) 2019-05-16 2019-05-16 Text classification method and text classification system

Publications (2)

Publication Number Publication Date
CN111949789A CN111949789A (en) 2020-11-17
CN111949789B true CN111949789B (en) 2024-07-19

Family

ID=73336837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910407126.3A Active CN111949789B (en) 2019-05-16 2019-05-16 Text classification method and text classification system

Country Status (1)

Country Link
CN (1) CN111949789B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114120060B (en) * 2021-11-25 2025-02-14 海信集团控股股份有限公司 Book grading method and equipment
CN113987137B (en) * 2021-12-01 2025-04-11 喜大(上海)网络科技有限公司 Classification and identification method, device, electronic device and computer-readable storage medium
CN116842200B (en) * 2023-03-29 2024-06-28 全景智联(武汉)科技有限公司 Event file aggregation management method
CN116720124A (en) * 2023-08-11 2023-09-08 之江实验室 An educational text classification method, device, storage medium and electronic equipment
CN117708322B (en) * 2023-10-17 2025-02-18 航天信息股份有限公司 A text classification method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021667A (en) * 2017-12-05 2018-05-11 新华网股份有限公司 A kind of file classification method and device
CN108228576A (en) * 2017-12-29 2018-06-29 科大讯飞股份有限公司 Text interpretation method and device
CN109726391A (en) * 2018-12-11 2019-05-07 中科恒运股份有限公司 The method, apparatus and terminal of emotional semantic classification are carried out to text

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9348901B2 (en) * 2014-01-27 2016-05-24 Metricstream, Inc. System and method for rule based classification of a text fragment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021667A (en) * 2017-12-05 2018-05-11 新华网股份有限公司 A kind of file classification method and device
CN108228576A (en) * 2017-12-29 2018-06-29 科大讯飞股份有限公司 Text interpretation method and device
CN109726391A (en) * 2018-12-11 2019-05-07 中科恒运股份有限公司 The method, apparatus and terminal of emotional semantic classification are carried out to text

Also Published As

Publication number Publication date
CN111949789A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
CN111949789B (en) Text classification method and text classification system
US11776036B2 (en) Generating and utilizing classification and query-specific models to generate digital responses to queries from client device
CN108734212B (en) A method for determining classification results and related device
CN110347908B (en) Voice shopping method, device, medium and electronic equipment
WO2019108276A1 (en) Method and apparatus for providing personalized self-help experience
CN111797622B (en) Method and device for generating attribute information
CN112085565A (en) Deep learning-based information recommendation method, device, equipment and storage medium
CN111737473B (en) Text classification method, device and equipment
CN111931002A (en) Matching method and related equipment
CN110929764A (en) Picture auditing method and device, electronic equipment and storage medium
CN110990563A (en) A method and system for constructing traditional cultural material library based on artificial intelligence
CN111078842A (en) Method, device, server and storage medium for determining query result
CN107203558B (en) Object recommendation method and device, and recommendation information processing method and device
CN108229358B (en) Index establishing method and device, electronic equipment and computer storage medium
CN118468061A (en) Automatic algorithm matching and parameter optimizing method and system
Wang et al. A transformer-based mask R-CNN for tomato detection and segmentation
CN113987168A (en) Business review analysis system and method based on machine learning
CN110851571A (en) Data processing method, apparatus, electronic device, and computer-readable storage medium
CN120067177A (en) Query method, processor, processing system, storage medium, and program product
CN112487141A (en) Method, device and equipment for generating recommended file and storage medium
CN115129871B (en) Text category determining method, apparatus, computer device and storage medium
CN116244442A (en) Text classification method and device, storage medium and electronic equipment
KR20230023600A (en) Device and method for artwork trend data prediction using artificial intelligence
CN110580285B (en) Product label determination method and device and electronic equipment
Sun et al. Classification and recognition of the Nantong blue calico pattern based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant