CN110807101A

CN110807101A - Scientific and technical literature big data classification method

Info

Publication number: CN110807101A
Application number: CN201911066136.1A
Authority: CN
Inventors: 张晓丹; 梁冰; 王莉; 白海燕
Original assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Current assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date: 2019-10-15
Filing date: 2019-11-04
Publication date: 2020-02-18

Abstract

The invention relates to a scientific and technical literature big data classification method, belonging to the technical field of big data text mining; the method S1 includes the following steps: the graph consists of nodes and edges, wherein the nodes are documents, sentences and keywords in STKOS; the edges are documents and sentences, documents and keywords, sentences and sentences, sentences and keywords and the relations between keywords and keywords; s2, converting the topological relation graph into a topological relation matrix; s3, training the classification model by using the training data and the topological relation matrix constructed by the training data; s4, document classification: and inputting the batches of documents to be classified into the trained classification model to obtain the probability that the documents to be classified belong to different categories. Compared with the prior art, the topological relation graph constructed by the method has the advantages that the sentence considers the factor of the word order, the key words are terms indexed by experts, and the classification accuracy is improved; the classification model is adopted, repeated training is not needed, and sampling calculation is carried out on the input of each convolution layer, so that the classification efficiency is improved.

Description

Scientific and technical literature big data classification method

Technical Field

The invention relates to a classification method for big data of scientific and technical literature, in particular to a deep learning classification method for big data of scientific and technical literature, and belongs to the technical field of big data text mining. The invention provides a method for constructing a topological relation graph by documents, sentences and keywords and realizing document classification through a FASTGCN graph neural network model. The method can improve the accuracy and efficiency of classification of the scientific and technical literature big data.

Background

The big data mining of the scientific and technological literature is a hot problem in the research of the data mining field at present, and the key problem in the research of the field is how to accurately and efficiently classify the big data of the scientific and technological literature. Deep learning is a big data mining method emerging in recent years, and has made certain progress in solving classification of big data of documents. The currently common literature big data deep learning method comprises the following steps: word Embeddings, convolutional neural networks CNN and LSTM, etc., which have their respective limitations although they have achieved certain classification results. Although the Word Embeddings method is optimized and improved, the problem of processing continuity is limited, the CNN method can only solve the problem that input data conforms to a positive qualitative matrix, and the LSTM method has a better effect on the classification of short texts.

The graph neural network method is a new model for solving the graph classification developed in the last two years, is one of the hot points of the research in the field of deep learning at present, has the function of processing irregular matrixes, and makes up the limitation of a CNN model. The model carries out graph convolution operation on the constructed topological relation graph to obtain characteristics so as to realize classification. Has obtained good classification effect in the fields of visual discovery, machine translation and the like.

The input of the graph neural network is a topological relation graph, so different topological relation graphs can lead to different classification results. Therefore, the construction of the topological relation graph has great influence on the classification result. The existing graph neural network text classification method mainly comprises a topological relation graph constructed based on a text, a topological relation graph constructed based on a sentence and a topological relation graph constructed based on the text and extracted words, wherein the classification accuracy of the method of the topological relation graph constructed based on the text and the extracted words is high, but the GCN is a direct-push graph neural network model, so that the classification task with real-time requirement cannot be guaranteed because the GCN needs to be trained again during classification, and meanwhile, the method does not consider the word order problem during constructing the topological relation graph, so that the accuracy is slightly influenced. The invention provides a new solution mainly aiming at the problems of efficiency and accuracy of the model.

Disclosure of Invention

The invention aims to provide a classification method of a graph neural network for solving the problems of accuracy and efficiency of classification of big data of scientific and technical literature.

The invention is realized by the following technical scheme.

A method for classifying scientific and technical literature big data comprises the following steps:

step 1, constructing a topological relation graph:

the topological relation graph is composed of nodes and edges, wherein the nodes are respectively as follows: documents, sentences, and keywords; the system comprises a document node, a sentence node and a keyword node, wherein the document node consists of a title, a document keyword and an abstract of a document, the sentence node is a sentence with a language order characteristic extracted from the abstract of the document, the keyword node is a term in STKOS, and the STKOS is a super dictionary developed by a national book document center;

preferably, the sentence extraction algorithm employs an LSTM method.

Edges are relationships between nodes, and are respectively: documents and sentences, documents and keywords, sentences and sentences, sentences and keywords;

preferably, the relation between the literature and the sentence is described by using the similarity after word2 vec; the relationship between documents and keywords is described using TFIDF; the relation between sentences is described by the similarity after the sentences word2 vec; the relation between the sentence and the keyword is described by CHI; the keywords and the relations between the keywords are described by PMI.

Step 2, converting the topological relation graph into a topological matrix;

the topological matrix is a two-dimensional matrix, and the matrix vectors are documents, sentences and keywords respectively; the matrix nodes are the relation values among the vectors;

step 3, training the classification model by using the training data and the topological relation matrix constructed in the step 1 and the step 2 based on the training data to obtain a trained classification model;

preferably, the classification model adopts a FASTGCN model, and the convolution layer is 3 layers; activation function selection RELU; selecting a SOFTMAX function by the classification function; and selecting a cross entropy function as an error function, comparing a model classification result with an input document classification with a label to obtain an error, and training model parameters by reversely transmitting the error by adopting a gradient descent method until the error is in a preset threshold range.

Preferably, in order to improve efficiency, data input to each convolutional layer is sampled and input.

Preferably, the markov algorithm is selected for sampling.

And 4, classifying the documents to be classified: step 1 is adopted to construct a topological relation graph of a batch of documents to be classified, step 2 is adopted to convert the topological relation graph into a matrix, the matrix and the documents to be classified are input into a FASTGCN model trained in step 3 to be classified, the probabilities of the documents to be classified belonging to different classes are obtained, and the maximum probability corresponding to the class is selected as the document classification.

Advantageous effects

In order to improve the classification accuracy, a topological relation graph is constructed by using scientific and technical documents, sentences and keywords, wherein the extracted sentences take the factors of word order into consideration and make up the defects of the text GCN, and the keywords are terms indexed by experts in a scientific and technical knowledge organization system STKOS (knowledge organization system construction thought facing foreign scientific and technical document information, Sun Tan, Liu Wu, books and information, 2013.1.(1)) developed by the national book literature center. In order to improve the classification efficiency, the FASTGCN classification model is adopted, so that the defect of repeated training of the GCN model can be overcome, and meanwhile, the input of each convolution layer is sampled and calculated, so that the classification efficiency can be greatly improved. Therefore, the topological relation graph and the FASTGCN classification model constructed by the method can realize accurate and efficient classification effect.

Drawings

FIG. 1 is a topological relation diagram of scientific and technical literature constructed by the invention.

FIG. 2 is a schematic diagram of a scientific and technical literature classification model constructed by the present invention.

Fig. 3 is a flowchart illustrating a scientific and technical literature classification method according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and embodiments, and technical problems and advantages solved by the technical solutions of the present invention will be described, wherein the described embodiments are only intended to facilitate understanding of the present invention, and do not limit the present invention in any way.

Examples

As an implementation of the object of the present invention, as shown in fig. 3, a process of a scientific and technical literature big data classification method is as follows:

1) constructing a document big data classification topological relation graph

The topological relation graph consists of nodes and edges, and the graph is represented as G ═ V, E, wherein V is a set of nodes, and E is a set of relations.

As shown in fig. 1, the nodes are divided into three classes, which are composed of circles with different sizes, and are divided into documents, sentences and keyword nodes according to the different sizes of the circles. The document nodes consist of titles, keywords and abstracts of documents; the sentence nodes are sentences with language order characteristics extracted from the document abstract, and there are many methods for extracting sentences with language order characteristics from the document abstract, such as a naive Bayes method, a maximum entropy method, and the like, in which an LSTM method is adopted. The keyword nodes adopt terms in STKOS, which is a super dictionary developed by the national book literature center.

Edges are different relationships among nodes, and are divided into five types according to the difference of the relationships: documents and keywords, keywords and sentences, sentences and documents, and sentences.

Then: v-nodes (documents, sentences, keywords), and E-relationships (documents and keywords, keywords and sentences, documents and sentences, sentences and sentences).

There are many algorithms available to describe the above relationship, such as: inter-point mutual information PMI, TF-IDF (term-inverse document frequency), mutual information MI (mutual information), CHI (Chi-square), sen2vec, word2vec, and the like.

The PMI is mainly used for calculating semantic similarity between words, the basic idea is to count the probability of simultaneous occurrence of two words in a text, if the probability is higher, the correlation is tighter, and the correlation is higher; the PMI value calculation for the two words word1 and word2 is shown as follows:

wherein P (-) represents the probability of occurrence in a document;

TF-IDF is a commonly used weighting technique for information searching and information mining. Has wide application in searching, document classification and other relevant fields. The main idea of TF-IDF is that if a word or phrase occurs in an article with a high frequency TF and rarely occurs in other articles, the word or phrase is considered to have a good class distinction capability and is suitable for classification. TF Term Frequency (Term Frequency) refers to the number of times a given Term appears in the document. The main idea of IDF Inverse Document Frequency (Inverse Document Frequency) is: if the number of documents containing a given term is less and the IDF is larger, the term is proved to have good category distinguishing capability. See (TFIDF algorithm research review, computer application, small warrior xu army yangxiang. 2009,29(z 1)).

MI is used to measure the amount of information that is characteristic words directly with document classes (zhangfeng, xuxin. machine learning based text classification technology research advances [ J ] software bulletin, 2006, (9)).

The CHI feature selection algorithm takes advantage of the basic idea of "hypothesis testing" in statistics: first, it is assumed that the feature words are directly irrelevant to the category, if the test value calculated using the CHI distribution deviates more from the threshold, then the original hypothesis is denied with more confidence, and the alternative hypothesis that accepts the original hypothesis: the characteristic words have high association degree with the categories. The specific contents are shown in (comparative research of a feature extraction method in Chinese text classification, which is a substitute for six lingering yellow river swallow-aged tasking, Chinese information report 2004,18 (1)).

Preferably, the following algorithm is specifically used in the present embodiment to describe the relationship between different nodes:

let E's relation value be y, let x1, x2 be neighboring nodes, and according to different relations, y is respectively expressed as:

y＝PMI(x₁,x₂) Similarity between keywords;

y＝CHI(x₁,x₂) Similarity between keywords and sentences;

COS (word2vec (x1,) word2vec (x2)) similarity between sentences, similarity between sentences and documents;

y is TFIDF, the relevance of the keyword to the literature;

the literature is composed of titles, keywords and abstracts. The sentences are obtained by extracting the document abstract through an LSTM (Long short-term memory) model and contain the language order relation. The keywords are derived from the keywords of the terminology layer in the super dictionary STKOS developed by the national book literature center.

2) Construction of FASTGCN Classification model

The classification model consists of a convolutional layer, a full-link layer and a classification layer, scientific and technical documents learn through the convolutional layer to obtain document characteristics, the document characteristics are input into the full-link layer, and then the classification layer is divided to obtain a final classification result.

As shown in fig. 2, the document topological relation diagram is converted into a matrix and input to the classification model for classification, so that the category (extracted from the category layer of the STKOS), i.e., the category, to which each document belongs can be obtained.

The FASTGCN model cannot directly recognize the topological relation diagram of the document, and therefore, before inputting the topological relation diagram into the FASTGCN model, the FASTGCN model needs to be converted into a matrix that can be recognized by the FASTGCN model.

And converting the topological relation matrix into a topological relation matrix according to the constructed topological structure diagram and inputting the topological relation matrix into the classification model. The topological relation matrix is a set of relations between nodes, i.e., documents, keywords, and sentences, and the relations are relationships between documents and keywords, keywords and sentences, sentences and documents, and sentences.

According to the relation value Y, a relation matrix is constructed, wherein the set Y is { PMI (x)₁,x₂) (relationship between keywords), CHI (x)₁,x₂) (keyword to sentence relationship), COS (word2vec (x1,) word2vec (x2)) (sentence to sentence relationship, sentence to document relationship), TFIDF (x2)₁,x₂) (relationship between keywords and documents) }, namely, the columns and rows are respectively documents, sentences and keywords which are arranged in sequence, the values of different columns and rows are the values of the relationship between the corresponding specific documents, sentences or keywords and the documents, sentences or keywords, the value at the diagonal position in the matrix is 1, and the values of the elements of the matrix corresponding to all the documents are set to be 0 because there is no relationship between the documents.

A data set for training was prepared, which was extracted 85% as training data and the remaining 15% as test data. Documents and labels of the documents in the training data, documents and labels of the documents in the test data, and a relationship matrix constructed from a training-based dataset are trained as inputs to the FASTGCN classification model. For the FASTGCN classification model used in the embodiment, the RELU function is selected as the activation function, so that the problem of data transmission between convolution layers can be quickly and accurately solved; selecting a SOFTMAX function by the classification function; in order to converge the neural network to the error interval, the error function selects a cross entropy function. And comparing the model classification result (category 1 … … category n in STKOS) with the input labeled document classification to obtain an error, and then training the model parameters by a gradient descent method to reversely transmit the error until the error is in a preset threshold range to obtain the trained FASTGCN model.

Preferably, to improve efficiency, the data input to each convolutional layer is sampled first to reduce the amount of data. The sampling may be any sampling algorithm, such as Box-Muller algorithm, monte carlo and markov algorithm, etc., and in this embodiment, the markov algorithm Gibbs (decomposable markov network structure learning with missing data, cyber newspaper, wang shuangyuan vast, 2004,27(9)) is used.

3) Classification process

The classification flow is shown in fig. 3.

And inputting documents to be classified in batches. The topological relation graph of the batch of documents is constructed, the topological relation graph is converted into a topological matrix, and all documents with the classification are input into the trained FASTGCN classification model to obtain the classification of each document.

In order to improve the classification efficiency, the invention adopts a FASTGCN model to realize the document big data classification task. The model is an inductive graph neural network model and classifies nodes on a topological relation graph. After the classification model is trained, the classification model does not need to be retrained when formal classification is carried out. Thus, the efficiency is high, but the accuracy is low. In order to solve the problem of accuracy, the topological relation graph constructed by the invention adopts the existing keywords, sentences with language order relation and documents as nodes. The keyword nodes adopt the existing terms in a super dictionary (STKOS for short) developed by the national book literature center as the keyword nodes of the topological relational graph, so that the accuracy of the keyword nodes can be ensured. The sentence nodes are extracted from the abstract of the document and have the sentence with the language order characteristic, so that the topological relation graph can contain the factor of the language order. The document nodes are texts composed of titles, keywords and abstracts of scientific documents. Feature extraction and classification of scientific and technical literature big data are realized through FASTGCN (classification model is shown in figure 2). Therefore, the classification accuracy can be guaranteed to the maximum extent, and the classification efficiency can be guaranteed. In order to further improve the classification efficiency, the method samples the input layer and the convolution layer of FASTGCN respectively.

Results of the experiment

The data used in this experiment are from the real data of the national book literature center, where the categories are five, respectively: general theory of social science, military, medical and health, industrial science and technology, aerospace and the like. The data in each category is 20000 bars. Wherein the training data is 15000 pieces of data, and the testing data is 5000 pieces of data. The literature data format is the TXT format. The models participating in the experiment are respectively a text GCN, a text FASTGCN and the model provided by the invention.

The topological relation graphs adopted by the text GCNCCN and the text FASTGCN are documents and keywords extracted from the documents, and the methods of the invention adopt the documents, sentences extracted from the documents and STKOS existing keywords. The loss function selects a cross entropy function, the classification function selects a SoftMax function, and the activation function selects a ReLU function.

Through experiments, the classification results are shown in table 1.

TABLE 1 comparison of classification results for classification models

The experimental parameters were verified by selecting accuracy (precision), recall (recall), f1 value and classification time. As can be seen from the table, the text FASTGCN has the lowest classification accuracy, which can reach 0.5586 at the highest, but the shortest classification time.

The accuracy of the text GCN can reach 0.9030 at the highest, but the classification time is longest.

The method provided by the invention has the highest accuracy rate of 0.9471, and the classification time is more than that of the text GCN and lower than that of the text FASTGCN; the method provided by the invention has the best comprehensive efficiency and accuracy performance.

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A scientific and technical literature big data classification method is characterized in that: the method comprises the following steps:

s1, constructing a topological relation graph: the topological relation graph is composed of nodes and edges, wherein the nodes are respectively as follows: documents, sentences, and keywords; the system comprises a document node, a sentence node and a keyword node, wherein the document node consists of a title, a document keyword and an abstract of a document, the sentence node is a sentence with a language order characteristic extracted from the abstract of the document, the keyword node is a term in STKOS, and the STKOS is a super dictionary developed by a national book document center; edges are relationships between nodes, and are respectively: documents and sentences, documents and keywords, sentences and sentences, sentences and keywords;

s2, converting the topological relation graph into a topological relation matrix;

s3, training the classification model by using the training data and the topological relation matrix constructed through S1 and S2 based on the training data to obtain a trained classification model;

s4, classifying documents to be classified: step 1 is adopted to construct a topological relation graph of the batch of documents to be classified, step 2 is adopted to convert the topological relation graph into a matrix, the matrix and the documents to be classified are input into a classification model trained in step 3 to be classified, the probabilities of the documents to be classified belonging to different classes are obtained, and the maximum probability corresponding to the class is selected as the document classification.

2. The method of claim 1, wherein: the sentence extraction algorithm uses the LSTM method.

3. The method of claim 1, wherein: the relation between the literature and the sentence is described by the similarity after word2 vec; the relationship between documents and keywords is described using TFIDF; the sentence and the relation between the sentences are described by the similarity after word2 vec; the relation between the sentence and the keyword is described by CHI; the keywords and the relations between the keywords are described by PMI.

4. The method of claim 1, wherein: the classification model adopts a FASTGCN model, and the convolution layer is 3 layers; activation function selection RELU; selecting a SOFTMAX function by the classification function; and selecting a cross entropy function as an error function, comparing a model classification result with an input document classification with a label to obtain an error, and training model parameters by reversely transmitting the error by adopting a gradient descent method until the error is in a preset threshold range.

5. The method according to any one of claims 1 to 4, wherein: in order to improve efficiency, the data input to each convolutional layer is sampled and then input.

6. The method of claim 5, wherein: and the Markov algorithm is selected for sampling.