[go: up one dir, main page]

CN119227012B - Literature research field association degree quantification method based on multidimensional feature fusion - Google Patents

Literature research field association degree quantification method based on multidimensional feature fusion Download PDF

Info

Publication number
CN119227012B
CN119227012B CN202411755579.2A CN202411755579A CN119227012B CN 119227012 B CN119227012 B CN 119227012B CN 202411755579 A CN202411755579 A CN 202411755579A CN 119227012 B CN119227012 B CN 119227012B
Authority
CN
China
Prior art keywords
document
documents
association
field
literature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411755579.2A
Other languages
Chinese (zh)
Other versions
CN119227012A (en
Inventor
韩进
王志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202411755579.2A priority Critical patent/CN119227012B/en
Publication of CN119227012A publication Critical patent/CN119227012A/en
Application granted granted Critical
Publication of CN119227012B publication Critical patent/CN119227012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for quantifying relevance in the literature research field based on multidimensional feature fusion, and belongs to the technical field of literature research relevance calculation. The method of the invention can comprehensively capture the relation among the document features by constructing a plurality of undirected graphs, enhance the relevance expression capability of the documents, select the optimal feature fusion scheme to extract the relevance features among the documents by experimental comparison and analysis of a plurality of feature aggregation operations, improve the prediction precision of the relevance degree quantification, and pass through the minimumThe dimension ball model is used for carrying out a cross-field scientific research capability assessment method, so that scientific researchers can be helped to identify contributions and influence of different scholars in multi-disciplinary cross research, and meanwhile, the flexibility and adaptability under different precision requirements are improved by taking the accuracy with error tolerance as an assessment index. The method can help researchers to quickly lock most relevant data from a large number of documents, so that document researches better meet the actual scientific research requirements, and therefore interdisciplinary cooperation and innovation are promoted.

Description

Literature research field association degree quantification method based on multidimensional feature fusion
Technical Field
The invention relates to the technical field of literature research association calculation, in particular to a multidimensional feature fusion-based literature research field association quantification method.
Background
The association degree quantification method in the literature research field is mainly divided into a method based on single relation modeling and multidimensional feature enhancement. Traditional methods based on single relation modeling mainly focus on single type association such as partnership or citation relations, and the like, are difficult to comprehensively capture multidimensional cross-association among documents, and particularly cannot meet multi-level analysis requirements under the research background of high cross-fusion, so that the application of the method in the fields of document retrieval and interdisciplinary research is limited.
The existing method based on multidimensional feature enhancement depends on basic node attributes and connection relations thereof based on a knowledge graph, but feature characterization depth is insufficient, feature differences among nodes are difficult to reveal, so that the degree of correlation calculation precision is low, the support of real evidence data research is lacked, and the result interpretation is also insufficient. In recent years, graph Neural Networks (GNNs) have been increasingly used in graph data analysis, with the advantage of enhancing feature expression.
However, most of the existing GNN methods adopt a single graph structure, direct domain feature extraction of more focused documents is applied to a document domain classification task, and complex association features among documents cannot be deeply mined to act on a more refined document research domain association degree calculation task. In addition, the prior art does not realize effective distinction of the characteristics of the scholars on the interdisciplinary scientific research capability evaluation, so that in the research of literature retrieval and multidisciplinary fields, the contribution and influence of different scholars in the cross research are difficult to accurately reflect.
Disclosure of Invention
The invention aims to solve the problem of providing a method for quantifying the association degree in the literature research field based on multidimensional feature fusion, which is used for rapidly locking the most relevant data from a large number of literatures, reducing the operation time and accelerating the scientific research process, so that the literature research better meets the actual scientific research requirements and effectively distinguishes the characteristics of students.
The technical scheme adopted by the invention is that the literature research field association degree quantification method based on multidimensional feature fusion comprises the following steps:
Step 1, acquiring target document data from a document database, extracting document metadata, preprocessing, generating an initialized document vector and a document research field association degree label, and balancing a target document data set;
Step 2, constructing a plurality of undirected graphs and adjacent matrixes based on document metadata to represent multidimensional association relations among documents;
Step 3, constructing a graph convolutional neural network model, carrying out multidimensional feature fusion on document nodes, and selecting an optimal aggregation strategy to enhance the association representation capacity of the document nodes to obtain enhanced document node features;
step 4, calculating the field correlation among the documents based on the enhanced document node characteristics, dynamically updating document node characteristic vectors through an iterative training model, and evaluating the validity of the association degree quantification in the document research field;
Step 5, combining minima And the dimension ball model is used for quantitatively comparing the cross-domain scientific research capability of the scholars and effectively distinguishing the characteristics of the scholars.
Preferably, the step 1 method is as follows:
Step 1.1, acquiring target document data from a document database, and extracting key metadata comprising titles, abstracts, keywords, authors and publishing information of documents;
step 1.2, preprocessing document metadata to obtain an initialized document vector:
Removing preset common stop words in the title and abstract of the document by using a stop word list, performing Chinese word segmentation on the title and abstract of the document by using a jieba word segmentation tool, training the segmented text by using a Doc2Vec model, wherein the model parameter size is the document vector dimension, and generating a semantic feature vector of each document;
step 1.3, preprocessing document metadata, extracting field classification number fields from the document metadata, uniformly processing the field classification number fields into a fixed-length character string which is used for representing field major classes and sub-research fields, performing prefix matching on the processed field classification numbers, and generating a document research field association degree label;
step 1.4, carrying out hierarchical sampling in the balanced sampling process, wherein for each prefix category, if the number of samples is smaller than or equal to the set number of samples, all the samples are reserved, if the number of samples is larger than the set number of samples, a random number seed is set to randomly extract the specified number of samples from the set number of samples to form a balanced data set, and the balanced data set is divided into a test set and a training set according to the prefix category, so that the proportion of each prefix category in the training set and the test set is consistent.
Preferably, the step 2 method is as follows:
step 2.1, constructing a plurality of different undirected graphs, wherein each undirected graph is structurally represented as For capturing multidimensional association relations between documents, nodesRepresenting documents carrying document metadata, edgeRepresenting the association between documents, the weight of the edge reflects the strength of the document type association;
Step 2.2, generating a corresponding adjacency matrix for each undirected graph structure InitializingZero matrix of (2)Wherein, the method comprises the steps of, wherein,Representing the strength of association of documents i and j in the dimension of the graph structure,Record representing documents, traversing each pair of documents, and filling corresponding elements in the adjacency matrix.
Preferably, the step 3 method is as follows:
Step 3.1, the semantic feature vector generated in the step 1 and the adjacent matrix information of the different undirected graphs in the step 2 are combined to form an initial feature vector of each document node, and the initial feature vector is used as input of a graph convolution neural network model;
step 3.2, performing aggregation operation on neighbor nodes in a layer of the graph roll-up neural network to update the nodes;
and 3.3, selecting an optimal aggregation strategy to enhance the association representation capability of the document node, wherein the aggregation strategy comprises a summation strategy, an average strategy and a pooling strategy, and the enhanced document node characteristics are obtained.
Preferably, the step 4 method is as follows:
Step 4.1, calculating the domain correlation among documents, namely calculating Euclidean distances from the enhanced document node set as predicted values of a document research domain correlation degree quantization method, and calculating Euclidean distances among each pair of document vectors for representing the predicted values after the document correlation degree quantization;
step 4.2, comparing the predicted value with a preset label, introducing the accuracy with the error tolerance as an evaluation index, and evaluating the association degree of the literature research field under different accuracies by setting an error threshold;
And 4.3, repeating the step 4.1 and the step 4.2 until the model converges, and improving the accuracy and the effectiveness of the association degree quantization method in the literature research field by continuously iterating training and updating the literature node feature vector.
Preferably, step 5 combines minimaThe dimension ball model is used for quantitatively comparing the cross-domain scientific research capability of a learner, and the method comprises the following steps:
Step 5.1, integrating the feature vectors of all documents into one set Each feature vectorThe characteristics of a document are represented by,For the number of feature vectors of the document,For the dimension of document vectors, feature vector mean is used as minimumCenter of the wiry ball;
Step 5.2, calculating the maximum Euclidean distance from all feature vectors to the feature vector mean as a radius representing the minimum of the feature vector setA dimension ball;
Step 5.3, extracting all published documents for each scholars, repeating step 5.1 to generate the minimum of each scholars Wiki model, comparing minimum of different scholarsVolume of the wiry sphere, minimumThe larger the dimension ball volume is, the wider the scientific research field of the corresponding scholars is, and the stronger the cross-field scientific research capability is.
The technical scheme of the invention also provides electronic equipment, which comprises:
One or more processors;
a storage device having one or more programs stored thereon;
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement any of the methods for quantifying relevance in the literature study field based on multidimensional feature fusion described above.
The technical scheme of the invention also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the program is executed by a processor, the steps in any of the above method for quantifying the association degree in the literature research field based on multidimensional feature fusion are realized.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
1. According to the method for quantifying the association degree in the literature research field, training is carried out by constructing undirected graphs with a plurality of association dimensions and combining a graph neural network, an optimal multidimensional feature aggregation scheme is obtained through experimental comparison and analysis, and how to carry out beneficial training on association feature extraction is developed according to influences of feature ranges from the plurality of dimensions, so that accuracy of association degree quantification in the literature research field is improved.
2. According to the method for quantifying the association degree in the literature research field, disclosed by the invention, the adaptation force of association modes in different research fields among texts is improved by combining multidimensional feature fusion and training of a plurality of text association diagrams, the influence of the combined action of a plurality of association dimensions is balanced, and the method has a certain stability and a remarkable advantage in the calculation of the refined association degree.
3. According to the method for quantifying the association degree in the literature research field, contributions and influence of a learner in multi-disciplinary cross research are quantified more scientifically by combining a minimum n-dimensional sphere model, and comprehensive performances of the learner in different disciplinary fields are captured by constructing a multi-dimensional feature space, so that a solution is provided for comparing the problem of the inter-field scientific research capability among inter-field learners, and the method is more comprehensive and interpretable.
Drawings
FIG. 1 is a flow chart of a method for quantifying the association degree in the literature research field of the invention;
FIG. 2 is a comparison of different feature aggregation operations provided by embodiments of the present invention;
FIG. 3 is a graph of results of quantification of literature study correlation at different accuracies provided by an embodiment of the present invention;
Fig. 4 is a graph of a model performance quantization result under single-dimensional modeling according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the application will be further elaborated in conjunction with the accompanying drawings, and the described embodiments are only a part of the embodiments to which the present invention relates. All non-innovative embodiments in this example by others skilled in the art are intended to be within the scope of the invention. Meanwhile, the step numbers in the embodiments of the present invention are set for convenience of illustration, the order between the steps is not limited, and the execution order of the steps in the embodiments can be adaptively adjusted according to the understanding of those skilled in the art.
In one embodiment of the present invention, the method for quantifying association degree in literature research field based on multidimensional feature fusion using network data as a source, as shown in fig. 1, comprises the following steps:
Step 1, acquiring literature data and preprocessing metadata to generate an initialized literature feature vector and an automatically generated literature research field association degree label, and simultaneously ensuring the balance of a literature data set;
The method comprises the following steps:
Step 1.1, acquiring target document data from a world wide web document database, and extracting key metadata including titles, abstracts, keywords, authors, publishing information and the like of documents.
Step 1.2, preprocessing document metadata to obtain an initialized document vector:
Firstly, eliminating documents lacking key metadata to ensure the integrity of the data, then, removing common stop words in the titles and abstracts of the documents by using a custom stop word list through a regular expression to improve the text quality, then, performing Chinese word segmentation on the titles and abstracts of the documents by using a jieba word segmentation tool to generate a segmented text format, finally, training the segmented text by using a Doc2Vec model, wherein the model parameter size corresponds to the dimension of a document vector A semantic feature vector is generated for each document.
Step 1.3, preprocessing document metadata, and automatically generating a document research field association degree label:
a domain classification number field is extracted from the document metadata, and the content format of the field is not fixed. The present embodiment uniformly processes into a string of length 4, as in G420, where the first letter represents the domain category and the last three digits represent the sub-research domain.
Prefix matching is carried out on the processed domain classification numbers, namely, assuming that a document i and a document j exist, character matching is carried out according to the bits respectively, if the documents i and the document j are the same, 1 is added, and otherwise, 0 is obtained. The prefix matching length represents the distance D (i, j) between the documents i and j, and the distance is converted into a corresponding discrete association level label for subsequent quantitative evaluation analysis, and the larger the number is, the higher the association degree of the research fields among the documents is represented.
Step 1.4, using hierarchical sampling technology in the balanced sampling process, specifically, for each prefix category, if the number of samples is smaller than or equal to the set number of samples, all the samples are reserved, and if the number of samples is greater than the set number of samples, a random number seed is set to randomly extract the specified number of samples from the set number of samples so as to form a balanced data set.
The method comprises the steps of dividing balanced data sets into groups according to prefix categories, extracting 20% of samples from each group to serve as test sets and the rest samples to serve as training sets, and ensuring that the proportion of each prefix category in the training sets and the test sets is consistent in the sampling process through the steps, so that the balanced distribution of the sample categories is maintained.
Step 2, constructing a plurality of undirected graphs based on document metadata to represent multidimensional association relations among documents;
The method comprises the following steps:
And 2.1, constructing four different undirected graphs to capture multidimensional association relations among documents so as to enrich the characteristic representation of the nodes. Each undirected graph The definition is as follows:
Node Representing documents carrying document metadata, edgesRepresenting the association between documents, the weight of the edge reflects the strength of the type of association.
Preferably, the four different undirected graphs of this embodiment have the following graph structure:
And (3) title word segmentation diagram, namely if the titles of the two documents are subjected to word segmentation processing and have intersection, namely at least one word segmentation is the same, constructing edges between the corresponding nodes. The initial weight of the edge is set to be the same number of segmented words, the similarity of the titles is quantized, and the domain relevance information among the documents is extracted.
Common keyword graph when two documents appear the same keyword in the keyword list, they are connected by edges in the graph. The weights of the edges are set to be the same number of keywords, and the importance of the keywords in the domain association is represented.
And (3) a journal name association diagram, namely if two documents are published in the same academic journal, establishing edges, and setting the weight of the edges to be 1 so as to reflect the contribution of the journal to which the documents belong to the field association.
And if the two documents belong to the same domain classification number, establishing sides, setting the weights of the sides to be 1, and reflecting the influence of domain classification information on the document association degree.
And 2.2, after the construction of each graph structure is completed, generating a corresponding adjacency matrix A.
Initialization ofZero matrix of (2)Wherein, the method comprises the steps of, wherein,Representing the strength of association of documents i and j in the structural dimension of the graph,Record representing literature. Traversing each pair of documents, defining corresponding elements in the conditional fill matrix according to the graph structure above.
Step 3, carrying out multidimensional feature fusion processing on the document nodes by using a graph neural network, selecting optimal aggregation operation, and effectively improving the association representation capability of the document nodes;
The method comprises the following steps:
And 3.1, forming an initial feature vector of each document node by the semantic feature vector generated in the step 1 and the adjacent matrix information in the different undirected graphs in the step 2 together, and using the initial feature vector as input of a graph roll-up neural network model (GCN).
And 3.2, the convolutional neural network model performs aggregation operation on neighbor nodes in the layer to update the nodes.
For each node, the feature vectors of its neighboring nodes are aggregated, and the aggregation function may be selected from different operations, such as summation, averaging, or pooling, and the update node procedure is as follows:
;
Wherein, Representing the node update procedure of each association graph,The calculation is as follows:
;
In the first layer, node characteristic matrixes with different graph structures are obtained through a series of information propagation and updating ,In order to be a contiguous matrix,Is thatThe degree matrix and the weight matrix are as follows,Is an activation function; Representing the aggregation mode symbol, wherein the feature matrix after feature fusion of the first layer +1 is as follows
And 3.3, selecting an optimal polymerization strategy by comparing different polymerization modes through experiments in order to improve the effect of feature fusion.
The implementation strives for, the aggregation function selects summation, average or maximum pooling, the comparison of different characteristic aggregation operations is shown in figure 2, and the result of literature research association degree quantification under different precision is shown in figure 3.
As can be seen from fig. 3, the sum strategy in fig. 3 (a) is adopted as the optimal strategy for multi-graph feature fusion, and shows the highest accuracy. Because the method can directly accumulate the features from different graph structures, the reduction of the feature amplitude is avoided, and the stronger feature contribution in each graph is reserved. By adding the different graph features, the GCN can extract richer inter-node association information from each graph structure, so that the nodes can more accurately express multi-dimensional association among documents.
Other policy comparison analyses of this example are as follows:
And (b) in the average strategy in the figure 3, the average strategy is adopted as a multi-graph feature fusion strategy, the loss value of the model on the training set and the testing set is higher at the beginning, and the accuracy is lower. This is because the averaging strategy narrows the dynamic range of the features, and the expressive power of the features in different graph structures is diluted, making it difficult for the nodes to learn the associated features effectively in the early stages of training. In the first 50 rounds of training, the accuracy of the model is improved slowly, and although the accuracy is improved to a certain extent after a period of time, the final accuracy still does not break through 0.2, which indicates that the strategy cannot fully capture the relevance among document nodes.
Maximum and minimum policies the effect of employing maximum and minimum values as feature fusion policies is demonstrated in fig. 3 (c) and (d), respectively. Fig. 3 (c) shows that the maximum strategy can raise the accuracy of the model faster at the initial stage of training, but as the training turns increase, the model has serious overfitting phenomenon, and the accuracy gradually decreases. This is because the maximum strategy is very sensitive to local peaks and outliers, resulting in the model mistaking local features as global features, thus impairing the generalization ability of the model. In contrast, fig. 3 (d) shows the performance of the minimum strategy, and although the accuracy of the strategy is lower in the initial stage of training, the strategy is insensitive to abnormal values, so that the generalization mode in the data can be focused more, and the accuracy steadily rises and gradually reaches a more stable level as training progresses. Although the minimum strategy performs poorly in the early stages, it can exhibit better generalization ability after long training.
The final experimental result shows that the summation strategy is finally selected as the optimal strategy for multi-graph feature fusion, because in the aggregation process, each node is allowed to integrate all neighbor nodes (including directly and indirectly associated nodes) during updating, global information is fused, so that information loss is avoided, association features among nodes in different graph structures can be reserved to the greatest extent, the excessive influence of compression or abnormal values of a dynamic range is avoided, and the GCN can learn multi-dimensional association features of document nodes more effectively.
Step 4, calculating the field correlation among documents based on the enhanced document node characteristics, dynamically updating document node characteristic vectors through an iterative training model, and evaluating the validity of the document research field correlation degree quantification;
The method comprises the following steps:
And 4.1, enhancing the associated features after the document nodes are subjected to multidimensional feature fusion. The Euclidean distance is calculated from the document node set as a predicted value of a document research field association degree quantization method, and the calculation method of the Euclidean distance between each pair of document vectors is as follows:
;
Wherein, AndIs the coordinates of two points in space in the i-th dimension,As a dimension of the document vector,The predicted value obtained by quantifying the association degree among documents is used for providing basis for subsequent analysis.
And 4.2, after the domain correlation of each pair of literature nodes is calculated, comparing the predicted value with a preset label to evaluate the accuracy of the quantization effect.
Because the correlation degree label automatically generated in the step 1 is a discretized digital grade, although the traditional regression index (such as mean square error MSE and mean absolute error MAE) can provide deviation information between the predicted value and the real label, the accuracy of the model under the specific correlation grade cannot be directly reflected.
In view of the fact that the actual application scenario in the literature research field generally does not require an absolute accurate prediction value, but rather concerns whether the prediction value falls within the correct level range, the accuracy with error margin (TAA) is introduced as an evaluation index in the present implementation. The index is obtained by setting a reasonable error threshold valueThe method is suitable for evaluating the association degree of the literature research field under different precision.
UsingIndicating that the value of the tag is preset,Represents the predicted value, epsilon is the error range, and the degree of association in the literature research fieldThe calculation method is as follows:
;
Wherein N is the total number of samples, I is an indicator function for checking whether the predicted value is within an allowable error range, only if AndAn error in epsilon is considered to be a correct prediction.
And 4.3, repeating the step 4.1 and the step 4.2 until the model converges.
The accuracy and the effectiveness of the association degree quantization method in the literature research field are improved in a certain range by continuously iterating training and updating the literature node feature vector.
Step 5, combining the minimumThe wiki model is used for quantitatively comparing the cross-domain scientific research capability of the scholars;
The method comprises the following steps:
step 5.1, integrating the feature vectors of all documents into a set, wherein each feature vector Wherein each feature vectorThe characteristics of a document are represented by,For the number of feature vectors of the document, the minimumThe n dimensions in the dimension sphere correspond to the dimensions of the document vector
Using feature vector means as minimumCenter of the wiry ballBecause the vectors are calculated by a correlation degree quantization method, the feature vector mean value is not only a single numerical value, but also can reflect the multidimensional domain correlation formed by the literature sets:
;
And 5.2, calculating the maximum Euclidean distance from all the feature vectors to the average value of the feature vectors as a radius, and representing the minimum n-dimensional sphere covering the whole feature vector set. The radius is calculated as follows:
;
Step 5.3, extracting all published documents from each scholars, repeating step 5.1, and generating the minimum of each scholars And D, a dimensional sphere model. Comparing the minimum of different scholarsThe volume of the wiki ball is calculated as follows:
;
Wherein, As a gamma function whenThe larger the volume of the factorial function can be obtained when the factorial function is an integer, which shows that the wider the scientific research field of the scholars is, the stronger the cross-field scientific research capability is.
Further, in this embodiment, experimental verification of the index of relevance quantification in the literature research field is performed on different models, and the experimental results are shown in the following table:
TABLE 1 comparison of experimental results for different models
Model Mean Square Error (MSE) Mean absolute value error (MAE) Accuracy rate of
Document vector (Doc 2 Vec) 0.098±0.002 0.255±0.001 0.011±0.005
Picture volume network (GCN) 0.036±0.001 0.134±0.002 0.600±0.010
Attention network (GAT) 0.047±0.004 0.150±0.001 0.667±0.001
The invention relates to a Doc2Vec+GCN+ multidimensional method 0.034±0.001 0.122±0.003 0.679±0.001
From the table, it can be seen that the method of the present invention performs optimally in three indicators, MSE, MAE and accuracy. In particular, the accuracy of the method reaches 0.679, which is obviously higher than that of other models. The method provided by the invention can accurately quantify the relevance in the literature research field, and improves the efficiency and accuracy of literature retrieval.
Further, the effect of the multidimensional feature fusion method is verified by adopting an ablation experiment.
The effects of the single-dimensional modeling and the multi-dimensional feature fusion method were tested by ablation experiments, respectively, and the results are shown in fig. 4.
Each sub-graph of fig. 4 represents the method effectiveness at a different single associated dimension. Wherein (a) in fig. 4 represents a keyword association dimension between two documents, a document research field association degree is quantified by measuring the correlation of a common keyword disclosure document subject among the documents, (b) in fig. 4 represents a title-level association dimension between the two documents, a document research association degree is quantified by comparing common words after title segmentation to reflect the consistency of the documents in a research direction, and (c) and (d) in fig. 4 represent association dimensions of the two documents on a journal release name and a field classification name respectively, and whether the same field classification name represents the consistency of the research field thereof is judged by whether the document release is the same academic journal and by the field classification number, thereby quantifying the association degree of the document research field.
Therefore, the experimental effect under single-dimension modeling has poor stability and low accuracy, and the multi-dimension feature fusion method is excellent in multiple indexes. The method has the advantages that the method only adopts a single dimension structure to extract the association features, the accuracy of the model is obviously reduced, because the single type graph structure can only learn the association mode specific to the dimension, and is difficult to learn the generalized association features, so that the accuracy of model prediction is reduced due to information loss, and the multidimensional feature fusion method can be combined with a plurality of text association graph training to improve the adaptability of the model to association modes in different research fields among texts, balance the influence of the combined action of a plurality of association dimensions and ensure the model to have certain stability.
The embodiment of the invention also provides electronic equipment, which comprises one or more processors, a storage device and a method for quantifying the association degree of the literature research field based on the multidimensional feature fusion, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for quantifying the association degree of the literature research field based on the multidimensional feature fusion.
In an embodiment of the present invention, there is further provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in any of the methods for quantifying the association degree in the literature research field based on multidimensional feature fusion in the above embodiment.
In summary, the method for quantifying the association degree in the literature research field of the invention effectively improves the accuracy of quantifying the association degree in the literature research field by constructing undirected graphs with a plurality of association dimensions, training by combining a graph neural network, developing how to beneficially train the association feature extraction from a plurality of dimensions according to the influence of a feature range through an optimal multidimensional feature aggregation scheme, and simultaneously has obvious advantages in the calculation of the fine association degree, can capture the comprehensive performance of a learner in different academic fields and passes through the minimum of the learnerThe size of the dimension ball quantifies the interdisciplinary scientific research capability, and meanwhile, the dimension ball has comprehensiveness and interpretability from multiple dimensions.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (9)

1. A literature research field association degree quantification method based on multidimensional feature fusion is characterized by comprising the following steps:
Step 1, acquiring target document data from a document database, extracting document metadata, preprocessing, generating an initialized document vector and a document research field association degree label, and balancing a target document data set;
Step 2, constructing a plurality of undirected graphs and adjacent matrixes based on document metadata to represent multidimensional association relations among documents;
step 3, constructing a graph convolutional neural network model, carrying out multidimensional feature fusion on document nodes, selecting an optimal aggregation strategy to enhance the association representation capability of the document nodes, and obtaining the enhanced document node features, wherein the method comprises the following steps:
Step 3.1, the semantic feature vector generated in the step 1 and the adjacent matrix information of the different undirected graphs in the step 2 are combined to form an initial feature vector of each document node, and the initial feature vector is used as input of a graph convolution neural network model;
And 3.2, carrying out aggregation operation on neighbor nodes in a layer of the graph convolution neural network to update the nodes, wherein feature vectors of neighbor nodes of each node are aggregated, and the updating process is as follows:
;
Wherein, Which represents the sign of the aggregate mode,Representing the node update procedure of each association graph,The feature matrix after feature fusion of the first layer (1) is as follows;
Step 3.3, selecting an optimal aggregation strategy to enhance the association representation capability of the document node, wherein the aggregation strategy comprises a summation strategy, an average strategy and a pooling strategy, and the enhanced document node characteristics are obtained;
step 4, calculating the field correlation among the documents based on the enhanced document node characteristics, dynamically updating document node characteristic vectors through an iterative training model, and evaluating the validity of the association degree quantification in the document research field;
Step 5, combining minima The dimension ball model is used for quantitatively comparing the cross-domain scientific research capability of the scholars and effectively distinguishing the characteristics of the scholars, and the method is as follows:
Step 5.1, integrating the feature vectors of all documents into one set ,Wherein each feature vectorThe characteristics of a document are represented by,For the number of feature vectors of the document,Representing dimensions of a document vector, using feature vector means as minimumCenter of the wiry ball:
;
Step 5.2, calculating the maximum Euclidean distance from all feature vectors to the feature vector mean as a radius representing the minimum of the feature vector setDimension ball, radiusThe calculation is as follows:
;
Step 5.3, extracting all published documents for each scholars, repeating step 5.1 to generate the minimum of each scholars Wiki model, comparing minimum of different scholarsDimension sphere volume:
;
Wherein, As a gamma function whenIs converted into a factorial function when being an integer, and is minimumThe larger the dimension ball volume is, the wider the scientific research field of the corresponding scholars is, and the stronger the cross-field scientific research capability is.
2. The method for quantifying the association degree of a literature study field based on multidimensional feature fusion according to claim 1, wherein step 1 acquires target literature data and performs preprocessing, and the method comprises the following steps:
Step 1.1, acquiring target document data from a document database, and extracting key metadata comprising titles, abstracts, keywords, authors and publishing information of documents;
step 1.2, preprocessing document metadata to obtain an initialized document vector:
Removing documents with missing key metadata, removing preset common stop words in the document title and abstract by using a stop word list through a regular expression, performing Chinese word segmentation on the document title and abstract by using a jieba word segmentation tool, and training the segmented text by using a Doc2Vec model, wherein model parameter size is document vector dimension, and generating semantic feature vectors of each document;
step 1.3, preprocessing document metadata, extracting field classification number fields from the document metadata, uniformly processing the field classification number fields into a fixed-length character string which is used for representing field major classes and sub-research fields, performing prefix matching on the processed field classification numbers, and generating a document research field association degree label;
Step 1.4, carrying out layered sampling in the balanced sampling process, and for each prefix category, if the number of samples is smaller than or equal to the set number of samples, keeping all samples, if the number of samples is larger than the set number of samples, setting a random number seed to randomly extract the specified number of samples from the set number of samples to form a balanced data set, and grouping the data set according to the prefix category, and dividing the test set and the training set to ensure that the proportion of each prefix category in the training set and the test set is consistent.
3. The method for quantifying the association degree of the literature research domain based on multidimensional feature fusion according to claim 1, wherein in step 1.3, prefix matching is performed on the processed domain classification number based on a prefix matching algorithm;
And according to the prefix matching algorithm, the field classification number character matching is respectively carried out on the document i and the document j according to the bit, if the field classification number character matching is the same, 1 is added, otherwise, the field classification number character matching is 0, the prefix matching length is determined through the character matching in the field classification number, the distance D (i, j) between the field classification numbers of the document i and the document j is obtained, the D (i, j) is converted into the corresponding discrete association level, the document research field association degree label is generated, and the larger the number is, the higher the research field association degree between the documents is represented.
4. The method for quantifying the association degree of a literature study field based on multidimensional feature fusion according to claim 2, wherein step 2 constructs a plurality of undirected graphs based on literature metadata, and the method is as follows:
step 2.1, constructing a plurality of different undirected graphs, wherein each undirected graph is structurally represented as For capturing multidimensional association between documents, wherein nodesRepresenting documents carrying document metadata, edgeRepresenting the association between documents, the weight of the edge reflecting the strength of the document type association;
Step 2.2, generating a corresponding adjacency matrix for each undirected graph structure InitializingZero matrix of (2)Wherein, the method comprises the steps of, wherein,Representing the strength of association of documents i and j in the dimension of the graph structure,Record representing documents, traversing each pair of documents, and filling corresponding elements in the adjacency matrix.
5. The method for quantifying the association degree of the literature research field based on multidimensional feature fusion according to claim 4, wherein the step 2.1 is to construct a plurality of different undirected graphs including a title word segmentation graph, a common keyword graph, a journal name association graph and a field classification association graph;
if the titles of two documents are subjected to word segmentation processing and have intersection, namely at least one word segmentation is the same between the two documents, constructing edges between corresponding nodes, setting the initial weight of the edges as the number of the same word segmentation, and quantifying the similarity of the titles;
If two documents have the same keyword in a keyword list, connecting the two documents in the figure through edges, wherein the weight of the edges is set to be the same number of the keywords, and the importance of the keywords in the field association is represented;
if two documents are published in the same academic journal, establishing an edge, and setting the weight of the edge to be 1 for reflecting the contribution of the journal to which the document belongs to the field association;
If two documents belong to the same domain classification number, establishing an edge, setting the weight of the edge to be 1, and reflecting the influence of domain classification information on the document association degree.
6. The method for quantifying the association degree in the literature study field based on multidimensional feature fusion according to claim 1, wherein in step 3.2,The calculation is as follows:
;
Wherein, Representing node feature matrices of different graph structures obtained through information propagation and updating at the first layer,In order to be a contiguous matrix,Is thatThe degree matrix and the weight matrix are as follows,To activate the function.
7. The method for quantifying the association degree of the literature study field based on multidimensional feature fusion according to claim 4, wherein step 4 evaluates the validity of the quantification of the association degree of the literature study field, and the method comprises the following steps:
Step 4.1, calculating the domain correlation among documents, and calculating the Euclidean distance from the enhanced document node set as a predicted value of a document research domain correlation degree quantification method, wherein the Euclidean distance between each pair of document vectors is calculated as follows:
;
Wherein, AndFor the coordinates of two points in space in the i-th dimension,The dimensions of the document vector are represented,Predicted values obtained by quantifying the association degree among documents;
step 4.2, comparing the predicted value with a preset label, introducing the accuracy with the error tolerance as an evaluation index, and evaluating the association degree of the literature research field under different accuracies by setting an error threshold;
Using Indicating that the value of the tag is preset,Represents the predicted value, epsilon represents the error range, and the degree of association in the literature research fieldThe calculation method is as follows:
;
Wherein N is the total number of samples, I is an indication function for checking whether the predicted value is within an allowable error range, when AndWhen the error of (2) falls within epsilon, the prediction is considered correct;
And 4.3, repeating the step 4.1 and the step 4.2 until the model converges, and improving the accuracy and the effectiveness of the association degree quantization method in the literature research field by continuously iterating training and updating the literature node feature vector.
8. An electronic device, comprising:
One or more processors;
a storage device having one or more programs stored thereon;
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for quantifying relevance in a literature study area based on multidimensional feature fusion as recited in any one of claims 1 to 7.
9. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the steps in the document research field association degree quantization method based on multidimensional feature fusion as claimed in any one of claims 1 to 7.
CN202411755579.2A 2024-12-03 2024-12-03 Literature research field association degree quantification method based on multidimensional feature fusion Active CN119227012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411755579.2A CN119227012B (en) 2024-12-03 2024-12-03 Literature research field association degree quantification method based on multidimensional feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411755579.2A CN119227012B (en) 2024-12-03 2024-12-03 Literature research field association degree quantification method based on multidimensional feature fusion

Publications (2)

Publication Number Publication Date
CN119227012A CN119227012A (en) 2024-12-31
CN119227012B true CN119227012B (en) 2025-03-04

Family

ID=93943658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411755579.2A Active CN119227012B (en) 2024-12-03 2024-12-03 Literature research field association degree quantification method based on multidimensional feature fusion

Country Status (1)

Country Link
CN (1) CN119227012B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364141A (en) * 2020-11-05 2021-02-12 天津大学 Scientific literature key content potential association mining method based on graph neural network
CN115758132A (en) * 2022-10-21 2023-03-07 淮阴工学院 A vectorization method of scientific and technological literature based on deep learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8452725B2 (en) * 2008-09-03 2013-05-28 Hamid Hatami-Hanza System and method of ontological subject mapping for knowledge processing applications
CN110807101A (en) * 2019-10-15 2020-02-18 中国科学技术信息研究所 Scientific and technical literature big data classification method
CN114741519B (en) * 2022-02-18 2024-12-03 北京邮电大学 A paper relevance analysis method based on graph convolutional neural network and knowledge base
CN116561591B (en) * 2023-07-10 2023-10-31 北京邮电大学 Scientific and technological literature semantic feature extraction model training method, feature extraction method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364141A (en) * 2020-11-05 2021-02-12 天津大学 Scientific literature key content potential association mining method based on graph neural network
CN115758132A (en) * 2022-10-21 2023-03-07 淮阴工学院 A vectorization method of scientific and technological literature based on deep learning

Also Published As

Publication number Publication date
CN119227012A (en) 2024-12-31

Similar Documents

Publication Publication Date Title
CN117273134B (en) Zero-sample knowledge graph completion method based on pre-training language model
CN112115716A (en) A service discovery method, system and device based on text matching under multidimensional word vector
CN104408153A (en) Short text hash learning method based on multi-granularity topic models
CN105426426A (en) KNN text classification method based on improved K-Medoids
CN118070775B (en) Performance evaluation method and device of abstract generation model and computer equipment
CN118968516B (en) Image-text cross-modal vehicle retrieval model training method in vehicle-dense scenes
CN116451675A (en) A detection and optimization method for similar duplicate records based on the density clustering algorithm DBSCAN algorithm
CN117237592A (en) Intelligent identification method for weld joint images at same opening
CN110245234A (en) A Multi-source Data Sample Association Method Based on Ontology and Semantic Similarity
CN112579783A (en) Short text clustering method based on Laplace map
CN114328923B (en) Quotation intention classification method based on multi-task bilateral branch network
CN112417152A (en) Topic detection method and device for case-related public sentiment
CN114281994B (en) A text clustering integration method and system based on three-layer weighted model
CN116108847A (en) Knowledge graph construction method, CWE community description method and storage medium
CN119227012B (en) Literature research field association degree quantification method based on multidimensional feature fusion
CN116431877A (en) Webpage big data content clustering method driven by cloud computing platform
CN117009518A (en) Similar event judging method integrating basic attribute and text content and application thereof
CN113987536A (en) Method, device, electronic device and medium for determining security level of field in data sheet
CN113836867A (en) A kind of patent text authorizability prediction method and device
CN114610880A (en) A text classification method, system, electronic device and storage medium
Meng [Retracted] Text Clustering and Economic Analysis of Free Trade Zone Governance Strategies Based on Random Matrix and Subject Analysis
CN116010257B (en) Web application type identification method, device and system based on multidimensional digital characteristics
CN120086374B (en) A few-shot relation classification method integrating knowledge distillation and efficient parameter fine-tuning
CN119003782B (en) A method and system for checking duplicate questions in a computer-based examination question bank
CN116136866B (en) Knowledge graph-based correction method and device for Chinese news abstract factual knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant