[go: up one dir, main page]

CN120068882B - Method and system for analyzing and predicting the trend of scientific and technological literature - Google Patents

Method and system for analyzing and predicting the trend of scientific and technological literature

Info

Publication number
CN120068882B
CN120068882B CN202510550556.6A CN202510550556A CN120068882B CN 120068882 B CN120068882 B CN 120068882B CN 202510550556 A CN202510550556 A CN 202510550556A CN 120068882 B CN120068882 B CN 120068882B
Authority
CN
China
Prior art keywords
topic
text data
literature
topics
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510550556.6A
Other languages
Chinese (zh)
Other versions
CN120068882A (en
Inventor
邓明森
王志翔
喻曦
陈旭
安冯竞
宋富洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University of Finance and Economics
Original Assignee
Guizhou University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University of Finance and Economics filed Critical Guizhou University of Finance and Economics
Priority to CN202510550556.6A priority Critical patent/CN120068882B/en
Publication of CN120068882A publication Critical patent/CN120068882A/en
Application granted granted Critical
Publication of CN120068882B publication Critical patent/CN120068882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a system for analyzing and predicting a subject trend of scientific literature. The method comprises the steps of firstly collecting text data of scientific literature, preprocessing, obtaining document semantic vectors by utilizing a pre-trained semantic analysis model, carrying out topic clustering according to the document semantic vectors, extracting a plurality of topics and representations of the topics, carrying out hierarchical clustering on the topics to form a hierarchical structure, dividing the text data according to time windows, analyzing change characteristics of the topics to construct an evolution time sequence, and finally predicting the change trend of the topic intensity. The method can accurately understand the deep semantics of the literature, effectively identify the focus field of the literature, predict the future evolution trend of the literature and provide prospective guidance for scientific and technological decisions.

Description

Method and system for analyzing and predicting topic trend of scientific and technical literature
Technical Field
The application relates to the technical fields of text mining and literature analysis, in particular to a method and a system for analyzing and predicting a topic trend of scientific and technical literature.
Background
In recent years, scientific literature research is coming to develop new opportunities, but is facing the explosive growth of scientific literature information and the challenges of research method update iteration.
Traditional scientific literature analysis mainly relies on manual reading, summary and qualitative analysis methods, such as content analysis, comparative research and the like. These methods, while allowing deep understanding of literature text, are inefficient in the face of large-scale literature documents and difficult to objectively quantify literature evolution laws. In recent years, text mining methods based on topic models such as LDA (latent dirichlet allocation) have been introduced into the field of document analysis, and topics can be automatically identified from a large number of document texts and clustered.
The existing most relevant technology is a literature topic analysis method based on text mining, which carries out pretreatment, feature extraction and topic modeling on literature texts through a specific algorithm, so as to disclose the distribution situation and evolution rule of literature topics. The technical principle is that unstructured document text is converted into structured data which can be quantitatively analyzed by using a computer linguistic method and a statistical method, so that semantic information and topic modes contained in the text are mined.
However, there are significant limitations to the prior art. Firstly, the document content is isomerized and scattered, the deep semantics of the document text are difficult to accurately understand by the traditional topic model, secondly, the prior method has a clear defect of the priority level of the front-edge technological development, the focused field of the document attention is difficult to accurately identify, and furthermore, an effective prediction model for the future evolution trend of the document is lacking, so that prospective guidance can not be provided for technological decision. These problems severely limit the supportive capacity of scientific literature research for government scientific decisions.
Disclosure of Invention
In view of the above, the application provides a method and a system for analyzing and predicting the trend of a literature topic, which solve the problems that the traditional topic model in the prior art is difficult to accurately understand the deep semantics of the literature text, is difficult to accurately identify the focus field of the literature and lacks an effective future evolution trend prediction model of the literature.
The embodiment of the application provides a method for analyzing and predicting the topic trend of scientific and technical literature, which comprises the following steps:
collecting scientific literature text data, and preprocessing the scientific literature text data to obtain target text data;
Encoding the target text data by utilizing a pre-trained semantic analysis model to obtain a document semantic vector;
Performing topic clustering on the target text data according to the document semantic vector, and extracting a plurality of topics and topic representations of each topic;
hierarchical clustering is carried out on the topics according to the topic representation, so as to form a topic hierarchical structure;
Dividing the target text data into a plurality of time windows, analyzing the change characteristics of the theme in each time window, and constructing a theme evolution time sequence according to the change characteristics, wherein the change characteristics comprise the occurrence frequency and the intensity change;
And predicting the intensity change trend of the theme according to the theme evolution time sequence.
The encoding the target text data by using the pre-trained semantic analysis model to obtain a document semantic vector comprises the following steps:
Segmenting the target text data to obtain text fragments with specified lengths;
Acquiring semantic representations of the text fragments by adopting the semantic analysis model;
integrating the semantic representations of the text segments into document semantic vectors, wherein the document semantic vectors are used for indicating semantic information and context of the target text data.
Performing topic clustering on the target text data according to the document semantic vector, and extracting a plurality of topics and topic representations of each topic, wherein the topic clustering comprises the following steps:
Performing dimension reduction processing on the document semantic vector by adopting a preset dimension reduction model to obtain a low-dimension semantic vector;
based on the low-dimensional semantic vector, performing density clustering on the target text data by using a preset clustering model to form a plurality of topics;
Calculating word characteristic values of words in each topic, and extracting keywords according to the word characteristic values, wherein the word characteristic values comprise ‌ word frequencies ‌ and ‌ inverse document frequencies;
and optimizing the keywords by adopting a preset maximum similarity matching algorithm to obtain the topic representation of the topic.
The hierarchical clustering of the topics according to the topic representation to form a topic hierarchical structure comprises the following steps:
Calculating semantic similarity based on the topic representation, and constructing a similarity matrix between topics according to the topic representation and the semantic similarity;
Hierarchical clustering is carried out on the topics by adopting a preset hierarchical clustering model, and a hierarchical system expressed as a tree structure is obtained, wherein the hierarchical system comprises a plurality of primary topics and a plurality of secondary topics corresponding to the primary topics;
and optimizing and adjusting the layering system to form a theme layering structure.
Dividing the target text data into a plurality of time windows, analyzing the change characteristics of the theme in each time window, and constructing a theme evolution time sequence according to the change characteristics, wherein the method comprises the following steps:
According to the release time of the target text data, performing time dimension division on the target text data to obtain a plurality of continuous time windows;
performing topic mining on the target text data in the time window by using a preset topic mining model;
Calculating the topic characteristics of each topic in different time windows according to topic mining results, wherein the topic characteristics comprise the occurrence frequency and the intensity;
analyzing the theme characteristics by adopting a preset dynamic theme model to obtain variation characteristics of the theme, wherein the variation characteristics are used for indicating semantic variation of the theme along with time;
and generating the theme evolution time sequence according to the change characteristics.
Preprocessing the scientific literature text data to obtain target text data, wherein the preprocessing comprises the following steps:
Evaluating the scientific literature text data by using a preset large language model to obtain a complexity score, wherein the complexity score is used for indicating the expertise, the intersection and the structural complexity of the scientific literature text data;
Classifying the scientific literature text data according to the complexity score to obtain a first type text and a second type text, wherein the first type text is used for indicating that the complexity score is lower than a preset score threshold value, and the second type text is used for indicating that the complexity score is higher than the score threshold value;
performing lightweight analysis on the first type text and performing deep analysis on the second type text to obtain a preliminary processing result;
And obtaining target text data according to the preliminary processing result.
The step of evaluating the scientific literature text data by using a preset large language model to obtain a complexity score comprises the following steps:
collecting training texts with complexity labels, and performing fine tuning training on a preset initial large language model to obtain the large language model;
Calculating analysis features of the scientific literature text data by using the large language model, wherein the analysis features comprise professional vocabulary density, domain coverage range and logic level depth;
calculating a text complexity score according to the analysis features, and generating an evaluation basis description according to the text complexity score;
And obtaining the complexity score of the scientific literature text data according to the evaluation basis description.
Performing topic clustering on the target text data according to the document semantic vector, extracting a plurality of topics and topic representations of each topic, and further comprising:
based on the document semantic vector, establishing a scoring index system comprising document grade classification, release time and reference frequency;
constructing a cost sensitive decision tree according to the document semantic vector and the scoring index system, wherein the cost sensitive decision tree is used for indicating the weight given by the increase of the sample with the importance reaching a preset importance threshold;
Extracting classification rules from the cost-sensitive decision tree, the classification rules comprising a feature threshold and a classification path;
optimizing initial center point distribution of the topic clustering according to the classification rule, and performing topic clustering on the target text data according to an optimization result.
After predicting the intensity change trend of the theme according to the theme evolution time sequence, the method further comprises the following steps:
Based on the topic intensity change trend, constructing a knowledge graph containing literature entities, influence paths and effect quantification indexes;
establishing a multi-level analysis framework by utilizing the knowledge graph;
establishing a retrieval and reasoning system according to the multi-level analysis framework, wherein the retrieval and reasoning system comprises a reasoning chain from literature to influence;
Adjusting the effect quantization index, and executing multi-scenario prediction by combining the adjusted effect quantization index and the retrieval reasoning system to obtain development trends under different conditions;
Wherein, the inference chain is established by the following steps:
Based on the multi-level analysis framework, decomposing the influence analysis task into a plurality of subtasks with logic dependency relations;
processing the subtasks by using the large language model to obtain a progressive reasoning chain;
Retrieving associated data related to each node of the progressive inference chain in the knowledge graph;
and verifying and correcting an inference result by combining the association data and an inference process indicated by the large language model, and generating an inference chain.
The embodiment of the application also provides a device for analyzing and predicting the topic trend of the scientific literature, which comprises the following steps:
The data preprocessing module is used for acquiring scientific literature text data and preprocessing the scientific literature text data to obtain target text data;
the semantic coding module is used for coding the target text data by utilizing a pre-trained semantic analysis model to obtain a document semantic vector;
The topic clustering module is used for topic clustering the target text data according to the document semantic vector, and extracting a plurality of topics and topic representations of each topic;
The hierarchical analysis module is used for hierarchical clustering of the topics according to the topic representation to form a topic hierarchical structure;
The time evolution analysis module is used for dividing the target text data into a plurality of time windows, analyzing the change characteristics of the theme in each time window and constructing a theme evolution time sequence according to the change characteristics, wherein the change characteristics comprise the occurrence frequency and the intensity change;
and the trend prediction module is used for predicting the intensity change trend of the theme according to the theme evolution time sequence.
The embodiment of the application also provides a computer device, which comprises:
At least one processor, and
A memory communicatively coupled to the at least one processor, wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of trending and predicting technical literature topics as described above.
The embodiment of the application also provides a computer readable storage medium which stores computer instructions for causing a computer to execute the method for analyzing and predicting the topic trend of the scientific literature.
The embodiment of the application also provides a computer program product, which comprises computer instructions, wherein the computer instructions realize the steps of the method for analyzing and predicting the topic trend of the technical literature when being executed by a processor.
The application has the following technical effects:
The method comprises the steps of innovatively combining a BERT semantic model with a topic model, constructing a BERTopic scientific literature analysis framework, designing a multi-level topic classification system, organizing topics through a hierarchical clustering algorithm, realizing omnibearing analysis from macroscopic literature to microscopic literature measures, providing a literature topic dynamic evolution analysis method, realizing quantitative analysis of literature topic evolution rules through time window division and dynamic topic modeling, developing a topic evolution mode-based literature trend prediction model, and providing prospective reference for future literature guidance by combining a time sequence analysis and a deep learning method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.
FIG. 1 is a schematic flow chart of a method for analyzing and predicting trends of a subject of a scientific literature according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an implementation flow of a text representation module based on BERT semantic embedding according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an implementation flow of a BERTopic-based scientific literature topic mining module according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an implementation flow of a document topic hierarchical analysis module provided by an embodiment of the present application;
Fig. 5 is a schematic implementation flow diagram of a literature topic time evolution analysis module provided by an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The term "and/or" is used herein to describe only one relationship, and means that three relationships may exist, for example, A and/or B, and that three cases exist, A alone, A and B together, and B alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.
As shown in fig. 1, an embodiment of the present application provides a method for analyzing and predicting a trend of a subject of a scientific literature, including:
S1, acquiring scientific literature text data, and preprocessing the scientific literature text data to obtain target text data;
The system firstly collects scientific literature text data from various channels such as national and local government websites, scientific and technical department officials and the like through a web crawler technology. The collected data comprise core information such as document titles, issuing institutions, issuing time, full document text and the like, so that authority and comprehensiveness of data sources are ensured. After the collection is completed, the system builds a structured document text database to provide basic data support for subsequent analysis. Next, the system performs overall cleaning processing on the collected document text, including removing special characters, punctuation marks and stop words in the text, unifying text formats, and performing Chinese word segmentation processing by using a professional word segmentation tool. Meanwhile, the system can also extract metadata information of document texts, such as key attributes of release time, departments, regions and the like, and the information plays an important role in subsequent time sequence analysis and region comparison. And finally, carrying out standardized processing on the preprocessed text data, ensuring the data quality and consistency, and laying a solid foundation for subsequent semantic analysis. The quality of the preprocessing link directly influences the accuracy of subsequent analysis, so that the system adopts strict data quality control measures in the link, and ensures the high quality of input data.
S1 specifically comprises:
S1.1, evaluating the scientific and technological literature text data by using a preset large language model to obtain a complexity score, wherein the complexity score is used for indicating the expertise, the intersection and the structural complexity of the scientific and technological literature text data;
As a basis for intelligent data filtering mechanisms, comprehensive quantitative assessment of document text complexity is aimed at. The system designs a multi-dimensional complexity assessment system, and focuses on three core dimensions, namely, the professional degree, the intersection and the structural complexity. The technical degree assessment literature uses the depth of technical terms and scientific concepts, the high-technical degree literature generally contains a large number of domain-specific terms and profound theoretical contents, the cross assessment literature relates to the breadth of a plurality of disciplines or technical fields, the high-cross literature often fuses knowledge and methods of a plurality of disciplines, the structural complexity evaluates the complexity of organization structures, logic relations and expression modes of the literature, and the high-structural complexity literature can contain multi-layer nested clauses, complex logic dependency relations or nonlinear content organization. The large language model can accurately capture the complexity characteristics through deep semantic understanding of the whole document, and generate comprehensive complexity scores. Compared with the traditional method based on keywords or rules, the evaluation method based on deep learning can more accurately understand the inherent complexity of the text and provide more reliable decision basis for subsequent processing.
S1.1.1, collecting training texts with complexity labels, and performing fine tuning training on a preset initial large language model to obtain the large language model;
The construction of the training data set is a key step of the step, and the system collects a large number of scientific and technological literature samples and invites the field expert to carry out complexity labeling on the samples. The labeling process takes into account multiple dimensions, such as the term density of art, depth of concept, complexity of logic structure, degree of interdisciplinary, etc. of the document. These annotated document samples are used to fine tune pre-trained large language models, such as BERT, GPT, etc. The fine tuning process adopts a supervised learning method, and model parameters are optimized by minimizing the difference between the prediction complexity score and the expert annotation score. Meanwhile, in order to enhance the generalization capability of the model, training data cover scientific and technological documents of different periods, different fields and different types, and the model is ensured to be capable of adapting to complexity evaluation requirements of various document texts. Through full training and verification, the system obtains a high-performance large language model special for scientific literature complexity assessment, and provides a powerful tool for subsequent complexity scoring.
In the fine tuning training process of a large language model, the following technical scheme is adopted specifically:
And constructing training data, namely selecting 10,000 documents with different complexity from a public scientific and technological document library, and marking the complexity by 5 field experts according to a 1-10 component table to form a training set with marking consistency reaching 0.85 Kappa coefficient.
Model architecture based on a pre-trained language model of a 12-layer transducer structure, comprising 12 attention heads, a hidden layer dimension of 768, and a total of about 1.1 hundred million parameters. To adapt to the characteristics of scientific literature, 3,000 technical terms are added in the word list.
Training parameters were set, using Adam optimizer, learning rate was set to 3e-5, and 4 epochs were trained using linear learning rate preheat and decay strategy, batch size 16. The loss function uses a combination of Mean Square Error (MSE) and cross entropy to optimize both classification accuracy and scoring accuracy.
The verification method adopts 5-fold cross verification, uses Mean Absolute Error (MAE) and accuracy as evaluation indexes, and achieves MAE of 0.72 and complexity classification accuracy of 85% on a verification set.
S1.1.2 calculating analysis characteristics of the scientific literature text data by using the large language model, wherein the analysis characteristics comprise professional vocabulary density, domain coverage range and logic hierarchy depth;
These features are the basis for complexity scoring, including key indicators of specialized vocabulary density, domain coverage, and logic level depth. The specialized vocabulary density is calculated by recognizing the proportions of the specialized terms, scientific and technological concepts and academic vocabularies in the text, and reflects the specialized depth of the literature. The system utilizes the powerful semantic understanding capability of a large language model, can accurately identify the professional terms in each field, and can effectively grasp even in the face of emerging technological concepts. The domain coverage measures the discipline or technical domain breadth related to the literature, and the system judges the interdisciplinary characteristics of the literature by analyzing the occurrence frequency and distribution of concepts in different domains in the text. For highly cross-fused documents, the system would give a higher complexity assessment. The logic hierarchical depth is an important index for evaluating the structural complexity of the document, and the system analyzes the characteristics of hierarchical structure, clause relation, logic dependence and the like in the document and quantifies the organization complexity of the document. In addition, the system can analyze the auxiliary characteristics such as the syntactic complexity, the semantic density, the reasoning depth and the like of the document, and comprehensively capture the complex characteristics of the document. The analysis features together form a multidimensional feature space, and a rich information basis is provided for accurately evaluating the complexity of the literature.
S1.1.3 calculating a text complexity score according to the analysis characteristics, and generating an evaluation basis description according to the text complexity score;
The calculation process adopts a weighted comprehensive scoring method, and weights are distributed according to the importance of different features to obtain a final complexity score. The weight setting is based on experience of field experts and statistical results of a large number of literature analyses, and can reflect the contribution degree of different features to complexity. To improve the interpretability of the score, the system also generates detailed evaluation basis descriptions including score cases, key influencing factors and typical feature examples for each dimension. For example, for a document rated as highly complex, the system might give an explanation that "the document contains high density artificial intelligence terms (35% of words are terms of art) while involving the cross-content of three fields of computer science, statistics and cognitive science, the document structure presents a multi-layer nested relationship, and there is a complex logical dependence between terms. "such detailed assessment enhances the credibility and practicality of the assessment results in terms of helping the user understand the source and meaning of the complexity score.
S1.1.4 obtaining the complexity score of the scientific literature text data according to the evaluation basis description.
Considering the above analysis results comprehensively, according to preset scoring criteria, document complexity is measured into specific scores, and a score range of 1-10 or a low-medium-high grading mode is generally adopted. The complexity score not only reflects the overall complexity of the document, but also contains the refinement evaluation result of each dimension, and provides accurate guidance for the follow-up shunt treatment. The system can divide the documents into two types of simple documents and complex documents according to the scoring result, wherein the simple documents are generally low in complexity score, straight white in content, clear in structure and suitable for lightweight processing, and the complex documents are high in complexity score and possibly contain deep professional content, complex logic structures or cross-domain fusion knowledge, so that deeper analysis is needed. The complexity-based classification provides scientific basis for optimizing distribution of system resources, so that the system can concentrate more computing resources and analysis capacity on complex documents really needing advanced processing, and simultaneously, relatively simple documents can be processed rapidly, and the overall processing efficiency and analysis quality are improved.
Through the implementation, the system completes complexity evaluation of the text data of the scientific and technical literature, and lays a foundation for subsequent strategic shunt processing. Compared with the traditional method based on rules or simple characteristics, the complexity evaluation method based on the large language model has higher accuracy and adaptability, and can effectively process scientific and technological documents in various types and fields. Meanwhile, by generating detailed evaluation basis description, the system enhances the interpretability and the credibility of complexity evaluation, so that a user can understand and verify an evaluation result. The high-quality complexity evaluation provides reliable guarantee for subsequent resource optimization allocation and processing strategy selection, and effectively improves the efficiency and quality of the whole literature analysis system.
S1.2, classifying the scientific literature text data according to the complexity score to obtain a first type text and a second type text, wherein the first type text is used for indicating that the complexity score is lower than a preset score threshold, and the second type text is used for indicating that the complexity score is higher than the score threshold;
the system sets a preset scoring threshold value (the threshold value can be dynamically adjusted according to the processing capacity of the system and the actual application requirement), classifies documents with complexity scores lower than the threshold value as a first type of text, the documents are generally clear in semantics, clear in structure and moderate in expertise and suitable for lightweight processing, and classifies documents with complexity scores higher than the threshold value as a second type of text, wherein the documents possibly contain deep professional contents, complex structural organization or cross-domain knowledge fusion, and deeper analysis is needed. In a practical implementation, the system considers the multidimensional nature of the complexity score, possibly using a multi-threshold classification method, i.e. the thresholds are set according to the complexity of different dimensions, respectively, and documents are classified as the first type of text only when they are below the corresponding thresholds in all key dimensions. The careful classification strategy ensures effective identification of complex documents and avoids erroneous judgment possibly caused by simple classification. Meanwhile, the system records the complexity details and the classification basis of each document, provides references for subsequent processing, and is also convenient for continuous optimization of classification strategies.
S1.3, performing lightweight analysis on the first type text and performing deep analysis on the second type text to obtain a preliminary processing result;
The method is a core implementation link of an intelligent data filtering mechanism, and realizes the optimization allocation of computing resources and the improvement of processing efficiency through a differential processing strategy. For the first type of text (low complexity), the system employs lightweight processing methods such as basic text cleansing, simple semantic analysis, and conventional feature extraction. The method has the advantages of small calculation load and high processing speed, and is suitable for batch processing of relatively simple literature texts. In particular, the system may adopt a simplified language model, a shallow neural network or an efficient statistical analysis method to rapidly extract the core content and the theme characteristics of the text. For the second type of text (high complexity), the system initiates a deep analysis mode, applying more powerful analysis tools and algorithms, such as full-parametric large language models, deep semantic parsing, complex logical structure analysis, etc. The depth analysis mode inputs more computing resources and processing time, can more accurately understand the connotation and structure of complex documents, and captures fine semantic features and logic relations. The system may also apply special analysis strategies for different types of complex documents (e.g., interdisciplinary documents, high-specialized documents, complex structural documents), further improving the pertinence and efficiency of the process. By the differentiation processing, the system obviously improves the overall processing efficiency and realizes reasonable distribution of computing resources on the premise of ensuring the analysis quality.
And S1.4, obtaining target text data according to the preliminary processing result.
And finishing the preprocessing process of the text data of the scientific and technical literature. And integrating and standardizing the results of the lightweight analysis and the deep analysis to ensure that the subsequent analysis can receive the input data with uniform format and consistent quality. Firstly, the system performs quality inspection on processing results of two types of texts, and evaluates extraction integrity of key information, accuracy of semantic understanding and rationality of structural analysis, so that the processing results are ensured to meet preset quality standards. If the results of the processing of some documents are found to be unsatisfactory, the system marks them as requiring reprocessing, possibly adjusting the processing strategy or parameters, or converting text originally classified as the first class into the second class for further analysis. Secondly, the system performs standardized processing on processing results meeting quality standards, including operations such as format unification, feature normalization, metadata supplementation and the like, so as to ensure consistency of data forms. Finally, the system organizes the standardized data into a target text data set, wherein the data set contains various information such as text content, structural information, semantic features, complexity attributes and the like, and high-quality input is provided for subsequent semantic coding and topic clustering. Through the series of processing, the system not only completes basic preprocessing of document text data, but also improves the overall processing efficiency through strategic diversion and differentiation processing, and simultaneously ensures the quality consistency of processing results.
The intelligent data filtering mechanism based on LLM successfully realizes the efficient preprocessing of the text data of the scientific literature. The mechanism has the core advantages that firstly, scientific decision basis is provided for subsequent processing through accurate assessment of document complexity by a large language model, secondly, strategic allocation of processing resources is realized by classifying documents into different complexity categories, thirdly, the processing efficiency is obviously improved while the analysis quality is ensured through differentiated processing strategies aiming at documents of different categories, and fourthly, the consistency and the reliability of preprocessing results are ensured through strict quality control and standardized processing. The innovative pretreatment method enables the system to more effectively cope with large-scale heterogeneous scientific literature text data, and lays a solid foundation for subsequent deep analysis.
S2, coding the target text data by utilizing a pre-trained semantic analysis model to obtain a document semantic vector;
The system adopts a pre-trained Chinese BERT model as a semantic encoder, and the model has stronger semantic understanding capability through pre-training on massive Chinese corpora. In view of the professional features of the scientific literature, the system performs additional fine tuning on the BERT model, so that the technical terms and expression modes in the scientific literature can be better understood. For longer document texts, the system adopts a segmentation processing strategy to divide the texts into fragments with proper lengths, and the fragments are integrated after semantic representations are respectively acquired, so that the problem of input length limitation of the BERT model when processing long texts is skillfully solved. In this way, the system can generate vector representations for each document that express rich semantic information, which can capture deep semantic and contextual relationships of the document text, providing a high quality feature representation for subsequent topic clusters. Compared with the traditional Word bag model or Word2Vec and other methods, the BERT semantic vector can capture the meaning of words in specific contexts more accurately and understand the semantic essence of the document content, so that semantic understanding challenges caused by document content isomerization and specialization are effectively solved.
S3, subject clustering is carried out on the target text data according to the document semantic vector, and a plurality of subjects and subject representation of each subject are extracted;
The system performs topic clustering on the target text data according to the document semantic vector, and extracts a plurality of topics and representations thereof. This step is the core link of the invention, and adopts the topic mining method based on BERTopic. First, the system uses UMAP (Uniform Manifold Approximation and Projection) algorithm to perform a dimension reduction process on the high-dimensional semantic vectors, mapping hundreds or even thousands of dimension BERT vectors to a low-dimensional space (typically 2-10 dimensions) while preserving the semantic relationships between documents. The UMAP algorithm is based on Riemann geometry and algebraic topology theory, can reduce the dimension and simultaneously reserve the local and global structures of data to the greatest extent, and has higher calculation efficiency and better reservation of the semantic distance relation between data points compared with the traditional dimension reduction algorithm such as t-SNE and the like. In the vector space after the dimension reduction, the system applies HDBSCAN (HIERARCHICAL DENSITY-Based Spatial Clustering of Applications with Noise) clustering algorithm to perform density clustering on the documents. HDBSCAN is a hierarchical extension version of the DBSCAN algorithm, which can identify clusters of any shape and effectively process noise data, and is particularly suitable for the characteristic of irregular topic distribution in literature texts. Through clustering, the system groups documents with similar semantics to form a preliminary topic structure. Then, the system calculates the characteristic value of the words in each cluster, mainly adopts an innovative c-TF-IDF (class-based Term Frequency-Inverse Document Frequency) method, treats each cluster as a virtual document, and calculates the importance degree of the words in the cluster relative to other clusters. Finally, the system optimizes the keyword set by adopting a maximized similarity matching (MMR) algorithm, and reduces redundancy among keywords while ensuring high correlation between the keywords and the topics, so as to form topic representations which can accurately represent topic contents and cover different aspects of the topics.
The core parameter setting and optimization strategy of BERTopic model is as follows:
UMAP dimension reduction parameters, namely setting n_neighbors to 15, controlling the size of a local neighborhood, setting min_dist to 0.1, balancing local and global structures, setting n_components to 5, and determining the dimension of the vector after dimension reduction. Parameters are optimized over different document sets by grid search to maximize the balance of cluster quality and computational efficiency.
HDBSCAN Cluster parameters min_Cluster_size is set to 10, ensuring that each topic contains enough samples, min_samples is set to 5, controlling cluster stability, cluster_selection_epsilon is set to 0.5, optimizing recognition of small clusters. The system adopts a mode of combining the contour coefficient and the DB index to evaluate the clustering quality, and realizes the dynamic adjustment of parameters.
The c-TF-IDF optimization comprises the steps of smoothing processing is adopted in word frequency calculation to reduce long text deviation, logarithmic scaling is adopted in IDF calculation to balance weights of common words and rare words, the first 20 words with the highest weights are extracted from key words, lambda parameters of MMR algorithm are set to be 0.6, and correlation and diversity are considered.
S4, hierarchical clustering is carried out on the topics according to the topic representation to form a topic hierarchical structure;
The system constructs a similarity matrix between topics based on semantic similarity of topic keywords and distribution conditions of documents among different topics. The system calculates cosine similarity or Jaccard similarity coefficient between the topic keyword sets, and meanwhile considers the document proportion shared between different topics to form a comprehensive topic similarity matrix through weighted combination. The system then applies hierarchical clustering algorithms (e.g., ward's method) to hierarchically cluster the topics. The Ward's method selects two clusters with least sum of squares increase in the combined group in each clustering process, tends to generate clusters with similar sizes, and is particularly suitable for the organization of scientific literature topics. The clustering result can be expressed as a tree structure (i.e. a tree diagram), and the system cuts out a multi-level topic structure from the tree diagram by setting a proper threshold. Based on hierarchical clustering results, the system defines a multi-level classification system of scientific literature topics, and generally comprises primary topics (such as macroscopic directions of 'scientific work', 'scientific enterprises', 'scientific finance', and the like), secondary topics (such as specific fields of 'project management', 'enterprise hatchers', and the like), and tertiary topics (such as specific literature measures). Finally, the system optimizes and adjusts the hierarchical theme system by combining the knowledge of the domain expert to ensure the scientificity and practicability of theme classification, which may include merging the themes with similar semantics, splitting the themes with too wide content, or readjusting the hierarchical attribution of certain themes. In this way, the system finally builds a multi-level theme structure which can grasp the macroscopic direction and go deep into specific measures.
S5, dividing the target text data into a plurality of time windows, analyzing the change characteristics of the theme in each time window, and constructing a theme evolution time sequence according to the change characteristics, wherein the change characteristics comprise the occurrence frequency and the intensity change;
The system divides the target text data into a plurality of time windows according to the release time, analyzes the change characteristics of the theme in each time window and constructs a theme evolution time sequence. Firstly, the system divides text data into a series of continuous time windows according to the release date of the documents, and the division of the time windows can be flexibly set according to research requirements, such as the year, the quarter or the unit of a five-year planning period. And in each time window, the system independently applies BERTopic models to perform topic mining, and performs a series of processing steps such as document embedding, dimension reduction, clustering, topic representation and the like. The independent modeling method based on the time window can capture the unique literature topic structure of each period and is not interfered by data of other periods. The system then calculates the frequency and intensity of occurrence of each topic in different time windows. The topic frequency is measured by calculating the ratio of the number of documents belonging to the topic to the total number of documents in the window, and the topic intensity is quantified by a weighted combination of the number of documents, the length of the documents, the level of the posting institution and other factors of the topic. These frequency and intensity data form a time series of subject evolution reflecting the overgrowth variations of the interest of different document subjects. In addition, the system also introduces a Dynamic Topic Model (DTM) technology to analyze semantic changes of topic contents with time. DTM assumes that the same topic in adjacent periods has some continuity, but allows the word distribution of the topic to change gradually, in this way, the system can track the evolution trend of keywords in the topic, identify newly added words and attenuated words, and reveal subtle changes in literature language and points of interest. Finally, the system tracks the continuation, differentiation and fusion processes of the topics by calculating the similarity matrix of the topics between adjacent time windows, and builds a complete topic evolution map.
S6, predicting the intensity change trend of the theme according to the theme evolution time sequence.
The system firstly utilizes the topic intensity time sequence data obtained from the time evolution analysis, and applies the traditional time sequence analysis method to carry out preliminary trend prediction. Common methods include autoregressive integrated moving average (ARIMA) and exponential smoothing. The ARIMA model predicts future values by analyzing the autocorrelation, differential stationarity and moving average characteristics of the time series and constructs a mathematical model, while the exponential smoothing rule predicts future values by weighted averaging historical data, where the weights of the most recent data are greatest and decays exponentially over time. These methods are particularly suited to capturing linear trends and seasonal fluctuations in subject intensity. However, the system also incorporates a deep learning approach to build more complex predictive models, taking into account the nonlinear characteristics of technological literature evolution. A long-short-term memory network (LSTM) is used as a special cyclic neural network, can learn long-term dependence and is particularly suitable for processing time series data, and a transducer model can process the series data in parallel and capture the association between different time points by using a attention mechanism. These deep learning models learn complex patterns of subject evolution through the training of large amounts of historical data, providing more accurate nonlinear predictions. In addition to predicting the intensity change of an existing topic, the system also identifies emerging topics and fading topics through topic evolution profile analysis, predicting the future focus of attention of the literature. Finally, the system integrates external factors (such as technological development dynamics, international literature environment changes, major social events and the like) to correct and adjust the prediction result, so that the accuracy and reliability of prediction are improved. Through multidimensional analysis and correction, the system can finally provide comprehensive and accurate trend prediction of scientific and technological literature and prospective reference information for decision makers.
In topic trend prediction, the system adopts the following deep learning model architecture and training strategy:
The LSTM network structure comprises 2 layers of bidirectional LSTM, 128 hidden units in each layer, the dropout rate is 0.3, and the input layer is connected with Batch Normalization to accelerate training. The time window length is set to 8, and the topic intensity change of the future 4 time windows is predicted.
The transducer structure adopts a 4-layer transducer encoder, 8 attention heads, a feedforward network dimension is 512, and a position encoding uses sine and cosine functions, so that the transducer structure has a global receptive field capturing long-term dependency relationship.
Model training and evaluation training data were divided into training and test sets at 80% to 20% and early-stop strategies were employed to avoid overfitting, and evaluation criteria included Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE). The system realizes model integration, synthesizes the prediction results of the ARIMA and the deep learning model, generates final prediction by weighted average, and remarkably improves the prediction accuracy.
S7, constructing a knowledge graph containing literature entities, influence paths and effect quantification indexes based on the topic intensity change trend;
The system constructs a domain knowledge graph of scientific and technological literature influence analysis as a knowledge base of follow-up reasoning analysis. The knowledge graph consists of three main components, namely a literature entity network, a scientific and technological innovation ecological network and a literature-influence causal link network. The document entity network includes entities such as document documents, issuing institutions, implementation objects, key stakeholders and the like, and relationships such as "issuing", "supervising", "implementation" and the like among the entities reflect basic information and management structures of documents. The technological innovation ecological network comprises entities such as scientific research institutions, enterprise bodies, innovation projects, technological fields and the like, and relationships among the entities such as collaboration, competition, resource flow and the like reflect the technological innovation environment of the literature. The literature-influence causal link network records causal relation between the historical literature and the observed influence thereof, including information such as paths, time delays, influence intensities and the like of direct influence and indirect influence, and provides experience basis for literature influence reasoning. The system constructs the knowledge graph by integrating multi-source data, wherein the data sources comprise literature texts, scientific and technological project databases, scientific and research output statistics, enterprise innovation investigation data and the like. In order to ensure the accuracy and the integrity of the knowledge graph, the system adopts a semi-automatic knowledge extraction method and combines expert auditing for verification. In the construction process of the knowledge graph, special attention is paid to capturing the association modes between the literature subject and the specific influence, and the association modes become important basis for subsequent reasoning analysis.
S8, establishing a multi-level analysis framework by utilizing the knowledge graph;
The framework comprises four main layers, namely a direct influence layer, a system response layer, an innovation result layer and a socioeconomic layer. Direct impact layer analysis documents direct effects on specific objects, such as funding support, tax offers, regulatory requirements, etc., and direct changes to object behavior by these effects. This level of concern is immediately after document implementation, and is typically directly observed through government funding data, tax deduction statistics, and the like. The system response layer analyzes the response and interaction of each element in the innovation system to the literature, including resource reallocation, organization structure adjustment, behavior pattern change and the like. This hierarchy focuses on the internal dynamic tuning process of the innovative system, reflecting the document-oriented conduction mechanism. The innovation result layer analyzes the influence of the literature on technological innovation results finally, and the influence comprises quantitative indexes such as research and development investment, patent output, technological breakthrough, product innovation and the like, and reflects the actual promotion effect of the literature on innovation activities. The socioeconomic layer analyzes the influence of the literature on a wider socioeconomic layer, including aspects of industrial structure, employment change, economic growth, sustainable development and the like, and reflects the long-term comprehensive benefit of the literature. In this multi-level framework, the system defines clear evaluation indexes and metrics for each level, forming a structured literature impact evaluation system. Through the hierarchical analysis framework, the system can evaluate the influence path and effect of the document from different dimensions and depths, and the complexity and multidimensional challenges of document influence evaluation are effectively solved.
S9, establishing a retrieval and reasoning system according to the multi-level analysis framework, wherein the retrieval and reasoning system comprises a reasoning chain from literature to influence;
The system realizes an innovative Reason-while-retrieval (RwR) framework, and organically combines the retrieval and reasoning processes, so that the reasoning result is based on logic deduction and has a fact basis. First, the system decomposes a complex document impact analysis problem into a series of sub-problems, forming a problem tree with logical dependencies between the sub-problems. For example, analyzing the impact of an innovative incentive document can be broken down into sub-questions in terms of research investment, talent appeal, technical collaboration, etc. The system then builds a progressive inference chain based on the large language model, with each inference step corresponding to a sub-problem. The large language model generates inference assumptions or intermediate conclusions based on the current sub-problem and the existing information. The progressive reasoning supports analysis of complex causal relationships and is suitable for multipath conduction of literature influence. In each reasoning step, the system retrieves the most relevant knowledge from the knowledge graph based on the current sub-problem and the reasoning state. The searching uses a mixed strategy, and combines semantic similarity searching, relation path inquiring and reasoning correlation scoring to ensure that the searched knowledge is relevant and useful. The system integrates the retrieved knowledge with the reasoning process of the large language model, verifying or adjusting the reasoning assumptions. If the retrieved evidence supports reasoning assumptions, the credibility of the assumptions is enhanced, and if a conflict exists, the reasonement or adjustment of the assumptions is triggered. In the whole reasoning process, the system explicitly tracks and quantifies uncertainty sources in a reasoning chain, including knowledge deletion, evidence conflict, reasoning jump and the like, and provides confidence assessment for a final conclusion. By constructing the interactive reasoning chain, the system can carry out the based and traceable deep analysis on the influence of complex documents, thereby greatly improving the reliability and the interpretability of the reasoning result.
The technical implementation details of RwR framework are as follows:
knowledge graph construction using Neo4j graph database to store literature influence knowledge network, comprising about 50,000 entity nodes and 200,000 relationship edges. Entity extraction adopts a method of combining Named Entity Recognition (NER) with remote supervision, and relationship extraction uses a BERT-based relationship classification model, wherein the F1 score reaches 0.83.
The progressive inference chain is realized by constructing an inference framework based on a 160 hundred million-parameter large language model and prompting an engineering design inference template through a few samples, and 9 basic inference modes including conditional inference, comparative inference, inverse fact inference and the like are realized. The temperature parameter is 0.3 in the reasoning process, so that the certainty and consistency of reasoning are maintained.
And a retrieval-reasoning fusion mechanism, namely designing a bidirectional attention mechanism to mutually enhance the retrieval result and the reasoning process. The search uses a hybrid search strategy, combining BM25 and vector similarity, focusing on exact matching and semantic correlation. The reasoning fusion adopts a confidence weighting method, and the reasoning weight is automatically adjusted according to the evidence support.
Uncertainty quantization, namely calculating a confidence interval for each reasoning step based on a Bayesian framework, analyzing the robustness of prediction by using Monte Carlo sampling, and providing transparent reliability assessment for a user by quantifying a knowledge gap through information entropy.
Wherein, the inference chain is established by the following steps:
S9.1, decomposing an influence analysis task into a plurality of subtasks with logic dependency relations based on the multi-level analysis framework;
Task decomposition is a key first step of complex reasoning problem processing, and the system adopts a structured task decomposition method to divide the whole influence analysis problem into a series of subtasks with clear boundaries and dependency relations. First, the system determines the major dimensions of impact analysis based on the previously constructed four-level impact analysis framework (direct impact layer, system response layer, innovation result layer, and socioeconomic layer). Within each dimension, the system further refines into specific subtasks. For example, analyzing the impact of innovative literature on enterprise research and development effort can be broken down into subtasks, which research and development resource support is provided to the enterprise by the literature, what the enterprise's response mechanism to these support is, how research and development resources translate into research and development effort growth, how the relationship between effort growth and innovation yield, etc. These subtasks form a Directed Acyclic Graph (DAG), each node in the graph representing a subtask, and edges representing dependencies between the subtasks. Through this task decomposition, the system converts the complex impact analysis problem into a series of manageable sub-problems, each with explicit input, output and evaluation criteria. Task decomposition not only reduces the complexity of a single reasoning step, but also defines a reasoning logic path, and provides a clear guiding framework for the subsequent reasoning process.
S9.2, processing the subtasks by using the large language model to obtain a progressive reasoning chain;
The system sequentially inputs the decomposed subtasks into a pre-trained large language model, and generates a solution of each subtask by utilizing the strong reasoning capability of the subtask. The processing process adopts a context accumulation strategy, namely, the processing of the follow-up subtasks takes the result of the preceding subtasks as context input, and the consistency and consistency of the reasoning process are ensured. In order to improve the reasoning quality, the system adopts a technology of 'reasoning prompt engineering', designs a special prompt template for each type of subtasks, and guides the model to generate normalized and structured reasoning results. In addition, the system also implements a "multi-step thinking" mechanism, requiring the model to analyze the problem first, to make explicit ideas, then to deduce gradually, and finally to summarize conclusions, such explicit thinking process helps to reduce logic jumps and errors in the inference. To deal with uncertainty in the inference, the system employs a "confidence labeling" approach, requiring the model to label confidence levels for each inference step, and explicitly indicate assumptions and constraints in the inference. By these techniques, the system generates a progressive inference chain consisting of a plurality of inference nodes, each node containing the analysis process of the subtask, the inference results and the confidence assessment. This inference chain represents a preliminary inference path from literature to impact, but is based on general knowledge of language models only, and needs to be validated and enhanced by fact retrieval.
S9.3, searching associated data related to each node of the progressive inference chain in the knowledge graph;
And providing a fact basis for the reasoning result of the language model, and establishing a relation between the reasoning and the actual data through the retrieval function of the knowledge graph. The system constructs multi-level search query aiming at each node in the inference chain. First, the system extracts key entities and relationships, such as document names, implementation objects, influence types, etc., from the inference nodes as basic search conditions. The system then expands the scope of search, including similar documents, related fields, similar impact cases, etc., based on the content and context of the inference. In the retrieval process, the system adopts a mixed retrieval strategy, combines accurate matching and semantic similarity searching, and ensures the comprehensiveness and relativity of retrieval results. And for the result returned by the retrieval, the system carries out relevance scoring and deduplication processing, and preferentially reserves the data which is most relevant to the current reasoning node and has the highest evidence value. The retrieval of the system is not only limited to the direct relation in the knowledge graph, but also supports multi-hop query, and indirect but meaningful associated evidence can be found. For example, if reasoning involves the effect of a document on enterprise development, the system not only searches for direct document-enterprise relationships, but also searches for multi-hop paths such as document-funds-enterprise or document-talent-enterprise, capturing complex impact mechanisms. Through the deep search, the system collects relevant fact data for each inference node, and provides objective basis for inference verification.
And S9.4, verifying and correcting an inference result by combining the association data and the inference process indicated by the large language model, and generating an inference chain.
The method is a core link of RwR framework, and realizes the deep fusion of reasoning and retrieval. First, the system matches the retrieved association data with the corresponding inference node, analyzing the consistency between the data and the inference. For each inference node, the system calculates an 'evidence support degree', and quantifies the support degree of the fact data on the inference conclusion. Based on the evaluation of the evidence support degree, the system classifies the reasoning results, namely, for the reasoning fully supported by the evidence, the system reserves the original conclusion and adds specific fact basis to improve the credibility of the conclusion, for the reasoning supported by the evidence part, the system corrects or limits the conclusion to enable the conclusion to be more in line with the constraint of the fact data, and for the reasoning insufficient or contradictory evidence, the system marks as to-be-verified and tries to solve the contradiction by additionally searching or adjusting the reasoning path. When the evidence conflict is processed, the system adopts an evidence weight strategy, and the authority, timeliness and direct correlation of the data source are considered to solve the conflict between different evidences. Through the verification and correction process, the system generates a complete inference chain which is based on logical inference and fact support, and each inference step has a clear source label, namely, the inference based on a language model, direct evidence of fact data or comprehensive judgment of the two. The transparent source labeling enables the reasoning process to be traceable and verifiable, and greatly enhances the reliability and the interpretation of the reasoning result.
The above builds an innovative knowledge reasoning system that tightly combines reasoning with retrieval. Compared with traditional pure language model reasoning or simple information retrieval, the system has the remarkable advantages that firstly, the system effectively processes complex influence analysis problems through task decomposition and progressive reasoning, so that the reasoning process is clearer and controllable, secondly, the system combines the reasoning capacity of the large language model with the fact base of a knowledge graph, so that the flexibility and generalization capacity of the language model are reserved, the 'illusion' problem of the language model is avoided, the accuracy and reliability of the reasoning are improved, furthermore, the reasoning process of the system is transparent and interpretable, each conclusion has a clear reasoning path and a fact basis, the user can conveniently understand and verify, and finally, the system supports interactive reasoning, and can dynamically adjust the reasoning path according to new information or problems, so that the system has strong adaptability and expansibility. The RwR framework-based retrieval reasoning system provides strong intelligent support for scientific and technological literature influence analysis, key information can be extracted from massive documents and data, a reliable causal reasoning chain is constructed, complex problems of why and how are effectively solved, and deep insight and basis are provided for scientific and technological decisions.
S10, adjusting the effect quantization index, and executing multi-scenario prediction by combining the adjusted effect quantization index and the retrieval reasoning system to obtain development trends under different conditions;
based on the reasoning result of RwR framework, the system further develops scene simulation and sensitivity analysis of literature influence. First, by adjusting key parameters of the document (e.g., fund size, tax preference scale, admission threshold, etc.), the system simulates the potential impact differences for different document designs. Such parameter tuning simulation helps the decision maker understand how subtle changes in the document design affect the final effect, providing a quantitative reference for document optimization. Secondly, the system evaluates the robustness and adaptability of the literature influence under different external environment assumptions (such as economic growth rate, international competition situation, technical development stage and the like). By changing external condition parameters, the system can identify the sensitivity of the literature effect to environmental influences, and help design more flexible and adaptable literature schemes. In addition, the system simulates the trend of literature influence over time, including short term effects, medium term adjustments, and long term effects, revealing the temporal characteristics of literature influence. Such time dynamic simulation is particularly focused on the persistence and decay characteristics of literature effects, helping to design rational literature implementation cycles and update mechanisms. Finally, the system performs element sensitivity analysis, identifies key elements and conditions which are most sensitive to the influence of the literature, and provides targeted suggestions for literature optimization. The system presents simulation and analysis results in a visual mode, including an influence path diagram, a sensitivity thermodynamic diagram, a scene comparison diagram and the like, so that a decision maker can intuitively understand influence characteristics and optimization space of a literature scheme. Through the multi-scenario predictive analysis, the system provides abundant decision support information for document makers, and is helpful for designing more efficient and targeted scientific and technological documents.
As shown in fig. 2, S2 specifically includes:
s2.1, carrying out segmentation processing on the target text data to obtain text fragments with specified lengths;
This addresses the limitation of pre-trained language models to handle long text. Pretrained models such as BERT generally have input length limitations (typically 512 token), and scientific literature often has a long spread and cannot be directly input into the model. To address this problem, the system employs an intelligent segmentation strategy to divide long text into segments of appropriate length. The segmentation process considers the integrity of the semantics, and preferentially segments the natural paragraph boundary or sentence boundary, so that complete semantic units are prevented from being segmented. For particularly long paragraphs, the system employs a sliding window technique, setting the appropriate overlap region (typically 15-20% text length), ensuring continuity of the context information. For text portions that contain special structures (e.g., tables, charts, formulas, etc.), the system employs specialized processing rules that preserve their structural characteristics or transform them into a form that is understandable to the model. Through the intelligent segmentation processing, the system converts the original long text into a series of text fragments with moderate length and complete semantics, and provides a proper input unit for subsequent semantic coding.
S2.2, acquiring semantic representations of the text fragments by adopting the semantic analysis model;
Pre-trained large language models (e.g., BERT, roBERTa, etc.) are used as semantic encoders, which have been provided with powerful semantic understanding capabilities by pre-training on massive text. In order to better adapt to the technical literature field, the system also carries out field adaptability fine adjustment on the basic model, and a large amount of technical literature data is used for carrying out additional training on the model, so that the understanding capability of the model on the technical terms and the technical expressions is enhanced. In the encoding process, the system inputs the text segment into the model, obtaining a deep semantic representation. Specifically, the system extracts hidden state vectors for the last layers of the model, which contain rich semantic information and context. For the BERT class model, the system typically uses the last layer of [ CLS ] tag vectors as a semantic representation of the entire fragment, or a vector representation of all token on average. These high-dimensional vectors (typically 768 or 1024 dimensions) capture deep semantic features of text, including multiple levels of word sense, syntactic relationships, topic information, etc., providing a high quality semantic basis for subsequent document representations.
And S2.3, integrating the semantic representation of the text segment into a document semantic vector, wherein the document semantic vector is used for indicating semantic information and context of the target text data.
This step solves the problem of how to synthesize a local representation of multiple segments into a global representation of the entire document. For this purpose, the system designs a hierarchical semantic integration method. First, for each text segment, the system obtains its semantic vector representations, which capture segment-level semantic information. The system then combines these segment vectors into a document-level overall representation using a weighted integration mechanism. The integration process takes into account a variety of factors including segment location (segments at the beginning and end of a document may contain more important summary information), segment content importance (assessed by indicators such as keyword density), and semantic relevance between segments. The system realizes an integration algorithm based on an attention mechanism, automatically learns importance weights of different fragments, and endows the key fragments with higher influence. In addition, the system also reserves the mapping relation between the fragment level vector and the document level vector, and is convenient for tracing the specific source of the key information in the subsequent analysis. The finally generated document semantic vector is a high-dimensional dense vector, comprehensively represents the semantic content and internal structural relation of the document, and provides an ideal input form for subsequent subject mining.
After the semantic representation is completed, the system enters a topic clustering stage, and topic distribution in the scientific literature is identified based on semantic vectors of the documents. First, since semantic vectors generated by BERT encoding are generally high in dimension (e.g., 768 dimensions), clustering directly in a high-dimensional space may face a problem of "dimension disaster", which affects clustering effects.
As shown in fig. 3, S3 specifically includes:
S3.1, performing dimension reduction processing on the document semantic vector by adopting a preset dimension reduction model to obtain a low-dimension semantic vector;
The UMAP (Uniform Manifold Approximation and Projection) algorithm was used as the primary dimension reduction tool. The UMAP algorithm is based on Riemann geometry and algebraic topology theory, and can reduce the dimension and simultaneously reserve the local and global structures of the data to the greatest extent. Compared with the traditional dimension reduction method (such as PCA and t-SNE), UMAP has the advantages of high calculation efficiency and strong expandability, and is particularly suitable for processing large-scale literature data. In practice, the system typically reduces the high-dimensional vector to a low-dimensional space of 5-15 dimensions, which balances information retention and computational efficiency. The selection of the dimension reduction parameters considers the data scale and the distribution characteristics, the system adopts a dynamic parameter adjustment strategy, and parameter setting is automatically optimized according to actual data distribution, so that the dimension reduction result is ensured to retain key information and be convenient for subsequent processing.
S3.2, based on the low-dimensional semantic vector, performing density clustering on the target text data by using a preset clustering model to form a plurality of topics;
HDBSCAN (HIERARCHICAL DENSITY-Based Spatial Clustering of Applications with Noise) was chosen as the primary clustering algorithm. HDBSCAN is a hierarchical extension version of the DBSCAN algorithm, and has several key advantages of being capable of identifying clusters of any shape, having good robustness to noise data, automatically determining the optimal cluster number and having strong clustering capability for processing different densities. These characteristics make HDBSCAN particularly suitable for the needs of topic clustering of scientific literature, because topic distribution of scientific literature is generally not uniform, core topics and edge topics exist, and density differences are significant. In the clustering process, the system performs optimal setting on key parameters HDBSCAN, wherein a minimum cluster size parameter is dynamically adjusted according to the total amount of documents to ensure that the identified topics have enough representativeness, and a core distance parameter is adaptively set based on the local density distribution of data to balance the clustering granularity and the coverage rate. Through this optimized density clustering process, the system is able to naturally identify topic groups with semantic consistency from document data, each group representing a relatively independent document topic.
S3.3, calculating word characteristic values of words in each theme, and extracting keywords according to the word characteristic values, wherein the word characteristic values comprise ‌ word frequencies ‌ and ‌ inverse document frequencies;
To accurately represent the core content of each topic, the system innovatively applies the c-TF-IDF (class-based Term Frequency-Inverse Document Frequency) method. Unlike traditional TF-IDFs, c-TF-IDFs treat each topic as a "virtual document" that calculates how important the word is in that topic relative to other topics. Specifically, the TF (word frequency) section calculates the frequency of occurrence of words in a particular topic, reflecting the contribution of the words to that topic, and the IDF (inverse document frequency) section calculates the degree of spread of the words across all topics, giving those words with strong specificity (only in a few topics) a higher weight. Through the c-TF-IDF calculation, the system generates an ordered list for all words in each topic, and the words with high weights can usually accurately reflect the core content and unique features of the topic. The system further combines semantic representation information of words, and captures keywords with strong semantic relevance although the frequency is not high by calculating the similarity between word vectors and topic center vectors, so that the semantic accuracy of topic representation is enhanced.
And S3.4, optimizing the keywords by adopting a preset maximum similarity matching algorithm to obtain the topic representation of the topic.
And optimizing the keywords by adopting a preset maximum similarity matching (MMR, maximal MARGINAL RELEVANCE) algorithm to obtain the topic representation of the topic. The MMR algorithm is a selection method for balancing relevance and diversity, and considers the relevance of words and topics and the diversity of a selected word set when selecting keywords. In particular, the system defines an objective function that combines relevance and diversity, MMR (w) =λ·sim (w, t) - (1- λ) · maxsim (w, S), where w is the candidate word, t is the subject center, S is the set of selected keywords, sim represents the similarity function, and λ is the balance parameter (typically set to 0.5-0.7). The system iteratively selects keywords in order of MMR value from high to low until a predetermined number of keywords (typically 10-20) is reached. This approach ensures that the selected keyword set is not only highly relevant to the topic, but also covers different aspects of the topic, avoiding the problem of content duplication or one-sided representation of the topic. Finally, each topic is represented by a set of optimized keywords, which together form a semantic feature representation of the topic, providing a basis for subsequent topic hierarchical and evolution analysis.
Wherein S3 further comprises:
s3.5, establishing a grading index system comprising document grade classification, release time and reference frequency based on the document semantic vector;
the index system comprehensively considers a plurality of importance dimensions of the documents, namely, the classification of grades reflects official attributes and authority of the documents, such as national level, province level, local level and the like, the timeliness of the documents is reflected by release time, recently released documents possibly have higher reference value, the influence and acceptance of the documents are reflected by the quote frequency, and the high-quote documents generally represent important document directions or research focuses. The system integrates these metrics into a unified importance score, assigning each document a weight value that will play an important role in the subsequent clustering process.
S3.6, constructing a cost sensitive decision tree according to the document semantic vector and the scoring index system, wherein the cost sensitive decision tree is used for indicating the weight given by the increase of a sample with the importance reaching a preset importance threshold;
The system treats the false clustering of documents as a kind of "cost", and the false clustering cost of important documents is higher. Through the cost sensitive decision tree, the system adds the weight given to the samples with the importance reaching the preset importance threshold value, so that the important documents are ensured to be processed more accurately in the clustering process. The construction process of the decision tree considers semantic features and importance indexes of documents, and learns decision rules capable of effectively distinguishing different types of documents by minimizing the weighted error rate.
S3.7, extracting classification rules from the cost-sensitive decision tree, wherein the classification rules comprise characteristic thresholds and classification paths;
each path from the root node to the leaf node of the decision tree represents a classification rule, which includes a series of feature judgment conditions (feature threshold) and a final classification result. The system extracts these rules to form a rule set, each rule describing the feature patterns and the categories to which a document of a certain type belongs. These rules not only serve for subsequent cluster optimization, but also provide an interpretable basis for document classification, enabling the system to clearly illustrate why a document is assigned to a particular topic. The rule extraction process also comprises rule simplification and optimization, similar rule merging, redundant condition removal and threshold range adjustment, so that the rule set is more refined and has wide coverage.
And S3.8, optimizing initial center point distribution of the subject clustering according to the classification rule, and performing subject clustering on the target text data according to an optimization result.
Conventional density clustering algorithms (e.g., HDBSCAN) may not adequately account for the impact of important documents when faced with documents of different importance. To address this problem, the system innovatively applies the classification rules of the cost-sensitive decision tree to the cluster initialization process. Specifically, the system identifies a set of important documents using classification rules and optimizes initial density estimation and core point selection of the clustering algorithm with the documents as cores. The method ensures that important documents can become clustering seed points, leads the formation of clustering boundaries and improves the identification accuracy of the topics represented by the important documents. After the initial optimization is completed, the system executes HDBSCAN algorithm of improved version, organizes all documents into topic clusters with consistent meaning, and simultaneously maintains the clustering accuracy of important documents. Finally, the system obtains an optimized clustering result which considers both semantic similarity and literature importance.
A conversion process from text data to a theme representation is implemented. The innovation of the process is that the deep semantic understanding capability of BERT and the flexible grouping capability of density clustering are combined, the precise topic identification of scientific and technical literature is realized, the c-TF-IDF and MMR algorithm is introduced, related and various topic representations are obtained, the cost-sensitive learning thought is innovatively applied, and the clustering accuracy of important literature is improved. The innovation points together form an efficient and accurate scientific literature subject mining framework, and a solid foundation is laid for subsequent hierarchical analysis and evolution prediction.
As shown in fig. 4, S4 specifically includes:
s4.1, calculating semantic similarity based on the topic representation, and constructing a similarity matrix between topics according to the topic representation and the semantic similarity;
The system builds a topic vector representation for each topic, which is based on two pieces of key information, one is a keyword set of the topic and the other is a center point of a document semantic vector belonging to the topic. For a keyword set, the system performs weighted average on word vectors of each keyword (weights are c-TF-IDF values of words) to generate keyword vector representations, and for a document set, the system calculates weighted average of semantic vectors of all documents under the subject (weights are importance scores of the documents) to generate document vector representations. The system then fuses the two vector representations into a unified topic vector for subsequent similarity calculations. In the similarity calculation link, the system adopts a multi-angle similarity measurement method. In addition to the common cosine similarity, the system also calculates Jaccard similarity coefficients of the topic keyword set and the overlapping degree of topic document distribution. The three similarities are integrated into a comprehensive similarity score that more fully reflects the strength of association between topics. Finally, the system organizes the similarity values between all pairs of topics into a similarity matrix, which is a symmetric matrix, with each element representing the semantic similarity between the corresponding rank topics. This matrix provides key input for subsequent hierarchical clusters, deciding which topics should be categorized into higher-level categories.
S4.2, hierarchical clustering is carried out on the topics by adopting a preset hierarchical clustering model, a hierarchical system expressed as a tree structure is obtained, and the hierarchical system comprises a plurality of primary topics and a plurality of secondary topics corresponding to the primary topics;
The Ward hierarchical clustering method is selected, and the method considers the minimized increase of the intra-group variance when merging clusters, tends to generate clusters with similar sizes, and meets the organization requirements of scientific and technological literature topics. The hierarchical clustering process is a bottom-up iterative merging process, wherein each topic is regarded as an independent cluster initially, then in each iteration step, the system finds out two clusters with highest similarity to merge, and the process is continued until all topics are merged into one cluster or a preset stop condition is reached. The clustering process can be visualized as a tree graph (i.e., hierarchical cluster tree) with leaf nodes of the tree being the original topics, internal nodes representing higher-level topic categories, and root nodes containing all topics. The system obtains a multi-level topic classification system by cutting at a proper position of the tree diagram. The selection of the cutting positions comprehensively considers the factors such as the number of clusters, the consistency inside the clusters, the degree of distinction among different clusters and the like. In the present invention, the system generally constructs a secondary or tertiary topic classification system, which includes a plurality of primary topics (such as "technological innovation", "industrial development", "technological finance", etc. macro categories) and a plurality of secondary topics corresponding to the primary topics (such as "basic research", "technological achievement transformation", etc. specific fields under "technological innovation"). The layering system provides a clear organization structure for the topics of scientific literature, and is helpful for understanding the subordinate relations and the correlation degree between topics.
And S4.3, optimizing and adjusting the layering system to form a theme layering structure.
The automatically generated hierarchy may have some irrational aspects and require further optimization to improve its scientificity and practicality. The optimization process of the system comprises multiple aspects, namely firstly, splitting or merging oversized or undersized clusters to keep the balance of each level of topics, secondly, adjusting the attribution relation of part of topics based on semantic relation among the topics to ensure the semantic consistency of classification, and furthermore, utilizing an external domain knowledge base (such as a subject classification system, a technical domain classification standard and the like) to standardize the naming and organization structure of the topics and improve the specialization and the understandability of classification. In practical application, the system also supports an expert intervention mode, allows the domain expert to audit and modify the automatically generated hierarchical structure based on self knowledge, and further improves the practical value of topic classification. Finally, the system forms a multi-level classification system of the scientific and technological literature theme, the system not only maintains the objectivity of data driving, but also gives consideration to the guiding function of professional knowledge, and an effective semantic framework is provided for the organization and the retrieval of the scientific and technological literature.
After the hierarchical structure of the theme is built, the system enters a theme evolution analysis stage, and the change trend of the theme along with time is focused. This analysis is of great importance for grasping the dynamics of technological development.
As shown in fig. 5, S5 specifically includes:
S5.1, dividing the time dimension of the target text data according to the release time of the target text data to obtain a plurality of continuous time windows;
The partitioning of the time window is the basis of time sequence analysis, and the system designs a flexible time partitioning strategy according to research requirements and data characteristics. Typically, the system uses natural time units (e.g., months, quarters, years) or literature periods (e.g., five-year planning periods) as the basis for dividing the time window. For the case of larger data volumes, smaller time units (e.g., months or quarters) may be selected to capture finer changes, and for the case of smaller data volumes or concerns about long-term trends, larger time units (e.g., years or years) may be selected. The system supports dynamic time window settings, allowing the user to adjust window size or sliding step size according to specific needs. The setting of the time window ensures the statistical significance of the data quantity in the window and considers the requirement of time resolution. In actual processing, the system firstly organizes all data in time sequence based on the distribution time attribute of the documents, then divides the data into a series of continuous time periods according to preset window parameters, and the documents in each time period form a data set of a time window. This time window division provides a basic time frame for subsequent topic evolution analysis.
S5.2, performing topic mining on the target text data in the time window by using a preset topic mining model;
In order to track the change in theme over time, the system needs to identify the theme structure independently in each time window. The invention adopts two topic mining strategies, namely an independent modeling strategy and an incremental updating strategy. In an independent modeling strategy, the system independently applies BERTopic models to the data of each time window, and performs a complete set of topic mining flows (including document semantic coding, dimension reduction, clustering and topic representation extraction). This approach is capable of capturing a topic pattern specific to each epoch, independent of other epochs, but may lead to problems of inconsistent topic identification. In the incremental update strategy, the system processes the data of the new window in an incremental manner based on the topic model of the previous time window, only updates the topic distribution and representation, and maintains the continuity of topic identification. This approach helps to maintain consistency of the topics, facilitates comparison across time, but may miss emerging topics or ignore significant changes in the topic content. To take account of the advantages of both strategies, the system employs a hybrid approach that performs independent modeling periodically (e.g., each year or each literature cycle), while employing incremental updates during intermediate periods, and sets up new topic discovery mechanisms that identify emerging topics in time during the incremental updates. By means of the mixing strategy, the system can effectively capture dynamic changes of the theme structure while maintaining the theme continuity.
S5.3, calculating the theme characteristics of each theme in different time windows according to the theme mining result, wherein the theme characteristics comprise the occurrence frequency and the intensity;
The theme feature is a basic index that quantifies the temporal change of the theme. The topic frequency reflects the popularity of topics and is usually measured by calculating the proportion of the number of documents belonging to the topics to the total number of documents in a window, the topic intensity reflects the importance degree or the attention degree of the topics, and the system calculates the topic intensity by integrating a plurality of factors, including the number of topic documents, the average length of the documents, the average importance score of the documents (based on a scoring index system constructed before), the concentration degree of topic keywords in the documents and the like. The calculation of the topic intensity considers the balance of the number of documents and the importance of the documents, and can reflect the actual influence of the topic more accurately. The system calculates these feature values for each topic in each time window and organizes the results into time series data forming a time profile of the topic features. These temporal profiles are the data basis for subsequent evolution analysis, reflecting the pattern of changes in the subject over time.
S5.4, analyzing the theme characteristics by adopting a preset dynamic theme model to obtain variation characteristics of the theme, wherein the variation characteristics are used for indicating semantic variation of the theme along with time;
In addition to variations in topic frequency and intensity, the topic's content structure also evolves over time, and this semantic variation reflects subtle shifts in topic focus. To capture this variation, the system applies dynamic topic model (Dynamic Topic Models, DTM) techniques to analyze the temporal variation of the word distribution within the topic. The DTM model assumes that the same topic in adjacent epochs has some continuity, but allows the word distribution of topics to change gradually. The system compares the keyword distribution of each topic in each time window, and identifies newly added words, disappeared words and words with obvious change. These vocabulary changes reflect the evolution trend of the subject matter, such as the appearance of new technical concepts, the fade-out of old concepts, the change of focus, etc. The system also calculates the semantic drift degree inside the theme, namely the distance between semantic representations of the same theme at different time points, and quantifies the speed and the amplitude of theme change. In addition, the system analyzes the interaction among topics, such as semantic penetration, topic differentiation, fusion and other phenomena among topics, and reveals the dynamic evolution rule of the topic network. These varying features together constitute a multi-dimensional description of the evolution of the subject, reflecting not only how much the variation is, but also what the variation is.
And S5.5, generating the subject evolution time sequence according to the change characteristics.
The various varying features of the previous analysis are integrated into structured time series data, providing input for trend prediction. The system organizes a multi-class theme evolution time sequence, namely a theme popularity time sequence, a theme content change time sequence, a theme relation change time sequence and a theme relation change time sequence, wherein the frequency and the intensity of each theme at each time point are recorded, the theme content change time sequence is used for recording the change condition of the distribution of theme keywords, and the theme relation change time sequence is used for recording the interaction among the themes and the dynamics of structural reorganization. These time sequences are stored in a variety of forms, including numeric time sequences (e.g., numeric changes in topic intensity), vector time sequences (e.g., changes in semantic representations of topics over time), and structured time sequences (e.g., evolution of topic network topology). The system also adds rich meta information such as time granularity, data source, processing method and the like to the time series data, so that the subsequent explanation and use are convenient. All time series data are organized into a unified database, multi-dimensional query and analysis are supported, and high-quality training data are provided for final trend prediction.
The invention realizes the comprehensive analysis of the topic structure and evolution rule of the scientific literature. The analysis not only reveals the organization structure of the theme, but also captures the change mode of the theme along with time, and provides important basis for understanding the technological development trend. In particular, through hierarchical organization and time sequence change analysis of the theme, the system establishes a panoramic view from microscopic theme to macroscopic field and from static structure to dynamic evolution, which is helpful for grasping the overall trend and internal law of technological development from multiple angles and multiple levels. The structured and dynamic analysis method breaks through the limitation of traditional literature analysis and provides more systematic and deeper knowledge support for scientific and technological decision.
The embodiment of the application also provides a device for analyzing and predicting the topic trend of the scientific literature, which comprises the following steps:
The data preprocessing module is used for acquiring scientific literature text data and preprocessing the scientific literature text data to obtain target text data;
the semantic coding module is used for coding the target text data by utilizing a pre-trained semantic analysis model to obtain a document semantic vector;
The topic clustering module is used for topic clustering the target text data according to the document semantic vector, and extracting a plurality of topics and topic representations of each topic;
The hierarchical analysis module is used for hierarchical clustering of the topics according to the topic representation to form a topic hierarchical structure;
The time evolution analysis module is used for dividing the target text data into a plurality of time windows, analyzing the change characteristics of the theme in each time window and constructing a theme evolution time sequence according to the change characteristics, wherein the change characteristics comprise the occurrence frequency and the intensity change;
and the trend prediction module is used for predicting the intensity change trend of the theme according to the theme evolution time sequence.
The embodiment of the application also provides a computer device, which comprises:
At least one processor, and
A memory communicatively coupled to the at least one processor, wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of trending and predicting technical literature topics as described above.
The embodiment of the application also provides a computer readable storage medium which stores computer instructions for causing a computer to execute the method for analyzing and predicting the topic trend of the scientific literature.
The embodiment of the application also provides a computer program product, which comprises computer instructions, wherein the computer instructions realize the steps of the method for analyzing and predicting the topic trend of the technical literature when being executed by a processor.
The application has the following technical effects:
The method comprises the steps of innovatively combining a BERT semantic model with a topic model, constructing a BERTopic scientific literature analysis framework, designing a multi-level topic classification system, organizing topics through a hierarchical clustering algorithm, realizing omnibearing analysis from macroscopic literature to microscopic literature measures, providing a literature topic dynamic evolution analysis method, realizing quantitative analysis of literature topic evolution rules through time window division and dynamic topic modeling, developing a topic evolution mode-based literature trend prediction model, and providing prospective reference for future literature guidance by combining a time sequence analysis and a deep learning method.
The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the scientific literature topic trend analysis and prediction method described in the above method embodiments. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.
In addition, the embodiments of the present disclosure further provide a computer program product, where the computer program product stores a computer program, and when the computer program is executed by a processor, the steps of the method for analyzing and predicting a trend of a subject matter of a technical literature provided in any of the foregoing embodiments of the present disclosure may be specifically referred to the foregoing method embodiments, which are not repeated herein.
Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, which may be a volatile or nonvolatile computer readable storage medium. In another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and device described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus, device, and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
It should be noted that the foregoing embodiments are merely specific implementations of the disclosure, and are not intended to limit the scope of the disclosure, and although the disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that any modification, variation or substitution of some of the technical features described in the foregoing embodiments may be made or equivalents may be substituted for those within the scope of the disclosure without departing from the spirit and scope of the technical aspects of the embodiments of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (9)

1.一种科技文献主题趋势分析与预测的方法,其特征在于,包括:1. A method for analyzing and predicting the subject trend of scientific and technological literature, characterized by comprising: 采集科技文献文本数据,对所述科技文献文本数据进行预处理,得到目标文本数据;Collecting scientific and technological literature text data, and preprocessing the scientific and technological literature text data to obtain target text data; 利用预训练的语义分析模型对所述目标文本数据进行编码,获得文档语义向量;Encoding the target text data using a pre-trained semantic analysis model to obtain a document semantic vector; 依据所述文档语义向量对所述目标文本数据进行主题聚类,提取多个主题以及每个主题的主题表示;Performing topic clustering on the target text data according to the document semantic vector, extracting multiple topics and topic representation of each topic; 依据所述主题表示对所述主题进行分层聚类,形成主题层次化结构;hierarchically clustering the topics according to the topic representation to form a topic hierarchical structure; 将所述目标文本数据划分为多个时间窗口,分析各时间窗口中所述主题的变化特征,并依据所述变化特征构建主题演化时间序列,其中,所述变化特征包括出现频率和强度变化;Divide the target text data into multiple time windows, analyze the change characteristics of the topic in each time window, and construct a topic evolution time series based on the change characteristics, wherein the change characteristics include occurrence frequency and intensity change; 根据所述主题演化时间序列,预测主题的强度变化趋势;According to the topic evolution time series, predict the intensity change trend of the topic; 其中,所述将所述目标文本数据划分为多个时间窗口,分析各时间窗口中所述主题的变化特征,并依据所述变化特征构建主题演化时间序列,包括:The step of dividing the target text data into multiple time windows, analyzing the change characteristics of the topics in each time window, and constructing a topic evolution time series based on the change characteristics includes: 按照所述目标文本数据的发布时间,对所述目标文本数据进行时间维度划分,获得连续的多个时间窗口;According to the release time of the target text data, the target text data is divided into time dimensions to obtain a plurality of continuous time windows; 利用预设的主题挖掘模型对所述时间窗口内的目标文本数据进行主题挖掘;Using a preset topic mining model to perform topic mining on the target text data within the time window; 依据主题挖掘结果,计算每个主题在不同的时间窗口中的主题特征,所述主题特征包括出现频率和强度;According to the topic mining results, the topic features of each topic in different time windows are calculated, and the topic features include occurrence frequency and intensity; 采用预设的动态主题模型对所述主题特征进行分析,得到所述主题的变化特征,所述变化特征用于指示所述主题随时间的语义变化;Analyzing the topic features using a preset dynamic topic model to obtain a change feature of the topic, wherein the change feature is used to indicate a semantic change of the topic over time; 依据所述变化特征产生所述主题演化时间序列。The theme evolution time series is generated according to the change characteristics. 2.根据权利要求1所述的方法,其特征在于,所述利用预训练的语义分析模型对所述目标文本数据进行编码,获得文档语义向量,包括:2. The method according to claim 1, characterized in that the step of encoding the target text data using a pre-trained semantic analysis model to obtain a document semantic vector comprises: 对所述目标文本数据进行分段处理,获得指定长度的文本片段;Segmenting the target text data to obtain text segments of a specified length; 采用所述语义分析模型获取所述文本片段的语义表示;Acquire the semantic representation of the text segment using the semantic analysis model; 将所述文本片段的语义表示整合成文档语义向量,所述文档语义向量用于指示所述目标文本数据的语义信息和上下文关系。The semantic representations of the text segments are integrated into a document semantic vector, where the document semantic vector is used to indicate the semantic information and contextual relationship of the target text data. 3.根据权利要求2所述的方法,其特征在于,所述依据所述文档语义向量对所述目标文本数据进行主题聚类,提取多个主题以及每个主题的主题表示,包括:3. The method according to claim 2, characterized in that the subject clustering of the target text data according to the document semantic vector to extract multiple subjects and subject representations of each subject comprises: 采用预设的降维模型对所述文档语义向量进行降维处理,得到低维语义向量;Using a preset dimensionality reduction model to perform dimensionality reduction processing on the document semantic vector to obtain a low-dimensional semantic vector; 基于所述低维语义向量,利用预设的聚类模型对所述目标文本数据进行密度聚类,形成多个主题;Based on the low-dimensional semantic vector, the target text data is density clustered using a preset clustering model to form multiple topics; 计算每个主题中词语的词语特征值,并依据所述词语特征值提取关键词,所述词语特征值包括‌词频‌和‌逆文档频率;Calculate the word feature values of the words in each topic, and extract keywords based on the word feature values, wherein the word feature values include word frequency and inverse document frequency; 采用预设的最大化相似度匹配算法优化所述关键词,得到所述主题的主题表示。The keywords are optimized using a preset maximum similarity matching algorithm to obtain a topic representation of the topic. 4.根据权利要求3所述的方法,其特征在于,所述依据所述主题表示对所述主题进行分层聚类,形成主题层次化结构,包括:4. The method according to claim 3, characterized in that the step of hierarchically clustering the topics according to the topic representation to form a topic hierarchical structure comprises: 基于所述主题表示计算语义相似度,依据所述主题表示和所述语义相似度构建所述主题之间的相似度矩阵;Calculate semantic similarity based on the topic representation, and construct a similarity matrix between the topics according to the topic representation and the semantic similarity; 采用预设的层级聚类模型对所述主题进行层次聚类,获得表示为树状结构的分层体系,所述分层体系包含多个一级主题以及与所述一级主题对应的多个二级主题;Using a preset hierarchical clustering model to hierarchically cluster the topics, obtaining a hierarchical system represented as a tree structure, wherein the hierarchical system includes a plurality of first-level topics and a plurality of second-level topics corresponding to the first-level topics; 对所述分层体系进行优化调整,形成主题层次化结构。The hierarchical system is optimized and adjusted to form a thematic hierarchical structure. 5.根据权利要求1所述的方法,其特征在于,所述对所述科技文献文本数据进行预处理,得到目标文本数据,包括:5. The method according to claim 1, characterized in that the preprocessing of the scientific literature text data to obtain the target text data comprises: 利用预设的大型语言模型对所述科技文献文本数据进行评估,得到复杂度评分,所述复杂度评分用于指示所述科技文献文本数据的专业度、交叉性和结构复杂度;Using a preset large language model to evaluate the scientific and technological literature text data to obtain a complexity score, wherein the complexity score is used to indicate the professionalism, intersectionality and structural complexity of the scientific and technological literature text data; 依据所述复杂度评分对所述科技文献文本数据进行分类,得到第一类文本和第二类文本,其中,所述第一类文本用于指示所述复杂度评分低于预设的评分阈值,所述第二类文本用于指示所述复杂度评分高于所述评分阈值;Classifying the scientific and technological literature text data according to the complexity score to obtain a first category of text and a second category of text, wherein the first category of text is used to indicate that the complexity score is lower than a preset score threshold, and the second category of text is used to indicate that the complexity score is higher than the score threshold; 对所述第一类文本执行轻量级分析,以及对所述第二类文本执行深度分析,获得初步处理结果;Performing a lightweight analysis on the first type of text and performing a deep analysis on the second type of text to obtain a preliminary processing result; 依据所述初步处理结果,得到目标文本数据。According to the preliminary processing result, the target text data is obtained. 6.根据权利要求5所述的方法,其特征在于,所述利用预设的大型语言模型对所述科技文献文本数据进行评估,得到复杂度评分,包括:6. The method according to claim 5, characterized in that the use of a preset large language model to evaluate the scientific and technological literature text data to obtain a complexity score comprises: 采集带有复杂度标注的训练文本,对预设的初始大型语言模型进行微调训练,得到大型语言模型;Collect training texts with complexity annotations, and fine-tune the preset initial large language model to obtain a large language model; 使用所述大型语言模型计算所述科技文献文本数据的分析特征,所述分析特征包括专业词汇密度、领域覆盖范围和逻辑层次深度;Calculate the analysis features of the scientific literature text data using the large language model, wherein the analysis features include professional vocabulary density, field coverage, and logical level depth; 依据所述分析特征计算文本复杂度分值,并依据所述文本复杂度分值生成评估依据说明;Calculating a text complexity score based on the analysis features, and generating an evaluation basis description based on the text complexity score; 依据所述评估依据说明,得到所述科技文献文本数据的复杂度评分。According to the evaluation basis description, a complexity score of the scientific literature text data is obtained. 7.根据权利要求6所述的方法,其特征在于,所述依据所述文档语义向量对所述目标文本数据进行主题聚类,提取多个主题以及每个主题的主题表示,还包括:7. The method according to claim 6, characterized in that the subject clustering of the target text data according to the document semantic vector to extract multiple subjects and subject representations of each subject further comprises: 基于所述文档语义向量,建立包含有文献等级分类、发布时间和引用频次的评分指标体系;Based on the document semantic vector, a scoring index system including document grade classification, publication time and citation frequency is established; 依据所述文档语义向量和所述评分指标体系,构建成本敏感决策树,所述成本敏感决策树用于指示对重要度达到预设重要度阈值的样本增加所赋予的权重;According to the document semantic vector and the scoring index system, a cost-sensitive decision tree is constructed, wherein the cost-sensitive decision tree is used to indicate the weights given to increase the samples whose importance reaches a preset importance threshold; 从所述成本敏感决策树中提取分类规则,所述分类规则包括特征阈值和分类路径;Extracting classification rules from the cost-sensitive decision tree, wherein the classification rules include feature thresholds and classification paths; 根据所述分类规则优化主题聚类的初始中心点分布,并依据优化结果对所述目标文本数据进行主题聚类。The initial center point distribution of the topic clustering is optimized according to the classification rule, and the target text data is subject-clustered according to the optimization result. 8.根据权利要求7所述的方法,其特征在于,在所述根据所述主题演化时间序列,预测主题的强度变化趋势之后,还包括:8. The method according to claim 7, characterized in that after predicting the intensity change trend of the topic according to the topic evolution time series, it also includes: 基于所述主题强度变化趋势,构建包含文献实体、影响路径和效果量化指标的知识图谱;Based on the changing trend of the topic intensity, a knowledge graph including literature entities, impact paths and effect quantitative indicators is constructed; 利用所述知识图谱建立多层次分析框架;Establishing a multi-level analysis framework using the knowledge graph; 根据所述多层次分析框架建立检索推理系统,所述检索推理系统包含从文献到影响的推理链;Establishing a retrieval reasoning system according to the multi-level analysis framework, wherein the retrieval reasoning system includes a reasoning chain from literature to impact; 调整所述效果量化指标,结合调整后的效果量化指标和所述检索推理系统执行多情景预测,得到不同条件下的发展趋势;Adjusting the effect quantification index, combining the adjusted effect quantification index with the retrieval reasoning system to perform multi-scenario prediction, and obtaining development trends under different conditions; 其中,通过如下步骤建立所述推理链,包括:The reasoning chain is established through the following steps, including: 基于所述多层次分析框架,将影响分析任务分解为存在逻辑依赖关系的多个子任务;Based on the multi-level analysis framework, the impact analysis task is decomposed into a plurality of subtasks with logical dependencies; 利用所述大型语言模型对所述子任务进行处理,得到递进式推理链条;Processing the subtasks using the large language model to obtain a progressive reasoning chain; 在所述知识图谱中检索与所述递进式推理链条的每个节点相关的关联数据;Retrieving associated data related to each node of the progressive reasoning chain in the knowledge graph; 结合所述关联数据和所述大型语言模型所指示的推理过程,验证并校正推理结果,生成推理链。The associated data and the reasoning process indicated by the large language model are combined to verify and correct the reasoning result and generate a reasoning chain. 9.一种科技文献主题趋势分析与预测的装置,其特征在于,包括:9. A device for analyzing and predicting the trend of scientific and technological literature, characterized by comprising: 数据预处理模块,用于采集科技文献文本数据,对所述科技文献文本数据进行预处理,得到目标文本数据;A data preprocessing module is used to collect scientific and technological literature text data, preprocess the scientific and technological literature text data, and obtain target text data; 语义编码模块,用于利用预训练的语义分析模型对所述目标文本数据进行编码,获得文档语义向量;A semantic encoding module, used to encode the target text data using a pre-trained semantic analysis model to obtain a document semantic vector; 主题聚类模块,用于依据所述文档语义向量对所述目标文本数据进行主题聚类,提取多个主题以及每个主题的主题表示;A topic clustering module, used to perform topic clustering on the target text data according to the document semantic vector, and extract multiple topics and topic representations of each topic; 层次分析模块,用于依据所述主题表示对所述主题进行分层聚类,形成主题层次化结构;A hierarchical analysis module, used for hierarchically clustering the topics according to the topic representation to form a topic hierarchical structure; 时间演化分析模块,用于将所述目标文本数据划分为多个时间窗口,分析各时间窗口中所述主题的变化特征,并依据所述变化特征构建主题演化时间序列,其中,所述变化特征包括出现频率和强度变化,包括:The time evolution analysis module is used to divide the target text data into multiple time windows, analyze the change characteristics of the topic in each time window, and construct a topic evolution time series based on the change characteristics, wherein the change characteristics include the frequency of occurrence and the change of intensity, including: 按照所述目标文本数据的发布时间,对所述目标文本数据进行时间维度划分,获得连续的多个时间窗口;According to the release time of the target text data, the target text data is divided into time dimensions to obtain a plurality of continuous time windows; 利用预设的主题挖掘模型对所述时间窗口内的目标文本数据进行主题挖掘;Using a preset topic mining model to perform topic mining on the target text data within the time window; 依据主题挖掘结果,计算每个主题在不同的时间窗口中的主题特征,所述主题特征包括出现频率和强度;According to the topic mining results, the topic features of each topic in different time windows are calculated, and the topic features include occurrence frequency and intensity; 采用预设的动态主题模型对所述主题特征进行分析,得到所述主题的变化特征,所述变化特征用于指示所述主题随时间的语义变化;Analyzing the topic features using a preset dynamic topic model to obtain a change feature of the topic, wherein the change feature is used to indicate a semantic change of the topic over time; 依据所述变化特征产生所述主题演化时间序列;Generating the theme evolution time series according to the change characteristics; 趋势预测模块,用于根据所述主题演化时间序列,预测主题的强度变化趋势。The trend prediction module is used to predict the intensity change trend of the topic based on the topic evolution time series.
CN202510550556.6A 2025-04-29 2025-04-29 Method and system for analyzing and predicting the trend of scientific and technological literature Active CN120068882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510550556.6A CN120068882B (en) 2025-04-29 2025-04-29 Method and system for analyzing and predicting the trend of scientific and technological literature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510550556.6A CN120068882B (en) 2025-04-29 2025-04-29 Method and system for analyzing and predicting the trend of scientific and technological literature

Publications (2)

Publication Number Publication Date
CN120068882A CN120068882A (en) 2025-05-30
CN120068882B true CN120068882B (en) 2025-07-25

Family

ID=95804729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510550556.6A Active CN120068882B (en) 2025-04-29 2025-04-29 Method and system for analyzing and predicting the trend of scientific and technological literature

Country Status (1)

Country Link
CN (1) CN120068882B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120336416B (en) * 2025-06-19 2025-08-19 解螺旋(上海)科技有限公司 Artificial intelligence-based document structured extraction method and system
CN120372325A (en) * 2025-06-27 2025-07-25 浙江吉利控股集团有限公司 Quality evaluation method, device, equipment, medium and program product for data set
CN120429486A (en) * 2025-07-08 2025-08-05 杭州市滨江区浙工大人工智能创新研究院 A hot topic monitoring method and system based on improved HDBSCAN clustering algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046167A (en) * 2019-11-07 2020-04-21 武汉大学 Inference method of subject theme evolution combined with time-delay calculation in scientific and technological intelligence analysis
CN111694930A (en) * 2020-06-11 2020-09-22 中国农业科学院农业信息研究所 Dynamic knowledge hotspot evolution and trend analysis method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566360B2 (en) * 2010-05-28 2013-10-22 Drexel University System and method for automatically generating systematic reviews of a scientific field
CN119691159B (en) * 2024-12-02 2025-07-29 武汉理工大学 Technological topic evolution stage prediction method and system based on multiple graph representation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046167A (en) * 2019-11-07 2020-04-21 武汉大学 Inference method of subject theme evolution combined with time-delay calculation in scientific and technological intelligence analysis
CN111694930A (en) * 2020-06-11 2020-09-22 中国农业科学院农业信息研究所 Dynamic knowledge hotspot evolution and trend analysis method

Also Published As

Publication number Publication date
CN120068882A (en) 2025-05-30

Similar Documents

Publication Publication Date Title
CN120068882B (en) Method and system for analyzing and predicting the trend of scientific and technological literature
CN113988071A (en) An intelligent dialogue method and device based on financial knowledge graph, and electronic equipment
US11620453B2 (en) System and method for artificial intelligence driven document analysis, including searching, indexing, comparing or associating datasets based on learned representations
CN117909507B (en) AI-based data classification system
CN119398092B (en) Construction method and device of multi-group data intelligent agent
CN119691135B (en) Knowledge graph construction and intelligent question answering method and device based on deep learning
Ackermann et al. Data-driven annotation of textual process descriptions based on formal meaning representations
CN120235181A (en) An automated construction method for end-to-end intelligent agents based on graph structure semantic fusion
CN119474171B (en) Data mining method device, equipment and storage medium
CN119202384A (en) A multi-heterogeneous data talent evaluation system and method based on intelligent graph
Minervini et al. Leveraging the schema in latent factor models for knowledge graph completion
CN119646192A (en) Multimodal scientific data retrieval method and system based on summary-assisted cognitive enhancement
CN118733860A (en) A method for constructing a media influence evaluation model based on multi-dimensional feature optimization
CN120278251A (en) Cross-domain knowledge graph system based on super-dimensional multi-modal calculation and data processing method
CN120198232A (en) A method for intelligently generating due diligence reports on financial non-performing assets
CN120067327A (en) System and method for calculating technology maturity based on graph convolution neural network
Ghanadi Nezhad et al. Forecasting the subject trend of international library and information science research by 2030 using the deep learning approach
Xiao et al. A novel method for intelligent reasoning of machining step sequences based on deep reinforcement learning
Tang et al. Intelligent recognition of high-quality academic papers: based on knowledge-based metasemantic networks
Liu et al. Automatic statistical chart analysis based on deep learning method
Liang et al. DeepDiveAI: Identifying AI-Related Documents in Large Scale Literature Dataset
CN120653775B (en) Scientific and technological public text intelligent classification and service method and device based on deep learning
KR102666388B1 (en) Apparatus and method for generating predictive information on development possibility of promising technology
CN120234387B (en) Ore finding prediction method based on multi-agent technology
CN120234427B (en) Electronic government platform management method and system based on cloud data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant