CN118471540B - A method and system for processing cardiovascular case data - Google Patents
A method and system for processing cardiovascular case data Download PDFInfo
- Publication number
- CN118471540B CN118471540B CN202410925105.1A CN202410925105A CN118471540B CN 118471540 B CN118471540 B CN 118471540B CN 202410925105 A CN202410925105 A CN 202410925105A CN 118471540 B CN118471540 B CN 118471540B
- Authority
- CN
- China
- Prior art keywords
- feature
- features
- data
- model
- random forest
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Public Health (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Epidemiology (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Computational Linguistics (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a processing method and a processing system of cardiovascular case data, and relates to the technical field of data processing, wherein the processing method comprises the following steps: determining key factors influencing cardiovascular disease risk according to the feature ordering table and the dynamic factors; according to key factors, learning disease risk information in long-term metabolic characteristics of a patient by adopting a causal stability and time-aware long-term memory network to construct a processing model; according to the processing model, performing characteristic interaction on the characteristics of the individual and disease risk information learned by a causal stability and time-aware long-period memory network to obtain a characteristic interaction result; and according to the characteristic interaction result, the processing model uses a fully-connected network to carry out final disease risk processing, and the processing result is output. The invention can fully utilize multidimensional information in the electronic medical record data, and improves the working efficiency and the accuracy.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a cardiovascular case data processing method and system.
Background
With the rapid development of modern medical technology, prevention and research of cardiovascular diseases has become a global important issue. Cardiovascular diseases, especially hypertension, are one of the major diseases affecting human health, and their morbidity and mortality are all high worldwide. For better prevention, diagnosis and treatment of cardiovascular diseases, a large amount of electronic medical record data is collected and used for related studies.
However, cardiovascular case data typically contains a large amount of characteristic information, such as patient's physiological indicators, lifestyle habits, family medical history, etc., among which complex nonlinear relationships may exist and the impact of some characteristics on disease risk may vary over time. Traditional data analysis methods often have difficulty fully mining the underlying information in these data, resulting in insufficiently accurate treatment of cardiovascular disease risk.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a processing method and a processing system for cardiovascular case data, which can fully utilize multidimensional information in electronic medical record data and improve the working efficiency and accuracy.
In order to solve the technical problems, the technical scheme of the invention is as follows:
in a first aspect, a method of processing cardiovascular case data, the method comprising:
Preprocessing the hypertension electronic medical record data set to obtain preprocessed data;
constructing a random forest model by using the preprocessing data, and training the random forest model to obtain a trained random forest model;
Calculating importance scores of each feature through a trained random forest model;
sorting the features according to the importance scores of each feature to obtain a feature sorting table;
determining key factors influencing cardiovascular disease risk according to the feature ordering table and the dynamic factors;
Determining a hidden state sequence through the LSTM network according to key factors, wherein the hidden state sequence contains information related to disease risks; distributing a weight to each hidden state according to the hidden state sequence;
performing second-order feature interaction on the original features of the individuals by using a factor decomposition machine to obtain interaction results; fusing the interaction result and the hidden state sequence to obtain a characteristic interaction result;
Inputting the characteristic interaction result into a fully-connected network for characteristic abstraction and combination to obtain fully-connected network output data; training a series of decision trees for the gradient lifting decision tree model, wherein after training is completed, the output data of the gradient lifting decision tree model is the weighted sum of the prediction results of all the decision trees; integrating the output data of the fully connected network and the gradient lifting decision tree to obtain integrated data, and obtaining a final disease risk processing result according to the integrated data.
Further, preprocessing the hypertension electronic medical record data set to obtain preprocessed data, including:
Randomly selecting k initial clustering centers, and iterating until the clustering result is stable, wherein the clustering centers are selected at random The updated formula of (2) is:
;
Wherein, Is a cluster tag that is a cluster of labels,Is the data point of the i-th data point,Is the total data point number in the data set,Is an index of the number of the words,Is a data pointTo a cluster centerIs the square of the euclidean distance of (c),Is a constant;
for the data in each cluster, a covariance matrix C between features is calculated, wherein,
Wherein, Is the weight of the i-th data point,Representing the transpose of the matrix,Weights representing the jth data point; is the jth data point; j is an index;
Performing feature decomposition on the covariance matrix C to obtain feature values and corresponding feature vectors of the covariance matrix C;
Sorting the feature vectors according to the magnitudes of the feature values;
the pre-processed data is obtained by multiplying the raw data matrix with the selected principal component matrix.
Further, constructing a random forest model by using the preprocessing data, training the random forest model to obtain a trained random forest model, including:
determining the number of decision trees, the final depth of each decision tree, and the number of samples required for leaf nodes;
sampling samples from an original training data set by adopting a self-help sampling method, generating a new training subset, and constructing a decision tree on the new training subset;
On the nodes of each decision tree, the importance of each feature is processed, and the final feature is determined to carry out node division;
The decision tree continues to grow until a preset stopping condition is reached, and in the growing process, each node is divided according to the selected characteristics and recursively generates child nodes;
repeating the operation, and constructing decision trees by using different self-service sampling subsets each time until a specified number of decision trees are generated;
integrating all the constructed decision trees together to form a random forest model;
Training the random forest model to obtain a trained random forest model.
Further, calculating an importance score for each feature by training a random forest model, comprising:
acquiring the feature importance attribute of the trained random forest model;
a feature importance score is calculated based on the feature importance attributes.
Further, sorting the features according to the importance score of each feature to obtain a feature sorting table, including:
according to the importance scores of the features, the features are subjected to preliminary ranking by using a rapid ranking method so as to obtain a preliminary ranking list;
According to the preliminary ranking list, analyzing the correlation between the features, and adjusting the preliminary ranking list according to the correlation to obtain an optimized ranking result, including: for each pair of features in the preliminary ranking list ,) Wherein, the method comprises the steps of, wherein,AndRespectively representing the h and m features, calculatingAndThe correlation coefficient is calculated according to the following formula:
;
Wherein, Is the firstWeights of the samples; And The standard deviation of the h and m features respectively,,Representing a sampleThe frequency of occurrence in the data set,,Represent the firstThe value of the sample on the h-th feature,,Representing the number of samples; constructing a correlation value using correlation coefficientsWherein, the method comprises the steps of, wherein,,Is a smoothing factor of the EWMA and,Time is; according to the correlation valueConstructing a correlation matrix R; traversing the features in the preliminary ranking list, and searching a feature set H h with higher correlation with each feature F h; according to a greedy strategy, for each feature F h, carrying out position exchange on the feature F h and the features in the feature set H h to obtain an optimized sorting result;
and generating a characteristic sorting table according to the optimized sorting result.
Further, determining key factors affecting cardiovascular disease risk according to the feature ranking table and the dynamic factors, including:
according to the importance scores of the features and the numerical values of the dynamic factors, calculating the comprehensive score of each feature, wherein the calculation formula of the comprehensive score is as follows:
;
Wherein, Is the firstThe composite score of the individual features,Indicating the number of features that are to be included,Is the firstWeighting of individual static featuresIs the firstThe characteristic is atThe normalized value on the individual samples is then calculated,Is a dynamic factor of the dynamic range,() Is at the firstThe weights at the points in time are such that,Is at the firstAt the point of timeThe observed value of the individual characteristic(s),Is the time window size of the moving average,It is the time that is required for the device to be in contact with the substrate,Is an index;
And determining key factors influencing the cardiovascular disease risk according to the magnitude of the comprehensive score.
Further, determining key factors affecting cardiovascular disease risk according to the magnitude of the composite score, including:
sequencing all the features according to the comprehensive score;
and selecting the features with the comprehensive score not less than the threshold value from the sorted feature list according to the preset threshold value, and determining the features with the comprehensive score not less than the threshold value as key factors influencing the cardiovascular disease risk.
In a second aspect, a method for processing cardiovascular case data, includes:
The acquisition module is used for preprocessing the hypertension electronic medical record data set to obtain preprocessed data; constructing a random forest model by using the preprocessing data, and training the random forest model to obtain a trained random forest model; calculating importance scores of each feature through a trained random forest model; sorting the features according to the importance scores of each feature to obtain a feature sorting table;
The processing module is used for determining key factors influencing cardiovascular disease risks according to the characteristic ranking table and the dynamic factors; determining a hidden state sequence through the LSTM network according to key factors, wherein the hidden state sequence contains information related to disease risks; distributing a weight to each hidden state according to the hidden state sequence; performing second-order feature interaction on the original features of the individuals by using a factor decomposition machine to obtain interaction results; fusing the interaction result and the hidden state sequence to obtain a characteristic interaction result; inputting the characteristic interaction result into a fully-connected network for characteristic abstraction and combination to obtain fully-connected network output data; training a series of decision trees for the gradient lifting decision tree model, wherein after training is completed, the output data of the gradient lifting decision tree model is the weighted sum of the prediction results of all the decision trees; integrating the output data of the fully connected network and the gradient lifting decision tree to obtain integrated data, and obtaining a final disease risk processing result according to the integrated data.
In a third aspect, a computing device includes:
one or more processors;
And a storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method.
In a fourth aspect, a computer readable storage medium has a program stored therein, which when executed by a processor, implements the method.
The above scheme of the invention at least comprises the following beneficial effects.
By preprocessing the hypertension electronic medical record data set, the invention can clean and normalize the original data and remove noise and redundant information, thereby obtaining high-quality preprocessed data.
By utilizing the random forest model, the method and the system can calculate the importance scores of each feature and sort the features according to the scores, so that a medical worker can quickly identify the key factors with the greatest influence on cardiovascular risk, and the working efficiency and the accuracy are improved.
When key factors are determined, the method not only considers the importance scores of the features, but also introduces dynamic factors, so that the model can be more flexibly adapted to the changes of different patients and time points, and the individuation degree of the treatment is improved.
By adopting a causal stability and time-aware long-short-term memory network (LSTM), the invention can effectively learn the disease risk information in the long-term metabolic characteristics of patients, and the introduction of LSTM enables the model to capture the long-term dependency relationship in the time series data, thereby more accurately predicting the risk of cardiovascular diseases.
According to the invention, the characteristic interaction is carried out on the individual characteristic and the disease risk information learned by the LSTM, and the step enhances the relevance between the characteristics, so that the influence of various factors can be more comprehensively considered by the processing model, and the processing accuracy is improved.
Drawings
Fig. 1 is a flowchart of a method for processing cardiovascular case data according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a cardiovascular case data processing system according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As shown in fig. 1, an embodiment of the present invention proposes a method for processing cardiovascular case data, the method comprising the steps of:
Step 11, preprocessing the hypertension electronic medical record data set to obtain preprocessed data;
Step 12, constructing a random forest model by using the preprocessing data, and training the random forest model to obtain a trained random forest model;
Step 13, calculating importance scores of each feature through a trained random forest model;
Step 14, sorting the features according to the importance scores of each feature to obtain a feature sorting table;
step 15, determining key factors influencing cardiovascular disease risk according to the feature ranking table and the dynamic factors;
step 16, determining a hidden state sequence through the LSTM network according to the key factors, wherein the hidden state sequence contains information related to disease risks; distributing a weight to each hidden state according to the hidden state sequence;
Step 17, performing second-order feature interaction on the original features of the individual by using a factor decomposition machine to obtain an interaction result; fusing the interaction result and the hidden state sequence to obtain a characteristic interaction result;
Step 18, inputting the feature interaction result into a fully-connected network for feature abstraction and combination to obtain fully-connected network output data; training a series of decision trees for the gradient lifting decision tree model, wherein after training is completed, the output data of the gradient lifting decision tree model is the weighted sum of the prediction results of all the decision trees; integrating the output data of the fully connected network and the gradient lifting decision tree to obtain integrated data, and obtaining a final disease risk processing result according to the integrated data.
In the embodiment of the invention, the original data can be cleaned and normalized by preprocessing the hypertension electronic medical record data set, and noise and redundant information can be removed, so that high-quality preprocessed data can be obtained. By utilizing the random forest model, the method and the system can calculate the importance scores of each feature and sort the features according to the scores, so that a medical worker can quickly identify the key factors with the greatest influence on cardiovascular risk, and the working efficiency and the accuracy are improved. When key factors are determined, the method not only considers the importance scores of the features, but also introduces dynamic factors, so that the model can be more flexibly adapted to the changes of different patients and time points, and the individuation degree of the treatment is improved. By adopting a causal stability and time-aware long-short-term memory network (LSTM), the invention can effectively learn the disease risk information in the long-term metabolic characteristics of patients, and the introduction of LSTM enables the model to capture the long-term dependency relationship in the time series data, thereby more accurately predicting the risk of cardiovascular diseases. According to the invention, the characteristic interaction is carried out on the individual characteristic and the disease risk information learned by the LSTM, and the step enhances the relevance between the characteristics, so that the influence of various factors can be more comprehensively considered by the processing model, and the processing accuracy is improved.
In a preferred embodiment of the present invention, the step 11 may include:
Step 111, randomly selecting k initial clustering centers, and iterating until the clustering result is stable, wherein the clustering centers The updated formula of (2) is:
;
Wherein, Is a cluster tag that is a cluster of labels,Is the data point of the i-th data point,Is the total data point number in the data set,Is an index of the number of the words,Is a data pointTo a cluster centerIs the square of the euclidean distance of (c),Is a constant;
Step 112, for the data in each cluster, a covariance matrix C between features is calculated, wherein, Wherein, the method comprises the steps of, wherein,Is the weight of the i-th data point,Representing the transpose of the matrix,Weights representing the jth data point; is the jth data point; j is an index;
Step 113, performing feature decomposition on the covariance matrix C to obtain feature values and corresponding feature vectors of the covariance matrix C;
step 114, sorting the feature vectors according to the magnitudes of the feature values;
step 115, pre-processing the data by multiplying the raw data matrix with the selected principal component matrix.
In the embodiment of the invention, the original hypertension electronic medical record data set can be effectively subjected to dimension reduction processing through a clustering algorithm and principal component analysis, redundant information and noise are removed, so that core features in the data are extracted, the calculation complexity of a subsequent model is reduced, and the accuracy of risk processing is improved. The cluster center update formula adopted in step 111 can more accurately locate the center of each cluster, ensure the stability and reliability of the clustering result, and help to more accurately identify commonalities and differences of different patient populations in subsequent analysis. By calculating the covariance matrix (step 112), the correlation between the different features can be quantified, thereby identifying multiple co-linear or redundant features that may be present. The process of feature decomposition (step 113) and sorting feature vectors by feature value size (step 114) helps to extract the principal components in the data, i.e., those features that contribute most to variation, ensuring that the preprocessed data retains the effective information in the original data to the maximum extent. The principal component matrix is multiplied by the original data matrix (step 115), so that the obtained preprocessed data is lower in dimension, noise and redundant information in the original data are removed, the generalization capability of a subsequently constructed processing model is improved, and stable performance can be kept when the processed model faces new data.
In a preferred embodiment of the present invention, the step 12 may include:
Step 121, determining the number of decision trees, the final depth of each decision tree, and the number of samples required for leaf nodes;
Step 122, sampling samples from the original training data set by self-help sampling method to generate a new training subset, and constructing a decision tree on the new training subset, specifically including randomly selecting a sample from the original training data set (here, randomly selecting a row of data and its corresponding characteristics and target values) according to the number of samples extracted from the original training data set, and copying the selected sample into the new training subset; determining decision tree parameters such as the final depth of the decision tree, the minimum number of samples required for leaf nodes, the minimum reduction in the impure quality required for node splitting, etc.; creating a root node using the new training subset and assigning the entire training subset to the node; starting from the root node, recursively applying the following steps until a stop condition is met (e.g., reaching a final depth, leaf nodes contain sufficiently few samples, etc.); selecting an optimal feature for splitting, dividing a training subset of the current node into two or more subsets according to the selected feature, creating a new sub-node for each subset, and distributing the corresponding subset to the sub-nodes; marking the current node as an internal node, and setting a branch condition according to the value of the splitting characteristic;
Step 123, processing the importance of each feature on the node of each decision tree, and determining the final feature to perform node division, specifically including, for the selected feature subset, calculating a processing index value of each feature on the current node; according to the calculated processing index value, selecting the characteristic with the optimal processing index (such as maximum information gain and maximum reduction of the base non-purity) as the dividing characteristic of the current node; determining a partitioning threshold for the selected partitioning feature; after the partition feature is selected, updating the value of the feature in the global feature importance list; recording the selected partitioning characteristics and the corresponding partitioning rules in the current node;
step 124, the decision tree continues to grow until a predetermined stop condition is reached, during which each node is partitioned according to the selected feature and sub-nodes are recursively generated;
Step 125, repeating the operation, each time using different self-service sampling subsets to construct decision trees, until a specified number of decision trees are generated;
Step 126, integrating all the constructed decision trees together to form a random forest model;
And step 127, training the random forest model to obtain a trained random forest model.
In the embodiment of the invention, a plurality of different training subsets are generated by extracting samples from the original data set through a self-help sampling method (step 122), decision trees constructed on each subset are different, and the diversity enables the deviation and variance of a single decision tree to be averaged when the random forest model is integrated, so that the overall generalization capability is improved. Because each decision tree is built on a different training subset and each node considers the importance of the feature (step 123), the random forest model can effectively avoid over-fitting to a specific data set during training, helping to ensure that the model can still maintain better performance in the face of unknown data. In the construction process of each decision tree, the importance of each feature is processed and node division is performed according to the importance (step 123), and the division strategy based on the importance of the features can ensure that the model focuses on the features with obvious influence on the target variable, so that the accuracy of feature selection and the interpretation of the model are improved.
By setting the final depth of each decision tree, the number of samples required by the leaf nodes, etc. (step 121), the complexity of each decision tree in the random forest model can be effectively controlled, which is helpful for balancing the model complexity and the calculation efficiency, and ensuring that the model can still obtain better performance under limited calculation resources. Because each decision tree in the random forest model is independently constructed, parallelization can be easily achieved (step 125), which can significantly improve the efficiency of model construction and training, especially when dealing with large-scale datasets. By integrating multiple independent decision trees together to form a random forest model (step 126), the learning capability of each decision tree can be fully utilized and effective information fusion can be performed, and the integrated learning strategy can significantly improve the overall performance and stability of the model.
In a preferred embodiment of the present invention, the step 13 may include:
Step 131, obtaining the feature importance attribute of the trained random forest model, which specifically includes accessing the corresponding attribute of the trained random forest model, wherein the attribute contains the feature importance information; extracting feature importance data from the attributes, wherein the feature importance data are provided in the form of an array, a list or a dictionary, and each element corresponds to an importance value of a feature;
Step 132, calculating a feature importance score based on the feature importance attributes, specifically including normalizing feature importance values in order to compare the relative importance between different features, the normalization may convert these values to the same scale, e.g. scale them to a range of 0 to 1, where 0 represents least importance and 1 represents most importance; different weights are distributed to the features according to the stability, reliability or service correlation of the features, and weighted feature importance scores are calculated; the features may be ranked according to the calculated feature importance scores to quickly identify the most important features.
In the embodiment of the invention, by acquiring the feature importance attribute of the random forest model after training (step 131), the key role of which features play in model prediction can be intuitively known. This helps to screen out features that have the greatest impact on cardiovascular risk, improving model interpretation, and making it easier for doctors and patients to understand the treatment results. The calculation of feature importance scores (step 132) is based on a feature selection mechanism inside the model, which means that features with higher scores have more discriminative power in distinguishing patients of different risk levels, and therefore ranking and selecting features according to these scores helps to improve the accuracy of the processing of the final cardiovascular case data. The feature importance score can be used for not only the current risk processing model, but also the subsequent deep analysis. For example, the biological or clinical link between the disease and the cardiovascular risk can be further explored for the characteristics with higher scores, and more basis is provided for the establishment of disease prevention and treatment strategies.
In a preferred embodiment of the present invention, the step 14 may include:
Step 141, performing preliminary ranking on the features by using a rapid ranking method according to the importance scores of the features to obtain a preliminary ranking list, wherein the preliminary ranking list specifically comprises ranking the features according to the importance scores of the features from high to low (or from low to high); selecting a reference value, which may be the first element, the last element, the middle element or a random selection in the list; the list is divided into two sub-lists: a sub-list of elements less than the reference value and a sub-list of elements greater than or equal to the reference value; recursively ordering the two sub-lists rapidly until the sub-list is empty or contains only one element; the fast ordered list is regarded as the input of merging ordering, namely, the fast ordered list is divided into a plurality of ordered sub-lists; creating an auxiliary list for merging operation; sequentially comparing the first element in the sub-list, putting the smaller element into the auxiliary list, and updating the pointer of the sub-list; repeating the operation until all elements in the sub-list are merged into the auxiliary list; updating the original list with the auxiliary list;
Step 142, analyzing the correlation between the features according to the preliminary ranking list, and adjusting the preliminary ranking list according to the correlation to obtain an optimized ranking result;
and step 143, generating a feature ordering table according to the optimized ordering result.
In the embodiment of the invention, the features are primarily ranked (step 141) by a rapid ranking method, so that the features with larger influence on cardiovascular disease risk can be rapidly identified, the key features are more focused in subsequent analysis, and the efficiency and accuracy of feature selection are improved. Based on the preliminary ranking, the correlation between features is further analyzed (step 142) and the ranking is adjusted based on the correlation. This helps to avoid redundant features and highly correlated features being selected into the model at the same time, thereby improving the stability and generalization ability of the model. The generated feature ranking table (step 143) not only provides a ranking of the importance of the features, but also implies relevance information between the features. The feature ordering table can be used as an important reference for the subsequent model construction. For example, in building more complex machine learning models, features may be selectively introduced according to a feature ordering table to optimize performance and interpretation of the model.
In another preferred embodiment of the present invention, the step 142 may include:
Step 1421, for each pair of features in the preliminary ranking list ,) Wherein, the method comprises the steps of, wherein,AndRespectively representing the h and m features, calculatingAndThe correlation coefficient is calculated according to the following formula:
;
Wherein, Is the firstWeights of the samples; And The standard deviation of the h and m features respectively,,Representing a sampleThe frequency of occurrence in the data set,,Represent the firstThe value of the sample on the h-th feature,,Representing the number of samples.
Step 1422, constructing a correlation value using the correlation coefficientWherein, the method comprises the steps of, wherein,,Is a smoothing factor of the EWMA and,Time is; according to the correlation valueConstructing a correlation matrix R; traversing the features in the preliminary ranking list, and searching a feature set H h with higher correlation with each feature F h; according to a greedy strategy, for each feature F h, the position exchange between the feature F h and the features in the feature set H h is considered to reduce the close arrangement between the high-correlation features, so that the feature diversity is increased; selecting a switching strategy, for example: and exchanging the feature pairs with the most reduced correlation sum after exchanging, and obtaining an optimized sequencing result after the adjustment.
In the embodiment of the invention, the correlation coefficient can more accurately reflect the correlation between the features by considering the sample weight and the standard deviation of the features, and is helpful for more accurately processing the importance of the features in the sorting process, thereby improving the accuracy of the sorting result. The correlation matrix is updated using EWMA so that the ordering method can dynamically adapt to changes in data. After obtaining the correlation coefficients, a correlation matrix R can be constructed in which each element (h, m) is the correlation coefficient between features F h and F m, since the correlation is symmetrical (i.e) The correlation matrix is a symmetric matrix, and the diagonal element (i.e. when h=m) is always 1. Over time, new data points will have an impact on the relevance value, updating the ranking result to better fit the characteristics of the current data. The feature position exchange is performed through the greedy strategy, so that the tight arrangement among the high-correlation features can be reduced, and the feature diversity is increased. This helps to introduce more information in feature selection or model training, improving generalization ability and performance of the model. The method provides a plurality of adjustable parameters, such as sample weight, EWMA smoothing factor and the like, and can be adjusted according to specific application scenes and data characteristics. This flexibility allows the method to accommodate different needs and challenges, improving its utility. The ranking result may be further optimized by selecting a switching strategy, such as a feature pair that reduces the correlation sum most after switching. This helps to reduce redundancy between features while preserving important features, improving the overall quality of the feature list.
In a preferred embodiment of the present invention, the step 15 may include:
step 151, calculating a comprehensive score of each feature according to the importance score of the feature and the numerical value of the dynamic factor, wherein the calculation formula of the comprehensive score is as follows:
;
Wherein, Is the firstThe composite score of the individual features,Indicating the number of features that are to be included,Is the firstWeighting of individual static featuresIs the firstThe characteristic is atThe normalized value on the individual samples is then calculated,Is a dynamic factor of the dynamic range,() Is at the firstThe weights at the points in time are such that,Is at the firstAt the point of timeThe observed value of the individual characteristic(s),Is the time window size of the moving average,It is the time that is required for the device to be in contact with the substrate,Is an index of the number of the words,,Is a smoothing factor between 0 and 1,Is a proportionality constant;
and step 152, determining key factors influencing cardiovascular disease risk according to the magnitude of the comprehensive score.
In the embodiment of the present invention, step 15 considers not only the importance score of the static feature (reflecting the importance of the feature in the whole dataset) but also the numerical value of the dynamic factor (reflecting the change condition of the feature in time sequence) through the calculation formula of the composite score. This enables the treatment method to more fully capture the various factors that affect the risk of cardiovascular disease. Dynamic factorThe introduction of the method enables the processing method to flexibly adjust the importance degree of static characteristics and dynamic factors according to actual conditions. When dynamic factorsWhen larger, the method focuses more on the variation of features over time; when dynamic factorsSmaller, the method then places more emphasis on the importance of the features in the overall dataset. This flexibility helps to accommodate the processing requirements of cardiovascular case data in different scenarios. The key factors influencing the cardiovascular disease risk are determined according to the magnitude of the comprehensive score, so that doctors and researchers can be helped to quickly identify the most important risk factors. This is of great importance for the formulation of targeted preventive and therapeutic strategies, which can improve the effectiveness and efficiency of the intervention. Step 15 provides a basis for determining key factors by calculating a composite score for each feature. This makes the processing result more objective and reliable.
In a preferred embodiment of the present invention, the step 152 may include:
step 1521, sorting all the features according to the composite score;
Step 1522, selecting the features with the comprehensive score equal to or greater than the threshold value from the ranked feature list according to the preset threshold value, and determining the features with the comprehensive score equal to or greater than the threshold value as key factors affecting the cardiovascular disease risk.
In embodiments of the present invention, by ranking the composite scores of all features (step 1521), the importance of features in affecting cardiovascular disease risk can be intuitively compared. Screening in combination with the preset threshold (step 1522) ensures that the selected feature has not only a high importance in its entirety, but also a significant impact above a certain threshold. This approach helps to improve the accuracy of key factor identification. The setting of the preset threshold value makes the determination of the key factors more clear and simplified. Doctors and researchers do not need to analyze all the features one by one, and only need to pay attention to the features with the comprehensive scores exceeding the threshold value. This helps to reduce the complexity of the decision and improve the decision efficiency. The introduction of the threshold provides a quantitative criterion for the screening of key factors. This makes the screening process more objective, repeatable and reduces the bias from subjective judgment. Meanwhile, the threshold value can be adjusted according to actual conditions so as to adapt to the processing requirements of different cardiovascular case data. The determined key factors can provide important guidance for subsequent cardiovascular disease research and analysis. For example, the biological or clinical link to cardiovascular risk can be further explored for these key factors, providing more basis for the formulation of disease prevention and treatment strategies.
In another preferred embodiment of the present invention, the step 16 may include:
Step 161, taking the long-term metabolic feature sequence of the patient as input data of an LSTM network, wherein the LSTM network can capture long-term dependency relationship in the sequence data so as to learn association between metabolic features and disease risks, and the output data of the LSTM is a hidden state sequence which contains information related to the disease risks; according to a hidden state sequence output by the LSTM, a weight is distributed to each hidden state, wherein the calculation formula of the weight is as follows:
,
Wherein, ,Is a modified linear cell activation function; v, W, U andIs a parameter of the sample, which is a parameter,Is an additional feature of the present invention,Is a function of the hyperbolic tangent,Representing the length of the sequence,Is atThe hidden state of the moment of time,AndRespectively atTime of day and time of dayAttention score at time; the prediction result of the LSTM model and the prediction result of the auxiliary model are integrated to improve the overall prediction performance, the auxiliary model can be a model trained based on different feature subsets, and the integrated learning method reduces the deviation and variance of a single model by combining the predictions of a plurality of models, so that the accuracy and the robustness of the prediction are improved.
In another preferred embodiment of the present invention, the step 17 may include:
step 171, performing second-order feature interaction on the original features of the individual by using a factoring machine to obtain an interaction result Wherein the factorizer calculates the second order feature interactions by:
;
Wherein, AndIs a parameter of the linear portion and,AndIs characterized byAndCorresponding hidden vectors, for computing interactions between features,The vector inner product is represented by the vector,Representing the number of features;
Step 172, fusing the interaction result with the disease risk information learned by the long-term and short-term memory network through time perception to obtain a characteristic interaction result The calculation formula of the characteristic interaction result is as follows:
;
Wherein, Is the attention weight of the LSTM output,Is a parameter of the sample, which is a parameter,Is the LSTM network at the time stepThe outputted disease risk information is provided to the computer,Is the total number representing the time step.
In the embodiment of the invention, the model can automatically learn the combination relation among the features by introducing a factorization machine to carry out second-order feature interaction on the original features of the individual, is particularly useful in processing the features with multiple collinearity or needing to explore the potential relation among the features, and can capture the information which is possibly ignored by the traditional linear model. Will interact the resultFusion with LSTM learned serialized disease risk information allows the model to take into account both static features and dynamic timing information. This is critical for processing data with time-series properties, such as biomarker changes during disease progression, as it enables capturing time-varying risk patterns. By fusing information of different sources and types (original feature interactions and time sequence risk information), the model obtains a richer, more dimensional representation of features. This helps to improve the modeling ability of the model for complex nonlinear relationships, and thus to improve prediction accuracy. Introducing interaction resultsThe flexibility of the model is enhanced because it can handle feature vectors of any real value without explicit preprocessing or feature engineering of the features. At the same time, the interactive resultsThe second order interaction term of (c) provides a degree of interpretability to the model, as it is possible to understand which feature combinations have a significant impact on the prediction result by examining the weights of the interaction terms. The use of the attention weight of the LSTM output in feature fusion means that the model can adaptively focus on the most relevant part of the timing information for the prediction results, not only improving the model's ability to process variable length sequences, but also enhancing the model's sensitivity to information at different points in time.
In another preferred embodiment of the present invention, the step 18 may include:
Step 181, inputting the feature interaction result into a fully connected network for high-level feature abstraction and combination according to the feature interaction result to obtain fully connected network output data, wherein the calculation formula of the fully connected network output data is as follows:
;
;
;
Wherein, ,AndIs a weight matrix;, And Is a bias vector; Is the result of the characteristic interaction of the input, Is the output data of the fully connected network; Is the output data of the first hidden layer; is the output data of the second hidden layer;
Step 182, transmitting the feature interaction result as input to a gradient lifting decision tree model, specifically including training a series of decision trees for the gradient lifting decision tree model, wherein after training, output data of the gradient lifting decision tree model is a weighted sum of prediction results of all decision trees, and a calculation formula of the output data is as follows:
;
Wherein, Is the rate of learning to be performed,Is the total number of decision trees and,Is the firstA predictive function of the decision tree is set,Is the final output data of the gradient lifting decision tree model;
step 183, integrating the output of the fully connected network and the gradient lifting decision tree to obtain integrated data, and obtaining a final disease risk processing result according to the integrated data, wherein a calculation formula of the integrated data is as follows:
;
Wherein, The weight parameter is used to determine the weight of the object,Is the final integrated data.
As shown in fig. 2, an embodiment of the present invention further provides a processing system 20 for cardiovascular case data, including:
An acquisition module 21, configured to preprocess the hypertension electronic medical record data set to obtain preprocessed data; constructing a random forest model by using the preprocessing data, and training the random forest model to obtain a trained random forest model; calculating importance scores of each feature through a trained random forest model; sorting the features according to the importance scores of each feature to obtain a feature sorting table;
A processing module 22, configured to determine key factors affecting the cardiovascular disease risk according to the feature ranking table and the dynamic factors; determining a hidden state sequence through the LSTM network according to key factors, wherein the hidden state sequence contains information related to disease risks; distributing a weight to each hidden state according to the hidden state sequence; performing second-order feature interaction on the original features of the individuals by using a factor decomposition machine to obtain interaction results; fusing the interaction result and the hidden state sequence to obtain a characteristic interaction result; inputting the characteristic interaction result into a fully-connected network for characteristic abstraction and combination to obtain fully-connected network output data; training a series of decision trees for the gradient lifting decision tree model, wherein after training is completed, the output data of the gradient lifting decision tree model is the weighted sum of the prediction results of all the decision trees; integrating the output data of the fully connected network and the gradient lifting decision tree to obtain integrated data, and obtaining a final disease risk processing result according to the integrated data.
Optionally, preprocessing the hypertension electronic medical record data set to obtain preprocessed data, including:
Randomly selecting k initial clustering centers, and iterating until the clustering result is stable, wherein the clustering centers are selected at random The updated formula of (2) is:
;
Wherein, Is a cluster tag that is a cluster of labels,Is the data point of the i-th data point,Is the total data point number in the data set,Is an index of the number of the words,Is a data pointTo a cluster centerIs the square of the euclidean distance of (c),Is a constant;
for the data in each cluster, a covariance matrix C between features is calculated, wherein,
Wherein, the method comprises the steps of, wherein,Is the weight of the i-th data point,Representing the transpose of the matrix,Weights representing the jth data point; is the jth data point; j is an index;
Performing feature decomposition on the covariance matrix C to obtain feature values and corresponding feature vectors of the covariance matrix C;
Sorting the feature vectors according to the magnitudes of the feature values;
the pre-processed data is obtained by multiplying the raw data matrix with the selected principal component matrix.
Optionally, constructing a random forest model using the pre-processing data, training the random forest model to obtain a trained random forest model, including:
determining the number of decision trees, the final depth of each decision tree, and the number of samples required for leaf nodes;
sampling samples from an original training data set by adopting a self-help sampling method, generating a new training subset, and constructing a decision tree on the new training subset;
On the nodes of each decision tree, the importance of each feature is processed, and the final feature is determined to carry out node division;
The decision tree continues to grow until a preset stopping condition is reached, and in the growing process, each node is divided according to the selected characteristics and recursively generates child nodes; repeating the operation, and constructing decision trees by using different self-service sampling subsets each time until a specified number of decision trees are generated;
integrating all the constructed decision trees together to form a random forest model;
Training the random forest model to obtain a trained random forest model.
Optionally, calculating the importance score of each feature by training a random forest model includes:
acquiring the feature importance attribute of the trained random forest model;
a feature importance score is calculated based on the feature importance attributes.
Optionally, sorting the features according to the importance score of each feature to obtain a feature sorting table, including:
According to the importance scores of the features, the features are subjected to preliminary ranking by using a rapid ranking method so as to obtain a preliminary ranking list; according to the preliminary ranking list, analyzing the correlation between the features, and adjusting the preliminary ranking list according to the correlation to obtain an optimized ranking result, including: for each pair of features in the preliminary ranking list ,) Wherein, the method comprises the steps of, wherein,AndRespectively representing the h and m features, calculatingAndThe correlation coefficient is calculated according to the following formula:
;
Wherein, Is the firstWeights of the samples; And The standard deviation of the h and m features respectively,,Representing a sampleThe frequency of occurrence in the data set,,Represent the firstThe value of the sample on the h-th feature,,Representing the number of samples; constructing a correlation value using correlation coefficientsWherein, the method comprises the steps of, wherein,,Is a smoothing factor of the EWMA and,Time is; according to the correlation valueConstructing a correlation matrix R; traversing the features in the preliminary ranking list, and searching a feature set H h with higher correlation with each feature F h; according to a greedy strategy, for each feature F h, carrying out position exchange on the feature F h and the features in the feature set H h to obtain an optimized sorting result;
and generating a characteristic sorting table according to the optimized sorting result.
Optionally, determining the key factors affecting the cardiovascular disease risk according to the feature ranking table and the dynamic factors includes:
according to the importance scores of the features and the numerical values of the dynamic factors, calculating the comprehensive score of each feature, wherein the calculation formula of the comprehensive score is as follows:
;
Wherein, Is the firstThe composite score of the individual features,Indicating the number of features that are to be included,Is the firstWeighting of individual static featuresIs the firstThe characteristic is atThe normalized value on the individual samples is then calculated,Is a dynamic factor of the dynamic range,() Is at the firstThe weights at the points in time are such that,Is at the firstAt the point of timeThe observed value of the individual characteristic(s),Is the time window size of the moving average,It is the time that is required for the device to be in contact with the substrate,Is an index;
and determining key factors influencing the cardiovascular disease risk according to the total score.
Optionally, determining a key factor affecting the risk of cardiovascular disease according to the magnitude of the composite score includes:
sequencing all the features according to the comprehensive score;
and selecting the features with the comprehensive score not less than the threshold value from the sorted feature list according to the preset threshold value, and determining the features with the comprehensive score not less than the threshold value as key factors influencing the cardiovascular disease risk.
It should be noted that, the system is a system corresponding to the above method, and all implementation manners in the above method embodiment are applicable to the embodiment, so that the same technical effects can be achieved.
Embodiments of the present invention also provide a computing device comprising: a processor, a memory storing a computer program which, when executed by the processor, performs the method as described above. All the implementation manners in the method embodiment are applicable to the embodiment, and the same technical effect can be achieved.
Embodiments of the present invention also provide a computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform a method as described above. All the implementation manners in the method embodiment are applicable to the embodiment, and the same technical effect can be achieved.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
Furthermore, it should be noted that in the apparatus and method of the present invention, it is apparent that the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present invention. Also, the steps of performing the series of processes described above may naturally be performed in chronological order in the order of description, but are not necessarily performed in chronological order, and some steps may be performed in parallel or independently of each other. It will be appreciated by those of ordinary skill in the art that all or any of the steps or components of the methods and apparatus of the present invention may be implemented in hardware, firmware, software, or any combination thereof in any computing device (including processors, storage media, etc.) or network of computing devices, as would be apparent to one of ordinary skill in the art upon reading the present specification.
The object of the invention can thus also be achieved by running a program or a set of programs on any computing device. The computing device may be a well-known general purpose device. The object of the invention can thus also be achieved by merely providing a program product containing program code for implementing said method or apparatus. That is, such a program product also constitutes the present invention, and a storage medium storing such a program product also constitutes the present invention. It is apparent that the storage medium may be any known storage medium or any storage medium developed in the future. It should also be noted that in the apparatus and method of the present invention, it is apparent that the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present invention. The steps of executing the series of processes may naturally be executed in chronological order in the order described, but are not necessarily executed in chronological order. Some steps may be performed in parallel or independently of each other.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.
Claims (2)
1. A method for processing cardiovascular case data, comprising:
Preprocessing the hypertension electronic medical record data set to obtain preprocessed data;
constructing a random forest model by using the preprocessing data, and training the random forest model to obtain a trained random forest model;
Calculating importance scores of each feature through a trained random forest model;
sorting the features according to the importance scores of each feature to obtain a feature sorting table;
determining key factors influencing cardiovascular disease risk according to the feature ordering table and the dynamic factors;
determining a hidden state sequence through the LSTM network according to key factors, wherein the hidden state sequence contains information related to disease risks; distributing a weight to each hidden state according to the hidden state sequence; the calculation formula of the weight is as follows:
;
Wherein, ,Is a modified linear cell activation function; v, W, U andIs a parameter of the sample, which is a parameter,Is an additional feature of the present invention,Is a function of the hyperbolic tangent,Representing the length of the sequence,Is atThe hidden state of the moment of time,AndRespectively atTime of day and time of dayAttention score at time; the prediction result of the LSTM model and the prediction result of the auxiliary model are integrated to improve the overall prediction performance, the auxiliary model is a model trained based on different feature subsets, and the integrated learning method reduces the deviation and variance of a single model by combining the predictions of a plurality of models, so that the accuracy and the robustness of the prediction are improved;
performing second-order feature interaction on the original features of the individuals by using a factor decomposition machine to obtain interaction results; fusing the interaction result and the hidden state sequence to obtain a characteristic interaction result;
Inputting the characteristic interaction result into a fully-connected network for characteristic abstraction and combination to obtain fully-connected network output data; the feature interaction result is used as input to be transmitted to a gradient lifting decision tree model, and specifically comprises the steps of training a series of decision trees for the gradient lifting decision tree model, wherein after training is finished, output data of the gradient lifting decision tree model is a weighted sum of prediction results of all the decision trees; integrating the output data of the fully connected network and the gradient lifting decision tree to obtain integrated data, and obtaining a final disease risk processing result according to the integrated data;
preprocessing the hypertension electronic medical record data set to obtain preprocessed data, including:
Randomly selecting k initial clustering centers, and iterating until the clustering result is stable, wherein the clustering centers are selected at random The updated formula of (2) is:
;
Wherein, Is a cluster tag that is a cluster of labels,Is the data point of the i-th data point,Is the total data point number in the data set,Is an index of the number of the words,Is a data pointTo a cluster centerIs the square of the euclidean distance of (c),Is a constant;
for the data in each cluster, a covariance matrix C between features is calculated, wherein,
Wherein, the method comprises the steps of, wherein,Is the weight of the i-th data point,Representing the transpose of the matrix,Weights representing the jth data point; is the jth data point; j is an index;
Performing feature decomposition on the covariance matrix C to obtain feature values and corresponding feature vectors of the covariance matrix C;
Sorting the feature vectors according to the magnitudes of the feature values;
multiplying the original data matrix with the selected principal component matrix to obtain preprocessed data;
constructing a random forest model by using the preprocessing data, training the random forest model to obtain a trained random forest model, and comprising the following steps:
determining the number of decision trees, the final depth of each decision tree, and the number of samples required for leaf nodes;
sampling samples from an original training data set by adopting a self-help sampling method, generating a new training subset, and constructing a decision tree on the new training subset;
On the nodes of each decision tree, the importance of each feature is processed, and the final feature is determined to carry out node division;
The decision tree continues to grow until a preset stopping condition is reached, and in the growing process, each node is divided according to the selected characteristics and recursively generates child nodes;
repeating the operation, and constructing decision trees by using different self-service sampling subsets each time until a specified number of decision trees are generated;
integrating all the constructed decision trees together to form a random forest model;
Training the random forest model to obtain a trained random forest model;
Calculating an importance score for each feature by training a random forest model, comprising:
acquiring the feature importance attribute of the trained random forest model;
calculating a feature importance score according to the feature importance attribute;
Ranking the features according to the importance score of each feature to obtain a feature ranking table, comprising:
according to the importance scores of the features, the features are subjected to preliminary ranking by using a rapid ranking method so as to obtain a preliminary ranking list;
According to the preliminary ranking list, analyzing the correlation between the features, and adjusting the preliminary ranking list according to the correlation to obtain an optimized ranking result, including: for each pair of features in the preliminary ranking list ,) Wherein, the method comprises the steps of, wherein,AndRespectively representing the h and m features, calculatingAndThe correlation coefficient is calculated according to the following formula:
;
Wherein, Is the firstWeights of the samples; And The standard deviation of the h and m features respectively,,Representing a sampleThe frequency of occurrence in the data set,,Represent the firstThe value of the sample on the h-th feature,,Representing the number of samples; constructing a correlation value using correlation coefficientsWherein, the method comprises the steps of, wherein,,Is a smoothing factor of the EWMA and,Time is; according to the correlation valueConstructing a correlation matrix R; traversing the features in the preliminary ranking list, and searching a feature set H h with higher correlation with each feature F h; according to a greedy strategy, for each feature F h, carrying out position exchange on the feature F h and the features in the feature set H h to obtain an optimized sorting result;
generating a characteristic sorting table according to the optimized sorting result;
determining key factors affecting cardiovascular disease risk according to the feature ordering table and the dynamic factors, wherein the key factors comprise:
according to the importance scores of the features and the numerical values of the dynamic factors, calculating the comprehensive score of each feature, wherein the calculation formula of the comprehensive score is as follows:
;
Wherein, Is the firstThe composite score of the individual features,Indicating the number of features that are to be included,Is the firstWeighting of individual static featuresIs the firstThe characteristic is atThe normalized value on the individual samples is then calculated,Is a dynamic factor of the dynamic range,() Is at the firstThe weights at the points in time are such that,Is at the firstAt the point of timeThe observed value of the individual characteristic(s),Is the time window size of the moving average,It is the time that is required for the device to be in contact with the substrate,Is an index;, is a smoothing factor between 0 and 1, Is a proportionality constant;
determining key factors influencing cardiovascular disease risk according to the comprehensive score;
Determining key factors influencing cardiovascular disease risk according to the magnitude of the composite score, wherein the key factors comprise:
sequencing all the features according to the comprehensive score;
and selecting the features with the comprehensive score not less than the threshold value from the sorted feature list according to the preset threshold value, and determining the features with the comprehensive score not less than the threshold value as key factors influencing the cardiovascular disease risk.
2. A system for processing cardiovascular case data, for use in the method of claim 1, comprising:
The acquisition module is used for preprocessing the hypertension electronic medical record data set to obtain preprocessed data; constructing a random forest model by using the preprocessing data, and training the random forest model to obtain a trained random forest model; calculating importance scores of each feature through a trained random forest model; sorting the features according to the importance scores of each feature to obtain a feature sorting table;
The processing module is used for determining key factors influencing cardiovascular disease risks according to the characteristic ranking table and the dynamic factors; determining a hidden state sequence through the LSTM network according to key factors, wherein the hidden state sequence contains information related to disease risks; distributing a weight to each hidden state according to the hidden state sequence; performing second-order feature interaction on the original features of the individuals by using a factor decomposition machine to obtain interaction results; fusing the interaction result and the hidden state sequence to obtain a characteristic interaction result; inputting the characteristic interaction result into a fully-connected network for characteristic abstraction and combination to obtain fully-connected network output data; training a series of decision trees for the gradient lifting decision tree model, wherein after training is completed, the output data of the gradient lifting decision tree model is the weighted sum of the prediction results of all the decision trees; integrating the output data of the fully connected network and the gradient lifting decision tree to obtain integrated data, and obtaining a final disease risk processing result according to the integrated data.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410925105.1A CN118471540B (en) | 2024-07-11 | 2024-07-11 | A method and system for processing cardiovascular case data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410925105.1A CN118471540B (en) | 2024-07-11 | 2024-07-11 | A method and system for processing cardiovascular case data |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118471540A CN118471540A (en) | 2024-08-09 |
| CN118471540B true CN118471540B (en) | 2024-10-15 |
Family
ID=92170885
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410925105.1A Active CN118471540B (en) | 2024-07-11 | 2024-07-11 | A method and system for processing cardiovascular case data |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118471540B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119400440A (en) * | 2024-10-28 | 2025-02-07 | 江苏法迈生医学科技有限公司 | Clinical trial information management method and system based on big data |
| CN119905261B (en) * | 2025-02-06 | 2025-08-29 | 华南农业大学 | A hypertension prediction method based on random forest and LSTM neural network |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117253614A (en) * | 2023-11-14 | 2023-12-19 | 天津医科大学朱宪彝纪念医院(天津医科大学代谢病医院、天津代谢病防治中心) | Diabetes risk early warning method based on big data analysis |
| CN117457217A (en) * | 2023-12-22 | 2024-01-26 | 天津医科大学朱宪彝纪念医院(天津医科大学代谢病医院、天津代谢病防治中心) | Risk assessment method and system for diabetic nephropathy |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106874663A (en) * | 2017-01-26 | 2017-06-20 | 中电科软件信息服务有限公司 | Cardiovascular and cerebrovascular disease Risk Forecast Method and system |
| AU2020100709A4 (en) * | 2020-05-05 | 2020-06-11 | Bao, Yuhang Mr | A method of prediction model based on random forest algorithm |
| CN116864139A (en) * | 2023-06-30 | 2023-10-10 | 平安科技(深圳)有限公司 | Disease risk assessment method, device, computer equipment and readable storage medium |
| CN117593142A (en) * | 2024-01-18 | 2024-02-23 | 辰风策划(深圳)有限公司 | Financial risk assessment management method and system |
| CN117850601B (en) * | 2024-03-08 | 2024-05-14 | 南昌大学第二附属医院 | System and method for automatically detecting vital signs for handheld PDA |
-
2024
- 2024-07-11 CN CN202410925105.1A patent/CN118471540B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117253614A (en) * | 2023-11-14 | 2023-12-19 | 天津医科大学朱宪彝纪念医院(天津医科大学代谢病医院、天津代谢病防治中心) | Diabetes risk early warning method based on big data analysis |
| CN117457217A (en) * | 2023-12-22 | 2024-01-26 | 天津医科大学朱宪彝纪念医院(天津医科大学代谢病医院、天津代谢病防治中心) | Risk assessment method and system for diabetic nephropathy |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118471540A (en) | 2024-08-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN118471540B (en) | A method and system for processing cardiovascular case data | |
| CN111967495B (en) | Classification recognition model construction method | |
| US20250104813A1 (en) | Genome-wide prediction method based on deep learning by using genome-wide data and bioinformatics features | |
| CN110111885B (en) | Attribute prediction method, attribute prediction device, computer equipment and computer readable storage medium | |
| Bihis et al. | A generalized flow for multi-class and binary classification tasks: An Azure ML approach | |
| CN116805533A (en) | Cerebral hemorrhage operation risk prediction system based on data collection and simulation | |
| CN116226629B (en) | Multi-model feature selection method and system based on feature contribution | |
| CN117476183B (en) | Construction system of autism children rehabilitation effect AI evaluation model | |
| CN116759093A (en) | Individual health status quantitative evaluation system based on intestinal microorganism composition | |
| CN118230952A (en) | Psychological assessment method and system based on BPRS (Business Process reference System) concise psychosis table | |
| CN119207562B (en) | Index sequence generation method and device, electronic equipment and storage medium | |
| CN119418855B (en) | A generative artificial intelligence-assisted method and system for low back pain rehabilitation | |
| CN118312816B (en) | Cluster weighted clustering integrated medical text processing method and system based on member selection | |
| CN119397349A (en) | A multi-stage classification method based on CIP pneumonia multimodal data | |
| CN117312958B (en) | A method for predicting oxygenation index based on continuous non-invasive parameters | |
| Saroja et al. | Data‐Driven Decision Making in IoT Healthcare Systems—COVID‐19: A Case Study | |
| AU2021102593A4 (en) | A Method for Detection of a Disease | |
| Prasanth et al. | Prognostication of diabetes diagnosis based on different machine learning classification algorithms | |
| EP4609406A1 (en) | Cancer progression assessment method and system thereof | |
| Hakim | Performance Evaluation of Machine Learning Techniques for Early Prediction of Brain Strokes | |
| CN119380975B (en) | Knowledge-graph-based personalized treatment decision system for depression | |
| CN119128077B (en) | A matching method and system for semantic understanding and question answering | |
| Qian | Predicting Parkinson's Disease Progression with Random Forests | |
| CN117877737B (en) | Method, system and device for constructing primary lung cancer risk prediction model | |
| CN118820815B (en) | Indirect estimation method and system for reference intervals of children's test indicators |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |