[go: up one dir, main page]

CN120119001B - A tumor hypoxia level assessment model and its application and detection kit - Google Patents

A tumor hypoxia level assessment model and its application and detection kit

Info

Publication number
CN120119001B
CN120119001B CN202510610465.7A CN202510610465A CN120119001B CN 120119001 B CN120119001 B CN 120119001B CN 202510610465 A CN202510610465 A CN 202510610465A CN 120119001 B CN120119001 B CN 120119001B
Authority
CN
China
Prior art keywords
hypoxia
score
breast cancer
gene
tumor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510610465.7A
Other languages
Chinese (zh)
Other versions
CN120119001A (en
Inventor
吕海泉
赵智群
季凯
王正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202510610465.7A priority Critical patent/CN120119001B/en
Publication of CN120119001A publication Critical patent/CN120119001A/en
Application granted granted Critical
Publication of CN120119001B publication Critical patent/CN120119001B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/118Prognosis of disease development
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/136Screening for pharmacological compounds
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Pathology (AREA)
  • Bioethics (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Zoology (AREA)
  • Computing Systems (AREA)
  • Wood Science & Technology (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Immunology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Hospice & Palliative Care (AREA)
  • Probability & Statistics with Applications (AREA)
  • Microbiology (AREA)
  • Oncology (AREA)

Abstract

The application relates to a tumor hypoxia level assessment model, application and a detection kit thereof, and belongs to the technical field of biomedicine. In the scheme of the application, a reliable breast cancer hypoxia gene set is established by utilizing transcriptome big data, the hypoxia characteristics of clinical samples are depicted according to the breast cancer hypoxia gene set, the hypoxia score is calculated, the relation between the hypoxia score of the clinical samples and the clinical prognosis is explored, a prediction model with clinical application value is established, and the drug sensitivity is predicted according to the hypoxia score of the clinical samples. The scheme of the application has important application prospect in a plurality of fields such as tumor treatment, drug screening and the like.

Description

Tumor hypoxia level assessment model and application and detection kit thereof
Technical Field
The application relates to a tumor hypoxia level assessment model, application and a detection kit thereof, and belongs to the technical field of biomedicine.
Background
Hypoxia is one of the typical features of many solid malignant tumors, peripheral vascularization abnormalities, and oxygen and nutrient starvation in the internal areas of the tumor. Oxygen levels of tumor tissue can be defined by measuring the partial pressure of oxygen. In general, oxygen levels below 10 mmHg (1.3 kPa) in tumor tissue are defined as hypoxic environments. In addition, tumor tissue typically contains regions of normal oxygen (functional blood vessels in the vicinity), regions of hypoxia (100 μm from functional blood vessels), and regions of necrosis (150 μm from functional blood vessels), these different regions can help define the oxygen levels of the tumor tissue. Hypoxia affects many aspects of tumor cells and tumor microenvironment, and the heterogeneity of the hypoxic conditions within a tumor may also lead to different patient tumor cells exhibiting different degrees of sensitivity to different treatment modalities or drugs, understanding the heterogeneity of the hypoxic conditions within a tumor can help to formulate a more personalized treatment strategy, and can help to target treatment against the hypoxic tumor microenvironment, thereby improving the effect of tumor treatment.
Judging the hypoxia condition of the tumor tissue of a patient plays an important role in clinical diagnosis, medication guidance and the like of the patient, so that a technical method for accurately judging the hypoxia condition of the patient needs to be developed, and the currently developed detection means comprise 1) an oxygen electrode probe (Eppendorf) that electrodes are directly placed on the surface of the tumor, and the track length is selected according to the tumor size determined by clinical and MRI scanning. Data are presented as a ratio of hypoxia, defined as the percentage of PO 2 readings less than 5 mmHg and the median of PO 2. 2) Immunohistochemistry tissue samples were stained with specific antibodies such as HIF-1a, CA9, etc. to identify the presence of hypoxia-associated proteins. 3) Analysis of gene expression profiling analysis of gene expression patterns in tumor samples can provide information on gene activity associated with hypoxia. 4) PET imaging PET (Positron Emission Computed Tomography, PET) radiotracer consists of a radioisotope and a hypoxia reactive molecule specific to the hypoxia microenvironment. The most widely studied PET strategy among them relies on the capture of 2-nitroimidazole probes in hypoxic cells, the principle of which is their bioreductive metabolism. Yet another strategy relies on radiolabeled antibodies to carbonic anhydrase 9 (CA 9), which CA9 may be considered a specific HIF-1 reporter, and this approach also has the potential to monitor hypoxia, while it is valuable for selecting patients for CA 9-targeted therapy. In addition, 18F-labeled fluorodeoxyglucose probes and the like are useful for non-invasive PET imaging.
Studies in which invasive methods such as the use of oxygen electrode probes have first determined that the hypoxic environment of a tumor is associated with poor prognosis for patients, but more extensive clinical applications for stratified patients still require less invasive methods such as PET imaging, endogenous markers and endogenous probes. While minimally invasive plasma-based molecular marker diagnosis has great potential for identifying patient hypoxia, because of the great complexity and heterogeneity of patient tumor hypoxia, there is a need to develop a stable, accurate and noninvasive method for characterizing patient hypoxia, thereby aiding in patient cancer diagnosis, prognosis and medication guidance, etc.
Disclosure of Invention
In order to solve the problems, the application provides a tumor hypoxia level assessment model, application and a detection kit thereof, wherein the breast cancer hypoxia gene set established by random forests can represent the hypoxia characteristics of clinical samples, and the prediction model constructed by using the hypoxia score has good prediction efficiency on prognosis of breast cancer patients, predicts the sensitivity of the patients to 139 FDA approved drugs according to the hypoxia characteristics of the clinical samples, and has important application value and wide application potential.
The application provides a biomarker for assessing tumor hypoxia level, which comprises the following 44 genes:
The application provides a kit for evaluating tumor hypoxia level, which comprises a reagent for detecting gene expression quantity of the biomarker.
Alternatively, the reagent is used to detect the mRNA expression level of the biomarker, which is used to represent the gene expression level of the biomarker.
Optionally, the detection sample of the kit is a breast cancer tumor sample.
Optionally, the breast cancer tumor sample is an intraoperative puncture sample, an intraoperative fresh sample or a liquid nitrogen frozen sample.
The application provides an evaluation model for evaluating the tumor hypoxia level, wherein the evaluation model takes the gene expression level of the biomarker as an input variable, and evaluates the hypoxia level of the tumor by calculating a hypoxia score;
the calculation formula of the hypoxia score is as follows: ;
wherein, the Wherein gene i is a biomarker as described above;
wherein, the AndAs a constant, the calculation formula is as follows:
;
;
wherein, the J is 1118 breast cancer samples in the TCGA-BRCA database.
Alternatively, the expression level of gene i is the mRNA expression level of gene i.
Optionally, the method comprises the steps of judging a high-hypoxia group when the hypoxia score is more than or equal to 0.5, judging a low-hypoxia group when the hypoxia score is less than or equal to-0.5, and judging a medium-hypoxia group when the hypoxia score is more than or equal to-0.5.
The present application provides a system for assessing tumor hypoxia level, comprising:
The data acquisition module is used for acquiring the gene expression level of the biomarker;
The evaluation module is used for evaluating the tumor hypoxia level according to the gene expression level obtained by the data acquisition module and outputting a hypoxia score, and comprises an evaluation model for evaluating the tumor hypoxia level, wherein the formula for calculating the hypoxia score by the evaluation model is as follows: ;
wherein, the Wherein gene i is a biomarker as described above; And As a constant, the calculation formula is as follows:
;
;
wherein, the J is 1118 breast cancer samples in the TCGA-BRCA database;
and the output module outputs a prediction result according to the hypoxia score.
The present application provides a computer readable storage medium storing a computer program which, when executed by a processor, performs the functions of the system described above.
The application provides the use of the above-described assessment model or the above-described system or the above-described computer-readable storage medium in any of the following:
1) Use in predicting the clinical prognosis of a tumor patient;
2) Application in screening hypoxia sensitivity drugs;
3) Use in the screening of drugs for hypoxia resistance.
The beneficial effects of the application include, but are not limited to:
1. According to the tumor hypoxia level assessment model and application and detection kit thereof, k-means unsupervised learning is utilized to identify hypoxia and normoxic population, a classifier is constructed through a random forest algorithm (the specificity of a training set is 0.944, the sensitivity is 0.833, and the specificity and sensitivity of a verification set are both 1), and a breast cancer hypoxia gene set is obtained through screening according to feature importance.
2. According to the tumor hypoxia level assessment model and the application and detection kit thereof, the application utilizes a breast cancer hypoxia gene set to carry out GSVA on TCGA, METABRIC and GSE20685 breast cancer clinical samples to obtain the hypoxia score of the samples, and further utilizes a hierarchical clustering algorithm to divide the samples into high, medium and low groups of hypoxia scores, so that the relation between the hypoxia score and clinical prognosis is explored. The Kaplan-Meier survival analysis result shows that the breast cancer has high hypoxia score and worse hypoxia score than the breast cancer has low hypoxia score, and the single-factor and multi-factor Cox proportional risk regression model analysis result shows that the continuous variable hypoxia score is a risk factor of death.
3. According to the tumor hypoxia level assessment model, the application and the detection kit thereof, medicaments sensitive to and resistant to hypoxia are screened out by calculating the correlation between the hypoxia scores of breast cancer clinical samples in a TCGA database and the reactivities of 139 FDA approved medicaments, and are verified on the cellular level through a CCK-8 experiment.
4. According to the tumor hypoxia level assessment model and application and detection kit thereof, the correlation between the hypoxia score and the drug response of the TCGA-BRCA clinical sample is calculated, and drugs for treating breast cancer hypoxia sensitivity, such as AZD6482 and NSC.87877, and drugs for treating hypoxia resistance, such as AZD8055 and Vorinostat, are obtained through screening.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a graph showing heterogeneity of hypoxia response of different breast cancer cell lines according to the application (a: DESeq2 analysis gives a thermal map of 31 differential expression genes of normoxic (20% O 2) and hypoxic (1% O 2) groups of cell lines; b:31 multiple scatter plots of differential expression genes of top20 of cell lines);
FIG. 2 is a graph of the results of unsupervised learning-based identification of normoxic and hypoxic populations according to the application (a: using top20 RNA in FIG. 1b to identify normoxic and hypoxic populations based on unsupervised k-means clustering; b: calculating the product-d of 31 cell lines; c: heat-map shows 20 differential gene expression for 8 cell lines with positive values for product-d, wherein one cell line in cluster1 (red label) is HCC1806 and the remaining 7 cell lines show weaker hypoxia response in cluster2 (blue label);
FIG. 3 is a graph of the construction result of a random forest model (a: a flow chart for constructing the random forest model; b: a graph of the relation between the error of the random forest model and the number of decision trees; c, d: a confusion matrix of the random forest model; c: a training set of 75% and d: a test set of 25%);
FIG. 4 is a graph showing the construction results of a breast cancer hypoxia gene set according to the present application (a: top30, MEAN DECREASE Accurcry (left) and MEAN DECREASE GINI (right) showing random forest feature importance parameters, b: wen graph showing feature intersections of MEAN DECREASE Accurcry >1 and MEAN DECREASE GINI >0.05, c: heat graph showing gene expression in 48 genes (b) of 23 cell lines);
FIG. 5 is a graph showing the construction results of a breast cancer hypoxia gene set according to the present application (a-c: performing a dimension reduction analysis using 44 genes in the breast cancer hypoxia gene set, wherein a is a PCA graph, b is UMAP, c is t-SNE graph);
FIG. 6 is a graph of the results of the validation of a breast cancer hypoxia gene set according to the present application (a: GO enrichment analysis of the breast cancer hypoxia gene set; b: calculating the correlation between GSVA enrichment scores of clinical samples of the TCGA-BRCA database using the breast cancer hypoxia gene set and the other 6 hypoxia-related gene sets; c: box-line graph showing the hypoxia score conditions of different PAM50 types for clinical samples of the TCGA-BRCA database; d: the ratio of different PAM50 types in the high, middle and low groups of hypoxia scores for clinical samples of the TCGA-BRCA database);
FIG. 7 is a graph of the classification results of clinical samples based on hierarchical clustering according to the present application (a-c: the clinical samples of the databases TCGA-BRCA (a), METABIC (b) and GSE20685 (c) are classified into three groups based on the breast cancer hypoxia gene set using hierarchical clustering; d-f: groups of the clinical samples showing high, medium and low expression patterns of the breast cancer hypoxia gene set are defined as high, medium and low groups of hypoxia scores, respectively, according to the hierarchical clustering results, the three groups of clinical samples (wherein the hypoxia scores of TCGA-BRCA (d), METABIC (e) and GSE20685 (f)) have a correspondence with the defined groups);
FIG. 8 is a graph of the results of a TCGA-BRCA database hypoxia score high and low group GSEA gene enrichment analysis;
FIG. 9 is a graph of the results of the high and low groups of Kaplan-Meier survival curves (TCGA-BRCA, METABIC and GSE20685 databases, the high and low groups of Kaplan-Meier survival curves show that the high groups of hypoxia score are worse and that the double-sided log-rank test p-value <0.05 is statistically significant);
FIG. 10 is a graph of risk ratio and 95% confidence interval results from a Cox multifactor analysis according to the present application (a is clinical data using database TCGA-BRCA, b is clinical data using breast cancer database METABIC);
FIG. 11 is a graph of the results of an oxygen-related drug responsiveness analysis in accordance with the present application (a-f: a correlation scatter plot shows the correlation between a clinical sample hypoxia score for breast cancer and an estimated drug response (Spearman correlation), wherein Spearman correlation coefficients are negative and represent drug sensitivity, and are canonical and represent drug resistance, wherein a is drug AZD6482, b is drug NSC.87877, c is drug AKT.inhibitor.VIII, d is drug Vorinostat, e is drug S.trityl.L.cysteine, and f is drug AZD 8055);
Fig. 12 is a graph showing the results of oxygen-related drug reactivity analysis according to the present application (a-d: breast cancer cell lines SUM159, hs578T and MDA-MB-231 were treated with drug AZD6482 or AZD8055 in normoxic (purple, n=4) and hypoxic (green, n=4) conditions, cell viability was homogenized by DMSO treatment group, where a is SUM159 treated with drug AZD6482, b is Hs578T treated with drug AZD6482, c is MDA-MB-231 treated with drug AZD8055, d is Hs578T treated with drug AZD8055, and error is mean±s.d).
Detailed Description
The present application will be described in detail with reference to examples, but the present application is not limited to these examples, and the raw materials and reagents in the examples of the present application are commercially available unless otherwise specified.
The hypoxic microenvironment of tumors is one of its important features, and hypoxia adapts tumor cells, which contribute to tumorigenesis, while being associated with resistance to chemotherapy, radiation therapy, drug therapy, immunotherapy, and the like. Thus, understanding the effects of hypoxia on molecular markers is critical to improving the outcome of cancer treatment, but difficulties remain from characterizing hypoxia in breast cancer patients to clinical medication guidance.
First, large-scale database tumor samples lack direct values of the hypoxia status, i.e., oxygen levels, and thus patient tumor hypoxia levels can only be assessed by indirect methods such as establishing hypoxia gene markers, which is a departure from the practice within the patient's tumor tissue. Second, there is heterogeneity in the levels of hypoxia within the tumor tissues of different breast cancer patients and the same patient, and thus although each patient's hypoxia level can be assessed, the patient's tumor hypoxia profile cannot be fully characterized, and this heterogeneity can affect drug sensitivity. Furthermore, the impact of hypoxia on tumor cells is manifold, as are the responses of tumor cells to hypoxia, the presence of hypoxia being a necessary but inadequate condition for hypoxia, as there are other critical susceptibility determinants. Although much knowledge about molecular reactions under hypoxic conditions has been known, the identification of the most useful molecular targets in hypoxic cells is far from complete. Therefore, it is of considerable importance to further develop hypoxia diagnosis as a predictive biomarker.
Based on this, GSE111653 dataset was obtained from GEO database, starting with transcriptome data of breast cancer cell lines at 20% and 1% oxygen concentration treatment conditions, first different breast cancer cell lines were analyzed for their response to hypoxia, and a large difference was found between the different cell lines, with 23 breast cancer cell lines having a pronounced hypoxia response and 8 breast cancer cell lines having a more blurred hypoxia response. Thus 23 breast cancer cell lines with pronounced hypoxia response were subsequently selected to construct a classifier.
In the context of the rapid development of big biological data, machine learning plays an important role in aiding understanding and exploring biomarkers, identifying molecular patterns, and the like. The gene expression matrix of 46 samples obtained by normoxic and hypoxic treatment of 23 breast cancer cell lines is processed by a random forest algorithm, so that a simple classifier is established, and the normoxic group and the hypoxic group samples of the breast cancer cell lines can be distinguished by the expression condition of a series of genes. The random forest is an integrated learning method and is based on a model constructed by a decision tree. It constructs multiple decision trees by randomly selecting features and samples and combines them into a "forest" which is then predicted by voting or averaging. Random forests have mainly the advantages of 1) reduced risk of overfitting by randomly selecting features and samples, and 2) high accuracy, since multiple decision trees are integrated, random forests generally have higher prediction accuracy. 3) The method is suitable for large-scale data, and for a large-scale data set, random forests can be effectively processed without too much data preprocessing work. 4) And (3) the interpretability is that importance ranking of the features can be provided, and the basis of the prediction result is helpful to understand.
Random forests have wide application in classification and regression problems, and are particularly suitable for medical research and construction of clinical prediction models. Feature importance by random forest calculation 44 hypoxia up-regulated genes were selected based on two parameters MEAN DECREASE Accuracy and MEAN DECREASE GINI and defined as the breast cancer hypoxia gene set. The GO enrichment analysis is carried out on the gene set, the correlation with other hypoxia-related gene sets is calculated, and the result shows that the established breast cancer hypoxia gene set has biological significance. And the next step is to continuously utilize the gene set to stratify the clinical sample and explore the relationship between the stratified sample and the clinical prognosis.
Since the breast cancer hypoxia gene set is a relative marker set, the gene set utilizes hierarchical clustering to classify the tumor samples of the 3 breast cancer databases TCGA, METABRIC and GSE20685 into high, medium and low hypoxia scores, and more attention is paid to comparing the high hypoxia score and the low hypoxia score. Meanwhile, GSVA is carried out on each sample, the selected gene set is an established breast cancer hypoxia gene set, and the GSVA pathway score obtained by the method is called the hypoxia score of the sample.
Next, survival analysis was performed on the high and low hypoxia scores. For Kaplan-Meier survival analysis, using a classification variable, i.e., a clinical grouping based on the breast cancer hypoxia gene set, the results showed that the high hypoxia score group survived worse than the low hypoxia score group. For single and multi-factor Cox survival analysis, using a continuous variable, i.e., the patient's hypoxia score, the results indicate that the hypoxia score is a risk factor for death. Thus, the established breast cancer hypoxia gene set can predict the prognosis of a breast cancer patient, which lays a foundation for judging the hypoxia characteristics of the breast cancer patient by using endogenous markers.
The aim of layering breast cancer patients according to the breast cancer hypoxia gene set is to realize accurate medical treatment, and different drug treatments are selected according to different hypoxia characteristics of the patients.
The main experimental reagents and consumables related to the application are shown in the following table 1.
Table 1 main experiment reagents and consumables
The frozen stock solution used in the application is prepared from 7mL of culture medium, 2mL of fetal calf serum and 1mL of dimethyl sulfoxide. In addition, the main instruments and devices to which the present application relates are shown in table 2 below.
TABLE 2 Main instruments and apparatus
The english abbreviations and corresponding english full names and chinese full names according to the application are shown in table 3 below.
Table 3 english abbreviation comparison table
The study establishes a breast cancer hypoxia gene set based on a GSE111653 data set, analyzes the prognosis relation and the drug response of the hypoxia and breast cancer patients by utilizing gene expression data and clinical data of the breast cancer patients in TCGA, METABRIC and GSE20685 databases, and concludes that 1) the breast cancer hypoxia gene set constructed by using a random forest algorithm can characterize the hypoxia characteristics of the breast cancer patients, 2) a survival prediction model constructed based on the hypoxia score has good prediction efficiency on the prognosis of the breast cancer patients, and 3) the sensitivity of the patients to 139 FDA approved drugs can be predicted according to the hypoxia characteristics of clinical samples.
The following describes the inventive solution by means of specific examples.
Example 1 data collection
The data used in this study were from GEO, TCGA and METABRIC databases, respectively.
1) Downloading of GEO database
GEO (Gene Expression Omnibus) is an international public database that stores and distributes free of charge microarray chips, second generation sequencing, and other forms of high-throughput functional genomics data submitted by research teams.
The GSE111653 dataset and GSE20685 dataset in the GEO database were downloaded using the R package "GEOquery". The GSE111653 dataset contained RNA sequencing data from 31 groups of breast cancer cell line cells treated for 24 hours at 20% or 1% O 2 concentration. The GSE20685 dataset contained gene expression data, follow-up and clinical information for 327 breast cancer samples. The GSE111653 dataset takes GPL20301 as a platform, the GSE20685 dataset takes GPL570 as a platform, and R packets 'Biobase' and 'biomaRt' are used for extracting and annotating gene information, including deleting probe information with missing gene names and probe information of a plurality of probes corresponding to the same gene, and then annotating the gene names.
2) Downloading of TCGA database
TCGA (THE CANCER Genome Atlas) is a cancer Genome project opened by the national institute of cancer and the national institute of human Genome, which molecularly identifies 33 cancer types over 20,000 cancers and matched normal samples.
The TCGA-BRCA dataset was retrieved, downloaded, preprocessed, analyzed and integrated using R package "TCGAbiolinks". Transcriptome data of 1118 breast cancer patients in the TCGA-BRCA dataset are downloaded and organized, and the gene names are selected as row names according to the gene annotation information, and follow-up and clinical information of the patients are downloaded at the same time.
3) Downloading of METABIC database
METABRIC (Molecular taxonomy of breast cancer international consortium) is an international study item comprising clinical features, gene expression, copy number variation and single nucleotide polymorphism genotypes derived from breast tumor samples, which is of great value for molecular typing studies of breast cancer.
Transcriptome data, along with follow-up and clinical information for 1980 breast cancer patients in the project, was downloaded from the cBioPortal platform.
Table 4 feature summary of breast cancer dataset
EXAMPLE 2 screening of differential Gene
The GSE111653 dataset was downloaded to obtain transcriptomic count data for 62 samples and the probe number was converted to standard gene name according to the GPL20301 platform. Screening of the gene expression matrix for differential genes was performed using the R package "DESeq 2". According to the GSE111653 experimental design, a grouping design of 31 cell lines and treatment conditions was established, resulting in DEGs (DIFFERENTIALLY EXPRESSED GENES, DEGs) of GSE111653 dataset under 1% O 2 concentration conditions and 20% O 2 concentration conditions. The corrected p-value (adjust p-value) <0.05, log2 (FoldChange) >2 was used as a screening standard to screen for differential genes.
Example 3 Supervisory Cluster identification of normoxic and hypoxic populations
K-means cluster analysis is an unsupervised machine learning algorithm that groups unlabeled samples into different clusters based on similarity of data. The algorithm proceeds by initializing k cluster centers randomly and then iteratively assigning each data point to the nearest cluster center and updating the cluster center location based on the average of all points assigned to it. This process continues until the cluster center is no longer moved or the maximum number of iterations is reached. And performing unsupervised cluster analysis on 62 samples based on the difference genes screened by the 31 breast cancer cell lines by using a 'kmeans' function of the R packet 'cluster', setting the number of clusters to be 2, obtaining sample classification conditions of the two clusters, and comparing the sample classification conditions with experimental clusters to determine the total hypoxia and normoxic.
The distance product-d is calculated for each cell line and this parameter characterizes whether normoxic and hypoxic samples of a particular cell line belong to the same population. The specific thought of product-d calculation is that k=2 is adopted in the study, namely 62 samples are divided into 2 groups according to differential gene expression data, and the algorithm is carried out according to the following steps that 1) Euclidean distance d1 between each sample and the clustering center of group 1 and Euclidean distance d2 between each sample and the clustering center of group 2 are calculated. 2) The difference d between the two euclidean distances is calculated (d=d1-d 2). 3) The same cell line contains two samples, namely normoxic and hypoxic, and then the product of the difference d between the two samples of the same cell line is calculated as product-d (product-d= dNormxia × dHypoxia).
Example 4 building a classifier based on random forest
Random forests are a popular machine learning algorithm that can be used for classification and regression problems. The algorithm builds a plurality of decision trees during training, each decision tree is trained on a different data subset, and the average of predictions of all single trees is taken as a result, so that the prediction accuracy of a model is improved, and the overfitting can be controlled. The random forest algorithm integrates a plurality of classifiers based on ensemble learning to solve the complex problem and improve the model performance, and is widely applied to biological big data.
The gene expression matrix of the GSE111653 dataset was subjected to supervised cluster analysis using R-package "randomForest", and the random forest was resampled back using boottrap method when creating decision trees, and the non-extracted samples were noted as OOB (out bag, OBB) for calculation of feature importance (feature importance). Since random forests randomly extract samples and select features in the process of building decision trees, the computed feature importance is mainly affected by two parameters, namely the number ntree of decision trees and the number mtry of variables in the nodes for the binary tree. However, because transcriptomics data has more variables, namely more genes, excessive noise data exists, which affects the calculation result of the feature importance, so that in order to reduce the influence of the noise data, a random forest model is firstly established by using all genes, the genes with the feature importance of 600 at the top are selected, the features with the small feature importance are removed, and then the random forest calculation is repeatedly performed, so that a simple classifier is established.
The specific process of random forest model establishment is that 1, data preprocessing is carried out by using R package 'caret' to carry out 1) removing variables with zero or near zero variance 2) removing interrelated variables, 2, data segmentation is carried out by randomly classifying 75% samples of a data set into a training set and 25% samples into a verification set, 3, model establishment is carried out by taking all genes obtained after preprocessing as variables to carry out random forest model establishment, feature importance MEAN DECREASE Accumey and MEAN DECREASE GINI are calculated, the gene with the feature importance ranking 600 is selected as a new variable set to use a random forest algorithm again to establish a simple classifier, and 4, a confusion matrix is drawn according to table 5.
The model was evaluated comprehensively by calculating random forest two classification model training set and test set specificity (SPECIFICITY, SPC), sensitivity (SEN), positive PREDICTED RATE (PPV), accuracy (ACC) and F1 value (F1 score). The calculation formula of the index is shown in table 6. The specificity is called true negative rate, which represents the proportion of the number of samples predicted to be negative and correct in all negative samples, characterizes the specificity of the classification model to negative sample data, sensitivity is called recall rate and true positive rate, which represents the proportion of the number of samples predicted to be positive and correct in all positive samples, is the complement of the second type error rate, characterizes the sensitivity of the classification model to negative sample data, positive prediction value is called precision rate, represents the proportion of the number of samples predicted to be positive and correct in all positive samples, accuracy rate represents the proportion of the number of samples predicted to be correct in all positive samples, and the harmonic average value of F1 value precision rate and recall rate.
TABLE 5 confusion matrix
Table 6 model evaluation index calculation formula
Example 5 dimension reduction analysis
Three dimension reduction algorithms, PCA, t-SNE, UMAP, were used in this study.
1)PCA
PCA (PRINCIPAL COMPONENT ANALYSIS, PCA), a principal component analysis method, data is linearly transformed to a new coordinate system so that the direction (principal component) characterizing the largest change in the data can be easily identified. The method summarizes continuous (i.e., quantitative) multi-metadata by constructing a small number of principal component variables that are independent of each other, by reducing the dimensionality of the data without losing important information. And (3) carrying out principal component analysis on the gene expression matrix by using a 'prcomp' function in the R language to obtain two directions PC1 and PC2 with the largest difference, calculating the respective variance contribution rate of the two feature vectors, and drawing a PCA two-dimensional graph.
2)t-SNE
T-SNE (t-Distributed Stochastic Neighbor Embedding, t-SNE), i.e. t-distributed-random adjacent embedding, is a powerful nonlinear dimension reduction technology. The algorithm preserves local similarity between data points by converting high-dimensional data into a lower-dimensional space (typically 2D or 3D). And (3) performing dimension reduction on the gene expression matrix by using an R package 'Rtsne', setting the dimension after dimension reduction to be 3, obtaining three variables tSNE1, tSNE2 and tSNE3, and drawing a t-SNE three-dimensional graph by using an R package 'scatterplot d'.
3)UMAP
UMAP (Uniform Manifold Approximation and Projection, UMAP), i.e. unified manifold approximation and projection algorithm, is a manifold learning technique based on Riemann geometry and algebraic topology for reducing the dimensionality of data. UMAP is a non-linear, non-parameterized algorithm that aims at low-dimensional data representation with learning to preserve the data local structure and global structure balance. The result is a practical and scalable algorithm that is suitable for practical data. The dimension of the gene expression matrix was reduced using the R package "umap" to obtain two variables UMAP and UMAP2, and a UMAP two-dimensional map was drawn.
Example 6 hypoxia score based on Gene set enrichment score
GSVA (Gene set variation analysis, GSVA) is a non-parametric, unsupervised gene set enrichment method based on a single sample for estimating the variation of gene set enrichment in expression dataset. GSVA by performing a transformation of the coordinate system, the gene-sample matrix is converted into a gene set-sample matrix, and simultaneously, the score evaluation of the channel enrichment is performed on each sample, so that the molecular data analysis centering on the channel is realized. Using the R package "GSVA", the gene expression matrices of TCGA-BRCA, METABRIC and GSE20685 clinical databases were introduced, a gene list of "breast cancer hypoxia-related gene set" was created using the above 44 genes, and the gene enrichment score for each clinical sample was calculated and output and defined as the hypoxia score for that clinical sample.
Example 7 hierarchical clustering to differentiate clinical samples
Hierarchical clustering aims at identifying objects or patterns in a dataset that have similar features. The goal of clustering is to group observations with similar characteristics according to certain criteria and group data points into hierarchical structures. Similarity between observations is characterized by a distance measure between observations, including euclidean distance and correlation-based distance measures. The specific calculation step of hierarchical clustering comprises the steps of firstly calculating Euclidean distances between every two samples, merging two samples with the smallest distance together to form a new cluster according to distance sorting, and then repeating the steps on the obtained new cluster, wherein the step is finished by merging all samples into one cluster or reaching a termination condition, so that the hierarchical clustering result is presented in a form of a tree diagram.
Using hierarchical clustering algorithm in R package "cluster", first calculating euclidean distance matrix of gene expression matrix of clinical sample of TCGA-BRCA, METABRIC and GSE20685 database about breast cancer hypoxia gene set, then using 'hclust' function to cluster it and return tree structure of clustering result, finally dividing cluster tree graph into 3 groups to obtain 3 group classification result of patient, and using to identify molecular characteristics of patient with good or bad prognosis.
Example 8 Gene enrichment analysis
1) GO enrichment analysis
GO (GO) is a gene ontology-based gene expression differential analysis method. The gene ontology includes three aspects, molecular function, biological processes and cellular components. GO enrichment analysis was performed using the online mesh station metascape with the breast cancer hypoxia gene set.
2) GSEA enrichment analysis
GSEA (GENE SET ENRICHMENT ANALYSIS, GSEA) is a computational method for determining whether a predefined set of genes exhibits statistically significant, synergistic differences between two biological states (e.g., phenotypes). And according to the high and low hypoxia scores in the TCGA-BRCA clinical samples of the hierarchical clustering result, sequencing based on the expression levels of all genes to obtain the rank order of the genes in all genes. Gene enrichment analysis was performed using the R package "clusterProfiler", and a ranked list of genes was obtained by calculation, and the normalized enrichment score (Normalized enrichment score, NES) and the adjusted p-value for each gene in the GOBP pathway in the gene ontology were calculated. Screening the gene set enriched in the TCGA-BRCA with the hypoxia score high by using NES >1.8 and p.adj <0.05 as threshold values, and screening the gene set enriched in the TCGA-BRCA with the hypoxia score low by using NES <1.8 and p.adj <0.05 as threshold values.
EXAMPLE 9 hypoxia-related drug sensitivity and resistance analysis
The Spearman rank correlation coefficient (Spearman's rank correlation coefficient) is a non-parametric method for measuring the correlation between two variables. The Spearman correlation coefficient can capture a nonlinear relationship with a value between-1 and 1, where 1 represents a complete positive correlation, -1 represents a complete negative correlation, and 0 represents no correlation.
The raw data of drug responsiveness, estimated drug responsiveness data, is first downloaded. The data contained responsiveness data of TCGA-BRCA clinical samples to 139 FDA-approved drugs. A larger drug responsiveness data value indicates a greater resistance to the corresponding drug, and a smaller data value indicates a higher sensitivity to the corresponding drug. And carrying out spearman correlation reaction analysis between the hypoxia score of the TCGA-BRCA clinical sample and drug reactivity data of 139 drugs to obtain spearman correlation coefficient of each drug, wherein positive correlation coefficient indicates that the drug has resistance to hypoxia, and negative correlation coefficient indicates that the drug has sensitivity to hypoxia.
Example 10 survival and prognosis analysis
1) Kaplan-Meier survival analysis
Kaplan-Meier survival analysis is used to measure the proportion of patients who survive a certain period of time after treatment, and can only be used to analyze the classification variables, comparing the survival time of two or more groups of patients by log-rank test. Survival analysis and mapping was performed using the R packages "survivinal" and "survminer". And taking the high-hypoxia score group and the low-hypoxia score group of the breast cancer patients as two groups of survival analysis according to hierarchical clustering results, carrying out total survival analysis according to total survival time (OS) of clinical data and corresponding events, and carrying out relapse-free survival analysis according to relapse-free survival time (relapse-free survival, RFS) of clinical data and corresponding events. The p-value of log-rank test and the risk table are obtained.
2) Cox proportional risk regression model
The Cox proportional-risk regression model is used to study the relationship between the patient's time-to-live and one or more predicted variables, and can be used for continuous and categorical variables. It allows to analyze the effect of several predicted variables on survival at the same time and to be able to provide the effect size of each predicted variable. Cox survival analysis was performed using the R packages "survivinal" and "survminer" and a tree was drawn. According to the hypoxia score of the breast cancer patient calculated by GSVA and different predicted variables such as lymph nodes, pathological stage, molecular typing and the like provided by clinical data, firstly performing univariate Cox regression analysis on all the included predicted variables, selecting the predicted variables with p value smaller than 0.05 according to the result, and then performing multivariate Cox regression analysis to obtain the risk ratio (Hazard ratio, HR) and 95% confidence interval (Confidence interval, CI) of each variable.
Example 11 CCK-8 in vitro experiments
1) Cell resuscitation
And opening the water bath kettle, adjusting the temperature to 37 ℃, taking 5mL of culture medium to a 15mL centrifuge tube during the heating period, taking out a freezing tube containing cell strains from a refrigerator at-80 ℃ after the heating of the water bath kettle is completed, immediately putting the freezing tube into the water bath kettle, slightly shaking the freezing tube, and thawing as soon as possible. After thawing the frozen tube, taking out the cells and adding the cells into a centrifuge tube containing a culture medium, screwing a bottle cap, centrifuging for 1000g and 3min, discarding the supernatant, re-suspending the cells with the culture medium, blowing and sucking uniformly, adding the cells into a culture dish, supplementing a proper amount of culture medium according to the specification of the culture dish, shaking uniformly, placing the cells into an incubator for culturing, and replacing the culture medium periodically according to the condition of the cells.
2) Cell passage
When the cells grow to about 90% of the bottom area of the culture dish, a passage operation is required, the original culture medium in the culture dish is sucked out by a waste liquid suction vacuum pump, and a proper amount of PBS is added to gently wash the cells so as to remove the residual culture medium. After PBS in the culture dish is sucked, a proper amount of 0.25% pancreatin is added, and the culture dish is put into an incubator or digested at normal temperature until cells become round and fall off, and the excessive digestion is not needed. Digestion was terminated by adding an equal amount of medium to the plates. Collecting cells into a 15mL centrifuge tube, centrifuging for 1000g and 3min, discarding the supernatant, re-suspending the cells with a culture medium, counting after blowing and sucking, calculating the required cell quantity, adding the required cell quantity into a culture dish, supplementing a proper amount of culture medium according to the specification of the culture dish, shaking the cells uniformly, and then placing the cells into an incubator for culturing, and periodically replacing the culture medium according to the cell condition.
3) Cell cryopreservation
When the cells grow to about 90% of the bottom area of the culture dish, freezing can be carried out, freezing solution is prepared, the original culture medium in the culture dish is sucked out by a waste liquid suction vacuum pump, and a proper amount of PBS is added to gently wash the cells so as to remove the residual culture medium. After PBS in the culture dish is sucked, a proper amount of 0.25% pancreatin is added, and the culture dish is put into an incubator or digested at normal temperature until cells become round and fall off, and the excessive digestion is not needed. Digestion was terminated by adding an equal amount of medium to the plates. Cells were collected in 15mL centrifuge tubes, centrifuged for 1000g,3min, the supernatant discarded and resuspended in frozen stock. The freezing pipes are marked with cell information, 1mL of cell suspension is added into each freezing pipe, the pipe cover is screwed, the freezing pipes are put into a program cooling box for preservation at-80 ℃ and can be placed into liquid nitrogen if long-term preservation is needed.
4)CCK-8
Plating-each cell was divided into normoxic and hypoxic groups, and each 96-well plate was provided with a cell-free blank control group and a DMSO (Dimethyl sulfoxide, DMSO) group (for eliminating DMSO interference with the experiment). The desired cells were SUM159, MDA-MB-231 and Hs578T cells. 4 compound wells are arranged in each group, the cells are inoculated in a 96-well plate according to 2000 cell densities per well, the total volume of the cell suspension added into each well is 100 mu L, and then the cells are placed in a cell culture box for overnight culture, so that the cells are attached to the wall.
And (3) adding medicine, namely observing that cells are adhered, growing well, adding medicines (0,0.01,0.1,1,10,100 mug/mL) with different concentrations into a hypoxia group and a normoxic group respectively, putting the hypoxia group into a hypoxia chamber for hypoxia treatment (1% oxygen, 5% carbon dioxide and 94% nitrogen), and directly putting the normoxic group into a cell incubator for continuous culture for 48 hours.
Adding CCK-8 reagent, mixing the culture medium with CCK-8 (Cell Counting Kit-8, CCK-8) at the time of drug treatment for 48h, to make the culture medium contain 10 μl of activity detection reagent per 100 μl of culture medium. The supernatant from the 96-well plate was thoroughly aspirated, 100. Mu.L of CCK-8 mixed with medium was added to each well and returned to the incubator for incubation. Different cells are incubated for different time, so that the color of the cells is changed to orange, and detection can be performed.
OD value measurement the absorbance at 450nm, the OD (Optical density) value, was measured with a microplate reader. The OD value is generally between 0.5 and 2.5, typically between 0.8 and 1.5. The mean value of fluorescence values of the cell-free blank control group was calculated, the mean value of the blank group was subtracted from the OD value of the experimental group, and the obtained value was used to calculate the cell inventory rate and draw a line graph. Cell viability (%) = experimental OD/control OD x 100%. The results of the processing were plotted using GRAPHPAD PRISM software.
The experimental results of the application are as follows:
1) The hypoxia response of different breast cancer cell lines is heterogeneous.
The study first performed differential analysis of gene expression using DESeq2 package on normoxic and hypoxic groups of 31 cell lines of GSE111653, screened for 20 genes with the largest fold difference using p-value <0.05, log2 (FoldChange) >2 as the screening threshold, and plotted the gene expression heat maps of the 20 genes of normoxic and hypoxic groups of 31 cell lines. The heat map results showed that the expression of hypoxia-related genes was highly consistent among the cell lines of the hypoxia group, MCF10A, SUM, SUM159, HCC1569, MCF12A, MDA-MB-468, MCF7, HCC1806 and HME, while the other 22 cell lines showed different expression patterns for the hypoxia response.
Further, a scatter plot of fold differences between the 20 genes of 31 cell lines was plotted (FIG. 1 b), indicating that the responses of the different cell lines to hypoxia were heterogeneous.
2) Identification of hypoxia and normoxic population based on unsupervised learning
The k-means clustering algorithm was used to distinguish between hypoxia populations and normoxic populations. Because of the heterogeneity of responses of different breast cancer cell lines to hypoxia, 62 samples were classified by a k-means clustering algorithm, and a dimension-reduction graph was drawn to show (fig. 2 a), with the circular scale representing the hypoxia samples, mainly concentrated in group 1, i.e., the red sample region, thus group 1 representing the hypoxia population, and the triangular scale representing the normoxic samples, mainly concentrated in group 2, i.e., the blue sample region, thus group 2 representing the normoxic population. At the same time, product-d is calculated (FIG. 2 b), with positive values indicating that two sets of samples of the same cell line are grouped into the same population and negative values indicating that two sets of samples of the same cell line are grouped into different populations. The heat map results indicated (FIG. 2 c) that 7 cell lines, BT474, DU4475, HCC38, MDA-MB-157, MDA-MB-175, SKBR3, SUM185, respectively, were included in the normoxic population for samples whose normoxic background expression was very close to that of the samples in the hypoxic condition, while 1 cell line, HCC1806, was included in the normoxic population for samples whose normoxic background expression was not apparent. These 8 sets of samples are therefore removed during the subsequent establishment of the random forest model.
3) Classifier is built based on random forest algorithm
Random forest model construction based on transcriptomics data of 23 cell lines, a classifier was built using a random forest algorithm to distinguish normoxic and hypoxic populations (fig. 3 a). A reduced classifier is built using R-package "randomForest", where the parameter that needs to be adjusted is the number of decision trees ntree. The more the number of decision trees, the longer the program running time, and when the number of decision trees reaches a certain number, the model of the number of decision trees is also difficult to continue to optimize, so that the number of decision trees is not as high as possible. The parameters are selected to be larger values 1500, a relation diagram of model errors and decision tree numbers is drawn (fig. 3 b), classification errors of random forest models established when ntree =300 are selected to be minimum on normoxic and hypoxic samples, and the parameters are in a stable state, so that ntree =300 is finally selected to establish a random forest reduction classifier. The confusion matrix of the established random forest model is plotted (fig. 3c and 3 d), and an evaluation finger table of the model is calculated according to the confusion matrix (table 7).
TABLE 7 evaluation of prediction results for random forest models
The results show that the specificity and the sensitivity established by the application are high, wherein the specificity of the training set is 0.944, the sensitivity is 0.833, the positive predictive value is 0.938, the accuracy is 0.917, the F1 value is 0.882, and the indexes of the verification set are all 1.
3) Calculating feature importance to establish breast cancer hypoxia gene set
Two parameters for characterizing feature importance, MEAN DECREASE Accuracy and MEAN DECREASE GINI, were calculated for the random forest model established above using the R package "randomForest". FIG. 4a shows the genes of feature importance top 30. Wherein MEAN DECREASE Accuracy >1 has 133 characteristic variables, MEAN DECREASE GINI >0.05 has 66 characteristic variables, and the intersection of the two is taken to screen 48 genes with highest characteristic importance (figure 4 b). And (3) drawing a gene expression heat map (figure 4 c) of the 48 genes in 23 cell lines, wherein the heat map shows that the 48 gene expression patterns of normoxic groups and hypoxic groups are opposite, so that the 48 genes have good distinguishing effect. Meanwhile, the R-package "DEseq2" was used to calculate differential gene expression profiles for normoxic and hypoxic groups of 23 cell lines (table 8).
The analysis found 44 out of 48 genes were hypoxia up-regulated genes, 4 hypoxia down-regulated genes, and 44 hypoxia up-regulated genes were defined as the breast cancer hypoxia gene set.
TABLE 8 summary of fold difference, adjusted p-values and feature importance for 48 genes
Table 8 continuous table
The effect of the breast cancer hypoxia gene set obtained by analysis of three dimensionality reduction algorithms, namely Principal Component Analysis (PCA), t-distribution random neighborhood embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), in distinguishing normoxic and hypoxic group cells (fig. 5a, 5b and 5 c), blue represents normoxic groups, red represents hypoxic groups, and the result shows that the sample distribution is discrete in two opposite directions, indicating that the characteristics of each of the two groups are obvious.
In conclusion, the results show that the established random forest model can divide most samples into normoxic groups and hypoxic groups according to the gene expression of the samples, and only a few samples cannot be classified, so that the model has higher sensitivity.
4) Verification of the Breast cancer hypoxia Gene set
First, GO gene enrichment analysis (fig. 6 a) was performed on the breast cancer hypoxia gene set, and the enrichment results showed that the genes were mainly enriched in the processes of hypoxia response, carbohydrate metabolism, mitochondrial changes during apoptosis, regulation of synthesis and energy production of metabolic precursors, response to nutrient levels, etc.
The results show that the established breast cancer hypoxia gene set is closely related to hypoxia microenvironment.
According to literature arrangement, 6 gene sets related to hypoxia are obtained in a summarizing way, namely 1) BUFFA _ HYPOXIA _ METAGENE gene sets are respectively 1) genes closely related to clinical prognosis in various cancer types while being regulated by hypoxia in head and neck cancer and breast cancer clinical queue analysis. 2) The WINTER_ HYPOXIA _ METAGENE gene set is summarized based on literature search for genes regulated by hypoxia. 3) FARDIN-HYPOXIA-11 Gene sets the established hypoxia gene set was analyzed after normoxic and hypoxic treatment based on 11 neuroblastoma cell lines. 4) ELVIDGE-HYPOXIA-UP Gene set, genes upregulated by the MCF-7 cell line under hypoxic conditions. 5) The HALMARK_ HYPOXIA gene set, the hypoxia responsive up-regulated gene. 6) Chi-common hypoxia genes gene set, genes up-regulated by tubular epithelial cells, mammary epithelial cells, smooth muscle cells and endothelial cells under hypoxia. The enrichment scores of the 7 gene sets were calculated using GSVA clinical samples of TCGA-BRCA and spearman correlation coefficients between the different hypoxia-related gene sets were calculated and heat mapped (fig. 6 b).
The results show that the established breast cancer hypoxia gene set has higher similarity with other 6 gene sets.
Typing was performed according to the gene expression of TCGA-BRCA, five types of PAM50 (Prediction Analysis of Microarray, PAM 50) were obtained, normal, lumB, lumA, HER, basal, and a box plot of hypoxia scores of the five types of clinical samples was drawn (fig. 6 c).
The results show that the Basal group, i.e. the triple negative breast cancer, has the highest hypoxia score. Meanwhile, the high, medium and low group of the hypoxia scores in the five typing samples are plotted (fig. 6 d), and only the high hypoxia score in the Basal group is found to be over 50% and significantly higher than the other typing.
5) Construction and evaluation of prognosis models
5.1 Clinical sample stratification based on hierarchical clustering
Hierarchical clustering was performed according to the gene expression status of the breast cancer hypoxia gene set of the breast cancer patients and the breast cancer hypoxia gene set was divided into 3 groups, and the three groups were defined as high, medium and low hypoxia scores according to the high, medium and low gene expression (fig. 7a, 7b, 7 c). Hierarchical clustering results show that the numbers of patients with high, medium and low hypoxia scores in the TCGA-BRCA database are 305, 453 and 360 respectively, the numbers of patients with 259, 754 and 967 in the METABIC database and the numbers of patients with 43, 104 and 180 in the GSE20685 database. Next, based on this grouping result, a box plot of each group of hypoxia scores was drawn (fig. 7d, fig. 7e and fig. 7 f).
The results show that the hierarchical clustering groups have consistency with the hypoxia scores calculated by GSVA, and can be used for subsequent survival and prognosis analysis.
5.2 Gene set enrichment analysis
GSEA gene-set enrichment analysis was performed on high and low hypoxia scores of TCGA-BRCA clinical samples (FIG. 8).
The enrichment results show that the hypoxia score high components are enriched in pathways such as positive regulation of stem cell population maintenance, regulation of spindle assembly, DNA replication initiation, negative regulation of translation initiation, translation initiation regulation of response pressure and the like (NES >1.8, p.adj < 0.01), and the hypoxia score low components are enriched in pathways such as positive regulation of serine phosphorylation of STAT protein, pituitary development, regulation of dental occurrence, catabolism process of aromatic amino acid and the like (NES < -1.8, p.adj < 0.01).
5.3 Kaplan-Meier survival analysis
The results of Kaplan-Meier survival analyses between the high and low groups of the breast cancer databases TCGA-BRCA, METABRIC and GSE20685 showed (FIG. 9) that the TCGA-BRCA patients had a high group of hypoxia median OS of 95.1 months (95% CI, 82.3-NA), the low group of median OS of 115.4 months (95% CI, 93.3-NA), the METABIC patients had a high group of hypoxia median OS of 75.2 months (95% CI, 63.2-91.1), the low group of median OS of 117.1 months (95% CI, 107.8-124.1), the METABIC patients had a high group of hypoxia median RFS of 75.1 months (95% CI, 53.1-117.7), the low group of median OS of 135.8 months (95% CI, 115.3-174.4), while the log-rank test showed significantly lower group survival than the low group of hypoxia score, with p-values of less than 0.05.
5.5 Single-and multi-factor Cox survival analysis
5.5.1)TCGA-BRCA
Using the clinical data of the database TCGA-BRCA, a risk score model was established by single-and multi-factor Cox survival analysis and it was investigated whether the continuous variable hypoxia score hScore (hypoxia score, hScore) could be a prognostic factor independent of breast cancer clinical pathology stage, lymph node etc. In clinical information of TCGA-BRCA, continuous variables are age and lymph node number, classification variables are pathological stage, PR, ER, HER2 and PAM50 classification, and specific classification standards are shown in Table 9.
Table 9 TCGA-BRCA and METABIC database clinical information Classification variables
Table 9 continuous table
The results are shown in table 10, fig. 10a, in the single factor Cox survival analysis, hypoxia score, age, pathological stage, number of lymph nodes were significantly correlated with survival of breast cancer patients (p < 0.05), where higher patient age, higher hypoxia score, TNM (III) stage showed higher risk of death of the patient (HR > 1), and higher number of lymph nodes showed lower risk of death of the patient (HR < 1), other variables including Estrogen Receptor (ER), progestin Receptor (PR), human epidermal growth factor receptor-2 (HER 2), and PAM50 typing of breast cancer all showed no significant correlation with patient survival (p > 0.05).
Table 10 TCGA-BRCA database Cox proportional Risk regression analysis results
The 4 potential breast cancer prognostic factors were then included in a multifactor Cox survival assay, and the forest map plots the results of the multifactor Cox survival assay.
The results show that after adjustment of other variables, the hypoxia score still has a shown correlation with the poor prognosis of breast cancer patients (p=1.4e-04, hr=4.70), in addition to age and pathological stage are prognostic factors independent of other factors.
Taken together, the results show that the hypoxia score hScore of breast cancer can be used as a potential prognostic factor independent of clinical pathology stage, lymph node, etc. of breast cancer.
5.5.2)METABRIC
Using clinical data from the breast cancer database METABRIC, a risk score model was established by single-and multi-factor Cox survival analysis and it was explored whether the continuous variable hypoxia score hScore could be used as a prognostic factor independent of variables such as breast cancer CLAUDIN type, number of lymph nodes, etc. In the clinical information of METABRIC, the continuous variable is age, lymph node number, the classification variable is HER2, ER, whether chemotherapy is performed, CLAUDIN types, and the specific classification criteria are shown in table 9.
The results are shown in table 11, fig. 10b, in the single factor Cox survival analysis, hypoxia score, age, CLAUDIN typing, whether chemotherapy was performed, number of lymph nodes, human epidermal growth factor receptor-2 positive and survival of breast cancer patients were significantly correlated (p < 0.05), with higher patient age, higher hypoxia score, higher number of lymph nodes, HER2 positive and claudin-low typing showing higher risk of mortality for the patients (HR > 1), with variable estrogen receptor showing no significant correlation with patient survival (p > 0.05).
Table 11 METABRIC database Cox proportional risk regression analysis results
Next, the 6 potential breast cancer prognostic factors described above were incorporated into a multifactorial Cox survival assay, the forest map plotting the results of the multifactorial Cox survival assay.
The results show that after adjustment of other variables, the hypoxia score still has a correlation with the poor prognosis of breast cancer patients (p=0.00219, hr= 1.4893), in addition to age, number of lymph nodes, whether chemotherapy is performed, HER2 receptor positive is also a prognostic factor independent of other factors.
Taken together, the results show that the hypoxia score hScore of breast cancer can be used as a potential prognostic factor independent of breast cancer CLAUDIN type, number of lymph nodes, etc.
6) Hypoxic drug reactivity analysis
6.1 Screening drugs for hypoxia sensitivity and resistance based on spearman correlation coefficients
According to the correlation study between hypoxia score of breast cancer patients and reactive data of 139 drugs approved by FDA in TCGA database (table 12), hypoxia sensitive drugs (figure 11), i.e. drugs with spearman correlation coefficient negative, are AZD6482 (r= -0.178), PI3K beta molecules targeting PI3K/MTOR pathway, NSC.87877 (r= -0.163), SHP-1 (PTPN 6) and SHP-2 (PTPN 11) molecules, AKT.inhibitor.VIII (r= -0.128), AKT1, AKT2 and AKT3 molecules targeting PI3K/MTOR pathway were calculated.
Simultaneously, drugs with hypoxia resistance (fig. 12), i.e., drugs with spearman positive correlation coefficients, were AZD8055 (r=0.272), targeted to HDAC CLASS I, IIA, IIB, IV molecules, s.trityl.lcysteine (r=0.272), KIF11 molecules targeted to mitotic processes, AZD8055 (r=0.272), MTORC1 and MTORC2 molecules targeted to the PI3K/MTOR pathway were also obtained.
TABLE 12 hypoxia-sensitive and resistant drugs
Subsequent selection of two drugs AZD6482 and AZD8055 targeting PI3K/AKT/mTOR signaling pathway was performed for further in vitro experimental verification.
6.2 Verifying drug sensitivity and resistance at the cellular level
According to the calculation result, under the condition of detecting 20% O 2 and 1% O 2 by using a CCK-8 kit, the cell death condition of breast cancer cells treated by medicaments AZD6482 and AZD8055 for 48 hours is drawn, and a medicament dose-effect relationship curve obtained by medicament treatment is drawn.
The results show that the drug AZD6482 (fig. 11) shows hypoxia sensitivity in SUM159 and Hs578T cell lines, with IC50 (half maximal inhibitory concentration, IC 50) values for AZD6482 in both hypoxia and normoxic conditions in SUM159 cell lines being 21.6uM and 14.9uM, respectively, and in Hs578T cell lines being 15.1uM and 5.9uM.
The drug AZD8055 (FIG. 12) showed hypoxia resistance in MDA-MB-231 and Hs578T cell lines, with IC50 values of AZD8055 of 0.502uM and 1.585uM in the MDA-MB-231 cell line and 0.051uM and 0.071uM in the Hs578T cell line, respectively, under hypoxia and normoxic conditions. The results of the in vitro experiments are consistent with the calculation results.
It can be seen that the hypoxic microenvironment significantly affects the sensitivity of breast cancer to different drugs. The application analyzes the obtained breast cancer hypoxia gene set, depicts the hypoxia characteristics of clinical samples based on the breast cancer hypoxia gene set and calculates the hypoxia score of the clinical samples, has important application significance, for example, the application is used for exploring the relationship between the hypoxia score of the clinical samples and the clinical prognosis, establishes a prediction model with clinical application value, and predicts the drug sensitivity according to the hypoxia score of the clinical samples.
The above description is only an example of the present application, and the scope of the present application is not limited to the specific examples, but is defined by the claims of the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the technical idea and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1.一种评估肿瘤乏氧水平的生物标志物,其特征在于,所述生物标志物由以下44个基因组成:1. A biomarker for assessing tumor hypoxia levels, characterized in that the biomarker consists of the following 44 genes: . 2.一种评估肿瘤乏氧水平的试剂盒,其特征在于,包括检测如权利要求1所述生物标志物的基因表达量的试剂。2. A kit for assessing tumor hypoxia levels, comprising a reagent for detecting the gene expression level of the biomarker according to claim 1. 3.根据权利要求2所述的评估肿瘤乏氧水平的试剂盒,其特征在于,所述试剂用于检测所述生物标志物的mRNA表达水平,所述生物标志物的mRNA表达水平用于代表所述生物标志物的基因表达量。3. The kit for assessing tumor hypoxia level according to claim 2, wherein the reagent is used to detect the mRNA expression level of the biomarker, and the mRNA expression level of the biomarker is used to represent the gene expression amount of the biomarker. 4.根据权利要求2所述的评估肿瘤乏氧水平的试剂盒,其特征在于,所述试剂盒的检测样本为乳腺癌肿瘤样本。4. The kit for assessing tumor hypoxia level according to claim 2, wherein the test sample of the kit is a breast cancer tumor sample. 5.一种评估肿瘤乏氧水平的评估模型,其特征在于,所述评估模型以权利要求1中所述的生物标志物的基因表达水平为输入变量,通过计算乏氧得分评估肿瘤的乏氧水平;5. An evaluation model for evaluating tumor hypoxia levels, characterized in that the evaluation model uses the gene expression level of the biomarker described in claim 1 as an input variable and evaluates the hypoxia level of the tumor by calculating a hypoxia score; 所述乏氧得分的计算公式为:The calculation formula of the hypoxia score is: ; 其中,,其中基因i为权利要求1中所述的生物标志物;in, , wherein gene i is the biomarker according to claim 1; 其中,为常数,计算公式如下:in, and is a constant, and the calculation formula is as follows: ; ; 其中,,j为TCGA-BRCA数据库中的1118个乳腺癌样本。in, , j represents 1118 breast cancer samples in the TCGA-BRCA database. 6.根据权利要求5所述的评估肿瘤乏氧水平的评估模型,其特征在于,所述基因i的表达水平为基因i的mRNA表达水平。6 . The evaluation model for evaluating tumor hypoxia level according to claim 5 , wherein the expression level of gene i is the mRNA expression level of gene i. 7.根据权利要求5所述的评估肿瘤乏氧水平的评估模型,其特征在于,当所述乏氧得分≥0.5时,判断为高乏氧组;当所述乏氧得分≤-0.5时,判断为低乏氧组;当所述-0.5>乏氧得分>0.5时,判断为中乏氧组。7. The evaluation model for evaluating tumor hypoxia level according to claim 5, characterized in that when the hypoxia score is ≥0.5, it is judged to be a high hypoxia group; when the hypoxia score is ≤-0.5, it is judged to be a low hypoxia group; when the -0.5>hypoxia score>0.5, it is judged to be a moderate hypoxia group. 8.一种评估肿瘤乏氧水平的系统,其特征在于,包括:8. A system for assessing tumor hypoxia level, comprising: 数据获取模块,用于获取权利要求1所述生物标志物的基因表达水平;a data acquisition module for acquiring the gene expression level of the biomarker according to claim 1; 评估模块,其根据所述数据获取模块获得的基因表达水平评估肿瘤乏氧水平,并输出乏氧得分;所述评估模块包括一个评估肿瘤乏氧水平的评估模型,所述评估模型计算乏氧得分的公式为:An evaluation module is configured to evaluate the tumor hypoxia level based on the gene expression levels obtained by the data acquisition module and output a hypoxia score. The evaluation module includes an evaluation model for evaluating the tumor hypoxia level. The evaluation model calculates the hypoxia score using the following formula: ; 其中,,其中基因i为权利要求1中所述的生物标志物;为常数,计算公式如下:in, , wherein gene i is the biomarker according to claim 1; and is a constant, and the calculation formula is as follows: ; ; 其中,,j为TCGA-BRCA数据库中的1118个乳腺癌样本;in, , j is 1118 breast cancer samples in the TCGA-BRCA database; 输出模块,其根据所述乏氧得分输出预测结果。An output module outputs a prediction result according to the hypoxia score. 9.一种计算机可读存储介质,其存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,可实现如权利要求8所述系统的功能。9. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the system can realize the functions of the system according to claim 8. 10.如权利要求5~7任一项所述的评估模型或权利要求8所述的系统或权利要求9所述的计算机可读存储介质在如下任一项中的应用:10. Use of the evaluation model according to any one of claims 5 to 7, the system according to claim 8, or the computer-readable storage medium according to claim 9 in any one of the following: 1)在乏氧敏感性药物筛选中的应用;1) Application in hypoxia-sensitive drug screening; 2)在乏氧抗性药物筛选中的应用。2) Application in screening of hypoxia-resistant drugs.
CN202510610465.7A 2025-05-13 2025-05-13 A tumor hypoxia level assessment model and its application and detection kit Active CN120119001B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510610465.7A CN120119001B (en) 2025-05-13 2025-05-13 A tumor hypoxia level assessment model and its application and detection kit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510610465.7A CN120119001B (en) 2025-05-13 2025-05-13 A tumor hypoxia level assessment model and its application and detection kit

Publications (2)

Publication Number Publication Date
CN120119001A CN120119001A (en) 2025-06-10
CN120119001B true CN120119001B (en) 2025-08-22

Family

ID=95921357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510610465.7A Active CN120119001B (en) 2025-05-13 2025-05-13 A tumor hypoxia level assessment model and its application and detection kit

Country Status (1)

Country Link
CN (1) CN120119001B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014089055A1 (en) * 2012-12-03 2014-06-12 Aveo Pharmaceuticals, Inc. Tivozanib response prediction

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0922437D0 (en) * 2009-12-22 2010-02-03 Cancer Rec Tech Ltd Hypoxia tumour markers
US20130143753A1 (en) * 2010-03-01 2013-06-06 Adelbio Methods for predicting outcome of breast cancer, and/or risk of relapse, response or survival of a patient suffering therefrom
NL2024482B1 (en) * 2019-12-17 2021-09-07 Univ Maastricht Method of training a machine learning data processing model, method of determining a hypoxia status of a neoplasm in a human or animal body, and system therefore.
CN115651983A (en) * 2022-10-13 2023-01-31 温州医科大学 Biomarkers and biomarker-based prediction models for assessing prognosis, immunity or treatment efficacy of breast cancer
CN116751858A (en) * 2023-06-16 2023-09-15 瀚星(苏州)医学科技有限公司 Novel gastric cancer prognosis marker and construction method of gastric cancer prognosis model thereof

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014089055A1 (en) * 2012-12-03 2014-06-12 Aveo Pharmaceuticals, Inc. Tivozanib response prediction

Also Published As

Publication number Publication date
CN120119001A (en) 2025-06-10

Similar Documents

Publication Publication Date Title
Ertel et al. Pathway-specific differences between tumor cell lines and normal and tumor tissue cells
US8165973B2 (en) Method of identifying robust clustering
US20090319244A1 (en) Binary prediction tree modeling with many predictors and its uses in clinical and genomic applications
CN111933211B (en) Cancer accurate chemotherapy typing marker screening method, chemotherapy sensitivity molecular typing method and application
CN109599157B (en) A precise intelligent diagnosis and treatment big data system
Maji et al. Relevant and significant supervised gene clusters for microarray cancer classification
Munquad et al. A deep learning–based framework for supporting clinical diagnosis of glioblastoma subtypes
EP4211692A1 (en) Systems and methods for identification of cell lines, biomarkers, and patients for drug response prediction
CN115762800A (en) Scoring system capable of predicting melanoma patient prognosis and immunotherapy response rate
Akçay et al. Non-negative matrix factorization and differential expression analyses identify hub genes linked to progression and prognosis of glioblastoma multiforme
CN117766024B (en) Ovarian cancer CD8+T cell related prognosis evaluation method, system and application thereof
CN120119001B (en) A tumor hypoxia level assessment model and its application and detection kit
WO2022159774A2 (en) METHODS AND SYSTEMS FOR mRNA BOUNDARY ANALYSIS IN NEXT GENERATION SEQUENCING
Bartlett et al. Classifying breast cancer tissue through DNA methylation and clinical covariate based retrieval
US20240153588A1 (en) Systems and methods for identifying microbial biosynthetic genetic clusters
Yang et al. Integrating gene mutation spectra from tumors and the general population with gene expression topological networks to identify novel cancer driver genes
Haibe-Kains Identification and assessment of gene signatures in human breast cancer
Rao et al. VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data
Bai et al. Deep Learning Driven Cell-Type-Specific Embedding for Inference of Single-Cell Co-expression Networks
CN117737251B (en) A combined molecular marker for the diagnosis and prognosis of AML
CN119685481A (en) Marker for predicting effect of oncolytic virus VG161 on treating liver cancer, prediction method and system
Silva et al. A k-mer based transcriptomics analysis for NPM1-mutated AML
US20240339215A1 (en) System and method for drug selection
Chen Supervised-unsupervised cancer subtyping based on multi-task learning
Esterhuysen Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant