Lung adenocarcinoma related gene labels and application thereof
Technical Field
The invention belongs to the technical field of tumor gene detection, and particularly relates to a group of lung adenocarcinoma related gene labels and application thereof.
Background
Lung cancer is the leading cause of cancer-related deaths worldwide, accounting for approximately 20% of the world's general population in china, with the number of lung cancer deaths accounting for one third of the world's total number. There are a number of factors that have led to a dramatic increase in lung cancer in china, especially the presence of air pollution and a large smoking population. Of which non-small cell lung cancer (NSCLC) is the most common cancer affecting the lung, with adenocarcinoma being the most common subtype. The combined chemotherapy can prolong the life of patients with advanced lung cancer. Survival rates can be further extended by targeted drugs, anti-angiogenic and epidermal growth factor receptor inhibitors. Lung cancer treatment is rapidly moving to an era of personalized medicine, where the molecular characteristics of individual patient tumors will determine the optimal treatment modality. For example, NSCLC patients with EGFR mutations respond significantly to treatment with tyrosine kinase inhibitors (gefitinib or erlotinib). However, despite substantial improvements in current lung cancer treatments, our understanding of the genetic factors of lung cancer has improved and molecular classification can be performed on patients with lung cancer, but the 5-year survival rate of NSCLC patients is only about 21%.
Lung adenocarcinoma is a polygenic controlled disease and patient groupings based on histopathological markers, immunohistochemistry, and other molecular factors have been evaluated to improve treatment regimens in patients with lung adenocarcinoma. The large genomic database of cancer that is currently in common use allows us to identify multigenic features important in tumor progression in an unbiased approach. There are several gene signatures based on Microarray analysis (Microarray analysis) that show a prediction of the prognosis or response to treatment in NSCLC patients. However, these gene tag pairs are typically developed based on incomplete genome annotations, or simply based on prior knowledge. Therefore, there is a need for a comprehensive and fair whole-genome selection of genes that are associated with lung cancer prognosis.
In the current cancer research, chip technology and second generation sequencing technology have become important tools for researching the heterogeneity and complexity of lung adenocarcinoma, and provide huge information for developing biomarkers related to diagnosis, treatment and prognosis. Gene expression analysis allows the same tumor to be divided into different subtypes and the prognosis studied. With the help of gene expression analysis technology, a related network of genes can be constructed, and the related network is proved to have important significance for researching the occurrence and development of cancers.
In other tumors, Oncotype DX (21-gene label) developed by Genomic Health company and the Mammaprint (70-gene label) gene detection technology developed by Agendia company can evaluate the prognosis of recurrence and metastasis of breast cancer, provide guidance information for patients whether needing chemotherapy, and show good application value and prospect in the aspect of guiding clinical treatment decision. Both tests were approved by the FDA in the united states for marketing. Oncotype DX is listed as the NCCN guideline recommendation and breast cancer test item for U.S. medical insurance. Genomatic Health corporation also developed an Oncotype DX gene test program for prostate and colon cancer. However, to date, there has been no similar commercial test for lung adenocarcinoma prognosis in the world.
Disclosure of Invention
The technical problem to be solved is as follows: the invention provides a group of lung adenocarcinoma related gene labels and application thereof. The method can be used for assisting treatment selection of lung adenocarcinoma patients and predicting response to treatment intervention, thereby judging the benefit degree of the patients from chemotherapy/targeted treatment, and achieving the purposes of avoiding overdose and reducing medical cost.
The technical scheme is as follows: a group of lung adenocarcinoma related gene labels, wherein the lung adenocarcinoma related genes are FAM83A, STK32A, TRPC6, DEFA1, TMEM4, CDC25C, PRKAR2B, TMEM100, CNTN4, HOOK1, INPP5A, TRHDE, RSPO2, LDB3, SLC24A3, VEPH1, SLC1A1, GPM6A, TMEM106B, FOXP1, NTN4, PALD1, F12, FHL1, TIMP1, IGSF9 and KLF 9.
The group of lung adenocarcinoma related gene labels further comprises 5 control genes: ACTB, GAPDH, PPIB, GUSB, TFRC.
The application of a group of probes or primers aiming at the gene labels in the preparation of products for diagnosing and predicting the metastasis, staging and recurrence of human lung adenocarcinoma.
The gene label is applied to the preparation of products for diagnosing and predicting the metastasis, staging and recurrence of human lung adenocarcinoma.
The product detects the mRNA expression level of the target gene by real-time fluorescent quantitative PCR, gene chip, second-generation high-throughput sequencing, Panomics or Nanostring technology.
A kit for measuring the expression level of a lung adenocarcinoma prognostic gene label comprises the probe or the primer.
Specifically, the invention provides a 27-gene signature and scoring system for evaluating lung adenocarcinoma prognosis. The invention comprises 27 lung adenocarcinoma prognosis related genes and detection of expression levels of the genes in clinical samples, and then the clinical prognosis is predicted by calculating a prognosis score.
As a preferred embodiment, the present invention first identifies genes that are significantly differentially expressed in lung adenocarcinoma by comparing normal and lung adenocarcinoma tissues. We developed a multi-step strategy to find key gene signatures (FIG. 1A) that could distinguish the prognosis of patients with lung adenocarcinoma. By using three publicly available human lung cancer transcription databases built by Affymetrix chips: GSE31210, GSE19188 and GSE19804, we found a total of 1327 genes that met our selection criteria, i.e. 5-fold or more expression changes and adjusted p-values <0.0001 in all three databases, including 884 expression down-regulated genes and 543 expression up-regulated genes.
As a preferred approach, we further evaluated the importance of differential expression of the 1327 genes described above in the clinical progression of lung adenocarcinoma. The invention analyzes the application value of the Kaplan-Meier curve (http:// kplot. com/analysis/index. phpp. service & cancer) and the log-rank test (log-rank test) of the survival and prognosis online tools for the patient prognosis in a large-scale common clinical chip lung adenocarcinoma database. Based on their expression levels, these genes are divided into two groups of high expression and low expression. Subsequently, using the Kaplan-Meier curve (fig. 1B) to show the effect of high or low expression levels of these genes on the five-year survival rate of lung adenocarcinoma patients, it was found that 600 out of 1327 genes were significantly associated with the overall survival rate of lung adenocarcinoma patients (adjusting p-value < 0.005). 406 genes had a Hazard Ratio (HR) <1 (high gene expression associated with good prognosis) and 194 genes had HR >1 (high gene expression associated with poor prognosis) (table 1).
In order to reveal the biological functions of the genes and the molecular mechanisms of the development of lung adenocarcinoma, the invention uses ClueGo to determine which Gene Ontology (GO) Gene ontology, GO) classes of 600 genes are statistically over-represented. A significant enrichment of genes associated with cell cycle, adhesion, cell death, angiogenesis, metabolism and kinase activity was observed, all of which are hallmarks of cancer.
Preferably, based on the above results, the present invention designs a strategy for developing a prognostic scoring system for lung adenocarcinoma based on gene expression characteristics (fig. 2). We first divided 517 lung adenocarcinoma patients out of The Cancer gene database (The Cancer Genome Atlas, TCGA) established by RNA sequencing into 100 training data sets (350 patients) and 100 test data sets (167 patients) using a resampling method. Then, we performed multivariate Cox regression analysis on all 100 training sets to find the independent genes of 600 genes used to predict overall survival. Genes with an occurrence of at least 30% in 100 training sets were included in our final 27 gene signatures (table 2), including: FAM83A, STK32A, TRPC6, DEFA1, TMEM4, CDC25C, PRKAR2B, TMEM100, CNTN4, HOOK1, inp 5A, TRHDE, RSPO2, LDB3, SLC24A3, VEPH1, SLC1a1, GPM6A, TMEM106B, FOXP1, NTN4, PALD1, F12, FHL1, TIMP1, IGSF9, KLF 9. 5 genes were used as controls: ACTB, GAPDH, PPIB, GUSB, TFRC.
Preferably, the lung adenocarcinoma prognosis scoring system uses the prediction score to calculate a probability of survival of the patient. The prediction score is defined as the linear combination of gene expression levels based on a typical discriminant function. The formula for calculating the prognosis score is as follows:
in each training set, the patient population was divided into three equal parts (good, medium and low) based on the prognostic scores of lung adenocarcinoma patients, and the prognostic scores of the entry points were recorded. Kaplan-Meier analysis was then performed and the log rank test was used to determine significant differences in overall survival between the different groups of all training groups (FIG. 3A). The risk ratio (HR) was calculated for each "middle" and "lower" group as compared to the "good" group (fig. 3B). The overall survival of patients carrying the poor prognosis gene signature was significantly shorter in all the test groups than in the "good" group (HR confidence interval higher than "1") (fig. 3B, bottom panel), with over 70% of patients in the "medium" group significantly shorter than in the "good" group (fig. 3B, top panel), indicating a good ability of the prognosis scoring system to distinguish between good and low prognosis. The scoring system achieved 100% accuracy of the prognosis prediction. Similar accuracy results were obtained using the data from the GSE42127, GSE31210, GSE37745 and GSE30219 databases (see example 2, figure 4 and table 3). In addition, we compared three published gene signatures that predicted NSCLC prognosis by the same multivariate Cox regression analysis. We conclude that the 27-gene signature of the present invention is significantly superior in predicting overall survival in patients with lung adenocarcinoma (FIG. 5, see example 3).
As a preferable scheme, a corresponding measuring kit and a corresponding scoring system are designed and developed by collecting RNA of tumor tissues of lung adenocarcinoma patients according to different detection technology platforms, including but not limited to real-time fluorescence quantitative PCR, gene chips, second-generation high-throughput sequencing, Panomics and Nanostring technologies, including but not limited to fresh biopsy tissues, postoperative tissues, fixed tissues and paraffin-embedded tissues. The kit developed by the invention designs corresponding gene primers (real-time fluorescence quantitative PCR) and target needles (gene chip, second-generation sequencing, Panomics and Nanostring technologies) aiming at different technical platforms.
Has the advantages that: the invention successfully finds a group of 27 important biomarker genes for predicting the overall survival rate of the lung adenocarcinoma patient by using the multiomic data, and establishes a prognosis scoring system based on the 27-gene label for the first time. We also independently demonstrated using other databases that the predictive scores of the system clearly distinguish between good and bad prognoses and show a significant advantage over the three published prognostic genetic signatures for NSCLC in the current literature. The invention can be used for helping treatment selection of lung adenocarcinoma patients and predicting the response to treatment intervention, thereby judging that the patients benefit from chemotherapy and targeted therapy, avoiding overuse of medicines, reducing the medical cost and finally achieving the aim of individualized medical treatment.
Drawings
FIG. 1 is a schematic diagram of the verification and survival curves of related genes, wherein (A) the flow chart of the identification and verification of genes involved in the prognosis of lung adenocarcinoma according to the present invention; (B) an example of a Kaplan-Meier survival curve for an individual gene that is significantly associated with overall survival (overall survival) in patients with lung adenocarcinoma. The p-value was obtained by comparing the differential assay (log-rank test) between the two groups.
FIG. 2 is a flow chart of Cox regression analysis to generate 27-gene signatures correlated with overall survival in patients with lung adenocarcinoma.
FIG. 3 is a graph of the overall survival curve and model calculations, in which (A) the Kaplan-Meier overall survival curves of two representative test sets using 27-gene signature prognostic scores; (B) risk ratios (HR) calculated by Cox model and 95% confidence intervals (good vs. low: top; middle and low: bottom) for 100 test groups.
FIG. 4 is an independent validation of the lung adenocarcinoma 27-gene signature. Based on the prognostic scores of the 27-gene signature, Kaplan-Meier overall survival curves generated from four independent cohorts of lung adenocarcinoma patients showed that the prognostic scores significantly correlated with the overall survival of lung adenocarcinomas in all pools.
FIG. 5 (A) Risk ratio HR comparisons between 27-gene signature and 3 existing gene signatures reported in the literature (100 test groups); (B) risk ratio HR and 95% confidence interval for three existing gene signatures.
The specific implementation mode is as follows:
the present invention is further illustrated by the following figures and detailed description of specific embodiments thereof, it is to be understood that these embodiments are illustrative only and are not limiting upon the scope of the invention, which is to be given the full breadth of the appended claims as modified by those of ordinary skill in the art upon reading the present disclosure.
Example 1
Performing system verification by using TCGA public database lung adenocarcinoma patients:
the prognostic scoring system was applied to 517 TCGA lung adenocarcinoma patients with survival data (fig. 3). The prognosis score is used to predict the probability of survival for each individual patient. We divided the patients into three groups based on the 27-gene signature prognostic score, i.e. good, moderate and low prognosis. As shown in fig. 3B, in both exemplary test groups, the overall survival of patients carrying a "good" prognostic gene signature was significantly longer (HR confidence interval higher than "1") than in the "low" group (fig. 3B, bottom panel). More than 60% of the patients survived 75 months in the former, while all patients died within 50 months or only 10% of the patients survived.
Example 2
Performing life cycle analysis on the lung adenocarcinoma patient by using a GSE public database:
using the same approach, we validated the utility of the prognostic scoring system in four lung adenocarcinoma public databases, GSE42127, GSE31210, GSE37745 and GSE30219 (fig. 4 and table 3). Unlike the cancer gene database TCGA established by RNA sequencing, the tissue gene expression values of these databases were determined by Affymetrix chip technology. FIG. 4A is a Kaplan-Meier Total survival curve showing that the scoring system of the present invention can be used to predict the prognosis of patients with lung adenocarcinoma in the above database. Finally, we used Cox regression to see if the prognostic scores of the present invention were independent of other clinical information including patient age, gender and tumor stage (table 3), with the conclusion that our prognostic scores were significantly independent of patient survival.
Example 3
Comparing the performance of other lung adenocarcinoma prognostic gene signatures with the 27-gene signature of the present invention:
the correlation between multiple genomes and lung adenocarcinoma prognosis is shown in the literature by using the expression difference of genetic experiments. One key question is whether our 27 gene scoring system outperforms these genomic signatures. Using the same specimen or method, we used three previously reported lung adenocarcinoma gene signatures to calculate the predictive score, including a 15 genome (Zhu CQ et al, genomic and predictive gene signature for adaptive chemotherapy in the selected non-small-cell restriction Cancer. journal of Clinical Oncology. 2010; 28:4417-24), 14-gene signature (Kratz JR et al, A-specific molecular assessment in the selected non-small-cell restriction, non-Cancer cell: displacement and internal restriction. Lance. 2012; 379:823-32) and 31-gene signature (nutritional II et al, Validation-restriction in the sample of Cancer-specific restriction, 19: 19). As shown in fig. 5, the median HR of the 27-gene signature was on average 2.2-fold higher in the "middle" versus the "good" group and 5.0-fold higher in the "low" versus the "good" group, as compared to any of the gene signature controls described above (fig. 5A). Our signature is therefore significantly superior in predicting prognosis for patients with lung adenocarcinoma.
Example 4
Detecting the prognosis effect of clinical lung adenocarcinoma patients:
clinically received lung adenocarcinoma tumor tissue, which may include fresh biopsy tissue, post-operative tissue, fixed tissue and paraffin-embedded tissue, was collected and RNA extracted. Then, the kit developed by the invention and a corresponding instrument are used for quantitatively detecting the expression levels of the 27 genes and the 5 control genes of the prognosis scoring system. The expression level of the gene is input into the prognostic scoring formula established in the present invention:
after calculating the patient's predictive score, the physician predicts the patient's prognosis, such as 5-year survival, based on the score. At present, a model is established through retrospective research, and verification is successfully carried out on different databases. And a prospective study was initiated to further refine the scoring system.
Example 5
Predicting the response of clinical lung adenocarcinoma patients to chemotherapeutic drugs:
the total effective rate of the current lung adenocarcinoma chemotherapy is about 30 percent. To reduce ineffective or excessive administration and reduce medical costs, the present invention is implemented to predict the response of clinical lung adenocarcinoma patients to chemotherapeutic drugs by the following scheme:
tumor tissue, which may include fresh biopsy tissue, post-operative tissue, fixed tissue and paraffin-embedded tissue, is collected and RNA is extracted from clinically accepted lung adenocarcinoma patients. Then, the kit developed by the invention and a corresponding instrument are used for quantitatively detecting the expression levels of the 27 gene and the 5 control gene. The expression level of the gene is input into the prognosis score formula established by the invention:
after calculating the patient's predictive score, the physician considers whether the patient should receive chemotherapy and the intensity based on the score. For patients with a good prognosis as indicated by the prediction score, the physician may be advised to consider the necessity or dose/cycle of chemotherapy as appropriate. For patients with a prediction score indicating poor prognosis, physicians may be advised to consider increasing the intensity of treatment with chemotherapeutic drugs as appropriate.
Table 1.K-M mapping analysis results summarize genes significantly associated with Total survival (OS)
TABLE 2 typical discriminant function coefficients
| Gene
|
Cox regression coefficient
|
| FAM83A
|
0.20995771
|
| STK32A
|
-0.45049286
|
| TRPC6
|
0.382016798
|
| DEFA1B
|
0.298967835
|
| TMEM47
|
0.220892566
|
| CDC25C
|
0.338527972
|
| PRKAR2B
|
0.035274941
|
| TMEM100
|
0.101858155
|
| CNTN4
|
0.120687495
|
| HOOK1
|
0.079775222
|
| INPP5A
|
-0.220656803
|
| TRHDE
|
0.363592887
|
| RSPO2
|
0.092585398
|
| LDB3
|
0.127095987
|
| SLC24A3
|
-0.336677565
|
| VEPH1
|
0.164080783
|
| SLC1A1
|
0.192834044
|
| GPM6A
|
0.086279146
|
| TMEM106B
|
0.105899244
|
| FOXP1
|
0.249725361
|
| NTN4
|
0.159188986
|
| PALD1
|
0.167148577
|
| F12
|
0.158275055
|
| FHL1
|
-0.869024553
|
| TIMP1
|
0.14597252
|
| IGSF9
|
0.078902808
|
| KLF9
|
0.32007008 |
TABLE 3 Lung adenocarcinoma 27-Gene signature independent validation data
(calculation of Risk ratio HR and 95% confidence intervals, tumor stage (I-IV), gender, diagnostic age and prognosis score as covariates using the Cox model)