CN117316295B - A method for identifying endocrine disease cells based on cell heterogeneity genes and pathway functions - Google Patents
A method for identifying endocrine disease cells based on cell heterogeneity genes and pathway functions Download PDFInfo
- Publication number
- CN117316295B CN117316295B CN202311177280.9A CN202311177280A CN117316295B CN 117316295 B CN117316295 B CN 117316295B CN 202311177280 A CN202311177280 A CN 202311177280A CN 117316295 B CN117316295 B CN 117316295B
- Authority
- CN
- China
- Prior art keywords
- gene
- cell
- genes
- label
- small sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Data Mining & Analysis (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Biotechnology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Organic Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Software Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Biochemistry (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
An endocrine disease cell identification method based on cell heterogeneity gene and pathway function relates to an endocrine disease cell identification method. The invention aims to solve the problem that the existing cell function identification method has limitation. The method comprises the steps of 1, extracting cell associated gene characteristics; step 2, amplifying cell-associated genes; step 3, predicting a cell heterogeneity gene; and 4, identifying the cell function of the endocrine disease. The invention belongs to the technical field of endocrinopathy cell function identification.
Description
Technical Field
The invention relates to a cell identification method for endocrine diseases, and belongs to the technical field of cell function identification for endocrine diseases.
Background
It has now been found that various endocrine cell subtypes are associated with major human diseases, such as β cells being closely related to type two diabetes, α cells being closely related to type one diabetes, T cells being closely related to polycystic ovary syndrome, etc. The main method of cell function recognition analysis is to analyze the difference of gene expression in different cells by using a gene sequencing technology. Conventional population sequencing methods provide an average of gene expression at the overall cell population level, but may mask cell-to-cell heterogeneity and individual cell differences, whereas single cell sequencing techniques can obtain detailed information of gene expression at the single cell level, thereby enabling analysis of cell-to-cell heterogeneity, revealing differences between different cell types and subtypes.
The main principle of the cell function recognition analysis based on single cell sequencing technology at present is to annotate and analyze cells based on differential expression genes and known biomarkers. For example, SCDE, a method for identifying cell function by calculating the similarity of the expression levels of differentially expressed genes in different cells using a Bayesian mixed model. PUSeqCluster A cell function identification method for carrying out gene expression data clustering analysis by utilizing a combined mixed T distribution clustering model. However, this approach still has certain limitations: 1) Because of the lack of prior knowledge, annotation by known biomarkers and gene expression profiles alone is not an accurate and effective means to identify the association of disease with cellular function; 2) Due to single cell sequencing technology barriers and cell number imbalance associated with endocrine tissues, cell heterogeneity gene recognition by analyzing only cell function-related gene features is affected by cell imbalance.
Cell heterogeneity functional recognition is not only related to gene feature extraction, but also can improve the accuracy of cell classification by analyzing imbalance of cell-associated genes, thereby effectively performing cell functional recognition based on endocrine disease pathway functions. A cell heterogeneity gene feature amplification method and a disease related cell function recognition method based on machine learning are provided, the method utilizes an improved SMOTE algorithm to amplify the characteristics and the labels of the cell heterogeneity genes, utilizes a RAKEL multi-label multi-classification model to identify the cell heterogeneity genes, and improves the cell function identification effect by analyzing the endocrine disease pathway functions involved.
Disclosure of Invention
The invention aims to solve the problem that the existing cell function recognition method has limitation, and further provides an endocrine disease cell recognition method based on cell heterogeneity genes and pathway functions.
The technical scheme adopted by the invention for solving the problems is as follows: the method comprises the following steps:
Step 1, extracting cell associated gene characteristics;
Step 2, amplifying cell-associated genes;
step 3, predicting a cell heterogeneity gene;
And 4, identifying the cell function of the endocrine disease.
Further, the step of extracting the cell-associated gene feature in step 1 includes:
Step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
102, taking the p value and the log 2 FC value of the matched SNP as the differential expression information of genes in the tissue, wherein eQTL data are derived from pancreas, fat, blood and muscle tissues;
Step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
Step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.
Further, in the step 2, a RAkEL framework based on a problem conversion method is adopted to construct an integrated multi-target classification model; RAkEL converting the multi-objective classification problem into a single-label classification problem by treating the label combination of the sample as a new single label, wherein the specific steps comprise:
Step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
in the formula (1), L represents a tag set, L 1 represents the 1 st tag, L represents the number of tags, N represents the number of genes, and Y i represents the tag set corresponding to the i-th gene;
The average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
Step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
LabelsynthGene={L1,L2,...,L|L|} (2),
In the formula (2), when the number of times of occurrence of the ith label in the neighbor node is greater than a set threshold value, marking the label L i of the synthesized gene as 1, otherwise marking as 0;
In step 203, in order to generate a gene feature, a neighboring node is randomly selected and used as a reference neighboring gene for generating a synthesized sample feature, and the gene feature F syn is synthesized by using an interpolation method and expressed as:
Fsyn=Fseed+r×(Fseed-Fref) (3),
in the formula (3), r is a random number between (0, 1), the characteristic of the small sample gene is marked as F seed, and the characteristic of the reference neighbor gene node represents F ref;
Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.
Further, the step of predicting a cell heterogeneity gene in step 3 comprises:
Step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
Step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the label subset corresponding to the ith classifier P i is L i, wherein each label L j can obtain a score, and after training all the classifiers, calculating the final score of each label L j by taking the average value If it isAbove a threshold, the gene is considered to have significant functional manifestation in the jth Cell type Cell j, whereas the gene is considered to have no significant functional manifestation in the Cell type;
Step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein parameters of RAkEL are k, and parameters of RAkEL are m, and 14 are m.
Further, after the cell-associated gene set is obtained in the step 4, integrating the cell-associated gene set with the cell genes in the original data set to form a new cell-associated gene set, and performing KEGG-based enrichment analysis on the new gene set to obtain a pathway set related to each cell type; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein the method comprises the steps ofAndThe method respectively shows that a new cell-associated pathway can be identified by the pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by utilizing enrichR packages in R language.
The beneficial effects of the invention are as follows: firstly, extracting various gene characteristics affecting cell functions, and generating characteristics of related genes of cell endocrine function by utilizing an antagonistic network; then, a SMOTE gene amplification method is provided, and the problem of low classification accuracy caused by too many small sample genes in the gene tag combination is solved by amplifying the small sample genes in the gene tag combination; finally, a RAkEL-based cell function recognition method is provided, cell function recognition is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the accuracy of endocrine disease cell heterogeneity genes and cell function recognition is effectively improved by comparing the method with the existing similar methods.
Drawings
FIG. 1 is a flow diagram of the present invention;
FIG. 2 is a diagram showing the comparison of the identification result of the present invention and the reference method.
Detailed Description
The first embodiment is as follows: referring to fig. 1 and 2, the steps of the endocrine disease cell recognition method based on the cell heterogeneity gene and pathway function according to the present embodiment include:
Step 1, extracting cell associated gene characteristics;
Step 2, amplifying cell-associated genes;
step 3, predicting a cell heterogeneity gene;
And 4, identifying the cell function of the endocrine disease.
The second embodiment is as follows: referring to fig. 1 and 2, the steps for extracting the characteristics of the cell-associated gene in step 1 of the endocrine-disease cell recognition method based on the cell heterogeneity gene and the pathway function according to the present embodiment include:
Step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
102, taking the p value and the log 2 FC value of the matched SNP as the differential expression information of genes in the tissue, wherein eQTL data are derived from pancreas, fat, blood and muscle tissues;
Step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
Step 104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated antagonism network.
And a third specific embodiment: referring to fig. 1 and 2, in the step 2 of the endocrine disease cell recognition method based on the cell heterogeneity gene and the pathway function according to the present embodiment, a RAkEL framework based on the problem transformation method is used to construct an integrated multi-objective classification model; RAkEL converting the multi-objective classification problem into a single-label classification problem by treating the label combination of the sample as a new single label, wherein the specific steps comprise:
Step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
in the formula (1), L represents a tag set, L 1 represents the 1 st tag, L represents the number of tags, N represents the number of genes, and Y i represents the tag set corresponding to the i-th gene;
The average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset. It is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
Step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
LabelsynthGene={L1,L2,...,L|L|} (2),
In the formula (2), when the number of times of occurrence of the ith label in the neighbor node is greater than a set threshold value, marking the label L i of the synthesized gene as 1, otherwise marking as 0;
In step 203, in order to generate a gene feature, a neighboring node is randomly selected and used as a reference neighboring gene for generating a synthesized sample feature, and the gene feature F syn is synthesized by using an interpolation method and expressed as:
Fsyn=Fseed+r×(Fseed-Fref) (3),
in the formula (3), r is a random number between (0, 1), the characteristic of the small sample gene is marked as F seed, and the characteristic of the reference neighbor gene node represents F ref;
Step 204, by comparing the mean (IR) values and the similarity of gene labels and label combination distribution under the amplification multiples of different small sample genes, the amplification number of the small sample genes is selected, so that the uniformity of gene distribution is improved, and the main information of label distribution is reserved.
The specific embodiment IV is as follows: referring to fig. 1 and 2, the step of predicting a cell heterogeneity gene in step 3 of the endocrine disease cell recognition method based on the cell heterogeneity gene and pathway function according to the present embodiment includes:
Step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
Step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the label subset corresponding to the ith classifier P i is L i, wherein each label L j can obtain a score, and after training all the classifiers, calculating the final score of each label L j by taking the average value If it isAbove a threshold, the gene is considered to have significant functional manifestation in the jth Cell type Cell j, whereas the gene is considered to have no significant functional manifestation in the Cell type;
Step 303, testing the data of the test set on trained classifiers, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample by voting, and classifying cells of endocrine disease genes, wherein parameters of RAkEL are k, and parameters of RAkEL are m, and 14 are m.
Fifth embodiment: referring to fig. 1 and 2, in the method for identifying endocrine disease cells according to the present embodiment, after a cell-associated gene set is obtained in step 4, the cell-associated gene set is integrated with the cell genes in the original dataset to form a new cell-associated gene set, and KEGG pathway-based enrichment analysis is performed on the new gene set to obtain a pathway set associated with each cell type; the result of the pathway enrichment analysis obtained by comparing the new cell gene set with the original cell gene set is expressed as the difference set between the two pathway setsWherein the method comprises the steps ofAndThe method respectively shows that a new cell-associated pathway can be identified by the pathway set obtained by enrichment analysis of the gene set identified by the invention in the ith cell type and the original gene set, and the cell function is finally embodied on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by utilizing enrichR packages in R language.
The present invention is not limited to the preferred embodiments, and the present invention is described above in any way, but is not limited to the preferred embodiments, and any person skilled in the art will appreciate that the present invention is not limited to the embodiments described above, while the invention has been described with respect to specific examples including presently preferred modes of carrying out the invention, those skilled in the art will appreciate that there are numerous variations and permutations of the above described embodiments that fall within the spirit and scope of the invention as set forth in the appended claims.
Claims (1)
1. A cell identification method of endocrine disease based on cell heterogeneity gene and pathway function is characterized in that: the endocrine disease cell identification method based on the cell heterogeneity gene and the pathway function comprises the following steps:
step 1, extracting cell associated gene characteristics; the step of extracting the cell-associated gene signature comprises:
Step 101, obtaining cell significant differential expression genes according to single cell sequencing data, and comparing the cell significant differential expression genes with disease mutation site data and tissue specific differential expression data according to position information of the genes, wherein each gene selects significant p values of 5 mutation sites which are most significantly related as disease mutation site information of cell related genes;
102, taking the p value and the log2 FC value of the matched SNP as the differential expression information of genes in the tissue, wherein eQTL data are derived from pancreas, fat, blood and muscle tissues;
Step 103, expressing the association KEGG channel information of the gene as a 343-dimension binary vector, wherein each dimension represents the association relation between the gene and the channel;
104, selecting subcellular localization information of the genes represented by the first 10 cell substructure through analyzing the subcellular localization information of the cell-associated genes, and generating cell function-related gene characteristics by utilizing the generated countermeasure network;
step 2, amplifying cell-associated genes; constructing an integrated multi-target classification model by adopting RAkEL frames based on a problem conversion method; RAkEL converting the multi-objective classification problem into a single-label classification problem by treating the label combination of the sample as a new single label, wherein the specific steps comprise:
Step 201, analyzing distribution of gene tag combinations, selecting small sample genes based on unbalanced proportion IR of the tag combinations, wherein the definition of IR is shown as the following formula:
In the formula (1), L represents a tag set, L 1 represents a1 st tag, l| represents the number of tags, N represents the number of genes, and Y i represents a tag set corresponding to an i-th gene; the average mean (IR) of the IR of all tags may represent the degree of imbalance of the dataset; it is generally considered that a tag having an IR (l) > 10 can be regarded as a small sample tag, and a gene containing such a tag is referred to as a small sample gene;
Step 202, synthesizing gene samples according to small sample genes, selecting k neighbor node sets of the small sample genes according to each small sample gene, and measuring the distance between small sample gene feature vectors by utilizing Euclidean distance; in order to generate gene tags, the number of times each tag appears in a small sample gene and neighbor nodes thereof is counted, and a threshold value is set to synthesize new tags, and the tags of the synthesized genes are expressed as follows:
LabelsynthGene={L1,L2,...,LL}(2),
In the formula (2), when the number of times of occurrence of the ith label in the neighbor node is greater than a set threshold value, marking the label L i of the synthesized gene as 1, otherwise marking as 0;
In step 203, in order to generate a gene feature, a neighboring node is randomly selected and used as a reference neighboring gene for generating a synthesized sample feature, and the gene feature F syn is synthesized by using an interpolation method and expressed as:
Fsyn=Fseed+r×(Fseed-Fref)(3),
in the formula (3), r is a random number between (0, 1), the characteristic of the small sample gene is marked as F seed, and the characteristic of the reference neighbor gene node represents F ref;
204, selecting the amplified number of the small sample genes by comparing the mean (IR) values and the similarity of the gene labels and label combination distribution under the amplification multiples of different small sample genes, so that the uniformity of the gene distribution is improved, and the main information of the label distribution is reserved;
step 3, predicting a cell heterogeneity gene; the step of predicting a cellular heterogeneity gene comprises:
Step 301, dividing the labels corresponding to the gene samples into m groups, wherein each group is provided with a label subset of k labels;
Step 302, constructing m binary base classifiers, each of which performs binary classification on a group of label subsets, and each classifier can perform primary prediction on labels: if the label subset corresponding to the ith classifier P i is L i, wherein each label L i can obtain a score, and after training all the classifiers, calculating the final score of each label L i by taking the average value If it isAbove a threshold, the gene is considered to have significant functional manifestation in the jth Cell type Cell j, whereas the gene is considered to have no significant functional manifestation in the Cell type;
Step 303, testing the data of the test set on trained classifiers respectively, obtaining a result on each classifier by each sample, finally obtaining a final label of the sample in a voting mode, and classifying cells of endocrine disease genes, wherein parameters of RAkEL are k, and parameters of RAkEL are m, 14 are m;
Step 4, identifying endocrine disease cell functions; after obtaining a cell-associated gene set, integrating the cell-associated gene set with a cell gene in an original data set to form a new cell-associated gene set, and carrying out KEGG (key gateway) -based enrichment analysis on the new gene set to obtain a channel set related to each cell type; the result of the pathway enrichment analysis obtained by comparing the new cellular gene set with the original cellular gene set is expressed as:
Wherein the method comprises the steps of AndThe method respectively shows that a new cell-associated pathway can be identified by a pathway set obtained by enrichment analysis of a gene set identified in the ith cell type and an original gene set, and the cell function is finally represented on a biological pathway related to cells, so that the cell function identification is carried out by analyzing endocrine disease pathway functions related to cell heterogeneity genes, and the pathway analysis method is realized by utilizing enrichR packages in R language.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311177280.9A CN117316295B (en) | 2023-09-13 | 2023-09-13 | A method for identifying endocrine disease cells based on cell heterogeneity genes and pathway functions |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311177280.9A CN117316295B (en) | 2023-09-13 | 2023-09-13 | A method for identifying endocrine disease cells based on cell heterogeneity genes and pathway functions |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN117316295A CN117316295A (en) | 2023-12-29 |
| CN117316295B true CN117316295B (en) | 2024-11-15 |
Family
ID=89261267
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202311177280.9A Active CN117316295B (en) | 2023-09-13 | 2023-09-13 | A method for identifying endocrine disease cells based on cell heterogeneity genes and pathway functions |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN117316295B (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106874706A (en) * | 2017-01-18 | 2017-06-20 | 湖南大学 | Disease association factor identification method and system based on functional module |
| CN116564410A (en) * | 2023-05-23 | 2023-08-08 | 浙江大学 | Method, equipment and medium for predicting mutation site cis-regulatory gene |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CA2239205A1 (en) * | 1998-05-29 | 1999-11-29 | Gabrielle Boulianne | Extension of lifespan by overexpression of a gene that increases reactive oxygen metabolism |
| US8865657B2 (en) * | 2005-08-19 | 2014-10-21 | New York University | Compositions and methods for inactivating or suppressing inflammatory cells |
| CN109033748A (en) * | 2018-08-14 | 2018-12-18 | 齐齐哈尔大学 | A kind of miRNA identification of function method based on multiple groups |
| CN115427585A (en) * | 2020-02-20 | 2022-12-02 | 居里研究所 | Method for identifying functional disease-specific regulatory T cells |
| CN114627961A (en) * | 2020-12-14 | 2022-06-14 | 北京致成生物医学科技有限公司 | A system for screening and distinguishing molecular markers related to osteoarthritis and rheumatoid arthritis |
| CN112951413B (en) * | 2021-03-22 | 2023-07-21 | 江苏大学 | An Asthma Diagnosis System Based on Decision Tree and Improved SMOTE Algorithm |
| CN113470743A (en) * | 2021-07-16 | 2021-10-01 | 哈尔滨星云医学检验所有限公司 | Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data |
| CN115148286B (en) * | 2022-06-24 | 2025-06-27 | 山东大学 | Cancer synergistic driver module identification system based on single-cell data |
| CN115798593B (en) * | 2022-12-02 | 2025-08-12 | 中国科学院深圳先进技术研究院 | Single cell identification method and device based on self-supervision clustering of graphic neural network |
-
2023
- 2023-09-13 CN CN202311177280.9A patent/CN117316295B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106874706A (en) * | 2017-01-18 | 2017-06-20 | 湖南大学 | Disease association factor identification method and system based on functional module |
| CN116564410A (en) * | 2023-05-23 | 2023-08-08 | 浙江大学 | Method, equipment and medium for predicting mutation site cis-regulatory gene |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117316295A (en) | 2023-12-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| d’Errico et al. | Automatic topography of high-dimensional data sets by non-parametric density peak clustering | |
| Ratnasingham et al. | A DNA-based registry for all animal species: the Barcode Index Number (BIN) system | |
| CN112926045B (en) | Group control equipment identification method based on logistic regression model | |
| Ypma et al. | Calculating LRs for presence of body fluids from mRNA assay data in mixtures | |
| CN106250925B (en) | A Zero-Shot Video Classification Method Based on Improved Canonical Correlation Analysis | |
| Pouyan et al. | Clustering single-cell expression data using random forest graphs | |
| CN118072825B (en) | A method for identifying and analyzing microorganisms in soil | |
| CN112784921A (en) | Task attention guided small sample image complementary learning classification algorithm | |
| CN114266321A (en) | Weak supervision fuzzy clustering algorithm based on unconstrained prior information mode | |
| CN114550831A (en) | Gastric cancer proteomics typing framework identification method based on deep learning feature extraction | |
| CN118507059A (en) | Risk assessment method and system for IDH mutant glioma based on pathological images | |
| CN104156503A (en) | Disease risk gene recognition method based on gene chip network analysis | |
| Zhao et al. | Automatic individual recognition of wild Crested Ibis based on hybrid method of self-supervised learning and clustering | |
| CN105139037B (en) | Integrated multi-target evolution automatic clustering method based on minimum spanning tree | |
| CN117316295B (en) | A method for identifying endocrine disease cells based on cell heterogeneity genes and pathway functions | |
| CN117437976B (en) | Disease risk screening method and system based on gene detection | |
| Jeong et al. | Effective single-cell clustering through ensemble feature selection and similarity measurements | |
| CN101894216B (en) | Method of discovering SNP group related to complex disease from SNP information | |
| JP3936851B2 (en) | Clustering result evaluation method and clustering result display method | |
| Li et al. | A novel algorithm for training hidden Markov models with positive and negative examples | |
| Salem et al. | A new gene selection technique based on hybrid methods for cancer classification using microarrays | |
| CN118486374B (en) | Tumor cell identification method based on single-cell sequencing data | |
| Giurcărneanu et al. | Fast iterative gene clustering based on information theoretic criteria for selecting the cluster structure | |
| CN116168761B (en) | Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium | |
| CN118016167B (en) | Cell clustering method, device and medium for unbalanced single-cell RNA-seq data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |