WO2005006179A1 - A method for identifying biomarkers using fractal genomics modeling - Google Patents
A method for identifying biomarkers using fractal genomics modeling Download PDFInfo
- Publication number
- WO2005006179A1 WO2005006179A1 PCT/US2004/022157 US2004022157W WO2005006179A1 WO 2005006179 A1 WO2005006179 A1 WO 2005006179A1 US 2004022157 W US2004022157 W US 2004022157W WO 2005006179 A1 WO2005006179 A1 WO 2005006179A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- genes
- gene
- group
- small
- fgm
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 105
- 239000000090 biomarker Substances 0.000 title claims abstract description 30
- 201000010099 disease Diseases 0.000 claims abstract description 24
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 24
- 238000011282 treatment Methods 0.000 claims abstract description 5
- 108090000623 proteins and genes Proteins 0.000 claims description 205
- 230000014509 gene expression Effects 0.000 claims description 84
- 208000024893 Acute lymphoblastic leukemia Diseases 0.000 claims description 56
- 208000014697 Acute lymphocytic leukaemia Diseases 0.000 claims description 51
- 208000006664 Precursor Cell Lymphoblastic Leukemia-Lymphoma Diseases 0.000 claims description 51
- 108020004999 messenger RNA Proteins 0.000 claims description 37
- 238000012360 testing method Methods 0.000 claims description 36
- 208000031261 Acute myeloid leukaemia Diseases 0.000 claims description 28
- 238000004422 calculation algorithm Methods 0.000 claims description 25
- 210000003719 b-lymphocyte Anatomy 0.000 claims description 24
- 241000282414 Homo sapiens Species 0.000 claims description 22
- 230000002068 genetic effect Effects 0.000 claims description 21
- 238000002405 diagnostic procedure Methods 0.000 claims description 17
- 201000010374 Down Syndrome Diseases 0.000 claims description 13
- 206010044688 Trisomy 21 Diseases 0.000 claims description 13
- 208000032839 leukemia Diseases 0.000 claims description 13
- 102000004169 proteins and genes Human genes 0.000 claims description 12
- 206010028980 Neoplasm Diseases 0.000 claims description 6
- 201000011510 cancer Diseases 0.000 claims description 6
- 101000914489 Homo sapiens B-cell antigen receptor complex-associated protein alpha chain Proteins 0.000 claims description 5
- 238000002493 microarray Methods 0.000 claims description 5
- 101000604114 Homo sapiens RNA-binding protein Nova-1 Proteins 0.000 claims description 4
- 108700021638 Neuro-Oncological Ventral Antigen Proteins 0.000 claims description 4
- 102100039614 Nuclear receptor ROR-alpha Human genes 0.000 claims description 4
- 102100038427 RNA-binding protein Nova-1 Human genes 0.000 claims description 4
- 208000029052 T-cell acute lymphoblastic leukemia Diseases 0.000 claims description 4
- 108091008324 binding proteins Proteins 0.000 claims description 4
- 230000001605 fetal effect Effects 0.000 claims description 4
- 208000015181 infectious disease Diseases 0.000 claims description 4
- 239000002243 precursor Substances 0.000 claims description 3
- 108020000494 protein-tyrosine phosphatase Proteins 0.000 claims description 3
- 101710187542 Alcohol dehydrogenase 6 Proteins 0.000 claims description 2
- 101000881293 Arabidopsis thaliana Serine/arginine-rich-splicing factor SR34 Proteins 0.000 claims description 2
- 108700018351 Major Histocompatibility Complex Proteins 0.000 claims description 2
- 102100027869 Moesin Human genes 0.000 claims description 2
- 102000002727 Protein Tyrosine Phosphatase Human genes 0.000 claims description 2
- 239000003623 enhancer Substances 0.000 claims description 2
- 108060003196 globin Proteins 0.000 claims description 2
- 230000020382 suppression by virus of host antigen processing and presentation of peptide antigen via MHC class I Effects 0.000 claims description 2
- 241000700605 Viruses Species 0.000 claims 2
- 101000742951 Bos taurus Retinol dehydrogenase 5 Proteins 0.000 claims 1
- 102100031785 Endothelial transcription factor GATA-2 Human genes 0.000 claims 1
- 101001066265 Homo sapiens Endothelial transcription factor GATA-2 Proteins 0.000 claims 1
- 101000607569 Mus musculus Unknown protein from 2D-PAGE of fibroblasts Proteins 0.000 claims 1
- 101000982427 Orgyia pseudotsugata multicapsid polyhedrosis virus OPEP-3 protein Proteins 0.000 claims 1
- 102000023732 binding proteins Human genes 0.000 claims 1
- 108700026220 vif Genes Proteins 0.000 claims 1
- 230000003993 interaction Effects 0.000 abstract description 8
- 238000012800 visualization Methods 0.000 abstract description 7
- 238000011002 quantification Methods 0.000 abstract description 6
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000011331 genomic analysis Methods 0.000 abstract description 2
- 101150101112 7 gene Proteins 0.000 description 56
- 230000000875 corresponding effect Effects 0.000 description 27
- 208000033776 Myeloid Acute Leukemia Diseases 0.000 description 25
- 230000037361 pathway Effects 0.000 description 23
- 101150096316 5 gene Proteins 0.000 description 16
- 101150072531 10 gene Proteins 0.000 description 14
- 238000004458 analytical method Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 11
- 238000012549 training Methods 0.000 description 11
- 230000002596 correlated effect Effects 0.000 description 10
- 238000013459 approach Methods 0.000 description 9
- 230000005540 biological transmission Effects 0.000 description 9
- 239000000523 sample Substances 0.000 description 9
- 238000011144 upstream manufacturing Methods 0.000 description 9
- 235000018102 proteins Nutrition 0.000 description 8
- 230000019491 signal transduction Effects 0.000 description 8
- 101150082072 14 gene Proteins 0.000 description 7
- 230000006399 behavior Effects 0.000 description 6
- 239000000091 biomarker candidate Substances 0.000 description 6
- 238000013507 mapping Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 241000725303 Human immunodeficiency virus Species 0.000 description 4
- 210000001744 T-lymphocyte Anatomy 0.000 description 4
- 230000033228 biological regulation Effects 0.000 description 4
- 102000014914 Carrier Proteins Human genes 0.000 description 3
- 108090000695 Cytokines Proteins 0.000 description 3
- 102000004127 Cytokines Human genes 0.000 description 3
- 101000802640 Homo sapiens Lactosylceramide 4-alpha-galactosyltransferase Proteins 0.000 description 3
- 101001090860 Homo sapiens Myeloblastin Proteins 0.000 description 3
- 101000739160 Homo sapiens Secretoglobin family 3A member 1 Proteins 0.000 description 3
- 102100035838 Lactosylceramide 4-alpha-galactosyltransferase Human genes 0.000 description 3
- 102100025193 OTU domain-containing protein 4 Human genes 0.000 description 3
- 102100020779 UV excision repair protein RAD23 homolog B Human genes 0.000 description 3
- 230000006907 apoptotic process Effects 0.000 description 3
- 230000001364 causal effect Effects 0.000 description 3
- 238000007621 cluster analysis Methods 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 230000011664 signaling Effects 0.000 description 3
- 102100024155 Cadherin-11 Human genes 0.000 description 2
- 101000741544 Homo sapiens Properdin Proteins 0.000 description 2
- 101001110286 Homo sapiens Ras-related C3 botulinum toxin substrate 1 Proteins 0.000 description 2
- 101000611023 Homo sapiens Tumor necrosis factor receptor superfamily member 6 Proteins 0.000 description 2
- 101000717424 Homo sapiens UV excision repair protein RAD23 homolog B Proteins 0.000 description 2
- 241000713772 Human immunodeficiency virus 1 Species 0.000 description 2
- 102000009438 IgE Receptors Human genes 0.000 description 2
- 108010073816 IgE Receptors Proteins 0.000 description 2
- 102000001620 Member 1 Group F Nuclear Receptor Subfamily 1 Human genes 0.000 description 2
- 102100034681 Myeloblastin Human genes 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 102000002808 Pituitary adenylate cyclase-activating polypeptide Human genes 0.000 description 2
- 108010004684 Pituitary adenylate cyclase-activating polypeptide Proteins 0.000 description 2
- 108091008731 RAR-related orphan receptors α Proteins 0.000 description 2
- 102100022122 Ras-related C3 botulinum toxin substrate 1 Human genes 0.000 description 2
- 235000001014 amino acid Nutrition 0.000 description 2
- 229940024606 amino acid Drugs 0.000 description 2
- 150000001413 amino acids Chemical class 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008238 biochemical pathway Effects 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 230000009087 cell motility Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 108091008053 gene clusters Proteins 0.000 description 2
- 230000011132 hemopoiesis Effects 0.000 description 2
- 210000003630 histaminocyte Anatomy 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 230000001105 regulatory effect Effects 0.000 description 2
- 238000011524 similarity measure Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 102100033453 26S proteasome non-ATPase regulatory subunit 5 Human genes 0.000 description 1
- 102100033409 40S ribosomal protein S3 Human genes 0.000 description 1
- 108010029731 6-phosphogluconolactonase Proteins 0.000 description 1
- 102100021308 60S ribosomal protein L23 Human genes 0.000 description 1
- 101710187798 60S ribosomal protein L23 Proteins 0.000 description 1
- 102100040131 60S ribosomal protein L37 Human genes 0.000 description 1
- 108010013238 70-kDa Ribosomal Protein S6 Kinases Proteins 0.000 description 1
- 102100022089 Acyl-[acyl-carrier-protein] hydrolase Human genes 0.000 description 1
- 230000007730 Akt signaling Effects 0.000 description 1
- DCXYFEDJOCDNAF-UHFFFAOYSA-N Asparagine Natural products OC(=O)C(N)CC(N)=O DCXYFEDJOCDNAF-UHFFFAOYSA-N 0.000 description 1
- 108010023546 Aspartylglucosylaminase Proteins 0.000 description 1
- 102100025422 Bone morphogenetic protein receptor type-2 Human genes 0.000 description 1
- 108050008407 Bone morphogenetic protein receptor type-2 Proteins 0.000 description 1
- 102100026548 Caspase-8 Human genes 0.000 description 1
- 102000009410 Chemokine receptor Human genes 0.000 description 1
- 108050000299 Chemokine receptor Proteins 0.000 description 1
- 108010012236 Chemokines Proteins 0.000 description 1
- 102000019034 Chemokines Human genes 0.000 description 1
- 108020004705 Codon Proteins 0.000 description 1
- 101710182026 Cyclic AMP-dependent transcription factor ATF-1 Proteins 0.000 description 1
- 102100023026 Cyclic AMP-dependent transcription factor ATF-1 Human genes 0.000 description 1
- 238000000018 DNA microarray Methods 0.000 description 1
- 108010049207 Death Domain Receptors Proteins 0.000 description 1
- 102000009058 Death Domain Receptors Human genes 0.000 description 1
- 102100037994 E3 ubiquitin-protein ligase MGRN1 Human genes 0.000 description 1
- 102100034582 E3 ubiquitin/ISG15 ligase TRIM25 Human genes 0.000 description 1
- 101710183057 E3 ubiquitin/ISG15 ligase TRIM25 Proteins 0.000 description 1
- 102000008013 Electron Transport Complex I Human genes 0.000 description 1
- 108010089760 Electron Transport Complex I Proteins 0.000 description 1
- 102100022462 Eukaryotic initiation factor 4A-II Human genes 0.000 description 1
- 102100029782 Eukaryotic translation initiation factor 3 subunit I Human genes 0.000 description 1
- 101710109054 Eukaryotic translation initiation factor 3 subunit I Proteins 0.000 description 1
- 102100037682 Fasciculation and elongation protein zeta-1 Human genes 0.000 description 1
- 101150047078 G6PD gene Proteins 0.000 description 1
- 101150115151 GAA gene Proteins 0.000 description 1
- 102100035172 Glucose-6-phosphate 1-dehydrogenase Human genes 0.000 description 1
- 108010018962 Glucosephosphate Dehydrogenase Proteins 0.000 description 1
- 108060003393 Granulin Proteins 0.000 description 1
- 108010085877 Guanine Nucleotide-Releasing Factor 2 Proteins 0.000 description 1
- 102000007548 Guanine Nucleotide-Releasing Factor 2 Human genes 0.000 description 1
- 102100021090 Homeobox protein Hox-A9 Human genes 0.000 description 1
- 101001135226 Homo sapiens 26S proteasome non-ATPase regulatory subunit 5 Proteins 0.000 description 1
- 101000671735 Homo sapiens 60S ribosomal protein L37 Proteins 0.000 description 1
- 101000983528 Homo sapiens Caspase-8 Proteins 0.000 description 1
- 101000951423 Homo sapiens E3 ubiquitin-protein ligase MGRN1 Proteins 0.000 description 1
- 101100101343 Homo sapiens EFTUD2 gene Proteins 0.000 description 1
- 101000959666 Homo sapiens Eukaryotic initiation factor 4A-I Proteins 0.000 description 1
- 101001044475 Homo sapiens Eukaryotic initiation factor 4A-II Proteins 0.000 description 1
- 101001025416 Homo sapiens Homologous-pairing protein 2 homolog Proteins 0.000 description 1
- 101001027146 Homo sapiens Kelch domain-containing protein 10 Proteins 0.000 description 1
- 101001018026 Homo sapiens Lysosomal alpha-glucosidase Proteins 0.000 description 1
- 101001018384 Homo sapiens Matrilin-2 Proteins 0.000 description 1
- 101000970561 Homo sapiens Myc box-dependent-interacting protein 1 Proteins 0.000 description 1
- 101000662686 Homo sapiens Torsin-1A Proteins 0.000 description 1
- 101000652726 Homo sapiens Transgelin-2 Proteins 0.000 description 1
- 101500025139 Homo sapiens Ubiquitin Proteins 0.000 description 1
- 101500027498 Homo sapiens Ubiquitin Proteins 0.000 description 1
- 101500027537 Homo sapiens Ubiquitin Proteins 0.000 description 1
- 101500028990 Homo sapiens Ubiquitin Proteins 0.000 description 1
- 102000010789 Interleukin-2 Receptors Human genes 0.000 description 1
- 108010038453 Interleukin-2 Receptors Proteins 0.000 description 1
- 102100037645 Kelch domain-containing protein 10 Human genes 0.000 description 1
- DCXYFEDJOCDNAF-REOHCLBHSA-N L-asparagine Chemical compound OC(=O)[C@@H](N)CC(N)=O DCXYFEDJOCDNAF-REOHCLBHSA-N 0.000 description 1
- 101710142669 Leucine zipper putative tumor suppressor 1 Proteins 0.000 description 1
- 102100021003 N(4)-(beta-N-acetylglucosaminyl)-L-asparaginase Human genes 0.000 description 1
- PRQROPMIIGLWRP-UHFFFAOYSA-N N-formyl-methionyl-leucyl-phenylalanin Chemical compound CSCCC(NC=O)C(=O)NC(CC(C)C)C(=O)NC(C(O)=O)CC1=CC=CC=C1 PRQROPMIIGLWRP-UHFFFAOYSA-N 0.000 description 1
- 108700026244 Open Reading Frames Proteins 0.000 description 1
- 108090000854 Oxidoreductases Proteins 0.000 description 1
- 102000004316 Oxidoreductases Human genes 0.000 description 1
- 108010044843 Peptide Initiation Factors Proteins 0.000 description 1
- 102000005877 Peptide Initiation Factors Human genes 0.000 description 1
- 102100037632 Progranulin Human genes 0.000 description 1
- 108010012809 Progranulins Proteins 0.000 description 1
- 102100038567 Properdin Human genes 0.000 description 1
- 102100036758 Small nuclear ribonucleoprotein F Human genes 0.000 description 1
- 108050002350 Small nuclear ribonucleoprotein F Proteins 0.000 description 1
- 102100036768 Small nuclear ribonucleoprotein G Human genes 0.000 description 1
- 108700036982 Small nuclear ribonucleoprotein G Proteins 0.000 description 1
- 108010021188 Superoxide Dismutase-1 Proteins 0.000 description 1
- 102100038836 Superoxide dismutase [Cu-Zn] Human genes 0.000 description 1
- 230000006044 T cell activation Effects 0.000 description 1
- 102000013530 TOR Serine-Threonine Kinases Human genes 0.000 description 1
- 108010065917 TOR Serine-Threonine Kinases Proteins 0.000 description 1
- 108020005038 Terminator Codon Proteins 0.000 description 1
- 108091023040 Transcription factor Proteins 0.000 description 1
- 102000040945 Transcription factor Human genes 0.000 description 1
- 102100031016 Transgelin-2 Human genes 0.000 description 1
- 102100040403 Tumor necrosis factor receptor superfamily member 6 Human genes 0.000 description 1
- 102100033254 Tumor suppressor ARF Human genes 0.000 description 1
- 101710102803 Tumor suppressor ARF Proteins 0.000 description 1
- 101710204645 UV excision repair protein RAD23 homolog B Proteins 0.000 description 1
- 102000003431 Ubiquitin-Conjugating Enzyme Human genes 0.000 description 1
- 230000004156 Wnt signaling pathway Effects 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 239000000427 antigen Substances 0.000 description 1
- 108091007433 antigens Proteins 0.000 description 1
- 102000036639 antigens Human genes 0.000 description 1
- 229960001230 asparagine Drugs 0.000 description 1
- 235000009582 asparagine Nutrition 0.000 description 1
- 230000008436 biogenesis Effects 0.000 description 1
- 230000036760 body temperature Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 210000005013 brain tissue Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000000711 cancerogenic effect Effects 0.000 description 1
- 231100000315 carcinogenic Toxicity 0.000 description 1
- 210000004413 cardiac myocyte Anatomy 0.000 description 1
- 230000006652 catabolic pathway Effects 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 238000004138 cluster model Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000002299 complementary DNA Substances 0.000 description 1
- 239000013068 control sample Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000012631 diagnostic technique Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 239000012636 effector Substances 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000017214 establishment of T cell polarity Effects 0.000 description 1
- 230000001747 exhibiting effect Effects 0.000 description 1
- 238000013213 extrapolation Methods 0.000 description 1
- 210000002950 fibroblast Anatomy 0.000 description 1
- 102000017941 granulin Human genes 0.000 description 1
- 239000008241 heterogeneous mixture Substances 0.000 description 1
- 108010027263 homeobox protein HOXA9 Proteins 0.000 description 1
- 102000050569 human MATN2 Human genes 0.000 description 1
- 238000010237 hybrid technique Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000006882 induction of apoptosis Effects 0.000 description 1
- 230000028709 inflammatory response Effects 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 102000006495 integrins Human genes 0.000 description 1
- 108010044426 integrins Proteins 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000031942 natural killer cell mediated cytotoxicity Effects 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 108010000953 osteoblast cadherin Proteins 0.000 description 1
- 230000001590 oxidative effect Effects 0.000 description 1
- 102000002574 p38 Mitogen-Activated Protein Kinases Human genes 0.000 description 1
- 108010068338 p38 Mitogen-Activated Protein Kinases Proteins 0.000 description 1
- 230000004108 pentose phosphate pathway Effects 0.000 description 1
- 108700040669 phosphatidylinositol glycan-class A Proteins 0.000 description 1
- 230000035790 physiological processes and functions Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 150000003254 radicals Chemical class 0.000 description 1
- 238000005295 random walk Methods 0.000 description 1
- 102000005962 receptors Human genes 0.000 description 1
- 108020003175 receptors Proteins 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 210000003705 ribosome Anatomy 0.000 description 1
- 108091005725 scavenger receptor cysteine-rich superfamily Proteins 0.000 description 1
- 238000002922 simulated annealing Methods 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
- 230000014621 translational initiation Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000037314 wound repair Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Definitions
- This present invention relates to methods of manipulation, storage, modeling, visualization and quantification of datasets.
- One application of the present invention is related to developing point-models of datasets represented by the various points in a multi- dimensional map.
- the invention can be adapted to genomic analysis by Fractal Genomics Modeling (FGM) which can be used to identify biomarkers to develop treatments, diagnoses or prognoses of disease by exploiting the map of interactions and causality — pathway conjecture — rendered by this technology.
- FGM Fractal Genomics Modeling
- FGM Fractal Genomics Modeling
- FGM is computationally efficient because the method is performed incrementally, is almost perfectly parallel, and is substantially linear. Consequently, there is no scaling problem with FGM. Furthermore, of significant interest, FGM can be used to identify biomarkers and develop systems for diagnoses or prognoses of disease by exploiting the map of interactions and causality — athway conjecture — rendered by this technology.
- FIGS J A and IB are a flow chart of the operational steps for manipulation, storage, modeling, visualization and quantification of datasets;
- FIG. 2 is a flow chart of the operational steps for an iterative algorithm and processing which provides a comparison string;
- FIG. 3 is a model showing an efficient, robust, network structure for information transmission of the kind that has been found in many complex networks, including gene regulatory networks;
- FIG. 4 is a model showing clinical expression of acute lymphoblastic leukemia (ALL) based on gene expression patterns in the ALL genetic network;
- FIG. ALL acute lymphoblastic leukemia
- FIG. 7 is a chart showing gene group models used for a 7-gene diagnostic test in Example 3;
- FIG. 8 is a chart showing gene group models used for 5-gene diagnostic tests in Example 3;
- FIG. 9 is a chart showing gene group models used for 10-gene diagnostic tests in Example 3; and
- FIG. 10 is a flow chart showing a downstream causality from two diagnostic 7-gene groups used in the 7-gene test in Example 3.
- FIGS 1 A and IB show a flow chart of the method of the present invention for generating a multi-dimensional map of one or more target strings in which the target strings can be represented by marked points in the map.
- the target strings correspond to datasets to be analyzed. Each point marked in the map serves as a point-model for one or more target strings.
- the method can be used in the manipulation, storage, modeling, visualization and quantification of datasets in the target strings.
- the dataset in each target string consists of a sequence of numbers of length N*.
- One example of a dataset to be analyzed and its corresponding target string is the yearly income of a population, the target string being each person's income listed in a sequence.
- Another example is the body temperature readings of a group of patients in a hospital ward, with the target string being those readings listed in a sequence.
- a further example is a DNA sequence, such that each different type of base (A, C, T, G) is labeled with a number (0, 1, 2, 3), producing a target string with a corresponding numerical sequence.
- a further example is a protein sequence, such that each type of amino acid in the protein chain is labeled with a different number, producing a target string with a corresponding numerical sequence.
- FIGS 1A and IB suppose each dataset to be analyzed is a string of measurements resulting from an experiment involving several thousand genes. Further suppose that there is a number connected with the experimental result from each gene. Such a number could be the gene expression ratio, which represents the differences in fluorescence calculated from the gene combined with some other chemical on a biochip or on a slide.
- the method starts (step 101) by providing a set of M such target strings of length N* (step 103).
- a region, R, is selected (step 104) such that each point in the region can serve as the domain of an iterative function.
- the iterative algorithm calculates the comparison string from a point, p, in some region, R.
- the region, R is in the complex plane corresponding to the area in and around the Mandelbrot Set.
- Mandelbrot Set is used in the preferred embodiment of the present invention, other sets, such as Julia Sets, may also be used.
- every point within the Mandelbrot Set can be made to correspond to a data sequence of arbitrary length. Because the Mandelbrot Set is made up of an infinite number of points, the method allows any number of datasets containing any number of values to be compared by mapping the datasets to points in or near the Mandelbrot Set.
- the Mandelbrot Set is an extremely complex fractal.
- the term fractal is used to describe non-regular geometric shapes that have the same degree of non-regularity on all scales. It is this property of a self-similarity that allows pictures of artificial systems built from fractals to resemble complex natural systems.
- a comparison string of length N is also provided (step 107).
- the comparison string is generated from a point, p, in the Region, R, by using an iterative algorithm N times to generate the comparison string having a length of N.
- the comparison string is also a data string and may be of any length relative to the target string.
- FIG 2 shows an example of the steps involved in an iterative algorithm to generate a comparison string of length N provided in step 107 of FIG 1A.
- the algorithm of FIG 2 for the Mandelbrot Set is an example of an algorithm that can be used. If a set of points from a different iterative domain is used in this method instead of the Mandelbrot Set, a different algoritlimic function would instead be used for that different set of points.
- the algorithm starts (201), and a counter, n, is initialized to zero (step 221).
- a variable to be used in the algorithm, z Q is initialized to zero (step 227).
- a point, p is chosen from region R, preferably the region corresponding to the area in and around the Mandelbrot Set (step 231). An example of choosing such a point might be to overlay a grid upon the Mandelbrot Set and, then, choose one of the points in the grid.
- n N (step 241), or when the iteration is stopped because the absolute value of z +1 is greater than 2.0, or
- the numbers in the comparison string may need to be transformed to have values within a value set of interest (step 281).
- the numbers in the example target string representing gene expression ratios are real numbers between 0 and 10. If we wish to explore the similarities between the comparison string and the target string the value set of interest would be the real numbers between 0 and 10. The numbers of the comparison string may need to undergo some transformation to produce real numbers in this range.
- an optional step is to determine if certain properties of the comparison string should be marked (step 109). Examples of properties that might be marked are the mean value of the comparison string or the Shannon entropy. If certain properties of the comparison string should be marked (step 109), mark the properties of the comparison string (step 111). Optionally, the comparison string can also be checked to determine if it meets pre-scoring criteria (step 113), regardless of whether the properties of the comparison string are marked. This step involves preliminary testing of the comparison string's properties alone as criteria to initiate scoring.
- pre-scoring criteria are measuring the mean value of the comparison string to see if it is higher or lower than desired and determining if the Shannon entropy of the comparison string is too low or too high.
- this subregion may be part of a grid. It may be determined that the rest of the points in that subregion will not be considered, even though the original intent was to consider all points in the region.
- the comparison string is pre-scored as described above and it does not meet the pre-scoring criteria (step 113), then the current comparison string is no longer under consideration. Another comparison string is instead provided (step 107).
- the new comparison string is generated using the exemplary iterative algorithm of FIG 2 on a new point, p, from region R. If the comparison string is pre-scored and it meets the pre-scoring criteria (step 113), then scoring of the comparison string is performed (step 121). Scoring refers to some test of the comparison string using the target string. Scoring of the comparison string can also be performed without marking the properties of the comparison string or pre-scoring the comparison string. In the example of real numbers r falling in the range between 0 and 10 described above, the score could be the correlation coefficient between the comparison string consisting of numbers r and the target string.
- the comparison string generated from this marked point with the iterative algorithm represents the target string, M.
- Marking can be used in an environment where a pixel or character corresponds to point p on a visual display or marking can refer to annotating the coordinates of point p in some memory, a database or a table.
- the point is marked by changing some graphical property of the corresponding pixel, such as color, or changing the corresponding character.
- the point may also be marked by annotating the coordinates of point p in some memory, a database or a table based on the score.
- point p can be marked, either additionally or solely, according to quantification of properties of the comparison string, without regard to the score.
- Such properties can be general, such as using some color or annotation to reflect the mean value of the string being in a certain range or markings reflecting the number of 3's in the string, or the value of the Shannon entropy.
- marking can be used as an aid in searching for preliminary criteria for scoring.
- marking point p it may be determined that an entire subregion of the region has a large number of points that do not meet the scoring criteria or other properties. For example, this subregion may be part of a grid. It may be determined that the rest of the points in that subregion will not be considered, even though the original intent was to consider all points in the region.
- step 129 determine if a sufficient number of the M target strings have been checked for the comparison string derived from point p (step 129). For instance, in our gene expression example, there may be several experiments or datasets that are being scored against each comparison string. If more of the M target strings should be checked, the comparison string is scored against another of the M target strings (step 121). The comparison string can be used to compare to all M target strings. Not all of the target strings may exhibit similarity to a comparison string, and, therefore, not all target strings may be marked. Also, more than one target string may demonstrate some homology with a comparison string. Moreover, target strings may be marked multiple times, exhibiting correlative relationships to multiple comparison strings.
- step 129 determines if a sufficient number or all of the M target strings have been checked (step 129). If more of the points corresponding to comparison strings should be checked, provide another comparison string (step 107).
- the new comparison string is generated using the same iterative algorithm as used in generating the previous comparison string, such as the one detailed illustratively in FIG 2, on a new point, p, from region R. Any number of the same M target strings will then be used to score the new comparison string. If a sufficient number of points corresponding to comparison strings has been checked (step 133), the scoring process stops. In the case of determining the points, p, from a grid, this could be the number of points in the grid.
- the highest scoring point or points are then mapped (step 137).
- Mapping refers to placing the coordinates of highest scoring point or points in memory, a database or a table.
- the target string or strings, represented by the coordinates may also be visually marked on a visual display.
- Target strings may be analyzed and/or compared by examining, either visually or mathematically, their relative locations and/or absolute locations within the region, R.
- scoring similarity measures between the comparison strings and the target strings target strings with greater similarity are generally mapped closer to each other based on Euclidean distance on the map. This is because comparison strings with greater similarity are generally closer to each other on the map. However, this is not always true because the metrics involved are more complicated.
- shading of points corresponding to comparison strings with high scores for a given target string represents a metric which shows similarity between this target string and others mapped in this shaded region.
- the target strings in this case, however, may not appear close together on the map or display but can be identified as being similar.
- determine whether points in region R should be marked (in a similar manner as previously described) based on their relative scores or properties compared to other points in region R (step 139). If it is determined that the points should be marked (step 139), mark the points (step 141). For example, one might wish to mark all the points whose score falls within 10% of the highest score of a chosen target string, or mark points whose comparison strings have the lowest or highest Shannon entropy for the region.
- this subregion may be part of a grid. This may be used to determine whether this subregion is of interest or not.
- determine if a subregion of R is of interest step 143. If a subregion of R is of interest (step 143), then this subregion is examined with higher resolution, called zooming (step 147). The subregion of R replaces the previous region R. (step 104 of FIG 1A).
- Comparison strings will be generated from the new subregion of R and will be scored against any number of the same set of M target strings originally provided. Points in a subregion of interest, which were previously unchecked, will be examined because the new region, R, is a higher resolution version of the subregion of interest. The points in the subregion will tend to produce a greater percentage of similar comparison strings to those previously examined in region R. If the subregion of interest is a high scoring region this will, in general, produce a greater percentage of high scores and some differences will emerge to produce higher scores or properties which are closer to some desired criteria.
- the target strings and comparison strings may optionally be transformed to attempt to improve the precision and resolution of the mapping and marking in the method.
- the target strings values instead of real numbers from 0 to 10, were binned into 10 contiguous intervals, such that the first bin corresponds to real number values from 0 to 1, the second bin to real number values from 1 to 2, etc.
- these bins were labeled 0 to 9.
- the target string would then be a string of integers with values from 0 through 9.
- a similar transformation was done on the transformed comparison strings.
- the method is performed and after zooming (step 147), the gene expression ratios and comparison strings are split into 20 such bins from 0 to 0.5, 0.5 to 1.0, etc.
- the target and comparison strings will be re-scaled before repeating the process in the new subregion (104 of FIG. 1A).
- This re-scaling can improve the precision and accuracy of the mapping and marking in the method.
- IB ends (step 199). This generally results when there is no improvement in the score after some number of zooms. It should be apparent to one skilled in the art that this technique can be used to study the behavior of any (scoring) function that uses the target strings and the comparison strings as variables. Attempting to find the highest value of the similarity measure scoring function is a particular case of this. As such, this method could be used to attempt to optimize any scoring function, using a target string or multiple target strings and comparison strings as variables, to find the functions minima and maxima. In addition, each comparison string can simply be used alone as input into the variables of a scoring function for such a purpose. It should be apparent to one skilled in the art that this method can be used for data compression.
- the model of the target string represented by a comparison string is sufficiently similar to the target string, and the coordinates of the point p corresponding to that comparison string can be represented in a more compact way than the target string, then the target string can be replaced with its more compact representation in the form of the coordinates of point p.
- This method has special applicability to multiple large datasets. Uses for this method include analysis of DNA sequence data, protein sequence data, and gene expression datasets. The method can also be used with demographic data, statistical data, and clinical (patient) data.
- this method uses for this method are not limited to these datasets, however, and may be applied to any type of data or heterogeneous mixtures of different data types within datasets. Some of the steps of this method can involve determinations and interventions made by a user of the method or they can be automated.
- FGM Fractal Genomics Modeling
- FGM Fractal Genomics Modeling
- FIG. 3 illustrates an efficient, robust, network structure for information transmission of the kind that has been found in many complex networks, including gene regulatory networks.
- the points represent what are called nodes in the networks, and the lines represent what are called links.
- the nature of the type of network shown in FIG. 3 is that there are few nodes with many direct links, forming hubs, and many nodes with only a small number of direct links.
- FIG. 3 can also represent a gene and the lines can represent correlated behavior to other genes in a genetic network. The most "connected" genes, or genes with the most direct links, would be ones in the center of this picture. As an example, FIG.
- ALL acute lymphoblasitc leukemia
- SRG super-regulatory genes
- SRG/Carcinogenic Due to mutation, biochemical or environmental factors, or chance, a group of genes residing somewhere in the ringed area labeled SRG/Carcinogenic begins a cascade through the network that propagates into a clinical expression of cancer. Further downstream are nodes in the network that define the clinical outcome as a specific type of cancer, illustrated by the group of genes labeled leukemia and, still further downstream, as a subclass of leukemia, ALL (and extending out to genes not seen). It should be noted that this is not a simple cascade from the center outward. Many interconnecting pathways are involved with both promoter and inhibitory links between genes. FGM is a hybrid technique that blends some of the concepts of wavelet analysis with cluster analysis.
- FGM "wavelets” are a series of real-valued numbers derived from complex logistic maps, such as Julia sets, generated from iterations of a single point in the complex plane. FGM searches points on the complex plane for the model that gives the greatest Pearson correlation with the actual localized data, using a minimum cutoff correlation whose absolute value is >.95.
- the similarity metric between point-models on the complex plane found in this way is very intricate but, in general, similar models tend to cluster in similar areas on the surface. This is particularly true if the point-models fall within a given "threshold" determined by Euclidean measure.
- FGM can be used to model the gene expression of small groups of genes, each having n number of genes (for example, n is 7 or 14 genes) from a much larger gene pool.
- the larger gene pool can be a sample of an organism's genome or of an organism's entire genome, such as the entire human genome.
- the genes in the gene pool can be arranged randomly in microarrays of commercial gene chips (e.g., Affymetrix Human Genome U95A chips consisting of about 12,000 genes) to measure the gene expression levels of the genes.
- at least one small characterizing group of genes must exist.
- the FGM method for modeling gene expression of a small group of genes in a genetic network of a subject comprises the following steps: (a) providing a dataset of gene expression values of the small group of genes from the subject; (b) providing a surface wherein each point on the surface can serve as a domain for an iterative algorithm; (c) selecting a point on the surface; (d) generating a comparison string from the selected point using the iterative algorithm; (e) scoring the comparison string against the gene expression values in the dataset; (f) determining if the score of the comparison string meets a predetermined condition or property; and (g) marking the point if the score meets the pre- determined condition or property to generate a fractal genomics modeling (FGM) model of the target string on the surface.
- the method above can include the additional steps of repeating the steps (c) through (g) for a plurality of gene expression values from a plurality of small groups of gene in the genetic network to generate a plurality of FGM models on the surface.
- Identifying Biomarkers within the point-models (also known as FGM models) on the FGM surface clusters are found containing models of the same gene groups corresponding to only one of the phenotypes. If such a gene group is found, it is then individually tested across all datasets to verify that between these n-gene patterns the Pearson correlation, or any other suitable correlations, is markedly different depending on the phenotype from which the dataset is drawn. If such a gene group is found, further testing is done to choose the n-gene group from the sample within the cluster that produces the most marked difference. Such a gene group or its FGM model then becomes a candidate biomarker for the particular phenotype being studied and provides insight into the biochemical pathways linked to the phenotype present.
- the biomarker can then be used to develop treatments, diagnoses or prognoses of diseases.
- a diagnostic test can also be designed to diagnose a disease of a test subject by comparing the gene expression values of the phenotype of the test subject against the biomarker .
- the method for identifying the biomarker for a phenotype includes the steps of: (a) identifying clusters containing FGM models of the small group of genes corresponding to the phenotype; (b) individually testing each of the small group of genes across all datasets to verify that the pre-determined condition or property between the small groups of genes is markedly different with regard to the phenotype; and (c) selecting the small group of genes that produces the most marked difference in the pre-determined condition or property as a biomarker for the particular phenotype.
- the method for identifying the biomarker for a phenotype includes the steps of: (a) providing a plurality of datasets of gene expression values wherein each dataset is from a small group of genes, and the plurality of datasets is from one or more subjects having the phenotype; (b) providing a surface wherein each point on the surface can be served as a domain for an iterative algorithm; (c) selecting a point on the surface; (d) generating a comparison string from the selected point using the iterative algorithm; (e) scoring the comparison string against the gene expression values in the dataset; (f) determining if the score of the comparison string meets a pre-determined Pearson correlation value; (g) marking the point if the score meets the pre-determined Pearson correlation value to generate a FGM model of the target string on the surface; (h) repeating steps (c) through (g) for a plurality of the datasets to generate FGM models for said plurality of datasets; (i
- Down's Syndrome This example demonstrates the use of FGM both to provide evidence of scale-free genetic network in Down's Syndrome and to identify specific small gene groupings, consisting of 7 genes, that can serve as biomarkers relating to Down's Syndrome.
- FGM was used to model small groups of 7 genes from much larger microarrays (Affymetrix Human Genome U95A chips) consisting of 12,558 genes. The data was derived from fibroblasts of 4 subjects with and 4 subjects without Down's Syndrome — totaling 8 subjects. The number of genes within the groups, in this case 7, was decided using the criteria of picking a relatively small number — in the range of 5-20 — that when divided into 12,558 yields a real number without a remainder.
- FGM models were scored based on their overall Pearson correlation, using a minimum cutoff correlation of absolute value > 0.95.
- point-models also known as the
- FIG. 5 is the log-log plot of k vs. P(k) of gene expression data for an arbitrarily chosen Downs and Control sample.
- FIG. 6 is the same plot derived from clusters for all samples on the same FGM map. This is, in effect, a picture of the combined genome for all the data. The picture conveyed from FIGS. 5 and 6 together brings to light further notions of universal constructs within such complex networks. Using the method described above, a 7-gene group was discovered that corresponded only to subjects with Down's syndrome. The corresponding results are shown in Tables 1 and 2.
- Table 1 Ranked absolute values of the Pearson Correlation for 7-gene FGM models with the 7-gene biomarker candidate model (left) and the corresponding correlations with actual expression values (right). Down's subject marked with "D” (The model/actual values of the genes from subject marked with * were used.) Subject Pearson Subject Pearson 6-194D* 6-194D 1 4213-34D 1 42135-34D 0.97 5-186D 1 5-186D 0.97 7-197D 0.92 10-AOlC 0.89 8-367C 0.87 7-197D 0.88 10-AOlC 0.83 3648FC 0.85 9-367C 0.62 8-367C 0.84 3648FC No model found 9-367C 0.71
- Table 2 The 7-gene Down's Syndrome biomarker candidate. Model and actual gene expression values for subject 6-194D (produced highest correlation in Table 1) and description of genes in the group. *Denotes the fact that this model is negatively correlated to the actual values (absolute Pearson used in model scoring).
- FGM .Model Values* Actual Values 57.5 200.9 112.3 22.5 70.9 170.7 106.9 7.9 103.3 8.2 99.7 14.5 112.9 4.7
- Example 2 Identification of Biomarkers in Human Immunodeficiency Virus (HIV) Infection
- FGM was used to model small groups of 14 genes from much larger microarrays (Affymetrix Human Genome U95A chips) consisting of 12,558 genes.
- the data was derived from the brain tissue of 5 HIV-1 negative and 4 HIN-1 infected subjects — totaling 9 subjects.
- the number of genes within the groups, in this case 14, was decided using the criteria of picking a relatively small number — in the range of 5-20 — that goes evenly into 12,558.
- 897 14-gene groups were established.
- Table 3 Ranked absolute values of the Pearson Correlation for 14-gene FGM models with the 14-gene biomarker candidate model (left) and the corresponding correlations with actual expression values (right). HIN-1 positive marked with "+”. (The model/actual values for the genes from subject marked with * were used.) Subject Pearson Subject Pearson G0036+* G0036+* G0017+ 0.98 D97 2916- 0.97 H0011+ 0.94 G0017+ 0.96 H0002+ 0.91 H0011+ 0.94 G0010+ 0.86 G0010+ 0.92 BTB 3455- No model found H0002+ 0.89 BTB 3648- No model found BTB 3648- 0.88 BTB 3749- No model found BTB 3455- 0.72 D97 2916- No model found BTB 3749- 0.71
- Table 4 HIV-1 brain biomarker candidate. Model and actual gene expression values for subject G0036+ (produced highest correlation in Table 3) and description of genes in the group.
- FGM Model
- FGM Model
- U39067 Homo sapiens translation initiation factor eIF3 p36 Cluster Incl.
- AL050106 Homo sapiens mRNA; cDNA DKFZp586I1319 Cluster Incl.
- AF047181 Homo sapiens NADH-ubiquinone oxidoreductase Cluster Incl.
- AF007872 Homo sapiens torsinB (DQ1) Cluster Incl.
- AF007871 Homo sapiens torsinA Cluster Incl.
- AB011116 Homo sapiens mRNA for KIAA0544
- AF032456 Homo sapiens ubiquitin conjugating enzyme
- D87454 Human mRNA for KIAA0265 gene Cluster Incl.
- AF001383 Homo sapiens amphiphysin II mRNA Cluster Incl.
- U69263 Human matrilin-2 precursor mRNA Cluster Incl.
- D31889 Human mRNA for KIAA0072 gene Cluster Incl.
- AL050265 Homo sapiens mRNA Cluster Incl.
- Example 3 Genetic Network and Biomarkers in Leukemia Input data from the study produced by Golub et al. (Golub T. R., et al, Science, Vol. 286, pp. 531-536, 1999) are used in this example in order to further demonstrate the utility of the present invention.
- the data in the Golub study contained Affymetrix gene expression data for 7070 genes acquired from patients diagnosed with either acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML).
- ALL acute lymphoblastic leukemia
- AML acute myeloid leukemia
- the data was composed of a training set of data from 27 ALL patients and 11 AML patients to develop diagnostic approaches based on the Affymetrix data and an independent set of 34 patients for testing.
- the second group of the 7-gene group contains the following genes: 1. Onconeural ventral antigen- 1 (Nova-1) mRNA 2. Inil mRNA 3. RORA RAR-related orphan receptor A 4. FUSE biding protein mRNA 5. Rar protein mRNA 6. Fetal ALZ-50-reactive clone 1 (FAC1) mRNA 7. MB-1 gene. These two 7-gene group models were used for a 7-gene diagnostic test. The two 7- gene group model values from two patients in the training set (above) were used to characterize ALL in the independent set.
- the test was an OR test, where if the corresponding 7-gene models in the independent set patients had a Pearson correlation with either of these 7-gene model values such that the absolute value was > 0.95, the patient was classified as ALL.
- the results for the 7-gene grouping are as follows: Overall Accuracy 0.853 ALL only 0.95 AML only 0.714
- Pathways related to this result comprise the Ras-Independent pathway in NK cell- mediated cytotoxicity.
- the gene of special interest from this result is MB-1 gene.
- the second 7-gene group above allows for the differentiation of patients with ALL into those who have the T-cell ALL from B-cell ALL.
- the test using this 7-gene group model was accurate to 100% in the test set in classifying B- Cell vs. T-Cell (See Table 7).
- the gene segments used are summarized in Table 8.
- Onconeural ventral antigen- 1 (Nova-1) mRNA U04840_at 828.711548 93
- Fetal Alz-50-reactive clone 1 (FAC1) mRNA U05237_at 1616.58411 635
- FIG. 8 is a chart showing gene group models used for 5-gene diagnostic tests. Five different gene model value sets consisting of four 5-gene groups each (20 genes total) were used to create five different 5-gene diagnostic tests.
- Results from 5-gene test 1 Overall Accuracy 0.824 ALL only 0.8 AML only 0.857 Results from 5-gene test 2: Overall Accuracy 0.735 ALL only 0.8 AML only 0.643 Results from 5-gene test 3: Overall Accuracy 0.824 ALL only 0.8 AML only 0.857 Results from 5-gene test 4: Overall Accuracy 0.765 ALL only 0.8 AML only 0.714 Results from 5-gene test 5: Overall Accuracy 0.735 ALL only 0.75 AML only 0.714 Pathways related to this result comprise: Regulation of hematopoiesis by cytokines, IL-2 Receptor Beta Chain in T cell Activation, Tumor Suppressor Arf Inhibits Ribosomal Biogenesis, Neuropeptides VIP and PACAP inhibit the apoptosis of activated T cells, FAS signaling pathway ( CD95 ), HIV-I Nef: negative effector of Fas and TNF, Fc Epsilon Receptor I Signaling in Mast Cell, p38 MAPK Signal
- FIG. 9 is a chart showing gene group models used for 10-gene diagnostic tests. Two different gene model values sets consisting of two 10-gene groups each (50 genes total) were used to create two different 10-gene diagnostic tests. The gene group models used are listed in FIG. 9. Results from 10-gene test 1 Overall Accuracy 0.735 ALL only 0.65 AML only 0.857 Results from 10-gene test 2 Overall Accuracy 0.676 ALL only 0.55 AML only 0.857 Pathways related to this result comprise Free Radical Induced Apoptosis, PDGF Signaling Pathway, Rac 1 cell motility signaling pathway, and Selective expression of chemokine receptors during T-cell polarization.
- Genes of special interest from this result are SOD1, Sm protein F, Sm protein G, and HOXA9.
- Transmission Pattern within the Network of ALL In order to determine if a particular transmission pattern within this network (gene expression pattern) can be identified with acute lymphoblastic leukemia (ALL), point models (also known as FGM models) from all 7-gene groups for all 38 patients were clustered. Clusters were examined that contained only 7-gene groups from the patients with ALL. Two 7-gene group model patterns were identified, which correlated with the largest number of corresponding models in other ALL patients and with none of the AML patients. To test how accurately these two patterns classified ALL patients, correlations were also tested with this diagnostic/classification method on the Golub independent data.
- ALL acute lymphoblastic leukemia
- FGM models also known as FGM models
- This method identified ALL patients form AML patients to -85% accuracy. (See the Results section) This gives credence to this method both as a diagnostic technique and lends significance to the gene models used. The chance of these two gene group model patterns producing an 85% result by chance is roughly 1 in 50,000. Similarly, tests were performed on the 5 and 10-gene groups. The diagnostic accuracy varied from 67.6 to 82.4%. Many pathways and genes were identified as being significant in the course of this test. Several of these appeared to mesh with current knowledge in the field (See Results section). The test cited above identified a particular group of genes and a gene expression pattern within them that appears to identify ALL. This does not necessarily mean, however, that this group of genes is in the hypothetical ALL ring within a network of the kind illustrated in FIG. 2.
- the results of the7-gene grouping all models to all models diagnostic test are as follows: Overall Accuracy 0.824 ALL only 0.85 AML only 0.786
- the results of the 10-gene grouping all models to all models diagnostic test are as follows: Overall Accuracy 0.735 ALL only 0.7 AML only 0.786
- the results for all 7-gene models to 7-gene group 1 model pattern diagnostic test are as follows: Overall Accuracy 0.765 ALL only 0.9 AML only 0.571
- ALL and AML have yet to reach genes which will determine their specific clinical expression.
- There was one 7-gene grouping whose models correlated with one of the ALL diagnostic patterns in all patients, both ALL and AML.
- a patient was identified as ALL if the average correlation was greater than the highest average AML correlation from the training set. This test identified ALL to -76% accuracy. The diagnostic score is somewhat low, but the probably of chance occurrence is roughly 1 in a 1,000. This provides statistical evidence that not only can large-scale gene expression be seen in ALL patients, a single pattern can be seen as being transmitted tlirough a large section of a genetic network involved in the clinical expression of ALL.
- Most common upstream gene Groups correlated to 7-gene model patterns which can be used in a diagnostic test are: Group 1 GAA gene extracted from Human lysosomal alpha-glucosidase gene exon 1 AGA Aspartylglucosaminidase 2-19 gene (2-19 protein) extracted from H.
- FIG. 10 shows a preliminary diagram of downstream causality from two diagnostic 7-gene groups used in the 7-gene test.
- Pathways related to downstream causality groups comprise ALK in cardiac myocytes, WNT Signaling Pathway, BCR Signaling Pathway, Fc Epsilon Receptor I Signaling in Mast Cell, Neuropeptides VIP and PACAP inhibit the apoptosis of activated T cells, Regulation of hematopoiesis by cytokines, Cytokines and Inflammatory Response, Integrin Signaling Pathway, AKT Signaling Pathway, Regulation of transcriptional activity by PML, mTOR Signaling Pathway, and Regulation of eIF4e and p70 S6 Kinase.
- Genes of special interest from this result include: FEZ1, EIF4A
- a partial picture of causality going downstream from the 7-gene diagnostic groups was constructed using a combination of correlations with the actual diagnostic patterns and correlations with the actual 7-gene diagnostic group models for each patient.
- a 7-gene group was considered a candidate for a downstream link if the gene model did not correlate with the corresponding model in any of the ALL patients and its 7-gene model correlated with one of the two diagnostic patterns.
- Downstream causality was considered found when the last condition only occurred when there was a correlation between its 7- gene model and the diagnostic group 7-gene models.
- this example describes a method of pathway conjecture and diagnosis using fractal genomics modeling (FGM).
- FGM fractal genomics modeling
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Physiology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
This present invention relates to methods of manipulation, storage, modeling, visualization and quantification of datasets. One application of the present invention is related to developing FGM models of datasets represented by the various points in a multi-dimensional map. The invention can be adapted to genomic analysis by Fractal Genomics Modeling (FGM) which can be used to identify biomarkers to develop treatments, diagnoses or prognoses of disease by exploiting the map of interactions and causality - pathway conjecture-rendered by this technology.
Description
A METHOD FOR IDENTIFYING BIOMARKERS USING FRACTAL GENOMICS MODELING
CROSS-REFERENCE TO RELATED APPLICATIONS: This application claims priority to Provisional Application Serial No. 60/486,233, filed July 10, 2003 which is incorporated herein in its entirety and made a part hereof. This application is also a continuation-in-part of U.S. Patent Application Serial No. 09/766,247, filed January 19, 2001, which claims priority to Provisional Application Serial No. 60/177,544 filed January 21, 2000 which are incorporated herein in their entirety and made a part hereof.
FEDERALLY SPONSORED RESEARCH ORDEVELOPMENT: Not Applicable.
BACKGROUND OF THE INVENTION: Technical Field This present invention relates to methods of manipulation, storage, modeling, visualization and quantification of datasets. One application of the present invention is related to developing point-models of datasets represented by the various points in a multi- dimensional map. The invention can be adapted to genomic analysis by Fractal Genomics Modeling (FGM) which can be used to identify biomarkers to develop treatments, diagnoses or prognoses of disease by exploiting the map of interactions and causality — pathway conjecture — rendered by this technology.
Background Art The standard techniques currently employed to analyze large datasets are Cluster
Analysis and Self-Organizing Maps. These approaches can be effective in identifying broad groupings of genes connected with well understood phenotypes but fall short in identifying more complex gene interactions and phenotypes, which are less well defined. They do not allow for the fingerprinting and visualization of an entire dataset, and missing values are not easily accommodated. The computational requirements are high for these techniques, and the mapping time increases exponentially with the size of the dataset. Furthermore, the
current data must be reanalyzed when new datasets are added to the analysis, and vastly different results can occur for each new dataset or group of datasets added. In order to take full advantage of the information in multiple, large sets of data, we need new, innovative tools. There is a need for methods that more easily enable identification and visualization of potentially significant similarities and differences between multiple datasets in their entirety. There is also a need for methods to intelligently store and model large datasets. Recent studies have revealed genome-wide gene expression patterns in relation to many diseases and physiological processes. These patterns indicate a complex network interaction involving many genes and gene pathways, over varying periods of times. On a parallel track, recent studies involving mathematical models and biophysical analysis have shown evidence of an efficient, robust, network structure for information transmission when these networks are examined as large-scale gene groups. The problem comes in producing analysis of information transmission and network structure on the scale of individual genes and genetic pathways. Fractal Genomics Modeling (FGM) solves this problem by taking advantage of universal principles of organization. From the Internet, to social relations, to biochemical pathways, the fundamental patterns are similar. The natural relationship among many different types of networks, when mathematically represented, enables the extrapolation of vast quantities of data, capable of computerized analysis. FGM is computationally efficient because the method is performed incrementally, is almost perfectly parallel, and is substantially linear. Consequently, there is no scaling problem with FGM. Furthermore, of significant interest, FGM can be used to identify biomarkers and develop systems for diagnoses or prognoses of disease by exploiting the map of interactions and causality — athway conjecture — rendered by this technology.
BRIEF DESCRIPTION OF THE DRAWINGS : These and other aspects and attributes of the present invention will be discussed with reference to the following drawings and accompanying specification. FIGS J A and IB are a flow chart of the operational steps for manipulation, storage, modeling, visualization and quantification of datasets; FIG. 2 is a flow chart of the operational steps for an iterative algorithm and processing which provides a comparison string;
FIG. 3 is a model showing an efficient, robust, network structure for information transmission of the kind that has been found in many complex networks, including gene regulatory networks; FIG. 4 is a model showing clinical expression of acute lymphoblastic leukemia (ALL) based on gene expression patterns in the ALL genetic network; FIG. 5 is a log-log probability distribution for lined FGM models derived from an arbitrarily chosen gene expression data from a sample of Control/Normal subject and Down's Syndrome subject. Dashed line = scale-free fit; FIG. 6 is a log-log probability distribution for lined FGM models derived from all gene expression data of Control/Normal subject and Down's Syndrome subject in the study. Dashed line = scale-free fit; FIG. 7 is a chart showing gene group models used for a 7-gene diagnostic test in Example 3; FIG. 8 is a chart showing gene group models used for 5-gene diagnostic tests in Example 3; FIG. 9 is a chart showing gene group models used for 10-gene diagnostic tests in Example 3; and FIG. 10 is a flow chart showing a downstream causality from two diagnostic 7-gene groups used in the 7-gene test in Example 3.
DETAILED DESCRIPTION OF THE INVENTION: The present invention is susceptible to embodiments in many different forms. Preferred embodiments of the invention are disclosed with the understanding that the present disclosure is to be considered as exemplifications of the principles of the invention and are not intended to limit the broad aspects of the invention to the embodiments illustrated.
Generation of Point-Models of Datasets in a Multi-Dimensional Map This present invention relates to methods of manipulation, storage, modeling, visualization and quantification of datasets. FIGS 1 A and IB show a flow chart of the method of the present invention for generating a multi-dimensional map of one or more target strings in which the target strings can be represented by marked points in the map.
The target strings correspond to datasets to be analyzed. Each point marked in the map
serves as a point-model for one or more target strings. The method can be used in the manipulation, storage, modeling, visualization and quantification of datasets in the target strings. The dataset in each target string consists of a sequence of numbers of length N*. One example of a dataset to be analyzed and its corresponding target string is the yearly income of a population, the target string being each person's income listed in a sequence. Another example is the body temperature readings of a group of patients in a hospital ward, with the target string being those readings listed in a sequence. A further example is a DNA sequence, such that each different type of base (A, C, T, G) is labeled with a number (0, 1, 2, 3), producing a target string with a corresponding numerical sequence. A further example is a protein sequence, such that each type of amino acid in the protein chain is labeled with a different number, producing a target string with a corresponding numerical sequence. For FIGS 1A and IB, suppose each dataset to be analyzed is a string of measurements resulting from an experiment involving several thousand genes. Further suppose that there is a number connected with the experimental result from each gene. Such a number could be the gene expression ratio, which represents the differences in fluorescence calculated from the gene combined with some other chemical on a biochip or on a slide. This calculation is not a part of the present invention but provides the numbers in the example target strings. The number of numbers in the example target strings is N*. Starting with FIG 1A, the method starts (step 101) by providing a set of M such target strings of length N* (step 103). A region, R, is selected (step 104) such that each point in the region can serve as the domain of an iterative function. The iterative algorithm calculates the comparison string from a point, p, in some region, R. Preferably, the region, R, is in the complex plane corresponding to the area in and around the Mandelbrot Set. Although the Mandelbrot Set is used in the preferred embodiment of the present invention, other sets, such as Julia Sets, may also be used. Using this iterative method, every point within the Mandelbrot Set can be made to correspond to a data sequence of arbitrary length. Because the Mandelbrot Set is made up of an infinite number of points, the method allows any number of datasets containing any number of values to be compared by mapping the datasets to points in or near the Mandelbrot Set. The Mandelbrot Set is an extremely complex fractal. The term fractal is used to describe non-regular geometric shapes that have the same degree of non-regularity on all
scales. It is this property of a self-similarity that allows pictures of artificial systems built from fractals to resemble complex natural systems. A comparison string of length N is also provided (step 107). The comparison string is generated from a point, p, in the Region, R, by using an iterative algorithm N times to generate the comparison string having a length of N. The comparison string is also a data string and may be of any length relative to the target string. FIG 2 shows an example of the steps involved in an iterative algorithm to generate a comparison string of length N provided in step 107 of FIG 1A. The algorithm of FIG 2 for the Mandelbrot Set is an example of an algorithm that can be used. If a set of points from a different iterative domain is used in this method instead of the Mandelbrot Set, a different algoritlimic function would instead be used for that different set of points. The algorithm starts (201), and a counter, n, is initialized to zero (step 221). A variable to be used in the algorithm, zQ , is initialized to zero (step 227). A point, p, is chosen from region R, preferably the region corresponding to the area in and around the Mandelbrot Set (step 231). An example of choosing such a point might be to overlay a grid upon the Mandelbrot Set and, then, choose one of the points in the grid. Determine if N numbers, which constitute the comparison string, have been calculated (step 241). In other words, check if n = N. If all the numbers of the comparison string have not yet been calculated (step 241), then the point, p, is used as input to the iterative algorithm z = zn + p (step 251). For example, the first iteration based on a point, p, is z. = zQ 2 + p , or z. = 0 + p, or z. = p. Since p is a complex number of the form a
+ bi when decomposed into its real and imaginary parts, z2 takes the form z2 = (a + 2i*a*b - b2) + a + bi or (a2 - b2 + a) + i(b *(2a + 1)). If the absolute value of zn+1 is greater than 2.0, or |z | > 2.0 (step 261), the iteration is stopped because it is unbounded, and the zn+1 will become infinitely large. Thus, point, p, is no longer under consideration. Instead, n is initialized to zero (step 221), zQ is initialized to zero (step 227), and another point is instead chosen from the region R (step 231), preferably in and/or near the Mandelbrot Set. This prematurely stopped string, however, may be used as a comparison string with a length of less than N. If the absolute value of zn+. is equal to 2.0 or less, increment n by one (step 271) and check if N numbers have been calculated which constitute the comparison string (step
241). In other words, the algorithm iterates until n = N. If n < N, then perform the next iteration on point p (step 251). This next iteration will calculate the next number in the string of numbers comprising the comparison string. The process iterates until a string of variables, z. through z^ can be produced that is of length N, so long as |z < 2.0. If n = N (step 241), or when the iteration is stopped because the absolute value of z +1 is greater than 2.0, or |z +1| > 2.0 (step 261), then the comparison string has been generated. However, the numbers in the comparison string may need to be transformed to have values within a value set of interest (step 281). Suppose the numbers in the example target string representing gene expression ratios are real numbers between 0 and 10. If we wish to explore the similarities between the comparison string and the target string the value set of interest would be the real numbers between 0 and 10. The numbers of the comparison string may need to undergo some transformation to produce real numbers in this range. One way to produce such a real number is the function r = 10.0 * b/|z |. This will produce real numbers r falling in the range between 0 and 10 for z = a + bi. Provide the comparison string (step 291), and the algorithm ends (step 299). Referring to FIG 1 A, an optional step is to determine if certain properties of the comparison string should be marked (step 109). Examples of properties that might be marked are the mean value of the comparison string or the Shannon entropy. If certain properties of the comparison string should be marked (step 109), mark the properties of the comparison string (step 111). Optionally, the comparison string can also be checked to determine if it meets pre-scoring criteria (step 113), regardless of whether the properties of the comparison string are marked. This step involves preliminary testing of the comparison string's properties alone as criteria to initiate scoring. Examples of pre-scoring criteria are measuring the mean value of the comparison string to see if it is higher or lower than desired and determining if the Shannon entropy of the comparison string is too low or too high. When marking prior to scoring, it may be determined that an entire subregion of the region has a large number of points that do not meet the pre-scoring criteria. For example, this subregion may be part of a grid. It may be determined that the rest of the points in that subregion will not be considered, even though the original intent was to consider all points in the region. If the comparison string is pre-scored as described above and it does not meet the pre-scoring criteria (step 113), then the current comparison string is no longer under
consideration. Another comparison string is instead provided (step 107). The new comparison string is generated using the exemplary iterative algorithm of FIG 2 on a new point, p, from region R. If the comparison string is pre-scored and it meets the pre-scoring criteria (step 113), then scoring of the comparison string is performed (step 121). Scoring refers to some test of the comparison string using the target string. Scoring of the comparison string can also be performed without marking the properties of the comparison string or pre-scoring the comparison string. In the example of real numbers r falling in the range between 0 and 10 described above, the score could be the correlation coefficient between the comparison string consisting of numbers r and the target string. A simple example of scoring might be counting the number of one-to-one matches between the comparison string and the target string over some length L where L <= N*, where N* is the length of the target string. Alternatively, a one-to-one comparison between numbers in the comparison and target strings may be performed for a non-contiguous number L of the numbers. For example, compare the second, fourth, and sixteenth numbers for a number L = 3. Determine if the point, p. corresponding to the comparison string should be marked depending on the score or other properties (step 123). If it is determined that the point should be marked (step 123), mark the point, p, in the region, R (step 127). The marked point is a point-model in the region, R, to represent the target string, M. The comparison string generated from this marked point with the iterative algorithm represents the target string, M. Marking can be used in an environment where a pixel or character corresponds to point p on a visual display or marking can refer to annotating the coordinates of point p in some memory, a database or a table. The point is marked by changing some graphical property of the corresponding pixel, such as color, or changing the corresponding character. The point may also be marked by annotating the coordinates of point p in some memory, a database or a table based on the score. Optionally, point p can be marked, either additionally or solely, according to quantification of properties of the comparison string, without regard to the score. Such properties can be general, such as using some color or annotation to reflect the mean value of the string being in a certain range or markings reflecting the number of 3's in the string, or the value of the Shannon entropy. Such marking can be used as an aid in searching for preliminary criteria for scoring. When marking point p, it may be determined that an entire subregion of the region has a large number of points that do not meet the scoring criteria or other properties. For example, this
subregion may be part of a grid. It may be determined that the rest of the points in that subregion will not be considered, even though the original intent was to consider all points in the region. If it is determined that the point should not be marked (step 123), determine if a sufficient number of the M target strings have been checked for the comparison string derived from point p (step 129). For instance, in our gene expression example, there may be several experiments or datasets that are being scored against each comparison string. If more of the M target strings should be checked, the comparison string is scored against another of the M target strings (step 121). The comparison string can be used to compare to all M target strings. Not all of the target strings may exhibit similarity to a comparison string, and, therefore, not all target strings may be marked. Also, more than one target string may demonstrate some homology with a comparison string. Moreover, target strings may be marked multiple times, exhibiting correlative relationships to multiple comparison strings. If a sufficient number or all of the M target strings have been checked (step 129), determine if a sufficient number of points corresponding to comparison strings have been checked (step 133). If more of the points corresponding to comparison strings should be checked, provide another comparison string (step 107). The new comparison string is generated using the same iterative algorithm as used in generating the previous comparison string, such as the one detailed illustratively in FIG 2, on a new point, p, from region R. Any number of the same M target strings will then be used to score the new comparison string. If a sufficient number of points corresponding to comparison strings has been checked (step 133), the scoring process stops. In the case of determining the points, p, from a grid, this could be the number of points in the grid. The highest scoring point or points are then mapped (step 137). Mapping refers to placing the coordinates of highest scoring point or points in memory, a database or a table. The target string or strings, represented by the coordinates, may also be visually marked on a visual display. Target strings may be analyzed and/or compared by examining, either visually or mathematically, their relative locations and/or absolute locations within the region, R. When scoring similarity measures between the comparison strings and the target strings, target strings with greater similarity are generally mapped closer to each other based on Euclidean distance on the map. This is because comparison strings with greater similarity
are generally closer to each other on the map. However, this is not always true because the metrics involved are more complicated. For example, shading of points corresponding to comparison strings with high scores for a given target string represents a metric which shows similarity between this target string and others mapped in this shaded region. The target strings in this case, however, may not appear close together on the map or display but can be identified as being similar. Continuing to FIG IB, determine whether points in region R should be marked (in a similar manner as previously described) based on their relative scores or properties compared to other points in region R (step 139). If it is determined that the points should be marked (step 139), mark the points (step 141). For example, one might wish to mark all the points whose score falls within 10% of the highest score of a chosen target string, or mark points whose comparison strings have the lowest or highest Shannon entropy for the region. When marking points, it may be determined that an entire subregion of the region has a large number of points that do not meet the relative score criteria or other properties. For example, this subregion may be part of a grid. This may be used to determine whether this subregion is of interest or not. In one embodiment of the present invention, once the decision has been made as to whether such points should be marked (step 139), determine if a subregion of R is of interest (step 143). If a subregion of R is of interest (step 143), then this subregion is examined with higher resolution, called zooming (step 147). The subregion of R replaces the previous region R. (step 104 of FIG 1A). Comparison strings will be generated from the new subregion of R and will be scored against any number of the same set of M target strings originally provided. Points in a subregion of interest, which were previously unchecked, will be examined because the new region, R, is a higher resolution version of the subregion of interest. The points in the subregion will tend to produce a greater percentage of similar comparison strings to those previously examined in region R. If the subregion of interest is a high scoring region this will, in general, produce a greater percentage of high scores and some differences will emerge to produce higher scores or properties which are closer to some desired criteria. After zooming (step 147) and before examining the subregion, the target strings and comparison strings may optionally be transformed to attempt to improve the precision and resolution of the mapping and marking in the method. Suppose in the gene expression example, the target strings values, instead of real numbers from 0 to 10, were binned into 10
contiguous intervals, such that the first bin corresponds to real number values from 0 to 1, the second bin to real number values from 1 to 2, etc. Suppose these bins were labeled 0 to 9. The target string would then be a string of integers with values from 0 through 9. Suppose that a similar transformation was done on the transformed comparison strings. Suppose the method is performed and after zooming (step 147), the gene expression ratios and comparison strings are split into 20 such bins from 0 to 0.5, 0.5 to 1.0, etc. Thus, the target and comparison strings will be re-scaled before repeating the process in the new subregion (104 of FIG. 1A). This re-scaling can improve the precision and accuracy of the mapping and marking in the method. There are several well-studied methodologies that can be used to approach such a re-scaling to improve the precision and resolution of the mapping and marking process as zooming is performed. These include, but are not limited to, methodologies such as Simulated Annealing, Hill Climbing Algorithms, Genetic Algorithms, or Evolutionary Programming Methods. If no other subregions of R are of interest (step 143), the method of FIGS. 1A and
IB ends (step 199). This generally results when there is no improvement in the score after some number of zooms. It should be apparent to one skilled in the art that this technique can be used to study the behavior of any (scoring) function that uses the target strings and the comparison strings as variables. Attempting to find the highest value of the similarity measure scoring function is a particular case of this. As such, this method could be used to attempt to optimize any scoring function, using a target string or multiple target strings and comparison strings as variables, to find the functions minima and maxima. In addition, each comparison string can simply be used alone as input into the variables of a scoring function for such a purpose. It should be apparent to one skilled in the art that this method can be used for data compression. If the model of the target string represented by a comparison string is sufficiently similar to the target string, and the coordinates of the point p corresponding to that comparison string can be represented in a more compact way than the target string, then the target string can be replaced with its more compact representation in the form of the coordinates of point p. This is because the comparison string generation algorithm can then be used to recreate a sufficiently similar representation of target string from point p. This method has special applicability to multiple large datasets. Uses for this method include analysis of DNA sequence data, protein sequence data, and gene expression
datasets. The method can also be used with demographic data, statistical data, and clinical (patient) data. The uses for this method are not limited to these datasets, however, and may be applied to any type of data or heterogeneous mixtures of different data types within datasets. Some of the steps of this method can involve determinations and interventions made by a user of the method or they can be automated.
Fractal Genomics Modeling (FGM) The previously described method can be adapted for use in a new data analysis technique, Fractal Genomics Modeling (FGM), to explore the structure of genetic networks. It is possible to produce hypotheses for unknown gene interactions, for proposed pathways, and for pathway interconnections of large-scale gene expression through Fractal Genomics Modeling (FGM). By virtue of its correlational power, FGM inherently results in the discovery of putative biomarkers that can classify disease. Such disease indicators are discovered by the rendering and ordering of the underlying genetic elements that engender the illness, as it progresses and changes over time. Three distinct disease models ensue, each exemplifying the predictive capability of FGM: Down's Syndrome, Human Immunodeficiency Virus (HIV) infection, and leukemia. The conventional approach to analyze large-scale gene expression has been cluster analysis and self-organizing maps. This approach can be effective in identifying broad groupings of genes connected with well understood phenotypes, but falls short in identifying more complex gene interactions and phenotypes which are less well defined. When applying cluster analysis to microarrays, typically a function is applied to every gene expression value in such a way that similar values cluster in similar locations on (usually) a two dimensional surface. With FGM, a special modeling method is used so that every point on a surface uses its own function to represent a cluster model of gene expression values, effectively "clustering the clusters." This allows for much greater insight into gene expression patterns and the similarities between them. By using FGM, the analysis moves from conventional approaches of examining gene expression values to examining gene expression patterns. FIG. 3 illustrates an efficient, robust, network structure for information transmission of the kind that has been found in many complex networks, including gene regulatory networks. The points represent what are called nodes in the networks, and the lines represent what are called links. The nature of the type of network shown in FIG. 3 is that
there are few nodes with many direct links, forming hubs, and many nodes with only a small number of direct links. These types of networks are often called scale-free or power- law networks. They are characterized by the fact that the number of genes with a given number of links falls off as a power law. For example, there may be twice as many nodes with 2 links as with 4 links. The robust nature of this type of network comes from the fact that if one removes or disables one of the nodes, it is more likely to be one with only a few links and cause less harm to the network as a whole. Suppose the World Wide Web is organized this way. The points around the center would be web sites like Yahoo or Google, the points slightly further from the center might be web sites like Amazon.com or Expedia, and the outside points might be personal web sites (obviously this requires a much larger picture to show this accurately!). The flow of information tends to go from the inside out. For example, information flows easily from Yahoo to the rest of the network because it has so many direct links. Information flow from a personal web site to the rest of the web is possible, but less likely. One can see the robust nature of the web in the fact that sites and servers go offline all the time without effecting the network. Of course the occasional times when an "inner" site such a Yahoo goes offline can have a very large impact! Each node in FIG. 3 can also represent a gene and the lines can represent correlated behavior to other genes in a genetic network. The most "connected" genes, or genes with the most direct links, would be ones in the center of this picture. As an example, FIG. 4 represents how acute lymphoblasitc leukemia (ALL) might express itself in the genetic network of an actual patient. In this model, the flow of information, in this case biochemical interactions, begins far upstream, in what could be called "super-regulatory genes "(SRG) due to their importance. This area is labeled SRG/Normal. In this patient and in most people, these SRG behave as they would in a normal, healthy individual. As the biochemical gene expression patterns propagate out of the center through downstream links, however, something occurs which causes a divergence from a normal, healthy pattern. Due to mutation, biochemical or environmental factors, or chance, a group of genes residing somewhere in the ringed area labeled SRG/Carcinogenic begins a cascade through the network that propagates into a clinical expression of cancer. Further downstream are nodes in the network that define the clinical outcome as a specific type of cancer, illustrated by the group of genes labeled leukemia and, still further downstream, as a
subclass of leukemia, ALL (and extending out to genes not seen). It should be noted that this is not a simple cascade from the center outward. Many interconnecting pathways are involved with both promoter and inhibitory links between genes. FGM is a hybrid technique that blends some of the concepts of wavelet analysis with cluster analysis. FGM "wavelets" are a series of real-valued numbers derived from complex logistic maps, such as Julia sets, generated from iterations of a single point in the complex plane. FGM searches points on the complex plane for the model that gives the greatest Pearson correlation with the actual localized data, using a minimum cutoff correlation whose absolute value is >.95. The similarity metric between point-models on the complex plane found in this way is very intricate but, in general, similar models tend to cluster in similar areas on the surface. This is particularly true if the point-models fall within a given "threshold" determined by Euclidean measure. Since a genome-wide pattern is mirrored in a small number of genes due to underlying fractal structure, FGM can be used to model the gene expression of small groups of genes, each having n number of genes (for example, n is 7 or 14 genes) from a much larger gene pool. The larger gene pool can be a sample of an organism's genome or of an organism's entire genome, such as the entire human genome. Illustratively, the genes in the gene pool can be arranged randomly in microarrays of commercial gene chips (e.g., Affymetrix Human Genome U95A chips consisting of about 12,000 genes) to measure the gene expression levels of the genes. Significantly, at least one small characterizing group of genes must exist. Since FGM models are usually scored based on their Pearson correlation, the overall magnitude of gene expression within these small groups does not matter in probing for similar patterns throughout the array, only the relative expression patterns within the groups. Other mathematical relations may be used other than Pearson's correlation. When comparing patterns of gene expression between these groups, we sometimes worked with only the models of these gene groups (in "model" space), and we sometimes worked with the actual gene expression values. Unless noted, we usually compare model values and not actual gene expression values although they are often similar. Choosing gene expression values from small groups of arbitrarily chosen genes in a network is the same as a series of short, random walks of random step-size on a modular structure. By analogy, one should see a comparative distribution of gene expression values
between such "walks" much different than if genes were randomly linked within the genome or acting largely independently. Similarities between the gene expression patterns in these groups should reveal information about the genetic network structure with correlations between gene groups skewed around gene groups chosen that align with the inherent modularity. Clusters on the FGM surface can serve to identify and to analyze such a skewed distribution. The FGM method for modeling gene expression of a small group of genes in a genetic network of a subject comprises the following steps: (a) providing a dataset of gene expression values of the small group of genes from the subject; (b) providing a surface wherein each point on the surface can serve as a domain for an iterative algorithm; (c) selecting a point on the surface; (d) generating a comparison string from the selected point using the iterative algorithm; (e) scoring the comparison string against the gene expression values in the dataset; (f) determining if the score of the comparison string meets a predetermined condition or property; and (g) marking the point if the score meets the pre- determined condition or property to generate a fractal genomics modeling (FGM) model of the target string on the surface. The method above can include the additional steps of repeating the steps (c) through (g) for a plurality of gene expression values from a plurality of small groups of gene in the genetic network to generate a plurality of FGM models on the surface.
Identifying Biomarkers Within the point-models (also known as FGM models) on the FGM surface, clusters are found containing models of the same gene groups corresponding to only one of the phenotypes. If such a gene group is found, it is then individually tested across all datasets to verify that between these n-gene patterns the Pearson correlation, or any other suitable correlations, is markedly different depending on the phenotype from which the dataset is drawn. If such a gene group is found, further testing is done to choose the n-gene group from the sample within the cluster that produces the most marked difference. Such a gene group or its FGM model then becomes a candidate biomarker for the particular phenotype being studied and provides insight into the biochemical pathways linked to the phenotype present. The biomarker can then be used to develop treatments, diagnoses or prognoses of diseases. A diagnostic test can also be designed to diagnose a disease of a test subject by
comparing the gene expression values of the phenotype of the test subject against the biomarker . In a preferred embodiment, the method for identifying the biomarker for a phenotype includes the steps of: (a) identifying clusters containing FGM models of the small group of genes corresponding to the phenotype; (b) individually testing each of the small group of genes across all datasets to verify that the pre-determined condition or property between the small groups of genes is markedly different with regard to the phenotype; and (c) selecting the small group of genes that produces the most marked difference in the pre-determined condition or property as a biomarker for the particular phenotype. In another preferred embodiment, the method for identifying the biomarker for a phenotype, such as the phenotype of a disease, includes the steps of: (a) providing a plurality of datasets of gene expression values wherein each dataset is from a small group of genes, and the plurality of datasets is from one or more subjects having the phenotype; (b) providing a surface wherein each point on the surface can be served as a domain for an iterative algorithm; (c) selecting a point on the surface; (d) generating a comparison string from the selected point using the iterative algorithm; (e) scoring the comparison string against the gene expression values in the dataset; (f) determining if the score of the comparison string meets a pre-determined Pearson correlation value; (g) marking the point if the score meets the pre-determined Pearson correlation value to generate a FGM model of the target string on the surface; (h) repeating steps (c) through (g) for a plurality of the datasets to generate FGM models for said plurality of datasets; (i) identifying clusters containing FGM models of the same small group of genes corresponding to the phenotype; (j) individually testing each of the small group of genes across all datasets to verify that the Pearson correlation between the small groups of genes is markedly different with regard to the phenotype; and (k) selecting the small group of genes that produces the most marked difference in the Pearson correlation as a biomarker for the particular phenotype.
Examples
Example 1: Evidence of Scale-free Genetic Network and Identification Biomarkers in
Down's Syndrome This example demonstrates the use of FGM both to provide evidence of scale-free genetic network in Down's Syndrome and to identify specific small gene groupings, consisting of 7 genes, that can serve as biomarkers relating to Down's Syndrome. In this study, FGM was used to model small groups of 7 genes from much larger microarrays (Affymetrix Human Genome U95A chips) consisting of 12,558 genes. The data was derived from fibroblasts of 4 subjects with and 4 subjects without Down's Syndrome — totaling 8 subjects. The number of genes within the groups, in this case 7, was decided using the criteria of picking a relatively small number — in the range of 5-20 — that when divided into 12,558 yields a real number without a remainder. Thus, arbitrarily choosing the gene groups by grouping the genes as they appeared on the gene chip, 1 ,794 7- gene groups were established. Consequently, 14,352 (1,794 gene groups x 8 subjects) target strings, M, each with 7 gene expression values, were provided for FGM analysis. Comparison strings were generated from points in the multi-dimensional map or complex plane for each target string and were scored against each of the target string.
These comparison strings served as potential FGM models for the target strings. These
FGM models were scored based on their overall Pearson correlation, using a minimum cutoff correlation of absolute value > 0.95. Among the point-models (also known as the
FGM models) on the FGM surface, clusters were found containing models of the same gene groups corresponding to only one of the phenotypes. In order to test a genetic network for the threshold requirements of scale-free and modular behavior, a log-log plot of k vs. P(k) of gene expression data from a Control/Normal sample and a Down's Syndrome subject is graphed. P(k) is the probability of finding a 7-gene group with k links to another 7-gene group. A group is considered linked to another group if it falls within the same FGM cluster of a given size. FIG. 5 is the log-log plot of k vs. P(k) of gene expression data for an arbitrarily chosen Downs and Control sample. The resulting plot is linear, demonstrating both modular and scale-free characteristics. The network organization appears to be hierarchical in nature for smaller clusters but deviates from linearity for larger clusters. This could be due to an effect called saturation that limits how large a cluster can get in real-world networks, due to physical constraints and stability.
FIG. 6 is the same plot derived from clusters for all samples on the same FGM map. This is, in effect, a picture of the combined genome for all the data. The picture conveyed from FIGS. 5 and 6 together brings to light further notions of universal constructs within such complex networks. Using the method described above, a 7-gene group was discovered that corresponded only to subjects with Down's syndrome. The corresponding results are shown in Tables 1 and 2.
Table 1: Ranked absolute values of the Pearson Correlation for 7-gene FGM models with the 7-gene biomarker candidate model (left) and the corresponding correlations with actual expression values (right). Down's subject marked with "D" (The model/actual values of the genes from subject marked with * were used.) Subject Pearson Subject Pearson 6-194D* 6-194D 1 4213-34D 1 42135-34D 0.97 5-186D 1 5-186D 0.97 7-197D 0.92 10-AOlC 0.89 8-367C 0.87 7-197D 0.88 10-AOlC 0.83 3648FC 0.85 9-367C 0.62 8-367C 0.84 3648FC No model found 9-367C 0.71
Table 2: The 7-gene Down's Syndrome biomarker candidate. Model and actual gene expression values for subject 6-194D (produced highest correlation in Table 1) and description of genes in the group. *Denotes the fact that this model is negatively correlated to the actual values (absolute Pearson used in model scoring). FGM .Model) Values* Actual Values 57.5 200.9 112.3 22.5 70.9 170.7 106.9 7.9 103.3 8.2 99.7 14.5 112.9 4.7
Cluster Incl. Dl 1466:Homo sapiens mRNA for PIG-A protein Cluster Incl. D10925:Human mRNA for HM145 Cluster Incl. U13395:Human oxidoreductase Human scavenger receptor cysteine rich Sp alpha mRNA Homo sapiens properdin (PFC) gene H.sapiens mRNA for BMPR-II Human apoptotic cysteine protease
The 7-gene Down's Syndrome biomarker candidate found was located within some of the larger clusters (which did not contain any control samples of the same gene group) on the FGM surface. This could be significant when exploring linkages to larger gene groups. To test for artifacts from the FGM surface, a "random" U-95A mock sample, produced from 12,558 uniformly distributed random numbers from 0-10,000, was analyzed as 7-gene groups. Only one cluster of three genes and 23 pair-clusters were found in the entire sample.
Example 2: Identification of Biomarkers in Human Immunodeficiency Virus (HIV) Infection In this example, FGM was used to model small groups of 14 genes from much larger microarrays (Affymetrix Human Genome U95A chips) consisting of 12,558 genes. The data was derived from the brain tissue of 5 HIV-1 negative and 4 HIN-1 infected subjects — totaling 9 subjects. The number of genes within the groups, in this case 14, was decided using the criteria of picking a relatively small number — in the range of 5-20 — that goes evenly into 12,558. Thus, arbitrarily choosing the gene groups by grouping the genes as they appeared on the gene chip, 897 14-gene groups were established. Consequently, 8,073 (897 gene groups * 9 subjects) target strings, M, each with 14 gene expression values, were provided for FGM analysis. Comparison strings were generated for each target string, as previously described. These FGM models were scored based on their overall Pearson correlation, using a minimum cutoff correlation of absolute value > 0.95. Therefore, the overall magnitude of gene expression with in these small groups did not matter in probing for similar patterns throughout the array, only the relative expression patterns within the groups. When comparing gene expression between the gene groups, the models of comparison strings were most often used, though sometimes the actual gene expression values were used. Among the point-models (also known as FGM models) on the FGM surface, clusters were found containing models of the same gene groups corresponding to only one of the phenotypes. One 14-gene group was discovered that corresponded only to HIN-1 infected subjects. This 14-gene group was then individually tested across all data for each subject in order to verify that between these n-gene patterns (n = 14 in this case) the Pearson correlation was noticeably different depending on the phenotype from which the data
sample was drawn. The 14-gene group from the sample within the cluster that produced the most noticeable difference was identified as a putative biomarker. The correlation values with this particular gene group and the corresponding gene groups, across all samples, are shown in Table 3. The left side of Table 3 uses the FGM model values, both ranked from highest to lowest correlation.
Table 3: Ranked absolute values of the Pearson Correlation for 14-gene FGM models with the 14-gene biomarker candidate model (left) and the corresponding correlations with actual expression values (right). HIN-1 positive marked with "+". (The model/actual values for the genes from subject marked with * were used.) Subject Pearson Subject Pearson G0036+* G0036+* G0017+ 0.98 D97 2916- 0.97 H0011+ 0.94 G0017+ 0.96 H0002+ 0.91 H0011+ 0.94 G0010+ 0.86 G0010+ 0.92 BTB 3455- No model found H0002+ 0.89 BTB 3648- No model found BTB 3648- 0.88 BTB 3749- No model found BTB 3455- 0.72 D97 2916- No model found BTB 3749- 0.71
The actual marker genes and the model and actual expression values of the sample/subject that produced the greatest correlation are listed in Table 4.
Table 4: HIV-1 brain biomarker candidate. Model and actual gene expression values for subject G0036+ (produced highest correlation in Table 3) and description of genes in the group. FGM (Model) Values Actual Values 310.8 180.7 126.3 55.6 298.8 158.4 264.5 51.9 274.4 174.7 585.9 912J 233.4 264.3 248 245.6 478.6 572.3 144.4 55.2 363 218.5 328.3 312.4 457.5 626.5 1074 1593.2
Cluster Incl. U39067:Homo sapiens translation initiation factor eIF3 p36 Cluster Incl. AL050106:Homo sapiens mRNA; cDNA DKFZp586I1319 Cluster Incl. AF047181:Homo sapiens NADH-ubiquinone oxidoreductase Cluster Incl. AF007872:Homo sapiens torsinB (DQ1) Cluster Incl. AF007871 :Homo sapiens torsinA Cluster Incl. AB011116:Homo sapiens mRNA for KIAA0544 Cluster Incl. AF032456:Homo sapiens ubiquitin conjugating enzyme G2 Cluster Incl. D87454:Human mRNA for KIAA0265 gene Cluster Incl. AF001383:Homo sapiens amphiphysin II mRNA Cluster Incl. U69263 :Human matrilin-2 precursor mRNA Cluster Incl. D31889:Human mRNA for KIAA0072 gene Cluster Incl. AL050265:Homo sapiens mRNA Cluster Incl. AL038340:DKFZp566K192_sl Cluster Incl. AL038340:DKFZp566K192_sl (duplicate description)
Example 3: Genetic Network and Biomarkers in Leukemia Input data from the study produced by Golub et al. (Golub T. R., et al, Science, Vol. 286, pp. 531-536, 1999) are used in this example in order to further demonstrate the utility of the present invention. The data in the Golub study contained Affymetrix gene expression data for 7070 genes acquired from patients diagnosed with either acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML). The data was composed of a training set of data from 27 ALL patients and 11 AML patients to develop diagnostic approaches based on the Affymetrix data and an independent set of 34 patients for testing.
Genetic Network in the Clinical Expression of Leukemia In order to determine what kind of genetic network is involved in the clinical expression of leukemia, the more than 7,000 gene expression values in the Golub data were broken into groupings of 5, 7, and 10 genes based only on the order in which the genes were arranged on the Affymetrix chip. FGM was used to create point-models (also known as FGM models) of the gene expression patterns in these small groups and looked for correlations, or clustering between the 5, 7, and 10 gene models in each of the 38 patients in the Golub training set. The number of ways to arrange to arrange 7 genes out of 7,000 is ~1027. Unless there is coordinated behavior between a large number of these 7,000 genes, there would be almost no chance of finding correlations between (effectively) arbitrary 7-gene groupings, even when clustering a thousand of them. On the other hand, if there is a genetic network
of the scale-free type described above, there should be a large number of genes whose behavior is correlated to only a few genes. For the 7-gene grouping, our analysis found that there were significant model clusters in every patient. The largest cluster had an average size of approximately ten 7- genes models. Pearson correlations of > |0.95| between the models confirmed the similarities within these clusters. This provides statistical evidence that there are at least a few genes whose behavior is connected with well over 1,000 other genes. This also agrees with an earlier gene expression study based on time-based gene expression data. The clusters that contained 7-gene groups from only the patients with ALL were then scrutinized. Two 7-gene group models correlated to the largest number of corresponding models in ALL patients but with no AML patients. The two 7-gene groups are listed in FIG. 7 with their respective gene model values as well as the actual gene expression values.
The first group of the 7-gene group contains the following genes: 1. G AT A2 GAT A-binding protein 2 2. Alcohol dehydrogenase 6 gene 3. GB DEF = Protein-tyrosine phosphatase mRNA 4. Globin gene 5. Pre-mRNA splicing factor SF2, P32 subunit precursor 6. Major histocompatibility complex enhancer-binding protein 7. MSN Moesin.
The second group of the 7-gene group contains the following genes: 1. Onconeural ventral antigen- 1 (Nova-1) mRNA 2. Inil mRNA 3. RORA RAR-related orphan receptor A 4. FUSE biding protein mRNA 5. Rar protein mRNA 6. Fetal ALZ-50-reactive clone 1 (FAC1) mRNA 7. MB-1 gene. These two 7-gene group models were used for a 7-gene diagnostic test. The two 7- gene group model values from two patients in the training set (above) were used to
characterize ALL in the independent set. The test was an OR test, where if the corresponding 7-gene models in the independent set patients had a Pearson correlation with either of these 7-gene model values such that the absolute value was > 0.95, the patient was classified as ALL. The results for the 7-gene grouping are as follows: Overall Accuracy 0.853 ALL only 0.95 AML only 0.714
Pathways related to this result comprise the Ras-Independent pathway in NK cell- mediated cytotoxicity. The gene of special interest from this result is MB-1 gene. In addition, it was found that the second 7-gene group above allows for the differentiation of patients with ALL into those who have the T-cell ALL from B-cell ALL. The test using this 7-gene group model was accurate to 100% in the test set in classifying B- Cell vs. T-Cell (See Table 7). The gene segments used are summarized in Table 8.
Table 7: Summary of using the second 7-gene group to predict B-cell and T-cell ALL 3ene-chip Absolute Pearson correlation Predicted (>.95 = Actual Correct (Patient) between patient gene segment B-Cell) model and classifier model
39 0.9997 B-Cell B-Cell Yes 40 0.9509 B-Cell B-Cell Yes 41 0.9954 B-Cell B-Cell Yes 42 1 B-Cell B-Cell Yes 43 0.9974 B-Cell B-Cell Yes 44 0.9995 B-Cell B-Cell Yes 45 1 B-Cell B-Cell Yes 46 0.9995 B-Cell B-Cell Yes 47 0.9996 B-Cell B-Cell Yes 48 0.9999 B-Cell B-Cell Yes 49 0.9999 B-Cell B-Cell Yes 55 0.9792 B-Cell B-Cell Yes 56 1 B-Cell B-Cell Yes 59 0.9616 B-Cell B-Cell Yes 67 0.6753 T-Cell T-Cell Yes 68 1 B-Cell B-Cell Yes 69 0.9996 B-Cell B-Cell Yes 70 0.9998 B-Cell B-Cell Yes 71 1 B-Cell B-Cell Yes 72 0.9998 B-Cell B-Cell Yes
Table 8: Gene Segments Used to predict B-cell and T-cell ALL Gene Model Actual Gene Segment (Classifier) Used (Classifier) Values Values
Onconeural ventral antigen- 1 (Nova-1) mRNA U04840_at 828.711548 93
Inil mRNA U04847_at 758.938538 123
RORA RAR-related orphan receptor A U04898_at 237.028641 -60
FUSE binding protein mRNA U05040_at 1345.72998 891
Rar protein mRNA U05227_at 958.517456 -38
Fetal Alz-50-reactive clone 1 (FAC1) mRNA U05237_at 1616.58411 635
MB-1 gene U05259 rnal at 5244.02344 5314
Clusters were also found in the 5 and 10 gene grouping runs. These clusters were generally smaller but the analysis of these groups also gave indications of large-scale correlation between many genes. The five gene-grouping runs resulted in several 5-gene groups. FIG. 8 is a chart showing gene group models used for 5-gene diagnostic tests. Five different gene model value sets consisting of four 5-gene groups each (20 genes total) were used to create five different 5-gene diagnostic tests. Results from 5-gene test 1 : Overall Accuracy 0.824 ALL only 0.8 AML only 0.857 Results from 5-gene test 2: Overall Accuracy 0.735 ALL only 0.8 AML only 0.643 Results from 5-gene test 3: Overall Accuracy 0.824 ALL only 0.8 AML only 0.857 Results from 5-gene test 4: Overall Accuracy 0.765 ALL only 0.8 AML only 0.714
Results from 5-gene test 5: Overall Accuracy 0.735 ALL only 0.75 AML only 0.714 Pathways related to this result comprise: Regulation of hematopoiesis by cytokines, IL-2 Receptor Beta Chain in T cell Activation, Tumor Suppressor Arf Inhibits Ribosomal Biogenesis, Neuropeptides VIP and PACAP inhibit the apoptosis of activated T cells, FAS signaling pathway ( CD95 ), HIV-I Nef: negative effector of Fas and TNF, Fc Epsilon Receptor I Signaling in Mast Cell, p38 MAPK Signaling Pathway, and Induction of apoptosis through DR3 and DR4/5 Death Receptors. FIG. 9 is a chart showing gene group models used for 10-gene diagnostic tests. Two different gene model values sets consisting of two 10-gene groups each (50 genes total) were used to create two different 10-gene diagnostic tests. The gene group models used are listed in FIG. 9. Results from 10-gene test 1 Overall Accuracy 0.735 ALL only 0.65 AML only 0.857 Results from 10-gene test 2 Overall Accuracy 0.676 ALL only 0.55 AML only 0.857 Pathways related to this result comprise Free Radical Induced Apoptosis, PDGF Signaling Pathway, Rac 1 cell motility signaling pathway, and Selective expression of chemokine receptors during T-cell polarization. Genes of special interest from this result are SOD1, Sm protein F, Sm protein G, and HOXA9.
Transmission Pattern within the Network of ALL In order to determine if a particular transmission pattern within this network (gene expression pattern) can be identified with acute lymphoblastic leukemia (ALL), point models (also known as FGM models) from all 7-gene groups for all 38 patients were clustered. Clusters were examined that contained only 7-gene groups from the patients with ALL. Two 7-gene group model patterns were identified, which correlated with the largest number of corresponding models in other ALL patients and with none of the AML patients. To test how accurately these two patterns classified ALL patients, correlations were also tested with this diagnostic/classification method on the Golub independent data. This method identified ALL patients form AML patients to -85% accuracy. (See the Results section) This gives credence to this method both as a diagnostic technique and lends significance to the gene models used. The chance of these two gene group model patterns producing an 85% result by chance is roughly 1 in 50,000. Similarly, tests were performed on the 5 and 10-gene groups. The diagnostic accuracy varied from 67.6 to 82.4%. Many pathways and genes were identified as being significant in the course of this test. Several of these appeared to mesh with current knowledge in the field (See Results section). The test cited above identified a particular group of genes and a gene expression pattern within them that appears to identify ALL. This does not necessarily mean, however, that this group of genes is in the hypothetical ALL ring within a network of the kind illustrated in FIG. 2. To produce evidence of this type of large-scale transmission a test was produced which compared all 7-gene models to all corresponding models between patients in the independent set and a randomly chosen ALL and AML patient from the training set. All model correlations were calculated and averaged for both the ALL and AML patients chosen. The diagnostic decision was based on which comparison had the higher average correlation. This test produced a diagnostic accuracy of 82.4%. More importantly, this result is a statistically significant indication of gene expression pattern reflecting a clinical expression of ALL throughout the 7,000+ gene set. The same test was also performed with the 10-gene models to also produce a statistically significant result (See Results section). The results of the7-gene grouping all models to all models diagnostic test (based on average correlation with randomly chosen ALL and AML patient from the training set) are as follows: Overall Accuracy 0.824 ALL only 0.85
AML only 0.786 The results of the 10-gene grouping all models to all models diagnostic test (based on average correlation with randomly chosen ALL and AML patient from the training set) are as follows: Overall Accuracy 0.735 ALL only 0.7 AML only 0.786 The results for all 7-gene models to 7-gene group 1 model pattern diagnostic test (based on average correlation with randomly chosen ALL and AML patient from the training set) are as follows: Overall Accuracy 0.765 ALL only 0.9 AML only 0.571
Upstream and Downstream Pathways in ALL Genetic Network It can be further determined if this transmission pattern be traced upstream in the network. Starting with the two specific 7-gene model patterns used to diagnosis ALL, an attempt was made to find correlations between these patterns and all 7-gene models in both
ALL and AML patients in the training set. The assumption was that finding this expression pattern in an area closer inside than the "ALL ring" in FIG. 4 would constitute finding an upstream gene grouping. In this area
ALL and AML have yet to reach genes which will determine their specific clinical expression. There was one 7-gene grouping whose models correlated with one of the ALL diagnostic patterns in all patients, both ALL and AML. There were also two other 7-gene groups that met this condition in almost all patients in the training set. All three of the gene groups are listed under the heading "Most Common Upstream Gene Groups correlated to 7- gene Model Patterns Used in Diagnostic Test" in the Results section. To strengthen the assumption that this pattern was being transmitted through a large section of the network, we performed the following test. We correlated the single 7-gene diagnostic pattern cited above against all the 7-gene models in each of the AML patients in the training set. The highest average correlation was found. The same correlation test was performed across all the independent patients. A patient was identified as ALL if the
average correlation was greater than the highest average AML correlation from the training set. This test identified ALL to -76% accuracy. The diagnostic score is somewhat low, but the probably of chance occurrence is roughly 1 in a 1,000. This provides statistical evidence that not only can large-scale gene expression be seen in ALL patients, a single pattern can be seen as being transmitted tlirough a large section of a genetic network involved in the clinical expression of ALL. Most common upstream gene Groups correlated to 7-gene model patterns which can be used in a diagnostic test are: Group 1 GAA gene extracted from Human lysosomal alpha-glucosidase gene exon 1 AGA Aspartylglucosaminidase 2-19 gene (2-19 protein) extracted from H. sapiens G6PD gene for glucose-6- phosphate dehydrogenase CYCLIC-AMP-DEPENDENT TRANSCRIPTION FACTOR ATF-1 Usf mRNA for late upstream transcription factor PRTN3 Proteinase 3 (serine proteinase, neutrophil, Wegener granulomatosis autoantigen) RPS3 Ribosomal protein S3 Group 2 XP-C repair complementing protein (p58/HHR23B) KIAA0031 gene Estrogen responsive finger protein C3G protein CDH11 Cadherin 11 (OB-cadherin) 60S RIBOSOMAL PROTEIN L23 SM22-ALPHA HOMOLOG Group 3 CD ID CD ID antigen, d polypeptide 5,10-methenyltetrahydrofolate synthetase mRNA PTPRD Protein tyrosine phosphatase, receptor type, delta polypeptide GT197 partial ORF mRNA, 3' end of cds The longest open reading frame predicts a protein of 202 amino acids, with fair Kozak consensus at the initial ATG codon; an in-frame TGA
codon is seen at nucleotide 8; ORF; putative gene extracted from Homo sapiens GT198 mRNA, complete ORF GT212 mRNA RPL37 Ribosomal protein L37 Pathways related to upstream gene groups comprise: Oxidative reactions of the pentose phosphate pathway, TNF/Stress Related Signaling, fMLP induced chemokine gene expression in HMC-1 cells, Proepithelin Conversion to Epithelin and Wound Repair Control, Rac 1 cell motility signaling pathway, and Catabolic pathway for asparagine and asparate. FIG. 10 shows a preliminary diagram of downstream causality from two diagnostic 7-gene groups used in the 7-gene test. Pathways related to downstream causality groups comprise ALK in cardiac myocytes, WNT Signaling Pathway, BCR Signaling Pathway, Fc Epsilon Receptor I Signaling in Mast Cell, Neuropeptides VIP and PACAP inhibit the apoptosis of activated T cells, Regulation of hematopoiesis by cytokines, Cytokines and Inflammatory Response, Integrin Signaling Pathway, AKT Signaling Pathway, Regulation of transcriptional activity by PML, mTOR Signaling Pathway, and Regulation of eIF4e and p70 S6 Kinase. Genes of special interest from this result include: FEZ1, EIF4A
Causal Picture of the Network In order to determine if a transmission pattern can be used to create a causal picture of the network, a partial picture of causality going downstream from the 7-gene diagnostic groups was constructed using a combination of correlations with the actual diagnostic patterns and correlations with the actual 7-gene diagnostic group models for each patient. A 7-gene group was considered a candidate for a downstream link if the gene model did not correlate with the corresponding model in any of the ALL patients and its 7-gene model correlated with one of the two diagnostic patterns. Downstream causality was considered found when the last condition only occurred when there was a correlation between its 7- gene model and the diagnostic group 7-gene models. The assumption is that this 7-gene group's expression (as part of an ALL network) was apparently "switched on" by the
diagnostic 7-gene group correlation upstream. The results of this preliminary causal analysis are in the Results section. In summary, this example describes a method of pathway conjecture and diagnosis using fractal genomics modeling (FGM). The 7-gene group results were focused on, but many interesting pathway and gene inferences seems to come out of the 5 and 10 gene tests. Within the related pathways listed, there is a great deal of overlap between the pathways connected with the downstream links and the 5-gene groups. This is intriguing because in a scale-free network of the kind shown in FIG. 2, the genes with 5 links would tend to be both downstream of genes with 7-links and also more prevalent. This could provide a framework for building the interconnected downstream pathways actually represented in these groups. This would also lend credence to the idea that the 10-gene models tend to reflect pathways upstream of the 7-gene groups. Together these two notions could perhaps be used to map the biochemistry within the "ALL ring" in FIG. 4. This also might explain why the 5-gene and 10-gene results were results less accurate, since they were dealing with pathways slightly removed from the "critical point" in ALL clinical expression. There could also be other biophysical reasons for this. Statistical evidence was produced toward validation of the model of clinical expression shown in the genetic network in FIG. 4. In this process of arriving at this evidence, new tools and approaches have been identified for extracting a great deal of information about the structure and function of such a network. New diagnostic methods have also been identified. The diagnostic results, although statistically significant, were still somewhat low compared to other methods. This could well be due to problems with the Golub methodology which were accurately portrayed in a false diagnosis by FGM. We will apply FGM to more up-to-date and accurate gene expression studies to further validate, improve, and extend the diagnostic approaches and pathway information of this invention. In the process, we will continue to translate the biophysics of gene expression models into the pathways and targets of interest to researchers in the medical field. Since FGM is data independent, we hope to apply these approaches to proteomic and even clinical data as well. While specific embodiments have been illustrated and described, numerous modifications come to mind without departing from the spirit of the invention and the scope of protection is only limited by the scope of the accompanying claims.
Claims
What is claimed is: 1. A method for modeling gene expression of a small group of genes in a genetic network of a subject comprising: (a) providing a dataset of gene expression values of the small group of genes from the subject; (b) providing a surface wherein each point on the surface can serve as a domain for an iterative algorithm; (c) selecting a point on the surface; (d) generating a comparison string from the selected point using the iterative algorithm; (e) scoring the comparison string against the gene expression values in the dataset; (f) determining if the score of the comparison string meets a pre-determined condition or property; and (g) marking the point if the score meets the pre-determined condition or property to generate a fractal genomics modeling (FGM) model of the target string on the surface.
2. The method of claim 1, further comprising zooming, wherein the steps (c) through (g) are repeated until the score cannot be improved.
3. The method of claim 1, wherein the steps (c) through (g) are repeated for a plurality of datasets from a plurality of small groups of genes to generate a plurality of FGM models on the surface.
4. The method of claim 1, wherein the subject is a subject diagnosed with a disease or a normal subject with respect to the diagnosed disease.
5. The method of claim 1, wherein the subject is a human subject.
6. The method of claim 4, wherein the disease is Down's Syndrome.
7. The method of claim 4, wherein the disease is Human Immunodeficient Virus (HIV) infection.
8. The method of claim 4, wherein the disease is cancer.
9. The method of claim 8, wherein the cancer is leukemia.
10. The method of claim 9, wherein the leukemia is acute lymphoblastic leukemia (ALL).
11. The method of claim 9, wherein the leukemia is acute myeloid leukemia (AML).
12. The method of claim 1, wherein the gene expression is measured in a gene- chip comprising a microarray of the genes in the small group of genes.
13. The method of claim 1 , wherein in the small group of genes is part of a larger gene pool from the subject.
14. The method of claim 13, wherein the small group of genes is randomly selected from the larger gene pool.
15. The method of claim 13, wherein the larger gene pool has about 7,000 genes or more.
16. The method of claim 13, wherein the larger gene pool has about 12,000 genes or more.
17. The method of claim 13, wherein the larger gene pool consists of the entire genome of the subject.
18. The method of claim 1 , wherein the number of genes in the small gene group is from 2 to 20.
19. The method of claim 1, wherein the number of genes in the small gene group is 5.
20. The method of claim 1 , wherein the number of genes in the small gene group is 7.
21. The method of claim 1, wherein the number of genes in the small gene group is 10.
22. The method of claim 1, wherein the number of genes in the small gene group is 14.
23. The method of claim 3, wherein the plurality of datasets are derived from more than one subject.
24. The method of claim 23, wherein the subjects are selected from a group consisting of subjects diagnosed with a disease, normal subjects with respect to the diagnosed disease and a combination thereof.
25. The method of claim 1, wherein the surface is a complex plane.
26. The method of claim 1, wherein the surface is a multi-dimensional surface.
27. The method of claim 1 , wherein the surface is in or around a Mandelbrot set.
28. The method of claiml , wherein the surface is a Julia set.
29. The method of claim 1, wherein the gene expression value is an absolute value or a relative value relative to another small group of genes from the subject.
30. The method of claim 1, wherein the gene expression value is an overall expression value of the small group of genes.
31. The method of claim 1, wherein the scoring of the comparison string is based on its correlation with the gene expression value of the small group of genes.
32. The method of claim 31, wherein the correlation is a Pearson correlation.
33. The method of claim 32, wherein the comparison string is marked to serve as the FGM model for the gene expression value of the small group of genes if the absolute value of the Pearson correlation is greater than 0.95.
34. The method of claim 3 further comprising identifying a biomarker of a phenotype by: (a) identifying clusters containing FGM models of the small group of genes corresponding to the phenotype; (b) individually testing each of the small group of genes across all datasets to verify that the pre-determined condition or property between the small groups of genes is markedly different with regard to the phenotype; and (c) selecting the small group of genes that produces the most marked difference in the pre-determined condition or property as a biomarker for the particular phenotype.
35. The method of claim 34, wherein the FGM model of the small group of genes is used as the biomarker.
36. The method of clam 34, wherein the phenotype is a phenotype of a disease.
37. The biomarker of claim 36 is used to develop treatments, diagnoses, or prognoses of the disease.
38. A diagnostic test comprising the biomarker of claim 34.
39. A method for identifying a biomarker for a phenotype comprising: (a) providing a plurality of datasets of gene expression values wherein each dataset is from a small group of genes, and the plurality of datasets is from one or more subjects having the phenotype; (b) providing a surface wherein each point on the surface can be served as a domain for an iterative algorithm; (c) selecting a point on the surface; (d) generating a comparison string from the selected point using the iterative algorithm; (e) scoring the comparison string against the gene expression values in the dataset; (f) determining if the score of the comparison string meets a pre-determined Pearson correlation value; (g) marking the point if the score meets the pre-determined Pearson correlation value to generate a FGM model of the target string on the surface; (h) repeating steps (c) through (g) for a plurality of the datasets to generate FGM models for said plurality of datasets; (i) identifying clusters containing FGM models of the same small group of genes corresponding to the phenotype; (j) individually testing each of the small group of genes across all datasets to verify that the Pearson correlation between the small groups of genes is markedly different with regard to the phenotype; and (k) selecting the small group of genes that produces the most marked difference in the Pearson correlation as a biomarker for the particular phenotype.
40. The method of claim 39, wherein the plurality of datasets is from a combination of one or more subjects having the phenotype and one or more subjects not having the phenotype.
41. The method of claim 39, wherein the FGM model of the small group of genes is used as the biomarker.
42. The method of claim 39, wherein the phenotype is a phenotype of a disease.
43. The biomarker of claim 39 is used to develop treatments, diagnoses, or prognoses of the disease.
44. A diagnostic test comprising the biomarker of claim 39.
45. The method of claim 39, wherein the disease is Down's Syndrome.
46. The method of claim 39, wherein the disease is Human Immunodeficient Virus (HIV) infection.
47. The method of claim 39, wherein the disease is cancer.
48. The method of claim 47, wherein the cancer is leukemia.
49. The method of claim 48, wherein the leukemia is acute lymphoblastic leukemia (ALL).
50. The method of claim 48, wherein the leukemia is acute myeloid leukemia (AML).
51. The method of claim 39, wherein the number of genes is in the small group of genes is from 2 to 20.
52. The method of claim 39, wherein the number of genes in the small group of genes is 5.
53. The method of claim 39, wherein the number of genes in the small group of genes is 7.
54. The method of claim 39, wherein the number of genes in the small group of genes is 10.
55. The method of claim 39, wherein the number of genes in the small group of genes is 14.
56. The method of claim 39, wherein the network of genes consists of the entire genome of the subject.
57. The method of claim 39, wherein pre-determined Pearson correlation value is an absolute of the Pearson correlation greater than 0.95.
58. The method of claim 39, wherein the Pearson correlation is markedly different if the absolute value of the Pearson correlation is equal to or less than 0.95.
59. A biomarker for ALL comprising a small gene-group or its FGM model, the small gene-group is selected from a first group of genes, a second group of genes, and both the first group of genes and the second group of genes wherein the first group of genes is GATA2 GATA-binding protein 2, Alcohol dehydrogenase 6 gene, GB DEF = Protein- tyrosine phosphatase mRNA, Globin gene, Pre-mRNA splicing factor SF2, P32 subunit precursor, Major histocompatibility complex enhancer-binding protein, and MSN Moesin; and the second group of genes is Onconeural ventral antigen- 1 (Nova-1) mRNA, l mRNA, RORA RAR-related orphan receptor A, FUSE biding protein mRNA, Rar protein mRNA, Fetal ALZ-50-reactive clone 1 (FAC1) mRNA, and MB-1 gene.
60. A biomarker for differentiating T-Cell ALL from B-Cell ALL comprising a small gene-group of 7 genes or its FMG model, the small gene-group consists of: Onconeural ventral antigen- 1 (Nova-1) mRNA, Inil mRNA, RORA RAR-related orphan receptor A, FUSE biding protein mRNA, Rar protein mRNA, Fetal ALZ-50-reactive clone 1 (FAC1) mRNA, and MB-1 gene.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US48623303P | 2003-07-10 | 2003-07-10 | |
US60/486,233 | 2003-07-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2005006179A1 true WO2005006179A1 (en) | 2005-01-20 |
Family
ID=34062117
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2004/022157 WO2005006179A1 (en) | 2003-07-10 | 2004-07-10 | A method for identifying biomarkers using fractal genomics modeling |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2005006179A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5416848A (en) * | 1992-06-08 | 1995-05-16 | Chroma Graphics | Method and apparatus for manipulating colors or patterns using fractal or geometric methods |
US6389428B1 (en) * | 1998-05-04 | 2002-05-14 | Incyte Pharmaceuticals, Inc. | System and method for a precompiled database for biomolecular sequence information |
US6453246B1 (en) * | 1996-11-04 | 2002-09-17 | 3-Dimensional Pharmaceuticals, Inc. | System, method, and computer program product for representing proximity data in a multi-dimensional space |
-
2004
- 2004-07-10 WO PCT/US2004/022157 patent/WO2005006179A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5416848A (en) * | 1992-06-08 | 1995-05-16 | Chroma Graphics | Method and apparatus for manipulating colors or patterns using fractal or geometric methods |
US6453246B1 (en) * | 1996-11-04 | 2002-09-17 | 3-Dimensional Pharmaceuticals, Inc. | System, method, and computer program product for representing proximity data in a multi-dimensional space |
US6389428B1 (en) * | 1998-05-04 | 2002-05-14 | Incyte Pharmaceuticals, Inc. | System and method for a precompiled database for biomolecular sequence information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7366719B2 (en) | Method for the manipulation, storage, modeling, visualization and quantification of datasets | |
Ravdin et al. | A demonstration that breast cancer recurrence can be predicted by neural network analysis | |
Marquardt | Advances in archaeological seriation | |
Demissie et al. | Bias due to missing exposure data using complete‐case analysis in the proportional hazards regression model | |
Pybus et al. | Testing macro–evolutionary models using incomplete molecular phylogenies | |
Cotton et al. | Rates and patterns of gene duplication and loss in the human genome | |
JP2020533679A (en) | Systems and methods for predicting relevance in the human population | |
KR20020075265A (en) | Method for providing clinical diagnostic services | |
Fryett et al. | Investigation of prediction accuracy and the impact of sample size, ancestry, and tissue in transcriptome‐wide association studies | |
US20050026199A1 (en) | Method for identifying biomarkers using Fractal Genomics Modeling | |
US20050079524A1 (en) | Method for identifying biomarkers using Fractal Genomics Modeling | |
Gruber et al. | Introduction to dartR | |
Clark et al. | Prognostic factors: rationale and methods of analysis and integration | |
US20050158736A1 (en) | Method for studying cellular chronomics and causal relationships of genes using fractal genomics modeling | |
Choi et al. | New variable selection strategy for analysis of high-dimensional DNA methylation data | |
WO2005006179A1 (en) | A method for identifying biomarkers using fractal genomics modeling | |
Xie et al. | A case study on choosing normalization methods and test statistics for two‐channel microarray data | |
JP2004030093A (en) | Gene expression data analysis method | |
Marinos et al. | A Survey of Survival Analysis Techniques. | |
WO2005029218A2 (en) | Method for studying cellular chronomics and casual relationships of genes using fractal genomics modeling | |
Lehmann et al. | High trait variability in optimal polygenic prediction strategy within multiple-ancestry cohorts | |
Gurven | How can we distinguish between mutational" hot spots" and" old sites" in human mtDNA samples? | |
AU2005218183B2 (en) | Estimation of clinical cut-offs | |
EP1239398A2 (en) | Method, system and computer program product for identifying conditional associations among structures in samples | |
CN115844878B (en) | A therapeutic drug and drug target for KRAS mutation-positive high-risk colon adenocarcinoma |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
122 | Ep: pct application non-entry in european phase |