WO2003031565A2 - Integrated system and method for analysis of genomic sequence data - Google Patents
Integrated system and method for analysis of genomic sequence data Download PDFInfo
- Publication number
- WO2003031565A2 WO2003031565A2 PCT/IL2002/000818 IL0200818W WO03031565A2 WO 2003031565 A2 WO2003031565 A2 WO 2003031565A2 IL 0200818 W IL0200818 W IL 0200818W WO 03031565 A2 WO03031565 A2 WO 03031565A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- genomic
- compressed
- genomic sequence
- uncompressed
- data
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 437
- 238000004458 analytical method Methods 0.000 title claims abstract description 60
- 238000007781 pre-processing Methods 0.000 claims abstract description 29
- 239000002773 nucleotide Substances 0.000 claims description 358
- 125000003729 nucleotide group Chemical group 0.000 claims description 330
- 108700026244 Open Reading Frames Proteins 0.000 claims description 209
- 102000004169 proteins and genes Human genes 0.000 claims description 179
- 108090000623 proteins and genes Proteins 0.000 claims description 179
- 238000012300 Sequence Analysis Methods 0.000 claims description 125
- 238000013500 data storage Methods 0.000 claims description 125
- 238000007906 compression Methods 0.000 claims description 82
- 230000006835 compression Effects 0.000 claims description 82
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 claims description 73
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 claims description 44
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 claims description 44
- 238000004590 computer program Methods 0.000 claims description 43
- 229930024421 Adenine Natural products 0.000 claims description 38
- 229960000643 adenine Drugs 0.000 claims description 38
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 claims description 37
- 239000002213 purine nucleotide Substances 0.000 claims description 29
- 150000003212 purines Chemical class 0.000 claims description 29
- 238000011144 upstream manufacturing Methods 0.000 claims description 25
- 239000002719 pyrimidine nucleotide Substances 0.000 claims description 24
- 150000003230 pyrimidines Chemical class 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 21
- 230000007717 exclusion Effects 0.000 claims description 18
- 229940104302 cytosine Drugs 0.000 claims description 15
- 239000003086 colorant Substances 0.000 claims description 14
- 229940113082 thymine Drugs 0.000 claims description 14
- 230000006837 decompression Effects 0.000 description 59
- 230000007246 mechanism Effects 0.000 description 39
- 108091026890 Coding region Proteins 0.000 description 26
- 238000010586 diagram Methods 0.000 description 21
- 108020004414 DNA Proteins 0.000 description 20
- 238000012217 deletion Methods 0.000 description 18
- 230000037430 deletion Effects 0.000 description 18
- 230000008569 process Effects 0.000 description 17
- 238000013519 translation Methods 0.000 description 14
- 108010042653 IgA receptor Proteins 0.000 description 10
- 102100034014 Prolyl 3-hydroxylase 3 Human genes 0.000 description 10
- 238000007792 addition Methods 0.000 description 10
- 238000013144 data compression Methods 0.000 description 10
- 108091035707 Consensus sequence Proteins 0.000 description 9
- 230000008827 biological function Effects 0.000 description 8
- 238000004883 computer application Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 230000000007 visual effect Effects 0.000 description 7
- FFKUHGONCHRHPE-UHFFFAOYSA-N 5-methyl-1h-pyrimidine-2,4-dione;7h-purin-6-amine Chemical compound CC1=CNC(=O)NC1=O.NC1=NC=NC2=C1NC=N2 FFKUHGONCHRHPE-UHFFFAOYSA-N 0.000 description 6
- 230000008901 benefit Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 5
- 230000000295 complement effect Effects 0.000 description 5
- 238000007405 data analysis Methods 0.000 description 5
- 238000012804 iterative process Methods 0.000 description 5
- 238000012163 sequencing technique Methods 0.000 description 5
- 102100037840 Dehydrogenase/reductase SDR family member 2, mitochondrial Human genes 0.000 description 4
- 101710204837 Envelope small membrane protein Proteins 0.000 description 4
- 101710160621 Fusion glycoprotein F0 Proteins 0.000 description 4
- 108091028043 Nucleic acid sequence Proteins 0.000 description 4
- 101800004937 Protein C Proteins 0.000 description 4
- 101710188053 Protein D Proteins 0.000 description 4
- 101710088839 Replication initiation protein Proteins 0.000 description 4
- 101710132893 Resolvase Proteins 0.000 description 4
- 102100036546 Salivary acidic proline-rich phosphoprotein 1/2 Human genes 0.000 description 4
- 101800001700 Saposin-D Proteins 0.000 description 4
- 229960000856 protein c Drugs 0.000 description 4
- NOIRDLRUNWIUMX-UHFFFAOYSA-N 2-amino-3,7-dihydropurin-6-one;6-amino-1h-pyrimidin-2-one Chemical compound NC=1C=CNC(=O)N=1.O=C1NC(N)=NC2=C1NC=N2 NOIRDLRUNWIUMX-UHFFFAOYSA-N 0.000 description 3
- 102000053602 DNA Human genes 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- KDCGOANMDULRCW-UHFFFAOYSA-N 7H-purine Chemical compound N1=CNC2=NC=NC2=C1 KDCGOANMDULRCW-UHFFFAOYSA-N 0.000 description 2
- 101100241454 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.71 / DSM 1257 / FGSC 987) nuc-2 gene Proteins 0.000 description 2
- 235000019219 chocolate Nutrition 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011331 genomic analysis Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000005257 nucleotidylation Effects 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- BZSALXKCVOJCJJ-IPEMHBBOSA-N (4s)-4-[[(2s)-2-acetamido-3-methylbutanoyl]amino]-5-[[(2s)-1-[[(2s)-1-[[(2s,3r)-1-[[(2s)-1-[[(2s)-1-[[2-[[(2s)-1-amino-1-oxo-3-phenylpropan-2-yl]amino]-2-oxoethyl]amino]-5-(diaminomethylideneamino)-1-oxopentan-2-yl]amino]-1-oxopropan-2-yl]amino]-3-hydroxy Chemical compound CC(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CCC(O)=O)C(=O)N[C@@H](CCCC)C(=O)N[C@@H](CCCC)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@H](C(N)=O)CC1=CC=CC=C1 BZSALXKCVOJCJJ-IPEMHBBOSA-N 0.000 description 1
- 108020004635 Complementary DNA Proteins 0.000 description 1
- 108700010674 N-acetylVal-Nle(7,8)- allatotropin (5-13) Proteins 0.000 description 1
- CZPWVGJYEJSRLH-UHFFFAOYSA-N Pyrimidine Chemical compound C1=CN=CN=C1 CZPWVGJYEJSRLH-UHFFFAOYSA-N 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000010411 cooking Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- PWPJGUXAGUPAHP-UHFFFAOYSA-N lufenuron Chemical compound C1=C(Cl)C(OC(F)(F)C(C(F)(F)F)F)=CC(Cl)=C1NC(=O)NC(=O)C1=C(F)C=CC=C1F PWPJGUXAGUPAHP-UHFFFAOYSA-N 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- the present invention relates to database storage and retrieval in general and more particularly to storage and retrieval in computers which utilize multi-phase bit technology and to analysis and representation of genomic sequence data in general, and to pattern analysis of genomic data in particular.
- Genomic sequence data is typically represented as alphanumeric strings.
- the present invention seeks to provide a system and method for analysis of genomic sequence data, and particularly pattern analysis of genomic motifs appearing in genomic sequence data and their possible functional significance.
- the present invention comprises three sub-systems: a first sub-system and method for analysis of genomic data further described in co-pending U.S. Provisional Patent Application 60/329,114; a second sub-system and method for storage and retrieval of genomic data in compressed data space also described in co-pending U.S. Provisional Patent Application 60/329, 1 12; and a third sub-system and method for genomic sequence similarity comparison in compressed data space also described in co-pending U.S. Provisional Patent Application 60/329,115.
- capabilities and preferably other capabilities are preferably provided using a computer software application or a computer database program.
- a method for analysis of genomic sequence data in compressed data space including: obtaining genomic data, preprocessing genomic data into preprocessed genomic data, compressing at least part of the preprocessed genomic data, storing compressed preprocessed genomic data, indexing compressed preprocessed genomic data, and analyzing genomic data, based at least in part on the indexing.
- the obtaining includes obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by the genomic sequence data
- the preprocessing includes calculating and storing a plurality of genomic region sequences, based at least in part on the obtaining, and determining for each of the plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith
- the compressing includes compressing each of the pluralitty of uncompressed short genomic segments contained in each of the plurality of genomic region sequences into one of a plurality of compressed short genomic segments
- the storing includes storing the plurality of compressed short genomic segments
- the indexing includes indexing the plurality of compressed short genomic segments
- the analyzing includes: receiving a user query containing at least one logical condition relating to at least one of the following: one of the genomic region sequences, and one of the uncompressed short genomic segments, and retrieving results to the user query, the retrieving including at least one of the following
- a method for storage and retrieval of compressed genomic sequence data and similarity assessment of genomic sequence data in compressed data space including: receiving uncompressed genomic sequence data, compressing the uncompressed genomic sequence data into compressed genomic sequence data, storing the compressed genomic sequence data, indexing the compressed genomic sequence data, retrieving at least part of the compressed genomic sequence data representing uncompressed genomic sequence data similar to an uncompressed genomic target sequence, based at least in part on the indexing, and decompressing the at least part of the compressed genomic sequence data.
- the retrieving includes: receiving a target genomic sequence, a first plurality of compressed genomic sequences, representing respectively in compressed form a first plurality of genomic sequences, and at least one similarity criterion, and producing a second plurality of compressed genomic sequences, representing respectively in compressed form a second plurality of genomic sequences, the second plurality of genomic sequence being a subset of the first plurality of genomic sequences, each of the second plurality of genomic sequences being similar to the target genomic sequence, according to the at least one similarity criterion.
- a method for analysis of genomic sequence data utilizing genomic sequence similarity assessment in compressed data space including: obtaining genomic data, preprocessing genomic data into preprocessed genomic data, compressing at least part of the preprocessed genomic data, storing compressed preprocessed genomic data, indexing compressed preprocessed genomic data, and analyzing genomic data, based at least in part on the indexing, the analyzing also including assessing genomic sequence similarity, based at least in part on the indexing
- the obtaining includes obtaining uncompressed genomic sequence data, and protein coding region location data for each of a plurality of proteins known to be encoded by the genomic sequence data
- the preprocessing includes calculating and storing a plurality of genomic region sequences, based at least in part on the obtaining, and determining for each of the plurality of genomic region sequences, a plurality of uncompressed short genomic segments contained therewith
- the compressing includes compressing each of the plurality of uncompressed short genomic segments contained in each of the plurality of genomic region sequences into one of a plurality of compressed short genomic segments
- the storing includes storing the plurality of compressed short genomic segments
- the indexing includes indexing the plurality of compressed short genomic segments
- the analyzing includes: receiving a user query containing at least one logical condition relating to one of the plurality of uncompressed short genomic segments and at least one similarity criterion, extracting a subset of the plurality of uncompressed short genomic segments, each of the subset
- the plurality of genomic region sequences includes a plurality of protein coding regions.
- each of the plurality of protein coding regions is normalized.
- the plurality of genomic region sequences includes a plurality of regions adjacent to protein coding regions.
- the plurality of regions adjacent to protein coding regions includes a plurality of regions upstream to protein boding regions.
- the plurality of regions adjacent to protein coding regions includes a plurality of regions downstream to protein coding regions.
- each of the plurality of regions adjacent to protein coding regions is normalized according to coding direction of one of the plurality of protein coding regions adjacent thereto.
- the plurality of proteins known to be encoded by the genomic sequence data includes a majority of proteins known to be encoded by the genomic sequence data.
- the plurality of short genomic segments contained in each of the plurality of genomic region sequences includes a majority of short genomic segments of a given length contained in each of the plurality of genomic region sequences.
- genomic sequence data includes: a first genomic sequence data belonging to a first organism, and a second genomic sequence data belonging to a second organism different than the first organism.
- the method also includes storing for each one of the plurality of proteins at least one of the following protein properties: an organism of expression, a tissue of expression and a function.
- the at least one logical condition includes a degree of uniqueness of one of the plurality of short genomic sequences relative to at least one of the plurality of genomic sequence regions.
- the at least one logical condition includes a degree of commonality of one of the plurality of short genomic sequences relative to at least two of the plurality of genomic sequence regions.
- the method also includes: storing, based on user input, a plurality of criteria, determining and marking, each of the plurality of short genomic segments which complies with each one of the criteria, and the user query is based at least in part on at least one of the plurality of criteria.
- each of the plurality of criteria includes at least one of the at least one logical condition.
- the determining and storing also includes determining and storing a relationship between at least two of the plurality of short genomic segments, and the logical condition references the relationship. Still further in accordance with a preferred embodiment of the present invention the relationship also includes a relation between a location of a first one of the plurality of short genomic sequence relative to one of the plurality of genomic region sequences, and a second one of the plurality of short genomic sequence relative to the one of the plurality of genomic region sequences.
- the relationship also includes a similarity between a first one of the plurality of short genomic sequences and a second one of the plurality of short genomic sequences.
- the retrieving includs: receiving a query, the query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed genomic sequence data, based at least in part on the compressed query data.
- the retrieving does not require storing the uncompressed genomic sequence data.
- the retrieving does not require accessing the uncompressed genomic sequence data.
- the retrieving does not require retrieving the uncompressed genomic sequence data.
- the retrieving includes sorting the uncompressed genomic sequence data, based at least in part on the indexing.
- the sorting is alphabetical sorting.
- the uncompressed genomic sequence data includs a plurality of uncompressed strings
- the compressed genomic sequence data includs a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
- each of the plurality of uncompressed strings is an alphanumeric string representing a genomic sequence
- each alphanumeric string includs a plurality of characters
- each of the plurality of characters represents one of the following items: a nucleotide in the genomic sequence, and an unknown nucleotide in the genomic sequence
- each of the plurality of uncompressed strings includs a plurality of uncompressed characters
- each of the plurality of compressed strings includs a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
- each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
- the at least two of the plurality of uncompressed characters includs at least three of the plurality of uncompressed characters.
- the at least two of the plurality of uncompressed characters includs at least four of the pluratlity of uncompressed characters.
- At least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
- the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
- the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing are performed internally by the database. Moreover in accordance with a preferred embodiment of the present invention the receiving, the, compressing, the storing, the indexing, the retrieving, and the decompressing, do not require a program external to the database.
- the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing do not require programming.
- each of the plurality of uncompressed strings is an alphanumeric string, including a plurality of alphanumeric characters.
- the determining does not require comparing the first genomic sequence with the second genomic sequence.
- the determining does not include any of the following: decompressing the first compressed genomic sequence, and decompressing the second compressed genomic sequence.
- the method also includes decompressing each of the second plurality of compressed genomic sequences.
- the producing does not require comparing the genomic sequence with any of the first plurality of genomic sequences.
- the producing does not require decompressing any of the first plurality of compressed genomic sequences.
- genomic data analyzer does not require comparing the first genomic sequence with the second genomic sequence.
- genomic compressed sequence similarity assessment system also includes a genomic decompressor operative to decompress each of the second plurality of compressed genomic sequence ' s.
- genomic data extractor does not require comparing the genomic sequence with any of the first plurality of genomic sequences.
- genomic data extractor does not require decompressing any of the first plurality of compressed genomic sequences.
- the present invention also seeks to provide an improved method for storage, sorting and retrieval of data in a database.
- the present invention seeks to provide the capability to store, index, and retrieve multiple alphanumeric strings, in compressed form, in a database and to assess string similarity of strings in their compressed form.
- These capabilities and preferably other capabilities are preferably provided using a computer software application or a computer database program.
- a method for storage and retrieval of compressed data and similarity assessment of data in compressed data space including: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data representing uncompressed data similar to an uncompressed target data item, based at least in part on the indexing, and decompressing the at least part of the compressed data.
- the retrieving includes: receiving ' a target string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compcessed strings, representing respectively in compressed form a second plurality of strings, the second plurality of string being a subset of the first plurality of strings, each of the second plurality of strings being similar to the target string, according to the at least one similarity criterion.
- a method for storage and retrieval of compressed data including: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data, based at least in part on the indexing, and decompressing the at least part of the compressed data.
- a method for comparing compressed strings including: receiving two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, comparing the first compressed string with the second compressed string, and determining degree of similarity between the first string and the second string, based at least in part on the comparing.
- a method for assessing similarity of strings including: receiving the following items: a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of strings being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
- a compressed data storage and retrieval system including' a data compressor operative to receive uncompressed data and to compress the uncompressed data into compressed data, a compressed data indexer operative to store the compressed data and to index the compressed data, and a data extractor employing the compressed data indexer, and operative to retrieve at least part of the compressed data and to decompress the at least part of the compressed data.
- a compressed string comparison system including: a compressed string evaluator operative to receive two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, and to compare the first compressed string. with the second compressed string, and a compressed string analyzer employing the compressed string evaluator, and operative to determine degree of similarity between the first string and the second string.
- a compressed string similarity assessment system including: a compressed string evaluator operative to receive a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and a compressed string extractor operative to produce a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of string being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
- a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving uncompressed data, compressing the uncompressed data into compressed data, storing the compressed data, indexing the compressed data, retrieving at least part of the compressed data, based at least in part on the indexing, and decompressing the at least part of the compressed data.
- a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving two compressed strings, a first compressed string representing in compressed form a first string, and a second compressed string representing in compressed form a second string, the second string . being different from the first string, comparing the first compressed string with the second compressed string, and determining degree of similarity between the first string and the second string, based at least in part on the comparing.
- a computer-readable medium including a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving the following items: a string, a first plurality of compressed strings, representing respectively in compressed form a first plurality of strings, and at least one similarity criterion, and producing a second plurality of compressed strings, representing respectively in compressed form a second plurality of strings, the second plurality of strings being a subset of the first plurality of strings, each of the second plurality of strings being similar to the string, according to the at least one similarity criterion.
- the retrieving includes: receiving a query, the query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed data, based at least in part on the compressed query data.
- the retrieving does not require storing the uncompressed data.
- the retrieving does not require accessing the uncompressed data.
- the retrieving does not require retrieving the uncompressed data.
- the retrieving includes sorting the uncompressed data, based at least in part on the indexing.
- the sorting is alphabetical sorting.
- the uncompressed data includes a plurality of uncompressed strings
- the compressed data includes a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
- each of the plurality of uncompressed strings is an alphanumeric string, including a plurality of alphanumeric characters.
- each of the plurality of uncompressed strings includes a plurality of uncompressed characters
- each of the plurality of compressed strings includes a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
- each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
- the at least two of the plurality of uncompressed characters includes at least three of the plurality of uncompressed characters.
- the at least two of the plurality of uncompressed characters includes at least four of the plurality of uncompressed characters.
- At least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
- the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
- the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing are performed internally by the database.
- the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing do not require a program external to the database. Additionally in accordance with a preferred embodiment of the present invention the receiving, the compressing, the storing, the indexing, the retrieving, and the decompressing, do not require programming.
- each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
- the determining does not include comparing the first string with the second string- 11
- the determining does not include any of the following: decompressing the first compressed string, and decompressing the second compressed string.
- first string and the first string and the second string are alphanumeric strings.
- each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
- the method also includes decompressing each of the second plurality of compressed strings.
- the producing does not require decompressing any of the first plurality of compressed strings.
- each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
- the data extractor provides the following functionality: receiving a query including a query condition and uncompressed query data to which the query condition relates, compressing the uncompressed query data into compressed query data, and extracting the at least part of the compressed data, based at least in part on the compressed query data.
- the functionality of the data extractor does not require storing the uncompressed data.
- the functionality of the data extractor does not require accessing the uncompressed data.
- the functionality of the data extractor does not require retrieving the uncompressed data.
- the data extractor employing the compressed data indexer is operative to sort the uncompressed data.
- the data extractor employing the compressed data indexer is operative to alphabetically sort the uncompressed data.
- the uncompressed data includes a plurality of uncompressed strings
- the compressed data includes a plurality of compressed strings, each of the plurality of uncompressed strings being compressed into a single corresponding one of the plurality of compressed strings.
- each of the plurality of uncompressed strings is an alphanumeric string including a plurality of alphanumeric characters.
- each of the plurality of uncompressed strings includes a plurality of uncompressed characters, each of the plurality of compressed strings includes a plurality of compressed characters, at least two of the plurality of uncompressed characters being compressed into one of the plurality of compressed characters.
- each one of the plurality of uncompressed characters is compressed into one of the plurality of compressed characters.
- the at least two of the plurality of uncompressed characters includes at least three of the plurality of uncompressed characters.
- the at least two of the plurality of uncompressed characters includes at least four of the plurality of uncompressed characters.
- At least three of the plurality of uncompressed characters are compressed into each one of a majority of the plurality of compressed characters.
- the plurality of compressed strings is stored in a field, the field being part of a table and the table being part of a database.
- functionality of the data compressor, the compressed data indexer, and the data extractor is performed internally by a database.
- functionality of the data compressor, the compressed data indexer, and the data extractor does not require a program external to the database.
- functionality of the data compressor, the compressed data indexer, and the data extractor does not require programming.
- each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits sto ⁇ ng one of more than two possible values.
- functionality of the compressed string analyzer does not require comparing the first string with the second string.
- functionality of the compressed string analyzer does not require any of the following: decompressing the first compressed string, and decompressing the second compressed string.
- first string and the first string and the second string are alphanumeric strings.
- each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits sto ⁇ ng one of more than two possible values.
- the compressed string similarity assessment system also includes a compressed string decompressor operative to decompress each of the second plurality of compressed strings.
- functionality of the compressed string extractor does not require comparing the string with any of the first plurality of strings.
- functionality of the compressed string extractor does not require decompressing any of the first plurality of compressed strings.
- each of the plurality of compressed characters is stored in one byte of memory, the one byte of memory including a plurality of bits, each of the plurality of bits storing one of more than two possible values.
- the present invention also seeks to provide an improved method for presentation of genomic sequence data.
- the present invention seeks to increase the ease with which genomic motifs and their inverse- reversed sequences may be visually distinguished from each other.
- the present invention enhances the ease with which a viewer can visually distinguish purine nucleotides from pyrimidine nucleotides and can visually distinguish one set of complementary nucleotides, i e adenine-thymine, from another set of complementary nucleotides, i.e. guanine-cytosine.
- These and other enhanced visual distinctions are preferably provided by employing a novel type of genomic computer font. Different colors may also be applied to different nucleotides.
- a method for displaying genomic sequence data including receiving an alphanumeric string representing genomic sequence data, the alphanumeric string including a plurality of characters, each of the characters representing a nucleotide in the genomic sequence; and expressing the alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute
- a method for graphically displaying genomic sequence information including: receiving a first alphanumeric string representing a first genomic sequence, and a second alphanumeric string representing a second genomic sequence, the second genomic sequence being a reversed-inversed genomic sequence of the first genomic sequence; and graphically displaying the first alphanumeric string and the second alphanumeric string, such that a graphical display of the second alphanumeric string is a horizontal and vertical mirror image of a graphical display of the first alphanumeric string.
- a genomic display system comprising: a receiving apparatus operative to receive an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said characters representing a nucleotide in said genomic sequence; and an expressing apparatus operative to express said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute.
- a system for graphically displaying genomic sequence information comprising: a genomic sequence expressor, receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and expressing said first alphanumeric string and said second alphanumeric string, such that a graphical display of said second alphanumeric string is a horizontal and vertical mirror image of a graphical display of said first alphanumeric string; and a display operative to receive an output from said genomic sequence expressor and to provide a visually sensible display of an expression of said graphical display of said first alphanumeric string and said graphical display of said second alphanumeric string.
- a computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving an alphanumeric string representing genomic sequence data, said alphanumeric string comprising a plurality of characters, each of said characters representing a nucleotide in said genomic sequence; and expressing said alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, said second genomic attribute being different from said first genomic attribute.
- the first plurality of nucleotides are represented by at least one first representing attribute
- the second plurality of nucleotides are represented by at least one second representing attribute, the second representing attribute being different from the-first representing attribute
- a computer-readable medium comprising a computer program, the computer program being operative, when in operative association with a computer, to perform the following steps: receiving a first alphanumeric string representing a first genomic sequence and a second alphanumeric string representing a second genomic sequence, said second genomic sequence being a reversed-inversed genomic sequence of said first genomic sequence; and graphically displaying said first alphanumeric string and said second alphanumeric string, such that a graphical display of said second alphanumeric string is a horizontal and vertical mirror image of a graphical display of said first alphanumeric string.
- the representation comprises a human sensible representation.
- the at least one first representing attribute and the at least one second representing attribute are graphical attributes.
- the graphical attributes are shapes.
- the graphical attributes are positions.
- the positions are vertical positions.
- the graphical attributes are orientations.
- orientations are vertical orientations.
- the graphical attributes are colors.
- representation also includes representing each of four nucleotides: adenine, thymine, cytosine, and guanine, by a different color.
- the human sensible representation includes one of the following: a shape with a letter and a shape without a letter.
- the human sensible representation is produced using a computer font.
- the computer font is a TRUETYPE® font.
- the representation comprises a machine sensible representation.
- the at least one first representing attribute and the at least one second representing attribute are machine sensible attributes.
- the first plurality of nucleotides are purine nucleotides
- the second plurality of nucleotides are pyrimidine nucleotides.
- the first plurality of nucleotides consists of adenine and thymine nucleotides
- the second plurality of nucleotides consists of guanine and cytosine nucleotides.
- the representation also distinguishes a third plurality of nucleotides, sharing in common a third genomic attribute, from a fourth plurality of nucleotides, sharing in common a fourth genomic attribute, the fourth genomic attribute being different from the third genomic attribute.
- the third plurality of nucleotides are represented by at least one third representing attribute
- the fourth plurality of nucleotides are represented by at least one fourth representing attribute, the at least one third representing attribute being different from the at least one fourth representing attribute.
- the first plurality of nucleotides are purine nucleotides
- the second plurality of nucleotides are pyrimidine nucleotides
- the third plurality of nucleotides are adenine and thymine nucleotides
- the fourth plurality of nucleotides are guanine and cytosine nucleotides.
- the method also includes expressing the first alphanumeric string and the second alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute.
- the genomic sequence expressor is also operative to receive an alphanumeric string which represents genomic sequence data, the alphanumeric string including a plurality of characters, each of the plurality of characters representing a nucleotide in the genomic sequence, and to express the alphanumeric string using a representation which distinguishes a first plurality of nucleotides, sharing in common a first genomic attribute, from a second plurality of nucleotides, sharing in common a second genomic attribute, the second genomic attribute being different from the first genomic attribute, and the display is also operative to receive an output from the expressor and to display the genomic sequence using the representation.
- Fig. 1 is a simplified block diagram illustrating a computer application constaicted and operative in accordance with a preferred embodiment of the present invention
- Fig. 2 is a simplified block diagram illustrating a genomic data compression mechanism, which is a preferred implementation of a compression mechanism and a decompression mechanism constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 3A is a simplified illustration of a preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention
- Fig. 3B is a simplified illustration of an alternative preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention.
- Fig. 4A is a table illustrating preferred values assignable to 'header' bits in accordance with a preferred embodiment of the present invention
- Fig. 4B is a table illustrating preferred values assignable to 'nucleotide- representing' bits in accordance with a preferred embodiment of the present invention
- Fig. 4C is a table illustrating preferred values assignable to bits when encoding one or more uncommon characters
- Fig. 5 is a simplified flowchart illustrating operation of a genomic data compression engine constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 6 is a simplified flowchart illustrating a methodology for generating translation tables used in accordance with a preferred embodiment of the present invention
- Fig. 7A is a simplified illustration of a compression table employed in accordance with a preferred embodiment of the present invention.
- Fig. 7B is a simplified illustration of a decompression table employed in accordance with a preferred embodiment of the present invention.
- Fig. 8 is a simplified flowchart illustrating operation of a genomic data decompression engine constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 9A is a simplified illustration of an example of compression of an uncompressed genomic string representing genomic sequence data into a compressed genomic string
- Fig. 9B is a simplified illustration of an example of compression of an uncompressed genomic string representing genomic sequence data, and containing an unknown nucleotide, into a compressed genomic string;
- Fig. 10 is a simplified block diagram illustrating shifted genomic sequences utilized by a compressed genomic sequence similarity search module constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 1 1 is a simplified flowchart illustrating operation of the compressed genomic sequence similarity search module constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 12A is a simplified illustration of an example of identifying a genomic sequence having one nucleotide replacement relative to a target genomic sequence
- Fig. 12B is a simplified illustration of an example of identifying a genomic sequence having two nucleotide additions relative to a target genomic sequence
- Fig. 12C is a simplified illustration of an example of identifying a genomic sequence having one nucleotide deletion relative to a target genomic sequence
- Fig. 13 is a simplified functional diagram of a computer database application constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 14 is a simplified block diagram illustrating a genomic pattern analysis database constructed and operative in accordance with a preferred embodiment of the present invention.
- Fig. 15 is a flowchart diagram illustrating operation of a genomic preprocessing unit constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 16 is a flowchart diagram illustrating operation of a genomic query processing unit constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 17 is a simplified illustration of an example of genomic pattern analysis performed by a preferred embodiment of the present invention.
- Fig. 18 is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention.
- Fig. 19 is a simplified block diagram illustrating a mechanism for data compression and decompression, which is a preferred implementation of a compression mechanism and a decompression mechanism constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 20A is a simplified illustration of a preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention
- Fig. 20B is a simplified illustration of an alternative preferred implementation of a compressed byte bitmap used in accordance with a preferred embodiment of the present invention
- Fig. 21 A is a table illustrating preferred values assignable to 'header' bits in accordance with a preferred embodiment of the present invention
- Fig. 2 IB is a table illustrating preferred values assignable to character- representing bits in accordance with a preferred embodiment of the present invention
- Fig. 21C is a table illustrating preferred values assignable to bits when encoding one or more uncommon characters
- Fig. 22 is a simplified flowchart illustrating operation of a compression engine constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 23 is a simplified flowchart illustrating a methodology for generating translation tables used in accordance with a preferred embodiment of the present invention.
- Fig. 24A is a simplified illustration of a compression table employed in accordance with a preferred embodiment of the present invention.
- Fig. 24B is a simplified illustration of a decompression table employed in accordance with a preferred embodiment of the present invention.
- Fig. 25 is a simplified flowchart illustrating operation of a decompression engine constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 26A is a simplified illustration of an example of compression of an uncompressed string into a compressed string
- Fig. 26B is a simplified illustration of an example of compression of an uncompressed string, and containing a rare character, into a compressed string;
- Fig. 27 is a simplified block diagram illustrating shifted compressed strings utilized by a compressed string similarity search module constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 28 is a simplified flowchart illustrating operation of the compressed string similarity search module constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 29A is a simplified illustration of an example of identifying a character string having one character replacement relative to a target string
- Fig. 29B is a simplified illustration of an example of identifying a character string having two character additions relative to a target string
- Fig. 29C is a simplified illustration of an example of identifying a character string having one character deletion relative to a target string
- Fig. 30 is a simplified illustration of a triphase-bit compressed character used in accordance with a preferred embodiment of the present invention.
- Fig. 3 1 is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 32 is a simplified flowchart illustrating preferred operation of a genomic graphic representation engine, constructed and operative in accordance with a preferred embodiment of the present invention
- Fig. 33 is a simplified illustration of an example demonstrating conversion of alphanumeric genomic representation into graphic genomic representation
- Fig 34 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation in comparing a genomic motif sequence with the inverse-reversed sequence of this motif;
- Fig 35 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation, in visually distinguishing adenine- thymine-rich sequences, from cytosine-guanine-rich sequences;
- Fig 36 is a simplified illustration of an example demonstrating an advantage of a graphic genomic representation, in visually distinguishing purine nucleotides from pyrimidine nucleotides.
- FIG. 1 is a simplified block diagram illustrating a computer application constructed and operative in accordance with a preferred embodiment of the present invention.
- Each of a plurality of genomic sequences 100 is compressed by a compression mechanism 102, collectively yielding a respective plurality of compressed genomic sequences 104, each typically represented as a compressed alphanumeric string.
- the compressed genomic sequences 104 are stored in a plurality of records in a table in a database 106, and a compressed genomic sequences index 108 is constructed, which indexes the compressed genomic sequences 104.
- a target genomic sequence 1 10, one or more similarity criteria 1 11 and a query condition 1 12 relating to the target genomic sequence 110 are provided by a user of the database 106, in order to find all genomic sequences 100 in the database 106 which comply with the provided query condition 112 as it is applied to genomic sequences similar to the target genomic sequence 110, to a degree specified by the similarity criteria 1 12.
- the target genomic sequence 110 is compressed by a compression mechanism 1 14, which may be similar to compression mechanism 102, into a compressed target genomic sequence 116.
- the compressed target genomic sequence 116 and the similarity criteria 1 1 1 are passed on as input to a compressed genomic sequence similarity search module 1 18.
- the compressed genomic sequence similarity search module in conjunction with the compressed genomic sequence index 108 is operative to query the database 106, retrieving a plurality of compressed genomic sequences which comply with the query condition 1 12, as applied to genomic sequences which are similar to the target genomic sequence 1 10, to a degree defined by similarity criteria 111, in compressed form. These results are designated compressed similar to target query results 120.
- the compressed genomic sequence similarity search module 118 is further described below with reference to Figs. 10, 11, 12A, 12B and 12C.
- Each of the compressed genomic sequences similar to target 120 is decompressed by a decompression mechanism 122, collectively yielding a respective plurality of genomic sequences similar to target 124.
- Preferred embodiments of the compression mechanisms 102 and 114 and of the decompression mechanism 122 which preferably reverses the process of these compression mechanisms, are further described below with reference to Fig. 2. It is appreciated that while the end result is retrieval of genomic sequences which are similar to the target genomic sequence 110, the actions of sequence similarity comparison and retrieval are preferably performed on compressed genomic sequences: comparing compressed target genomic sequence 116 with compressed genomic sequences 104, using the compressed genomic sequence index 108.
- An important aspect of the present invention is that it allows determining the level of similarity between genomic sequences, by comparing compressed genomic sequences to which these genomic sequences correspond.
- Fig. 2 is a simplified block diagram illustrating a mechanism for compression and decompression of genomic data. This mechanism is a preferred implementation of the compression mechanisms 102 and 114 and of the decompression mechanism 122 described hereinabove with reference to Fig. 1.
- An uncompressed genomic string 200 is given as an example of one of the genomic sequences 100 or of the target genomic sequence 110 described hereinabove with reference to Fig. 1.
- the uncompressed genomic string 200 represents genomic sequence data.
- genomic sequence data is typically represented as an alphanumeric string comprising only five letters: A, T, C, and G, representing the four nucleotides, which comprise the genome: Adenine, Thymine, Cytosine, and Guanine respectively, and the letter N or the minus sign, representing locations in the sequence in which the nucleotide is currently not known.
- the letter N typically appears in genomic sequence data much less frequently than do the other four letters. Further, when N's do appear in a genomic sequence they are frequently found contiguously rather than separately since it is frequently the case that a contiguous group of nucleotides in the sequence, rather than just one nucleotide, are unknown.
- the uncompressed genomic string 200 comprises a plurality of bytes, each storing a character - A, T, C, or G - representing a nucleotide.
- the uncompressed genomic string 200 shown in Fig. 2 comprises only three bytes: BYTE I, BYTE II and BYTE III, each storing a character representing a corresponding nucleotide: NUC 1, NUC 2 and NUC 3, respectively.
- the uncompressed genomic string 200 is compressed by a genomic data compression engine 202 into a compressed genomic string 204. Operation of a preferred embodiment of the genomic data compression engine 202 is further described below with reference to Fig. 5
- the genomic data compression engine 202 employs a compression table 206 in compression of uncompressed genomic string 200 into compressed genomic string 204
- the compression table 206 is preferably a translation table and holds a list of all possible or reasonable 3 -alphanumeric-character combinations and, for each such combination, the byte or bytes into which it may be compressed
- a preferred embodiment of the compression table 206 is further described below with reference to Fig 7A
- the compressed genomic string 204 preferably comprises one or more compressed bytes 208
- the compressed genomic string 204 shown in Fig 2 comprises only one compressed byte 208, which represents in compressed form all three nucleotides which are represented in the uncompressed genomic string 200, NUC 1, NUC 2 and NUC 3, which nucleotides require three bytes of storage, BYTE I, BYTE II and BYTE III, in their original uncompressed form.
- compressed genomic string 204 which comprises of a plurality of compressed bytes 208, may alternatively be stored as one or more integers.
- an Integer is typically defined as a data-type comprising 4 bytes. It is therefore possible to compress an uncompressed genomic string 200 comprising 12 nucleotides into 4 compressed bytes 208, and then to store all 4 resulting compressed bytes 208 as one Integer Longer genomic strings may be compressed into longer Integers, such as Biglnt data type in MS SQL-2000® which comprises 8 bytes, or as a plurality of Integers or Biglnts. For example, a genomic string comprising 48 nucleotides may be compressed into 2 Biglnts
- Tinylnt datatype which comprises one byte of memory
- a Tinylnt datatype which comprises one byte of memory
- genomic strings which are longer than 3 or 4 characters, i.e. are compressed into more than one Tinylnt
- the compressed byte 208 is further described below with reference to Figs 3, 4 A, 4B, and 4C
- the compressed genomic string 204 may be decompressed by a genomic data decompression engine 210 back to uncompressed genomic string 200, preferably by reversal of the methodology of the genomic data compression engine 202 Operation of the genomic data decompression engine 210 is further described below with refei ence to Fig 8
- Genomic data decompression engine 210 employs a decompression table 212 in decompressing compressed genomic string 204 into uncompressed genomic string 200
- the decompression table 212 is a translation table and holds a list of the bit- values for each possible or reasonable compressed byte 208 and for each such compressed byte, the alphanumeric string, containing up to three characters, into which it may be decompressed
- the decompression table 212 is further described below with reference to Fig 7B
- FIG. 3 A is a simplified illustration of a preferred implementation of a byte bitmap preferably employed in generating the compressed byte 208 of Fig 2
- the compressed byte 208 of Fig 2 preferably comprises 8 bits' BIT I, BIT II, BIT III, BIT IV, BIT V, BIT VI, BIT VII & BIT VIII Preferably, these bits are divided into four groups, each containing 2 bits
- a HEADER typically contains BIT I and BIT II
- a BIT-PAIR I typically contains BIT III and BIT IV
- a BIT-PAIR II typically contains BIT V and BIT VI
- a BIT-PAIR III typically contains BIT VII and BIT VIII
- the HEADER preferably stores information about what the other bits in the compressed byte, BIT III - BIT VIII, represent
- the compression of genomic data is such that the compressed byte may either store up to three nucleotides, or may store up to three unknown-nucleotides, i e 'N's, but preferably not a combination of nucleotides and N's
- the HEADER stores information which indicates how many nucleotides the compressed byte 208 represents, one, two or three, or alternatively if the entire compressed-byte represents one or more 'N's
- the values assignable to bits of the HEADER are further described below with reference to Fig 4A BIT-PAIR I, BIT-PAIR II and BIT-PAIR III each contain 2 bits which are capable, when taken together, of representing one of four possible nucleotides, A, T, C and G.
- values assigned to the two bits of BIT -PAIR I determine whether the compressed byte 208 represents one, two or three N's. Values assignable to bits of BIT-PAIR I, determining the number of N's that the compressed byte 208 represents, are further described below with reference to Fig. 4C.
- Fig. 3B is a simplified illustration of an alternative preferred implementation of a byte bitmap preferably employed in generating the compressed byte 208 of Fig. 2.
- Alternative compressed byte 300 is an alternative byte bitmap which may be used for compression of genomic data, instead of the byte bitmap of compressed byte 208, depicted in Fig. 3A.
- Alternative compressed byte 300 comprises four bit-pairs, BIT-PAIR I, BIT-PAIR II, BIT-PAIR III and BIT-PAIR IV, rather than only three bit-pairs in compressed-byte 208 of Fig. 3 A. Unlike compressed byte 208 of Fig. 3 A, in which BIT I and BIT II function as a HEADER, alternative compressed byte 300 does not comprise any such header. All 8 bits of alternative compressed byte 300 function as one of four bit-pairs, each of said bit-pairs representing a nucleotide. Alternative compressed byte 300 is therefore capable of representing 4 nucleotides in compressed form, as opposed to compressed byte 208 of Fig. 3A, which is capable of representing 3 nucleotides.
- alternative compressed byte 300 may be useful when compressing genomic sequences which do not include unknown nucleotides, and are of a fixed length. If the length of the uncompressed genomic string 200 is known, then it is possible to ignore the possible tailing zeros at the right end of the alternative compressed byte 300, which do not represent a nucleotide, but rather represent a blank.
- an uncompressed genomic string 200 which is known to be 7 nucleotides long, may be compressed into 2 alternative compressed bytes 300: the first containing in compressed form 4 nucleotides, and the second containing 3 nucleotides.
- BIT VII and BITVIII of BIT-PAIR IV contain zeros which are ignored because the uncompressed genomic string is known to be 7 nucleotides long, despite the absence of a 'header' which would explicitly instruct to ignore these bits.
- Fig. 4A is a table illustrating preferred values assignable to BIT I and BIT II, both belonging to the HEADER of compressed byte 208 shown in Fig. 3 A.
- Assigning the value '01 ' to the bits of the HEADER i.e. assigning '0' to BIT I and ' 1 ' to BIT II of compressed byte 208 shown in Fig. 3A, signifies that the compressed byte 208 represents only one nucleotide, as represented by the values in BIT III and BIT IV, both belonging to BIT-PAIR I of compressed byte 208.
- the remaining four bits of the compressed byte 208, BIT V, BIT VI, BIT VII and BIT VUI are to be ignored and do not represent any additional nucleotide.
- the remaining two bits of the compressed byte 208, BIT VII & BIT VIII, are to be ignored and do not represent any additional nucleotide.
- Fig 4B is a table illustrating the preferred values assignable to the nucleotide-representing bits: BIT III, BIT IV, BIT V, BIT VI, BIT VII, & BIT VIII of compressed byte 208 shown in Fig. 3 A.
- each of BIT-PAIR I, BIT-PAIR II and BIT-PAIR III in compressed byte 208 comprises a pair of bits: BIT III & BIT IV, BIT V & BIT VI, and BIT VII & BIT VIII respectively.
- the values presented in Fig 4B are values which may be assigned to each of the above mentioned pairs of bits so as to allow each of these pairs of bits to represent one of the four possible genomic nucleotides A, T, C or G
- bit-pair i e assigning '0' to BIT III and '0' to BIT IV, or assigning '0' to BIT V and '0' to BIT VI, or assigning '0' to BIT VII and '0' to BIT VIII, signifies that that bit-pair, i e BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, of the compressed byte 208 of Fig 3 A, represents the nucleotide 'A'.
- bit-pair Assigning the value of '01' to any of the three bit-pairs representing one of the three nucleotides, i e. assigning '0' to BIT III & '1' to BIT IV, or assigning '0' to BIT V & ' L to BIT VI, or assigning '0' to BIT VII & ' 1 ' to BIT VIII, signifies that that bit-pair, i e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the nucleotide 'C.
- bit-pair Assigning the value of ' 11 ' to any of the three bit-pairs representing one of the three nucleotides, i e assigning ' 1 ' to BIT III & ' 1' to BIT IV, or assigning T to BIT V & ' 1 ' to BIT VI, or assigning ' 1 ' to BIT VII & ' 1 ' to BIT VIII, signifies that that bit-pair, i e BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the nucleotide "T.
- Fig. 4C is a table illustrating preferred values ' assignable to BIT III and BIT IV of Fig. 3 A respectively, when encoding one or more .unknown nucleotides.
- a value '00' assigned to the bits of the HEADER of Fig. 3 A signifies that the compressed byte 208 represents one or more unknown nucleotides and does not represent any known nucleotides.
- BIT III and BIT IV may be used to signify if the entire byte represents one, two of three 'N's.
- Assigning the value ' 10' to BIT III & BIT IV i.e. assigning '1' to BIT III and '0' to BIT IV, signifies that the entire compressed byte 208 represents two unknown nucleotides.
- each compressed byte 208 it is possible to use each compressed byte 208 to encode more or less than three 'N's.
- BIT III through BIT VIII of Fig. 3 A it is possible to use BIT III through BIT VIII of Fig. 3 A to signify up to 64 N's represented by a compressed byte 208.
- Fig. 5 is a simplified flowchart illustrating operation of the genomic data compression engine 202 of Fig. 2, constructed and operative in accordance with a preferred embodiment of the present invention.
- the compression table 206 stores for each possible combination of up to three nucleotides, e.g. 'ATG', 'GGC, 'AT', bit-values which represents this combination in compressed form in one compressed byte 208, preferably according to the values suggested in Figs. 4A, 4B, and 4C.
- Figs. 7A and 7B present examples of preferred compression and decompression tables 206 and 212, respectively.
- an iterative process of compression of multiple strings takes place.
- An uncompressed genomic string such as uncompressed genomic string 200 of Fig. 2 is received.
- the uncompressed genomic string 200 is a genomic sequence represented by a string 'ATGAT'. This example is followed through the following steps of Fig. 5.
- the uncompressed genomic string 200 preferably is parsed into substrings, each having up to three nucleotides (3-nucleotide-substrings) by parsing the uncompressed genomic string 200 from left to right. It should be noted that one or more of the nucleotides in a 3-nucleotide-substring may in fact be unknown, i.e. an 'N'. In the given example the string 'ATGAT' is parsed from left to right, into 3-nucleotide- substrings, yielding 'ATG' and 'AT'.
- a recursive operation is initiated, which looks up each 3- nucleotide-substring in the compression table 206, and based on the contents of the compression table, assigns appropriate bit values to the bits in one or more compressed byte 208.
- the compressed bytes 208 are combined to yield a compressed genomic string 204.
- Fig. 6 is a simplified flowchart illustrating a preferred functionality for generating translation tables, including compression table 206 and decompression table 212 of Fig. 2.
- 3-nucleotide-substrings i.e. 1 -nucleotide 2-nucleotide and 3-nucleotide combinations of A, T, C, G and N.
- these combinations may include ATN, ATA and ATC.
- N's are encoded in compressed bytes 208 which do not also represent known nucleotides. Values '00' are assigned to the headers of such compressed bytes and all known nucleotides in the 3-nucleotide-substring are encoded in other bytes.
- the HEADER is assigned the value '01 ' and the single nucleotide is represented by BIT III & BIT IV of the compressed byte 208.
- the HEADER is assigned the value ' 10' and the two nucleotides are represented by BITS III - VI of the compressed byte 208, with the first nucleotide represented by BIT III & BIT IV and the second nucleotide represented by BIT V & BIT VI.
- the HEADER is assigned the value ' 1 1 ' and the three nucleotides are represented by BITS III - VIII of the compressed byte 208, with the first nucleotide represented by BIT III & BIT IV, the second nucleotide represented by BIT V & BIT VI and the third nucleotide represented by BIT VII & BIT VIII.
- Each 3-nucleotide-substring and its corresponding one or more compressed bytes 208 are stored in translation tables including compression table 206 and decompression table 212.
- N's unknown nucleotides
- N's unknown nucleotides
- the present invention utilizes this fact to achieve an optimized compression suited specifically for genomic sequence data: three- nucleotide combinations which contain only nucleotides and no N's, as well as those containing only N's and no nucleotides, are both compressed into a single byte. Most of the rare cases of 3-nucleotide mixtures of nucleotides and N's are compressed into two bytes. Only a minority of extremely rare combinations of nucleotides and N's require three bytes and therefore are in fact not compressed.
- FIG. 7A is a simplified illustration of a preferred implementation of compression table 206 of Fig. 2, employed in accordance with a preferred embodiment of the present invention.
- compression table 206 The goal of compression table 206 is to provide a translation-table, also referred to as a 'lookup table', which provides the bit-values of the one or more compressed bytes 208 required to represent in compressed form every possible 1- nucleotide, 2-nucleotide and 3-nucleotide sub-string of uncompressed genomic string 200.
- the compression table 206 is described here logically, as a database table comprising fields into each of which multiple values are stored in respective multiple records. It is appreciated by those skilled in the art that the description of the compression table 206 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the compression table 206 may equally be implemented as a 'CASE' or 'IF-THEN' programming code in a any suitable computer language, as is well known in the art.
- computer code can be written, which comprises a plurality of TF-THEN' or 'CASE' arguments, each one of the arguments providing bit-values of the one or more compressed bytes 208 representing in compressed form one 3-nucleotide-substring of uncompressed genomic string 200.
- Compression table 206 preferably comprises multiple records each containing 4 fields: uncompressed nucleotide-combination 700, compressed byte I 702, compressed byte II 704 and compressed byte III 706. For clarity, an example is given for the content which may be stored in each of these fields.
- the uncompressed nucleotide combination 700 is a field which stores all possible 3-nucleotide substrings, i.e. 1-nucleotide, 2-nucleotide and 3-nucleotide combinations, including combinations of nucleotides only and combinations which include N's.
- uncompressed nucleotide combination 700 stores a 3- nucleotide combination 'ATN'.
- Compressed byte I 702, compressed byte II 704 and compressed byte III 706 respectively are fields which store for each uncompressed nucleotide-combination 700 the bit-values for each of the one or more compressed bytes 208 required for encoding it.
- compressed byte I 702 stores ' 10001100', which represents the nucleotide-combination 'AT'
- compressed byte II 704 stores '00010000', which represents 'N ⁇
- Compressed byte III 706, in the given example stores null, since only two compressed bytes are required to represent the nucleotide combination 'ATN'.
- most 3-nucleotide substrings may be compressed into one compressed byte 208, some rare combinations may be compressed into two compressed bytes, and some 20 very rare combinations may require 3 compressed bytes, and therefore may not be compressed. Therefore, notwithstanding that compression table 206 comprises three compressed bytes fields 702-706, one compressed byte field, such as compressed byte I 702, is sufficient to translate a vast majority of 3-nucleotide combinations to be typically found in a genomic sequence.
- Fig. 7B is a simplified illustration of a preferred implementation of decompression table 212 of Fig. 2, employed in accordance with a preferred embodiment of the present invention.
- decompression table 212 is to provide a translation-table, also referred to as a 'lookup table', which provides the 1-nucleotide, 2-nucleotide or 3- nucleotide uncompressed genomic string 200 preferably corresponding to every possible compressed byte 208.
- decompression table 212 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the decompression table 212 may equally be implemented as a 'CASE' code in any computer language, as is well known in the art.
- the decompression table 212 preferably comprises multiple records each containing 2 fields: compressed byte 708 and decompressed nucleotide-combination 710.
- Compressed byte 708 is a field which preferably stores bit-values of every possible compressed byte 208.
- Decompressed nucleotide combination 710 is a field which stores for each compressed byte 708 the 1-nucleotide, 2-nucleotide or 3-nucleotide uncompressed genomic string 200 which it encodes.
- the field compressed byte 708 may contain the compressed byte 208 bit- value ' 10001100' and the respective field decompressed nucleotide combination 710 may contain the 2-nucleotide combination 'AT' which this bit value represents in compressed form.
- genomic data decompression engine 210 of Fig. 2 performs a reverse action of that of genomic data compression engine 202 of Fig. 2, which was further described hereinabove with reference to Fig. 5.
- a compressed genomic string 204 is received in order to be decompressed.
- Genomic data decompression engine 210 of Fig. 2 gets the compressed genomic string 204 of Fig. 2 to be decompressed.
- the process shown in Fig. 8 is explained with reference to an example wherein the compressed genomic string 204 comprises two compressed bytes 208 the bit-value of which are: ' 11001110' & ' 10001 100' respectively. This example is followed through the following steps of Fig.
- a recursive operation is initiated, which parses the received compressed genomic string 204 into the compressed byte 208 included in this compressed genomic string.
- Each compressed byte 208 is looked up in the decompression table 212, and based on the contents of the compression table, finds out the 3-nucleotide substring which this compressed byte represents.
- the 3-nucleotide substrings are combined to yield an uncompressed genomic string 200.
- the second compressed byte in the compressed genomic string has bit-values of ' 10001100', which when looked-up in the decompression table is found out to represent the nucleotide combination 'AT'.
- 'ATG' represented by compressed byte ' 1 1001 1 10'
- 'AT' represented by compressed byte ' 10001100'
- FIG. 9 A is a simplified illustration of an example of compressing an uncompressed genomic string 200 into a compressed genomic string 204, both shown in Fig. 2.
- Uncompressed genomic string 'ATGAT' 900 is an uncompressed genomic string 200 comprising the nucleotides: 'ATGAT'.
- nucleotide-triplet-1 'ATG' 902 nucleotide-triplet-1 'ATG' 902
- Compressed byte-1 906 encodes three nucleotides: 'ATG'. Therefore a value of ' 1 1 ' is assigned to the two bits of the HEADER of compressed byte-1 906, as indicated by reference numeral 908, signifying that this compressed byte 208 encodes 3 nucleotides.
- Value '00' is set to the two bits of BIT-PAIR I, of compressed byte-1 906, as indicated by reference numeral 910, signifying that the first nucleotide represented by this byte is 'A'.
- Value ' 11 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 906, as indicated by reference numeral 912, signifying that the second nucleotide represented by this byte is 'T'.
- Value ' 10' is assigned to the two bits of BIT-PAIR III of compressed byte-1 906, as indicated by reference numeral 914, signifying that the third nucleotide represented by this byte is 'G'.
- Compressed byte-2 908 encodes two nucleotides: 'AT'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-2 908, as indicated by reference numeral 916, signifying that this compressed byte 208 encodes 2 nucleotides, and that therefore the two bits of BIT-PAIR III are to be ignored. Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 2 908, as indicated by reference numeral 918, signifying that the first nucleotide represented by this byte is 'A'.
- Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte-2 908, as indicated by reference numeral 920, signifying that the second nucleotide represented by this byte is 'T'.
- FIG. 9B is a simplified illustration of another example of compression of an uncompressed genomic string 200 of Fig. 2, containing an unknown nucleotide, into a compressed genomic string 204 of Fig. 2.
- Uncompressed genomic string 'ATNCG' 950 is an uncompressed genomic string 200 comprising the characters: 'ATNCG'.
- nucleotide-triplet- 1 'ATN' 952 nucleotide-triplet-2 'CG_' 954. It is appreciated that the nucleotide- triplet-2 'CG_' 954 actually contains not a triplet but only 2 nucleotides: C and G.
- nucleotide-triplet-1 'ATN' 952 contains an 'N' it is preferably represented by two compressed bytes 208 rather than one: the first, compressed byte-1 956, encodes 'AT', and the second, compressed byte-2 958, encodes 'N ⁇
- Compressed byte-1 956 encodes two nucleotides, 'AT', therefore ' 10' is assigned to the two bits of the HEADER of compressed byte-1 956, as indicated by reference numeral 960, signifying that this compressed byte 208 encodes 2 nucleotides.
- Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 1 956, as indicated by reference numeral 962, signifying that the first nucleotide represented by this byte is 'A'.
- Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 956, as indicated by reference numeral 964, signifying that the second nucleotide represented by this byte is 'T'.
- Value '00' assigned to the bits of BIT-PAIR III of compressed byte-1 956, as indicated by reference numeral 966, is ignored, and does not represent an 'A' since the HEADER specified that this byte encodes only 2 nucleotides.
- Compressed byte-2 958 is dedicated to encoding 'N's, in this case only one 'N ⁇ which is derived from the nucleotide-triplet-1 'ATN' 952. Therefore '00' is assigned to the two bits of the HEADER of compressed byte-2 958, as indicated by reference numeral 968, signifying that this compressed byte 208 is dedicated to encoding one or more N's.
- the value '01' is assigned to the two bits of BIT-PALR I of compressed byte-2 958, as indicated by reference numeral 970, signifying that this byte, which is dedicated to encoding N's, encodes only one N. Accordingly, The zeros in the two bits of BIT-PAIR II and the two bits of BIT-PAIR III, indicated by reference numerals 972 & 974, are ignored.
- Compressed byte-3 990 encodes two nucleotides: 'CG'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-3 990, as indicated by reference numeral 976, signifying that this compressed byte 208 encodes only 2 nucleotides.
- Value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte- 3 990, as indicated by reference numeral 978, signifying that the first nucleotide represented by this byte is 'C ⁇
- Value ' 10' is assigned to the two bits of BIT-PAIR II of compressed byte-3 990, as indicated by reference numeral 980, signifying that the second nucleotide represented by this byte is 'G'.
- Fig. 10 is a simplified block diagram illustrating shifted genomic sequences utilized by the compressed genomic sequence similarity search module 1 18 of Fig. 1 constructed and operative in accordance with a preferred embodiment of the present invention.
- each compressed byte represents in compressed form 3 or 4 nucleotides. It is therefore very easy to compare entire 'triplets' of nucleotides, but if an addition or deletion of a single nucleotide occurs, then all triplets 'downstream' to the one modified will have seemed to have changed completely, whereas in fact, they have only been 'shifted' to the right or to the left by one location.
- the basic concept of the compressed genomic sequence similarity search module 1 18 of Fig. 1 is therefore to calculate all possible 'shifted' variations of the compressed genomic sequence 110 of Fig. 1, and to use them to search for compressed genomic sequences similar to target 120 of Fig. 1.
- target genomic sequence 1000 which is a compressed genomic sequence comprising of 12 nucleotides, NI through N12, which are represented in compressed form by 4 compressed bytes 208, BYTE 1, BYTE 2, BYTE 3 and BYTE 4.
- shifted genomic sequences 1002 are generated: 'minus one' shifted genomic sequence 1004, 'minus two' shifted genomic sequence 1006, 'plus one' shifted genomic sequence 1008, 'plus two' shifted genomic sequence 1010.
- the first nucleotide in target genomic sequence 1000, NI has been removed in the 'minus one' shifted genomic sequence 1004, and therefore 'minus one' shifted genomic sequence 1004 begins with N2.
- the nucleotides compressed into each of the four compressed bytes 208 of 'minus one' shifted genomic sequence 1004 are therefore 'shifted to the left' by one location.
- sequence of nucleotides compressed into each of the four compressed bytes 208 of 'minus two' shifted genomic sequence 1006 is shifted to the left by two locations; that of 'plus one' shifted genomic sequence 1008 is shifted to the right by one location; and that of 'plus two' shifted genomic sequence 1010 is shifted to the right by two locations.
- Fig. 11 is a simplified flowchart illustrating operation of the compressed genomic sequence similarity search module 118 of Fig. 1 constructed and operative in accordance with a preferred embodiment of the present invention.
- Operation of the compressed genomic sequence similarity search module 1 18 begins by getting a target compressed genomic sequence 110 of Fig. 1.
- shifted compressed genomic sequences are generated: 'minus one' shifted genomic sequence 1004, 'minus two' shifted genomic sequence 1006, 'plus one' shifted genomic sequence 1008, 'plus two' shifted genomic sequence 1010, as described hereinabove with reference to Fig. 10.
- all compressed genomic sequences 104 having at least one compressed byte which matches that of the compressed target genomic sequence 1000 or of one of the four shifted genomic sequences 1004-1010, are retrieved. It is important to note that a match is looked for only between bytes occupying the same location in the compressed genomic string: the first compressed byte in a compressed genomic sequence 104 is compared to the first compressed byte of the compressed target genomic sequence and to the first compressed byte of each of the four compressed shifted genomic sequences. It is not compared to any other, e.g. second, third or fourth, compressed bytes within these genomic sequences. All compressed -genomic sequences having at least one match, are considered potentially similar, and are passed on the next step.
- Compressed genomic sequences having less mismatching compressed bytes with the target or one of the shifted genomic sequences than a certain user defined 'threshold', are considered potentially very similar, and are passed on to the next step.
- mismatching compressed byte/s are further analyzed to determine the exact nature of the mistake, in order to further fine-tune the similarity comparison.
- the resulting compressed genomic sequences similar to target 120 of Fig. 1 are considered to represent in compressed form. genomic sequences which are similar to the target genomic sequence represented in compressed form by the compressed target genomic sequence 1 10, and are delivered.
- Fig. 12A is a simplified illustration of an example of identifying a genomic sequence having one nucleotide replacement relative to a target genomic sequence.
- Fig. 12A shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (1 REPLACEMENT), in which a nucleotide designated N13 shown in broken line format, in the compressed byte designated 1R-BYTEIL has replaced nucleotide N5 in that same spot in the original TARGET GENOMIC SEQUENCE.
- Fig. 12B is a simplified illustration of an example of identifying a genomic sequence having two nucleotide additions relative to a target genomic sequence.
- Fig. 12B shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (2 ADDITIONS), in which two nucleotides designated N13 and N 14 shown in broken line format, in compressed byte designated 2A-B YTEIL have been added to the genomic sequence relative to the original TARGET GENOMIC SEQUENCE, 'pushing' nucleotides N5 and N6 to the next compressed byte designated 2A-BYTE III, and shifting all the following nucleotides by two positions.
- Fig. 12C is a simplified illustration of an example of identifying a genomic sequence having one nucleotide deletion relative to a target genomic sequence.
- Fig. 12C shows a genomic sequence designated SIMILAR TO TARGET GENOMIC SEQUENCE (1 DELETION), in which one nucleotide designated N5 of TARGET GENOMIC SEQUENCE has been deleted, shifting all nucleotides from N6 onwards one position to the left.
- the missing N5 in byte ID-BYTE II is represented by a small blank broken line box between N4 and N6.
- the two genomic sequences may therefore be deduced as being similar, differing by a one deletion mistake in the mismatched compressed byte ID-BYTE II.
- FIGs. 10 and 12A, 12B and 12C demonstrate detection of up to 2 addition or deletion mistakes, the same concept may be utilized to detect a wider spectrum of mistakes, by generating more 'shifted sequences' accordingly, e.g. 'plus three' shifted sequence, etc.
- Fig. 13 is a simplified block diagram of a computer database application constructed and operative in accordance with a preferred embodiment of the present invention. It is appreciated that the computer database application may be implemented in any appropriate programmed computer system, for example, an appropriate personal computer, or a personal computer server, and may use any appropriate database system, for example MICROSOFT SQL SERVER 12000®.
- the embodiment of Fig. 13 comprises a mechanism for efficient pattern analysis of genomic sequence data.
- the general idea of the present invention is to view the task of genomic pattern analysis in a manner similar to an attempt to understand a book in a totally foreign and unknown language, but where some clues do exist as for the meaning of a few specific words, or the general significance of several chapters.
- the approach in such a case would be to break up the book into meaningful sections, such as chapters, and within each such chapter, to make a list of all the words appearing in that chapter. This then allows one to find correlations between words and other words, or between words and the chapters they are found in.
- genomic sequence data may be approached in much a similar manner.
- First the 'book' i.e. the DNA sequence, is divided into meaningful 'chapters' such as protein coding regions, and regions upstream and downstream of these regions.
- meaningful 'chapters' such as protein coding regions, and regions upstream and downstream of these regions.
- regions adjacent to protein coding regions are often involved in inhibiting or enhancing the production of these proteins.
- Additional 'sub- chapters' may be created as well, such as regions within a protein coding region, which is known or suspected to have a specific function.
- each of these 'chapters' i.e. genomic protein related regions
- genomic sequence data we do not know what the 'words' are
- the approach taken by the present invention is to parse each 'chapter' into 'words' of arbitrary length/s, such as all lengths between 10 and 30 characters. This approach generates a very long list of 'potential words', knowing that most of these are non-sense, and only a small fraction are genuine 'words'.
- genomic data begins by obtaining genomic data to be analyzed, and other definitions and preferences required for the genomic data analysis.
- Primary required data is raw genomic data 1100, including sequenced DNA data 1102 and protein location information 1104.
- Protein location information 1104 comprises relative offset of the protein coding regions of proteins known to be encoded by the sequenced DNA 1 102, as well as the orientation of these protein coding areas where available. Protein location 1104 is typically part of basic genomic annotation data which is made available as part of the genomic sequencing effort.
- genomic data 1106 may also contribute to the process of genomic pattern analysis, and may include various properties of known proteins, such as tissue-specific expression of proteins, the organism in which each protein is expressed (when analyzing genomes of multiple organisms), a protein specific biological function. This information may also include additional research-derived information about proteins encoded by the sequenced DNA, such as grouping specific proteins into groups of proteins which are of particular interest. Other genomic data 1 106 may also include information about specific sites or locations in a protein-coding region, such as various protein binding sites, or regions upstream of the coding area of a protein, which are of specific interest. The significance and use of such additional data is elaborated hereinbelow.
- User defined criteria 1108 may be entered, defining various parameters by which the genomic sequence data analysis is performed, as explained hereinbelow.
- the raw genomic data 1100, other genomic data 1106 and user defined criteria 1 108 are entered into a genomic pattern analysis engine 1110.
- the genomic pattern analysis engine 1110 is a computer based program, preferably built around a database program, such as MICROSOFT SQL-SERVER 12000®, and comprising a genomic pre-processing unit 1112, a genomic pattern analysis database 1 1 14 and a genomic query processing unit 1116.
- the genomic pre-processing unit 1112 is a computer program operative in conjunction with a database program, which receives the raw genomic data 1100 and other genomic data 1 106 entered to the genomic pattern analysis engine 1110, pre- processes it and stores the pre-processed genomic data to the genomic pattern analysis database. Operation of the genomic pre-processing unit 1112 is further described below with reference to Fig. 15.
- the genomic pattern analysis database 1114 is a database storing the preprocessed genomic data.
- the data structure of the genomic pattern analysis database 1 1 14 is designed so as to be conducive to pattern analysis of genomic sequence data, and to include the following major data elements:
- Proteins 11 18 is a list of proteins known to be encoded by the sequenced DNA 1 102;
- Protein related regions 1120 are regions in the sequenced DNA, which are related to each of the proteins 1118, such as protein coding regions, and regions upstream of protein coding regions;
- Short genomic segments in regions 1122 are all short genomic segments of a length defined by the user defined criteria 1108, found in each of the protein related regions; and ,
- SGSR-to-SGSR relationships 1124 document various relationships between two or more SGSRs, such as the distance between them.
- genomic pattern analysis database is further described below with reference to Fig. 14.
- the genomic pattern analysis database 1114 may be queried by the genomic query processing unit 1116, allowing a user 1126 to analyze the raw genomic data 1 100, by using the preprocessed data derived therefrom and stored in the genomic pattern analysis database 1114. Operation of the genomic query processing unit is further described below with reference to Fig. 16.
- genomic pattern analysis engine a basic concept of the genomic pattern analysis engine is to perform as much pre-processing and storing of useful intermediate results as possible before the actual process of pattern analysis, so as to be able afterwards to produce very fast results in response to relatively complicated pattern analysis queries.
- This approach allows performing complicated genomic' data analysis tasks, frequently carried out only by mainframe computers or super-computers, on relatively inexpensive, and easily scalable hardware, such as PC server computers. While this approach potentially requires very large databases, with up to billions of records in some cases, and may requires significant pre-processing time, it still offers a dramatically more cost- effective alternative than the traditional extremely expensive parallel processing alternatives.
- Fig. 14 is a block diagram illustrating a preferred embodiment of the genomic pattern analysis 1114 of Fig. 13.
- the genomic pattern analysis database preferably comprises of four major data elements: proteins 1118, protein related regions 1120, short genomic segments in region (SGSR) 1122 and SGSR-to-SGSR relationships 1124.
- proteins 1118 proteins 1118
- protein related regions 1120 protein related regions 1120
- short genomic segments in region (SGSR) 1122 SGSR-to-SGSR relationships 1124.
- each of these data elements is stored in a table in the database, related to the other tables, as described below.
- Proteins 1118 is a plurality of proteins known to be encoded by the raw genomic data 1 100. For each of these proteins, multiple properties relevant to the genomic pattern analysis may be stored. For example, an organism 1200 it belongs to, a biological or other function 1202 it is known to perform, and its expression 1204 in a specific organ or tissue.
- Each of the proteins 1 118 is related to one or more protein related regions 1 120. These include a protein coding region 1206, which must be obtained or calculated based on the protein location data 1104. In addition, regions adjacent to the protein may be calculated and stored: pre protein 1208 is a region upstream to the protein coding region 1206, and post protein 1210 is a region downstream to the protein coding region. The coding direction of the protein is required in order to calculate the protein-adjacent regions. Finally, other regions 1212 may also be defined, such as regions of special interest within a specific protein, e.g. regions within the protein coding region known or suspected, correlating to an amino-sequence which is responsible for specific biological activity in the final protein, to have a certain functional significance.
- regions may be selected, which are not related to a specific protein, in order to analyze genomic patterns within them as well.
- some of the definitions used to create the protein related regions 1 120 may be semi-arbitrary, and therefore may be defined by the user, as part of the user defined criteria 1108. For example, when analyzing the regions upstream of a protein coding region, a user defined criterion 1108 may define whether this region should extend all the way until the next protein upstream, or if it should be considered only a maximal fixed distance from the protein.
- Each of the protein related regions 1120 is related to a plurality of short genomic segments in region 1 122, which are a key element in the present invention.
- Short genomic segments in region 1 122 is a plurality of preferably all, or most, of the short genomic segments of a given length or range of lengths, as determined by the user defined criteria 1 108, which are found in each of the plurality of protein related regions 1 120.
- the number of protein related regions 1120 created and the number of lengths of short genomic segments desired may be billions or tens of billions.
- records may be preferably stored in partitioned tables, under a single view, using a database such as MICROSOFT SQL SERVER 12000®
- Location 1214 is the location, or offset, of the SGSR 1122 relative to the protein related region 1120 in which it is found.
- Uniqueness 1216 stores a link to a reference indicating the degree of uniqueness of this SGSR relative to the protein related region 1120 in which it is found, or relative to multiple protein related regions 1120.
- one SGSR 1122 may be unique in the protein related region 1120 it appears in, i.e. it appears in that region only once.
- Another SGSR 1 122 may appear in a specific protein coding region 1206, and may be unique relative to all protein coding regions 1206.
- Yet another SGSR 1122 may appear only 3 times in the pre protein regions 1208 of all proteins 1118 which have a similar expression 1204, such as proteins expressing in nerve cells - this may still be considered significant by a user 1126 of the system, and may be queried.
- commonality 1218 stores an indication, or a link to an indication, as to the commonality of a SGSR 1122 relative to two or more protein related regions 1 120. For example, it may be of interest to find out and mark all SGSRs 1 122 which appear in a pre protein region of proteins which have a similar function, as a starting point to attempting to assess which short genomic segments may be active as 'triggers' in controlling expression of these proteins.
- Each SGSR may be associated with one or more, possibly many, criteria flags 1220, each of which stores an indication or a link to an indication of any compound condition which this SGSR meets.
- a criteria flag 1220 may be used in order to 'mark' all SGSRs meeting a certain query, so that they can later be retrieved quickly and easily.
- a criteria flag 1220 may be created to indicate any SGSR which appears in the post protein region 1210 of all proteins 1118 having a first function 1202, and does not appear in the post protein region 1210 of any proteins 1118 having a second function 1202. Since each short-genomic-segment record may be associated with multiple criteria flags, preferably each criterion is a record in a criteria table (not shown in Fig. 14), which is linked to multiple criteria-flag records 1220, each of which is linked to one short-genomic-segment record 1122.
- the 'criteria flag' mechanism therefore allows extremely fast retrieval of all short-genomic-segments which comply with any combination of complex queries, which have been applied at the pre-processing phase.
- each short genomic segment in region 1122 may be associated, with one or more SGSR-to-SGSR relationships 1124.
- Each SGSR-to-SGSR relationship 1 124 stores information on the relation between two or more SGSRs.
- SGSR-to-SGSR relationships 1124 may document the proximity 1222 of two SGSRs from each other, i.e. the difference between their respective location 1214 values; or their nucleotide sequence similarity 1224, or any other 1226 parameter by which they may be compared.
- Fig. 15 is a flowchart illustrating operation of the genomic pre-processing unit 11 12 of Fig. 13 in accordance with a preferred embodiment of the present invention.
- the genomic pre-processing unit 1112 of Fig. 13 processes the raw genomic data 1 100, in several steps described below, and populates the genomic pattern analysis database 1 1 14.
- Preprocessing begins by acquiring the genomic data to be processed, including raw genomic data 1100, other genomic data 1106 and user defined criteria 1 108 of Fig. 13.
- Proteins known to be encoded by the raw genomic data 1100 are stored in the genomic pattern analysis database 1110, and are classified according to various attributes which are deemed relevant to the genomic data analysis process, such as the organism 1200 they belong to, their biological or other function 1202, and their organs or tissue specific expression 1204 of Fig. 14.
- the protein coding region 1206 of Fig. 14 is calculated based on the protein location data 1104 of Fig. 13.
- the protein coding regions are also normalized, i.e. if the direction of the protein is known to be right to left, it is reversed, so as to be read from left to right, and inverted, replacing every A with a T and every C with a G.
- some proteins are coded on the positive strand of the DNA, and are therefore 'read' from left to right from the sequenced DNA 1 102, whereas some are encoded on the negative strand, and therefore are 'read' from right to left, and appear 'inverted' in the sequenced DNA 1102, i.e. each 'A' should be replaced with a 'T', each 'C replaced with a 'G', and vice-versa.
- regions upstream and downstream of each protein may also be calculated, normalized and stored, as may also other regions 1212 of interest.
- Protein related regions 1120 are stored in the database, and are each linked to the protein related thereto.
- Each of the protein related regions 1120 is then parsed in order to find preferably all short genomic segments of a given length located in that region.
- the results are stored as short genomic segments in region 1122 of Fig. 14.
- the length/s of the short genomic segments to be analyzed is determined as part of the user defined criteria.
- various queries may be performed, in order to further determine properties of short genomic segments in region 1122, which are deemed material to the desired direction of genomic sequence data analysis.
- SGSRs' answering a queried criteria may be 'flagged' for future use, using the uniqueness 1216, commonality 1218 or other criteria flags 1220 of Fig. 14.
- SGSR-to-SGSR relationships 1124 between two or more SGSRs are determined, such as their proximity 1222, similarity 1224, or any other attribute 1226 by which they may be compared. ; .
- Fig. 16 is a block diagram illustrating operation of the genomic query processing unit 1116 of Fig. 13, constructed and operative in accordance with a preferred embodiment of the present invention.
- the genomic query processing unit 1116 allows a user 1126 of Fig. 13 to perform complex pattern analysis of genomic sequence data, based on the preprocessed data stored in the genomic pattern analysis database 1114.
- qualifying properties to be used in analysis of genomic data are obtained from the user 1126, as indicated by reference numeral 1400. These may include short segment properties 1402, short segment in,region properties 1404, protein properties 1406 and other pre-defined criteria 1408.
- genomic pattern analysis database 1114 is queried according to the qualifying properties obtained by the previous step, as indicated by reference numeral 1410.
- this step of querying the database according to qualifying properties may also comprise a short genomic segment similarity comparison mechanism, as indicated by reference numeral 1412 marked by broken line.
- a short genomic segment similarity comparison mechanism may enable querying the database for short genomic segments which are similar but not identical to the short segments specified by any of the qulifying properties indicated by reference numerals 1402-1408.
- Such a mechanism may be advantegous since, as is well known in the art, it is frequently the case that a genomic motif may appear in slight variations while still maintaining its biological functionality. Accordingly, various algorithms are known in the art for identifying genomic string similarity, and may therefore be used here.
- a 'translation-table' may be created, in which for each short genomic segment, all of the possible variants, e.g. of up to a small number of mistakes such as 2 mistakes, are listed.
- Such a mechanism may be very efficient especially for short segments, such as 3-7 nucleotides long, since the number of possible permutations is not very large.
- results which fit the qualifying properties 1402-1408 are retrieved from the database and are delivered, as indicated by reference numeral 1414. These include qualifying short genomic segments 1416, qualifying regions 1418 and qualifying proteins 1420.
- FIG. 17 is a simplified illustration of an example of genomic pattern analysis performed by a preferred embodiment of the present invention.
- Fig. 17 provides a genomic analysis example which may be conducive for a better understanding the usefulness and operation of the present invention.
- protein A, protein B, protein C, protein D, protein E and protein F are six proteins known to be coded by the raw genomic data 1100 entered into genomic pattern analysis engine 1110 of Fig. 13.
- proteins A and C are known to have a specific biological function, which the user 1126 considers desirable, and protein B is known not to have this specific desired biological function.
- the biological function of proteins D, E and F is unknown.
- the initial goal of the genomic pattern analysis is to find a genomic sequence pattern common to the coding regions of proteins A and C, and which is not found in the coding region of protein B. If such a genomic pattern is found, then the final goal is to find other proteins, the function of which is at present unknown, such as proteins D, E and F in the given example, which display a genomic pattern similar to that found in the initial step.
- the rationale is that the genomic pattern common to proteins known to have a desired function, may serve as a predictor for finding other proteins, the function of which is at present unknown, and which might be expected to perhaps have the desired function.
- the genomic preprocessing unit 1112 preprocesses the raw genomic data 1 100, and using the sequenced DNA 1102 and the protein location 1104 of these six proteins relative to the sequenced DNA 1102, and their coding direction (left-to-right or right-to-left), calculates and preferably normalizes the protein coding region 1206 of Fig. 14.
- Protein A coding region 1500, protein B coding region 1502, protein C coding region 1504, protein D coding region 1506, protein E coding region 1508 and protein F coding region 1510 are illustrated in simplified form in Fig. 17. Coding regions of proteins D, E and F, designated by reference numerals 1506, 1508 and 1510, the biological function of which is unknown, are shown in broken line format. Preferably all protein related regions for preferably all other proteins known to be encoded by raw genomic data 1100 are processed in a similar manner. For clarity of explanation, the given example now focuses on these six proteins alone.
- the genomic pre-processing unit further processes the protein coding regions of proteins A, B, C, D, E and F designated by reference numerals 1500-1510, in order to find and store preferably all short genomic segments, of a given length, e.g. 10 nucleotides long, in each of these protein coding regions.
- the length of SGSR to be used is a matter of user preference, and is determined by user defined criteria 1 108 of Fig. 13.
- Fig. 17 illustrates only six short genomic segments found in the protein coding regions of these proteins: SGSR-I, SGSR-II, SGSR-III, SGSR-IV, SGSR-V and SGSR- VI. It is appreciated that in reality there is a very large number, such as tens of thousands, of short genomic segments found in each protein related genomic region. This number depends upon the size of the region and the number of different SGSR lengths which the user 1126 decides to use.
- SGSR-I, SGSR-II and SGSR-III are found in protein A coding region 1500; SGSR-IV, SGSR-III, SGSR-II and SGSR-V are found in protein B coding region 1502; SGSR- VI, SGSR-I, SGSR-II and SGSR-III are found in protein C coding region 1504; none of the six SGSR are found in protein D coding region 1506; SGSR-III, SGSR-II and SGSR-V are found in protein E coding region 1508; and SGSR-I, SGSR-II and SGSR-III are found in protein F coding region 1510.
- the initial step of the genomic pattern analysis searches for a pattern common to protein A and C, and not to protein B.
- SGSR-I, SGSR-II and SGSR-III form a pattern which appears in protein A coding region 1500 and protein C coding region 1504, but not in protein B coding region 1502
- the commonality property of SGSR may be used to 'flag' all short genomic segments in regions 1 122, which are common to the protein coding regions 1206 of all proteins 1118, sharing the same desired function 1202 - protein A and B in the given example All SGSRs common to coding regions of all proteins which share the lack of that desired function, may be 'flagged accordingly as well, using a different commonality 1218 'flag' In the given example only protein B is shown as a representative of that group of proteins It is then easy to find all SGSRs which are common to A and C but not to B.
- a more elaborate pattern may be sought, for example by analyzing the location of the SGSRs which seem to be potentially significant relative to each other, or to the region in which they are found.
- SGSR-III and SGSR-II do appear as well in protein B coding region 1502, they do not appear in the same pattern, e.g. the location SGSR-III relative to SGSR-II is different in protein B coding region than it is in coding regions of proteins A and C This may easily be queried from the genomic pattern analysis, using the location property of SGSR, designated by reference numeral 1214 of Fig 14, which preferably stores the location, i e offset, of the SGSR relative to the region in which it is found
- the SGSR-to-SGSR relationships 1 124 of Fig 14 may be very useful
- multiple such relationship records may be formed, which document the proximity 1222 of Fig. 14 of each of the pairs of SGSRs which are suspected as being potentially significant: SGSR- I-to-SGSR-II, SGSR-II-to-SGSR-III and SGSR-I-to-SGSR-III. This provides an efficient means of finding all instances in which several SGSRs are not only found in proximity, but form a pattern relative to one another.
- the criteria flags 1220 of Fig. 14 may be used to flag all SGSRs which comply with a very complex query.
- a criteria flag may be formed to 'flag' all SGSRs which are: (a) common to a first group of proteins, (b) do not appear in a second group, (c) have a location property indicating they appear close together, and (d) appear as part of a certain SGSR-to-SGSR relationship with a certain proximity value.
- Protein coding regions of proteins D, E and F are thus examined.
- Protein F coding region 1504 shown marked in bold broken line, is the only one which displays a similar pattern of SGSRs to that observed in the coding regions of proteins A and C.
- Protein D coding region shown none of the three SGSRs suspected as significant, and protein E coding region shows two of the significant three, but not in the same pattern.
- genomic sequence analysis database 1114 may be beneficially utilized in performing complex genomic pattern analysis tasks, and of the usefulness of such analysis. It is further appreciated that genomic pattern analysis is often a highly complex task, often requiring a long, iterative, and somewhat creative process of trial-and-error.
- FIG. 18 is a simplified block diagram illustrating a computer application constructed and operative in accordance with another preferred embodiment of the present invention.
- Each of a plurality of strings 1800 is compressed by a compression mechanism 1802, collectively yielding a respective plurality of compressed strings 1804.
- the compressed strings 1804 are stored in a plurality of records in a table in a database 1806, and a compressed strings index 1808 is constructed, which indexes the compressed strings 1804.
- a target string 1810, one or more similarity criteria 1811, and a query condition 1812 relating to the target string 1810 are provided by a user of the database 1806, in order to find all strings 1800 in the database 1806 which comply with the provided query condition 1812, as it is applied to strings similar to the target string 1810, to a degree specified by the similarity criteria 1811.
- the target string 1810 is compressed by a compression mechanism 1814, which may be similar to compression mechanism 1802, .into a compressed target string 1816.
- the compressed target string 1816 and the similarity criteria 1811 are passed on as input to a compressed string similarity search module 1818.
- the compressed string similarity search module in conjunction with the compressed string index 1808 is operative to query the database 1806, retrieving a plurality of compressed strings which comply with the query condition 1812, as applied to strings which are similar to the target string 1810, to a degree defined by similarity criteria 1811, in compressed form. These results are designated compressed similar to target query results 1820.
- the compressed string similarity search module 1818 is further described below with reference to Figs. 27, 28, 29A, 29B and 29C-
- Each of the compressed similar to target query results 1820 is decompressed by a decompression mechanism 1822, collectively yielding a respective plurality of strings similar to target 1824.
- the actions of string similarity comparison and retrieval are preferably performed on compressed strings: comparing compressed target string 1816 with compressed strings 1804, using the compressed string index 1808.
- An important aspect of the present invention is that it allows determining the level of similarity between strings, by comparing compressed strings to which these strings correspond.
- Fig. 19 is a simplified block diagram illustrating a mechanism for compression and decompression of data. This mechanism is a preferred implementation of the compression mechanisms 1802 and 1814 and of the decompression mechanism 1822 described hereinabove with reference to Fig. 18.
- An uncompressed string 1900 is given as an example of one of the strings 1800 or of the target string 1810 described hereinabove with reference to Fig. 18.
- Genomic sequence data is typically represented as alphanumeric strings, each character representing one of four nucleotides, A, T, C, and G, and unknown nucleotides represented by N.
- the letter N typically appears in genomic sequence data much less frequently than do the other four letters. Therefore, in the context of this example, 'N' is an example of a 'rare character'.
- the uncompressed string 1900 comprises a plurality of bytes, each storing a character: A, T, C, or G.
- the uncompressed string 1900 shown in Fig. 19 comprises only three bytes: BYTE I, BYTE II and BYTE III, each storing a character: CHR 1, CHR 2 and CHR 3, respectively.
- the uncompressed string 1900 is compressed by a compression engine 1902 into a compressed string 1904. Operation of a preferred embodiment of the compression engine 1902 is further described below with reference to Fig. 22.
- the compression engine 1902 employs a compression table 1906 in compression of uncompressed string 1900 into compressed string 1904.
- the compression table 1906 is preferably a translation table and holds a list of all possible or reasonable 3 -alphanumeric-character combinations and, for each such combination, the byte or bytes into which it may be compressed.
- a preferred embodiment of the compression table 1906 is further described below with reference to Fig. 24A.
- the compressed string 1904 preferably comprises one or more compressed bytes 1908.
- the compressed string 1904 shown in Fig. 19 comprises only one compressed byte 1908, which represents in compressed form all three characters which are represented in the uncompressed string 1900, CHR 1, CHR 2 and CHR 3, which characters require three bytes of storage, BYTE I, BYTE II and BYTE III, in their original uncompressed form.
- compressed string 1904 which comprises a plurality of compressed bytes 1908, may alternatively be stored as one or more integers.
- an Integer is typically defined as a data-type comprising 4 bytes.
- an uncompressed string 1900 comprising 12 characters into 4 compressed bytes 1908, and then to store all 4 resulting compressed bytes 1908 as one Integer.
- Longer strings may be compressed into longer Integers, such as Biglnt data type in MS SQL-2000® which comprised of 8 bytes, or as a plurality of Integers or Biglnts.
- a string comprising 48 characters may be compressed into 2 Biglnts.
- Tinylnt data type which comprises one byte of memory, may be used to store compressed byte 1908.
- strings which are longer than 3 or 4 characters, i.e. are compressed into more than one Tinylnt
- Tinylnt type fields each representing in compressed form 3 or 4 uncompressed characters. It is then possible to create an indexed View which indexes all these fields together, as one.
- the compressed byte 1908 is further described below with reference to Figs. 20, 21A, 21B, and 21 C.
- the compressed string 1904 may be decompressed by a decompression engine 1910 back to uncompressed string 1900, preferably by reversal of the methodology of the compression engine 1902. Operation of the decompression engine 1910 is further described below with reference to Fig. 25.
- Decompression engine 1910 employs a decompression table 1912 in decompressing compressed string 1904 into uncompressed string 1900.
- the decompression table 1912 is a translation table and holds a list of the bit-values for each possible or reasonable compressed byte 1908 and for each such compressed byte, the alphanumeric string, containing up to three characters, into which it may be decompressed.
- the decompression table 1912 is further described below with reference to Fig. 24B.
- Fig. 20A is a simplified illustration of a preferred implementation of a byte bitmap preferably employed in generating the compressed byte 1908 of Fig. 19.
- the compressed byte 1908 of Fig. 19 preferably comprises 8 bits: BIT I, BIT II, BIT III, BIT IV, BIT V, BIT VI, BIT VII & BIT VIII.
- these bits are divided into four groups, each containing 2 bits.
- a HEADER typically contains BIT I and BIT II
- a BIT-PAIR I typically contains BIT III and BIT IV
- a BIT-PAIR II typically contains BIT V and BIT VI
- a BIT-PAIR III typically contains BIT VII and BIT VIII.
- the HEADER preferably stores information about what the other bits in the compressed byte, BIT III - BIT VIII, represent.
- the compression of data is such that the compressed byte may either store up to three characters, or may store up to three rare characters, e.g. 'N's in the genomic example, but preferably not a combination of characters and N's.
- the HEADER stores information which indicates how many characters the compressed byte 1908 represents, one, two or three, or alternatively if the entire compressed-byte represents one or more 'N's.
- the values assignable to bits of the HEADER are further described below with reference to Fig. 21A.
- BIT-PAIR I, BIT-PAIR II, and BIT-PAIR in each contain 2 bits which are capable, when taken together, of representing one of four possible characters, A, T, C, and G.
- the values assignable to the bits of each of the characters - BIT-PAIR I, BIT- PAIR II, and BIT-PAIR III - are further described below with reference to Fig. 21B.
- values assigned to the two bits of BIT-PAIR I determine whether the compressed byte 1908 represents one, two or three N's. Values assignable to bits of BIT-PAIR I, determining the number of N's that the compressed byte 1908 represents, are further described below with reference to Fig. 21C.
- Fig. 20B is a simplified illustration of an alternative preferred implementation of a byte bitmap preferably employed in generating the compressed byte 1908 of Fig. 19.
- Alternative compressed byte 2000 is an alternative byte bitmap which may be used for compression of data, instead of the byte bitmap of compressed byte 1908 depicted in Fig. 20 A.
- Alternative compressed byte 2000 comprises of four bit-pairs, BIT-PAIR I, BIT-PAIR II, BIT-PAIR III and BIT-PAIR IV, rather than only three bit-pairs in compressed-byte 1908 of Fig. 20 A.
- alternative compressed byte 2000 does not comprise of any such header. All 8 bits of alternative compressed byte 2000 function as one of four bit-pairs, each of said bit-pairs representing a character.
- Alternative compressed byte 2000 is therefore capable of representing 4 characters in compressed form, as opposed to compressed byte 1908 of Fig. 20A, which is capable of representing 3 characters.
- alternative compressed byte 2000 may be useful when compressing strings which do not include rare characters, and are of a fixed length. If the length of the uncompressed string 1900 is known, then it is possible to ignore the possible tailing zeros at the right end of the alternative compressed byte 2000, which do not represent a character, but rather represent a blank.
- an uncompressed string 1900 which is known to be 7 characters long, may be compressed into 2 alternative compressed byte 2000: the first containing in compressed form 4 characters, and the second containing 3 characters.
- this second alternative compressed byte 2000 BIT VII and BITVIII of BIT-PAIR IV contain zeros which are ignored because the uncompressed string is known to be 7 characters long, despite the absence of a 'header' which would explicitly instruct to ignore these bits.
- Fig. 21 A is a table illustrating preferred values assignable to BIT I and BIT II, both belonging to the HEADER of compressed byte 1908 shown in Fig. 20 A.
- Assigning the value '00' to the bits of the HEADER i.e. assigning '0' to BIT I and assigning '0' to BIT II of compressed byte 1908 shown in Fig. 20A, signifies that the entire compressed byte 1908 represents only one or more rare characters, i.e. 'N ⁇ and does not represent any known characters. In the non genomic embodiment of the present invention, it is possible to specify a plurality of different rare characters, such as up to 64, which would be represented by an entire byte, when the value of the header is '00'. Assigning the value '01 ' to the bits of the HEADER, i.e.
- Assigning the value ' 10' to the bits of the HEADER i.e. assigning '1 ' to BIT I and '0' to BIT II of compressed byte 1908 shown in Fig. 20A, signifies that the compressed byte 1908 represents two characters; the first character being represented by the values in BIT III & BIT IV, both belonging to BIT-PAIR I, and the second character being represented by values in BIT V & BIT VI, both belonging to BIT-PAIR II of compressed byte 1908.
- the remaining two bits of the compressed byte 1908, BIT VII & BIT VIII are to be ignored and do not represent any additional character.
- Fig. 21B is a table illustrating the preferred values assignable to the character-representing bits: BIT III, BIT IN BIT V, BIT VI, BIT VII, & BIT VIII of compressed byte 1908 shown in Fig. 20A.
- each of BIT-PAIR I, BIT-PAIR II and BIT-PAIR III in compressed byte 1908 comprises a pair of bits: BIT III & BIT IV, BIT V & BIT VI, and BIT VII & BIT VIII respectively.
- the values presented in Fig. 21B are values which may be assigned to each of the above mentioned pairs of bits so as to allow each of these pairs of bits to represent one of the four possible genomic characters: A, T, C, or G.
- Assigning the value '00' to any of the three bit-pairs representing one of the three characters i.e. assigning '0' to BIT III and '0' to BIT IV, or assigning '0' to BIT V and '0' to BIT VI, or assigning '0' to BIT VII and '0' to BIT VIH, signifies that that bit-pair, i.e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, of the compressed byte 1908 of Fig. 20A, represents the character 'A'.
- Assigning the value of ' 10' to any of the three bit-pairs representing one of the three characters i.e. assigning ' 1 ' to BIT IH & '0' to BIT IV, or assigning ' 1' to BIT V & '0' to BIT VI, or assigning ' 1 ' to BIT VII & '0' to BIT VIII, signifies that that bit-pair, i.e. BIT-PAIR I, BIT-PAIR II or BIT-PAIR III, represents the character 'G'.
- Fig. 21C is a table illustrating preferred values assignable to BIT III and BIT IV of Fig. 20A respectively, when encoding one or more rare characters.
- a value '00' assigned to the bits of the HEADER of Fig. 20 A signifies that the compressed byte 1908 represents one or more rare characters and does not represent any known characters.
- BIT III and BIT IV may be used to signify if the entire byte represents one, two of three 'N's. Assigning the value '01 ' to BIT III & BIT IN i.e. assigning '0' to BIT III and ' 1 ' to BIT IV, signifies that the entire compressed byte 1908 represents one rare character.
- each compressed byte 1908 to encode more or less than three ' ⁇ 's
- BIT III through BIT VIII of Fig. 20A to signify up to 64 ⁇ 's represented by a compressed byte 1908.
- Fig. 22 is a simplified flowchart illustrating operation of the compression engine 1902 of Fig. 19, constructed and operative in accordance with a preferred embodiment of the present invention.
- compression table 1906 and decompression table 1912 are initially generated.
- the compression table 1906 stores for each possible combination of up to three characters, e.g. 'ATG', 'GGC, 'AT', bit-values which represents this combination in compressed form in one compressed byte 1908, preferably according to the values suggested in Figs 21 A, 21B, and 21C.
- Figs. 24A and 24B present examples of preferred compression and decompression tables 1906 and 1912 respectively.
- an iterative process of compression of multiple strings takes place.
- An uncompressed string such as uncompressed string 1900 of Fig. 19, is received.
- the uncompressed string 1900 is a string l epi esented by a string 'ATGAT'
- This example is followed through the following steps of Fig 22
- the uncompressed string 1900 preferably is parsed into sub-strings, each having up to three characters (3 -character-substrings) by parsing the uncompressed string 1900 from left to right It should be noted that one or more of the characters in a 3-character-subst ⁇ ng may in fact be unknown, i e an 'N' In the given example the stung 'ATGAT' is parsed from left to right, into 3 -character-substrings, yielding 'ATG' and 'AT'
- Fig 23 is a simplified flowchart illustrating a pi eferred functionality for generating translation tables, including compression table 1906 and decompression table 1912 of Fig 19
- N's are encoded in compressed bytes 1908 which do not also represent known characters Values '00' are assigned to the headers of such compressed bytes and all known characters in the 3 -character-substring are encoded in other bytes
- the HEADER is assigned the value ' 10' and the two characters are represented by BITS III - VI of the compressed byte 1908, with the first character represented by BIT III & BIT IV and the second character represented by BIT V & BIT VI.
- the HEADER is assigned the value ' 1 1 ' and the three characters are represented by BITS III - VIII of the compressed byte 1908, with the first character represented by BIT III & BIT IV, the second character represented by BIT V & BIT VI and the third character represented by BIT VII & BIT VIII.
- Each 3 -character-substring and its corresponding one or more compressed bytes 1908 are stored in translation tables including compression table 1906 and decompression table 1912.
- rare characters are typically very rare in typical strings, and that furthermore when 'N's appear in a string they tend to appear contiguously, signifying a 'gap' in the sequenced genome. Instances of isolated single or double N's are typically less frequent than instances of contiguous 'N's.
- the present invention utilizes this fact to achieve an optimized compression suited specifically for genomic sequence data: three-character combinations which contain only characters and no N's, as well as those containing only N's and no characters, are both compressed into a single byte. Most of the rare cases of 3-character mixtures of characters and N's are compressed into two bytes. Only a minority of extremely rare combinations of characters and N's require three bytes and therefore are in fact not compressed.
- FIG. 24A is a simplified illustration of a preferred implementation of compression table 1906 of Fig. 19, employed in accordance with a preferred embodiment of the present invention.
- compression table 1906 is to provide a translation-table, also referred to as a 'lookup table', which provides the bit- values of the one or more compressed bytes 1908 required to represent in compressed form every possible 1- character 2-character and 3-character sub-string of uncompressed string 1900.
- the compression table 1906 is described here logically, as a database table comprising fields into each of which multiple values are stored in respective multiple records. It is appreciated by those skilled in the art that the description of the compression table 1906 in terms of a table comprising fields is meant for clarity and not meant to be limiting and that the compression table 1906 may equally be implemented as a 'CASE', or 'IF-THEN' programming code in a any suitable computer language, as is well known in the art. For example, computer code can be written, which comprises a plurality of 'IF-THEN' or 'CASE' arguments, each one of the arguments providing bit-values of the one or more compressed bytes 1908 representing in compressed form one 3 -character-substring of uncompressed string 1900.
- Compression table 1906 preferably comprises multiple records each containing 4 fields: uncompressed character-combination 2400, compressed byte I 2402, compressed byte II 2404 and compressed byte III 2406. For clarity, an example is given or the content which may be stored in each of these fields.
- the uncompressed character combination 2400 is a field which stores all possible 3-character substrings, i.e. 1-character, 2-character and 3-character combinations, including combinations of characters only and combinations which include N's.
- uncompressed character combination 2400 stores a 3-character combination 'ATN'.
- Compressed byte I 2402, compressed byte II 2404 and compressed byte III 2406, respectively,- are fields which store for each uncompressed character- combination 2400 the bit-values for each of the one or more compressed bytes 1908 required for encoding it.
- compressed byte I 2402 stores ' 10001 100', which represents the character-combination 'AT'
- compressed byte II 2404 stores '00010000', which represents 'N ⁇ Compressed byte III 2406, in the given example, stores null, since only two compressed bytes are required to represent the character combination 'ATN'.
- most 3-character substrings may be compressed into one compressed byte 1908, some rare combinations may be compressed into two compressed bytes, and some 20 very rare combinations may require 3 compressed bytes, and therefore may not be compressed. Therefore, notwithstanding that compression table 1906 comprises three compressed bytes fields 2402-2406, one compressed byte field, such as compressed byte-1 2402, is sufficient to translate a vast majority of 3-character combinations to be typically found in a string.
- Fig. 24B is a simplified illustration of a preferred implementation of decompression table 1912 of Fig. 19, employed in accordance with a preferred embodiment of the present invention.
- decompression table 1912 is to provide a translation-table, also referred to as a 'lookup table', which provides the 1-character, 2-character or 3- character uncompressed string 1900 preferably corresponding to every possible compressed byte 1908.
- decompression table 1912 in terms of table comprising fields is meant for clarity and not meant to be limiting, and that the decompression table 1912 may equally be implemented as a 'CASE' code in any computer language, as is well known in the art.
- the decompression table 1912 preferably comprises multiple records each containing 2 fields: compressed byte 2408 and decompressed character- combination 2410.
- Compressed byte 2408 is a field which preferably stores bit-values of every possible compressed byte 1908.
- Decompressed character combination 2410 is a field which stores for each compressed byte 2408 the 1-character, 2-character or 3-character uncompressed string 1900 which it encodes.
- the field compressed byte 2408 may contain the compressed byte 1908 bit-value ' 10001100' and the respective field decompressed character combination 2410 may contain the 2-character combination 'AT' which this bit value represents in compressed form.
- Fig. 25 is a simplified flowchart illustrating operation of decompression engine 1910 of Fig. 19 constructed and operative in accordance with a preferred embodiment of the present invention.
- decompression engine 1910 of Fig. 19 performs a reverse action of that of compression engine 1902 of Fig. 19, which was further described hereinabove with reference to Fig. 22.
- a compressed string 1904 is received in order to be decompressed.
- Decompression engine 1910 of Fig. 19 gets the compressed string 1904 of Fig. 19 to be decompressed.
- the process shown in Fig. 25 is explained with reference to an example wherein the compressed string 1904 comprises two compressed bytes 1908 the bit-value of which are: ' 11001 1 10' & ' 10001100' respectively. This example is followed through the following steps of Fig. 25.
- a recursive operation is initiated, which parses the received compressed string 1904 into the compressed byte 1908 included in this compressed string.
- Each compressed byte 1908 is looked up in the decompression table 1912, and based on the contents of the compression table, finds out the 3-character substring which this compressed byte represents.
- the 3-character substrings are combined to yield an uncompressed string 1900.
- the second compressed byte in the compressed string has bit-values of ' 10001100', which when looked-up in the decompression table is found out to represent the character combination 'AT'.
- 'ATG' represented by compressed byte ' 11001110'
- 'AT' represented by compressed byte ' 10001 100'
- FIG. 26A is a simplified illustration of an example of compressing an uncompressed string 1900 into a compressed string 1904, both shown in Fig. 19.
- Uncompressed string 'ATGAT' 2600 is an uncompressed string 1900 comprising the characters: 'ATGAT'.
- the results is two 'character-triplets': character-triplet- 1 'ATG' 2602 and character-triplet-2 'AT_' 2604. It is appreciated that the character- triplet-2 'AT_' 2604 actually contains only two characters: A and T.
- Compressed byte-1 2606 encodes three characters: 'ATG'. Therefore a value of ' 1 r is assigned to the two bits of the HEADER of compressed byte-1 2606, as indicated by reference numeral 2608, signifying that this compressed byte 1908 encodes 3 characters.
- Value '00' is set to the two bits of BIT-PATR I, of compressed byte-1 2606, as indicated by reference numeral 2610, signifying that the first character represented by this byte is 'A'.
- Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 2606, as indicated by reference numeral 2612, signifying that the second character represented by this byte is 'T'.
- Value ' 10' is assigned to the two bits of BIT-PAIR III of compressed byte- 1 2606, as indicated by reference numeral 2614, signifying that the third character represented by this byte is 'G'.
- Compressed byte-2 2607 encodes two characters: 'AT'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-2 2607, as indicated by reference numeral 2616, signifying that this compressed byte 1908 encodes 2 characters, and that therefore the two bits of BIT-PATR III are to be ignored.
- Value '00' is assigned to the two bits of BIT-PAIR I of compressed byte- 2 2607, as indicated by reference numeral 2618, signifying that the first character represented by this byte is 'A'.
- Value ' 1 1 ' is assigned to the two bits of BIT-PAIR II of compressed byte-2 2607, as indicated by reference numeral 2620, signifying that the second character represented by this byte is 'T'.
- Value '00' stored in the bits of BIT-PAIR III of compressed byte-2 2607, as indicated by reference numeral 2622 are ignored and do not represent an 'A', since the HEADER specified that this byte encodes only 2 characters.
- FIG. 26B is a simplified illustration of another example of compression of an uncompressed string 1900 of Fig. 19, containing a rare character, into a compressed string 1904 of Fig. 19.
- Uncompressed string 'ATNCG' 2650 is an uncompressed string 1900 comprising the characters: 'ATNCG'.
- character-triplet- 1 'ATN' 2652 contains an 'N' it is preferably represented by two compressed bytes 1908 rather than one: the first, compressed byte-1 2656, encodes 'AT', and the second, compressed byte-2 2658, encodes 'N'.
- Compressed byte-1 2656 encodes two characters, 'AT', therefore ' 10' is assigned to the two bits of the HEADER of compressed byte-1 2656, as indicated by reference numeral 2660, signifying that this compressed byte 1908 encodes 2 characters.
- Value '00' is assigned to the two bits of BIT -PAIR I of compressed byte- 1 2656, as indicated by reference numeral 2662, signifying that the first character represented by this byte is 'A'.
- Value ' 11 ' is assigned to the two bits of BIT-PAIR II of compressed byte- 1 2656, as indicated by reference numeral 2664, signifying that the second character represented by this byte is 'T'.
- Compressed byte-2 2658 is dedicated to encoding 'N's, in this case only one 'N ⁇ which is derived from the character-triplet- 1 'ATN' 2652. Therefore '00' is assigned to the two bits of the HEADER of compressed byte-2 2658, as indicated by reference numeral 2668, signifying that this compressed byte 1908 is dedicated to encoding one or more N's.
- the value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte-2 2658, as indicated by reference numeral 2670, signifying that this byte, which is dedicated to encoding N's, encodes only one N. Accordingly, the zeros in the two bits of BIT-PAIR II and the two bits of BIT-PAIR III, indicated by reference numerals 2672 & 2674 are ignored.
- Compressed byte-3 2690 encodes two characters: 'CG'. Therefore a value of ' 10' is assigned to the two bits of the HEADER of compressed byte-3 2690, as indicated by reference numeral 2676, signifying that this compressed byte 1908 encodes only 2 characters.
- Value '01 ' is assigned to the two bits of BIT-PAIR I of compressed byte- 3 2690, as indicated by reference numeral 2678, signifying that the first character represented by this byte is 'C.
- Value ' 10' is assigned to the two bits of BIT-PAIR II of compressed byte-3 2690, as indicated by reference numeral 2680, signifying that the second character represented by this byte is 'G'.
- Fig. 27 is a simplified block diagram illustrating shifted compressed strings utilized by the compressed string similarity search module 1818 of Fig. 18 constructed and operative in accordance with a preferred embodiment of the present invention.
- each compressed byte represents in compressed form 3 or 4 characters. It is therefore very easy to compare entire 'triplets' of characters, but if an addition or deletion of a single character occurs, then all triplets 'downstream' to the one modified will have seemed to have changed completely, whereas in fact, they have only been 'shifted' to the right or to the left by one location.
- the basic concept of the compressed character string similarity search module 1818 of Fig. 18 is therefore to calculate all possible 'shifted' variations of the compressed character string 1810 of Fig. 18, and to use them to search for compressed character strings similar to target 1820 of Fig. 18.
- target string 2700 which is a compressed character string comprising 12 characters, NI through N12, which are represented in compressed form by 4 compressed bytes 1908, BYTE 1, BYTE 2, BYTE 3 and BYTE 4.
- shifted strings 2702 are generated: 'minus one' shifted string 2704, 'minus two' shifted string 2706, 'plus one' shifted string 2708 and 'plus two' shifted string 2710.
- the first character in target string 2700, NI has been removed in the 'minus one' shifted string 2704, therefore 'minus one' shifted string 2704 begins with N2.
- the characters compressed into each of the four compressed bytes 1908 of 'minus one' shifted string 2704 are therefore 'shifted to the left' by one location.
- sequence of characters compressed into each of the four compressed bytes 1908 of 'minus two' shifted string 2706 is shifted to the left by two locations; that of 'plus one' shifted string 2708 is shifted to the right by one location; and that of 'plus two' shifted string 2710 is shifted to the right by two locations.
- Fig. 28 is a simplified flowchart illustrating operation of the compressed string similarity search module 1818 of Fig. 18 constaicted and operative in accordance with a preferred embodiment of the present invention.
- Operation of the compressed string similarity search module 1818 begins by getting a target compressed string 1810 of Fig. 18.
- all compressed strings 1804 having at least one compressed byte which matches that of the compressed target string 2700 or of one of the four shifted strings 2704-2710, are retrieved. It is important to note that a match is looked for only between bytes occupying the same location in the compressed genomic string: the first compressed byte in a compressed string 1804 is compared to the first compressed byte of the compressed target string 1816 and to the first compressed byte of each of the four compressed shifted strings. It is not compared to any other, e.g. second, third or fourth, compressed bytes within these strings. All compressed strings having at least one match, are considered potentially similar, and are passed on the next step.
- Compressed strings having less mismatching compressed bytes with the target or one of the shifted strings than a certain user defined 'threshold', are considered potentially very similar, and are passed on to the next step.
- mismatching compressed byte/s are further analyzed to determine the exact nature of the mistake, in order to further fine-tune the similarity comparison
- the resulting compressed strings similar to target 1816 of Fig. 18 are considered to represent in compressed form genomic sequences which are similar to the target genomic sequence represented in compressed form by the compressed target string 1816, and are delivered.
- Fig. 29A is a simplified illustration of an example of identifying a string having one character replacement relative to a target string.
- Fig. 29A shows a string designated SIMILAR TO TARGET STRING (1 REPLACEMENT), in which a character, designated N13 shown in broken line format, in the compressed byte designated 1R-BYTEII, has replaced character N5 in that same spot in the original TARGET STRING.
- FIG. 29A 3 of the 4 compressed bytes of SIMILAR TO TARGET GENOMIC SEQUENCE (1 REPLACEMENT), shown in bold line format, still match those in TARGET GENOMIC SEQUENCE, and so the two genomic sequences can be deduced as being similar, by comparison of their compressed format, without decompressing them.
- Fig. 29B is a simplified illustration of an example of identifying a string having two character additions relative to a target string.
- Fig. 29B shows a string designated SIMILAR TO TARGET STRING (2 ADDITIONS), in which two characters, designated N13 and N14, shown in broken line format, in compressed byte designated 2A-BYTEII, have been added to the string relative to the original TARGET STRING, 'pushing' characters N5 and N6 to the next compressed byte designated 2A-BYTE III, and shifting all the following characters by two positions.
- the two strings may therefore be deduced as being similar, differing by a two addition mistake in the mismatched compressed byte 2A-BYTE II.
- Fig. 29C is a simplified illustration of an example of identifying a character string having one character deletion relative to a target string.
- Fig. 29C shows a string designated SIMILAR TO TARGET STRING (1 DELETION), in which one character, designated N5, of TARGET STRING has been deleted, shifting all characters from N6 onwards one position to the left.
- N5 in byte ID-BYTE II is represented by a small blank broken line box between N4 and N6.
- the two strings may therefore be deduced as being similar, differing by a one deletion mistake in the mismatched compressed byte ID-BYTE II.
- Figs. 27 and 29A, 29B and 29C demonstrate detection of up to 2 addition or deletion mistakes, the same concept may be utilized to detect a wider spectrum of mistakes, by generating more 'shifted sequences' accordingly, e.g. 'plus three' shifted sequence etc.
- Fig. 30 is a simplified illustration of a triphase-bit compressed character used in accordance with a preferred embodiment of the present invention.
- Fig. 30 is a preferred implementation of the present invention in a tri-phase bit environment.
- a compressed triphase-bit compressed byte 3000 is a preferred embodiment of a compressed string 1804 of Fig. 18, when implemented in a triphase-bit computer environment.
- Triphase-bit compressed byte 3000 preferably comprises the following three elements: a HEADER comprising two bits: BIT I and BIT ⁇ , CHAR I comprising three bits BIT III, BIT IV & BIT V, and CHAR II comprising three bits BIT VI, BIT VII & BIT VIII.
- each of the two alphanumeric characters which may be represented by the triphase compressed character 3000 may be one of the following three common-character-sets: uppercase English letters, lowercase English letters, or numerals and seventeen other common symbols.
- the HEADER may indicate that the entire byte represents one rare character, one of seven hundred twenty nine options, as is explained below.
- the following values may be utilized by the two bits of the HEADER:
- the values '01 ' in the two bits of the HEADER signify that the entire byte represents only one character, which is a lowercase letter.
- the values '02' in the two bits of the HEADER signify that the entire byte represents only one character, which is an uppercase letter.
- the values ' 10' in the two bits of the HEADER - signify that the entire byte represents only one character, which is a numeral or other common symbols. Up to 17 common symbols may be represented in this way, in addition to the ten numerals, since CHAR I and CHAR II each contain 3 bits and may therefore represent 27 options.
- the values ' 11 ' in the two bits of the HEADER signify that the byte represents two characters, which are both lowercase letters.
- the values '12' in the two bits ; of the HEADER signify that the byte represents two characters, which are both uppercase letters.
- the values '20' in the two bits of the HEADER signify that the byte represents two characters, which are both numerals or common symbols.
- the values '21' in the two bits of the HEADER signify that the byte represents two characters, the first of which is a lowercase letter and the second of which is an uppercase letter
- the values '22' in the two bits of the HEADER signify that the byte represents two characters, the first of which is an uppercase letter and the second of which is a lowercase letter
- Fig. 30 depicts a configuration where 2 bits are utilized as a HEADER of the triphase-bit compressed character 3000, this is not necessary It is possible, for example, to encode two characters in the triphase-bit compressed character 3000, each comprising four, rather than three, bits, and therefore supporting 81 options, one of these options being a null.
- Fig 31 is a simplified block diagram illustrating a computer application constructed and operative in accordance with another preferred embodiment of the present invention. It is appreciated that the computer application may be implemented in any appropriately programmed computer system such as, for example, a suitable personal computer including an operating system having a suitable graphical user interface.
- Fig. 31 comprises a mechanism for conversion of a standard alphanumeric representation of genomic sequence information into a more informative and intuitive graphical representation, conducive to visual pattern analysis of genomic sequence information
- Sequenced DNA 3100 is biological information relating to a sequence of nucleotides - adenine, thymine, guanine, and cytosine - of a given DNA molecule, or genome. Determining the sequence of nucleotides of a genome is achieved by various 'wet-lab' sequencing methodologies and techniques, as is well known in the art.
- genomic alphanumeric representation 3110 is an alphanumeric string used both for computer storage of sequenced DNA data and for its presentation.
- the genomic alphanumeric representation 3110 comprises primarily four letters 'A' representing adenine, 'T' representing thymine, 'G' representing guanine and 'C representing cytosine.
- nucleotide adenine represented by 'A'
- nucleotide thymine represented by T: wherever the DNA sequence on one DNA strand contains adenine, the opposite strand at that exact location contains thymine, and vice-versa.
- nucleotide guanine represented by 'G' is a 'counterpart' of nucleotide cytosine represented by C. Wherever the DNA sequence on one DNA strand contains guanine, the opposite strand at that exact location contains cytosine, and vice-versa.
- adenine represented by 'A' and guanine represented by 'G 1 are purine nucleotides, whereas thymine represented by 'T' and cytosine represented by 'C are pyrimidine nucleotides.
- a genomic graphic representation engine 3120 receives the genomic alphanumeric representation 31 10 and converts it into a genomic graphical representation 3130.
- a preferred embodiment of a genomic graphic representation engine 3120 is further described below with reference to Fig. 32.
- the genomic graphic representation 3130 produced by the genomic graphic representation engine 3120, preferably represents each of the four letters - 'A', 'T ⁇ 'G' and 'C - in the original genomic alphanumeric representation 3110, using a plurality of graphic parameters.
- the alphanumeric representation 3110 is represented by a combination of a graphic shape, a vertical orientation, and a color specific to that letter, as is further described below.
- the genomic sequence data may also include unknown nucleotides, i.e. nucleotides in the genomic sequence which the sequencing process was unable to identify. Unknown nucleotides are typically represented by 'N' or '-'. These may be also be represented by the graphic representation 3130 by a designated shape, color, and letter, as per user preference.
- a preferred embodiment of the genomic graphic representation 3130 may include a genomic font with embedded letters 3140, in which a letter representing each nucleotide is embedded on a shape that represents it graphically.
- the letter 'A' is represented by an upward oriented half-oval with an embedded letter 'A', as illustrated by reference numeral 3141.
- the letter 'T' is represented by a downward oriented half- oval with an embedded letter 'T', as illustrated by reference numeral 3142.
- the letter 'G' is represented by an upward oriented half-square with an embedded letter 'G', as illustrated by reference numeral 3143.
- the letter 'C is represented by a downward , oriented half-square with an embedded letter 'C, as illustrated by reference numeral 3 144.
- the genomic graphic representation 3130 may also include a genomic font without letters 3150, in which only a shape is used to graphically represent each letter, without any letter embedded on the shape.
- a genomic font without letters 3150 in which only a shape is used to graphically represent each letter, without any letter embedded on the shape.
- an upward oriented half-oval 3151 represents 'A'
- a downward oriented half-oval 3152 represents 'T'
- an upward oriented half-square 3 153 represents 'G'
- a downward oriented half-square 3154 represents 'C.
- each of the four shapes without letters 3151, 3152, 3153 & 3 154, or alternatively each of the four shapes with embedded letters 3141, 3142, 3143 & 3144, representing the four letters 'A', 'T', 'G' and 'C, respectively, may be displayed in a different color, according to the user's preference.
- a preferred embodiment of the current invention displays the above mentioned shapes in red, blue, brown, and green, respectively.
- genomic graphic representation 3130 provides enhanced visual discrimination between adenine-thymine counterparts, and guanine-cytosine counterparts.
- 'A' and 'T' are represented by two vertical 'complementary' halves of one shape.
- 'A' and 'T' are represented by two halves of an oval 3151 and 3 152 respectively.
- 'G' and 'C are also represented by two vertical complementary halves of a different shape.
- 'G' and 'C are represented by two halves of a square 3153 and 3 154 respectively.
- 'AT-rich' DNA segments i.e. segments in which there is a higher incidence of adenine and thymine nucleotides
- 'CG-rich' DNA segments i.e. segments in which there is a higher incidence of cytosine and guanine nucleotides.
- AT-rich DNA segments in which the oval shapes are more dominant, can be discerned visually with enhanced ease from 'CG-rich' segments, in which square shapes are more dominant. Different shapes may be utilized other than the ones described here, e.g. a triangle may be used instead of an half-oval.
- Enhanced visual discernment of AT-rich from CG-rich DNA sequences is further described below with reference to Fig. 35.
- both purine nucleotides adenine ('A') and guanine ('G') are graphically represented by shapes that have an upward orientation: an upward oriented half-oval 3 15 1 and an upward oriented half-square 3153, respectively.
- both pyrimidine nucleotides thymine ('T') and cytosine ('C') are graphically represented by shapes that have a downward orientation: a downward oriented half-oval 3152 and a downward oriented half-square 3154, respectively.
- An example for the usefulness of the enhanced ease of visually distinguishing purine nucleotides from pyrimidine nucleotides is the enhanced ease of visually discerning the similarity between two genomic motifs: When comparing two genomic motifs, one ending with adenine ('A') while the other ends with guanine ('G'), both adenine and guanine being purine nucleotides, since both adenine and guanine are graphically represented by upward oriented shapes, the similarity between these two genomics motifs, is made more visually apparent. Visually distinguishing of purine nucleotides from pyrimidine nucleotides is further described below with reference to Fig. 35. It is yet further appreciated that the genomic graphic representation 3130 described above may also provide enhanced visual discrimination between the four different nucleotides, based on their different colors.
- Fig. 31 illustrates a human sensible, graphical representation of genomic sequence data in order to represent one or more genomic attributes of each of the four nucleotides
- another implementation of the present invention may use machine sensible representation in order to represent these attributes.
- Fig. 32 is a simplified flowchart illustrating preferred operation of the genomic graphic representation engine 3120 of Fig. 3 1 , constructed and operative in accordance with a preferred embodiment of the present invention.
- a genomic font is produced, preferably using conventional font- creation software, such as 'FONT CREATOR PROGRAM'.
- font- creation software such as 'FONT CREATOR PROGRAM'.
- preferred shapes are assigned to each of the four letters 'A', 'T', 'G' and 'C, such as the shapes indicated by reference numerals 3151-3154 of Fig. 31, respectively.
- the genomic font is employed: the first comprising shapes with embedded letters in shapes, as designated by reference, numeral 3140, and the other comprising shapes without embedded letters, as designated by reference numeral 3150, both of Fig. 31.
- each of the four letters, 'A', 'T', 'G' & 'C are preferably those designated by reference numerals 3141, 3142, 3143 & 3144 and by reference numerals 3151 , 3152, 3153 & 3154 respectively in Fig. 31.
- the process of generating a genomic font preferably is a one-time process, and hence is connected to the next step by a broken line. It is typically carried out once, before an iterative process of converting the representation of multiple genomic sequences, from a genomic alphanumeric representation 3110 into a genomic graphic representation 3130.
- genomic font Once a genomic font has been created, the process of graphically representing genomic sequence data may be very simple: an alphanumeric string representing genomic sequence data is received and a genomic font, generated by the previous step, is applied to this alphanumeric string.
- Different colors may be applied to different letters, typically by using standard 'search-and-replace' commands, as is known in the art.
- the colors applied to the letters 'A', 'T', 'G' & 'C are red, blue, brown & green respectively.
- other colors may be used, according to user preferences. It should be noted, that applying different colors to different letters is an optional step: the user may or may not want to view the different letters in different colors, or may want to view a group of letters, for example purine nucleotides or pyrimidine nucleotides, or A-T or C-G, or some other grouping in a certain color.
- Fig. 33 is a simplified illustration depicting an example of conversion of a typical alphanumeric genomic representation, of the type indicated by reference numeral 3110 of Fig. 31, into a typical graphic genomic representation, of the type indicated by reference numeral 3130 of Fig. 31.
- Reference numeral 3300 designates an example of a short genomic sequence, conventionally represented by an alphanumeric string ' ACTTTTGATAATTATTGTAACTGTAAAAGAT' .
- the short genomic sequence 3300 may be displayed using a genomic graphic representation as designated by reference numeral 3310, either employing genomic font with embedded letters 3140 of Fig. 31, as designated by reference numeral 3320 or employing genomic font without embedded letters 3150 of Fig. 31, as designated by reference numeral 3330.
- genomic alphanumeric representation 3300 it is easier to visually discern patterns in the genomic sequence when it is displayed as a genomic graphic representation 3310, than when displayed as genomic alphanumeric representation 3300.
- the segment 'ATAATTAT', 8 l l -15 th characters in the string from its left end, surrounded by a broken-line border may not immediately stand out as having any special significance.
- a visual pattern is apparent: the first four characters in this segment, 'ATAA', are a vertical and horizontal mirror image of the last four characters of the segment, 'TTAT'.
- a genomic sequence in which the second half of the sequence is a reversed-inversed sequence relative to the first half of the sequence, such as 'ATAATTAT' in the given example designated by reference numeral 3300 is known in the art as a 'hair-pin structure'.
- sequence 'ATAATTAT' is what: a genomic sequence in which 'Hair-pin' sequences are genomic patterns which may indeed be biologically significant.
- Fig. 34 is a simplified illustration of an example demonstrating an advantage of the graphic genomic representation 3130 of Fig. 3 1 , in comparing a genomic motif sequence with the inverse-reverse thereof.
- Genomic motifs are short genomic sequences, which may have a specific biological significance or action. Genomic motifs may be compared to 'words', insofar as a word is a combination of English letters and has a specific meaning, and a genomic motif is a combination of a genomic nucleotides, and may have a specific action.
- An example for a well known genomic motif is the genomic sequence 'GATAA.
- genomic sequence data is typically provided for a sequence of nucleotides on a positive strand of the DNA.
- some segments of biologically significant genomic data are actually 'coded' on the negative, i.e. opposite, strand of the DNA.
- To inverse-reverse means to read the sequence from right to left, and replace each A with a T, and each C with a G and vice-versa.
- 'TTATC is the inverse-reverse of the genomic motif 'GATAA'.
- genomic motif 'GATAA' may appear in the genomic sequence either as 'GATAA', or as its inverse-reverse 'TTATC.
- 'GATAA' may appear in the genomic sequence either as 'GATAA', or as its inverse-reverse 'TTATC.
- the genomic graphic representation 3130 of Fig. 31 provides the user with enhanced ease of visually discerning genomic motifs from their inversed-reversed sequences, inasmuch as the inversed-reversed sequence presents a horizontal and vertical mirror image of the original motif. This is due to the fact that complementary nucleotide pairs adenine- thymine and cytosine-guanine, are graphically represented by complementary vertical halves of the same shape, as described with reference to Fig. 31.
- Fig. 34 enables comparison between the genomic sequence 'GATAA' which is a well known genomic motif, and the genomic sequence 'TTATC which is the inverse-reverse of this genomic motif. It is seen that the graphical representation of the inversed-reversed genomic motif 'TTATC as designated by reference numeral 3440, presents a vertical and horizontal mirror image of the genomic motif 'GATAA' as designate by reference numeral 3450. This provides the user with enhanced ease of visually discerning the similarity of these motifs. The same is true for the graphic representation with embedded letters, as depicted by reference numerals 3420 and 3430 respectively.
- Fig 35 is a simplified illustration of an example demonstrating an advantage of graphic genomic representation 3130 of Fig. 3 1 , in visually distinguishing adenine-thymine-rich sequences, from cytosine-guanine- rich sequences.
- genomic sequence "CCCGCTCCAGG”, which is a GC-rich sequence
- genomic sequence "TTTATTATCTA” which is an AT- rich sequence.
- Reference numerals 3500 and 3510 respectively designate these sequences in standard alphanumeric form
- reference numerals 3520 & 3530 and 3540 & 3550, respectively, designate these sequences graphically, with embedded letters and without embedded letters respectively.
- genomic graphic representation 3130 of Fig. 31 provides the user with enhanced ease of visually discerning GC-rich sequences, depicted by reference numerals 3520 and 3540, in which the predominant shapes are squares, from AT-rich sequences, depicted by reference numerals 3530 and 3550, in which the predominant shapes are ovals. This may be particularly useful, since AT-rich sequences and GC-rich sequences may have different genomic significance, as is well known in the art.
- Fig. 36 is a simplified illustration of an example which demonstrates an advantage of graphic genomic representation 3130 of Fig. 31 , in visually distinguishing purine nucleotides from pyrimidine nucleotides.
- meaningful genomic motifs often appear in a genome with slight variations, while still maintaining their biological function and significance.
- a motif in which variations are known to happen is typically described in terms of a 'consensus-sequence' which is a description of the location and frequency of acceptable 'mistakes', notwithstanding which the biological function of the motif is maintained.
- a consensus-sequence may be compared to an English word, for which several slightly different spellings may be considered acceptable, e.g. Haematology and Hematology.
- the consensus sequence may be related to a biochemical type of nucleotides.
- the consensus-sequence definition for the well known genomic motif 'GATA box' is WGATAR, where W stands for adenine or thymine nucleotide, and R stands for a purine nucleotides: either adenine or guanine.
- W stands for adenine or thymine nucleotide
- R stands for a purine nucleotides: either adenine or guanine.
- the consensus-sequence in this example states that both 'AGATAA' and 'AGATAG' may have the same biological function, despite the difference in the last nucleotide, since both adenine and guanine are purine nucleotides.
- the present invention provides the user with enhanced ease of visually discerning purine nucleotides from pyrimidine nucleotides, thereby making it easier to visually identify genomic consensus-sequence motifs in which the consensus-sequence definition contains a purine or a pyrimidine.
- Fig. 36 provides an example of the two genomic sequences 'AGATAA' and 'AGATAG' mentioned above, both being variants of the same consensus-sequence motif WGATAR mentioned above.
- Reference numeral 3600 designates a genomic alphanumeric representation of an adenine ending GATA box, 'AGATAA', and reference numeral 3610 designates a genomic graphic representation of a guanine ending GATA box, 'AGATAG'.
- the purine nucleotide ending the GATA box, adenine in reference numeral 3600 and guanine in reference numeral 3610 is surrounded by a broken-line border.
- Reference numerals 3640 and 3650 designate genomic graphic representations of these two variants of the WGATAR motif: 'AGATAA' and 'AGATAG' respectively. It is appreciated that the genomic graphic representation 3130 of Fig. 31 makes it easier to visually identify the similarity between fhe$e two variants of the same 'GATA box' consensus sequence, since adenine and guanine, which are both purine nucleotides, are graphically represented by upward oriented shapes: upward oriented half-oval 151 and upward oriented half-square 3153 respectively. The same is clearly true of the graphic representation with embedded letters, as depicted by reference numerals 3620 and 3630 respectively.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Testing And Monitoring For Control Systems (AREA)
Abstract
Description
Claims
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP02800693A EP1442292A2 (en) | 2001-10-12 | 2002-10-10 | Integrated system and method for analysis of genomic sequence data |
| AU2002334367A AU2002334367A1 (en) | 2001-10-12 | 2002-10-10 | Integrated system and method for analysis of genomic sequence data |
Applications Claiming Priority (12)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US32911201P | 2001-10-12 | 2001-10-12 | |
| US32911101P | 2001-10-12 | 2001-10-12 | |
| US32911401P | 2001-10-12 | 2001-10-12 | |
| US32911501P | 2001-10-12 | 2001-10-12 | |
| US32911001P | 2001-10-12 | 2001-10-12 | |
| US60/329,111 | 2001-10-12 | ||
| US09/976,911 | 2001-10-12 | ||
| US60/329,115 | 2001-10-12 | ||
| US60/329,114 | 2001-10-12 | ||
| US09/976,911 US20060129330A1 (en) | 2001-10-12 | 2001-10-12 | System and method for graphically representing genomic sequence data |
| US60/329,110 | 2001-10-12 | ||
| US60/329,112 | 2001-10-12 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2003031565A2 true WO2003031565A2 (en) | 2003-04-17 |
| WO2003031565A3 WO2003031565A3 (en) | 2004-03-04 |
Family
ID=27559744
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IL2002/000818 WO2003031565A2 (en) | 2001-10-12 | 2002-10-10 | Integrated system and method for analysis of genomic sequence data |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP1442292A2 (en) |
| AU (1) | AU2002334367A1 (en) |
| WO (1) | WO2003031565A2 (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140214780A1 (en) * | 2011-07-06 | 2014-07-31 | Christoph Lange | Systems and methods for genetic data compression |
| US9355108B2 (en) | 2012-11-07 | 2016-05-31 | International Business Machines Corporation | Storing data files in a file system |
| CN113037292A (en) * | 2021-02-02 | 2021-06-25 | 深圳市汉云科技有限公司 | Data compression method and data decompression method of structured database and communication equipment |
| US20230214319A9 (en) * | 2012-06-01 | 2023-07-06 | European Molecular Biology Laboratory | High-Capacity Storage of Digital Information in DNA |
| US11823774B2 (en) | 2016-03-15 | 2023-11-21 | Genomics, PLC | Compression/decompression method and apparatus for genomic variant call data |
-
2002
- 2002-10-10 WO PCT/IL2002/000818 patent/WO2003031565A2/en not_active Application Discontinuation
- 2002-10-10 EP EP02800693A patent/EP1442292A2/en not_active Withdrawn
- 2002-10-10 AU AU2002334367A patent/AU2002334367A1/en not_active Abandoned
Non-Patent Citations (2)
| Title |
|---|
| BENSON ET AL.: 'GenBank' NUCLEIC ACIDS RESEARCH vol. 25, no. 1, 1997, pages 1 - 6, XP002967189 * |
| WILLIAMS ET AL.: 'Compression of nucleotide databases for fast searching' COMPUTER APPLICATIONS IN THE BIOSCIENCES vol. 13, no. 5, 1997, pages 549 - 554, XP002967190 * |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140214780A1 (en) * | 2011-07-06 | 2014-07-31 | Christoph Lange | Systems and methods for genetic data compression |
| US9519650B2 (en) * | 2011-07-06 | 2016-12-13 | President And Fellows Of Harvard College | Systems and methods for genetic data compression |
| US20230214319A9 (en) * | 2012-06-01 | 2023-07-06 | European Molecular Biology Laboratory | High-Capacity Storage of Digital Information in DNA |
| US9355108B2 (en) | 2012-11-07 | 2016-05-31 | International Business Machines Corporation | Storing data files in a file system |
| US9922041B2 (en) | 2012-11-07 | 2018-03-20 | International Business Machines Corporation | Storing data files in a file system |
| US10409777B2 (en) | 2012-11-07 | 2019-09-10 | International Business Machines Corporation | Storing data in a file system |
| US11221992B2 (en) | 2012-11-07 | 2022-01-11 | International Business Machines Corporation | Storing data files in a file system |
| US11823774B2 (en) | 2016-03-15 | 2023-11-21 | Genomics, PLC | Compression/decompression method and apparatus for genomic variant call data |
| CN113037292A (en) * | 2021-02-02 | 2021-06-25 | 深圳市汉云科技有限公司 | Data compression method and data decompression method of structured database and communication equipment |
Also Published As
| Publication number | Publication date |
|---|---|
| EP1442292A2 (en) | 2004-08-04 |
| AU2002334367A1 (en) | 2003-04-22 |
| WO2003031565A3 (en) | 2004-03-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Kumar et al. | Time-series bitmaps: a practical visualization tool for working with large time series databases | |
| Cox et al. | Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform | |
| US8595196B2 (en) | Computer product, information retrieving apparatus, and information retrieval method | |
| US20100131476A1 (en) | Computer product, information retrieval method, and information retrieval apparatus | |
| JP5309570B2 (en) | Information retrieval apparatus, information retrieval method, and control program | |
| US20020165707A1 (en) | Methods and apparatus for storing and processing natural language text data as a sequence of fixed length integers | |
| US20120072434A1 (en) | Information retrieval method, information retrieval apparatus, and computer product | |
| JP6343081B1 (en) | Recording medium recording code code classification search software | |
| KR20200018469A (en) | Computerized Methods for Data Compression and Analysis | |
| WO2003031565A2 (en) | Integrated system and method for analysis of genomic sequence data | |
| US20220171815A1 (en) | System and method for generating filters for k-mismatch search | |
| Deorowicz et al. | AGC: Compact representation of assembled genomes | |
| JPH0869476A (en) | Retrieval system | |
| Wandelt et al. | Column-wise compression of open relational data | |
| JP3923803B2 (en) | Input prediction processing program | |
| US10752958B2 (en) | Identification of microorganisms from genome sequencing data | |
| US8571809B2 (en) | Apparatus for calculating scores for chains of sequence alignments | |
| JP5347307B2 (en) | Information retrieval apparatus, information retrieval method, and control program | |
| CN108292307A (en) | With the quick operating prefix Burrow-Wheeler transformation to compressed data | |
| JPH1185794A (en) | Search term input device and recording medium recording search term input program | |
| KR100513266B1 (en) | Client/server based workbench system and method for expressed sequence tag analysis | |
| WO2008129459A2 (en) | A method for visualizing a dna sequence | |
| Ursing et al. | Exprot-a database for experimentally verified protein functions | |
| JP2005087069A (en) | Biological information lossless encoding device, search device, and three-dimensional information lossless encoding device | |
| US20050136457A1 (en) | Method for analyzing genome |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BY BZ CA CH CN CO CR CU CZ DE DM DZ EC EE ES FI GB GD GE GH HR HU ID IL IN IS JP KE KG KP KR LC LK LR LS LT LU LV MA MD MG MN MW MX MZ NO NZ OM PH PL PT RU SD SE SG SI SK SL TJ TM TN TR TZ UA UG US UZ VC VN YU ZA ZM |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ UG ZM ZW AM AZ BY KG KZ RU TJ TM AT BE BG CH CY CZ DK EE ES FI FR GB GR IE IT LU MC PT SE SK TR BF BJ CF CG CI GA GN GQ GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 2002800693 Country of ref document: EP |
|
| WWP | Wipo information: published in national office |
Ref document number: 2002800693 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: JP |
|
| WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2002800693 Country of ref document: EP |