[go: up one dir, main page]

US20160048608A1 - Systems and methods for genetic analysis - Google Patents

Systems and methods for genetic analysis Download PDF

Info

Publication number
US20160048608A1
US20160048608A1 US14/826,595 US201514826595A US2016048608A1 US 20160048608 A1 US20160048608 A1 US 20160048608A1 US 201514826595 A US201514826595 A US 201514826595A US 2016048608 A1 US2016048608 A1 US 2016048608A1
Authority
US
United States
Prior art keywords
node
mutation
file
graph database
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/826,595
Inventor
Alexander Frieden
Caleb J. Kennedy
Xavier S. Haurie
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Labcorp Holdings Inc
Original Assignee
Good Start Genetics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Good Start Genetics Inc filed Critical Good Start Genetics Inc
Priority to US14/826,595 priority Critical patent/US20160048608A1/en
Publication of US20160048608A1 publication Critical patent/US20160048608A1/en
Assigned to GOOD START GENETICS, INC. reassignment GOOD START GENETICS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KENNEDY, CALEB J., FRIEDEN, Alexander, HAURIE, XAVIER S.
Assigned to INN SA LLC reassignment INN SA LLC SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: COMBIMATRIX CORPORATION, GOOD START GENETICS, INC., INVITAE CORPORATION
Assigned to COMBIMATRIX CORPORATION, INVITAE CORPORATION, GOOD START GENETICS, INC. reassignment COMBIMATRIX CORPORATION RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: INN SA LLC
Priority to US17/000,054 priority patent/US12386895B2/en
Assigned to PERCEPTIVE CREDIT HOLDINGS III, LP reassignment PERCEPTIVE CREDIT HOLDINGS III, LP PATENT SECURITY AGREEMENT Assignors: GOOD START GENETICS, INC., INVITAE CORPORATION, SINGULAR BIO, INC., YOUSCRIPT, LLC
Assigned to INVITAE CORPORATION reassignment INVITAE CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GOOD START GENETICS, INC.
Assigned to INVITAE CORPORATION reassignment INVITAE CORPORATION CORRECTIVE ASSIGNMENT TO CORRECT THE THE SCHEDULE A OF THE CONFIRMATORY ASSIGNMENT PREVIOUSLY RECORDED AT REEL: 056756 FRAME: 0884. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT. Assignors: GOOD START GENETICS, INC.
Assigned to GOOD START GENETICS, INC., SINGULAR BIO, INC., INVITAE CORPORATION, YOUSCRIPT, LLC reassignment GOOD START GENETICS, INC. RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: PERCEPTIVE CREDIT HOLDINGS III, LP
Assigned to LABORATORY CORPORATION OF AMERICA HOLDINGS reassignment LABORATORY CORPORATION OF AMERICA HOLDINGS ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: INVITAE CORPORATION
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30958
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9038Presentation of query results
    • G06F17/30991
    • G06F19/28
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • the invention relates to medical genetics.
  • a person Before having children, a person may turn to genetic screening to find out if he or she is a carrier of a genetic condition. Genetic carrier screening can be done using next-generation sequencing (NGS), which produces millions of “base-calls” read from the person's genome. Typically, those base calls are then compared to a reference genome to determine their clinical significance. While all 3.2 billion base-pairs of the human genome are available for use as a reference (e.g., as hg18), knowing the clinical significance of features in the person's genome requires turning to medical literature or specialized databases of mutations. For example, the Online Mendelian Inheritance in Man (OMIM) database contains information on genetic disorders in over 12,000 human genes.
  • OMIM Online Mendelian Inheritance in Man
  • the invention provides systems and methods for genetic analysis in which entities such as mutations, patients, samples, alleles, and clinical information are individually represented and stored as nodes and in which relationships between entities are also individually represented and stored.
  • Each node and relationship can be stored using a fixed-size record and nodes can be flexibly invoked to represent any novel entity without disrupting the information already represented in the system.
  • the run time for queries need not be proportional to the amount of data in the tables. Instead, queries that start with a certain node can find the relevant related nodes in time proportional only to the number of nodes in the results that match the query.
  • novel entities and relationships can be inserted into the data system upon discovery with no disruption to the data or operation of the system.
  • novel mutations can be added or related to disease phenotypes or appropriate literature references as that new information is discovered and observed.
  • the time required for a query of—for example—relationships between a patient and disease-associated alleles in that patient's genome will be proportional to the number of results that are found for inclusion in a report for that patient.
  • sequencing uncovers novel mutations or genotype/phenotype associations, those entities and relationships can be brought into the system and included in the reporting without requiring any changes or re-design to the underlying system architecture.
  • NGS results, patient information, and medical information can be stored in a graph database and analyzed using graph processing approaches and languages. This provides for very rapid querying and report generation, independent of the size of the underlying data store.
  • the invention includes the insight that the clinical significance of mutations—or “variants”, e.g., as documented in NGS results such as Variant Call Format (VCF) files—can be shown by relating the mutation to a particular allele of a gene and showing where in the literature the variant is reported as pathogenic or benign while connecting this information back to a patient and lab sample for reporting purposes.
  • Sequencing by existing NGS technologies may provide abundant high-quality raw data in the form of sequence files such as FASTA, FASTQ, Sequence Alignment Map (SAM), Binary Alignment Map (BAM), or VCF files.
  • Systems and methods of the invention can be used to extract relevant data from those files into the described nodes to support the rapid querying and report generation useful for NGS carrier screening.
  • systems of the invention may include an Application Programming Interface (API) that takes as input VCF files and creates a network of nodes representing patients, samples, VCF files, VCF records, variants, alleles, and literature reports with relationships connecting adjacent pairs of those nodes according to their natural relationships.
  • API Application Programming Interface
  • the system supports a genomics analysis clinical pipeline even as it changes and can accommodate the loading in of external data.
  • the system can be implemented using a graph database and related software.
  • Systems of the invention support a variety of analyses and use cases. For example, with NGS-based carrier screening implemented using the described graph database structure for analysis and reporting, it becomes easy to query and report such phenomenon as allele frequencies.
  • Curating variants includes identifying an individual variant in sequencing results, researching medical literature for information about the variant, classifying the variant (e.g., pathogenic, benign, somewhere in between), and accessioning that information into the database for use in subsequent reports on patient samples in which that variant is implicated.
  • variants can be connected to alleles, literature references, medical information, or combinations thereof. If changes are subsequently made (e.g., a missense mutation is re-classified as a nonsense mutation), other features of the system infrastructure are not disrupted.
  • the active curation of variants is accommodate and improves the system.
  • the invention provides a method for analyzing mutations.
  • the method includes obtaining data representing a mutation in a genome of an individual and using a node in a graph database to store a description of the mutation.
  • the node has stored within it a pointer to an adjacent node that provides information about a clinical significance of the variant.
  • the method includes querying the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.
  • the data representing the mutation may be obtained by obtaining a sample that includes a nucleic acid from the individual and sequencing the nucleic acid to obtain a sequence read file that includes the data.
  • the sample may be represented in the graph database using a sample node and the sample node may be connected via a pointer to a read file node representing the sequence read file.
  • the graph database may include nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants as well as edges defining relationships between pairs of the nodes.
  • the data representing a mutation is obtained as part of a file such as a variant call file (VCF), a sequence alignment map (SAM) file, a binary alignment map (BAM) file, a FASTA file, or a FASTQ file.
  • VCF variant call file
  • SAM sequence alignment map
  • BAM binary alignment map
  • FASTA FASTA
  • FASTQ file a file such as a variant call file (VCF)
  • VCF variant call file
  • SAM sequence alignment map
  • BAM binary alignment map
  • FASTA file e.g., a binary alignment map
  • FASTQ file e.g., a FASTQ file.
  • the data representing a mutation comprises a description of the mutation as a variant of a reference human genome.
  • the description of the mutation may be provided as a VCF record in a VCF file.
  • the method may include obtaining sequencing data that represents a plurality of mutations in the genome of the individual—each of the plurality of mutations being represented as variant calls relative to a human genome reference. For each of the plurality of mutations, a corresponding variant node in the graph database is used to store a description of that mutation.
  • the system includes at least one computer comprising memory coupled to a processor.
  • the system has at least a portion of a graph database stored therein.
  • the system is operable to obtain data representing a mutation in a genome of an individual, use a variant node in the graph database to store a description of the mutation, and store—within the variant node—a pointer to an adjacent node that provides information about a clinical significance of the mutation.
  • the system may be used to query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.
  • the data representing a mutation may be obtained as part of a file such as a VCF file.
  • the system may represent the file as a file node in the graph database and store, in the variant node, a pointer to the file node.
  • the data representing the mutation may be provided as a sequence read file that includes that data.
  • the system is operable use the graph database to represent a biological sample from the individual with a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • the system may be operated to obtain sequencing data representing a plurality of mutations in the genome of the individual (e.g., as variant calls relative to a human genome reference) and use, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation.
  • the system links the individual to an allele node based on the plurality of mutations.
  • the invention provides: a system for describing genetic information, the system comprising: at least one computer comprising memory coupled to a processor, the system having at least a portion of a graph database stored therein, wherein the system is operable to: obtain data representing a mutation in a genome of an individual; use a node in the graph database to store a description of the mutation; store, in the node, a pointer to an adjacent node that provides information about a clinical significance of the mutation; and query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.
  • a pointer identifies a physical location in the memory at which the adjacent node is stored. Thus each node may be stored at a specific physical location the memory.
  • Each such specific physical location is referenced by a pointer (which itself optionally may be stored within a node at a physical location that is referenced, in-turn, by another pointer).
  • each pointer identifies a physical location in the memory subsystem at which the adjacent object is stored.
  • the pointer or native pointer is manipulatable as a memory address in that it points to a physical location on the memory but also dereferencing the pointer accesses intended data. That is, a pointer is a reference to a datum stored somewhere in memory; to obtain that datum is to dereference the pointer.
  • a pointer's value is interpreted as a memory address, at a low-level or hardware level.
  • the speed and efficiency of the described low-level, or hardware level, memory referencing allows for incredibly rapid graph traversals, which means that data content can scale up unbounded but reporting actionable medical genetic information will not require amounts of time that scale up with the data content.
  • Use of hardware level references, or index-free adjacency, uncouples the time requirements for medical genetics reporting from data content volume.
  • the system is operable to obtain the data representing the mutation by receiving at least one sequence read file that includes the data.
  • the system of the first embodiment is further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • the data representing the mutation is obtained as part of a file.
  • the file may have a format selected from the group consisting of variant call format; sequence alignment map; binary alignment map; FASTA; and FASTQ.
  • the system is operable to represent the file as a file node in the graph database and store, in the variant node, a pointer to the file node.
  • the system is further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • the data representing the mutation comprises a description of the mutation as a variant of a reference human genome.
  • the description of the mutation may optionally be obtained from a VCF record in a VCF file.
  • the system of the third embodiment may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • the system is further operable to: obtain sequencing data representing a plurality of mutations in the genome of the individual, the plurality of mutations being represented as variant calls relative to a human genome reference; use, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation; and link the individual to an allele node based on the plurality of mutations.
  • the graph database may include: nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and edges defining relationships between pairs of the nodes.
  • the system of the fourth embodiment may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • the graph database comprises: nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and edges defining relationships between pairs of the nodes.
  • the system may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • FIG. 1 illustrates an exemplary NGS workflow for carrier screening.
  • FIG. 2 gives a sample of an exemplary VCF file.
  • FIG. 3 diagrams a method for analyzing mutations.
  • FIG. 4 gives a flow chart for a VCF file parser.
  • FIG. 5 presents a model of data received from parsing a VCF file.
  • FIG. 6 shows an entity relationship diagram (ERD) of the data modeled by FIG. 5 .
  • FIG. 7 diagrams a high-level architecture of a system of the invention.
  • FIG. 8 illustrates a structure for nodes and relationships on disk.
  • FIG. 9 illustrates the use of a variant node to store a description of a mutation.
  • FIG. 10 shows an allele node showing that an allele includes a certain mutation.
  • FIG. 11 shows variant node connected to two different literature reference nodes.
  • FIG. 12 illustrates updating information about a mutation.
  • FIG. 13 presents an example database that may be queried for allele frequency.
  • FIG. 14 diagrams a system for performing methods of the invention.
  • the invention relates to using a graph database in genetic analyses to link mutation data to extrinsic data.
  • Entities such as mutations, patients, samples, alleles, and clinical information are individually represented and stored as nodes and relationships between entities are also individually represented and stored. Each node and relationship can be stored using a fixed-size record and nodes can be flexibly invoked to represent any entity without disrupting the existing data.
  • Systems and methods of the invention may be used for obtaining data representing a mutation in an individual and using a variant node in a graph database to store a description of the mutation.
  • the variant node has stored within it a pointer to an adjacent node that provides information about a clinical significance of the variant.
  • the graph database can be queried to provide a report of the clinical significance of the mutation.
  • systems and methods of the invention operate within the context of a carrier screening workflow and provide a querying and reporting tool for carrier screening.
  • FIG. 1 illustrates an exemplary NGS workflow for carrier screening.
  • the workflow combines automated, optimized molecular inversion probe target capture 109 with molecular barcoding to maximize the sample throughput of an NGS machine and employs assembly and alignment methods that allow accurate identification of both substitution and insertion/deletion lesions.
  • the workflow is applicable to, for example, genes in which loss-of-function mutations cause recessive Mendelian disorders often included as part of routine carrier screening.
  • a screening or analysis may begin with obtaining nucleic acid from a sample.
  • Nucleic acid in a sample can be any nucleic acid, including for example, genomic DNA in a tissue sample, cDNA amplified from a particular target in a laboratory sample, or mixed DNA from multiple organisms.
  • the sample includes homozygous DNA from a haploid or diploid organism.
  • a sample can include genomic DNA from a patient who is homozygous for a rare recessive allele.
  • the sample includes heterozygous genetic material from a diploid or polyploidy organism with a somatic mutation such that two related nucleic acids are present in allele frequencies other than 50 or 100%, i.e., 20%, 5%, 1%, 0.1%, or any other allele frequency.
  • nucleic acid template molecules are isolated from a biological sample containing a variety of other components, such as proteins, lipids, and non-template nucleic acids.
  • Nucleic acid template molecules can be obtained from any cellular material, obtained from animal, plant, bacterium, fungus, or any other cellular organism. Biological samples for use in the present invention also include viral particles or preparations. Nucleic acid template molecules can be obtained directly from an organism or from a biological sample obtained from an organism, e.g., from blood, urine, cerebrospinal fluid, seminal fluid, saliva, sputum, stool, and tissue.
  • tissue or body fluid specimen e.g., a human tissue of bodily fluid specimen
  • Nucleic acid template molecules can also be isolated from cultured cells, such as a primary cell culture or cell line. The cells or tissues from which template nucleic acids are obtained can be infected with a virus or other intracellular pathogen.
  • a sample can also be total RNA extracted from a biological specimen, a cDNA library, viral, or genomic DNA.
  • a sample may also be isolated DNA from a non-cellular origin, e.g. amplified/isolated DNA from the freezer.
  • nucleic acid can be extracted, isolated, amplified, or analyzed by a variety of techniques such as those described by Green and Sambrook, Molecular Cloning: A Laboratory Manual (Fourth Edition), Cold Spring Harbor Laboratory Press, Woodbury, N.Y. 2,028 pages ( 2012 ); or as described in U.S. Pat. No. 7,957,913; U.S. Pat. No. 7,776,616; U.S. Pat. No. 5,234,809; U.S. Pub. 2010/0285578; and U.S. Pub. 2002/0190663.
  • Nucleic acid from a sample may optionally be fragmented or sheared to a desired length, using a variety of mechanical, chemical, and/or enzymatic methods.
  • DNA may be randomly sheared via sonication using, for example, an ultrasonicator sold by Covaris (Woburn, Mass.), brief exposure to a DNase, or using a mixture of one or more restriction enzymes, or a transposase or nicking enzyme.
  • RNA may be fragmented by brief exposure to an RNase, heat plus magnesium, or by shearing. The RNA may be converted to cDNA. If fragmentation is employed, the RNA may be converted to cDNA before or after fragmentation.
  • nucleic acid is fragmented by sonication.
  • nucleic acid is fragmented by a hydroshear instrument.
  • individual nucleic acid template molecules can be from about 2 kb bases to about 40 kb.
  • nucleic acids are about 6 kb-10 kb fragments.
  • Nucleic acid molecules may be single-stranded, double-stranded, or double stranded with single-stranded regions (for example, stem- and loop-structures).
  • a biological sample may be lysed, homogenized, or fractionated in the presence of a detergent or surfactant as needed.
  • Suitable detergents may include an ionic detergent (e.g., sodium dodecyl sulfate or N-lauroylsarcosine) or a nonionic detergent (such as the polysorbate 80 sold under the trademark TWEEN by Uniqema Americas (Paterson, N.J.) or C14H22O(C2H4)n, known as TRITON X-100).
  • genomic DNA samples are input to a molecular inversion probe capture 109 reaction.
  • Molecular inversion probes may be designed to capture the coding regions and as well as well-characterized noncoding regions of genes. Such probes may include 5′ and 3′ targeting arms (extension and ligation, respectively) of, for example, about a total of 40 nucleotides and being designed to flank 130-bp target regions. Each target is captured 109 by multiple probes that anneal to non-overlapping genomic intervals. PCR is performed 121 using primers containing patient-specific barcodes, yielding barcode libraries. Genomic DNA may be subjected to multiplex target capture using molecular inversion probes. Captured product may be subjected to PCR to attach molecular barcodes in a manner that allow sequencing from either end of the captured region.
  • PCR may be used as described or any other amplification reaction may be performed.
  • Amplification refers to production of additional copies of a nucleic acid sequence and is generally carried out using polymerase chain reaction (PCR) or other technologies known in the art.
  • the amplification reaction may be any amplification reaction known in the art that amplifies nucleic acid molecules such as PCR (e.g., nested PCR, PCR-single strand conformation polymorphism, ligase chain reaction, strand displacement amplification and restriction fragments length polymorphism, transcription based amplification system, rolling circle amplification, and hyper-branched rolling circle amplification, quantitative PCR, quantitative fluorescent PCR (QF-PCR), multiplex fluorescent PCR (MF-PCR), real time PCR (RTPCR), restriction fragment length polymorphism PCR).
  • QF-PCR quantitative fluorescent PCR
  • MF-PCR multiplex fluorescent PCR
  • RTPCR real time PCR
  • Amplification adapters may be attached to the fragmented nucleic acid.
  • Adapters may be commercially obtained, such as from Integrated DNA Technologies (Coralville, Iowa).
  • the adapter sequences are attached to the template nucleic acid molecule with an enzyme.
  • the enzyme may be a ligase or a polymerase.
  • the ligase may be any enzyme capable of ligating an oligonucleotide (RNA or DNA) to the template nucleic acid molecule.
  • Suitable ligases include T4 DNA ligase and T4 RNA ligase, available commercially from New England Biolabs (Ipswich, Mass.). Methods for using ligases are well known in the art.
  • the polymerase may be any enzyme capable of adding nucleotides to the 3′ and the 5′ terminus of template nucleic acid molecules.
  • Embodiments of the invention involve attaching the bar code sequences to the template nucleic acids e.g., for barcode PCR 121 .
  • a bar code is attached to each fragment.
  • a plurality of bar codes e.g., two bar codes, are attached to each fragment.
  • a bar code sequence generally includes certain features that make the sequence useful in sequencing reactions. For example the bar code sequences are designed to have minimal or no homo-polymer regions, i.e., 2 or more of the same base in a row such as AA or CCC, within the bar code sequence. The bar code sequences are also designed so that they are at least one edit distance away from the base addition order when performing base-by-base sequencing, ensuring that the first and last base do not match the expected bases of the sequence.
  • the bar code sequences are designed such that each sequence is correlated to a particular portion of nucleic acid, allowing sequence reads to be correlated back to the portion from which they came. Methods of designing sets of bar code sequences are shown for example in U.S. Pat. No. 6,235,475, the contents of which are incorporated by reference herein in their entirety.
  • the bar code sequences range from about 5 nucleotides to about 15 nucleotides. In a particular embodiment, the bar code sequences range from about 4 nucleotides to about 7 nucleotides. Since the bar code sequence is sequenced along with the template nucleic acid, the oligonucleotide length should be of minimal length so as to permit the longest read from the template nucleic acid attached.
  • the bar code sequences are spaced from the template nucleic acid molecule by at least one base (minimizes homo-polymeric combinations).
  • the bar code sequences are attached to the template nucleic acid molecule, e.g., with an enzyme.
  • the enzyme may be a ligase or a polymerase, as discussed below. Attaching bar code sequences to nucleic acid templates is shown in U.S. Pub. 2008/0081330 and U.S. Pub. 2011/0301042, the contents of which are incorporated by reference herein in its entirety. Methods for designing sets of bar code sequences and other methods for attaching bar code sequences are shown in U.S. Pat. Nos.
  • nucleic acid can be sequenced 129 .
  • Sequencing 129 may be by any method known in the art.
  • DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing by synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, Illumina/Solexa sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing by synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, polony sequencing, and SOLiD sequencing. Sequencing of separated molecules has more recently been demonstrated by sequential or single extension reactions using polymerases or ligases as well as by single or sequential differential hybridizations with libraries of probes.
  • a sequencing technique that can be used includes, for example, use of sequencing-by-synthesis systems sold under the trademarks GS JUNIOR, GS FLX+ and 454 SEQUENCING by 454 Life Sciences, a Roche company (Branford, Conn.), and described by Margulies, M. et al., Genome sequencing in micro-fabricated high-density picotiter reactors, Nature, 437:376-380 (2005); U.S. Pat. No. 5,583,024; U.S. Pat. No. 5,674,713; and U.S. Pat. No. 5,700,673, the contents of which are incorporated by reference herein in their entirety. 454 sequencing involves two steps.
  • DNA is sheared into fragments of approximately 300-800 base pairs, and the fragments are blunt ended.
  • Oligonucleotide adaptors are then ligated to the ends of the fragments.
  • the adaptors serve as primers for amplification and sequencing of the fragments.
  • the fragments can be attached to DNA capture beads, e.g., streptavidin-coated beads using, e.g., Adaptor B, which contains 5′-biotin tag.
  • the fragments attached to the beads are PCR amplified within droplets of an oil-water emulsion. The result is multiple copies of clonally amplified DNA fragments on each bead.
  • the beads are captured in wells (pico-liter sized).
  • Pyrosequencing is performed on each DNA fragment in parallel. Addition of one or more nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument. The signal strength is proportional to the number of nucleotides incorporated. Pyrosequencing makes use of pyrophosphate (PPi) which is released upon nucleotide addition. PPi is converted to ATP by ATP sulfurylase in the presence of adenosine 5 ′ phosphosulfate. Luciferase uses ATP to convert luciferin to oxyluciferin, and this reaction generates light that is detected and analyzed.
  • PPi pyrophosphate
  • SOLiD sequencing genomic DNA is sheared into fragments, and adaptors are attached to the 5′ and 3′ ends of the fragments to generate a fragment library.
  • internal adaptors can be introduced by ligating adaptors to the 5′ and 3′ ends of the fragments, circularizing the fragments, digesting the circularized fragment to generate an internal adaptor, and attaching adaptors to the 5′ and 3′ ends of the resulting fragments to generate a mate-paired library.
  • clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components.
  • templates are denatured and beads are enriched to separate the beads with extended templates.
  • Templates on the selected beads are subjected to a 3′ modification that permits bonding to a glass slide.
  • the sequence can be determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After a color is recorded, the ligated oligonucleotide is removed and the process is then repeated.
  • ion semiconductor sequencing using, for example, a system sold under the trademark ION TORRENT by Ion Torrent by Life Technologies (South San Francisco, Calif.). Ion semiconductor sequencing is described, for example, in Rothberg, et al., An integrated semiconductor device enabling non-optical genome sequencing, Nature 475:348-352 (2011); U.S. Pub. 2010/0304982; U.S. Pub. 2010/0301398; U.S. Pub. 2010/0300895; U.S. Pub. 2010/0300559; and U.S. Pub. 2009/0026082, the contents of each of which are incorporated by reference in their entirety.
  • Illumina sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Genomic DNA is fragmented, and adapters are added to the 5′ and 3′ ends of the fragments. DNA fragments that are attached to the surface of flow cell channels are extended and bridge amplified. The fragments become double stranded, and the double stranded molecules are denatured. Multiple cycles of the solid-phase amplification followed by denaturation can create several million clusters of approximately 1,000 copies of single-stranded DNA molecules of the same template in each channel of the flow cell.
  • Primers, DNA polymerase and four fluorophore-labeled, reversibly terminating nucleotides are used to perform sequential sequencing. After nucleotide incorporation, a laser is used to excite the fluorophores, and an image is captured and the identity of the first base is recorded. The 3′ terminators and fluorophores from each incorporated base are removed and the incorporation, detection and identification steps are repeated. Sequencing according to this technology is described in U.S. Pat. No. 7,960,120; U.S. Pat. No. 7,835,871; U.S. Pat. No. 7,232,656; U.S. Pat. No. 7,598,035; U.S. Pat. No. 6,911,345; U.S. Pat. No.
  • SMRT single molecule, real-time
  • each of the four DNA bases is attached to one of four different fluorescent dyes. These dyes are phospholinked.
  • a single DNA polymerase is immobilized with a single molecule of template single stranded DNA at the bottom of a zero-mode waveguide (ZMW). It takes several milliseconds to incorporate a nucleotide into a growing strand. During this time, the fluorescent label is excited and produces a fluorescent signal, and the fluorescent tag is cleaved off. Detection of the corresponding fluorescence of the dye indicates which base was incorporated. The process is repeated.
  • Nanopore sequencing (Soni & Meller, 2007, Progress toward ultrafast DNA sequence using solid-state nanopores, Clin Chem 53(11):1996-2001).
  • a nanopore is a small hole, of the order of 1 nanometer in diameter. Immersion of a nanopore in a conducting fluid and application of a potential across it results in a slight electrical current due to conduction of ions through the nanopore. The amount of current which flows is sensitive to the size of the nanopore.
  • each nucleotide on the DNA molecule obstructs the nanopore to a different degree.
  • the change in the current passing through the nanopore as the DNA molecule passes through the nanopore represents a reading of the DNA sequence.
  • a sequencing technique involves using a chemical-sensitive field effect transistor (chemFET) array to sequence DNA (for example, as described in U.S. Pub. 2009/0026082).
  • chemFET chemical-sensitive field effect transistor
  • DNA molecules can be placed into reaction chambers, and the template molecules can be hybridized to a sequencing primer bound to a polymerase.
  • Incorporation of one or more triphosphates into a new nucleic acid strand at the 3′ end of the sequencing primer can be detected by a change in current by a chemFET.
  • An array can have multiple chemFET sensors.
  • single nucleic acids can be attached to beads, and the nucleic acids can be amplified on the bead, and the individual beads can be transferred to individual reaction chambers on a chemFET array, with each chamber having a chemFET sensor, and the nucleic acids can be sequenced.
  • Another example of a sequencing technique involves using an electron microscope as described, for example, by Moudrianakis, E. N. and Beer M., in Base sequence determination in nucleic acids with the electron microscope, III. Chemistry and microscopy of guanine-labeled DNA, PNAS 53:564-71 (1965).
  • individual DNA molecules are labeled using metallic labels that are distinguishable using an electron microscope. These molecules are then stretched on a flat surface and imaged using an electron microscope to measure sequences.
  • Sequencing according to embodiments of the invention generates a plurality of reads.
  • Reads according to the invention generally include sequences of nucleotide data less than about 5000 bases in length, or less than about 150 bases in length. In certain embodiments, reads are between about 80 and about 90 bases, e.g., about 85 bases in length. In some embodiments, methods of the invention are applied to very short reads, i.e., less than about 50 or about 30 bases in length.
  • Sequence read data can include the sequence data as well as meta information. Sequence read data can be stored in any suitable file format including, for example, VCF files, FASTA files or FASTQ files, as are known to those of skill in the art.
  • PCR product is pooled and sequenced (e.g., on an Illumina HiSeq 2000).
  • Raw .bcl files are converted to qseq files using bclConverter (Illumina).
  • FASTQ files are generated by “de-barcoding” genomic reads using the associated barcode reads; reads for which barcodes yield no exact match to an expected barcode, or contain one or more low-quality base calls, may be discarded. Reads may be stored in any suitable format such as, for example, FASTA or FASTQ format.
  • FASTA is originally a computer program for searching sequence databases and the name FASTA has come to also refer to a standard file format. See Pearson & Lipman, 1988, Improved tools for biological sequence comparison, PNAS 85:2444-2448.
  • a sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. The word following the “>” symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the “>” and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence.
  • the FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. It is similar to the FASTA format but with quality scores following the sequence data. Both the sequence letter and quality score are encoded with a single ASCII character for brevity.
  • the FASTQ format is a de facto standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer. Cock et al., 2009, The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Res 38(6):1767-1771.
  • meta information includes the description line and not the lines of sequence data.
  • the meta information includes the quality scores.
  • the sequence data begins after the description line and is present typically using some subset of IUPAC ambiguity codes optionally with “-”. In a preferred embodiment, the sequence data will use the A, T, C, G, and N characters, optionally including “-” or U as-needed (e.g., to represent gaps or uracil).
  • reads are preferably mapped 135 to a reference using assembly and alignment techniques known in the art or developed for use in the workflow.
  • assembly and alignment techniques known in the art or developed for use in the workflow.
  • Various strategies for the alignment and assembly of sequence reads are described in detail in U.S. Pat. No. 8,209,130, incorporated herein by reference.
  • Strategies may include (i) assembling reads into contigs and aligning the contigs to a reference; (ii) aligning individual reads to the reference; (iii) assembling reads into contigs, aligning the contigs to a reference, and aligning the individual reads to the contigs; or (iv) other strategies known to be developed or known in the art.
  • Mapping 135 may employ assembly steps, alignment steps, or both.
  • Assembly can be implemented, for example, by the program ‘The Short Sequence Assembly by k-mer search and 3′ read Extension’ (SSAKE), from Canada's Michael Smith Genome Sciences Centre (Vancouver, B.C., CA) (see, e.g., Warren et al., 2007, Assembling millions of short DNA sequences using SSAKE, Bioinformatics, 23:500-501).
  • SSAKE cycles through a table of reads and searches a prefix tree for the longest possible overlap between any two sequences.
  • SSAKE clusters reads into contigs.
  • Forge Genome Assembler written by Darren Platt and Dirk Evers and available through the SourceForge web site maintained by Geeknet (Fairfax, Va.) (see, e.g., DiGuistini et al., 2009, De novo sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data, Genome Biology, 10:R94). Forge distributes its computational and memory consumption to multiple nodes, if available, and has therefore the potential to assemble large sets of reads. Forge was written in C++ using the parallel MPI library. Forge can handle mixtures of reads, e.g., Sanger, 454, and Illumina reads.
  • Assembly through multiple sequence alignment can be performed, for example, by the program Clustal Omega, (Sievers et al., 2011, Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega, Mol Syst Biol 7:539), ClustalW, or ClustalX (Larkin et al., 2007, Clustal W and Clustal X version 2.0, Bioinformatics, 23(21):2947-2948) available from University College Dublin (Dublin, Ireland).
  • Clustal Omega (Sievers et al., 2011, Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega, Mol Syst Biol 7:539)
  • ClustalW or ClustalX (Larkin et al., 2007, Clustal W and Clustal X version 2.0, Bioinformatics, 23(21):2947-2948) available from University College Dublin (Dublin, Ireland).
  • Velvet Another exemplary read assembly program known in the art is Velvet, available through the web site of the European Bioinformatics Institute (Hinxton, UK) (Zerbino & Birney, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Genome Research 18(5):821-829). Velvet implements an approach based on de Bruijn graphs, uses information from read pairs, and implements various error correction steps.
  • Read assembly can be performed with the programs from the package SOAP, available through the website of Beijing Genomics Institute (Beijing, Conn.) or BGI Americas Corporation (Cambridge, Mass.).
  • SOAPdenovo program implements a de Bruijn graph approach.
  • SOAP3/GPU aligns short reads to a reference sequence.
  • ABySS Another read assembly program is ABySS, from Canada's Michael Smith Genome Sciences Centre (Vancouver, B.C., CA) (Simpson et al., 2009, ABySS: A parallel assembler for short read sequence data, Genome Res., 19(6):1117-23). ABySS uses the de Bruijn graph approach and runs in a parallel environment.
  • Read assembly can also be done by Roche's GS De Novo Assembler, known as gsAssembler or Newbler (NEW assemBLER), which is designed to assemble reads from the Roche 454 sequencer (described, e.g., in Kumar & Blaxter, 2010, Comparing de novo assemblers for 454 transcriptome data, Genomics 11:571 and Margulies 2005).
  • Newbler accepts 454 Flx Standard reads and 454 Titanium reads as well as single and paired-end reads and optionally Sanger reads. Newbler is run on Linux, in either 32 bit or 64 bit versions. Newbler can be accessed via a command-line or a Java-based GUI interface.
  • reads are aligned to hg18 on a per-sample basis using Burrows-Wheeler Aligner version 0.5.7 for short alignments, and genotype calls are made using Genome Analysis Toolkit. See McKenna et al., 2010, The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data, Genome Res 20(9):1297-1303. High-confidence genotype calls may be defined as having depth ⁇ 50 and strand bias score ⁇ 0. Clinical significance of variant calls is an important question in carrier screening and will be addressed below.
  • Other computer programs for assembling reads are known in the art. Such assembly programs can run on a single general-purpose computer, on a cluster or network of computers, or on specialized computing devices dedicated to sequence analysis.
  • de-barcoded fastq files are obtained as described above and partitioned by capture region (exon) using the target arm sequence as a unique key. Reads are assembled in parallel by exon using SSAKE version 3.7 with parameters “-m 30 -o 15”. The resulting contiguous sequences (contigs) can be aligned to hg18 (e.g., using BWA version 0.5.7 for long alignments with parameter “-r 1”). In some embodiments, short-read alignment is performed as described above, except that sample contigs (rather than hg18) are used as the input reference sequence. Software may be developed in Java to accurately transfer coordinate and variant data (gaps) from local sample space to global reference space for every BAM-formatted alignment. Genotyping and base-quality recalibration may be performed on the coordinate-translated BAM files using the GATK program.
  • any or all of the steps of the invention are automated.
  • a Perl script or shell script can be written to invoke any of the various programs discussed above (see, e.g., Tisdall, Mastering Perl for Bioinformatics, O'Reilly & Associates, Inc., Sebastopol, C A 2003; Michael, R., Mastering Unix Shell Scripting, Wiley Publishing, Inc., Indianapolis, Ind. 2003).
  • methods of the invention may be embodied wholly or partially in one or more dedicated programs, for example, each optionally written in a compiled language such as C++ then compiled and distributed as a binary.
  • Methods of the invention may be implemented wholly or in part as modules within, or by invoking functionality within, existing sequence analysis platforms.
  • methods of the invention include a number of steps that are all invoked automatically responsive to a single starting queue (e.g., one or a combination of triggering events sourced from human activity, another computer program, or a machine).
  • a single starting queue e.g., one or a combination of triggering events sourced from human activity, another computer program, or a machine.
  • the invention provides methods in which any or the steps or any combination of the steps can occur automatically responsive to a queue.
  • Automatically generally means without intervening human input, influence, or interaction (i.e., responsive only to original or pre-queue human activity).
  • mapping 135 sequence reads to a reference may produce output such as a text file or an XML file containing sequence data such as a sequence of the nucleic acid aligned to a sequence of the reference genome.
  • mapping 135 reads to a reference produces results stored in SAM or BAM file 179 and such results may contain coordinates or a string describing one or more mutations in the subject nucleic acid relative to the reference genome.
  • Alignment strings known in the art include Simple UnGapped Alignment Report (SUGAR), Verbose Useful Labeled Gapped Alignment Report (VULGAR), and Compact Idiosyncratic Gapped Alignment Report (CIGAR) (Ning, Z., et al., Genome Research 11(10):1725-9 (2001)). These strings are implemented, for example, in the Exonerate sequence alignment software from the European Bioinformatics Institute (Hinxton, UK).
  • SUGAR Simple UnGapped Alignment Report
  • VULGAR Verbose Useful Labeled Gapped Alignment Report
  • CIGAR Compact Idiosyncratic Gapped Alignment Report
  • a sequence alignment is produced—such as, for example, a sequence alignment map (SAM) or binary alignment map (BAM) file—comprising a CIGAR string (the SAM format is described, e.g., in Li, et al., The Sequence Alignment/Map format and SAMtools, Bioinformatics, 2009, 25(16):2078-9).
  • SAM sequence alignment map
  • BAM binary alignment map
  • CIGAR displays or includes gapped alignments one-per-line.
  • CIGAR is a compressed pairwise alignment format reported as a CIGAR string.
  • a CIGAR string is useful for representing long (e.g. genomic) pairwise alignments.
  • a CIGAR string is used in SAM format to represent alignments of reads to a reference genome sequence.
  • the CIGAR string defines the sequence of matches/mismatches and deletions (or gaps). For example, the CIGAR string 2MD3M2D2M will mean that the alignment contains 2 matches, 1 deletion (number 1 is omitted in order to save some space), 3 matches, 2 deletions and 2 matches. In general, for carrier screening or other assays such as the NGS workflow depicted in FIG. 1 , sequencing results will be used in genotyping 141 .
  • Output from mapping 135 may be stored in a SAM or BAM file 179 , in a variant call format (VCF) file 183 , or other format.
  • VCF variant call format
  • output is stored in a VCF file, although methods described herein are applicable to other file formats such as SAM or BAM files, as will be readily apparent to one of skill in the art.
  • FIG. 2 gives a sample of an exemplary VCF file 183 .
  • a typical VCF file 183 will include a header section and a data section.
  • the header contains an arbitrary number of meta-information lines, each starting with characters ‘##’, and a TAB delimited field definition line starting with a single ‘#’ character.
  • the field definition line names eight mandatory columns and the body section contains lines of data populating the columns defined by the field definition line.
  • the VCF format is described in Danecek et al., 2011, The variant call format and VCFtools, Bioinformatics 27(15):2156-2158.
  • the data contained in a VCF file 183 as shown for example in FIG. 2 represents the variants, or mutations, that are found in the nucleic acid that was obtained from the sample from the patient and sequenced.
  • mutation refers to a change in genetic information and has come to refer to the present genotype that results from a mutation.
  • mutations include different types of mutations such as substitutions, insertions or deletions (INDELs), translocations, inversions, chromosomal abnormalities, and others.
  • INDELs substitutions, insertions or deletions
  • translocations inversions
  • chromosomal abnormalities and others.
  • an absolute allele frequency is not determined (i.e., not every human on the planet is genotyped) but allele frequency refers to a calculated probable allele frequency based on sampling and known statistical methods and often an allele frequency is reported in terms of a certain population such as humans of a certain ethnicity.
  • Variant can be taken to be roughly synonymous to mutation but referring to a genotype being described in comparison or with reference to a reference genotype or genome.
  • variant describes a genotype feature in comparison to a reference such as the human genome (e.g., hg18 or hg19 which may be taken as a wild type).
  • An NGS workflow and genotype 141 generates data representing one or more mutations in a genome of an individual that are generally reported as variants, or “variant calls”, in, for example, a VCF file 183 .
  • a VCF file 183 includes data representing one or more mutations. Those data may be analyzed by methods of the invention to provide a report of the clinical significance of the mutations in the genome of the individual.
  • FIG. 3 diagrams a method 301 for analyzing mutations according to the invention.
  • One benefit of a method 301 is an ability to provide information about the clinical significance of mutations in a patient's genome from data such as that provided by sequencing, e.g., in FASTA/FASTQ files, SAM/BAM files, or VCF files.
  • Methods include obtaining 305 data representing a mutation in a genome of an individual by, for example, the sampling, sequencing, and mapping methods described above.
  • a variant node in a graph database is used 311 to store a description of the mutation.
  • a pointer is stored 317 in the variant node and the pointer points to an adjacent node that provides information about a clinical significance of the variant.
  • Method 301 includes querying 323 the graph database to obtain information reporting the clinical significance of the mutation in the genome of the individual.
  • VCF file containing mutation data is obtained 305 .
  • the VCF file may be parsed to isolate its component pieces of information and to consider each piece of information for its own significance.
  • APIs application programming interfaces
  • FIG. 4 gives a flow chart for a VCF parser.
  • the flow chart shown in FIG. 4 represents the conceptual steps that may go into parsing a VCF file and extracting component information. Since the various action blocks and loops are defined according to the format of the VCF file as standardized (e.g., in Danecek, 2011, Bioinformatics 27:2156), each character of information that is extracted is treated for what it is. Thus, using VCF file 183 from FIG. 2 for reference, the “A” that appears on line 16, character 7 (counting 1 tab as 1 character) is treated as a nucleotide in the reference and the “A” that appears in line 17, character 17 is simply part of the word “PASS” in the FILTER column.
  • line 16 is a single VCF record within a VCF file.
  • Each record from the VCF file represents something found by sequencing the nucleic acid from the sample from the patient.
  • the VCF run e.g., all the VCF files produced by the NGS sequencing
  • FIG. 5 presents a model of data received from parsing a VCF.
  • one run from the sequencing instruments can produce a plurality of VCF files.
  • Each VCF file typically contains a plurality of VCF records. Those records ultimately relate back to the samples from which they were derived, and the samples can each contain a plurality of alleles. However, this relationship just described can also be described using an entity relationship diagram, or ERD.
  • FIG. 6 shows an entity relationship diagram (ERD) 601 of the data modeled by FIG. 5 .
  • ERD entity relationship diagram
  • FIG. 6 shows an entity relationship diagram (ERD) 601 of the data modeled by FIG. 5 .
  • ERD entity relationship diagram
  • FIG. 6 shows an entity relationship diagram (ERD) 601 of the data modeled by FIG. 5 .
  • ERD entity relationship diagram
  • FIG. 6 shows an entity relationship diagram (ERD) 601 of the data modeled by FIG. 5 .
  • ERD 601 satisfies the definition of a graph as used in graph theory within mathematics and computer science.
  • Graph theory provides a well-known mathematical tool for representing systems.
  • Graph theory is the mathematical study of properties of formal mathematical structures called graphs.
  • a graph is a finite set of points, termed vertices or nodes, connected by links termed edges or arcs.
  • a graph thus generally defines a set of vertices and a set of pairs of vertices, which
  • the type of a particular graph largely depends upon the features of its components, namely the attributes of its vertices and edges. For example, when the set of pairs includes only distinct elements, the graph is called a simple graph; when one or more pairs are connected by multiple edges the graph is called a multi-graph; when one or more vertices are connected to themselves the graph is called a pseudo-graph; when the edges are assigned with directions the graph is called a directed graph or a digraph; and when the pairs of vertices are unordered the graph is called undirected. Additional illustrative background on graph theory may be found in U.S. Pat. No. 8,463,895 to Arora; U.S. Pat. No. 8,462,161 to Barber; U.S. Pat. No.
  • ERD 601 presents a graph—a collection of vertices and edges—or another description would be a set of nodes and the relationships that connect them.
  • Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships.
  • This general-purpose, expressive structure allows graphs to model all kinds of phenomena such as NGS sequence files and their relationships to the source biological samples and genetic concepts like certain alleles.
  • There are various dominant graph data models such as the property graph, Resource Description Framework (RDF) triples, and hypergraphs.
  • a graph database used in the invention uses the property graph model.
  • a property graph has characteristics such as containing nodes and relationships (which are illustrated by ERD 601 in FIG. 6 ).
  • the nodes contain properties (key-value pairs). Relationships are named and directed, and have a start and end node; and relationships can also contain properties.
  • a graph database management system (henceforth, a graph database) is an online database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model.
  • Graph databases according to the invention may be described or characterized according to the underlying storage, the processing engine, or both.
  • some graph databases use native graph storage that is optimized and designed for storing and managing graphs. Some databases serialize the graph data into a relational database, an object-oriented database, or some other general-purpose data store and present graph database functionality on top of that.
  • graph databases use index-free adjacency, meaning that connected nodes physically “point” to each other in the database. More broadly, graph databases can include any database that from the user's perspective behaves like a graph database (i.e., exposes a graph data model through CRUD operations) qualifies as a graph database. In certain embodiments, however, the invention provides the significant performance advantages of index-free adjacency. Native graph processing may describe graph databases that use index-free adjacency.
  • a benefit of native graph storage is that it is engineered for performance and scalability.
  • a benefit of non-native graph storage is that it typically depends on a mature non-graph backend (such as MySQL) whose production characteristics are well understood by operations teams.
  • Native graph processing index-free adjacency benefits traversal performance.
  • graph databases provide arbitrarily sophisticated models that map closely to the problem domain (e.g., FIG. 5 ). The resulting models are simpler and at the same time more expressive than those produced using traditional relational databases and the other NOSQL stores.
  • Any suitable graph database can be used to implement the systems and methods described herein.
  • Exemplary graph databases may include Microsoft Infinite Graph, Titan, OrientDB, Neo4j, *dex, Franz Inc., AllegroGraph, and Hypergraphdb.
  • systems and methods of the invention employ a graph compute engine.
  • a graph compute engine is a technology that enables global graph computational algorithms to be run against large datasets.
  • Graph compute engines are designed to do things like identify clusters in the data, or answer questions about how entities are connected, and particularly to trace across a series of linked ideas (e.g., SNP to allele to genetic condition to a literature reference providing a clinical significance of the allele containing the SNP).
  • a distributed graph compute engine may be structured as described in Malewicz, et al., 2010, Pregel: a system for large-scale graph processing, Proceedings ACM SIGMOD Int Conf Management Data 135-146. Also see Rodriguez and Neubauer, 2010, Constructions from Dots and Lines, Bulletin Am Soc Inf Sci Tech 36(6):35-41.
  • systems and methods of the invention store mutation descriptions using a graph database and analyze mutations in graph space.
  • a genetic analysis pipeline and methodology uses nodes as well as named and directed relationships, with both the nodes and relationships serving as containers for properties.
  • nodes and relationships are illustrated and index-free adjacency is discussed.
  • a database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes. Each node thus acts as a micro-index of other nearby nodes, which is much cheaper than using global indexes. It means that query times are independent of the total size of the graph, and are instead simply proportional to the amount of the graph searched.
  • a non-native graph database engine uses (global) indexes to link nodes together. These indexes add a layer of indirection to each traversal, thereby incurring greater computational cost.
  • Proponents for native graph processing argue that index-free adjacency is crucial for fast, efficient graph traversals.
  • index lookups could be O(log n) in algorithmic complexity versus O(l) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.
  • Index-free adjacency provides lower-cost “joins.” With index-free adjacency, bidirectional joins are effectively pre-computed and stored in the database as relationships. In contrast, when using indexes to fake connections between records, there is no actual relationship stored in the database. This becomes problematic for traversals in the “opposite” direction from the one for which the index was constructed. Because such traversals require a brute-force search through the index—which is an O(n)operation—and joins like this are simply too costly to be of any practical use.
  • Index free adjacency provides surprising benefits in the context of reporting clinical significance of the results of NGS-based carrier screening in that the concepts involved are of just such a nature as to naturally lend themselves to representation using the pre-computed bidirectional joins offered by index free adjacency.
  • FIG. 6 shows how relationships eliminate the need for index lookups.
  • a graph database can use relationships, not indexes, for fast traversals
  • a general-purpose graph database relationships can be traversed in either direction (tail to head, or head to tail) extremely cheaply. Starting from a given VcfRun or a given allele, a graph processing engine can find the related other one of those two at a very low computation cost.
  • systems and methods of the invention use native graph storage. If index-free adjacency is the key to high-performance traversals, queries, and writes, then one key aspect of the design of a graph database is the way in which graphs are stored.
  • An efficient, native graph storage format supports extremely rapid traversals for arbitrary graph algorithms an important reason for using graphs.
  • a graph database such as Neo4j stores graph data in a number of different store files.
  • Each store file may contain the data for a specific part of the graph (e.g., nodes, relationships, properties).
  • FIGS. 7-10 illustrates a node and relationship storage structure as implemented by a graph database of the invention.
  • FIG. 7 diagrams a high-level architecture 701 of systems of certain embodiments of the invention. From the bottom-up, systems may operate using files on disk 733 . Record files 739 provide a basic level of storage to support the file system cache 741 . The object cache 747 is kept at a high level for rapid access as discussed herein. Additionally, the disks 733 can store a transaction log 725 , which is written to by a transaction management module 721 .
  • a graph database such as Neo4j includes or provides a traversal API 755 , core API 705 , and a query language 713 such as Cypher.
  • FIG. 8 illustrates the structure of nodes 801 and relationships 809 on disk as may be deployed within a physical structure of systems of the invention.
  • the node store file stores node records. Every node created in the user-level graph ends up in the node store.
  • the node store is a fixed-size record store. While the precise values or traits may be varied as necessary or best-suited to the invention, in the illustrated embodiment, each node record 801 is nine bytes in length. Fixed-size records enable fast lookups for nodes in the store file. To illustrate, if a node has id 100, then it can be known that its record begins 900 bytes into the file.
  • the database can directly compute a record's location, at cost O(l), rather than performing a search, which would be cost O(log n).
  • cost O(l) the cost of a record's location
  • O(log n) the cost of a record's location
  • the first byte of a node 801 record is the in-use flag. This tells the database whether the record is currently being used to store a node.
  • the next four bytes represent the ID of the first relationship connected to the node, and the last four bytes represent the ID of the first property for the node.
  • the node record is lightweight and contains just pointers to lists of relationships and properties.
  • each relationship record 809 is 33 bytes long.
  • Each relationship record 809 contains the IDs of the nodes at the start and end of the relationship, a pointer to the relationship type (which is stored in the relationship type store), and pointers for the next and previous relationship records for each of the start and end nodes. These last pointers are part of what is often called the relationship chain.
  • the node and relationship stores are concerned only with the structure of the graph, not its property data. Both stores use fixed-sized records so that any individual record's location within a store file can be rapidly computed given its ID. The significance can hardly be overstated: the described structure improves the operation of the hardware itself.
  • Each of the node records contains a pointer to that node's first property and first relationship in a relationship chain.
  • To read a node's properties one may follow the singly linked list structure beginning with the pointer to the first property.
  • To find a relationship for a node one may follow that node's relationship pointer to its first relationship and then follow the doubly linked list of relationships for that particular node (that is, either the start node doubly linked list, or the end node doubly linked list) until the relationship of interest is found.
  • That relationship's properties can be read (if there are any) using the same singly linked list structure as is used for node properties, or the node records can be examined for the two nodes the relationship connects using its start node and end node IDs. These IDs, multiplied by the node record size, give the immediate offset of each node in the node store file.
  • systems and methods of the invention use doubly-linked lists in the relationship store.
  • a relationship record 809 can be thought of as “belonging” to two nodes—the start node and the end node of the relationship.
  • pointers aka record IDs
  • two doubly linked lists one is the list of relationships visible from the start node; the other is the list of relationships visible from the end node. This provide rapid iteration through that list in either direction, and efficient insertion or deletion of relationships.
  • the first record in the relationship chain is located by computing its offset into the relationship store—that is, by multiplying its ID by the fixed relationship record size (e.g., 33 bytes). This gets to the right record in the relationship store. Then, from the relationship record, look in the second node field to find the ID of the second node. Multiply that ID by the node record size (e.g., nine bytes) to locate the correct node record in the store.
  • the fixed relationship record size e.g. 33 bytes
  • systems include the property store files. These store the user's key-value pairs. Properties may be attached to both nodes and relationships. The property stores, therefore, are referenced from both node and relationship records. Records in the property store are physically stored in a file. As with the node and relationship stores, property records are of a fixed size. Each property record consists of four property blocks and the ID of the next property in the property chain. Properties are held as a singly linked list on disk as compared to the doubly linked list used in relationship chains. Each property occupies between one and four property blocks—a property record can, therefore, hold four properties. A property record holds the property type and a pointer to the property index file, which is where the property name is stored.
  • the record For each property's value, the record contains either a pointer into a dynamic store record or an inlined value.
  • the dynamic stores allow for storing large property values.
  • a graph database may optimize storage where it inlines some properties into the property store file directly. This happens when property data can be encoded to fit in one or more of a record's four property blocks. In practice this means that data like variant calls can be inlined in the property store file directly, rather than being pushed out to the dynamic stores. This results in reduced I/O operations and improved throughput, because only a single file access is required.
  • a graph database can also reference long values as property names (e.g., complete journal article titles and citations).
  • property names are indirectly referenced from the property store through the property index file.
  • the property index allows all properties with the same name to share a single record, and thus for repetitive graphs achieves considerable space and I/O savings.
  • Neo4j uses a two-tiered caching architecture to provide this functionality.
  • the file system cache 741 is a page-affined cache, meaning the cache divides each store into discrete regions, and then holds a fixed number of regions per store file. The actual amount of memory to be used to cache the pages for each store file can be fine-tuned, though in the absence of input from the user, Neo4j will use sensible default values based on the capacity of the underlying hardware. Pages are evicted from the cache based on a least-frequently-used (LFU) cache policy.
  • LFU least-frequently-used
  • the file system cache 741 is particularly beneficial when related parts of the graph are modified at the same time such that they occupy the same page. This is a common pattern for writes, where whole sub-graphs (such as a patient's NGS results and associated carrier screening report) are written to disk in a single operation, rather than discrete nodes and relationships.
  • a graph database may be manipulated through a query language, which can be either imperative or declarative.
  • a query language which can be either imperative or declarative.
  • One such language is the Cypher query language.
  • Cypher is a declarative graph query language for Neo4j that allows for expressive and efficient querying and updating of the graph store.
  • Cypher contains a variety of clauses, some of the most common of which include MATCH and WHERE. These functions are slightly different than in SQL.
  • MATCH is used for describing the structure of the pattern searched for, primarily based on relationships, and WHERE is used to add additional constraints to patterns.
  • Cypher additionally contains clauses for writing, updating, and deleting data.
  • CREATE and DELETE are used to create and delete nodes and relationships.
  • SET and REMOVE are used to set values to properties and add labels on nodes.
  • Systems and methods of the invention provide very rapid transactions, idiomatic queries, and an excellent ability to “scale up” with very large data sizes.
  • the topic of scale has become more important as data volumes have grown.
  • Graph databases don't suffer the same latency problems as traditional relational databases, where the more data that exists in tables—and in indexes—the longer the join operations.
  • Graph databases With a graph database, most queries follow a pattern whereby an index is used simply to find a starting node (or nodes). The remainder of the traversal then uses a combination of pointer chasing and pattern matching to search the data store. What this means is that, unlike relational databases, performance does not depend on the total size of the dataset, but only on the data being queried.
  • FIG. 9 illustrates the use of a variant node 901 in a graph database to store a description of a mutation.
  • the first byte of the variant node 901 record is set to show that node 901 is in use.
  • the next four bytes of node 901 represent the ID of the first relationship connected to the node. Through the ID of that first relationship, node 901 thus includes a pointer to an adjacent node (adjacent by definition, since the relationship is identified by the four bytes in node 901 ).
  • the last four bytes of node 901 represent the ID of the first property for the node.
  • Property records in the property store are of a fixed size and each property record consists of four property blocks and the ID of the next property in the chain.
  • the property record holds the property type (here, “variant”) and a pointer to the property index file, which is where the property name is stored.
  • the record either points to a dynamic store or an inline record.
  • the parser operating via the logic mapped in FIG. 4 produces a record of a mutation (by parsing that record from the VCF file) and can store that mutation in the property index file.
  • the property index file for a variant node preferably includes a description of a mutation.
  • a description of a mutation may be provided according to a systematic nomenclature.
  • a variant can be described by a systematic comparison to a specified reference which is assumed to be unchanging and identified by a unique label such as a name or accession number.
  • the A of the ATG start codon is denoted nucleotide +1 and the nucleotide 5′ to +1 is ⁇ 1 (there is no zero).
  • a systematic name can be used to describe a number of variant types including, for example, substitutions, deletions, insertions, and variable copy numbers.
  • a substitution name starts with a number followed by a “from to” markup.
  • 199A>G shows that at position 199 of the reference sequence, A is replaced by a G.
  • a deletion is shown by “del” after the number.
  • 223delT shows the deletion of T at nt
  • 997-999del shows the deletion of three nucleotides (alternatively, this mutation can be denoted as 997-999delTTC).
  • the 3′ nt is arbitrarily assigned; e.g.
  • a TG deletion is designated 1997-1998delTG or 1997-1998del (where 1997 is the first T before C). Insertions are shown by ins after an interval. Thus 200-201insT denotes that T was inserted between nts 200 and 201. Variable short repeats appear as 997(GT)N ⁇ N′. Here, 997 is the first nucleotide of the dinucleotide GT, which is repeated N to N′ times in the population.
  • Variants in introns can use the intron number with a positive number indicating a distance from the G of the invariant donor GU or a negative number indicating a distance from an invariant G of the acceptor site AG.
  • IVS3+1C>T shows a C to T substitution at nt+1 of intron 3.
  • cDNA nucleotide numbering may be used to show the location of the mutation, for example, in an intron.
  • c.1999+1C>T denotes the C to T substitution at nt+1 after nucleotide 1997 of the cDNA.
  • c.1997-2A>C shows the A to C substitution at nt-2 upstream of nucleotide 1997 of the cDNA.
  • the mutation can also be designated by the nt number of the reference sequence.
  • a patient's genome may vary by more than one mutation, or by a complex mutation that is describable by more than one character string or systematic name.
  • the invention further provides systems and methods for describing more than one variant using a systematic name.
  • two mutations in the same allele can be listed within brackets as follows: [1997G>T; 2001A>C].
  • Systematic nomenclature is discussed in den Dunnen & Antonarakis, 2003, Mutation Nomenclature, Curr Prot Hum Genet 7.13.1-7.13.8 as well as in Antonarakis and the Nomenclature Working Group, 1998, Recommendations for a nomenclature system for human gene mutations, Human Mutation 11:1-3.
  • a mutation can be described in the property index file of a variant node.
  • node 901 can be instantiated or used as any type, with the type being stored in the property store.
  • FIG. 10 illustrates a simple example in which an allele node is used to show that an allele includes a certain mutation by representing the mutation using a variant node and representing a relationship between the allele node and the variant node with a “HAS_VARIANT” type relationship. This illustrates the simplicity of connecting alleles to variants using relationships. After the variant is created, literature references can be added to the variant.
  • FIG. 11 shows elements of a graph database in which a variant has been connected to two nodes, each for a literature reference. From this setup emerges one of the powerful applications of a graph database in processing results from NGS sequencing data. If variant changes are made, those variant changes can be tracked within systems of the invention without requiring upsetting the structure of the existing database.
  • a patient sample could be sequenced via NGS technologies and the sequencing results could include, in a VCF file, a description of a mutation in that patient's mitochondrial genome.
  • a variant node is used and a property of that node (e.g., in a property index file) is used to describe that mutation as m.593T>C.
  • a relationship is created to shown that the mutation is described in a literature reference. The relationship is a pointer to a LitRef node and the LitRef node points to a property index file that with information about the literature reference.
  • the property index file contains Zhang et al., 2011, Is mitochondrial tRNAphe variant m.593T>C a synergistically pathogenic mutation in Chinese LHON families with m.11778G>A?, PLoS ONE 6(10):e26511. Based on the synergistic pathogenesis alluded to by the literature reference, a geneticist or curator may deem it important to flag instances in which a patient has both m.593T>C and m.11778G>A in their genome.
  • This example illustrates the real power of a graph database and index-free adjacency. A query can be initiated that starts at the LitRef node just described and traverses to the variant node.
  • That query can traverse to the sample node for that patient and even to a node for the patient. That query can then—by its own terms—traverse from the patient or sample node examining for the presence of a second variant node representing m.11778G>A.
  • the query can be programmed to, in the absence of said second variant node, classify the mutation as benign.
  • the query can be programmed to, in the presence of said second variant node, classify the mutation as pathogenic. Intermediate labels or other categories can also be used. Since the query is traversing across a graph database, a comprehensive index-based look-up is not required as would be required in prior art RDMSs.
  • a graph database could represent protein interactions using the edges (aka pointers or relationships) to represent interactions between proteins and thus influxes of data would expand the graph “horizontally”.
  • the invention is unlike the protein interaction example in that the graph expands “vertically” outside of a set of natural phenomena. Since a sample can have a node, the graph can reach to laboratory management systems and receive from or provide information to, for example, sample chain of custody modules.
  • the graph can leap vertically to a genetic plane and represent human mutations that are being discovered.
  • the graph can reach vertically into a different category to represent medical literature, and can go on to be used patient reports.
  • the power of this novel vertical structure is shown by the illustration of use of the invention for reporting carrier screening results.
  • FIG. 12 illustrates a graph database in which a variant has been connected to two nodes, each for a literature reference and in which updated information about the variant has been introduced in two changes.
  • node 17451 may represent a specific mutation such as a SNP (e.g., G at a certain position).
  • Node 17454 could be created when A is observed at that position.
  • Systems and methods of the invention support a plurality of different use cases and applications. For example, if a graph database is used in support of NGS carrier screening, one capability that will emerge is support for evaluating and reporting allele frequency.
  • the graph database can easily be queried for that.
  • FIG. 13 presents an example database that may be queried for allele frequency.
  • variants i.e. genetic mutations
  • FIGS. 10-12 Another illustrative use case for application of a graph database is the curation of variants.
  • the curation of variants involves taking variants (i.e. genetic mutations) that have been picked up through a sequencing platform and then looking through the literature for references to evaluate how common the variant is and whether it is identified as pathogenic, benign, or somewhere in-between. This can be supported and modeled by tracking three things: connecting allele to a variant; variant and variant changes; and literature references per variant. To illustrate, a geneticist may observe review a patient's NGS sequencing results and observe the presence of a poly-T variant.
  • the geneticist may connect this variant to an allele of the cystic fibrosis transmembrane conductance receptor (CFTR) gene located on the long arm of chromosome 7 (e.g., as shown in FIG. 10 ).
  • the geneticist may further observe that this variant is described by a literature reference and connect the variant object to two different LitRef objects such as one for each of Rowntree and Harris, The phenotypic consequences of CFTR mutations, Ann Hum Gen 67:471-485 (2003) and Kreindler, Cystic fibrosis: exploiting its genetic basis in the hunt for new therapies, Pharmacol Ther 125(2):219-229 (2010) (e.g., according to the diagram of FIG. 11 ).
  • CFTR cystic fibrosis transmembrane conductance receptor
  • the geneticist may observe that the mutation (the poly-T) is a novel poly-T variant in the acceptor splice site of intron 8 of CFTR in cis with R117H (i.e., c.350G>A based on GenBank cDNA reference sequence NM — 000492.3).
  • the geneticist may want to update the graph for this patient by connecting the poly-T mutation to a variant object for c.350G>A (e.g., as seen in FIG. 12 ).
  • the chain of updated variants may reveal that the patient has an allele with the T5 poly-T variant, which evidence suggests plays a role in in pathogenic alternate splicing or exon skipping.
  • the geneticist may further consider the data and determine that, in-fact, the patient's allele includes a T6 form of the poly-T variant and may update the variant nodes to so reflect.
  • T6 node other content need not be modified.
  • the geneticist may add a LitRef node for Huang, et al., Comparative analysis of common CFTR polymorphisms poly-T, TG-repeats and M470V in a healthy Chinese population, World J Gastroenterol 14(12):1925-30 (2008).
  • a computer system or machines of the invention include one or more processors (e.g., a central processing unit (CPU) a graphics processing unit (GPU) or both), a main memory and a static memory, which communicate with each other via a bus.
  • FIG. 14 diagrams a system 1500 suitable for performing methods of the invention.
  • system 1500 may include one or more of a server computer 1513 , a terminal 1567 , a sequencer 1501 , a sequencer computer 1533 , a computer 1549 , or any combination thereof. Each such computer device may communicate via network 1509 .
  • Sequencer 1501 may optionally include or be operably coupled to its own, e.g., dedicated, sequencer computer 1533 (including any input/output mechanisms (I/O), processor, and memory). Additionally or alternatively, sequencer 1501 may be operably coupled to a server 1513 or computer 1549 (e.g., laptop, desktop, or tablet) via network 1509 .
  • a server 1513 or computer 1549 e.g., laptop, desktop, or tablet
  • Computer 1549 includes one or more processor, memory, and I/O. Where methods of the invention employ a client/server architecture, any steps of methods of the invention may be performed using server 1513 , which includes one or more of processor, memory, and I/O, capable of obtaining data, instructions, etc., or providing results via an interface module or providing results as a file. Server 1513 may be engaged over network 1509 through computer 1549 or terminal 1567 , or server 1513 may be directly connected to terminal 1567 . Terminal 1567 is preferably a computer device. A computer according to the invention preferably includes one or more processor coupled to an I/O mechanism and memory.
  • a processor may be provided by one or more processors including, for example, one or more of a single core or multi-core processor (e.g., AMD Phenom II X2, Intel Core Duo, AMD Phenom II X4, Intel Core i5, Intel Core i& Extreme Edition 980X, or Intel Xeon E7-2820).
  • a single core or multi-core processor e.g., AMD Phenom II X2, Intel Core Duo, AMD Phenom II X4, Intel Core i5, Intel Core i& Extreme Edition 980X, or Intel Xeon E7-2820.
  • An I/O mechanism may include a video display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device (e.g., a keyboard), a cursor control device (e.g., a mouse), a disk drive unit, a signal generation device (e.g., a speaker), an accelerometer, a microphone, a cellular radio frequency antenna, and a network interface device (e.g., a network interface card (NIC), Wi-Fi card, cellular modem, data jack, Ethernet port, modem jack, HDMI port, mini-HDMI port, USB port), touchscreen (e.g., CRT, LCD, LED, AMOLED, Super AMOLED), pointing device, trackpad, light (e.g., LED), light/image projection device, or a combination thereof.
  • a video display unit e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)
  • Memory refers to a non-transitory memory which is provided by one or more tangible devices which preferably include one or more machine-readable medium on which is stored one or more sets of instructions (e.g., software) embodying any one or more of the methodologies or functions described herein.
  • the software may also reside, completely or at least partially, within the main memory, processor, or both during execution thereof by a computer within system 1500 , the main memory and the processor also constituting machine-readable media.
  • the software may further be transmitted or received over a network via the network interface device.
  • machine-readable medium can in an exemplary embodiment be a single medium
  • the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
  • the term “machine-readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention.
  • Memory may be, for example, one or more of a hard disk drive, solid state drive (SSD), an optical disc, flash memory, zip disk, tape drive, “cloud” storage location, or a combination thereof.
  • a device of the invention includes a tangible, non-transitory computer readable medium for memory.
  • Exemplary devices for use as memory include semiconductor memory devices, (e.g., EPROM, EEPROM, solid state drive (SSD), and flash memory devices e.g., SD, micro SD, SDXC, SDIO, SDHC cards); magnetic disks, (e.g., internal hard disks or removable disks); and optical disks (e.g., CD and DVD disks).
  • Components of system 1500 may be under the control of a carrier screening service provider and may be operated to obtain data representing a mutation in a genome of an individual, use a variant node in a graph database to store a description of the mutation (while storing, in the variant node, a pointer to an adjacent node that provides information about a clinical significance of the variant), and query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.
  • Functionality of server computer 1513 may be provided by an outside vendor such as Amazon Web Services or Amazon's EC2. In fact, the carrier screening entity who is analyzing the mutations from the sample may not and need not have actual knowledge of the physical location and type of computers that provide server computer(s) 1513 .
  • a sequencing instrument 1501 is employed (e.g., an IIlumina HiSeq 2000), which itself includes a sequencer computer 1533 ).
  • the sample from the patient may be received from an outside source (e.g., from a phlebotomy facility down the hall or may be sent by courier (e.g., in an Eppendorf tube).
  • the service provider will have access to and use a computer 1549 for coordinating methods of the invention.
  • sequencer 1501 is operated by an outside service provider in support of or on order of the carrier screening entity.
  • the carrier screening professional has access to or control over components of the system.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Ecology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to using a graph database in genetic analyses to link mutation data to extrinsic data. Entities such as mutations, patients, samples, alleles, and clinical information are individually represented and stored as nodes and relationships between entities are also individually represented and stored. Each node and relationship can be stored using a fixed-size record and nodes can be flexibly invoked to represent any entity without disrupting the existing data. Systems and methods of the invention may be used for obtaining data representing a mutation in an individual and using a node in a graph database to store a description of the mutation. The node has stored within it a pointer to an adjacent node that provides information about a clinical significance of the variant. The graph database can be queried to provide a report of the clinical significance of the mutation.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application claims priority to, and the benefit of, U.S. Provisional Patent Application No. 62/037,861, filed Aug. 15, 2014, the contents of which are incorporated by reference.
  • TECHNICAL FIELD
  • The invention relates to medical genetics.
  • BACKGROUND
  • Before having children, a person may turn to genetic screening to find out if he or she is a carrier of a genetic condition. Genetic carrier screening can be done using next-generation sequencing (NGS), which produces millions of “base-calls” read from the person's genome. Typically, those base calls are then compared to a reference genome to determine their clinical significance. While all 3.2 billion base-pairs of the human genome are available for use as a reference (e.g., as hg18), knowing the clinical significance of features in the person's genome requires turning to medical literature or specialized databases of mutations. For example, the Online Mendelian Inheritance in Man (OMIM) database contains information on genetic disorders in over 12,000 human genes.
  • The volumes of data that must be stored, compared, and understood are a significant obstacle to realizing the full potential of NGS as a carrier screening tool. Generally, the time required for analysis and reporting is proportional to the amount of data in the databases. The structure of those databases requires exhaustive index table lookups for each comparison. Also, since databases designs must be locked in prior to use, a clinician's use of the data system is limited to what the database designer foresaw as the likely qualities of the data. A clinician who discovers a new phenomenon—such as and a novel combination of mutations associated with an unexpected disease—may be faced with a data system that does not even provide a means for entering or describing this information.
  • SUMMARY
  • The invention provides systems and methods for genetic analysis in which entities such as mutations, patients, samples, alleles, and clinical information are individually represented and stored as nodes and in which relationships between entities are also individually represented and stored. Each node and relationship can be stored using a fixed-size record and nodes can be flexibly invoked to represent any novel entity without disrupting the information already represented in the system. By forsaking the traditional database schema of indexed tables, the run time for queries need not be proportional to the amount of data in the tables. Instead, queries that start with a certain node can find the relevant related nodes in time proportional only to the number of nodes in the results that match the query. Moreover, novel entities and relationships can be inserted into the data system upon discovery with no disruption to the data or operation of the system. Thus, novel mutations can be added or related to disease phenotypes or appropriate literature references as that new information is discovered and observed. The time required for a query of—for example—relationships between a patient and disease-associated alleles in that patient's genome will be proportional to the number of results that are found for inclusion in a report for that patient. Where sequencing uncovers novel mutations or genotype/phenotype associations, those entities and relationships can be brought into the system and included in the reporting without requiring any changes or re-design to the underlying system architecture. In methods and systems of the invention, NGS results, patient information, and medical information can be stored in a graph database and analyzed using graph processing approaches and languages. This provides for very rapid querying and report generation, independent of the size of the underlying data store.
  • Since report generation is rapid and not linked to the underlying volume of data, and since systems of the invention may easily accommodate the volumes of data associated with NGS sequencing and human genome based analyses, systems and methods of the invention may be employed for NGS-based carrier screening and provide meaningful results to patients.
  • Additionally, the invention includes the insight that the clinical significance of mutations—or “variants”, e.g., as documented in NGS results such as Variant Call Format (VCF) files—can be shown by relating the mutation to a particular allele of a gene and showing where in the literature the variant is reported as pathogenic or benign while connecting this information back to a patient and lab sample for reporting purposes. Sequencing by existing NGS technologies may provide abundant high-quality raw data in the form of sequence files such as FASTA, FASTQ, Sequence Alignment Map (SAM), Binary Alignment Map (BAM), or VCF files. Systems and methods of the invention can be used to extract relevant data from those files into the described nodes to support the rapid querying and report generation useful for NGS carrier screening. For example, systems of the invention may include an Application Programming Interface (API) that takes as input VCF files and creates a network of nodes representing patients, samples, VCF files, VCF records, variants, alleles, and literature reports with relationships connecting adjacent pairs of those nodes according to their natural relationships. The system supports a genomics analysis clinical pipeline even as it changes and can accommodate the loading in of external data. The system can be implemented using a graph database and related software. Systems of the invention support a variety of analyses and use cases. For example, with NGS-based carrier screening implemented using the described graph database structure for analysis and reporting, it becomes easy to query and report such phenomenon as allele frequencies.
  • Importantly, systems and methods of the invention support the curation of variants. Curating variants includes identifying an individual variant in sequencing results, researching medical literature for information about the variant, classifying the variant (e.g., pathogenic, benign, somewhere in between), and accessioning that information into the database for use in subsequent reports on patient samples in which that variant is implicated. Using the nodes and relationships provided by the invention, variants can be connected to alleles, literature references, medical information, or combinations thereof. If changes are subsequently made (e.g., a missense mutation is re-classified as a nonsense mutation), other features of the system infrastructure are not disrupted. Thus the active curation of variants is accommodate and improves the system.
  • In certain aspects, the invention provides a method for analyzing mutations. The method includes obtaining data representing a mutation in a genome of an individual and using a node in a graph database to store a description of the mutation. The node has stored within it a pointer to an adjacent node that provides information about a clinical significance of the variant. The method includes querying the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.
  • The data representing the mutation may be obtained by obtaining a sample that includes a nucleic acid from the individual and sequencing the nucleic acid to obtain a sequence read file that includes the data. The sample may be represented in the graph database using a sample node and the sample node may be connected via a pointer to a read file node representing the sequence read file. The graph database may include nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants as well as edges defining relationships between pairs of the nodes.
  • In some embodiments, the data representing a mutation is obtained as part of a file such as a variant call file (VCF), a sequence alignment map (SAM) file, a binary alignment map (BAM) file, a FASTA file, or a FASTQ file. The file may be represented in the graph database (e.g., using a file node) and a pointer to the file node may be stored in the mutation node.
  • In certain embodiments, the data representing a mutation comprises a description of the mutation as a variant of a reference human genome. The description of the mutation may be provided as a VCF record in a VCF file. The method may include obtaining sequencing data that represents a plurality of mutations in the genome of the individual—each of the plurality of mutations being represented as variant calls relative to a human genome reference. For each of the plurality of mutations, a corresponding variant node in the graph database is used to store a description of that mutation.
  • Aspects of the invention provide a system for describing genetic information. The system includes at least one computer comprising memory coupled to a processor. The system has at least a portion of a graph database stored therein. The system is operable to obtain data representing a mutation in a genome of an individual, use a variant node in the graph database to store a description of the mutation, and store—within the variant node—a pointer to an adjacent node that provides information about a clinical significance of the mutation. The system may be used to query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual. As discussed above, the data representing a mutation may be obtained as part of a file such as a VCF file. The system may represent the file as a file node in the graph database and store, in the variant node, a pointer to the file node.
  • The data representing the mutation may be provided as a sequence read file that includes that data. In certain embodiments, the system is operable use the graph database to represent a biological sample from the individual with a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • The system may be operated to obtain sequencing data representing a plurality of mutations in the genome of the individual (e.g., as variant calls relative to a human genome reference) and use, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation. The system links the individual to an allele node based on the plurality of mutations.
  • In a preferred aspect, the invention provides: a system for describing genetic information, the system comprising: at least one computer comprising memory coupled to a processor, the system having at least a portion of a graph database stored therein, wherein the system is operable to: obtain data representing a mutation in a genome of an individual; use a node in the graph database to store a description of the mutation; store, in the node, a pointer to an adjacent node that provides information about a clinical significance of the mutation; and query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual. Preferably a pointer identifies a physical location in the memory at which the adjacent node is stored. Thus each node may be stored at a specific physical location the memory. Each such specific physical location is referenced by a pointer (which itself optionally may be stored within a node at a physical location that is referenced, in-turn, by another pointer). Preferably, each pointer identifies a physical location in the memory subsystem at which the adjacent object is stored. In the preferred embodiments, the pointer or native pointer is manipulatable as a memory address in that it points to a physical location on the memory but also dereferencing the pointer accesses intended data. That is, a pointer is a reference to a datum stored somewhere in memory; to obtain that datum is to dereference the pointer. The feature that separates pointers from other kinds of reference is that a pointer's value is interpreted as a memory address, at a low-level or hardware level. The speed and efficiency of the described low-level, or hardware level, memory referencing allows for incredibly rapid graph traversals, which means that data content can scale up unbounded but reporting actionable medical genetic information will not require amounts of time that scale up with the data content. Use of hardware level references, or index-free adjacency, uncouples the time requirements for medical genetics reporting from data content volume.
  • In a first embodiment of the preferred aspect, the system is operable to obtain the data representing the mutation by receiving at least one sequence read file that includes the data. Preferably the system of the first embodiment is further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • In a second embodiment of the preferred aspect, the data representing the mutation is obtained as part of a file. In the second embodiment, the file may have a format selected from the group consisting of variant call format; sequence alignment map; binary alignment map; FASTA; and FASTQ. Preferably in the second embodiment the system is operable to represent the file as a file node in the graph database and store, in the variant node, a pointer to the file node. Optionally, the system is further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • In a third embodiment of the preferred aspect, the data representing the mutation comprises a description of the mutation as a variant of a reference human genome. In the third embodiment, the description of the mutation may optionally be obtained from a VCF record in a VCF file. Additionally, the system of the third embodiment may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • In a fourth embodiment of the preferred aspect, the system is further operable to: obtain sequencing data representing a plurality of mutations in the genome of the individual, the plurality of mutations being represented as variant calls relative to a human genome reference; use, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation; and link the individual to an allele node based on the plurality of mutations. In the fourth embodiment, the graph database may include: nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and edges defining relationships between pairs of the nodes. The system of the fourth embodiment may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • In a fifth embodiment of the preferred aspect, the graph database comprises: nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and edges defining relationships between pairs of the nodes. In the fifth embodiment, the system may be further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 illustrates an exemplary NGS workflow for carrier screening.
  • FIG. 2 gives a sample of an exemplary VCF file.
  • FIG. 3 diagrams a method for analyzing mutations.
  • FIG. 4 gives a flow chart for a VCF file parser.
  • FIG. 5 presents a model of data received from parsing a VCF file.
  • FIG. 6 shows an entity relationship diagram (ERD) of the data modeled by FIG. 5.
  • FIG. 7 diagrams a high-level architecture of a system of the invention.
  • FIG. 8 illustrates a structure for nodes and relationships on disk.
  • FIG. 9 illustrates the use of a variant node to store a description of a mutation.
  • FIG. 10 shows an allele node showing that an allele includes a certain mutation.
  • FIG. 11 shows variant node connected to two different literature reference nodes.
  • FIG. 12 illustrates updating information about a mutation.
  • FIG. 13 presents an example database that may be queried for allele frequency.
  • FIG. 14 diagrams a system for performing methods of the invention.
  • DETAILED DESCRIPTION
  • The invention relates to using a graph database in genetic analyses to link mutation data to extrinsic data. Entities such as mutations, patients, samples, alleles, and clinical information are individually represented and stored as nodes and relationships between entities are also individually represented and stored. Each node and relationship can be stored using a fixed-size record and nodes can be flexibly invoked to represent any entity without disrupting the existing data. Systems and methods of the invention may be used for obtaining data representing a mutation in an individual and using a variant node in a graph database to store a description of the mutation. The variant node has stored within it a pointer to an adjacent node that provides information about a clinical significance of the variant. The graph database can be queried to provide a report of the clinical significance of the mutation. In certain embodiments, systems and methods of the invention operate within the context of a carrier screening workflow and provide a querying and reporting tool for carrier screening.
  • FIG. 1 illustrates an exemplary NGS workflow for carrier screening. The workflow combines automated, optimized molecular inversion probe target capture 109 with molecular barcoding to maximize the sample throughput of an NGS machine and employs assembly and alignment methods that allow accurate identification of both substitution and insertion/deletion lesions. The workflow is applicable to, for example, genes in which loss-of-function mutations cause recessive Mendelian disorders often included as part of routine carrier screening. A screening or analysis may begin with obtaining nucleic acid from a sample.
  • Nucleic acid in a sample can be any nucleic acid, including for example, genomic DNA in a tissue sample, cDNA amplified from a particular target in a laboratory sample, or mixed DNA from multiple organisms. In some embodiments, the sample includes homozygous DNA from a haploid or diploid organism. For example, a sample can include genomic DNA from a patient who is homozygous for a rare recessive allele. In other embodiments, the sample includes heterozygous genetic material from a diploid or polyploidy organism with a somatic mutation such that two related nucleic acids are present in allele frequencies other than 50 or 100%, i.e., 20%, 5%, 1%, 0.1%, or any other allele frequency.
  • In one embodiment, nucleic acid template molecules (e.g., DNA or RNA) are isolated from a biological sample containing a variety of other components, such as proteins, lipids, and non-template nucleic acids. Nucleic acid template molecules can be obtained from any cellular material, obtained from animal, plant, bacterium, fungus, or any other cellular organism. Biological samples for use in the present invention also include viral particles or preparations. Nucleic acid template molecules can be obtained directly from an organism or from a biological sample obtained from an organism, e.g., from blood, urine, cerebrospinal fluid, seminal fluid, saliva, sputum, stool, and tissue. Any tissue or body fluid specimen (e.g., a human tissue of bodily fluid specimen) may be used as a source for nucleic acid to use in the invention. Nucleic acid template molecules can also be isolated from cultured cells, such as a primary cell culture or cell line. The cells or tissues from which template nucleic acids are obtained can be infected with a virus or other intracellular pathogen. A sample can also be total RNA extracted from a biological specimen, a cDNA library, viral, or genomic DNA. A sample may also be isolated DNA from a non-cellular origin, e.g. amplified/isolated DNA from the freezer.
  • Generally, nucleic acid can be extracted, isolated, amplified, or analyzed by a variety of techniques such as those described by Green and Sambrook, Molecular Cloning: A Laboratory Manual (Fourth Edition), Cold Spring Harbor Laboratory Press, Woodbury, N.Y. 2,028 pages (2012); or as described in U.S. Pat. No. 7,957,913; U.S. Pat. No. 7,776,616; U.S. Pat. No. 5,234,809; U.S. Pub. 2010/0285578; and U.S. Pub. 2002/0190663.
  • Nucleic acid from a sample may optionally be fragmented or sheared to a desired length, using a variety of mechanical, chemical, and/or enzymatic methods. DNA may be randomly sheared via sonication using, for example, an ultrasonicator sold by Covaris (Woburn, Mass.), brief exposure to a DNase, or using a mixture of one or more restriction enzymes, or a transposase or nicking enzyme. RNA may be fragmented by brief exposure to an RNase, heat plus magnesium, or by shearing. The RNA may be converted to cDNA. If fragmentation is employed, the RNA may be converted to cDNA before or after fragmentation. In one embodiment, nucleic acid is fragmented by sonication. In another embodiment, nucleic acid is fragmented by a hydroshear instrument. Generally, individual nucleic acid template molecules can be from about 2 kb bases to about 40 kb. In a particular embodiment, nucleic acids are about 6 kb-10 kb fragments. Nucleic acid molecules may be single-stranded, double-stranded, or double stranded with single-stranded regions (for example, stem- and loop-structures).
  • A biological sample may be lysed, homogenized, or fractionated in the presence of a detergent or surfactant as needed. Suitable detergents may include an ionic detergent (e.g., sodium dodecyl sulfate or N-lauroylsarcosine) or a nonionic detergent (such as the polysorbate 80 sold under the trademark TWEEN by Uniqema Americas (Paterson, N.J.) or C14H22O(C2H4)n, known as TRITON X-100).
  • In certain embodiments, genomic DNA samples are input to a molecular inversion probe capture 109 reaction. Molecular inversion probes may be designed to capture the coding regions and as well as well-characterized noncoding regions of genes. Such probes may include 5′ and 3′ targeting arms (extension and ligation, respectively) of, for example, about a total of 40 nucleotides and being designed to flank 130-bp target regions. Each target is captured 109 by multiple probes that anneal to non-overlapping genomic intervals. PCR is performed 121 using primers containing patient-specific barcodes, yielding barcode libraries. Genomic DNA may be subjected to multiplex target capture using molecular inversion probes. Captured product may be subjected to PCR to attach molecular barcodes in a manner that allow sequencing from either end of the captured region.
  • PCR may be used as described or any other amplification reaction may be performed. Amplification refers to production of additional copies of a nucleic acid sequence and is generally carried out using polymerase chain reaction (PCR) or other technologies known in the art. The amplification reaction may be any amplification reaction known in the art that amplifies nucleic acid molecules such as PCR (e.g., nested PCR, PCR-single strand conformation polymorphism, ligase chain reaction, strand displacement amplification and restriction fragments length polymorphism, transcription based amplification system, rolling circle amplification, and hyper-branched rolling circle amplification, quantitative PCR, quantitative fluorescent PCR (QF-PCR), multiplex fluorescent PCR (MF-PCR), real time PCR (RTPCR), restriction fragment length polymorphism PCR). See U.S. Pat. No. 5,242,794; U.S. Pat. No. 5,494,810; U.S. Pat. No. 4,988,617; U.S. Pat. No. 6,582,938; U.S. Pat. No. 4,683,195; and U.S. Pat. No. 4,683,202, hereby incorporated by reference. Primers for PCR, sequencing, and other methods can be prepared by cloning, direct chemical synthesis, and other methods known in the art. Primers can also be obtained from commercial sources such as Eurofins MWG Operon (Huntsville, Ala.) or Life Technologies (Carlsbad, Calif.).
  • Amplification adapters may be attached to the fragmented nucleic acid. Adapters may be commercially obtained, such as from Integrated DNA Technologies (Coralville, Iowa). In certain embodiments, the adapter sequences are attached to the template nucleic acid molecule with an enzyme. The enzyme may be a ligase or a polymerase. The ligase may be any enzyme capable of ligating an oligonucleotide (RNA or DNA) to the template nucleic acid molecule. Suitable ligases include T4 DNA ligase and T4 RNA ligase, available commercially from New England Biolabs (Ipswich, Mass.). Methods for using ligases are well known in the art. The polymerase may be any enzyme capable of adding nucleotides to the 3′ and the 5′ terminus of template nucleic acid molecules.
  • Embodiments of the invention involve attaching the bar code sequences to the template nucleic acids e.g., for barcode PCR 121. In certain embodiments, a bar code is attached to each fragment. In other embodiments, a plurality of bar codes, e.g., two bar codes, are attached to each fragment. A bar code sequence generally includes certain features that make the sequence useful in sequencing reactions. For example the bar code sequences are designed to have minimal or no homo-polymer regions, i.e., 2 or more of the same base in a row such as AA or CCC, within the bar code sequence. The bar code sequences are also designed so that they are at least one edit distance away from the base addition order when performing base-by-base sequencing, ensuring that the first and last base do not match the expected bases of the sequence.
  • The bar code sequences are designed such that each sequence is correlated to a particular portion of nucleic acid, allowing sequence reads to be correlated back to the portion from which they came. Methods of designing sets of bar code sequences are shown for example in U.S. Pat. No. 6,235,475, the contents of which are incorporated by reference herein in their entirety. In certain embodiments, the bar code sequences range from about 5 nucleotides to about 15 nucleotides. In a particular embodiment, the bar code sequences range from about 4 nucleotides to about 7 nucleotides. Since the bar code sequence is sequenced along with the template nucleic acid, the oligonucleotide length should be of minimal length so as to permit the longest read from the template nucleic acid attached. Generally, the bar code sequences are spaced from the template nucleic acid molecule by at least one base (minimizes homo-polymeric combinations). In certain embodiments, the bar code sequences are attached to the template nucleic acid molecule, e.g., with an enzyme. The enzyme may be a ligase or a polymerase, as discussed below. Attaching bar code sequences to nucleic acid templates is shown in U.S. Pub. 2008/0081330 and U.S. Pub. 2011/0301042, the contents of which are incorporated by reference herein in its entirety. Methods for designing sets of bar code sequences and other methods for attaching bar code sequences are shown in U.S. Pat. Nos. 7,544,473; 7,537,897; 7,393,665; 6,352,828; 6,172,218; 6,172,214; 6,150,516; 6,138,077; 5,863,722; 5,846,719; 5,695,934; and 5,604,097, each incorporated by reference.
  • After any processing steps (e.g., obtaining, isolating, fragmenting, amplification, or barcoding), nucleic acid can be sequenced 129.
  • Sequencing 129 may be by any method known in the art. DNA sequencing techniques include classic dideoxy sequencing reactions (Sanger method) using labeled terminators or primers and gel separation in slab or capillary, sequencing by synthesis using reversibly terminated labeled nucleotides, pyrosequencing, 454 sequencing, Illumina/Solexa sequencing, allele specific hybridization to a library of labeled oligonucleotide probes, sequencing by synthesis using allele specific hybridization to a library of labeled clones that is followed by ligation, real time monitoring of the incorporation of labeled nucleotides during a polymerization step, polony sequencing, and SOLiD sequencing. Sequencing of separated molecules has more recently been demonstrated by sequential or single extension reactions using polymerases or ligases as well as by single or sequential differential hybridizations with libraries of probes.
  • A sequencing technique that can be used includes, for example, use of sequencing-by-synthesis systems sold under the trademarks GS JUNIOR, GS FLX+ and 454 SEQUENCING by 454 Life Sciences, a Roche company (Branford, Conn.), and described by Margulies, M. et al., Genome sequencing in micro-fabricated high-density picotiter reactors, Nature, 437:376-380 (2005); U.S. Pat. No. 5,583,024; U.S. Pat. No. 5,674,713; and U.S. Pat. No. 5,700,673, the contents of which are incorporated by reference herein in their entirety. 454 sequencing involves two steps. In the first step of those systems, DNA is sheared into fragments of approximately 300-800 base pairs, and the fragments are blunt ended. Oligonucleotide adaptors are then ligated to the ends of the fragments. The adaptors serve as primers for amplification and sequencing of the fragments. The fragments can be attached to DNA capture beads, e.g., streptavidin-coated beads using, e.g., Adaptor B, which contains 5′-biotin tag. The fragments attached to the beads are PCR amplified within droplets of an oil-water emulsion. The result is multiple copies of clonally amplified DNA fragments on each bead. In the second step, the beads are captured in wells (pico-liter sized). Pyrosequencing is performed on each DNA fragment in parallel. Addition of one or more nucleotides generates a light signal that is recorded by a CCD camera in a sequencing instrument. The signal strength is proportional to the number of nucleotides incorporated. Pyrosequencing makes use of pyrophosphate (PPi) which is released upon nucleotide addition. PPi is converted to ATP by ATP sulfurylase in the presence of adenosine 5′ phosphosulfate. Luciferase uses ATP to convert luciferin to oxyluciferin, and this reaction generates light that is detected and analyzed.
  • Another example of a DNA sequencing technique that can be used is SOLiD technology by Applied Biosystems from Life Technologies Corporation (Carlsbad, Calif.). In SOLiD sequencing, genomic DNA is sheared into fragments, and adaptors are attached to the 5′ and 3′ ends of the fragments to generate a fragment library. Alternatively, internal adaptors can be introduced by ligating adaptors to the 5′ and 3′ ends of the fragments, circularizing the fragments, digesting the circularized fragment to generate an internal adaptor, and attaching adaptors to the 5′ and 3′ ends of the resulting fragments to generate a mate-paired library. Next, clonal bead populations are prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates are denatured and beads are enriched to separate the beads with extended templates. Templates on the selected beads are subjected to a 3′ modification that permits bonding to a glass slide. The sequence can be determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After a color is recorded, the ligated oligonucleotide is removed and the process is then repeated.
  • Another example of a DNA sequencing technique that can be used is ion semiconductor sequencing using, for example, a system sold under the trademark ION TORRENT by Ion Torrent by Life Technologies (South San Francisco, Calif.). Ion semiconductor sequencing is described, for example, in Rothberg, et al., An integrated semiconductor device enabling non-optical genome sequencing, Nature 475:348-352 (2011); U.S. Pub. 2010/0304982; U.S. Pub. 2010/0301398; U.S. Pub. 2010/0300895; U.S. Pub. 2010/0300559; and U.S. Pub. 2009/0026082, the contents of each of which are incorporated by reference in their entirety.
  • Another example of a sequencing 129 technology that can be used is Illumina sequencing. Illumina sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Genomic DNA is fragmented, and adapters are added to the 5′ and 3′ ends of the fragments. DNA fragments that are attached to the surface of flow cell channels are extended and bridge amplified. The fragments become double stranded, and the double stranded molecules are denatured. Multiple cycles of the solid-phase amplification followed by denaturation can create several million clusters of approximately 1,000 copies of single-stranded DNA molecules of the same template in each channel of the flow cell. Primers, DNA polymerase and four fluorophore-labeled, reversibly terminating nucleotides are used to perform sequential sequencing. After nucleotide incorporation, a laser is used to excite the fluorophores, and an image is captured and the identity of the first base is recorded. The 3′ terminators and fluorophores from each incorporated base are removed and the incorporation, detection and identification steps are repeated. Sequencing according to this technology is described in U.S. Pat. No. 7,960,120; U.S. Pat. No. 7,835,871; U.S. Pat. No. 7,232,656; U.S. Pat. No. 7,598,035; U.S. Pat. No. 6,911,345; U.S. Pat. No. 6,833,246; U.S. Pat. No. 6,828,100; U.S. Pat. No. 6,306,597; U.S. Pat. No. 6,210,891; U.S. Pub. 2011/0009278; U.S. Pub. 2007/0114362; U.S. Pub. 2006/0292611; and U.S. Pub. 2006/0024681, each of which are incorporated by reference in their entirety.
  • Another example of a sequencing technology that can be used includes the single molecule, real-time (SMRT) technology of Pacific Biosciences (Menlo Park, Calif.). In SMRT, each of the four DNA bases is attached to one of four different fluorescent dyes. These dyes are phospholinked. A single DNA polymerase is immobilized with a single molecule of template single stranded DNA at the bottom of a zero-mode waveguide (ZMW). It takes several milliseconds to incorporate a nucleotide into a growing strand. During this time, the fluorescent label is excited and produces a fluorescent signal, and the fluorescent tag is cleaved off. Detection of the corresponding fluorescence of the dye indicates which base was incorporated. The process is repeated.
  • Another example of a sequencing technique that can be used is nanopore sequencing (Soni & Meller, 2007, Progress toward ultrafast DNA sequence using solid-state nanopores, Clin Chem 53(11):1996-2001). A nanopore is a small hole, of the order of 1 nanometer in diameter. Immersion of a nanopore in a conducting fluid and application of a potential across it results in a slight electrical current due to conduction of ions through the nanopore. The amount of current which flows is sensitive to the size of the nanopore. As a DNA molecule passes through a nanopore, each nucleotide on the DNA molecule obstructs the nanopore to a different degree. Thus, the change in the current passing through the nanopore as the DNA molecule passes through the nanopore represents a reading of the DNA sequence.
  • Another example of a sequencing technique that can be used involves using a chemical-sensitive field effect transistor (chemFET) array to sequence DNA (for example, as described in U.S. Pub. 2009/0026082). In one example of the technique, DNA molecules can be placed into reaction chambers, and the template molecules can be hybridized to a sequencing primer bound to a polymerase. Incorporation of one or more triphosphates into a new nucleic acid strand at the 3′ end of the sequencing primer can be detected by a change in current by a chemFET. An array can have multiple chemFET sensors. In another example, single nucleic acids can be attached to beads, and the nucleic acids can be amplified on the bead, and the individual beads can be transferred to individual reaction chambers on a chemFET array, with each chamber having a chemFET sensor, and the nucleic acids can be sequenced.
  • Another example of a sequencing technique that can be used involves using an electron microscope as described, for example, by Moudrianakis, E. N. and Beer M., in Base sequence determination in nucleic acids with the electron microscope, III. Chemistry and microscopy of guanine-labeled DNA, PNAS 53:564-71 (1965). In one example of the technique, individual DNA molecules are labeled using metallic labels that are distinguishable using an electron microscope. These molecules are then stretched on a flat surface and imaged using an electron microscope to measure sequences.
  • Sequencing according to embodiments of the invention generates a plurality of reads. Reads according to the invention generally include sequences of nucleotide data less than about 5000 bases in length, or less than about 150 bases in length. In certain embodiments, reads are between about 80 and about 90 bases, e.g., about 85 bases in length. In some embodiments, methods of the invention are applied to very short reads, i.e., less than about 50 or about 30 bases in length. Sequence read data can include the sequence data as well as meta information. Sequence read data can be stored in any suitable file format including, for example, VCF files, FASTA files or FASTQ files, as are known to those of skill in the art. In some embodiments, PCR product is pooled and sequenced (e.g., on an Illumina HiSeq 2000). Raw .bcl files are converted to qseq files using bclConverter (Illumina). FASTQ files are generated by “de-barcoding” genomic reads using the associated barcode reads; reads for which barcodes yield no exact match to an expected barcode, or contain one or more low-quality base calls, may be discarded. Reads may be stored in any suitable format such as, for example, FASTA or FASTQ format.
  • FASTA is originally a computer program for searching sequence databases and the name FASTA has come to also refer to a standard file format. See Pearson & Lipman, 1988, Improved tools for biological sequence comparison, PNAS 85:2444-2448. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column. The word following the “>” symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the “>” and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence.
  • The FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. It is similar to the FASTA format but with quality scores following the sequence data. Both the sequence letter and quality score are encoded with a single ASCII character for brevity. The FASTQ format is a de facto standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer. Cock et al., 2009, The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucleic Acids Res 38(6):1767-1771.
  • For FASTA and FASTQ files, meta information includes the description line and not the lines of sequence data. In some embodiments, for FASTQ files, the meta information includes the quality scores. For FASTA and FASTQ files, the sequence data begins after the description line and is present typically using some subset of IUPAC ambiguity codes optionally with “-”. In a preferred embodiment, the sequence data will use the A, T, C, G, and N characters, optionally including “-” or U as-needed (e.g., to represent gaps or uracil).
  • Following sequencing, reads are preferably mapped 135 to a reference using assembly and alignment techniques known in the art or developed for use in the workflow. Various strategies for the alignment and assembly of sequence reads, including the assembly of sequence reads into contigs, are described in detail in U.S. Pat. No. 8,209,130, incorporated herein by reference. Strategies may include (i) assembling reads into contigs and aligning the contigs to a reference; (ii) aligning individual reads to the reference; (iii) assembling reads into contigs, aligning the contigs to a reference, and aligning the individual reads to the contigs; or (iv) other strategies known to be developed or known in the art. Mapping 135, it can be seen, may employ assembly steps, alignment steps, or both. Assembly can be implemented, for example, by the program ‘The Short Sequence Assembly by k-mer search and 3′ read Extension’ (SSAKE), from Canada's Michael Smith Genome Sciences Centre (Vancouver, B.C., CA) (see, e.g., Warren et al., 2007, Assembling millions of short DNA sequences using SSAKE, Bioinformatics, 23:500-501). SSAKE cycles through a table of reads and searches a prefix tree for the longest possible overlap between any two sequences. SSAKE clusters reads into contigs.
  • Another read assembly program is Forge Genome Assembler, written by Darren Platt and Dirk Evers and available through the SourceForge web site maintained by Geeknet (Fairfax, Va.) (see, e.g., DiGuistini et al., 2009, De novo sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data, Genome Biology, 10:R94). Forge distributes its computational and memory consumption to multiple nodes, if available, and has therefore the potential to assemble large sets of reads. Forge was written in C++ using the parallel MPI library. Forge can handle mixtures of reads, e.g., Sanger, 454, and Illumina reads.
  • Assembly through multiple sequence alignment can be performed, for example, by the program Clustal Omega, (Sievers et al., 2011, Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega, Mol Syst Biol 7:539), ClustalW, or ClustalX (Larkin et al., 2007, Clustal W and Clustal X version 2.0, Bioinformatics, 23(21):2947-2948) available from University College Dublin (Dublin, Ireland).
  • Another exemplary read assembly program known in the art is Velvet, available through the web site of the European Bioinformatics Institute (Hinxton, UK) (Zerbino & Birney, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Genome Research 18(5):821-829). Velvet implements an approach based on de Bruijn graphs, uses information from read pairs, and implements various error correction steps.
  • Read assembly can be performed with the programs from the package SOAP, available through the website of Beijing Genomics Institute (Beijing, Conn.) or BGI Americas Corporation (Cambridge, Mass.). For example, the SOAPdenovo program implements a de Bruijn graph approach. SOAP3/GPU aligns short reads to a reference sequence.
  • Another read assembly program is ABySS, from Canada's Michael Smith Genome Sciences Centre (Vancouver, B.C., CA) (Simpson et al., 2009, ABySS: A parallel assembler for short read sequence data, Genome Res., 19(6):1117-23). ABySS uses the de Bruijn graph approach and runs in a parallel environment.
  • Read assembly can also be done by Roche's GS De Novo Assembler, known as gsAssembler or Newbler (NEW assemBLER), which is designed to assemble reads from the Roche 454 sequencer (described, e.g., in Kumar & Blaxter, 2010, Comparing de novo assemblers for 454 transcriptome data, Genomics 11:571 and Margulies 2005). Newbler accepts 454 Flx Standard reads and 454 Titanium reads as well as single and paired-end reads and optionally Sanger reads. Newbler is run on Linux, in either 32 bit or 64 bit versions. Newbler can be accessed via a command-line or a Java-based GUI interface. Additional discussion of read assembly may be found in Li et al., 2009, The Sequence alignment/map (SAM) format and SAMtools, Bioinformatics 25:2078; Lin et al., 2008, ZOOM! Zillions Of Oligos Mapped, Bioinformatics 24:2431; Li & Durbin, 2009, Fast and accurate short read alignment with Burrows-Wheeler Transform, Bioinformatics 25:1754; and Li, 2011, Improving SNP discovery by base alignment quality, Bioinformatics 27:1157. Assembled sequence reads may preferably be aligned to a reference.
  • Methods for alignment and known in the art and may make use of a computer program that performs alignment, such as Burrows-Wheeler Aligner.
  • In certain embodiments, reads are aligned to hg18 on a per-sample basis using Burrows-Wheeler Aligner version 0.5.7 for short alignments, and genotype calls are made using Genome Analysis Toolkit. See McKenna et al., 2010, The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data, Genome Res 20(9):1297-1303. High-confidence genotype calls may be defined as having depth ≧50 and strand bias score ≦0. Clinical significance of variant calls is an important question in carrier screening and will be addressed below. Other computer programs for assembling reads are known in the art. Such assembly programs can run on a single general-purpose computer, on a cluster or network of computers, or on specialized computing devices dedicated to sequence analysis.
  • In some embodiments, de-barcoded fastq files are obtained as described above and partitioned by capture region (exon) using the target arm sequence as a unique key. Reads are assembled in parallel by exon using SSAKE version 3.7 with parameters “-m 30 -o 15”. The resulting contiguous sequences (contigs) can be aligned to hg18 (e.g., using BWA version 0.5.7 for long alignments with parameter “-r 1”). In some embodiments, short-read alignment is performed as described above, except that sample contigs (rather than hg18) are used as the input reference sequence. Software may be developed in Java to accurately transfer coordinate and variant data (gaps) from local sample space to global reference space for every BAM-formatted alignment. Genotyping and base-quality recalibration may be performed on the coordinate-translated BAM files using the GATK program.
  • In some embodiments, any or all of the steps of the invention are automated. For example, a Perl script or shell script can be written to invoke any of the various programs discussed above (see, e.g., Tisdall, Mastering Perl for Bioinformatics, O'Reilly & Associates, Inc., Sebastopol, C A 2003; Michael, R., Mastering Unix Shell Scripting, Wiley Publishing, Inc., Indianapolis, Ind. 2003). Alternatively, methods of the invention may be embodied wholly or partially in one or more dedicated programs, for example, each optionally written in a compiled language such as C++ then compiled and distributed as a binary. Methods of the invention may be implemented wholly or in part as modules within, or by invoking functionality within, existing sequence analysis platforms. In certain embodiments, methods of the invention include a number of steps that are all invoked automatically responsive to a single starting queue (e.g., one or a combination of triggering events sourced from human activity, another computer program, or a machine). Thus, the invention provides methods in which any or the steps or any combination of the steps can occur automatically responsive to a queue. Automatically generally means without intervening human input, influence, or interaction (i.e., responsive only to original or pre-queue human activity).
  • Mapping 135 sequence reads to a reference, by whatever strategy, may produce output such as a text file or an XML file containing sequence data such as a sequence of the nucleic acid aligned to a sequence of the reference genome. In certain embodiments (e.g., see FIG. 1) mapping 135 reads to a reference produces results stored in SAM or BAM file 179 and such results may contain coordinates or a string describing one or more mutations in the subject nucleic acid relative to the reference genome. Alignment strings known in the art include Simple UnGapped Alignment Report (SUGAR), Verbose Useful Labeled Gapped Alignment Report (VULGAR), and Compact Idiosyncratic Gapped Alignment Report (CIGAR) (Ning, Z., et al., Genome Research 11(10):1725-9 (2001)). These strings are implemented, for example, in the Exonerate sequence alignment software from the European Bioinformatics Institute (Hinxton, UK).
  • In some embodiments, a sequence alignment is produced—such as, for example, a sequence alignment map (SAM) or binary alignment map (BAM) file—comprising a CIGAR string (the SAM format is described, e.g., in Li, et al., The Sequence Alignment/Map format and SAMtools, Bioinformatics, 2009, 25(16):2078-9). In some embodiments, CIGAR displays or includes gapped alignments one-per-line. CIGAR is a compressed pairwise alignment format reported as a CIGAR string. A CIGAR string is useful for representing long (e.g. genomic) pairwise alignments. A CIGAR string is used in SAM format to represent alignments of reads to a reference genome sequence.
  • A CIGAR string follows an established motif. Each character is preceded by a number, giving the base counts of the event. Characters used can include M, I, D, N, and S (M=match; I=insertion; D=deletion; N=gap; S=substitution). The CIGAR string defines the sequence of matches/mismatches and deletions (or gaps). For example, the CIGAR string 2MD3M2D2M will mean that the alignment contains 2 matches, 1 deletion (number 1 is omitted in order to save some space), 3 matches, 2 deletions and 2 matches. In general, for carrier screening or other assays such as the NGS workflow depicted in FIG. 1, sequencing results will be used in genotyping 141.
  • Output from mapping 135 may be stored in a SAM or BAM file 179, in a variant call format (VCF) file 183, or other format. In an illustrative embodiment, output is stored in a VCF file, although methods described herein are applicable to other file formats such as SAM or BAM files, as will be readily apparent to one of skill in the art.
  • FIG. 2 gives a sample of an exemplary VCF file 183. A typical VCF file 183 will include a header section and a data section. The header contains an arbitrary number of meta-information lines, each starting with characters ‘##’, and a TAB delimited field definition line starting with a single ‘#’ character. The field definition line names eight mandatory columns and the body section contains lines of data populating the columns defined by the field definition line. The VCF format is described in Danecek et al., 2011, The variant call format and VCFtools, Bioinformatics 27(15):2156-2158.
  • The data contained in a VCF file 183 as shown for example in FIG. 2 represents the variants, or mutations, that are found in the nucleic acid that was obtained from the sample from the patient and sequenced. In its original sense, mutation refers to a change in genetic information and has come to refer to the present genotype that results from a mutation. As is known in the art, mutations include different types of mutations such as substitutions, insertions or deletions (INDELs), translocations, inversions, chromosomal abnormalities, and others. By convention in some contexts where two or more versions of genetic information or alleles are known, the one thought to have the predominant frequency in the population is denoted the wild type and the other(s) are referred to as mutation(s). In general in some contexts an absolute allele frequency is not determined (i.e., not every human on the planet is genotyped) but allele frequency refers to a calculated probable allele frequency based on sampling and known statistical methods and often an allele frequency is reported in terms of a certain population such as humans of a certain ethnicity. Variant can be taken to be roughly synonymous to mutation but referring to a genotype being described in comparison or with reference to a reference genotype or genome. For example as used in bioinformatics variant describes a genotype feature in comparison to a reference such as the human genome (e.g., hg18 or hg19 which may be taken as a wild type). An NGS workflow and genotype 141 generates data representing one or more mutations in a genome of an individual that are generally reported as variants, or “variant calls”, in, for example, a VCF file 183.
  • With continuing reference to FIG. 2, a VCF file 183 includes data representing one or more mutations. Those data may be analyzed by methods of the invention to provide a report of the clinical significance of the mutations in the genome of the individual.
  • FIG. 3 diagrams a method 301 for analyzing mutations according to the invention. One benefit of a method 301 is an ability to provide information about the clinical significance of mutations in a patient's genome from data such as that provided by sequencing, e.g., in FASTA/FASTQ files, SAM/BAM files, or VCF files. Methods include obtaining 305 data representing a mutation in a genome of an individual by, for example, the sampling, sequencing, and mapping methods described above. A variant node in a graph database is used 311 to store a description of the mutation. A pointer is stored 317 in the variant node and the pointer points to an adjacent node that provides information about a clinical significance of the variant. Method 301 includes querying 323 the graph database to obtain information reporting the clinical significance of the mutation in the genome of the individual.
  • To illustrate operation of the invention, the following discusses obtaining mutation data in a VCF file, although one of skill in the art will readily see that the discussion is extensible to other formats. Using a workflow such as the NGS workflow illustrated in FIG. 1, a VCF file containing mutation data is obtained 305. The VCF file may be parsed to isolate its component pieces of information and to consider each piece of information for its own significance. There exist programs or application programming interfaces (APIs) for parsing VCF files 183 or a program may be written that parses data from the VCF file.
  • FIG. 4 gives a flow chart for a VCF parser. The flow chart shown in FIG. 4 represents the conceptual steps that may go into parsing a VCF file and extracting component information. Since the various action blocks and loops are defined according to the format of the VCF file as standardized (e.g., in Danecek, 2011, Bioinformatics 27:2156), each character of information that is extracted is treated for what it is. Thus, using VCF file 183 from FIG. 2 for reference, the “A” that appears on line 16, character 7 (counting 1 tab as 1 character) is treated as a nucleotide in the reference and the “A” that appears in line 17, character 17 is simply part of the word “PASS” in the FILTER column. It is further recognized that line 16 (and any subsequent line) is a single VCF record within a VCF file. Each record from the VCF file represents something found by sequencing the nucleic acid from the sample from the patient. Each patient, having numerous genes in their genome, has numerous alleles. Thus where carrier screening is performed for a patient, the VCF run (e.g., all the VCF files produced by the NGS sequencing) ultimately documents and shows the various alleles in the patient's genome that were probed for by the probes used.
  • FIG. 5 presents a model of data received from parsing a VCF. As just discussed, one run from the sequencing instruments can produce a plurality of VCF files. Each VCF file typically contains a plurality of VCF records. Those records ultimately relate back to the samples from which they were derived, and the samples can each contain a plurality of alleles. However, this relationship just described can also be described using an entity relationship diagram, or ERD.
  • FIG. 6 shows an entity relationship diagram (ERD) 601 of the data modeled by FIG. 5. An insight of the invention is that the ERD 601 satisfies the definition of a graph as used in graph theory within mathematics and computer science. Graph theory provides a well-known mathematical tool for representing systems. Graph theory is the mathematical study of properties of formal mathematical structures called graphs. In that context, a graph is a finite set of points, termed vertices or nodes, connected by links termed edges or arcs. A graph thus generally defines a set of vertices and a set of pairs of vertices, which are the edges of the graph. There are several types of graphs in graph theory. The type of a particular graph largely depends upon the features of its components, namely the attributes of its vertices and edges. For example, when the set of pairs includes only distinct elements, the graph is called a simple graph; when one or more pairs are connected by multiple edges the graph is called a multi-graph; when one or more vertices are connected to themselves the graph is called a pseudo-graph; when the edges are assigned with directions the graph is called a directed graph or a digraph; and when the pairs of vertices are unordered the graph is called undirected. Additional illustrative background on graph theory may be found in U.S. Pat. No. 8,463,895 to Arora; U.S. Pat. No. 8,462,161 to Barber; U.S. Pat. No. 7,523,117 to Zhang; U.S. Pat. No. 6,360,235 to Tilt; U.S. Pub. 2013/0222388 to McDonald; and U.S. Pub. 2007/0244675 to Shai, the contents of each of which are incorporated by reference.
  • It can be observed that ERD 601 presents a graph—a collection of vertices and edges—or another description would be a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships. This general-purpose, expressive structure allows graphs to model all kinds of phenomena such as NGS sequence files and their relationships to the source biological samples and genetic concepts like certain alleles. There are various dominant graph data models such as the property graph, Resource Description Framework (RDF) triples, and hypergraphs. In certain embodiments, a graph database used in the invention uses the property graph model.
  • A property graph has characteristics such as containing nodes and relationships (which are illustrated by ERD 601 in FIG. 6). The nodes contain properties (key-value pairs). Relationships are named and directed, and have a start and end node; and relationships can also contain properties. A graph database management system (henceforth, a graph database) is an online database management system with Create, Read, Update, and Delete (CRUD) methods that expose a graph data model. Graph databases according to the invention may be described or characterized according to the underlying storage, the processing engine, or both.
  • Regarding the underlying storage, some graph databases use native graph storage that is optimized and designed for storing and managing graphs. Some databases serialize the graph data into a relational database, an object-oriented database, or some other general-purpose data store and present graph database functionality on top of that.
  • Regarding the processing engine, some graph databases use index-free adjacency, meaning that connected nodes physically “point” to each other in the database. More broadly, graph databases can include any database that from the user's perspective behaves like a graph database (i.e., exposes a graph data model through CRUD operations) qualifies as a graph database. In certain embodiments, however, the invention provides the significant performance advantages of index-free adjacency. Native graph processing may describe graph databases that use index-free adjacency.
  • A benefit of native graph storage is that it is engineered for performance and scalability. A benefit of non-native graph storage is that it typically depends on a mature non-graph backend (such as MySQL) whose production characteristics are well understood by operations teams. Native graph processing (index-free adjacency) benefits traversal performance.
  • In the graph data model, relationships are included as entities that themselves are stored as objects. (Whereas other database management systems require connections between entities to be inferred using contrived properties such as foreign keys, or out-of-band processing like map-reduce.) By assembling the simple abstractions of nodes and relationships into connected structures, graph databases provide arbitrarily sophisticated models that map closely to the problem domain (e.g., FIG. 5). The resulting models are simpler and at the same time more expressive than those produced using traditional relational databases and the other NOSQL stores.
  • Any suitable graph database can be used to implement the systems and methods described herein. Exemplary graph databases may include Microsoft Infinite Graph, Titan, OrientDB, Neo4j, *dex, Franz Inc., AllegroGraph, and Hypergraphdb. Preferably, systems and methods of the invention employ a graph compute engine.
  • A graph compute engine is a technology that enables global graph computational algorithms to be run against large datasets. Graph compute engines are designed to do things like identify clusters in the data, or answer questions about how entities are connected, and particularly to trace across a series of linked ideas (e.g., SNP to allele to genetic condition to a literature reference providing a clinical significance of the allele containing the SNP).
  • A variety of different types of graph compute engines exist. Most notably there are in-memory/single machine graph compute engines like Cassovary, and distributed graph compute engines like Pegasus or Giraph. A distributed graph compute engine may be structured as described in Malewicz, et al., 2010, Pregel: a system for large-scale graph processing, Proceedings ACM SIGMOD Int Conf Management Data 135-146. Also see Rodriguez and Neubauer, 2010, Constructions from Dots and Lines, Bulletin Am Soc Inf Sci Tech 36(6):35-41.
  • In preferred embodiments, systems and methods of the invention store mutation descriptions using a graph database and analyze mutations in graph space.
  • To achieve the benefits potentially offered by using a graph database, a genetic analysis pipeline and methodology according to the invention uses nodes as well as named and directed relationships, with both the nodes and relationships serving as containers for properties. With continuing reference to FIG. 6, nodes and relationships are illustrated and index-free adjacency is discussed.
  • A database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes. Each node thus acts as a micro-index of other nearby nodes, which is much cheaper than using global indexes. It means that query times are independent of the total size of the graph, and are instead simply proportional to the amount of the graph searched.
  • A non-native graph database engine, in contrast, uses (global) indexes to link nodes together. These indexes add a layer of indirection to each traversal, thereby incurring greater computational cost. Proponents for native graph processing argue that index-free adjacency is crucial for fast, efficient graph traversals. To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(l) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.
  • Index-free adjacency provides lower-cost “joins.” With index-free adjacency, bidirectional joins are effectively pre-computed and stored in the database as relationships. In contrast, when using indexes to fake connections between records, there is no actual relationship stored in the database. This becomes problematic for traversals in the “opposite” direction from the one for which the index was constructed. Because such traversals require a brute-force search through the index—which is an O(n)operation—and joins like this are simply too costly to be of any practical use. Index free adjacency provides surprising benefits in the context of reporting clinical significance of the results of NGS-based carrier screening in that the concepts involved are of just such a nature as to naturally lend themselves to representation using the pre-computed bidirectional joins offered by index free adjacency.
  • For at least these reasons, systems and methods of certain embodiments of the invention use index-free adjacency to ensure high-performance traversals. FIG. 6 shows how relationships eliminate the need for index lookups. A graph database can use relationships, not indexes, for fast traversals
  • A general-purpose graph database relationships can be traversed in either direction (tail to head, or head to tail) extremely cheaply. Starting from a given VcfRun or a given allele, a graph processing engine can find the related other one of those two at a very low computation cost.
  • In certain embodiments, systems and methods of the invention use native graph storage. If index-free adjacency is the key to high-performance traversals, queries, and writes, then one key aspect of the design of a graph database is the way in which graphs are stored. An efficient, native graph storage format supports extremely rapid traversals for arbitrary graph algorithms an important reason for using graphs.
  • A graph database such as Neo4j stores graph data in a number of different store files. Each store file may contain the data for a specific part of the graph (e.g., nodes, relationships, properties). The division of storage responsibilities—particularly the separation of graph structure from property data—facilitates performant graph traversals, even though it means the user's view of their graph and the actual records on disk are structurally dissimilar. FIGS. 7-10 illustrates a node and relationship storage structure as implemented by a graph database of the invention.
  • FIG. 7 diagrams a high-level architecture 701 of systems of certain embodiments of the invention. From the bottom-up, systems may operate using files on disk 733. Record files 739 provide a basic level of storage to support the file system cache 741. The object cache 747 is kept at a high level for rapid access as discussed herein. Additionally, the disks 733 can store a transaction log 725, which is written to by a transaction management module 721. A graph database such as Neo4j includes or provides a traversal API 755, core API 705, and a query language 713 such as Cypher.
  • FIG. 8 illustrates the structure of nodes 801 and relationships 809 on disk as may be deployed within a physical structure of systems of the invention. The node store file stores node records. Every node created in the user-level graph ends up in the node store. Preferably, the node store is a fixed-size record store. While the precise values or traits may be varied as necessary or best-suited to the invention, in the illustrated embodiment, each node record 801 is nine bytes in length. Fixed-size records enable fast lookups for nodes in the store file. To illustrate, if a node has id 100, then it can be known that its record begins 900 bytes into the file. Based on this format, the database can directly compute a record's location, at cost O(l), rather than performing a search, which would be cost O(log n). It is noted that fixed-size record stores provide an improvement to a computer in the sense that information storage efficiently exploits the physical storage device for very fast retrieval and very fast look-ups. Thus, genetic queries according to methods and systems of the invention actually proceed faster at a hardware level than prior art approaches—the computer itself is sped up by the implementations described.
  • The first byte of a node 801 record is the in-use flag. This tells the database whether the record is currently being used to store a node. The next four bytes represent the ID of the first relationship connected to the node, and the last four bytes represent the ID of the first property for the node. The node record is lightweight and contains just pointers to lists of relationships and properties.
  • Correspondingly, relationships are stored in a relationship store file Like the node store, the relationship store consists of fixed-sized records—in this case each relationship record 809 is 33 bytes long. Each relationship record 809 contains the IDs of the nodes at the start and end of the relationship, a pointer to the relationship type (which is stored in the relationship type store), and pointers for the next and previous relationship records for each of the start and end nodes. These last pointers are part of what is often called the relationship chain.
  • The node and relationship stores are concerned only with the structure of the graph, not its property data. Both stores use fixed-sized records so that any individual record's location within a store file can be rapidly computed given its ID. The significance can hardly be overstated: the described structure improves the operation of the hardware itself.
  • Using the described structures, given the way that the various store files are stored on disk, graph processing operations are low-cost. Each of the node records contains a pointer to that node's first property and first relationship in a relationship chain. To read a node's properties, one may follow the singly linked list structure beginning with the pointer to the first property. To find a relationship for a node, one may follow that node's relationship pointer to its first relationship and then follow the doubly linked list of relationships for that particular node (that is, either the start node doubly linked list, or the end node doubly linked list) until the relationship of interest is found.
  • Having found the record for the relationship of interest, that relationship's properties can be read (if there are any) using the same singly linked list structure as is used for node properties, or the node records can be examined for the two nodes the relationship connects using its start node and end node IDs. These IDs, multiplied by the node record size, give the immediate offset of each node in the node store file.
  • In some embodiments, systems and methods of the invention use doubly-linked lists in the relationship store. It is noted that a relationship record 809 can be thought of as “belonging” to two nodes—the start node and the end node of the relationship. To avoid storing two relationship records and to make the relationship record belong to both the start node and the end node, there are pointers (aka record IDs) for two doubly linked lists: one is the list of relationships visible from the start node; the other is the list of relationships visible from the end node. This provide rapid iteration through that list in either direction, and efficient insertion or deletion of relationships.
  • Choosing to follow a different relationship involves iterating through a linked list of relationships until a candidate matching the correct type or having some matching property value is found. The found relationship gives a new ID. The new ID is multiplied by record size as a new pointer and the traversal continues. With fixed-sized records and pointer-like record IDs, traversals are implemented simply by chasing pointers around a data structure, which can be performed at very high speed. To traverse a particular relationship from one node to another, the database performs several cheap ID computations (these computations are much cheaper than searching global indexes, as would be required if faking a graph in a non-graph native database). First, from a given node record, the first record in the relationship chain is located by computing its offset into the relationship store—that is, by multiplying its ID by the fixed relationship record size (e.g., 33 bytes). This gets to the right record in the relationship store. Then, from the relationship record, look in the second node field to find the ID of the second node. Multiply that ID by the node record size (e.g., nine bytes) to locate the correct node record in the store.
  • In addition to the node and relationship stores, which contain the graph structure, systems include the property store files. These store the user's key-value pairs. Properties may be attached to both nodes and relationships. The property stores, therefore, are referenced from both node and relationship records. Records in the property store are physically stored in a file. As with the node and relationship stores, property records are of a fixed size. Each property record consists of four property blocks and the ID of the next property in the property chain. Properties are held as a singly linked list on disk as compared to the doubly linked list used in relationship chains. Each property occupies between one and four property blocks—a property record can, therefore, hold four properties. A property record holds the property type and a pointer to the property index file, which is where the property name is stored. For each property's value, the record contains either a pointer into a dynamic store record or an inlined value. The dynamic stores allow for storing large property values. A graph database may optimize storage where it inlines some properties into the property store file directly. This happens when property data can be encoded to fit in one or more of a record's four property blocks. In practice this means that data like variant calls can be inlined in the property store file directly, rather than being pushed out to the dynamic stores. This results in reduced I/O operations and improved throughput, because only a single file access is required.
  • In addition to in-lining certain compatible property values, a graph database can also reference long values as property names (e.g., complete journal article titles and citations). In such cases, property names are indirectly referenced from the property store through the property index file. The property index allows all properties with the same name to share a single record, and thus for repetitive graphs achieves considerable space and I/O savings.
  • To improve the performance characteristics of mechanical/electronic mass storage de-vices, many graph databases use in-memory caching to provide probabilistic low latency access to the graph. Neo4j uses a two-tiered caching architecture to provide this functionality.
  • The lowest tier in the Neo4j caching stack is the file system cache 741. The file system cache 741 is a page-affined cache, meaning the cache divides each store into discrete regions, and then holds a fixed number of regions per store file. The actual amount of memory to be used to cache the pages for each store file can be fine-tuned, though in the absence of input from the user, Neo4j will use sensible default values based on the capacity of the underlying hardware. Pages are evicted from the cache based on a least-frequently-used (LFU) cache policy.
  • The file system cache 741 is particularly beneficial when related parts of the graph are modified at the same time such that they occupy the same page. This is a common pattern for writes, where whole sub-graphs (such as a patient's NGS results and associated carrier screening report) are written to disk in a single operation, rather than discrete nodes and relationships.
  • A graph database may be manipulated through a query language, which can be either imperative or declarative. One such language is the Cypher query language. Cypher is a declarative graph query language for Neo4j that allows for expressive and efficient querying and updating of the graph store. Cypher contains a variety of clauses, some of the most common of which include MATCH and WHERE. These functions are slightly different than in SQL. MATCH is used for describing the structure of the pattern searched for, primarily based on relationships, and WHERE is used to add additional constraints to patterns. Cypher additionally contains clauses for writing, updating, and deleting data. CREATE and DELETE are used to create and delete nodes and relationships. SET and REMOVE are used to set values to properties and add labels on nodes.
  • Systems and methods of the invention provide very rapid transactions, idiomatic queries, and an excellent ability to “scale up” with very large data sizes. The topic of scale has become more important as data volumes have grown. Graph databases don't suffer the same latency problems as traditional relational databases, where the more data that exists in tables—and in indexes—the longer the join operations. With a graph database, most queries follow a pattern whereby an index is used simply to find a starting node (or nodes). The remainder of the traversal then uses a combination of pointer chasing and pattern matching to search the data store. What this means is that, unlike relational databases, performance does not depend on the total size of the dataset, but only on the data being queried. This leads to performance times that are nearly constant (i.e., are related to the size of the result set), even as the size of the dataset grows. Throughput, speed, and scalability of graph databases make them suited to genetic analysis and reporting. Given the input/output-intensive nature of such sequencing, variant-calling, genotyping, and clinical reporting, a typical operation reads and writes a set of related data. In other words, the application performs multiple operations on a logical sub-graph within the overall dataset. With a graph database such multiple operations can be rolled up into larger, more cohesive operations. Further, with a graph-native store, executing each operation takes less computational effort than the equivalent relational operation. Graphs scale by doing less work for the same outcome.
  • FIG. 9 illustrates the use of a variant node 901 in a graph database to store a description of a mutation. The first byte of the variant node 901 record is set to show that node 901 is in use. The next four bytes of node 901 represent the ID of the first relationship connected to the node. Through the ID of that first relationship, node 901 thus includes a pointer to an adjacent node (adjacent by definition, since the relationship is identified by the four bytes in node 901). The last four bytes of node 901 represent the ID of the first property for the node.
  • To read the first property for node 901, one may follow the singly linked list structure to the appropriate property record in the property store. Property records in the property store are of a fixed size and each property record consists of four property blocks and the ID of the next property in the chain. The property record holds the property type (here, “variant”) and a pointer to the property index file, which is where the property name is stored. For each property's value, the record either points to a dynamic store or an inline record. Here, the parser operating via the logic mapped in FIG. 4 produces a record of a mutation (by parsing that record from the VCF file) and can store that mutation in the property index file. Thus the property index file for a variant node preferably includes a description of a mutation.
  • A description of a mutation may be provided according to a systematic nomenclature. For example, a variant can be described by a systematic comparison to a specified reference which is assumed to be unchanging and identified by a unique label such as a name or accession number. For a given gene, coding region, or open reading frame, the A of the ATG start codon is denoted nucleotide +1 and the nucleotide 5′ to +1 is −1 (there is no zero). A lowercase g, c, or m prefix, set off by a period, indicates genomic DNA, cDNA, or mitochondrial DNA, respectively.
  • A systematic name can be used to describe a number of variant types including, for example, substitutions, deletions, insertions, and variable copy numbers. A substitution name starts with a number followed by a “from to” markup. Thus, 199A>G shows that at position 199 of the reference sequence, A is replaced by a G. A deletion is shown by “del” after the number. Thus 223delT shows the deletion of T at nt 223 and 997-999del shows the deletion of three nucleotides (alternatively, this mutation can be denoted as 997-999delTTC). In short tandem repeats, the 3′ nt is arbitrarily assigned; e.g. a TG deletion is designated 1997-1998delTG or 1997-1998del (where 1997 is the first T before C). Insertions are shown by ins after an interval. Thus 200-201insT denotes that T was inserted between nts 200 and 201. Variable short repeats appear as 997(GT)N−N′. Here, 997 is the first nucleotide of the dinucleotide GT, which is repeated N to N′ times in the population.
  • Variants in introns can use the intron number with a positive number indicating a distance from the G of the invariant donor GU or a negative number indicating a distance from an invariant G of the acceptor site AG. Thus, IVS3+1C>T shows a C to T substitution at nt+1 of intron 3. In any case, cDNA nucleotide numbering may be used to show the location of the mutation, for example, in an intron. Thus, c.1999+1C>T denotes the C to T substitution at nt+1 after nucleotide 1997 of the cDNA. Similarly, c.1997-2A>C shows the A to C substitution at nt-2 upstream of nucleotide 1997 of the cDNA. When the full length genomic sequence is known, the mutation can also be designated by the nt number of the reference sequence.
  • Relative to a reference, a patient's genome may vary by more than one mutation, or by a complex mutation that is describable by more than one character string or systematic name. The invention further provides systems and methods for describing more than one variant using a systematic name. For example, two mutations in the same allele can be listed within brackets as follows: [1997G>T; 2001A>C]. Systematic nomenclature is discussed in den Dunnen & Antonarakis, 2003, Mutation Nomenclature, Curr Prot Hum Genet 7.13.1-7.13.8 as well as in Antonarakis and the Nomenclature Working Group, 1998, Recommendations for a nomenclature system for human gene mutations, Human Mutation 11:1-3. By such means, a mutation can be described in the property index file of a variant node.
  • While described here with reference to FIG. 9 as a “variant node”, it will be appreciated that node 901 can be instantiated or used as any type, with the type being stored in the property store.
  • FIG. 10 illustrates a simple example in which an allele node is used to show that an allele includes a certain mutation by representing the mutation using a variant node and representing a relationship between the allele node and the variant node with a “HAS_VARIANT” type relationship. This illustrates the simplicity of connecting alleles to variants using relationships. After the variant is created, literature references can be added to the variant.
  • FIG. 11 shows elements of a graph database in which a variant has been connected to two nodes, each for a literature reference. From this setup emerges one of the powerful applications of a graph database in processing results from NGS sequencing data. If variant changes are made, those variant changes can be tracked within systems of the invention without requiring upsetting the structure of the existing database.
  • To illustrate the invention by an example, a patient sample could be sequenced via NGS technologies and the sequencing results could include, in a VCF file, a description of a mutation in that patient's mitochondrial genome. A variant node is used and a property of that node (e.g., in a property index file) is used to describe that mutation as m.593T>C. A relationship is created to shown that the mutation is described in a literature reference. The relationship is a pointer to a LitRef node and the LitRef node points to a property index file that with information about the literature reference. The property index file contains Zhang et al., 2011, Is mitochondrial tRNAphe variant m.593T>C a synergistically pathogenic mutation in Chinese LHON families with m.11778G>A?, PLoS ONE 6(10):e26511. Based on the synergistic pathogenesis alluded to by the literature reference, a geneticist or curator may deem it important to flag instances in which a patient has both m.593T>C and m.11778G>A in their genome. This example illustrates the real power of a graph database and index-free adjacency. A query can be initiated that starts at the LitRef node just described and traverses to the variant node. That query can traverse to the sample node for that patient and even to a node for the patient. That query can then—by its own terms—traverse from the patient or sample node examining for the presence of a second variant node representing m.11778G>A. The query can be programmed to, in the absence of said second variant node, classify the mutation as benign. The query can be programmed to, in the presence of said second variant node, classify the mutation as pathogenic. Intermediate labels or other categories can also be used. Since the query is traversing across a graph database, a comprehensive index-based look-up is not required as would be required in prior art RDMSs.
  • It is important to note that the “graph” of the described graph databases follows the counter-intuitive path of connecting things of un-related categories. Although it is not the primary structure or purpose described herein, one may imagine embodiments in which a graph has a horizontal structure connecting entities that are essentially similar in nature so that the database maps a natural phenomenon. For example, a graph database could represent protein interactions using the edges (aka pointers or relationships) to represent interactions between proteins and thus influxes of data would expand the graph “horizontally”. However, the invention is unlike the protein interaction example in that the graph expands “vertically” outside of a set of natural phenomena. Since a sample can have a node, the graph can reach to laboratory management systems and receive from or provide information to, for example, sample chain of custody modules. With NGS results from that sample, the graph can leap vertically to a genetic plane and represent human mutations that are being discovered. For NGS carrier screening application, the graph can reach vertically into a different category to represent medical literature, and can go on to be used patient reports. The power of this novel vertical structure is shown by the illustration of use of the invention for reporting carrier screening results.
  • FIG. 12 illustrates a graph database in which a variant has been connected to two nodes, each for a literature reference and in which updated information about the variant has been introduced in two changes. For example, node 17451 may represent a specific mutation such as a SNP (e.g., G at a certain position). Node 17454 could be created when A is observed at that position.
  • Systems and methods of the invention support a plurality of different use cases and applications. For example, if a graph database is used in support of NGS carrier screening, one capability that will emerge is support for evaluating and reporting allele frequency.
  • For example, where a practitioner wants to know, across all included research consenting data, what is the frequency of a certain allele, the graph database can easily be queried for that.
  • FIG. 13 presents an example database that may be queried for allele frequency.
  • Using—for example, in Cypher—the following (pseudo) code produces the desired result.
  • MATCH (a:Allele)←(sd:S ampleData)→(s:Sample)→p:Patient) RETURN a,count(distinct p)
  • Another illustrative use case for application of a graph database is the curation of variants. As was illustrated by FIGS. 10-12. The curation of variants involves taking variants (i.e. genetic mutations) that have been picked up through a sequencing platform and then looking through the literature for references to evaluate how common the variant is and whether it is identified as pathogenic, benign, or somewhere in-between. This can be supported and modeled by tracking three things: connecting allele to a variant; variant and variant changes; and literature references per variant. To illustrate, a geneticist may observe review a patient's NGS sequencing results and observe the presence of a poly-T variant. The geneticist may connect this variant to an allele of the cystic fibrosis transmembrane conductance receptor (CFTR) gene located on the long arm of chromosome 7 (e.g., as shown in FIG. 10). The geneticist may further observe that this variant is described by a literature reference and connect the variant object to two different LitRef objects such as one for each of Rowntree and Harris, The phenotypic consequences of CFTR mutations, Ann Hum Gen 67:471-485 (2003) and Kreindler, Cystic fibrosis: exploiting its genetic basis in the hunt for new therapies, Pharmacol Ther 125(2):219-229 (2010) (e.g., according to the diagram of FIG. 11). Moreover the geneticist may observe that the mutation (the poly-T) is a novel poly-T variant in the acceptor splice site of intron 8 of CFTR in cis with R117H (i.e., c.350G>A based on GenBank cDNA reference sequence NM000492.3). In this instance, the geneticist may want to update the graph for this patient by connecting the poly-T mutation to a variant object for c.350G>A (e.g., as seen in FIG. 12). To further illustrate, the chain of updated variants may reveal that the patient has an allele with the T5 poly-T variant, which evidence suggests plays a role in in pathogenic alternate splicing or exon skipping. Moreover, the geneticist may further consider the data and determine that, in-fact, the patient's allele includes a T6 form of the poly-T variant and may update the variant nodes to so reflect. Here, with the addition of a T6 node, other content need not be modified. The geneticist may add a LitRef node for Huang, et al., Comparative analysis of common CFTR polymorphisms poly-T, TG-repeats and M470V in a healthy Chinese population, World J Gastroenterol 14(12):1925-30 (2008). Thus if the NGS screening gave results indicating a R117H with T6 variant, methods and systems of the invention can be used to relate this clinical results data to the existing infrastructure of medical information on one level and back to the patient via the sample (through the VCF files and instrument run) on another level. Since a graph database preferably with index-free adjacency is used for each node, those connections can be traversed to provide a report to the patient's attending physician, where the report shows the patient to be R117H T6 and gives the relevant literature with information about treatment and outcomes. Since a graph database is used, the traversals are very fast and traversal times do not increase with increasing volumes of database contents as queries times must so increase in the context of prior art relational databases.
  • As one skilled in the art would recognize as necessary or best-suited for performance of the methods of the invention, a computer system or machines of the invention include one or more processors (e.g., a central processing unit (CPU) a graphics processing unit (GPU) or both), a main memory and a static memory, which communicate with each other via a bus.
  • FIG. 14 diagrams a system 1500 suitable for performing methods of the invention. As shown in FIG. 14, system 1500 may include one or more of a server computer 1513, a terminal 1567, a sequencer 1501, a sequencer computer 1533, a computer 1549, or any combination thereof. Each such computer device may communicate via network 1509. Sequencer 1501 may optionally include or be operably coupled to its own, e.g., dedicated, sequencer computer 1533 (including any input/output mechanisms (I/O), processor, and memory). Additionally or alternatively, sequencer 1501 may be operably coupled to a server 1513 or computer 1549 (e.g., laptop, desktop, or tablet) via network 1509. Computer 1549 includes one or more processor, memory, and I/O. Where methods of the invention employ a client/server architecture, any steps of methods of the invention may be performed using server 1513, which includes one or more of processor, memory, and I/O, capable of obtaining data, instructions, etc., or providing results via an interface module or providing results as a file. Server 1513 may be engaged over network 1509 through computer 1549 or terminal 1567, or server 1513 may be directly connected to terminal 1567. Terminal 1567 is preferably a computer device. A computer according to the invention preferably includes one or more processor coupled to an I/O mechanism and memory.
  • A processor may be provided by one or more processors including, for example, one or more of a single core or multi-core processor (e.g., AMD Phenom II X2, Intel Core Duo, AMD Phenom II X4, Intel Core i5, Intel Core i& Extreme Edition 980X, or Intel Xeon E7-2820).
  • An I/O mechanism may include a video display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device (e.g., a keyboard), a cursor control device (e.g., a mouse), a disk drive unit, a signal generation device (e.g., a speaker), an accelerometer, a microphone, a cellular radio frequency antenna, and a network interface device (e.g., a network interface card (NIC), Wi-Fi card, cellular modem, data jack, Ethernet port, modem jack, HDMI port, mini-HDMI port, USB port), touchscreen (e.g., CRT, LCD, LED, AMOLED, Super AMOLED), pointing device, trackpad, light (e.g., LED), light/image projection device, or a combination thereof.
  • Memory according to the invention refers to a non-transitory memory which is provided by one or more tangible devices which preferably include one or more machine-readable medium on which is stored one or more sets of instructions (e.g., software) embodying any one or more of the methodologies or functions described herein. The software may also reside, completely or at least partially, within the main memory, processor, or both during execution thereof by a computer within system 1500, the main memory and the processor also constituting machine-readable media. The software may further be transmitted or received over a network via the network interface device.
  • While the machine-readable medium can in an exemplary embodiment be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term “machine-readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present invention. Memory may be, for example, one or more of a hard disk drive, solid state drive (SSD), an optical disc, flash memory, zip disk, tape drive, “cloud” storage location, or a combination thereof. In certain embodiments, a device of the invention includes a tangible, non-transitory computer readable medium for memory. Exemplary devices for use as memory include semiconductor memory devices, (e.g., EPROM, EEPROM, solid state drive (SSD), and flash memory devices e.g., SD, micro SD, SDXC, SDIO, SDHC cards); magnetic disks, (e.g., internal hard disks or removable disks); and optical disks (e.g., CD and DVD disks).
  • Components of system 1500 may be under the control of a carrier screening service provider and may be operated to obtain data representing a mutation in a genome of an individual, use a variant node in a graph database to store a description of the mutation (while storing, in the variant node, a pointer to an adjacent node that provides information about a clinical significance of the variant), and query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual. Functionality of server computer 1513 may be provided by an outside vendor such as Amazon Web Services or Amazon's EC2. In fact, the carrier screening entity who is analyzing the mutations from the sample may not and need not have actual knowledge of the physical location and type of computers that provide server computer(s) 1513. It is enough that the entity have access to and the ability to control at least a portion of each of one or more of server computer 1513. In some embodiments, a sequencing instrument 1501 is employed (e.g., an IIlumina HiSeq 2000), which itself includes a sequencer computer 1533). The sample from the patient may be received from an outside source (e.g., from a phlebotomy facility down the hall or may be sent by courier (e.g., in an Eppendorf tube). Generally, the service provider will have access to and use a computer 1549 for coordinating methods of the invention. It is important to note that any given computer is optional but typically at least one of the depicted computer (sequencer computer 1533, local computer 1549, or server computer 1513) will be used to perform steps of the methods of the invention. In some embodiments, sequencer 1501 is operated by an outside service provider in support of or on order of the carrier screening entity. Thus generally the carrier screening professional has access to or control over components of the system.
  • INCORPORATION BY REFERENCE
  • References and citations to other documents, such as patents, patent applications, patent publications, journals, books, papers, web contents, have been made throughout this disclosure. All such documents are hereby incorporated herein by reference in their entirety for all purposes.
  • EQUIVALENTS
  • Various modifications of the invention and many further embodiments thereof, in addition to those shown and described herein, will become apparent to those skilled in the art from the full contents of this document, including references to the scientific and patent literature cited herein. The subject matter herein contains important information, exemplification and guidance that can be adapted to the practice of this invention in its various embodiments and equivalents thereof.

Claims (25)

What is claimed is:
1. A system for describing genetic information, the system comprising:
at least one computer comprising memory coupled to a processor, the system having at least a portion of a graph database stored therein, wherein the system is operable to:
obtain data representing a mutation in a genome of an individual;
use a node in the graph database to store a description of the mutation;
store, in the node, a pointer to an adjacent node that provides information about a clinical significance of the mutation; and
query the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.
2. The system of claim 1, wherein the system is operable to obtain the data representing the mutation by receiving at least one sequence read file that includes the data.
3. The system of claim 2, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
4. The system of claim 1, wherein the data representing the mutation is obtained as part of a file.
5. The system of claim 4, wherein the file has a format selected from the group consisting of variant call format; sequence alignment map; binary alignment map; FASTA; and FASTQ.
6. The system of claim 4, operable to represent the file as a file node in the graph database and store, in the variant node, a pointer to the file node.
7. The system of claim 6, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
8. The system of claim 1, wherein the data representing the mutation comprises a description of the mutation as a variant of a reference human genome.
9. The system of claim 8, wherein the description of the mutation is obtained from a VCF record in a VCF file.
10. The system of claim 9, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
11. The system of claim 1, further operable to:
obtain sequencing data representing a plurality of mutations in the genome of the individual, the plurality of mutations being represented as variant calls relative to a human genome reference;
use, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation; and
link the individual to an allele node based on the plurality of mutations.
12. The system of claim 11, wherein the graph database comprises:
nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and
edges defining relationships between pairs of the nodes.
13. The system of claim 12, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
14. The system of claim 1, wherein the graph database comprises:
nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and
edges defining relationships between pairs of the nodes.
15. The system of claim 14, further operable to represent, in the graph database, a biological sample from the individual using a sample node and connect the sample node via a pointer to a read file node representing the sequence read file.
16. A method for analyzing mutations, the method comprising:
obtaining data representing a mutation in a genome of an individual;
using a node in a graph database to store a description of the mutation;
storing, in the node, a pointer to an adjacent node that provides information about a clinical significance of the mutation; and
querying the graph database to provide a report of the clinical significance of the mutation in the genome of the individual.
17. The method of claim 16, wherein obtaining the data representing the mutation comprises
obtaining a sample that includes a nucleic acid from the individual; and
sequencing the nucleic acid to obtain a sequence read file that includes the data.
18. The method of claim 17, further comprising representing the sample in the graph database using a sample node and connecting the sample node via a pointer to a read file node representing the sequence read file and metadata associated with the data.
19. The method of claim 16, wherein the data representing a mutation is obtained as part of a file.
20. The method of claim 19, wherein the file has a format selected from the group consisting of variant call format; sequence alignment map; binary alignment map; FASTA; and FASTQ.
21. The method of claim 19, further comprising representing the file as a file node in the graph database and storing in the mutation node a pointer to the file node.
22. The method of claim 16, wherein the data representing a mutation comprises a description of the mutation as a variant of a reference human genome.
23. The method of claim 22, wherein the description of the mutation is provided as a VCF record in a VCF file.
24. The method of claim 16, further comprising:
obtaining sequencing data representing a plurality of mutations in the genome of the individual, each of the plurality of mutations being represented as variant calls relative to a human genome reference; and
using, for each of the plurality of mutations, a corresponding variant node in the graph database to store a description of that mutation.
25. The method of claim 16, wherein the graph database comprises:
nodes representing people, nodes representing genomic variants relative to a reference, and nodes representing literature reports on medical relevance of the genomic variants; and
edges defining relationships between pairs of the nodes.
US14/826,595 2014-08-15 2015-08-14 Systems and methods for genetic analysis Abandoned US20160048608A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US14/826,595 US20160048608A1 (en) 2014-08-15 2015-08-14 Systems and methods for genetic analysis
US17/000,054 US12386895B2 (en) 2014-08-15 2020-08-21 Systems and methods for genetic analysis

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201462037861P 2014-08-15 2014-08-15
US14/826,595 US20160048608A1 (en) 2014-08-15 2015-08-14 Systems and methods for genetic analysis

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/000,054 Continuation US12386895B2 (en) 2014-08-15 2020-08-21 Systems and methods for genetic analysis

Publications (1)

Publication Number Publication Date
US20160048608A1 true US20160048608A1 (en) 2016-02-18

Family

ID=55302346

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/826,595 Abandoned US20160048608A1 (en) 2014-08-15 2015-08-14 Systems and methods for genetic analysis
US17/000,054 Active 2039-05-04 US12386895B2 (en) 2014-08-15 2020-08-21 Systems and methods for genetic analysis

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/000,054 Active 2039-05-04 US12386895B2 (en) 2014-08-15 2020-08-21 Systems and methods for genetic analysis

Country Status (2)

Country Link
US (2) US20160048608A1 (en)
WO (1) WO2016025818A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170042495A1 (en) * 2014-04-24 2017-02-16 Hitachi, Ltd. Medical image information system, medical image information processing method, and program
US20180260520A1 (en) * 2017-03-08 2018-09-13 Björn Pollex Systems and methods for aligning sequences to graph reference constructs
EP3437001A1 (en) * 2016-03-29 2019-02-06 Regeneron Pharmaceuticals, Inc. Genetic variant-phenotype analysis system and methods of use
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US20210057090A1 (en) * 2019-08-20 2021-02-25 Life Technologies Corporation Methods for control of a sequencing device
CN112559513A (en) * 2019-09-10 2021-03-26 网易(杭州)网络有限公司 Link data access method, device, storage medium, processor and electronic device
CN114550833A (en) * 2022-02-15 2022-05-27 郑州大学 Gene analysis method and system based on big data
US11461318B2 (en) 2017-02-28 2022-10-04 Microsoft Technology Licensing, Llc Ontology-based graph query optimization
CN116246715A (en) * 2023-04-27 2023-06-09 倍科为(天津)生物技术有限公司 Multi-sample gene mutation data storage method, device, equipment and medium
WO2024081570A1 (en) * 2022-10-10 2024-04-18 Eli Lilly And Company Methods and systems for using causal networks to develop models for evaluating biological processes
US12071669B2 (en) 2016-02-12 2024-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for detection of abnormal karyotypes
US12386895B2 (en) 2014-08-15 2025-08-12 Laboratory Corporation Of America Holdings Systems and methods for genetic analysis
US12406413B2 (en) 2021-05-10 2025-09-02 Optum Services (Ireland) Limited Predictive data analysis using image representations of genomic data

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BR112020026029A2 (en) * 2018-06-19 2021-03-23 Ancestry. Com Dna, Llc filtering genetic networks to discover populations of interest
CN114566221A (en) * 2022-03-04 2022-05-31 上海交通大学医学院附属上海儿童医学中心 Automatic analysis and interpretation system for NGS data of genetic diseases
CN115050421A (en) * 2022-05-27 2022-09-13 杭州纽安津生物科技有限公司 Method for storing tumor neogenesis antigen and targeted drug information

Family Cites Families (329)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3781120A (en) 1970-09-14 1973-12-25 Technicon Instr Self-locating sample receptacle having integral identification label
SE410229B (en) 1976-05-11 1979-10-01 Kockums Chem AQUATIC COMPOSITION WITH REGULATED PH SHIFT WHEN FREEZING, AND SET FOR ITS PREPARATION
US5242794A (en) 1984-12-13 1993-09-07 Applied Biosystems, Inc. Detection of specific sequences in nucleic acids
US4683202A (en) 1985-03-28 1987-07-28 Cetus Corporation Process for amplifying nucleic acid sequences
US4683195A (en) 1986-01-30 1987-07-28 Cetus Corporation Process for amplifying, detecting, and/or-cloning nucleic acid sequences
US5583024A (en) 1985-12-02 1996-12-10 The Regents Of The University Of California Recombinant expression of Coleoptera luciferase
US4988617A (en) 1988-03-25 1991-01-29 California Institute Of Technology Method of detecting a nucleotide change in nucleic acids
US5234809A (en) 1989-03-23 1993-08-10 Akzo N.V. Process for isolating nucleic acid
US6379895B1 (en) 1989-06-07 2002-04-30 Affymetrix, Inc. Photolithographic and other means for manufacturing arrays
US5494810A (en) 1990-05-03 1996-02-27 Cornell Research Foundation, Inc. Thermostable ligase-mediated DNA amplifications system for the detection of genetic disease
US5060980A (en) 1990-05-30 1991-10-29 Xerox Corporation Form utilizing encoded indications for form field processing
CA2039652C (en) 1990-05-30 1996-12-24 Frank Zdybel, Jr. Hardcopy lossless data storage and communications for electronic document processing systems
US6716580B2 (en) 1990-06-11 2004-04-06 Somalogic, Inc. Method for the automated generation of nucleic acid ligands
US5253551A (en) 1990-07-12 1993-10-19 Bio-Pias, Inc. Centrifuge tube and centrifuge tube cap removing and installing tool and method
US5210015A (en) 1990-08-06 1993-05-11 Hoffman-La Roche Inc. Homogeneous assay system using the nuclease activity of a nucleic acid polymerase
US6197508B1 (en) 1990-09-12 2001-03-06 Affymetrix, Inc. Electrochemical denaturation and annealing of nucleic acid
US5994056A (en) 1991-05-02 1999-11-30 Roche Molecular Systems, Inc. Homogeneous methods for nucleic acid amplification and detection
US5348853A (en) 1991-12-16 1994-09-20 Biotronics Corporation Method for reducing non-specific priming in DNA amplification
US6033854A (en) 1991-12-16 2000-03-07 Biotronics Corporation Quantitative PCR using blocking oligonucleotides
US5567583A (en) 1991-12-16 1996-10-22 Biotronics Corporation Methods for reducing non-specific priming in DNA detection
US5869252A (en) 1992-03-31 1999-02-09 Abbott Laboratories Method of multiplex ligase chain reaction
US6100099A (en) 1994-09-06 2000-08-08 Abbott Laboratories Test strip having a diagonal array of capture spots
US5225165A (en) 1992-05-11 1993-07-06 Brandeis University Microcentrifuge tube with upwardly projecting lid extension
US5342328A (en) 1993-03-22 1994-08-30 Grossman Michael D Medical body fluid sampler device and method
CA2130517C (en) 1993-09-10 1999-10-05 Walter Fassbind Array of reaction containers for an apparatus for automatic performance of temperature cycles
JPH09507121A (en) 1993-10-26 1997-07-22 アフィマックス テクノロジーズ ナームロゼ ベノートスハップ Nucleic acid probe array on biological chip
US5459307A (en) 1993-11-30 1995-10-17 Xerox Corporation System for storage and retrieval of digitally encoded information on a medium
SE9400522D0 (en) 1994-02-16 1994-02-16 Ulf Landegren Method and reagent for detecting specific nucleotide sequences
US5888788A (en) 1994-05-18 1999-03-30 Union Nationale Des Groupements De Distillateurs D'alcool (Ungda) Use of ionophoretic polyether antibiotics for controlling bacterial growth in alcoholic fermentation
US5456887A (en) 1994-05-27 1995-10-10 Coulter Corporation Tube adapter
US7008771B1 (en) 1994-09-30 2006-03-07 Promega Corporation Multiplex amplification of short tandem repeat loci
US5604097A (en) 1994-10-13 1997-02-18 Spectragen, Inc. Methods for sorting polynucleotides using oligonucleotide tags
US5846719A (en) 1994-10-13 1998-12-08 Lynx Therapeutics, Inc. Oligonucleotide tags for sorting and identification
US5695934A (en) 1994-10-13 1997-12-09 Lynx Therapeutics, Inc. Massively parallel sequencing of sorted polynucleotides
US5776737A (en) 1994-12-22 1998-07-07 Visible Genetics Inc. Method and composition for internal identification of samples
US6235501B1 (en) 1995-02-14 2001-05-22 Bio101, Inc. Method for isolation DNA
US5866337A (en) 1995-03-24 1999-02-02 The Trustees Of Columbia University In The City Of New York Method to detect mutations in a nucleic acid using a hybridization-ligation procedure
US5750341A (en) 1995-04-17 1998-05-12 Lynx Therapeutics, Inc. DNA sequencing by parallel oligonucleotide extensions
US5701256A (en) 1995-05-31 1997-12-23 Cold Spring Harbor Laboratory Method and apparatus for biological sequence comparison
US5636400A (en) 1995-08-07 1997-06-10 Young; Keenan L. Automatic infant bottle cleaner
SE9601676D0 (en) 1996-04-30 1996-04-30 Ulf Landegren Improved probing of specific mucleic acids
US5830064A (en) 1996-06-21 1998-11-03 Pear, Inc. Apparatus and method for distinguishing events which collectively exceed chance expectations and thereby controlling an output
US5916202A (en) 1996-08-30 1999-06-29 Haswell; John N. Umbilical cord blood collection
US6361940B1 (en) 1996-09-24 2002-03-26 Qiagen Genomics, Inc. Compositions and methods for enhancing hybridization and priming specificity
GB9620209D0 (en) 1996-09-27 1996-11-13 Cemu Bioteknik Ab Method of sequencing DNA
ES2401121T3 (en) 1996-10-04 2013-04-17 Veri-Q Inc. Sample collection devices and methods that use markers and the use of such markers as controls in the validation of a sample, evaluation and / or laboratory certification
TW491892B (en) 1996-11-07 2002-06-21 Srl Inc Apparatus for detecting microorganism
EP0848060A3 (en) 1996-12-11 2000-02-02 Smithkline Beecham Corporation Novel human 11CB splice variant
GB9626815D0 (en) 1996-12-23 1997-02-12 Cemu Bioteknik Ab Method of sequencing DNA
EP2327797B1 (en) 1997-04-01 2015-11-25 Illumina Cambridge Limited Method of nucleic acid sequencing
CA2214461A1 (en) 1997-09-02 1999-03-02 Mcgill University Screening method for determining individuals at risk of developing diseases associated with different polymorphic forms of wildtype p53
CA2304017C (en) 1997-09-17 2007-05-01 Gentra Systems, Inc. Apparatuses and methods for isolating nucleic acid
US5869717A (en) 1997-09-17 1999-02-09 Uop Llc Process for inhibiting the polymerization of vinyl aromatics
US20010046673A1 (en) 1999-03-16 2001-11-29 Ljl Biosystems, Inc. Methods and apparatus for detecting nucleic acid polymorphisms
US5993611A (en) 1997-09-24 1999-11-30 Sarnoff Corporation Capacitive denaturation of nucleic acid
US6866996B1 (en) 1998-01-30 2005-03-15 Evolutionary Genomics, Llc Methods to identify polynucleotide and polypeptide sequences which may be associated with physiological and medical conditions
US6054276A (en) 1998-02-23 2000-04-25 Macevicz; Stephen C. DNA restriction site mapping
CA2263784A1 (en) 1998-03-23 1999-09-23 Megabios Corporation Dual-tagged proteins and their uses
WO1999049079A1 (en) 1998-03-25 1999-09-30 Ulf Landegren Rolling circle replication of padlock probes
US6946291B2 (en) 1998-04-24 2005-09-20 University Hospitals Of Cleveland Mixed cell diagnostic systems
US5971921A (en) 1998-06-11 1999-10-26 Advanced Monitoring Devices, Inc. Medical alarm system and methods
US6223128B1 (en) 1998-06-29 2001-04-24 Dnstar, Inc. DNA sequence assembly system
US6787308B2 (en) 1998-07-30 2004-09-07 Solexa Ltd. Arrayed biomolecules and their use in sequencing
US20020001800A1 (en) 1998-08-14 2002-01-03 Stanley N. Lapidus Diagnostic methods using serial testing of polymorphic loci
US6315767B1 (en) 1998-08-19 2001-11-13 Gambro, Inc. Cell storage maintenance and monitoring system
US6235502B1 (en) 1998-09-18 2001-05-22 Molecular Staging Inc. Methods for selectively isolating DNA using rolling circle amplification
US6703228B1 (en) 1998-09-25 2004-03-09 Massachusetts Institute Of Technology Methods and products related to genotyping and DNA analysis
AR021833A1 (en) 1998-09-30 2002-08-07 Applied Research Systems METHODS OF AMPLIFICATION AND SEQUENCING OF NUCLEIC ACID
US7034143B1 (en) 1998-10-13 2006-04-25 Brown University Research Foundation Systems and methods for sequencing by hybridization
US7071324B2 (en) 1998-10-13 2006-07-04 Brown University Research Foundation Systems and methods for sequencing by hybridization
US6948843B2 (en) 1998-10-28 2005-09-27 Covaris, Inc. Method and apparatus for acoustically controlling liquid solutions in microfluidic devices
AU1600000A (en) 1998-10-28 2000-05-15 Covaris, Inc. Apparatus and methods for controlling sonic treatment
US6927024B2 (en) 1998-11-30 2005-08-09 Genentech, Inc. PCR assay
EP1144684B1 (en) 1999-01-06 2009-08-19 Callida Genomics, Inc. Enhanced sequencing by hybridization using pools of probes
GB9901475D0 (en) 1999-01-22 1999-03-17 Pyrosequencing Ab A method of DNA sequencing
US6360235B1 (en) 1999-03-16 2002-03-19 Webcriteria, Inc. Objective measurement and graph theory modeling of web sites
US6335200B1 (en) 1999-06-09 2002-01-01 Tima Ab Device containing and method of making a composition having an elevated freezing point and change of color at selected temperatures
US7074586B1 (en) 1999-06-17 2006-07-11 Source Precision Medicine, Inc. Quantitative assay for low abundance molecules
US6818395B1 (en) 1999-06-28 2004-11-16 California Institute Of Technology Methods and apparatus for analyzing polynucleotide sequences
JP2001041918A (en) 1999-08-03 2001-02-16 Honda Motor Co Ltd Oil gas concentration detector
US6941317B1 (en) 1999-09-14 2005-09-06 Eragen Biosciences, Inc. Graphical user interface for display and analysis of biological sequence data
US7244559B2 (en) 1999-09-16 2007-07-17 454 Life Sciences Corporation Method of sequencing a nucleic acid
US6274320B1 (en) 1999-09-16 2001-08-14 Curagen Corporation Method of sequencing a nucleic acid
WO2001023610A2 (en) 1999-09-29 2001-04-05 Solexa Ltd. Polynucleotide sequencing
US7430477B2 (en) 1999-10-12 2008-09-30 Maxygen, Inc. Methods of populating data structures for use in evolutionary simulations
US6613516B1 (en) 1999-10-30 2003-09-02 Affymetrix, Inc. Preparation of nucleic acid samples
US6582938B1 (en) 2001-05-11 2003-06-24 Affymetrix, Inc. Amplification of nucleic acids
US20030166057A1 (en) 1999-12-17 2003-09-04 Hildebrand William H. Method and apparatus for the production of soluble MHC antigens and uses thereof
US20060276629A9 (en) 1999-12-17 2006-12-07 Hildebrand William H Purification and characterization of soluble human HLA proteins
US6775622B1 (en) 2000-01-31 2004-08-10 Zymogenetics, Inc. Method and system for detecting near identities in large DNA databases
US6714874B1 (en) 2000-03-15 2004-03-30 Applera Corporation Method and system for the assembly of a whole genome using a shot-gun data set
US20030208454A1 (en) 2000-03-16 2003-11-06 Rienhoff Hugh Y. Method and system for populating a database for further medical characterization
WO2001077383A2 (en) 2000-04-11 2001-10-18 Mats Bo Johan Nilsson Nucleic acid detection medium
US20020091666A1 (en) 2000-07-07 2002-07-11 Rice John Jeremy Method and system for modeling biological systems
US6913879B1 (en) 2000-07-10 2005-07-05 Telechem International Inc. Microarray method of genotyping multiple samples at multiple LOCI
US6448717B1 (en) 2000-07-17 2002-09-10 Micron Technology, Inc. Method and apparatuses for providing uniform electron beams from field emission displays
US6569920B1 (en) 2000-08-16 2003-05-27 Millennium Inorganic Chemicals, Inc. Titanium dioxide slurries having improved stability
US20020182609A1 (en) 2000-08-16 2002-12-05 Luminex Corporation Microsphere based oligonucleotide ligation assays, kits, and methods of use, including high-throughput genotyping
US6544895B1 (en) 2000-08-17 2003-04-08 Micron Technology, Inc. Methods for use of pulsed voltage in a plasma reactor
WO2002017207A2 (en) 2000-08-23 2002-02-28 Arexis Ab System and method of storing genetic information
US8501471B2 (en) 2000-10-18 2013-08-06 Sloan-Kettering Institute For Cancer Research Uses of monoclonal antibody 8H9
EP1366192B8 (en) 2000-10-24 2008-10-29 The Board of Trustees of the Leland Stanford Junior University Direct multiplex characterization of genomic dna
US7119188B2 (en) 2001-01-04 2006-10-10 Bristol-Myers Squibb Company N-carbobenzyloxy (N-CBZ)-deprotecting enzyme and uses therefor
WO2002072892A1 (en) 2001-03-12 2002-09-19 California Institute Of Technology Methods and apparatus for analyzing polynucleotide sequences by asynchronous base extension
US6557898B2 (en) 2001-03-13 2003-05-06 Bioanalytical Systems, Inc. Device, system and method for labeling three-dimensional objects
WO2002079502A1 (en) 2001-03-28 2002-10-10 The University Of Queensland A method for nucleic acid sequence analysis
US20020187483A1 (en) 2001-04-20 2002-12-12 Cerner Corporation Computer system for providing information about the risk of an atypical clinical event based upon genetic information
US7809509B2 (en) 2001-05-08 2010-10-05 Ip Genesis, Inc. Comparative mapping and assembly of nucleic acid sequences
WO2002093453A2 (en) 2001-05-12 2002-11-21 X-Mine, Inc. Web-based genetic research apparatus
GB2378245A (en) 2001-08-03 2003-02-05 Mats Nilsson Nucleic acid amplification method
US20040142325A1 (en) 2001-09-14 2004-07-22 Liat Mintz Methods and systems for annotating biomolecular sequences
US20030224384A1 (en) 2001-11-13 2003-12-04 Khalid Sayood Divide and conquer system and method of DNA sequence assembly
WO2003044177A2 (en) 2001-11-19 2003-05-30 Parallele Bioscience, Inc. Multiplex pcr
US7057026B2 (en) 2001-12-04 2006-06-06 Solexa Limited Labelled nucleotides
US20030175709A1 (en) 2001-12-20 2003-09-18 Murphy George L. Method and system for depleting rRNA populations
DK1321477T3 (en) 2001-12-22 2005-02-14 4Antibody Ag Process for the Production of Genetically Modified Vertebrate Precursor Lymphocytes and Their Use in the Production of Heterologous Binding Proteins
US7127436B2 (en) 2002-03-18 2006-10-24 Motorola, Inc. Gene expression programming algorithm
US20030203370A1 (en) 2002-04-30 2003-10-30 Zohar Yakhini Method and system for partitioning sets of sequence groups with respect to a set of subsequence groups, useful for designing polymorphism-based typing assays
AU2003282562A1 (en) * 2002-08-02 2004-02-25 David Atlan Method and system for finding mutations in dna sequences and interpreting their consequences
US6841384B2 (en) 2002-08-08 2005-01-11 Becton Dickinson Company Advanced roller bottle system for cell and tissue culturing
EP3363809B1 (en) 2002-08-23 2020-04-08 Illumina Cambridge Limited Modified nucleotides for polynucleotide sequencing
AU2003272498A1 (en) 2002-09-19 2004-04-08 Applera Corporation Fragmentation of dna
US7865534B2 (en) 2002-09-30 2011-01-04 Genstruct, Inc. System, method and apparatus for assembling and mining life science data
US20050003369A1 (en) 2002-10-10 2005-01-06 Affymetrix, Inc. Method for depleting specific nucleic acids from a mixture
CA2507189C (en) 2002-11-27 2018-06-12 Sequenom, Inc. Fragmentation-based methods and systems for sequence variation detection and discovery
JP4473878B2 (en) 2003-01-29 2010-06-02 454 コーポレーション Methods for amplifying and sequencing nucleic acids
US20060183132A1 (en) 2005-02-14 2006-08-17 Perlegen Sciences, Inc. Selection probe amplification
WO2004081183A2 (en) 2003-03-07 2004-09-23 Rubicon Genomics, Inc. In vitro dna immortalization and whole genome amplification using libraries generated from randomly fragmented dna
US7041481B2 (en) 2003-03-14 2006-05-09 The Regents Of The University Of California Chemical amplification based on fluid partitioning
WO2004083819A2 (en) 2003-03-17 2004-09-30 Trace Genetics, Inc Molecular forensic specimen marker
ITMI20030643A1 (en) 2003-04-01 2004-10-02 Copan Innovation Ltd BUFFER FOR THE COLLECTION OF BIOLOGICAL SAMPLES
EP3261006A1 (en) 2003-04-09 2017-12-27 Omicia Inc. Methods of selection, reporting and analysis of genetic markers using broad based genetic profiling applications
BRPI0410636A (en) 2003-05-23 2006-07-18 Cold Spring Harbor Lab virtual representation of nucleotide sequences
US9045796B2 (en) 2003-06-20 2015-06-02 Illumina, Inc. Methods and compositions for whole genome amplification and genotyping
SE0301951D0 (en) 2003-06-30 2003-06-30 Pyrosequencing Ab New method
US7108979B2 (en) 2003-09-03 2006-09-19 Agilent Technologies, Inc. Methods to detect cross-contamination between samples contacted with a multi-array substrate
US7537889B2 (en) 2003-09-30 2009-05-26 Life Genetics Lab, Llc. Assay for quantitation of human DNA using Alu elements
US7729865B2 (en) 2003-10-06 2010-06-01 Cerner Innovation, Inc. Computerized method and system for automated correlation of genetic test results
ES2533876T3 (en) 2003-10-29 2015-04-15 Bioarray Solutions Ltd Multiplexed nucleic acid analysis by double stranded DNA fragmentation
WO2005042781A2 (en) 2003-10-31 2005-05-12 Agencourt Personal Genomics Corporation Methods for producing a paired tag from a nucleic acid sequence and methods of use thereof
WO2005047477A2 (en) 2003-11-07 2005-05-26 University Of Massachusetts Interspersed repetitive element rnas as substrates, inhibitors and delivery vehicles for rnai
US7169560B2 (en) 2003-11-12 2007-01-30 Helicos Biosciences Corporation Short cycle methods for sequencing polynucleotides
EP1685410A1 (en) 2003-11-19 2006-08-02 EVOTEC Neurosciences GmbH Diagnostic and therapeutic use of the human sgpl1 gene and protein for neurodegenerative diseases
US20050209787A1 (en) 2003-12-12 2005-09-22 Waggener Thomas B Sequencing data analysis
US8529744B2 (en) 2004-02-02 2013-09-10 Boreal Genomics Corp. Enrichment of nucleic acid targets
US7910353B2 (en) 2004-02-13 2011-03-22 Signature Genomic Laboratories Methods and apparatuses for achieving precision genetic diagnoses
US20050191682A1 (en) 2004-02-17 2005-09-01 Affymetrix, Inc. Methods for fragmenting DNA
US20060195266A1 (en) 2005-02-25 2006-08-31 Yeatman Timothy J Methods for predicting cancer outcome and gene signatures for use therein
US20100216151A1 (en) 2004-02-27 2010-08-26 Helicos Biosciences Corporation Methods for detecting fetal nucleic acids and diagnosing fetal abnormalities
US20100216153A1 (en) 2004-02-27 2010-08-26 Helicos Biosciences Corporation Methods for detecting fetal nucleic acids and diagnosing fetal abnormalities
WO2005085477A1 (en) 2004-03-02 2005-09-15 Orion Genomics Llc Differential enzymatic fragmentation by whole genome amplification
US20080176209A1 (en) 2004-04-08 2008-07-24 Biomatrica, Inc. Integration of sample storage and sample management for life science
WO2005104401A2 (en) 2004-04-22 2005-11-03 Ramot At Tel Aviv University Ltd. Method and apparatus for optimizing multidimensional systems
SE0401270D0 (en) 2004-05-18 2004-05-18 Fredrik Dahl Method for amplifying specific nucleic acids in parallel
US7622281B2 (en) 2004-05-20 2009-11-24 The Board Of Trustees Of The Leland Stanford Junior University Methods and compositions for clonal amplification of nucleic acid
WO2006014869A1 (en) 2004-07-26 2006-02-09 Parallele Bioscience, Inc. Simultaneous analysis of multiple genomes
WO2006044017A2 (en) 2004-08-13 2006-04-27 Jaguar Bioscience Inc. Systems and methods for identifying diagnostic indicators
US8024128B2 (en) 2004-09-07 2011-09-20 Gene Security Network, Inc. System and method for improving clinical decisions by aggregating, validating and analysing genetic and phenotypic data
US20060078894A1 (en) 2004-10-12 2006-04-13 Winkler Matthew M Methods and compositions for analyzing nucleic acids
US20060133963A1 (en) 2004-12-16 2006-06-22 Israel Stein Adapter for attaching information to test tubes
US20060184489A1 (en) 2004-12-17 2006-08-17 General Electric Company Genetic knowledgebase creation for personalized analysis of medical conditions
EP2233581A1 (en) 2005-02-01 2010-09-29 AB Advanced Genetic Analysis Corporation Nucleic acid sequencing by performing successive cycles of duplex extension
US7393665B2 (en) 2005-02-10 2008-07-01 Population Genetics Technologies Ltd Methods and compositions for tagging and identifying polynucleotides
US7658346B2 (en) 2005-02-25 2010-02-09 Honeywell International Inc. Double ducted hovering air-vehicle
GB0504184D0 (en) 2005-03-01 2005-04-06 Lingvitae As Method
DK1877559T3 (en) 2005-04-18 2011-02-07 Mitomics Inc Mitochondrial mutations and rearrangements as a diagnostic tool for detection of sun exposure, prostate cancer and other cancers
US20060263789A1 (en) 2005-05-19 2006-11-23 Robert Kincaid Unique identifiers for indicating properties associated with entities to which they are attached, and methods for using
WO2007145612A1 (en) 2005-06-06 2007-12-21 454 Life Sciences Corporation Paired end sequencing
US20060292585A1 (en) 2005-06-24 2006-12-28 Affymetrix, Inc. Analysis of methylation using nucleic acid arrays
GB0514936D0 (en) 2005-07-20 2005-08-24 Solexa Ltd Preparation of templates for nucleic acid sequencing
US20070020640A1 (en) 2005-07-21 2007-01-25 Mccloskey Megan L Molecular encoding of nucleic acid templates for PCR and other forms of sequence analysis
US7838646B2 (en) 2005-08-18 2010-11-23 Quest Diagnostics Investments Incorporated Cystic fibrosis transmembrane conductance regulator gene mutations
US7666593B2 (en) 2005-08-26 2010-02-23 Helicos Biosciences Corporation Single molecule sequencing of captured nucleic acids
EP1931805A2 (en) 2005-09-30 2008-06-18 Perlegen Sciences, Inc. Methods and compositions for screening and treatment of disorders of blood glucose regulation
US20070092883A1 (en) 2005-10-26 2007-04-26 De Luwe Hoek Octrooien B.V. Methylation specific multiplex ligation-dependent probe amplification (MS-MLPA)
GB0522310D0 (en) 2005-11-01 2005-12-07 Solexa Ltd Methods of preparing libraries of template polynucleotides
US20120021930A1 (en) 2005-11-22 2012-01-26 Stichting Dienst Landbouwkundig Onderzoek Multiplex Nucleic Acid Detection
US7329860B2 (en) 2005-11-23 2008-02-12 Illumina, Inc. Confocal imaging methods and apparatus
EP3373175B1 (en) 2005-11-26 2025-05-28 Natera, Inc. System and method for cleaning noisy genetic data and using data to make predictions
US20100137163A1 (en) 2006-01-11 2010-06-03 Link Darren R Microfluidic Devices and Methods of Use in The Formation and Control of Nanoreactors
EP1987162A4 (en) 2006-01-23 2009-11-25 Population Genetics Technologi Nucleic acid analysis using sequence tokens
WO2007087312A2 (en) 2006-01-23 2007-08-02 Population Genetics Technologies Ltd. Molecular counting
WO2007092538A2 (en) 2006-02-07 2007-08-16 President And Fellows Of Harvard College Methods for making nucleotide probes for sequencing and synthesis
RU2466458C2 (en) 2006-03-10 2012-11-10 Конинклейке Филипс Электроникс, Н.В. Methods and systems for identifying dna patterns through spectral analysis
ES2546848T3 (en) 2006-03-10 2015-09-29 Epigenomics Ag A method to identify a biological sample for methylation analysis
US20100168067A1 (en) 2006-03-21 2010-07-01 Ucl Business Plc Biomarkers for bisphosphonate-responsive bone disorders
WO2007112289A2 (en) 2006-03-23 2007-10-04 The Regents Of The University Of California Method for identification and sequencing of proteins
EP2018622B1 (en) 2006-03-31 2018-04-25 Illumina, Inc. Systems for sequence by synthesis analysis
CA2648778A1 (en) 2006-04-10 2007-10-18 The Regents Of The University Of California Method for culturing cells on removable pallets for subsequent cell expansion and analysis
US7282337B1 (en) 2006-04-14 2007-10-16 Helicos Biosciences Corporation Methods for increasing accuracy of nucleic acid sequencing
US7702468B2 (en) 2006-05-03 2010-04-20 Population Diagnostics, Inc. Evaluating genetic disorders
EP2021113A2 (en) 2006-05-11 2009-02-11 Raindance Technologies, Inc. Microfluidic devices
US8178360B2 (en) 2006-05-18 2012-05-15 Illumina Cambridge Limited Dye compounds and the use of their labelled conjugates
WO2008111990A1 (en) 2006-06-14 2008-09-18 Cellpoint Diagnostics, Inc. Rare cell analysis using sample splitting and dna tags
US20080050739A1 (en) 2006-06-14 2008-02-28 Roland Stoughton Diagnosis of fetal abnormalities using polymorphisms including short tandem repeats
US20100035243A1 (en) 2006-07-10 2010-02-11 Nanosphere, Inc. Ultra-sensitive detection of analytes
US20080085836A1 (en) 2006-09-22 2008-04-10 Kearns William G Method for genetic testing of human embryos for chromosome abnormalities, segregating genetic disorders with or without a known mutation and mitochondrial disorders following in vitro fertilization (IVF), embryo culture and embryo biopsy
US7985716B2 (en) 2006-09-22 2011-07-26 Uchicago Argonne, Llc Nucleic acid sample purification and enrichment with a thermo-affinity microfluidic sub-circuit
US20080081330A1 (en) 2006-09-28 2008-04-03 Helicos Biosciences Corporation Method and devices for analyzing small RNA molecules
US7754429B2 (en) 2006-10-06 2010-07-13 Illumina Cambridge Limited Method for pair-wise sequencing a plurity of target polynucleotides
WO2008061193A2 (en) 2006-11-15 2008-05-22 Biospherex Llc Multitag sequencing and ecogenomics analysis
CN103642902B (en) 2006-11-30 2016-01-20 纳维哲尼克斯公司 Genetic analysis systems and method
US7948015B2 (en) 2006-12-14 2011-05-24 Life Technologies Corporation Methods and apparatus for measuring analytes using large scale FET arrays
US8349167B2 (en) 2006-12-14 2013-01-08 Life Technologies Corporation Methods and apparatus for detecting molecular interactions using FET arrays
US8262900B2 (en) 2006-12-14 2012-09-11 Life Technologies Corporation Methods and apparatus for measuring analytes using large scale FET arrays
US7855054B2 (en) 2007-01-16 2010-12-21 Somalogic, Inc. Multiplexed analyses of test samples
US7862999B2 (en) 2007-01-17 2011-01-04 Affymetrix, Inc. Multiplex targeted amplification using flap nuclease
CN101652780B (en) 2007-01-26 2012-10-03 伊鲁米那股份有限公司 Nucleic acid sequencing system and method
WO2008098014A2 (en) 2007-02-05 2008-08-14 Applied Biosystems, Llc System and methods for indel identification using short read sequencing
US20080269068A1 (en) 2007-02-06 2008-10-30 President And Fellows Of Harvard College Multiplex decoding of sequence tags in barcodes
KR100896987B1 (en) 2007-03-14 2009-05-14 한국과학기술연구원 Target protein detection method and detection kit using aptamer
US8533327B2 (en) 2007-04-04 2013-09-10 Zte Corporation System and method of providing services via a peer-to-peer-based next generation network
US7774962B1 (en) 2007-04-27 2010-08-17 David Ladd Removable and reusable tags for identifying bottles, cans, and the like
JP2008286754A (en) 2007-05-21 2008-11-27 Ids Co Ltd Test tube
US20080293589A1 (en) 2007-05-24 2008-11-27 Affymetrix, Inc. Multiplex locus specific amplification
US8283116B1 (en) 2007-06-22 2012-10-09 Ptc Therapeutics, Inc. Methods of screening for compounds for treating spinal muscular atrophy using SMN mRNA translation regulation
JP2009009624A (en) 2007-06-26 2009-01-15 Hitachi Global Storage Technologies Netherlands Bv Servo pattern forming method and magnetic disk apparatus
WO2009017678A2 (en) 2007-07-26 2009-02-05 Pacific Biosciences Of California, Inc. Molecular redundant sequencing
US20120252020A1 (en) 2007-08-17 2012-10-04 Predictive Biosciences, Inc. Screening Assay for Bladder Cancer
EP2201143B2 (en) 2007-09-21 2016-08-24 Katholieke Universiteit Leuven Tools and methods for genetic tests using next generation sequencing
EP2053132A1 (en) 2007-10-23 2009-04-29 Roche Diagnostics GmbH Enrichment and sequence analysis of geomic regions
US20090119313A1 (en) 2007-11-02 2009-05-07 Ioactive Inc. Determining structure of binary data using alignment algorithms
US8478544B2 (en) * 2007-11-21 2013-07-02 Cosmosid Inc. Direct identification and measurement of relative populations of microorganisms with direct DNA sequencing and probabilistic methods
US8463895B2 (en) 2007-11-29 2013-06-11 International Business Machines Corporation System and computer program product to predict edges in a non-cumulative graph
DE102007060291B4 (en) 2007-12-12 2011-04-28 Sartorius Stedim Biotech Gmbh Container arrangement with a container with flexible wall
US20090156412A1 (en) 2007-12-17 2009-06-18 Helicos Biosciences Corporation Surface-capture of target nucleic acids
US20090163366A1 (en) 2007-12-24 2009-06-25 Helicos Biosciences Corporation Two-primer sequencing for high-throughput expression analysis
US8003326B2 (en) 2008-01-02 2011-08-23 Children's Medical Center Corporation Method for diagnosing autism spectrum disorder
US20090181389A1 (en) 2008-01-11 2009-07-16 Signosis, Inc., A California Corporation Quantitative measurement of nucleic acid via ligation-based linear amplification
EP4450642A3 (en) 2008-01-17 2025-01-08 Sequenom, Inc. Single molecule nucleic acid sequence analysis processes and compositions
US9127306B2 (en) 2008-02-15 2015-09-08 Life Technologies Corporation Methods and apparatuses for nucleic acid shearing by sonication
US20090226975A1 (en) 2008-03-10 2009-09-10 Illumina, Inc. Constant cluster seeding
US9074244B2 (en) 2008-03-11 2015-07-07 Affymetrix, Inc. Array-based translocation and rearrangement assays
US8271206B2 (en) 2008-04-21 2012-09-18 Softgenetics Llc DNA sequence assembly methods of short reads
US20090298064A1 (en) 2008-05-29 2009-12-03 Serafim Batzoglou Genomic Sequencing
WO2009149243A1 (en) 2008-06-04 2009-12-10 G Patel A monitoring system based on etching of metals
AU2009274031A1 (en) 2008-07-23 2010-01-28 The Regents Of The University Of California Method of characterizing sequences from genetic material samples
US20100035252A1 (en) 2008-08-08 2010-02-11 Ion Torrent Systems Incorporated Methods for sequencing individual nucleic acids under tension
WO2010024894A1 (en) 2008-08-26 2010-03-04 23Andme, Inc. Processing data from genotyping chips
US20100063742A1 (en) 2008-09-10 2010-03-11 Hart Christopher E Multi-scale short read assembly
US8383345B2 (en) 2008-09-12 2013-02-26 University Of Washington Sequence tag directed subassembly of short sequencing reads into long sequencing reads
US8921046B2 (en) 2008-09-19 2014-12-30 Pacific Biosciences Of California, Inc. Nucleic acid sequence analysis
EP2562268B1 (en) 2008-09-20 2016-12-21 The Board of Trustees of The Leland Stanford Junior University Noninvasive diagnosis of fetal aneuploidy by sequencing
US20100137143A1 (en) 2008-10-22 2010-06-03 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes
US8546128B2 (en) 2008-10-22 2013-10-01 Life Technologies Corporation Fluidics system for sequential delivery of reagents
US20100301398A1 (en) 2009-05-29 2010-12-02 Ion Torrent Systems Incorporated Methods and apparatus for measuring analytes
WO2010056728A1 (en) 2008-11-11 2010-05-20 Helicos Biosciences Corporation Nucleic acid encoding for multiplex analysis
US8474228B2 (en) 2009-12-08 2013-07-02 Life Technologies Corporation Packaging systems and methods for transporting vials
US8462161B1 (en) 2009-01-20 2013-06-11 Kount Inc. System and method for fast component enumeration in graphs with implicit edges
ES2735514T3 (en) 2009-02-03 2019-12-19 Ande Corp Nucleic acid purification
EP2425023B1 (en) 2009-04-27 2015-12-23 Pacific Biosciences of California, Inc. Real-time sequencing methods and systems
DK2511843T3 (en) 2009-04-29 2017-03-27 Complete Genomics Inc METHOD AND SYSTEM FOR DETERMINING VARIATIONS IN A SAMPLE POLYNUCLEOTIDE SEQUENCE IN TERMS OF A REFERENCE POLYNUCLEOTIDE SEQUENCE
JP2012525147A (en) 2009-04-30 2012-10-22 グッド スタート ジェネティクス, インコーポレイテッド Methods and compositions for assessing genetic markers
US20120165202A1 (en) 2009-04-30 2012-06-28 Good Start Genetics, Inc. Methods and compositions for evaluating genetic markers
US8574835B2 (en) 2009-05-29 2013-11-05 Life Technologies Corporation Scaffolded nucleic acid polymer particles and methods of making and using
US8673627B2 (en) 2009-05-29 2014-03-18 Life Technologies Corporation Apparatus and methods for performing electrochemical reactions
US20110007639A1 (en) 2009-07-10 2011-01-13 Qualcomm Incorporated Methods and apparatus for detecting identifiers
CN101677716A (en) 2009-07-14 2010-03-24 吉诺工业有限公司 Temperature-adjustable beverage brewing device
EP2456885A2 (en) 2009-07-20 2012-05-30 Bar Harbor Biotechnology, Inc. Methods for assessing disease risk
EP2464738A4 (en) 2009-08-12 2013-05-01 Nugen Technologies Inc Methods, compositions, and kits for generating nucleic acid products substantially free of template nucleic acid
US20110053208A1 (en) 2009-08-31 2011-03-03 Streck, Inc. Biological sample identification system
SG179038A1 (en) 2009-09-08 2012-04-27 Lab Corp America Holdings Compositions and methods for diagnosing autism spectrum disorders
WO2011030838A1 (en) 2009-09-10 2011-03-17 富士フイルム株式会社 Method for analyzing nucleic acid mutation using array comparative genomic hybridization technique
US8975019B2 (en) 2009-10-19 2015-03-10 University Of Massachusetts Deducing exon connectivity by RNA-templated DNA ligation/sequencing
US20110098193A1 (en) 2009-10-22 2011-04-28 Kingsmore Stephen F Methods and Systems for Medical Sequencing Analysis
US20120245041A1 (en) 2009-11-04 2012-09-27 Sydney Brenner Base-by-base mutation screening
CN102597272A (en) 2009-11-12 2012-07-18 艾索特里克斯遗传实验室有限责任公司 Copy number analysis of genetic locus
WO2011066476A1 (en) 2009-11-25 2011-06-03 Quantalife, Inc. Methods and compositions for detecting genetic material
US20120270739A1 (en) 2010-01-19 2012-10-25 Verinata Health, Inc. Method for sample analysis of aneuploidies in maternal samples
US20120010085A1 (en) 2010-01-19 2012-01-12 Rava Richard P Methods for determining fraction of fetal nucleic acids in maternal samples
US20110257889A1 (en) 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
CA2793877A1 (en) 2010-03-22 2011-09-29 Esoterix Genetic Laboratories, Llc Mutations associated with cystic fibrosis
US20140206552A1 (en) 2010-05-18 2014-07-24 Natera, Inc. Methods for preimplantation genetic diagnosis by sequencing
EP2854058A3 (en) 2010-05-18 2015-10-28 Natera, Inc. Methods for non-invasive pre-natal ploidy calling
ES2862331T3 (en) 2010-06-18 2021-10-07 Myriad Genetics Inc Methods to predict the status of the BRCA1 and BRCA2 genes in a cancer cell
WO2012006291A2 (en) 2010-07-06 2012-01-12 Life Technologies Corporation Systems and methods to detect copy number variation
HRP20180593T1 (en) 2010-07-09 2018-07-27 Cergentis B.V. 3-D SEGMENTING STRATEGIES OF THE GENOMIC AREA OF INTEREST
EP2601609B1 (en) 2010-08-02 2017-05-17 Population Bio, Inc. Compositions and methods for discovery of causative mutations in genetic disorders
US20140342940A1 (en) 2011-01-25 2014-11-20 Ariosa Diagnostics, Inc. Detection of Target Nucleic Acids using Hybridization
EP2614161B1 (en) * 2010-09-09 2020-11-04 Fabric Genomics, Inc. Variant annotation, analysis and selection tool
RU2565550C2 (en) 2010-09-24 2015-10-20 Те Борд Оф Трастиз Оф Те Лилэнд Стэнфорд Джуниор Юниверсити Direct capture, amplification and sequencing of target dna using immobilised primers
US8715933B2 (en) 2010-09-27 2014-05-06 Nabsys, Inc. Assay methods using nicking endonucleases
US8430053B2 (en) 2010-09-30 2013-04-30 Temptime Corporation Color-changing emulsions for freeze indicators
CN114678128A (en) 2010-11-30 2022-06-28 香港中文大学 Detection of genetic or molecular aberrations associated with cancer
US9163281B2 (en) 2010-12-23 2015-10-20 Good Start Genetics, Inc. Methods for maintaining the integrity and identification of a nucleic acid template in a multiplex sequencing reaction
US11270781B2 (en) 2011-01-25 2022-03-08 Ariosa Diagnostics, Inc. Statistical analysis for non-invasive sex chromosome aneuploidy determination
CA2824387C (en) 2011-02-09 2019-09-24 Natera, Inc. Methods for non-invasive prenatal ploidy calling
JP2014506465A (en) 2011-02-09 2014-03-17 バイオ−ラド ラボラトリーズ,インコーポレイティド Nucleic acid analysis
US8782566B2 (en) 2011-02-22 2014-07-15 Cisco Technology, Inc. Using gestures to schedule and manage meetings
WO2012122548A2 (en) * 2011-03-09 2012-09-13 Lawrence Ganeshalingam Biological data networks and methods therefor
US20120252686A1 (en) 2011-03-31 2012-10-04 Good Start Genetics Methods for maintaining the integrity and identification of a nucleic acid template in a multiplex sequencing reaction
US8680066B2 (en) 2011-04-05 2014-03-25 The United States of America as represented by the Development of Veterans Affairs Methods for determining and inhibiting rheumatoid arthritis associated with the BRAF oncogene in a subject
US20140357497A1 (en) 2011-04-27 2014-12-04 Kun Zhang Designing padlock probes for targeted genomic sequencing
DK2716766T3 (en) 2011-05-31 2017-01-02 Berry Genomics Co Ltd DEVICE FOR DETECTING COPIANT NUMBER OF Fetal CHROMOSOMES OR TUMOR CELL CHROMOSOMES
EP2718466B1 (en) 2011-06-07 2018-08-08 Icahn School of Medicine at Mount Sinai Materials and method for identifying spinal muscular atrophy carriers
US9733221B2 (en) 2011-06-09 2017-08-15 Agilent Technologies, Inc. Injection needle cartridge with integrated sealing force generator
US20130178378A1 (en) 2011-06-09 2013-07-11 Andrew C. Hatch Multiplex digital pcr
JP5792711B2 (en) 2011-07-06 2015-10-14 株式会社Joled Display device
WO2013028920A2 (en) 2011-08-23 2013-02-28 Eagile, Inc. System for associating rfid tag with upc code, and validating associative encoding of same
US20140228226A1 (en) 2011-09-21 2014-08-14 Bgi Health Service Co., Ltd. Method and system for determining chromosome aneuploidy of single cell
JP5748104B2 (en) 2011-09-30 2015-07-15 日立工機株式会社 Driving machine
WO2013052913A2 (en) 2011-10-06 2013-04-11 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
EP2768983A4 (en) 2011-10-17 2015-06-03 Good Start Genetics Inc Analysis methods
CN104334196B (en) 2012-02-16 2018-04-10 Atyr 医药公司 For treating the Histidyl-tRNA-synthetase of autoimmune disease and inflammatory disease
AU2013200876A1 (en) 2012-02-24 2013-09-12 Callum David Mcdonald Method of graph processing
WO2013148496A1 (en) 2012-03-26 2013-10-03 The Johns Hopkins University Rapid aneuploidy detection
US8209130B1 (en) 2012-04-04 2012-06-26 Good Start Genetics, Inc. Sequence assembly
US8812422B2 (en) 2012-04-09 2014-08-19 Good Start Genetics, Inc. Variant database
US10227635B2 (en) 2012-04-16 2019-03-12 Molecular Loop Biosolutions, Llc Capture reactions
CA2874195C (en) 2012-05-21 2020-08-25 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
US9193992B2 (en) 2012-06-05 2015-11-24 Agilent Technologies, Inc. Method for determining ploidy of a cell
CN104619894B (en) 2012-06-18 2017-06-06 纽亘技术公司 Compositions and methods for negative selection of undesired nucleic acid sequences
US8676857B1 (en) 2012-08-23 2014-03-18 International Business Machines Corporation Context-based search for a data store related to a graph node
EP2891099A4 (en) 2012-08-28 2016-04-20 Broad Inst Inc DETECTION OF VARIANTS IN SEQUENCING DATA AND CALIBRATION
EP2917881A4 (en) 2012-11-07 2016-07-20 Good Start Genetics Inc Validation of genetic tests
US20140222349A1 (en) * 2013-01-16 2014-08-07 Assurerx Health, Inc. System and Methods for Pharmacogenomic Classification
US20140242581A1 (en) 2013-01-23 2014-08-28 Reproductive Genetics And Technology Solutions, Llc Compositions and methods for genetic analysis of embryos
JP2016513461A (en) 2013-03-12 2016-05-16 カウンシル,インコーポレーテッド Prenatal genetic analysis system and method
US9802196B2 (en) 2013-03-13 2017-10-31 Alphagem Bio Inc. Ergonomic numbered connector to hold tubes with improved cap
US20160034638A1 (en) 2013-03-14 2016-02-04 University Of Rochester System and Method for Detecting Population Variation from Nucleic Acid Sequencing Data
US8778609B1 (en) 2013-03-14 2014-07-15 Good Start Genetics, Inc. Methods for analyzing nucleic acids
US9418203B2 (en) * 2013-03-15 2016-08-16 Cypher Genomics, Inc. Systems and methods for genomic variant annotation
WO2014197377A2 (en) 2013-06-03 2014-12-11 Good Start Genetics, Inc. Methods and systems for storing sequence read data
US9579656B2 (en) 2013-06-11 2017-02-28 J. G. Finneran Associates, Inc. Rotation-limiting well plate assembly
US9116866B2 (en) 2013-08-21 2015-08-25 Seven Bridges Genomics Inc. Methods and systems for detecting sequence variants
WO2015089333A1 (en) 2013-12-11 2015-06-18 Accuragen, Inc. Compositions and methods for detecting rare sequence variants
WO2016025818A1 (en) 2014-08-15 2016-02-18 Good Start Genetics, Inc. Systems and methods for genetic analysis
JP2018512080A (en) 2015-01-15 2018-05-10 グッド スタート ジェネティクス, インコーポレイテッド Devices and systems for attaching barcodes to individual wells and containers
USD773070S1 (en) 2015-02-24 2016-11-29 Good Start Genetics, Inc. Device for barcoding individual wells and vessels

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10085707B2 (en) * 2014-04-24 2018-10-02 Hitachi, Ltd. Medical image information system, medical image information processing method, and program
US20170042495A1 (en) * 2014-04-24 2017-02-16 Hitachi, Ltd. Medical image information system, medical image information processing method, and program
US12386895B2 (en) 2014-08-15 2025-08-12 Laboratory Corporation Of America Holdings Systems and methods for genetic analysis
US11568957B2 (en) 2015-05-18 2023-01-31 Regeneron Pharmaceuticals Inc. Methods and systems for copy number variant detection
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US12071669B2 (en) 2016-02-12 2024-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for detection of abnormal karyotypes
EP3437001A1 (en) * 2016-03-29 2019-02-06 Regeneron Pharmaceuticals, Inc. Genetic variant-phenotype analysis system and methods of use
US11461318B2 (en) 2017-02-28 2022-10-04 Microsoft Technology Licensing, Llc Ontology-based graph query optimization
US20180260520A1 (en) * 2017-03-08 2018-09-13 Björn Pollex Systems and methods for aligning sequences to graph reference constructs
US20210057090A1 (en) * 2019-08-20 2021-02-25 Life Technologies Corporation Methods for control of a sequencing device
CN112559513A (en) * 2019-09-10 2021-03-26 网易(杭州)网络有限公司 Link data access method, device, storage medium, processor and electronic device
US12406413B2 (en) 2021-05-10 2025-09-02 Optum Services (Ireland) Limited Predictive data analysis using image representations of genomic data
CN114550833A (en) * 2022-02-15 2022-05-27 郑州大学 Gene analysis method and system based on big data
WO2024081570A1 (en) * 2022-10-10 2024-04-18 Eli Lilly And Company Methods and systems for using causal networks to develop models for evaluating biological processes
CN116246715A (en) * 2023-04-27 2023-06-09 倍科为(天津)生物技术有限公司 Multi-sample gene mutation data storage method, device, equipment and medium

Also Published As

Publication number Publication date
US20210089581A1 (en) 2021-03-25
US12386895B2 (en) 2025-08-12
WO2016025818A1 (en) 2016-02-18

Similar Documents

Publication Publication Date Title
US12386895B2 (en) Systems and methods for genetic analysis
US11149308B2 (en) Sequence assembly
US10706017B2 (en) Methods and systems for storing sequence read data
US20210057045A1 (en) Determining the Clinical Significance of Variant Sequences
US20200399719A1 (en) Systems and methods for analyzing viral nucleic acids
CN112885412B (en) Genome annotation method, apparatus, visualization platform and storage medium
CA3020669A1 (en) Systems and methods for biological data management
Chen et al. Recent advances in sequence assembly: principles and applications
Sirén et al. Personalized pangenome references
US20170076047A1 (en) Systems and methods for genetic testing
US9594777B1 (en) In-database single-nucleotide genetic variant analysis
Kaye Approaches to genome analysis through the application of graph theory
WO2018026576A1 (en) Genomic analysis of cord blood
EP4540825A1 (en) Systems and methods for sequencing and analysis of nucleic acid diversity
From Reference-free SNP detection: dealing with the data deluge

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOD START GENETICS, INC., MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FRIEDEN, ALEXANDER;KENNEDY, CALEB J.;HAURIE, XAVIER S.;SIGNING DATES FROM 20141105 TO 20141222;REEL/FRAME:039720/0658

AS Assignment

Owner name: INN SA LLC, NEW YORK

Free format text: SECURITY INTEREST;ASSIGNORS:INVITAE CORPORATION;GOOD START GENETICS, INC.;COMBIMATRIX CORPORATION;REEL/FRAME:047889/0836

Effective date: 20181106

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

AS Assignment

Owner name: COMBIMATRIX CORPORATION, CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:INN SA LLC;REEL/FRAME:050454/0559

Effective date: 20190910

Owner name: GOOD START GENETICS, INC., CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:INN SA LLC;REEL/FRAME:050454/0559

Effective date: 20190910

Owner name: INVITAE CORPORATION, CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:INN SA LLC;REEL/FRAME:050454/0559

Effective date: 20190910

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

AS Assignment

Owner name: PERCEPTIVE CREDIT HOLDINGS III, LP, NEW YORK

Free format text: PATENT SECURITY AGREEMENT;ASSIGNORS:INVITAE CORPORATION;GOOD START GENETICS, INC.;SINGULAR BIO, INC.;AND OTHERS;REEL/FRAME:054234/0872

Effective date: 20201002

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: INVITAE CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GOOD START GENETICS, INC.;REEL/FRAME:056756/0884

Effective date: 20210615

AS Assignment

Owner name: INVITAE CORPORATION, CALIFORNIA

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE THE SCHEDULE A OF THE CONFIRMATORY ASSIGNMENT PREVIOUSLY RECORDED AT REEL: 056756 FRAME: 0884. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:GOOD START GENETICS, INC.;REEL/FRAME:057772/0828

Effective date: 20210615

AS Assignment

Owner name: YOUSCRIPT, LLC, CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:PERCEPTIVE CREDIT HOLDINGS III, LP;REEL/FRAME:063282/0538

Effective date: 20230228

Owner name: SINGULAR BIO, INC., CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:PERCEPTIVE CREDIT HOLDINGS III, LP;REEL/FRAME:063282/0538

Effective date: 20230228

Owner name: GOOD START GENETICS, INC., CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:PERCEPTIVE CREDIT HOLDINGS III, LP;REEL/FRAME:063282/0538

Effective date: 20230228

Owner name: INVITAE CORPORATION, CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:PERCEPTIVE CREDIT HOLDINGS III, LP;REEL/FRAME:063282/0538

Effective date: 20230228

AS Assignment

Owner name: LABORATORY CORPORATION OF AMERICA HOLDINGS, NORTH CAROLINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INVITAE CORPORATION;REEL/FRAME:068822/0025

Effective date: 20240805