[go: up one dir, main page]

WO1996033276A1 - NUCLEOTIDE SEQUENCE OF THE HAEMOPHILUS INFLUENZAE Rd GENOME, FRAGMENTS THEREOF, AND USES THEREOF - Google Patents

NUCLEOTIDE SEQUENCE OF THE HAEMOPHILUS INFLUENZAE Rd GENOME, FRAGMENTS THEREOF, AND USES THEREOF Download PDF

Info

Publication number
WO1996033276A1
WO1996033276A1 PCT/US1996/005320 US9605320W WO9633276A1 WO 1996033276 A1 WO1996033276 A1 WO 1996033276A1 US 9605320 W US9605320 W US 9605320W WO 9633276 A1 WO9633276 A1 WO 9633276A1
Authority
WO
WIPO (PCT)
Prior art keywords
fragments
genome
sequence
nucleotide sequence
seq
Prior art date
Application number
PCT/US1996/005320
Other languages
French (fr)
Inventor
Robert D. Fleischmann
Mark D. Adams
Owen White
Hamilton O. Smith
J. Craig Venter
Original Assignee
Human Genome Sciences, Inc.
Johns Hopkins University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US08/476,102 external-priority patent/US6355450B1/en
Priority claimed from US08/487,429 external-priority patent/US6468765B1/en
Application filed by Human Genome Sciences, Inc., Johns Hopkins University filed Critical Human Genome Sciences, Inc.
Priority to EP96912845A priority Critical patent/EP0821737A4/en
Priority to JP8531888A priority patent/JPH11501520A/en
Priority to AU55523/96A priority patent/AU5552396A/en
Publication of WO1996033276A1 publication Critical patent/WO1996033276A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K14/00Peptides having more than 20 amino acids; Gastrins; Somatostatins; Melanotropins; Derivatives thereof
    • C07K14/195Peptides having more than 20 amino acids; Gastrins; Somatostatins; Melanotropins; Derivatives thereof from bacteria
    • C07K14/285Peptides having more than 20 amino acids; Gastrins; Somatostatins; Melanotropins; Derivatives thereof from bacteria from Pasteurellaceae (F), e.g. Haemophilus influenza

Definitions

  • the present invention relates to the field of molecular biology.
  • the present invention discloses compositions comprising the nucleotide sequence of Haemophilus influenzae, fragments thereof and usage in industrial fermentation and pharmaceutical development.
  • the complete genome sequence from a free living cellular organism has never been determined.
  • the first mycobacterium sequence should be completed by 1996, while E. coli and S. cerevisae are expected to be completed before 1998. These are being done by random and/or directed sequencing of overlapping cosmid clones. No one has attempted to determine sequences of the order ofa megabase or more by a random shotgun approach.
  • H. influenzae is a small (approximately 0.4 x 1 micron) non-motile, non-spore forming, germ-negative bacterium whose only natural host is human. It is a resident ofthe upper respiratory mucosa ofchildren and adults and causes otitis mediaand respiratory tract infections mostly in children. The most serious complication is meningitis, which produces neurological sequelae in up to 50% ofaffected children.
  • Six H. influenzae serotypes (a through f) have been identified based on immunologically distinct capsular polysaccharide antigens. A number of non-typeable strains are also known. Serotype b accounts for the majority ofhuman disease.
  • the lipoligosaccharide (LOS) component of the outer membrane and the genes of its synthetic pathway are under intensive study (Weiser et al., J. Bacteriol.172:3304-3309 (1990)). While a vaccine has been available since 1984, the study ofouter membrane components is motivated to some extent by the need for improved vaccines. Recently, the catalase gene was characterized and sequenced as a possible virulence-related gene (Bishni et al., in press). Elucidation of the H. influenzae genome will enhance the understanding of how H. influenzae causes invasive disease and how best to combat infection.
  • H. influenzae possesses a highly efficient natural DNA transformation system which has been intensively studied in the non-encapsulated (R), serotyped strain (Kahn and Smith, J. Membrane Biology 81:89-103 (1984)). At least 16 transformation-specific genes have been identified and sequenced. Of these, four are regulatory (Redfield, J. Bacteriol. 173:5612-5618 (1991), and Chandler, Proc. Natl. Acad. Sci. USA 89:1626-1630 (1992)), at least two are involved in recombination processes (Barouki and Smith, J. Bacteriol.
  • H. influenzae Rd transformation shows a number of interesting features including sequence-specific DNA uptake, rapid uptake of several double-stranded DNA molecules per competent cell into a membrane compartment called the transformasome, linear translocation ofa single strand of the donor DNA into the cytoplasm, and synapsis and recombination of the strand with the chromosome by a single-strand displacement mechanism.
  • Rd transformation system is the most thoroughly studied of the gram-negative systems and distinct in a number of ways from the gram-positive systems.
  • H. influenzae Rd genome has been determined by pulsed-field agarose gel electrophoresis of restriction digests to be approximately 1.9 Mb, making its genome approximately 40% the size of E. coli (Lee and Smith, J. Bacteriol.770:4402-4405 (1988)).
  • the restriction map of H. influenzae is circular (Lee et al., J. Bacteriol.171:3016-3024 (1989), and Redfield and Lee, "Haemophilus influenzae Rd", pp.2110-2112, In O'Brien, S.J. (ed), Genetic Maps: Locus Maps of Complex Genomes, Cold Spring Harbor Press, New York).
  • Various genes have been mapped to restriction fragments by Southern hybridization probing of restriction digest DNA bands. This map will be valuable in verification of the assembly of a complete genome sequence from randomly sequenced fragments.
  • GenBank currently contains about 100 kb of non-redundant H. influenzae DNA sequences. About half are from serotype b and half from Rd
  • the present invention is based on the sequencing of the Haemophilus influenzae Rd genome.
  • the primary nucleotide sequence which was generated is provided in SEQ ID NO:1.
  • the present invention provides the generated nucleotide sequence of the
  • Haemophilus influenzae Rd genome or a representative fragment thereof, in a form which can be readily used, analyzed, and interpreted by a skilled artisan.
  • present invention is provided as a contiguous string of primary sequence information corresponding to the nucleotide sequence depicted in SEQ ID NO:1.
  • the present invention fur ther provides nucleotide sequences which are at least 99.9% identical to the nucleotide sequence of SEQ ID NO: 1.
  • the nucleotide sequence of SEQ ID NO:1, a representative fragment thereof, or a nucleotide sequence which is at least 99.9% identical to the nucleotide sequence of SEQ ID NO:1 may be provided in a variety of mediums to facilitate its use.
  • the sequences of the present invention are recorded on computer readable media.
  • Such media includes, but is not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media.
  • the present invention furtherprovides systems, particularly computer- based systems which contain the sequence information herein described stored in adata storage means. Such systems are designed to identify commercially important fragments of the Haemophilus influenzae Rd genome.
  • fragments of the Haemophilus influenzae Rd genome of thepresent invention include, but are not limited to, fragments which encode peptides, hereinafter open reading frames (ORFs), fragments which modulate the expression of an operably linked ORF, hereinafter expression modulating fragments (EMFs), fragments which mediate the uptake of a linked DNA fragment into a cell, hereinafter uptake modulating fragments (UMFs), and fragments which can be used to diagnose the presence of Haemophilus influenzae Rd in a sample, hereinafter, diagnostic fragments (DFs).
  • ORFs open reading frames
  • EMFs expression modulating fragments
  • UMFs uptake modulating fragments
  • DFs diagnostic fragments
  • Each of the ORF fragments of the Haemophilus influenzae Rd genome disclosed in Tables 1(a) and 2, and the EMF found 5' to the ORF, can be used in numerous ways as polynucleotide reagents.
  • the sequences can be used as diagnostic probes or diagnostic amplification primers for the presence of a specific microbe in a sample, for the production of commercially important pharmaceutical agents, and to selectively control gene expression.
  • the present invention further includes recombinant constructs comprising oneor morefragments of the Haemophilus influenzae Rd genome ofthepresent invention.
  • the recombinant constructs of the present invention comprise vectors, such as a plasmid or viral vector, into which a fragment of the Haemophilus influenzae Rd has been inserted.
  • the present invention further provides host cells containing any one of the isolated fragments of the Haemophilus influenzae Rd genome of the present invention.
  • the host cells can be a highereukaryotic host such as a mammalian cell, a lower eukaryotic cell such as a yeast cell, or can be a procaryotic cell such as a bacterial cell.
  • the present invention is further directed to isolated proteins encoded by the ORFs of the present invention.
  • isolated proteins encoded by the ORFs of the present invention.
  • a variety of methodologies known in the art can be utilized to obtain any one of the proteins of the present invention.
  • the amino acid sequence can be synthesized using commercially available peptide synthesizers.
  • theprotein is purified from bacterial cells which naturally produce the protein.
  • theproteins ofthepresent invention can alternatively be purified from cells which have been altered to express the desired protein.
  • the invention further provides methods of obtaining homologs of the fragments ofthe Haemophilus influenzae Rd genome of the present invention and homologs of the proteins encoded by the ORFs of the present invention.
  • nucleotide and amino acid sequences disclosed herein as a probe or as primers, and techniques such as PCR cloning and colony/plaque hybridization, one skilled in the art can obtain homologs.
  • the invention further provides antibodies which selectively bind one of the proteins of the present invention. Such antibodies include both monoclonal and polyclonal antibodies.
  • the invention further provides hybridomas which produce the above- described antibodies.
  • a hybridoma is an immortalized cell line which is capable ofsecreting a specific monoclonal antibody.
  • the present invention further provides methods of identifying test samples derived from cells which express one of the ORF of the present invention, or homolog thereof. Such methods comprise incubating a test sample with oneor more of the antibodies of the present invention, or one or more of the DFs of the present invention, under conditions which allow a skilled artisan to determine if the sample contains the ORF or product produced therefrom.
  • kits which contain thenecessary reagents to carry out the above-described assays.
  • the invention provides acompartmentalized kit to receive, in close confinement, one or more containers which comprises: (a) a first container comprising one of the antibodies, or one of the DFs of the present invention; and (b) oneor moreothercontainers comprising one or more of the following: wash reagents, reagents capable of detecting presence of bound antibodies or hybridized DFs.
  • the present invention furtherprovides methods ofobtaining and identifying agents capable ofbinding to a protein encoded by one of the ORFs of thepresent invention.
  • agents include antibodies (described above), peptides, carbohydrates, pharmaceutical agents and the like.
  • Such methods comprise the steps of:
  • H. influenzae The complete genomic sequence of H. influenzae will be of great value to all laboratoriesworking with this organism and for a variety ofcommercial purposes. Many fragments of the Haemophilus influenzae Rd genome will be immediately identified by similarity searches against GenBank or protein databases and will be ofimmediate value to Haemophilus researchers and for immediate commercial value for the production ofproteins or to control gene expression. A specific example concerns PHA synthase. It has been reported that polyhydroxybutyrate is present in the membranes ofH. influenzae Rd and that the amount correlates with the level of competence for transformation.
  • the PHA synthase that synthesizes this polymer has been identified and sequenced in a number ofbacteria, none ofwhich are evolutionarily close to H. influenzae.
  • This gene has yet to be isolated from H. influenzae by use of hybridization probes or PCR techniques.
  • the genomic sequence of the present invention allows the identification of the gene by utilizing search means described below.
  • sequenced genomes will provide the models for developing tools for the analysis of chromosome structure and function, including the ability to identify genes within large segments ofgenomic DNA, the structure, position, and spacing ofregulatory elements, the identification ofgenes with potential industrial applications, and the ability to do comparative genomic and molecular phylogeny.
  • Figure 1 restriction map of the Haemophilus influenzae Rd genome.
  • Figure 2 Block diagram of a computer system 102 that can be used to implement the computer-based systems ofpresent invention.
  • Figure 3 A comparison of experimental coverage of up to approximately 4000 random sequence fragments assembled with AutoAssembler (squares) as compared to Lander-Waterman prediction for a 2.5 Mb genome (triangles) and a 1.6 Mb genome (circles) with a 460 bp average sequence length and a 25 bp overlap.
  • Figure 4 Data flow and computer programs used to manage, assemble, edit, and annotate theH. influenzae genome. Both Macintosh and Unix platforms are used to handle the AB 373 sequence data files (Kerlavage et al., Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, IEEE Computer Society Press, Washington
  • Factura is a Macintosh program designed for automatic vector sequence removal and end trimming of sequence files.
  • the program esp runs on a Macintosh platform and parses the feature data extracted from the sequence files by Factura to the Unix based H. influenzae relational database. Assembly is accomplished by retrieving a specific set of sequence files and their associated features using stp, an X-windows graphical interface and control program which can retrieve sequences from the H. influenzae databaseusing user-defined or standard SQL queries.
  • the sequence files wereassembled using TIGR Assembler, an assembly engine designed at TIGR for rapid and accurate assembly of thousands of sequence fragments.
  • TIGR Editor is a graphical interface which can parse the aligned sequence files from TIGR Assembler output and display the alignment and associated electropherograms for contig editing. Identification ofputative coding regions was performed with Genemark (Borodovsky and Mclninch, Computers Chem. 17(2):123 (1993)), a Markov and Bayes modeled program for predicting gene locations, and trained on a H. influenzae sequence data set. Peptide searches wereperformed against the three reading frames ofeach Genemark predicted coding region using blaze (Brutlag et al., Computers Chem.77:203 (1993)) run on a Maspar MP-2 massively parallel computer with 4096 microprocessors.
  • Results from each frame were combined into a single output file by mblzt.
  • Optimal protein alignments were obtained using the program praze which extendsalignments across potential frameshifts.
  • the output was inspected using a custom graphic viewing program, gbyob, that interacts directly with the H. influenzae database.
  • the alignments were further used to identify potential frameshift errors and were targeted for additional editing.
  • Figure 5 A circular representation of the H. influenzae Rd chromosome illustrating the location ofeach predicted coding region containing a database match as well as selected global features of the genome.
  • Outer perimeter The location of the unique N ⁇ tl restriction site (designated as nucleotide 1), the Rsrll sites, and the Smal sites.
  • Outerconcentric circle The location of each identified coding region for which a gene identification was made. Each coding region location is coded as to role according to the color code in Fig.6.
  • Second concentric circle Regions of high G/C content (> 42%, red; > 40%, blue) and high A/T content (> 66%, black; > 64%, green). High G/C content regions are specifically associated with the 6 ribosomal operons and the mu-like prophage.
  • Third concentric circle Coverage by lambda clones (blue). Over 300 lambda clones were sequenced from each end to confirm the overall structure of the genome and identify the 6 ribosomal operons.
  • Fourth concentric circle The locations of the 6 ribosomal operons (green), the tR ⁇ As (black) and the cryptic mu-like prophage (blue).
  • Fifth concentric circle Simple tandem repeats. The locations of the following repeats are shown: CTGGCT, GTCT, ATT, AATGGC, TTGA, TTGG, TTTA, TTATC, TGAC, TCGTC, AACC, TTGC, CAAT, CCAA.
  • Theputative origin of replication is illustrated by the outward pointing arrows (green) originating near base 603,000. Two potential termination sequences are shown near the opposite midpoint of the circle (red).
  • Figure 7 A comparison of the region of the H. influenzae chromosome containing the 8 genes of the fimbrial gene cluster present in H. influenzae type b and the same region in H. influenzae Rd. The region is flanked by the pepN and purE genes in both organisms. However in the non- infectious Rd strain the 8 genes of thefimbrial gene cluster have been excised. A 172 bp spacerregion is located in this region in the Rd strain and continues to be flanked by the pepN and purE genes.
  • Figure 8 Hydrophobicity analysis of five predicted channel-proteins.
  • the predicted coding region sequences were analyzed by the Kyte-
  • the present invention is based on the sequencing of the Haemophilus influenzae Rd genome.
  • Theprimary nucleotide sequence which was generated is provided in SEQ ID NO:1.
  • the "primary sequence” refers to the nucleotide sequence represented by the IUPAC nomenclature system.
  • SEQ ID NO:1 The sequenceprovided in SEQ ID NO:1 isoriented relative to a unique Not I restriction endonuclease site found in the Haemophilus influenzae Rd genome. A skilled artisan will readily recognize that this start/stop point was chosen for convenience and does not reflect a structural significance.
  • the present invention provides the nucleotide sequence of SEQ ID ⁇ O:l, or a representative fragment thereof, in a form which can be readily used, analyzed, and interpreted by a skilled artisan.
  • the sequence is provided as a contiguous string ofprimary sequence information corresponding to the nucleotide sequence provided in SEQ ID NO:1.
  • a "representative fragment of the nucleotide sequence depicted in SEQ ID NO:1" refers to any portion of SEQ ID NO:1 which is not presently represented within a publicly available database.
  • Preferred representative fragments of the present invention are Haemophilus influenzae open reading frames, expression modulating fragments, uptake modulating fragments, and fragments which can be used to diagnose the presence of Haemophilus influenzae Rd in sample.
  • a non-limiting identification of such preferred representative fragments is provided in Tables 1(a) and and 2.
  • the nucleotide sequence information provided in SEQ ID NO:1 was obtained by sequencing the Haemophilus influenzae Rd genome using a megabase shotgun sequencing method. Using three parameters of accuracy discussed in the Examples below, thepresentinventors have calculated that the sequence in SEQ ID NO:1 has a maximum accuracy of 99.98%. Thus, the nucleotide sequence provided in SEQ ID NO:1 is a highly accurate, although not necessarily a 100% perfect, representation of the nucleotide sequence of the Haemophilus influenzae Rd genome.
  • Nucleotide sequence editing software is publicly available.
  • Applied Biosystem's (AB) AutoAssemblerTM can be used as an aid during visual inspection of nucleotide sequences. Even if all of the very rare sequencing errors in SEQ ID NO:1 were corrected, the resulting nucleotide sequence would still be at least 99.9% identical to the nucleotide sequence in SEQ ID NO:1.
  • nucleotide sequences of the genomes from different strains of Haemophilus influenzae differ slightly. However, the nucleotide sequence of the genomes of all Haemophilus influenzae strains will be at least 99.9% identical to the nucleotide sequence provided in SEQ ID NO:1.
  • the present invention further provides nucleotide sequences which are at least99.9% identical to the nucleotide sequence of SEQ ID NO:1 in a form which can be readily used, analyzed and interpreted by the skilled artisan.
  • Methods for determining whether a nucleotide sequence is at least 99.9% identical to the nucleotide sequence ofSEQ ID NO:1 are routine and readily available to the skilled artisan.
  • the well known fasta algothrithm Pierson and Lipman, Proc. Natl. Acad. Sci. USA 85:2444 (1988)
  • the nucleotide sequence provided in SEQ ED NO:1, a representative fragmentthereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 may be "provided” in avariety ofmediums to facilitate use thereof.
  • provided refers to a manufacture, other than an isolated nucleic acid molecule, which contains a nucleotide sequence of the present invention, i.e., the nucleotide sequence provided in SEQ ID NO:1, a representative fragmentthereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1.
  • Such a manufacture provides the Haemophilus influenzae Rd genome or a subset thereof (e.g., a Haemophilus Influenzae Rd open reading frame
  • a nucleotide sequence of the present invention can be recorded on computer readable media.
  • computer readable media refers to any medium which can be read and accessed directly by a computer. Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media.
  • magnetic storage media such as floppy discs, hard disc storage medium, and magnetic tape
  • optical storage media such as CD-ROM
  • electrical storage media such as RAM and ROM
  • hybrids of these categories such as magnetic/optical storage media.
  • recorded refers to a process for storing information on computerreadable medium.
  • a skilled artisan can readily adopt any of the presently know methods for recording information on computer readable medium to generate manufactures comprising the nucleotide sequence information of the present invention.
  • a variety ofdata storage structures are available to a skilled artisan for creating a computer readable medium having recorded thereon a nucleotide sequence of the present invention.
  • the choice of the data storage structure will generallybebased on the means chosen to access the stored information.
  • sequence information ofthe present invention can be represented in a word processing text file, formatted in commercially-available software such as WordPerfect and Microsoft Word, or represented in the form of an ASCII file, stored in adatabase application, such as DB2, Sybase, Oracle, or the like.
  • a database application such as DB2, Sybase, Oracle, or the like.
  • a skilled artisan can readily adapt any number of dataprocessor structuring formats (e.g. text file or database) in order to obtain computer readable medium having recorded thereon the nucleotide sequence information of the present invention.
  • nucleotide sequence of SEQ ID NO: 1 By providing the nucleotide sequence of SEQ ID NO: 1, a representative fragment thereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 in computer readable form, a skilled artisan can routinely access the sequence information for a variety of purposes.
  • Computer software is publicly available which allows a skilled artisan to access sequence information provided in a computer readable medium.
  • the examples which follow demonstrate how software which implements the BLAST (Altschul etal., J. Mol. Biol.275:403-410 (1990)) and BLAZE (Brutlag et al, Comp.
  • ORFs open reading frames
  • Such ORFs are protein encoding fragments within the Haemophilus influenzae Rd genomeand are useful in producing commercially important proteins such as enzymes used in fermentation reactions and in the production of commercially useful metabolites.
  • Thepresent invention further provides systems, particularly computer- based systems, which contain the sequenceinformation described herein. Such systems are designed to identify commercially important fragments of the Haemophilus influenzae Rd genome.
  • a computer-based system refers to the hardware means, software means, and data storage means used to analyze the nucleotide sequence information ofthepresentinvention.
  • the minimum hardware means of the computer-based systems of the present invention comprises a central processing unit (CPU), input means, output means, and data storage means.
  • CPU central processing unit
  • input means input means
  • output means output means
  • data storage means data storage means
  • the computer-based systems of the present invention comprise adata storage means having stored therein a nucleotide sequence of the present invention and the necessary hardware means and software means for supporting and implementing a search means.
  • data storage means refers to memory which can store nucleotide sequence information of the present invention, or a memory access means which can access manufactures having recorded thereon the nucleotide sequence information of the present invention.
  • search means refers to one or more programs which are implemented on the computer-based system to compare a target sequence or target structural motifwith the sequence information stored within the data storage means. Search means are used to identify fragments or regions of the Haemophilus influenzae Rd genome which match a particular target sequence or target motif.
  • a variety of known algorithms are disclosed publicly and a variety of commercially available software for conducting search means are and can be used in the computer-based systems of the present invention. Examples of such software includes, but is not limited to, MacPattern (EMBL), BLASTN and BLASTX (NCBIA).
  • EMBL MacPattern
  • BLASTN BLASTN
  • NCBIA BLASTX
  • a "target sequence” can be any DNA or amino acid sequence of six or more nucleotides or two or more amino acids.
  • the most preferred sequence length of a target sequence is from about 10 to 100 amino acids or from about 30 to 300 nucleotide residues.
  • searches for commercially important fragments of the Haemophilus influenzae Rd genome such as sequence fragments involved in gene expression and protein processing, may be of shorter length.
  • a target structural motif refers to any rationally selected sequence or combination of sequences in which the sequence(s) are chosen based on a three-dimensional configuration which is formed upon the folding of the target motif.
  • target motifs include, but are not limited to, enzymic active sites and signal sequences.
  • Nucleic acid target motifs include, but are not limited to, promoter sequences, hairpin structures and inducible expression elements (protein binding sequences).
  • a variety of structural formats for the input and output means can be used to input and output the information in the computer-based systems of the present invention.
  • a preferred format for an output means ranks fragments of the Haemophilus influenzae Rd genome possessing varying degrees of homology to the target sequence or target motif. Such presentation provides a skilled artisan with aranking ofsequences which contain various amounts of the target sequence or target motif and identifies the degree of homology contained in the identified fragment.
  • a variety of comparing means can be used to compare a target sequence or target motif with the data storage means to identify sequence fragments of the Haemophilus influenzae Rd genome.
  • implementing software which implement the BLAST and BLAZE algorithms was used to identify open reading frames within the Haemophilus influenzae Rd genome.
  • a skilled artisan can readily recognize that any one of the publicly available homology search programscan be used as the search means for the computer- based systems of the present invention.
  • Figure 2 provides a block diagram of a computer system 102 that can be used to implement the present invention.
  • the computer system 102 includes a processor 106 connected to a bus 104.
  • main memory 108 preferably implemented as random access memory, RAM
  • secondary storage devices 110 such as a hard drive 112 and a removable medium storage device 114.
  • the removable medium storage device 114 may represent, forexample, a floppy diskdrive, a CD-ROM drive, a magnetic tape drive, etc.
  • a removable storage medium 116 (such as a floppy disk, a compact disk, a magnetic tape, etc.) containing control logic and/or data recorded therein may be inserted into the removable medium storage device 114.
  • the computer system 102 includes appropriate software for reading the control logic and/or the data from the removable medium storage device 114 onceinserted in the removable medium storage device 114.
  • a nucleotide sequence of the present invention may be stored in a well known mannerin the main memory 108, any of the secondary storage devices 110, and/or a removable storage medium 116.
  • Software for accessing and processing the genomic sequence (such as search tools, comparing tools, etc.) reside in main memory 108 during execution.
  • Another embodiment of the present invention is directed to isolated fragments of the Haemophilus influenzae Rd genome.
  • Haemophilus influenzae Rd genome of the present invention include, but are not limited to fragments which encode peptides, hereinafter open reading frames (ORFs), fragments which modulate the expression of an operably linked ORF, hereinafter expression modulating fragments (EMFs), fragments which mediate the uptake of a linked DNA fragment into a cell, hereinafter uptake modulating fragments (UMFs), and fragments which can be used to diagnose the presence ofHaemophilus influenzae Rd in a sample, hereinafter diagnostic fragments (DFs).
  • ORFs open reading frames
  • EMFs expression modulating fragments
  • UMFs uptake modulating fragments
  • DFs diagnostic fragments
  • an "isolated nucleic acid molecule” or an “isolated fragment of the Haemophilus influenzae Rd genome” refers to a nucleic acid molecule possessing a specific nucleotide sequence which has been subjected to purification means to reduce, from the composition, the number of compounds which arenormally associated with the composition.
  • purification means can be used to generated the isolated fragments of the present invention. These include, but are not limited to methods which separate constituents ofa solution based on charge, solubility, or size.
  • Haemophilus influenaze Rd DNA can be mechanically sheared to produce fragments of 15-20 kb in length. These fragments can then be used to generate an Haemophilus influenzae Rd library by inserting them into labda clones as described in the Examples below. Primers flanking, for examiple, an ORF provided in Table 1(a) can then be generated using nucleotide sequence information provided in SEQ ID NO:1. PCR cloning can then be used to isolate the ORF from the lambda DNA library. PCR cloning is well known in the art. Thus, given the availability of SEQ ID NO:1, Table 1(a) and Table 2, it would be routine to isolate any ORF or other nucleic acid fragment of the present invention.
  • the isolated nucleic acid molecules of the present invention include, but are not limited to single stranded and double stranded DNA, and single stranded RNA.
  • ORF an "open reading frame," ORF, means a series of triplets coding for amino acids without any termination codons and is a sequence translatable into protein.
  • Tables la, lb and 2 identify ORFs in the Haemophilus influenzae Rd genome.
  • Table la indicates the location ofORFs within theHaemophilus influenzae genome which encode the recited protein based on homology matching with protein sequences from the organism appearing in parentheticals (see the fourth column ofTable 1(a)).
  • the first column ofTable 1(a) provides the "GenelD" of a particular ORF. This information is useful for two reasons. First, the complete map of theHaemophilus influenzae Rd genome provided in Figures 6(A)-6(D) refers to the ORFs according to their GenelD numbers. Second, Table 1(b) uses the GenelD numbers to indicatewhich ORFs wereprovided previously in a public database.
  • the secondand thirdcolumns in Table 1(a) indicate an ORFs position in the nucleotide sequence provided in SEQ ID NO:1.
  • ORFs may be oriented in opposite directions in the Haemophilus influenae genome. This is reflected in columns 2 and 3.
  • the fifth column of Table 1(a) indicates the percent identity of the protein encoded forby an ORF to the corresponding protein from the orgaism appearing in parentheticals in the fourth column.
  • the sixth column of Table 1(a) indicates the percent similarity of the protein encoded forby an ORF to thecorresponding protein from the organism appearing in parentheticals in the fourth column.
  • the concepts of percent identity and percent similarity oftwo polypeptide sequences is well understood in the art. For example, two polypeptides 10 amino acids in length which differ at three amino acid positions (e.g., at positions 1, 3 and 5) are said to have a percent identity of70%. However, the same two polypeptides would be deemed to have a percent similarity of 80% if, for example at position 5, the amino acids moieties, although not identical, were "similar" (i.e., possessed similar biochemical characteristics).
  • the seventh column in Table 1(a) indicates the lenth of the amino acid homology match.
  • Table 2 provides ORFs of the Haemophilus influenzae Rd genome which encode polypeptide sequences which did not elicit a "homology match" with a known protein sequence from another organism. Further details concerning the algorithms and criteria used for homology searches are provided in the Examples below.
  • ORFs in the Haemophilus influenzae Rd genome other than those listed in Tables 1(a), 1(b) and 2, such as ORFs which are overlapping or encoded by the opposite strand of an identified ORF in addition to those ascertainable using the computer-based systems of the present invention.
  • an "expression modulating fragment,” EMF means a series ofnucleotide molecules which modulates the expression ofan operably linked ORF or EMF.
  • a sequence is said to "modulate the expression of an operably linked sequence" when the expression of the sequence is altered by the presence of the EMF.
  • EMFs include, but are not limited to, promoters, and promoter modulating sequences (inducible elements).
  • One class ofEMFs are fragments which induce the expression or an operably linked ORF in response to a specific regulatory factor or physiological event.
  • a review of known EMFs from Haemophilus are described by (Tomb et al. Gene 104:1-10 (1991), Chandler, M. S., Proc. Natl Acad. Sci. USA 89:1626-1630 (1992).
  • EMF sequences can be identified within theHaemophilus influenzae Rd genome by their proximity to the ORFs provided in Tables 1(a), 1(b) and 2.
  • an "intergenic segment” refers to the fragments of the Haemophilus genome which are between two ORF(s) herein described.
  • EMFs can be identified using known EMFs as a target sequence or target motif in the computer-based systems of the present invention.
  • An EMF trap vector contains a cloning site 5' to a marker sequence.
  • a marker sequence encodes an identifiable phenotype, such as antibiotic resistance or a complementing nutrition auxotrophic factor, which can be identified or assayed when the EMF trap vector is placed within an appropriate host under appropriate conditions.
  • a EMF will modulate the expression ofan operably linked marker sequence.
  • a sequence which is suspected as being a EMF is cloned in all three reading frames in one or more restriction sites upstream from the marker sequence in the EMF trap vector.
  • the vector is then transformed into an appropriatehostusing knownprocedures and thephenotype of the transformed hostin examined underappropriate conditions.
  • an EMF will modulate the expression ofan operably linked marker sequence.
  • an "uptake modulating fragment,” UMF means a series of nucleotide molecules which mediate the uptake of a linked DNA fragment into a cell.
  • UMFs can be readily identified using known UMFs as a target sequence or target motif with the computer-based systems described above. Thepresence and activity of a UMF can be confirmed by attaching the suspected UMF to a marker sequence. The resulting nucleic acid molecule is then incubated with an appropriate host under appropriate conditions and the uptake of the marker sequence is determined. As described above, a UMF will increase the frequency of uptake of a linked marker sequence.
  • a review ofDNA uptake in Haemophilus is provided by Goodgall, S.H., et al., J. Bact. 172:5924-5928 (1990).
  • a "diagnostic fragment,” DF means a series of nucleotide molecules which selectively hybridize to Haemophilus influenzae sequences. DFs can be readily identified by identifying unique sequences within the Haemophilus influenzae Rd genome, or by generating and testing probes or amplification primers consisting of the DF sequence in an appropriate diagnostic format which determines amplification or hybridization selectivity.
  • sequences falling within the scope of the present invention are not limited to the specific sequences herein described, but also include allelic and species variations thereof. Allelic and species variations can be routinely determined by comparing the sequence provided in SEQ ID NO:1, a representative fragment thereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 with a sequence from another isolate of the same species. Furthermore, to accommodate codon variability, the invention includes nucleic acid molecules coding for the same amino acid sequences as do.the specific ORFs disclosed herein. In other words, in the coding region of an ORF, substitution of one codon for another which encodes the same amino acid is expressly contemplated.
  • Any specific sequence disclosed herein can be readily screened for errors by resequencing a particular fragment, such as an ORF, in both directions (i.e., sequenceboth strands).
  • error screening can be performed by sequencing corresponding polynucleotides of Haemophilus influenzae origin isolated by using part or all of the fragments in question as a probe or primer.
  • Each of the ORFs of theHaemophilus influenzae Rd genome disclosed in Tables 1(a), 1(b) and 2, and the EMF found 5' to the ORF can be used in numerous ways as polynucleotide reagents.
  • sequences can be used as diagnostic probes ordiagnostic amplification primers to detect the presence of a specific microbe, such as Haemophilus influenzae RD, in a sample. This is especially the case with the fragments or ORFs of Table 2, which will be highly selective forHaemophilus influenzae.
  • fragments of the present invention can be used to control gene expression through triple helix formation or antisense DNA or RNA, both of which methods are based on the binding of apolynucleotide sequence to DNA or RNA.
  • Polynucleotides suitable for use in these methods are usually 20 to 40 bases in length and are designed to be complementary to a region of the gene involved in transcription (triple helix - see Lee etal., Nucl Acids Res.6:3073 (1979); Cooney et al, Science 247:456 (1988); and Dervan etal., Science 257:1360 (1991)) or to the mRNA itself(antisense - Okano, J. Neurochem.56:560 (1991); Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression, CRC Press, Boca Raton, FL (1988)).
  • Triple helix- formation optimally results in a shut-off of RNA transcription from DNA, while antisense RNA hybridization blocks translation ofan mRNA molecule into polypeptide. Both techniques have been demonstrated to be effective in model systems. Information contained in the sequences of the present invention is necessary for the design of an antisense or triple helix oligonucleotide.
  • the present invention further provides recombinant constructs comprising one or more fragments of the Haemophilus influenzae Rd genome ofthepresent invention.
  • the recombinant constructs of the present invention comprise a vector, such as aplasmid or viral vector, into which a fragment of the Haemophilus influenzae Rd has been inserted, in a forward or reverse orientation.
  • the vector may furthercomprise regulatory sequences, including for example, apromoter, operably linked to the ORF.
  • the vector may further comprise a marker sequence or heterologous ORF operably linked to the EMF or UMF.
  • Bacterial pBs, phagescript, PsiX174, pBluescript SK, pBs
  • KS pNH8a, ⁇ NH16a, pNH18a, pNH46a (Stratagene); pTrc99A, pKK223-3, pKK233-3, pDR540, pRTT5 (Pharmacia).
  • Promoter regions can be selected from any desired gene using CAT (chloramphenicol transferase) vectors orother vectors with selectable markers.
  • CAT chloramphenicol transferase
  • Two appropriatevectors are pKK232-8 and pCM7.
  • Particular named bacterial promoters include lad, lacZ, T3, T7, gpt, lambda P R , and trc.
  • Eukaryotic promoters include CMV immediate early, HSV thymidine kinase, early and late SV40, LTRs from retrovirus, and mouse metallothionein-I. Selection of the appropriate vector and promoter is well within the level ofordinary skill in the art.
  • the present invention furtherprovides host cells containing any one of the isolated fragments of theHaemophilus influenzae Rd genome of the present invention, wherein the fragment has been introduced into the host cell using known transformulation methods.
  • the host cell can be a higher eukaryotic host cell, such as a mammalian cell, a lower eukaryotic host cell, such as a yeast cell, or the host cell can be a procaryotic cell, such as a bacterial cell.
  • Introduction ofthe recombinant construct into the host cell can be effected by calcium phosphate transfection, DEAE, dextran mediated transfection, or electroporation (Davis, L. etal., BasicMethods in MolecularBiology (1986)).
  • the host cells containing one of the fragments of the Haemophilus influenzae Rd genome of the present invention can be used in conventional manners to producethe geneproduct encoded by the isolated fragment (in the case ofan ORF) or can be used to produce a heterologous protein under the control of the EMF.
  • the present invention further provides isolated polypeptides encoded by the nucleic acid fragments of the present invention or by degenerate variants of thenucleicacid fragments of thepresentinvention.
  • degenerate variant is intended nucleotide fragments which differ from a nucleic acid fragment of thepresent invention (e.g., an ORF) by nucleotide sequence but, due to the degeneracy of the Genetic Code, encode an identical polypeptide sequence.
  • Preferred nucleic acid fragments of the present invention are the ORFs depicted in Table 1(a) which encode proteins.
  • the amino acid sequence can be synthesized using commercially available peptide synthesizers. This is particularly useful in producing small peptides and fragments oflarger polypeptides. Fragments are useful, for example, in generating antibodies against the native polypeptide.
  • the polypeptide or protein is purified from bacterial cells which naturally producethepolypeptide or protein.
  • One skilled in the art can readily follow known methods for isolating polpeptides and proteins in order to obtain one of the isolated polypeptides or proteins of the present invention. These include, but are not limited to, immunochromatography, HPLC, size-exclusion chromatography, ion-exchange chromatography, and immuno-affinity chromatography.
  • Thepolypeptides and proteins ofthepresent invention can alternatively be purified from cells which have been altered to express the desired polypeptide or protein.
  • a cell is said to be altered to express a desired polypeptide or protein when the cell, through genetic manipulation, is made to produce a polypeptide or protein which it normally does not produceorwhich the cell normally produces at a lower level.
  • One skilled in the art can readily adapt procedures for introducing and expressing either recombinant or synthetic sequences into eukaryotic or prokaryotic cells in order to generate acell which produces one of the polypeptides or proteins of the present invention.
  • Any host/vector system can be used to express one or more of the
  • ORFs of the present invention include, but are not limited to, eukaryotic hosts such as HeLa cells, Cv-1 cell, COS cells, and Sf9 cells, as well as prokaryotic host such as E. coli and B. subtilis.
  • eukaryotic hosts such as HeLa cells, Cv-1 cell, COS cells, and Sf9 cells
  • prokaryotic host such as E. coli and B. subtilis.
  • the most preferred cells are those which do not normally express the particular polypeptide or protein or which expresses the polypeptide or protein at low natural level.
  • Recombinant means that a polypeptide or protein is derived from recombinant (e.g., microbial or mammalian) expression systems.
  • Microbial refers to recombinant polypeptides or proteins made in bacterial or fungal (e.g., yeast) expression systems.
  • recombinant microbial defines a polypeptide or protein essentially free of native endogenous substances and unaccompanied by associated native glycosylation. Polypeptides or proteins expressed in most bacterial cultures, e.g., E. coli, will be free ofglycosylation modifications; polypeptides or proteins expressed in yeast will have a glycosylation pattern different from that expressed in mammalian cells.
  • Nucleotide sequence refers to a heteropolymer of deoxyribonucleotides.
  • DNA segments encoding the polypeptides and proteins provided by this invention are assembled from fragments of the Haemophilus influenzae Rd genomeand short oligonucleotide linkers, or from a series ofoligonucleotides, to provide a synthetic gene which is capable of being expressed in a recombinant transcriptional unit comprising regulatory elements derived from a microbial or viral operon.
  • Recombinant expression vehicle or vector refers to a plasmid or phage orvirus or vector, forexpressing apolypeptide from a DNA (RNA) sequence.
  • the expression vehicle can comprise a transcriptional unit comprising an assembly of(1) a genetic elementor elements having a regulatory role in gene expression, for example, promoters or enhancers, (2) a structural or coding sequence which is transcribed into mRNA and translated into protein, and (3) appropriate transcription initiation and termination sequences.
  • Structural units intended for use in yeast or eukaryotic expression systems preferably includealeadersequenceenabling extracellular secretion oftranslated protein by a hostcell.
  • recombinant protein when expressed without a leader or transport sequence, it may include an N-terminal methionine residue. This residue may or may not be subsequently cleaved from the expressed recombinant protein to provide a final product.
  • Recombinant expression system means host cells which have stably integrated a recombinant transcriptional unit into chromosomal DNA or carry the recombinant transcriptional unit extra chromosomally.
  • the cells can be prokaryotic oreukaryotic.
  • Recombinant expression systems as defined herein will express heterologous polypeptides or proteins upon induction of the regulatory elements linked to the DNA segment or synthetic gene to be expressed.
  • Matureproteins can be expressed in mammalian cells, yeast, bacteria, orothercells underthecontrol ofappropriate promoters. Cell-free translation systems can also be employed to produce such proteins using RNAs derived from the DNA constructs of the present invention.
  • Appropriate cloning and expression vectors for use with prokaryatic and eukaryotic hosts are described by Sambrook, et al., in Molecular Cloning: A Laboratory Manual, Second Edition, Cold Spring Harbor, New York (1989), the disclosure of which is hereby incorporated by reference.
  • recombinant expression vectors will include origins of replication and selectable markers permitting transformation of the host cell, e.g., the ampicillin resistance gene ofE. coli and S. cerevisiae TRP1 gene, and a promoterderived from a highly-expressed gene to direct transcription of a downstream structural sequence.
  • promoters can be derived from operons encoding glycolytic enzymes such as 3-phosphoglycerate kinase
  • the heterologous structural sequence is assembled in appropriate phase with translation initiation and termination sequences, and preferably, a leader sequence capable of directing secretion of translated protein into the periplasmic space or extracellular medium.
  • the heterologous sequence can encode a fusion protein including an N-terminal identification peptide imparting desired characteristics, e.g., stabilization or simplified purification ofexpressed recombinant product.
  • Useful expression vectors for bacterial use are constructed by inserting a structural DNA sequence encoding a desired protein together with suitable translation initiation and termination signals in operable reading phase with a functional promoter.
  • the vector will comprise one or more phenotypic selectable markers and an origin ofreplication to ensure maintenance of the vector and to, if desirable, provide amplification within the host.
  • Suitable prokaryotic hosts for transformation include E. coli, Bacillus subtilis,
  • useful expression vectors for bacterial use can comprise a selectable marker and bacterial origin of replication derived from commercially available plasmids comprising genetic elements of the well known cloning vector pBR322 (ATCC 37017).
  • cloning vector pBR322 ATCC 37017
  • Such commercial vectors include, for example, pKK223-3 (Pharmacia Fine Chemicals, Uppsala, Sweden) and GEM 1 (Promega Biotec, Madison, WI, USA). ThesepBR322 "backbone" sections are combined with an appropriate promoter and the structural sequence to be expressed.
  • the selected promoter is derepressed by appropriate means (e.g., temperature shiftor chemical induction) and cells are cultured for an additional period.
  • appropriate means e.g., temperature shiftor chemical induction
  • Cells are typically harvested by centrifugation, disrupted by physical or chemical means, and the resulting crude extract retained for further purification.
  • mammalian cell culture systems can also be employed to express recombinant protein.
  • mammalian expression systems includethe COS-7 lines ofmonkey kidney fibroblasts, described by Gluzman, Cell 23:175 (1981), and other cell lines capable of expressing a compatible vector, for example, the C127, 3T3, CHO, HeLa and BHK cell lines.
  • Mammalian expression vectors will comprise an origin of replication, a suitablepromoterand enhancer, and alsoany necessary ribosome binding sites, polyadenylation site, splice donor and acceptor sites, transcriptional termination sequences, and 5' flanking nontranscribed sequences.
  • DNA sequences derived from the SV40 viral genome for example, SV40 origin, early promoter, enhancer, splice, and polyadenylation sites may be used to provide the required nontranscribed genetic elements.
  • Recombinant polypeptides and proteins produced in bacterial culture is usually isolated by initial extraction from cell pellets, followed by one or more salting-out, aqueous ion exchange or size exclusion chromatography steps. Protein refolding steps can be used, as necessary, in completing configuration of the mature protein. Finally, high performance liquid chromatography (HPLC) can be employed for final purification steps. Microbial cells employed in expression of proteins can be disrupted by any convenient method, including freeze-thaw cycling, sonication, mechanical disruption, or use ofcell lysing agents.
  • the present invention further includes isolated polypeptides, proteins and nucleic acid molecules which are substantially equivalent to those herein described.
  • substantially equivalent can refer both to nucleic acid and amino acid sequences, for example a mutant sequence, that varies from a reference sequence by one or more substitutions, deletions, or additions, the net effect of which does not result in an adverse functional dissimilarity between reference and subject sequences.
  • sequences having equivalent biological activity, and equivalent expression characteristics are considered substantially equivalent.
  • truncation of the mature sequence should be disregarded.
  • the invention further provides methods of obtaining homologs from other strains ofHaemophilus influenzae, of the fragments of the Haemophilus influenzae Rd genome of the present invention and homologs of the proteins encoded by the ORFs ofthepresent invention.
  • a sequence or protein ofHaemophilus influenzae is defined as a homolog ofa fragment of the Haemophilus influenzae Rd genome or a protein encoded by one of the ORFs of the present invention, if it shares significant homology to one of the fragments of theHaemophilus influenzae Rd genome of the present invention or a protein encoded by one of the ORFs of the present invention.
  • sequence disclosed herein as a probe or as primers, and techniques such as PCR cloning and colony/plaque hybridization, one skilled in the art can obtain homologs.
  • nucleic acid molecules or proteins are said to "share significant homology” if the two contain regions which process greater than 85% sequence (amino acid or nucleic acid) homology.
  • Region specificprimers orprobes derived from the nucleotide sequence provided in SEQ ID NO:1 or from a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 can be used to prime DNA synthesis and PCR amplification, as well as to identify colonies containing cloned DNA encoding a homolog using known methods (Innis et al, PCR Protocols, Academic Press, San Diego, CA (1990)).
  • sequences which are greater than 75% homologous to the primer will be amplified.
  • sequences which are greater than 40-50% homologous to the primer will also be amplified.
  • DNA probes derived from SEQ ID NO:1 or from a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 for colony/plaque hybridization one skilled in the art will recognize that by employing high stringency conditions (e.g., hybridizing at 50-65°C in 5X SSPC and 50% formamide, and washing at 50-65°C in 0.5X SSPC), sequences having regions which aregreater than 90% homologous to the probe can be obtained, and that by employing lower stringency conditions (e.g., hybridizing at 35-37°C in 5X SSPC and 40-45% formamide, and washing at 42°C in SSPC), sequences having regions which are greater than 35-45% homologous to the probe will be obtained.
  • high stringency conditions e.g., hybridizing at 50-65°C in 5X SSPC and 50% formamide, and washing at 50-65°C in 0.5X SSPC
  • sequences having regions which aregreater than 90% homologous to the probe can be obtained
  • Any organism can be used as the source for homologs of the present invention so long as the organism naturally expresses such a protein or contains genes encoding the same.
  • the most preferred organism for isolating homologs are bacterias which are closely related to Haemophilus influenzae Rd. Uses for the Compositions ofthe Invention
  • Table 1(a) Each ORF provided in Table 1(a) was assigned to one of 102 biological role categories adapted from Riley, M., Microbiology Reviews 57(4):S62 (1993)). This allows the skilled artisan to determine a use for each identified coding sequence. Tables 1(a) further provides an identification of the type of polypeptide which is encoded forby each ORF. As a result, one skilled in the art can use the polypeptides of the present invention for commercial, therapeutic and industrial purposes consistent with the type of putative identification of thepolypeptide.
  • Haemophilus influenzae ORFs permit one skilled in theartto usethe Haemophilus influenzae ORFs in a manner similar to the known type of sequences for which the identification is made; for example, to ferment a particular sugar source or to produce a particular metabolite.
  • Enzymes used within the commercial industry see Biochemical Engineering and Biotechnology Handbook 2nd, eds. Macmillan Publ. Ltd., NY (1991) and Biocatalysts in Organic Syntheses, ed. J. Tramper et al, Elsevier Science Publishers, Amsterdam, The Netherlands (1985)).
  • Open reading frames encoding proteins involved in mediating the catalytic reactions involved in intermediary and macromolecular metabolism, the biosynthesis of small molecules, cellular processes and other functions includes enzymes involved in the degradation of the intermediary products of metabolism, enzymes involved in central intermediary metabolism, enzymes involved in respiration, both aerobic and anaerobic, enzymes involved in fermentation, enzymes involved in ATP proton motor force conversion, enzymes involved in broad regulatory function, enzymes involved in amino acid synthesis, enzymes involved in nucleotide synthesis, enzymes involved in cofactor and vitamin synthesis, can be used for industrial biosynthesis.
  • the various metabolic pathways present in Haemophilus can be identified based on absolute nutritional requirements as well as by examining the various enzymes identified in Table 1(a).
  • a number of theproteins encoded by the identified ORFs in Tables 1(a) are particularly involved in the degradation of intermediary metabolites as well as non- macromolecular metabolism.
  • Some of the enzymes identified include amylases, glucose oxidases, and catalase.
  • Proteolytic enzymes are another class of commercially important enzymes. Proteolytic enzymes find use in a number of industrial processes including the processing of flax and other vegetable fibers, in the extraction, clarification and depectinization of fruit juices, in the extraction of vegetables' oil and in the maceration offruits and vegetables to give unicellular fruits.
  • the metabolism of glucose, galactose, fructose and xylose are important parts of the primary metabolism of Haemophilus. Enzymes involved in the degradation of these sugars can be used in industrial fermentation. Some of the important sugar transforming enzymes, from a commercial viewpoint, include sugar isomerases such as glucose isomerase. Other metabolic enzymes have found commercial use such as glucose oxidases which produces ketogulonic acid (KGA). KGA is an intermediate in the commercial production of ascorbic acid using the Reichstein's procedure (see Krueger etal., Biotechnology 6(A), Rhine, H.J. et al, eds., Verlag Press, Weinheim, Germany (1984)).
  • Glucose oxidase is commercially available and has been used in purified form as well as in an immobilized form for the deoxygenation of beer. See Hartmeir et al, Biotechnology Letters 7:21 (1979). The most importantapplication of GOD is the industrial scale fermentation ofgluconic acid. Market for gluconic acids which are used in the detergent, textile, leather, photographic, pharmaceutical, food, feed and concrete industry (see
  • the main sweetener used in the world today is sugar which comes from sugar beets and sugar cane.
  • the glucose isomerase process shows the largest expansion in the market today. Initially, soluble enzymes were used and later immobilized enzymes were developed (Krueger et al, Biotechnology, The Textbook of Industrial Microbiology,
  • Proteinases such as alkaline serine proteinases, are used as detergent additives and thus represent one of the largest volumes of microbial enzymes used in the industrial sector. Because of their industrial importance, there is a large body of published and unpublished information regarding the use of these enzymes in industrial processes. (See Faultman et al., Acid Proteases Structure Function and Biology, Tang, J., ed., Plenum Press, New York
  • Upases A major use of Upases is in the fat and oU industry for the production of neutral glycerides using lipase catalyzed inter-esterification of readily available triglycerides. AppUcation ofUpases include the use as a detergent additive to facilitate the removal of fats from fabrics in the course of the washing procedures.
  • Amino transferases enzymes involved in the biosynthesis and metabolism of amino acids, are useful in the catalytic production of amino acids.
  • the advantages of using microbial based enzyme systems is that the amino transferase enzymes catalyze the stereo-selective synthesis of only /-amino acids and generally possess uniformly high catalytic rates.
  • a description of the use of amino transferases for amino acid production is provided by Roselle-David, Methods ofEnzymology 756:479 (1987).
  • Another category ofusefulproteins encoded by the ORFs of thepresent invention include enzymes involved in nucleic acid synthesis, repair, and recombination.
  • a variety ofcommercially important enzymes have previously been isolated from members ofHaemophilus sp. These include the Hinc II, Hind III, and HinfI restriction endonucleases.
  • Table 1(a) identifies a wide array of enzymes, such as restriction enzymes, ligases, gyrases and methylases, which have immediate use in the biotechnology industry. 2.
  • the proteins of the present invention, as well as homologs thereof can be used in a variety procedures and methods known in the art which are currently appUed to other proteins.
  • the proteins of the present invention can further be used to generate an antibody which selectively binds the protein.
  • Such antibodies can be either monoclonal or polyclonal antibodies, as well fragments of these antibodies, and humanized forms.
  • the invention furtherprovides antibodies which selectively bind to one of the proteins of the present invention and hybridomas which produce these antibodies.
  • a hybridoma is an immortalized cell line which is capable of secreting a specific monoclonal antibody.
  • Any animal which is known to produce antibodies can be immunized with the pseudogene polypeptide.
  • Methods for immunization are well known in the art. Such methods include subcutaneous or interperitoneal injection of the polypeptide.
  • One skilled in the art will recognize that the amount of the protein encoded by the ORF of the present invention used for immunization will vary based on the animal which is immunized, the antigenicity of thepeptide and the site ofinjection.
  • the protein which is used as an immunogen may be modified or administered in an adjuvant in order to increase the protein's antigenicity.
  • Methods of increasing the antigenicity of a protein include, but are not limited to coupling the antigen with a heterologous protein (such as globulin or ⁇ -galactosidase) or through the inclusion of an adjuvant during immunization.
  • spleen cells from the immunized animals areremoved, fused with myeloma cells, such as SP2/0-Ag14 myeloma cells, and allowed to become monoclonal antibody producing hybridoma cells.
  • myeloma cells such as SP2/0-Ag14 myeloma cells
  • any one of a number of methods well known in the art can be used to identify the hybridoma cell which produces an antibody with the desired characteristics. Theseinclude screening thehybridomas with an ELISA assay, western blot analysis, or radioimmunoassay (Lutz et al, Exp. Cell Res. 775:109-124(1988)).
  • Hybridomas secreting the desired antibodies are cloned and the class and subclass is determined using procedures known in the art (Campbell,
  • antibody containing an ⁇ sera is isolated from theimmunized animal and is screened for thepresence ofantibodies with the desired specificity using one of the above-described procedures.
  • Thepresent invention further provides the above-described antibodies in detectably labelled form.
  • Antibodies can be detectably labelled through the use of radioisotopes, affinity labels (such as biotin, avidin, etc.), enzymatic labels (such as horseradish peroxidase, alkaline phosphatase, etc.) fluorescent labels (such as FlTC or rhodamine, etc.), paramagnetic atoms, etc. Procedures for accomplishing such labelling are well-known in the art, for example see (Sternberger, L.A. et al., J. Histochem. Cytochem. 18:315 (1970); Bayer, E.A. et al., Meth. Enzym.62:308 (1979); Engval, E.
  • the labeled antibodies ofthe present invention can be used for in vitro, in vivo, and in situ assays to identify cellsor tissues in which a fragment of the Haemophilus influenzae Rd genome is expressed.
  • Thepresent invention further provides the above-described antibodies immobilized on a solid support.
  • solid supports include plastics such as polycarbonate, complex carbohydrates such as agarose and sepharose, acrylic resins and such as polyacrylamide and latex beads. Techniques for coupling antibodies to such solid supports are well known in the art (Weir, D.M. etal., "HandbookofExperimental Immunology” 4th Ed., Blackwell Scientific Publications, Oxford, England, Chapter 10 (1986);
  • the immobilized antibodies of the present invention can be used for in vitro, in vivo, and in situ assays as well as for immunoaffinity purification of the proteins of the present invention. 5. DiagnosticAssays andKits
  • the present invention further provides methods to identify the expression ofone of the ORFs of the present invention, or homolog thereof, in a test sample, using one of the DFs or antibodies of the present invention.
  • such methods comprise incubating a test sample with one or more oftheantibodies or one or more of the DFs of the present invention and assaying for binding of the DFs or antibodies to components within the test sample.
  • Incubation conditions depend on the format employed in the assay, the detection methods employed, and the type and nature of the DF or antibody used in the assay.
  • One skilled in the art wUl recognize that any one of the commonly available hybridization, amplification or immunological assay formats can readily beadapted to employ the DFs or antibodies of thepresent invention. Examples of such assays can be found in Chard, T., An Introduction to Radioimmunoassay andRelated Techniques, Elsevier Science Publishers, Amsterdam, The Netherlands (1986); Bullock, G.R.
  • test samples of the present invention include cells, protein or membraneextracts ofcells, orbiological fluids such as sputum, blood, serum, plasma, or urine.
  • the test sample used in the above-described method will vary based on theassay format, nature of thedetection method and the tissues, cells or extracts used as the sample to be assayed. Methods for preparing protein extracts or membrane extracts of cells are well known in the art and can be readily be adapted in order to obtain a sample which is compatible with the system utilized.
  • kits which contain the necessary reagents to carry out the assays of the present invention.
  • theinvention providesacompartmentalized kit to receive, in close confinement, one or more containers which comprises: (a) a first container comprising one of the DFs or antibodies of the present invention; and (b) oneor more othercontainers comprising oneor more of the following: wash reagents, reagents capable of detecting presence of a bound DF or antibody.
  • a compartmentalized kit includes any kit in which reagents are contained in separate containers.
  • Such containers include small glass containers, plastic containers or strips of plastic or paper.
  • Such containers allows one to efficiently transfer reagents from one compartment to another compartment such that the samples and reagents are not cross-contaminated, and the agents or solutions ofeach container can be added in a quantitative fashion from one compartment to another.
  • Such containers will include a container which willaccept the test sample, a container which contains the antibodies used in the assay, containers which contain wash reagents (such as phosphate buffered saline, Tris-buffers, etc.), and containers which contain the reagents used to detect the bound antibody or DF.
  • Types of detection reagents include labelled nucleic acid probes, labelled secondary antibodies, or in the alternative, ifthe primary antibody is labelled, the enzymatic, or antibody binding reagents which are capable of reacting with the labelled antibody.
  • detection reagents include labelled nucleic acid probes, labelled secondary antibodies, or in the alternative, ifthe primary antibody is labelled, the enzymatic, or antibody binding reagents which are capable of reacting with the labelled antibody.
  • the present invention further provides methods ofobtaining and identifying agents which bind to a protein encoded by one of the ORFs of the present invention or to one of the fragments and the Haemophilus genome herein described.
  • said method comprises the steps of:
  • the agents screened in the above assay can be, but are not limited to, peptides, carbohydrates, vitamin derivatives, or other pharmaceutical agents.
  • the agents can be selected and screened at random or rationally selected or designed using protein modeling techniques.
  • agents such as peptides, carbohydrates, pharmaceutical agents and the like are selected at random and are assayed for their ability to bind to the protein encoded by the ORF of the present invention.
  • agents may be rationally selected or designed.
  • an agent is said to be "rationally selected or designed" when the agent is chosen based on the configuration of the particular protein.
  • one skilled in the art can readily adapt currently available procedures to generate peptides, pharmaceutical agents and the like capable ofbinding to a specific peptide sequence in order to generate rationally designed antipeptide peptides, for example see Hurby et al. , Application of Synthetic Peptides: Antisense Peptides," In Synthetic Peptides, A User's Guide, W.H. Freeman, NY (1992), pp.289-307, and Kaspczaketal., Biochemistry 28:9230-8 (1989), or pharmaceutical agents, or the like.
  • one class of agents of the present invention can be used to control gene expression through binding to one of the ORFs or EMFs of the present invention. As described above, such agents can be randomly screened or rationally designed/selected. Targeting the ORF or EMF allows a skilled artisan to design sequence specific orelement specific agents, modulating the expression of either a single ORF or multiple ORFs which rely on the same EMF for expression control.
  • DNA binding agents are agents which contain base residues which hybridize or form a triple helix formation by binding to DNA or RNA.
  • Such agents can be based on the classic phosphodiester, ribonucleic acid backbone, or can be a variety of sulfhydryl or polymeric derivatives which have base attachment capacity.
  • Agents suitable for use in these methods usually contain 20 to 40 bases and are designed to be complementary to a region of the gene involved in transcription (triple helix - see Lee et al, Nucl Acids Res.6:3073 (1979); Cooney etal., Science241:456 (1988); and Dervan et al, Science 251:1360 (1991)) or to the mRNA itself (antisense - Okano, J. Neurochem.56:560 (1991); Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression, CRC Press, Boca Raton, FL (1988)).
  • Triple helix- formation optimally results in a shut-off of RNA transcription from DNA, whUe antisense RNA hybridization blocks translation ofan mRNA molecule into polypeptide. Both techniques have been demonstrated to be effective in model systems. Information contained in the sequences of the present invention is necessary for the design ofan antisense or triple helix ohgonucleotide and other DNA binding agents.
  • Agents which bind to a protein encoded by one of the ORFs of the present invention can be used as a diagnostic agent, in the control ofbacterial infection by modulating the activity of the protein encoded by the ORF.
  • Agents which bind to a protein encoded by one of the ORFs of the present invention can be formulated using known techniques to generate a pharmaceutical composition for use in controlling Haemophilus growth and infection.
  • the presentinvention further provides pharmaceutical agents which can be used to modulatethe growth of Haemophilus influenzae, or another related organism, in vivo or in vitro.
  • a "pharmaceutical agent” is defined as a composition of matter which can be formulated using known techniques to provide a pharmaceutical compositions.
  • pharmaceutical agents of the present invention refers the pharmaceutical agents which arederived from theproteins encoded by the ORFs of the present invention orareagents which are identified using the herein described assays.
  • a pharmaceutical agent is said to "modulated the growth of Haemophilussp., or a related organism, in vivo or in vitro," when the agent reduces the rate of growth, rate of division, or viability of the organism in question.
  • Thepharmaceutical agents of the present invention can modulate the growth of an organism in many fashions, although an understanding of theunderlying mechanism of action is not needed to practice the use of the pharmaceutical agents of the present invention. Some agents will modulate the growth by binding to an important protein thus blocking the biological activity oftheprotein, while other agents may bind to a component of the outer surface of the organism blocking attachment or rendering the organism morepronetoact the bodies nature immune system.
  • the agent may be comprise a protein encoded by one of the ORFs of the present invention and serve as a vaccine.
  • the development and use of a vaccine based on outer membrane components, such as the LPS, are well known in the art.
  • a "related organism” is a broad term which refers to any organism whose growth can be modulated by one of the pharmaceutical agents of the present invention. In general, such an organism willcontain a homolog of the protein which is the target of thepharmaceutical agent or the protein used as a vaccine. As such, related organism do not need to be bacterial but may be fungal or viral pathogens.
  • the pharmaceutical agents and compositions of the present invention may be administered in a convenient manner such as by the oral, topical, intravenous, intraperitoneal, intramuscular, subcutaneous, intranasal or intradermal routes.
  • the pharmaceutical compositions are administered in an amount which is effective for treating and/or prophylaxis of the specific indication. In general, they areadministered in an amount ofat least about 10 ⁇ g/kg body weight and in most cases they willbe administered in an amount not in excess of about 8 mg/Kg body weight per day. In most cases, the dosage is from about 10 ⁇ g/kg to about 1 mg/kg body weight daily, taking into account the routes of administration, symptoms, etc.
  • the agents of the present invention can be used in native form or can be modified to form a chemical derivative.
  • a molecule is said to bea "chemical derivative" ofanother molecule when it contains additional chemical moieties not normally a part of the molecule. Such moieties may improve the molecule's solubility, absorption, biological half life, etc. The moieties may alternatively decrease the toxicity of the molecule, eliminate or attenuate any undesirable side effect of the molecule, etc. Moieties capable of mediating such effects are disclosed in Remington's Pharmaceutical Sciences (1980).
  • a change in the immunological character of the functional derivative is measured by a competitive type immunoassay. Changes in immunomodulation activity are measured by the appropriate assay. Modifications of such protein properties as redox or thermal stabiUty, biological half-life, hydrophobicity, susceptibility to proteolytic degradation or the tendency to aggregate with carriers or into multimers are assayed by methods wellknown to the ordinarily skilled artisan.
  • the therapeutic effects of the agents of the present invention may be obtained by providing the agent to a patient by any suitable means (i.e., inhalation, intravenously, intramuscularly, subcutaneously, enterally, or parenterally). It is preferred to administer the agent of the present invention so as to achievean effective concentration within the blood or tissue in which the growth of the organism is to be controlled.
  • suitable means i.e., inhalation, intravenously, intramuscularly, subcutaneously, enterally, or parenterally.
  • the preferred method is to administer the agentby injection.
  • Theadministration may be by continuous infusion, or by single or multiple injections.
  • the dosage of theadministered agent will vary depending upon such factors as the patient's age, weight, height, sex, general medical condition, previous medical history, etc. In general, it is desirable to provide the recipient with a dosage of agent which is in the range of from about 1 pg/kg to 10 mg/kg (body weight of patient), although a lower or higher dosage may be admin- istered.
  • the therapeutically effective dose can be lowered by using combinations of the agents of the present invention or another agent.
  • compositions of the present invention can be administered concurrently with, prior to, or following the administration of the other agent.
  • the agents of the present invention are intended to be provided to recipient subjects in an amount sufficient to decrease the rate of growth (as defined above) of the target organism.
  • Theadministration of theagent(s) of the invention may be for either a "prophylactic” or "therapeutic” purpose.
  • the agent(s) areprovided in advance ofany symptoms indicative of the organisms growth.
  • the prophylactic administration of the agent(s) serves to prevent, attenuate, or decrease the rate of onset of any subsequent infection.
  • the agent(s) are provided at (or shortly after) the onset of an indication of infection.
  • the therapeutic administration of the compound(s) serves to attenuatethepathological symptoms of the infection and to increase the rate of recovery.
  • the agents ofthepresentinvention are administered to the mammal in a pharmaceutically acceptable form and in a therapeutically effective concentration.
  • a composition is said to be "pharmacologically acceptable” if its administration can be tolerated by a recipient patient.
  • Such an agent is said to be administered in a "therapeutically effective amount” if the amount administered is physiologically significant.
  • An agent is physiologically significant ifits presence results in a detectable change in the physiology of a recipient patient.
  • the agents of the present invention can be formulated according to known methods to prepare pharmaceutically useful compositions, whereby these materials, ortheir functional derivatives, arecombined in admixture with a pharmaceutically acceptable carrier vehicle.
  • Suitable vehicles and their formulation, inclusive ofother human proteins, e.g., human serum albumin, are described, for example, in Remington's Pharmaceutical Sciences (16th ed., Osol, A., Ed., Mack, Easton PA (1980)).
  • a pharmaceutically acceptable composition suitable for effective administration such compositions will contain an effective amount of one or more of the agents of the present invention, together with a suitable amount ofcarrier vehicle.
  • Control release preparations may be achieved through the use ofpolymers to complex orabsorb one or more of the agents of the present invention.
  • the controlled delivery may be exercised by selecting appropriate macromolecules (for example polyesters, polyamino acids, polyvinyl, pyrrolidone, ethylenevinylacetate, methylcellulose, carboxymethylcellulose, orprotamine, sulfate) and the concentration of macromolecules as well as the methods ofincorporation in orderto control release.
  • agents of the present invention are incorporated into particles of a polymeric material such as polyesters, polyamino acids, hydrogels, poly(lactic acid) or ethylenevinylacetatecopolymers.
  • a polymeric material such as polyesters, polyamino acids, hydrogels, poly(lactic acid) or ethylenevinylacetatecopolymers.
  • agents of the present invention are incorporated into polymeric particles, for example, by coacervation techniques or by interfacial polymerization, for example, hydroxymethylcellulose or gelatine- microcapsules and poly(methylmethacylate) microcapsules, respectively, or in colloidal drug delivery systems, for example, liposomes, albumin microspheres, microemulsions, nanoparticles, and nanocapsules or in macroemulsions.
  • colloidal drug delivery systems for example, liposomes, albumin microspheres, microemulsions, nanoparticles, and nanocapsules or in macroemulsions.
  • Theinvention furtherprovides apharmaceutical pack or kit comprising one or more containers filled with one or more of the ingredients of the pharmaceutical compositions of the invention.
  • apharmaceutical pack or kit comprising one or more containers filled with one or more of the ingredients of the pharmaceutical compositions of the invention.
  • Associated with such containers can be a notice in the form prescribed by a governmental agency regulating the manufacture, use or sale of pharmaceuticals or biological products, which noticereflects approval by the agency of manufacture, use or sale for human administration.
  • the agents of the present invention may be employed in conjunction with other therapeutic compounds. 6. Shot-Gun Approach to Megabase DNA Sequencing
  • the present invention further provides the first demonstration thata sequenceofgreaterthan one megabasecanbe sequenced using a random shotgun approach. This procedure, described in detail in the examples that follow, has eUminated the up front cost ofisolating and ordering overlapping or contiguous subclones prior to the start of the sequencing protocols.
  • the overall strategy for a shotgun approach to whole genome sequencing is outiined in Table 3.
  • the total gap length is Le -m
  • the average gap size is L/n.
  • H. influenzae Rd KW20 DNA was prepared by phenol extraction. A mixture (3.3 ml) containing 600 ⁇ g DNA, 300 mM sodium acetate, 10 mM Tris- ⁇ Cl, 1 mM Na-EDTA, 30% glycerol was sonicated (Branson Model 450 Sonicator) at the lowest energy setting for 1 min. at 0° using a 3 mm probe. The DNA was ethanol precipitated and redissolved in 500 ⁇ l TE buffer. To create blunt-ends, a 100 ⁇ l aliquot was digested for 10 min at 30° in 200 ⁇ l BAL31 buffer with 5 units BAL31 nuclease (New England BioLabs).
  • the DNA was phenol-extracted, ethanol-precipitated, redissolved in 100 ⁇ l TE buffer, electrophoresed on a 1.0% low melting agarose gel, and the 1.6-2.0 kb size fraction was excised, phenol-extracted, and redissolved in 20 ⁇ l TE buffer.
  • a two-step ligation procedure was used to produce a plasmid library with 97% insert of which >99% were single inserts.
  • the first ligation mixture (50 ⁇ l) contained 2 ⁇ g ofDNA fragments, 2 ⁇ g Smal/BAP pUC18 DNA (Pharmacia), and 10 units T4 ligase (GIBCO/BRL), and incubation was at 14° for 4 hr.
  • the DNA was dissolved in 20 ⁇ l TE buffer and electrophoresed on a 1.0% low melting agarosegel.
  • a ladderofethidium bromide-stained linear bands, identified by size as insert (i), vector (v), v+i, v+2i, v+3i, ... was visuahzed by 360 nm UV light, and thev+i DNA was excised and recovered in 20 ⁇ l TE.
  • the v+i DNA was blunt-ended by T4 polymerase treatment for 5 min.
  • Agar/L Agar/L.
  • the 5 ml bottom layer is supplemented with 0.4 ml ampicillin (50 mg/ml)/100 ml SOB agar.
  • the 15 ml top layer of SOB agar is supplemented with 1 ml X-Gal (2%), 1 ml MgCl 2 (1 M), and 1 ml MgSO 4 /100 ml SOB agar.
  • the 15 ml top layer was poured just prior to plating. Our titer was approximately 100 colonies/10 ⁇ l atiquot of transformation.
  • High quatity double stranded DNA plasmid templates (19,687) were prepared using a "boiling bead” method developed in collaboration with Advanced Genetic Technology Corp. (Gaithersburg, MD) (Adams et al.,
  • Plamid preparation was performed in a 96-well format for all stages of DNA preparation from bacterial growth through final DNA purification. Template concentration was determined using Hoechst Dye and a Millipore Cytofluor. DNA concentrations were not adjusted, but low-yielding templates were identified where possible and not sequenced. Templates were also prepared from two H. influenzae lambda genomic libraries. An amplified library was constructed in vector Lambda GEM-12 (Promega) and an unamplified library was constructed in Lambda DASH II (Stratagene). In particular, for the unamplified lambda library, H.
  • influenzae Rd KW20 DNA (> 100 kb) was partially digested in a reaction mixture (200 ⁇ l) containing 50 ⁇ g DNA, IX Sau3Al buffer, 20 units Sau3AI for 6 min. at 23°.
  • the digested DNA was phenol-extracted and electrophoresed on a 0.5% low melting agarose gel at 2V/cm for 7 hours. Fragments from 15 to 25 kb were excised and recovered in a final volume of6 ⁇ l.
  • One ⁇ l offragments was used with 1 ⁇ l of DAS ⁇ II vector (Stratagene) in the recommended ligation reaction.
  • One ⁇ l of the Ugation mixture was used per packaging reaction foUowing the recommended protocol with the Gigapack II XL P kaging Extract (Stratagene, #227711). Phage wereplated directly withou plification from the packaging mixture
  • the amplified library was prepared essentially as above except the lambda GEM-12 vector was used. Afterpackaging, about 3.5x10 4 pfu were plated on the restrictive NM539 host. The lysate was harvested in 2 ml of SM buffer and stored frozen in 7% dimethylsulfoxide. The phage titer was approximately 1x10 9 pfu/ml.
  • Liquid lysates (10 ml) were prepared from randomly selected plaques and template was prepared on an anion-exchange resin (Qiagen). Sequencing reactions were carried out on plasmid templates using the AB Catalyst LabStation with Applied Biosystems PRISM Ready Reaction Dye Primer
  • Random reverse sequencing reactions were done based on successful forward sequencing reactoins.
  • Some M13RP1 sequences were obtained in a semi-directed fashion: Ml3-21 sequences pointing outward at the ends of contigs were chosen for M13RP1 sequencing in an effort to specifically order contigs.
  • the semi- directed strategy was effective, and clone-based ordering formed an integral part ofassembly and gap closure (see below).
  • the sequencing consisted of using eight ABI Catalyst robots and fourteen AB 373 Automated DNA Sequencers.
  • the Catalyst robot is a publicly available sophisticated pipetting and temperature control robot which has been developed specifically for DNA sequencing reactions.
  • the Catalyst combinespre-aliquoted templates and reaction mixes consisting ofdeoxy- and dideoxynucleotides, the Taq thermostable DNA polymerase, fluorescently- labelled sequencing primers, and reaction buffer. Reaction mixes and templates were combined in the wells ofan aluminum 96-well thermocycling plate. Thirty consecutive cycles of linear amplification (e.g., one primer synthesis) steps were performed including denaturation, annealing ofprimer and template, and extension of DNA synthesis. A heated lid with rubber gaskets on the thermocycling plate prevented evaporation without the need for an oiloverlay.
  • the shotgun sequencing involves use of four dye-labelled sequencing primers, one for each of the four terminator nucleotide. Each dye-primer is labelled with a different fluorescent dye, permitting the four individual reactions to be combined into one lane of the 373 DNA Sequencer for electrophoresis, detection, and base-calling.
  • AB currently supplies pre-mixed reaction mixes in bulk packages containing all the necessary non-template reagents for sequencing. Sequencing can be done with both plasmid and PCR-generated templates with both dye-primers and dye- terminators with approximately equal fidelity, although plasmid templates generally give longer usable sequences.
  • the AB 373 performs automatic lane tracking and base-calling. The lane-tracking was confirmed visually. Each sequence electropherogram (or fluorescence lane trace) was inspected visually and assessed for quality. Trailing sequences of low quality were removed and the sequence itself was loaded via software to a Sybase database (archived daily to a 8mm tape). Leading vector polylinker sequence was removed automatically by software program. Average edited lengths of sequences from the standard ABI 373 were around 400 bp and depended mostly on thequality ofthetemplate used for the sequencing reaction. All of the ABI 373 Sequencers were converted to Stretch Liners, which provided a longer electrophoresispath prior to fluorescence detection, thus increasing the average number ofusable bases to 500-600 bp.
  • TIGR Assembler An assembly engine (TIGR Assembler) was developed for the rapid and accurate assembly of thousands of sequence fragments.
  • the AB AutoAssemblerTM was modified (and named TIGR Editor) to provide a graphical interface to the electropherogram for the purpose of editing data associated with the aligned sequence file output ofTIGR Assembler.
  • the TIGR assembler simultaneously clusters and assembles fragments of thegenome. In order to obtain the speed necessary to assemble more than 10 4 fragments, the algorithm builds a hash table of 10 bp oligonucleotide subsequences to generate a list ofpotential sequence fragment overlaps. The number ofpotential overlaps for each fragment determines which fragments are likely to fall into repetitive elements. Beginning with a single seed sequence fragment, TIGR Assembler extends the current contig by attempting to add the best matching fragment based on oligonucleotide content.
  • the current contig and candidate fragment are aligned using a modified version of the Smith-Waterman algorithm (Waterman, M.S., Methods in Enzymology 164:765 (1988)) which provides for optimal gapped alignments.
  • the current contig is extended by the fragment only ifstrict criteria for the quality of the match are met.
  • the match criteria include the minimum length ofoverlap, the maximum length ofan unmatched end, and the minimum percentage match. These criteria are automatically lowered by the algorithm in regions of minimal coverageand raised in regions with a possible repetitive element. The number of potential overlaps for each fragment determines which fragments are likely to fall into repetitive elements.
  • TIGR Assembler is designed to take advantage of clone size information coupled with sequencing from both ends of each template. Itenforces theconstraint that sequence fragments from two ends of the same template point toward one another in the contig and are located within a certain ranged of base pairs (definable for each clone based on the known clone size range for a given library). Assembly of 24,304 sequence fragments of H. influenzae required 30 hours ofCPU time using one processor on a SPARCenter 2000 with 512 Mb of RAM. This process resulted in approximately 210 contigs.
  • the relativepositions ofthe 140 contigs were unknown.
  • the contigs were ordered by asmalign. Asmalign uses a number of relationships to identify and atign contigs that are adjacent to each other.
  • the 140 contigs were placed into 42 groups totaling 42 physical gaps (no template DNA for the region) and 98 sequence gaps (template available for gap closure).
  • the labeled oligonucleotides were purified using Sephadex G-25 superfine (Pharmacia) and 107 cpm of each was used in a Southern hybridization analysis of H. influenzae Rd chromosomal DNA digested with one frequent cutters (Asel) and five less frequent cutters (Bglll, EcoRl, PstI, Xbal, and Pvull).
  • the DNA from each digest was fractionated on a 0.7% agarose gel and transferred to Nytran Plus nylon membranes (Schleicher & Schuell). Hybridization was carried out for 16 hours at 40°. To remove non-specific signals, each blot was sequentially washed at room temperature with increasingly stringent conditions up to 0.IX SSC + 0.5% SDS. Blots were exposed to a Phosphorlmager cassette (Molecular Dynamics) for several hours and hybridization patterns were visually compared.
  • Phosphorlmager cassette Molecular Dynamics
  • the two lambda libraries constructed from H. influenaze genomic DNA were probed with oligonucleotides designed from the ends of contig groups (Kirkness et al., Genomics 10:985 (1991)).
  • the positive plaques were then used to prepare templates and the sequence was determined from each end of the lambda clone insert. These sequence fragments were searched using grasta against a database of all contigs. Two contigs that matched the sequence from the opposite ends of the same lambda clone were ordered.
  • the lambda clone then provided the template for closure of the sequence gap between theadjacentcontigs. Thelambdaclones were especially valuable for solving repeat structures.
  • Standard PCR was performed in the following manner. Each reaction contained a 37 ⁇ l cocktail; 16.5 ⁇ l H 2 O, 3 ⁇ l 25 mM MgCl 2 , 8 ⁇ l ofa dNTP mix (1.25 mM each dNTP), 4.5 ⁇ l 10X PCR core buffer II (Perkin Elmer), 25 ng H. influenzae Rd KW20 genomic DNA. The appropriate two primers (4 ⁇ l, 3.2 pmole/ ⁇ l) were added to each reaction. A hot start was performed at 95° for 5 min followed by a 75° hold.
  • XL PCR Long range PCR was performed as follows: Each reaction contained a 35.2 ⁇ l cocktail; 12.0 ⁇ l H 2 O, 2.2 ⁇ l 25 mM Mg(OAc) 2 , 4 ⁇ l of a dNTP mix (200 ⁇ M final concentration), 12.0 ⁇ l 3.3X PCR buffer, 25 ng H. influenzae Rd KW20 genomic DNA. The appropriate two primers (5 ⁇ l, 3.2 pmoles/ ⁇ l) was added to each reaction. A hot start was performed at 94° for 1 minute. ⁇ Tth polymerase, 2.0 ⁇ l (4 U/reaction) in 2.8 ⁇ l 3.3X PCR buffer II was added to each reaction.
  • the PCR profile was 18 cycles of 94°/15 sec., denature; 62°/8 min., anneal and extend followed by 12 cycles 94°/15 sec., denature; 62°/8 min. (increase 15 sec./cycle), anneal and extend; 72°/10 min., final extension. All reactions were performed in a 96 well format on a Perkin Elmer GeneAmp PCR System 9600.
  • Sequenceinformation from the ends of 15-20 kb clones is particularly suitable for gap closure, solving repeat structures, and providing general confirmation of the overall genome assembly.
  • Approximately 100 random plaques were picked from the amplified lambda library, templates prepared, and sequence information obtained from each end. These sequences were searched (grasta) against the contigs and linked in the database to their appropriate contig, thus providing a scaffolding oflambdaclones contributing additional support to the accuracy of the genome asse bly (Fig 5).
  • the lambd ones provided closure for 23 physical gaps.
  • Lambda clones were also useful for solving repeat structures. Repeat structures identified in the genome were small enough to be spanned by a single clone from the random insert library, except for the six ribosomal RNA operons and one repeat (2 copies) which was 5,340 bp in length.
  • Oligonucleotideprobes were designed from the unique flanks at the beginning ofeach repeat and hybridized to the lambda libraries. Positive plaques were identified for each flank and the sequence fragments from the ends of each clone were used to correctly orient the repeats within the genome.
  • rRNA ribosomal RNA
  • This region contains the ribosomal promoter and appeared to be non-clonable in the high copy number pUC18 plasmid. However, unique sequences could be identified at the right (5S) ends. Oligonucleotide primers were designed from these six flanking regions and used to probe the two lambda libraries. For each of the six rRNA operonsatleastonepositiveplaquewas identified which completely spanned the rRNA operon and contained unique flanking sequence at the 16S and 5S ends. These plaques provided the templates for obtaining the unique sequence for each of the six rRNA operons.
  • restriction fragments from the sequence-derived map matched those from the physical map in size and relative order ( Figure 5).
  • each contig was edited visually by reassembling overlapping 10 kb sections ofcontigs using the AB AutoAssemblerTM and the Fast Data FinderTM hardware.
  • AutoAssemblerTM provides a graphical interface to electropherogram data for editing.
  • the electropherogram data wasused toassign the mostlikely base at each position. Where a discrepancy could not be resolved or a clear assignment made, the automatic base calls were left unchanged.
  • Individual sequence changes were written to the electropherogram files and a replication protocol (crash) was used to maintain the synchrony of sequence data between the H. influenzae database and the electropherogram files.
  • a replication protocol was used to maintain the synchrony of sequence data between the H. influenzae database and the electropherogram files.
  • the rRNA and other repeat regions precluded complete assembly of the circular genome with TIGR Assembler. Final assembly of the genome was accomplished using combasm which splices together contigs based on short overlaps.
  • TheH. influenaze Rd genome is a circular chromosome of 1,830,121 bp.
  • the G/C content of the genome was examined with several window lengths to look for global structural features.
  • the G/C content is relatively even except for 7 large G/C-rich regions and several A/T-rich regions (Fig. 5).
  • the G/C rich regions correspond to six rRNA operons and the location of a cryptic mu-like prophage.
  • Genes for several proteins with similarity to proteins encoded by bacteriophage mu are located at approximately position 1.56-1.59 Mbp of the genome. This area of the genome has a markedly higher G/C content than average for H. influenaze (-50% G/C compared to -38% for the rest of the genome). No significance has yet been ascertained for the source or importance of the A/T rich regions.
  • the minimal origin of replication (oriC) in E. coli is a 245 bp region definedby three copies ofa thirteen base pair repeat containing a GATC core sequence at one end and four copies of a nine base pair repeat containing a TTAT core sequence at the otherend.
  • the GATC sites are methylation targets and control replication while the TTAT sites provide the binding sites for DnaA, the first step in the reptication process (Genes V, B. Lewin Ed. (Oxford University Press, New York, 1994), chap.18-19).
  • An approximately 281 bp sequence (602,483 - 602,764) whose limits are defined by these same core sequences appears to define the origin of replication in H. influenaze Rd.
  • Termination of E. coli replication is marked by two 23 bp termination sequences located -100 kb on either side of the midway point at which the two reptication forks meet. Two potential termination sequences sharing a 10 bp core sequence with the E. coli termination sequence were identified in H. influenaze at coordinates 1,375,949-1,375,958 and
  • Each rRNA operon contains three rRNA subunits and a variable spacer region in the order: 16S subunit - spacer region - 23S subunit -5S subunit.
  • the subunit lengths are 1539 bp, 2653 bp, and 116 bp, respectively.
  • the G/C content of the three ribosomal subunits (50%) is higher than the genome as a whole.
  • the G/C content of the spacer region (38%) is consistent with the remainder of the genome.
  • the nucleotide sequence of the three rRNA subunits is 100% identical in all six ribosomal operons.
  • the rRNA operons can be grouped into two classes based on the spacer region between the 16S and 23S sequences.
  • the shorter of the two spacer regions is 478 bp in length (rrnB, rrnE, and rrnF) and contains the gene for tRNA Glu.
  • Thelonger spacer is 723 bp in length (rrnA, rrnC, and rrnD) and contains the genes for tRNA lle and tRNA Ala.
  • the two sets of spacer regions are also 100% identical across each group of three operons.
  • tRNA genes are also present at the 16S and 5S ends of two of the rRNA operons.
  • the genes for tRNA Arg, tRNA His, and tRNA Pro are located at the 16S end of rrnE while the genes for tRNA Trp, and tRNA Asp are located at the 5S end ofrrnA.
  • Thepredicted coding regions oftheH. influenaze genome were initially defined by evaluating their coding potential with the program Genemark (Borodovsky and Mclninch, Computers Chem.17(2):123 (1993)) using codon frequency matrices derived from 122 H. influenaze coding sequences in GenBank.
  • the predicted coding region sequences (plus 300 bp of flanking sequence) were used in searches against a database of non-redundant bacterial proteins (NRBP) created specifically for the annotation. Redundancy was removed from NRBP at two stages. All DNA coding sequences were extracted from GenBank (release 85), and sequences from the same species were searched against each other. Sequences having >97% similarity over regions > 100 nucleotides were combined.
  • NRBP is composed of21,445 sequences extracted from 23,751 GenBank sequences and 11,183 Swiss-Prot sequences from 1,099 different species.
  • H. influenaze gene was assigned to one of 102 biological role categories adapted from Riley (Riley, M., Microbiology Reviews 57(4):862 (1993)). Assignments were made by linking the protein sequence of thepredicted coding regions with the Swiss-Prot sequences in the Riley database. Of the 1,749 predicted coding regions, 724 have no role assignment. Of these, no database match was found for 384, while 340 matched "hypotheticalproteins" in thedatabase. Role assignments were made for 1,025 of the predicted coding regions. A compilation of all the predicted coding regions, their unique identifiers, a three letter gene identifier, percent identity, percent similarity, and amino acid match length are presented in
  • Table 1(a) An annotated complete genome map of H. influenaze Rd is presented in Figures 6(A)-(D). The map places each predicted coding region on the H. influenaze chromosome, indicates its direction oftranscription and color codes its role assignment. Role assignments are also represented in Figure 5.
  • H. influenaze requires for survival as a freeliving organism, the nutritional requirements for its growth in the laboratory, and the characteristics which make it unique from other organisms specifically as it relates to its pathogenicity and virulence.
  • Thegenome would beexpected to have complete complements ofcertain classes of genes known to be essential for life. For example, there is aone-to-one correspondence ofpublished E. coli ribosomal protein sequences to potential homologs in the H. influenaze database.
  • Table 1(a) an aminoacyl tRNA-synthetase is present in thegenomefor each amino acid.
  • the location oftRNA genes was mapped onto the genome. There are 54 identified tRNA genes, including representatives of all 20 amino acids.
  • H. influenaze In order to survive as a free living organism, H. influenaze must produce energy in theform ofATP via fermentation and/or electron transport. As a facultative anaerobe, H. influenaze Rd is known to ferment glucose, fructose, galactose, ribose, xylose and fucose (Dorocicz et al., J. Bacteriol. 175:7142 (1993)). The genes identified in Table 1(a) indicate that transport systems are available for the uptake of these sugars via the phosphoenolpyruvate-phosphotransferase system (PTS), and via non-PTS mechanisms. Genes that specify the common phosphate-carriers Enzyme I and
  • ⁇ pr (ptslandptsH) of the PTS system were identified as well as the glucose specific errgene.
  • TheptsH, ptsl, and errgenes constitute thepts operon.
  • a complete PTS system for fructose was identified.
  • Genes encoding thecomplete glycolytic pathway and for the production of fermentative end products were identified. Growth utilizing anaerobic respiratory mechanisms were found by identifying genes encoding functional electron transport systems using inorganic electron acceptors such as nitrates, nitrites, and dimethylsulfoxide.
  • Glutamate can be directed into theTCA cycleviaconversion to alpha-ketoglutarate by glutamate dehydrogenase.
  • glutamate presumably serves as the source of carbon for biosynthesis of amino acids using precursors which branch from the TCA cycle.
  • Functional electron transport systems are available for the production of ATP using oxygen as a terminal electron acceptor.
  • H. influenzae Rd possesses a highly efficient natural DNA transformation system (Kahn and Smith, J. Membrane Biol.138:155 (1984).
  • CRE palindromic competence regulatory element
  • the regulatorprotein is generally a transcription factor which, when activated by the sensor, turns on or off expression of a specific set ofgenes (for review, see Albright et al., Ann. Rev. Genet.23:311
  • H. influenaze are adjacent to one another and presumably form an operon.
  • the nar and arc genes are not located adjacent to one another.
  • the non-pathogenicH. influenaze Rd strain varies significantly from the pathogenic serotypeb strains. Many ofthedifferences between these two strains appear in factors affecting infectivity. For example, the eight genes which make up the fimbrial gene cluster (van ⁇ am et al., Mol. Microbiol.13:673 (1994)) involved in adhesion ofbacteria to hostcellsare now shown to be absent in the Rd strain.
  • ThepepNandpurE genes which flank the fimbrial cluster in H are now shown to be absent in the Rd strain.
  • E. coli proteins are not in H. influenzae by taking advantage of a non-redundant set of protein coding genes from E. coli, namely the University ofWisconsin Genome Project contigs in GenBank: 1,216 predicted protein sequences from GenBank accessions D10483, L10328, U00006, U00039, U14003, and U18997 (Yura et al., NucleicAcids Research 20:3305 (1992); Burland etal., Genomics 16:551 (1993)).
  • Proteins are annotated as hypothetical based on alack ofmatches with any other known protein (Yura et al., Nucleic AcidsResearch 20:3305 (1992); Burland etal., Genomics 16:551 (1993)). At least two potential explanations can be offered for the over representation of hypothetical proteins among those without matches: some of the hypothetical proteins are not, in fact, translated (at least in the annotated frame), or these areE. coli-specificproteins thatare unlikely to be found in any species except those most closely related to E. coli, for example Salmonella typhimurium.
  • a total of 384 predicted coding regions did not display significant similarity with a six-frame translation of GenBank release 87. These unidentified coding regions werecompared to one another with fasta. Several novel gene families were identified. For example, two predicted coding regions without database matches ( ⁇ I0591, HI0852) share 75% identity over almost their entire lengths (139 and 143 amino acid residues respectively).
  • the lipooligosaccharide component of the outer membrane and the genes of its synthetic pathway are under intensive study (Weiser et al., J. Bacteriol. 175:3304 (1990)). Whilea vaccine is available, the study ofouter membrane components is motivated to some extent by the need for improved vaccines.
  • GSDB Sequence DataBase
  • Substantially pure protein or polypeptide is isolated from the transfected or transformed cells using any oneofthe methods known in the art.
  • the protein can also be produced in a recombinant prokaryotic expression system, such as E. coli, or can by chemically synthesized. Concentration of protein in the final preparation is adjusted, for example, by concentration on an Amicon filter device, to thelevel ofa few micrograms/ml.
  • Monoclonal or polyclonal antibody to the protein can then be prepared as follows:
  • Monoclonal antibody to epitopes ofany ofthepeptides identified and isolated as described can beprepared from murine hybridomas according to the classical method of Kohler, G. and Milstein, C., Nature 256:495 (1975) or modifications of the methods thereof. Briefly, a mouse is repetitively inoculated with a few micrograms of the selected protein over a period of a few weeks. The mouse is then sacrificed, and the antibody producing cells of the spleen isolated. The spleen cellsare fused by means ofpolyethylene glycol with mouse myeloma cells, and the excess unfused cells destroyed by growth of thesystem on selective media comprising aminopterin (HAT media).
  • HAT media aminopterin
  • the successfully fused cells are diluted and aliquots of the dilution placed in wells of a microtiter plate where growth of the culture is continued.
  • Antibody- producing clones are identified by detection of antibody in the supernatant fluid of the wells by immunoassay procedures, such as ELISA, as originally described by Engvall, E., Meth. Enzymol 70:419 (1980), and modified methods thereof. Selected positive clones can be expanded and their monoclonal antibody product harvested for use. Detailed procedures for monoclonal antibody production are described in Davis, L. et al. Basic Methods in Molecular Biology Elsevier, New York. Section 21-2 (1989).
  • Polyclonal antiserum containing antibodies to heterogenous epitopes of a single protein can be prepared by immunizing suitable animals with the expressed protein described above, which can be unmodified or modified to enhance immunogenicity. Effectivepolyclonal antibody production is affected by many factorsrelated both to theantigen and the host species. For example, smaU molecules tend to be less immunogenic than other and may require the use of carriers and adjuvant. Also, host animals vary in response to site of inoculations and dose, with both inadequate or excessive doses of antigen resulting in low titer antisera. Small doses (ng level) ofantigen administered at multiple intradermal sites appears to be most reliable. An effective immunization protocol for rabbits can be found in Vaitukaitis, J.
  • Booster injections can be given at regular intervals, and antiserum harvested when antibody titer thereof, as determined semi-quantitatively, for example, by double immunodiffusion in agar against known concentrations of the antigen, begins to fall. See, for example, Ouchterlony, O. et al., Chap. 19 in: Handbook ofExperimental Immunology, Wier, D., ed, Blackwell
  • Plateau concentration ofantibody is usually in the range of0.1 to 0.2 mg/ml of serum (about 12 ⁇ M).
  • Affinity of the antisera for the antigen is determined by preparing competitive binding curves, as described, for example, by Fisher, D., Chap.42 in: ManualofClinical Immunology, second edition, Rose and Friedman, eds., Amer. Soc. For Microbiology, Washington,
  • Antibody preparations prepared according to either protocol are useful in quantitative immunoassays which determine concentrations of antigen- bearing substancesin biological samples; they are also used semi-quantitatively or qualitatively to identify the presence ofantigen in a biological sample.
  • PCR primers are preferably at least 15 bases, and more preferably at least 18 bases in length. When selecting a primer sequence, it is preferred that the primer pairs have approximately the same G/C ratio, so that melting temperatures are approximately the same.
  • the PCR primers and amplified DNA of this Example find use in the Examples that follow.
  • a fragment of the Haemophilus influenzae Rd genome provided in Tables 1(a) or 2 is introduced into an expression vector using conventional technology.
  • Techniques to transfer cloned sequences into expression vectors that direct protein translation in mammalian, yeast, insect or bacterial expression systems are wellknown in theart.
  • Commerciallyavailable vectors and expression systems are available from a variety of suppliers including Stratagene (La Jolla, California), Promega (Madison, Wisconsin), and
  • codon context and codon pairing of the sequence may be optimized for the particular expression organism, as explained by Hatfield et al. , U.S. Patent No. 5,082,767, incorporated herein by this reference.
  • the following is provided as one exemplary method to generate polypeptide(s) from cloned ORFs of the Haemophilus genome fragment. Since the ORF lacks a poly A sequence because of the bacterial origin of the ORF, this sequence can beadded to the construct by, for example, splicing out the poly A sequence from pSG5 (Stratagene) using Bgll and Sall restriction endonuclease enzymes and incorporating it into the mammalian expression vector pXTl (Stratagene) for use in eukaryotic expression systems.
  • pXTl contains the LTRs and a portion of the gag gene from Moloney Murine Leukemia Virus. The position of the LTRs in the construct allow efficient stabletransfection.
  • Thevector includes the Herpes Simplex thymidine kinase promoter and the selectable neomycin gene.
  • the Haemophilus DNA is obtained by PCR from the bacterial vector using oligonucleotide primers complementary to the Haemophilus DNA and containing restriction endonuclease sequences for PstI incorporated into the 5' primer and Bglll at the 5' end ofthecorresponding Haemophilus DNA 3' primer, taking care to ensure that theHaemophilus DNA is positioned such that its followed with the poly A sequence.
  • the purified fragment obtained from the resulting PCR reaction is digested with Pstl, blunt ended with an exonuclease, digested with Bglll, purified andligated to pXT1, now containing a poly A sequence and digested Bglll.
  • the ligated product is transfected into mouse NlH 3T3 cells using Lipofectin (Life Technologies, Inc., Grand Island, New York) under conditions outlined in the product specification. Positive transfectants are selected after growing the transfected cells in 600 ug/ml G418 (Sigma, St. Louis, Missouri). The protein is preferably released into the supernatant.
  • the protein may additionally be retained within the cell or expression may be restricted to the cell surface.
  • theHaemophilus DNA sequence is additionallyincorporated into eukaryotic expression vectors and expressed as a chimeric with, for example, ⁇ -globin.
  • Antibody to ⁇ -globin is used to purify the chimeric.
  • Corresponding protease cleavage sites engineered between the ⁇ -globin geneand theHaemophilus DNA are then used to separate the twopolypeptide fragments from one another after translation.
  • One useful expression vector forgenerating ⁇ -globin chimerics is pSG5 (Stratagene). This vectorencodes rabbit ⁇ -globin.
  • Intron ll oftee rabbit ⁇ -globin gene facilitates splicing of the expressed transcript, and the polyadenylation signal incorporated into the construct increases the level of expression.
  • Polypeptide may additionallybe produced from either construct using in vitro translation systems such as In vitro ExpressTM Translation Kit (Stratagene).

Landscapes

  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Gastroenterology & Hepatology (AREA)
  • Communicable Diseases (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Peptides Or Proteins (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)
  • Preparation Of Compounds By Using Micro-Organisms (AREA)

Abstract

The present invention provides the sequencing of the entire genome of Haemophilus influenzae Rd, SEQ ID NO:1. The present invention further provides the sequence information stored on computer readable media, and computer-based systems and methods which facilitate its use. In addition to the entire genomic sequence, the present invention identifies over 1700 protein encoding fragments of the genome and identifies, by position relative to a unique Not I restriction endonuclease site, any regulatory elements which modulate the expression of the protein encoding fragments of the Haemophilus genome.

Description

Nucleotide Sequence of the Haemophilus influenzae Rd Genome, Fragments Thereof, and Uses Thereof
Part of the work performed during development of this invention utilized U.S. Government funds. The government may have certain rights in this invention. N1H-5R01GM48251
Field of the Invention
The present invention relates to the field of molecular biology. The present invention discloses compositions comprising the nucleotide sequence of Haemophilus influenzae, fragments thereof and usage in industrial fermentation and pharmaceutical development.
Background of the Invention
The complete genome sequence from a free living cellular organism has never been determined. The first mycobacterium sequence should be completed by 1996, while E. coli and S. cerevisae are expected to be completed before 1998. These are being done by random and/or directed sequencing of overlapping cosmid clones. No one has attempted to determine sequences of the order ofa megabase or more by a random shotgun approach.
H. influenzae is a small (approximately 0.4 x 1 micron) non-motile, non-spore forming, germ-negative bacterium whose only natural host is human. It is a resident ofthe upper respiratory mucosa ofchildren and adults and causes otitis mediaand respiratory tract infections mostly in children. The most serious complication is meningitis, which produces neurological sequelae in up to 50% ofaffected children. Six H. influenzae serotypes (a through f) have been identified based on immunologically distinct capsular polysaccharide antigens. A number of non-typeable strains are also known. Serotype b accounts for the majority ofhuman disease.
Interest in the medically important aspects of H. influenzae biology has focused particularly on those genes which determine virulence characteristics of the organism. A number of the genes responsible for the capsular polysaccharide have been mapped and sequenced (Kroll etal., Mol. Microbiol. 5(6):1549-1560 (1991)). Several outer membrane protein (OMP) genes have been identified and sequenced (Langford et al., J. Gen. Microbiol.138:155-
159 (1992)). The lipoligosaccharide (LOS) component of the outer membrane and the genes of its synthetic pathway are under intensive study (Weiser et al., J. Bacteriol.172:3304-3309 (1990)). While a vaccine has been available since 1984, the study ofouter membrane components is motivated to some extent by the need for improved vaccines. Recently, the catalase gene was characterized and sequenced as a possible virulence-related gene (Bishni et al., in press). Elucidation of the H. influenzae genome will enhance the understanding of how H. influenzae causes invasive disease and how best to combat infection.
H. influenzae possesses a highly efficient natural DNA transformation system which has been intensively studied in the non-encapsulated (R), serotyped strain (Kahn and Smith, J. Membrane Biology 81:89-103 (1984)). At least 16 transformation-specific genes have been identified and sequenced. Of these, four are regulatory (Redfield, J. Bacteriol. 173:5612-5618 (1991), and Chandler, Proc. Natl. Acad. Sci. USA 89:1626-1630 (1992)), at least two are involved in recombination processes (Barouki and Smith, J. Bacteriol. 163(2):629-634 (1985)), and at least seven are targeted to the membranes and periplasmic space (Tomb etal., Gene 104:1-10 (1991), and Tomb, Proc. Natl. Acad. Sci. USA 89:10252-10256 (1992)), where they appear to function as structural components or in the assembly of the DNA transport machinery. H. influenzae Rd transformation shows a number of interesting features including sequence-specific DNA uptake, rapid uptake of several double-stranded DNA molecules per competent cell into a membrane compartment called the transformasome, linear translocation ofa single strand of the donor DNA into the cytoplasm, and synapsis and recombination of the strand with the chromosome by a single-strand displacement mechanism. The H. influenzae
Rd transformation system is the most thoroughly studied of the gram-negative systems and distinct in a number of ways from the gram-positive systems.
The size of H. influenzae Rd genome has been determined by pulsed-field agarose gel electrophoresis of restriction digests to be approximately 1.9 Mb, making its genome approximately 40% the size of E. coli (Lee and Smith, J. Bacteriol.770:4402-4405 (1988)). The restriction map of H. influenzae is circular (Lee et al., J. Bacteriol.171:3016-3024 (1989), and Redfield and Lee, "Haemophilus influenzae Rd", pp.2110-2112, In O'Brien, S.J. (ed), Genetic Maps: Locus Maps of Complex Genomes, Cold Spring Harbor Press, New York). Various genes have been mapped to restriction fragments by Southern hybridization probing of restriction digest DNA bands. This map will be valuable in verification of the assembly of a complete genome sequence from randomly sequenced fragments. GenBank currently contains about 100 kb of non-redundant H. influenzae DNA sequences. About half are from serotype b and half from Rd.
Summary of the Invention
The present invention is based on the sequencing of the Haemophilus influenzae Rd genome. The primary nucleotide sequence which was generated is provided in SEQ ID NO:1.
The present invention provides the generated nucleotide sequence of the
Haemophilus influenzae Rd genome, or a representative fragment thereof, in a form which can be readily used, analyzed, and interpreted by a skilled artisan. In one embodiment, present invention is provided as a contiguous string of primary sequence information corresponding to the nucleotide sequence depicted in SEQ ID NO:1.
The present invention fur ther provides nucleotide sequences which are at least 99.9% identical to the nucleotide sequence of SEQ ID NO: 1.
The nucleotide sequence of SEQ ID NO:1, a representative fragment thereof, or a nucleotide sequence which is at least 99.9% identical to the nucleotide sequence of SEQ ID NO:1 may be provided in a variety of mediums to facilitate its use. In one application of this embodiment, the sequences of the present invention are recorded on computer readable media. Such media includes, but is not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media.
The present invention furtherprovides systems, particularly computer- based systems which contain the sequence information herein described stored in adata storage means. Such systems are designed to identify commercially important fragments of the Haemophilus influenzae Rd genome.
Another embodiment of the present invention is directed to isolated fragments of the Haemophilus influenzae Rd genome. The fragments of the Haemophilus influenzae Rd genome of thepresent invention include, but are not limited to, fragments which encode peptides, hereinafter open reading frames (ORFs), fragments which modulate the expression of an operably linked ORF, hereinafter expression modulating fragments (EMFs), fragments which mediate the uptake of a linked DNA fragment into a cell, hereinafter uptake modulating fragments (UMFs), and fragments which can be used to diagnose the presence of Haemophilus influenzae Rd in a sample, hereinafter, diagnostic fragments (DFs).
Each of the ORF fragments of the Haemophilus influenzae Rd genome disclosed in Tables 1(a) and 2, and the EMF found 5' to the ORF, can be used in numerous ways as polynucleotide reagents. The sequences can be used as diagnostic probes or diagnostic amplification primers for the presence of a specific microbe in a sample, for the production of commercially important pharmaceutical agents, and to selectively control gene expression.
The present invention further includes recombinant constructs comprising oneor morefragments of the Haemophilus influenzae Rd genome ofthepresent invention. The recombinant constructs of the present invention comprise vectors, such as a plasmid or viral vector, into which a fragment of the Haemophilus influenzae Rd has been inserted.
The present invention further provides host cells containing any one of the isolated fragments of the Haemophilus influenzae Rd genome of the present invention. The host cells can bea highereukaryotic host such as a mammalian cell, a lower eukaryotic cell such as a yeast cell, or can be a procaryotic cell such as a bacterial cell.
The present invention is further directed to isolated proteins encoded by the ORFs of the present invention. A variety of methodologies known in the art can be utilized to obtain any one of the proteins of the present invention. At the simplest level, the amino acid sequence can be synthesized using commercially available peptide synthesizers. In an alternative method, theprotein is purified from bacterial cells which naturally produce the protein.
Lastly, theproteins ofthepresent invention can alternatively be purified from cells which have been altered to express the desired protein.
The invention further provides methods of obtaining homologs of the fragments ofthe Haemophilus influenzae Rd genome of the present invention and homologs of the proteins encoded by the ORFs of the present invention.
Specifically, by using the nucleotide and amino acid sequences disclosed herein as a probe or as primers, and techniques such as PCR cloning and colony/plaque hybridization, one skilled in the art can obtain homologs.
The invention further provides antibodies which selectively bind one of the proteins of the present invention. Such antibodies include both monoclonal and polyclonal antibodies. The invention further provides hybridomas which produce the above- described antibodies. A hybridoma is an immortalized cell line which is capable ofsecreting a specific monoclonal antibody.
The present invention further provides methods of identifying test samples derived from cells which express one of the ORF of the present invention, or homolog thereof. Such methods comprise incubating a test sample with oneor more of the antibodies of the present invention, or one or more of the DFs of the present invention, under conditions which allow a skilled artisan to determine if the sample contains the ORF or product produced therefrom.
In another embodiment of the present invention, kits are provided which contain thenecessary reagents to carry out the above-described assays.
Specifically, the invention provides acompartmentalized kit to receive, in close confinement, one or more containers which comprises: (a) a first container comprising one of the antibodies, or one of the DFs of the present invention; and (b) oneor moreothercontainers comprising one or more of the following: wash reagents, reagents capable of detecting presence of bound antibodies or hybridized DFs.
Using the isolated proteins of the present invention, the present invention furtherprovides methods ofobtaining and identifying agents capable ofbinding to a protein encoded by one of the ORFs of thepresent invention. Specifically, such agents include antibodies (described above), peptides, carbohydrates, pharmaceutical agents and the like. Such methods comprise the steps of:
(a) contacting an agent with an isolated protein encoded by one of the ORFs of the present invention; and
(b) determining whether the agent binds to said protein.
The complete genomic sequence of H. influenzae will be of great value to all laboratoriesworking with this organism and for a variety ofcommercial purposes. Many fragments of the Haemophilus influenzae Rd genome will be immediately identified by similarity searches against GenBank or protein databases and will be ofimmediate value to Haemophilus researchers and for immediate commercial value for the production ofproteins or to control gene expression. A specific example concerns PHA synthase. It has been reported that polyhydroxybutyrate is present in the membranes ofH. influenzae Rd and that the amount correlates with the level of competence for transformation.
The PHA synthase that synthesizes this polymer has been identified and sequenced in a number ofbacteria, none ofwhich are evolutionarily close to H. influenzae. This gene has yet to be isolated from H. influenzae by use of hybridization probes or PCR techniques. However, the genomic sequence of the present invention allows the identification of the gene by utilizing search means described below.
Developing the methodology and technology for elucidating the entire genomic sequence of bacterial and other small genomes has and will greatly enhance the ability to analyze and understand chromosomal organization. In particular, sequenced genomes will provide the models for developing tools for the analysis of chromosome structure and function, including the ability to identify genes within large segments ofgenomic DNA, the structure, position, and spacing ofregulatory elements, the identification ofgenes with potential industrial applications, and the ability to do comparative genomic and molecular phylogeny.
Description of the Figures
Figure 1 - restriction map of the Haemophilus influenzae Rd genome.
Figure 2 - Block diagram of a computer system 102 that can be used to implement the computer-based systems ofpresent invention.
Figure 3 - A comparison of experimental coverage of up to approximately 4000 random sequence fragments assembled with AutoAssembler (squares) as compared to Lander-Waterman prediction for a 2.5 Mb genome (triangles) and a 1.6 Mb genome (circles) with a 460 bp average sequence length and a 25 bp overlap. Figure 4 - Data flow and computer programs used to manage, assemble, edit, and annotate theH. influenzae genome. Both Macintosh and Unix platforms are used to handle the AB 373 sequence data files (Kerlavage et al., Proceedings of the Twenty-Sixth Annual Hawaii International Conference on System Sciences, IEEE Computer Society Press, Washington
D.C., 585 (1993)). Factura (AB) is a Macintosh program designed for automatic vector sequence removal and end trimming of sequence files. The program esp runs on a Macintosh platform and parses the feature data extracted from the sequence files by Factura to the Unix based H. influenzae relational database. Assembly is accomplished by retrieving a specific set of sequence files and their associated features using stp, an X-windows graphical interface and control program which can retrieve sequences from the H. influenzae databaseusing user-defined or standard SQL queries. The sequence files wereassembled using TIGR Assembler, an assembly engine designed at TIGR for rapid and accurate assembly of thousands of sequence fragments.
TIGR Editor is a graphical interface which can parse the aligned sequence files from TIGR Assembler output and display the alignment and associated electropherograms for contig editing. Identification ofputative coding regions was performed with Genemark (Borodovsky and Mclninch, Computers Chem. 17(2):123 (1993)), a Markov and Bayes modeled program for predicting gene locations, and trained on a H. influenzae sequence data set. Peptide searches wereperformed against the three reading frames ofeach Genemark predicted coding region using blaze (Brutlag et al., Computers Chem.77:203 (1993)) run on a Maspar MP-2 massively parallel computer with 4096 microprocessors. Results from each frame were combined into a single output file by mblzt. Optimal protein alignments were obtained using the program praze which extendsalignments across potential frameshifts. The output was inspected using a custom graphic viewing program, gbyob, that interacts directly with the H. influenzae database. The alignments were further used to identify potential frameshift errors and were targeted for additional editing. Figure 5 - A circular representation of the H. influenzae Rd chromosome illustrating the location ofeach predicted coding region containing a database match as well as selected global features of the genome. Outer perimeter: The location of the unique Nøtl restriction site (designated as nucleotide 1), the Rsrll sites, and the Smal sites. Outerconcentric circle: The location of each identified coding region for which a gene identification was made. Each coding region location is coded as to role according to the color code in Fig.6. Second concentric circle: Regions of high G/C content (> 42%, red; > 40%, blue) and high A/T content (> 66%, black; > 64%, green). High G/C content regions are specifically associated with the 6 ribosomal operons and the mu-like prophage. Third concentric circle: Coverage by lambda clones (blue). Over 300 lambda clones were sequenced from each end to confirm the overall structure of the genome and identify the 6 ribosomal operons. Fourth concentric circle: The locations of the 6 ribosomal operons (green), the tRΝAs (black) and the cryptic mu-like prophage (blue). Fifth concentric circle: Simple tandem repeats. The locations of the following repeats are shown: CTGGCT, GTCT, ATT, AATGGC, TTGA, TTGG, TTTA, TTATC, TGAC, TCGTC, AACC, TTGC, CAAT, CCAA. Theputative origin of replication is illustrated by the outward pointing arrows (green) originating near base 603,000. Two potential termination sequences are shown near the opposite midpoint of the circle (red).
Figures 6(A)-6(D)- Complete map of the H. influenzae Rd genome. Predicted coding regions are shown on each strand. rRΝA and tRΝA genes are shown as lines and triangles, respectively. Genes are color-coded by role category as described in the legend. GenelD numbers correspond to those in
Tables 1(a), 1(b) and 2. Where possible, three-letter designations are also provided.
Figure 7 - A comparison of the region of the H. influenzae chromosome containing the 8 genes of the fimbrial gene cluster present in H. influenzae type b and the same region in H. influenzae Rd. The region is flanked by the pepN and purE genes in both organisms. However in the non- infectious Rd strain the 8 genes of thefimbrial gene cluster have been excised. A 172 bp spacerregion is located in this region in the Rd strain and continues to be flanked by the pepN and purE genes.
Figure 8 - Hydrophobicity analysis of five predicted channel-proteins. The amino acid sequences of five predicted coding regions that do not display homology with known peptide sequences (GenBank release 87), each exhibit multiple hydrophobic domains that are characteristic of channel-forming proteins. The predicted coding region sequences were analyzed by the Kyte-
Doolittle algorithm (Kyte and Doolittle, J. Mol. Biol.157:105 (1982)) (with a range of 11 residues) using the GeneWorks software package (Intelligenetics).
Detailed Description of the Preferred Embodiments The present invention is based on the sequencing of the Haemophilus influenzae Rd genome. Theprimary nucleotide sequence which was generated is provided in SEQ ID NO:1. As used herein, the "primary sequence" refers to the nucleotide sequence represented by the IUPAC nomenclature system.
The sequenceprovided in SEQ ID NO:1 isoriented relative to a unique Not I restriction endonuclease site found in the Haemophilus influenzae Rd genome. A skilled artisan will readily recognize that this start/stop point was chosen for convenience and does not reflect a structural significance.
The present invention provides the nucleotide sequence of SEQ ID ΝO:l, or a representative fragment thereof, in a form which can be readily used, analyzed, and interpreted by a skilled artisan. In one embodiment, the sequence is provided as a contiguous string ofprimary sequence information corresponding to the nucleotide sequence provided in SEQ ID NO:1. As used herein, a "representative fragment of the nucleotide sequence depicted in SEQ ID NO:1" refers to any portion of SEQ ID NO:1 which is not presently represented within a publicly available database. Preferred representative fragments of the present invention are Haemophilus influenzae open reading frames, expression modulating fragments, uptake modulating fragments, and fragments which can be used to diagnose the presence of Haemophilus influenzae Rd in sample. A non-limiting identification of such preferred representative fragments is provided in Tables 1(a) and and 2.
The nucleotide sequence information provided in SEQ ID NO:1 was obtained by sequencing the Haemophilus influenzae Rd genome using a megabase shotgun sequencing method. Using three parameters of accuracy discussed in the Examples below, thepresentinventors have calculated that the sequence in SEQ ID NO:1 has a maximum accuracy of 99.98%. Thus, the nucleotide sequence provided in SEQ ID NO:1 is a highly accurate, although not necessarily a 100% perfect, representation of the nucleotide sequence of the Haemophilus influenzae Rd genome.
As discussed in detail below, using the information provided in SEQ ID NO:1 and in Tables 1(a) and 2 together with routine cloning and sequencing methods, one ofordinary skill in the art will be able to clone and sequence all "representative fragments" of interest including open reading frames (ORFs) encoding a large variety of Haemophilus influenzae proteins. In very rare instances, this may reveal a nucleotide sequence error present in the nucleotide sequence disclosed in SEQ ID NO: 1. Thus, once the present invention is made available (i.e., once the information in SEQ ID NO:1 and Tables 1(a) and 2 have been made available), resolving a rare sequencing error in SEQ ID NO:1 will be well within the skill of the art. Nucleotide sequence editing software is publicly available. For example, Applied Biosystem's (AB) AutoAssembler™ can be used as an aid during visual inspection of nucleotide sequences. Even if all of the very rare sequencing errors in SEQ ID NO:1 were corrected, the resulting nucleotide sequence would still be at least 99.9% identical to the nucleotide sequence in SEQ ID NO:1.
The nucleotide sequences of the genomes from different strains of Haemophilus influenzae differ slightly. However, the nucleotide sequence of the genomes of all Haemophilus influenzae strains will be at least 99.9% identical to the nucleotide sequence provided in SEQ ID NO:1.
Thus, the present invention further provides nucleotide sequences which are at least99.9% identical to the nucleotide sequence of SEQ ID NO:1 in a form which can be readily used, analyzed and interpreted by the skilled artisan. Methods for determining whether a nucleotide sequence is at least 99.9% identical to the nucleotide sequence ofSEQ ID NO:1 are routine and readily available to the skilled artisan. For example, the well known fasta algothrithm (Pearson and Lipman, Proc. Natl. Acad. Sci. USA 85:2444 (1988)) can be used to generate the percent identity ofnucleotide sequences.
Computer Related Embodiments
The nucleotide sequence provided in SEQ ED NO:1, a representative fragmentthereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 may be "provided" in avariety ofmediums to facilitate use thereof. As used herein, provided refers to a manufacture, other than an isolated nucleic acid molecule, which contains a nucleotide sequence of the present invention, i.e., the nucleotide sequence provided in SEQ ID NO:1, a representative fragmentthereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1. Such a manufactureprovides the Haemophilus influenzae Rd genome or a subset thereof (e.g., a Haemophilus Influenzae Rd open reading frame
(ORF)) in a form which allows a skilled artisan to examine the manufacture using means not directly applicable to examining the Haemophilus influenzae Rd genome or a subset thereofas it exists in nature or in purified form. In one application of this embodiment, a nucleotide sequence of the present invention can be recorded on computer readable media. As used herein, "computer readable media" refers to any medium which can be read and accessed directly by a computer. Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. A skilled artisan can readily appreciate how any ofthe presently known computer readable mediums can be used to create a manufacture comprising computer readable medium having recorded thereon a nucleotide sequence of the present invention.
As used herein, "recorded" refers to a process for storing information on computerreadable medium. A skilled artisan can readily adopt any of the presently know methods for recording information on computer readable medium to generate manufactures comprising the nucleotide sequence information of the present invention.
A variety ofdata storage structures are available to a skilled artisan for creating a computer readable medium having recorded thereon a nucleotide sequence of the present invention. The choice of the data storage structure will generallybebased on the means chosen to access the stored information.
In addition, a variety of data processorprograms and formats can be used to store the nucleotide sequence information ofthe present invention on computer readable medium. The sequence information can be represented in a word processing text file, formatted in commercially-available software such as WordPerfect and Microsoft Word, or represented in the form of an ASCII file, stored in adatabase application, such as DB2, Sybase, Oracle, or the like. A skilled artisan can readily adapt any number of dataprocessor structuring formats (e.g. text file or database) in order to obtain computer readable medium having recorded thereon the nucleotide sequence information of the present invention. By providing the nucleotide sequence of SEQ ID NO: 1, a representative fragment thereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 in computer readable form, a skilled artisan can routinely access the sequence information for a variety of purposes. Computer software is publicly available which allows a skilled artisan to access sequence information provided in a computer readable medium. The examples which follow demonstrate how software which implements the BLAST (Altschul etal., J. Mol. Biol.275:403-410 (1990)) and BLAZE (Brutlag et al, Comp. Chem.77:203-207 (1993)) search algorithms on a Sybase system was used to identify open reading frames (ORFs) within the Haemophilus influenzae Rd genome which contain homology to ORFs or proteins from other organisms. Such ORFs are protein encoding fragments within the Haemophilus influenzae Rd genomeand are useful in producing commercially important proteins such as enzymes used in fermentation reactions and in the production of commercially useful metabolites.
Thepresent invention further provides systems, particularly computer- based systems, which contain the sequenceinformation described herein. Such systems are designed to identify commercially important fragments of the Haemophilus influenzae Rd genome.
As used herein, "a computer-based system" refers to the hardware means, software means, and data storage means used to analyze the nucleotide sequence information ofthepresentinvention. The minimum hardware means of the computer-based systems of the present invention comprises a central processing unit (CPU), input means, output means, and data storage means. A skilled artisan can readily appreciate that any one of the currently available computer-based system are suitable for use in the present invention.
As stated above, the computer-based systems of the present invention comprise adata storage means having stored therein a nucleotide sequence of the present invention and the necessary hardware means and software means for supporting and implementing a search means. As used herein, "data storage means" refers to memory which can store nucleotide sequence information of the present invention, or a memory access means which can access manufactures having recorded thereon the nucleotide sequence information of the present invention.
As used herein, "search means" refers to one or more programs which are implemented on the computer-based system to compare a target sequence or target structural motifwith the sequence information stored within the data storage means. Search means are used to identify fragments or regions of the Haemophilus influenzae Rd genome which match a particular target sequence or target motif. A variety of known algorithms are disclosed publicly and a variety of commercially available software for conducting search means are and can be used in the computer-based systems of the present invention. Examples of such software includes, but is not limited to, MacPattern (EMBL), BLASTN and BLASTX (NCBIA). A skilled artisan can readily recognize that any one of the available algorithms or implementing software packages for conducting homology searches can be adapted for use in the present computer-based systems.
As used herein, a "target sequence" can be any DNA or amino acid sequence of six or more nucleotides or two or more amino acids. A skilled artisan can readily recognize thatthelonger a target sequence is, the less likely a target sequence will bepresentas a random occurrence in the database. The most preferred sequence length of a target sequence is from about 10 to 100 amino acids or from about 30 to 300 nucleotide residues. However, it is well recognized that searches for commercially important fragments of the Haemophilus influenzae Rd genome, such as sequence fragments involved in gene expression and protein processing, may be of shorter length.
As used herein, "a target structural motif," or "target motif," refers to any rationally selected sequence or combination of sequences in which the sequence(s) are chosen based on a three-dimensional configuration which is formed upon the folding of the target motif. There are a variety of target motifs known in theart. Protein target motifs include, but are not limited to, enzymic active sites and signal sequences. Nucleic acid target motifs include, but are not limited to, promoter sequences, hairpin structures and inducible expression elements (protein binding sequences).
A variety of structural formats for the input and output means can be used to input and output the information in the computer-based systems of the present invention. A preferred format for an output means ranks fragments of the Haemophilus influenzae Rd genome possessing varying degrees of homology to the target sequence or target motif. Such presentation provides a skilled artisan with aranking ofsequences which contain various amounts of the target sequence or target motif and identifies the degree of homology contained in the identified fragment.
A variety of comparing means can be used to compare a target sequence or target motif with the data storage means to identify sequence fragments of the Haemophilus influenzae Rd genome. In the present examples, implementing software which implement the BLAST and BLAZE algorithms (Altschul et al., J. Mol. Biol.275:403-410 (1990)) was used to identify open reading frames within the Haemophilus influenzae Rd genome. A skilled artisan can readily recognize that any one of the publicly available homology search programscan be used as the search means for the computer- based systems of the present invention.
One application ofthis embodiment is provided in Figure 2. Figure 2 provides a block diagram of a computer system 102 that can be used to implement the present invention. The computer system 102 includes a processor 106 connected to a bus 104. Also connected to the bus 104 are a main memory 108 (preferably implemented as random access memory, RAM) and avariety of secondary storage devices 110, such as a hard drive 112 and a removable medium storage device 114. The removable medium storage device 114 may represent, forexample, a floppy diskdrive, a CD-ROM drive, a magnetic tape drive, etc. A removable storage medium 116 (such as a floppy disk, a compact disk, a magnetic tape, etc.) containing control logic and/or data recorded therein may be inserted into the removable medium storage device 114. The computer system 102 includes appropriate software for reading the control logic and/or the data from the removable medium storage device 114 onceinserted in the removable medium storage device 114.
A nucleotide sequence of the present invention may be stored in a well known mannerin the main memory 108, any of the secondary storage devices 110, and/or a removable storage medium 116. Software for accessing and processing the genomic sequence (such as search tools, comparing tools, etc.) reside in main memory 108 during execution.
Biochemical Embodiments
Another embodiment of the present invention is directed to isolated fragments of the Haemophilus influenzae Rd genome. The fragments of the
Haemophilus influenzae Rd genome of the present invention include, but are not limited to fragments which encode peptides, hereinafter open reading frames (ORFs), fragments which modulate the expression of an operably linked ORF, hereinafter expression modulating fragments (EMFs), fragments which mediate the uptake of a linked DNA fragment into a cell, hereinafter uptake modulating fragments (UMFs), and fragments which can be used to diagnose the presence ofHaemophilus influenzae Rd in a sample, hereinafter diagnostic fragments (DFs).
As used herein, an "isolated nucleic acid molecule" or an "isolated fragment of the Haemophilus influenzae Rd genome" refers to a nucleic acid molecule possessing a specific nucleotide sequence which has been subjected to purification means to reduce, from the composition, the number of compounds which arenormally associated with the composition. A variety of purification means can be used to generated the isolated fragments of the present invention. These include, but are not limited to methods which separate constituents ofa solution based on charge, solubility, or size.
In one embodiment, Haemophilus influenaze Rd DNA can be mechanically sheared to produce fragments of 15-20 kb in length. These fragments can then be used to generate an Haemophilus influenzae Rd library by inserting them into labda clones as described in the Examples below. Primers flanking, for examiple, an ORF provided in Table 1(a) can then be generated using nucleotide sequence information provided in SEQ ID NO:1. PCR cloning can then be used to isolate the ORF from the lambda DNA library. PCR cloning is well known in the art. Thus, given the availability of SEQ ID NO:1, Table 1(a) and Table 2, it would be routine to isolate any ORF or other nucleic acid fragment of the present invention.
The isolated nucleic acid molecules of the present invention include, but are not limited to single stranded and double stranded DNA, and single stranded RNA.
As used herein, an "open reading frame," ORF, means a series of triplets coding for amino acids without any termination codons and is a sequence translatable into protein. Tables la, lb and 2 identify ORFs in the Haemophilus influenzae Rd genome. In particular, Table la indicates the location ofORFs within theHaemophilus influenzae genome which encode the recited protein based on homology matching with protein sequences from the organism appearing in parentheticals (see the fourth column ofTable 1(a)).
The first column ofTable 1(a) provides the "GenelD" ofa particular ORF. This information is useful for two reasons. First, the complete map of theHaemophilus influenzae Rd genome provided in Figures 6(A)-6(D) refers to the ORFs according to their GenelD numbers. Second, Table 1(b) uses the GenelD numbers to indicatewhich ORFs wereprovided previously in a public database.
The secondand thirdcolumns in Table 1(a) indicate an ORFs position in the nucleotide sequence provided in SEQ ID NO:1. One ofordinary skill will recognize that ORFs may be oriented in opposite directions in the Haemophilus influenae genome. This is reflected in columns 2 and 3.
The fifth column of Table 1(a) indicates the percent identity of the protein encoded forby an ORF to the corresponding protein from the orgaism appearing in parentheticals in the fourth column. The sixth column of Table 1(a) indicates the percent similarity of the protein encoded forby an ORF to thecorresponding protein from the organism appearing in parentheticals in the fourth column. The concepts of percent identity and percent similarity oftwo polypeptide sequences is well understood in the art. For example, two polypeptides 10 amino acids in length which differ at three amino acid positions (e.g., at positions 1, 3 and 5) are said to have a percent identity of70%. However, the same two polypeptides would be deemed to have a percent similarity of 80% if, for example at position 5, the amino acids moieties, although not identical, were "similar" (i.e., possessed similar biochemical characteristics).
The seventh column in Table 1(a) indicates the lenth of the amino acid homology match.
Table 2 provides ORFs of the Haemophilus influenzae Rd genome which encode polypeptide sequences which did not elicit a "homology match" with a known protein sequence from another organism. Further details concerning the algorithms and criteria used for homology searches are provided in the Examples below.
A skilled artisan can readily identify ORFs in the Haemophilus influenzae Rd genome other than those listed in Tables 1(a), 1(b) and 2, such as ORFs which are overlapping or encoded by the opposite strand of an identified ORF in addition to those ascertainable using the computer-based systems of the present invention.
As used herein, an "expression modulating fragment," EMF, means a series ofnucleotide molecules which modulates the expression ofan operably linked ORF or EMF.
As used herein, a sequence is said to "modulate the expression of an operably linked sequence" when the expression of the sequence is altered by the presence of the EMF. EMFs include, but are not limited to, promoters, and promoter modulating sequences (inducible elements). One class ofEMFs are fragments which induce the expression or an operably linked ORF in response to a specific regulatory factor or physiological event. A review of known EMFs from Haemophilus are described by (Tomb et al. Gene 104:1-10 (1991), Chandler, M. S., Proc. Natl Acad. Sci. USA 89:1626-1630 (1992).
EMF sequences can be identified within theHaemophilus influenzae Rd genome by their proximity to the ORFs provided in Tables 1(a), 1(b) and 2. An intergenic segment, or a fragment of the intergenic segment, from about
10 to 200 nucleotides in length, taken 5' from any one of the ORFs of Tables 1(a), 1(b), or 2 will modulate the expression ofan operably linked 3' ORF in a fashion similar to that found with the naturally linked ORF sequence. As used herein, an "intergenic segment" refers to the fragments of the Haemophilus genome which are between two ORF(s) herein described.
Alternatively, EMFs can be identified using known EMFs as a target sequence or target motif in the computer-based systems of the present invention.
Thepresence and activity of an EMF can be confirmed using an EMF trap vector. An EMF trap vector contains a cloning site 5' to a marker sequence. A marker sequence encodes an identifiable phenotype, such as antibiotic resistance or a complementing nutrition auxotrophic factor, which can be identified or assayed when the EMF trap vector is placed within an appropriate host under appropriate conditions. As described above, a EMF will modulate the expression ofan operably linked marker sequence. A more detailed discussion ofvarious marker sequences is provided below.
A sequence which is suspected as being a EMF is cloned in all three reading frames in one or more restriction sites upstream from the marker sequence in the EMF trap vector. The vector is then transformed into an appropriatehostusing knownprocedures and thephenotype of the transformed hostin examined underappropriate conditions. As described above, an EMF will modulate the expression ofan operably linked marker sequence.
As used herein, an "uptake modulating fragment," UMF, means a series of nucleotide molecules which mediate the uptake of a linked DNA fragment into a cell. UMFs can be readily identified using known UMFs as a target sequence or target motif with the computer-based systems described above. Thepresence and activity ofa UMF can be confirmed by attaching the suspected UMF to a marker sequence. The resulting nucleic acid molecule is then incubated with an appropriate host under appropriate conditions and the uptake of the marker sequence is determined. As described above, a UMF will increase the frequency of uptake of a linked marker sequence. A review ofDNA uptake in Haemophilus is provided by Goodgall, S.H., et al., J. Bact. 172:5924-5928 (1990).
As used herein, a "diagnostic fragment," DF, means a series of nucleotide molecules which selectively hybridize to Haemophilus influenzae sequences. DFs can be readily identified by identifying unique sequences within the Haemophilus influenzae Rd genome, or by generating and testing probes or amplification primers consisting of the DF sequence in an appropriate diagnostic format which determines amplification or hybridization selectivity.
The sequences falling within the scope of the present invention are not limited to the specific sequences herein described, but also include allelic and species variations thereof. Allelic and species variations can be routinely determined by comparing the sequence provided in SEQ ID NO:1, a representative fragment thereof, or a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 with a sequence from another isolate of the same species. Furthermore, to accommodate codon variability, the invention includes nucleic acid molecules coding for the same amino acid sequences as do.the specific ORFs disclosed herein. In other words, in the coding region of an ORF, substitution of one codon for another which encodes the same amino acid is expressly contemplated.
Any specific sequence disclosed herein can be readily screened for errors by resequencing a particular fragment, such as an ORF, in both directions (i.e., sequenceboth strands). Alternatively, error screening can be performed by sequencing corresponding polynucleotides of Haemophilus influenzae origin isolated by using part or all of the fragments in question as a probe or primer. Each of the ORFs of theHaemophilus influenzae Rd genome disclosed in Tables 1(a), 1(b) and 2, and the EMF found 5' to the ORF, can be used in numerous ways as polynucleotide reagents. The sequences can be used as diagnostic probes ordiagnostic amplification primers to detect the presence of a specific microbe, such as Haemophilus influenzae RD, in a sample. This is especially the case with the fragments or ORFs of Table 2, which will be highly selective forHaemophilus influenzae.
In addition, the fragments of the present invention, as broadly described, can be used to control gene expression through triple helix formation or antisense DNA or RNA, both of which methods are based on the binding of apolynucleotide sequence to DNA or RNA. Polynucleotides suitable for use in these methods are usually 20 to 40 bases in length and are designed to be complementary to a region of the gene involved in transcription (triple helix - see Lee etal., Nucl Acids Res.6:3073 (1979); Cooney et al, Science 247:456 (1988); and Dervan etal., Science 257:1360 (1991)) or to the mRNA itself(antisense - Okano, J. Neurochem.56:560 (1991); Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression, CRC Press, Boca Raton, FL (1988)).
Triple helix- formation optimally results in a shut-off of RNA transcription from DNA, while antisense RNA hybridization blocks translation ofan mRNA molecule into polypeptide. Both techniques have been demonstrated to be effective in model systems. Information contained in the sequences of the present invention is necessary for the design of an antisense or triple helix oligonucleotide.
The present invention further provides recombinant constructs comprising one or more fragments of the Haemophilus influenzae Rd genome ofthepresent invention. The recombinant constructs of the present invention comprisea vector, such as aplasmid or viral vector, into which a fragment of the Haemophilus influenzae Rd has been inserted, in a forward or reverse orientation. In the caseofa vectorcomprising one of the ORFs of the present invention, the vector may furthercomprise regulatory sequences, including for example, apromoter, operably linked to the ORF. For vectors comprising the EMFs and UMFs of thepresent invention, the vector may further comprise a marker sequence or heterologous ORF operably linked to the EMF or UMF. Large numbers of suitable vectors and promoters are known to those of skill in the art and are commercially available for generating the recombinant constructs of the present invention. The following vectors are provided by way ofexample. Bacterial: pBs, phagescript, PsiX174, pBluescript SK, pBs
KS, pNH8a, ρNH16a, pNH18a, pNH46a (Stratagene); pTrc99A, pKK223-3, pKK233-3, pDR540, pRTT5 (Pharmacia). Eukaryotic: pWLneo, pSV2cat, pOG44, pXTl, pSG (Stratagene) pSVK3, pBPV, pMSG, pSVL (Pharmacia).
Promoter regions can be selected from any desired gene using CAT (chloramphenicol transferase) vectors orother vectors with selectable markers.
Two appropriatevectors are pKK232-8 and pCM7. Particular named bacterial promoters include lad, lacZ, T3, T7, gpt, lambda PR, and trc. Eukaryotic promoters include CMV immediate early, HSV thymidine kinase, early and late SV40, LTRs from retrovirus, and mouse metallothionein-I. Selection of the appropriate vector and promoter is well within the level ofordinary skill in the art. The present invention furtherprovides host cells containing any one of the isolated fragments of theHaemophilus influenzae Rd genome of the present invention, wherein the fragment has been introduced into the host cell using known transformulation methods. The host cell can be a higher eukaryotic host cell, such as a mammalian cell, a lower eukaryotic host cell, such as a yeast cell, or the host cell can be a procaryotic cell, such as a bacterial cell. Introduction ofthe recombinant construct into the host cell can be effected by calcium phosphate transfection, DEAE, dextran mediated transfection, or electroporation (Davis, L. etal., BasicMethods in MolecularBiology (1986)).
The host cells containing one of the fragments of the Haemophilus influenzae Rd genome of the present invention, can be used in conventional manners to producethe geneproduct encoded by the isolated fragment (in the case ofan ORF) or can be used to produce a heterologous protein under the control of the EMF.
The present invention further provides isolated polypeptides encoded by the nucleic acid fragments of the present invention or by degenerate variants of thenucleicacid fragments of thepresentinvention. By "degenerate variant" is intended nucleotide fragments which differ from a nucleic acid fragment of thepresent invention (e.g., an ORF) by nucleotide sequence but, due to the degeneracy of the Genetic Code, encode an identical polypeptide sequence. Preferred nucleic acid fragments of the present invention are the ORFs depicted in Table 1(a) which encode proteins.
A variety of methodologies known in the art can be utilized to obtain any one of the isolated polypeptides or proteins of the present invention. At the simplest level, the amino acid sequence can be synthesized using commercially available peptide synthesizers. This is particularly useful in producing small peptides and fragments oflarger polypeptides. Fragments are useful, for example, in generating antibodies against the native polypeptide. Inan alternative method, the polypeptide or protein is purified from bacterial cells which naturally producethepolypeptide or protein. One skilled in the art can readily follow known methods for isolating polpeptides and proteins in order to obtain one of the isolated polypeptides or proteins of the present invention. These include, but are not limited to, immunochromatography, HPLC, size-exclusion chromatography, ion-exchange chromatography, and immuno-affinity chromatography.
Thepolypeptides and proteins ofthepresent invention can alternatively be purified from cells which have been altered to express the desired polypeptide or protein. As used herein, a cell is said to be altered to express a desired polypeptide or protein when the cell, through genetic manipulation, is made to produce a polypeptide or protein which it normally does not produceorwhich the cell normally produces at a lower level. One skilled in the art can readily adapt procedures for introducing and expressing either recombinant or synthetic sequences into eukaryotic or prokaryotic cells in order to generate acell which produces one of the polypeptides or proteins of the present invention.
Any host/vector system can be used to express one or more of the
ORFs of the present invention. These include, but are not limited to, eukaryotic hosts such as HeLa cells, Cv-1 cell, COS cells, and Sf9 cells, as well as prokaryotic host such as E. coli and B. subtilis. The most preferred cells are those which do not normally express the particular polypeptide or protein or which expresses the polypeptide or protein at low natural level.
"Recombinant," as used herein, means that a polypeptide or protein is derived from recombinant (e.g., microbial or mammalian) expression systems. "Microbial" refers to recombinant polypeptides or proteins made in bacterial or fungal (e.g., yeast) expression systems. As a product, "recombinant microbial" defines a polypeptide or protein essentially free of native endogenous substances and unaccompanied by associated native glycosylation. Polypeptides or proteins expressed in most bacterial cultures, e.g., E. coli, will be free ofglycosylation modifications; polypeptides or proteins expressed in yeast will have a glycosylation pattern different from that expressed in mammalian cells. "Nucleotide sequence" refers to a heteropolymer of deoxyribonucleotides. Generally, DNA segments encoding the polypeptides and proteins provided by this invention are assembled from fragments of the Haemophilus influenzae Rd genomeand short oligonucleotide linkers, or from a series ofoligonucleotides, to provide a synthetic gene which is capable of being expressed in a recombinant transcriptional unit comprising regulatory elements derived from a microbial or viral operon.
"Recombinant expression vehicle or vector" refers to a plasmid or phage orvirus or vector, forexpressing apolypeptide from a DNA (RNA) sequence. The expression vehicle can comprise a transcriptional unit comprising an assembly of(1) a genetic elementor elements having a regulatory role in gene expression, for example, promoters or enhancers, (2) a structural or coding sequence which is transcribed into mRNA and translated into protein, and (3) appropriate transcription initiation and termination sequences. Structural units intended for use in yeast or eukaryotic expression systems preferably includealeadersequenceenabling extracellular secretion oftranslated protein by a hostcell. Alternatively, where recombinant protein is expressed without a leader or transport sequence, it may include an N-terminal methionine residue. This residue may or may not be subsequently cleaved from the expressed recombinant protein to provide a final product.
"Recombinant expression system" means host cells which have stably integrated a recombinant transcriptional unit into chromosomal DNA or carry the recombinant transcriptional unit extra chromosomally. The cells can be prokaryotic oreukaryotic. Recombinant expression systems as defined herein will express heterologous polypeptides or proteins upon induction of the regulatory elements linked to the DNA segment or synthetic gene to be expressed.
Matureproteins can be expressed in mammalian cells, yeast, bacteria, orothercells underthecontrol ofappropriate promoters. Cell-free translation systems can also be employed to produce such proteins using RNAs derived from the DNA constructs of the present invention. Appropriate cloning and expression vectors for use with prokaryatic and eukaryotic hosts are described by Sambrook, et al., in Molecular Cloning: A Laboratory Manual, Second Edition, Cold Spring Harbor, New York (1989), the disclosure of which is hereby incorporated by reference.
Generally, recombinant expression vectors will include origins of replication and selectable markers permitting transformation of the host cell, e.g., the ampicillin resistance gene ofE. coli and S. cerevisiae TRP1 gene, and a promoterderived from a highly-expressed gene to direct transcription of a downstream structural sequence. Such promoters can be derived from operons encoding glycolytic enzymes such as 3-phosphoglycerate kinase
(PGK), a-factor, acid phosphatase, or heat shock proteins, among others. The heterologous structural sequence is assembled in appropriate phase with translation initiation and termination sequences, and preferably, a leader sequence capable of directing secretion of translated protein into the periplasmic space or extracellular medium. Optionally, the heterologous sequence can encode a fusion protein including an N-terminal identification peptide imparting desired characteristics, e.g., stabilization or simplified purification ofexpressed recombinant product.
Useful expression vectors for bacterial use are constructed by inserting a structural DNA sequence encoding a desired protein together with suitable translation initiation and termination signals in operable reading phase with a functional promoter. The vector will comprise one or more phenotypic selectable markers and an origin ofreplication to ensure maintenance of the vector and to, if desirable, provide amplification within the host. Suitable prokaryotic hosts for transformation include E. coli, Bacillus subtilis,
Salmonella typhimurium and various species within the genera Pseudomonas, Streptomyces, and Staphylococcus, although others may, also be employed as a matter ofchoice.
As arepresentative but nonlimiting example, useful expression vectors for bacterial use can comprise a selectable marker and bacterial origin of replication derived from commercially available plasmids comprising genetic elements of the well known cloning vector pBR322 (ATCC 37017). Such commercial vectors include, for example, pKK223-3 (Pharmacia Fine Chemicals, Uppsala, Sweden) and GEM 1 (Promega Biotec, Madison, WI, USA). ThesepBR322 "backbone" sections are combined with an appropriate promoter and the structural sequence to be expressed.
Following transformation ofa suitable host strain and growth of the host strain to an appropriate cell density, the selected promoter is derepressed by appropriate means (e.g., temperature shiftor chemical induction) and cells are cultured for an additional period. Cells are typically harvested by centrifugation, disrupted by physical or chemical means, and the resulting crude extract retained for further purification.
Various mammalian cell culture systems can also be employed to express recombinant protein. Examples of mammalian expression systems includethe COS-7 lines ofmonkey kidney fibroblasts, described by Gluzman, Cell 23:175 (1981), and other cell lines capable of expressing a compatible vector, for example, the C127, 3T3, CHO, HeLa and BHK cell lines. Mammalian expression vectors will comprise an origin of replication, a suitablepromoterand enhancer, and alsoany necessary ribosome binding sites, polyadenylation site, splice donor and acceptor sites, transcriptional termination sequences, and 5' flanking nontranscribed sequences. DNA sequences derived from the SV40 viral genome, for example, SV40 origin, early promoter, enhancer, splice, and polyadenylation sites may be used to provide the required nontranscribed genetic elements.
Recombinant polypeptides and proteins produced in bacterial culture is usually isolated by initial extraction from cell pellets, followed by one or more salting-out, aqueous ion exchange or size exclusion chromatography steps. Protein refolding steps can be used, as necessary, in completing configuration of the mature protein. Finally, high performance liquid chromatography (HPLC) can be employed for final purification steps. Microbial cells employed in expression of proteins can be disrupted by any convenient method, including freeze-thaw cycling, sonication, mechanical disruption, or use ofcell lysing agents.
The present invention further includes isolated polypeptides, proteins and nucleic acid molecules which are substantially equivalent to those herein described. As used herein, substantially equivalent can refer both to nucleic acid and amino acid sequences, for example a mutant sequence, that varies from a reference sequence by one or more substitutions, deletions, or additions, the net effect of which does not result in an adverse functional dissimilarity between reference and subject sequences. For purposes of the present invention, sequences having equivalent biological activity, and equivalent expression characteristics are considered substantially equivalent. For purposes ofdetermining equivalence, truncation of the mature sequence should be disregarded.
The invention further provides methods of obtaining homologs from other strains ofHaemophilus influenzae, of the fragments of the Haemophilus influenzae Rd genome of the present invention and homologs of the proteins encoded by the ORFs ofthepresent invention. As used herein, a sequence or protein ofHaemophilus influenzae is defined as a homolog ofa fragment of the Haemophilus influenzae Rd genome or a protein encoded by one of the ORFs of the present invention, if it shares significant homology to one of the fragments of theHaemophilus influenzae Rd genome of the present invention or a protein encoded by one of the ORFs of the present invention. Specifically, by using the sequence disclosed herein as a probe or as primers, and techniques such as PCR cloning and colony/plaque hybridization, one skilled in the art can obtain homologs.
As used herein, two nucleic acid molecules or proteins are said to "share significant homology" ifthe two contain regions which process greater than 85% sequence (amino acid or nucleic acid) homology.
Region specificprimers orprobes derived from the nucleotide sequence provided in SEQ ID NO:1 or from a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 can be used to prime DNA synthesis and PCR amplification, as well as to identify colonies containing cloned DNA encoding a homolog using known methods (Innis et al, PCR Protocols, Academic Press, San Diego, CA (1990)).
When using primers derived from SEQ ID NO:1 or from a nucleotide sequenceat least99.9% identical to SEQ ID NO:1, one skilled in the art will recognize that by employing high stringency conditions (e.g., annealing at 50-
60°C) only sequences which are greater than 75% homologous to the primer will be amplified. By employing lower stringency conditions (e.g., annealing at 35-37°C), sequences which are greater than 40-50% homologous to the primer will also be amplified.
When using DNA probes derived from SEQ ID NO:1 or from a nucleotide sequence at least 99.9% identical to SEQ ID NO:1 for colony/plaque hybridization, one skilled in the art will recognize that by employing high stringency conditions (e.g., hybridizing at 50-65°C in 5X SSPC and 50% formamide, and washing at 50-65°C in 0.5X SSPC), sequences having regions which aregreater than 90% homologous to the probe can be obtained, and that by employing lower stringency conditions (e.g., hybridizing at 35-37°C in 5X SSPC and 40-45% formamide, and washing at 42°C in SSPC), sequences having regions which are greater than 35-45% homologous to the probe will be obtained.
Any organism can be used as the source for homologs of the present invention so long as the organism naturally expresses such a protein or contains genes encoding the same. The most preferred organism for isolating homologs are bacterias which are closely related to Haemophilus influenzae Rd. Uses for the Compositions ofthe Invention
Each ORF provided in Table 1(a) was assigned to one of 102 biological role categories adapted from Riley, M., Microbiology Reviews 57(4):S62 (1993)). This allows the skilled artisan to determine a use for each identified coding sequence. Tables 1(a) further provides an identification of the type of polypeptide which is encoded forby each ORF. As a result, one skilled in the art can use the polypeptides of the present invention for commercial, therapeutic and industrial purposes consistent with the type of putative identification of thepolypeptide.
Such identifications permit one skilled in theartto usethe Haemophilus influenzae ORFs in a manner similar to the known type of sequences for which the identification is made; for example, to ferment a particular sugar source or to produce a particular metabolite. (For a review ofenzymes used within the commercial industry, see Biochemical Engineering and Biotechnology Handbook 2nd, eds. Macmillan Publ. Ltd., NY (1991) and Biocatalysts in Organic Syntheses, ed. J. Tramper et al, Elsevier Science Publishers, Amsterdam, The Netherlands (1985)).
1. Biosynthetic Enzymes
Open reading frames encoding proteins involved in mediating the catalytic reactions involved in intermediary and macromolecular metabolism, the biosynthesis of small molecules, cellular processes and other functions includes enzymes involved in the degradation of the intermediary products of metabolism, enzymes involved in central intermediary metabolism, enzymes involved in respiration, both aerobic and anaerobic, enzymes involved in fermentation, enzymes involved in ATP proton motor force conversion, enzymes involved in broad regulatory function, enzymes involved in amino acid synthesis, enzymes involved in nucleotide synthesis, enzymes involved in cofactor and vitamin synthesis, can be used for industrial biosynthesis. The various metabolic pathways present in Haemophilus can be identified based on absolute nutritional requirements as well as by examining the various enzymes identified in Table 1(a).
Identified within the category of intermediary metabolism, a number of theproteins encoded by the identified ORFs in Tables 1(a) are particularly involved in the degradation of intermediary metabolites as well as non- macromolecular metabolism. Some of the enzymes identified include amylases, glucose oxidases, and catalase. Proteolytic enzymes are another class of commercially important enzymes. Proteolytic enzymes find use in a number of industrial processes including the processing of flax and other vegetable fibers, in the extraction, clarification and depectinization of fruit juices, in the extraction of vegetables' oil and in the maceration offruits and vegetables to give unicellular fruits. A detailed review of the proteolytic enzymes used in the food industry is provided by Rombouts et al, Symbiosis 21:79 (1986) and Voragen et al. in Biocatalyst in Agricultural Biotechnology, edited J.R. Whitaker et al, American Chemical Society Symposium Series 389:93 (1989)).
The metabolism of glucose, galactose, fructose and xylose are important parts of the primary metabolism of Haemophilus. Enzymes involved in the degradation of these sugars can be used in industrial fermentation. Some of the important sugar transforming enzymes, from a commercial viewpoint, include sugar isomerases such as glucose isomerase. Other metabolic enzymes have found commercial use such as glucose oxidases which produces ketogulonic acid (KGA). KGA is an intermediate in the commercial production of ascorbic acid using the Reichstein's procedure (see Krueger etal., Biotechnology 6(A), Rhine, H.J. et al, eds., Verlag Press, Weinheim, Germany (1984)).
Glucose oxidase (GOD) is commercially available and has been used in purified form as well as in an immobilized form for the deoxygenation of beer. See Hartmeir et al, Biotechnology Letters 7:21 (1979). The most importantapplication of GOD is the industrial scale fermentation ofgluconic acid. Market for gluconic acids which are used in the detergent, textile, leather, photographic, pharmaceutical, food, feed and concrete industry (see
Bigelis in GeneManipulations andFungi, Benett, J.W. et al, eds., Academic Press, New York (1985), p.357). In addition to industrial applications, GOD has found applications in medicine for quantitative determination of glucose in body fluidsrecently in biotechnology for analyzing syrups from starch and cellulose hydrosylates. See Owusu et al, Biochem. et Biophysica. Acta.
872:83 (1986). The main sweetener used in the world today is sugar which comes from sugar beets and sugar cane. In the field of industrial enzymes, the glucose isomerase process shows the largest expansion in the market today. Initially, soluble enzymes were used and later immobilized enzymes were developed (Krueger et al, Biotechnology, The Textbook of Industrial Microbiology,
Sinauer Associated Incorporated, Sunderland, Massachusetts (1990)). Today, the use ofglucose-produced high fructose syrups is by far the largest industrial business using immobilized enzymes. A review of the industrial use of these enzymes is provided by Jorgensen, Starch 40:307 (1988).
Proteinases, such as alkaline serine proteinases, are used as detergent additives and thus represent one of the largest volumes of microbial enzymes used in the industrial sector. Because of their industrial importance, there is a large body of published and unpublished information regarding the use of these enzymes in industrial processes. (See Faultman et al., Acid Proteases Structure Function and Biology, Tang, J., ed., Plenum Press, New York
(1977) and Godfrey etal., IndustrialEnzymes, MacMillan Publishers, Surrey, UK (1983) and Hepner etal., ReportIndustrialEnzymes by 1990, Hel Hepner & Associates, London (1986)).
Another class ofcommercially usable proteins of thepresent invention are the microbial Upases identified in Table 1 (see Macrae et al., Philosophical
Transactions ofthe Chiral Society ofLondon 310:227 (1985) and Poserke, JournaloftheAmerican Oil Chemist Society 61:1758 (1984). A major use of Upases is in the fat and oU industry for the production of neutral glycerides using lipase catalyzed inter-esterification of readily available triglycerides. AppUcation ofUpases include the use as a detergent additive to facilitate the removal of fats from fabrics in the course of the washing procedures.
The use of enzymes, and in particular microbial enzymes, as catalyst for key steps in the synthesis of complex organic molecules is gaining popularity atagreat rate. Onearea ofgreatinterest is the preparation ofchiral intermediates. Preparation ofchiral intermediates is ofinterest to a wide range ofsynthetic chemists particularly those scientists involved with the preparation ofnew pharmaceuticals, agrochemicals, fragrances and flavors. (See Davies et al, Recent Advances in the Generation of Chiral Intermediates Using Enzymes, CRC Press, Boca Raton, Florida (1990)). The following reactions catalyzed by enzymes are of interest to organic chemists: hydrolysis of carboxylic acid esters, phosphate esters, amides and nitriles, esterification reactions, trans-esterification reactions, synthesis of amides, reduction of alkanones and oxoalkanates, oxidation of alcohols to carbonyl compounds, oxidation ofsulfides to sulfoxides, and carbon bond forming reactions such as the aldol reaction. When considering the use ofan enzyme encoded by one of the ORFs of the present invention for biotransformation and organic synthesis it is sometimes necessary to consider the respective advantages and disadvantages of using a microorganism as opposed to an isolated enzyme. Pros and cons of using a whole cell system on the one hand or an isolated partially purified enzyme on the other hand, has been described in detail by Bud et al., Chemistry in Britain (1987), p.127.
Amino transferases, enzymes involved in the biosynthesis and metabolism of amino acids, are useful in the catalytic production of amino acids. The advantages of using microbial based enzyme systems is that the amino transferase enzymes catalyze the stereo-selective synthesis of only /-amino acids and generally possess uniformly high catalytic rates. A description of the use of amino transferases for amino acid production is provided by Roselle-David, Methods ofEnzymology 756:479 (1987).
Another category ofusefulproteins encoded by the ORFs of thepresent invention include enzymes involved in nucleic acid synthesis, repair, and recombination. A variety ofcommercially important enzymes have previously been isolated from members ofHaemophilus sp. These include the Hinc II, Hind III, and HinfI restriction endonucleases. Table 1(a) identifies a wide array of enzymes, such as restriction enzymes, ligases, gyrases and methylases, which have immediate use in the biotechnology industry. 2. Generation of Antibodies As described here, the proteins of the present invention, as well as homologs thereof, can be used in a variety procedures and methods known in the art which are currently appUed to other proteins. The proteins of the present invention can further be used to generate an antibody which selectively binds the protein. Such antibodies can be either monoclonal or polyclonal antibodies, as well fragments of these antibodies, and humanized forms.
The invention furtherprovides antibodies which selectively bind to one of the proteins of the present invention and hybridomas which produce these antibodies. A hybridoma is an immortalized cell line which is capable of secreting a specific monoclonal antibody.
In general, techniques for preparing polyclonal and monoclonal antibodies as well as hybridomas capable ofproducing the desired antibody are well known in the art (Campbell, A.M., Monoclonal Antibody Technology: Laboratory Techniques in Biochemistry and Molecular Biology, Elsevier Science Publishers, Amsterdam, The Netherlands (1984); St. Groth et al, J.
Immunol. Methods55:1-21 (1980); Kohler and Milstein, Nature 256:495-497 (1975)), the trioma technique, the human B-cellhybridoma technique (Kozbor etal., Immunology Today 4:72 (1983); Cole et al., in MonoclonalAntibodies and Cancer Therapy, Alan R. Liss, Inc. (1985), pp.77-96).
Any animal (mouse, rabbit, etc.) which is known to produce antibodies can be immunized with the pseudogene polypeptide. Methods for immunization are well known in the art. Such methods include subcutaneous or interperitoneal injection of the polypeptide. One skilled in the art will recognize that the amount of the protein encoded by the ORF of the present invention used for immunization will vary based on the animal which is immunized, the antigenicity of thepeptide and the site ofinjection.
The protein which is used as an immunogen may be modified or administered in an adjuvant in order to increase the protein's antigenicity. Methods of increasing the antigenicity ofa protein are well known in the art and include, but are not limited to coupling the antigen with a heterologous protein (such as globulin or β-galactosidase) or through the inclusion of an adjuvant during immunization.
For monoclonal antibodies, spleen cells from the immunized animals areremoved, fused with myeloma cells, such as SP2/0-Ag14 myeloma cells, and allowed to become monoclonal antibody producing hybridoma cells.
Any one ofa number of methods well known in the art can be used to identify the hybridoma cell which produces an antibody with the desired characteristics. Theseinclude screening thehybridomas with an ELISA assay, western blot analysis, or radioimmunoassay (Lutz et al, Exp. Cell Res. 775:109-124(1988)).
Hybridomas secreting the desired antibodies are cloned and the class and subclass is determined using procedures known in the art (Campbell,
A.M., Monoclonal Antibody Technology: Laboratory Techniques in
Biochemistry andMolecularBiology, Elsevier Science Publishers, Amsterdam, The Netherlands (1984)).
Techniques described for the production of single chain antibodies (U.S. Patent4,946,778) can be adapted to produce single chain antibodies to proteins of thepresent invention.
Forpolyclonal antibodies, antibody containing anύsera is isolated from theimmunized animal and is screened for thepresence ofantibodies with the desired specificity using one of the above-described procedures.
Thepresent invention further provides the above-described antibodies in detectably labelled form. Antibodies can be detectably labelled through the use of radioisotopes, affinity labels (such as biotin, avidin, etc.), enzymatic labels (such as horseradish peroxidase, alkaline phosphatase, etc.) fluorescent labels (such as FlTC or rhodamine, etc.), paramagnetic atoms, etc. Procedures for accomplishing such labelling are well-known in the art, for example see (Sternberger, L.A. et al., J. Histochem. Cytochem. 18:315 (1970); Bayer, E.A. et al., Meth. Enzym.62:308 (1979); Engval, E. et al., Immunol.109:129 (1972); Goding, J.W. J. Immunol Meth.13:215 (1976)). The labeled antibodies ofthe present invention can be used for in vitro, in vivo, and in situ assays to identify cellsor tissues in which a fragment of the Haemophilus influenzae Rd genome is expressed.
Thepresent invention further provides the above-described antibodies immobilized on a solid support. Examples of such solid supports include plastics such as polycarbonate, complex carbohydrates such as agarose and sepharose, acrylic resins and such as polyacrylamide and latex beads. Techniques for coupling antibodies to such solid supports are well known in the art (Weir, D.M. etal., "HandbookofExperimental Immunology" 4th Ed., Blackwell Scientific Publications, Oxford, England, Chapter 10 (1986);
Jacoby, W.D. et al, Meth. Enzym.34 Academic Press, N.Y. (1974)). The immobilized antibodies of the present invention can be used for in vitro, in vivo, and in situ assays as well as for immunoaffinity purification of the proteins of the present invention. 5. DiagnosticAssays andKits
The present invention further provides methods to identify the expression ofone of the ORFs of the present invention, or homolog thereof, in a test sample, using one of the DFs or antibodies of the present invention.
In detail, such methods comprise incubating a test sample with one or more oftheantibodies or one or more of the DFs of the present invention and assaying for binding of the DFs or antibodies to components within the test sample.
Conditions for incubating a DF or antibody with a test sample vary. Incubation conditions depend on the format employed in the assay, the detection methods employed, and the type and nature of the DF or antibody used in the assay. One skilled in the art wUl recognize that any one of the commonly available hybridization, amplification or immunological assay formats can readily beadapted to employ the DFs or antibodies of thepresent invention. Examples of such assays can be found in Chard, T., An Introduction to Radioimmunoassay andRelated Techniques, Elsevier Science Publishers, Amsterdam, The Netherlands (1986); Bullock, G.R. et al., Techniques in Immunocytochemistry, Academic Press, Orlando, FL Vol.1 (1982), Vol.2 (1983), Vol.3 (1985); Tijssen, P., Practice and Theory of Enzyme Immunoassays:Laboratory Techniques in Biochemistry and Molecular
Biology, Elsevier Science Publishers, Amsterdam, The Netherlands (1985).
The test samples of the present invention include cells, protein or membraneextracts ofcells, orbiological fluids such as sputum, blood, serum, plasma, or urine. The test sample used in the above-described method will vary based on theassay format, nature of thedetection method and the tissues, cells or extracts used as the sample to be assayed. Methods for preparing protein extracts or membrane extracts of cells are well known in the art and can be readily be adapted in order to obtain a sample which is compatible with the system utilized.
In another embodiment of the present invention, kits are provided which contain the necessary reagents to carry out the assays of the present invention.
Specifically, theinvention providesacompartmentalized kit to receive, in close confinement, one or more containers which comprises: (a) a first container comprising one of the DFs or antibodies of the present invention; and (b) oneor more othercontainers comprising oneor more of the following: wash reagents, reagents capable of detecting presence of a bound DF or antibody.
In detail, a compartmentalized kit includes any kit in which reagents are contained in separate containers. Such containers include small glass containers, plastic containers or strips of plastic or paper. Such containers allows one to efficiently transfer reagents from one compartment to another compartment such that the samples and reagents are not cross-contaminated, and the agents or solutions ofeach container can be added in a quantitative fashion from one compartment to another. Such containers will include a container which willaccept the test sample, a container which contains the antibodies used in the assay, containers which contain wash reagents (such as phosphate buffered saline, Tris-buffers, etc.), and containers which contain the reagents used to detect the bound antibody or DF.
Types of detection reagents include labelled nucleic acid probes, labelled secondary antibodies, or in the alternative, ifthe primary antibody is labelled, the enzymatic, or antibody binding reagents which are capable of reacting with the labelled antibody. One skilled in the art will readily recognize that the disclosed DFs and antibodies of the present invention can be readily incorporated into one of the established kit formats which are well known in the art.
4. Screening Assay for Binding Agents
Using the isolated proteins of the present invention, the present invention further provides methods ofobtaining and identifying agents which bind to a protein encoded by one of the ORFs of the present invention or to one of the fragments and the Haemophilus genome herein described.
In detail, said method comprises the steps of:
(a) contacting an agent with an isolated protein encoded by one of the ORFs of the present invention, or an isolated fragment of the Haemophilus genome; and
(b) determining whether the agent binds to said protein or said fragment.
The agents screened in the above assay can be, but are not limited to, peptides, carbohydrates, vitamin derivatives, or other pharmaceutical agents. The agents can be selected and screened at random or rationally selected or designed using protein modeling techniques.
For random screening, agents such as peptides, carbohydrates, pharmaceutical agents and the like are selected at random and are assayed for their ability to bind to the protein encoded by the ORF of the present invention. Alternatively, agents may be rationally selected or designed. As used herein, an agent is said to be "rationally selected or designed" when the agent is chosen based on the configuration of the particular protein. For example, one skilled in the art can readily adapt currently available procedures to generate peptides, pharmaceutical agents and the like capable ofbinding to a specific peptide sequence in order to generate rationally designed antipeptide peptides, for example see Hurby et al. , Application of Synthetic Peptides: Antisense Peptides," In Synthetic Peptides, A User's Guide, W.H. Freeman, NY (1992), pp.289-307, and Kaspczaketal., Biochemistry 28:9230-8 (1989), or pharmaceutical agents, or the like.
In addition to the foregoing, one class of agents of the present invention, as broadly described, can be used to control gene expression through binding to one of the ORFs or EMFs of the present invention. As described above, such agents can be randomly screened or rationally designed/selected. Targeting the ORF or EMF allows a skilled artisan to design sequence specific orelement specific agents, modulating the expression of either a single ORF or multiple ORFs which rely on the same EMF for expression control.
One class of DNA binding agents are agents which contain base residues which hybridize or form a triple helix formation by binding to DNA or RNA. Such agents can be based on the classic phosphodiester, ribonucleic acid backbone, or can be a variety of sulfhydryl or polymeric derivatives which have base attachment capacity.
Agents suitable for use in these methods usually contain 20 to 40 bases and are designed to be complementary to a region of the gene involved in transcription (triple helix - see Lee et al, Nucl Acids Res.6:3073 (1979); Cooney etal., Science241:456 (1988); and Dervan et al, Science 251:1360 (1991)) or to the mRNA itself (antisense - Okano, J. Neurochem.56:560 (1991); Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression, CRC Press, Boca Raton, FL (1988)). Triple helix- formation optimally results in a shut-off of RNA transcription from DNA, whUe antisense RNA hybridization blocks translation ofan mRNA molecule into polypeptide. Both techniques have been demonstrated to be effective in model systems. Information contained in the sequences of the present invention is necessary for the design ofan antisense or triple helix ohgonucleotide and other DNA binding agents.
Agents which bind to a protein encoded by one of the ORFs of the present invention can be used as a diagnostic agent, in the control ofbacterial infection by modulating the activity of the protein encoded by the ORF. Agents which bind to a protein encoded by one of the ORFs of the present invention can be formulated using known techniques to generate a pharmaceutical composition for use in controlling Haemophilus growth and infection.
5. Vaccine and Pharmaceutical Composition
The presentinvention further provides pharmaceutical agents which can be used to modulatethe growth of Haemophilus influenzae, or another related organism, in vivo or in vitro. As used herein, a "pharmaceutical agent" is defined as a composition of matter which can be formulated using known techniques to provide a pharmaceutical compositions. As used herein, the
"pharmaceutical agents of the present invention" refers the pharmaceutical agents which arederived from theproteins encoded by the ORFs of the present invention orareagents which are identified using the herein described assays.
As used herein, a pharmaceutical agent is said to "modulated the growth of Haemophilussp., or a related organism, in vivo or in vitro," when the agent reduces the rate of growth, rate of division, or viability of the organism in question. Thepharmaceutical agents of the present invention can modulate the growth of an organism in many fashions, although an understanding of theunderlying mechanism of action is not needed to practice the use of the pharmaceutical agents of the present invention. Some agents will modulate the growth by binding to an important protein thus blocking the biological activity oftheprotein, while other agents may bind to a component of the outer surface of the organism blocking attachment or rendering the organism morepronetoact the bodies nature immune system. Alternatively, the agent may be comprise a protein encoded by one of the ORFs of the present invention and serve as a vaccine. The development and use of a vaccine based on outer membrane components, such as the LPS, are well known in the art.
As used herein, a "related organism" is a broad term which refers to any organism whose growth can be modulated by one of the pharmaceutical agents of the present invention. In general, such an organism willcontain a homolog of the protein which is the target of thepharmaceutical agent or the protein used as a vaccine. As such, related organism do not need to be bacterial but may be fungal or viral pathogens.
The pharmaceutical agents and compositions of the present invention may be administered in a convenient manner such as by the oral, topical, intravenous, intraperitoneal, intramuscular, subcutaneous, intranasal or intradermal routes. The pharmaceutical compositions are administered in an amount which is effective for treating and/or prophylaxis of the specific indication. In general, they areadministered in an amount ofat least about 10 μg/kg body weight and in most cases they willbe administered in an amount not in excess of about 8 mg/Kg body weight per day. In most cases, the dosage is from about 10 μg/kg to about 1 mg/kg body weight daily, taking into account the routes of administration, symptoms, etc.
The agents of the present invention can be used in native form or can be modified to form a chemical derivative. As used herein, a molecule is said to bea "chemical derivative" ofanother molecule when it contains additional chemical moieties not normally a part of the molecule. Such moieties may improve the molecule's solubility, absorption, biological half life, etc. The moieties may alternatively decrease the toxicity of the molecule, eliminate or attenuate any undesirable side effect of the molecule, etc. Moieties capable of mediating such effects are disclosed in Remington's Pharmaceutical Sciences (1980).
Forexample, a change in the immunological character of the functional derivative, such as affinity fora given antibody, is measured by a competitive type immunoassay. Changes in immunomodulation activity are measured by the appropriate assay. Modifications of such protein properties as redox or thermal stabiUty, biological half-life, hydrophobicity, susceptibility to proteolytic degradation or the tendency to aggregate with carriers or into multimers are assayed by methods wellknown to the ordinarily skilled artisan.
The therapeutic effects of the agents of the present invention may be obtained by providing the agent to a patient by any suitable means (i.e., inhalation, intravenously, intramuscularly, subcutaneously, enterally, or parenterally). It is preferred to administer the agent of the present invention so as to achievean effective concentration within the blood or tissue in which the growth of the organism is to be controlled.
To achieve an effective blood concentration, the preferred method is to administer the agentby injection. Theadministration may be by continuous infusion, or by single or multiple injections.
In providing a patient with one of the agents of the present invention, the dosage of theadministered agent will vary depending upon such factors as the patient's age, weight, height, sex, general medical condition, previous medical history, etc. In general, it is desirable to provide the recipient with a dosage of agent which is in the range of from about 1 pg/kg to 10 mg/kg (body weight of patient), although a lower or higher dosage may be admin- istered. The therapeutically effective dose can be lowered by using combinations of the agents of the present invention or another agent.
As used herein, two or more compounds or agents are said to be administered "in combination" with each other when either (1) the physiological effects of each compound, or (2) the serum concentrations of each compound can be measured at the same time. The composition of the present invention can be administered concurrently with, prior to, or following the administration of the other agent.
The agents of the present invention are intended to be provided to recipient subjects in an amount sufficient to decrease the rate of growth (as defined above) of the target organism.
Theadministration of theagent(s) of the invention may be for either a "prophylactic" or "therapeutic" purpose. When provided prophylactically, the agent(s) areprovided in advance ofany symptoms indicative of the organisms growth. The prophylactic administration of the agent(s) serves to prevent, attenuate, or decrease the rate of onset of any subsequent infection. When provided therapeutically, the agent(s) are provided at (or shortly after) the onset of an indication of infection. The therapeutic administration of the compound(s) serves to attenuatethepathological symptoms of the infection and to increase the rate of recovery.
The agents ofthepresentinvention are administered to the mammal in a pharmaceutically acceptable form and in a therapeutically effective concentration. A composition is said to be "pharmacologically acceptable" if its administration can be tolerated by a recipient patient. Such an agent is said to be administered in a "therapeutically effective amount" if the amount administered is physiologically significant. An agent is physiologically significant ifits presence results in a detectable change in the physiology of a recipient patient.
The agents of the present invention can be formulated according to known methods to prepare pharmaceutically useful compositions, whereby these materials, ortheir functional derivatives, arecombined in admixture with a pharmaceutically acceptable carrier vehicle. Suitable vehicles and their formulation, inclusive ofother human proteins, e.g., human serum albumin, are described, for example, in Remington's Pharmaceutical Sciences (16th ed., Osol, A., Ed., Mack, Easton PA (1980)). In order to form a pharmaceutically acceptable composition suitable for effective administration, such compositions will contain an effective amount of one or more of the agents of the present invention, together with a suitable amount ofcarrier vehicle.
Additional pharmaceutical methods may be employed to control the duration ofaction. Control release preparations may be achieved through the use ofpolymers to complex orabsorb one or more of the agents of the present invention. The controlled delivery may be exercised by selecting appropriate macromolecules (for example polyesters, polyamino acids, polyvinyl, pyrrolidone, ethylenevinylacetate, methylcellulose, carboxymethylcellulose, orprotamine, sulfate) and the concentration of macromolecules as well as the methods ofincorporation in orderto control release. Another possible method to control the duration of action by controlled release preparations is to incorporate agents of the present invention into particles of a polymeric material such as polyesters, polyamino acids, hydrogels, poly(lactic acid) or ethylenevinylacetatecopolymers. Alternatively, instead of incorporating these agents into polymeric particles, it is possible to entrap these materials in microcapsules prepared, for example, by coacervation techniques or by interfacial polymerization, for example, hydroxymethylcellulose or gelatine- microcapsules and poly(methylmethacylate) microcapsules, respectively, or in colloidal drug delivery systems, for example, liposomes, albumin microspheres, microemulsions, nanoparticles, and nanocapsules or in macroemulsions. Such techniques are disclosed in Remington's Pharmaceutical Sciences (1980).
Theinvention furtherprovides apharmaceutical pack or kit comprising one or more containers filled with one or more of the ingredients of the pharmaceutical compositions of the invention. Associated with such containers) can be a notice in the form prescribed by a governmental agency regulating the manufacture, use or sale of pharmaceuticals or biological products, which noticereflects approval by the agency of manufacture, use or sale for human administration. In addition, the agents of the present invention may be employed in conjunction with other therapeutic compounds. 6. Shot-Gun Approach to Megabase DNA Sequencing
The present invention further provides the first demonstration thata sequenceofgreaterthan one megabasecanbe sequenced using a random shotgun approach. This procedure, described in detail in the examples that follow, has eUminated the up front cost ofisolating and ordering overlapping or contiguous subclones prior to the start of the sequencing protocols.
Certain aspects of the present invention are described in greater detail in the non-limiting Examples that follow.
Examples
Experimental Design and Methods
1. Shotgun Sequencing Strategy
The overall strategy for a shotgun approach to whole genome sequencing is outiined in Table 3. The theory of shotgun sequencing follows from theLanderand Waterman (Landerman and Waterman, Genomics 2: 231 (1988)) application ofthe equation for the Poisson distribution px = mxe-m/x!, where x is the number of occurrences of an event, m is the mean number of occurrences, and px is theprobability that any given base is not sequenced after a certain amount ofrandom sequence has been generated. If L is the genome length, n is the number of clone insert ends sequenced, and w is the sequencing read length, then m = nw/L, and the probability that no clone originates at any of the w bases preceding a given base, i.e., the probability that the base is not sequenced, is p0 = e-m. Using the fold coverage as the unit for m, one sees that after 1.8 Mb of sequence has been randomly generated, m = 1, representing IX coverage. In this case, p0 = e-1 = .37, thus approximately 37% is unsequenced. For example, 5X coverage (approximately 9500 clones sequenced from both insert ends and an average sequence read length of 460 bp) yields p0 = e-5 = 0.0067, or 0.67% unsequenced. The total gap length is Le-m, and the average gap size is L/n.
5X coverage would leaveabout 128 gaps averaging about 100 bp in size. The treatment is essentiallythat ofLander and Waterman, Genomics 2:231 (1988). Table 4 illustrates the coverage fora 1.9 Mb genome with an average fragment size of 460 bp. 2. Random Library Construction
In order to approximate the random model described above during actual sequencing, a nearly ideal library of cloned genomic fragment is required. The following library construction procedure was developed to achieve this.
H. influenzae Rd KW20 DNA was prepared by phenol extraction. A mixture (3.3 ml) containing 600 μg DNA, 300 mM sodium acetate, 10 mM Tris-ΗCl, 1 mM Na-EDTA, 30% glycerol was sonicated (Branson Model 450 Sonicator) at the lowest energy setting for 1 min. at 0° using a 3 mm probe. The DNA was ethanol precipitated and redissolved in 500 μl TE buffer. To create blunt-ends, a 100 μl aliquot was digested for 10 min at 30° in 200 μl BAL31 buffer with 5 units BAL31 nuclease (New England BioLabs). The DNA was phenol-extracted, ethanol-precipitated, redissolved in 100 μl TE buffer, electrophoresed on a 1.0% low melting agarose gel, and the 1.6-2.0 kb size fraction was excised, phenol-extracted, and redissolved in 20 μl TE buffer. A two-step ligation procedure was used to produce a plasmid library with 97% insert of which >99% were single inserts. The first ligation mixture (50 μl) contained 2 μg ofDNA fragments, 2 μg Smal/BAP pUC18 DNA (Pharmacia), and 10 units T4 ligase (GIBCO/BRL), and incubation was at 14° for 4 hr. After phenol extraction and ethanol precipitation, the DNA was dissolved in 20 μl TE buffer and electrophoresed on a 1.0% low melting agarosegel. A ladderofethidium bromide-stained linear bands, identified by size as insert (i), vector (v), v+i, v+2i, v+3i, ... was visuahzed by 360 nm UV light, and thev+i DNA was excised and recovered in 20 μl TE. The v+i DNA was blunt-ended by T4 polymerase treatment for 5 min. at 37° in a reaction mixture (50 μl) containing the v+i linears, 500 μM each of the 4 dNTP's, and 9 units of T4 polymerase (New England BioLabs) under recommended buffer conditions. After phenol extraction and ethanol precipitation the repaired v+i linears were dissolved in 20 μl TE. The final Ugation to produce circles was carried out in a 50 μl reaction containing 5 μl ofv+i linears and 5 units of T4 ligase at 14° overnight. After 10 min. at 70° the reaction mixture was stored at 20° This two-stage procedure resulted in a molecularly random collection of single-insert plasmid recombinants with minimal contamination from double-insert chimeras (< 1%) or free vector (<3%). Since deviation from randomness is mostlikely to occurduring cloning, E. coli host cells deficient in all recombination and restriction functions (A. Greener, Strategies 3 (1):5
(1990)) were used to prevent rearrangements, deletions, and loss ofclones by restriction. Transformed cells were plated directly on antibiotic diffusion plates toavoid the usual broth recovery phase which allows multiplication and selection of the most rapidly growing cells. Plating occured as follows:
A 100 μl atiquot of Epicurian Coli SURE II Supercompetent Cells
(Stratagene 200152) was thawed on ice and transferred to a chilled Falcon 2059 tube on ice. A 1.7 μl aliquot of 1.42 M β-mercaptoethanol was added to the aliquot ofcellsto a final concentration of25 mM. Cells were incubated on ice for 10 min. A 1 μl aliquot of the final ligation was added to the cells and incubated on ice for 30 min. The cellswereheat pulsed for 30 sec. at 42° and placed backon ice for 2 min. The outgrowth period in liquid culture was eliminated from this protocol in order to minimize the preferential growth of any given transformed cell. Instead the transformation were plated directly on a nutrient rich SOB plate containing a 5 ml bottom layer of SOB agar (1.5% SOB agar: 20 g tryptone, 5 g yeast extract, 0.5 g NaCl, 1.5% Difco
Agar/L). The 5 ml bottom layer is supplemented with 0.4 ml ampicillin (50 mg/ml)/100 ml SOB agar. The 15 ml top layer of SOB agar is supplemented with 1 ml X-Gal (2%), 1 ml MgCl2 (1 M), and 1 ml MgSO4/100 ml SOB agar. The 15 ml top layer was poured just prior to plating. Our titer was approximately 100 colonies/10 μl atiquot of transformation.
All colonies were picked for template preparation regardless of size. Only clones lost due to "poison" DNA or deleterious gene products would be deleted from the library, resulting in a slight increase in gap number over that expected. In order to evaluate the quatity of the H. influenzae library, sequence data were obtained from approximately 4000 templates using the Ml3-21 primer. The random sequence fragments were assembled using the AutoAssembler™ software (Applied Biosystems division of Perkin-Elmer (AB)) afterobtaining 1300, 1800, 2500, 3200, and 3800 sequence fragments, and the numberofuniqueassembled basepairs was determined. Based on the equations described above, an idealplot of thenumber ofbase pairs remaining to be sequenced as a func n of the # of sequenced fragments obtained with an average read length of460bp fora2.5X106 and a 1.9X106 bp genome was determined (Figure 3). The progression of assembly was plotted using the actual data obtained from the assembly ofup to 3800 sequence fragments and compared the data that is provided in the ideal plot (Figure 3). Figure 3 illustrates that there was essentially no deviation of the actual assembly data from the idealplot, indicating that we had constructed close to an ideal random library with minimal contamination from double insert chimeras and free of vector.
3. Random DNA Sequencing
High quatity double stranded DNA plasmid templates (19,687) were prepared using a "boiling bead" method developed in collaboration with Advanced Genetic Technology Corp. (Gaithersburg, MD) (Adams et al.,
Science 252:1651 (1991); Adams et al., Nature 555:632 (1992)). Plamid preparation was performed in a 96-well format for all stages of DNA preparation from bacterial growth through final DNA purification. Template concentration was determined using Hoechst Dye and a Millipore Cytofluor. DNA concentrations were not adjusted, but low-yielding templates were identified where possible and not sequenced. Templates were also prepared from two H. influenzae lambda genomic libraries. An amplified library was constructed in vector Lambda GEM-12 (Promega) and an unamplified library was constructed in Lambda DASH II (Stratagene). In particular, for the unamplified lambda library, H. influenzae Rd KW20 DNA (> 100 kb) was partially digested in a reaction mixture (200 μl) containing 50 μg DNA, IX Sau3Al buffer, 20 units Sau3AI for 6 min. at 23°. The digested DNA was phenol-extracted and electrophoresed on a 0.5% low melting agarose gel at 2V/cm for 7 hours. Fragments from 15 to 25 kb were excised and recovered in a final volume of6 μl. One μl offragments was used with 1 μl of DASΗII vector (Stratagene) in the recommended ligation reaction. One μl of the Ugation mixturewas used per packaging reaction foUowing the recommended protocol with the Gigapack II XL P kaging Extract (Stratagene, #227711). Phage wereplated directly withou plification from the packaging mixture
(after dilution with 500 μl of recommended SM buffer and chloroform treatment). Yield was about 2.5x103 pfu/μl. The amplified library was prepared essentially as above except the lambda GEM-12 vector was used. Afterpackaging, about 3.5x104pfu were plated on the restrictive NM539 host. The lysate was harvested in 2 ml of SM buffer and stored frozen in 7% dimethylsulfoxide. The phage titer was approximately 1x109 pfu/ml.
Liquid lysates (10 ml) were prepared from randomly selected plaques and template was prepared on an anion-exchange resin (Qiagen). Sequencing reactions were carried out on plasmid templates using the AB Catalyst LabStation with Applied Biosystems PRISM Ready Reaction Dye Primer
Cycle Sequencing Kits for the M13 forward (M13-21) and the M13 reverse (M13RP1) primers (Adams et al, Nature 568:474 (1994)). Dye terminator sequencing reactions were carried out on the lambda templates on a Perkin- Elmer 9600 Thermocyclerusing the Applied Biosystems Ready Reaction Dye Terminator Cycle Sequencing kits. T7 and SP6 primers were used to sequence the ends of the inserts from the Lambda GEM-12 library and T7 and T3 primers were used to sequence theends ofthe inserts from the Lambda DASH ll library. Sequencing reactions (28,643) were performed by eight individuals using an average offourteen AB 373 DNA Sequencers per day over a 3 month period. All sequencing reactions were analyzed using the Stretch modification of the AB 373, primarily using a 34 cm well-to-read distance. The overall sequencing success rate was 84% for M13-21 sequences, 83% for M13RP1 sequences and 65% for dye-terminator reactions. The average usable read length was485 bp forM13-21 sequences, 444 bp for M13RP1 sequences, and 375 bp fordye-terminator reactions. Table 5 summarizes the high-throughput sequencing phase of the invention.
Richards et al. (Richards et al., Automated DNA sequencing and Analysis, M.D. Adams, C. Fields, J.C. Venter, Eds. (Academic Press, London, 1994), Chap.28.) described the value of using sequence from both ends of sequencing templates to facilitate ordering of contigs in shotgun assembly projects oflambdaand cosmid clones. We balanced the desirability of both-end sequencing (including the reduced cost of lower total number of templates) against shorterread-lengths for sequencing reactions performed with the M13RP1 (reverse) primer compared to the Ml3-21 (forward) primer. Approximately one-half of the templates were sequenced from both ends. In total, 9,297 M13RP1 sequencing reactions were done. Random reverse sequencing reactions were done based on successful forward sequencing reactoins. Some M13RP1 sequences were obtained in a semi-directed fashion: Ml3-21 sequences pointing outward at the ends of contigs were chosen for M13RP1 sequencing in an effort to specifically order contigs. The semi- directed strategy was effective, and clone-based ordering formed an integral part ofassembly and gap closure (see below).
4. ProtocolforAutomated Cycle Sequencing
The sequencing consisted of using eight ABI Catalyst robots and fourteen AB 373 Automated DNA Sequencers. The Catalyst robot is a publicly available sophisticated pipetting and temperature control robot which has been developed specifically for DNA sequencing reactions. The Catalyst combinespre-aliquoted templates and reaction mixes consisting ofdeoxy- and dideoxynucleotides, the Taq thermostable DNA polymerase, fluorescently- labelled sequencing primers, and reaction buffer. Reaction mixes and templates were combined in the wells ofan aluminum 96-well thermocycling plate. Thirty consecutive cycles of linear amplification (e.g., one primer synthesis) steps were performed including denaturation, annealing ofprimer and template, and extension of DNA synthesis. A heated lid with rubber gaskets on the thermocycling plate prevented evaporation without the need for an oiloverlay.
Two sequencing protocols were used: dye-labelled primers and dye- labelled dideoxy chain terminators. The shotgun sequencing involves use of four dye-labelled sequencing primers, one for each of the four terminator nucleotide. Each dye-primer is labelled with a different fluorescent dye, permitting the four individual reactions to be combined into one lane of the 373 DNA Sequencer for electrophoresis, detection, and base-calling. AB currently supplies pre-mixed reaction mixes in bulk packages containing all the necessary non-template reagents for sequencing. Sequencing can be done with both plasmid and PCR-generated templates with both dye-primers and dye- terminators with approximately equal fidelity, although plasmid templates generally give longer usable sequences.
Thirty-two reactions were loaded per 373 Sequencer each day, for a total of 960 samples. Electrophoresis was run overnight following the manufacture's protocols, and the data was collected for twelve hours.
Following electrophoresis and fluorescence detection, the AB 373 performs automatic lane tracking and base-calling. The lane-tracking was confirmed visually. Each sequence electropherogram (or fluorescence lane trace) was inspected visually and assessed for quality. Trailing sequences of low quality were removed and the sequence itself was loaded via software to a Sybase database (archived daily to a 8mm tape). Leading vector polylinker sequence was removed automatically by software program. Average edited lengths of sequences from the standard ABI 373 were around 400 bp and depended mostly on thequality ofthetemplate used for the sequencing reaction. All of the ABI 373 Sequencers were converted to Stretch Liners, which provided a longer electrophoresispath prior to fluorescence detection, thus increasing the average number ofusable bases to 500-600 bp.
Informatics
1. Data Management A number of information management systems (LIMA) for a large- scale sequencing lab havebeen developed (Kerlavage et al., Proceedings ofthe Twenty-Sixth Annual Hawaii International Conference on System Sciences, IEEE Computer Society Press, Washington D.C., 585 (1993)). The system used tocollectand assemblethe sequencedatawasdeveloped using the Sybase relational data management system and was designed to automate data flow whereever possible and to reduce user error. The database stores and correlates all information collected during the entire operation from template preparation to finalanalysis of thegenome. Because the raw output of the AB 373 Sequencers was based on a Macintosh platform and the data management system chosen was based on a Unix platform, it was necessary to design and implement a variety ofmulti-user, client server applications which allow the raw data as well as analysis results to flow seamlessly into the database with a minimum ofuser effort. A description of the software programs used for large sequence assembly and managment is provided in Figure 4. 2. Assembly
An assembly engine (TIGR Assembler) was developed for the rapid and accurate assembly of thousands of sequence fragments. The AB AutoAssembler™ was modified (and named TIGR Editor) to provide a graphical interface to the electropherogram for the purpose of editing data associated with the aligned sequence file output ofTIGR Assembler. TIGR
Editor maintains synchrony between the electropherogram files on the Macintosh platform and the sequence data in the H. influenzae database on the Unix platform.
The TIGR assembler simultaneously clusters and assembles fragments of thegenome. In order to obtain the speed necessary to assemble more than 104 fragments, the algorithm builds a hash table of 10 bp oligonucleotide subsequences to generate a list ofpotential sequence fragment overlaps. The number ofpotential overlaps for each fragment determines which fragments are likely to fall into repetitive elements. Beginning with a single seed sequence fragment, TIGR Assembler extends the current contig by attempting to add the best matching fragment based on oligonucleotide content. The current contig and candidate fragment are aligned using a modified version of the Smith-Waterman algorithm (Waterman, M.S., Methods in Enzymology 164:765 (1988)) which provides for optimal gapped alignments. The current contig is extended by the fragment only ifstrict criteria for the quality of the match are met. The match criteriainclude the minimum length ofoverlap, the maximum length ofan unmatched end, and the minimum percentage match. These criteria are automatically lowered by the algorithm in regions of minimal coverageand raised in regions with a possible repetitive element. The number of potential overlaps for each fragment determines which fragments are likely to fall into repetitive elements. Fragments representing the boundaries ofrepetitive elements and potentially chimeric fragments are often rejected based on partial mismatches at the ends ofalignments and excluded from the current contig. TIGR Assembler is designed to take advantage of clone size information coupled with sequencing from both ends of each template. Itenforces theconstraint that sequence fragments from two ends of the same template point toward one another in the contig and are located within a certain ranged of base pairs (definable for each clone based on the known clone size range for a given library). Assembly of 24,304 sequence fragments of H. influenzae required 30 hours ofCPU time using one processor on a SPARCenter 2000 with 512 Mb of RAM. This process resulted in approximately 210 contigs. Because of the high stringency of the TIGR Assembler, all contigs were searched against each other using grasta (a modified fasta (Person and Lipman, Proc. Natl. Acad. Sci. U.S.A.85:2444 (1988)). In this way, additional overlaps were detected which enabled compression oof thedata set into 140 contigs. The location ofeach fragment in the contigs and extensive information about the consensus sequence itself were loaded into the H. influenzae relational database.
3. Ordering Assembled Contigs
Afterassembly the relativepositions ofthe 140 contigs were unknown. The contigs were ordered by asmalign. Asmalign uses a number of relationships to identify and atign contigs that are adjacent to each other.
Using this algorithm, the 140 contigs were placed into 42 groups totaling 42 physical gaps (no template DNA for the region) and 98 sequence gaps (template available for gap closure).
Ordering Contigs Separated by Physical Gaps andAchieving Closure Four integrated strategies were developed to order contigs separated by physical gaps. Oligonucleotideprimers were designed and synthesized from the end of each contig group. These primers were then available for use in one or more of the strategies outlined below:
1. Southern analysis was done to develop a unique "fingerprint" for a subset of 72 of the above oligonucleotides. This procedure was based upon the supposition that labeled oligonucleotides homologous to the ends of adjacentcontigs should hybridize to common DNA restriction fragments, and thus share a similar or identical hybridization pattern or "fingerprint". Oligonucleotides were labeled using 50 pmoles ofeach 20 mer and 250 mCi of [γ-32P]ATP and T4 polynucleotide kinase. The labeled oligonucleotides were purified using Sephadex G-25 superfine (Pharmacia) and 107 cpm of each was used in a Southern hybridization analysis of H. influenzae Rd chromosomal DNA digested with one frequent cutters (Asel) and five less frequent cutters (Bglll, EcoRl, PstI, Xbal, and Pvull). The DNA from each digest was fractionated on a 0.7% agarose gel and transferred to Nytran Plus nylon membranes (Schleicher & Schuell). Hybridization was carried out for 16 hours at 40°. To remove non-specific signals, each blot was sequentially washed at room temperature with increasingly stringent conditions up to 0.IX SSC + 0.5% SDS. Blots were exposed to a Phosphorlmager cassette (Molecular Dynamics) for several hours and hybridization patterns were visually compared.
Adjacent contigs identified in this manner were targeted for specific
PCR reactions.
2. Peptide links were made by searching each contig end using blastx (Altschul et al., J. Mol. Biol 275:403 (1990)) against a peptide database. If the ends oftwo contigs matched the same database sequence in an appropriate manner, then the two contigs were tentatively considered to be adjacent to each other.
3. The two lambda libraries constructed from H. influenaze genomic DNA were probed with oligonucleotides designed from the ends of contig groups (Kirkness et al., Genomics 10:985 (1991)). The positive plaques were then used to prepare templates and the sequence was determined from each end of the lambda clone insert. These sequence fragments were searched using grasta against a database of all contigs. Two contigs that matched the sequence from the opposite ends of the same lambda clone were ordered. The lambda clone then provided the template for closure of the sequence gap between theadjacentcontigs. Thelambdaclones were especially valuable for solving repeat structures.
4. To confirm the order ofcontigs found by the other approaches and estabUsh the order ofnon-ordered contigs, standard and long range (XL) PCR reactions were performed as follows.
Standard PCR was performed in the following manner. Each reaction contained a 37 μl cocktail; 16.5 μl H2O, 3 μl 25 mM MgCl2, 8 μl ofa dNTP mix (1.25 mM each dNTP), 4.5 μl 10X PCR core buffer II (Perkin Elmer), 25 ng H. influenzae Rd KW20 genomic DNA. The appropriate two primers (4 μl, 3.2 pmole/μl) were added to each reaction. A hot start was performed at 95° for 5 min followed by a 75° hold. During the hold Amplitaq DNA polymerase (Perkin Elmer) 0.3 μl in 4.3 μl Η2O, 0.5 μl 10X PCR core buffer π, wasadded toeach reaction. The PCRprofilewas25 cycles of94°/45 sec., denature; 55°/1 min., anneal; 72°/3 min, extension. All reactions were performed in a 96 well format on a Perkin Elmer GeneAmp PCR System 9600.
Long range PCR (XL PCR) was performed as follows: Each reaction contained a 35.2 μl cocktail; 12.0 μl H2O, 2.2 μl 25 mM Mg(OAc)2, 4 μl of a dNTP mix (200 μM final concentration), 12.0 μl 3.3X PCR buffer, 25 ng H. influenzae Rd KW20 genomic DNA. The appropriate two primers (5 μl, 3.2 pmoles/μl) was added to each reaction. A hot start was performed at 94° for 1 minute. τTth polymerase, 2.0 μl (4 U/reaction) in 2.8 μl 3.3X PCR buffer II was added to each reaction. The PCR profile was 18 cycles of 94°/15 sec., denature; 62°/8 min., anneal and extend followed by 12 cycles 94°/15 sec., denature; 62°/8 min. (increase 15 sec./cycle), anneal and extend; 72°/10 min., final extension. All reactions were performed in a 96 well format on a Perkin Elmer GeneAmp PCR System 9600.
Although a PCR reaction was performed for essentially every combination ofphysical gapends, techniques such as Southern fingerprinting, database matching, and the probing of large insert clones were particularly valuable in ordering contigs adjacent to each other and reducing the number of combinatorial PCR reactions necessary to achieve complete gap closure.
Employing these strategies to an even greater extent in future genome projects willincrease theoverall efficiency ofcomplete genome closure. The number ofphysical gaps ordered and closed by each of these techniques is summarized in Table 5.
Sequenceinformation from the ends of 15-20 kb clones is particularly suitable for gap closure, solving repeat structures, and providing general confirmation of the overall genome assembly. We were also concerned that some fragments oftheH. influenaze genome would be non-clonable in a high copy plasmid in E. coli. We reasoned that lytic lambda clones would provide theDNA for these segments. Approximately 100 random plaques were picked from the amplified lambda library, templates prepared, and sequence information obtained from each end. These sequences were searched (grasta) against the contigs and linked in the database to their appropriate contig, thus providing a scaffolding oflambdaclones contributing additional support to the accuracy of the genome asse bly (Fig 5). In addition to confirmation of the contig structure, the lambd ones provided closure for 23 physical gaps.
Approximately 78% of the genome is covered by lambda clones.
Lambda clones were also useful for solving repeat structures. Repeat structures identified in the genome were small enough to be spanned by a single clone from the random insert library, except for the six ribosomal RNA operons and one repeat (2 copies) which was 5,340 bp in length.
Oligonucleotideprobes were designed from the unique flanks at the beginning ofeach repeat and hybridized to the lambda libraries. Positive plaques were identified for each flank and the sequence fragments from the ends of each clone were used to correctly orient the repeats within the genome.
The ability to distinguish and assemble the six ribosomal RNA (rRNA) operons ofH. influenaze (16S subunit-23S subunit-5S subunit) was a test of our overall strategy to sequence and assemble a complex genome which might contain a significant number ofrepeat regions. The high degree of sequence similarity and the length of the six operons caused the assembly process to cluster all the underlying sequences into a few indistinguishable contigs. To determine the correct placement of the operons in the sequence, a pair of unique flanking sequences was required for each. No unique flanking sequences could be found at the left (16S rRNA) ends. This region contains the ribosomal promoter and appeared to be non-clonable in the high copy number pUC18 plasmid. However, unique sequences could be identified at the right (5S) ends. Oligonucleotide primers were designed from these six flanking regions and used to probe the two lambda libraries. For each of the six rRNA operonsatleastonepositiveplaquewas identified which completely spanned the rRNA operon and contained unique flanking sequence at the 16S and 5S ends. These plaques provided the templates for obtaining the unique sequence for each of the six rRNA operons.
An additional confirmation of the global structure of the assembled circular genome was obtained by comparing a computer generated restriction map based on theassembled sequence for the enzymesApal, Smal, and Rsrll with the predicted physical map of Redfield and Lee (Genetic Maps: locus maps ofcomplexgenomes, S.J. O'Brien, Ed. Cold Spring Harbor Laboratory
Press, New York, N.Y., 1990, 2110.). The restriction fragments from the sequence-derived map matched those from the physical map in size and relative order (Figure 5).
Editing Simultaneous with the final gap filling process, each contig was edited visually by reassembling overlapping 10 kb sections ofcontigs using the AB AutoAssembler™ and the Fast Data Finder™ hardware. AutoAssembler™ provides a graphical interface to electropherogram data for editing. The electropherogram datawasused toassign the mostlikely base at each position. Where a discrepancy could not be resolved or a clear assignment made, the automatic base calls were left unchanged. Individual sequence changes were written to the electropherogram files and a replication protocol (crash) was used to maintain the synchrony of sequence data between the H. influenzae database and the electropherogram files. Following editing, contigs were reassembled with TIGR Assembler prior to annotation.
Potential frameshifts identified in the course ofannotating the genome were saved as reports in the database. These reports include the coordinates in a contig which the alignment software (praze) predicts to be the most likely location of a missing or inserted base and a representation of the sequence alignment containing the frameshift. Apparent frameshifts were used to indicate areas of the sequence which may require further editing. Frameshifts were not corrected in cases where clear electropherogram data disagreed with a frameshift. Frameshift editing was performed with TIGR Editor.
The rRNA and other repeat regions precluded complete assembly of the circular genome with TIGR Assembler. Final assembly of the genome was accomplished using combasm which splices together contigs based on short overlaps.
Accuracy ofthe Genome Sequence The accuracy of the H. influenaze genome sequence is difficult to quantitate because there is very tittle previously determined H. influenaze sequence and most of these sequences are from other strains. There are, however, three parameters ofaccuracy that can be applled to the data. First, the number of apparent frameshifts in predicted H. influenaze genes, based on database similarities, is 148. Some ofthese apparent frameshifts may be in the database sequences rather than in ours, particularly considering that 49 of the apparent frameshifts arebased on matches to hypothetical proteins from other organisms. Second, there are 188 bases in the genome that remain as N ambiguities (1/9,735 bp). Combining these two types of "known" errors, we can calculate a maximum sequence accuracy of 99.98%. The average coverage is 6.5X and less than 1% of the genome is single-fold coverage.
Identifying Genes
An attempt was made to predict all of the coding regions of the H. influenzae Rd genome and identify genes, tRNAs and rRNAs, as well as other features of the DNA sequence (e.g., repeats, regulatory sites, replication origin sites, nucleotide composition). A description of some of the readily apparent sequence features is provided below. TheH. influenaze Rd genome is a circular chromosome of 1,830,121 bp. The overall G/C nucleotide content is approximately 38% (A = 31%, C = 19%, G = 19%, T = 31%, IUB = 0.035%). The G/C content of the genome was examined with several window lengths to look for global structural features. With a window of5,000 bp, the G/C content is relatively even except for 7 large G/C-rich regions and several A/T-rich regions (Fig. 5). The G/C rich regions correspond to six rRNA operons and the location of a cryptic mu-like prophage. Genes for several proteins with similarity to proteins encoded by bacteriophage mu are located at approximately position 1.56-1.59 Mbp of the genome. This area of the genome has a markedly higher G/C content than average for H. influenaze (-50% G/C compared to -38% for the rest of the genome). No significance has yet been ascertained for the source or importance of the A/T rich regions.
The minimal origin of replication (oriC) in E. coli is a 245 bp region definedby three copies ofa thirteen base pair repeat containing a GATC core sequence at one end and four copies of a nine base pair repeat containing a TTAT core sequence at the otherend. The GATC sites are methylation targets and control replication while the TTAT sites provide the binding sites for DnaA, the first step in the reptication process (Genes V, B. Lewin Ed. (Oxford University Press, New York, 1994), chap.18-19). An approximately 281 bp sequence (602,483 - 602,764) whose limits are defined by these same core sequences appears to define the origin of replication in H. influenaze Rd. These coordinates liebetween sets ofribosomal operons rrnF, rrnE, rrnD and rmA, rrnB, rrnC. These two groups ofribosomal operons are transcribed in opposite directions and the placement of the origin is consistent with their polarity for transcription. Termination of E. coli replication is marked by two 23 bp termination sequences located -100 kb on either side of the midway point at which the two reptication forks meet. Two potential termination sequences sharing a 10 bp core sequence with the E. coli termination sequence were identified in H. influenaze at coordinates 1,375,949-1,375,958 and
1,558,759-1,558,768. These two sets ofcoordinates are offset approximately 100 kb from the point 180° opposite of the proposed origin of H. influenaze reptication.
Six rRNA operons were identified. Each rRNA operon contains three rRNA subunits and a variable spacer region in the order: 16S subunit - spacer region - 23S subunit -5S subunit. The subunit lengths are 1539 bp, 2653 bp, and 116 bp, respectively. The G/C content of the three ribosomal subunits (50%) is higher than the genome as a whole. The G/C content of the spacer region (38%) is consistent with the remainder of the genome. The nucleotide sequence of the three rRNA subunits is 100% identical in all six ribosomal operons. The rRNA operons can be grouped into two classes based on the spacer region between the 16S and 23S sequences. The shorter of the two spacer regions is 478 bp in length (rrnB, rrnE, and rrnF) and contains the gene for tRNA Glu. Thelonger spacer is 723 bp in length (rrnA, rrnC, and rrnD) and contains the genes for tRNA lle and tRNA Ala. The two sets of spacer regions are also 100% identical across each group of three operons. tRNA genes are also present at the 16S and 5S ends of two of the rRNA operons. The genes for tRNA Arg, tRNA His, and tRNA Pro are located at the 16S end of rrnE while the genes for tRNA Trp, and tRNA Asp are located at the 5S end ofrrnA.
Thepredicted coding regions oftheH. influenaze genome were initially defined by evaluating their coding potential with the program Genemark (Borodovsky and Mclninch, Computers Chem.17(2):123 (1993)) using codon frequency matrices derived from 122 H. influenaze coding sequences in GenBank. The predicted coding region sequences (plus 300 bp of flanking sequence) were used in searches against a database of non-redundant bacterial proteins (NRBP) created specifically for the annotation. Redundancy was removed from NRBP at two stages. All DNA coding sequences were extracted from GenBank (release 85), and sequences from the same species were searched against each other. Sequences having >97% similarity over regions > 100 nucleotides were combined. In addition, the sequences were translated and used in protein comparisons with all sequences in Swiss-Prot (release 30). Sequences belonging to the same species and having >98% similarity over 33 amino acids werecombined. NRBP is composed of21,445 sequences extracted from 23,751 GenBank sequences and 11,183 Swiss-Prot sequences from 1,099 different species.
A total of 1,749 predicted coding regions were identified. Searches of theH. infuenzaepredicted coding regions wereperformed using an algorithm that translates thequery DNA sequence in the three plus-strand reading frames for searching against NRBP, identifies the protein sequences that match the query, and atigns the protein-protein matches using praze, a modified Smith- Waterman (Pearson and Lipman, Proc. Natl Acad. Sci. U.S.A.85:2444
(1988)) algorithm. In cases where insertion or deletions in the DNA sequence produced a frameshift error, the alignment algorithm started with protein regions of maximum similarity and extended the alignment to the same database match in alternative frames using the 300 bp flanking region. Regions known to contain frameswft errors were saved in the database and evaluated for possible correction. Unidentified predicted coding regions and the remaining intergenic sequences were searched against a dataset of all available peptide sequences from Swiss-Prot, PIR, and GenBank. Identification of operon structures will be facilitated by experimental determination oftranscription promoter and termination sites.
Each putatively identified H. influenaze gene was assigned to one of 102 biological role categories adapted from Riley (Riley, M., Microbiology Reviews 57(4):862 (1993)). Assignments were made by linking the protein sequence of thepredicted coding regions with the Swiss-Prot sequences in the Riley database. Of the 1,749 predicted coding regions, 724 have no role assignment. Of these, no database match was found for 384, while 340 matched "hypotheticalproteins" in thedatabase. Role assignments were made for 1,025 of the predicted coding regions. A compilation of all the predicted coding regions, their unique identifiers, a three letter gene identifier, percent identity, percent similarity, and amino acid match length are presented in
Table 1(a). An annotated complete genome map of H. influenaze Rd is presented in Figures 6(A)-(D). The map places each predicted coding region on the H. influenaze chromosome, indicates its direction oftranscription and color codes its role assignment. Role assignments are also represented in Figure 5.
A survey of the genes and their chromosomal organization in H. influenaze Rd make possible a description of the metabolic processes H. influenaze requires for survival as a freeliving organism, the nutritional requirements for its growth in the laboratory, and the characteristics which make it unique from other organisms specifically as it relates to its pathogenicity and virulence. Thegenome would beexpected to have complete complements ofcertain classes of genes known to be essential for life. For example, there is aone-to-one correspondence ofpublished E. coli ribosomal protein sequences to potential homologs in the H. influenaze database. Likewise, as shown in Table 1(a), an aminoacyl tRNA-synthetase is present in thegenomefor each amino acid. Finally, the location oftRNA genes was mapped onto the genome. There are 54 identified tRNA genes, including representatives of all 20 amino acids.
In order to survive as a free living organism, H. influenaze must produce energy in theform ofATP via fermentation and/or electron transport. As a facultative anaerobe, H. influenaze Rd is known to ferment glucose, fructose, galactose, ribose, xylose and fucose (Dorocicz et al., J. Bacteriol. 175:7142 (1993)). The genes identified in Table 1(a) indicate that transport systems are available for the uptake of these sugars via the phosphoenolpyruvate-phosphotransferase system (PTS), and via non-PTS mechanisms. Genes that specify the common phosphate-carriers Enzyme I and
Ηpr (ptslandptsH) of the PTS system were identified as well as the glucose specific errgene. TheptsH, ptsl, and errgenes constitute thepts operon. We have not however identified the gene encoding membrane-bound glucose specific Enzyme ll. The latter enzyme is required for transport ofglucose by the PTS system. A complete PTS system for fructose was identified. Genes encoding thecomplete glycolytic pathway and for the production of fermentative end products were identified. Growth utilizing anaerobic respiratory mechanisms were found by identifying genes encoding functional electron transport systems using inorganic electron acceptors such as nitrates, nitrites, and dimethylsulfoxide. Genes encoding three enzymes of the tricarboxylic acid (TCA) cycle appear to be absent from the genome. Citrate synthase, isocitratedehydrogenase, and acordtase were not found by searching thepredicted coding regions orby using theE. coli enzymes as peptide queries againstthe entiregenome in translation. This provides an explanation for the very high level ofglutamate (lg/L) which is required in defined culture media
(Klein and Luginbuhl, J. Gen. Microbiol.775:409 (1979)). Glutamate can be directed into theTCA cycleviaconversion to alpha-ketoglutarate by glutamate dehydrogenase. In the absence of a complete TCA cycle, glutamate presumably serves as the source of carbon for biosynthesis of amino acids using precursors which branch from the TCA cycle. Functional electron transport systems are available for the production of ATP using oxygen as a terminal electron acceptor.
Previously unanswered questions regarding pathogenicity and virulence can be addressed by examining certain classes ofgenes such as adhesions and the lipooligosaccharide biogenesis genes. Moxon and co-workers (Weiser et al., Cell 59:657 (1989)) have obtained evidence that a number of these virulence-related genes contain tandem tetramer repeats which undergo frequent addition and deletion of one or more repeat units during replication such that the reading frame ofthegene is changed and its expression thereby altered. Itis nowpossible, using the complete genome sequence, to locate all such tandem repeat tracts (Figure 5) and to begin to determine their roles in phase variation of such potential virulence genes.
H. influenzae Rd possesses a highly efficient natural DNA transformation system (Kahn and Smith, J. Membrane Biol.138:155 (1984). A unique DNA uptake sequence site, 5' AAGTGCGGT, present in multiple copies in the genome, has been shown to be necessary for efficient DNA uptake. It is now possible to locate all of these sites and completely describe their distribution with respect to genie and intergenic regions. Fifteen genes involved in transformation have already been described and sequenced (Redfield, R., J. Bacteriol 175:5612 (1991); Chandler, M., Proc. Natl. Acad. Sci. U.S.A 89:1616 (1992); Barouki and Smith, J. Bacteriol 163(2):629
(1985); Tomb et al., Gene 104:1 (1991); Tomb, J, Proc. Natl. Acad. Sci. U.S.A 89:10252 (1992)). Six of the genes, comA to comF, comprise an operon which is under positive control by a 22-bp palindromic competence regulatory element (CRE) about one helix turn upstream ofthepromoter. The rec-2 transformation geneis also controlled by this element. It is now possible to locate additional copies of CRE in the genome and discover potential transformation genes under CRE control. In addition, it may now be possible to discover other global regulatory elements with an ease not previously possible.
One well-described gene regulatory system in bacteria is the "two- component" system composed ofa sensor molecule that detects some sort of environmental signal and a regulator molecule that is phosphorylated by the activated form ofthesensor. The regulatorprotein is generally a transcription factor which, when activated by the sensor, turns on or off expression of a specific set ofgenes (for review, see Albright et al., Ann. Rev. Genet.23:311
(1989); Parkinson and Kofoid, Ann. Rev. Genet.26:71 (1992)). It has been estimated that E. coli harbors 40 sensor-regulator pairs (Albright et al., Ann. Rev. Genet.23:311 (1989); Parkinson and Kofoid, Ann. Rev. Genet.26:71 (1992)). TheH. influenaze genome was searched with representative proteins from each family of sensor and regulator proteins using tblastn and tfasta.
Four sensor and five regulator proteins were identified with similarity to proteins from other species (Table 6). There appears to be a corresponding sensor for each regulator protein except CpxR. Searches with the CpxA protein from E. coli identified three ofthefour sensors listed in Table 6, but no additional significant matches were found. It is possible that the level of sequence similarity is low enough to be undetectable with tfasta. No representatives of the NtrC-class of regulators were found. This class of proteins interacts directly with the sigma-54 subunit of RNA polymerase, which is not present in H. influenaze. All of the regulator proteins faU into the OmpR subclass (Albright et al., Ann. Rev. Genet.23:311 (1989); Parkinson and Kofoid, Ann. Rev. Genet.26:71 (1992)). ThephoBR and basRS genes of
H. influenaze are adjacent to one another and presumably form an operon. The nar and arc genes are not located adjacent to one another.
Some of the most interesting questions that can be answered by a completegenome sequence relate to what genes or pathways are absent. The non-pathogenicH. influenaze Rd strain varies significantly from the pathogenic serotypeb strains. Many ofthedifferences between these two strains appear in factors affecting infectivity. For example, the eight genes which make up the fimbrial gene cluster (vanΗam et al., Mol. Microbiol.13:673 (1994)) involved in adhesion ofbacteria to hostcellsare now shown to be absent in the Rd strain. ThepepNandpurE genes which flank the fimbrial cluster in H. influenaze typeb strains are adjacent to one another in the Rd strain (Fig.7), suggesting that the entire fimbrial duster was excised. On a broader level, we determined which E. coli proteins are not in H. influenzae by taking advantage of a non-redundant set of protein coding genes from E. coli, namely the University ofWisconsin Genome Project contigs in GenBank: 1,216 predicted protein sequences from GenBank accessions D10483, L10328, U00006, U00039, U14003, and U18997 (Yura et al., NucleicAcids Research 20:3305 (1992); Burland etal., Genomics 16:551 (1993)). The minimum threshold for matches was set so that even weak matches would be scored as positive, thereby giving a minimal estimate of the E. coli genes not present in H. influenaze. tblastn was used to search each of the E. coli proteins against the complete genome. All blast scores > 100 were considered matches. Altogether 627 E. coli proteins matched at least one region of the H. influenaze genome and 589 proteins did not. The 589 non-matching proteins wereexamined and found tocontain adisproportionate number ofhypothetical proteins from E. coli. Sixty-eight percent of the identified E. coli proteins were matched by an H. influenaze sequence whereas only 38% of the hypothetical proteins were matched. Proteins are annotated as hypothetical based on alack ofmatches with any other known protein (Yura et al., Nucleic AcidsResearch 20:3305 (1992); Burland etal., Genomics 16:551 (1993)). At least two potential explanations can be offered for the over representation of hypothetical proteins among those without matches: some of the hypothetical proteins are not, in fact, translated (at least in the annotated frame), or these areE. coli-specificproteins thatare unlikely to be found in any species except those most closely related to E. coli, for example Salmonella typhimurium.
A total of 384 predicted coding regions did not display significant similarity with a six-frame translation of GenBank release 87. These unidentified coding regions werecompared to one another with fasta. Several novel gene families were identified. For example, two predicted coding regions without database matches (ΗI0591, HI0852) share 75% identity over almost their entire lengths (139 and 143 amino acid residues respectively).
Their similarity to each other but failure to match any protein available in the current databases suggest that they could represent a novel cellular function.
Other types of analyses can be applied to the unidentified coding regions, including hydropathy analysis, which indicates the patterns of potential membrane-spanning domains that are often conserved between members of receptor and transporter gene families, even in the absence of significant amino acid identity. Five examples of unidentified predicted coding regions that display potential transmembrane domains with a periodic pattern that is characteristic of membrane-bound channel proteins are shown in Figure 8. Such information can be used to focus on specific aspects of cellular function that are affected by targeted deletion or mutation of these genes.
Interest in the medically importantaspects ofH. influenaze biology has focused particularly on those genes which determine virulence characteristics of theorganism. Recently, the catalase gene was characterized and sequenced as a possible virulence-related gene (Bishai et al., J. Bacteriol 176:2914 (1994)). A number of the genes responsible for the capsular polysaccharide have been mapped and sequenced (Kroll et al., Mol. Microbiol. 5(6):1549 (1991)). Several outer membrane protein genes have been identified and sequenced (Langford et al., J. Gen. Microbiol. 138:155 (1992)). The lipooligosaccharide component of the outer membrane and the genes of its synthetic pathway are under intensive study (Weiser et al., J. Bacteriol. 175:3304 (1990)). Whilea vaccine is available, the study ofouter membrane components is motivated to some extent by the need for improved vaccines.
Data Availability TheH. influenaze genome sequence has been deposited in the Genome
Sequence DataBase (GSDB) with the accession number L42023. The nucleotide sequence and peptide translation of each predicted coding region with identified start and stop codons have also been accessioned by GSDB.
Production ofan Antibody to a Haemophilus influenzae Protein Substantially pure protein or polypeptide is isolated from the transfected or transformed cells using any oneofthe methods known in the art. The protein can also be produced in a recombinant prokaryotic expression system, such as E. coli, or can by chemically synthesized. Concentration of protein in the final preparation is adjusted, for example, by concentration on an Amicon filter device, to thelevel ofa few micrograms/ml. Monoclonal or polyclonal antibody to the protein can then be prepared as follows:
Monoclonal Antibody Production by Hybridoma Fusion
Monoclonal antibody to epitopes ofany ofthepeptides identified and isolated as described can beprepared from murine hybridomas according to the classical method of Kohler, G. and Milstein, C., Nature 256:495 (1975) or modifications of the methods thereof. Briefly, a mouse is repetitively inoculated with a few micrograms of the selected protein over a period of a few weeks. The mouse is then sacrificed, and the antibody producing cells of the spleen isolated. The spleen cellsare fused by means ofpolyethylene glycol with mouse myeloma cells, and the excess unfused cells destroyed by growth of thesystem on selective media comprising aminopterin (HAT media). The successfully fused cellsare diluted and aliquots of the dilution placed in wells of a microtiter plate where growth of the culture is continued. Antibody- producing clones are identified by detection of antibody in the supernatant fluid of the wells by immunoassay procedures, such as ELISA, as originally described by Engvall, E., Meth. Enzymol 70:419 (1980), and modified methods thereof. Selected positive clones can be expanded and their monoclonal antibody product harvested for use. Detailed procedures for monoclonal antibody production are described in Davis, L. et al. Basic Methods in Molecular Biology Elsevier, New York. Section 21-2 (1989).
Polyclonal Antibody Production by Immunization
Polyclonal antiserum containing antibodies to heterogenous epitopes of a single protein can be prepared by immunizing suitable animals with the expressed protein described above, which can be unmodified or modified to enhance immunogenicity. Effectivepolyclonal antibody production is affected by many factorsrelated both to theantigen and the host species. For example, smaU molecules tend to be less immunogenic than other and may require the use of carriers and adjuvant. Also, host animals vary in response to site of inoculations and dose, with both inadequate or excessive doses of antigen resulting in low titer antisera. Small doses (ng level) ofantigen administered at multiple intradermal sites appears to be most reliable. An effective immunization protocol for rabbits can be found in Vaitukaitis, J. et al., J. Clin. Endocrinol. Metab. 33:988-991 (1971). Booster injections can be given at regular intervals, and antiserum harvested when antibody titer thereof, as determined semi-quantitatively, for example, by double immunodiffusion in agar against known concentrations of the antigen, begins to fall. See, for example, Ouchterlony, O. et al., Chap. 19 in: Handbook ofExperimental Immunology, Wier, D., ed, Blackwell
(1973). Plateau concentration ofantibody is usually in the range of0.1 to 0.2 mg/ml of serum (about 12 μM). Affinity of the antisera for the antigen is determined by preparing competitive binding curves, as described, for example, by Fisher, D., Chap.42 in: ManualofClinical Immunology, second edition, Rose and Friedman, eds., Amer. Soc. For Microbiology, Washington,
D.C. (1980).
Antibody preparations prepared according to either protocol are useful in quantitative immunoassays which determine concentrations of antigen- bearing substancesin biological samples; they arealso used semi-quantitatively or qualitatively to identify the presence ofantigen in a biological sample.
Preparation of PCR Primers and Amplification of DNA
Various fragments oftheHaemophilus influenzae Rd genome, such as those disclosed in Tables 1(a) and 2 can be used, in accordance with the present invention, to prepare PCR primers for a variety of uses. The PCR primers arepreferably at least 15 bases, and more preferably at least 18 bases in length. When selecting a primer sequence, it is preferred that the primer pairs have approximately the same G/C ratio, so that melting temperatures are approximately the same. The PCR primers and amplified DNA of this Example find use in the Examples that follow. Gene expressionfrom DNA Sequences Corresponding to ORFs
A fragment of the Haemophilus influenzae Rd genome provided in Tables 1(a) or 2 is introduced into an expression vector using conventional technology. (Techniques to transfer cloned sequences into expression vectors that direct protein translation in mammalian, yeast, insect or bacterial expression systemsare wellknown in theart.) Commerciallyavailable vectors and expression systems are available from a variety of suppliers including Stratagene (La Jolla, California), Promega (Madison, Wisconsin), and
Invitrogen (San Diego, California). If desired, to enhance expression and facilitate proper protein folding, the codon context and codon pairing of the sequence may be optimized for the particular expression organism, as explained by Hatfield et al. , U.S. Patent No. 5,082,767, incorporated herein by this reference.
The following is provided as one exemplary method to generate polypeptide(s) from cloned ORFs of the Haemophilus genome fragment. Since the ORF lacks a poly A sequence because of the bacterial origin of the ORF, this sequence can beadded to the construct by, for example, splicing out the poly A sequence from pSG5 (Stratagene) using Bgll and Sall restriction endonuclease enzymes and incorporating it into the mammalian expression vector pXTl (Stratagene) for use in eukaryotic expression systems. pXTl contains the LTRs and a portion of the gag gene from Moloney Murine Leukemia Virus. The position of the LTRs in the construct allow efficient stabletransfection. Thevector includes the Herpes Simplex thymidine kinase promoter and the selectable neomycin gene. The Haemophilus DNA is obtained by PCR from the bacterial vector using oligonucleotide primers complementary to the Haemophilus DNA and containing restriction endonuclease sequences for PstI incorporated into the 5' primer and Bglll at the 5' end ofthecorresponding Haemophilus DNA 3' primer, taking care to ensure that theHaemophilus DNA is positioned such that its followed with the poly A sequence. The purified fragment obtained from the resulting PCR reaction is digested with Pstl, blunt ended with an exonuclease, digested with Bglll, purified andligated to pXT1, now containing a poly A sequence and digested Bglll. The ligated product is transfected into mouse NlH 3T3 cells using Lipofectin (Life Technologies, Inc., Grand Island, New York) under conditions outlined in the product specification. Positive transfectants are selected after growing the transfected cells in 600 ug/ml G418 (Sigma, St. Louis, Missouri). The protein is preferably released into the supernatant.
However if the protein has membrane binding domains, the protein may additionally be retained within the cell or expression may be restricted to the cell surface.
Since it may be necessary to purify and locate the transfected product, synthetic 15-merpeptides synthesized from the predicted Haemophilus DNA sequence are injected into mice to generate antibody to the polypeptide encoded by the Haemophilus DNA.
Ifantibody production is notpossible, theHaemophilus DNA sequence is additionallyincorporated into eukaryotic expression vectors and expressed as a chimeric with, for example, β-globin. Antibody to β-globin is used to purify the chimeric. Corresponding protease cleavage sites engineered between the β-globin geneand theHaemophilus DNA are then used to separate the twopolypeptide fragments from one another after translation. One useful expression vector forgenerating β-globin chimerics is pSG5 (Stratagene). This vectorencodes rabbit β-globin. Intron ll oftee rabbit β-globin gene facilitates splicing of the expressed transcript, and the polyadenylation signal incorporated into the construct increases the level of expression. These techniques as described are wellknown to those skilled in the art of molecular biology. Standard methods arepublished in methods texts such as Davis et al. and many of the methods are available from the technical assistance representatives from Stratagene, Life Technologies, Inc., or Promega.
Polypeptide may additionallybe produced from either construct using in vitro translation systems such as In vitro Express™ Translation Kit (Stratagene).
While thepresent invention has been described in some detailfor purposes ofclarity and understanding, one skilled in the art willappreciate that various changes in form and detail can be made without departing from the true scope of the invention.
Allpatents, patent applications and publications referred to above are hereby incorporated by reference.
Figure imgf000078_0001
Figure imgf000079_0001
Figure imgf000080_0001
Figure imgf000081_0001
Figure imgf000082_0001
Figure imgf000083_0001
Figure imgf000084_0001
Figure imgf000085_0001
Figure imgf000086_0001
Figure imgf000087_0001
Figure imgf000088_0001
Figure imgf000089_0001
Figure imgf000090_0001
Figure imgf000091_0001
Figure imgf000092_0001
Figure imgf000093_0001
Figure imgf000094_0001
Figure imgf000095_0001
Figure imgf000096_0001
Figure imgf000097_0001
Figure imgf000098_0001
Figure imgf000099_0001
Figure imgf000100_0001
Figure imgf000101_0001
Figure imgf000102_0001
Figure imgf000103_0001
Figure imgf000104_0001
Figure imgf000105_0001
Figure imgf000106_0001
Figure imgf000107_0001
Figure imgf000108_0001
Figure imgf000109_0001
Figure imgf000110_0001
Figure imgf000111_0001
Figure imgf000112_0001
Figure imgf000113_0001
Figure imgf000114_0001
Figure imgf000115_0001
Figure imgf000116_0001
Figure imgf000117_0001
Figure imgf000118_0001
Figure imgf000119_0001
Figure imgf000120_0001
Figure imgf000121_0001
Figure imgf000122_0001
Figure imgf000123_0001
Figure imgf000124_0001
Figure imgf000125_0001
Figure imgf000126_0001
Figure imgf000127_0001
Figure imgf000128_0001
Figure imgf000129_0001
Figure imgf000130_0001
Figure imgf000131_0001
Figure imgf000132_0001
Figure imgf000133_0001
Figure imgf000134_0001
Figure imgf000135_0001
Figure imgf000136_0001
Figure imgf000137_0001
Figure imgf000138_0001
Figure imgf000139_0001
Figure imgf000140_0001
Figure imgf000141_0001
Figure imgf000142_0001
Figure imgf000143_0001
Figure imgf000144_0001
Figure imgf000145_0001
Figure imgf000145_0002
Figure imgf000146_0001
Figure imgf000147_0001
Figure imgf000148_0001
Figure imgf000149_0001
Figure imgf000150_0001
Figure imgf000151_0001
Figure imgf000152_0001
Figure imgf000153_0001
Figure imgf000154_0001
Figure imgf000155_0001
Figure imgf000156_0001
Figure imgf000157_0001
Figure imgf000158_0001
Figure imgf000159_0001
Figure imgf000160_0001
Figure imgf000161_0001
Figure imgf000162_0001
Figure imgf000163_0001
Figure imgf000164_0001
Figure imgf000165_0001
Figure imgf000166_0001
Figure imgf000167_0001
Figure imgf000168_0001
Figure imgf000169_0001
Figure imgf000170_0001
Figure imgf000171_0001
Figure imgf000172_0001
Figure imgf000173_0001
Figure imgf000174_0001
Figure imgf000175_0001
Figure imgf000176_0001
Figure imgf000177_0001
Figure imgf000178_0001
Figure imgf000179_0001
Figure imgf000180_0001
Figure imgf000181_0001
Figure imgf000182_0001
Figure imgf000183_0001
Figure imgf000184_0001
Figure imgf000185_0001
Figure imgf000186_0001
Figure imgf000187_0001
Figure imgf000188_0001
Figure imgf000189_0001
Figure imgf000190_0001
Figure imgf000191_0001
Figure imgf000192_0001
Figure imgf000193_0001
Figure imgf000194_0001
Figure imgf000195_0001
Figure imgf000196_0001
Figure imgf000197_0001
Figure imgf000198_0001
Figure imgf000199_0001
Figure imgf000200_0001
Figure imgf000201_0001
Figure imgf000202_0001
Figure imgf000203_0001
Figure imgf000204_0001
Figure imgf000205_0001
Figure imgf000206_0001
Figure imgf000207_0001
Figure imgf000208_0001
Figure imgf000209_0001
Figure imgf000210_0001
Figure imgf000211_0001
Figure imgf000212_0001
Figure imgf000213_0001
Figure imgf000214_0001
Figure imgf000215_0001
Figure imgf000216_0001
Figure imgf000217_0001
Figure imgf000218_0001
Figure imgf000219_0001
Figure imgf000220_0001
Figure imgf000221_0001
Figure imgf000222_0001
Figure imgf000223_0001
Figure imgf000224_0001
Figure imgf000225_0001
Figure imgf000226_0001
Figure imgf000227_0001
Figure imgf000228_0001
Figure imgf000229_0001
Figure imgf000230_0001
Figure imgf000231_0001
Figure imgf000232_0001
Figure imgf000233_0001
Figure imgf000234_0001
Figure imgf000235_0001
Figure imgf000236_0001
Figure imgf000237_0001
Figure imgf000238_0001
Figure imgf000239_0001
Figure imgf000240_0001
Figure imgf000241_0001
Figure imgf000242_0001
Figure imgf000243_0001
Figure imgf000244_0001
Figure imgf000245_0001
Figure imgf000246_0001
Figure imgf000247_0001
Figure imgf000248_0001
Figure imgf000249_0001
Figure imgf000250_0001
Figure imgf000251_0001
Figure imgf000252_0001
Figure imgf000253_0001
Figure imgf000254_0001
Figure imgf000255_0001
Figure imgf000256_0001
Figure imgf000257_0001
Figure imgf000258_0001
Figure imgf000259_0001
Figure imgf000260_0001
Figure imgf000261_0001
Figure imgf000262_0001
Figure imgf000263_0001
Figure imgf000264_0001
Figure imgf000265_0001
Figure imgf000266_0001
Figure imgf000267_0001
Figure imgf000268_0001
Figure imgf000269_0001
Figure imgf000270_0001
Figure imgf000271_0001
Figure imgf000272_0001
Figure imgf000273_0001
Figure imgf000274_0001
Figure imgf000275_0001
Figure imgf000276_0001
Figure imgf000277_0001
Figure imgf000278_0001
Figure imgf000279_0001
Figure imgf000280_0001
Figure imgf000281_0001
Figure imgf000282_0001
Figure imgf000283_0001
Figure imgf000284_0001
Figure imgf000285_0001
Figure imgf000286_0001
Figure imgf000287_0001
Figure imgf000288_0001
Figure imgf000289_0001
Figure imgf000290_0001
Figure imgf000291_0001
Figure imgf000292_0001
Figure imgf000293_0001
Figure imgf000294_0001
Figure imgf000295_0001
Figure imgf000296_0001
Figure imgf000297_0001
Figure imgf000298_0001
Figure imgf000299_0001
Figure imgf000300_0001
Figure imgf000301_0001
Figure imgf000302_0001
Figure imgf000303_0001
Figure imgf000304_0001
Figure imgf000305_0001
Figure imgf000306_0001
Figure imgf000307_0001
Figure imgf000308_0001
Figure imgf000309_0001
Figure imgf000310_0001
Figure imgf000311_0001
Figure imgf000312_0001
Figure imgf000313_0001
Figure imgf000314_0001
Figure imgf000315_0001
Figure imgf000316_0001
Figure imgf000317_0001
Figure imgf000318_0001
Figure imgf000319_0001
Figure imgf000320_0001
Figure imgf000321_0001
Figure imgf000322_0001
Figure imgf000323_0001
Figure imgf000324_0001
Figure imgf000325_0001
Figure imgf000326_0001
Figure imgf000327_0001
Figure imgf000328_0001
Figure imgf000329_0001
Figure imgf000330_0001
Figure imgf000331_0001
Figure imgf000332_0001
Figure imgf000333_0001
Figure imgf000334_0001
Figure imgf000335_0001
Figure imgf000336_0001
Figure imgf000337_0001
Figure imgf000338_0001
Figure imgf000339_0001
Figure imgf000340_0001
Figure imgf000341_0001
Figure imgf000342_0001
Figure imgf000343_0001
Figure imgf000344_0001
Figure imgf000345_0001
Figure imgf000346_0001
Figure imgf000347_0001
Figure imgf000348_0001
Figure imgf000349_0001
Figure imgf000350_0001
Figure imgf000351_0001
Figure imgf000352_0001
Figure imgf000353_0001
Figure imgf000354_0001
Figure imgf000355_0001
Figure imgf000356_0001
Figure imgf000357_0001
Figure imgf000358_0001
Figure imgf000359_0001
Figure imgf000360_0001
Figure imgf000361_0001
Figure imgf000362_0001
Figure imgf000363_0001
Figure imgf000364_0001
Figure imgf000365_0001
Figure imgf000366_0001
Figure imgf000367_0001
Figure imgf000368_0001
Figure imgf000369_0001
Figure imgf000370_0001
Figure imgf000371_0001
Figure imgf000372_0001
Figure imgf000373_0001
Figure imgf000374_0001
Figure imgf000375_0001
Figure imgf000376_0001
Figure imgf000377_0001
Figure imgf000378_0001
Figure imgf000379_0001
Figure imgf000380_0001
Figure imgf000381_0001
Figure imgf000382_0001
Figure imgf000383_0001
Figure imgf000384_0001
Figure imgf000385_0001
Figure imgf000386_0001
Figure imgf000387_0001
Figure imgf000388_0001
Figure imgf000389_0001
Figure imgf000390_0001
Figure imgf000391_0001
Figure imgf000392_0001
Figure imgf000393_0001
Figure imgf000394_0001
Figure imgf000395_0001
Figure imgf000396_0001
Figure imgf000397_0001
Figure imgf000398_0001
Figure imgf000399_0001
Figure imgf000400_0001
Figure imgf000401_0001
Figure imgf000402_0001
Figure imgf000403_0001
Figure imgf000404_0001
Figure imgf000405_0001
Figure imgf000406_0001
Figure imgf000407_0001
Figure imgf000408_0001
Figure imgf000409_0001
Figure imgf000410_0001
Figure imgf000411_0001
Figure imgf000412_0001
Figure imgf000413_0001
Figure imgf000414_0001
Figure imgf000415_0001
Figure imgf000416_0001
Figure imgf000417_0001
Figure imgf000418_0001
Figure imgf000419_0001
Figure imgf000420_0001
Figure imgf000421_0001
Figure imgf000422_0001
Figure imgf000423_0001
Figure imgf000424_0001
Figure imgf000425_0001
Figure imgf000426_0001
Figure imgf000427_0001
Figure imgf000428_0001
Figure imgf000429_0001
Figure imgf000430_0001
Figure imgf000431_0001
Figure imgf000432_0001
Figure imgf000433_0001
Figure imgf000434_0001
Figure imgf000435_0001
Figure imgf000436_0001
Figure imgf000437_0001
Figure imgf000438_0001
Figure imgf000439_0001
Figure imgf000440_0001
Figure imgf000441_0001
Figure imgf000442_0001
Figure imgf000443_0001
Figure imgf000444_0001
Figure imgf000445_0001
Figure imgf000446_0001
Figure imgf000447_0001
Figure imgf000448_0001
Figure imgf000449_0001
Figure imgf000450_0001
Figure imgf000451_0001
Figure imgf000452_0001
Figure imgf000453_0001
Figure imgf000454_0001
Figure imgf000455_0001
Figure imgf000456_0001
Figure imgf000457_0001
Figure imgf000458_0001
Figure imgf000459_0001
Figure imgf000460_0001
Figure imgf000461_0001
Figure imgf000462_0001
Figure imgf000463_0001
Figure imgf000464_0001
Figure imgf000465_0001
Figure imgf000466_0001
Figure imgf000467_0001
Figure imgf000468_0001
Figure imgf000469_0001
Figure imgf000470_0001
Figure imgf000471_0001
Figure imgf000472_0001
Figure imgf000473_0001
Figure imgf000474_0001
Figure imgf000475_0001
Figure imgf000476_0001
Figure imgf000477_0001
Figure imgf000478_0001
Figure imgf000479_0001
Figure imgf000480_0001
Figure imgf000481_0001
Figure imgf000482_0001
Figure imgf000483_0001
Figure imgf000484_0001
Figure imgf000485_0001
Figure imgf000486_0001
Figure imgf000487_0001
Figure imgf000488_0001
Figure imgf000489_0001
Figure imgf000490_0001
Figure imgf000491_0001
Figure imgf000492_0001
Figure imgf000493_0001
Figure imgf000494_0001
Figure imgf000495_0001
Figure imgf000496_0001
Figure imgf000497_0001
Figure imgf000498_0001
Figure imgf000499_0001
Figure imgf000500_0001
Figure imgf000501_0001
Figure imgf000502_0001
Figure imgf000503_0001
Figure imgf000504_0001
Figure imgf000505_0001
Figure imgf000506_0001
Figure imgf000507_0001
Figure imgf000508_0001
Figure imgf000509_0001
Figure imgf000510_0001
Figure imgf000511_0001
Figure imgf000512_0001
Figure imgf000513_0001
Figure imgf000514_0001
Figure imgf000515_0001
Figure imgf000516_0001
Figure imgf000517_0001
Figure imgf000518_0001
Figure imgf000519_0001
Figure imgf000520_0001
Figure imgf000521_0001
Figure imgf000522_0001
Figure imgf000523_0001
Figure imgf000524_0001
Figure imgf000525_0001
Figure imgf000526_0001
Figure imgf000527_0001
Figure imgf000528_0001
Figure imgf000529_0001
Figure imgf000530_0001
Figure imgf000531_0001
Figure imgf000532_0001
Figure imgf000533_0001
Figure imgf000534_0001
Figure imgf000535_0001
Figure imgf000536_0001
Figure imgf000537_0001
Figure imgf000538_0001
Figure imgf000539_0001
Figure imgf000540_0001
Figure imgf000541_0001
Figure imgf000542_0001
Figure imgf000543_0001
Figure imgf000544_0001
Figure imgf000545_0001
Figure imgf000546_0001
Figure imgf000547_0001
Figure imgf000548_0001
Figure imgf000549_0001
Figure imgf000550_0001
Figure imgf000551_0001
Figure imgf000552_0001
Figure imgf000553_0001
Figure imgf000554_0001
Figure imgf000555_0001
Figure imgf000556_0001
Figure imgf000557_0001
Figure imgf000558_0001
Figure imgf000559_0001
Figure imgf000560_0001
Figure imgf000561_0001
Figure imgf000562_0001
Figure imgf000563_0001
Figure imgf000564_0001
Figure imgf000565_0001
Figure imgf000566_0001
Figure imgf000567_0001
Figure imgf000568_0001
Figure imgf000569_0001
Figure imgf000570_0001
Figure imgf000571_0001
Figure imgf000572_0001
Figure imgf000573_0001
Figure imgf000574_0001
Figure imgf000575_0001
Figure imgf000576_0001
Figure imgf000577_0001
Figure imgf000578_0001
Figure imgf000579_0001
Figure imgf000580_0001
Figure imgf000581_0001
Figure imgf000582_0001
Figure imgf000583_0001
Figure imgf000584_0001
Figure imgf000585_0001
Figure imgf000586_0001
Figure imgf000587_0001
Figure imgf000588_0001
Figure imgf000589_0001
Figure imgf000590_0001
Figure imgf000591_0001
Figure imgf000592_0001
Figure imgf000593_0001
Figure imgf000594_0001
Figure imgf000595_0001
Figure imgf000596_0001
Figure imgf000597_0001
Figure imgf000598_0001
Figure imgf000599_0001
Figure imgf000600_0001
Figure imgf000601_0001
Figure imgf000602_0001
Figure imgf000603_0001
Figure imgf000604_0001
Figure imgf000605_0001
Figure imgf000606_0001
Figure imgf000607_0001
Figure imgf000608_0001
Figure imgf000609_0001
Figure imgf000610_0001
Figure imgf000611_0001
Figure imgf000612_0001
Figure imgf000613_0001
Figure imgf000614_0001
Figure imgf000615_0001
Figure imgf000616_0001
Figure imgf000617_0001
Figure imgf000618_0001
Figure imgf000619_0001
Figure imgf000620_0001
Figure imgf000621_0001
Figure imgf000622_0001
Figure imgf000623_0001
Figure imgf000624_0001
Figure imgf000625_0001
Figure imgf000626_0001
Figure imgf000627_0001
Figure imgf000628_0001
Figure imgf000629_0001
Figure imgf000630_0001
Figure imgf000631_0001
Figure imgf000632_0001
Figure imgf000633_0001
Figure imgf000634_0001
Figure imgf000635_0001
Figure imgf000636_0001
Figure imgf000637_0001
Figure imgf000638_0001
Figure imgf000639_0001
Figure imgf000640_0001
Figure imgf000641_0001
Figure imgf000642_0001
Figure imgf000643_0001
Figure imgf000644_0001
Figure imgf000645_0001
Figure imgf000646_0001
Figure imgf000647_0001
Figure imgf000648_0001
Figure imgf000649_0001
Figure imgf000650_0001
Figure imgf000651_0001
Figure imgf000652_0001
Figure imgf000653_0001
Figure imgf000654_0001
Figure imgf000655_0001
Figure imgf000656_0001
Figure imgf000657_0001
Figure imgf000658_0001
Figure imgf000659_0001
Figure imgf000660_0001
Figure imgf000661_0001
Figure imgf000662_0001
Figure imgf000663_0001
Figure imgf000664_0001
Figure imgf000665_0001
Figure imgf000666_0001
Figure imgf000667_0001
Figure imgf000668_0001
Figure imgf000669_0001
Figure imgf000670_0001
Figure imgf000671_0001
Figure imgf000672_0001
Figure imgf000673_0001
Figure imgf000674_0001
Figure imgf000675_0001
Figure imgf000676_0001
Figure imgf000677_0001
Figure imgf000678_0001
Figure imgf000679_0001
Figure imgf000680_0001
Figure imgf000681_0001
Figure imgf000682_0001
Figure imgf000683_0001
Figure imgf000684_0001
Figure imgf000685_0001
Figure imgf000686_0001
Figure imgf000687_0001
Figure imgf000688_0001
Figure imgf000689_0001
Figure imgf000690_0001
Figure imgf000691_0001
Figure imgf000692_0001
Figure imgf000693_0001
Figure imgf000694_0001
Figure imgf000695_0001
Figure imgf000696_0001
Figure imgf000697_0001
Figure imgf000698_0001
Figure imgf000699_0001
Figure imgf000700_0001
Figure imgf000701_0001
Figure imgf000702_0001
Figure imgf000703_0001
Figure imgf000704_0001
Figure imgf000705_0001
Figure imgf000706_0002
Figure imgf000706_0001
Figure imgf000707_0001
Figure imgf000708_0001
Figure imgf000709_0001
Figure imgf000710_0001
Figure imgf000711_0001
Figure imgf000712_0001
Figure imgf000713_0001
Figure imgf000714_0001
Figure imgf000715_0001
Figure imgf000716_0001
Figure imgf000717_0001
Figure imgf000718_0001
Figure imgf000719_0001
Figure imgf000720_0001
Figure imgf000721_0001
Figure imgf000722_0001
Figure imgf000723_0001
Figure imgf000724_0001
Figure imgf000725_0001
Figure imgf000726_0001
Figure imgf000727_0001
Figure imgf000728_0001
Figure imgf000729_0001
Figure imgf000730_0001
Figure imgf000731_0001
Figure imgf000732_0001
Figure imgf000733_0001
Figure imgf000734_0001
Figure imgf000735_0001
Figure imgf000736_0001
Figure imgf000737_0001
Figure imgf000738_0001
Figure imgf000739_0001
Figure imgf000740_0001
Figure imgf000741_0001
Figure imgf000742_0001
Figure imgf000743_0001
Figure imgf000744_0001
Figure imgf000745_0001
Figure imgf000746_0001
Figure imgf000747_0001
Figure imgf000748_0001
Figure imgf000749_0001
Figure imgf000750_0001
Figure imgf000751_0001
Figure imgf000752_0001
Figure imgf000753_0001
Figure imgf000754_0001
Figure imgf000755_0001
Figure imgf000756_0001
Figure imgf000757_0001
Figure imgf000758_0001
Figure imgf000759_0001
Figure imgf000760_0001
Figure imgf000761_0001
Figure imgf000762_0001
Figure imgf000763_0001
Figure imgf000764_0001
Figure imgf000765_0001
Figure imgf000766_0001
Figure imgf000767_0001
Figure imgf000768_0001
Figure imgf000769_0001
Figure imgf000770_0001
Figure imgf000771_0001
Figure imgf000772_0001
Figure imgf000773_0001
Figure imgf000774_0001
Figure imgf000775_0001
Figure imgf000776_0001
Figure imgf000777_0001
Figure imgf000778_0001
Figure imgf000779_0001
Figure imgf000780_0001
Figure imgf000781_0001
Figure imgf000782_0001
Figure imgf000783_0001
Figure imgf000784_0001
Figure imgf000785_0001
Figure imgf000786_0001
Figure imgf000787_0001
Figure imgf000788_0001
Figure imgf000789_0001
Figure imgf000790_0001
Figure imgf000791_0001
Figure imgf000792_0001
Figure imgf000793_0001
Figure imgf000794_0001
Figure imgf000795_0001
Figure imgf000796_0001
Figure imgf000797_0001
Figure imgf000798_0001
Figure imgf000799_0001
Figure imgf000800_0001
Figure imgf000801_0001
Figure imgf000802_0001
Figure imgf000803_0001
Figure imgf000804_0001
Figure imgf000805_0001
Figure imgf000806_0001
Figure imgf000807_0001
Figure imgf000808_0001
Figure imgf000809_0001
Figure imgf000810_0001
Figure imgf000811_0001
Figure imgf000812_0001
Figure imgf000813_0001
Figure imgf000814_0001
Figure imgf000815_0001
Figure imgf000816_0001
Figure imgf000817_0001
Figure imgf000818_0001
Figure imgf000819_0001
Figure imgf000820_0001
Figure imgf000821_0001
Figure imgf000822_0001
Figure imgf000823_0001
Figure imgf000824_0001
Figure imgf000825_0001
Figure imgf000826_0001
Figure imgf000827_0001
Figure imgf000828_0001
Figure imgf000829_0001
Figure imgf000830_0001
Figure imgf000831_0001
Figure imgf000832_0001
Figure imgf000833_0001
Figure imgf000834_0001
Figure imgf000835_0001
Figure imgf000836_0001
Figure imgf000837_0001
Figure imgf000838_0001
Figure imgf000839_0001
Figure imgf000840_0001
Figure imgf000841_0001
Figure imgf000842_0001
Figure imgf000843_0001
Figure imgf000844_0001
Figure imgf000845_0001
Figure imgf000846_0001
Figure imgf000847_0001
Figure imgf000848_0001
Figure imgf000849_0001
Figure imgf000850_0001
Figure imgf000851_0001
Figure imgf000852_0001
Figure imgf000853_0001
Figure imgf000854_0001
Figure imgf000855_0001
Figure imgf000856_0001
Figure imgf000857_0001
Figure imgf000858_0001
Figure imgf000859_0001
Figure imgf000860_0001
Figure imgf000861_0001
Figure imgf000862_0001
Figure imgf000863_0001
Figure imgf000864_0001
Figure imgf000865_0001
Figure imgf000866_0001
Figure imgf000867_0001
Figure imgf000868_0001
Figure imgf000869_0001
Figure imgf000870_0001
Figure imgf000871_0001
Figure imgf000872_0001
Figure imgf000873_0001
Figure imgf000874_0001
Figure imgf000875_0001
Figure imgf000877_0001
Figure imgf000878_0001
Figure imgf000879_0001
Figure imgf000880_0001
Figure imgf000881_0001
Figure imgf000882_0001
Figure imgf000883_0001
Figure imgf000884_0001
Figure imgf000885_0001
Figure imgf000886_0001
Figure imgf000887_0001
Figure imgf000888_0001
Figure imgf000889_0001
Figure imgf000890_0001
Figure imgf000891_0001
Figure imgf000892_0001
Figure imgf000893_0001
Figure imgf000894_0001
Figure imgf000895_0001
Figure imgf000896_0001
Figure imgf000897_0001
Figure imgf000898_0001
Figure imgf000899_0001
Figure imgf000900_0001
Figure imgf000901_0001
Figure imgf000902_0001
Figure imgf000903_0001
Figure imgf000904_0001
Figure imgf000905_0001
Figure imgf000906_0001
Figure imgf000907_0001
Figure imgf000908_0001
Figure imgf000909_0001
Figure imgf000910_0001
Figure imgf000911_0001
Figure imgf000912_0001
Figure imgf000913_0001
Figure imgf000914_0001
Figure imgf000915_0001
Figure imgf000916_0001
Figure imgf000917_0001
Figure imgf000918_0001
Figure imgf000919_0001
Figure imgf000920_0001
Figure imgf000921_0001
Figure imgf000922_0001
Figure imgf000923_0001
Figure imgf000924_0001
Figure imgf000925_0001
Figure imgf000926_0001
Figure imgf000927_0001
Figure imgf000928_0001
Figure imgf000929_0001
Figure imgf000930_0001
Figure imgf000931_0001
Figure imgf000932_0001
Figure imgf000933_0001
Figure imgf000934_0001
Figure imgf000935_0001
)
Figure imgf000936_0001
Figure imgf000937_0001
Figure imgf000938_0001
Figure imgf000939_0001
Figure imgf000940_0001
Figure imgf000941_0001
Figure imgf000942_0001
Figure imgf000943_0001
Figure imgf000944_0001
Figure imgf000945_0001
Figure imgf000946_0001
Figure imgf000947_0001
Figure imgf000948_0001
Figure imgf000949_0001
Figure imgf000950_0001
Figure imgf000951_0001
Figure imgf000952_0001
Figure imgf000953_0001
Figure imgf000954_0001
Figure imgf000955_0001
Figure imgf000956_0001
Figure imgf000957_0001
Figure imgf000958_0001
Figure imgf000959_0001
Figure imgf000960_0001
Figure imgf000961_0001
Figure imgf000962_0001
Figure imgf000963_0001
Figure imgf000964_0001
Figure imgf000965_0001
Figure imgf000966_0001
Figure imgf000967_0001
Figure imgf000968_0001
Figure imgf000969_0001
Figure imgf000970_0001
Figure imgf000971_0001
Figure imgf000972_0001
Figure imgf000973_0001
Figure imgf000974_0001
Figure imgf000975_0001
Figure imgf000976_0001
Figure imgf000977_0001
Figure imgf000979_0001
Figure imgf000980_0001
Figure imgf000981_0001
Figure imgf000982_0001
Figure imgf000983_0001
Figure imgf000984_0001
Figure imgf000985_0001
Figure imgf000986_0001
Figure imgf000987_0001
Figure imgf000988_0001
Figure imgf000989_0001
Figure imgf000990_0001
Figure imgf000991_0001
Figure imgf000992_0001
Figure imgf000993_0001
Figure imgf000994_0001
Figure imgf000995_0001
Figure imgf000996_0001
Figure imgf000997_0001
Figure imgf000998_0001
Figure imgf000999_0001
Figure imgf001000_0001
Figure imgf001001_0001
Figure imgf001002_0001
Figure imgf001003_0001
Figure imgf001004_0001
Figure imgf001005_0001
Figure imgf001006_0001
Figure imgf001007_0001
Figure imgf001008_0001
Figure imgf001009_0001
Figure imgf001010_0001
Figure imgf001011_0001
Figure imgf001012_0001
Figure imgf001013_0001
Figure imgf001014_0001
Figure imgf001015_0001
Figure imgf001016_0001
Figure imgf001017_0001
Figure imgf001018_0001
Figure imgf001019_0001
Figure imgf001020_0001
Figure imgf001021_0001
Figure imgf001022_0001
Figure imgf001023_0001
Figure imgf001024_0001
Figure imgf001025_0001
Figure imgf001026_0001
Figure imgf001027_0001
Figure imgf001028_0001
Figure imgf001029_0001
Figure imgf001030_0001
Figure imgf001031_0001
Figure imgf001032_0001
Figure imgf001033_0001
Figure imgf001034_0001
Figure imgf001035_0001
Figure imgf001036_0001
Figure imgf001037_0001
Figure imgf001038_0001
Figure imgf001039_0001
Figure imgf001040_0001
Figure imgf001041_0001
Figure imgf001042_0001
Figure imgf001043_0001
Figure imgf001044_0001
Figure imgf001045_0001
Figure imgf001046_0001
Figure imgf001047_0001
Figure imgf001048_0001
Figure imgf001049_0001
Figure imgf001050_0001
Figure imgf001051_0001
Figure imgf001052_0001
Figure imgf001053_0001
Figure imgf001054_0001
Figure imgf001055_0001
Figure imgf001056_0001
Figure imgf001057_0001
Figure imgf001058_0001
Figure imgf001059_0001
Figure imgf001061_0001
Figure imgf001062_0001
Figure imgf001063_0001
Figure imgf001064_0001
Figure imgf001065_0001
Figure imgf001066_0001
Figure imgf001067_0001
Figure imgf001068_0001
Figure imgf001069_0001
Figure imgf001070_0001
Figure imgf001071_0001
Figure imgf001072_0001
Figure imgf001073_0001
Figure imgf001074_0001
Figure imgf001075_0001
Figure imgf001076_0001
Figure imgf001077_0001
Figure imgf001078_0001
Figure imgf001079_0001
Figure imgf001080_0001
Figure imgf001081_0001
Figure imgf001082_0001
Figure imgf001083_0001
Figure imgf001084_0001
Figure imgf001085_0001
Figure imgf001086_0001
Figure imgf001087_0001
Figure imgf001088_0001
Figure imgf001089_0001
Figure imgf001090_0001
Figure imgf001091_0001
Figure imgf001092_0001
Figure imgf001093_0001
Figure imgf001094_0001
Figure imgf001095_0001
Figure imgf001096_0001
Figure imgf001097_0001
Figure imgf001098_0001
Figure imgf001099_0001
Figure imgf001100_0001
Figure imgf001101_0001
Figure imgf001102_0001
Figure imgf001103_0001
Figure imgf001104_0001
Figure imgf001105_0001
Figure imgf001106_0001
Figure imgf001107_0001
Figure imgf001108_0001
Figure imgf001109_0001
Figure imgf001110_0001
Figure imgf001111_0001
Figure imgf001112_0001
Figure imgf001113_0001
Figure imgf001114_0001
Figure imgf001115_0001
Figure imgf001116_0001
Figure imgf001117_0001
Figure imgf001118_0001
Figure imgf001119_0001
Figure imgf001120_0001
Figure imgf001121_0001
Figure imgf001122_0001
Figure imgf001123_0001
Figure imgf001124_0001
Figure imgf001125_0001
Figure imgf001126_0001
Figure imgf001127_0001
Figure imgf001128_0001
Figure imgf001129_0001
Figure imgf001130_0001
Figure imgf001131_0001
Figure imgf001132_0001
Figure imgf001133_0001
Figure imgf001134_0001
Figure imgf001135_0001
Figure imgf001136_0001
Figure imgf001137_0001
Figure imgf001138_0001
Figure imgf001139_0001
Figure imgf001140_0001
Figure imgf001141_0001
Figure imgf001142_0001
Figure imgf001143_0001
Figure imgf001144_0001
Figure imgf001145_0001
Figure imgf001146_0001
Figure imgf001147_0001
Figure imgf001148_0001
Figure imgf001149_0001
Figure imgf001150_0001
Figure imgf001151_0001
Figure imgf001152_0001
Figure imgf001153_0001
Figure imgf001154_0001
Figure imgf001155_0001
Figure imgf001156_0001
Figure imgf001157_0001
Figure imgf001158_0001
Figure imgf001159_0001
Figure imgf001160_0001
Figure imgf001161_0001
Figure imgf001162_0001
Figure imgf001163_0001
Figure imgf001164_0001
Figure imgf001165_0001
Figure imgf001166_0001
Figure imgf001167_0001
Figure imgf001168_0001
Figure imgf001169_0001
Figure imgf001170_0001
Figure imgf001171_0001
Figure imgf001172_0001
Figure imgf001173_0001
Figure imgf001174_0001
Figure imgf001175_0001
Figure imgf001176_0001
Figure imgf001177_0001
Figure imgf001178_0001
Figure imgf001179_0001
Figure imgf001180_0001
Figure imgf001181_0001
Figure imgf001182_0001
Figure imgf001183_0001
Figure imgf001184_0001
Figure imgf001185_0001
Figure imgf001186_0001
Figure imgf001187_0001
Figure imgf001188_0001
Figure imgf001189_0001
Figure imgf001190_0001
Figure imgf001191_0001
Figure imgf001192_0001
Figure imgf001193_0001
Figure imgf001194_0001
Figure imgf001195_0001
Figure imgf001196_0001
Figure imgf001197_0001
Figure imgf001198_0001
Figure imgf001199_0001
Figure imgf001200_0001
Figure imgf001201_0001
Figure imgf001202_0001
Figure imgf001203_0001
Figure imgf001204_0001
Figure imgf001205_0001
Figure imgf001206_0001
Figure imgf001207_0001
Figure imgf001208_0001
Figure imgf001209_0001
Figure imgf001210_0001
Figure imgf001211_0001
Figure imgf001212_0001
Figure imgf001213_0001
Figure imgf001214_0001
Figure imgf001215_0001
Figure imgf001216_0001
Figure imgf001217_0001
Figure imgf001218_0001
Figure imgf001219_0001
Figure imgf001220_0001
Figure imgf001221_0001
Figure imgf001222_0001
Figure imgf001223_0001
Figure imgf001224_0001
Figure imgf001225_0001
Figure imgf001226_0001
Figure imgf001227_0001
Figure imgf001228_0001
Figure imgf001229_0001
Figure imgf001230_0001
Figure imgf001231_0001
Figure imgf001232_0001
Figure imgf001233_0001
Figure imgf001234_0001

Claims

What Is Claimed Is:
1. Computer readable medium having recorded thereon the nucleotide sequence depicted in SEQ ID NO:1, a representative fragment thereof or a nucleotide sequence at least 99.9% identical to the nucleotide sequence depicted in SEQ ID NO: 1.
2. Computer readable medium having recorded thereon any one of the fragments of SEQ ID NO:1 depicted in Table 1a or a degenerate variant thereof, excluding the fragments of SEQ ID NO: 1 depicted in Table 1b.
3. The computer readable medium of claim 1, wherein said medium is selected from the group consisting of a floppy disc, a hard disc, random access memory (RAM), read only memory (ROM), and CD-ROM.
4. The computer readable medium of claim 3, wherein said medium is selected from the group consisting of a floppy disc, a hard disc, random access memory (RAM), read only memory (ROM), and CD-ROM.
5. A computer-based system for identifying fragments of the
Haemophilus genome of commercial importance comprising the following elements;
a) a data storage means comprising the nucleotide sequence of SEQ ID NO:1, a representative fragment thereof, or a nucleotide sequence at least 99.9 % identical to the nucleotide sequence of SEQ ID NO:1;
b) search means for comparing a target sequence to the nucleotide sequence of the data storage means of step (a) to identify homologous sequence(s), and
c) retrieval means for obtaining said homologous sequence(s) of step (b).
6. A method for identifying commercially important nucleic acid fragments of the Haemophilus genome comprising the step of comparing a database comprising the nucleotide sequence depicted in SEQ ID NO: 1, a representative fragment thereof, or a nucleotide sequence at least 99.9% identical to the nucleotide sequence of SEQ ID NO:1 with a target sequence to obtain a nucleic acid molecule comprised of a complementary nucleotide sequence to said target sequence, wherein said target sequence is not randomly selected.
7. A method for identifying an expression modulating fragment of Haemophilus genome comprising the step of comparing a database comprising the nucleotide sequence depicted in SEQ ID NO: 1, a representative fragment thereof, or a nucleotide sequence at least 99.9% identical to the nucleotide sequence of SEQ ID NO:1 with a target sequence to obtain a nucleic acid molecule comprised of a complementary nucleotide sequence to said target sequence, wherein said target sequence comprises sequences known to regulate gene expression.
8. An isolated protein-encoding nucleic acid fragment of the Haemophilus influenzae Rd genome, wherein said fragment consists of the nucleotide sequence of any one of the fragments of SEQ ID NO:1 depicted in Table la or a degenerate variant thereof, excluding the fragments of SEQ ID NO:1 depicted in Table 1b.
9. A vector comprising any one of the fragments of the Haemophilus influenzae Rd genome depicted in Table 1a or a degenerate variant thereof, excluding the fragments of SEQ ID NO: 1 depicted in Table 1b.
10. An isolated fragment of the Haemophilus influenzae Rd genome, wherein said fragment modulates the expression of an operably linked open reading frame, wherein said fragment consists of the nucleotide sequence from about 10 to 200 bases in length which is 5' to any one of the open reading frames depicted in Table la or a degenerate variant thereof, excluding the fragments of SEQ ID NO: 1 depicted in Table lb.
11. A vector comprising any one of the fragments of the
Haemophilus influenzae Rd genome of claim 8.
12. An organism which has been altered to contain any one of the fragments of the Haemophilus genome of claim 8.
13. An organism which has been altered to contain any one of the fragments of the Haemophilus genome of claim 10.
14. A method for regulating the expression of a nucleic acid molecule comprising the step of covalently attaching 5' to said nucleic acid molecule a nucleic acid molecule consisting of the nucleotide sequence from about 10 to 100 bases 5' to any one of the fragments of the Haemophilus genome depicted in Table la or a degenerate variant thereof, excluding the fragments of SEQ ID NO: 1 depicted in Table 1b.
15. An isolated nucleic acid molecule encoding a homolog of any one of the fragment of the Haemophilus genome depicted in Table 1a, excluding the fragments of SEQ ID NO: 1 depicted in Table 1b wherein said nucleic acid molecule is produced by the steps of:
a) screening a genomic DNA library using any one of the fragments of the Haemophilus genome depicted in Table 1a as a target sequence;
b) identifying members of said library which contain sequences which hybridize to said target sequence; c) isolating the nucleic acid molecules from said members identified in step (b).
16. An isolated DNA molecule encoding a homolog of any one of the fragments of the Haemophilus genome depicted in Table la, excluding the fragments of SEQ ID NO: 1 depicted in Table lb wherein said nucleic acid molecule is produced by the steps of:
a) isolating mRNA, DNA, or cDNA produced from an organism;
b) amplifying nucleic acid molecules whose nucleotide sequence is homologous to amplification primers derived from said fragment of said Haemophilus genome to prime said amplification;
c) isolating said amplified sequences produced in step (b).
17. An isolated polypeptide encoded by any one of the fragments of the Haemophilus influenzae Rd genome depicted in Table la or by a degenerate variant of said fragment, excluding the fragments of SEQ ID NO: 1 depicted in Table 1b.
18. An isolated polynucleotide molecule encoding any one of the polypeptides of claim 17.
19. An antibody which selectively binds to any one of the polypeptides of claim 17.
20. A method for producing a polypeptide in a host cell comprising the steps of: a ) incubating a host containing a heterologous nucleic acid molecule whose nucleotide sequence consists of any one of the fragments of the Haemophilus influenzae Rd genome depicted in Table la or a degenerate variant thereof, excluding the fragments of SEQ ID NO: 1 depicted in Table lb under conditions where said heterologous nucleic acid molecule is expressed to produce said protein, and
b ) isolating said protein.
PCT/US1996/005320 1995-04-21 1996-04-22 NUCLEOTIDE SEQUENCE OF THE HAEMOPHILUS INFLUENZAE Rd GENOME, FRAGMENTS THEREOF, AND USES THEREOF WO1996033276A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP96912845A EP0821737A4 (en) 1995-04-21 1996-04-22 NUCLEOTIDE SEQUENCE OF THE HAEMOPHILUS INFLUENZAE Rd GENOME, FRAGMENTS THEREOF, AND USES THEREOF
JP8531888A JPH11501520A (en) 1995-04-21 1996-04-22 Nucleotide sequence of Haemophilus influenzae Rd genome, fragments thereof and uses thereof
AU55523/96A AU5552396A (en) 1995-04-21 1996-04-22 Nucleotide sequence of the haemophilus influenzae rd genome, fragments thereof, and uses thereof

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US42678795A 1995-04-21 1995-04-21
US08/426,787 1995-04-21
US08/476,102 US6355450B1 (en) 1995-04-21 1995-06-07 Computer readable genomic sequence of Haemophilus influenzae Rd, fragments thereof, and uses thereof
US08/487,429 1995-06-07
US08/487,429 US6468765B1 (en) 1995-04-21 1995-06-07 Selected Haemophilus influenzae Rd polynucleotides and polypeptides
US08/476,102 1995-06-07

Publications (1)

Publication Number Publication Date
WO1996033276A1 true WO1996033276A1 (en) 1996-10-24

Family

ID=27411524

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1996/005320 WO1996033276A1 (en) 1995-04-21 1996-04-22 NUCLEOTIDE SEQUENCE OF THE HAEMOPHILUS INFLUENZAE Rd GENOME, FRAGMENTS THEREOF, AND USES THEREOF

Country Status (5)

Country Link
EP (1) EP0821737A4 (en)
JP (1) JPH11501520A (en)
AU (1) AU5552396A (en)
CA (1) CA2218741A1 (en)
WO (1) WO1996033276A1 (en)

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998018931A3 (en) * 1996-10-31 1998-08-20 Human Genome Sciences Inc Streptococcus pneumoniae polynucleotides and sequences
WO1998050555A3 (en) * 1997-05-06 1999-01-14 Human Genome Sciences Inc Enterococcus faecalis polynucleotides and polypeptides
WO1999040189A3 (en) * 1998-02-09 1999-10-14 Genset Sa Cdnas encoding secreted proteins
EP0894006A4 (en) * 1996-04-18 1999-11-24 Smithkline Beecham Corp POLYNUCLEOTIDES AND POLYPEPTIDES OF THE THREONYL tRNA SYNTHETASE FAMILY
WO1999055873A3 (en) * 1998-04-24 2000-03-09 Smithkline Beecham Biolog Basb006 polynucleotide(s) and polypeptides from neisseria meningitis
WO1999054473A3 (en) * 1998-04-22 2000-03-23 Glaxo Group Ltd Bacterial yjeq polypeptide family
WO1999054470A3 (en) * 1998-04-22 2000-03-30 Glaxo Group Ltd Bacterial ygjd polypeptide family
WO1999054474A3 (en) * 1998-04-22 2000-05-04 Glaxo Group Ltd Bacterial yiha polypeptide family
WO2000011141A3 (en) * 1998-08-24 2000-06-02 Mount Sinai Hospital Corp Trna binding domain
WO1999054462A3 (en) * 1998-04-22 2000-06-22 Glaxo Group Ltd Bacterial ycfb polypeptide family
EP0896061A3 (en) * 1997-08-08 2000-07-26 Smithkline Beecham Corporation RpoA gene from Staphylococcus aureus
WO2000047737A1 (en) * 1999-02-09 2000-08-17 Smithkline Beecham Biologicals S.A. Haemophilus influenzae rd outer membrane sequences used as vaccine
WO1999057280A3 (en) * 1998-05-01 2000-08-24 Chiron Corp Neisseria meningitidis antigens and compositions
WO2000050599A1 (en) * 1999-02-24 2000-08-31 Smithkline Beecham Biologicals S.A. Haemophilus antigen
WO2001000837A1 (en) * 1999-06-25 2001-01-04 Smithkline Beecham Biologicals S.A. Basb111 polypeptide and polynucleotide from moraxella catharrhalis
WO1999047553A3 (en) * 1998-03-18 2001-02-22 Glaxo Group Ltd Bacterial yeal family members as targets for antimicrobial drug design
EP1136557A1 (en) * 2000-03-21 2001-09-26 De Staat Der Nederlanden Vertegenwoordigd Door De Minister Van Welzijn, Volksgezondheid En Cultuur Proteins and polypeptides from Haemophilus influenzae involved in paracytosis, nucleotide sequences encoding them and uses thereof
WO2001011033A3 (en) * 1999-08-04 2002-01-17 Abbott Lab Identification of genes essential for the survival of haemophilus influenzae through genome scanning by transposition mutagenesis
WO2002032946A3 (en) * 2000-10-17 2002-10-31 Smithkline Beecham Biolog Basb207 polypeptides and polynucleotides from nontypeable haemophilus influenzae
WO2002030967A3 (en) * 2000-10-13 2002-11-07 Smithkline Beecham Biolog Novel compounds
WO2002030971A3 (en) * 2000-10-13 2002-11-07 Smithkline Beecham Biolog Basb205 polypeptides and polynucleotides coding therefor
WO2002066503A3 (en) * 2001-02-16 2002-11-07 Smithkline Beecham Biolog H. influenzae antigen basb213
WO2002046215A3 (en) * 2000-12-08 2002-11-14 Smithkline Beecham Biolog Basb221 polypeptides and polynucleotides encoding basb221 polypeptides
WO2002034772A3 (en) * 2000-10-24 2003-01-30 Smithkline Beecham Biolog Novel compounds
WO2002018601A3 (en) * 2000-08-25 2003-02-06 Abbott Lab Essential bacteria genes and genome scanning in haemophilus influenzae for the identification of 'essential genes'
WO2002028889A3 (en) * 2000-10-02 2003-04-10 Shire Biochem Inc Haemophilus influenzae antigens and corresponding dna fragments
WO2002077020A3 (en) * 2001-03-22 2003-06-19 Isis Innovation Virulence genes in h. influenzae
WO2002088361A3 (en) * 2001-04-30 2003-11-27 Glaxosmithkline Biolog Sa Haemophilus influenzae antigens
RU2227043C2 (en) * 1998-05-01 2004-04-20 Чирон Корпорейшн Neisseria meningitidis antigens and compositions
US7033799B2 (en) * 1994-08-25 2006-04-25 Washington University Haemophilus adherence and penetration proteins
US7385034B2 (en) 1998-12-22 2008-06-10 Serono Genetics Institute S.A. Complementary DNAs encoding proteins with signal peptides
US7875421B2 (en) 2002-04-09 2011-01-25 Nestec S.A. La1—the genome of a lactobacillus strain
AU2007202270B2 (en) * 2000-10-02 2011-09-01 Id Biomedical Corporation Of Quebec Haemophilus influenzae antigens and corresponding DNA fragments
US8691555B2 (en) 2006-09-28 2014-04-08 Dsm Ip Assests B.V. Production of carotenoids in oleaginous yeast and fungi
US20150218231A1 (en) * 2005-06-16 2015-08-06 Nationwide Children's Hospital, Inc. Genes of an otitis media isolate of nontypeable heamophilus influenzae
US9909130B2 (en) 2005-03-18 2018-03-06 Dsm Ip Assets B.V. Production of carotenoids in oleaginous yeast and fungi

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
INFECTION AND IMMUNITY, October 1994, Volume 62, Number 10, SANDERS et al., "Identification of a Locus Involved in the Utilization of Iron by Haemophilus Influenzae", pages 4515-4525. *
JOURNAL OF BACTERIOLOGY, May 1995, Volume 177, Number 10, COPE et al., "A Gene Cluster Involved in the Utilization of Both Free Heme and Heme: Hemopexin by Haemophilus Influenzae Type B", pages 2644-2653. *
NUCLEOTIDE DATABASE ON ENTREZ RELEASE 15.0, PUBLISHED ON CD-ROM BY NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION, NATIONAL LIBRARY OF MEDICINE, NATIONAL INSTITUTES OF HEALTH, BETHESDA, MD, USA, WEISER et al., "Identification and Characterization of oapA, a Cell-Envelope Protein of Haemophilus Influenzae Contributing to *
NUCLEOTIDE DATABASE ON ENTREZ RELEASE 15.0, PUBLISHED ON CD-ROM BY THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION, NATIONAL LIBRARY OF MEDICINE, NATIONAL INSTITUTES OF HEALTH, BETHESDA, MD, USA, COPE et al., "A Gene Cluster Involved in the Utilization of Both Free Heme and Heme: Hemopexin by Haemophilus Influenzae Type B", 15 February *
SCIENCE, 28 July 1995, Volume 269, FLEISCHMANN et al., "Whole-Genome Random Sequencing and Assembly of Haemophilus Influenzae Rd", pages 496-512. *
See also references of EP0821737A4 *
WATSON J.D. et al., "Recombinant DNA in Medicine and Industry", In: RECOMBINANT DNA, Second Edition, NEW YORK: SCIENTIFIC AMERICAN BOOKS, W.H. FREEMAN AND COMPANY, 1992, pages 453-470. *

Cited By (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7033799B2 (en) * 1994-08-25 2006-04-25 Washington University Haemophilus adherence and penetration proteins
EP0894006A4 (en) * 1996-04-18 1999-11-24 Smithkline Beecham Corp POLYNUCLEOTIDES AND POLYPEPTIDES OF THE THREONYL tRNA SYNTHETASE FAMILY
US7141418B2 (en) 1996-10-31 2006-11-28 Human Genome Sciences, Inc. Streptococcus pneumoniae polynucleotides and sequences
US6420135B1 (en) 1996-10-31 2002-07-16 Human Genome Sciences, Inc. Streptococcus pneumoniae polynucleotides and sequences
US8168205B2 (en) 1996-10-31 2012-05-01 Human Genome Sciences, Inc. Streptococcus pneumoniae polypeptides
WO1998018931A3 (en) * 1996-10-31 1998-08-20 Human Genome Sciences Inc Streptococcus pneumoniae polynucleotides and sequences
US7056510B1 (en) 1996-10-31 2006-06-06 Human Genome Sciences, Inc. Streptococcus pneumoniae SP036 polynucleotides, polypeptides, antigens and vaccines
WO1998050555A3 (en) * 1997-05-06 1999-01-14 Human Genome Sciences Inc Enterococcus faecalis polynucleotides and polypeptides
US6448043B1 (en) 1997-05-06 2002-09-10 Human Genome Sciences, Inc. Enterococcus faecalis EF040 and uses therefor
EP0896061A3 (en) * 1997-08-08 2000-07-26 Smithkline Beecham Corporation RpoA gene from Staphylococcus aureus
WO1999040189A3 (en) * 1998-02-09 1999-10-14 Genset Sa Cdnas encoding secreted proteins
US6936692B2 (en) 1998-02-09 2005-08-30 Genset, S.A. Complementary DNAs
WO1999047553A3 (en) * 1998-03-18 2001-02-22 Glaxo Group Ltd Bacterial yeal family members as targets for antimicrobial drug design
WO1999054462A3 (en) * 1998-04-22 2000-06-22 Glaxo Group Ltd Bacterial ycfb polypeptide family
WO1999054474A3 (en) * 1998-04-22 2000-05-04 Glaxo Group Ltd Bacterial yiha polypeptide family
WO1999054470A3 (en) * 1998-04-22 2000-03-30 Glaxo Group Ltd Bacterial ygjd polypeptide family
WO1999054473A3 (en) * 1998-04-22 2000-03-23 Glaxo Group Ltd Bacterial yjeq polypeptide family
US6696062B1 (en) 1998-04-24 2004-02-24 Smithkline Beecham Biologicals S.A. BASB006 polypeptides from Neisseria meningitidis and immunogenic compositions thereof
US7419824B2 (en) 1998-04-24 2008-09-02 Glaxosmithkline Biologicals, S.A,. BASB006 polypeptides from Neisseria meningitidis and immunogenic compositions thereof
WO1999055873A3 (en) * 1998-04-24 2000-03-09 Smithkline Beecham Biolog Basb006 polynucleotide(s) and polypeptides from neisseria meningitis
RU2227043C2 (en) * 1998-05-01 2004-04-20 Чирон Корпорейшн Neisseria meningitidis antigens and compositions
US9266929B2 (en) 1998-05-01 2016-02-23 Glaxosmithkline Biologicals Sa Neisseria meningitidis antigens and compositions
US9249198B2 (en) 1998-05-01 2016-02-02 Glaxosmithkline Biologicals Sa Neisseria meningitidis antigens and compositions
US9249196B2 (en) 1998-05-01 2016-02-02 Glaxosmithkline Biologicals Sa Neisseria meningitidis antigens and compositions
US9139621B2 (en) 1998-05-01 2015-09-22 Glaxosmithkline Biologicals Sa Neisseria meningitidis antigens and compositions
US8524251B2 (en) 1998-05-01 2013-09-03 J. Craig Venter Institute, Inc. Neisseria meningitidis antigens and compositions
US7576176B1 (en) 1998-05-01 2009-08-18 Novartis Vaccines And Diagnostics, Inc. Neisseria meningitidis antigens and compositions
WO1999057280A3 (en) * 1998-05-01 2000-08-24 Chiron Corp Neisseria meningitidis antigens and compositions
US7988979B2 (en) 1998-05-01 2011-08-02 J. Craig Venter Institute, Inc. Neisseria meningitidis antigens and compositions
WO2000011141A3 (en) * 1998-08-24 2000-06-02 Mount Sinai Hospital Corp Trna binding domain
US7385034B2 (en) 1998-12-22 2008-06-10 Serono Genetics Institute S.A. Complementary DNAs encoding proteins with signal peptides
WO2000047737A1 (en) * 1999-02-09 2000-08-17 Smithkline Beecham Biologicals S.A. Haemophilus influenzae rd outer membrane sequences used as vaccine
WO2000050599A1 (en) * 1999-02-24 2000-08-31 Smithkline Beecham Biologicals S.A. Haemophilus antigen
CN100352924C (en) * 1999-06-25 2007-12-05 史密丝克莱恩比彻姆生物有限公司 Moraxella catarrhalis BASB111 polypeptides and polynucleotides
WO2001000837A1 (en) * 1999-06-25 2001-01-04 Smithkline Beecham Biologicals S.A. Basb111 polypeptide and polynucleotide from moraxella catharrhalis
WO2001011033A3 (en) * 1999-08-04 2002-01-17 Abbott Lab Identification of genes essential for the survival of haemophilus influenzae through genome scanning by transposition mutagenesis
WO2001070992A1 (en) * 2000-03-21 2001-09-27 De Staat Der Nederlanden Vertegenwoordigd Door De Minister Van Welzijn, Volksgezondheid En Cultuur Proteins and polypeptides from haemophilus influenza involved in paracytosis, nucleotide sequences encoding them and uses thereof
EP1136557A1 (en) * 2000-03-21 2001-09-26 De Staat Der Nederlanden Vertegenwoordigd Door De Minister Van Welzijn, Volksgezondheid En Cultuur Proteins and polypeptides from Haemophilus influenzae involved in paracytosis, nucleotide sequences encoding them and uses thereof
WO2002018601A3 (en) * 2000-08-25 2003-02-06 Abbott Lab Essential bacteria genes and genome scanning in haemophilus influenzae for the identification of 'essential genes'
EP2280071A3 (en) * 2000-10-02 2011-04-27 ID Biomedical Corporation Haemophilus influenzae antigens and corresponding DNA fragments
EP2088197A3 (en) * 2000-10-02 2011-12-21 ID Biomedical Corporation Haemophilus influenzae antigens and corresponding DNA fragments
AU2007202270B8 (en) * 2000-10-02 2011-10-13 Id Biomedical Corporation Of Quebec Haemophilus influenzae antigens and corresponding DNA fragments
AU2007202270A8 (en) * 2000-10-02 2011-09-08 Id Biomedical Corporation Of Quebec Haemophilus influenzae antigens and corresponding DNA fragments
AU2007202270B2 (en) * 2000-10-02 2011-09-01 Id Biomedical Corporation Of Quebec Haemophilus influenzae antigens and corresponding DNA fragments
WO2002028889A3 (en) * 2000-10-02 2003-04-10 Shire Biochem Inc Haemophilus influenzae antigens and corresponding dna fragments
EP2270171A3 (en) * 2000-10-02 2011-04-27 ID Biomedical Corporation Haemophilus influenzae antigens and corresponding DNA fragments
WO2002030971A3 (en) * 2000-10-13 2002-11-07 Smithkline Beecham Biolog Basb205 polypeptides and polynucleotides coding therefor
US7838003B2 (en) 2000-10-13 2010-11-23 Glaxosmithkline Biologicals S.A. BASB205 polypeptides and polynucleotides from Haemophilus influenzae
WO2002030967A3 (en) * 2000-10-13 2002-11-07 Smithkline Beecham Biolog Novel compounds
EP1847609A1 (en) * 2000-10-13 2007-10-24 GlaxoSmithKline Biologicals S.A. BASB205 polypeptides and polynucleotides coding therefor
WO2002032946A3 (en) * 2000-10-17 2002-10-31 Smithkline Beecham Biolog Basb207 polypeptides and polynucleotides from nontypeable haemophilus influenzae
WO2002034772A3 (en) * 2000-10-24 2003-01-30 Smithkline Beecham Biolog Novel compounds
WO2002046215A3 (en) * 2000-12-08 2002-11-14 Smithkline Beecham Biolog Basb221 polypeptides and polynucleotides encoding basb221 polypeptides
WO2002066503A3 (en) * 2001-02-16 2002-11-07 Smithkline Beecham Biolog H. influenzae antigen basb213
WO2002077020A3 (en) * 2001-03-22 2003-06-19 Isis Innovation Virulence genes in h. influenzae
WO2002088361A3 (en) * 2001-04-30 2003-11-27 Glaxosmithkline Biolog Sa Haemophilus influenzae antigens
US7875421B2 (en) 2002-04-09 2011-01-25 Nestec S.A. La1—the genome of a lactobacillus strain
US9909130B2 (en) 2005-03-18 2018-03-06 Dsm Ip Assets B.V. Production of carotenoids in oleaginous yeast and fungi
US20150218231A1 (en) * 2005-06-16 2015-08-06 Nationwide Children's Hospital, Inc. Genes of an otitis media isolate of nontypeable heamophilus influenzae
US8691555B2 (en) 2006-09-28 2014-04-08 Dsm Ip Assests B.V. Production of carotenoids in oleaginous yeast and fungi
US9297031B2 (en) 2006-09-28 2016-03-29 Dsm Ip Assets B.V. Production of carotenoids in oleaginous yeast and fungi

Also Published As

Publication number Publication date
AU5552396A (en) 1996-11-07
EP0821737A4 (en) 2005-01-19
CA2218741A1 (en) 1996-10-24
EP0821737A1 (en) 1998-02-04
JPH11501520A (en) 1999-02-09

Similar Documents

Publication Publication Date Title
WO1996033276A1 (en) NUCLEOTIDE SEQUENCE OF THE HAEMOPHILUS INFLUENZAE Rd GENOME, FRAGMENTS THEREOF, AND USES THEREOF
EP1400592A1 (en) Streptococcus pneumoniae polynucleotides and sequences
JP2004041185A (en) Nucleotide sequence of Haemophilus influenzae Rd genome, fragments thereof and uses thereof
US20070020746A1 (en) Staphylococcus aureus polynucleotides and sequences
Lukomski et al. Identification and characterization of a second extracellular collagen-like protein made by group A Streptococcus: control of production at the level of translation
EP0756006A2 (en) Nucleotide sequence of the mycoplasma genitalium genome, fragments thereof, and uses thereof
US20040043037A1 (en) Staphylococcus aureus polynucleotides and sequences
Florin-Christensen et al. The Babesia bovis merozoite surface antigen 2 locus contains four tandemly arranged and expressed genes encoding immunologically distinct proteins
US6797466B1 (en) Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii
Hendrix et al. Cloning and sequencing of Coxiella burnetii outer membrane protein gene com1
JP2002529046A (en) Enterococcus faecalis polynucleotides and polypeptides
KR20020091820A (en) Novel polynucleotides
US5700683A (en) Virulence-attenuating genetic deletions deleted from mycobacterium BCG
WO1996025519A9 (en) Virulence-attenuating genetic deletions
US20050131222A1 (en) Nucleotide sequence of the haemophilus influenzae Rd genome, fragments thereof, and uses thereof
US6468765B1 (en) Selected Haemophilus influenzae Rd polynucleotides and polypeptides
US20020120116A1 (en) Enterococcus faecalis polynucleotides and polypeptides
Kimoto et al. Molecular characterization of NADase-streptolysin O operon of hemolytic streptococci
CA2195090A1 (en) Lkp pilin structural genes and operon of nontypable haemophilus influenzae
MXPA01005728A (en) Polymorphic loci that differentiate escherichia coli 0157:h7 from other strains.
US6348328B1 (en) Compounds
JPH09501308A (en) Regulators of contact-mediated hemolysin
McCaman et al. Sequence characterization of two new members of a multi-gene family in Serpulina hyodysenteriae (B204) with homology to a 39 kDa surface exposed protein: vspC and D
CA2296814A1 (en) Treponema pallidum polynucleotides and sequences
AU2004231248A1 (en) Streptococcus pneumoniae Polynucleotides and Sequences

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BB BG BR BY CA CH CN CZ DE DK EE ES FI GB GE HU IS JP KE KG KP KR KZ LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TR TT UA UG UZ VN AM AZ BY KG KZ MD RU TJ TM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
ENP Entry into the national phase

Ref document number: 2218741

Country of ref document: CA

Kind code of ref document: A

Ref document number: 1996 531888

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 1996912845

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1996912845

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 1996912845

Country of ref document: EP

DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)