[go: up one dir, main page]

WO2023036849A1 - Identifying and predicting future coronavirus variants - Google Patents

Identifying and predicting future coronavirus variants Download PDF

Info

Publication number
WO2023036849A1
WO2023036849A1 PCT/EP2022/074922 EP2022074922W WO2023036849A1 WO 2023036849 A1 WO2023036849 A1 WO 2023036849A1 EP 2022074922 W EP2022074922 W EP 2022074922W WO 2023036849 A1 WO2023036849 A1 WO 2023036849A1
Authority
WO
WIPO (PCT)
Prior art keywords
amino acid
coronavirus
ace
binding
sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2022/074922
Other languages
French (fr)
Inventor
Beichen GAO
Cédric Weber
Joseph TAFT
Roy EHLING
Sai T. Reddy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Eidgenoessische Technische Hochschule Zurich ETHZ
Original Assignee
Eidgenoessische Technische Hochschule Zurich ETHZ
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eidgenoessische Technische Hochschule Zurich ETHZ filed Critical Eidgenoessische Technische Hochschule Zurich ETHZ
Publication of WO2023036849A1 publication Critical patent/WO2023036849A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis

Definitions

  • coronaviruses and coronavirus variants threaten to result in pandemics such as the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) that emerged in 2019 causing pandemic coronavirus disease 2019 (COVID-19), leading to millions of fatalities worldwide.
  • SARS-CoV-2 severe acute respiratory syndrome coronavirus-2
  • COVID-19 pandemic coronavirus disease 2019 (COVID-19)
  • VOC coronavirus variants of concern
  • RBD receptor-binding domain
  • systems and methods for prediction of coronavirus variants may predict coronavirus variants that bind to the ACE-2 receptor, but do not bind to one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor.
  • the method presented herein goes far beyond simply assaying spike variants.
  • the number of possible spike variants is almost unimaginably high (billions of billions), and cannot possibly be tested by in vitro or in vivo methods.
  • the method of the invention generates a subset of these variants, and uses the resulting data to train machine learning models which can then make predictions across the entire set.
  • the key concept is to train accurate machine learning models that can be then deployed for prediction and surveillance of emerging variants through the course of a pandemic.
  • the present technology provides a method that includes providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising a first plurality of variant sequences of the coronavirus, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein; training a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; and determining, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
  • the present technology provides a system that includes one or more processors and a memory storing processor-executable instructions, the one or more processors execute the processor-executable instructions to: receive an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; receive a first data set comprising a first plurality of variant sequences, each of the first plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein; train a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; and determine, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
  • the first data set may include a first plurality of amino acid sequences that bind to a ACE-2 receptor and bind to one or more antibodies and/or one or more binding proteins (e.g., single-domain nanobody) that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor.
  • binding proteins e.
  • the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
  • the coronavirus may be a betacoronavirus including, but not limited to, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) virus or a variant thereof.
  • SARS-CoV-2 severe acute respiratory syndrome coronavirus-2
  • the present technology also provides a coronavirus variant or portion thereof having an amino acid sequence, wherein the amino acid sequence is generated by the methods or systems disclosed herein.
  • the coronavirus variant or portion thereof may bind to the ACE-2 receptor.
  • the coronavirus variant or portion thereof may not bind to the one or more antibodies and/or the one or more binding proteins that inhibit the coronavirus variant or portion thereof from binding to the ACE-2 receptor.
  • the present technology provides one or more antibodies and/or one or more binding proteins that bind to the coronavirus variant or portion thereof.
  • the one or more antibodies and/or the one or more binding proteins prevent the coronavirus variant or portion thereof from binding to the ACE-2 receptor.
  • the present technology also provides information on the design of a vaccine that induces production of the antibody and/or the binding protein.
  • FIG. 1 illustrates an exemplary combinatorial library design.
  • FIG. 2 illustrates flow cytometry plots of gating schemes for selection of different RBD libraries after labeling with soluble human ACE-2 (y-axis) and antibody binding to RBD expression tag (anti- Flag-PE, x-axis), according to the examples.
  • FIGS. 3A-3C illustrate amino acid frequencies of the ACE-2 binding (ACE2+) and non-binding (ACE2-) of libraries 2C, 2CE, and 2T, respectively, according to the examples.
  • FIG. 4 illustrates one exemplary flow cytometry plot of a library of approximately 10 7 cells that were ACE-2 binding RBD and sorted for binding and escape from the monoclonal antibodies LY-CoV16, LY-CoV555, REGN10933, and REGN10987, according to the examples.
  • FIGS. 5A-5D illustrate amino acid frequencies of the monoclonal antibody binding and escape with with LY-C0VI 6, LY-CoV555, REGN10933, and REGN10987, respectively, according to the examples.
  • FIG. 6 illustrates an example system for identifying and predicting coronavirus variants, in accordance with implementations.
  • FIG. 7 illustrates an example method for identifying and predicting coronavirus variants, in accordance with implementations.
  • FIG. 8 illustrates an example method for identifying and predicting coronavirus variants, in accordance with implementations.
  • FIG. 9 illustrates accuracy scores for models trained to identify and predict coronavirus variants, in accordance with implementations.
  • FIG. 10 illustrates accuracy scores for models trained to identify and predict coronavirus variants, in accordance with implementations.
  • FIG. 11 illustrates an example process for identifying and predicting coronavirus variants, in accordance with implementations.
  • FIG. 12 illustrates an example process for identifying and predicting coronavirus variants, in accordance with implementations.
  • the present technology provides a method that includes providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising a first plurality of variant sequences of the coronavirus, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein; training a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; and determining, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
  • the input sequences are variants that are part of the theoretical mutagenesis sequence space of the spike protein library.
  • generation of the input sequences is informed by a deep mutational scan of the sequence to select the mutation sites with highest impact on receptor binding and antibody binding.
  • the present technology provides a system that includes one or more processors and a memory storing processor-executable instructions, the one or more processors execute the processor-executable instructions to: receive an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; receive a first data set comprising a first plurality of variant sequences, each of the first plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein; train a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; and determine, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
  • the first data set may include a first plurality of amino acid sequences that bind to a ACE-2 receptor and bind to one or more antibodies and/or one or more binding proteins (e.g., single-domain nanobody) that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor.
  • binding proteins e.
  • antibody and its plural “antibodies” where used in the present specification encompasses classical immunoglobulin antibodies and antibody-like proteins that are capable of specifically binding to a coronavirus spike protein.
  • specific binding in the context of the present invention refers to a property of antibodies and antibody-like proteins, which bind to their target with a certain affinity and target specificity.
  • the affinity of such a ligand is indicated by the dissociation constant of the ligand.
  • a specifically reactive ligand has a dissociation constant of ⁇ 10' 7 mol/L (particularly ⁇ 10' 9 mol/L) when binding to its target, but a dissociation constant at least three orders of magnitude higher in its interaction with a molecule having a globally similar chemical composition as the target, but a different three- dimensional structure.
  • antibody refers to whole antibodies including but not limited to immunoglobulin type G (IgG), type A (IgA), type D (IgD), type E (IgE) or type M (IgM), any antigen-binding fragment or single chains thereof and related or derived constructs.
  • a whole antibody is a glycoprotein comprising at least two heavy (H) chains and two light (L) chains inter-connected by disulfide bonds.
  • Each heavy chain is comprised of a heavy chain variable region (VH) and a heavy chain constant region (CH).
  • VH heavy chain variable region
  • CH heavy chain constant region
  • the heavy chain constant region of IgG is comprised of three domains, CH1 , CH2 and CH3.
  • Each light chain is comprised of a light chain variable region (abbreviated herein as VL) and a light chain constant region (CL).
  • the light chain constant region is comprised of one domain, CL.
  • the variable regions of the heavy and light chains contain a binding domain that interacts with an antigen.
  • the constant regions of the antibodies may mediate the binding of the immunoglobulin to host tissues or factors, including various cells of the immune system (e.g., effector cells) and the first component of the classical complement system.
  • the term encompasses a so-called nanobody or single domain antibody, an antibody fragment consisting of a single monomeric variable antibody domain.
  • antibody-like molecule in the context of the present specification refers to a molecule capable of specific binding to another molecule or target with high affinity / a Kd ⁇ 10' 7 mol/l (particularly ⁇ 10' 9 mol/L).
  • An antibody-like molecule binds to its target similarly to the specific binding of an antibody.
  • antibody-like molecule encompasses a repeat protein, such as a designed ankyrin repeat protein (Molecular Partners, Zurich), an engineered antibody mimetic protein exhibiting highly specific and high-affinity target protein binding (see US2012142611 , US2016250341 , US2016075767 and US2015368302).
  • antibody-like molecule further encompasses, but is not limited to, a polypeptide derived from armadillo repeat proteins, a polypeptide derived from leucine-rich repeat proteins and a polypeptide derived from tetratricopeptide repeat proteins.
  • the term antibody-like molecule further encompasses a specifically binding polypeptide derived from a protein A domain, a fibronectin domain FN3, a consensus fibronectin domain, a lipocalin (see Skerra, Biochim. Biophys. Acta 2000, 1482(1- 2):337-50), a polypeptide derived from a Zinc finger protein (see Krun et al.
  • a Src homology domain 2 (SH2) or Src homology domain 3 (SH3)
  • a PDZ domain a gamma-crystallin
  • ubiquitin a cysteine knot polypeptide or a knottin, cystatin, Sac7d
  • a triple helix coiled coil also known as alphabodies
  • Kunitz domain or a Kunitz-type protease inhibitor
  • carbohydrate binding module 32-2 The term antibody-like molecule further encompasses a humanized camelid antibody.
  • protein A domains derived polypeptide refers to a molecule that is a derivative of protein A and is capable of specifically binding the Fc region and the Fab region of immunoglobulins.
  • armadillo repeat protein refers to a polypeptide comprising at least one armadillo repeat, wherein an armadillo repeat is characterized by a pair of alpha helices that form a hairpin structure.
  • humanized camelid antibody in this context refers to an antibody consisting of only the heavy chain or the variable domain of the heavy chain (VHH domain) and whose amino acid sequence has been modified to increase their similarity to antibodies naturally produced in humans and, thus show a reduced immunogenicity when administered to a human being.
  • VHH domain variable domain of the heavy chain
  • a general strategy to humanize camelid antibodies is shown in Vincke et al. “General strategy to humanize a camelid single-domain antibody and identification of a universal humanized nanobody scaffold”, J Biol Chem. 2009 Jan 30;284(5):3273-3284, and US2011165621 A1 .
  • antibody refers to classical immunoglobulins, particularly to IgG antibodies.
  • the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
  • the coronavirus variant is likely to emerge through in vivo viral replication in a mammal.
  • the antibodies employed to inform the generation of the four groups can be selected from the therapeutic antibodies developed by pharmaceutical companies which were widely used during the acute phase of the pandemic. They include, but are not limited to, the group consisting of Casoirivimab, Imdevimab, Estevimab, and Bamlanivimab. All four of these were granted FDA emergency use authorizations and were shown to be effective on early SARS-CoV-2 variants.
  • the data generated by the inventors shows independently that these therapeutic antibodies were susceptible to escape mutations in the spike, leading to decreased or lacking efficacy against variants which emerged after late 2021 .
  • the method outlined in this specification is able to assess the efficacy of each antibody to billions of possible spike variants.
  • the first plurality of variant sequences may include at least 2 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 5 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 10 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 20 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 50 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 75 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 100 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 200 unique variant sequences.
  • the first plurality of variant sequences may include at least 300 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 500 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 10 10 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 10 9 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 10 8 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 10 7 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 10 6 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 10 5 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 500 unique variant sequences.
  • the first plurality of amino acid sequences may include 1 to 1 x 10 10 unique amino acid sequences. In any embodiment, the first plurality of amino acid sequences may include 10 to 1 x 10 9 unique amino acid sequences. In any embodiment, the first plurality of amino acid sequences may include 50 to 1 x 10 8 unique amino acid sequences. In any embodiment, the first plurality of amino acid sequences may include 100 to 1 x 10 7 unique amino acid sequences. In any embodiment, the first plurality of amino acid sequences may include 350 to 1 x 10 6 unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 1 to 1 x 10 1 ° unique amino acid sequences.
  • the second plurality of amino acid sequences may include 10 to 1 x 10 9 unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 50 to 1 x 10 8 unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 100 to 1 x 10 7 unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 350 to 1 x 10 6 unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 1 to 1 x 10 1 ° unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 10 to 1 x 10 9 unique amino acid sequences.
  • the third plurality of amino acid sequences may include 50 to 1 x 10 8 unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 100 to 1 x 10 7 unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 350 to 1 x 10 6 unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 1 to 1 x 10 1 ° unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 10 to 1 x 10 9 unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 50 to 1 x 10 8 unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 100 to 1 x 10 7 unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 350 to 1 x 10 6 unique amino acid sequences.
  • the coronavirus may be a betacoronavirus including, but not limited to, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) virus or a variant thereof.
  • SARS-CoV-2 severe acute respiratory syndrome coronavirus-2
  • the first data set may include at least 100 unique amino acid sequences. In any embodiment, the first data set may include at least 500 unique amino acid sequences, at least 1000 unique amino acid sequences, at least 10,000 unique amino acid sequences, at least 50,000 unique amino acid sequences, at least 100,000 unique amino acid sequences, or at least 200,000 unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 10 1 ° unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 10 9 unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 10 8 unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 10 7 unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 10 6 unique amino acid sequences.
  • the one or more antibodies and/or the one or more binding proteins is in human serum.
  • the one or more antibodies may include one or more monoclonal antibodies.
  • the method and/or system may further include generating a combinatorial library of amino acid sequences and/or a tiling library of amino acid sequences that comprise the first plurality of amino acid sequences, the second plurality of amino acid sequences, the third plurality of amino acid sequences, and the fourth plurality of amino acid sequences.
  • the combinatorial RBD library is physically screened for ACE-2/AB (antibody) binding and nonbinding, those variants are collected by FACS and then deep sequenced. Therefore there is an experimental dataset with RBD sequence variants that possess information on their ACE/AB properties. This experimental dataset is the training data used for ML.
  • the combinatorial library of amino acid sequences and/or the tiling library of amino acid sequences may be based on deep mutational scanning.
  • the design of the combinatorial and tiling library can be guided by deep mutational scanning data, or be generated via or based on deep mutational scanning data.
  • the deep mutational scanning may include generating a first library of variant sequences of the viral spike protein wherein each variant sequence is modified at a single amino acid position relative to the input amino acid sequence.
  • the first library may include variant sequences representing each amino acid position of the input amino acid sequence.
  • the first library may include variant sequences representing all 20 standard amino acids at each position of the input amino acid sequence.
  • the combinatorial library of amino acid sequences and the tiling library of amino acid sequences may be recombinantly expressed as individual amino acid sequences on a yeast cell surface.
  • the combinatorial library of amino acid sequences may include a plurality of amino acid sequences, wherein the plurality of amino acid sequences includes at least a portion of the viral spike protein of the coronavirus with amino acid site saturation mutagenesis performed at one or more amino acid sites of the viral spike protein.
  • the tiling library of amino acid sequences may include a plurality of amino acid sequence, wherein each amino acid sequence may include at least a portion of the viral spike protein of the coronavirus with amino acid site saturation mutagenesis performed at one or more amino acid sites of the viral spike protein.
  • the amino acid site saturation mutagenesis may include a receptor-binding domain (RBD) of the viral spike protein.
  • RBD receptor-binding domain
  • the RBD may include amino acid sites 350 to 550, including, but not limited to amino acid sites 400 to 515, amino acid sites 417 to 505, amino acid sites 453 to 478, amino acid sites 484 to 505, and/or amino acid sites 439 to 452.
  • the amino acid site saturation mutagenesis may be in the RBD and one or more amino acid at sites 350 to 550, including, but not limited to amino acid sites 400 to 515, amino acid sites 417 to 505, amino acid sites 453 to 478, amino acid sites 484 to 505, and/or amino acid sites 439 to 452.
  • the amino acid site saturation mutagenesis may be in the RBD and one or more amino acid at sites 350 to 550, including, but not limited to amino acid sites 453 to 478, amino acid sites 484 to 505, and/or amino acid sites 439 to 452.
  • At least 5 amino acid sites may have undergone saturation mutagenesis. In any embodiment, at least 10 amino acid sites may have undergone saturation mutagenesis. In any embodiment, at least 20 amino acid sites may have undergone saturation mutagenesis. In any embodiment, 5 to 50 amino acid sites may have undergone saturation mutagenesis. In any embodiment, 10 to 45 amino acid sites may have undergone saturation mutagenesis. In any embodiment, 20 to 40 amino acid sites may have undergone saturation mutagenesis.
  • the amino acid site saturation mutagenesis may correspond to 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, or 20 amino acids. In any embodiment, the amino acid site saturation mutagenesis may correspond to 1-20 amino acids including 1-5, 1-10, 1-15, 2- 10, 2-15, 2-20, 3-10, 3-15, 3-20, 5-10, 5-15, 5-20, 7-10, 7-15, or 7-20 amino acids.
  • the amino acid may differ from the amino acid at that position in the wild type sequence such that the amino acid site saturation mutagenesis may correspond to 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, or 19 amino acids (e.g token 1-5, 1-10, 1-15, 2-10, 2-15, 2- 19, 3-10, 3-15, 3-19, 5-10, 5-15, 5-19, 7-10, 7-15, or 7-19 amino acids).
  • the amino acid site saturation mutagenesis may be present at one or more amino acids.
  • the amino acid site saturation mutagenesis may correspond to one or more amino acid mutations that differ at the same position(s) in the wild type sequence.
  • the amino acid site saturation mutagenesis may be present at two or more amino acids. In any embodiment, the amino acid site saturation mutagenesis may correspond to two or more amino acid mutations that differ at the positions in the wild type sequence. In any embodiment, the amino acid site saturation mutagenesis may be present at three or more amino acids. In any embodiment, the amino acid site saturation mutagenesis may correspond to three or more amino acid mutations that differ that position in the wild type sequence. In some embodiments, the amino acid site saturation mutagenesis may be provided by a degenerate codon.
  • the amino acid site saturation mutagenesis may include at least two degenerate codons, including, but not limited to at least three degenerate codons, at least four degenerate codons, at least five degenerate codons, or at least six degenerate codons.
  • the degenerate codon may include any known degenerate codon.
  • the degenerate codon may include any codon made up of a combination of degenerate nucleotides A, C, T, G, W, S, M, K, R, Y, B, D, H, V, N (e.g. NNK, NNS, NDT, DBK, GGN, TWY, WR, YWY, etc., or a combination of two or more thereof.
  • the method and/or system may further include training the first machine model with the combinatorial library of amino acid sequences and/or the tiling library of amino acid sequences to predict the affinity binding scores for proposed amino acid sequences.
  • the first machine model may include, but is not limited to, at least one of a random forest or a recurrent neural network.
  • the first machine learning model can include any supervised machine learning model that is configured to perform classification on encoded biological sequence data.
  • the method and/or system may further include generating a second data set that includes a second plurality of sequences, each of the second plurality of sequences comprising one of the identified coronavirus variants.
  • the method may further include updating the first machine learning model with the second data set; and determining, via the first machine learning model updated with the second data set, a second affinity binding score for a second proposed amino acid sequence to identify a second-generation coronavirus variant.
  • the method and/or system may further include generating, in silico, a plurality of proposed amino acid sequences of coronavirus variants; and inputting the plurality of proposed amino acid sequences into the first machine learning model to identify a corresponding plurality of affinity binding scores.
  • the method and/or system may further include validating the proposed amino acid sequence binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and updating the first machine learning model based on the validation.
  • the method and/or system may further include invalidating the proposed amino acid sequence binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and updating the first machine learning model based on the invalidation.
  • the method and/or system may further include pre-processing the data set prior to training the first machine learning model, wherein pre-processing comprises at least one of: removing sequencing errors by filtering for sequences complying with initial amino acid site saturation mutagenesis; filtering the data set based on a number of read counts greater than a threshold and a distance from a wild type less than or equal to a second threshold; or removing duplicate sequences.
  • the method and/or system may further include validating the first machine learning using a second data set comprising a second plurality of variant sequences, each of the second plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein, wherein the second data set is different from the first data set and has not been used to train the first machine learning model.
  • the method and/or system may further include determining a score based on predicted affinity binding score and any combination of one or more of nucleotide and amino acid edit distance of the coronavirus variant to a reference coronavirus sequence; and/or likelihood of nucleotide mutations based on one or more data sets; and/or the first affinity binding score; and/or an existence of a viable evolutionary path to the coronavirus variant.
  • the method and/or system may further include determining a likelihood of variant emergence score for the coronavirus variant based on the proposed amino sequence based on: nucleotide and amino acid edit distance of the coronavirus variant to a reference coronavirus sequence; and/or likelihood of nucleotide mutations based on one or more data sets; and/or the first affinity binding score; and/or an existence of a viable evolutionary path to the coronavirus variant.
  • the method and/or system may further include determining the likelihood of variant emergence score for the coronavirus variant responsive to the first affinity binding score being greater than or equal to a threshold.
  • the method may further include selecting, responsive to the likelihood of variant emergence score greater than or equal to a second threshold, the coronavirus variant for experimental validation.
  • the method and/or system may further include balancing the data set to include an equal number of positive sequences and negative sequences per amino acid edit distance from a reference sequence (e.g., SARS-CoV-2 Wuhan-1 ).
  • a reference sequence e.g., SARS-CoV-2 Wuhan-1
  • the method and/or system may further include generating the one or more antibodies and/or the one or more binding proteins that bind the coronavirus variant.
  • the method and/or system may further include generating a vaccine that induces production of one or more antibodies and/or one or more binding proteins, wherein the one or more antibodies and/or the one or more binding proteins bind the coronavirus variant.
  • the processor-executable instructions may generate a second data set comprising a second plurality of sequences, each of the second plurality of sequences comprising one of the identified coronavirus variants.
  • the first machine learning model may be updated with the second data set; and determine, via the first machine learning model updated with the second data set, a second affinity binding score for a second proposed amino acid sequence to identify a second generation coronavirus variant.
  • the present technology also provides a coronavirus variant or portion thereof having an amino acid sequence, wherein the amino acid sequence is generated by the methods or systems disclosed herein.
  • the coronavirus variant or portion thereof may bind to the ACE-2 receptor.
  • the coronavirus variant or portion thereof may not bind to the one or more antibodies and/or the one or more binding proteins that inhibit the coronavirus variant or portion thereof from binding to the ACE-2 receptor.
  • the present technology provides one or more antibodies and/or one or more binding proteins that bind to the coronavirus variant or portion thereof.
  • the one or more antibodies and/or the one or more binding proteins prevent the coronavirus variant or portion thereof from binding to the ACE-2 receptor.
  • a cell may include the one or more antibodies and/or one or more binding proteins that bind to the coronavirus variant or portion thereof.
  • the cell may be a mammalian cell, a bacterial cell, a yeast cell, an insect cell, or a eukaryotic cell.
  • the present technology also provides a vaccine that induces production of the one or more antibodies and/or one or more binding proteins that bind to the coronavirus variant or portion thereof.
  • the ACE-2 receptor may be a mammalian ACE-2 receptor (e.g., a human ACE-2 receptor).
  • Reference to item N includes the alternative reference to item N a).
  • Item 1 A method, comprising: providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising data reflecting the ability of a first plurality of variant sequences of the input amino acid sequence to bind to: an ACE-2 receptor and to an antibody to the spike protein, wherein the antibody inhibits the coronavirus from binding to the ACE-2 receptor, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein relative to the input sequence; wherein the first data set comprises: a first plurality of amino acid sequences that bind to an ACE-2 receptor and bind to said antibody; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to said antibody; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to said antibody; and/or a fourth plurality of amino acid sequences that do not
  • Item 1 a A method, comprising: providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising a first plurality of variant sequences of the coronavirus, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein; wherein the first data set comprises:
  • Item 2 The method item 1 , wherein the coronavirus is a betacoronavirus.
  • betacoronavirus is severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) virus or a variant thereof.
  • SARS-CoV-2 severe acute respiratory syndrome coronavirus-2
  • any one of items 1 -3 further comprising: generating a combinatorial library of amino acid sequences and/or a tiling library of amino acid sequences that comprise the first plurality of amino acid sequences, the second plurality of amino acid sequences, the third plurality of amino acid sequences, and the fourth plurality of amino acid sequences; and training the first machine model with the combinatorial library of amino acid sequences and/or the tiling library of amino acid sequences to predict the affinity binding scores for proposed amino acid sequences.
  • RBD receptorbinding domain
  • the combinatorial library of amino acid sequences comprises a plurality of amino acid sequences comprising at least a portion of the viral spike protein of the coronavirus with an amino acid site saturation mutagenesis at one or more amino acid sites of the viral spike protein.
  • the tiling library of amino acid sequences comprises a plurality of amino acid sequence, wherein each amino acid sequence comprises at least a portion of the viral spike protein of the coronavirus with amino acid site saturation mutagenesis at one or more amino acid sites of the viral spike protein.
  • amino acid site saturation mutagenesis comprises a receptor-binding domain (RBD) of the viral spike protein.
  • amino acid site saturation mutagenesis comprises 1 to 19 amino acids, wherein the exclude amino acid is a wild type amino acid.
  • degenerate codon comprise any codon made up of a combination of degenerate nucleotides A, C, T, G, W, S, M, K, R, Y, B, D, H, V, N, or a combination of two or more thereof.
  • any one of items 1-35 further comprising generating a vaccine that induces production of one or more antibodies and/or one or more binding proteins, wherein the one or more antibodies and/or the one or more binding proteins bind the coronavirus variant.
  • any one of items 1 -38 further comprising: generating a second data set comprising a second plurality of sequences, each of the second plurality of sequences comprising one of the identified coronavirus variants; updating the first machine learning model with the second data set; and determining, via the first machine learning model updated with the second data set, a second affinity binding score for a second proposed amino acid sequence to identify a second generation coronavirus variant.
  • any one of items 1 -40 further comprising: balancing the data set to include an equal number of positive (binding) sequences and negative (non-binding) sequences per amino acid edit distance from a reference sequence (e.g., SARS- CoV-2 Wuhan-1 ).
  • a reference sequence e.g., SARS- CoV-2 Wuhan-1
  • any one of items 1 -43 further comprising: Experimentally invalidating if the proposed amino acid sequence binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and updating the first machine learning model based on the invalidation.
  • any one of items 1 -44 further comprising: pre-processing the data set prior to training the first machine learning model, wherein preprocessing comprises at least one of: removing sequencing errors by filtering for sequences complying with initial amino acid site saturation mutagenesis; filtering the data set based on a number of read counts greater than a threshold and a distance from a wild type less than or equal to a second threshold; or removing duplicate sequences.
  • any one of items 1 -45 further comprising: validating the first machine learning using a second data set comprising a second plurality of variant sequences, each of the second plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein, wherein the second data set is different from the first data set and has not been used to train the first machine learning model.
  • any one of items 3-38 further comprising: determining a likelihood of variant emergence score for the coronavirus variant based on the proposed amino sequence, the predicted affinity binding score and one or more of: nucleotide and amino acid edit distance of the coronavirus variant to a reference coronavirus sequence; and/or likelihood of nucleotide mutations based on one or more data sets; and/or the first affinity binding score; and/or an existence of a viable evolutionary path to the coronavirus variant.
  • the method of item 47 further comprising: determining the likelihood of variant emergence score for the coronavirus variant responsive to the first affinity binding score being greater than or equal to a threshold.
  • a system comprising one or more processors and a memory storing processor-executable instructions, the one or more processors execute the processor-executable instructions to: receive an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; receive a first data set comprising a first plurality of variant sequences, each of the first plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein; wherein the first data set comprises: a first plurality of amino acid sequences that bind to an ACE-2 receptor and bind to one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or one or more binding proteins that
  • coronavirus variant or portion thereof of item 52 wherein the coronavirus variant or portion thereof binds to the ACE-2 receptor and/or the coronavirus variant or portion thereof do not bind to the one or more antibodies and/or the one or more binding proteins that inhibit the protein or peptide from binding to the ACE-2 receptor.
  • a cell comprising a protein or peptide according to any one of items 52-55.
  • the cell according to item 59 wherein the cell is a mammalian cell, a bacterial cell, a yeast cell, an insect cell, or a eukaryotic cell.
  • mutagenesis libraries of the RBD of SARS-CoV- 2 were designed.
  • combinatorial libraries were designed at amino acid positions 484-505 (Class 2) (FIG. 1 ), amino acid positions 453-478 (Class 1C), and amino acid positions 440-452 (Class 3C).
  • the log enrichment from the DMS data was transformed into amino acid frequencies.
  • the negatively enriched amino acids were discarded (if below a threshold) and the positively enriched were given frequencies proportional to their enrichment above 0 in the DMS dataset. Based on these frequencies each position was assigned an amino acid site saturation mutagenesis.
  • tiling libraries (Class 2T, Class 1T, Class 3T) were constructed to enable recovery of variants with a lower distance to the wild-type RBD (/.e. sequence from the original Wuhan SARS-CoV-2 virus).
  • the tiling libraries were designed by introducing three NNK codons at every position in the sequence stretch of interest (Class 2T: 484- 505, Class 1T: 453-478, Class 3T: 440-452) (excluding positions that were fixed to single amino acid in the combinatorial design, e.g., Class 2C positions: 486, 487, 489, 495, 496, 497, 500).
  • Every possible combination of three positions was designed resulting in a total diversity of the Class 2T library of 1 ,533,035 unique sequences.
  • yeast display libraries were constructed by in vivo homologous recombination. Synthetic single-stranded oligonucleotides were designed to encode the desired library diversity. The oligonucleotides were amplified by PCR to produce double-stranded DNA with 30 bp of 5’ and 3’ homology to the yeast display vector pYD1. Yeast libraries were prepared using 1 pg each of plasmid and inserting DNA per 300 pl of electrocompetent EBY100 cells.
  • Binding (ACE2+/FLAG+) and nonbinding (ACE2-/FLAG+) cells were sorted by FACS (BD FACSAria Fusion or Sony MA800 cytometer). Double-positive (/.e., binding) gates were drawn using wild-type RBD as a guide, and negative gates were drawn with anti-FLAG tag stained cells (FIG. 2 and Table 1 ). The collected cells were cultured in SD-UT medium for one to two days at 30 °C followed by repeating induction and sorting until the desired populations were pure.
  • the RBD libraries pre-sorted for ACE2-binding were cultured and induced, as described above.
  • the induced cells were washed once with DPBS wash buffer, followed by incubation with 100 nM monoclonal antibody, or antibody mixtures. In the case of antibody mixtures, 100 nM of each antibody was used.
  • the cells were resuspended in 5 ng/pl antihuman lgG-AlexaFluor647 and incubated for 30 minutes at 4 °C. After an additional wash, the cells were resuspended in 1 ng/planti-FLAG-PE and incubated for 30 minutes at 4 °C.
  • FIGS. 3A-C provide the resulting amino acid frequencies of the ACE-2 binding (ACE2+) and non-binding (ACE2-) of libraries 2C (FIG. 3A), 2CE (FIG. 3B), and 2T (FIG. 3C).
  • Table 1 Statistics for RBD Library Sorting by FACS for binding and non-binding to ACE-2.
  • Plasmid DNA encoding the RBD variants was isolated and the mutagenized regions of the RBD were amplified using custom oligonucleotides. Illumina Nextera barcode sequences were added in a second PCR amplification step, allowing for multiplexed high-throughput sequencing runs. The populations were pooled at the desired ratios and sequenced using Illumina 2 x 250 PE protocols.
  • the identified sequences were paired, quality trimmed, and assembled using Geneious and BBDuk, with a quality threshold of qphred 25.
  • Mutagenized regions of interest were extracted using custom Python scripts, followed by translation to amino acid sequences.
  • the sequences obtained from the 2C, 2CE, and 2T libraries were pre-processed separately before being combined into the final training set used for model training and evaluation (Table 2). To remove sequencing errors, all libraries were filtered for sequences complying with the initial amino acid site saturation mutagenesis scheme. Library 2CE was filtered for only those sequences retaining wild-type residues in positions 417/439, to focus on the 484-505 region. Next, the Class 2T library was filtered using a threshold of read counts > 4 and restricted to sequences with edit distance less than or equal to 3.
  • Table 2 Sequencing Statistics for RBD Library Sorting following Illumina deep sequencing. After filtering, all 3 libraries (Class 2, Class 2E, and Class 2T) were assigned a label of 1 (binding) or 0 (non-binding), prior to being combined to create a full dataset. Duplicate sequences in the full dataset as well as sequences found to be occurring across labels were removed. The remaining data was balanced such that equal numbers of positive and negative class sequences were present for each edit distance from wild-type. Class balancing was performed through random subsampling from the majority class at each edit distance equal to the counts from the minority class.
  • FIG. 6 illustrates a block diagram of an example system 600 to identify and predict coronavirus variants, in accordance with implementations.
  • the system 600 can include at least one data processing system 602.
  • the data processing system 602 can include one or more processors and memory.
  • the one or more processors can execute processor-executable instructions to perform the functions described herein.
  • the memory can store processor-executable instructions, generate data, and collected data.
  • the data processing system 602 can include at least one library preprocessor 604.
  • the data processing system 602 can include at least one dataset balancer 606.
  • the data processing system 602 can include at least one model generator 608.
  • the data processing system 602 can include at least one sequence classifier 610.
  • the data processing system 602 can include at least one variant predictor 612.
  • the data processing system 602 can include at least one data repository 616, which can store one or more datasets 618, training data 620, or models 622.
  • the data processing system 602 can include at least one logic device, such as the processors.
  • the data processing system 602 can include at least one memory element, which can store data and processor-executable instructions.
  • the data processing system 602 can include a plurality of computing resources or servers located in at least one data center.
  • the data processing system 602 can include multiple, logically-grouped servers and facilitate distributed computing techniques.
  • the logical group of servers may be referred to as a data center, server farm, or a machine farm.
  • the servers can also be geographically dispersed.
  • the data processing system 602 can be any computing device.
  • the data processing system 602 can be or can include one or more laptops, desktops, tablets, smartphones, portable computers, or any combination thereof.
  • the data processing system 602 can include one or more processors.
  • the processor can provide information processing capabilities to the data processing system 602.
  • the processor can include one or more of digital processors, analog processors, digital circuits to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information.
  • Each processor can include a plurality of processing units or processing cores.
  • the processor can be electrically coupled with the memory and can execute one or more components of the data processing system 602, including, for example, the library pre-processor 604, dataset balancer 606, model generator 608, sequence classifier 610, or variant predictor 612.
  • the data processing system 602 can include one or more microprocessors, application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA), or combinations thereof.
  • ASIC application-specific integrated circuits
  • FPGA field-programmable gate arrays
  • the processor can be an analog processor and can include one or more resistive networks.
  • the resistive network can include a plurality of inputs and a plurality of outputs. Each of the plurality of inputs and each of the plurality of outputs can be coupled with nanowires.
  • the nanowires of the inputs can be coupled with the nanowires of the outputs via memory elements.
  • the memory elements can include ReRAM, memristors, or PCM.
  • the processor as an analog processor, can use analog signals to perform matrix-vector multiplication.
  • the data processing system 602 can include one or more memories.
  • the memory can be or can include a memory element.
  • the memory can store machine instructions that, when executed by the processor of the data processing system 602 can cause the processor to perform one or more of the operations described herein.
  • the memory can include but is not limited to, electronic, optical, magnetic, or any other storage devices capable of providing the processor with instructions.
  • the memory can include a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, EEPROM, EPROM, flash memory, optical media, or any other suitable memory from which the processor can read instructions.
  • the instructions can include code from any suitable computer programming language such as, but not limited to, C, C++, C#, Java, JavaScript, Perl, HTML, XML, Python, and Visual Basic.
  • the data processing system 602 can communicate with one or more client computing devices 630 or remote data sources 632 via a network 601.
  • the network 601 can include any type of communication network, including, for example, the Internet, public network, or private network.
  • the network 601 can include one or more wired or wireless connections.
  • the client computing device 630 can include a laptop computer, desktop computer, tablet computer, or mobile computing device, for example.
  • Remote data sources 632 can include any data source accessible to the data processing system 602 via the network 601 , for example. Remote data sources 632 can provide datasets 618, training data 620 or models 622, or data or information used by the data processing system 602 to establish, update or maintain one or more of the dataset 618, training data 620 or model 622.
  • the data processing system 602 can include a library pre-processor 604 designed, configured and operational to pre-process high-throughput sequencing data of RBD libraries.
  • the pre-processor 604 can retrieve, from the data repository 616, a data set 618 containing, for example, a combinatorial mutagenesis library.
  • the data set 618 can include, for example, a combinatorial mutagenesis library of the SARS-CoV-2 RBD.
  • the combinatorial mutagenesis library can include data covering the protein sequence space of possible variants for binding to ACE2.
  • the dataset 618 can be generated or based on deep mutational scanning (DMS) datasets.
  • DMS deep mutational scanning
  • the dataset 618 can be designed or constructed to focus on one or more amino acid positions.
  • the dataset 618 can be constructed to focus on amino acid positions of 484-505 (Class 2).
  • the dataset 618 can include one or more additional combinatorial libraries, such as combinatorial libraries for positions 453-478 (Class 1C) and 440-452 (Class 3C).
  • the data processing system 602 can transform the log enrichment from the DMS data into amino acid frequencies.
  • the library pre-processor 604 can discard negatively enriched amino acids if they are below a threshold.
  • the threshold can be a fixed threshold, or dynamic threshold.
  • the threshold can be stored in data repository 616.
  • the threshold can be based on a relative value, such as a percentage.
  • the threshold can be, for example, 0.
  • the pre-processor 604 can identify positively enriched amino acids based on the threshold. If the amino acids are above the threshold, the pre-processor 604 can assign them frequencies proportional to their enrichment above 0. Based on these frequencies, the pre-processor 604 can assign each position an amino acid site saturation mutant. The pre-processor 604 can make this assignment by determining the amino acid site saturation mutant that most closely map to the observed frequencies (based on a positional mean squared error), and then selecting the codon that best represents the diversity of viable mutations in this position. For Class 2, this can result in an RBD combinatorial library design diversity of 14’978’815'488 sequences (Class 2C Library), for example. In addition to this library, a second library with the same design can be produced, in which NNK codons are included in positions 417 and 439 of the RBD (Class 2CE Library) yielding a diversity of 5.99 x 10 12 .
  • the pre-processor 604 can construct tiling libraries (Class 2T, Class 1 T, Class 3T) to facilitate recovery of variants with a lower distance to the wild-type RBD. Since the probability of identifying low distance variants from the combinatorial library decreases as the diversity of the combinatorial library increases, the pre-processor 604 can generate the tiling library to improve the recovery of variants with the lower distance to wild-type RBD.
  • the data processing system 602 can design tiling libraries by, for example, introducing three NNK codons at every position in the sequence stretch of interest (Class 2T: 484-505, Class 1T: 453-478, Class 3T: 440-452) (excluding positions that were fixed to single a.
  • a. in the combinatorial design e.g., Class 2C positions: 486, 487, 489, 495, 496, 497, 500).
  • One or more possible combinations of three positions was designed resulting in a total diversity of the Class 2T library of 1 ,533,035 sequences, for example.
  • the pre-processor 604 can provide for high-throughput sequencing of data of RBD libraries. To do so, the pre-processor 604 can perform read sequences in pairs, perform quality trimming, and assembly using Geneious and BBDuk, with a quality threshold of qphred 25, for example.
  • the data processing system 602 can extract mutagenized regions of interest using custom Python scripts, followed by translation to amino acid sequences.
  • the data processing system 602 can separately pre-process sequences obtained from each of the three: 2C, 2CE and 2T libraries before combining them into the final training set used for model training and evaluation (Table 2).
  • the pre-processor 604 can store the training set in training data 620 in data repository 616.
  • the pre-processor 604 can filter the libraries for sequences complying with the initial an amino acid site saturation mutagenesis scheme. For example, the pre-processor can filter the Class 2CE Library for only those sequences retaining WT residues in positions 417/439, to focus on the 484-505 region.
  • the data processing system 602 can filter the Class 2T Library using a threshold of read counts > 4, for example, and restricted to sequences with edit distance less than or equal to 3, for example.
  • the data processing system 602 can generate a combinatorial library of amino acid sequences and/or a tiling library of amino acid sequences that comprise a first plurality of amino acid sequences, a second plurality of amino acid sequences, a third plurality of amino acid sequences, and a fourth plurality of amino acid sequences.
  • the combinatorial library of amino acid sequences can include the plurality of amino acid sequences including at least a portion of the viral spike protein of the coronavirus with an amino acid site saturation mutagenesis at one or more amino acid sites of the viral spike protein.
  • the tiling library of amino acid sequences can include a plurality of amino acid sequence, wherein each amino acid sequence includes at least a portion of the viral spike protein of the coronavirus with amino acid site saturation mutagenesis at one or more amino acid sites of the viral spike protein.
  • the data processing system 602 can include a dataset balancer 606 designed, configured and operational to balance datasets to include an equal number of positive sequences and negative sequences per amino acid edit distance from a reference sequence.
  • the reference sequence can include, for example, SARS-CoV-2 Wuhan-1 .
  • the dataset balancer 606 can balance the dataset to generate the final training data 620 used by the model generator 608.
  • the data processing system 602 can assign all 3 libraries (Class 2, Class 2E, and Class 2T) a label of 1 (binding) or 0 (non-binding), prior to being combined to create a full dataset.
  • the data processing system 602 can remove duplicate sequences in the full dataset as well as sequences found to be occurring across labels.
  • the dataset balancer 606 can balance the remaining data such that equal numbers of positive and negative class sequences were present for each edit distance from wild-type.
  • the dataset balancer 606 can balance positive and negative classes in order to remove model bias at low and high Levenshtein edit distances (LD) from wild-type RBD, thereby increasing the overall model performance.
  • LD Levenshtein edit distances
  • the dataset balancer 606 can perform class balancing through random subsampling from the majority class at each edit distance equal to the counts from the minority class.
  • Random subsampling can refer to or include a Monte Carlo cross-validation, multiple holdout, or repeated evaluation set.
  • the random subsampling can be based on randomly splitting the data into subsets. The sizes of the subsets can be pre-defined or determined based on an amount of data in the data set.
  • the random partitioning of the data can be repeated one or more times.
  • the data processing system 602 can include a model generator 608 designed, configured and operational to use machine learning to generate or train a model 622 using training data 620.
  • the training data 620 can be established or provided by one or more of the library pre-processor 604 or dataset balancer 606.
  • the model generator 608 can include or be configured with one or more machine learning techniques.
  • the model generator 608 can be configured with or use any supervised machine learning model that is configured to perform classification on encoded biological sequence data.
  • the model generator 608 can be configured with machine learning classifier models built in Python, for example.
  • the model generator 608 can be configured with or use machine learning techniques that include, for example, random forest, neural networks, recurrent neural networks, or other machine learning techniques.
  • the model generator 608 can build a random forest model, or other benchmarking models, using a 80/20 train-test data split (random split).
  • the model generator 608 can be configured with long-short-term-memory recurrent neural networks deep learning models. RBD sequences can be one-hot encoded and flattened into a 1 -dimensional vector before input into the models.
  • the data processing system 602 can perform hyperparameter optimization on all models by Random Search, with 5-fold cross validation, and scored on recall.
  • the model generator 608 can use best performing parameters based on the recall scores to train the final models.
  • the model generator 608 can evaluate the models on the basis of Accuracy, F1 (e.g., a measure of a test’s accuracy determined from the precision and recall of the test, where precision is the number of true positive results divided by the number of all positive result; and recall is the number of true positive results divided by the number of all samples that should have been identified as positive), and Matthews correlation coefficient (“MCC”) values using the entire withheld test set. Additionally, the model generator 608 can evaluate the accuracy of the model using subsets of the unseen libraries separated by levenshtein edit distance (LD) from wild-type RBD to determine prediction bias based on LD.
  • LD levenshtein edit distance
  • Table 3 Illustrative example of Training and hyperparameter values for Random Forest (RF) and Recurrent neural network (RNN) models.
  • Table 3 illustrates various performance metrics and parameters generated or used by the model generator 608 to generate and tune the model 622 using the training data 620.
  • Table 3 illustrates the parameters for different machine learning techniques, such as random forest and recurrent neural network.
  • Table 4 Illustrative example of Training and testing parameters/scores of RF and RNN models using datasets with different Levenshtein edit distance (LD) values.
  • LD Levenshtein edit distance
  • the model generator 608 can use various training and testing parameters or scores based on the random forest or recurrent neural network models.
  • the model generator 608 can use datasets with different LD values.
  • the data processing system 602 can include a sequence classifier 610 designed, configured and operational to perform in silico screening of the combinatorial space to classify combinatorial sequence variants as one of the four populations: a first plurality of amino acid sequences that bind to a ACE-2 receptor and bind to one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies and
  • the sequence classifier 610 can use one or more models 622 generated by the model generator 608 in order to classify various sequences into one of the four populations, such as: 1 ) ACE-2+ are functional RBD variants that are able infect human cells; ACE-2- are non-functional RBD variants unable to infect human cells; 3) ACE-2+Ab+ are functional RBD variants that are can be recognized (neutralized) by antibodies; or 4) ACE-2+Ab- are functional RBD variants that are not recognized (not neutralized) by antibodies — > antibody escape mutants.
  • the sequence classifier 610 can determine, via the machine learning model 622, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
  • the first affinity binding score can indicate a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
  • the sequence classifier 610 can provide the classification to the variant predictor 612.
  • the data processing system 602 can evaluate the in silico generated sequences for probability of ACE2-binding by an ensemble of ML models (RF, RNN). For models that output a probability score rather than binary class assignment, probabilities > 0.5 were assigned as “positive” and ⁇ - 0.5 a “negative” binder. Sequences were assigned binding or non-binding labels if models showed agreement, as illustrated in FIGS. 3A-3C.
  • the data processing system 602 can select sequences with highest probabilities of positive and negative binding for gene synthesis, cloning and experimental validation, as illustrated in FIG. 4 and FIGS. 5A-5D. This evaluation can be performed for the prediction of escape from a mixture of monoclonal neutralizing antibodies. Finally, each variant predicted had to have a high probability above a threshold (e.g., p > 0.5) to be ACE2 binding and have the existence of a complete evolutionary path (all intermediate variants from wild-type required predicted ACE2 binding).
  • a threshold e.g., p > 0.5
  • the data processing system 602 can include a variant predictor 612 designed, configured and operational to predict or identify variants of coronavirus.
  • the variant predictor 612 can use the classifications generated by the sequence classifier 610 to predict or identify a variant that is likely to bind to the ACE-2 receptor but unlikely to bind to an antibody or protein that inhibits coronavirus from binding to the ACE-2 receptor.
  • the variant predictor 612 can further determine the likelihood that the variant may emerge in the wild. For example, the variant predictor 612 can generate a likelihood of variant emergence (LoVE) score.
  • LoVE likelihood of variant emergence
  • the data processing system 602 can couple in silico sequence generation and classification with ML models 622 and phylogenetic analysis of existing variants to determine a score for the likelihood of variant emergence (LoVE).
  • the variant predictor 612 can determine the LoVE score as a weighted average of multiple dimensions:
  • the data processing system 602 can generate variant RBD sequences in silico using custom Python scripts, for selected Levenshtein edit distances (LD) away from wild-type RBD.
  • the LD can be defined on both the nucleotide and amino acid level, such that each generated nucleotide sequence was categorized by an LD pair (distance_nt, distance_aa).
  • the data processing system 602 can determine the likelihood of variant emergence through mutation based on a BLOSUM-like matrix modified to take into account the mutations observed in naturally occurring variants as recorded in the GISAID database.
  • the variant predictor 612 can use a weighted computational model that incorporates metrics associated with evolutionary and phylogenetic information along with DML-predictions of ACE2-binding and neutralizing antibody escape. Starting with the sequence of the original Wuhan virus (wild-type) RBD, the variant predictor 612 can determine a high LoVE score for variants possessing the same combinatorial mutations present in gamma and beta VOC.
  • the data processing system 602 can also facilitate the performance of antagonistic coevolution to predict future variants of SARS-CoV- 2.
  • RBD VOCs e.g., beta and gamma
  • DML DML to identify future RBD escape variants. This can continue for several cycles to eventually identify possible evolutionary trajectories of SARS-CoV- 2 and prediction of future VOC.
  • the data processing system 602 can identify RBD variants capable of binding ACE2 that were highly homologous to other coronaviruses, representing mutations that can result in zoonotic spillover into the human population.
  • FIG. 7 illustrates an example method for identifying and predicting coronavirus variants, in accordance with implementations.
  • the method 700 can be performed by one or more system or component depicted in FIG. 6, including, for example, a data processing system.
  • the data processing system can identify a combinatorial mutagenesis library of RBD.
  • the library can be designed, constructed or operational to focus on key positions of RBM, such as 417, 439, and 484- 505.
  • the combinatorial mutagenesis library of the RBD I RBM of SARS-Cov-2 can be expressed on the surface of yeast.
  • the library design can be based on amino acid enrichment preferences of ACE-2 expression.
  • the estimated theoretical combinatorial diversity of this library can be 1 x 10 13 .
  • the data processing system can clone the RBD library, which can be expressed in yeast, likely yielding only a physical library size of approximately 1 x 10 7 -1 x 10 8 . This can provide a significant reduction.
  • the data processing system can screen four key populations.
  • the four key populations can be: 1 ) ACE-2+ are functional RBD variants that are able infect human cells; 2) ACE-2- are non-functional RBD variants unable to infect human cells; 3) ACE-2+Ab+ are functional RBD variants that are can be recognized (neutralized) by antibodies; and 4) ACE-2+Ab- are functional RBD variants that are not recognized (not neutralized) by antibodies — > antibody escape mutants.
  • the data processing system can screen the yeast display combinatorial RBD library for binding to ACE-2 using flow cytometry. Cells that maintain or increase binding to ACE-2 can be selected and their encoding RBD genes can be deep sequenced at 706. ACE-2 non-binding yeast cells can also be sorted and deep sequenced at 706.
  • the ACE-2+ RBD fraction of the library can be screened for binding to either single monoclonal antibodies (e.g., REGN 10933, LYCOV16) or panels of monoclonal antibodies or polyclonal antibodies derived from serum of individuals vaccinated or previously exposed to SARS-CoV-2.
  • the data processing system can sort RBD variants in this library that do not bind (or show lower binding) to antibodies, and, at 706, perform deep sequencing on them. RBD variants that maintain binding to antibodies are also selected and deep sequenced at 706. Thus, at 706, deep sequencing data sets of the RBD combinatorial libraries will result in the four main populations of ACE-2+, ACE-2-, ACE-2+Ab+ and ACE-2+Ab-.
  • the data processing system can train and test deep neural networks using the deep sequencing data generated at 706.
  • the deep neural network can be trained to classify sequences in one of each of the four populations of: ACE-2+, ACE-2-, ACE-2+Ab+ and ACE-2+Ab-.
  • the data processing system can perform sequence encoding and neural network construction to result in neural network classifiers that can predict based on protein sequence, whether an RBD belongs to one of the four populations described above.
  • the data processing system can deploy deep learning models on the theoretical sequence space not interrogated by the physical yeast display library. Since the physical library may be at maximum 1 x 10 8 and the theoretical diversity is 1 x 10 13 , only 0.0001 % of the theoretical combinatorial diversity may have been screened from yeast display.
  • the other 99.999% can be screened in silico (e.g., by the data processing system) by using deep or machine learning models to predict the probability each sequence belongs to one of the four populations.
  • the data processing system can predict future variants of SARS-CoV-2. Due to the large amount of diversity that may be present in each of the four populations, especially the sequences present in the ACE-2+Ab- population which are of most interest as they represent potential RBD variants that escape antibody recognition and neutralization, the data processing system can prioritize sequences. The data processing system can prioritize which sequences are likely to evolve naturally by the current SARS-CoV-2 RBD variant circulating in the global population. To do so, the data processing system can be configured with or execute an evolutionary likelihood model such as maximum likelihood, edit distance, or others. The data processing system can identify RBD sequences that score as the most likely to evolve in these models, and prioritize these sequences or otherwise indicate that these sequences are most likely to be a future variant of concern.
  • an evolutionary likelihood model such as maximum likelihood, edit distance, or others.
  • the data processing system can provide the future variant of concern for engineering antibodies to neutralize the variants of concern.
  • the data processing system can facilitate or otherwise allow for the generation of antibodies that neutralize the variants of concern using scaffold antibodies that can recognize the previous version of SARS-Cov-2.
  • the data processing system can use the new neutralizing antibodies to repeat the yeast display screening step 702 with the ACE-2+ RBD combinatorial library to identify RBD variants that do not bind them.
  • the process of deep or machine learning 708 and in silico screening 710 can then be iterated in another cycle to identify the next round of SARS-CoV-2 evolution and prediction of variants of concern. This process can continue with new engineered antibodies, thus advancing many rounds of evolution and prediction of variants of concern.
  • FIG. 8 illustrates an example method for identifying and predicting coronavirus variants, in accordance with implementations.
  • the method 800 can be performed by one or more system or component depicted in FIG. 6, including, for example, a data processing system.
  • the data processing system can pre-process a library.
  • the data processing system can pre-process one or more libraries.
  • the data processing system can pre-process a 2C library, 2CE library and 2T library illustrated in FIG. 2.
  • the class 2C library can include 5 x 10 7 cells screened, of which 1.6 x 10 6 are ACE2-Binding and 6.4 x 10 6 are non-binding; the class 2CE library can include 5 x 10 7 cells screened, of which 5.9 x 10 5 are ACE2-Binding and 6.4 x 10 6 are non-binding; and the class 2T library can include 1.5 x 10 7 cells screened, of which 8.5 x 10 5 are ACE2-Binding and 1 .5 x 10 6 are non-binding.
  • the data processing system can pre-process the library to generate a training data set.
  • the data processing system can pre-process or filter the data using or based on one or more thresholds, distances, labels, or to expected mutated positions only.
  • Pre-processing can include, for example, quality trimming using a quality threshold.
  • Pre-processing can include filtering the sequences.
  • the data processing system can filter or trim the data based on a number of occurrences of the sequence in the data, or based on expected position of variant.
  • the data processing system can pre-process the data to remove sequence errors, such as by filtering for sequences complying with an amino acid site saturation mutagenesis scheme.
  • the data processing system can filter the different classes using different techniques. For example, the data processing system can filter the class 2T library using a threshold of read counts > 4 and restricted to sequences with edit distance less than or equal to 3.
  • the data processing system upon filtering the one or more libraries or classes, can assign a label of binding or non-binding to the respective sequences.
  • the data processing system can balance the datasets such that an equal number of positive and negative class sequences are present for each edit distance from wild-type.
  • the data processing system can balance positive and negative classes to remove or reduce model bias at low and high Levenshtein edit distances (LD) from wild-type RBD, and increase overall model performance.
  • LD Levenshtein edit distances
  • the data processing system can perform class balancing through random subsampling from the majority class at each edit distance equal to the counts from the minority class.
  • the raw sequence count of library 2C can be 2,904,729, and can be reduced to 637,565 after pre-processing at step 602, which can be further reduced to 435,488 after balancing at step 604.
  • the raw sequence count of library 2CE can be 4,026,982, and can be reduced to 18,796 after pre-processing at step 602, which can be further reduced to 12,196 after balancing at step 604.
  • the raw sequence count of library 2T can be 3,410,650, and can be reduced to 8,311 after pre-processing at step 802, which can be further reduced to 2,036 after balancing at step 804.
  • the data processing system can perform machine learning and deep learning model training using random forest, recurrent neural networks, long short-term memory networks, or other machine learning techniques.
  • the data processing system can perform one-hot encoding (e.g., a group of bits among which the possible or allowed combination of values are only those with a single high bit and all others are low bits) and flattening (e.g., converting the data into a 1- dimensional array) of the RBD sequences into a 1 D-vector, which can then be input into the models.
  • one-hot encoding e.g., a group of bits among which the possible or allowed combination of values are only those with a single high bit and all others are low bits
  • flattening e.g., converting the data into a 1- dimensional array
  • the data processing system can perform optimization, such as hyperparameter optimization on the models to identify the best parameters to train the models.
  • the data processing system can perform in silico phylogenetic RBD mutant generation.
  • the data processing system can receive a subset of unseen library separated by an Levenshtein edit distance (“LD”) from the wild-type RBD by a certain amount to determine the prediction bias based on the LD.
  • LD Levenshtein edit distance
  • the edit distance can be, for example, 3, 5, and 7.
  • the data processing system can perform ensemble prediction and ranking of the in silico sequences.
  • the data processing system can select one or more sequences based on the ranking, such as the highest ranking sequences.
  • the data processing system can select 46 sequences that can include or be associated with variants of concern based on the trained models.
  • the selected sequences can correspond to those that are predicted to bind to the ACE-2 receptor and not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
  • the data processing system can synthesize the selected variant of concern or mutant.
  • the data processing system can provide an indication or output the selected mutant to allow the mutant to be synthesized.
  • the data processing system can provide the sequence for the selected mutant.
  • the mutant can be selected from the highest ranking sequences from step 810, such as one of the 46 selected sequences.
  • the sequence can be experimentally validated for ACE2 binding and mAb escape (e.g., not bind to one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor).
  • FIG. 9 illustrates accuracy scores for models trained to identify and predict coronavirus variants, in accordance with implementations.
  • One or more system or component depicted in FIG. 6 can generate the accuracy score graphs 900, including, for example, the data processing system.
  • the accuracy scores can be for the models trained with different data sets, such as a full data set, an external third-party data set, or a data set based on LD 1-3.
  • the y-axis indicates the score, and the x-axis can indicate the distance from the wild type.
  • the accuracy score across the full range of distances from 1 to 10 is best for the models that are trained on the full data set (e.g., graph C).
  • the second-best performing accuracy scores across the range of LD 1-3 distances is model B, which is trained on LD 1-3 data set.
  • FIG. 10 illustrates accuracy scores for models trained to identify and predict coronavirus variants, in accordance with implementations.
  • One or more system or component depicted in FIG. 6 can generate the accuracy score graph 1000, including, for example, the data processing system.
  • the graph 1000 illustrates the difference in accuracy scores for an optimized random forest or an unoptimized random forest model. As indicated in graph 1000, the accuracy scores are higher across the LD distances 1-10 for the optimized RF model, thereby indicating improved ML performance and prediction of variants of concern in a more accurate and efficient manner using the data processing system of this technical solution.
  • FIG. 11 illustrates an example process for identifying and predicting coronavirus variants, in accordance with implementations.
  • the process 1100 can be performed by one or more system or component depicted in FIG. 6, including, for example, a data processing system.
  • the data processing system can identify sequences or RBM mutants.
  • the data processing system can identify the distance of the RBM mutant from a predetermined or default sequence, such as the SARS-CoV-2 RBD.
  • the distance can refer to a LD distance.
  • the bar graph or histogram depicted at 1102 can graphically represent the number of sequences located at each distance.
  • the data processing sequence can select one or more sequences for further processing, or process each sequence.
  • the data processing system can, to reduce or manage computing resource consumption, select sequences based on their distance from the SARS-CoV-2 in an effort to predict binding properties for the sequences that are most similar to the SARS-CoV-2.
  • the data processing system can select sequences have a greater LD distance in an effort to predict whether a sequence that is different from the SARS-CoV-2 is likely to be a variant of concern.
  • the data processing system can input the sequence into the prediction flow.
  • the data processing system can select a sequence and then perform one-hot encoding at 1106.
  • the data processing system can group the bits such that the possible or allowed combination of values are only those with a single high bit and all others are low bits.
  • the data processing system can generate a multidimensional array with bit values.
  • the data processing system can then utilize one or more machine learning techniques or models to output a prediction.
  • the data processing system can use a random forest decision tree or a LSTM-RNN neural network layers.
  • the data processing system can use multiple machine learning techniques.
  • the data processing system can proceed to 1108 to flatten the one-hot encoding array for input into a random forest decision tree.
  • the data processing system can flatten the one-hot encoding by converting the one-hot encoded multi-dimensional array into a 1- dimensional array of RBD sequences into a 1 D-vector.
  • the data processing system can input the 1 D-vector into a decision tree at 1110.
  • the random forest decision tree can process the input 1 D- vector to provide an output prediction 1112 based on majority voting of the decision tree.
  • the output prediction 1112 can indicate a likelihood or probability that the sequence binds to ACE2 or is an mAB escape (e.g., not bind to one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor).
  • the output prediction using random forest can be depicted in table 1120 for each sequence.
  • the data processing system can predict whether the sequence binds to ACE2 or is an mAB using a neural network.
  • the data processing system can take input the one-hot encoding into a LSTM-RNN layers at 1114.
  • the data processing system when using a neural network, may not flatten the one-hot encoding first.
  • the data processing system can input the one- hot encoding array into the neural network at 1114.
  • the data processing system can use a neural network with one or more layers or one or more types of layers, such as one or more LSTM-RNN layers 1114, and one or more dense layers 1116.
  • the output of the one or more LSTM-RNN layers 1114 can be input into the dense layer 1116.
  • the dense layer 1114 can provide an output prediction 1118.
  • the output prediction 1118 can include a likelihood that a particular sequence binds to ACE2 and is mAB.
  • the data processing system can identify the output predictions from the one or more machine learning techniques (e.g., 1112 and 1114). The data processing system can use the one or more predictions to generate a label for the sequence.
  • Table 1120 illustrates the various output predictions for each sequence using the one or more machine learning techniques, and application of a label based on the output predictions.
  • the data processing system can establish the table 1120, store the table 1120 in memory, or provide table 1120 for display via a graphical user interface, for example.
  • the table 1120 can include sequences 1122.
  • the sequences 1122 can correspond to the input sequence at step 1104.
  • the table 1120 can include an output prediction 1124 that indicates the likelihood or probability that the sequence 1122 binds to ACE2 as predicted using a neural network, such as the neural networks 1114 and 1116.
  • the column 1126 can indicate the probability that the sequence 1122 binds to ACE2 as predicted using a random forest decision tree 1110, for example.
  • the column 1128 can indicate the ACE2 label to assign to the sequence.
  • the data processing system can determine the label to apply based on a combination of the decision tree output prediction 1112, as indicated in column 1126, and the neural network output prediction 1118 as indicated in column 1124.
  • the data processing system can apply a positive ACE2 label to indicate that the sequence does bind, or is likely to bind, to ACE2 if both the ACE2 RNN prediction 1124 and ACE2 RF prediction 1126 are greater than 0.5, which indicates that both machine learning techniques predict that the sequence is more likely than not to bind to ACE2. If, however, one of the machine learning techniques has a probability greater than 0.5, but the other machine learning technique has a probability less than 0.5, then the data processing system can determine to apply a negative ACE2 label to the sequence.
  • both the ACE2 RNN 1124 and the ACE2 RF 1126 are less than 0.5, then the data processing system can apply a negative ACE2 label to indicate that the sequence is unlikely to bind to ACE2.
  • the ACE2 label for a given sequence can be positive for binding if both models 1124 and 1126 predict a P >0.5 and negative for binding if one or both models 1124 and 1126 have a P ⁇ 0.5.
  • the data processing system can similarly use the output predictions 1112 and 1118 to predict the mAb.
  • column 1130 indicates the probability of mAb using a neural network, such as the neural networks 1114 and 1116 and output prediction 1118.
  • Column 1132 indicates the probability of mAb using a decision tree, such as the random forest decision tree of 1108, 1110 and output prediction 1112.
  • the data processing system can apply a mAb label 1134 based on both the mAb RNN 1130 and the mAb RF 1132.
  • the data processing system can apply a positive mAb label 1134 if both the mAb RNN 1130 and the mAb RF 1132 for the sequence is greater than 0.5.
  • the data processing system can apply a negative mAb label 1134. If both the mAb RNN 1130 and mAb RF 1132 is less than 0.5, then the data processing system can apply a negative mAb label 1134.
  • FIG. 12 illustrates an example process for identifying and predicting coronavirus variants, in accordance with implementations.
  • the process 1200 can be performed by one or more system or component depicted in FIG. 6, including, for example, a data processing system.
  • the data processing system can generate RBM mutants in silico.
  • Generating RBM mutants in silico can include using one or more servers or computing devices having one or more processors coupled to memory that can execute program instructions or software programs that are configured to generate RBM mutants.
  • the data processing system can predict or determine the likelihood of whether the sequences generated at 1202 bind to ACE-2.
  • the data processing system can use one or more machine learning techniques, such as decision trees, random forest decision trees, neural networks, or recurrent neural networks to output a prediction or probability that a sequence generated at 1202 binds to ACE-2.
  • the data processing system can predict or determine, the probability or likelihood that the sequence escapes mAb using one or more machine learning techniques or models, such as decision trees or neural networks.
  • the data processing system can select one or more sequences that form a diverse lineage.
  • the data processing system can select 46 sequences that form a diverse lineage.
  • the data processing system can choose sequences corresponding to edit distances to wild-type of 3, 5, and 7.
  • the data processing system can choose sequences that represent lineages across distances.
  • lineages chosen can include sequences with observed mutations (484K/501Y, 484Q/501Y, 501Y) as well as lineages without such preconditions.
  • the data processing system can select sequences for distance 7 sequences to have diversity with respect to their prediction of antibody escape.
  • the data processing system can provide the selected sequences for synthesis. Synthesis can refer to or include creating the sequence in a laboratory by stringing together the corresponding nucleic acids to form the sequence. For example, the data processing system can select sequences with a positive ACE-2 label (e.g., as illustrated in column 1128 of table 1120 in FIG. 11 ) and a positive mAb label (e.g., as illustrated in column 1134 of table 1120 in FIG. 11 ).
  • a positive ACE-2 label e.g., as illustrated in column 1128 of table 1120 in FIG. 11
  • a positive mAb label e.g., as illustrated in column 1134 of table 1120 in FIG. 11 .
  • the data processing system can facilitate validating or receive an indication of validation of the ACE-2 binding and mAb escape prediction. For example, an experiment can be conducted in a laboratory with the synthesized sequences to confirm or validate the predictions output by the data processing system, as depicted in FIG. 4.
  • mAbs monoclonal antibodies
  • LC light chain
  • VH variable heavy chain
  • HC heavy chain
  • Genes for SARS-CoV-2 RBDs were designed spanning from amino acid site Arg 319 to Lys 537, followed by a 6x poly-His tag and an AviTag for site-specific biotinylation.
  • a subset of RBD genes were cloned into pTWIST transient expression vectors by Notl and BamHI cloning.
  • suspension adapted HEK293 cells were used for transient expression of both RBDs and mAbs proteins.
  • Cells were transfected using 1 pg of plasmid DNA per mL of expression culture (typically 25 mL) at a 1 :1 ratio of LC to heavy chain (HC) vectors.
  • Supernatant was harvested at day 5-6 post transfection by spinning down the culture and sterile filtration.
  • RBD proteins were purified through HisTrap columns using a custom built I 3D printed column loading, wash and elution pump system.
  • equilibration buffer containing 50 mM NaH2PO4, 300 mM NaCI at pH 8
  • wash buffer containing 50 mM NaH2PO4, 300 mM NaCI and 20 mM imidazole
  • elution buffer containing 50 mM NaH2PO4, 300 mM NaCI and 250 mM imidazole.
  • mAb proteins were purified using Protein G columns and processed using Protein G equilibration, wash, and Elution buffer, as well as neutralized with Tris-Buffer at a 3:1 ratio.
  • RBDs and mAb proteins were dialyzed against either PBS or phosphate buffer using 10 kDa SnakeSkin(TM) dialysis tubing and concentrated using either 30 kDa or 10 kDa Amicon Ultra centrifugal filter units.
  • Avi-tagged RBDs were biotinylated using a BirA biotin ligase kit.
  • Previously identified neutralizing mAbs e.g., clinically approved therapeutic antibodies
  • REGN10933 6XDG.B/C
  • REGN10987 6XDG.D/E
  • Ly-CoV-555 7KMG
  • Ly-CoV-16 7C01
  • sequences were truncated between the CDRH2 and CDRH3 and replaced by a staffer containing Type 2S restriction sites (PaqCI).
  • the annealed oligos were mixed with the truncated mAb backbones to perform golden gate assembly by resuspending oligos at 10 pM.
  • the forward and reverse oligos were premixed and further diluted resulting in 1 pM working stock.
  • duplex oligos were imaged on a 2.5% agarose gel in order to check the purity, followed by a PCR cleanup.
  • Each purified DNA sample was then mixed with the corresponding truncated mAb backbone at a 2:1 molar ratio as well as PaqCI, T4 DNA ligase, T4 DNA Ligase buffer and PaqCI activator (conducted at 37°C, 10 minutes, 16°C, 1 minute — > 37°C, 1 minute — > 16°C, 1 minute x 60 cycles — > 37°C, 5 min — > 60°C, 5 min).
  • Assembled plasmids were then purified and concentrated by DNA cleanup, followed by dialysis using MCE membranes and electroporation into 50 pL (Dual-DMS) or 350 pL (isDMS) of Top10 I DH5alpha freshly prepared competent cells.
  • HDR homology- directed repair
  • 5A-5D provide the amino acid frequencies of the resulting ACE2+ and the mAb binding and escape with LY-CoV16 (FIG. 5A), LY-CoV555 (FIG. 5B), REGN10933 (FIG. 5C), and REGN10987 (FIG. 5D).
  • Genomic DNA was then isolated from hybridoma cells expressing antibodies binding to RBD using a PureLink gDNA isolation kit and amplified over two consecutive PCRs to add illumina and TruSeq adapters (conducted by PCR1 - Q5, composition as per manufacturer’s instructions, using an annealing temperature of 62.5°C. 6.25 pL gDNA template was used per 50 pL Q5 reaction mix) corresponding to a maximum diversity of 625,000. The number of reactions per sample was scaled according to the observed diversity post FACS x 10 to ensure sufficient oversampling.
  • the primers used for the PCR were adapted from Taft JM, Weber CR, Gao B, et al., (2022) Deep mutational learning predicts ACE2 binding and antibody escape to combinatorial mutations in the SARS-CoV2 receptor binding domain. call Cell, 31 August 2022. They are chimeric sequences, one half of which targets the region of interest (in our case the RBD sequence), and the other half attaches adaptor sequences used for Illumina sequencing platforms.
  • PCR2 was performed and the resulting PCR fragments were gel excised and assayed on a Fragment Analyzer to check for quality. Pooled DNA was then run on an Illumina MiSeq (2 x 250 bp PE).
  • the term “about” and “substantially” will be understood by persons of ordinary skill in the art and will vary to some extent depending upon the context in which it is used. If there are uses of the term which are not clear to persons of ordinary skill in the art given the context in which it is used, “about” will mean up to plus or minus 10% of the particular term.
  • degenerate codon refers to the set of nucleotides in an oligonucleotide sequence that code for at least one amino acid.
  • the degenerate codons used herein may code for 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, or 20 amino acids.
  • amino acid refers to the twenty standard amino acids.
  • references to implementations or elements or acts of the systems and methods herein referred to in the singular may also embrace implementations including a plurality of these elements, and any references in plural to any implementation or element or act herein may also embrace implementations including only a single element.
  • References in the singular or plural form are not intended to limit the presently disclosed systems or methods, their components, acts, or elements to single or plural configurations.
  • References to any act or element being based on any information, act or element may include implementations where the act or element is based at least in part on any information, act, or element.
  • any implementation disclosed herein may be combined with any other implementation or embodiment, and references to “an implementation,” “some implementations,” “one implementation” or the like are not necessarily mutually exclusive and are intended to indicate that a particular feature, structure, or characteristic described in connection with the implementation may be included in at least one implementation or embodiment. Such terms as used herein are not necessarily all referring to the same implementation. Any implementation may be combined with any other implementation, inclusively or exclusively, in any manner consistent with the aspects and implementations disclosed herein.
  • references to “or” may be construed as inclusive so that any terms described using “or” may indicate any of a single, more than one, and all of the described terms. For example, a reference to “at least one of ‘A’ and ‘B’” can include only ’A’, only ’B’, as well as both ’A’ and ’B’. Such references used in conjunction with “comprising” or other open terminology can include additional items.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The present disclosure describes systems and methods for predicting coronavirus variants. In particular, the systems and methods may predict coronavirus variants that bind to the ACE-2 receptor, but do not bind to one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor.

Description

IDENTIFYING AND PREDICTING FUTURE CORONAVIRUS VARIANTS
This application claims the benefit of priority of US provisional application 63.241.385, filed 7 September 2021 , which is incorporated herein by reference.
BACKGROUND OF THE DISCLOSURE
The evolution of coronaviruses and coronavirus variants threaten to result in pandemics such as the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) that emerged in 2019 causing pandemic coronavirus disease 2019 (COVID-19), leading to millions of fatalities worldwide. Selection and emergence of coronavirus variants of concern (VOC) are driven in part by mutations within the viral spike protein and in particular, the receptor-binding domain (RBD), which binds to the human ACE-2 receptor and is a primary target site for neutralizing antibodies.
SUMMARY OF THE DISCLOSURE
Provided herein are systems and methods for prediction of coronavirus variants. In particular, the systems and methods may predict coronavirus variants that bind to the ACE-2 receptor, but do not bind to one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor.
A very brief summary of the work underlying the present invention is represented in the following points:
Millions of combinatorial SARS-Cov-2-RBD variants were screened by yeast surface display;
The results of these screens were used to trained and tested machine learning models to predict ACE2 binding and antibody escape;
This method was verified to enable identification of combinatorial mutations that can drive escape to multiple antibodies, and to enable
Assessment of robustness of multiple antibodies to billions of prospective RBD variants
The method presented herein goes far beyond simply assaying spike variants. The number of possible spike variants is almost unimaginably high (billions of billions), and cannot possibly be tested by in vitro or in vivo methods. The method of the invention generates a subset of these variants, and uses the resulting data to train machine learning models which can then make predictions across the entire set. Thus, the key concept is to train accurate machine learning models that can be then deployed for prediction and surveillance of emerging variants through the course of a pandemic. To the inventors’ knowledge, the effects of single mutations on the spike protein has been studied through Deep mutational scanning, and the effects of combinatorial mutations in observed variants of interest or concern (Vol/VoC) have been studied, but the effects of all potential combinatorial mutations in the sequence space has yet to be studied by any other group.
In one aspect, the present technology provides a method that includes providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising a first plurality of variant sequences of the coronavirus, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein; training a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; and determining, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
In another aspect, the present technology provides a system that includes one or more processors and a memory storing processor-executable instructions, the one or more processors execute the processor-executable instructions to: receive an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; receive a first data set comprising a first plurality of variant sequences, each of the first plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein; train a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; and determine, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
In any embodiment, the first data set may include a first plurality of amino acid sequences that bind to a ACE-2 receptor and bind to one or more antibodies and/or one or more binding proteins (e.g., single-domain nanobody) that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor.
In any embodiment, the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
In any embodiment, the coronavirus may be a betacoronavirus including, but not limited to, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) virus or a variant thereof. \ The present technology also provides a coronavirus variant or portion thereof having an amino acid sequence, wherein the amino acid sequence is generated by the methods or systems disclosed herein. In any embodiment, the coronavirus variant or portion thereof may bind to the ACE-2 receptor. In any embodiment, the coronavirus variant or portion thereof may not bind to the one or more antibodies and/or the one or more binding proteins that inhibit the coronavirus variant or portion thereof from binding to the ACE-2 receptor.
In another aspect, the present technology provides one or more antibodies and/or one or more binding proteins that bind to the coronavirus variant or portion thereof. In any embodiment, the one or more antibodies and/or the one or more binding proteins prevent the coronavirus variant or portion thereof from binding to the ACE-2 receptor. The present technology also provides information on the design of a vaccine that induces production of the antibody and/or the binding protein.
The foregoing general description and the following description of the drawings and detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed. Other objects, advantages, and novel features will be readily apparent to those skilled in the art from the following brief description of the drawings and detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component may be labeled in every drawing. In the drawings:
FIG. 1 illustrates an exemplary combinatorial library design.
FIG. 2 illustrates flow cytometry plots of gating schemes for selection of different RBD libraries after labeling with soluble human ACE-2 (y-axis) and antibody binding to RBD expression tag (anti- Flag-PE, x-axis), according to the examples.
FIGS. 3A-3C illustrate amino acid frequencies of the ACE-2 binding (ACE2+) and non-binding (ACE2-) of libraries 2C, 2CE, and 2T, respectively, according to the examples.
FIG. 4 illustrates one exemplary flow cytometry plot of a library of approximately 107 cells that were ACE-2 binding RBD and sorted for binding and escape from the monoclonal antibodies LY-CoV16, LY-CoV555, REGN10933, and REGN10987, according to the examples.
FIGS. 5A-5D illustrate amino acid frequencies of the monoclonal antibody binding and escape with with LY-C0VI 6, LY-CoV555, REGN10933, and REGN10987, respectively, according to the examples.
FIG. 6 illustrates an example system for identifying and predicting coronavirus variants, in accordance with implementations. FIG. 7 illustrates an example method for identifying and predicting coronavirus variants, in accordance with implementations.
FIG. 8 illustrates an example method for identifying and predicting coronavirus variants, in accordance with implementations.
FIG. 9 illustrates accuracy scores for models trained to identify and predict coronavirus variants, in accordance with implementations.
FIG. 10 illustrates accuracy scores for models trained to identify and predict coronavirus variants, in accordance with implementations.
FIG. 11 illustrates an example process for identifying and predicting coronavirus variants, in accordance with implementations.
FIG. 12 illustrates an example process for identifying and predicting coronavirus variants, in accordance with implementations.
DETAILED DESCRIPTION
The various concepts introduced above and discussed in greater detail below may be implemented in any of numerous ways, as the described concepts are not limited to any particular manner of implementation. Examples of specific implementations and applications are provided primarily for illustrative purposes.
Comprehensive mapping of single-position amino acid substitution in the RBD have revealed key mutations that enhance binding to ACE-2 and provide an escape from neutralizing antibodies. However, several coronavirus variants have emerged that possess multiple RBD mutations with even greater binding to ACE-2 and greater escape from neutralizing antibodies, such as beta (B.1.351 , originating in South Africa), gamma (P.1 , originating in Brazil), and delta (B.1.617.2, originating in India) variants that evolved from the original Wuhan SARS-CoV-2. The present technology provides a method for predicting coronavirus variants and use of these variants to produce therapeutic antibodies and vaccines.
In one aspect, the present technology provides a method that includes providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising a first plurality of variant sequences of the coronavirus, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein; training a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; and determining, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
The input sequences are variants that are part of the theoretical mutagenesis sequence space of the spike protein library. In certain embodiments, generation of the input sequences is informed by a deep mutational scan of the sequence to select the mutation sites with highest impact on receptor binding and antibody binding.
In another aspect, the present technology provides a system that includes one or more processors and a memory storing processor-executable instructions, the one or more processors execute the processor-executable instructions to: receive an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; receive a first data set comprising a first plurality of variant sequences, each of the first plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein; train a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; and determine, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant.
In any embodiment, the first data set may include a first plurality of amino acid sequences that bind to a ACE-2 receptor and bind to one or more antibodies and/or one or more binding proteins (e.g., single-domain nanobody) that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor.
The terms “antibody” and its plural “antibodies” where used in the present specification encompasses classical immunoglobulin antibodies and antibody-like proteins that are capable of specifically binding to a coronavirus spike protein.
The term specific binding in the context of the present invention refers to a property of antibodies and antibody-like proteins, which bind to their target with a certain affinity and target specificity. The affinity of such a ligand is indicated by the dissociation constant of the ligand. A specifically reactive ligand has a dissociation constant of < 10'7mol/L (particularly < 10'9mol/L) when binding to its target, but a dissociation constant at least three orders of magnitude higher in its interaction with a molecule having a globally similar chemical composition as the target, but a different three- dimensional structure.
In the context of the present specification, the term antibody refers to whole antibodies including but not limited to immunoglobulin type G (IgG), type A (IgA), type D (IgD), type E (IgE) or type M (IgM), any antigen-binding fragment or single chains thereof and related or derived constructs. A whole antibody is a glycoprotein comprising at least two heavy (H) chains and two light (L) chains inter-connected by disulfide bonds. Each heavy chain is comprised of a heavy chain variable region (VH) and a heavy chain constant region (CH). The heavy chain constant region of IgG is comprised of three domains, CH1 , CH2 and CH3. Each light chain is comprised of a light chain variable region (abbreviated herein as VL) and a light chain constant region (CL). The light chain constant region is comprised of one domain, CL. The variable regions of the heavy and light chains contain a binding domain that interacts with an antigen. The constant regions of the antibodies may mediate the binding of the immunoglobulin to host tissues or factors, including various cells of the immune system (e.g., effector cells) and the first component of the classical complement system. Similarly, the term encompasses a so-called nanobody or single domain antibody, an antibody fragment consisting of a single monomeric variable antibody domain.
The term antibody-like molecule in the context of the present specification refers to a molecule capable of specific binding to another molecule or target with high affinity / a Kd < 10'7 mol/l (particularly < 10'9mol/L). An antibody-like molecule binds to its target similarly to the specific binding of an antibody. The term antibody-like molecule encompasses a repeat protein, such as a designed ankyrin repeat protein (Molecular Partners, Zurich), an engineered antibody mimetic protein exhibiting highly specific and high-affinity target protein binding (see US2012142611 , US2016250341 , US2016075767 and US2015368302). The term antibody-like molecule further encompasses, but is not limited to, a polypeptide derived from armadillo repeat proteins, a polypeptide derived from leucine-rich repeat proteins and a polypeptide derived from tetratricopeptide repeat proteins. The term antibody-like molecule further encompasses a specifically binding polypeptide derived from a protein A domain, a fibronectin domain FN3, a consensus fibronectin domain, a lipocalin (see Skerra, Biochim. Biophys. Acta 2000, 1482(1- 2):337-50), a polypeptide derived from a Zinc finger protein (see Kwan et al. Structure 2003, 11 (7):803-813), a Src homology domain 2 (SH2) or Src homology domain 3 (SH3), a PDZ domain, a gamma-crystallin, ubiquitin, a cysteine knot polypeptide or a knottin, cystatin, Sac7d, a triple helix coiled coil (also known as alphabodies), a Kunitz domain or a Kunitz-type protease inhibitor and a carbohydrate binding module 32-2. The term antibody-like molecule further encompasses a humanized camelid antibody.
The term protein A domains derived polypeptide refers to a molecule that is a derivative of protein A and is capable of specifically binding the Fc region and the Fab region of immunoglobulins.
The term armadillo repeat protein refers to a polypeptide comprising at least one armadillo repeat, wherein an armadillo repeat is characterized by a pair of alpha helices that form a hairpin structure.
The term humanized camelid antibody in this context refers to an antibody consisting of only the heavy chain or the variable domain of the heavy chain (VHH domain) and whose amino acid sequence has been modified to increase their similarity to antibodies naturally produced in humans and, thus show a reduced immunogenicity when administered to a human being. A general strategy to humanize camelid antibodies is shown in Vincke et al. “General strategy to humanize a camelid single-domain antibody and identification of a universal humanized nanobody scaffold”, J Biol Chem. 2009 Jan 30;284(5):3273-3284, and US2011165621 A1 .
In certain embodiments, the term “antibody” or “antibodies” refers to classical immunoglobulins, particularly to IgG antibodies.
In any embodiment, the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor. In any embodiment, the coronavirus variant is likely to emerge through in vivo viral replication in a mammal.
In certain embodiments, the antibodies employed to inform the generation of the four groups, in other words to generate the data relating to binding or non-binding of the variant sequence to an antibody, can be selected from the therapeutic antibodies developed by pharmaceutical companies which were widely used during the acute phase of the pandemic. They include, but are not limited to, the group consisting of Casoirivimab, Imdevimab, Estevimab, and Bamlanivimab. All four of these were granted FDA emergency use authorizations and were shown to be effective on early SARS-CoV-2 variants. The data generated by the inventors shows independently that these therapeutic antibodies were susceptible to escape mutations in the spike, leading to decreased or lacking efficacy against variants which emerged after late 2021 . The method outlined in this specification is able to assess the efficacy of each antibody to billions of possible spike variants.
In any embodiment, the first plurality of variant sequences may include at least 2 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 5 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 10 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 20 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 50 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 75 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 100 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 200 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 300 unique variant sequences. In any embodiment, the first plurality of variant sequences may include at least 500 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 1010 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 109 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 108 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 107 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 106 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 1 x 105 unique variant sequences. In any embodiment, the first plurality of variant sequences may include no more than 500 unique variant sequences.
In any embodiment, the first plurality of amino acid sequences may include 1 to 1 x 1010 unique amino acid sequences. In any embodiment, the first plurality of amino acid sequences may include 10 to 1 x 109 unique amino acid sequences. In any embodiment, the first plurality of amino acid sequences may include 50 to 1 x 108 unique amino acid sequences. In any embodiment, the first plurality of amino acid sequences may include 100 to 1 x 107 unique amino acid sequences. In any embodiment, the first plurality of amino acid sequences may include 350 to 1 x 106 unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 1 to 1 x 101° unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 10 to 1 x 109 unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 50 to 1 x 108 unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 100 to 1 x 107 unique amino acid sequences. In any embodiment, the second plurality of amino acid sequences may include 350 to 1 x 106 unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 1 to 1 x 101° unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 10 to 1 x 109 unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 50 to 1 x 108 unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 100 to 1 x 107 unique amino acid sequences. In any embodiment, the third plurality of amino acid sequences may include 350 to 1 x 106 unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 1 to 1 x 101° unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 10 to 1 x 109 unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 50 to 1 x 108 unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 100 to 1 x 107 unique amino acid sequences. In any embodiment, the fourth plurality of amino acid sequences may include 350 to 1 x 106 unique amino acid sequences.
In any embodiment, the coronavirus may be a betacoronavirus including, but not limited to, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) virus or a variant thereof.
In any embodiment, the first data set may include at least 100 unique amino acid sequences. In any embodiment, the first data set may include at least 500 unique amino acid sequences, at least 1000 unique amino acid sequences, at least 10,000 unique amino acid sequences, at least 50,000 unique amino acid sequences, at least 100,000 unique amino acid sequences, or at least 200,000 unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 101° unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 109 unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 108 unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 107 unique amino acid sequences. In any embodiment, the first data set may include no more than 1 x 106 unique amino acid sequences.
In any embodiment, the one or more antibodies and/or the one or more binding proteins is in human serum. In any embodiment, the one or more antibodies may include one or more monoclonal antibodies.
In any embodiment, the method and/or system may further include generating a combinatorial library of amino acid sequences and/or a tiling library of amino acid sequences that comprise the first plurality of amino acid sequences, the second plurality of amino acid sequences, the third plurality of amino acid sequences, and the fourth plurality of amino acid sequences.
The combinatorial RBD library is physically screened for ACE-2/AB (antibody) binding and nonbinding, those variants are collected by FACS and then deep sequenced. Therefore there is an experimental dataset with RBD sequence variants that possess information on their ACE/AB properties. This experimental dataset is the training data used for ML.
In any embodiment, the combinatorial library of amino acid sequences and/or the tiling library of amino acid sequences may be based on deep mutational scanning. For example, the design of the combinatorial and tiling library can be guided by deep mutational scanning data, or be generated via or based on deep mutational scanning data. In any embodiment, the deep mutational scanning may include generating a first library of variant sequences of the viral spike protein wherein each variant sequence is modified at a single amino acid position relative to the input amino acid sequence. In any embodiment, the first library may include variant sequences representing each amino acid position of the input amino acid sequence. In any embodiment, the first library may include variant sequences representing all 20 standard amino acids at each position of the input amino acid sequence. In any embodiment, the combinatorial library of amino acid sequences and the tiling library of amino acid sequences may be recombinantly expressed as individual amino acid sequences on a yeast cell surface.
There are many ways to design a combinatorial library, and it must be designed in a certain way so that when the library is experimentally screened, one can actually obtain sufficient data on all four groups. If one, for example, were to make the highest diversity combinatorial library possible (e.g., NNK across all positions), once this library is screened, the output would be a dataset almost exclusively populated with non-binding variants to ACE2 or ABs. The reason being too much diversity would destroy the ability for the protein to maintain binding. Thus the combinatorial library design is a critical step before the experimental screening can begin.
In any embodiment, the combinatorial library of amino acid sequences may include a plurality of amino acid sequences, wherein the plurality of amino acid sequences includes at least a portion of the viral spike protein of the coronavirus with amino acid site saturation mutagenesis performed at one or more amino acid sites of the viral spike protein.
In any embodiment, the tiling library of amino acid sequences may include a plurality of amino acid sequence, wherein each amino acid sequence may include at least a portion of the viral spike protein of the coronavirus with amino acid site saturation mutagenesis performed at one or more amino acid sites of the viral spike protein.
In any embodiment, the amino acid site saturation mutagenesis may include a receptor-binding domain (RBD) of the viral spike protein. In any embodiment, the RBD may include amino acid sites 350 to 550, including, but not limited to amino acid sites 400 to 515, amino acid sites 417 to 505, amino acid sites 453 to 478, amino acid sites 484 to 505, and/or amino acid sites 439 to 452. In any embodiment, the amino acid site saturation mutagenesis may be in the RBD and one or more amino acid at sites 350 to 550, including, but not limited to amino acid sites 400 to 515, amino acid sites 417 to 505, amino acid sites 453 to 478, amino acid sites 484 to 505, and/or amino acid sites 439 to 452. In any embodiment, the amino acid site saturation mutagenesis may be in the RBD and one or more amino acid at sites 350 to 550, including, but not limited to amino acid sites 453 to 478, amino acid sites 484 to 505, and/or amino acid sites 439 to 452.
In any embodiment, at least 5 amino acid sites may have undergone saturation mutagenesis. In any embodiment, at least 10 amino acid sites may have undergone saturation mutagenesis. In any embodiment, at least 20 amino acid sites may have undergone saturation mutagenesis. In any embodiment, 5 to 50 amino acid sites may have undergone saturation mutagenesis. In any embodiment, 10 to 45 amino acid sites may have undergone saturation mutagenesis. In any embodiment, 20 to 40 amino acid sites may have undergone saturation mutagenesis.
In any embodiment, the amino acid site saturation mutagenesis may correspond to 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, or 20 amino acids. In any embodiment, the amino acid site saturation mutagenesis may correspond to 1-20 amino acids including 1-5, 1-10, 1-15, 2- 10, 2-15, 2-20, 3-10, 3-15, 3-20, 5-10, 5-15, 5-20, 7-10, 7-15, or 7-20 amino acids. In some embodiments, the amino acid may differ from the amino acid at that position in the wild type sequence such that the amino acid site saturation mutagenesis may correspond to 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, or 19 amino acids (e.g„ 1-5, 1-10, 1-15, 2-10, 2-15, 2- 19, 3-10, 3-15, 3-19, 5-10, 5-15, 5-19, 7-10, 7-15, or 7-19 amino acids). In any embodiment, the amino acid site saturation mutagenesis may be present at one or more amino acids. In any embodiment, the amino acid site saturation mutagenesis may correspond to one or more amino acid mutations that differ at the same position(s) in the wild type sequence. In any embodiment, the amino acid site saturation mutagenesis may be present at two or more amino acids. In any embodiment, the amino acid site saturation mutagenesis may correspond to two or more amino acid mutations that differ at the positions in the wild type sequence. In any embodiment, the amino acid site saturation mutagenesis may be present at three or more amino acids. In any embodiment, the amino acid site saturation mutagenesis may correspond to three or more amino acid mutations that differ that position in the wild type sequence. In some embodiments, the amino acid site saturation mutagenesis may be provided by a degenerate codon. In some embodiments, the amino acid site saturation mutagenesis may include at least two degenerate codons, including, but not limited to at least three degenerate codons, at least four degenerate codons, at least five degenerate codons, or at least six degenerate codons. In some embodiments, the degenerate codon may include any known degenerate codon. In some embodiments, the degenerate codon may include any codon made up of a combination of degenerate nucleotides A, C, T, G, W, S, M, K, R, Y, B, D, H, V, N (e.g. NNK, NNS, NDT, DBK, GGN, TWY, WR, YWY, etc., or a combination of two or more thereof.
In any embodiment, the method and/or system may further include training the first machine model with the combinatorial library of amino acid sequences and/or the tiling library of amino acid sequences to predict the affinity binding scores for proposed amino acid sequences. In any embodiment, the first machine model may include, but is not limited to, at least one of a random forest or a recurrent neural network. In embodiments, the first machine learning model can include any supervised machine learning model that is configured to perform classification on encoded biological sequence data.
In any embodiment, the method and/or system may further include generating a second data set that includes a second plurality of sequences, each of the second plurality of sequences comprising one of the identified coronavirus variants. In any embodiment, the method may further include updating the first machine learning model with the second data set; and determining, via the first machine learning model updated with the second data set, a second affinity binding score for a second proposed amino acid sequence to identify a second-generation coronavirus variant.
In any embodiment, the method and/or system may further include generating, in silico, a plurality of proposed amino acid sequences of coronavirus variants; and inputting the plurality of proposed amino acid sequences into the first machine learning model to identify a corresponding plurality of affinity binding scores.
In any embodiment, the method and/or system may further include validating the proposed amino acid sequence binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and updating the first machine learning model based on the validation.
In any embodiment, the method and/or system may further include invalidating the proposed amino acid sequence binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and updating the first machine learning model based on the invalidation. In any embodiment, the method and/or system may further include pre-processing the data set prior to training the first machine learning model, wherein pre-processing comprises at least one of: removing sequencing errors by filtering for sequences complying with initial amino acid site saturation mutagenesis; filtering the data set based on a number of read counts greater than a threshold and a distance from a wild type less than or equal to a second threshold; or removing duplicate sequences.
In any embodiment, the method and/or system may further include validating the first machine learning using a second data set comprising a second plurality of variant sequences, each of the second plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein, wherein the second data set is different from the first data set and has not been used to train the first machine learning model.
In any embodiment, the method and/or system may further include determining a score based on predicted affinity binding score and any combination of one or more of nucleotide and amino acid edit distance of the coronavirus variant to a reference coronavirus sequence; and/or likelihood of nucleotide mutations based on one or more data sets; and/or the first affinity binding score; and/or an existence of a viable evolutionary path to the coronavirus variant. In any embodiment, the method and/or system may further include determining a likelihood of variant emergence score for the coronavirus variant based on the proposed amino sequence based on: nucleotide and amino acid edit distance of the coronavirus variant to a reference coronavirus sequence; and/or likelihood of nucleotide mutations based on one or more data sets; and/or the first affinity binding score; and/or an existence of a viable evolutionary path to the coronavirus variant.
In any embodiment, the method and/or system may further include determining the likelihood of variant emergence score for the coronavirus variant responsive to the first affinity binding score being greater than or equal to a threshold.
In any embodiment, the method may further include selecting, responsive to the likelihood of variant emergence score greater than or equal to a second threshold, the coronavirus variant for experimental validation.
In any embodiment, the method and/or system may further include balancing the data set to include an equal number of positive sequences and negative sequences per amino acid edit distance from a reference sequence (e.g., SARS-CoV-2 Wuhan-1 ).
In any embodiment, the method and/or system may further include generating the one or more antibodies and/or the one or more binding proteins that bind the coronavirus variant.
In any embodiment, the method and/or system may further include generating a vaccine that induces production of one or more antibodies and/or one or more binding proteins, wherein the one or more antibodies and/or the one or more binding proteins bind the coronavirus variant. In any embodiment, the processor-executable instructions may generate a second data set comprising a second plurality of sequences, each of the second plurality of sequences comprising one of the identified coronavirus variants. In any embodiment, the first machine learning model may be updated with the second data set; and determine, via the first machine learning model updated with the second data set, a second affinity binding score for a second proposed amino acid sequence to identify a second generation coronavirus variant.
The present technology also provides a coronavirus variant or portion thereof having an amino acid sequence, wherein the amino acid sequence is generated by the methods or systems disclosed herein. In any embodiment, the coronavirus variant or portion thereof may bind to the ACE-2 receptor. In any embodiment, the coronavirus variant or portion thereof may not bind to the one or more antibodies and/or the one or more binding proteins that inhibit the coronavirus variant or portion thereof from binding to the ACE-2 receptor.
In another aspect, the present technology provides one or more antibodies and/or one or more binding proteins that bind to the coronavirus variant or portion thereof. In any embodiment, the one or more antibodies and/or the one or more binding proteins prevent the coronavirus variant or portion thereof from binding to the ACE-2 receptor. In any embodiment, a cell may include the one or more antibodies and/or one or more binding proteins that bind to the coronavirus variant or portion thereof. In any embodiment, the cell may be a mammalian cell, a bacterial cell, a yeast cell, an insect cell, or a eukaryotic cell.
The present technology also provides a vaccine that induces production of the one or more antibodies and/or one or more binding proteins that bind to the coronavirus variant or portion thereof.
In any embodiment, the ACE-2 receptor may be a mammalian ACE-2 receptor (e.g., a human ACE-2 receptor).
The invention further encompasses the following items. Reference to item N includes the alternative reference to item N a).
Item 1 : A method, comprising: providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising data reflecting the ability of a first plurality of variant sequences of the input amino acid sequence to bind to: an ACE-2 receptor and to an antibody to the spike protein, wherein the antibody inhibits the coronavirus from binding to the ACE-2 receptor, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein relative to the input sequence; wherein the first data set comprises: a first plurality of amino acid sequences that bind to an ACE-2 receptor and bind to said antibody; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to said antibody; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to said antibody; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to said antibody; training a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; determining, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant; wherein the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to said antibody.
Item 1 a): A method, comprising: providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising a first plurality of variant sequences of the coronavirus, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein; wherein the first data set comprises:
- a first plurality of amino acid sequences that bind to an ACE-2 receptor and bind to one or more antibodies and/or to one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or
- a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or to the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or
- a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or to the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or to the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; training a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; determining, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant; wherein the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
Item 2. The method item 1 , wherein the coronavirus is a betacoronavirus.
3. The method item 2, wherein the betacoronavirus is severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) virus or a variant thereof.
4. The method of any one of items 1 -3, further comprising: generating a combinatorial library of amino acid sequences and/or a tiling library of amino acid sequences that comprise the first plurality of amino acid sequences, the second plurality of amino acid sequences, the third plurality of amino acid sequences, and the fourth plurality of amino acid sequences; and training the first machine model with the combinatorial library of amino acid sequences and/or the tiling library of amino acid sequences to predict the affinity binding scores for proposed amino acid sequences.
Item 4 a): The method according to any one of items 1 to 3, wherein said first plurality of variant sequences is generated by creating a combinatorial library of amino acid sequences and/or a tiling library of amino acid sequences that comprises variants of the input amino acid sequence generated by amino acid site saturation mutagenesis at one or more amino acid sites in a receptorbinding domain (RBD) of the viral spike protein.
5. The method of item 4, wherein the combinatorial library of amino acid sequences comprises a plurality of amino acid sequences comprising at least a portion of the viral spike protein of the coronavirus with an amino acid site saturation mutagenesis at one or more amino acid sites of the viral spike protein.
6. The method of any one of items 4-6, wherein the tiling library of amino acid sequences comprises a plurality of amino acid sequence, wherein each amino acid sequence comprises at least a portion of the viral spike protein of the coronavirus with amino acid site saturation mutagenesis at one or more amino acid sites of the viral spike protein.
7. The method of item 5 or item 6, wherein the amino acid site saturation mutagenesis comprises a receptor-binding domain (RBD) of the viral spike protein.
7 a) The method of item 5 or item 6, wherein the amino acid site saturation mutagenesis is effected on I conducted to vary a position of / a receptor-binding domain (RBD) of the viral spike protein.
8. The method of any one of items 5-7, wherein the amino acid site saturation mutagenesis comprises 1 to 20 amino acids.
9. The method of any one of items 5-8, wherein the amino acid site saturation mutagenesis comprises 1 to 19 amino acids, wherein the exclude amino acid is a wild type amino acid.
10. The method of any one of items 5-9, wherein the amino acid site saturation mutagenesis is present at one or more amino acids.
11 . The method of any one of items 5-10, wherein the amino acid site saturation mutagenesis comprises a degenerate codon.
12. The method of any one of items 5-11 , wherein the degenerate codon comprise any codon made up of a combination of degenerate nucleotides A, C, T, G, W, S, M, K, R, Y, B, D, H, V, N, or a combination of two or more thereof.
13. The method of any one of items 7-12, wherein the amino acid site saturation mutagenesis is in the RBD and one or more amino acid at sites 350 to 550.
14. The method of item 13, wherein at least 5 amino acid sites have the amino acid site saturation mutagenesis.
15. The method of item 13 or item 14, wherein 5 to 50 amino acid sites have the amino acid site saturation mutagenesis.
16. The method of any one of items 13-15, wherein at least 10 amino acid sites have the amino acid site saturation mutagenesis.
17. The method of any one of items 13-16, wherein 20 to 40 amino acid sites have the amino acid site saturation mutagenesis.
18. The method of any one of items 13-17, wherein the RBD comprises amino acid sites 350 to 550.
19. The method of any one of items 13-18, wherein the RBD comprises amino acid sites 400 to 515. 20. The method of any one of items 13-19, wherein the RBD comprises amino acid sites 417 to 505.
21. The method of any one of items 1-20, wherein the first data set comprises at least 100 unique amino acid sequences.
22. The method of any one of items 1-21 , wherein the first data set comprises at least 1000 unique amino acid sequences.
23. The method of any one of items 1-22, wherein the first data set comprises at least 10,000 unique amino acid sequences.
24. The method of any one of items 1-23, wherein the first data set comprises at least 50,000 unique amino acid sequences.
25. The method of any one of items 1 -24, wherein the first data set comprises at least 100,000 unique amino acid sequences.
26. The method of any one of items 1 -25, wherein the first data set comprises at least 200,000 unique amino acid sequences.
27. The method of any one of items 1-26, wherein the one or more antibodies and/or the one or more binding proteins is in human serum.
28. The method of any one of items 1-28, wherein the one or more antibodies comprises one or more monoclonal antibodies.
29. The method of any one of items 4 to 20, wherein the combinatorial library of amino acid sequences and/or the tiling library of amino acid sequences is generated based on or via deep mutational scanning.
30. The method of item 29, wherein the deep mutational scanning comprises generating a first library of variant sequences of the viral spike protein wherein each variant sequence is modified at a single amino acid position relative to the input amino acid sequence.
31 . The method of item 30, wherein the first library comprises variant sequences representing each amino acid position of the input amino acid sequence.
32. The method of item 30 or 31 , wherein the first library comprises variant sequences representing all 20 standard amino acids at each position of the input amino acid sequence.
33. The method of any one of items 4-32, wherein the combinatorial library of amino acid sequences and the tiling library of amino acid sequences are recombinantly expressed as individual amino acid sequences on a yeast cell surface.
34. The method of any one of items 1-33, wherein the coronavirus variant is likely to emerge through in vivo viral replication in a mammal. 35. The method of any one of items 1-34 further comprising generating the one or more antibodies and/or the one or more binding proteins that bind the coronavirus variant.
36. The method of any one of items 1-35 further comprising generating a vaccine that induces production of one or more antibodies and/or one or more binding proteins, wherein the one or more antibodies and/or the one or more binding proteins bind the coronavirus variant.
37. The method of any one of items 1-36, wherein the ACE-2 receptor is a mammalian ACE- 2 receptor.
38. The method of item 37, wherein the mammalian ACE-2 receptor is a human ACE-2 receptor.
39. The method of any one of items 1 -38 further comprising: generating a second data set comprising a second plurality of sequences, each of the second plurality of sequences comprising one of the identified coronavirus variants; updating the first machine learning model with the second data set; and determining, via the first machine learning model updated with the second data set, a second affinity binding score for a second proposed amino acid sequence to identify a second generation coronavirus variant.
40. The method of any one of items 1-39, wherein the first machine model comprises at least one of a random forest or a recurrent neural network.
41 . The method of any one of items 1 -40 further comprising: balancing the data set to include an equal number of positive (binding) sequences and negative (non-binding) sequences per amino acid edit distance from a reference sequence (e.g., SARS- CoV-2 Wuhan-1 ).
42. The method of any one of items 1-41 further comprising: generating, in silico, a plurality of proposed amino acid sequences of coronavirus variants; and inputting the plurality of proposed amino acid sequences into the first machine learning model to identify a corresponding plurality of affinity binding scores.
43. The method of any one of items 1 -42 further comprising:
Experimentally validating if the proposed amino acid sequence binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and updating the first machine learning model based on the validation.
44. The method of any one of items 1 -43 further comprising: Experimentally invalidating if the proposed amino acid sequence binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and updating the first machine learning model based on the invalidation.
45. The method of any one of items 1 -44 further comprising: pre-processing the data set prior to training the first machine learning model, wherein preprocessing comprises at least one of: removing sequencing errors by filtering for sequences complying with initial amino acid site saturation mutagenesis; filtering the data set based on a number of read counts greater than a threshold and a distance from a wild type less than or equal to a second threshold; or removing duplicate sequences.
46. The method of any one of items 1 -45 further comprising: validating the first machine learning using a second data set comprising a second plurality of variant sequences, each of the second plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein, wherein the second data set is different from the first data set and has not been used to train the first machine learning model.
47. The method of any one of items 3-38, further comprising: determining a likelihood of variant emergence score for the coronavirus variant based on the proposed amino sequence, the predicted affinity binding score and one or more of: nucleotide and amino acid edit distance of the coronavirus variant to a reference coronavirus sequence; and/or likelihood of nucleotide mutations based on one or more data sets; and/or the first affinity binding score; and/or an existence of a viable evolutionary path to the coronavirus variant.
48. The method of item 47 further comprising: determining the likelihood of variant emergence score for the coronavirus variant responsive to the first affinity binding score being greater than or equal to a threshold.
49. The method of item 48 further comprising: selecting, responsive to the likelihood of variant emergence score greater than or equal to a second threshold, the coronavirus variant for experimental validation. 50. A system comprising one or more processors and a memory storing processor-executable instructions, the one or more processors execute the processor-executable instructions to: receive an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; receive a first data set comprising a first plurality of variant sequences, each of the first plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein; wherein the first data set comprises: a first plurality of amino acid sequences that bind to an ACE-2 receptor and bind to one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or the one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; train a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; determine, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant; wherein the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
51 . The system of item 50, wherein the processor-executable instructions to generate a second data set comprising a second plurality of sequences, each of the second plurality of sequences comprising one of the identified coronavirus variants; update the first machine learning model with the second data set; and determine, via the first machine learning model updated with the second data set, a second affinity binding score for a second proposed amino acid sequence to identify a second generation coronavirus variant. 52. A coronavirus variant or portion thereof having an amino acid sequence, wherein the amino acid sequence is generated by the method of any one of items 1 to 49, or the system of item 50 or item 51 .
53. The coronavirus variant or portion thereof of item 52, wherein the coronavirus variant or portion thereof binds to the ACE-2 receptor and/or the coronavirus variant or portion thereof do not bind to the one or more antibodies and/or the one or more binding proteins that inhibit the protein or peptide from binding to the ACE-2 receptor.
54. The coronavirus variant or portion thereof of item 53, wherein the ACE-2 receptor is a mammalian ACE-2 receptor.
55. The coronavirus variant or portion thereof of item 54, wherein the mammalian ACE-2 receptor is a human ACE-2 receptor.
56. An antibody and/or a binding protein that binds to the coronavirus variant or portion thereof according to any one of items 52-55.
57. The antibody and/or the binding protein of item 56, wherein the antibody and/or the binding protein prevents the coronavirus variant or portion thereof according to any one of items 52-55 from binding to the ACE-2 receptor.
58. A vaccine that induces production of the antibody and/or the binding protein according to item 56 or item 57.
59. A cell comprising a protein or peptide according to any one of items 52-55.
60. The cell according to item 59, wherein the cell is a mammalian cell, a bacterial cell, a yeast cell, an insect cell, or a eukaryotic cell.
EXAMPLES
Example 1 : Discovery of coronavirus variants
Using deep mutational scanning (DMS) datasets, mutagenesis libraries of the RBD of SARS-CoV- 2 were designed. First, combinatorial libraries were designed at amino acid positions 484-505 (Class 2) (FIG. 1 ), amino acid positions 453-478 (Class 1C), and amino acid positions 440-452 (Class 3C). To arrive at the combinatorial design, the log enrichment from the DMS data was transformed into amino acid frequencies. The negatively enriched amino acids were discarded (if below a threshold) and the positively enriched were given frequencies proportional to their enrichment above 0 in the DMS dataset. Based on these frequencies each position was assigned an amino acid site saturation mutagenesis. The assignment was made by first determining the amino acid site saturation mutagenesis that most closely map to the observed frequencies (based on a positional mean squared error), and then selecting the codon that best represents the diversity of viable mutations in the position. For Class 2, this resulted in an RBD combinatorial library design with a theoretical diversity of 14,978,815,488 unique sequences (Class 2C Library). In addition to this library, a second library with the same design was produced except NNK codons were included at positions 417 and 439 of the RBD yielding a theoretical diversity of 5.99 x 1012 unique sequences (Class 2CE Library).
In addition to the combinatorial libraries, tiling libraries (Class 2T, Class 1T, Class 3T) were constructed to enable recovery of variants with a lower distance to the wild-type RBD (/.e. sequence from the original Wuhan SARS-CoV-2 virus). The tiling libraries were designed by introducing three NNK codons at every position in the sequence stretch of interest (Class 2T: 484- 505, Class 1T: 453-478, Class 3T: 440-452) (excluding positions that were fixed to single amino acid in the combinatorial design, e.g., Class 2C positions: 486, 487, 489, 495, 496, 497, 500). Every possible combination of three positions was designed resulting in a total diversity of the Class 2T library of 1 ,533,035 unique sequences. The number of total sequences of a tiling library (/.e. the number of variants of up to a maximum edit distance k away from the wild-type sequence) was determined by the length of sequence (n, here = 14 non-fixed positions), the number of NNKs (or max edit distance) introduced (k, here = 3) and the size of the alphabet (a, here = 20):
Figure imgf000024_0001
Similarly, the number of sequences for a given edit distance k was given by:
Figure imgf000024_0002
Next, yeast display libraries were constructed by in vivo homologous recombination. Synthetic single-stranded oligonucleotides were designed to encode the desired library diversity. The oligonucleotides were amplified by PCR to produce double-stranded DNA with 30 bp of 5’ and 3’ homology to the yeast display vector pYD1. Yeast libraries were prepared using 1 pg each of plasmid and inserting DNA per 300 pl of electrocompetent EBY100 cells.
Surface expression of the library sequences was induced by growth of the yeast cells in SG-UT medium at 23°C for 16-40 hours. Approximately 108 library cells were washed once with 1 mL wash buffer (DPBS+ 0.5% BSA + 0.1 % Tween + 2 mM EDTA) by centrifugation at 8000 x g for 30 s. The washed cells were stained with 50 nM biotinylated human ACE-2 for 30 minutes at 4 °C, followed by an additional wash. The cells were then stained with 2.5 ng/pl streptavidin-AlexaFluor 647 and 1 ng/pl anti-FLAG-PE for 30 minutes at 4 °C. Next, the cells were pelleted by centrifugation at 8000 x g for 30s and kept on ice until sorting. Binding (ACE2+/FLAG+) and nonbinding (ACE2-/FLAG+) cells were sorted by FACS (BD FACSAria Fusion or Sony MA800 cytometer). Double-positive (/.e., binding) gates were drawn using wild-type RBD as a guide, and negative gates were drawn with anti-FLAG tag stained cells (FIG. 2 and Table 1 ). The collected cells were cultured in SD-UT medium for one to two days at 30 °C followed by repeating induction and sorting until the desired populations were pure.
The RBD libraries pre-sorted for ACE2-binding were cultured and induced, as described above. The induced cells were washed once with DPBS wash buffer, followed by incubation with 100 nM monoclonal antibody, or antibody mixtures. In the case of antibody mixtures, 100 nM of each antibody was used. Following an additional wash, the cells were resuspended in 5 ng/pl antihuman lgG-AlexaFluor647 and incubated for 30 minutes at 4 °C. After an additional wash, the cells were resuspended in 1 ng/planti-FLAG-PE and incubated for 30 minutes at 4 °C. The cells were then pelleted by centrifugation at 8000 x g for 30s and kept on ice until sorting. Cells expressing RBD that maintained Antibody-binding (lgG+/FLAG+) or showed complete loss of antibody binding (escape variants) (lgG-/FLAG+) were sorted by FACS, collected, and cultured in SD-UT medium for 16-40 hours at 30 °C. Induction and sorting was repeated for multiple rounds until the desired populations of RBD variants showed purity for binding and non-binding to antibodies (Table 1 ). FIGS. 3A-C provide the resulting amino acid frequencies of the ACE-2 binding (ACE2+) and non-binding (ACE2-) of libraries 2C (FIG. 3A), 2CE (FIG. 3B), and 2T (FIG. 3C).
Table 1 : Statistics for RBD Library Sorting by FACS for binding and non-binding to ACE-2.
Figure imgf000025_0001
Plasmid DNA encoding the RBD variants was isolated and the mutagenized regions of the RBD were amplified using custom oligonucleotides. Illumina Nextera barcode sequences were added in a second PCR amplification step, allowing for multiplexed high-throughput sequencing runs. The populations were pooled at the desired ratios and sequenced using Illumina 2 x 250 PE protocols.
The identified sequences were paired, quality trimmed, and assembled using Geneious and BBDuk, with a quality threshold of qphred 25. Mutagenized regions of interest were extracted using custom Python scripts, followed by translation to amino acid sequences. The sequences obtained from the 2C, 2CE, and 2T libraries were pre-processed separately before being combined into the final training set used for model training and evaluation (Table 2). To remove sequencing errors, all libraries were filtered for sequences complying with the initial amino acid site saturation mutagenesis scheme. Library 2CE was filtered for only those sequences retaining wild-type residues in positions 417/439, to focus on the 484-505 region. Next, the Class 2T library was filtered using a threshold of read counts > 4 and restricted to sequences with edit distance less than or equal to 3.
Table 2: Sequencing Statistics for RBD Library Sorting following Illumina deep sequencing.
Figure imgf000026_0001
Figure imgf000026_0002
After filtering, all 3 libraries (Class 2, Class 2E, and Class 2T) were assigned a label of 1 (binding) or 0 (non-binding), prior to being combined to create a full dataset. Duplicate sequences in the full dataset as well as sequences found to be occurring across labels were removed. The remaining data was balanced such that equal numbers of positive and negative class sequences were present for each edit distance from wild-type. Class balancing was performed through random subsampling from the majority class at each edit distance equal to the counts from the minority class.
FIG. 6 illustrates a block diagram of an example system 600 to identify and predict coronavirus variants, in accordance with implementations. The system 600 can include at least one data processing system 602. The data processing system 602 can include one or more processors and memory. The one or more processors can execute processor-executable instructions to perform the functions described herein. The memory can store processor-executable instructions, generate data, and collected data. The data processing system 602 can include at least one library preprocessor 604. The data processing system 602 can include at least one dataset balancer 606. The data processing system 602 can include at least one model generator 608. The data processing system 602 can include at least one sequence classifier 610. The data processing system 602 can include at least one variant predictor 612. The data processing system 602 can include at least one data repository 616, which can store one or more datasets 618, training data 620, or models 622.
The data processing system 602 can include at least one logic device, such as the processors. The data processing system 602 can include at least one memory element, which can store data and processor-executable instructions. The data processing system 602 can include a plurality of computing resources or servers located in at least one data center. The data processing system 602 can include multiple, logically-grouped servers and facilitate distributed computing techniques. The logical group of servers may be referred to as a data center, server farm, or a machine farm. The servers can also be geographically dispersed. The data processing system 602 can be any computing device. For example, the data processing system 602 can be or can include one or more laptops, desktops, tablets, smartphones, portable computers, or any combination thereof.
The data processing system 602 can include one or more processors. The processor can provide information processing capabilities to the data processing system 602. The processor can include one or more of digital processors, analog processors, digital circuits to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information. Each processor can include a plurality of processing units or processing cores. The processor can be electrically coupled with the memory and can execute one or more components of the data processing system 602, including, for example, the library pre-processor 604, dataset balancer 606, model generator 608, sequence classifier 610, or variant predictor 612. The data processing system 602 can include one or more microprocessors, application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA), or combinations thereof. The processor can be an analog processor and can include one or more resistive networks. The resistive network can include a plurality of inputs and a plurality of outputs. Each of the plurality of inputs and each of the plurality of outputs can be coupled with nanowires. The nanowires of the inputs can be coupled with the nanowires of the outputs via memory elements. The memory elements can include ReRAM, memristors, or PCM. The processor, as an analog processor, can use analog signals to perform matrix-vector multiplication.
The data processing system 602 can include one or more memories. The memory can be or can include a memory element. The memory can store machine instructions that, when executed by the processor of the data processing system 602 can cause the processor to perform one or more of the operations described herein. The memory can include but is not limited to, electronic, optical, magnetic, or any other storage devices capable of providing the processor with instructions. The memory can include a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, EEPROM, EPROM, flash memory, optical media, or any other suitable memory from which the processor can read instructions. The instructions can include code from any suitable computer programming language such as, but not limited to, C, C++, C#, Java, JavaScript, Perl, HTML, XML, Python, and Visual Basic.
The data processing system 602 can communicate with one or more client computing devices 630 or remote data sources 632 via a network 601. The network 601 can include any type of communication network, including, for example, the Internet, public network, or private network. The network 601 can include one or more wired or wireless connections. The client computing device 630 can include a laptop computer, desktop computer, tablet computer, or mobile computing device, for example. Remote data sources 632 can include any data source accessible to the data processing system 602 via the network 601 , for example. Remote data sources 632 can provide datasets 618, training data 620 or models 622, or data or information used by the data processing system 602 to establish, update or maintain one or more of the dataset 618, training data 620 or model 622.
The data processing system 602 can include a library pre-processor 604 designed, configured and operational to pre-process high-throughput sequencing data of RBD libraries. The pre-processor 604 can retrieve, from the data repository 616, a data set 618 containing, for example, a combinatorial mutagenesis library. The data set 618 can include, for example, a combinatorial mutagenesis library of the SARS-CoV-2 RBD. The combinatorial mutagenesis library can include data covering the protein sequence space of possible variants for binding to ACE2. The dataset 618 can be generated or based on deep mutational scanning (DMS) datasets.
The dataset 618 can be designed or constructed to focus on one or more amino acid positions. For example, the dataset 618 can be constructed to focus on amino acid positions of 484-505 (Class 2). The dataset 618 can include one or more additional combinatorial libraries, such as combinatorial libraries for positions 453-478 (Class 1C) and 440-452 (Class 3C). To arrive at the combinatorial design, the data processing system 602 can transform the log enrichment from the DMS data into amino acid frequencies. The library pre-processor 604 can discard negatively enriched amino acids if they are below a threshold. The threshold can be a fixed threshold, or dynamic threshold. The threshold can be stored in data repository 616. The threshold can be based on a relative value, such as a percentage. The threshold can be, for example, 0.
The pre-processor 604 can identify positively enriched amino acids based on the threshold. If the amino acids are above the threshold, the pre-processor 604 can assign them frequencies proportional to their enrichment above 0. Based on these frequencies, the pre-processor 604 can assign each position an amino acid site saturation mutant. The pre-processor 604 can make this assignment by determining the amino acid site saturation mutant that most closely map to the observed frequencies (based on a positional mean squared error), and then selecting the codon that best represents the diversity of viable mutations in this position. For Class 2, this can result in an RBD combinatorial library design diversity of 14’978’815'488 sequences (Class 2C Library), for example. In addition to this library, a second library with the same design can be produced, in which NNK codons are included in positions 417 and 439 of the RBD (Class 2CE Library) yielding a diversity of 5.99 x 1012.
In addition to the combinatorial libraries, the pre-processor 604 can construct tiling libraries (Class 2T, Class 1 T, Class 3T) to facilitate recovery of variants with a lower distance to the wild-type RBD. Since the probability of identifying low distance variants from the combinatorial library decreases as the diversity of the combinatorial library increases, the pre-processor 604 can generate the tiling library to improve the recovery of variants with the lower distance to wild-type RBD. The data processing system 602 can design tiling libraries by, for example, introducing three NNK codons at every position in the sequence stretch of interest (Class 2T: 484-505, Class 1T: 453-478, Class 3T: 440-452) (excluding positions that were fixed to single a. a. in the combinatorial design, e.g., Class 2C positions: 486, 487, 489, 495, 496, 497, 500). One or more possible combinations of three positions was designed resulting in a total diversity of the Class 2T library of 1 ,533,035 sequences, for example. The number of total sequences of a tiling library, e.g., the number of variants of up to a maximum edit distance k away from the wild-type sequence can be determined by the length of sequence (n, here = 14 non-fixed positions), the number of NNKs (or max edit distance) introduced (k, here = 3) and the size of the alphabet (a, here = 20):
Figure imgf000029_0001
Similarly, the number of sequences for a given edit distance k is given by:
Figure imgf000030_0001
The pre-processor 604 can provide for high-throughput sequencing of data of RBD libraries. To do so, the pre-processor 604 can perform read sequences in pairs, perform quality trimming, and assembly using Geneious and BBDuk, with a quality threshold of qphred 25, for example. The data processing system 602 can extract mutagenized regions of interest using custom Python scripts, followed by translation to amino acid sequences. The data processing system 602 can separately pre-process sequences obtained from each of the three: 2C, 2CE and 2T libraries before combining them into the final training set used for model training and evaluation (Table 2). The pre-processor 604 can store the training set in training data 620 in data repository 616.
To remove sequencing errors, the pre-processor 604 can filter the libraries for sequences complying with the initial an amino acid site saturation mutagenesis scheme. For example, the pre-processor can filter the Class 2CE Library for only those sequences retaining WT residues in positions 417/439, to focus on the 484-505 region. The data processing system 602 can filter the Class 2T Library using a threshold of read counts > 4, for example, and restricted to sequences with edit distance less than or equal to 3, for example.
Thus, the data processing system 602 can generate a combinatorial library of amino acid sequences and/or a tiling library of amino acid sequences that comprise a first plurality of amino acid sequences, a second plurality of amino acid sequences, a third plurality of amino acid sequences, and a fourth plurality of amino acid sequences. The combinatorial library of amino acid sequences can include the plurality of amino acid sequences including at least a portion of the viral spike protein of the coronavirus with an amino acid site saturation mutagenesis at one or more amino acid sites of the viral spike protein. The tiling library of amino acid sequences can include a plurality of amino acid sequence, wherein each amino acid sequence includes at least a portion of the viral spike protein of the coronavirus with amino acid site saturation mutagenesis at one or more amino acid sites of the viral spike protein.
The data processing system 602 can include a dataset balancer 606 designed, configured and operational to balance datasets to include an equal number of positive sequences and negative sequences per amino acid edit distance from a reference sequence. The reference sequence can include, for example, SARS-CoV-2 Wuhan-1 . The dataset balancer 606 can balance the dataset to generate the final training data 620 used by the model generator 608.
For example, after filtering, the data processing system 602 can assign all 3 libraries (Class 2, Class 2E, and Class 2T) a label of 1 (binding) or 0 (non-binding), prior to being combined to create a full dataset. The data processing system 602 can remove duplicate sequences in the full dataset as well as sequences found to be occurring across labels. The dataset balancer 606 can balance the remaining data such that equal numbers of positive and negative class sequences were present for each edit distance from wild-type. The dataset balancer 606 can balance positive and negative classes in order to remove model bias at low and high Levenshtein edit distances (LD) from wild-type RBD, thereby increasing the overall model performance. The dataset balancer 606 can perform class balancing through random subsampling from the majority class at each edit distance equal to the counts from the minority class. Random subsampling can refer to or include a Monte Carlo cross-validation, multiple holdout, or repeated evaluation set. The random subsampling can be based on randomly splitting the data into subsets. The sizes of the subsets can be pre-defined or determined based on an amount of data in the data set. The random partitioning of the data can be repeated one or more times.
The data processing system 602 can include a model generator 608 designed, configured and operational to use machine learning to generate or train a model 622 using training data 620. The training data 620 can be established or provided by one or more of the library pre-processor 604 or dataset balancer 606. The model generator 608 can include or be configured with one or more machine learning techniques. The model generator 608 can be configured with or use any supervised machine learning model that is configured to perform classification on encoded biological sequence data. The model generator 608 can be configured with machine learning classifier models built in Python, for example. The model generator 608 can be configured with or use machine learning techniques that include, for example, random forest, neural networks, recurrent neural networks, or other machine learning techniques.
For example, the model generator 608 can build a random forest model, or other benchmarking models, using a 80/20 train-test data split (random split). The model generator 608 can be configured with long-short-term-memory recurrent neural networks deep learning models. RBD sequences can be one-hot encoded and flattened into a 1 -dimensional vector before input into the models. The data processing system 602 can perform hyperparameter optimization on all models by Random Search, with 5-fold cross validation, and scored on recall. The model generator 608 can use best performing parameters based on the recall scores to train the final models. After training, the model generator 608 can evaluate the models on the basis of Accuracy, F1 (e.g., a measure of a test’s accuracy determined from the precision and recall of the test, where precision is the number of true positive results divided by the number of all positive result; and recall is the number of true positive results divided by the number of all samples that should have been identified as positive), and Matthews correlation coefficient (“MCC”) values using the entire withheld test set. Additionally, the model generator 608 can evaluate the accuracy of the model using subsets of the unseen libraries separated by levenshtein edit distance (LD) from wild-type RBD to determine prediction bias based on LD.
Figure imgf000032_0001
Table 3: Illustrative example of Training and hyperparameter values for Random Forest (RF) and Recurrent neural network (RNN) models.
Table 3 illustrates various performance metrics and parameters generated or used by the model generator 608 to generate and tune the model 622 using the training data 620. For example, Table 3 illustrates the parameters for different machine learning techniques, such as random forest and recurrent neural network.
Figure imgf000032_0002
Figure imgf000033_0001
Table 4: Illustrative example of Training and testing parameters/scores of RF and RNN models using datasets with different Levenshtein edit distance (LD) values.
As illustrated in table 4, the model generator 608 can use various training and testing parameters or scores based on the random forest or recurrent neural network models. The model generator 608 can use datasets with different LD values.
The data processing system 602 can include a sequence classifier 610 designed, configured and operational to perform in silico screening of the combinatorial space to classify combinatorial sequence variants as one of the four populations: a first plurality of amino acid sequences that bind to a ACE-2 receptor and bind to one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies and/or one or more binding proteins that inhibits the coronavirus from binding to the ACE-2 receptor.
To do so, the sequence classifier 610 can use one or more models 622 generated by the model generator 608 in order to classify various sequences into one of the four populations, such as: 1 ) ACE-2+ are functional RBD variants that are able infect human cells; ACE-2- are non-functional RBD variants unable to infect human cells; 3) ACE-2+Ab+ are functional RBD variants that are can be recognized (neutralized) by antibodies; or 4) ACE-2+Ab- are functional RBD variants that are not recognized (not neutralized) by antibodies — > antibody escape mutants. Thus, the sequence classifier 610 can determine, via the machine learning model 622, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant. The first affinity binding score can indicate a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor. Upon classification of the sequences using the model 622, the sequence classifier 610 can provide the classification to the variant predictor 612.
The data processing system 602 can evaluate the in silico generated sequences for probability of ACE2-binding by an ensemble of ML models (RF, RNN). For models that output a probability score rather than binary class assignment, probabilities > 0.5 were assigned as “positive” and <- 0.5 a “negative” binder. Sequences were assigned binding or non-binding labels if models showed agreement, as illustrated in FIGS. 3A-3C.
The data processing system 602 can select sequences with highest probabilities of positive and negative binding for gene synthesis, cloning and experimental validation, as illustrated in FIG. 4 and FIGS. 5A-5D. This evaluation can be performed for the prediction of escape from a mixture of monoclonal neutralizing antibodies. Finally, each variant predicted had to have a high probability above a threshold (e.g., p > 0.5) to be ACE2 binding and have the existence of a complete evolutionary path (all intermediate variants from wild-type required predicted ACE2 binding).
The data processing system 602 can include a variant predictor 612 designed, configured and operational to predict or identify variants of coronavirus. The variant predictor 612 can use the classifications generated by the sequence classifier 610 to predict or identify a variant that is likely to bind to the ACE-2 receptor but unlikely to bind to an antibody or protein that inhibits coronavirus from binding to the ACE-2 receptor. The variant predictor 612 can further determine the likelihood that the variant may emerge in the wild. For example, the variant predictor 612 can generate a likelihood of variant emergence (LoVE) score. To identify future variants of concern (VOC), the data processing system 602 can couple in silico sequence generation and classification with ML models 622 and phylogenetic analysis of existing variants to determine a score for the likelihood of variant emergence (LoVE). The variant predictor 612 can determine the LoVE score as a weighted average of multiple dimensions:
(i) The nucleotide and amino acid edit distance to the wild-type RBD
(ii) The likelihood of nucleotide mutations based on GISAID datasets
(iii) Probability of binding ACE2 as predicted by an ensemble of ML and DL models
(iv) Probability of escaping multiple neutralizing antibodies as predicted by an ensemble of ML and DL models
(v) The existence of a viable evolutionary path to each sequence.
For example, the data processing system 602 can generate variant RBD sequences in silico using custom Python scripts, for selected Levenshtein edit distances (LD) away from wild-type RBD. The LD can be defined on both the nucleotide and amino acid level, such that each generated nucleotide sequence was categorized by an LD pair (distance_nt, distance_aa). The data processing system 602 can determine the likelihood of variant emergence through mutation based on a BLOSUM-like matrix modified to take into account the mutations observed in naturally occurring variants as recorded in the GISAID database.
In order to prioritize RBD variants with the potential of becoming VOC, the variant predictor 612 can use a weighted computational model that incorporates metrics associated with evolutionary and phylogenetic information along with DML-predictions of ACE2-binding and neutralizing antibody escape. Starting with the sequence of the original Wuhan virus (wild-type) RBD, the variant predictor 612 can determine a high LoVE score for variants possessing the same combinatorial mutations present in gamma and beta VOC. The data processing system 602 can also facilitate the performance of antagonistic coevolution to predict future variants of SARS-CoV- 2. This can include, for example, engineering neutralizing antibodies that bind to RBD VOCs (e.g., beta and gamma) and then performing DML to identify future RBD escape variants. This can continue for several cycles to eventually identify possible evolutionary trajectories of SARS-CoV- 2 and prediction of future VOC. The data processing system 602 can identify RBD variants capable of binding ACE2 that were highly homologous to other coronaviruses, representing mutations that can result in zoonotic spillover into the human population.
FIG. 7 illustrates an example method for identifying and predicting coronavirus variants, in accordance with implementations. The method 700 can be performed by one or more system or component depicted in FIG. 6, including, for example, a data processing system. At 702, the data processing system can identify a combinatorial mutagenesis library of RBD. The library can be designed, constructed or operational to focus on key positions of RBM, such as 417, 439, and 484- 505. The combinatorial mutagenesis library of the RBD I RBM of SARS-Cov-2 can be expressed on the surface of yeast. The library design can be based on amino acid enrichment preferences of ACE-2 expression. If the selected primary sites for mutagenesis selected are 417, 439 and 484- 505, the estimated theoretical combinatorial diversity of this library can be 1 x 1013. The data processing system can clone the RBD library, which can be expressed in yeast, likely yielding only a physical library size of approximately 1 x 107-1 x 108. This can provide a significant reduction.
At 704, the data processing system can screen four key populations. The four key populations can be: 1 ) ACE-2+ are functional RBD variants that are able infect human cells; 2) ACE-2- are non-functional RBD variants unable to infect human cells; 3) ACE-2+Ab+ are functional RBD variants that are can be recognized (neutralized) by antibodies; and 4) ACE-2+Ab- are functional RBD variants that are not recognized (not neutralized) by antibodies — > antibody escape mutants.
The data processing system can screen the yeast display combinatorial RBD library for binding to ACE-2 using flow cytometry. Cells that maintain or increase binding to ACE-2 can be selected and their encoding RBD genes can be deep sequenced at 706. ACE-2 non-binding yeast cells can also be sorted and deep sequenced at 706. The ACE-2+ RBD fraction of the library can be screened for binding to either single monoclonal antibodies (e.g., REGN 10933, LYCOV16) or panels of monoclonal antibodies or polyclonal antibodies derived from serum of individuals vaccinated or previously exposed to SARS-CoV-2. The data processing system can sort RBD variants in this library that do not bind (or show lower binding) to antibodies, and, at 706, perform deep sequencing on them. RBD variants that maintain binding to antibodies are also selected and deep sequenced at 706. Thus, at 706, deep sequencing data sets of the RBD combinatorial libraries will result in the four main populations of ACE-2+, ACE-2-, ACE-2+Ab+ and ACE-2+Ab-.
At 708, the data processing system can train and test deep neural networks using the deep sequencing data generated at 706. The deep neural network can be trained to classify sequences in one of each of the four populations of: ACE-2+, ACE-2-, ACE-2+Ab+ and ACE-2+Ab-. The data processing system can perform sequence encoding and neural network construction to result in neural network classifiers that can predict based on protein sequence, whether an RBD belongs to one of the four populations described above. Following training, the data processing system can deploy deep learning models on the theoretical sequence space not interrogated by the physical yeast display library. Since the physical library may be at maximum 1 x 108 and the theoretical diversity is 1 x 1013, only 0.0001 % of the theoretical combinatorial diversity may have been screened from yeast display. At 710, the other 99.999% can be screened in silico (e.g., by the data processing system) by using deep or machine learning models to predict the probability each sequence belongs to one of the four populations.
At 712, the data processing system can predict future variants of SARS-CoV-2. Due to the large amount of diversity that may be present in each of the four populations, especially the sequences present in the ACE-2+Ab- population which are of most interest as they represent potential RBD variants that escape antibody recognition and neutralization, the data processing system can prioritize sequences. The data processing system can prioritize which sequences are likely to evolve naturally by the current SARS-CoV-2 RBD variant circulating in the global population. To do so, the data processing system can be configured with or execute an evolutionary likelihood model such as maximum likelihood, edit distance, or others. The data processing system can identify RBD sequences that score as the most likely to evolve in these models, and prioritize these sequences or otherwise indicate that these sequences are most likely to be a future variant of concern.
At 714, the data processing system can provide the future variant of concern for engineering antibodies to neutralize the variants of concern. The data processing system can facilitate or otherwise allow for the generation of antibodies that neutralize the variants of concern using scaffold antibodies that can recognize the previous version of SARS-Cov-2. The data processing system can use the new neutralizing antibodies to repeat the yeast display screening step 702 with the ACE-2+ RBD combinatorial library to identify RBD variants that do not bind them. The process of deep or machine learning 708 and in silico screening 710 can then be iterated in another cycle to identify the next round of SARS-CoV-2 evolution and prediction of variants of concern. This process can continue with new engineered antibodies, thus advancing many rounds of evolution and prediction of variants of concern.
FIG. 8 illustrates an example method for identifying and predicting coronavirus variants, in accordance with implementations. The method 800 can be performed by one or more system or component depicted in FIG. 6, including, for example, a data processing system. At 820, the data processing system can pre-process a library. The data processing system can pre-process one or more libraries. For example, the data processing system can pre-process a 2C library, 2CE library and 2T library illustrated in FIG. 2. In an illustrative example, the class 2C library can include 5 x 107 cells screened, of which 1.6 x 106 are ACE2-Binding and 6.4 x 106 are non-binding; the class 2CE library can include 5 x 107 cells screened, of which 5.9 x 105 are ACE2-Binding and 6.4 x 106 are non-binding; and the class 2T library can include 1.5 x 107 cells screened, of which 8.5 x 105 are ACE2-Binding and 1 .5 x 106 are non-binding.
The data processing system, at step 802, can pre-process the library to generate a training data set. The data processing system can pre-process or filter the data using or based on one or more thresholds, distances, labels, or to expected mutated positions only. Pre-processing can include, for example, quality trimming using a quality threshold. Pre-processing can include filtering the sequences. For example, the data processing system can filter or trim the data based on a number of occurrences of the sequence in the data, or based on expected position of variant. The data processing system can pre-process the data to remove sequence errors, such as by filtering for sequences complying with an amino acid site saturation mutagenesis scheme. The data processing system can filter the different classes using different techniques. For example, the data processing system can filter the class 2T library using a threshold of read counts > 4 and restricted to sequences with edit distance less than or equal to 3.
The data processing system, upon filtering the one or more libraries or classes, can assign a label of binding or non-binding to the respective sequences. At 804, the data processing system can balance the datasets such that an equal number of positive and negative class sequences are present for each edit distance from wild-type. The data processing system can balance positive and negative classes to remove or reduce model bias at low and high Levenshtein edit distances (LD) from wild-type RBD, and increase overall model performance. The data processing system can perform class balancing through random subsampling from the majority class at each edit distance equal to the counts from the minority class.
For example, the raw sequence count of library 2C can be 2,904,729, and can be reduced to 637,565 after pre-processing at step 602, which can be further reduced to 435,488 after balancing at step 604. In another example, the raw sequence count of library 2CE can be 4,026,982, and can be reduced to 18,796 after pre-processing at step 602, which can be further reduced to 12,196 after balancing at step 604. In another example, the raw sequence count of library 2T can be 3,410,650, and can be reduced to 8,311 after pre-processing at step 802, which can be further reduced to 2,036 after balancing at step 804.
At step 806, the data processing system can perform machine learning and deep learning model training using random forest, recurrent neural networks, long short-term memory networks, or other machine learning techniques. To do so, the data processing system can perform one-hot encoding (e.g., a group of bits among which the possible or allowed combination of values are only those with a single high bit and all others are low bits) and flattening (e.g., converting the data into a 1- dimensional array) of the RBD sequences into a 1 D-vector, which can then be input into the models.
The data processing system can perform optimization, such as hyperparameter optimization on the models to identify the best parameters to train the models. At step 808, the data processing system can perform in silico phylogenetic RBD mutant generation. For example, the data processing system can receive a subset of unseen library separated by an Levenshtein edit distance (“LD”) from the wild-type RBD by a certain amount to determine the prediction bias based on the LD. The edit distance can be, for example, 3, 5, and 7.
At 810, the data processing system can perform ensemble prediction and ranking of the in silico sequences. The data processing system can select one or more sequences based on the ranking, such as the highest ranking sequences. For example, the data processing system can select 46 sequences that can include or be associated with variants of concern based on the trained models. The selected sequences can correspond to those that are predicted to bind to the ACE-2 receptor and not bind to the one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
At 812, the data processing system can synthesize the selected variant of concern or mutant. The data processing system can provide an indication or output the selected mutant to allow the mutant to be synthesized. For example, the data processing system can provide the sequence for the selected mutant. The mutant can be selected from the highest ranking sequences from step 810, such as one of the 46 selected sequences. At 814, the sequence can be experimentally validated for ACE2 binding and mAb escape (e.g., not bind to one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor).
FIG. 9 illustrates accuracy scores for models trained to identify and predict coronavirus variants, in accordance with implementations. One or more system or component depicted in FIG. 6 can generate the accuracy score graphs 900, including, for example, the data processing system. The accuracy scores can be for the models trained with different data sets, such as a full data set, an external third-party data set, or a data set based on LD 1-3. The y-axis indicates the score, and the x-axis can indicate the distance from the wild type. As indicated in FIG. 9, the accuracy score across the full range of distances from 1 to 10 is best for the models that are trained on the full data set (e.g., graph C). The second-best performing accuracy scores across the range of LD 1-3 distances is model B, which is trained on LD 1-3 data set.
FIG. 10 illustrates accuracy scores for models trained to identify and predict coronavirus variants, in accordance with implementations. One or more system or component depicted in FIG. 6 can generate the accuracy score graph 1000, including, for example, the data processing system. The graph 1000 illustrates the difference in accuracy scores for an optimized random forest or an unoptimized random forest model. As indicated in graph 1000, the accuracy scores are higher across the LD distances 1-10 for the optimized RF model, thereby indicating improved ML performance and prediction of variants of concern in a more accurate and efficient manner using the data processing system of this technical solution.
FIG. 11 illustrates an example process for identifying and predicting coronavirus variants, in accordance with implementations. The process 1100 can be performed by one or more system or component depicted in FIG. 6, including, for example, a data processing system. At 1102, the data processing system can identify sequences or RBM mutants. The data processing system can identify the distance of the RBM mutant from a predetermined or default sequence, such as the SARS-CoV-2 RBD. The distance can refer to a LD distance. The bar graph or histogram depicted at 1102 can graphically represent the number of sequences located at each distance. The data processing sequence can select one or more sequences for further processing, or process each sequence. The data processing system can, to reduce or manage computing resource consumption, select sequences based on their distance from the SARS-CoV-2 in an effort to predict binding properties for the sequences that are most similar to the SARS-CoV-2. In another example, the data processing system can select sequences have a greater LD distance in an effort to predict whether a sequence that is different from the SARS-CoV-2 is likely to be a variant of concern.
At 1104, the data processing system can input the sequence into the prediction flow. The data processing system can select a sequence and then perform one-hot encoding at 1106. For example, the data processing system can group the bits such that the possible or allowed combination of values are only those with a single high bit and all others are low bits.
When generating the one-hot encoding at 1106, the data processing system can generate a multidimensional array with bit values. The data processing system can then utilize one or more machine learning techniques or models to output a prediction. For example, the data processing system can use a random forest decision tree or a LSTM-RNN neural network layers. The data processing system can use multiple machine learning techniques.
For example, the data processing system can proceed to 1108 to flatten the one-hot encoding array for input into a random forest decision tree. At the 1108, the data processing system can flatten the one-hot encoding by converting the one-hot encoded multi-dimensional array into a 1- dimensional array of RBD sequences into a 1 D-vector. The data processing system can input the 1 D-vector into a decision tree at 1110. The random forest decision tree can process the input 1 D- vector to provide an output prediction 1112 based on majority voting of the decision tree. The output prediction 1112 can indicate a likelihood or probability that the sequence binds to ACE2 or is an mAB escape (e.g., not bind to one or more antibodies and/or one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor). The output prediction using random forest can be depicted in table 1120 for each sequence.
The data processing system can predict whether the sequence binds to ACE2 or is an mAB using a neural network. For example, the data processing system can take input the one-hot encoding into a LSTM-RNN layers at 1114. The data processing system, when using a neural network, may not flatten the one-hot encoding first. For example, the data processing system can input the one- hot encoding array into the neural network at 1114. The data processing system can use a neural network with one or more layers or one or more types of layers, such as one or more LSTM-RNN layers 1114, and one or more dense layers 1116. The output of the one or more LSTM-RNN layers 1114 can be input into the dense layer 1116. The dense layer 1114 can provide an output prediction 1118. The output prediction 1118 can include a likelihood that a particular sequence binds to ACE2 and is mAB.
The data processing system can identify the output predictions from the one or more machine learning techniques (e.g., 1112 and 1114). The data processing system can use the one or more predictions to generate a label for the sequence. Table 1120 illustrates the various output predictions for each sequence using the one or more machine learning techniques, and application of a label based on the output predictions. The data processing system can establish the table 1120, store the table 1120 in memory, or provide table 1120 for display via a graphical user interface, for example.
The table 1120 can include sequences 1122. The sequences 1122 can correspond to the input sequence at step 1104. The table 1120 can include an output prediction 1124 that indicates the likelihood or probability that the sequence 1122 binds to ACE2 as predicted using a neural network, such as the neural networks 1114 and 1116. The column 1126 can indicate the probability that the sequence 1122 binds to ACE2 as predicted using a random forest decision tree 1110, for example. The column 1128 can indicate the ACE2 label to assign to the sequence. The data processing system can determine the label to apply based on a combination of the decision tree output prediction 1112, as indicated in column 1126, and the neural network output prediction 1118 as indicated in column 1124. For example, the data processing system can apply a positive ACE2 label to indicate that the sequence does bind, or is likely to bind, to ACE2 if both the ACE2 RNN prediction 1124 and ACE2 RF prediction 1126 are greater than 0.5, which indicates that both machine learning techniques predict that the sequence is more likely than not to bind to ACE2. If, however, one of the machine learning techniques has a probability greater than 0.5, but the other machine learning technique has a probability less than 0.5, then the data processing system can determine to apply a negative ACE2 label to the sequence. If both the ACE2 RNN 1124 and the ACE2 RF 1126 are less than 0.5, then the data processing system can apply a negative ACE2 label to indicate that the sequence is unlikely to bind to ACE2. Thus, the ACE2 label for a given sequence can be positive for binding if both models 1124 and 1126 predict a P >0.5 and negative for binding if one or both models 1124 and 1126 have a P < 0.5.
The data processing system can similarly use the output predictions 1112 and 1118 to predict the mAb. For example, column 1130 indicates the probability of mAb using a neural network, such as the neural networks 1114 and 1116 and output prediction 1118. Column 1132 indicates the probability of mAb using a decision tree, such as the random forest decision tree of 1108, 1110 and output prediction 1112. The data processing system can apply a mAb label 1134 based on both the mAb RNN 1130 and the mAb RF 1132. For example, the data processing system can apply a positive mAb label 1134 if both the mAb RNN 1130 and the mAb RF 1132 for the sequence is greater than 0.5. If one of the mAb RNN 1130 or mAb RF 1132 is less than 0.5 and the other is greater than 0.5, the data processing system can apply a negative mAb label 1134. If both the mAb RNN 1130 and mAb RF 1132 is less than 0.5, then the data processing system can apply a negative mAb label 1134.
FIG. 12 illustrates an example process for identifying and predicting coronavirus variants, in accordance with implementations. The process 1200 can be performed by one or more system or component depicted in FIG. 6, including, for example, a data processing system. At 1202, the data processing system can generate RBM mutants in silico. Generating RBM mutants in silico can include using one or more servers or computing devices having one or more processors coupled to memory that can execute program instructions or software programs that are configured to generate RBM mutants.
At 1204, the data processing system can predict or determine the likelihood of whether the sequences generated at 1202 bind to ACE-2. The data processing system can use one or more machine learning techniques, such as decision trees, random forest decision trees, neural networks, or recurrent neural networks to output a prediction or probability that a sequence generated at 1202 binds to ACE-2. At 1206, the data processing system can predict or determine, the probability or likelihood that the sequence escapes mAb using one or more machine learning techniques or models, such as decision trees or neural networks.
At 1208, the data processing system can select one or more sequences that form a diverse lineage. For example, the data processing system can select 46 sequences that form a diverse lineage. For example, the data processing system can choose sequences corresponding to edit distances to wild-type of 3, 5, and 7. The data processing system can choose sequences that represent lineages across distances. In an example, lineages chosen can include sequences with observed mutations (484K/501Y, 484Q/501Y, 501Y) as well as lineages without such preconditions. The data processing system can select sequences for distance 7 sequences to have diversity with respect to their prediction of antibody escape.
At 1210, the data processing system can provide the selected sequences for synthesis. Synthesis can refer to or include creating the sequence in a laboratory by stringing together the corresponding nucleic acids to form the sequence. For example, the data processing system can select sequences with a positive ACE-2 label (e.g., as illustrated in column 1128 of table 1120 in FIG. 11 ) and a positive mAb label (e.g., as illustrated in column 1134 of table 1120 in FIG. 11 ).
At 1212, the data processing system can facilitate validating or receive an indication of validation of the ACE-2 binding and mAb escape prediction. For example, an experiment can be conducted in a laboratory with the synthesized sequences to confirm or validate the predictions output by the data processing system, as depicted in FIG. 4.
Example 2: Expression of soluble RBD and monoclonal antibodies
Gene fragments for monoclonal antibodies (mAbs) were synthesized as light chain (LC, variable light-constant light) or variable heavy chain (VH) and cloned into pTWIST transient expression vectors by restriction cloning using Notl and BamHI for LC, and Notl and Esp3l for heavy chain (HC) vectors. Genes for SARS-CoV-2 RBDs were designed spanning from amino acid site Arg 319 to Lys 537, followed by a 6x poly-His tag and an AviTag for site-specific biotinylation. A subset of RBD genes were cloned into pTWIST transient expression vectors by Notl and BamHI cloning. After sequence verification and large-scale plasmid DNA preparation, suspension adapted HEK293 cells were used for transient expression of both RBDs and mAbs proteins. Cells were transfected using 1 pg of plasmid DNA per mL of expression culture (typically 25 mL) at a 1 :1 ratio of LC to heavy chain (HC) vectors. Supernatant was harvested at day 5-6 post transfection by spinning down the culture and sterile filtration. RBD proteins were purified through HisTrap columns using a custom built I 3D printed column loading, wash and elution pump system. For His-Tag purification, the following buffers were used: equilibration buffer containing 50 mM NaH2PO4, 300 mM NaCI at pH 8; wash buffer containing 50 mM NaH2PO4, 300 mM NaCI and 20 mM imidazole and elution buffer containing 50 mM NaH2PO4, 300 mM NaCI and 250 mM imidazole. mAb proteins were purified using Protein G columns and processed using Protein G equilibration, wash, and Elution buffer, as well as neutralized with Tris-Buffer at a 3:1 ratio. Purified RBDs and mAb proteins were dialyzed against either PBS or phosphate buffer using 10 kDa SnakeSkin(™) dialysis tubing and concentrated using either 30 kDa or 10 kDa Amicon Ultra centrifugal filter units. Avi-tagged RBDs were biotinylated using a BirA biotin ligase kit.
Example 3: Engineering monoclonal antibodies for binding to RBD variants
Previously identified neutralizing mAbs (e.g., clinically approved therapeutic antibodies) including REGN10933 (6XDG.B/C), REGN10987 (6XDG.D/E), Ly-CoV-555 (7KMG), and Ly-CoV-16 (7C01 ) were back-translated and genes were designed in-silico as full length single ORF. In order to facilitate library generation, the sequences were truncated between the CDRH2 and CDRH3 and replaced by a staffer containing Type 2S restriction sites (PaqCI). Libraries of antibody genes were then generated by annealing two oligos containing either a single NNK/NNN codon within each CDRH2 and CDRH3, or combinatorial degeneracy based on existing similar mAbs (LD 3-7, depending on the mAb) in a curated human mAb database. The resulting theoretical diversity of such library designs ranged from < 5 x 104 for the NNK/NNN approach (Dual-DMS), up to < 3 x 109 variants for the combinatorial approach (hereafter called in silica DMS or isDMS). The annealed oligos were mixed with the truncated mAb backbones to perform golden gate assembly by resuspending oligos at 10 pM. The forward and reverse oligos were premixed and further diluted resulting in 1 pM working stock. After 3 PCR cycles, duplex oligos were imaged on a 2.5% agarose gel in order to check the purity, followed by a PCR cleanup. Each purified DNA sample was then mixed with the corresponding truncated mAb backbone at a 2:1 molar ratio as well as PaqCI, T4 DNA ligase, T4 DNA Ligase buffer and PaqCI activator (conducted at 37°C, 10 minutes, 16°C, 1 minute — > 37°C, 1 minute — > 16°C, 1 minute x 60 cycles — > 37°C, 5 min — > 60°C, 5 min). Assembled plasmids were then purified and concentrated by DNA cleanup, followed by dialysis using MCE membranes and electroporation into 50 pL (Dual-DMS) or 350 pL (isDMS) of Top10 I DH5alpha freshly prepared competent cells. These assembled plasmids represented homology- directed repair (HDR) templates, which undergo large-scale DNA preparation from resulting transformants, 20-100 x 106 (depending on the desired library size). For integration of antibody libraries, the cell line Cas9+ PnP hybridoma was transfected with assembled HDR templates with guide RNA (mRuby targeting gRNA-J). Hybridomas displaying productive and stable mAbs indicative of successful Cas9 mediated HDR were sorted and isolated by FACS. Double-positive (/.e., binding) gates to were drawn using wild-type RBD as a guide, and negative gates were drawn with anti-FLAG tag stained cells. Next, labeling and sorting for binding to SARS-CoV-2 RBD variants identified in Example 1. FIGS. 5A-5D provide the amino acid frequencies of the resulting ACE2+ and the mAb binding and escape with LY-CoV16 (FIG. 5A), LY-CoV555 (FIG. 5B), REGN10933 (FIG. 5C), and REGN10987 (FIG. 5D).
Genomic DNA was then isolated from hybridoma cells expressing antibodies binding to RBD using a PureLink gDNA isolation kit and amplified over two consecutive PCRs to add illumina and TruSeq adapters (conducted by PCR1 - Q5, composition as per manufacturer’s instructions, using an annealing temperature of 62.5°C. 6.25 pL gDNA template was used per 50 pL Q5 reaction mix) corresponding to a maximum diversity of 625,000. The number of reactions per sample was scaled according to the observed diversity post FACS x 10 to ensure sufficient oversampling. The primers used for the PCR were adapted from Taft JM, Weber CR, Gao B, et al., (2022) Deep mutational learning predicts ACE2 binding and antibody escape to combinatorial mutations in the SARS-CoV2 receptor binding domain. call Cell, 31 August 2022. They are chimeric sequences, one half of which targets the region of interest (in our case the RBD sequence), and the other half attaches adaptor sequences used for Illumina sequencing platforms. PCR2 was performed and the resulting PCR fragments were gel excised and assayed on a Fragment Analyzer to check for quality. Pooled DNA was then run on an Illumina MiSeq (2 x 250 bp PE). Sequencing reads were pre-processed (paired, merged, and trimmed) using BBDuk, with a quality threshold of qphred = 15. Processed reads were aligned and analyzed for CDR2-CDR3 identity using custom Python scripts. The resulting mutagenesis landscape was then used to determine optimized mutagenesis schemes.
Where technical features in the drawings, detailed description, item or any claim are followed by reference signs, the reference signs have been included to increase the intelligibility of the drawings, detailed description, and claims. Accordingly, neither the reference signs nor their absence has any limiting effect on the scope of any claim elements.
The systems and methods described herein may be embodied in other specific forms without departing from the characteristics thereof. The foregoing implementations are illustrative rather than limiting of the described systems and methods. Scope of the systems and methods described herein is thus indicated by the appended claims, rather than the foregoing description, and changes that come within the meaning and range of equivalency of the claims are embraced therein.
Having now described some illustrative implementations, it is apparent that the foregoing is illustrative and not limiting, having been presented by way of example. In particular, although many of the examples presented herein involve specific combinations of method acts or system elements, those acts and those elements may be combined in other ways to accomplish the same objectives. Acts, elements, and features discussed in connection with one implementation are not intended to be excluded from a similar role in other implementations or implementations.
The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including” “comprising” “having” “containing” “involving” “characterized by” “characterized in that” and variations thereof herein, is meant to encompass the items listed thereafter, equivalents thereof, and additional items, as well as alternate implementations consisting of the items listed thereafter exclusively. In one implementation, the systems and methods described herein consist of one, each combination of more than one, or all of the described elements, acts, or components.
As used herein, the term “about” and “substantially” will be understood by persons of ordinary skill in the art and will vary to some extent depending upon the context in which it is used. If there are uses of the term which are not clear to persons of ordinary skill in the art given the context in which it is used, “about” will mean up to plus or minus 10% of the particular term.
As used herein, the term “degenerate codon” refers to the set of nucleotides in an oligonucleotide sequence that code for at least one amino acid. In any embodiment, the degenerate codons used herein may code for 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, or 20 amino acids. Unless otherwise specified, the term “amino acid” refers to the twenty standard amino acids.
Any references to implementations or elements or acts of the systems and methods herein referred to in the singular may also embrace implementations including a plurality of these elements, and any references in plural to any implementation or element or act herein may also embrace implementations including only a single element. References in the singular or plural form are not intended to limit the presently disclosed systems or methods, their components, acts, or elements to single or plural configurations. References to any act or element being based on any information, act or element may include implementations where the act or element is based at least in part on any information, act, or element.
Any implementation disclosed herein may be combined with any other implementation or embodiment, and references to “an implementation,” “some implementations,” “one implementation” or the like are not necessarily mutually exclusive and are intended to indicate that a particular feature, structure, or characteristic described in connection with the implementation may be included in at least one implementation or embodiment. Such terms as used herein are not necessarily all referring to the same implementation. Any implementation may be combined with any other implementation, inclusively or exclusively, in any manner consistent with the aspects and implementations disclosed herein.
The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.”
References to “or” may be construed as inclusive so that any terms described using “or” may indicate any of a single, more than one, and all of the described terms. For example, a reference to “at least one of ‘A’ and ‘B’” can include only ’A’, only ’B’, as well as both ’A’ and ’B’. Such references used in conjunction with “comprising” or other open terminology can include additional items.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. All publications, patent applications, patents and other references mentioned herein are incorporated by reference in their entirety.

Claims

WHAT IS CLAIMED:
1. A method, comprising: providing an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; generating a first data set comprising a first plurality of variant sequences of the coronavirus, each of the first plurality of variant sequences of the coronavirus comprising one or more amino acid site mutations of the viral spike protein; wherein the first data set comprises:
- a first plurality of amino acid sequences that bind to an ACE-2 receptor and bind to one or more antibodies that inhibits the coronavirus from binding to the ACE- 2 receptor; and/or
- a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies that inhibits the coronavirus from binding to the ACE-2 receptor; and/or
- a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies that inhibits the coronavirus from binding to the ACE-2 receptor; and/or
- a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies that inhibits the coronavirus from binding to the ACE-2 receptor; training a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences; determining, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant; wherein the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies that inhibits coronavirus from binding to the ACE-2 receptor.
2. The method according to claim 1 , wherein said first plurality of variant sequences is generated by creating a combinatorial library of amino acid sequences and/or a tiling library of amino acid sequences that comprises variants of the input amino acid sequence generated by amino acid site saturation mutagenesis at one or more amino acid sites in a receptor-binding domain (RBD) of the viral spike protein.
3. The method of claims 3, wherein the amino acid site saturation mutagenesis is effected on 1 to 20 amino acid positions.
44
4. The method of any one of claims 2-3, wherein the coronavirus is SARS-CoV-2 and the amino acid site saturation mutagenesis is in the RBD and one or more amino acids at sites 350 to 550 of the SARS-CoV-2 spike protein.
5. The method of claim 4, wherein at least 5 amino acid sites of the input amino acid sequence are varied by amino acid site saturation mutagenesis, particularly wherein at least 10 amino acid sites are varied, more particularly wherein 20 to 40 sites are varied.
6. The method of any one of claims 4-5, wherein the amino acid site saturation mutagenesis is in the RBD at amino acid sites 400 to 515, particularly at amino acid sites 417 to 505.
7. The method of any one of claims 1-6, wherein the first data set comprises at least 10,000 unique amino acid sequences; particularly wherein the first data set comprises at least 50,000 unique amino acid sequences; more particularly wherein the first data set comprises at least 200,000 unique amino acid sequences.
8. The method of any one of claims 1-7, wherein the one or more antibodies are a) obtained from human serum, or b) wherein the one or more antibodies comprises one or more monoclonal antibodies.
9. The method of any one of claims 4-32, wherein the combinatorial library of amino acid sequences and the tiling library of amino acid sequences are recombinantly expressed as individual amino acid sequences on a yeast cell surface.
10. The method of any one of claims 1-9, further comprising generating an antibody and/or antibody-like molecule that binds the coronavirus variant.
11 . The method of any one of claims 1-10, further comprising generating a vaccine that induces production of antibodies that bind the coronavirus variant.
12. The method of any one of claims 1-11 , wherein the first machine model comprises at least one of a random forest or a recurrent neural network.
13. The method of any one of claims 1-40 further comprising: balancing the data set to include an equal number of positive binding sequences and negative non-binding sequences per amino acid edit distance from a reference sequence (e.g., SARS- CoV-2 Wuhan-1 ).
14. The method of any one of claims 1-13, further comprising:
45 determining a likelihood of variant emergence score for the coronavirus variant based on the proposed amino sequence, the predicted affinity binding score and one or more of: nucleotide and amino acid edit distance of the coronavirus variant to a reference coronavirus sequence; and/or likelihood of nucleotide mutations based on one or more data sets; and/or the first affinity binding score; and/or an existence of a viable evolutionary path to the coronavirus variant. particularly wherein the method further comprises: determining the likelihood of variant emergence score for the coronavirus variant responsive to the first affinity binding score being greater than or equal to a threshold. A system comprising one or more processors and a memory storing processor-executable instructions, the one or more processors execute the processor-executable instructions to: receive an input amino acid sequence that represents at least a portion of a viral spike protein of a coronavirus; receive a first data set comprising a first plurality of variant sequences, each of the first plurality of variant sequences comprising one or more amino acid site mutations of the viral spike protein; wherein the first data set comprises: a first plurality of amino acid sequences that bind to an ACE-2 receptor and bind to one or more antibodies that inhibit the coronavirus from binding to the ACE-2 receptor; and/or a second plurality of amino acid sequences that do not bind to the ACE-2 receptor and bind to the one or more antibodies that inhibit the coronavirus from binding to the ACE-2 receptor; and/or a third plurality of amino acid sequences that bind to the ACE-2 receptor and do not bind to the one or more antibodies that inhibit the coronavirus from binding to the ACE-2 receptor; and/or a fourth plurality of amino acid sequences that do not bind to the ACE-2 receptor and do not bind to the one or more antibodies that inhibit the coronavirus from binding to the ACE-2 receptor; train a first machine learning model with the first data set to predict affinity binding scores for proposed amino acid sequences;
46 determine, via the first machine learning model trained with the first data set, a first affinity binding score for a proposed amino acid sequence to identify a coronavirus variant; wherein the first affinity binding score indicates a likelihood the coronavirus variant binds to the ACE-2 receptor and does not bind to the one or more antibodies and/or the one or more binding proteins that inhibits coronavirus from binding to the ACE-2 receptor.
PCT/EP2022/074922 2021-09-07 2022-09-07 Identifying and predicting future coronavirus variants Ceased WO2023036849A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163241385P 2021-09-07 2021-09-07
US63/241,385 2021-09-07

Publications (1)

Publication Number Publication Date
WO2023036849A1 true WO2023036849A1 (en) 2023-03-16

Family

ID=83438593

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/074922 Ceased WO2023036849A1 (en) 2021-09-07 2022-09-07 Identifying and predicting future coronavirus variants

Country Status (1)

Country Link
WO (1) WO2023036849A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110165621A1 (en) 2010-01-04 2011-07-07 arGEN-X BV Humanized antibodies
US20120142611A1 (en) 2000-09-08 2012-06-07 Universitat Zurich Repeat protein from collection of repeat proteins comprising repeat modules
US20150368302A1 (en) 2012-06-28 2015-12-24 Molecular Partners Ag Designed ankyrin repeat proteins binding to platelet-derived growth factor
US20160075767A1 (en) 2010-11-26 2016-03-17 Molecular Partners Ag Capping modules for designed ankyrin repeat proteins
WO2020208555A1 (en) * 2019-04-09 2020-10-15 Eth Zurich Systems and methods to classify antibodies

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120142611A1 (en) 2000-09-08 2012-06-07 Universitat Zurich Repeat protein from collection of repeat proteins comprising repeat modules
US20110165621A1 (en) 2010-01-04 2011-07-07 arGEN-X BV Humanized antibodies
US20160075767A1 (en) 2010-11-26 2016-03-17 Molecular Partners Ag Capping modules for designed ankyrin repeat proteins
US20160250341A1 (en) 2010-11-26 2016-09-01 Molecular Partners Ag Designed repeat proteins binding to serum albumin
US20150368302A1 (en) 2012-06-28 2015-12-24 Molecular Partners Ag Designed ankyrin repeat proteins binding to platelet-derived growth factor
WO2020208555A1 (en) * 2019-04-09 2020-10-15 Eth Zurich Systems and methods to classify antibodies

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
DEHURY BUDHESWAR ET AL: "Effect of mutation on structure, function and dynamics of receptor binding domain of human SARS-CoV-2 with host cell receptor ACE2: a molecular dynamics simulations study", JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 7 August 2020 (2020-08-07), England, pages 7231 - 7245, XP055858872, Retrieved from the Internet <URL:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7484587/pdf/TBSD_0_1802348.pdf> [retrieved on 20211108], DOI: 10.1080/07391102.2020.1802348 *
HIE BRIAN ET AL: "Learning the language of viral evolution and escape", SCIENCE, vol. 371, no. 6526, 15 January 2021 (2021-01-15), US, pages 284 - 288, XP055886358, ISSN: 0036-8075, DOI: 10.1126/science.abd7331 *
JIAHUI CHEN ET AL: "Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 March 2021 (2021-03-09), XP081898174 *
KWAN ET AL., STRUCTURE, vol. 11, no. 7, 2003, pages 803 - 813
SKERRA, BIOCHIM. BIOPHYS. ACTA, vol. 1482, no. 1-2, 2000, pages 337 - 50
TAFT JMWEBER CRGAO B ET AL.: "Deep mutational learning predicts ACE2 binding and antibody escape to combinatorial mutations in the SARS-CoV2 receptor binding domain.call", CELL, 31 August 2022 (2022-08-31)
TENG SHAOLEI ET AL: "Systemic effects of missense mutations on SARS-CoV-2 spike glycoprotein stability and receptor-binding affinity", BRIEFINGS IN BIOINFORMATICS, 22 March 2021 (2021-03-22), GB, pages 1239 - 1253, XP055943995, ISSN: 1467-5463, Retrieved from the Internet <URL:https://watermark.silverchair.com/bbaa233.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAtQwggLQBgkqhkiG9w0BBwagggLBMIICvQIBADCCArYGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMr5nB5YGq4E7PtxcsAgEQgIICh4aEsSmf-kU1gFsaBzJx2zSR3Fs0D1MZ5VA4keqYWn6FHjI-TIEylKxjnZrmwu9gb8eT3khpSxPMoTkgIFG7_DiTSXu4> DOI: 10.1093/bib/bbaa233 *
VINCKE ET AL.: "General strategy to humanize a camelid single-domain antibody and identification of a universal humanized nanobody scaffold", J BIOL CHEM., vol. 284, no. 5, 30 January 2009 (2009-01-30), pages 3273 - 3284, XP055107615, DOI: 10.1074/jbc.M806889200

Similar Documents

Publication Publication Date Title
Mason et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning
JP7524215B2 (en) Systems and methods for classifying antibodies
JP7047115B2 (en) GAN-CNN for MHC peptide bond prediction
Li et al. Machine learning optimization of candidate antibody yields highly diverse sub-nanomolar affinity antibody libraries
Parkinson et al. The RESP AI model accelerates the identification of tight-binding antibodies
WO2020225693A1 (en) Identification of convergent antibody specificity sequence patterns
Minot et al. Meta learning addresses noisy and under-labeled data in machine learning-guided antibody engineering
Ripoll et al. Using the antibody-antigen binding interface to train image-based deep neural networks for antibody-epitope classification
Erasmus et al. Insights into next generation sequencing guided antibody selection strategies
Chinery et al. Baselining the buzz Trastuzumab-HER2 affinity, and beyond
Frisby et al. Identifying promising sequences for protein engineering using a deep transformer protein language model
Kelow et al. A penultimate classification of canonical antibody CDR conformations
Frei et al. Deep mutational learning for the selection of therapeutic antibodies resistant to the evolution of Omicron variants of SARS-CoV-2
Veltri A computational and statistical framework for screening novel antimicrobial peptides
Ouellet et al. CysPresso: a classification model utilizing deep learning protein representations to predict recombinant expression of cysteine-dense peptides
WO2025022002A1 (en) Analysis of antigen-binding proteins
Minot et al. Meta learning improves robustness and performance in machine learning-guided protein engineering
WO2023036849A1 (en) Identifying and predicting future coronavirus variants
US20230352118A1 (en) Generative Modeling Leveraging Deep Learning for Antibody Affinity Tuning
AU2023361018A1 (en) Engineering of antigen-binding proteins
JP2024542017A (en) Systems and methods for intelligent construction of antibody libraries - Patents.com
Huo et al. Deep generative design of neutralizing nanobodies against SARS-CoV-2 variants
Eme et al. Lateral gene transfer leaves lasting traces in Rhizaria
Paul Modelling Sequence and Structure Towards Functional Protein Design
Liu Beyond predictive modeling: new computational aspects for deep learning based biological applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22776939

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22776939

Country of ref document: EP

Kind code of ref document: A1