[go: up one dir, main page]

WO2003038728A2 - A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins - Google Patents

A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins Download PDF

Info

Publication number
WO2003038728A2
WO2003038728A2 PCT/IB2002/004839 IB0204839W WO03038728A2 WO 2003038728 A2 WO2003038728 A2 WO 2003038728A2 IB 0204839 W IB0204839 W IB 0204839W WO 03038728 A2 WO03038728 A2 WO 03038728A2
Authority
WO
WIPO (PCT)
Prior art keywords
peak
peaks
protein
probability
database
Prior art date
Application number
PCT/IB2002/004839
Other languages
French (fr)
Other versions
WO2003038728A3 (en
Inventor
Jari HÄKKINEN
Thorsteinn RÖGNVALDSSON
Jim Samuelsson
Original Assignee
Biobridge Computing Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Biobridge Computing Ab filed Critical Biobridge Computing Ab
Priority to AU2002347462A priority Critical patent/AU2002347462A1/en
Publication of WO2003038728A2 publication Critical patent/WO2003038728A2/en
Publication of WO2003038728A3 publication Critical patent/WO2003038728A3/en

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins

Definitions

  • the present invention relates to a computer system and a method for selecting one or more candidate proteins from a plurality of proteins stored in a database.
  • the known methods are based on semi-computerised comparisons between numerical representations of peptide peaks of known proteins and peptide peaks of the unknown protein.
  • an object of the present invention is to provide a method and a computer system for selecting at least one candidate protein from a database, in such a manner that the human factor is minimised.
  • a method for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins said method comprises preferably based on a numerical representation of a mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, extracting noise-free mono-isotopic peptide peaks from the numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks; selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), and determining the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.
  • the set of peak matches are preferably all those peaks that satisfy a condition that the distance between the two peak masses is less than a certain value. This certain value is in many preferred embodiments user defined, but may, of course, be defined based on the actual data.
  • the method comprises preferably the step of determining a score value ( ⁇ ) relating to the probability for the set of peak matches to occur, said score value being preferably determined as the negative logarithm of the probability for the set of peak matches to occur.
  • the method preferably always computes a score value to every protein stored in the database.
  • the present invention may preferably further comprise the steps of determining the probability for getting a score value equal to or above a predefined score ( ⁇ ), wherein
  • the probability for getting a score value equal to or above a predefined score is the probability to reach randomly a score value at least as large as the score value in question and/or the probability for getting a score value equal to or above a predefined score ( ⁇ ) is the probability to reach from the data base a score value at least as large as the score value in question.
  • the assessment may lead to the conclusion that none of the proteins represented in the data base is a likely candidate as a comparison of these two probabilities normally will show that top-scoring candidates from the data base search have score values much higher than what can be expected from just random matching. This last situation is especially valuable if the unknown protein is not in the database.
  • the probability for the set of peak matches to occur preferably reflects the probability to have a predetermined number(r) of matches. Furthermore, the probability for the set of peak matches to occur may preferably be determined so that the probability is rewarded by many matches, and at the same time takes into account the propensity for large proteins having many matches.
  • the candidate protein(s) is preferably represented by a first set of peptide masses being a theoretical spectrum wherein each peak has the same intensity as all other peaks.
  • the extracting of noise-free mono-isotopic peptide peak by the method according to the present invention may preferably comprise determining the intensity level of the spectrum where the signal-to-noise ratio is unity, such as substantial unity, preferably by use of equation 1 disclosed herein; determining the intensity level of the spectrum where the signal-to-noise ratio is zero, such as substantial zero, preferably by use of equation 2 disclosed herein; and locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.
  • the extracting of noise-free mono-isotopic peptide peak may preferably further comprise determining the peak entities mass/electric charge, intensity, width and signal to noise ratio for the peak candidates; bundling of peak candidates into clusters; - deconvolution of compound peak clusters into single peptide peak clusters; and resolving single peptide peak clusters into mono-isotopic peptide peaks.
  • the extracting of noise-free mono-isotopic peptide peak may preferably comprising or further comprise determining a baseline of the spectrum and smoothening the baseline; determining a noise level and smoothening the noise level; and - locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.
  • the extracting of noise- free mono-isotopic peptide peak may further comprise - fitting function parameters for the peak candidates bundling of peak candidates into clusters; deconvolution of compound peak clusters into single peptide peak clusters; and - resolving single peptide peak clusters into mono-isotopic peptide peaks.
  • the method according to the present invention allows peak extraction of peptide peaks representing peptides having any value of electrical charge.
  • the method according to the present invention may preferably further comprise the step of filtering the list of selected peaks. Additionally, the filtering may comprise discarding one or more peaks preferably based on input from a user of the method, said input may claim one or more specific peak to be discarded.
  • the list is preferably filtered based on one or more of the strategies: echoes to consider, intensity cut, low and mass, aldi, peaks to exclude, peaks to keep, width cut.
  • a peak match is defined as a situation where the distance between the two peak masses is less than a predefined match value according to equation 6 disclosed herein and the predefined match value may preferably be user defined.
  • the method according to the present invention method may preferably further comprise the step of determining a score value ( ⁇ ) relating to the probability for the set of peak matches to occur, said score value being determined as the negative logarithm of the probability for the set of peak matches to occur.
  • a score value is determined for all proteins in the data base.
  • the method may preferably further comprise the step of determining the probability for getting a score value equal to or above a predefined score ( ⁇ ).
  • the step of determining the probability may preferably be based on the list of peptide peaks, the predefined match value and the probability density for the list of peptide peaks.
  • the probability for getting a score value equal to or above a predefined score ( ⁇ ) is the probability to reach randomly a score value at least as large as the score value in question. Additionally or in combination thereto, the probability for getting a score value equal to or above a predefined score ( ⁇ ) is preferably the probability to reach from the data base a score value at least as large as the score value in question.
  • is the probability for getting a score value equal to or above a predefined score ( ⁇ ) calculated by equation 24 disclosed herein.
  • the database is storing a list of peptide masses and the corresponding parent proteins.
  • the database results from a digestion of proteins and the digestion may preferably has been performed by the method according to the second aspect of the present invention.
  • the present invention preferably relates to a method for in silico digesting proteins, comprising establishing a plurality of protein sequences, checking, for each amino acid in the sequences, whether the amino acid acquires a post- translational modification and if so modifying the amino acid, and whether the current position coincides with a cleavage sites pre-specified or is the current position right-most amino acid and if so modify the acid accordingly, and compute and register the masses for all possible combinations of minimal peptide masses.
  • the present invention preferably relates to a computer system for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins
  • said computer system comprises preferably means for - extracting noise-free mono-isotopic peptide peaks from a numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks, the extracting being based on a numerical representation of the mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), and determining, such as computing means, the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.
  • the computer system according to the present invention comprises preferably means for performing some or all of the steps of the method according to the first aspect of the present invention.
  • Such means comprises preferably one or more computer processor, memory, disk storage, one or more data connections and/or the like.
  • a graphical user interface is provided in a fourth aspect of the present invention.
  • This graphical user interface is preferably particular useful for guiding a user through a protein identification process and for presenting the result of the identification process, as the interface preferably comprises a number of module fields each representing one or more module adapted to perform one or more of the steps according to the method according to first aspect of the present invention, and/or the second aspect of the present invention.
  • these fields are preferably graphically arranged and graphically linked so as to reflect a predefined executing order of the modules, the executing order being preferably the calculation flow governed by the underlying algorithms.
  • the graphical user interface is preferably adapted to initiating executing of a module in response to input to the computer system.
  • the graphical user interface is preferably also adapted to change the appearance of a field corresponding to the module being executed.
  • the input is preferably provided by a pointing device, such as a computer mouse, and a thereto associated bottom press. Additionally or in combination thereto, the one or more of these fields are changing appearance during execution of their corresponding module(s).
  • the graphical user interface according to the present invention may preferably further comprise one or more result fields appearing after results have been and/or during results are being generated by one or more of said modules and displaying the results.
  • the graphical user interface typically and preferably further comprise input fields preferably appearing when a user input is required. Furthermore, the graphical user interface may preferably further comprise one or more dialog windows through which user input may be inputted and/or edited. In connection thereto or in general, the one or more of the one or more dialog windows may preferably allow a user to edit values stored in a data base and the one or more dialog windows are preferably accessible via button(s) appearing on the interface.
  • the graphical user interface further comprises a tool bar via which actions can be executed by pushing buttons appearing on said tool bar, said tool bar preferably further comprising curtains, wherein each curtain represents a category of action and wherein each curtain comprises buttons for actions belonging to a particular category.
  • the graphical user interface may preferably further comprise a set of windows that communicate the results of the protein identification process.
  • Fig. 1 shows a raw experimental spectrum.
  • Fig. 2 shows the algorithmic flow of the present invention, (a: automatic input u: user input).
  • Fig. 3 shows an example of a cluster that might be mistaken for a single, very broad, peak.
  • Fig. 4 shows a peak cluster
  • Fig. 5 shows the isotope abundancy distributions for the four lowest-lying isotopes 1(0, m), 1(1, m), 1(2, m) and 1(3, m). (the horizontal axis: m: the vertical axis: isotope abundancy)
  • Fig. 6 shows a simple illustration of the chemistry of post-translational modifications (ptmrs) and missed cleavages, and their potential effect on the peaks of a mass spectrum, a. no ptm:s, no missed cleavage, b. ptm:s, both fixed and variable, c. missed cleavage.
  • ptmrs post-translational modifications
  • Fig. 7 shows an example of the distribution of the number of peptides per parent protein. On the y-axis is shown the values of the frequency distribution for the number of peptides per parent protein. Note that the distribution is dependent on choice of database, post-translational modifications, and allowed number of missed cleavages.
  • Fig. 8 shows an example of the distribution of the peptide masses. On the y- axis is shown the ⁇ 'alues of the peptide mass frequency distribution. Note that the distribution is dependent on choice of database, post-translational modifications, and allowed number of missed cleavages.
  • Fig. 9 shows an example of a clear case of protein identification. On the y-axis is shown the values of the functions ⁇ andom and ⁇ oiice base defined in equations eq. 25 and eq. 26 respectively.
  • the top-scoring candidates from the database search have score values much higher than what can be expected from just random matching.
  • Fig. 10 shows an example of a case with no clear protein identification.
  • the top-scoring candidates from the database search have score values not higher than what can be expected from just random matching. Further experimental measures have to be taken.
  • Fig. 11 shows the workspace as it appears on startup.
  • the arrows between the analysis boxes illustrate the flow of data in the analysis, the boxes represent key steps in the analysis, as described in the sections "Algorithms". Graphs of the data and results will occupy the empty space on the left-hand side, once the analysis is started.
  • Fig. 12 shows the File menu.
  • Fig. 13 shows the Edit menu.
  • Fig. 14 shows the Actions menu.
  • Fig. 15 shows the dialogue window that comes up when the MSFiles box is activated. By clicking on the relevant fields the user chooses what spectrum files to be analyzed. Note that a user, by clicking on the "Masslist” button, can also send in a list of already extracted peak mass values. Given that choice the present invention will skip the peak extraction step, and move directly to the peak filtering step.
  • Fig. 16 shows the options menu for the peak filtering step, corresponding to the box Pepfil.
  • Fig. 17 shows the options menu related to the database digestion as well as the scoring and the validation steps; corresponding to the box Matcher.
  • Fig. 18 shows the window in which a user can specify his/her own post- translational modification.
  • Fig. 19 shows the mass spectrum during different stages in the analysis.
  • the top graph is the unprocessed spectrum.
  • the middle graph shows the extracted mono-isotopic peaks.
  • the bottom graph shows the mono-isotopic peaks after the filtering step. Zooming in any of the three graphs will automatically 2;oom the two other graphs as well.
  • the table at the bottom left of the workspace shows the top-scoring proteins from the database.
  • Fig. 20 shows a table presenting the top-scoring proteins of the database.
  • Fig. 21 shows a table presenting the peaks that were extracted and met the peak criteria in the filtering step. This interactive window enables a user to manually add or remove peaks that he/she thinks can affect the database search.
  • Fig. 22 shows detailed information from the database search about a particular database protein.
  • Fig. 23 shows the search result using the default parameters of the present invention. No likely protein candidate is found.
  • Fig. 24 shows the search result when the parameter values have been chosen judiciously, taking into account known information about the experimental spectrum, assumed post-translational modifications etc, leading to one very likely protein candidate, marked by a green flag.
  • Fig. 25 shows the workspace after a batch run. The result for each individual spectrum can be studied in the same way as after processing a single spectrum.
  • the present invention takes as input a spectrum from a mass spectrometer, a protein database, and a set of user options.
  • the output is a table of those proteins (from the database) that are the most likely candidates for having generated the mass spectrum.
  • • h is a header row, which may or may not be empty. If it is not empty it may contain information about the experimental setup that generated the spectrum. If it is empty, it may also be left out.
  • the first column should contain values of the entity mass/electric charge (m/z), measured in units of Dalton (Da).
  • the second column is a field that separates the first and the third column. It can, for example, consist of a comma or white space. • The third column should contain values of the intensity measured at the mass spectrometer.
  • N is the number of datapoints in the spectrum.
  • a protein database consists of a table of proteins. Depending on how well annotated the database is, the amount of information available varies. A minimum requirement for the present invention to work, is that for each database entry there is an identification tag (number or name) of the protein, and that its amino acid sequence is presented, where the amino acids are represented by their one-letter code. Additional but not necessary information is, for example, information about species, protein weight etc. Examples of protein databases that can be used are SWISSPROT [1], and NCBI non-redundant peptide sequence database.
  • the output is a table where those proteins in the database whose theoretical spectra show the strongest spectrum resemblance to the unknown experimental spectrum, are presented, the top candidates.
  • the measure of resemblance is a score value, defined and described in the section "The Algorithms" below.
  • the proteins in the table are presented in descending order with respect to resemblance, hence having the most likely candidate on the top row, the second most likely candidate on the second row etc.
  • For each protein top candidate in the table its score value and other parameters related to the search statistics, as well as its particular amino acid sequence, are presented.
  • Given the contents of the output table a graphical illustration of the output can also be given. This is further described in the section "The Graphical User Interface” below.
  • the peak extraction is divided into a set of five sub-processes.
  • a user can influence which peaks to be extracted by the choice of signal-to-noise ratio s/n.
  • the user's choice is denoted by ⁇
  • the peak extraction 1 Separating signal from noise
  • the separation of signal from noise is done in a series of steps.
  • V ,fl ⁇ ⁇ - w ⁇ ⁇ z - wh • ere w ⁇ ( 2 )
  • the peak extraction 2 Classification of the datapoints and peak determination
  • a peak V is built up by datapoints (d n ⁇ (x n , y n )) according to the following criteria:
  • FIG 3 is shown a portion of a raw experimental spectrum
  • the non-constant line marked 1 (solid) connects experimental datapoints.
  • Crosses marked 3 are those datapoints that belong to the set V s ⁇ gna i-
  • the dots on line 1 between lines 2 and 4 are those datapoints that belong to the set > supp ⁇ rt .
  • This one peak will have an x range, r x (V) ⁇ X ( p ) + ⁇ — - covering approximately 6 Da. Since peaks representing different isotopes of the same singly charged peptide should be separated with a distance of approximately 1 Da, a peak cannot have an x range very much more than 1 Da. Therefore a fourth criterion to the three above is added:
  • the present invention employs a procedure that systematically checks the value of r x (V):
  • the peak extraction 3 Computing peak properties
  • the intensity of a peak V is in the present invention taken as the maximum intensity value of those datapoints that build up the peak.
  • the center of mass and width of a peak is determined by a centroid calculation. Furtehrmore, the signal-to-noise ratio of the peak is therefore determined through its maximum intensity value, using the definition of signal-to-noise ratio for an individual datapoint as described in step 8 in the passage "Separating signal from noise" above.
  • the vertical impulses marked 6 indicate the central x values (x c (P)) and maximum y values (y(V)) for the peaks that have been extracted.
  • the peak extraction 4 Determining peak clusters
  • the next step is to partition the peaks into peak clusters.
  • Two peaks, V and V are defined to be neighbours in and members of the same peak cluster if 1 — r ⁇ ⁇ 1 + T, where T PH 0.1 Da.
  • the reason for having the value of r in that particular range is based on the fact that consecutive isotopes that belong to the same singly charged chemical compound should be separated by the mass of the neutron « 1 Da.
  • An additional criterion is imposed on a cluster: At least one peak in a cluster needs to have at least one datapoint in T> s i gno ⁇ . Such a peak is said to be in > s i gna [.
  • the p:th cluster can, in turn, be defined as a set of peaks:
  • V p (r) is peak number r in the p:th cluster C p ⁇ and r(p) (> 1) is the number of peaks in that cluster.
  • the peak extraction 5 Finding mono-isotopic peaks in a cluster
  • the present invention proceeds now to select the mono-isotopic peaks in a peak cluster.
  • p:th cluster C p defined above, an example of which is shown in figure 4.
  • the present computes the peptide abundancies that build up the cluster.
  • FIG 5 is shown the distributions X(i, m) for the four lowest-lying isotopes. Due to statistical variations there is a width in each distribution, indicated by errorbars.
  • the values of the Isotopic distributions are in the present invention kept in a table.
  • the solution a can in general contain components o ⁇ ⁇ 0; a non-realistic solution in the present context.
  • the way to approach the problem will instead be to find a solution a such that it is the best possible with the constraint that V ⁇ r > 0.
  • “best possible” solution is meant the following:
  • d(a,p,r) y(p,r)- (a_ ⁇ ⁇ (r - l,m) + ... + a ⁇ ⁇ ⁇ (0,m))
  • the present invention employs the well-known technique of quadratic programming to solve this constrained minimization problem. So, for each cluster C p there ⁇ vill be mono-isotopic peptides in and only in those postions where r > 0.
  • a peak in a spectrum that may contain peptides with higher charges can represent peptides of different charges. This means that the systems of equations that descibe the peak intensities in terms of isotopic distributions X(i, rn) and peptide abundancies ⁇ r , may get contributions from more than one peak cluster. In this general case it is therefore necessary to introduce a procedure that creates a set of disjoint systems of equations, where a disjoint system may or may not get get contributions from more than one peak cluster, and a peak cluster contribute to one and only one such disjoint system.
  • the baseline of the spectrum is calculated. This is done by finding, over the given raw data, the smallest intensity within a running window (user option: -bcr [baseline constant range]). This calculation is performed by analysis of the measured intensity values by means of histograms, and the smallest valued bin in the histogram is choosen as the baseline value. The baseline is smoothed after the analysis to make sure that the curve is continuous.
  • the last fit is to try fitting more than one gaussian to the peak candidate. This is done to resolve a possible situation of superposition of peaks. If this is successiveful; that is, more than one peak was fitted, then these peak parameters are used, and the candidate resulted in more than one peak.
  • the single peptide clusters are analysed, mono-isotopic mass, charge and a quality measure is calculated for each cluster.
  • the user option -mir [min Jsotopic_reduction] is used in quality analysis of whether the single cluster is a superpostion of more than one peptide differing an integer times the mass of a neutron, -mir sets the accept- able deviation from the cluster model used.
  • User option -mi (monoJsotopes] defines whether to report all extracted peaks or mono-isotopes only.
  • User option -of [overflow] sets the intensity measurement limit of the mass spectrometer. This parameter is needed since the measured Isotopic clusters become corrupt if the measurements are overflowed.
  • the user options -im and -ic [signal-to-noise or absolute intensity] set the model to use and the cut-off value.
  • the peak extraction step described above has resulted in a set of selected peaks.
  • a user of the present invention may want to discard some peaks. Those unwanted peaks may or may not be part of the selected peaks; if they are, they will be filtered out; that is removed, in this step.
  • the invention supports the following filters and any combination of them: echoes to consider: In a mass spectrum, so-called peak echoes sometimes occur. By this is meant that the experimental mass spectrum sometimes contains peaks that are false doublets of true peaks. These false doublets often appear at certain well- established distances from the true peaks. This filter handles that problem. Assume the user chooses a value d, measured in Dalton.
  • peaks to exclude In a mass spectrum some particular m/z values often represent calibrants, parts of the digestion enzyme, or known contaminants. These m/z values can be specified by the user and if these appear in the set of selected peaks, within a tolerance window, the peaks are removed. peaks to keep: Suppose the number of selected peaks is S and the user only wants U peaks to go into the spectrum matching. In this filter those U peaks with the highest signal-to-noise are kept, and the other S-U are removed.
  • width cut A threshold for width of peaks. Peaks with a width above the width cut value, specified by the user, are removed.
  • each row contains information about an extracted peak that has survived the filter step.
  • a row consists of values for the parameters of peak properties such as m/z, intensity, peak width, signal-to-noise ratio and peptide abundancy.
  • digestion in silico The purpose of digestion of proteins in a protein database, so-called digestion in silico, is to mimic the enzymatic digestion of the real unknown protein that takes place in the laboratory, and hence compute theoretical spectra, one theoretical spectrum for each database protein. Having that, the present invention can compare the experimental spectrum with each theoretical spectrum.
  • digestion has been carried out by a site-specific enzyme.
  • the enzyme cleaves; that is cuts, the protein only at certain cleavage sites.
  • digestion enzymes there is trypsin that cleaves only on the C-terminal side of the amino acids arginin and lysin - unless there is a neighbouring prolin amino acid on the C-terminal side of the arginin or lysin. If that is the case no cleavage will be done by the trypsin enzyme at that particular site.
  • Post-translational modifications It may happen that a protein gets modified by more or less complicated chemical compounds that cannot be predicted by only studying the nucleotide base-pair sequence of its corresponding gene.
  • One example of post-translational modifications are methionine oxidation where the amino acid methionine acquires an extra oxygen atom.
  • a peptide containing methionine therefore gets its mass shifted upwards by the mass of an oxygen atom.
  • Other examples of post- translational modifications can be found in [4]. Post-translational modifications can be divided into variable and fixed modifications.
  • a variable modification is such that it may or may not occur; in the example above it would mean that some of the methionine amino acids acquire an extra oxygen atom while others do not.
  • a fixed modification on the other hand, always occurs; in the example above it would mean that every methionine amino acid in each peptide acquires an extra oxygen atom.
  • Post-translational modifications are not really part of the protein digestion process, but need to be included when computing the theoretical spectra. It is therefore natural to Incorporate this circumstance at the digestion stage. 2: Missed cleavages: It may also happen that an enzyme does not cut at a site on the protein where it is allowed to cut.
  • the general procedure of digestion in silico is the following: In a protein database the present invention will process entry j, where j — l, .-, iV rfo ⁇ 6ose and N database is the number of proteins registered in the database. Entry j contains a unique identification tag T[j], and an amino acid sequence
  • ⁇ (j) a ⁇ (j)a 2 ⁇ j)...a k (j)...a ⁇ j) (j)
  • a k (j) is the k:th amino acid residue, counted from the N-terminal side of the 2 b protein j.
  • a k (j) can be any of the 20 amino acids found in proteins, and is represented by a one-letter code. (For a reference of their code-letters, chemical composition and mass, see [3].)
  • K (j) is the number of amino acids in protein j.
  • y trypsin, M[Y)—2
  • S[Y] ⁇ (the C-terminal side of arginin, unless there is a neighbouring prolin on the C-terminal side of the arginin)
  • S'[y] 2 (the C-terminal side of lysin, unless there is a neighbouring prolin on the C-terminal side of the lysin).
  • M [FM 1 , .., FM f , ..., FM F ; VM 1 , ....VM v , ..., VM v )
  • FM and VM denote fixed and variable modifications respectively.
  • To every fixed and variable modification is assigned a mass, m(FM j ) and m(VM v ) respectively.
  • Counters for all members of the set M are set to n(FM ) — 0 and n(VM ⁇ ) — 0.
  • the invention reads off, from left to right; that is from the N-terminal to the C-terminal side, the protein sequence A(j).
  • IV For each amino acid that is read off in the sequence, the invention checks whether a. the current amino acid shall acquire a post-translational modification specified by M.
  • n Read off next amino acid; that is go to step III.
  • y n m(p[j, c, x(c)]) — ⁇ m(p[j, c, x(c)])+ mass(current amino acid residue)
  • Yl y 1 [n(VM v ) + 1] such combinations) m(p[j, c, x(c)]) —? m(p[j, c, x(c))) + ⁇ v __ ⁇ - m(MV v ) ⁇ for each value n' ⁇ ; 0 ⁇ ⁇ n(VM v ).
  • the integer x(c) runs between 1 and X(c).
  • V Take into account missed cleavages: Having performed in silico digestion at every allowed cleavage site at protein T[j], there is a set of minimal peptides p[j,c',x(c')], where 1 ⁇ d ⁇ c and 1 ⁇ x(c') ⁇ X(d).
  • the invention now computes and registers the masses for all possible combinations of minimal peptides with the restriction that for each member of the combination, p[j, c', x(c')] say, there has to be at least one other member, p ⁇ j, J', x(c")], such that
  • ⁇ ( ⁇ ; N) The frequency distribution of peptide masses ⁇ for peptides whose parent proteins have given rise to TV peptides
  • Distributions 1 and 2 are shown in figures 7 and 8.
  • Distribution 2 is the sum of distributions 3 for all different values of TV, keeping the value of ⁇ fixed. In the present invention theses three distributions are kept in memory and will be used in the spectrum matching and validation algorithms described below.
  • the present invention computes the probability for the set of peak matches to occur.
  • the score is then taken as the negative logarithm of that probability, so that a high score value reflects an unlikely event, and hence a high degree of spectrum resemblance; that is, a good match.
  • There are, of course, different ways to do this basically meaning different ways to define the set of peak matches and different ways of computing probabilities.
  • the aim in the present invention is, quite naturally, to reward many matches, and at the same time take into account the propensity for large database proteins to have many matches.
  • z t (j) is the mass of the i:ih peak in the theoretical spectrum of protein T[j] 7 and N(j) is the number of peptides that resulted from in silico digestion of protein T[j].
  • the present invention calculates the probability for the set of matches between the peak list x and the theoretical spectrum z(j) as described by M[j].
  • Eq. 8 is taken as the general definition of a score value in the context of the present invention. There are now different ways to specialize that general expression. Here two examples of such specializations will be given.
  • is taken to be the probability that a peptide whose parent protein has given rise to N(j) peptides will find a match with one of the L peaks in x. K is then the probability for no match.
  • N TV rflndOT n ( c ) ⁇ r * a r d a ⁇ n m do ( m ⁇ c) ( 33 ) in order to contain n random ( ⁇ c ) random proteins that reach a score value at least ⁇ c .
  • n database (o ⁇ c ) s the number of real database proteins that reached a score value at least ⁇ c
  • N a t a e is the size of the real protein database.
  • p — p( ⁇ c ) is defined as the random probability of getting at least one protein with a score value of at least ⁇ c given the size of the real database, N (lfnbase . This implies
  • the parameters score value, quality measure and p-value are in the present invention calculated and reported for every database protein.
  • GUI Graphical User Inteface
  • GUI Graphical User Interface
  • GUI is designed so that it is platform independent. By this is meant that the computer code is written such that the GUI can run on any computer irrespective of the computer's operating system.
  • the invention is designed so that the Main Application can run independently of the GUI. 4.
  • the invention can, through the GUI, be run in a stepwise manner. This means that each algorithmic step, described above in the section "The Algorithms" , can be executed such that the following step in the algorithmic flow will not be executed before the user chooses to do so.
  • GUI workspace As illustrated, the workspace is divided into
  • the first two and the fourth workspace areas are interactive with the user. By this is meant that when a user clicks on one of the items in those areas, the user can either select input for the algorithms or start a process.
  • the menu bar at the top of the workspace, see figure 11, consists of five menus with the following features:
  • Edit The Edit menu contains one item, Boxes, that gives the user access to all the option menus for the analysis boxes in the workspace. It is shown in figure 13.
  • View The View menu is used to specify the items a user wants to see on the desktop. It has only one option: Hide icon toolbar, which specifies whether the icons at the top of the workspace should be shown or not.
  • the Preferences menu is divided into
  • Actions The Actions menu, see figure 14, has the following items:
  • Run step Runs only the high-lighted analysis step.
  • high-lighted is meant that the title bar of the corresponding analysis box is coloured.
  • Run batch Runs the entire analysis on every spectrum selected by the user.
  • Run spectrum Runs the entire analysis on a single spectrum selected by the user.
  • Halt process Halts a batch run.
  • the icon bar see figure 11, consists of a set of often-used icons that correspond to features and functions controlled in the menu bar, described above.
  • the status bar is located at the bottom of the workspace, see figure 11. It contains updated information about what is being currently processed by the present invention.
  • MSFiles Selects those files containing the user Pepex/the screen raw spectra to be analysed.
  • Pepex Extracts mono-isotopic peaks MSFiles/the user Pepfil/the screen/file from the raw spectrum.
  • Pepf ⁇ l Filters peaks from the peak Pepex/the user Matcher /the screen/file list created by Pepex. Possibility for recalibration of a spectrum.
  • the MSFiles box When clicking on the box MSFiles, or by clicking in the MSFiles field in the Edit menu, a window appears, see figure 15. There the user can indicate the file (containing data for a raw experimental spectrum) or set of files to be analysed. By clicking on "Add Files” a browser will appear and a user can choose to run one spectrum or many spectra for a batch job. Note that a user, by clicking on the "Masslist” button, can also send in a list of already extracted peak mass values. Given that choice the present invention will skip the peak extraction step, and move directly to the peak filtering step. Al
  • the left side of the window contains a list of the filters that are supported in the present invention.
  • the right side contains the filters that are currently active. By writing in the Parameter fields, the user can choose values that correspond to the active filters.
  • the filters that are supported which were described above in the section "The Algorithms" , are the following:
  • buttons that controls which filters to be used:
  • Deactivating a filter is done by marking a filter on the right side, followed by clicking on the ⁇ button.
  • the floppy disk button enables the user to save the current filter setup to a file.
  • This window contains a set of option fields.
  • Validation database size This is the number of random proteins that is used when calulating the function -F Tan om defined above in the section "The Algo- rithms”.
  • the GUI output from a run of the present invention can be divided into
  • FIG 19 is shown the GUI workspace after a run of the present invention.
  • three spectrum graphs are presented. These graphs are related to, from top to bottom, the experimental raw spectrum, the extracted peptide peaks, and the peptide peaks left after filtering.
  • a user can visually study the extracted peaks and relate them to the experimental spectrum.
  • a zooming function enables detailed study over the whole m/z-range.
  • the Results window
  • Results window a user gets information about the top-scoring protein candidates as well as about the peaks that were used in the database search.
  • FIG 20 is illustrated a list of the top-scoring protein candidates.
  • the proteins are listed in descending order of spectrum resemblance in comparison with the filtered peak list extracted from the experimental spectrum.
  • the list contains a number of rows (as many as chosen by the user in the Matcher box) where each row holds search results for one database protein. In the present illustration there are five columns for each row:
  • • quality flag The flag is a small rectangle whose colour is dependent on the quality measure of the fifth column and chosen cut-off values. In the present illustration it is implemented such that a quality value above 7 gives a green flag, a quality value between 3 and 7 gives a yellow flag, and below 3 gives a red flag. The statistical significance of the quality measure was discussed in the "Algorithms" section above.
  • • protein id In this column is reported the name and database identification tag of the database protein. Denoted "Protein id" in the present illustration.
  • This window contains detailed Information about that particular protein, and is described below. Peaks
  • FIG 21 is illustrated a list of the peaks that were used in the database search.
  • the upper field contains only the m/z values for the peaks.
  • a "Copy" button By clicking on a "Copy" button a user can copy and then paste the set of /z values into any desired application.
  • rows where more detailed information about every selected peak is given. In the present illustration there are six columns for each row; that is, for each peak V that is used in the database search:
  • intensity The absolute intensity value of the peak. It is calculated as y(V), described in the passage "Computing peak properties" of the “Algorithms” section. It is denoted by "Intensity” in the present illustration.
  • FIG 22 is shown an example of such a window, where a user, among other things, can find information about
  • PC hardware requirements: PC:s; that is, personal computers, or work sta- tions.
  • a client-server solution in which the binary files executing the algorithms are run on a server and the GUI is run on one or many clients.
  • the protein database Should be formatted in a so-called FASTA format and be stored on the server if a client-server solution is the user's choice. If a stand-alone version is used, the formatted protein database is stored on the computer at hand.
  • a rough outline of the method is the following.
  • the whole chosen database is digested Then, for every peptide mass, the two numbers i are j are calculated, such that
  • the score value for an entry in the protein database Is now defined as
  • the spectrum resemblance to the peak list x is computed. This resemblance is based on how well peaks from z(j) and x match each other. The criterion for a match between two peaks is such that their mutual distance has to be less than e. The spectrum resemblance is now based on a score that is written as
  • contains a scoring method based on the probability for matches and misses between the experimental spectrum and a theoretical spectrum.
  • Peak extraction not continued by the identification steps is valuable in many cir- cumstances.
  • One such case is when a user is only interested in visually inspecting spectrum differences; for example comparison of a protein spectrum from a healthy cell sample with a spectrum from a cell sample in disease. It is, however, also valuable if a user does want to make protein identification using, in parallell with the present invention, some other protein identification software. This can easily be done using the "Copy" function available for the peak list in the GUI Results window, described above. Running different methods in parallell, in order to raise the statistical significance, is not uncommon when doing protein identification.
  • the example spectrum file is called spectrum.txt. It is loaded by choosing the box MSFiles under the Edit menu, clicking on the "Add Files" button, and then use the browser to select the spectrum file.
  • the present invention will process the spectrum, with all user options set to their default values.
  • the search result is shown in figure 23.
  • the best scoring protein candidate gets a score value ("Score") of 14.52, a p-value ("Probability") of 0.15 and a quality value ("Quality”) of 3.89.
  • Score score value
  • Probability p-value
  • Q quality value
  • the quality value Q is directly related to the size needed of a random database in order to expect a certain score value. That size is in the present case exp(3.&9) ⁇ 50 times the size of the actual random database used for the search, maybe not a very convincing number.
  • the colour-coded flags in the left column also helps a user to quickly asses the statistical significance of the search result.
  • a user can choose to change the values of the option parameters from their default values.
  • the user uses the filter option to remove known peaks from the digestive enzyme, takes into account the possibility of missed cleavages and known post-translational modifications.
  • the present invention is rerun, and the result is shown in figure 24.
  • As can be seen there is a new top-scoring candidate. It gets a score value ("Score") of 16.42, a p-value ( "Probability") ⁇ 0, and a quality value Q 12.1.
  • the needed size of a random database for expecting the reported score value is in the present case exp(12.1) « 180000 times the size of the actual random database used for the search.
  • top candidate is also the correct protein.
  • a quick glance by the user at the colour-coding of the flags in the left column gives strong support for the top-scoring candidate to actually be the unknown protein. Additional support is also given by the following three candidates; all of the same type as the top candidate.
  • the present invention allows for running many experimental spectra in one go; so- called batch jobs. This is extremely valuable, and basically the only method feasible when a user wants to perform high-throughput screening of many spectra in a fast and automated fashion.
  • a user selects all the desired spectrum files in the MSFiles box, and selects "Run batch" in the "Actions" menu. After a batch run a user clicks anywhere on the workspace, and a list of the processed spectra appears, as shown in figure 25. The result for each individual spectrum can then be studied in the same way as after processing a single spectrum, with access to spectrum graphs, lists of top-scoring protein candidates etc.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Hematology (AREA)
  • Chemical & Material Sciences (AREA)
  • Urology & Nephrology (AREA)
  • Physics & Mathematics (AREA)
  • Immunology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Food Science & Technology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Microbiology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Cell Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The present invention relates to a computer system and a method for selecting one or more candidate proteins from a plurality of proteins stored in a database. The invention relates in particular to a method for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins. The method comprises the steps of: extracting noise-free mono-isotopic peptide peaks from the numerical representation of the mass spectrum to be identified based on a numerical representation of a mass spectrum of a protein to be identified in the form of corresponding values of entity mass/electric charge and intensity, thereby providing, if possible, a list of selected peaks; selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), determining the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates. The present invention also relates to a graphical user interface rendering utilisation of the method easy.

Description

A COMPUTER SYSTEM AND METHOD USING MASS SPECTROMETRY DATA AND A PROTEIN DATABASE FOR IDENTIFYING UNKNOWN PROTEINS.
The present invention relates to a computer system and a method for selecting one or more candidate proteins from a plurality of proteins stored in a database.
Today, extensive research is carried out in order to determine the effects a protein may have in a living organism. While the effect may be established for a protein it is often very difficult to determine the protein - or proteins - being the cause of the effect, that is identification of a specific protein is today not a trivial job.
Different attempts have been used in the past in order to solve the identification problem. In general, the known methods are based on semi-computerised comparisons between numerical representations of peptide peaks of known proteins and peptide peaks of the unknown protein.
Common for all the known methods is that they are all strongly based on a human factor in the sense a person often judge a match - or no match - between a known protein and the unknown protein based on a graphical representation of peaks of known proteins selected based on a closeness-of-fit algorithm and the peaks of the unknown protein. Thereby, such methods result in that a correct identification is strongly influenced by the skills of the person performing the selection.
It is known, that when a process is strongly influenced by the skills of a human being the process may become non-reproducible resulting in the present situation in that the certainty of the identification may be low resulting in a less valuable result.
Thus, an object of the present invention is to provide a method and a computer system for selecting at least one candidate protein from a database, in such a manner that the human factor is minimised. This and many other objects are believed to be fulfilled by a method for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins, said method comprises preferably based on a numerical representation of a mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, extracting noise-free mono-isotopic peptide peaks from the numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks; selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), and determining the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.
It should be noticed that if the raw experimental spectrum is of really bad quality, theremay, quite naturally, not be any selected peaks, meaning that "a list of selected peaks" can contain no peaks at all in a really bad scenario.
The set of peak matches are preferably all those peaks that satisfy a condition that the distance between the two peak masses is less than a certain value. This certain value is in many preferred embodiments user defined, but may, of course, be defined based on the actual data.
In particular preferred embodiments of the method according to the present invention, the method comprises preferably the step of determining a score value (σ) relating to the probability for the set of peak matches to occur, said score value being preferably determined as the negative logarithm of the probability for the set of peak matches to occur.
In accordance with the present invention the method preferably always computes a score value to every protein stored in the database. In order to assess the score, which may be interpreted in a wrong manner if no measure is taken to avoid it, the present invention may preferably further comprise the steps of determining the probability for getting a score value equal to or above a predefined score (σ), wherein
the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach randomly a score value at least as large as the score value in question and/or the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach from the data base a score value at least as large as the score value in question.
Thanks to the provision of these two probabilities, the assessment may lead to the conclusion that none of the proteins represented in the data base is a likely candidate as a comparison of these two probabilities normally will show that top-scoring candidates from the data base search have score values much higher than what can be expected from just random matching. This last situation is especially valuable if the unknown protein is not in the database.
In particular preferred embodiments, the probability for the set of peak matches to occur preferably reflects the probability to have a predetermined number(r) of matches. Furthermore, the probability for the set of peak matches to occur may preferably be determined so that the probability is rewarded by many matches, and at the same time takes into account the propensity for large proteins having many matches.
The candidate protein(s) is preferably represented by a first set of peptide masses being a theoretical spectrum wherein each peak has the same intensity as all other peaks.
The extracting of noise-free mono-isotopic peptide peak by the method according to the present invention may preferably comprise determining the intensity level of the spectrum where the signal-to-noise ratio is unity, such as substantial unity, preferably by use of equation 1 disclosed herein; determining the intensity level of the spectrum where the signal-to-noise ratio is zero, such as substantial zero, preferably by use of equation 2 disclosed herein; and locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.
In connection hereto or in general, the extracting of noise-free mono-isotopic peptide peak may preferably further comprise determining the peak entities mass/electric charge, intensity, width and signal to noise ratio for the peak candidates; bundling of peak candidates into clusters; - deconvolution of compound peak clusters into single peptide peak clusters; and resolving single peptide peak clusters into mono-isotopic peptide peaks.
Alternatively or in connection to the above, the extracting of noise-free mono-isotopic peptide peak may preferably comprising or further comprise determining a baseline of the spectrum and smoothening the baseline; determining a noise level and smoothening the noise level; and - locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.
Furthermore, in particular preferred embodiments of the present invention, the extracting of noise- free mono-isotopic peptide peak may further comprise - fitting function parameters for the peak candidates bundling of peak candidates into clusters; deconvolution of compound peak clusters into single peptide peak clusters; and - resolving single peptide peak clusters into mono-isotopic peptide peaks.
Preferably, the method according to the present invention allows peak extraction of peptide peaks representing peptides having any value of electrical charge.
The method according to the present invention may preferably further comprise the step of filtering the list of selected peaks. Additionally, the filtering may comprise discarding one or more peaks preferably based on input from a user of the method, said input may claim one or more specific peak to be discarded. The list is preferably filtered based on one or more of the strategies: echoes to consider, intensity cut, low and mass, aldi, peaks to exclude, peaks to keep, width cut.
In preferred embodiments of the method, a peak match is defined as a situation where the distance between the two peak masses is less than a predefined match value according to equation 6 disclosed herein and the predefined match value may preferably be user defined. The method according to the present invention method may preferably further comprise the step of determining a score value (σ) relating to the probability for the set of peak matches to occur, said score value being determined as the negative logarithm of the probability for the set of peak matches to occur. Typically and preferably, a score value is determined for all proteins in the data base. Additionally or in general, the method may preferably further comprise the step of determining the probability for getting a score value equal to or above a predefined score (σ).
According to preferred embodiments of the method, the step of determining the probability may preferably be based on the list of peptide peaks, the predefined match value and the probability density for the list of peptide peaks.
Preferably, the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach randomly a score value at least as large as the score value in question. Additionally or in combination thereto, the probability for getting a score value equal to or above a predefined score (σ) is preferably the probability to reach from the data base a score value at least as large as the score value in question.
In particular preferred embodiments, is the probability for getting a score value equal to or above a predefined score (σ) calculated by equation 24 disclosed herein.
Preferably, the database is storing a list of peptide masses and the corresponding parent proteins. Typically and preferably, the database results from a digestion of proteins and the digestion may preferably has been performed by the method according to the second aspect of the present invention.
In a second aspect, the present invention preferably relates to a method for in silico digesting proteins, comprising establishing a plurality of protein sequences, checking, for each amino acid in the sequences, whether the amino acid acquires a post- translational modification and if so modifying the amino acid, and whether the current position coincides with a cleavage sites pre-specified or is the current position right-most amino acid and if so modify the acid accordingly, and compute and register the masses for all possible combinations of minimal peptide masses. In a third aspect, the present invention preferably relates to a computer system for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins, said computer system comprises preferably means for - extracting noise-free mono-isotopic peptide peaks from a numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks, the extracting being based on a numerical representation of the mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), and determining, such as computing means, the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.
The computer system according to the present invention comprises preferably means for performing some or all of the steps of the method according to the first aspect of the present invention. Such means comprises preferably one or more computer processor, memory, disk storage, one or more data connections and/or the like.
In order to ease use of the computer system and method according to the present invention, a graphical user interface is provided in a fourth aspect of the present invention. This graphical user interface is preferably particular useful for guiding a user through a protein identification process and for presenting the result of the identification process, as the interface preferably comprises a number of module fields each representing one or more module adapted to perform one or more of the steps according to the method according to first aspect of the present invention, and/or the second aspect of the present invention. In accordance with the invention, these fields are preferably graphically arranged and graphically linked so as to reflect a predefined executing order of the modules, the executing order being preferably the calculation flow governed by the underlying algorithms. The graphical user interface is preferably adapted to initiating executing of a module in response to input to the computer system. The graphical user interface is preferably also adapted to change the appearance of a field corresponding to the module being executed. The input is preferably provided by a pointing device, such as a computer mouse, and a thereto associated bottom press. Additionally or in combination thereto, the one or more of these fields are changing appearance during execution of their corresponding module(s).
The graphical user interface according to the present invention may preferably further comprise one or more result fields appearing after results have been and/or during results are being generated by one or more of said modules and displaying the results.
The graphical user interface typically and preferably further comprise input fields preferably appearing when a user input is required. Furthermore, the graphical user interface may preferably further comprise one or more dialog windows through which user input may be inputted and/or edited. In connection thereto or in general, the one or more of the one or more dialog windows may preferably allow a user to edit values stored in a data base and the one or more dialog windows are preferably accessible via button(s) appearing on the interface.
Typically and preferably, the graphical user interface further comprises a tool bar via which actions can be executed by pushing buttons appearing on said tool bar, said tool bar preferably further comprising curtains, wherein each curtain represents a category of action and wherein each curtain comprises buttons for actions belonging to a particular category. Furthermore, the graphical user interface may preferably further comprise a set of windows that communicate the results of the protein identification process.
Detailed description of preferred embodiment
In the following the present invention and in particular preferred embodiments thereof will be presented in the following in connection with the accompanying figures in which:
Fig. 1 shows a raw experimental spectrum.
Fig. 2 shows the algorithmic flow of the present invention, (a: automatic input u: user input).
Fig. 3 shows an example of a cluster that might be mistaken for a single, very broad, peak.
Fig. 4 shows a peak cluster.
Fig. 5 shows the isotope abundancy distributions for the four lowest-lying isotopes 1(0, m), 1(1, m), 1(2, m) and 1(3, m). (the horizontal axis: m: the vertical axis: isotope abundancy)
Fig. 6 shows a simple illustration of the chemistry of post-translational modifications (ptmrs) and missed cleavages, and their potential effect on the peaks of a mass spectrum, a. no ptm:s, no missed cleavage, b. ptm:s, both fixed and variable, c. missed cleavage.
Fig. 7 shows an example of the distribution of the number of peptides per parent protein. On the y-axis is shown the values of the frequency distribution for the number of peptides per parent protein. Note that the distribution is dependent on choice of database, post-translational modifications, and allowed number of missed cleavages.
Fig. 8 shows an example of the distribution of the peptide masses. On the y- axis is shown the λ'alues of the peptide mass frequency distribution. Note that the distribution is dependent on choice of database, post-translational modifications, and allowed number of missed cleavages.
Fig. 9 shows an example of a clear case of protein identification. On the y-axis is shown the values of the functions τandom and ύoibase defined in equations eq. 25 and eq. 26 respectively. The top-scoring candidates from the database search have score values much higher than what can be expected from just random matching.
Fig. 10 shows an example of a case with no clear protein identification. On the y-axis is shown the values of the functions -^random and ^database defined in equations eq. 25 and eq. 26 respectively. The top-scoring candidates from the database search have score values not higher than what can be expected from just random matching. Further experimental measures have to be taken.
Fig. 11 shows the workspace as it appears on startup. The arrows between the analysis boxes illustrate the flow of data in the analysis, the boxes represent key steps in the analysis, as described in the sections "Algorithms". Graphs of the data and results will occupy the empty space on the left-hand side, once the analysis is started.
Fig. 12 shows the File menu.
Fig. 13 shows the Edit menu.
Fig. 14 shows the Actions menu.
Fig. 15 shows the dialogue window that comes up when the MSFiles box is activated. By clicking on the relevant fields the user chooses what spectrum files to be analyzed. Note that a user, by clicking on the "Masslist" button, can also send in a list of already extracted peak mass values. Given that choice the present invention will skip the peak extraction step, and move directly to the peak filtering step.
Fig. 16 shows the options menu for the peak filtering step, corresponding to the box Pepfil.
Fig. 17 shows the options menu related to the database digestion as well as the scoring and the validation steps; corresponding to the box Matcher.
Fig. 18 shows the window in which a user can specify his/her own post- translational modification. Fig. 19 shows the mass spectrum during different stages in the analysis. The top graph is the unprocessed spectrum. The middle graph shows the extracted mono-isotopic peaks. The bottom graph shows the mono-isotopic peaks after the filtering step. Zooming in any of the three graphs will automatically 2;oom the two other graphs as well. The table at the bottom left of the workspace shows the top-scoring proteins from the database.
Fig. 20 shows a table presenting the top-scoring proteins of the database.
Fig. 21 shows a table presenting the peaks that were extracted and met the peak criteria in the filtering step. This interactive window enables a user to manually add or remove peaks that he/she thinks can affect the database search.
Fig. 22 shows detailed information from the database search about a particular database protein.
Fig. 23 shows the search result using the default parameters of the present invention. No likely protein candidate is found. Fig. 24 shows the search result when the parameter values have been chosen judiciously, taking into account known information about the experimental spectrum, assumed post-translational modifications etc, leading to one very likely protein candidate, marked by a green flag.
Fig. 25 shows the workspace after a batch run. The result for each individual spectrum can be studied in the same way as after processing a single spectrum.
Note: The figures and the references to figures in this description are merely illustrations of the methods, functions and features of the present invention.
W
This Description
The description of the present invention is divided into six parts:
• Input and Output
• The Algorithms
• The Graphical User Interface
• The Computer Implementation
• Other Methods
• Applications
Input and Output
The present invention takes as input a spectrum from a mass spectrometer, a protein database, and a set of user options. The output is a table of those proteins (from the database) that are the most likely candidates for having generated the mass spectrum. Input - the experimental mass spectrum
The relevant numbers representing an experimental mass spectrum, an example of which is shown in figure 1, are assumed to be stored in a datafile. Different data file formats are accepted by the present invention. A general description of allowed formats is the following: h
(m/z)ι separation field intensityi
(mjz)? separation field intensϊty2
(m/z)pr separation field intensity^
where
• h is a header row, which may or may not be empty. If it is not empty it may contain information about the experimental setup that generated the spectrum. If it is empty, it may also be left out.
• The first column should contain values of the entity mass/electric charge (m/z), measured in units of Dalton (Da).
• The second column is a field that separates the first and the third column. It can, for example, consist of a comma or white space. • The third column should contain values of the intensity measured at the mass spectrometer.
• N is the number of datapoints in the spectrum.
It is furthermore required that (τn/z)(k+ι) > (m/z)k for k = 1, ..., N — 1. The entire m/z-range of the experimental spectrum is therefore [(m/z)1 (m/z)N]. In figure 1 is shown a mass spectrum where, as conventionally done , the datapoint values of the first column (m/z) and the second column (intensity) correspond to the horizontal and vertical directions, respectively. Note: in the following x and y are defined as x ≡ m/z and y = intensity. Input - the protein database
A protein database consists of a table of proteins. Depending on how well annotated the database is, the amount of information available varies. A minimum requirement for the present invention to work, is that for each database entry there is an identification tag (number or name) of the protein, and that its amino acid sequence is presented, where the amino acids are represented by their one-letter code. Additional but not necessary information is, for example, information about species, protein weight etc. Examples of protein databases that can be used are SWISSPROT [1], and NCBI non-redundant peptide sequence database.
Input - the user options A judicious choice of user options strongly increases the chances for a successful protein identification. The user options are described in detail below; in the section "The Algorithms" as well as in the section "The Graphical User Interface" .
Output
The output is a table where those proteins in the database whose theoretical spectra show the strongest spectrum resemblance to the unknown experimental spectrum, are presented, the top candidates. The measure of resemblance is a score value, defined and described in the section "The Algorithms" below. The proteins in the table are presented in descending order with respect to resemblance, hence having the most likely candidate on the top row, the second most likely candidate on the second row etc. For each protein top candidate in the table, its score value and other parameters related to the search statistics, as well as its particular amino acid sequence, are presented. Given the contents of the output table, a graphical illustration of the output can also be given. This is further described in the section "The Graphical User Interface" below. The Algorithms
The algorithms of this invention can be divided into five well-separated parts
• the peak extraction
• the peak filtering • in silico digestion of the database proteins
• the spectrum matching
• the validation
A flow-chart of the algorithms is shown in figure 2.
The peak extraction - first preferred embodiment
The peak extraction is divided into a set of five sub-processes. A user can influence which peaks to be extracted by the choice of signal-to-noise ratio s/n. The user's choice is denoted by σ
The peak extraction 1: Separating signal from noise
In the present invention the separation of signal from noise is done in a series of steps.
1. Divide the entire spectral x range into sub-intervals , of width ω, i — 1, 2, ... I where I is the number of sub-intervals.
2. Let , and Zτ be the maximum and minimum intensity values, respectively, in the interval ml, and put Wl = Yτ — Zl
3. For each integer x value in the spectrum, j, place a symmetric window, M] 7 of width Ω around j. Let i be the index of the leftmost sub-interval m, that is covered by M3, and let if be the index of the rightmost sub-interval ,- that is covered by M3.
4. Define
yX j) ≡ (1)
Figure imgf000016_0001
l b
5. y1 (j) is now taken as the level of s/n = 1 for j
6. Define
V ,fl
Figure imgf000017_0001
= ≡ ∑- wι ■ z - wh ere w ≡ (2)
Figure imgf000017_0002
7. y°(j) is now taken as the level of s/n = 0 for j
8. For each datapoint dn = (xn, yn) find the closest integer x value jn = j(xn, yn).
The signal-to-noise ratio for dn is calculated as σ(dn) = -y°bn) v' M-yOi r,)
9. Classify all datapoints dn = (xn, yn) into the three sets Vnmse, Vsupporf and Εsτgna.i according to:
Figure imgf000017_0003
Ε> signal, σ < σ(dn) where σ is the signal-to-noise level chosen by the user
Note 1: y = yl (j) is the intensity level that minimizes the (weighted) quadratic distance D(y) — ∑ 3_ _ ^' , which therefore also gives a straight-forward and non- ambiguous definition of the level s/n — 1. Note 2- The reason to calculate the weighted quadratic distance D(y) — ∑ ' - ~w> ' and not just D(y) = ∑ 3_ _ (y — YΛ2 is to prevent the peaks themselves to have too much influence. Computing a non-weighted average, would make the level of s/n = 1 too high in regions with very high peaks or in regions with very many peaks.
The peak extraction 2: Classification of the datapoints and peak determination
Given the three sets Vnmse, Vsuppori and Vsιgnoι the present invention will now determine peak properties. A peak V is built up by datapoints (dn ≡ (xn, yn)) according to the following criteria:
• V = [d3 = d3 (V); x3 < x3+l ; j = 0, 1, ..., J(V), J(V) + 1] \ (_>
• d3 G V SUpPOrt or άj e T>sιgna for j = 1, ..., CP)
In figure 3 is shown a portion of a raw experimental spectrum The non-constant line marked 1 (solid) connects experimental datapoints. The lowest constant line, line 2 (dashed), indicates the level where the signal-to-noise level is = 1. The second lowest constant line, line 4 (dotted), indicates the level where the signal-to-noise level is the one that the user has chosen (= σ). Crosses marked 3 are those datapoints that belong to the set Vsιgnai- The dots on line 1 between lines 2 and 4 are those datapoints that belong to the set >suppσrt. As can be seen in the figure, there are at least four peaks, but unless one puts an extra criterion to the three peak criteria above, the algorithm will extract only one peak. This one peak will have an x range, rx(V) ≡ X (p)+ι — - covering approximately 6 Da. Since peaks representing different isotopes of the same singly charged peptide should be separated with a distance of approximately 1 Da, a peak cannot have an x range very much more than 1 Da. Therefore a fourth criterion to the three above is added:
• rx(V) < r™', where rx uf is a value slightly above 1 Da.
In order to implement this fourth criterion, the present invention employs a procedure that systematically checks the value of rx(V):
1. Given:
• V = Vt • the levels that separate Vnmse, Vsmpport and >s gnaι (the lines 2 and 4 in figure 3 for i = 0)
2. Determine rx(V%). If
• rx(Vτ) < r "*: Determine the peak properties of V, see below.
• rx(Vi) > r u . Continue to step 3. 3. Introduce a cut just above the y level of the datapoint in Vτ where the lowest local minimum occurs.
4. This cut replaces line 2 , hence increasing Vnσιse at the expense of Vsupport in this particular x range. 5. The cut divides Vτ into two new peak sets VΛ and Vτ2
6. Repeat steps 2 and 3 for V% and V until all peaks satisfy rx(V) < r^"f
7. Start with V = V0; that is, % = 0.
In figure 3 the constant line marked 5 (double-dashed) indicates the last position of the moving cut. Having determined peaks that obey the four peak criteria, the present invention now proceeds to calculate peak properties.
The peak extraction 3: Computing peak properties
The intensity of a peak V is in the present invention taken as the maximum intensity value of those datapoints that build up the peak. The center of mass and width of a peak is determined by a centroid calculation. Furtehrmore, the signal-to-noise ratio of the peak is therefore determined through its maximum intensity value, using the definition of signal-to-noise ratio for an individual datapoint as described in step 8 in the passage "Separating signal from noise" above.
• y(V) = mΑx-p(y3)
Figure imgf000019_0001
• σ(V) =
Figure imgf000019_0002
where x(V) is the closest integer value to xc(P)
In figure 3 the vertical impulses marked 6 (dash-dot) indicate the central x values (xc(P)) and maximum y values (y(V)) for the peaks that have been extracted. The peak extraction 4: Determining peak clusters
When the properties of all peaks have been computed, the next step is to partition the peaks into peak clusters. Two peaks, V and V , are defined to be neighbours in and members of the same peak cluster if 1 — r <
Figure imgf000019_0003
< 1 + T, where T PH 0.1 Da. The reason for having the value of r in that particular range is based on the fact that consecutive isotopes that belong to the same singly charged chemical compound should be separated by the mass of the neutron « 1 Da. An additional criterion is imposed on a cluster: At least one peak in a cluster needs to have at least one datapoint in T>signoι. Such a peak is said to be in >signa[. The p:th cluster can, in turn, be defined as a set of peaks:
Cp ≡ [Vp(r); Vp(r) = (y(p, r)), xc(p, r)), w(p, r))),r = 1, ..., r( ); 3Vp(r) e Vsignol]
(3) where Vp(r) is peak number r in the p:th cluster C and r(p) (> 1) is the number of peaks in that cluster. The peak extraction 5: Finding mono-isotopic peaks in a cluster
The present invention proceeds now to select the mono-isotopic peaks in a peak cluster. Consider the p:th cluster Cp defined above, an example of which is shown in figure 4. In order to identify the mono-isotopic peaks, the present
Figure imgf000020_0001
computes the peptide abundancies that build up the cluster. By
• using tabulated isotope abundancies for the elements that build up peptides (these elements are H, C, N, 0 and S)
• digesting a protein database
isotopic distributions as functions of peptide mass are readily computed: X(i, m) is the portion of isotope i at mass m, such that ∑%-ϋl(ι, m) = 1, where i = 0 represents the mono-isotope. In figure 5 is shown the distributions X(i, m) for the four lowest-lying isotopes. Due to statistical variations there is a width in each distribution, indicated by errorbars. The isotopic distributions can therefore be written as l(i, rn) = μ(i, rn) + v(i, ), where μ(i, m) is the center of a distribution and v(i, rn) is the (positive and negative) width around that center. In figure 5 the center and the width is the average and standard deviation, respectively. The values of the Isotopic distributions are in the present invention kept in a table.
Consider now the vector ap ≡ a = (oi, ... r, ..., σr(p))- The present approach is such that a value of a.r > 0 will indicate that there is a peptide whose mono-isotopic component is located at peak number r in the cluster. The abundancy of this peptide is defined to be ar. Also taken into account is the possibility that a peak can be built up by different isotope components stemming from different peptides. In a situation free from experimental noise and variations in the isotopic pattern (meaning \°» v(i, m) = 0 — 7 X(i,m) = μ(i,m)) the present invention would proceed to solve the following system of equations for the cluster Cp:
Figure imgf000021_0001
a,ι X(l, rn) + 2 μ(0, m) = y(p, 2)
αj μ(r - l,m) + ... + aτ ■ μ(0,m) = y(p,r)
ι- μ(r(p)-l,m) + ... + ar{p -μ(0,m) - y(p,r(p))
where m is taken as xc(p, 1)- This system contains r(p) equations and r(p) unknowns, and has therefore a unique solution of peptide abundancies, a — (aι,...ar,...,arιp)). In the ideal situation the solution will consist of a (small) number of aτ such that ar > 0 and then a (larger) number of . for which ar — 0. Because of the presence of noise (figure 4) and variations in isotope distributions (v(i,τn) φ 0, as illustrated by the error bars in figure 5), the solution a can in general contain components oτ < 0; a non-realistic solution in the present context. The way to approach the problem will instead be to find a solution a such that it is the best possible with the constraint that Vαr > 0. By "best possible" solution is meant the following: Define
d(a,p,r) = y(p,r)- (a_ μ(r - l,m) + ... + aτ μ(0,m))
r d(a,p,r) ≡ y(p,r)- /as-μ(r-s,m) (4) and r(p)
D_=D(a) ≡ ∑d{a,p,r)2 (5) r=l
The best possible solution is the a = ά = (αl7...αr, ..., αr(p)) that simultaneously satisfies the three conditions
• \/ar > 0
• The condition aτ = 0 should be imposed on as many values of r as possible.
• D takes on a value less than the uncertainty given by the spectrum noise and the width in the isotope distributions. 2.0
The present invention employs the well-known technique of quadratic programming to solve this constrained minimization problem. So, for each cluster Cp there λvill be mono-isotopic peptides in and only in those postions where r > 0.
A note on spectra that may contain peaks representing peptides with higher charges
The methods for peak extraction of singly charged peptides described above can in a straightforward manner be generalized to spectra that may contain peptides with higher charges. An outline of such an extension is given here:
• Separating signal from noise: Same method as in the singly charged case
• Classification of the datapoints and peak determination: Same method as in the singly charged case
• Computing peak properties: Same method as in the singly charged case
• Determining peak clusters: Same method as in the singly charged case, with the following modification: Consecutive isotopes that belong to the same chemical compound of charge z should be separated by the mass of the neutron divided by z sa \/z Da. This fact implies that in this type of spectra each peak cluster gets a charge label z, and adjacent peaks in such a cluster are separated by 1/z Da.
• Creating disjoint systems of equations: A peak in a spectrum that may contain peptides with higher charges, can represent peptides of different charges. This means that the systems of equations that descibe the peak intensities in terms of isotopic distributions X(i, rn) and peptide abundancies αr, may get contributions from more than one peak cluster. In this general case it is therefore necessary to introduce a procedure that creates a set of disjoint systems of equations, where a disjoint system may or may not get get contributions from more than one peak cluster, and a peak cluster contribute to one and only one such disjoint system.
• Finding mono-isotopes in a cluster: Given a set of disjoint systems of equations: Same method as in the singly charged case The peak extraction - second preferred embodiment
1: The baseline of the spectrum is calculated. This is done by finding, over the given raw data, the smallest intensity within a running window (user option: -bcr [baseline constant range]). This calculation is performed by analysis of the measured intensity values by means of histograms, and the smallest valued bin in the histogram is choosen as the baseline value. The baseline is smoothed after the analysis to make sure that the curve is continuous.
2: The noise level calculation is performed in a similar way as the baseline calculation, but another running window size (user option: -nr [noise range]) is used. Here, the histogram bin value is chosen in a different way utilizing the fact that true measured values are more sparse than noise. The histogram describes the density of the measured intensities, and the bin value under a user defined threshold (user option: -nd, [noise density]) is chosen as the noise level. The noise curve is smoothed to assure continuity. 3: Peak candidates are located. This means locating all parts of the spectrum over noise, and extract (copy) these ranges to form peak candidates. A peak candidate is defined by an increase in intensity (with increasing mass), reaching a peak, followed by a decrease of intensity to some cutoff. This algorithm step is not very precise, and is improved on in the subsequent steps. 4: Fitting of parameters on the peak candidates. This is done in three passes:
a. A crude, but robust, fit is made to get estimates on the peak parameters.
b. The estimates from a. is used in a more refined single gauεsian fit. These peak parameters are usually used in subsequent analysis of the spectrum.
c. The last fit is to try fitting more than one gaussian to the peak candidate. This is done to resolve a possible situation of superposition of peaks. If this is succesful; that is, more than one peak was fitted, then these peak parameters are used, and the candidate resulted in more than one peak.
5: Parameter fitting always yield results, and the fitted parameters are anaylsed, so that strange, contradictory or too wide (user option: -wc [λvidth cut]) peaks are filtered out. 6: All remaining peaks are now bundled into clusters. A cluster is defined by peaks at most 1 Da apart. The clusters are analysed and compound clusters (i.e. contributions from more than 1 peptide) are resolved, leaving only single peptide clusters, (user option: -id a [isotope drift accuracy] sets the limit on how much peak masses can differ from one neutron mass to be treated as coming from one peptide)
7: The single peptide clusters are analysed, mono-isotopic mass, charge and a quality measure is calculated for each cluster. The user option -mir [min Jsotopic_reduction] is used in quality analysis of whether the single cluster is a superpostion of more than one peptide differing an integer times the mass of a neutron, -mir sets the accept- able deviation from the cluster model used. User option -mi [monoJsotopes] defines whether to report all extracted peaks or mono-isotopes only. User option -of [overflow] sets the intensity measurement limit of the mass spectrometer. This parameter is needed since the measured Isotopic clusters become corrupt if the measurements are overflowed. The user options -im and -ic [signal-to-noise or absolute intensity] set the model to use and the cut-off value.
The peak filtering
The peak extraction step described above has resulted in a set of selected peaks. However, a user of the present invention may want to discard some peaks. Those unwanted peaks may or may not be part of the selected peaks; if they are, they will be filtered out; that is removed, in this step. The invention supports the following filters and any combination of them: echoes to consider: In a mass spectrum, so-called peak echoes sometimes occur. By this is meant that the experimental mass spectrum sometimes contains peaks that are false doublets of true peaks. These false doublets often appear at certain well- established distances from the true peaks. This filter handles that problem. Assume the user chooses a value d, measured in Dalton. If two selected peaks have a mutual distance of (d± (tolerance window)) Da, the peak with the higher m/z value is considered to be an echo, and is therefore removed. intensity cut: All peaks with an intensity below the intensity cut value, specified by the user, are removed. lower & upper mass: A user may not want to use peaks below or above certain m/z values when the spectrum matching is performed. AU selected peaks outside the desired range are removed. maldi: In a Maldi mass spectrometer, the digested peptides carry an extra H+ ion. Preparing for the subsequent spectrum matching and database search, the user can choose to have all selected peaks to have their m/z value reduced by m/z(H+). peaks to exclude: In a mass spectrum some particular m/z values often represent calibrants, parts of the digestion enzyme, or known contaminants. These m/z values can be specified by the user and if these appear in the set of selected peaks, within a tolerance window, the peaks are removed. peaks to keep: Suppose the number of selected peaks is S and the user only wants U peaks to go into the spectrum matching. In this filter those U peaks with the highest signal-to-noise are kept, and the other S-U are removed. width cut: A threshold for width of peaks. Peaks with a width above the width cut value, specified by the user, are removed.
The simplest form of output after the peak extraction and peak filtering steps is a table where each row contains information about an extracted peak that has survived the filter step. A row consists of values for the parameters of peak properties such as m/z, intensity, peak width, signal-to-noise ratio and peptide abundancy.
Digestion of proteins in the protein database
The purpose of digestion of proteins in a protein database, so-called digestion in silico, is to mimic the enzymatic digestion of the real unknown protein that takes place in the laboratory, and hence compute theoretical spectra, one theoretical spectrum for each database protein. Having that, the present invention can compare the experimental spectrum with each theoretical spectrum.
Now, in the laboratory, digestion has been carried out by a site-specific enzyme. By this is meant that the enzyme cleaves; that is cuts, the protein only at certain cleavage sites. As one example of digestion enzymes, there is trypsin that cleaves only on the C-terminal side of the amino acids arginin and lysin - unless there is a neighbouring prolin amino acid on the C-terminal side of the arginin or lysin. If that is the case no cleavage will be done by the trypsin enzyme at that particular site.
There are, however, two circumstances that complicate the mimicking of the protein digestion in the laboratory. These circumstances also change the masses of the peptides, the digestion products, and have therefore to be taken into account: 1: Post-translational modifications: It may happen that a protein gets modified by more or less complicated chemical compounds that cannot be predicted by only studying the nucleotide base-pair sequence of its corresponding gene. One example of post-translational modifications are methionine oxidation where the amino acid methionine acquires an extra oxygen atom. A peptide containing methionine therefore gets its mass shifted upwards by the mass of an oxygen atom. Other examples of post- translational modifications can be found in [4]. Post-translational modifications can be divided into variable and fixed modifications. A variable modification is such that it may or may not occur; in the example above it would mean that some of the methionine amino acids acquire an extra oxygen atom while others do not. A fixed modification, on the other hand, always occurs; in the example above it would mean that every methionine amino acid in each peptide acquires an extra oxygen atom. Post-translational modifications are not really part of the protein digestion process, but need to be included when computing the theoretical spectra. It is therefore natural to Incorporate this circumstance at the digestion stage. 2: Missed cleavages: It may also happen that an enzyme does not cut at a site on the protein where it is allowed to cut. This non-perfect cleavage gives rise to (heavier) peptides that would not be produced if enzyme cleavage always occurred at every allowed cleavage site. When doing the in silico digestion, this circumstance needs to be mimicked and included in the algorithm. In figure 6 is shown schematically the occurrence of post-translational modifications and missed cleavages, and the effect these circumstances may have on a mass spectrum.
The general procedure of digestion in silico is the following: In a protein database the present invention will process entry j, where j — l, .-, iVrfo α6ose and Ndatabase is the number of proteins registered in the database. Entry j contains a unique identification tag T[j], and an amino acid sequence
Λ(j) : a} (j)a2{j)...ak(j)...aκ{j) (j) where ak(j) is the k:th amino acid residue, counted from the N-terminal side of the 2 b protein j. ak(j) can be any of the 20 amino acids found in proteins, and is represented by a one-letter code. (For a reference of their code-letters, chemical composition and mass, see [3].) K (j) is the number of amino acids in protein j.
The in silico digestion will mimic digestion done by an enzyme Y that cleaves specif- ically at the sites S[Y] = (S[Y]j, S\Y]2, ••-, S[Y]M(γ))- In the trypsin example above, y=trypsin, M[Y)—2 , and S[Y]ι=(the C-terminal side of arginin, unless there is a neighbouring prolin on the C-terminal side of the arginin) and S'[y]2=(the C-terminal side of lysin, unless there is a neighbouring prolin on the C-terminal side of the lysin).
The set of allowed possible post-translational modifications is M = [FM1, .., FMf, ..., FMF; VM1, ....VMv, ..., VMv) where FM and VM denote fixed and variable modifications respectively. To every fixed and variable modification is assigned a mass, m(FMj) and m(VMv) respectively.
The in silico digestion procedure for protein T[j] is now:
/: A peptide counter is initialized to c =1; a modification counter is set to x(c) = 1; and amass counter corresponding to a peptidepjj, c, x(c)] is initialized to m(p[j, c, x{c)}) = 0.
II: Counters for all members of the set M are set to n(FM ) — 0 and n(VMυ) — 0.
Ill: The invention reads off, from left to right; that is from the N-terminal to the C-terminal side, the protein sequence A(j). IV: For each amino acid that is read off in the sequence, the invention checks whether a. the current amino acid shall acquire a post-translational modification specified by M.
- No: continue to b
- Yes, and the modification is fix, of type FMμ n(FMf) — r n(FMf)+l, continue to b
- Yes, and the modification is variable, of type VMυ: n(VMv) → n(VMv) + 1, continue to b b. the current position coincides with one of the cleavage sites specified by S[Y], or is the C-terminal site of the full protein sequence; that is the right-most amino acid, aκ{j) (j). 2G>
- No:
n . m(p[j, c, x(c)]) → m(p[j, c, x(c)])+ mass(cυrrent amino acid residue) n2: Read off next amino acid; that is go to step III.
- Yes(l), it is an S[Y] site:
yn: m(p[j, c, x(c)]) — τ m(p[j, c, x(c)])+ mass(current amino acid residue) y12: Take into account fixed modifications: m(p[j, c, x(c)]) —T m(p[j, c, x(c)]) + ∑ l[n(FMs) - m(FMf)} y : Take into account variable modifications: For every possible combination of the variable modifications that have been registered for the pep- tide and that gives rise to different peptide masses (there are X(c) =
Yly=1[n(VMv) + 1] such combinations) m(p[j, c, x(c)]) —? m(p[j, c, x(c))) + ∑v__Λ - m(MVv)} for each value n'υ; 0 < < n(VMv). The integer x(c) runs between 1 and X(c). yu: Each of the X(c) values of m(p[j, c, x(c)]) axe registered as a peptide mass derived from protein T[j], n(FMj) and n(VMυ) are set to n(FMj) = 0 and n(VMυ) = 0 for all values of / and υ, c —r c + 1 y15: Read off next amino acid; that is go to step III.
- Yes(2), it is the C-terminal site of the protein:
3/21 : Go through the steps y12 to y1 y2 : The reading off of the protein sequence A(j) terminates
V: Take into account missed cleavages: Having performed in silico digestion at every allowed cleavage site at protein T[j], there is a set of minimal peptides p[j,c',x(c')], where 1 < d < c and 1 < x(c') < X(d). The invention now computes and registers the masses for all possible combinations of minimal peptides with the restriction that for each member of the combination, p[j, c', x(c')] say, there has to be at least one other member, p\j, J', x(c")], such that |c' - c"\ = l.There are ∑^ ∑£=1 Yl _7j )+^m x(l) such combinations. (If there are no post-translational modifications for protein T[j], the number of additional combinations due to missed cleavages reduces to g"(c~^.) Each combination has its mass registered as a peptide mass derived from protein -
VI: Digestion of entry T[j] terminates. There is now a list of peptide masses for this protein entry that should be matched with the list of peaks that were extracted from the experimental spectrum.
VII: Match the list of peptide masses with the list of experimental peaks, and compute a score value for protein entry T[j]. How this procedure is carried out in the present invention is described in the next section.
VIII: Next entry: j — r j + 1. When generating the peptides for all proteins in the protein database by the in silico digestion, the present invention also computes the following three distributions:
1. The frequency distribution of the number of generated peptides per parent protein
2. p(μ): The frequency distribution of peptide masses μ
3. ρ(μ; N): The frequency distribution of peptide masses μ for peptides whose parent proteins have given rise to TV peptides
The distributions 1 and 2 are shown in figures 7 and 8. Distribution 2 is the sum of distributions 3 for all different values of TV, keeping the value of μ fixed. In the present invention theses three distributions are kept in memory and will be used in the spectrum matching and validation algorithms described below.
The spectrum matching - computing a score
A general definition of score value in the context of the present invention
The general approach of protein identification using spectrum peak matching is in the present invention to consider a good match as something unlikely to occur. Therefore, the more unprobable a peak match is, the better it is, and should hence contribute to a higher score. Now, given
• a peak list resulting from the peak extraction and peak filtering steps above • a protein from the database that has been digested in silico, as described above, and hence is represented by a set of peptide masses; that is, by a theoretical spectrum (where each peak has the same intensity as all other peaks)
• a set of peak matches between the peak list and the theoretical spectrum
the present invention computes the probability for the set of peak matches to occur. The score is then taken as the negative logarithm of that probability, so that a high score value reflects an unlikely event, and hence a high degree of spectrum resemblance; that is, a good match. There are, of course, different ways to do this, basically meaning different ways to define the set of peak matches and different ways of computing probabilities. The aim in the present invention is, quite naturally, to reward many matches, and at the same time take into account the propensity for large database proteins to have many matches.
Let x ≡ (xΪ 7 x2, ..., xι, ...,xL) be the masses of the peaks in the peak list, and w = (wι, w2. —, wι, ..., w ) the corresponding peak widths. Let z(j) ≡ [zx (j), z2(j), ..., Zι(j), ..., zN{3)(j));j = 1, -, Ndafabase, where Ndatabase is the number of proteins in the protein database. zt(j) is the mass of the i:ih peak in the theoretical spectrum of protein T[j]7 and N(j) is the number of peptides that resulted from in silico digestion of protein T[j].
A match between a peak xι in the peak list and a peptide mass zι(j) in the theoretical spectrum has, in the present invention, occurred if
Figure imgf000030_0001
where δι is a certain (user-defined) value, the match tolerance window for a match at peak xι. In a general, parameter-free, approach it would be natural to choose δι = u>i, the width of the peak rrr/. Let the vector δ = (δι, ... δι, ..., δA represent the set of tolerance windows pertaining to the peak list x. For each database protein T[j] there is therefore a L x N(j) matrix
M[j} ≡ M{z(j) fr w-y, δ} (7)
such that
Figure imgf000030_0002
For each database protein T[j] the present invention calculates the probability for the set of matches between the peak list x and the theoretical spectrum z(j) as described by M[j]. The score value for T[j], σ3, (not to be confused with the signal-to-noise ratio discussed in the peak extraction section above) is then taken as the negative natural logarithm of that probability: σ3 ≡ σ[z{j); (x, w): δ] = —
Figure imgf000031_0001
(8)
Eq. 8 is taken as the general definition of a score value in the context of the present invention. There are now different ways to specialize that general expression. Here two examples of such specializations will be given.
Specialized score expression, case 1: constant peptide mass distribution Assume now that there is a match between xi and Zi(j) as described by eq. 6. The probability for that match is taken to be f rXXιι++δt>lι
Pi = / P(μ)dμ (9) where p(μ) is the mass frequency distribution for peptides in the digested database, described in the database digestion section above. Continuing, the probability for a miss, qι, is then
Figure imgf000031_0002
For every peak xi we use all N(j) peptides from z(j) to look for a match. The probability for no match by z(j) at t, Qι, is therefore taken to be
Qi = q?ω (ii)
The probability for at least one match by z(j) at x\, Pi, is then
Figure imgf000031_0003
We proceed to define the polynomial
G1[z(j) (x, w)- δ} ≡ G1 (ζ) = ∑ [r] - 0~ ≡ J[{Pi ζ + Qi) (13) r=0 1=1
Now, given (x, w), z(j), and the match tolerance δ — (SI , ..., 5 ), the probability to have at least r matches out of L is g_[r], and the corresponding score is therefore
^• = - ln(5l [r]) (14)
In the present case two simplifications are now made: 1. p(μ) — Pconst = l/Δ, where Δ is the entire mass range for the database search, as specified by the user in the filtering step, described above
2. δι = δ for all I
- pi = 25/Δ. This means that Pi and Qi also become independent of I; Pi = P = 1 - (1 - 2< Δ)W<Λ and Q, = Q = (1 - 2δ/Δ)NW for all /. The polynomial G1 (ζ) can now be written as
LI
Gι(C) = ( - C + Q)L = ∑ Cr . I T Pr - Q L-r (15) r=0 rl - [L - r)l
This simplification means that
Figure imgf000032_0001
and the score for database protein T[j] is now
σ, = -r - HP) - (L - r) - ln(Q) - Mr, . (^_ r), ) (17)
Specialized score expression, case 2: non-constant peptide mass distribu- tion
Assume, once again, that there is a match between xi and zτ(j) as described by eq. 6. The probability for that match is taken to be
Pi[N(j)} = fX'' p(μ-, N(j))dμ (18)
where p(μ; N(j)) is the mass frequency distribution for peptides whose parent proteins have given rise to N(j) peptides. In this approach where it is desired to assume as little as possible about parameter values, 5; is set δι = wι, the width of the peak xi. Define
Figure imgf000032_0002
λ is taken to be the probability that a peptide whose parent protein has given rise to N(j) peptides will find a match with one of the L peaks in x. K is then the probability for no match. Define the polynomial
Nb) G_[z(j); {x, iv); δ ~] ≡ G2(Q _≡ (λ - ζ + κ)N ) = ∑ ζn ■ <feK NO')] (20)
71=0 where
»to»w = vi.lm- _t -χ'-"! l (21) is the probability that n of the N(j) peptides find matches with the peaks of x. The score value for protein T[j] in this second case is then
σj = - \n(g2[n; N(j)]) (22)
The validation - assessing the score
It is important to realize that when using a scoring method for spectrum matching, the nature of the method is such that there will always be a candidate from the protein database with the highest score value. This will be the case irrespective of whether the unknown protein is registered in the database as an entry or not. One therefore concludes that it is normally not enough to calculate only score values for each protein in the database, but one needs preferably to assess those values too. The way to attack this problem is in the present invention done in the following way:
Let z ≡ (z1, z2, ..., z^) be the set of peptide masses that results when digesting a database protein, and let (z) be the probability density for the vector z. Now, ψ(z) is, in priciple, an unknown function, but we make an approximation such that ψ(z) is taken to be the joint distribution of the frequency distributions for the number of digested peptides per parent protein and the peptide masses. These two distributions are readily calculated when performing the theoretical digestion of database proteins, as were discussed in the database digestion section above, and two examples are shown in figures 7 and 8. Now, given the peptide peak list x, and the widths of the peaks w7 the peak matching tolerance δ, and ψ(z), the invention computes the probability density for a score value σ = σk:
Tl(σh; (x, w); δ) - j [dz φ(z) δDh - σ[z; (x, w); δ])] (23) where δr> is the Dirac delta function and the integral is computed over the space of vectors z. The vectors z are sampled with the distribution φ(z). The score σ[z; (x, w); δ] is calculated using the expression eq. 8, or one of the expressions for the specialized cases. The probability to reach, on random, a score value at least as large as σc is then calculated. This probability is written as
Prandom
Figure imgf000034_0001
1 - J [dz φ(z) • Θ(σc - σ[z; (x, w); δ])} ≡ 1- < θ(σc - σrondom) > (24)
where θ is the Heaviside step function (θ(y) — 1 if y > 0, θ(y) = 0 otherwise). For comparison, the corresponding probability expression Vdaiabase = 1— < θ(σc — <7rfnfα6ose) > is calculated when doing the actual database search. By taking the natural logarithm, the invention therefore produces two functions
J~ random. == ^ random Ac) = Hl\r random) == nl\ < y(0~ c — O~ ranljom) >) (.^O
and
Figure imgf000034_0002
ln(l— < θ(σc — Odatabase) >) (26)
The computational flow
It is important to understand how the score values σ and the functions ^database an TTandom are calculated for a given specific unknown raw spectrum (represented, after peak extraction, by the peak list x.) Below is a step-wise description of the process.
^"database-
1. Digest a protein from the database.
2. Apply the score equation eq. 8 to compute a score σdaiabase.
3. Calculate θ(σc — σ database) (= 1 for all σc > σ at base, and 0 for all other values of σc.)
4. Repeat steps 1-3 for every protein in the database.
5. Calculate the average of Q(σc — σ atab se) - for every value of σc.
6. Use eq. 26 to calculate ^da ab se-
F, random- 1. Use the frequency distributions for the number of digested peptides per parent protein and the peptide masses to create a random spectrum z. Examples of such distributions are shown in figures 7 and 8.
2. Apply the score equation eq. 8 to compute a score σrandom.
3. Calculate Θ(σc — σTandom) (— 1 for all σc > o~ random, anα* 0 for all other values of σc.)
4. Repeat steps 1-3 many times. The number of times is chosen by the user, as described in the section on The Graphical User Interface below.
5. Calculate the average of Θ(σc — σrandom) - for every value of σc.
6. Use eq. 25 to calculate ^random-
The interpretation and application of datahase and random
The two functions are shown for two real situations in figures 9 and 10. In figure 9 one notices that the top-scoring candidate from the actual database search have score values much higher than what can be expexted from just random coincidence, hence leading one to believe that the top candidate is truly the unknown protein. In figure 10, however, the top-scoring candidates have score values that are not larger than what one could expect from a random database search. This indicates that one should not consider the top-scoring candidate in this case to be a true candidate for the unknown protein. Further experimental measures should therefore be taken in order to get a satisfactory protein identification.
In more detail, put Arandom =
Figure imgf000035_0001
then the random probability to get a score value at least as high as σc, on one try, is Prondom
Figure imgf000035_0002
The probability to get a score value less than σc, is then QT ndom = 1 — Prandom- The methods of the present invention allow calculation of random probabilities when making TV > 1 tries. Consider the polynomial H(ζ):
H(ζ) ≡ (Prandom " C + Qrαndom)" = ∑ ' h n; TV) (27) n=0 where tl(n; TV) _ , , jy __ s , Prandom ' ^r n om ( 8) is the probability for getting a score value of at least σc n times out of N. Differentiation with respect to ζ" gives
N H'{ζ) = TV Prandom Prandom C + Qran om '1 = ∑ n "_J • hfa N) (29)
71=0
Putting ζ = 1 in the differentiated expression leads to
N
N Prandom = ∑ TI fc(n; N) (30) n=0
The right hand side is nothing but the expectation value for the number of times of getting a score value above σc on TV tries; that is < n >. This implies
N Prandom =< n > (31)
So in order to expect a score value above σc in < n >— nτan σmc) cases out of TV tries, given the random probability Pr ndom, the following condition has to be satisfied:
' J random. -~ ^random Ac) l"^J
The size of a random database therefore needs to be of the order
N = TVrflndOT n( c) ≡ r *a rdaθnmdo(mσc) (33) in order to contain nrandomc) random proteins that reach a score value at least σc. Now, the corresponding frequency derived from the real database search is
Figure imgf000036_0001
where ndatabase (o~ c) s the number of real database proteins that reached a score value at least σc, and N ata e is the size of the real protein database. Hence, the ratio between the size of a random database containing nrandomc) random proteins with score values of at least σc, Nrandom(o~c), and the size of the real database, Ndotab e, is
Figure imgf000036_0002
So, consider a run of the present invention that ends up with nd0inba e o~fOp) = nfop top candidates. The size of the random database that one would expect to contain nrandom o~top) — ntop random proteins is therefore
Nτ„ndom (θc) — ^database — exp(_Fdatabase ~ ^random) = Ndatbse βXp( Q) (36) r ^top where Q ≡ AdafbaSe— Arandom- Q is in the present invention called the quality measure.
Now, as an example, if it is demanded that the random database has to have a size 1000 times larger than the real database for considering a protein candidate with score value σtop to be statistically significant, the quality mesure of that top scoring protein needs to be at least Q = ln(1000) « 7. If, on the other hand, it is only demanded a factor of 20 between the random database and the real database, the quality mesure of the top scoring protein needs to be only Q = ln(20) fzs 3.
Closely related to the quality measure is the p-value. p — p(σc) is defined as the random probability of getting at least one protein with a score value of at least σc given the size of the real database, N (lfnbase. This implies
Figure imgf000037_0001
If Prandom ^ 1> which it should be for any high-scoring database protein, the following approximation can be made-
P = 1 (t P random) ~ 1 [l ^database ' ■' random) = 1" database " * random l"°j
Using eqs. 33 - 38 leads to
P — p J random =
Figure imgf000037_0002
database
(39) If, after a run of the present invention, the number of real database proteins that get a score value of at least σc is only one, there is simple relationship between the quality measure and the p-value:
p = exp(-Q) (40)
If, however, the relations Pr dom ^ 1 or if ndata ase {&_) Φ 1 d not hold, the expression for the relation between the p-value and the quality measure q becomes a bit more complicated, but the tight connection between these two parameters and their relation to the sizes of real and random databases is, of course, still there.
The parameters score value, quality measure and p-value are in the present invention calculated and reported for every database protein. The Main Application
The set of algorithms described above constitute, when written as computer code, the Main Application. The Main Application, with its algorithmic flow shown in figure 2, solves all the tasks regarding protein identification that the present invention is claimed to solve. However, in addition to the algorithm-related computer code, the present invention also contains a Graphical User Inteface (GUI). The interface, which inter alia makes the present invention more user-friendly, and a user more efficient, is described in the next section.
The Graphical User Interface
General properties of the Graphical User Interface
The purpose of the Graphical User Interface (GUI) is to provide all functions and parameters relevant for utilizing the present invention. In addition, partial results from the different steps of the analysis, described above in the section "The Algorithms" , should, together with the final result, be presented in a graphical and clear way on the computer screen, and registered in appropriate files. A list of the most important features is the following:
1. All results provided by the algorithms can be accessed by the GUI. 2. The GUI is designed so that it is platform independent. By this is meant that the computer code is written such that the GUI can run on any computer irrespective of the computer's operating system.
3. The invention is designed so that the Main Application can run independently of the GUI. 4. The invention can, through the GUI, be run in a stepwise manner. This means that each algorithmic step, described above in the section "The Algorithms" , can be executed such that the following step in the algorithmic flow will not be executed before the user chooses to do so.
5. All steps can be run in one go. 6. Several experimental spectra (representing unknown proteins) can be run consecutively without user intervention, so-called batch jobs.
The different parts of the Graphical User Interface workspace
In figure 11 is shown the GUI workspace as it may appear on a computer screen, before a run of the invention. As illustrated, the workspace is divided into
• The menu bar • The icon bar • The status bar
• The analysis boxes
described in detail below. The first two and the fourth workspace areas are interactive with the user. By this is meant that when a user clicks on one of the items in those areas, the user can either select input for the algorithms or start a process.
The menu bar
The menu bar, at the top of the workspace, see figure 11, consists of five menus with the following features:
File: The File menu, see figure 12, contains commands for opening, closing, and saving the workspace setup:
• New: Clears the current workspace setup and restores all parameter values to their default values.
• Open: Opens a previously saved workspace.
• Save: Saves the current workspace. • Save as: Works just like "Save" with the exception that the user is always asked for a file name.
• Exit: Quits the session.
Edit: The Edit menu contains one item, Boxes, that gives the user access to all the option menus for the analysis boxes in the workspace. It is shown in figure 13. View: The View menu is used to specify the items a user wants to see on the desktop. It has only one option: Hide icon toolbar, which specifies whether the icons at the top of the workspace should be shown or not.
Preferences: The Preferences menu is divided into
• Get default dir: Resets the working directory to the directory used at startup. • Save default dir: Saves the current data directory as default directory.
• Server: Here a user specifies on what server the binary files of the present invention are located. • Directories: Here a user specifies the directory where spectrum data files are located.
Actions: The Actions menu, see figure 14, has the following items:
• Run step: Runs only the high-lighted analysis step. By "high-lighted" is meant that the title bar of the corresponding analysis box is coloured.
• Run batch: Runs the entire analysis on every spectrum selected by the user.
• Run spectrum: Runs the entire analysis on a single spectrum selected by the user.
• Halt process: Halts a batch run.
The icon bar
The icon bar, see figure 11, consists of a set of often-used icons that correspond to features and functions controlled in the menu bar, described above.
The status bar
The status bar is located at the bottom of the workspace, see figure 11. It contains updated information about what is being currently processed by the present invention.
The analysis boxes
A short description of the analysis boxes is the following table:
o
Table
Part (box) name Role/function /algorithm Input from Output to
MSFiles Selects those files containing the user Pepex/the screen raw spectra to be analysed. Pepex Extracts mono-isotopic peaks MSFiles/the user Pepfil/the screen/file from the raw spectrum. Pepfϊl Filters peaks from the peak Pepex/the user Matcher /the screen/file list created by Pepex. Possibility for recalibration of a spectrum. Matcher Digests the proteins in the Pepfil/the user the screen/file protein database according to biological and chemical rules specified by the user. Matches peaks found in the raw spectrum with peaks digested in silico from the protein database. Calculates score values and assesses the result statistically.
A more detailed description is given below.
The MSFiles box When clicking on the box MSFiles, or by clicking in the MSFiles field in the Edit menu, a window appears, see figure 15. There the user can indicate the file (containing data for a raw experimental spectrum) or set of files to be analysed. By clicking on "Add Files" a browser will appear and a user can choose to run one spectrum or many spectra for a batch job. Note that a user, by clicking on the "Masslist" button, can also send in a list of already extracted peak mass values. Given that choice the present invention will skip the peak extraction step, and move directly to the peak filtering step. Al
The Pepex box
When clicking on the box Pepex another dialogue window appears. There a user can choose which signal-to-noise cut-off value (σ in the section "The peak extraction") to use for selection of peaks. The Pepfil box
When clicking on the box Pepfil a dialogue window appears, an example of which is shown in figure 16. The left side of the window contains a list of the filters that are supported in the present invention. The right side contains the filters that are currently active. By writing in the Parameter fields, the user can choose values that correspond to the active filters. The filters that are supported, which were described above in the section "The Algorithms" , are the following:
1. echoes to consider
2. intensity cut
3. lower & upper mass 4. maldi
5. peaks to exclude
6. peaks to keep
7. width cut
In addition to this there is a set of buttons that controls which filters to be used:
• By marking a filter on the left side and then clicking on the > button, the user activates that particular filter; confirmed by it now appearing on the right side.
• Deactivating a filter is done by marking a filter on the right side, followed by clicking on the < button.
• Clicking on the trash bin button deactivates all the filters on the right side. • Clicking on the folder button enables the user to activate a user-defined set of filters. This is done either by writing the name of the filter file in a designated field, or by using a browser for navigating in the file system. Note that a user- defined set of filters can only contain the filters in the list. Ml
• The floppy disk button enables the user to save the current filter setup to a file.
There is also a calibration function that, if a user wishes to do so, recalibrates the raw spectrum. The recalibration is based on the set of peaks that the user selects. Such peaks have often well-established mass values, for example peptide fragments from the digestive enzyme that have been used.
The Matcher box
When clicking on the box Matcher a dialogue window appears, see figure 17. This window contains a set of option fields.
On the upper half the user chooses which species the proteins in the database search should belong to. Multiple choices of species can be made. Activating and deactivating works in the same way as in the Pepfil dialogue window.
In the middle left part of the dialogue window (following figure 17) the following choices are made:
• protein database
• digestion enzyme
• number of allowed missed cleavages
In the middle right part of the dialogue window (following figure 17) the user can specify expected post-translational modifications. There is a pre-defined set of those, but an editing function allows a user to create his/her own modifications, see figure 18. In the lower part of the Matcher dialogue window there is a set of additional settings:
• Report, top: This is the number of the top-scoring proteins to be reported and presented in the result list.
• Validation database size: This is the number of random proteins that is used when calulating the function -FTan om defined above in the section "The Algo- rithms".
• Mass tolerance (Da) & Mass tolerance (ppm): Two peaks are considered to match, if their mutual distance is less than the chosen mass tolerance value. This tolerance value can be measured either in absolute m/z value, expressed A3 in Dalton (Da) or relative to the peak mass, expressed in parts per million (ppm).
• Select masslist option: M-hH or Mr Is set to "M+H" if the peaks in list that has been passed through the filter step have not had the mass of an H-atom subtracted. Otherwise this option shoul be set to Mr.
The output of results through the Graphical User Interface
The GUI output from a run of the present invention can be divided into
• Spectrum graphs
• A Results window that contains
— a list of top-scoring protein candidates
— a list of extracted peaks
• Detailed information about each top-scoring protein candidate
Spectrum graphs
In figure 19 is shown the GUI workspace after a run of the present invention. On the right side three spectrum graphs are presented. These graphs are related to, from top to bottom, the experimental raw spectrum, the extracted peptide peaks, and the peptide peaks left after filtering. Here a user can visually study the extracted peaks and relate them to the experimental spectrum. A zooming function enables detailed study over the whole m/z-range. The Results window
In the Results window a user gets information about the top-scoring protein candidates as well as about the peaks that were used in the database search.
Protein candidates
In figure 20 is illustrated a list of the top-scoring protein candidates. The proteins are listed in descending order of spectrum resemblance in comparison with the filtered peak list extracted from the experimental spectrum. The list contains a number of rows (as many as chosen by the user in the Matcher box) where each row holds search results for one database protein. In the present illustration there are five columns for each row:
quality flag | protein id | score value | p-value | quality measure
• quality flag: The flag is a small rectangle whose colour is dependent on the quality measure of the fifth column and chosen cut-off values. In the present illustration it is implemented such that a quality value above 7 gives a green flag, a quality value between 3 and 7 gives a yellow flag, and below 3 gives a red flag. The statistical significance of the quality measure was discussed in the "Algorithms" section above. • protein id: In this column is reported the name and database identification tag of the database protein. Denoted "Protein id" in the present illustration.
• score value: Descibed above in the "Algorithms" section. It is denoted "Score" in the present illustration.
• p-value: Descibed above in the "Algorithms" section. It is denoted "Probabil- ity" in the present illustration.
• quality measure: The parameter Q described in the "Algorithms" section. It is denoted "Qlty" in the present illustration.
By clicking on a protein id, a window will appear. This window contains detailed Information about that particular protein, and is described below. Peaks
In figure 21 is illustrated a list of the peaks that were used in the database search. In the present illustration the upper field contains only the m/z values for the peaks. By clicking on a "Copy" button a user can copy and then paste the set of /z values into any desired application. There is also list of rows where more detailed information about every selected peak is given. In the present illustration there are six columns for each row; that is, for each peak V that is used in the database search:
m/z I intensity | width | signal-to-noise ratio | quality | delete option Mb
• m/z: The mass-to- charge ratio of the peak. It is calculated as xc[V), described in the passage "Computing peak properties" of the "Algorithms" section. It is denoted by "Mass" in the present illustration.
• intensity: The absolute intensity value of the peak. It is calculated as y(V), described in the passage "Computing peak properties" of the "Algorithms" section. It is denoted by "Intensity" in the present illustration.
• width: The width of the peak. It is calculated as w(V), described in the passage "Computing peak properties" of the "Algorithms" section. It is denoted by "Width" in the present illustration. • signal-to-noise ratio: The signal-to-noise ratio of the peak. It is calculated as σ(V), described in the passage "Computing peak properties" of the "Algorithms" section. It is denoted by "S/N" in the present illustration.
• quality: A parameter whose value depends on the filters chosen by a user. In the present illustration (where it is not activated) it is denoted by "Quality". • delete option: In this column, denoted "Deleted" in the present illustration, a user can choose to delete peaks that he/she does not want to be used in the database search. (A user can also add peaks, if so desired, by using the "Add" feature in this window.) By running "Run step" the database search will be redone with the new set of peaks.
Detailed protein information
As mentioned above a window with detailed information about a protein will appear when a user clicks in the protein id column of that particular protein. In figure 22 is shown an example of such a window, where a user, among other things, can find information about
• number of experimental peaks that found a match
• number of database peptides that found a match
• the corresponding sequence coverage
• the amino acid sequence of matched database peptides and their post-translational modifications i statistics on peak-matching errors
The Computer Implementation
The actual implementation of the present invention can be summarized as follows:
• Computer code: Object-oriented design for all parts of the code.
• Hardware requirements: PC:s; that is, personal computers, or work sta- tions.
• Platform types:
1. A client-server solution in which the binary files executing the algorithms are run on a server and the GUI is run on one or many clients.
2. A stand-alone version, requiring only a single computer.
• The protein database: Should be formatted in a so-called FASTA format and be stored on the server if a client-server solution is the user's choice. If a stand-alone version is used, the formatted protein database is stored on the computer at hand.
• Internal communication: http requests through which xml files are sent.
Other Methods
We will in this section briefly describe two other methods that use mass spectrometry data for protein identification. One should note that none of these methods contain the peak extraction and peak filtering steps. This implies that they address only the database search. The same notation as in the section "The Algorithms" will be used.
Method 1
This approach is described in [5]; a method that has been implemented into a commercial software product, described in [6]. As has been described above, the algorithms of the present invention consist of five steps: peak extraction, peak filtering, digestion of the protein database, spectrum matching, and validation. The method of [6] includes the three latter steps. Here we will only discuss the spectrum matching part since database digestion is straightforward and non-controversial, and since the validation step of the method of [6] has, at least to our knowledge, not been publicly disclosed.
A rough outline of the method is the following. In this particular method, the whole chosen database is digested Then, for every peptide mass, the two numbers i are j are calculated, such that
i 100 Da ~ peptide mass j - 10 kDa ~ parent protein mass
The matrix ft] is now created. It is constructed such that ftJ = 0 for all % and j, initially. Then, for every peptide the numbers i and j are calculated and fl3y + 1. When all peptides have been processed, the matrix f3 is normalized to ml3m .A ι , where maxj[/υ] is the largest ftJ value in column j The score value for an entry in the protein database Is now defined as
Score ≡ — (41)
J-'iprotem ' 1 In
Figure imgf000050_0001
where Mproteτn is the mass of the protein measured in Dalton. and the product runs over those matrix elements which correspond to the matches between the peaks from the experimental spectrum and the peaks in the theoretical spectrum for the protein database entry. The score defined in eq. 41 takes into account the matches, but does not account for the peaks that do not find a match. It is also difficult to see how the normalization M 50000 is argued for, except that it takes into account the general idea that large database proteins should have reduced scores. In conclusion, this scoring scheme is certainly a valuable step towards a sound and reliable algorithm for protein identification, but it does not use all the information that is available from the experimental spectrum as well as from the digested protein database.
Method 2
Another approach, also implemented into a commercial software product, is described in [7] and [8], and is outlined here. For this method we will discuss the spectrum matching and validation steps.
For each theoretical spectrum z(j), representing database protein T[j], the spectrum resemblance to the peak list x is computed. This resemblance is based on how well peaks from z(j) and x match each other. The criterion for a match between two peaks is such that their mutual distance has to be less than e. The spectrum resemblance is now based on a score that is written as
s M[j}
Figure imgf000051_0001
where Xk and zk(j) are the masses of the A;:th peak match between the peak list x and the theoretical spectrum z(j) respectively, N(j) is the number of peptides that resulted from in silico digestion of protein T[j], r is the number of matches, Δ is the mass range over which the search is done, and wk is the width of the k:ih matching peak from x.
This score is claimed by the authors of [7] to be the so-called Bayesian probability for protein T[j] to be the unknown protein. The weakness pertaining to that claim is that the score is normalized in such a way that it is assumed that the unknown protein can be found in the database that is being searched. It therefore has no bearing on unknown proteins not registered in the database. In the actual software application of this method, the inventors realise this problem, and are therefore using the so-called .Z-value for validation. The iϊ-value for database protein T[j] is defined as Z[j] = (s[jj— < s >)/ω, where the values of the score average < s > and score value standard deviation ω are taken over all database proteins. One sees that in the equation for the score s[j], matches (r) as well as misses (TV(j') — r) are accounted for. Therefore, by acknowledging the weakness of the score as a probabilistic measure of protein identification, and subsequently introducing the i5-value as a measure of validation, improves the method. One notes, however, that method 2 does not incorporate the possibility to promote matches with high m/z values; all matches carry equal weight.
Comparison
When the present invention is compared to methods 1 and 2, one concludes that the differences, in favour of the present invention are, in brief, the following: The present invention
• contains a scoring method based on the probability for matches and misses between the experimental spectrum and a theoretical spectrum.
• can in a natural way take into account and promote matches with high m/z values.
• contains a validation method that based on the general properties of digested proteins, calculates what score values can probabilistically be expected on a random basis.
• is designed with a streamlined fully automatic flow, supported by a Graphical User Interface, starting with the peak extraction from the experimental spectrum and ending in protein identification.
Applications
Three applications of the present invention will be presented in this section.
Application 1 - peak extraction
Peak extraction not continued by the identification steps is valuable in many cir- cumstances. One such case is when a user is only interested in visually inspecting spectrum differences; for example comparison of a protein spectrum from a healthy cell sample with a spectrum from a cell sample in disease. It is, however, also valuable if a user does want to make protein identification using, in parallell with the present invention, some other protein identification software. This can easily be done using the "Copy" function available for the peak list in the GUI Results window, described above. Running different methods in parallell, in order to raise the statistical significance, is not uncommon when doing protein identification.
Application 2 - processing a single experimental spectrum
The example spectrum file is called spectrum.txt. It is loaded by choosing the box MSFiles under the Edit menu, clicking on the "Add Files" button, and then use the browser to select the spectrum file.
When the user selects "Run spectrum" in the "Actions menu" the present invention will process the spectrum, with all user options set to their default values. The search result is shown in figure 23. As can be seen the best scoring protein candidate gets a score value ("Score") of 14.52, a p-value ("Probability") of 0.15 and a quality value ("Quality") of 3.89. In the "Algorithms" section above it was shown that the quality value Q is directly related to the size needed of a random database in order to expect a certain score value. That size is in the present case exp(3.&9) ~ 50 times the size of the actual random database used for the search, maybe not a very convincing number. The colour-coded flags in the left column also helps a user to quickly asses the statistical significance of the search result.
By actually taking into account knowledge about the experimental setup, a user can choose to change the values of the option parameters from their default values. In the present case the user uses the filter option to remove known peaks from the digestive enzyme, takes into account the possibility of missed cleavages and known post-translational modifications. With these new parameter values the present invention is rerun, and the result is shown in figure 24. As can be seen there is a new top-scoring candidate. It gets a score value ("Score") of 16.42, a p-value ( "Probability") ∞ 0, and a quality value Q = 12.1. The needed size of a random database for expecting the reported score value is in the present case exp(12.1) « 180000 times the size of the actual random database used for the search. A very convincing number leading the user to be certain that the top candidate is also the correct protein. As an aid, a quick glance by the user at the colour-coding of the flags in the left column, gives strong support for the top-scoring candidate to actually be the unknown protein. Additional support is also given by the following three candidates; all of the same type as the top candidate.
Application 3 - enabling high-throughput protein identifica- tion
The present invention allows for running many experimental spectra in one go; so- called batch jobs. This is extremely valuable, and basically the only method feasible when a user wants to perform high-throughput screening of many spectra in a fast and automated fashion. A user selects all the desired spectrum files in the MSFiles box, and selects "Run batch" in the "Actions" menu. After a batch run a user clicks anywhere on the workspace, and a list of the processed spectra appears, as shown in figure 25. The result for each individual spectrum can then be studied in the same way as after processing a single spectrum, with access to spectrum graphs, lists of top-scoring protein candidates etc.
References
[1] Bairoch and Ap eiler: Nucleic Acids Res. 1996, 24, 21-25
[2] Bleasby and Wootton: Protein Engineer. 1990, 3, 153-159
[3] Alberts et. al; Molecular Biology of the Cell, Garland, New York, 1994
[4] Graves et. al: Co- and Posttranslational Modifications of Proteins: Chemical Principles and Biological Properties, Oxford, New York, 1994
[5] Pappin et. al; Current Biology, 1993, 3: 327-332
[6] Perkins et. al: Electrophoresis 1999, 20, 3551-3567
[7] Zhang and Chaϊt; Anal. Chem. 2000, 72, 2482-2489
[8] WO 00/73787

Claims

1. A method for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins, said method comprising based on a numerical representation of a mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, extracting noise-free mono-isotopic peptide peaks from the numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks; - selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), and determining the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.
2. A method according to claim 1 , wherein the probability for the set of peak matches to occur reflects the probability to have a predetermined number (r) of matches.
3. A method according to claim ] or 2, wherein the probability for the set of peak matches to occur being determined so that the probability is rewarded by many matches, and at the same time takes into account the propensity for large proteins having many matches.
4. A method according to any of the preceding claims, wherein the candidate protein(s) is represented by a first set of peptide masses being a theoretical spectrum wherein each peak has the same intensity as all other peaks.
5. A method according to any of the preceding claims, wherein the extracting of noise-free mono- isotopic peptide peak comprising determining the intensity level of the spectrum where the signal-to-noise ratio is unity, such as substantial unity, preferably by use of equation 1 disclosed herein; detennining the intensity level of the spectrum where the signal-to-noise ratio is zero, such as substantial zero, preferably by use of equation 2 disclosed herein; and locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.
6. A method according to claim 5, wherein the extracting of noise-free mono-isotopic peptide peak further comprising determining the peak entities mass/electric charge, intensity, width and signal to noise ratio for the peak candidates; bundling of peak candidates into clusters; deconvolution of compound peak clusters into single peptide peak clusters; and resolving single peptide peak clusters into mono-isotopic peptide peaks.
7. A method according to any of the preceding claims, wherein the extracting of noise-free mono- isotopic peptide peak comprising determining a baseline of the spectrum and smoothening the baseline; determining a noise level and smoothening the noise level; and locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.
8. A method according to claim 5 or 7, wherein the extracting of noise-free mono-isotopic peptide peak further comprising fitting function parameters for the peak candidates - bundling of peak candidates into clusters; deconvolution of compound peak clusters into single peptide peak clusters; and resolving single peptide peak clusters into mono-isotopic peptide peaks.
9. A method according to any of claims 5-8, allowing peak extraction of peptide peaks representing peptides having any value of electrical charge.
10. A method according to any of the preceding claims, said method further comprising the step of filtering the list of selected peaks.
1 1. A method according to claim 10, wherein the filtering comprises discarding some peak based on input from a user of the method, said input may claim one or more specific peak to be discarded.
5
12. A method according to claim 10 or 1 1 , wherein the list is filtered based on one or more of the strategies: echoes to consider, intensity cut, low and mass, maldi, peaks to exclude, peaks to keep, width cut.
10 13. A method according to any of the preceding claims, wherein a peak match is defined as a situation where the distance between the two peak masses is less than a predefined match value according to equation 6 disclosed herein.
14. A method according to claim 13, wherein the predefined match value is user defined. 15
15. A method according to any of the preceding claims, further comprising the step of determining a score value (σ) relating to the probability for the set of peak matches to occur, said score value being determined as the negative logarithm of the probability for the set of peak matches to occur.
20 16. A method according to claim 15, wherein a score value is determined for all proteins in the data base.
17. A method according to claim 16, further comprising the step of determining the probability for getting a score value equal to or above a predefined score (σ).
25
18. A method according to any of the claims 15-17, wherein the step of determining the probability is based on the list of peptide peaks, the predefined match value and the probability density for the list of peptide peaks.
30 19. A method according to any of the claims 15-18, wherein the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach randomly a score value at least as large as the score value in question.
20. A method according to any of the claims 15-18, wherein the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach from the data base a score value at least as large as the score value in question.
5 21. A method according to claim 19 or 20, wherein the probability for getting a score value equal to or above a predefined score (σ) is calculated by equation 24 disclosed herein.
22. A method according to any of the preceding claims, wherein the database is storing a list of peptide masses and the corresponding parent proteins.
10
23. A method according to claim 22, wherein the database results from a digestion of proteins.
24. A method according to claim 23, wherein the digestion has been performed by a method according to claim 25.
15
25. A method for in silico digesting proteins, comprising establishing a plurality of protein sequences, checking, for each amino acid in the sequences, whether the amino acid acquires a posttranslational modification and if so modifying the amino acid, and whether the current position 20 coincides with a cleavage sites pre-specified or is the current position right-most amino acid and if so modify the acid accordingly, and compute and register the masses for all possible combinations of minimal peptide masses.
25 26. A computer system for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins, said computer system comprising means for extracting noise-free mono-isotopic peptide peaks from a numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks, the 30 extracting being based on a numerical representation of the mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), 35 and determining, such as computing means, the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.
27. A computer system according to claim 26, further comprising means for performing some or all 5 of the steps of the method according to any of the claim 1 -24
28. A graphical user interface for guiding a user through a protein identification process, said interface comprising a number of module fields each representing one or more module adapted to perform one or more of the steps according to the method according to any of the claims 1-24, said
10 fields are graphically arranged and graphically linked so as to reflect a predefined executing order of the modules and said graphical user interface is adapted to in response to input to the computer system to initiating executing of a module and to change the appearance of the field corresponding to the module being executed.
15 29. A graphical user interface according to claim 28, wherein said input is provided by a pointing device, such as a computer mouse, and a thereto associated bottom press.
30. A graphical user interface according to claim 28 or 29, wherein the one or more of said fields are changing appearance during execution of their corresponding module(s).
20
31. A graphical user interface according to any of the claims 26-28, further comprising one or more result fields appearing after results have been and/or during results are being generated by one or more of said modules and displaying said results.
25 32. A graphical user interface according to any of the claims 28-31 , further comprising input fields appearing when a user input is required.
33. A graphical user interface according to any of the claims 28-31, further comprising one or more dialog windows through which user input may be inputted and/or edited.
30
34. A graphical user interface according to claim 31, wherein one or more of the one or more dialog windows allow a user to edit values stored in a data base and wherein said one or more dialog windows preferably are accessible via button(s) appearing on the interface.
35. A graphical user interface according to any of the claims 28-33, further comprising a tool bar via which actions can be executed by pushing buttons appearing on said tool bar, said tool bar preferably further comprising curtains, wherein each curtain represents a category of action and wherein each curtain comprises buttons for actions belonging to a particular category.
36. A graphical user interface according to any of the claims 26-33, further comprising a set of windows that communicate the results of the protein identification process.
PCT/IB2002/004839 2001-11-01 2002-11-01 A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins WO2003038728A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2002347462A AU2002347462A1 (en) 2001-11-01 2002-11-01 A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DKPA200101616 2001-11-01
DKPA200101616 2001-11-01

Publications (2)

Publication Number Publication Date
WO2003038728A2 true WO2003038728A2 (en) 2003-05-08
WO2003038728A3 WO2003038728A3 (en) 2003-11-06

Family

ID=8160803

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2002/004839 WO2003038728A2 (en) 2001-11-01 2002-11-01 A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins

Country Status (2)

Country Link
AU (1) AU2002347462A1 (en)
WO (1) WO2003038728A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2410608A (en) * 2004-02-02 2005-08-03 Agilent Technologies Inc System and methods for mass spectrometry analysis and dynamic library searching
WO2006133568A1 (en) * 2005-06-16 2006-12-21 Caprion Pharmaceuticals Inc. Virtual mass spectrometry
WO2008007821A1 (en) * 2006-07-12 2008-01-17 Korea Basic Science Institute A method for reconstructing protein database and a method for identifying proteins by using the same method
US7894650B2 (en) * 2005-11-10 2011-02-22 Microsoft Corporation Discover biological features using composite images
CN115129836A (en) * 2022-06-08 2022-09-30 阿里巴巴(中国)有限公司 Dialogue data processing method, device, device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103389335A (en) * 2012-05-11 2013-11-13 中国科学院大连化学物理研究所 Analysis device and method for identifying biomacromolecules

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5538897A (en) * 1994-03-14 1996-07-23 University Of Washington Use of mass spectrometry fragmentation patterns of peptides to identify amino acid sequences in databases
WO2000073787A1 (en) * 1999-05-27 2000-12-07 Rockefeller University An expert system for protein identification using mass spectrometric information combined with database searching
DE60127813T2 (en) * 2000-02-07 2007-12-27 MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. METHOD FOR IDENTIFYING AND / OR CHARACTERIZING A (POLY) PEPTIDE
AU2001286059A1 (en) * 2000-09-08 2002-03-22 Oxford Glycosciences (Uk) Ltd. Automated identification of peptides

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2410608A (en) * 2004-02-02 2005-08-03 Agilent Technologies Inc System and methods for mass spectrometry analysis and dynamic library searching
GB2410608B (en) * 2004-02-02 2010-07-21 Agilent Technologies Inc Dynamic library searching
WO2006133568A1 (en) * 2005-06-16 2006-12-21 Caprion Pharmaceuticals Inc. Virtual mass spectrometry
US7894650B2 (en) * 2005-11-10 2011-02-22 Microsoft Corporation Discover biological features using composite images
US8275185B2 (en) 2005-11-10 2012-09-25 Microsoft Corporation Discover biological features using composite images
WO2008007821A1 (en) * 2006-07-12 2008-01-17 Korea Basic Science Institute A method for reconstructing protein database and a method for identifying proteins by using the same method
US8296300B2 (en) 2006-07-12 2012-10-23 Korea Basic Science Institute Method for reconstructing protein database and a method for screening proteins by using the same method
CN115129836A (en) * 2022-06-08 2022-09-30 阿里巴巴(中国)有限公司 Dialogue data processing method, device, device and storage medium

Also Published As

Publication number Publication date
WO2003038728A3 (en) 2003-11-06
AU2002347462A1 (en) 2003-05-12

Similar Documents

Publication Publication Date Title
Polasky et al. Fast and comprehensive N-and O-glycoproteomics analysis with MSFragger-Glyco
Yang et al. MSBooster: improving peptide identification rates using deep learning-based features
JP5512546B2 (en) System, method and computer readable medium for determining the composition of chemical components of a complex mixture
Wang et al. JUMP: a tag-based database search tool for peptide identification with high sensitivity and accuracy
Sadygov et al. Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book
Samuelsson et al. Modular, scriptable and automated analysis tools for high-throughput peptide mass fingerprinting
EP2450815B1 (en) Method for identifying peptides and proteins according to mass spectrometry data
Carvalho et al. YADA: a tool for taking the most out of high-resolution spectra
Savitski et al. New data base-independent, sequence tag-based scoring of peptide MS/MS data validates Mowse scores, recovers below threshold data, singles out modified peptides, and assesses the quality of MS/MS techniques
Roushan et al. Peak filtering, peak annotation, and wildcard search for glycoproteomics
Colinge et al. Introduction to computational proteomics
US20040181351A1 (en) Methods and devices for identifying related ions from chromatographic mass spectral datasets containing overlapping components
Du et al. A noise model for mass spectrometry based proteomics
Durbin et al. ProSight Native: defining protein complex composition from native top-down mass spectrometry data
Sun et al. Recent advances in computational analysis of mass spectrometry for proteomic profiling
Schulze et al. SugarPy facilitates the universal, discovery-driven analysis of intact glycopeptides
JP4058449B2 (en) Mass spectrometry method and mass spectrometer
Hu et al. A semi‐empirical approach for predicting unobserved peptide MS/MS spectra from spectral libraries
WO2003038728A2 (en) A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins
JP5776443B2 (en) Modified protein identification method and identification apparatus using mass spectrometry
Vašíček et al. Finding haplotypic signatures in proteins
CN112326770A (en) A method for identifying the type of N-linked sugar chains on intact glycopeptides
US20080166696A1 (en) Method for Analyzing Proteins
Shao et al. Denoising peptide tandem mass spectra for spectral libraries: a Bayesian approach
Liu et al. DISC: DISulfide linkage Characterization from tandem mass spectra

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION PURSUANT TO RULE 69 EPC (EPO FORM 1205A OF 240804)

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP