WO2003038728A2

WO2003038728A2 - A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins

Info

Publication number: WO2003038728A2
Application number: PCT/IB2002/004839
Authority: WO
Inventors: Jari HÄKKINEN; Thorsteinn RÖGNVALDSSON; Jim Samuelsson
Original assignee: Biobridge Computing Ab
Priority date: 2001-11-01
Filing date: 2002-11-01
Publication date: 2003-05-08
Also published as: WO2003038728A3; AU2002347462A1

Abstract

The present invention relates to a computer system and a method for selecting one or more candidate proteins from a plurality of proteins stored in a database. The invention relates in particular to a method for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins. The method comprises the steps of: extracting noise-free mono-isotopic peptide peaks from the numerical representation of the mass spectrum to be identified based on a numerical representation of a mass spectrum of a protein to be identified in the form of corresponding values of entity mass/electric charge and intensity, thereby providing, if possible, a list of selected peaks; selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), determining the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates. The present invention also relates to a graphical user interface rendering utilisation of the method easy.

Description

A COMPUTER SYSTEM AND METHOD USING MASS SPECTROMETRY DATA AND A PROTEIN DATABASE FOR IDENTIFYING UNKNOWN PROTEINS.

The present invention relates to a computer system and a method for selecting one or more candidate proteins from a plurality of proteins stored in a database.

Today, extensive research is carried out in order to determine the effects a protein may have in a living organism. While the effect may be established for a protein it is often very difficult to determine the protein - or proteins - being the cause of the effect, that is identification of a specific protein is today not a trivial job.

Different attempts have been used in the past in order to solve the identification problem. In general, the known methods are based on semi-computerised comparisons between numerical representations of peptide peaks of known proteins and peptide peaks of the unknown protein.

Common for all the known methods is that they are all strongly based on a human factor in the sense a person often judge a match - or no match - between a known protein and the unknown protein based on a graphical representation of peaks of known proteins selected based on a closeness-of-fit algorithm and the peaks of the unknown protein. Thereby, such methods result in that a correct identification is strongly influenced by the skills of the person performing the selection.

It is known, that when a process is strongly influenced by the skills of a human being the process may become non-reproducible resulting in the present situation in that the certainty of the identification may be low resulting in a less valuable result.

Thus, an object of the present invention is to provide a method and a computer system for selecting at least one candidate protein from a database, in such a manner that the human factor is minimised. This and many other objects are believed to be fulfilled by a method for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins, said method comprises preferably based on a numerical representation of a mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, extracting noise-free mono-isotopic peptide peaks from the numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks; selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), and determining the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.

It should be noticed that if the raw experimental spectrum is of really bad quality, theremay, quite naturally, not be any selected peaks, meaning that "a list of selected peaks" can contain no peaks at all in a really bad scenario.

The set of peak matches are preferably all those peaks that satisfy a condition that the distance between the two peak masses is less than a certain value. This certain value is in many preferred embodiments user defined, but may, of course, be defined based on the actual data.

In particular preferred embodiments of the method according to the present invention, the method comprises preferably the step of determining a score value (σ) relating to the probability for the set of peak matches to occur, said score value being preferably determined as the negative logarithm of the probability for the set of peak matches to occur.

In accordance with the present invention the method preferably always computes a score value to every protein stored in the database. In order to assess the score, which may be interpreted in a wrong manner if no measure is taken to avoid it, the present invention may preferably further comprise the steps of determining the probability for getting a score value equal to or above a predefined score (σ), wherein

the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach randomly a score value at least as large as the score value in question and/or the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach from the data base a score value at least as large as the score value in question.

Thanks to the provision of these two probabilities, the assessment may lead to the conclusion that none of the proteins represented in the data base is a likely candidate as a comparison of these two probabilities normally will show that top-scoring candidates from the data base search have score values much higher than what can be expected from just random matching. This last situation is especially valuable if the unknown protein is not in the database.

In particular preferred embodiments, the probability for the set of peak matches to occur preferably reflects the probability to have a predetermined number(r) of matches. Furthermore, the probability for the set of peak matches to occur may preferably be determined so that the probability is rewarded by many matches, and at the same time takes into account the propensity for large proteins having many matches.

The candidate protein(s) is preferably represented by a first set of peptide masses being a theoretical spectrum wherein each peak has the same intensity as all other peaks.

The extracting of noise-free mono-isotopic peptide peak by the method according to the present invention may preferably comprise determining the intensity level of the spectrum where the signal-to-noise ratio is unity, such as substantial unity, preferably by use of equation 1 disclosed herein; determining the intensity level of the spectrum where the signal-to-noise ratio is zero, such as substantial zero, preferably by use of equation 2 disclosed herein; and locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.

In connection hereto or in general, the extracting of noise-free mono-isotopic peptide peak may preferably further comprise determining the peak entities mass/electric charge, intensity, width and signal to noise ratio for the peak candidates; bundling of peak candidates into clusters; - deconvolution of compound peak clusters into single peptide peak clusters; and resolving single peptide peak clusters into mono-isotopic peptide peaks.

Alternatively or in connection to the above, the extracting of noise-free mono-isotopic peptide peak may preferably comprising or further comprise determining a baseline of the spectrum and smoothening the baseline; determining a noise level and smoothening the noise level; and - locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.

Furthermore, in particular preferred embodiments of the present invention, the extracting of noise- free mono-isotopic peptide peak may further comprise - fitting function parameters for the peak candidates bundling of peak candidates into clusters; deconvolution of compound peak clusters into single peptide peak clusters; and - resolving single peptide peak clusters into mono-isotopic peptide peaks.

Preferably, the method according to the present invention allows peak extraction of peptide peaks representing peptides having any value of electrical charge.

The method according to the present invention may preferably further comprise the step of filtering the list of selected peaks. Additionally, the filtering may comprise discarding one or more peaks preferably based on input from a user of the method, said input may claim one or more specific peak to be discarded. The list is preferably filtered based on one or more of the strategies: echoes to consider, intensity cut, low and mass, aldi, peaks to exclude, peaks to keep, width cut.

In preferred embodiments of the method, a peak match is defined as a situation where the distance between the two peak masses is less than a predefined match value according to equation 6 disclosed herein and the predefined match value may preferably be user defined. The method according to the present invention method may preferably further comprise the step of determining a score value (σ) relating to the probability for the set of peak matches to occur, said score value being determined as the negative logarithm of the probability for the set of peak matches to occur. Typically and preferably, a score value is determined for all proteins in the data base. Additionally or in general, the method may preferably further comprise the step of determining the probability for getting a score value equal to or above a predefined score (σ).

According to preferred embodiments of the method, the step of determining the probability may preferably be based on the list of peptide peaks, the predefined match value and the probability density for the list of peptide peaks.

Preferably, the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach randomly a score value at least as large as the score value in question. Additionally or in combination thereto, the probability for getting a score value equal to or above a predefined score (σ) is preferably the probability to reach from the data base a score value at least as large as the score value in question.

In particular preferred embodiments, is the probability for getting a score value equal to or above a predefined score (σ) calculated by equation 24 disclosed herein.

Preferably, the database is storing a list of peptide masses and the corresponding parent proteins. Typically and preferably, the database results from a digestion of proteins and the digestion may preferably has been performed by the method according to the second aspect of the present invention.

In a second aspect, the present invention preferably relates to a method for in silico digesting proteins, comprising establishing a plurality of protein sequences, checking, for each amino acid in the sequences, whether the amino acid acquires a post- translational modification and if so modifying the amino acid, and whether the current position coincides with a cleavage sites pre-specified or is the current position right-most amino acid and if so modify the acid accordingly, and compute and register the masses for all possible combinations of minimal peptide masses. In a third aspect, the present invention preferably relates to a computer system for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins, said computer system comprises preferably means for - extracting noise-free mono-isotopic peptide peaks from a numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks, the extracting being based on a numerical representation of the mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), and determining, such as computing means, the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.

The computer system according to the present invention comprises preferably means for performing some or all of the steps of the method according to the first aspect of the present invention. Such means comprises preferably one or more computer processor, memory, disk storage, one or more data connections and/or the like.

In order to ease use of the computer system and method according to the present invention, a graphical user interface is provided in a fourth aspect of the present invention. This graphical user interface is preferably particular useful for guiding a user through a protein identification process and for presenting the result of the identification process, as the interface preferably comprises a number of module fields each representing one or more module adapted to perform one or more of the steps according to the method according to first aspect of the present invention, and/or the second aspect of the present invention. In accordance with the invention, these fields are preferably graphically arranged and graphically linked so as to reflect a predefined executing order of the modules, the executing order being preferably the calculation flow governed by the underlying algorithms. The graphical user interface is preferably adapted to initiating executing of a module in response to input to the computer system. The graphical user interface is preferably also adapted to change the appearance of a field corresponding to the module being executed. The input is preferably provided by a pointing device, such as a computer mouse, and a thereto associated bottom press. Additionally or in combination thereto, the one or more of these fields are changing appearance during execution of their corresponding module(s).

The graphical user interface according to the present invention may preferably further comprise one or more result fields appearing after results have been and/or during results are being generated by one or more of said modules and displaying the results.

The graphical user interface typically and preferably further comprise input fields preferably appearing when a user input is required. Furthermore, the graphical user interface may preferably further comprise one or more dialog windows through which user input may be inputted and/or edited. In connection thereto or in general, the one or more of the one or more dialog windows may preferably allow a user to edit values stored in a data base and the one or more dialog windows are preferably accessible via button(s) appearing on the interface.

Typically and preferably, the graphical user interface further comprises a tool bar via which actions can be executed by pushing buttons appearing on said tool bar, said tool bar preferably further comprising curtains, wherein each curtain represents a category of action and wherein each curtain comprises buttons for actions belonging to a particular category. Furthermore, the graphical user interface may preferably further comprise a set of windows that communicate the results of the protein identification process.

Detailed description of preferred embodiment

In the following the present invention and in particular preferred embodiments thereof will be presented in the following in connection with the accompanying figures in which:

Fig. 1 shows a raw experimental spectrum.

Fig. 2 shows the algorithmic flow of the present invention, (a: automatic input u: user input).

Fig. 3 shows an example of a cluster that might be mistaken for a single, very broad, peak.

Fig. 4 shows a peak cluster.

Fig. 5 shows the isotope abundancy distributions for the four lowest-lying isotopes 1(0, m), 1(1, m), 1(2, m) and 1(3, m). (the horizontal axis: m: the vertical axis: isotope abundancy)

Fig. 6 shows a simple illustration of the chemistry of post-translational modifications (ptmrs) and missed cleavages, and their potential effect on the peaks of a mass spectrum, a. no ptm:s, no missed cleavage, b. ptm:s, both fixed and variable, c. missed cleavage.

Fig. 7 shows an example of the distribution of the number of peptides per parent protein. On the y-axis is shown the values of the frequency distribution for the number of peptides per parent protein. Note that the distribution is dependent on choice of database, post-translational modifications, and allowed number of missed cleavages.

Fig. 8 shows an example of the distribution of the peptide masses. On the y- axis is shown the λ'alues of the peptide mass frequency distribution. Note that the distribution is dependent on choice of database, post-translational modifications, and allowed number of missed cleavages.

Fig. 9 shows an example of a clear case of protein identification. On the y-axis is shown the values of the functions _τandom and _ύoi„_base defined in equations eq. 25 and eq. 26 respectively. The top-scoring candidates from the database search have score values much higher than what can be expected from just random matching.

Fig. 10 shows an example of a case with no clear protein identification. On the y-axis is shown the values of the functions -^rand_om and ^datab_ase defined in equations eq. 25 and eq. 26 respectively. The top-scoring candidates from the database search have score values not higher than what can be expected from just random matching. Further experimental measures have to be taken.

Fig. 11 shows the workspace as it appears on startup. The arrows between the analysis boxes illustrate the flow of data in the analysis, the boxes represent key steps in the analysis, as described in the sections "Algorithms". Graphs of the data and results will occupy the empty space on the left-hand side, once the analysis is started.

Fig. 12 shows the File menu.

Fig. 13 shows the Edit menu.

Fig. 14 shows the Actions menu.

Fig. 15 shows the dialogue window that comes up when the MSFiles box is activated. By clicking on the relevant fields the user chooses what spectrum files to be analyzed. Note that a user, by clicking on the "Masslist" button, can also send in a list of already extracted peak mass values. Given that choice the present invention will skip the peak extraction step, and move directly to the peak filtering step.

Fig. 16 shows the options menu for the peak filtering step, corresponding to the box Pepfil.

Fig. 17 shows the options menu related to the database digestion as well as the scoring and the validation steps; corresponding to the box Matcher.

Fig. 18 shows the window in which a user can specify his/her own post- translational modification. Fig. 19 shows the mass spectrum during different stages in the analysis. The top graph is the unprocessed spectrum. The middle graph shows the extracted mono-isotopic peaks. The bottom graph shows the mono-isotopic peaks after the filtering step. Zooming in any of the three graphs will automatically 2;oom the two other graphs as well. The table at the bottom left of the workspace shows the top-scoring proteins from the database.

Fig. 20 shows a table presenting the top-scoring proteins of the database.

Fig. 21 shows a table presenting the peaks that were extracted and met the peak criteria in the filtering step. This interactive window enables a user to manually add or remove peaks that he/she thinks can affect the database search.

Fig. 22 shows detailed information from the database search about a particular database protein.

Fig. 23 shows the search result using the default parameters of the present invention. No likely protein candidate is found. Fig. 24 shows the search result when the parameter values have been chosen judiciously, taking into account known information about the experimental spectrum, assumed post-translational modifications etc, leading to one very likely protein candidate, marked by a green flag.

Fig. 25 shows the workspace after a batch run. The result for each individual spectrum can be studied in the same way as after processing a single spectrum.

Note: The figures and the references to figures in this description are merely illustrations of the methods, functions and features of the present invention.

W

This Description

The description of the present invention is divided into six parts:

• Input and Output

• The Algorithms

• The Graphical User Interface

• The Computer Implementation

• Other Methods

• Applications

Input and Output

The present invention takes as input a spectrum from a mass spectrometer, a protein database, and a set of user options. The output is a table of those proteins (from the database) that are the most likely candidates for having generated the mass spectrum. Input - the experimental mass spectrum

The relevant numbers representing an experimental mass spectrum, an example of which is shown in figure 1, are assumed to be stored in a datafile. Different data file formats are accepted by the present invention. A general description of allowed formats is the following: h

(m/z)ι separation field intensityi

(mjz)? separation field intensϊty₂

(m/z)pr separation field intensity^

where

• h is a header row, which may or may not be empty. If it is not empty it may contain information about the experimental setup that generated the spectrum. If it is empty, it may also be left out.

• The first column should contain values of the entity mass/electric charge (m/z), measured in units of Dalton (Da).

• The second column is a field that separates the first and the third column. It can, for example, consist of a comma or white space. • The third column should contain values of the intensity measured at the mass spectrometer.

• N is the number of datapoints in the spectrum.

It is furthermore required that (τn/z)(k+ι) > (m/z)_k for k = 1, ..., N — 1. The entire m/z-range of the experimental spectrum is therefore [(m/z)₁ (m/z)_N]. In figure 1 is shown a mass spectrum where, as conventionally done , the datapoint values of the first column (m/z) and the second column (intensity) correspond to the horizontal and vertical directions, respectively. Note: in the following x and y are defined as x ≡ m/z and y = intensity. Input - the protein database

A protein database consists of a table of proteins. Depending on how well annotated the database is, the amount of information available varies. A minimum requirement for the present invention to work, is that for each database entry there is an identification tag (number or name) of the protein, and that its amino acid sequence is presented, where the amino acids are represented by their one-letter code. Additional but not necessary information is, for example, information about species, protein weight etc. Examples of protein databases that can be used are SWISSPROT [1], and NCBI non-redundant peptide sequence database.

Input - the user options A judicious choice of user options strongly increases the chances for a successful protein identification. The user options are described in detail below; in the section "The Algorithms" as well as in the section "The Graphical User Interface" .

Output

The output is a table where those proteins in the database whose theoretical spectra show the strongest spectrum resemblance to the unknown experimental spectrum, are presented, the top candidates. The measure of resemblance is a score value, defined and described in the section "The Algorithms" below. The proteins in the table are presented in descending order with respect to resemblance, hence having the most likely candidate on the top row, the second most likely candidate on the second row etc. For each protein top candidate in the table, its score value and other parameters related to the search statistics, as well as its particular amino acid sequence, are presented. Given the contents of the output table, a graphical illustration of the output can also be given. This is further described in the section "The Graphical User Interface" below. The Algorithms

The algorithms of this invention can be divided into five well-separated parts

• the peak extraction

• the peak filtering • in silico digestion of the database proteins

• the spectrum matching

• the validation

A flow-chart of the algorithms is shown in figure 2.

The peak extraction - first preferred embodiment

The peak extraction is divided into a set of five sub-processes. A user can influence which peaks to be extracted by the choice of signal-to-noise ratio s/n. The user's choice is denoted by σ

The peak extraction 1: Separating signal from noise

In the present invention the separation of signal from noise is done in a series of steps.

1. Divide the entire spectral x range into sub-intervals , of width ω, i — 1, 2, ... I where I is the number of sub-intervals.

2. Let , and Z_τ be the maximum and minimum intensity values, respectively, in the interval m_l, and put W_l = Y_τ — Z_l

3. For each integer x value in the spectrum, j, place a symmetric window, M_{] 7} of width Ω around j. Let i be the index of the leftmost sub-interval m, that is covered by M₃, and let if be the index of the rightmost sub-interval ,- that is covered by M₃.

4. Define

y^X j) ≡ (1)

l b

5. y¹ (j) is now taken as the level of s/n = 1 for j

6. Define

V ,fl

= ≡ ∑- ^wι ^{■ z} - wh ^• ere w ≡ (²)

7. y°(j) is now taken as the level of s/n = 0 for j

8. For each datapoint d_n = (x_n, y_n) find the closest integer x value j_n = j(x_n, y_n).

The signal-to-noise ratio for d_n is calculated as σ(d_n) = -y°bn) v' M-y^Oi r,)

9. Classify all datapoints d_n = (x_n, y_n) into the three sets V_nmse, V_supporf and Εsτgna.i according to:

Ε> signal, σ < σ(d_n) where σ is the signal-to-noise level chosen by the user

Note 1: y = y^l (j) is the intensity level that minimizes the (weighted) quadratic distance D(y) — ∑ ³_ _ ^' , which therefore also gives a straight-forward and non- ambiguous definition of the level s/n — 1. Note 2- The reason to calculate the weighted quadratic distance D(y) — ∑ ' - ^~w> _' and not just D(y) = ∑ ³_ _ (y — YΛ² is to prevent the peaks themselves to have too much influence. Computing a non-weighted average, would make the level of s/n = 1 too high in regions with very high peaks or in regions with very many peaks.

The peak extraction 2: Classification of the datapoints and peak determination

Given the three sets V_nmse, V_suppori and V_sιgnoι the present invention will now determine peak properties. A peak V is built up by datapoints (d_n ≡ (x_n, y_n)) according to the following criteria:

• V = [d₃ = d₃ (V); x₃ < x_3+l ; j = 0, 1, ..., J(V), J(V) + 1] \ (_>

• d₃ G V _SUp_POrt or ά_j e T>_sιgna for j = 1, ..., CP)

In figure 3 is shown a portion of a raw experimental spectrum The non-constant line marked 1 (solid) connects experimental datapoints. The lowest constant line, line 2 (dashed), indicates the level where the signal-to-noise level is = 1. The second lowest constant line, line 4 (dotted), indicates the level where the signal-to-noise level is the one that the user has chosen (= σ). Crosses marked 3 are those datapoints that belong to the set V_sιgnai- The dots on line 1 between lines 2 and 4 are those datapoints that belong to the set >_suppσrt. As can be seen in the figure, there are at least four peaks, but unless one puts an extra criterion to the three peak criteria above, the algorithm will extract only one peak. This one peak will have an x range, r_x(V) ≡ X ₍p₎+ι — - covering approximately 6 Da. Since peaks representing different isotopes of the same singly charged peptide should be separated with a distance of approximately 1 Da, a peak cannot have an x range very much more than 1 Da. Therefore a fourth criterion to the three above is added:

• r_x(V) < r™', where r_x ^uf is a value slightly above 1 Da.

In order to implement this fourth criterion, the present invention employs a procedure that systematically checks the value of r_x(V):

1. Given:

• V = V_t • the levels that separate V_nmse, V_smpport and >_{s gna}ι (the lines 2 and 4 in figure 3 for i = 0)

2. Determine r_x(V_%). If

• r_x(V_τ) < r "*: Determine the peak properties of V, see below.

• r_x(Vi) > r ^u . Continue to step 3. 3. Introduce a cut just above the y level of the datapoint in V_τ where the lowest local minimum occurs.

4. This cut replaces line 2 , hence increasing V_nσιse at the expense of V_support in this particular x range. 5. The cut divides V_τ into two new peak sets V_Λ and V_τ2

6. Repeat steps 2 and 3 for V_% and V until all peaks satisfy r_x(V) < r^"^f

7. Start with V = V₀; that is, % = 0.

In figure 3 the constant line marked 5 (double-dashed) indicates the last position of the moving cut. Having determined peaks that obey the four peak criteria, the present invention now proceeds to calculate peak properties.

The peak extraction 3: Computing peak properties

The intensity of a peak V is in the present invention taken as the maximum intensity value of those datapoints that build up the peak. The center of mass and width of a peak is determined by a centroid calculation. Furtehrmore, the signal-to-noise ratio of the peak is therefore determined through its maximum intensity value, using the definition of signal-to-noise ratio for an individual datapoint as described in step 8 in the passage "Separating signal from noise" above.

• y(V) = mΑx-p(y₃)

•

• σ(V) =

where x(V) is the closest integer value to x_c(P)

In figure 3 the vertical impulses marked 6 (dash-dot) indicate the central x values (x_c(P)) and maximum y values (y(V)) for the peaks that have been extracted. The peak extraction 4: Determining peak clusters

When the properties of all peaks have been computed, the next step is to partition the peaks into peak clusters. Two peaks, V and V , are defined to be neighbours in and members of the same peak cluster if 1 — r <

< 1 + T, where T PH 0.1 Da. The reason for having the value of r in that particular range is based on the fact that consecutive isotopes that belong to the same singly charged chemical compound should be separated by the mass of the neutron « 1 Da. An additional criterion is imposed on a cluster: At least one peak in a cluster needs to have at least one datapoint in T>_si_gnoι. Such a peak is said to be in >_si_gna[. The p:th cluster can, in turn, be defined as a set of peaks:

C_p ≡ [V_p(r); V_p(r) = (y(p, r)), x_c(p, r)), w(p, r))),r = 1, ..., r( ); 3V_p(r) e V_signol]

(3) where V_p(r) is peak number r in the p:th cluster C_pι and r(p) (> 1) is the number of peaks in that cluster. The peak extraction 5: Finding mono-isotopic peaks in a cluster

The present invention proceeds now to select the mono-isotopic peaks in a peak cluster. Consider the p:th cluster C_p defined above, an example of which is shown in figure 4. In order to identify the mono-isotopic peaks, the present

computes the peptide abundancies that build up the cluster. By

• using tabulated isotope abundancies for the elements that build up peptides (these elements are H, C, N, 0 and S)

• digesting a protein database

isotopic distributions as functions of peptide mass are readily computed: X(i, m) is the portion of isotope i at mass m, such that ∑_%-_ϋl(ι, m) = 1, where i = 0 represents the mono-isotope. In figure 5 is shown the distributions X(i, m) for the four lowest-lying isotopes. Due to statistical variations there is a width in each distribution, indicated by errorbars. The isotopic distributions can therefore be written as l(i, rn) = μ(i, rn) + v(i, ), where μ(i, m) is the center of a distribution and v(i, rn) is the (positive and negative) width around that center. In figure 5 the center and the width is the average and standard deviation, respectively. The values of the Isotopic distributions are in the present invention kept in a table.

Consider now the vector a_p ≡ a = (oi, ... _r, ..., σ_r(p))- The present approach is such that a value of a._r > 0 will indicate that there is a peptide whose mono-isotopic component is located at peak number r in the cluster. The abundancy of this peptide is defined to be a_r. Also taken into account is the possibility that a peak can be built up by different isotope components stemming from different peptides. In a situation free from experimental noise and variations in the isotopic pattern (meaning \°» v(i, m) = 0 — 7 X(i,m) = μ(i,m)) the present invention would proceed to solve the following system of equations for the cluster C_p:

a,ι ^■ X(l, rn) + ₂ ^• μ(0, m) = y(p, 2)

αj ^■ μ(r - l,m) + ... + a_τ ■ μ(0,m) = y(p,r)

ι- μ(r(p)-l,m) + ... + a_r{p -μ(0,m) - y(p,r(p))

where m is taken as x_c(p, 1)- This system contains r(p) equations and r(p) unknowns, and has therefore a unique solution of peptide abundancies, a — (aι,...a_r,...,a_rι_p)). In the ideal situation the solution will consist of a (small) number of a_τ such that a_r > 0 and then a (larger) number of . for which a_r — 0. Because of the presence of noise (figure 4) and variations in isotope distributions (v(i,τn) φ 0, as illustrated by the error bars in figure 5), the solution a can in general contain components o_τ < 0; a non-realistic solution in the present context. The way to approach the problem will instead be to find a solution a such that it is the best possible with the constraint that Vα_r > 0. By "best possible" solution is meant the following: Define

d(a,p,r) = y(p,r)- (a_ ^■ μ(r - l,m) + ... + a_τ ^■ μ(0,m))

r d(a,p,r) ≡ y(p,r)- _/a_s-μ(r-s,m) (4) and r(p)

D_=D(a) ≡ ∑d{a,p,r)² (5) r=l

The best possible solution is the a = ά = (α_l7...α_r, ..., α_r(p)) that simultaneously satisfies the three conditions

• \/a_r > 0

• The condition a_τ = 0 should be imposed on as many values of r as possible.

• D takes on a value less than the uncertainty given by the spectrum noise and the width in the isotope distributions. 2.0

The present invention employs the well-known technique of quadratic programming to solve this constrained minimization problem. So, for each cluster C_p there λvill be mono-isotopic peptides in and only in those postions where _r > 0.

A note on spectra that may contain peaks representing peptides with higher charges

The methods for peak extraction of singly charged peptides described above can in a straightforward manner be generalized to spectra that may contain peptides with higher charges. An outline of such an extension is given here:

• Separating signal from noise: Same method as in the singly charged case

• Classification of the datapoints and peak determination: Same method as in the singly charged case

• Computing peak properties: Same method as in the singly charged case

• Determining peak clusters: Same method as in the singly charged case, with the following modification: Consecutive isotopes that belong to the same chemical compound of charge z should be separated by the mass of the neutron divided by z sa \/z Da. This fact implies that in this type of spectra each peak cluster gets a charge label z, and adjacent peaks in such a cluster are separated by 1/z Da.

• Creating disjoint systems of equations: A peak in a spectrum that may contain peptides with higher charges, can represent peptides of different charges. This means that the systems of equations that descibe the peak intensities in terms of isotopic distributions X(i, rn) and peptide abundancies α_r, may get contributions from more than one peak cluster. In this general case it is therefore necessary to introduce a procedure that creates a set of disjoint systems of equations, where a disjoint system may or may not get get contributions from more than one peak cluster, and a peak cluster contribute to one and only one such disjoint system.

• Finding mono-isotopes in a cluster: Given a set of disjoint systems of equations: Same method as in the singly charged case The peak extraction - second preferred embodiment

1: The baseline of the spectrum is calculated. This is done by finding, over the given raw data, the smallest intensity within a running window (user option: -bcr [baseline constant range]). This calculation is performed by analysis of the measured intensity values by means of histograms, and the smallest valued bin in the histogram is choosen as the baseline value. The baseline is smoothed after the analysis to make sure that the curve is continuous.

2: The noise level calculation is performed in a similar way as the baseline calculation, but another running window size (user option: -nr [noise range]) is used. Here, the histogram bin value is chosen in a different way utilizing the fact that true measured values are more sparse than noise. The histogram describes the density of the measured intensities, and the bin value under a user defined threshold (user option: -nd, [noise density]) is chosen as the noise level. The noise curve is smoothed to assure continuity. 3: Peak candidates are located. This means locating all parts of the spectrum over noise, and extract (copy) these ranges to form peak candidates. A peak candidate is defined by an increase in intensity (with increasing mass), reaching a peak, followed by a decrease of intensity to some cutoff. This algorithm step is not very precise, and is improved on in the subsequent steps. 4^: Fitting of parameters on the peak candidates. This is done in three passes:

a. A crude, but robust, fit is made to get estimates on the peak parameters.

b. The estimates from a. is used in a more refined single gauεsian fit. These peak parameters are usually used in subsequent analysis of the spectrum.

c. The last fit is to try fitting more than one gaussian to the peak candidate. This is done to resolve a possible situation of superposition of peaks. If this is succesful; that is, more than one peak was fitted, then these peak parameters are used, and the candidate resulted in more than one peak.

5: Parameter fitting always yield results, and the fitted parameters are anaylsed, so that strange, contradictory or too wide (user option: -wc [λvidth cut]) peaks are filtered out. 6: All remaining peaks are now bundled into clusters. A cluster is defined by peaks at most 1 Da apart. The clusters are analysed and compound clusters (i.e. contributions from more than 1 peptide) are resolved, leaving only single peptide clusters, (user option: -id a [isotope drift accuracy] sets the limit on how much peak masses can differ from one neutron mass to be treated as coming from one peptide)

7: The single peptide clusters are analysed, mono-isotopic mass, charge and a quality measure is calculated for each cluster. The user option -mir [min Jsotopic_reduction] is used in quality analysis of whether the single cluster is a superpostion of more than one peptide differing an integer times the mass of a neutron, -mir sets the accept- able deviation from the cluster model used. User option -mi [monoJsotopes] defines whether to report all extracted peaks or mono-isotopes only. User option -of [overflow] sets the intensity measurement limit of the mass spectrometer. This parameter is needed since the measured Isotopic clusters become corrupt if the measurements are overflowed. The user options -im and -ic [signal-to-noise or absolute intensity] set the model to use and the cut-off value.

The peak filtering

The peak extraction step described above has resulted in a set of selected peaks. However, a user of the present invention may want to discard some peaks. Those unwanted peaks may or may not be part of the selected peaks; if they are, they will be filtered out; that is removed, in this step. The invention supports the following filters and any combination of them: echoes to consider: In a mass spectrum, so-called peak echoes sometimes occur. By this is meant that the experimental mass spectrum sometimes contains peaks that are false doublets of true peaks. These false doublets often appear at certain well- established distances from the true peaks. This filter handles that problem. Assume the user chooses a value d, measured in Dalton. If two selected peaks have a mutual distance of (d± (tolerance window)) Da, the peak with the higher m/z value is considered to be an echo, and is therefore removed. intensity cut: All peaks with an intensity below the intensity cut value, specified by the user, are removed. lower & upper mass: A user may not want to use peaks below or above certain m/z values when the spectrum matching is performed. AU selected peaks outside the desired range are removed. maldi: In a Maldi mass spectrometer, the digested peptides carry an extra H⁺ ion. Preparing for the subsequent spectrum matching and database search, the user can choose to have all selected peaks to have their m/z value reduced by m/z(H⁺). peaks to exclude: In a mass spectrum some particular m/z values often represent calibrants, parts of the digestion enzyme, or known contaminants. These m/z values can be specified by the user and if these appear in the set of selected peaks, within a tolerance window, the peaks are removed. peaks to keep: Suppose the number of selected peaks is S and the user only wants U peaks to go into the spectrum matching. In this filter those U peaks with the highest signal-to-noise are kept, and the other S-U are removed. width cut: A threshold for width of peaks. Peaks with a width above the width cut value, specified by the user, are removed.

The simplest form of output after the peak extraction and peak filtering steps is a table where each row contains information about an extracted peak that has survived the filter step. A row consists of values for the parameters of peak properties such as m/z, intensity, peak width, signal-to-noise ratio and peptide abundancy.

Digestion of proteins in the protein database

The purpose of digestion of proteins in a protein database, so-called digestion in silico, is to mimic the enzymatic digestion of the real unknown protein that takes place in the laboratory, and hence compute theoretical spectra, one theoretical spectrum for each database protein. Having that, the present invention can compare the experimental spectrum with each theoretical spectrum.

Now, in the laboratory, digestion has been carried out by a site-specific enzyme. By this is meant that the enzyme cleaves; that is cuts, the protein only at certain cleavage sites. As one example of digestion enzymes, there is trypsin that cleaves only on the C-terminal side of the amino acids arginin and lysin - unless there is a neighbouring prolin amino acid on the C-terminal side of the arginin or lysin. If that is the case no cleavage will be done by the trypsin enzyme at that particular site.

There are, however, two circumstances that complicate the mimicking of the protein digestion in the laboratory. These circumstances also change the masses of the peptides, the digestion products, and have therefore to be taken into account: 1: Post-translational modifications: It may happen that a protein gets modified by more or less complicated chemical compounds that cannot be predicted by only studying the nucleotide base-pair sequence of its corresponding gene. One example of post-translational modifications are methionine oxidation where the amino acid methionine acquires an extra oxygen atom. A peptide containing methionine therefore gets its mass shifted upwards by the mass of an oxygen atom. Other examples of post- translational modifications can be found in [4]. Post-translational modifications can be divided into variable and fixed modifications. A variable modification is such that it may or may not occur; in the example above it would mean that some of the methionine amino acids acquire an extra oxygen atom while others do not. A fixed modification, on the other hand, always occurs; in the example above it would mean that every methionine amino acid in each peptide acquires an extra oxygen atom. Post-translational modifications are not really part of the protein digestion process, but need to be included when computing the theoretical spectra. It is therefore natural to Incorporate this circumstance at the digestion stage. 2: Missed cleavages: It may also happen that an enzyme does not cut at a site on the protein where it is allowed to cut. This non-perfect cleavage gives rise to (heavier) peptides that would not be produced if enzyme cleavage always occurred at every allowed cleavage site. When doing the in silico digestion, this circumstance needs to be mimicked and included in the algorithm. In figure 6 is shown schematically the occurrence of post-translational modifications and missed cleavages, and the effect these circumstances may have on a mass spectrum.

The general procedure of digestion in silico is the following: In a protein database the present invention will process entry j, where j — l, .-, iV_{rfo α6ose} and N_database is the number of proteins registered in the database. Entry j contains a unique identification tag T[j], and an amino acid sequence

Λ(j) : a_} (j)a₂{j)...a_k(j)...a_κ{j) (j) where a_k(j) is the k:th amino acid residue, counted from the N-terminal side of the 2 b protein j. a_k(j) can be any of the 20 amino acids found in proteins, and is represented by a one-letter code. (For a reference of their code-letters, chemical composition and mass, see [3].) K (j) is the number of amino acids in protein j.

The in silico digestion will mimic digestion done by an enzyme Y that cleaves specif- ically at the sites S[Y] = (S[Y]j, S\Y]₂, ••-, S[Y]_M(γ₎)- In the trypsin example above, y=trypsin, M[Y)—2 , and S[Y]ι=(the C-terminal side of arginin, unless there is a neighbouring prolin on the C-terminal side of the arginin) and S'[y]₂=(the C-terminal side of lysin, unless there is a neighbouring prolin on the C-terminal side of the lysin).

The set of allowed possible post-translational modifications is M = [FM₁, .., FM_f, ..., FM_F; VM₁, ....VM_v, ..., VM_v) where FM and VM denote fixed and variable modifications respectively. To every fixed and variable modification is assigned a mass, m(FM_j) and m(VM_v) respectively.

The in silico digestion procedure for protein T[j] is now:

/: A peptide counter is initialized to c =1; a modification counter is set to x(c) = 1; and amass counter corresponding to a peptidepjj, c, x(c)] is initialized to m(p[j, c, x{c)}) = 0.

II: Counters for all members of the set M are set to n(FM ) — 0 and n(VM_υ) — 0.

Ill: The invention reads off, from left to right; that is from the N-terminal to the C-terminal side, the protein sequence A(j). IV: For each amino acid that is read off in the sequence, the invention checks whether a. the current amino acid shall acquire a post-translational modification specified by M.

- No: continue to b

- Yes, and the modification is fix, of type FMμ n(FMf) — r n(FMf)+l, continue to b

- Yes, and the modification is variable, of type VM_υ: n(VM_v) → n(VM_v) + 1, continue to b b. the current position coincides with one of the cleavage sites specified by S[Y], or is the C-terminal site of the full protein sequence; that is the right-most amino acid, a_κ{j) (j). 2G>

- No:

n . m(p[j, c, x(c)]) → m(p[j, c, x(c)])+ mass(cυrrent amino acid residue) n₂: Read off next amino acid; that is go to step III.

- Yes(l), it is an S[Y] site:

y_n: m(p[j, c, x(c)]) — τ m(p[j, c, x(c)])+ mass(current amino acid residue) y₁₂: Take into account fixed modifications: m(p[j, c, x(c)]) —T m(p[j, c, x(c)]) + ∑ _l[n(FM_s) - m(FM_f)} y : Take into account variable modifications: For every possible combination of the variable modifications that have been registered for the pep- tide and that gives rise to different peptide masses (there are X(c) =

Yl_y=1[n(VM_v) + 1] such combinations) m(p[j, c, x(c)]) —? m(p[j, c, x(c))) + ∑_v__Λ - m(MV_v)} for each value n'_υ; 0 < < n(VM_v). The integer x(c) runs between 1 and X(c). y_u: Each of the X(c) values of m(p[j, c, x(c)]) axe registered as a peptide mass derived from protein T[j], n(FM_j) and n(VM_υ) are set to n(FM_j) = 0 and n(VM_υ) = 0 for all values of / and υ, c —r c + 1 y₁₅: Read off next amino acid; that is go to step III.

- Yes(2), it is the C-terminal site of the protein:

3/₂₁ : Go through the steps y₁₂ to y₁ y₂ : The reading off of the protein sequence A(j) terminates

V: Take into account missed cleavages: Having performed in silico digestion at every allowed cleavage site at protein T[j], there is a set of minimal peptides p[j,c',x(c')], where 1 < d < c and 1 < x(c') < X(d). The invention now computes and registers the masses for all possible combinations of minimal peptides with the restriction that for each member of the combination, p[j, c', x(c')] say, there has to be at least one other member, p\j, J', x(c")], such that |c' - c"\ = l.There are ∑^ ∑£₌₁ Yl _7j ⁾⁺^^m x(l) such combinations. (If there are no post-translational modifications for protein T[j], the number of additional combinations due to missed cleavages reduces to ^g"(c~^.) Each combination has its mass registered as a peptide mass derived from protein -

VI: Digestion of entry T[j] terminates. There is now a list of peptide masses for this protein entry that should be matched with the list of peaks that were extracted from the experimental spectrum.

VII: Match the list of peptide masses with the list of experimental peaks, and compute a score value for protein entry T[j]. How this procedure is carried out in the present invention is described in the next section.

VIII: Next entry: j — r j + 1. When generating the peptides for all proteins in the protein database by the in silico digestion, the present invention also computes the following three distributions:

1. The frequency distribution of the number of generated peptides per parent protein

2. p(μ): The frequency distribution of peptide masses μ

3. ρ(μ; N): The frequency distribution of peptide masses μ for peptides whose parent proteins have given rise to TV peptides

The distributions 1 and 2 are shown in figures 7 and 8. Distribution 2 is the sum of distributions 3 for all different values of TV, keeping the value of μ fixed. In the present invention theses three distributions are kept in memory and will be used in the spectrum matching and validation algorithms described below.

The spectrum matching - computing a score

A general definition of score value in the context of the present invention

The general approach of protein identification using spectrum peak matching is in the present invention to consider a good match as something unlikely to occur. Therefore, the more unprobable a peak match is, the better it is, and should hence contribute to a higher score. Now, given

• a peak list resulting from the peak extraction and peak filtering steps above • a protein from the database that has been digested in silico, as described above, and hence is represented by a set of peptide masses; that is, by a theoretical spectrum (where each peak has the same intensity as all other peaks)

• a set of peak matches between the peak list and the theoretical spectrum

the present invention computes the probability for the set of peak matches to occur. The score is then taken as the negative logarithm of that probability, so that a high score value reflects an unlikely event, and hence a high degree of spectrum resemblance; that is, a good match. There are, of course, different ways to do this, basically meaning different ways to define the set of peak matches and different ways of computing probabilities. The aim in the present invention is, quite naturally, to reward many matches, and at the same time take into account the propensity for large database proteins to have many matches.

Let x ≡ (x_{Ϊ 7} x₂, ..., xι, ...,x_L) be the masses of the peaks in the peak list, and w = (wι, w₂. —, wι, ..., w ) the corresponding peak widths. Let z(j) ≡ [z_x (j), z₂(j), ..., Zι(j), ..., z_N{₃₎(j));j = 1, -, N_dafabase, where N_database is the number of proteins in the protein database. z_t(j) is the mass of the i:ih peak in the theoretical spectrum of protein T[j]₇ and N(j) is the number of peptides that resulted from in silico digestion of protein T[j].

A match between a peak xι in the peak list and a peptide mass zι(j) in the theoretical spectrum has, in the present invention, occurred if

where δι is a certain (user-defined) value, the match tolerance window for a match at peak xι. In a general, parameter-free, approach it would be natural to choose δι = u>i, the width of the peak rrr_/. Let the vector δ = (δι, ... δι, ..., δA represent the set of tolerance windows pertaining to the peak list x. For each database protein T[j] there is therefore a L x N(j) matrix

M[j} ≡ M{z(j) fr w-y, δ} (7)

such that

For each database protein T[j] the present invention calculates the probability for the set of matches between the peak list x and the theoretical spectrum z(j) as described by M[j]. The score value for T[j], σ₃, (not to be confused with the signal-to-noise ratio discussed in the peak extraction section above) is then taken as the negative natural logarithm of that probability: σ₃ ≡ σ[z{j); (x, w): δ] = —

(8)

Eq. 8 is taken as the general definition of a score value in the context of the present invention. There are now different ways to specialize that general expression. Here two examples of such specializations will be given.

Specialized score expression, case 1: constant peptide mass distribution Assume now that there is a match between xi and Zi(j) as described by eq. 6. The probability for that match is taken to be f rXXιι++δt>_lι

Pi = / P(μ)dμ (9) where p(μ) is the mass frequency distribution for peptides in the digested database, described in the database digestion section above. Continuing, the probability for a miss, qι, is then

For every peak xi we use all N(j) peptides from z(j) to look for a match. The probability for no match by z(j) at _t, Qι, is therefore taken to be

Qi = q?^ω (ii)

The probability for at least one match by z(j) at x_\, Pi, is then

We proceed to define the polynomial

G₁[z(j) (x, w)- δ} ≡ G₁ (ζ) = ∑ [r] - 0^~ ≡ J[{Pi ^■ ζ + Qi) (13) r=0 1=1

Now, given (x, w), z(j), and the match tolerance δ — (S_I , ..., 5 ), the probability to have at least r matches out of L is g_[r], and the corresponding score is therefore

^• = - ln(_5l [r]) (14)

In the present case two simplifications are now made: 1. p(μ) — P_const = l/Δ, where Δ is the entire mass range for the database search, as specified by the user in the filtering step, described above

2. δι = δ for all I

- pi = 25/Δ. This means that Pi and Qi also become independent of I; Pi = P = 1 - (1 - 2< Δ)^W<Λ and Q, = Q = (1 - 2δ/Δ)^NW for all /. The polynomial G₁ (ζ) can now be written as

LI

Gι(C) = ( - C + Q)^L = ∑ C^r . I T P^r - Q L-r (15) r=0 rl - [L - r)l

This simplification means that

and the score for database protein T[j] is now

σ, = -r - HP) - (L - r) - ln(Q) - M_r, . ₍^_ _r), ) (17)

Specialized score expression, case 2: non-constant peptide mass distribu- tion

Assume, once again, that there is a match between xi and z_τ(j) as described by eq. 6. The probability for that match is taken to be

Pi[N(j)} = f^X'^+δ' p(μ-, N(j))dμ (18)

where p(μ; N(j)) is the mass frequency distribution for peptides whose parent proteins have given rise to N(j) peptides. In this approach where it is desired to assume as little as possible about parameter values, 5; is set δι = wι, the width of the peak xi. Define

λ is taken to be the probability that a peptide whose parent protein has given rise to N(j) peptides will find a match with one of the L peaks in x. K is then the probability for no match. Define the polynomial

Nb) G_[z(j); {x, iv); δ ^~] ≡ G₂(Q _≡ (λ - ζ + κ)^{N )} = ∑ ζ^{n ■} <feK NO^')] (20)

71=0 where

^»to^»w ⁼ _vi._lm- _t -^χ'-"^{! l (21)} is the probability that n of the N(j) peptides find matches with the peaks of x. The score value for protein T[j] in this second case is then

_σj = - \n(g₂[n; N(j)]) (22)

The validation - assessing the score

It is important to realize that when using a scoring method for spectrum matching, the nature of the method is such that there will always be a candidate from the protein database with the highest score value. This will be the case irrespective of whether the unknown protein is registered in the database as an entry or not. One therefore concludes that it is normally not enough to calculate only score values for each protein in the database, but one needs preferably to assess those values too. The way to attack this problem is in the present invention done in the following way:

Let z ≡ (z₁, z₂, ..., z^) be the set of peptide masses that results when digesting a database protein, and let (z) be the probability density for the vector z. Now, ψ(z) is, in priciple, an unknown function, but we make an approximation such that ψ(z) is taken to be the joint distribution of the frequency distributions for the number of digested peptides per parent protein and the peptide masses. These two distributions are readily calculated when performing the theoretical digestion of database proteins, as were discussed in the database digestion section above, and two examples are shown in figures 7 and 8. Now, given the peptide peak list x, and the widths of the peaks w₇ the peak matching tolerance δ, and ψ(z), the invention computes the probability density for a score value σ = σ_k:

Tl(σ_h; (x, w); δ) - j [dz ^■ φ(z) ^■ δ_D(σ_h - σ[z; (x, w); δ])] (23) where δr> is the Dirac delta function and the integral is computed over the space of vectors z. The vectors z are sampled with the distribution φ(z). The score σ[z; (x, w); δ] is calculated using the expression eq. 8, or one of the expressions for the specialized cases. The probability to reach, on random, a score value at least as large as σ_c is then calculated. This probability is written as

Prandom

1 - J [dz ^■ φ(z) • Θ(σ_c - σ[z; (x, w); δ])} ≡ 1- < θ(σ_c - σ_rondom) > (24)

where θ is the Heaviside step function (θ(y) — 1 if y > 0, θ(y) = 0 otherwise). For comparison, the corresponding probability expression V_dai_aba_se = 1— < θ(σ_c — <7rfnf_α6ose) > is calculated when doing the actual database search. By taking the natural logarithm, the invention therefore produces two functions

J~ random. ⁼⁼ ^ random Ac) ⁼ Hl\r random) ⁼⁼ nl\ < ^y(0^~ _c — O^~ _ranlj_om) >) (.^O

and

ln(l— < θ(σ_c — Odatabase) >) (26)

The computational flow

It is important to understand how the score values σ and the functions ^_database an T_Tandom are calculated for a given specific unknown raw spectrum (represented, after peak extraction, by the peak list x.) Below is a step-wise description of the process.

^■^^"database-

1. Digest a protein from the database.

2. Apply the score equation eq. 8 to compute a score σ_daiabase.

3. Calculate θ(σ_c — σ _database) (= 1 for all σ_c > σ _{at base}, and 0 for all other values of σ_c.)

4. Repeat steps 1-3 for every protein in the database.

5. Calculate the average of Q(σ_c — σ atab se) - for every value of σ_c.

6. Use eq. 26 to calculate ^da ab se-

F, random- 1. Use the frequency distributions for the number of digested peptides per parent protein and the peptide masses to create a random spectrum z. Examples of such distributions are shown in figures 7 and 8.

2. Apply the score equation eq. 8 to compute a score σ_random.

3. Calculate Θ(σ_c — σ_Tandom) (— 1 for all σ_c > o^~ _random, ^anα* 0 for all other values of σ_c.)

4. Repeat steps 1-3 many times. The number of times is chosen by the user, as described in the section on The Graphical User Interface below.

5. Calculate the average of Θ(σ_c — σ_random) - for every value of σ_c.

6. Use eq. 25 to calculate ^rand_om-

The interpretation and application of _datahase and _random

The two functions are shown for two real situations in figures 9 and 10. In figure 9 one notices that the top-scoring candidate from the actual database search have score values much higher than what can be expexted from just random coincidence, hence leading one to believe that the top candidate is truly the unknown protein. In figure 10, however, the top-scoring candidates have score values that are not larger than what one could expect from a random database search. This indicates that one should not consider the top-scoring candidate in this case to be a true candidate for the unknown protein. Further experimental measures should therefore be taken in order to get a satisfactory protein identification.

In more detail, put A_random =

then the random probability to get a score value at least as high as σ_c, on one try, is P_rondom —

The probability to get a score value less than σ_c, is then Q_{T ndom} = 1 — P_random- The methods of the present invention allow calculation of random probabilities when making TV > 1 tries. Consider the polynomial H(ζ):

H(ζ) ≡ (Prandom " C + Qrαndom)" = ∑ ' h n; TV) (27) n=0 where tl(n; TV) _ , , _jy __ s , ^• Prandom ' ^r n om ( 8) is the probability for getting a score value of at least σ_c n times out of N. Differentiation with respect to ζ^" gives

N H'{ζ) = TV ^• P_random ^■ Prandom ^• C + Qran om ^'1 = ∑ n ^• "^{_J •} hfa N) (29)

71=0

Putting ζ = 1 in the differentiated expression leads to

N

N ^■ Prandom = ∑ TI ^• fc(n; N) (30) n=0

The right hand side is nothing but the expectation value for the number of times of getting a score value above σ_c on TV tries; that is < n >. This implies

N ^• Prandom =< n > (31)

So in order to expect a score value above σ_c in < n >— n_{τan σm}(σ_c) cases out of TV tries, given the random probability P_{r ndom}, the following condition has to be satisfied:

' J random. -~ ^random Ac) l"^J

The size of a random database therefore needs to be of the order

N = TV_rflndOT ⁿ( ^c) ≡ ^r *^a r^da^θn^mdo⁽m^σc) (³³) in order to contain n_random(σ_c) random proteins that reach a score value at least σ_c. Now, the corresponding frequency derived from the real database search is

where n_database (o^~ _c) s the number of real database proteins that reached a score value at least σ_c, and N _at_a e is the size of the real protein database. Hence, the ratio between the size of a random database containing n_random(σ_c) random proteins with score values of at least σ_c, N_random(o^~c), and the size of the real database, N_dot_ab e, is

So, consider a run of the present invention that ends up with n_d0i_{nba e} o^~f_Op) = n_fop top candidates. The size of the random database that one would expect to contain n_rand_om o^~to_p) — ⁿtop random proteins is therefore

N_τ„ndom (θc) — ^database — ^• exp(_F_database ~ ^random) = N_dat„_b„_se ^• βXp( Q) (36) r ^top where Q ≡ A_daf„_baSe— A_random- Q is in the present invention called the quality measure.

Now, as an example, if it is demanded that the random database has to have a size 1000 times larger than the real database for considering a protein candidate with score value σ_top to be statistically significant, the quality mesure of that top scoring protein needs to be at least Q = ln(1000) « 7. If, on the other hand, it is only demanded a factor of 20 between the random database and the real database, the quality mesure of the top scoring protein needs to be only Q = ln(20) fzs 3.

Closely related to the quality measure is the p-value. p — p(σ_c) is defined as the random probability of getting at least one protein with a score value of at least σ_c given the size of the real database, N _(lfnbase. This implies

If P_random ^ 1_> which it should be for any high-scoring database protein, the following approximation can be made-

P ⁼ 1 (t P random) ~ 1 [^l ^database ' ■' random) ⁼ 1" database ^" * random l"°j

Using eqs. 33 - 38 leads to

P — p J random ⁼

database

(39) If, after a run of the present invention, the number of real database proteins that get a score value of at least σ_c is only one, there is simple relationship between the quality measure and the p-value:

p = exp(-Q) (40)

If, however, the relations Pr _dom ^ 1 or if n_{data ase} {&_) Φ 1 d not hold, the expression for the relation between the p-value and the quality measure q becomes a bit more complicated, but the tight connection between these two parameters and their relation to the sizes of real and random databases is, of course, still there.

The parameters score value, quality measure and p-value are in the present invention calculated and reported for every database protein. The Main Application

The set of algorithms described above constitute, when written as computer code, the Main Application. The Main Application, with its algorithmic flow shown in figure 2, solves all the tasks regarding protein identification that the present invention is claimed to solve. However, in addition to the algorithm-related computer code, the present invention also contains a Graphical User Inteface (GUI). The interface, which inter alia makes the present invention more user-friendly, and a user more efficient, is described in the next section.

The Graphical User Interface

General properties of the Graphical User Interface

The purpose of the Graphical User Interface (GUI) is to provide all functions and parameters relevant for utilizing the present invention. In addition, partial results from the different steps of the analysis, described above in the section "The Algorithms" , should, together with the final result, be presented in a graphical and clear way on the computer screen, and registered in appropriate files. A list of the most important features is the following:

1. All results provided by the algorithms can be accessed by the GUI. 2. The GUI is designed so that it is platform independent. By this is meant that the computer code is written such that the GUI can run on any computer irrespective of the computer's operating system.

3. The invention is designed so that the Main Application can run independently of the GUI. 4. The invention can, through the GUI, be run in a stepwise manner. This means that each algorithmic step, described above in the section "The Algorithms" , can be executed such that the following step in the algorithmic flow will not be executed before the user chooses to do so.

5. All steps can be run in one go. 6. Several experimental spectra (representing unknown proteins) can be run consecutively without user intervention, so-called batch jobs.

The different parts of the Graphical User Interface workspace

In figure 11 is shown the GUI workspace as it may appear on a computer screen, before a run of the invention. As illustrated, the workspace is divided into

• The menu bar • The icon bar • The status bar

• The analysis boxes

described in detail below. The first two and the fourth workspace areas are interactive with the user. By this is meant that when a user clicks on one of the items in those areas, the user can either select input for the algorithms or start a process.

The menu bar

The menu bar, at the top of the workspace, see figure 11, consists of five menus with the following features:

File: The File menu, see figure 12, contains commands for opening, closing, and saving the workspace setup:

• New: Clears the current workspace setup and restores all parameter values to their default values.

• Open: Opens a previously saved workspace.

• Save: Saves the current workspace. • Save as: Works just like "Save" with the exception that the user is always asked for a file name.

• Exit: Quits the session.

Edit: The Edit menu contains one item, Boxes, that gives the user access to all the option menus for the analysis boxes in the workspace. It is shown in figure 13. View: The View menu is used to specify the items a user wants to see on the desktop. It has only one option: Hide icon toolbar, which specifies whether the icons at the top of the workspace should be shown or not.

Preferences: The Preferences menu is divided into

• Get default dir: Resets the working directory to the directory used at startup. • Save default dir: Saves the current data directory as default directory.

• Server: Here a user specifies on what server the binary files of the present invention are located. • Directories: Here a user specifies the directory where spectrum data files are located.

Actions: The Actions menu, see figure 14, has the following items:

• Run step: Runs only the high-lighted analysis step. By "high-lighted" is meant that the title bar of the corresponding analysis box is coloured.

• Run batch: Runs the entire analysis on every spectrum selected by the user.

• Run spectrum: Runs the entire analysis on a single spectrum selected by the user.

• Halt process: Halts a batch run.

The icon bar

The icon bar, see figure 11, consists of a set of often-used icons that correspond to features and functions controlled in the menu bar, described above.

The status bar

The status bar is located at the bottom of the workspace, see figure 11. It contains updated information about what is being currently processed by the present invention.

The analysis boxes

A short description of the analysis boxes is the following table:

o

Table

Part (box) name Role/function /algorithm Input from Output to

MSFiles Selects those files containing the user Pepex/the screen raw spectra to be analysed. Pepex Extracts mono-isotopic peaks MSFiles/the user Pepfil/the screen/file from the raw spectrum. Pepfϊl Filters peaks from the peak Pepex/the user Matcher /the screen/file list created by Pepex. Possibility for recalibration of a spectrum. Matcher Digests the proteins in the Pepfil/the user the screen/file protein database according to biological and chemical rules specified by the user. Matches peaks found in the raw spectrum with peaks digested in silico from the protein database. Calculates score values and assesses the result statistically.

A more detailed description is given below.

The MSFiles box When clicking on the box MSFiles, or by clicking in the MSFiles field in the Edit menu, a window appears, see figure 15. There the user can indicate the file (containing data for a raw experimental spectrum) or set of files to be analysed. By clicking on "Add Files" a browser will appear and a user can choose to run one spectrum or many spectra for a batch job. Note that a user, by clicking on the "Masslist" button, can also send in a list of already extracted peak mass values. Given that choice the present invention will skip the peak extraction step, and move directly to the peak filtering step. Al

The Pepex box

When clicking on the box Pepex another dialogue window appears. There a user can choose which signal-to-noise cut-off value (σ in the section "The peak extraction") to use for selection of peaks. The Pepfil box

When clicking on the box Pepfil a dialogue window appears, an example of which is shown in figure 16. The left side of the window contains a list of the filters that are supported in the present invention. The right side contains the filters that are currently active. By writing in the Parameter fields, the user can choose values that correspond to the active filters. The filters that are supported, which were described above in the section "The Algorithms" , are the following:

1. echoes to consider

2. intensity cut

3. lower & upper mass 4. maldi

5. peaks to exclude

6. peaks to keep

7. width cut

In addition to this there is a set of buttons that controls which filters to be used:

• By marking a filter on the left side and then clicking on the > button, the user activates that particular filter; confirmed by it now appearing on the right side.

• Deactivating a filter is done by marking a filter on the right side, followed by clicking on the < button.

• Clicking on the trash bin button deactivates all the filters on the right side. • Clicking on the folder button enables the user to activate a user-defined set of filters. This is done either by writing the name of the filter file in a designated field, or by using a browser for navigating in the file system. Note that a user- defined set of filters can only contain the filters in the list. Ml

• The floppy disk button enables the user to save the current filter setup to a file.

There is also a calibration function that, if a user wishes to do so, recalibrates the raw spectrum. The recalibration is based on the set of peaks that the user selects. Such peaks have often well-established mass values, for example peptide fragments from the digestive enzyme that have been used.

The Matcher box

When clicking on the box Matcher a dialogue window appears, see figure 17. This window contains a set of option fields.

On the upper half the user chooses which species the proteins in the database search should belong to. Multiple choices of species can be made. Activating and deactivating works in the same way as in the Pepfil dialogue window.

In the middle left part of the dialogue window (following figure 17) the following choices are made:

• protein database

• digestion enzyme

• number of allowed missed cleavages

In the middle right part of the dialogue window (following figure 17) the user can specify expected post-translational modifications. There is a pre-defined set of those, but an editing function allows a user to create his/her own modifications, see figure 18. In the lower part of the Matcher dialogue window there is a set of additional settings:

• Report, top: This is the number of the top-scoring proteins to be reported and presented in the result list.

• Validation database size: This is the number of random proteins that is used when calulating the function -F_{Tan om} defined above in the section "The Algo- rithms".

• Mass tolerance (Da) & Mass tolerance (ppm): Two peaks are considered to match, if their mutual distance is less than the chosen mass tolerance value. This tolerance value can be measured either in absolute m/z value, expressed A3 in Dalton (Da) or relative to the peak mass, expressed in parts per million (ppm).

• Select masslist option: M-hH or M_r Is set to "M+H" if the peaks in list that has been passed through the filter step have not had the mass of an H-atom subtracted. Otherwise this option shoul be set to M_r.

The output of results through the Graphical User Interface

The GUI output from a run of the present invention can be divided into

• Spectrum graphs

• A Results window that contains

— a list of top-scoring protein candidates

— a list of extracted peaks

• Detailed information about each top-scoring protein candidate

Spectrum graphs

In figure 19 is shown the GUI workspace after a run of the present invention. On the right side three spectrum graphs are presented. These graphs are related to, from top to bottom, the experimental raw spectrum, the extracted peptide peaks, and the peptide peaks left after filtering. Here a user can visually study the extracted peaks and relate them to the experimental spectrum. A zooming function enables detailed study over the whole m/z-range. The Results window

In the Results window a user gets information about the top-scoring protein candidates as well as about the peaks that were used in the database search.

Protein candidates

In figure 20 is illustrated a list of the top-scoring protein candidates. The proteins are listed in descending order of spectrum resemblance in comparison with the filtered peak list extracted from the experimental spectrum. The list contains a number of rows (as many as chosen by the user in the Matcher box) where each row holds search results for one database protein. In the present illustration there are five columns for each row:

quality flag | protein id | score value | p-value | quality measure

• quality flag: The flag is a small rectangle whose colour is dependent on the quality measure of the fifth column and chosen cut-off values. In the present illustration it is implemented such that a quality value above 7 gives a green flag, a quality value between 3 and 7 gives a yellow flag, and below 3 gives a red flag. The statistical significance of the quality measure was discussed in the "Algorithms" section above. • protein id: In this column is reported the name and database identification tag of the database protein. Denoted "Protein id" in the present illustration.

• score value: Descibed above in the "Algorithms" section. It is denoted "Score" in the present illustration.

• p-value: Descibed above in the "Algorithms" section. It is denoted "Probabil- ity" in the present illustration.

• quality measure: The parameter Q described in the "Algorithms" section. It is denoted "Qlty" in the present illustration.

By clicking on a protein id, a window will appear. This window contains detailed Information about that particular protein, and is described below. Peaks

In figure 21 is illustrated a list of the peaks that were used in the database search. In the present illustration the upper field contains only the m/z values for the peaks. By clicking on a "Copy" button a user can copy and then paste the set of /z values into any desired application. There is also list of rows where more detailed information about every selected peak is given. In the present illustration there are six columns for each row; that is, for each peak V that is used in the database search:

m/z I intensity | width | signal-to-noise ratio | quality | delete option Mb

• m/z: The mass-to- charge ratio of the peak. It is calculated as x_c[V), described in the passage "Computing peak properties" of the "Algorithms" section. It is denoted by "Mass" in the present illustration.

• intensity: The absolute intensity value of the peak. It is calculated as y(V), described in the passage "Computing peak properties" of the "Algorithms" section. It is denoted by "Intensity" in the present illustration.

• width: The width of the peak. It is calculated as w(V), described in the passage "Computing peak properties" of the "Algorithms" section. It is denoted by "Width" in the present illustration. • signal-to-noise ratio: The signal-to-noise ratio of the peak. It is calculated as σ(V), described in the passage "Computing peak properties" of the "Algorithms" section. It is denoted by "S/N" in the present illustration.

• quality: A parameter whose value depends on the filters chosen by a user. In the present illustration (where it is not activated) it is denoted by "Quality". • delete option: In this column, denoted "Deleted" in the present illustration, a user can choose to delete peaks that he/she does not want to be used in the database search. (A user can also add peaks, if so desired, by using the "Add" feature in this window.) By running "Run step" the database search will be redone with the new set of peaks.

Detailed protein information

As mentioned above a window with detailed information about a protein will appear when a user clicks in the protein id column of that particular protein. In figure 22 is shown an example of such a window, where a user, among other things, can find information about

• number of experimental peaks that found a match

• number of database peptides that found a match

• the corresponding sequence coverage

• the amino acid sequence of matched database peptides and their post-translational modifications i statistics on peak-matching errors

The Computer Implementation

The actual implementation of the present invention can be summarized as follows:

• Computer code: Object-oriented design for all parts of the code.

• Hardware requirements: PC:s; that is, personal computers, or work sta- tions.

• Platform types:

1. A client-server solution in which the binary files executing the algorithms are run on a server and the GUI is run on one or many clients.

2. A stand-alone version, requiring only a single computer.

• The protein database: Should be formatted in a so-called FASTA format and be stored on the server if a client-server solution is the user's choice. If a stand-alone version is used, the formatted protein database is stored on the computer at hand.

• Internal communication: http requests through which xml files are sent.

Other Methods

We will in this section briefly describe two other methods that use mass spectrometry data for protein identification. One should note that none of these methods contain the peak extraction and peak filtering steps. This implies that they address only the database search. The same notation as in the section "The Algorithms" will be used.

Method 1

This approach is described in [5]; a method that has been implemented into a commercial software product, described in [6]. As has been described above, the algorithms of the present invention consist of five steps: peak extraction, peak filtering, digestion of the protein database, spectrum matching, and validation. The method of [6] includes the three latter steps. Here we will only discuss the spectrum matching part since database digestion is straightforward and non-controversial, and since the validation step of the method of [6] has, at least to our knowledge, not been publicly disclosed.

A rough outline of the method is the following. In this particular method, the whole chosen database is digested Then, for every peptide mass, the two numbers i are j are calculated, such that

i ^■ 100 Da ~ peptide mass j - 10 kDa ~ parent protein mass

The matrix f_t] is now created. It is constructed such that f_tJ = 0 for all % and j, initially. Then, for every peptide the numbers i and j are calculated and f_l3 — _y + 1. When all peptides have been processed, the matrix f₃ is normalized to m_l3 ≡ _m .A ι , where max_j[/_υ] is the largest f_tJ value in column j The score value for an entry in the protein database Is now defined as

Score ≡ — (41)

^J-'iprotem ' 1 In

where M_proteτn is the mass of the protein measured in Dalton. and the product runs over those matrix elements which correspond to the matches between the peaks from the experimental spectrum and the peaks in the theoretical spectrum for the protein database entry. The score defined in eq. 41 takes into account the matches, but does not account for the peaks that do not find a match. It is also difficult to see how the normalization _M ⁵⁰⁰⁰⁰ is argued for, except that it takes into account the general idea that large database proteins should have reduced scores. In conclusion, this scoring scheme is certainly a valuable step towards a sound and reliable algorithm for protein identification, but it does not use all the information that is available from the experimental spectrum as well as from the digested protein database.

Method 2

Another approach, also implemented into a commercial software product, is described in [7] and [8], and is outlined here. For this method we will discuss the spectrum matching and validation steps.

For each theoretical spectrum z(j), representing database protein T[j], the spectrum resemblance to the peak list x is computed. This resemblance is based on how well peaks from z(j) and x match each other. The criterion for a match between two peaks is such that their mutual distance has to be less than e. The spectrum resemblance is now based on a score that is written as

s M[j}

where X_k and z_k(j) are the masses of the A;:th peak match between the peak list x and the theoretical spectrum z(j) respectively, N(j) is the number of peptides that resulted from in silico digestion of protein T[j], r is the number of matches, Δ is the mass range over which the search is done, and w_k is the width of the k:ih matching peak from x.

This score is claimed by the authors of [7] to be the so-called Bayesian probability for protein T[j] to be the unknown protein. The weakness pertaining to that claim is that the score is normalized in such a way that it is assumed that the unknown protein can be found in the database that is being searched. It therefore has no bearing on unknown proteins not registered in the database. In the actual software application of this method, the inventors realise this problem, and are therefore using the so-called .Z-value for validation. The iϊ-value for database protein T[j] is defined as Z[j] = (s[jj— < s >)/ω, where the values of the score average < s > and score value standard deviation ω are taken over all database proteins. One sees that in the equation for the score s[j], matches (r) as well as misses (TV(j^') — r) are accounted for. Therefore, by acknowledging the weakness of the score as a probabilistic measure of protein identification, and subsequently introducing the i5-value as a measure of validation, improves the method. One notes, however, that method 2 does not incorporate the possibility to promote matches with high m/z values; all matches carry equal weight.

Comparison

When the present invention is compared to methods 1 and 2, one concludes that the differences, in favour of the present invention are, in brief, the following: The present invention

• contains a scoring method based on the probability for matches and misses between the experimental spectrum and a theoretical spectrum.

• can in a natural way take into account and promote matches with high m/z values.

• contains a validation method that based on the general properties of digested proteins, calculates what score values can probabilistically be expected on a random basis.

• is designed with a streamlined fully automatic flow, supported by a Graphical User Interface, starting with the peak extraction from the experimental spectrum and ending in protein identification.

Applications

Three applications of the present invention will be presented in this section.

Application 1 - peak extraction

Peak extraction not continued by the identification steps is valuable in many cir- cumstances. One such case is when a user is only interested in visually inspecting spectrum differences; for example comparison of a protein spectrum from a healthy cell sample with a spectrum from a cell sample in disease. It is, however, also valuable if a user does want to make protein identification using, in parallell with the present invention, some other protein identification software. This can easily be done using the "Copy" function available for the peak list in the GUI Results window, described above. Running different methods in parallell, in order to raise the statistical significance, is not uncommon when doing protein identification.

Application 2 - processing a single experimental spectrum

The example spectrum file is called spectrum.txt. It is loaded by choosing the box MSFiles under the Edit menu, clicking on the "Add Files" button, and then use the browser to select the spectrum file.

When the user selects "Run spectrum" in the "Actions menu" the present invention will process the spectrum, with all user options set to their default values. The search result is shown in figure 23. As can be seen the best scoring protein candidate gets a score value ("Score") of 14.52, a p-value ("Probability") of 0.15 and a quality value ("Quality") of 3.89. In the "Algorithms" section above it was shown that the quality value Q is directly related to the size needed of a random database in order to expect a certain score value. That size is in the present case exp(3.&9) ~ 50 times the size of the actual random database used for the search, maybe not a very convincing number. The colour-coded flags in the left column also helps a user to quickly asses the statistical significance of the search result.

By actually taking into account knowledge about the experimental setup, a user can choose to change the values of the option parameters from their default values. In the present case the user uses the filter option to remove known peaks from the digestive enzyme, takes into account the possibility of missed cleavages and known post-translational modifications. With these new parameter values the present invention is rerun, and the result is shown in figure 24. As can be seen there is a new top-scoring candidate. It gets a score value ("Score") of 16.42, a p-value ( "Probability") ∞ 0, and a quality value Q = 12.1. The needed size of a random database for expecting the reported score value is in the present case exp(12.1) « 180000 times the size of the actual random database used for the search. A very convincing number leading the user to be certain that the top candidate is also the correct protein. As an aid, a quick glance by the user at the colour-coding of the flags in the left column, gives strong support for the top-scoring candidate to actually be the unknown protein. Additional support is also given by the following three candidates; all of the same type as the top candidate.

Application 3 - enabling high-throughput protein identifica- tion

The present invention allows for running many experimental spectra in one go; so- called batch jobs. This is extremely valuable, and basically the only method feasible when a user wants to perform high-throughput screening of many spectra in a fast and automated fashion. A user selects all the desired spectrum files in the MSFiles box, and selects "Run batch" in the "Actions" menu. After a batch run a user clicks anywhere on the workspace, and a list of the processed spectra appears, as shown in figure 25. The result for each individual spectrum can then be studied in the same way as after processing a single spectrum, with access to spectrum graphs, lists of top-scoring protein candidates etc.

References

[1] Bairoch and Ap eiler: Nucleic Acids Res. 1996, 24, 21-25

[2] Bleasby and Wootton: Protein Engineer. 1990, 3, 153-159

[3] Alberts et. al; Molecular Biology of the Cell, Garland, New York, 1994

[4] Graves et. al: Co- and Posttranslational Modifications of Proteins: Chemical Principles and Biological Properties, Oxford, New York, 1994

[5] Pappin et. al; Current Biology, 1993, 3: 327-332

[6] Perkins et. al: Electrophoresis 1999, 20, 3551-3567

[7] Zhang and Chaϊt; Anal. Chem. 2000, 72, 2482-2489

[8] WO 00/73787

Claims

1. A method for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins, said method comprising based on a numerical representation of a mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, extracting noise-free mono-isotopic peptide peaks from the numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks; - selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), and determining the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.

2. A method according to claim 1 , wherein the probability for the set of peak matches to occur reflects the probability to have a predetermined number (r) of matches.

3. A method according to claim ] or 2, wherein the probability for the set of peak matches to occur being determined so that the probability is rewarded by many matches, and at the same time takes into account the propensity for large proteins having many matches.

4. A method according to any of the preceding claims, wherein the candidate protein(s) is represented by a first set of peptide masses being a theoretical spectrum wherein each peak has the same intensity as all other peaks.

5. A method according to any of the preceding claims, wherein the extracting of noise-free mono- isotopic peptide peak comprising determining the intensity level of the spectrum where the signal-to-noise ratio is unity, such as substantial unity, preferably by use of equation 1 disclosed herein; detennining the intensity level of the spectrum where the signal-to-noise ratio is zero, such as substantial zero, preferably by use of equation 2 disclosed herein; and locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.

6. A method according to claim 5, wherein the extracting of noise-free mono-isotopic peptide peak further comprising determining the peak entities mass/electric charge, intensity, width and signal to noise ratio for the peak candidates; bundling of peak candidates into clusters; deconvolution of compound peak clusters into single peptide peak clusters; and resolving single peptide peak clusters into mono-isotopic peptide peaks.

7. A method according to any of the preceding claims, wherein the extracting of noise-free mono- isotopic peptide peak comprising determining a baseline of the spectrum and smoothening the baseline; determining a noise level and smoothening the noise level; and locating peak candidates by locating all parts, such as substantial all parts, of the spectrum over noise and extract, preferably by copying, the ranges to form peak candidates.

8. A method according to claim 5 or 7, wherein the extracting of noise-free mono-isotopic peptide peak further comprising fitting function parameters for the peak candidates - bundling of peak candidates into clusters; deconvolution of compound peak clusters into single peptide peak clusters; and resolving single peptide peak clusters into mono-isotopic peptide peaks.

9. A method according to any of claims 5-8, allowing peak extraction of peptide peaks representing peptides having any value of electrical charge.

10. A method according to any of the preceding claims, said method further comprising the step of filtering the list of selected peaks.

1 1. A method according to claim 10, wherein the filtering comprises discarding some peak based on input from a user of the method, said input may claim one or more specific peak to be discarded.

5

12. A method according to claim 10 or 1 1 , wherein the list is filtered based on one or more of the strategies: echoes to consider, intensity cut, low and mass, maldi, peaks to exclude, peaks to keep, width cut.

10 13. A method according to any of the preceding claims, wherein a peak match is defined as a situation where the distance between the two peak masses is less than a predefined match value according to equation 6 disclosed herein.

14. A method according to claim 13, wherein the predefined match value is user defined. 15

15. A method according to any of the preceding claims, further comprising the step of determining a score value (σ) relating to the probability for the set of peak matches to occur, said score value being determined as the negative logarithm of the probability for the set of peak matches to occur.

20 16. A method according to claim 15, wherein a score value is determined for all proteins in the data base.

17. A method according to claim 16, further comprising the step of determining the probability for getting a score value equal to or above a predefined score (σ).

25

18. A method according to any of the claims 15-17, wherein the step of determining the probability is based on the list of peptide peaks, the predefined match value and the probability density for the list of peptide peaks.

30 19. A method according to any of the claims 15-18, wherein the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach randomly a score value at least as large as the score value in question.

20. A method according to any of the claims 15-18, wherein the probability for getting a score value equal to or above a predefined score (σ) is the probability to reach from the data base a score value at least as large as the score value in question.

5 21. A method according to claim 19 or 20, wherein the probability for getting a score value equal to or above a predefined score (σ) is calculated by equation 24 disclosed herein.

22. A method according to any of the preceding claims, wherein the database is storing a list of peptide masses and the corresponding parent proteins.

10

23. A method according to claim 22, wherein the database results from a digestion of proteins.

24. A method according to claim 23, wherein the digestion has been performed by a method according to claim 25.

15

25. A method for in silico digesting proteins, comprising establishing a plurality of protein sequences, checking, for each amino acid in the sequences, whether the amino acid acquires a posttranslational modification and if so modifying the amino acid, and whether the current position 20 coincides with a cleavage sites pre-specified or is the current position right-most amino acid and if so modify the acid accordingly, and compute and register the masses for all possible combinations of minimal peptide masses.

25 26. A computer system for providing a measure applicable in selecting proteins in a database storing a numerical representation of theoretical mass spectrum for a plurality of proteins, said computer system comprising means for extracting noise-free mono-isotopic peptide peaks from a numerical representation of the mass spectrum to be identified, thereby providing, if possible, a list of selected peaks, the 30 extracting being based on a numerical representation of the mass spectrum of a protein to be identified in the form of corresponding values of entity mass / electric charge and intensity, selecting at least one candidate protein from the database, said candidate(s) being selected based on a closeness-of-fit algorithm providing a set of peak matches between the list of selected peaks and peaks of the candidate protein(s), 35 and determining, such as computing means, the probability for the set of peak matches to occur, thereby providing a measure applicable in selecting protein candidates.

27. A computer system according to claim 26, further comprising means for performing some or all 5 of the steps of the method according to any of the claim 1 -24

28. A graphical user interface for guiding a user through a protein identification process, said interface comprising a number of module fields each representing one or more module adapted to perform one or more of the steps according to the method according to any of the claims 1-24, said

10 fields are graphically arranged and graphically linked so as to reflect a predefined executing order of the modules and said graphical user interface is adapted to in response to input to the computer system to initiating executing of a module and to change the appearance of the field corresponding to the module being executed.

15 29. A graphical user interface according to claim 28, wherein said input is provided by a pointing device, such as a computer mouse, and a thereto associated bottom press.

30. A graphical user interface according to claim 28 or 29, wherein the one or more of said fields are changing appearance during execution of their corresponding module(s).

20

31. A graphical user interface according to any of the claims 26-28, further comprising one or more result fields appearing after results have been and/or during results are being generated by one or more of said modules and displaying said results.

25 32. A graphical user interface according to any of the claims 28-31 , further comprising input fields appearing when a user input is required.

33. A graphical user interface according to any of the claims 28-31, further comprising one or more dialog windows through which user input may be inputted and/or edited.

30

34. A graphical user interface according to claim 31, wherein one or more of the one or more dialog windows allow a user to edit values stored in a data base and wherein said one or more dialog windows preferably are accessible via button(s) appearing on the interface.

35. A graphical user interface according to any of the claims 28-33, further comprising a tool bar via which actions can be executed by pushing buttons appearing on said tool bar, said tool bar preferably further comprising curtains, wherein each curtain represents a category of action and wherein each curtain comprises buttons for actions belonging to a particular category.

36. A graphical user interface according to any of the claims 26-33, further comprising a set of windows that communicate the results of the protein identification process.